CN118016073A - Classroom coarse granularity sound event detection method based on audio and video feature fusion - Google Patents

Classroom coarse granularity sound event detection method based on audio and video feature fusion Download PDF

Info

Publication number
CN118016073A
CN118016073A CN202311820919.0A CN202311820919A CN118016073A CN 118016073 A CN118016073 A CN 118016073A CN 202311820919 A CN202311820919 A CN 202311820919A CN 118016073 A CN118016073 A CN 118016073A
Authority
CN
China
Prior art keywords
audio
frame
information
video
classroom
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311820919.0A
Other languages
Chinese (zh)
Inventor
许炜
崔玉蕾
周为
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huazhong University of Science and Technology
Original Assignee
Huazhong University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huazhong University of Science and Technology filed Critical Huazhong University of Science and Technology
Priority to CN202311820919.0A priority Critical patent/CN118016073A/en
Publication of CN118016073A publication Critical patent/CN118016073A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/18Artificial neural networks; Connectionist approaches

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Theoretical Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Acoustics & Sound (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Physics & Mathematics (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • Psychiatry (AREA)
  • Social Psychology (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Medical Informatics (AREA)
  • Electrically Operated Instructional Devices (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention belongs to the technical field of intelligent class, and particularly relates to a class coarse-granularity sound event detection method based on audio and video feature fusion, which comprises the following steps: adopting a video information processing model to perform face detection on video data frame by frame, and extracting all mouth state information in each frame; carrying out human body posture detection on the video data frame by frame, and extracting posture information of all people in each frame; splicing all mouth state information and all person posture information according to a time sequence to serve as video action characteristics; extracting audio characteristics from the audio data frame by adopting an audio information processing model, and converting the audio data into text to extract text characteristics frame by frame; splicing the audio features and the text features according to the time sequence to serve as audio information features; based on the video action characteristics and the audio information characteristics, a characteristic fusion and classification model is adopted, and a detection classification result of each frame of speaking character is output, so that a coarse-granularity sound event detection result is obtained. The invention can improve the detection precision of the classroom sound event.

Description

Classroom coarse granularity sound event detection method based on audio and video feature fusion
Technical Field
The invention belongs to the technical field of intelligent class, and particularly relates to a class coarse-granularity sound event detection method based on audio and video feature fusion.
Background
The classroom activity detection is always a hot topic, expert students are constantly researching the classroom activity, and the teaching skills of the teachers and the learning efficiency of the students can be improved simultaneously by analyzing the behaviors of the students and teachers in the classroom and correspondingly adjusting the content after the class.
To realize classroom activity detection, high-quality and fine classroom activity recording is indispensable, so that it is necessary to determine whether a person is speaking and what the identity of the speaker is, and record the start and stop time of each utterance of a teacher and a student in a classroom, and in general, perform coarse-grained sound event detection for the classroom. However, unless a special person records everything occurring in a classroom or lets a teacher and a student wear independent sound receiving devices at the same time, the independent activity records of the teacher and the student in the classroom are very difficult to obtain, but obviously, the two cannot be realized, only one to two sound receiving devices can be provided, and the sound receiving devices comprise the mixture of all sounds in the classroom.
Part of the research provides class sound event detection based on audio, and when the quality of collected audio is poor and the class environment is more complex, or the sound of a teacher is possibly very close to the sound of some students, the accuracy of class sound event detection can be influenced. Meanwhile, in some situations, such as the communication of teachers and students, the speaking switching speed of the teacher and the students is extremely high, which brings great challenges to the detection of classroom sound events based on audio only, and often the event switching points cannot be detected well, so that some error divisions are made.
Disclosure of Invention
Aiming at the defects and improvement demands of the prior art, the invention provides a classroom coarse granularity sound event detection method based on audio and video feature fusion, which aims to improve the coarse granularity sound event detection precision of a classroom.
In order to achieve the above object, according to an aspect of the present invention, there is provided a classroom coarse granularity sound event detection method based on audio and video feature fusion, including:
Acquiring video data and audio data generated in a classroom;
Adopting the constructed video information processing model to perform face detection on video data frame by frame, and extracting state information of all mouths in each frame; carrying out human body posture detection on the video data frame by frame, and extracting posture information of all people in each frame; splicing the state information of all the mouths and the gesture information of all the people according to a time sequence to be used as video action characteristics;
Extracting audio characteristics from audio data frame by adopting the constructed audio information processing model, converting the audio data into text, and extracting text characteristics from the text frame by frame; splicing the audio features and the text features according to a time sequence to serve as audio information features;
Based on the video action characteristics and the audio information characteristics, a constructed characteristic fusion and classification model based on an attention mechanism is adopted, and a detection classification result of each frame of speaking roles is output, so that a class coarse-granularity sound event detection result is obtained, each coarse-granularity sound event comprises event start-stop time and a speaking role corresponding to the event start-stop time, and the speaking roles are divided into three categories of teachers, students and mixtures.
Further, the part of the video information processing model for extracting the state information of all the mouths in each frame is obtained by training and constructing based on MTCNN algorithm.
Further, the state information of each mouth is state information composed of four key points of the left and right mouth corners and the upper and lower lips.
Further, when the mask is identified, the state information of the corresponding mouth is identified by using mask information.
Further, the part of the video information processing model for extracting the gesture information of all people in each frame is obtained by training and constructing based on AlphaPose algorithm.
Further, the video data includes teacher video data and student video data;
and splicing the video motion characteristics obtained based on the teacher video data and the video motion characteristics obtained based on the student video data according to a time sequence to serve as total video motion characteristics, and inputting the characteristics fusion and classification model with the audio information characteristics.
Further, the posture information of the teacher extracted based on the teacher video data is posture information composed of 15 key points of the head, neck, left and right shoulders, left and right elbows, left and right hands, left and right ankles, left and right knees, left and right crotch bones, and torso;
The posture information of the student extracted based on the student video data is posture information composed of eight key points of a head, a neck, left and right shoulders, left and right elbows and left and right hands.
Further, the audio features are mel-frequency cepstral coefficients.
Further, the method for splicing the audio feature and the text feature according to the time sequence is as follows:
And respectively inputting the audio features and the text features into a CNN feature extraction network, splicing the results after feature extraction according to time sequence alignment, and inputting the results into an RNN network to obtain the audio information features combined with the context information.
The invention also provides a computer readable storage medium, which stores a computer program, wherein when the computer program is run by a processor, the equipment where the storage medium is located is controlled to execute the classroom coarse granularity sound event detection method based on the audio and video feature fusion.
In general, through the above technical solutions conceived by the present invention, the following beneficial effects can be obtained:
(1) The invention provides a method for detecting classroom coarse-granularity sound events by converting audio into text and extracting text features according to data of multiple modes produced in an actual classroom environment, wherein the method is equivalent to the method for detecting classroom coarse-granularity sound events by combining data of three modes. Specifically, three characteristics of audio characteristics, text characteristics and video action characteristics are used by extracting and analyzing characteristics of the audio and video information of the classroom, and the multi-mode fusion mode is adopted to detect the coarse-granularity sound event of the classroom, so that the problem that the segmentation result is poor due to the fact that the audio quality is poor, the classroom environment is complex, the sound of a teacher is possibly very close to the sound of some students in the audio mode only is effectively avoided, and the detection precision of the coarse-granularity sound event of the classroom is improved.
(2) The invention provides video data comprising teacher video data and student video data, which are respectively and clearly collected to be related to teacher motion and student motion, so as to be convenient for improving coarse-granularity sound event detection precision.
Drawings
Fig. 1 is a schematic block diagram of a classroom coarse granularity sound event detection method based on audio and video feature fusion according to an embodiment of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention. In addition, the technical features of the embodiments of the present invention described below may be combined with each other as long as they do not collide with each other.
Example 1
A class coarse granularity sound event detection method based on audio and video feature fusion is shown in fig. 1, and comprises the following steps:
Acquiring video data and audio data generated in a classroom;
Adopting the constructed video information processing model to perform face detection on video data frame by frame, and extracting state information of all mouths in each frame; carrying out human body posture detection on the video data frame by frame, and extracting posture information of all people in each frame; splicing the state information of all the mouth parts and the gesture information of all the people according to a time sequence to be used as video action characteristics;
extracting audio characteristics from audio data frame by adopting the constructed audio information processing model, converting the audio data into text, and extracting text characteristics from the text frame by frame; splicing the audio features and the text features according to a time sequence to serve as audio information features;
based on the video action characteristics and the audio information characteristics, a constructed characteristic fusion and classification model based on an attention mechanism is adopted, and a detection classification result of each frame of speaking roles is output, so that a class coarse-granularity sound event detection result is obtained, each coarse-granularity sound event comprises event start-stop time and a speaking role corresponding to the event start-stop time, and the speaking roles are divided into three categories of teachers, students and mixtures.
In this embodiment, the class coarse-granularity sound event detection model finally classifies the sound event detection results into three classes: teacher, student and Babble (i.e. mixed mode, co-speaking) output class coarse-granularity sound event detection result, including start-stop time of each event and its correspondent speaking role. The present embodiment combines mouth features, text features, and motion features to determine the speaking character.
Adopting RETINAFACE, MTCNN algorithm to detect the faces of students in classroom teachers; meanwhile, facial key points of the detected face are extracted by adopting a face feature point algorithm (for example Dlib and MTCNN), which can be used as a preferred implementation mode, wherein the video feature extraction (namely the facial key points of the face) is carried out by adopting a MTCNN algorithm, two tasks of face detection and key point extraction can be simultaneously completed, and if retinaface is selected for face detection, the face key point extraction is carried out by adopting Dlib and other algorithms.
MTCNN comprises three layers of network structures: and quickly generating P-Net of the candidate window, R-Net for filtering and selecting the candidate window with high precision, and O-Net for generating a final boundary box and key points of the human face. P-Net is a full convolution network that undergoes three layers of convolutions with a convolution kernel size of 3x 3. And outputting whether the region has a human face or not through three full connection layers, outputting possible frames of a plurality of human faces and corresponding facial key points at the same time, and inputting the regions into R-Net for further processing. Compared with the last convolution layer, the R-Net and the P-Net are provided with a full connection layer with one dimension 128, and the full connection layer is used for carrying out refinement selection on input, discarding most of error input, and finally outputting three information of a frame and a face key point respectively by using three full connection layers. The O-Net network is more complex, and after information transmitted by R-Net is subjected to convolution for 4 times and pooling for 3 times, three pieces of information of frames and facial key points are respectively output through the full-connection layer with dimension 256 and the three full-connection layers.
The first face algorithms and corresponding data sets are marked as left and right eyes, a nose and left and right mouth corners, and the five key points are used as key points of the face to perform face recognition and detection, so that the face recognition and detection method is not applicable to the scene of the embodiment, and the mouth information features and whether a mask is worn are required in the scene of the embodiment, so that the key point information is modified and the key point whether the mask is worn is added. That is, since it is necessary to pay attention to the mouth feature in detecting whether or not more speech is being made in this example, it is preferable to modify the feature point information extraction target to four key points of the left and right mouth angles and the upper and lower lips while considering that there may be a student or teacher having a mask, so that when the mask is recognized at the time of feature point extraction, mask information is outputted as the mouth key point, wherein the mask information may be marked with one feature map, for example.
For the acquisition of the gesture information, a human body gesture algorithm (such as AlphaPose) is adopted to extract human body key points. Preferably, alphaPose algorithm is adopted, which is a multi-person gesture estimation method, and a top-down detection method is adopted, so that the bounding box of each person in the picture is detected first, and then the gesture in the bounding box of each person is detected independently. It mainly comprises two networks: (1) The symmetric spatial transformation network comprises a spatial transformation network STN, a spatial inverse transformation network SDTN and a single person pose estimation SPPE, wherein the SPPE is positioned between the STN and the SDTN. A parallel PARALLEL SPPE branch is connected from the STN to optimize the network. PARALLEL SPPE mainly determines whether the posture of the STN extraction is at the central position or not, and reversely propagates errors to the STN, updates the STN network weight, and PARALLEL SPPE does not participate in output and is only used for optimizing the STN network. (2) The parameterized gesture maximally suppresses the network, and human body positioning inevitably generates redundant detection frames, namely, one detection target is provided with a plurality of detection frames, so that redundant gesture detection is also generated, and gesture distance measurement is used for comparing the similarity of gestures, so that the similar gestures are removed.
In order to collect better and clearer action characteristics, preferably, the video data comprise teacher video data and student video data; for example, a camera may be disposed in front of the classroom to collect video data of students, and a camera may be disposed behind the classroom to collect video data of teachers.
Then video motion features obtained based on teacher video data and video motion features obtained based on student video data are spliced according to a time sequence to be used as total video motion features for inputting feature fusion and classification models together with the audio information features.
As a preferred embodiment, the posture information of the teacher extracted based on the teacher video data is posture information composed of 15 key points of the head, neck, left and right shoulders, left and right elbows, left and right hands, left and right ankles, left and right knees, left and right crotch bones, and torso;
The posture information of the student extracted based on the student video data is posture information composed of eight key points of the head, the neck, the left and right shoulders, the left and right elbows and the left and right hands.
Extraction of text features and audio features. The classroom audio data is transcribed into text data, preferably using whisper, kalid or the like, and the transcribed result contains a time stamp corresponding to the text content, which is input into the Bert model. The Bert type uses a multi-layer bidirectional transducer as a feature extractor to extract word vector features, wherein the extracted features can contain context information, the word vector features are obtained by combining word embedding vectors and position embedding of input text, x ti=wi+pi embedding,xti represents word vector features of an ith word, w i represents word embedding vectors corresponding to the ith word, p i embedding represents position embedding vectors corresponding to the ith word, and finally obtained word vector features xt= [ x t1,xt2…xtn ].
Audio features Xs such as MFCC, LLD, etc. are extracted using an audio processing tool. Preferably, the present example selects to extract mel-cepstrum coefficient, abbreviated as MFCC feature, where x (N) represents an original audio signal, and performs preprocessing to remove silence, noise reduction, pre-emphasis, and the like, and divides the preprocessed audio signal into overlapping frames with frame length N, and preferably, performs audio signal overlapping with a 50% ratio (for avoiding excessive variation between two adjacent frames, so that there is a section of overlapping region between two adjacent frames). A window function w (n), i.e. s (n) =x (n) ×w (n), is applied to the audio signal of each frame, where s (n) is the windowed signal. The windowed signal is subjected to a Fast Fourier Transform (FFT), i.e. S (k) =fft [ S (n) ], where S (k) is a frequency domain signal and k represents a frequency index. The energy spectrum E (k) of each frame is calculated, i.e. the square of the amplitude of the frequency domain signal: e (k) = |S (k) |2. A set of mel filters is designed, which are evenly distributed over the mel scale. The energy spectrum is filtered by the set of filters to obtain the output H m (k) of the filter bank, i.eWhere m denotes the filter index, H m (k ', m) is the response of the mth filter at the frequency index k'. Performing logarithmic operation on the output of the filter bank to obtain a logarithmic energy spectrum M m(k)=log(Hm (k)), and performing Discrete Cosine Transform (DCT) on the logarithmic energy spectrum to obtain cepstrum coefficient Xs m:
wherein Xs m is the mth cepstral coefficient.
Finally, the resulting Xs m may be used as a MFCC feature vector for the audio signal.
In a preferred embodiment, the method for splicing the audio feature and the text feature according to the time sequence is as follows:
And respectively inputting the audio features and the text features into a CNN feature extraction network, splicing the results after feature extraction according to time sequence alignment, and inputting the results into an RNN network to obtain the audio information features combined with the context information.
And outputting a detection classification result of each frame of speaking role by adopting the constructed feature fusion and classification model based on the attention mechanism. And inputting the video action characteristic Xv and the audio information characteristic Xa into a fusion model with an attention mechanism, fusing and inputting into a full-connection layer to obtain an integrated classroom sound event detection result. Taking the audio information feature as a key and the video action feature as a query feature, calculating the relevance s i, xai of Xai and Xvi to represent the audio information feature at the ith moment, xvi to represent the video action feature at the ith moment, s i = Vtanh (w.xvi+u. Xai), wherein W, U is a weight parameter, tanh is an activation function, V is a weight parameter of multi-modal fusion,Α i is the normalized attention coefficient, and the final fused feature can be expressed as yi=α i*Xai+(1-αi) ×xvi.
And finally, inputting the fusion characteristics Yi into a full-connection layer network for classification to obtain a class coarse-granularity sound event detection result.
Regarding the construction of a video information processing model, an audio information processing model and a feature fusion and classification model based on an attention mechanism, the three can be independently trained, wherein:
The loss function L mouth of the part of the video information processing model for extracting the state information of all the mouths in each frame is expressed as:
In this expression, i denotes the number of mouth key points, e denotes the number of mouth key points, t denotes the time, l bce denotes the cross entropy loss, p i,t denotes the true state corresponding to the mouth key point p i at time t, Representing the predicted probability of the mouth keypoint p i at time t.
The loss function L posture of the portion of the video information processing model for extracting the pose information of all persons in each frame is expressed as:
In the formula, i is the number of human body key points, g is the number of human body key points, t is the representative time, and a i,t is the real state corresponding to the posture key point a i at time t, Representing the predicted probability of the gesture keypoint a i at time t.
The loss function L word of the portion of the audio information processing model used to extract text features from text frame by frame is expressed as:
In this formula, W represents the vocabulary amount of the phonetic transcription text, q i represents the true result of the ith word of the phonetic transcription text result, Representing the i-th word prediction result of the phonetic transcription text result.
The loss function L fusion used for fusing the video motion feature and the audio information feature in the feature fusion and classification model is expressed as:
In this embodiment, i is the type of sound detection tag, z is the number of sound detection tags, t is the representative time, and y i,t is the real state of the classroom sound event detection result y i corresponding to time t in the feature fusion model The prediction probability of the classroom sound event detection result y i at time t.
In the video labeling, when a teacher or a student wears the mask, the mask is used as a mouth key point for labeling. The starting and ending time points of each event and the labels corresponding to the starting and ending time points are marked through manual marking, students do not distinguish differences among different students, unified labels are students, and when the sound is extremely short, any label is not marked on the students. And obtaining the annotated classroom audio and video data set according to the labeling rules.
In general, the invention adopts a mode of fusing data of multiple modes to detect the sound event of the classroom with coarse granularity, adds video mode characteristics of teachers and students, detects the mouth state and body posture of the students of the current teacher, and solves the problem that when the collected audio quality is poor, the classroom environment is more complex, the sound of the teacher is possibly very close to the sound of some students, and the like, the accuracy of detecting the sound event of the classroom is influenced when the audio is singly adopted, because the video mode can better reflect the current sound event of the classroom. Meanwhile, the method of audio and video fusion is used, so that the problem that students and teachers wear masks and cannot detect classroom sound events when using video modes alone can be avoided.
Example two
A computer readable storage medium storing a computer program, wherein the computer program when executed by a processor controls a device in which the storage medium is located to perform a class coarse granularity sound event detection method based on audio-video feature fusion as described above.
The related technical solution is the same as the first embodiment, and will not be described herein.
It will be readily appreciated by those skilled in the art that the foregoing description is merely a preferred embodiment of the invention and is not intended to limit the invention, but any modifications, equivalents, improvements or alternatives falling within the spirit and principles of the invention are intended to be included within the scope of the invention.

Claims (10)

1. A class coarse granularity sound event detection method based on audio and video feature fusion is characterized by comprising the following steps:
Acquiring video data and audio data generated in a classroom;
Adopting the constructed video information processing model to perform face detection on video data frame by frame, and extracting state information of all mouths in each frame; carrying out human body posture detection on the video data frame by frame, and extracting posture information of all people in each frame; splicing the state information of all the mouths and the gesture information of all the people according to a time sequence to be used as video action characteristics;
Extracting audio characteristics from audio data frame by adopting the constructed audio information processing model, converting the audio data into text, and extracting text characteristics from the text frame by frame; splicing the audio features and the text features according to a time sequence to serve as audio information features;
Based on the video action characteristics and the audio information characteristics, a constructed characteristic fusion and classification model based on an attention mechanism is adopted, and a detection classification result of each frame of speaking roles is output, so that a class coarse-granularity sound event detection result is obtained, each coarse-granularity sound event comprises event start-stop time and a speaking role corresponding to the event start-stop time, and the speaking roles are divided into three categories of teachers, students and mixtures.
2. The method for detecting class coarse-granularity sound events according to claim 1, wherein the part of the video information processing model for extracting the state information of all the mouths in each frame is obtained by training based on MTCNN algorithm.
3. The classroom coarse granularity sound event detection method according to claim 1 or 2, wherein the status information of each mouth is status information composed of four key points of left and right mouth corners and upper and lower lips.
4. The classroom coarse granularity sound event detection method according to claim 1 or 2, wherein when the mask is identified, the state information of the corresponding mouth is identified using mask information.
5. The method for detecting class coarse-granularity sound events according to claim 1, wherein the part of the video information processing model for extracting the gesture information of all people in each frame is obtained by training based on AlphaPose algorithm.
6. The classroom coarse granularity sound event detection method according to any one of claims 1 to 5, wherein the video data includes teacher video data and student video data;
and splicing the video motion characteristics obtained based on the teacher video data and the video motion characteristics obtained based on the student video data according to a time sequence to serve as total video motion characteristics, and inputting the characteristics fusion and classification model with the audio information characteristics.
7. The classroom coarse granularity sound event detection method according to claim 6, wherein the posture information of the teacher extracted based on the teacher video data is posture information composed of 15 key points of the head, neck, left and right shoulders, left and right elbows, left and right hands, left and right ankles, left and right knees, left and right crotch bones, and torso;
The posture information of the student extracted based on the student video data is posture information composed of eight key points of a head, a neck, left and right shoulders, left and right elbows and left and right hands.
8. The classroom coarse granularity sound event detection method according to claim 1, wherein the audio feature is mel-frequency cepstral coefficient.
9. The method for detecting class coarse granularity sound events according to claim 1, wherein the manner of splicing the audio feature and the text feature according to the time sequence is as follows:
And respectively inputting the audio features and the text features into a CNN feature extraction network, splicing the results after feature extraction according to time sequence alignment, and inputting the results into an RNN network to obtain the audio information features combined with the context information.
10. A computer readable storage medium, characterized by a computer program stored in the computer readable storage medium, wherein the computer program, when executed by a processor, controls a device in which the storage medium is located to perform a class coarse granularity sound event detection method based on audio-video feature fusion according to one of claims 1 to 9.
CN202311820919.0A 2023-12-27 2023-12-27 Classroom coarse granularity sound event detection method based on audio and video feature fusion Pending CN118016073A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311820919.0A CN118016073A (en) 2023-12-27 2023-12-27 Classroom coarse granularity sound event detection method based on audio and video feature fusion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311820919.0A CN118016073A (en) 2023-12-27 2023-12-27 Classroom coarse granularity sound event detection method based on audio and video feature fusion

Publications (1)

Publication Number Publication Date
CN118016073A true CN118016073A (en) 2024-05-10

Family

ID=90941908

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311820919.0A Pending CN118016073A (en) 2023-12-27 2023-12-27 Classroom coarse granularity sound event detection method based on audio and video feature fusion

Country Status (1)

Country Link
CN (1) CN118016073A (en)

Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160314789A1 (en) * 2015-04-27 2016-10-27 Nuance Communications, Inc. Methods and apparatus for speech recognition using visual information
US20190341025A1 (en) * 2018-04-18 2019-11-07 Sony Interactive Entertainment Inc. Integrated understanding of user characteristics by multimodal processing
CN110473548A (en) * 2019-07-31 2019-11-19 华中师范大学 A kind of classroom Internet analysis method based on acoustic signal
CN110807370A (en) * 2019-10-12 2020-02-18 南京摄星智能科技有限公司 Multimode-based conference speaker identity noninductive confirmation method
CN111161715A (en) * 2019-12-25 2020-05-15 福州大学 Specific sound event retrieval and positioning method based on sequence classification
CN111341318A (en) * 2020-01-22 2020-06-26 北京世纪好未来教育科技有限公司 Speaker role determination method, device, equipment and storage medium
CN111785287A (en) * 2020-07-06 2020-10-16 北京世纪好未来教育科技有限公司 Speaker recognition method, speaker recognition device, electronic equipment and storage medium
CN112599135A (en) * 2020-12-15 2021-04-02 华中师范大学 Teaching mode analysis method and system
KR20210064018A (en) * 2019-11-25 2021-06-02 광주과학기술원 Acoustic event detection method based on deep learning
CN114282621A (en) * 2021-12-29 2022-04-05 湖北微模式科技发展有限公司 Multi-mode fused speaker role distinguishing method and system
CN114399818A (en) * 2022-01-05 2022-04-26 广东电网有限责任公司 Multi-mode face emotion recognition method and device
WO2022110354A1 (en) * 2020-11-30 2022-06-02 清华珠三角研究院 Video translation method, system and device, and storage medium
WO2022116420A1 (en) * 2020-12-01 2022-06-09 平安科技(深圳)有限公司 Speech event detection method and apparatus, electronic device, and computer storage medium
CN115719516A (en) * 2022-11-30 2023-02-28 华中师范大学 Multichannel-based classroom teaching behavior identification method and system

Patent Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160314789A1 (en) * 2015-04-27 2016-10-27 Nuance Communications, Inc. Methods and apparatus for speech recognition using visual information
US20190341025A1 (en) * 2018-04-18 2019-11-07 Sony Interactive Entertainment Inc. Integrated understanding of user characteristics by multimodal processing
CN110473548A (en) * 2019-07-31 2019-11-19 华中师范大学 A kind of classroom Internet analysis method based on acoustic signal
CN110807370A (en) * 2019-10-12 2020-02-18 南京摄星智能科技有限公司 Multimode-based conference speaker identity noninductive confirmation method
KR20210064018A (en) * 2019-11-25 2021-06-02 광주과학기술원 Acoustic event detection method based on deep learning
CN111161715A (en) * 2019-12-25 2020-05-15 福州大学 Specific sound event retrieval and positioning method based on sequence classification
CN111341318A (en) * 2020-01-22 2020-06-26 北京世纪好未来教育科技有限公司 Speaker role determination method, device, equipment and storage medium
CN111785287A (en) * 2020-07-06 2020-10-16 北京世纪好未来教育科技有限公司 Speaker recognition method, speaker recognition device, electronic equipment and storage medium
WO2022110354A1 (en) * 2020-11-30 2022-06-02 清华珠三角研究院 Video translation method, system and device, and storage medium
WO2022116420A1 (en) * 2020-12-01 2022-06-09 平安科技(深圳)有限公司 Speech event detection method and apparatus, electronic device, and computer storage medium
CN112599135A (en) * 2020-12-15 2021-04-02 华中师范大学 Teaching mode analysis method and system
CN114282621A (en) * 2021-12-29 2022-04-05 湖北微模式科技发展有限公司 Multi-mode fused speaker role distinguishing method and system
CN114399818A (en) * 2022-01-05 2022-04-26 广东电网有限责任公司 Multi-mode face emotion recognition method and device
CN115719516A (en) * 2022-11-30 2023-02-28 华中师范大学 Multichannel-based classroom teaching behavior identification method and system

Similar Documents

Publication Publication Date Title
CN112489635B (en) Multi-mode emotion recognition method based on attention enhancement mechanism
CN112259106B (en) Voiceprint recognition method and device, storage medium and computer equipment
Gao et al. Sign language recognition based on HMM/ANN/DP
CN112151030B (en) Multi-mode-based complex scene voice recognition method and device
US7636662B2 (en) System and method for audio-visual content synthesis
CN103218842B (en) A kind of voice synchronous drives the method for the three-dimensional face shape of the mouth as one speaks and facial pose animation
CN107972028B (en) Man-machine interaction method and device and electronic equipment
CN113822192A (en) Method, device and medium for identifying emotion of escort personnel based on Transformer multi-modal feature fusion
CN103996155A (en) Intelligent interaction and psychological comfort robot service system
Sahoo et al. Emotion recognition from audio-visual data using rule based decision level fusion
CN116304973A (en) Classroom teaching emotion recognition method and system based on multi-mode fusion
CN117765981A (en) Emotion recognition method and system based on cross-modal fusion of voice text
CN114724224A (en) Multi-mode emotion recognition method for medical care robot
CN1952850A (en) Three-dimensional face cartoon method driven by voice based on dynamic elementary access
CN115188074A (en) Interactive physical training evaluation method, device and system and computer equipment
Pujari et al. A survey on deep learning based lip-reading techniques
CN116883888A (en) Bank counter service problem tracing system and method based on multi-mode feature fusion
Liu et al. Real-time speech-driven animation of expressive talking faces
Bera et al. Identification of mental state through speech using a deep learning approach
Lan et al. Low level descriptors based DBLSTM bottleneck feature for speech driven talking avatar
CN118016073A (en) Classroom coarse granularity sound event detection method based on audio and video feature fusion
Saranya et al. Text Normalization by Bi-LSTM Model with Enhanced Features to Improve Tribal English Knowledge
Sajid et al. Multimodal emotion recognition using deep convolution and recurrent network
Hsu et al. Attentively-coupled long short-term memory for audio-visual emotion recognition
Sharma et al. Classroom student emotions classification from facial expressions and speech signals using deep learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination