WO2024077511A1 - 互动统计方法、装置、设备、系统及存储介质 - Google Patents

互动统计方法、装置、设备、系统及存储介质 Download PDF

Info

Publication number
WO2024077511A1
WO2024077511A1 PCT/CN2022/124805 CN2022124805W WO2024077511A1 WO 2024077511 A1 WO2024077511 A1 WO 2024077511A1 CN 2022124805 W CN2022124805 W CN 2022124805W WO 2024077511 A1 WO2024077511 A1 WO 2024077511A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio
sound object
audio frames
frames
classroom
Prior art date
Application number
PCT/CN2022/124805
Other languages
English (en)
French (fr)
Inventor
李波
Original Assignee
广州视源电子科技股份有限公司
广州视睿电子科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 广州视源电子科技股份有限公司, 广州视睿电子科技有限公司 filed Critical 广州视源电子科技股份有限公司
Priority to CN202280007331.0A priority Critical patent/CN118202410A/zh
Priority to PCT/CN2022/124805 priority patent/WO2024077511A1/zh
Publication of WO2024077511A1 publication Critical patent/WO2024077511A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state

Definitions

  • the embodiments of the present application relate to the field of audio processing technology, and in particular to an interactive statistics method, device, equipment, system and storage medium.
  • an active classroom atmosphere can make the relationship between teachers and students more harmonious, enrich the teaching content, and thus improve the quality and efficiency of classroom teaching.
  • the interaction between lecturers and students can be used as an evaluation indicator of classroom atmosphere.
  • School administrators can understand the interaction between teachers and students in the classroom by attending classes or listening to classroom recordings.
  • each school has dozens or even hundreds of classes every day. In this case, how to enable school administrators to quantify and effectively grasp the interaction between teachers and students in each class every day has become a technical problem that needs to be solved urgently.
  • the embodiments of the present application provide an interaction statistics method, device, equipment, system and storage medium to solve the technical problem that some technologies cannot enable school administrators to quantify and effectively grasp the interaction between teachers and students in each class.
  • an embodiment of the present application provides an interactive statistics method, including:
  • counting the number of teacher-student interactions during the teaching process according to the arrangement order of each audio frame in the time domain and the corrected sound object label of each audio frame includes:
  • the number of teacher-student interactions during the teaching process is counted.
  • the number of teacher-student interactions during the teaching process is counted according to the arrangement order of each audio frame with the sound object label of the lecturer and each audio frame with the sound object label of the non-lecturer in the time domain, including:
  • the number of times the sound object is changed from the lecturer to the non-lecturer is determined, and the number of changes is used as the number of teacher-student interactions in the teaching process.
  • the step of modifying the sound object label of the audio frame according to the first audio feature of the adjacent audio frame includes:
  • the sound object label of the audio frame is modified according to the first classification result.
  • the step of correcting the sound object label of the audio frame according to the first classification result includes:
  • the number of audio frames corresponding to the two sound object labels of the mutually adjacent audio frames is counted in all the audio frames;
  • the sound object label corresponding to the larger audio frame quantity value is selected as the corrected sound object label of the adjacent audio frame.
  • the step of correcting the sound object label of the audio frame according to the first classification result includes:
  • a first number of audio frames are respectively selected from the audio frames corresponding to each of the sound object labels
  • the sound object label with the highest similarity to the adjacent audio frame is selected as the corrected sound object label of the adjacent audio frame.
  • the interactive statistics method further includes:
  • the second classification result being a single-person voice and emotion classification result, a multi-person voice classification result, or an environmental auxiliary sound classification result;
  • the effective interaction moments and the invalid interaction moments in the teaching process are determined according to the second classification results.
  • the second classification result when the second classification result is a single person voice and emotion classification result, the second classification result also includes an age classification result of the single person voice object;
  • the method further includes:
  • the audio frames corresponding to the lecturer are deleted.
  • obtaining a second classification result corresponding to each audio frame according to the second audio feature includes:
  • Each of the second audio features is spliced into a spectrogram, the spectrogram is sent to a second classification model, and a second classification result corresponding to each of the audio frames is obtained based on the second classification model.
  • the process before extracting the first audio feature of each audio frame in the classroom audio data, the process includes:
  • the audio frames belonging to the environmental noise in the classroom audio data are filtered out.
  • an embodiment of the present application further provides an interactive statistics device, including:
  • a first feature extraction unit is used to extract a first audio feature of each audio frame in the classroom audio data, where the classroom audio data is audio data obtained after recording the teaching process;
  • an audio clustering unit configured to cluster the audio frames based on the first audio features to obtain a plurality of audio classes
  • a label setting unit configured to set a corresponding sound object label for each audio frame based on the audio class, wherein the sound object labels of the audio frames in the same audio class are the same;
  • a label correction unit used to correct the sound object label of the audio frame according to the first audio feature of the mutually adjacent audio frames, wherein the mutually adjacent audio frames are two audio frames that are adjacent in the time domain in the classroom audio data;
  • the frequency counting unit is used to count the number of teacher-student interactions during the teaching process according to the arrangement order of each audio frame in the time domain and the corrected sound object label of each audio frame.
  • an embodiment of the present application further provides an interactive statistics device, including:
  • processors one or more processors
  • a memory for storing one or more programs
  • the one or more processors When the one or more programs are executed by the one or more processors, the one or more processors implement the interactive statistical method as described in the first aspect.
  • the interactive statistics device also includes: a first recording device, which is used to record classroom audio data during the teaching process.
  • an embodiment of the present application further provides an interactive statistics system, including: a recording device and an interactive statistics device;
  • the recording device includes a second recording device, which is used to record classroom audio data during the teaching process and send the classroom audio data to the interactive statistics device;
  • the interactive statistics device includes a memory and one or more processors
  • the memory is used to store one or more programs, and the one or more programs are executed by the one or more processors, so that the one or more processors implement the interactive statistical method as described in the first aspect.
  • an embodiment of the present application further provides a computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the interactive statistical method as described in the first aspect.
  • the first audio feature of each audio frame is extracted, and then an appropriate sound object label is set for each audio frame based on the first audio feature.
  • the audio frames belonging to the same sound object can be clearly identified, and then the number of teacher-student interactions can be counted.
  • the above process can be automatically implemented by relying on interactive statistical equipment, which is easy to quantify, so that school administrators do not need to sit in on the class or listen to the recorded classroom audio data one by one, and can quickly grasp the interaction between teachers and students in each classroom.
  • FIG1 is a flow chart of an interactive statistics method provided by an embodiment of the present application.
  • FIG2 is a framework diagram of a first audio feature extraction process provided by an embodiment of the present application.
  • FIG3 is a flow chart of an interactive statistics method provided by an embodiment of the present application.
  • FIG4 is a schematic diagram of a training structure of a first classification model training provided by an embodiment of the present application.
  • FIG5 is a schematic diagram of a sound object label provided by an embodiment of the present application.
  • FIG6 is a flow chart of an interactive statistics method provided by an embodiment of the present application.
  • FIG. 7 is a structural table diagram of a second classification model provided by an embodiment of the present application.
  • FIG8 is a schematic diagram of a second classification model application provided by an embodiment of the present application.
  • FIG9 is a schematic diagram of a second audio feature provided by an embodiment of the present application.
  • FIG10 is a schematic diagram of an interaction invalid time provided by an embodiment of the present application.
  • FIG11 is a schematic diagram of an effective interaction time provided by an embodiment of the present application.
  • FIG12 is a diagram of an activity display interface provided by an embodiment of the present application.
  • FIG13 is a schematic diagram of the structure of an interactive statistics device provided by an embodiment of the present application.
  • FIG14 is a schematic diagram of the structure of an interactive statistics device provided by an embodiment of the present application.
  • FIG. 15 is a schematic diagram of the structure of an interactive statistics system provided by an embodiment of the present application.
  • An embodiment of the present application provides an interactive statistical method, which can be executed by an interactive statistical device.
  • the interactive statistical device can be implemented by software and/or hardware.
  • the interactive statistical device can be composed of two or more physical entities or one physical entity.
  • the interactive statistical device is a computer device with data analysis and processing capabilities, such as a computer, a mobile phone, an interactive tablet, etc.
  • the interactive statistical device when the interactive statistical device executes the interactive statistical method through an application installed therein, the interactive statistical device may also be the application itself.
  • FIG1 is a flow chart of an interactive statistics method provided by an embodiment of the present application.
  • the interactive statistics method includes:
  • Step 110 extract the first audio feature of each audio frame in the classroom audio data, where the classroom audio data is audio data obtained after recording the teaching process.
  • the teaching process refers to the teaching process of the lecturer to the students.
  • the lecturer teaches the students in the classroom, and the recording device is turned on during the teaching process to collect audio data during the teaching process.
  • the audio data collected during the teaching process is recorded as classroom audio data.
  • the classroom audio data can reflect the actual situation of the teaching process, and each class can have corresponding classroom audio data.
  • the recording device is a device with a recording function.
  • the recording device can be used as an independent physical entity, which is distinguished from the interactive statistical device.
  • the physical entity containing the recording device is recorded as a recording device.
  • Each classroom can be equipped with a recording device to record classroom audio data.
  • each classroom can also be equipped with an interactive statistical device.
  • the recording device and the interactive statistical device in the classroom can be connected in a wired or wireless manner to realize sending the classroom audio data recorded by the recording device to the interactive statistical device.
  • the interactive statistical device can also be placed in an office other than the classroom.
  • an interactive statistical device can establish a wired or wireless connection with the recording device in one or more classrooms to obtain the classroom audio data recorded by the recording device.
  • the recording device can be integrated into the interactive statistical equipment. In this case, each classroom is equipped with an interactive statistical equipment, which has a recording function and can record classroom audio data.
  • classroom audio data recorded by the recording device is data in the time domain.
  • the interactive statistics device obtains the classroom audio data recorded by the recording device.
  • the classroom audio data obtained by the interactive statistics device is the audio data obtained after the recording device finishes recording, that is, the classroom audio data obtained after the class ends.
  • classroom video data can also be recorded.
  • classroom video data refers to video data obtained after video recording of the teaching process.
  • the recording method of classroom video data is currently not limited.
  • the interactive statistics device After the interactive statistics device obtains the classroom audio data, it extracts the audio features of each audio frame.
  • the audio features can reflect the signal characteristics of the audio frame.
  • MFCC Mel Frequency Cepstrum Coefficient
  • Fbank Filter bank
  • the audio features of each audio and video in the classroom audio data are recorded as the first audio feature of the corresponding audio frame. After that, the number of teacher-student interactions can be counted based on the first audio feature.
  • the MFCC feature is used as the first audio feature for description.
  • Figure 2 is a framework diagram of the extraction process of the first audio feature provided by an embodiment of the present application.
  • the process of extracting the first audio feature includes: pre-emphasis, framing, windowing, discrete Fourier transform (DFT), Mei filter group, Log and inverse discrete Fourier transform (iDFT).
  • Pre-emphasis is used to compensate for the amplitude of the high-frequency part of the classroom audio data.
  • the formula for pre-emphasis is:
  • x(t) is the sampling point of the classroom audio data at the tth time
  • x(t-1) is the sampling point of the classroom audio data at the t-1th time
  • is the coefficient, generally taking the value of 0.97
  • y(t) is the result after pre-emphasis of x(t).
  • Framing means dividing classroom audio data into multiple segments.
  • the audio data contained in each segment can be considered as a frame of audio data, that is, as an audio frame.
  • Each audio frame is arranged in the time domain according to the order of collection time.
  • Windowing is used to smooth each audio frame and eliminate signal (i.e., audio data) discontinuity that may be caused at both ends of each audio frame.
  • signal i.e., audio data
  • Windowing is used to smooth each audio frame and eliminate signal (i.e., audio data) discontinuity that may be caused at both ends of each audio frame.
  • using a Hamming window to implement windowing is used as an example. At this time, the formula of the Hamming window is:
  • W(n) can also be written as W(n,a) or Wham(n), 0 ⁇ n ⁇ N-1, N is the window length of the Hamming window, i is the frame index, L is the frame shift, x(n+iL) represents the n+iLth sampling point in the window, a is generally 0.46, and S’(n) is the data signal obtained after windowing (i.e., the data signal after the audio frame is windowed).
  • DFT is used to obtain the spectrum of the windowed data signal.
  • the DFT formula is:
  • x(n) is the nth sampling point in the windowed data signal
  • X a (k) is the kth point in the spectrum obtained after performing DFT on x(n)
  • N' is the total number of sampling points of the windowed data signal.
  • the Mei filter group can simplify the amplitude of the signal of the classroom audio data in the frequency domain.
  • the formula of the Mei filter group is:
  • H (m) (k) is the kth point in the frequency domain of the data signal obtained after being processed by the Mei filter bank.
  • M is the total number of Mei filters
  • f(m) is the mth point in the spectrum obtained after DFT
  • f(m+1) is the m+1th point in the spectrum obtained after DFT
  • f(m-1) is the m-1th point in the spectrum obtained after DFT.
  • s(m) represents the data signal obtained after the Log operation, and the meanings of the other parameters can refer to the aforementioned related formulas.
  • iDFT is used to convert the data signal in the frequency domain into the data signal in the time domain, and then obtain the MFCC features of each audio frame.
  • the formula of iDFT is:
  • C(n) represents the data signal obtained after the iDFT operation
  • L represents the length of the data signal obtained after the iDFT operation
  • the Frank feature may be used as the first audio feature.
  • the Frank feature may be obtained after performing a Log operation in the aforementioned calculation process.
  • step 120 is executed.
  • Step 120 cluster the audio frames based on the first audio features to obtain multiple audio classes.
  • the audio frames are clustered based on each first audio feature, so that audio frames with similar first audio features are clustered into one class.
  • the clustered audio frames are recorded as audio classes.
  • the first audio feature can reflect the characteristics of the audio frame in the frequency spectrum, and the characteristics of the frequency spectrum of each audio frame in the audio data generated based on the same person speaking are also very similar (that is, the similarity between the audio features of each audio frame when the same person speaks is relatively high).
  • each audio frame can be clustered based on the first audio feature of each audio frame, so that audio frames with highly similar first audio features are clustered into one class.
  • it can be considered that one audio class obtained by clustering corresponds to one speaking object (that is, the sound-making object). This process can be considered as a preliminary unsupervised clustering of the first audio feature.
  • clustering using the Kmeans clustering algorithm is described as an example.
  • the clustering process can be: randomly select a set number of first audio features as the initial cluster centers, wherein the set number can be set according to the actual scenario. For example, in a classroom teaching scenario, the lecturer usually speaks mainly, and students are occasionally required to answer questions. Therefore, in combination with the classroom teaching scenario, the set number is set to 3, and 3 is an empirical value. Afterwards, the distance between the first audio feature of each audio frame in the classroom audio data and the three initial cluster centers is calculated, wherein the distance calculation method is currently not limited, and the closer the distance, the more similar the two first audio features are.
  • the cluster center with the smallest distance from the first audio feature, and classify the first audio feature and this cluster center into one category. After all audio frames have corresponding categories, the initial three audio classes can be obtained. Afterwards. Recalculate the cluster center of each audio class, wherein the calculation formula of the cluster center can be: a j is the cluster center of the j-th audio class currently calculated, +c j is the j-th audio class, and
  • Step 130 Set a corresponding sound object label for each audio frame based on the audio class, and the sound object labels of audio frames in the same audio class are the same.
  • the object of speech refers to the object of speech in the classroom.
  • the object of speech can be a lecturer or a student.
  • the lecturer is an adult and the student is a minor.
  • the student can also be an adult, and the interactive statistical method provided in this embodiment is also applicable.
  • a sound object label is set for each audio class, and different audio classes have different sound object labels.
  • One sound object label corresponds to a type of speaking object, and each audio frame in an audio class has the same sound object label.
  • the sound object label can be used to identify the audio frames in the classroom audio data that are considered to be the same speaker.
  • the generation rules of the sound object labels are not currently limited, and the sound object labels can be composed of letters, numbers, Chinese characters and/or symbols. For example, there are currently three audio classes, and the sound object labels corresponding to the three audio classes can be recorded as speaker1, speaker2 and speaker3, respectively.
  • Step 140 Correct the sound object label of the audio frame according to the first audio feature of the adjacent audio frames, where the adjacent audio frames are two audio frames that are adjacent in the time domain in the classroom audio data.
  • the sound object label of each audio frame is corrected so that each audio frame belonging to the same sound object corresponds to the same sound object label as much as possible.
  • the first audio features of adjacent audio frames are used to correct the sound object label of each audio frame.
  • Adjacent audio frames refer to two audio frames that are adjacent in the time domain in each audio frame of the classroom audio data, that is, two audio frames collected successively.
  • the length of two audio frames can be used as the window length, starting from the first audio frame of the classroom audio data, sliding the length of one audio frame each time, and taking the two audio frames in the window as adjacent audio frames, that is, an audio frame and each of its adjacent audio frames can constitute adjacent audio frames for correcting the sound object label.
  • a neural network model is pre-trained, and the neural network model is a deep network model for identifying whether mutually adjacent audio frames belong to the same sound object, that is, whether the mutually adjacent audio frames are the same speaker.
  • the neural network model is a classification model, which can output two types of results.
  • the neural network model is recorded as a first classification model, and the result output by the neural network model is recorded as a first classification result, wherein the first classification result is that the mutually adjacent audio frames belong to the same sound object or the mutually adjacent audio frames do not belong to the same sound object (that is, they belong to different sound objects), that is, the first classification result includes two types.
  • the first classification result is that the mutually adjacent audio frames belong to the same sound object or the mutually adjacent audio frames do not belong to the same sound object (that is, they belong to different sound objects), that is, the first classification result includes two types.
  • one of the first classification results is output based on the currently input mutually adjacent audio frame. For example, each time the two first audio features of a mutually adjacent audio frame are input into the first classification model, the first classification model determines whether the mutually adjacent audio frames belong to the same sound object based on the first audio features of the mutually adjacent audio frames, and then outputs the first classification result.
  • the first classification result output by the first classification model is that adjacent audio frames belong to the same sound object
  • the sound object labels of the adjacent audio frames are the same, it means that the sound object labels of the adjacent audio frames are accurate and no correction is required. If the sound object labels of the adjacent audio frames are different, it means that the sound object labels of the adjacent audio frames are inaccurate and need to be corrected.
  • the number of audio frames with the same sound object label as the adjacent audio frames is counted. At this time, the two sound object labels of the adjacent audio frames have corresponding audio frame number values.
  • the sound object label with a larger audio frame number value is selected as the sound object label of the adjacent audio frame, that is, the two sound object labels of the adjacent audio frames are unified into the sound object label with a larger audio frame number value.
  • the first classification result output by the first classification model is that the adjacent audio frames do not belong to the same sound object, if the sound object labels of the adjacent audio frames are different, it means that the sound object labels of the adjacent audio frames are accurate and no correction is required. If the sound object labels of the adjacent audio frames are the same, it means that the sound object labels of the adjacent audio frames are inaccurate and need to be corrected.
  • the correction of the sound object labels of all audio frames can be achieved.
  • Step 150 Count the number of teacher-student interactions during the teaching process based on the arrangement order of each audio frame in the time domain and the corrected utterance object label of each audio frame.
  • the number of times the sound object changes during the teaching process can be determined. For example, according to the acquisition timing of each audio frame (the acquisition timing can be reflected by the arrangement order in the time domain), when the sound object label changes, the number of changes is increased by 1, and then based on the number of changes, the number of interactions of the speakers in the classroom is obtained.
  • the sound objects in the classroom are mainly lecturers and students, that is, the interaction is mainly between lecturers and students. Therefore, the number of interactions counted based on the changes in the sound object labels can be used as the number of teacher-student interactions, that is, the number of interactions between lecturers and students.
  • the audio frames belonging to the lecturer and the audio frames belonging to non-lecturers can be determined first, and then the number of teacher-student interactions can be counted.
  • the non-lecturer's voice object mainly refers to students.
  • the lecturer is the main speaker, that is, the number of audio frames belonging to the lecturer in the classroom audio data should be the largest. Therefore, the number of audio frames corresponding to each voice object label can be combined to determine the voice object label belonging to the lecturer, and the remaining voice object labels can be understood as the voice object labels belonging to the students.
  • the voice object label with the largest number of audio frames is selected as the voice object label belonging to the lecturer, and the other voice object labels are used as the voice object labels belonging to the students (i.e., non-lecturers).
  • the audio frames belonging to the lecturer and the audio frames belonging to the students are clarified, and then the number of teacher-student interactions during the teaching process is counted in combination with the audio frames of the lecturer and the students.
  • counting it can be based on the acquisition timing of each audio frame.
  • the number of teacher-student interactions is determined to be plus 1.
  • the number of interactions between teachers and students and the classroom label are associated and saved so that the number of interactions between teachers and students can be retrieved through the classroom label.
  • Each classroom label corresponds to a teaching process (i.e., a class).
  • the relevant data of the corresponding teaching process can be retrieved through the classroom label.
  • the label content of the classroom label is currently not limited.
  • the classroom label includes time, topic, and lecturer.
  • the classroom audio data and the classroom label are also associated and saved so that the classroom audio data can be retrieved through the classroom label.
  • the technical means solves the technical problem that some technologies cannot enable school administrators to quantify and effectively grasp the interaction between teachers and students in the classroom.
  • the first audio feature of each audio frame is extracted, and then an appropriate sound object label is set for each audio frame based on the first audio feature.
  • the audio frames belonging to the same sound object can be clearly identified, and then the number of teacher-student interactions can be counted.
  • the above process can be automatically realized by relying on interactive statistical equipment, which is easy to quantify, so that school administrators do not need to sit in on the class or listen to the recorded classroom audio data one by one, and can quickly grasp the interaction between teachers and students in each classroom.
  • FIG3 is a flow chart of an interactive statistics method provided by an embodiment of the present application. This embodiment is exemplified on the basis of the above embodiment. Referring to FIG3, the interactive statistics method includes:
  • Step 210 filter out audio frames belonging to environmental noise in classroom audio data, where the classroom audio data is audio data obtained after recording the teaching process.
  • the recording device when recording classroom audio data, the recording device will record all sound data within the collection range, so that the classroom audio data includes not only the sound of the speaker, but also the environmental noise unrelated to the teaching process, wherein environmental noise refers to non-speech sounds unrelated to the classroom. It can be understood that if there is no sound during the teaching process, then the audio frames obtained based on the silence in the classroom audio data also belong to environmental noise.
  • the interaction statistics device obtains the classroom audio data, it filters out the environmental noise in the classroom audio data.
  • the technical means used to filter out audio frames that belong to environmental noise in classroom audio data are not currently limited.
  • a neural network model is pre-trained, and the neural network model is used to predict whether each audio frame is environmental noise, and then the environmental noise is filtered out based on the prediction results of the neural network model.
  • voice activity detection by performing voice activity detection on classroom audio data, the audio frames that belong to environmental noise are filtered out.
  • taking voice activity detection as an example how to filter out audio frames that belong to environmental noise is described.
  • the formula used for voice activity detection is:
  • xk represents the energy feature vector of the currently processed audio frame.
  • the energy feature vector refers to the energy sequence of the currently processed audio frame in each subband.
  • the subband energy is a common indicator for detecting abnormal sounds.
  • There are currently six subbands, and their frequency domains are: 80Hz-250Hz, 250Hz-500Hz, 500Hz-1KHz, 1KHz-2KHz, 2KHz-3KHz and 3KHz-4KHz.
  • xk is the feature vector (feature_vector) of the variable and stores the sequence composed of each subband energy. It can be understood that there is a significant difference between the energy feature vectors of the audio frame belonging to environmental noise and the audio frame belonging to the speech of the sound object (i.e., the audio frame belonging to speech).
  • uz and ⁇ are the mean and variance of the energy feature vectors of the audio frame belonging to the speech of the sound object based on the statistics of historical classroom audio data.
  • uz and ⁇ belong to the statistical characteristics of the audio frame and can determine the probability of Gaussian distribution.
  • rk is the combination of uz and ⁇ , that is, rk includes uz and ⁇ .
  • the probability that each audio frame belongs to environmental noise and the probability that it belongs to the speech audio of the sound-emitting object can be determined, and then the probability of belonging to environmental noise is filtered out according to each probability.
  • a probability threshold is set (set in combination with actual conditions), and when the probability that the audio frame belongs to environmental noise is higher than the probability threshold, the audio frame can be filtered out.
  • classroom audio data used subsequently are all audio frames from which environmental noise has been filtered out.
  • Step 220 extract the first audio feature of each audio frame in the classroom audio data.
  • Step 230 cluster the audio frames based on the first audio features to obtain multiple audio classes.
  • the number of audio classes is a preset second number, which can be set according to actual conditions.
  • the second number is set to 3 or 4 to meet the basic needs of teacher-student interaction statistics. For example, taking the second number as 3 and using the Kmeans clustering algorithm as an example, at this time, the initial cluster center is 3, and after clustering, three audio classes can be obtained.
  • the audio frames corresponding to the students and the audio frames corresponding to the lecturer can be clustered into different audio classes based on the first audio features.
  • Step 240 Set a corresponding sound object label for each audio frame based on the audio class, and the sound object labels of audio frames in the same audio class are the same.
  • Step 250 input the two first audio features corresponding to the adjacent audio frames into the first classification model together, and obtain the first classification result of the adjacent audio frames based on the first classification model.
  • the first classification result is that the adjacent audio frames belong to the same sound-emitting object or the adjacent audio frames belong to different sound-emitting objects.
  • the adjacent audio frames are two audio frames that are adjacent in the time domain in the classroom audio data.
  • the first classification model is a deep network model, and its specific structure is not limited at present. After the two first audio features of the adjacent audio frames are input into the first classification model, the first classification model outputs a first classification result based on the first audio features, and the first classification result is 0 or 1, 0 indicates that the two audio frames of the adjacent audio frames belong to different sound objects, and 1 indicates that the two audio frames of the adjacent audio frames belong to the same sound object.
  • the first classification model uses supervised training, and the training data is based on thousands of classroom audio data recorded by a recording device (which can be considered as historical classroom audio data), and the first classification results of whether adjacent audio frames in the thousands of classroom audio data belong to the same sound object are known (the known first classification results can be considered as the true first classification results).
  • the cross entropy loss function is used in the training process of the first classification model, wherein the formula of the cross entropy loss function is:
  • yi represents the true first classification result, which is used to indicate whether adjacent audio frames during training are actually from the same utterance object. It represents the predicted first classification result, which is used to indicate whether the adjacent audio frames predicted by the first classification model are the same sound object.
  • NLL(x, y) and J CE represent the specific value of the loss function obtained after the first audio feature of the current input adjacent audio frame.
  • FIG4 is a schematic diagram of a training structure for training a first classification model provided by an embodiment of the present application.
  • Network1 and Network2 represent two deep network models with the same structure, input1 and input2 represent two first audio features corresponding to mutually adjacent audio frames, input1 corresponds to the first audio feature of the previous audio frame in mutually adjacent audio frames, and after the two first audio features are respectively input into the two deep network models, the output results (i.e., the predicted first classification results) obtained based on the two deep network models are substituted into the loss function to obtain the specific value of the loss function, and then the network parameters of Network1 and Network2 are adjusted based on the specific value, and then the first audio features of the new mutually adjacent audio frames are input again, and the above process is repeated until the training is completed, wherein the judgment condition for the completion of the training can be that the number of trainings reaches a preset number of times or the loss function converges.
  • the gradient descent optimal algorithm is used to optimize the specific value of the loss function to as small as possible, wherein gradient descent is a first-order optimization algorithm, which is often used in the training process of neural networks.
  • the first classification model is obtained based on Network1 and Network2.
  • the first classification model is trained and applied in an interactive statistical device, or the first classification model is trained in other devices and then deployed in the interactive statistical device for application.
  • the application of the first classification model starting from the first audio frame of the classroom audio data, the two first audio features of the adjacent audio frames are input into the first classification model to obtain the first classification results of the two adjacent audio frames. Afterwards, the distance of one audio frame is moved each time, and the two first audio features of the adjacent audio frames obtained after the movement are input into the first classification model again, and so on, so that each adjacent audio frame has a corresponding first classification result.
  • Step 260 Modify the sound object label of the audio frame according to the first classification result.
  • the process of modifying the utterance object label includes the following four schemes:
  • Solution 1 When the sound object labels of adjacent audio frames are the same and the first classification result is that the adjacent audio frames belong to the same sound object, this step can be further refined as follows: when adjacent audio frames correspond to the same sound object label and the first classification result is that the adjacent audio frames belong to the same sound object, the sound object labels of the adjacent audio frames are maintained.
  • Solution 2 The sound object labels of adjacent audio frames are different and the first classification result is that the adjacent audio frames belong to the same sound object.
  • this step can be further refined as follows: when adjacent audio frames correspond to different sound object labels and the first classification result is that the adjacent audio frames belong to the same sound object, in all audio frames, the number of audio frames corresponding to the two sound object labels of the adjacent audio frames are counted respectively; among the two audio frame number values obtained by counting, the sound object label corresponding to the larger audio frame number value is selected as the corrected sound object label of the adjacent audio frame.
  • the sound object labels corresponding to adjacent audio frames are recorded as target sound object labels.
  • the number of audio frames with the same sound object label as the target sound object label is counted, and each target sound object label corresponds to an audio frame number value.
  • the audio frame number value with the larger number value is selected and the target sound object label corresponding to the audio frame number value is determined.
  • the target sound object label is used as the sound object label of the two adjacent audio frames to achieve the correction of the sound object label.
  • the sound object labels of adjacent audio frames are respectively: speaker1 and speaker2.
  • the first classification result of the first classification model for the adjacent audio frames is that the two audio frames belong to the same sound object label.
  • the number of audio frames whose sound object label belongs to speaker1 and the number of audio frames whose sound object label belongs to speaker2 in each audio frame of the classroom audio data are counted. Then, among the two audio frame number values, the audio frame number value with a larger numerical value is selected. For example, the audio frame number value of speaker2 is larger. Therefore, the sound object labels of the adjacent audio frames are all set to speaker2 to complete the correction of the sound object labels of the adjacent audio frames. It should be noted that in actual applications, there is no link for marking the target sound object label. The current use of the target sound object table is only for the convenience of explaining the technical solution.
  • Solution three The sound object labels of adjacent audio frames are the same and the first classification result is that the adjacent audio frames belong to different sound objects.
  • this step can be refined as follows: when adjacent audio frames correspond to the same sound object label and the first classification result is that the adjacent audio frames belong to different sound objects, cache the adjacent audio frames; after all adjacent audio frames in the classroom audio data are processed, select a first number of audio frames from each audio frame corresponding to each sound object label; determine the similarity between the audio frame selected corresponding to each sound object label and each audio frame in the adjacent audio frames; based on the similarity, select the sound object label with the highest similarity to the adjacent audio frame as the corrected sound object label of the adjacent audio frame.
  • the first quantity can be determined based on the total number of audio frames contained in the classroom audio data.
  • the sampling frequency is the same when recording each class audio data, and the duration of each class is relatively fixed and does not fluctuate much, so the range of the total number of audio frames contained in the collected classroom audio data is relatively fixed. Therefore, the first quantity can be used continuously after being pre-set according to the total number of audio frames.
  • the first classification result is that the adjacent audio frames belong to different sound objects and the adjacent audio frames have the same sound object label
  • the adjacent audio frames are cached first.
  • all audio frames that is, the sound object labels of the adjacent audio frames of each audio frame are corrected
  • the cached audio frames are obtained.
  • the sound object label of the audio frame is also corrected, so as to determine that each adjacent audio frame in the classroom audio data has been processed.
  • the audio frame is removed from the cache, or, the audio frame is retained in the cache to modify the sound object label of the audio frame again after all audio frames are processed. If an audio frame and its adjacent previous audio frame form adjacent audio frames and the sound object label is corrected based on the classification result, when the audio frame and the subsequent audio frame form adjacent audio frames and are determined to be cached based on the classification result, the audio frame is cached.
  • a first number of audio frames are selected from each audio frame corresponding to each sound object label.
  • the similarity between the selected audio frame and the cached audio frame is calculated, wherein the similarity of the two audio frames is determined by calculating the similarity of the first audio features of the two audio frames (which can be reflected by the distance).
  • the sound object label with the highest similarity to the cached audio frame is determined based on the similarity and used as the sound object label of the audio frame.
  • the method for determining the sound object label with the highest similarity can be to select the similarity with the highest value among the similarities between each audio frame (a total of the first number) corresponding to the sound object label and the cached audio frame.
  • each sound object label corresponds to a similarity with the highest value, and then a maximum value is selected from the similarities with the highest values, and the sound object label corresponding to the maximum value is used as the sound object label with the highest similarity.
  • the similarities between each audio frame (a total of the first number) corresponding to the sound object label and the cached audio frame are averaged.
  • each sound object label corresponds to an average similarity value, and the highest average similarity value is selected from the average similarity values, and the corresponding sound object label is used as the sound object label with the highest similarity.
  • other methods can also be used to select sound object labels, which are not currently limited.
  • adjacent audio frames include audio frame 1 and audio frame 2, and the first classification results corresponding to the two are that the adjacent audio frames belong to different sound objects and the sound object labels of the adjacent audio frames are both speaker 1.
  • the two audio frames are cached.
  • the cached audio frame 1 and audio frame 2 are obtained.
  • 10 audio frames are selected from each audio frame corresponding to the three sound object labels, and the similarity between the selected audio frames and audio frame 1 and audio frame 2 is calculated respectively.
  • the sound object label of audio frame 1 is modified to speaker 1 and the sound object label of audio frame 2 is modified to speaker 3.
  • Solution 4 When the first classification result is that adjacent audio frames belong to different sound objects and the sound object labels of adjacent audio frames are different, this step can be further refined as follows: when adjacent audio frames correspond to different sound object labels and the first classification result is that adjacent audio frames belong to different sound objects, the sound object labels of adjacent audio frames are maintained.
  • FIG. 5 is a schematic diagram of a sound object label provided by an embodiment of the present application. Referring to Figure 5, it shows some audio frames of classroom audio data and corresponding sound object labels. Each rectangle in Figure 5 represents an audio frame, and each audio frame has a corresponding sound object label.
  • the corresponding audio frames when the sound object speaks should be continuous. Therefore, after correcting the sound object labels of each audio frame, if the sound object label of a certain audio frame is inconsistent with the sound object labels of the previous and next audio frames, the audio frame can be deleted to avoid the situation of incorrect correction.
  • Step 270 Count the number of audio frames corresponding to each sound object label in all audio frames.
  • the number of audio frames with the same sound object label is counted. In this case, there is a corresponding number of audio frames for each sound object label.
  • Step 280 Select the sound object label with the largest audio frame quantity value as the sound object label belonging to the lecturer.
  • the lecturer speaks the most, and the number of audio frames corresponding to the lecturer is also the largest, and is greater than the number of audio frames corresponding to the students. Therefore, the number of lecturers in the classroom and the number of audio frames corresponding to each sound object label can be combined to select the sound object label with the largest number of audio frames, and the selected sound object label is equal to the number of lecturers.
  • the selected sound object label is used as the sound object label belonging to the lecturer. After determining the sound object label belonging to the lecturer, the remaining sound object labels can be considered as the sound object labels belonging to the students.
  • the number of lecturers is generally 1, so the sound object label with the largest number of audio frames can be used as the sound object label of the lecturer, and the other sound object labels can be used as the sound object labels of non-lecturers, that is, the sound object labels of students.
  • each sound object label corresponds to a non-lecturer sound object. For example, when there are 2 non-lecturer sound object labels, it means that there are 2 non-lecturer sound objects in total.
  • Step 290 Count the number of interactions between the teacher and the students during the course according to the arrangement order of the audio frames with the sound source label of the lecturer and the audio frames with the sound source label of the non-lecturer in the time domain.
  • the audio frames belonging to the lecturer can be determined based on the sound object label belonging to the lecturer.
  • the audio frames belonging to the non-lecturer i.e., the student
  • the audio frames belonging to the non-lecturer can be determined based on the sound object label belonging to the non-lecturer.
  • the number of times the lecturer and the non-lecturer's sound are changed in the time domain can be determined.
  • the change of the sound object is: speaker1-speaker2-speaker3-speaker1, and the corresponding number of changes is 3.
  • the number of changes is used as the number of interactions between teachers and students.
  • the number of times the object is changed from the lecturer to the student is determined, and the number of changes is used as the number of interactions between teachers and students.
  • the number of times the sound object is changed from the lecturer to the student is used as the number of teacher-student interactions for description.
  • this step can be further refined as follows: according to the arrangement order of each audio frame with the sound object label as the lecturer and each audio frame with the sound object label as the non-lecturer in the time domain, determine the number of times the sound object is changed from the lecturer to the non-lecturer, and use the number of changes as the number of teacher-student interactions during the teaching process.
  • the teacher-student interaction process includes at least the process of changing from the lecturer's speech to the student's speech. Based on this, according to the arrangement order of the sound objects belonging to the lecturer and the sound objects belonging to non-lecturers in the time domain, the number of changes of the sound object from the lecturer to non-lecturer is determined, and the number of changes is used as the number of teacher-student interactions.
  • the sound object label is corrected based on the first audio feature of the adjacent audio frame, ensuring the accuracy of the sound object label, and then ensuring the accuracy of the number of teacher-student interactions.
  • the audio frames belonging to the lecturer can be determined, which further ensures the accuracy of the number of teacher-student interactions.
  • the above process can be automated by interactive statistics equipment, which is easy to quantify, so that school administrators can quickly grasp the interaction between teachers and students in each classroom.
  • FIG6 is a flow chart of an interactive statistics method provided by an embodiment of the present application.
  • the interactive statistics method is based on the aforementioned interactive statistics method, and adds technical means for marking effective and invalid moments of interaction in the classroom.
  • the effective moments of interaction can also be understood as the highlight moments of the classroom atmosphere, at which time the classroom atmosphere is more active and relaxed.
  • the invalid moments of interaction can also be understood as the darkest moments of the classroom atmosphere, at which time the classroom atmosphere is more dull and sluggish.
  • the interactive statistical method includes:
  • Step 310 Acquire the classroom audio data recorded during the teaching process. Then, execute steps 320 and 390.
  • the audio frames belonging to the ambient noise in the classroom audio data may also be filtered out.
  • step 310 after obtaining classroom audio data, in addition to counting the number of interactions between teachers and students, effective interaction moments and invalid interaction moments in the classroom can also be marked, that is, after step 310, steps 320 and 390 can be executed respectively.
  • Step 320 extract the first audio feature of each audio frame in the classroom audio data.
  • Step 330 cluster the audio frames based on the first audio features to obtain multiple audio classes.
  • Step 340 Set a corresponding sound object label for each audio frame based on the audio class, and the sound object labels of audio frames in the same audio class are the same.
  • Step 350 modify the sound object label of the audio frame according to the first audio feature of the adjacent audio frames, where the adjacent audio frames are two audio frames that are adjacent in the time domain in the classroom audio data.
  • Step 360 Count the number of audio frames corresponding to each sound object label in all audio frames.
  • Step 370 Select the sound object label with the largest number of audio frames as the sound object label belonging to the lecturer.
  • Step 380 Count the number of interactions between the teacher and the students during the course according to the arrangement order of the audio frames with the sound source label of the lecturer and the audio frames with the sound source label of the non-lecturer in the time domain.
  • Step 390 extract the second audio feature of each audio frame in the classroom audio data.
  • the second audio feature and the first audio feature have the same type and the same extraction process, which will not be described in detail at present.
  • the first audio feature may also be used as the second audio feature, that is, there is no need to execute step 390 , but step 3100 is executed after step 320 .
  • Step 3100 Obtain a second classification result corresponding to each audio frame according to the second audio feature, where the second classification result is a single-person voice and emotion classification result, a multi-person voice classification result, or an environmental auxiliary sound classification result.
  • classroom audio data may include laughter, applause, multiple people speaking at the same time (such as in a classroom discussion scene), single person speaking, and other vocal atmospheres, and the audio of the vocal object will also carry the emotions of the vocal object, where emotions include but are not limited to normal, angry, dull/sluggish, excited/passionate.
  • the atmosphere classification corresponding to the classroom audio data can be determined based on the second audio feature. Currently, the atmosphere classification result is recorded as the second classification result.
  • the atmosphere classification content corresponding to the second classification result can be set according to the actual situation.
  • the second classification result is a single voice and emotion classification result, a multi-person voice classification result or an environmental auxiliary sound classification result, that is, the second classification result includes three types, and one type of second classification result can be obtained each time.
  • the single voice and emotion classification result corresponds to the single voice and emotion classification.
  • the single voice and emotion classification refers to the atmosphere of the audio frame is a single voice and the emotion corresponding to the audio frame.
  • the single voice and emotion classification can further include a single voice and a normal emotion, a single voice and an angry emotion, a single voice and a dull/dragging emotion, and a single voice and an excited/passionate emotion, a total of four sub-categories.
  • the second classification result is a single voice and emotion classification result, it should be refined to specific subcategories.
  • the multi-person voice classification result refers to the atmosphere of the audio frame is a multi-person voice.
  • the classification result of environmental auxiliary sound refers to the atmosphere of the audio frame as auxiliary sound related to the classroom atmosphere.
  • Environmental auxiliary sound can further include subcategories such as applause and laughter.
  • the second classification result should be refined to specific subcategories when it is the classification result of environmental auxiliary sound.
  • a neural network model is pre-trained, and the neural network model is a convolutional network model, which is used to output the second classification result corresponding to the audio frame.
  • the specific structure of the neural network model is not limited at present.
  • the neural network model can be considered as a classification model.
  • the neural network model is recorded as the second classification model. At this time, this step can specifically include: splicing each second audio feature into a spectrogram, sending the spectrogram to the second classification model, and obtaining the second classification result corresponding to each audio frame based on the second classification model.
  • FIG7 is a structural table diagram of a second classification model provided by an embodiment of the present application.
  • the image input to the second classification model is a spectrogram of classroom audio data, and the spectrogram can be obtained based on the second audio feature.
  • a spectrogram can be obtained.
  • continuous audio frames belonging to the same sound object label can also be spliced into a spectrogram.
  • the classroom audio data can correspond to multiple spectrograms.
  • the spectrogram is input into the second classification model so that the second classification model determines the second classification result to which each audio frame in the classroom audio data belongs based on the spectrogram.
  • the second classification model uses supervised training, and the training data is based on classroom audio data of thousands of classes recorded by a recording device (which can be considered as historical classroom audio data), and the second classification results corresponding to each audio frame in the classroom audio data of thousands of classes are known (the known second classification results can be considered as the real second classification results).
  • the second classification model and the first classification model can use the same training data or different training data, which is not currently limited.
  • a cross entropy loss function is used in the training process of the second classification model, wherein the formula of the cross entropy loss function refers to the formula of the cross entropy loss function used by the first classification model, except that, in the cross entropy loss function of the second classification model, yi represents the true second classification result, Indicates the predicted second classification result.
  • the training process of the second classification model is similar to that of the first classification model, and will not be described in detail here.
  • FIG 8 is a schematic diagram of a second classification model application provided by an embodiment of the present application. Referring to Figure 8, taking three spectrograms as an example, three second classification results can be obtained after the three spectrograms are input into the second classification model. In Figure 8, the three second classification results are applause, laughter, and multiple voices. It should be noted that the three second classification models shown in Figure 8 are to illustrate the classification of the three spectrograms. In actual applications, only one second classification model needs to be used.
  • Step 3110 Determine the effective interaction moments and the invalid interaction moments during the teaching process according to the second classification result.
  • the second classification results belonging to the effective interaction moments and the second classification results belonging to the invalid interaction moments are predetermined.
  • the second classification results belonging to the effective interaction moments include: applause, laughter, multiple people speaking at the same time, and a single person speaking with an excited/passionate emotion.
  • the second classification results belonging to the invalid interaction moments include: a single person speaking with an angry emotion, and a single person speaking with a dull/sluggish emotion.
  • the audio frames corresponding to the second classification results belonging to the effective interaction moments are marked as audio frames of the effective interaction moments, and the collection time corresponding to the audio frames of the effective interaction moments is determined as the effective interaction moments.
  • the audio frames corresponding to the second classification results belonging to the invalid interaction moments are marked as audio frames of the invalid interaction moments, and the collection time corresponding to the audio frames of the invalid interaction moments is determined as the invalid interaction moments.
  • the effective interaction time, the invalid interaction time and the number of teacher-student interactions are associated and saved, so that school administrators can clearly identify the effective interaction time and invalid interaction time in each classroom.
  • FIG9 is a schematic diagram of a second audio feature provided by an embodiment of the present application. Referring to FIG9, it is a heat map of each second audio feature obtained after processing the classroom audio data.
  • FIG10 is a schematic diagram of an invalid interaction moment provided by an embodiment of the present application. Referring to FIG10, it is the position of the audio frame belonging to the invalid interaction moment in the classroom audio data corresponding to FIG9. FIG10 can be considered as a true result.
  • FIG11 is a schematic diagram of an effective interaction moment provided by an embodiment of the present application. Referring to FIG11, it shows the probability of each second classification result corresponding to the effective interaction moment for each audio frame in the classroom audio data corresponding to FIG9. The higher the probability, the more accurate the corresponding second classification result.
  • the probability of the 4th second belonging to laughter is higher, and the probability of the 8th to 9th seconds belonging to applause (i.e., the probability of clapping is higher). Therefore, the audio frame corresponding to the 4th second and the audio frame between the 8th and 9th seconds can be marked as audio frames belonging to the effective interaction moment. Moreover, in combination with FIG10, it can be seen that the probability value of the audio frame of the invalid interaction moment in FIG10 belonging to the audio frame of the effective interaction moment in FIG11 is smaller.
  • the activity corresponding to each moment in the class can be determined in combination with the effective interaction time, the invalid interaction time and the number of teacher-student interactions, wherein the calculation method of the activity is not currently limited.
  • the activity at the effective interaction time is higher.
  • school administrators can determine the teaching quality of the class by checking the activity of the class. It should be noted that lecturers can also view the statistical results of their own teaching process.
  • Figure 12 is an activity display interface diagram provided by an embodiment of the present application, which shows the activity of a class at various times. It can be seen from Figure 12 that the activity is the highest in the 4th minute after the start of the class.
  • the classroom atmosphere can also be determined based on the second audio features of each audio frame in the classroom audio data (i.e., single-person voice and emotion classification results, multi-person voice classification results, or environmental auxiliary sound classification results), thereby obtaining effective interaction times and invalid interaction times.
  • This allows for the identification of more dimensional information from classroom audio data alone, making it easier for school administrators to better grasp the classroom situation.
  • the second classification model when the second classification model recognizes a single person's voice, it can also predict the age of the corresponding voice object.
  • the second classification result when the second classification result is a single person's voice and emotion classification result, the second classification result also includes the age classification result of the single person's voice object. Accordingly, after selecting the voice object label with the largest number of audio frames as the voice object label belonging to the lecturer, it also includes: obtaining the age distribution range corresponding to the lecturer; in each audio frame corresponding to the lecturer, deleting the audio frames whose age classification results do not fall within the age distribution range.
  • the second classification model when the second classification model outputs the single-person voice and emotion classification results, it can also output the age classification results of the voice object, that is, when training the second classification model, the training of predicting the age classification results when a single person speaks is added.
  • the age classification result can be understood as the predicted age of the single-person voice object, which can be a specific age value or an age range (the span value of the age range can be set according to actual conditions).
  • the label of the sound object can also be corrected according to the age classification result.
  • the age distribution of the lecturer and the student is different, so there are obvious differences in the audio features when the two speak.
  • the age distribution range of the lecturer is pre-set, and the age distribution range can be set according to the actual situation to reflect the age distribution of the lecturer.
  • the second classification result includes single-person voice, emotion classification and age classification results
  • each audio frame belonging to the lecturer is obtained, and it is determined whether the age classification result corresponding to each audio frame belonging to the lecturer meets the age distribution range. If it meets, it means that the audio frame should be the audio frame of the lecturer. If not, it means that the audio frame is not the audio frame of the lecturer, so the audio frame is deleted.
  • the accuracy of the audio frame to which the lecturer belongs can be ensured, thereby ensuring the accuracy of the number of teacher-student interactions.
  • Figure 13 is a structural diagram of an interactive statistics device provided by an embodiment of the present application.
  • the interactive statistics device includes: a first feature extraction unit 401, an audio clustering unit 402, a label setting unit 403, a label correction unit 404 and a frequency statistics unit 405.
  • the first feature extraction unit 401 is used to extract the first audio feature of each audio frame in the classroom audio data, and the classroom audio data is the audio data obtained after recording the teaching process;
  • the audio clustering unit 402 is used to cluster each audio frame based on each first audio feature to obtain multiple audio classes;
  • the label setting unit 403 is used to set a corresponding sound object label for each audio frame based on the audio class, and the sound object labels of each audio frame in the same audio class are the same;
  • the label correction unit 404 is used to correct the sound object label of the audio frame according to the first audio features of the adjacent audio frames, and the adjacent audio frames are two audio frames adjacent in the time domain in the classroom audio data;
  • the number counting unit 405 is used to count the number of teacher-student interactions in the teaching process according to the arrangement order of each audio frame in the time domain and the corrected sound object label of each audio frame.
  • the number counting unit 405 includes: a first number counting subunit, used to count the number of audio frames corresponding to each of the sound object labels in all the audio frames; a lecturer label determination subunit, used to select the sound object label with the largest number of audio frames as the sound object label belonging to the lecturer; an interaction number counting subunit, used to count the number of teacher-student interactions during the teaching process according to the arrangement order of the audio frames with the sound object label of the lecturer and the audio frames with the sound object label of non-lecturer in the time domain.
  • the interaction count statistics subunit is specifically used to determine the number of times the sound object is changed from lecturer to non-lecturer based on the arrangement order of each audio frame with the sound object label as lecturer and each audio frame with the sound object label as non-lecturer in the time domain, and use the number of changes as the number of teacher-student interactions during the teaching process.
  • the label correction unit 404 includes: a first classification subunit, used to input two first audio features corresponding to adjacent audio frames into a first classification model together, and obtain a first classification result of the adjacent audio frames based on the first classification model, the first classification result being that the adjacent audio frames belong to the same sound object or the adjacent audio frames belong to different sound objects; a first correction subunit, used to correct the sound object label of the audio frame according to the first classification result.
  • the first correction subunit includes: a second quantity counting subunit, which is used to count the audio frame quantity values corresponding to the two sound object labels of the adjacent audio frames in all the audio frames when the adjacent audio frames correspond to different sound object labels and the first classification result is that the adjacent audio frames belong to the same sound object; and a quantity selection subunit, which is used to select the sound object label corresponding to the larger audio frame quantity value from the two audio frame quantity values obtained by counting as the corrected sound object label of the adjacent audio frame.
  • the first correction subunit includes: an audio caching subunit, which is used to cache the adjacent audio frames when the adjacent audio frames correspond to the same sound object label and the first classification result is that the adjacent audio frames belong to different sound objects; an audio selection subunit, which is used to select a first number of audio frames from each audio frame corresponding to each sound object label after all adjacent audio frames in the classroom audio data have been processed; a similarity determination subunit, which is used to determine the similarity between the audio frame selected corresponding to each sound object label and each audio frame in the adjacent audio frames; and a label selection subunit, which is used to select the sound object label with the highest similarity to the adjacent audio frame as the corrected sound object label of the adjacent audio frame based on the similarity.
  • it also includes: a second feature extraction unit, used to extract the second audio feature of each audio frame in the classroom audio data; a second classification unit, used to obtain a second classification result corresponding to each audio frame based on the second audio feature, the second classification result being a single-person voice and emotion classification result, a multi-person voice classification result or an environmental auxiliary sound classification result; an interaction determination unit, used to determine the effective interaction time and invalid interaction time during the teaching process based on the second classification result.
  • a second feature extraction unit used to extract the second audio feature of each audio frame in the classroom audio data
  • a second classification unit used to obtain a second classification result corresponding to each audio frame based on the second audio feature, the second classification result being a single-person voice and emotion classification result, a multi-person voice classification result or an environmental auxiliary sound classification result
  • an interaction determination unit used to determine the effective interaction time and invalid interaction time during the teaching process based on the second classification result.
  • the second classification result when the second classification result is a single-person voice and emotion classification result, the second classification result also includes an age classification result of the single-person voice object; the device also includes: an age acquisition unit, used to select the voice object label with the largest number of audio frames as the voice object label belonging to the lecturer, and then obtain the age distribution range corresponding to the lecturer; an audio deletion unit, used to delete the audio frames whose age classification results do not fall within the age distribution range in each of the audio frames corresponding to the lecturer.
  • the second classification unit is specifically used to: splice each of the second audio features into a spectrogram, send the spectrogram to a second classification model, and obtain a second classification result corresponding to each of the audio frames based on the second classification model, and the second classification result is a single-person voice and emotion classification result, a multi-person voice classification result, or an environmental auxiliary sound classification result.
  • a noise filtering unit which is used to filter out audio frames belonging to environmental noise in the classroom audio data before extracting the first audio feature of each audio frame in the classroom audio data.
  • An interactive statistics device provided in one embodiment of the present application is included in an interactive statistics equipment and can be used to execute the interactive statistics method in any of the above embodiments, and has corresponding functions and beneficial effects.
  • FIG14 is a schematic diagram of the structure of an interactive statistics device provided by an embodiment of the present application.
  • the interactive statistics device includes a processor 50 and a memory 51; the number of processors 50 in the interactive statistics device can be one or more, and FIG14 takes one processor 50 as an example; the processor 50 and the memory 51 in the interactive statistics device can be connected via a bus or other means, and FIG14 takes the connection via a bus as an example.
  • the memory 51 is a computer-readable storage medium that can be used to store software programs, computer executable programs and modules, such as the program instructions/modules corresponding to the interactive statistics device in the embodiment of the present application when executing the interactive statistics method (for example, the first feature extraction unit 401, the audio clustering unit 402, the label setting unit 403, the label correction unit 404 and the frequency statistics unit 405 in the interactive statistics device).
  • the processor 50 executes various functional applications and data processing of the interactive statistics device by running the software programs, instructions and modules stored in the memory 51, that is, realizes the interactive statistics method executed by the above-mentioned interactive statistics device.
  • the memory 51 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system and at least one application required for a function; the data storage area may store data created according to the use of the interactive statistical device, etc.
  • the memory 51 may include a high-speed random access memory, and may also include a non-volatile memory, such as at least one disk storage device, a flash memory device, or other non-volatile solid-state storage device.
  • the memory 51 may further include a memory remotely arranged relative to the processor 50, and these remote memories may be connected to the interactive statistical device via a network. Examples of the above-mentioned network include, but are not limited to, the Internet, an intranet, a local area network, a mobile communication network, and combinations thereof.
  • the interactive statistics device may also include an input device, which may be used to receive input digital or character information and generate key signal input related to user settings and function control of the interactive statistics device.
  • the interactive statistics device may also include an output device, which may include an audio playback device and other devices.
  • the interactive statistics device may also include a display screen, which may be a display screen with a touch function, and the display screen is displayed according to the instructions of the processor 50.
  • the interactive statistics device may also include a communication device for realizing a communication function.
  • the interactive statistics device may further include a recording device.
  • the recording device included in the interactive statistics device is recorded as a first recording device, and the first recording device is used to record classroom audio data during the teaching process.
  • the interactive statistics device is integrated with a recording function and can record the sound in the classroom.
  • the interactive statistical device may not include the first recording device, but cooperate with an independent recording device to obtain the classroom audio data recorded by the recording device.
  • the above-mentioned interactive statistical equipment includes an interactive statistical device, which can be used to execute the interactive statistical method executed by the interactive statistical equipment, and has corresponding functions and beneficial effects.
  • FIG15 is a schematic diagram of the structure of an interactive statistics system provided by an embodiment of the present application.
  • the interactive statistics system includes a recording device 60 and an interactive statistics device 70 .
  • the recording device 60 includes a second recording device, which is used to record classroom audio data during the teaching process and send the classroom audio data to the interactive statistics device 70;
  • the interactive statistics device 70 includes a memory and one or more processors.
  • the memory is used to store one or more programs, and the one or more programs are executed by the one or more processors, so that the one or more processors implement the interactive statistics method provided in the above embodiments.
  • the recording device included in the recording device 60 is recorded as a second recording device, and the classroom audio data is recorded by the second recording device.
  • recording classroom audio data and processing classroom audio data are composed of different physical entities.
  • the recording device 60 and the interactive statistics device 70 can be connected by wire or wirelessly.
  • the interactive statistics device 70 can refer to the relevant description of the interactive statistics device 70 in the above-mentioned embodiment, and the difference is that the current interactive statistics device 70 may have the first recording device or may not have the first recording device. That is, regardless of whether the interactive statistics device 70 has the first recording device, the classroom audio data is recorded by the recording device 60, so that the installation location of the interactive statistics device 70 can be more flexible and not limited to the classroom.
  • an interactive statistics device 70 can be coordinated with one or more recording devices 60 .
  • FIG. 15 currently takes one recording device 60 as an example.
  • the classroom audio data recorded by the recording device 60 is sent to the interactive statistics device 70.
  • the interactive statistics device 70 obtains the classroom audio data and processes it to implement the interactive statistics method in the aforementioned embodiment, and has corresponding functions and beneficial effects.
  • One embodiment of the present application also provides a storage medium containing computer-executable instructions, which, when executed by a computer processor, are used to perform relevant operations in the interactive statistical method provided in any embodiment of the present application, and have corresponding functions and beneficial effects.
  • the application can adopt the form of a complete hardware embodiment, a complete software embodiment, or an embodiment in combination with software and hardware.
  • the application can adopt the form of a computer program product implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program codes.
  • the application is described with reference to the flowchart and/or block diagram of the method, device (system) and computer program product according to the embodiment of the application. It should be understood that each flow and/or box in the flow chart and/or block diagram and the combination of the flow chart and/or box in the flow chart and/or block diagram can be realized by computer program instructions.
  • These computer program instructions can be provided to the processor of a general-purpose computer, a special-purpose computer, an embedded processor or other programmable data processing device to produce a machine, so that the instructions executed by the processor of a computer or other programmable data processing device produce a device for realizing the function specified in one flow chart or multiple flows and/or one box or multiple boxes of a block diagram.
  • These computer program instructions may also be stored in a computer-readable memory capable of directing a computer or other programmable data processing device to operate in a specific manner, so that the instructions stored in the computer-readable memory produce a manufactured product including an instruction device that implements the functions specified in one or more processes of the flowchart and/or one or more blocks of the block diagram.
  • These computer program instructions may also be loaded onto a computer or other programmable data processing device so that a series of operating steps are performed on the computer or other programmable device to produce a computer-implemented process, so that the instructions executed on the computer or other programmable device provide steps for implementing the functions specified in one or more processes of the flowchart and/or one or more blocks of the block diagram.
  • a computing device includes one or more processors (CPU), input/output interfaces, network interfaces, and memory.
  • the memory may include non-permanent storage in a computer-readable medium, random access memory (RAM) and/or non-volatile memory in the form of read-only memory (ROM) or flash memory (flash RAM).
  • RAM random access memory
  • ROM read-only memory
  • flash RAM flash memory
  • Computer readable media include permanent and non-permanent, removable and non-removable media that can be implemented by any method or technology to store information.
  • Information can be computer readable instructions, data structures, program modules or other data.
  • Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technology, compact disk read-only memory (CD-ROM), digital versatile disk (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices or any other non-transmission media that can be used to store information that can be accessed by a computing device.
  • computer readable media does not include temporary computer readable media (transitory media), such as modulated data signals and carrier waves.

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Hospice & Palliative Care (AREA)
  • Psychiatry (AREA)
  • General Health & Medical Sciences (AREA)
  • Child & Adolescent Psychology (AREA)
  • Signal Processing (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

本申请实施例公开了一种互动统计方法、装置、设备、系统及存储介质,该方法包括:提取授课过程中录制的课堂音频数据中每个音频帧的第一音频特征;基于各第一音频特征对各音频帧进行聚类,得到多个音频类;基于音频类为每个音频帧设置对应的发声对象标签,同一音频类中各音频帧的发声对象标签相同;根据互邻音频帧的第一音频特征对音频帧的发声对象标签进行修正,互邻音频帧为课堂音频数据中在时域上相邻的两个音频帧;根据各音频帧在时域上的排列顺序以及各音频帧修正后的发声对象标签,统计授课过程中的师生互动次数。采用上述方法可以解决相关技术中,无法使学校的管理者量化且有效的掌握各节课堂的师生互动情况的技术问题。

Description

互动统计方法、装置、设备、系统及存储介质 技术领域
本申请实施例涉及音频处理技术领域,尤其涉及一种互动统计方法、装置、设备、系统及存储介质。
背景技术
教学场景下,活跃的课堂气氛可以使师生关系更为融洽,使教学内涵更加丰富,进而提高课堂教学的质量和效率。其中,讲师和学生的互动情况可以作为课堂气氛的一个评价指标,互动越多,表明课堂气氛越活跃。学校的管理者可以通过旁听课堂或收听课堂录音来明确课堂上的师生互动情况,但是,每个学校每天都有数十节乃至数百节课,这种情况下,如何让学校的管理者量化且有效的掌握每天各节课堂的师生互动情况,成为了亟需解决的技术问题。
发明内容
本申请实施例提供了一种互动统计方法、装置、设备、系统及存储介质,以解决一些技术中,无法使学校的管理者量化且有效的掌握各节课堂的师生互动情况的技术问题。
第一方面,本申请一个实施例提供了一种互动统计方法,包括:
提取所述课堂音频数据中每个音频帧的第一音频特征,所述课堂音频数据是对授课过程进行录音后得到的音频数据;
基于各所述第一音频特征对各所述音频帧进行聚类,得到多个音频类;
基于所述音频类为每个所述音频帧设置对应的发声对象标签,同一所述音频类中各所述音频帧的发声对象标签相同;
根据互邻音频帧的所述第一音频特征对音频帧的发声对象标签进行修正,所述互邻音频帧为所述课堂音频数据中在时域上相邻的两个音频帧;
根据各所述音频帧在时域上的排列顺序以及各所述音频帧修正后的发声对象标签,统计授课过程中的师生互动次数。
在本申请一个实施例中,所述根据各所述音频帧在时域上的排列顺序以及各所述音频帧修正后的发声对象标签,统计授课过程中的师生互动次数,包括:
在全部所述音频帧中,统计每个所述发声对象标签对应的音频帧数量值;
选择所述音频帧数量值最大的发声对象标签作为属于讲师的发声对象标签;
根据发声对象标签为讲师的各音频帧以及发声对象标签为非讲师的各音频帧在时域上的排列顺序,统计授课过程中的师生互动次数。
在本申请一个实施例中,所述根据发声对象标签为讲师的各音频帧以及发声对象标签为 非讲师的各音频帧在时域上的排列顺序,统计授课过程中的师生互动次数,包括:
根据发声对象标签为讲师的各音频帧以及发声对象标签为非讲师的各音频帧在时域上的排列顺序,确定发声对象由讲师更换为非讲师的更换次数,并将所述更换次数作为所述授课过程中的师生互动次数。
在本申请一个实施例中,所述根据互邻音频帧的所述第一音频特征对音频帧的发声对象标签进行修正,包括:
将互邻音频帧对应的两个第一音频特征一同输入至第一分类模型,基于所述第一分类模型得到所述互邻音频帧的第一分类结果,所述第一分类结果为所述互邻音频帧属于同一发声对象或所述互邻音频帧属于不同发声对象;
根据所述第一分类结果修正音频帧的发声对象标签。
在本申请一个实施例中,所述根据所述第一分类结果修正音频帧的发声对象标签,包括:
所述互邻音频帧对应不同的发声对象标签且所述第一分类结果为所述互邻音频帧属于同一发声对象时,在全部所述音频帧中,分别统计所述互邻音频帧的两个发声对象标签对应的音频帧数量值;
在统计得到的两个所述音频帧数量值中,选择较大的音频帧数量值对应的发声对象标签作为所述互邻音频帧修正后的发声对象标签。
在本申请一个实施例中,所述根据所述第一分类结果修正音频帧的发声对象标签,包括:
所述互邻音频帧对应相同的发声对象标签且所述第一分类结果为所述互邻音频帧属于不同发声对象时,缓存所述互邻音频帧;
所述课堂音频数据中各互邻音频帧均处理完毕后,在每个所述发声对象标签对应的各音频帧中分别选择第一数量的音频帧;
确定每个所述发声对象标签对应选择的音频帧与所述互邻音频帧中每个音频帧的相似性;
根据所述相似性,选择与所述互邻音频帧相似度最高的发声对象标签作为所述互邻音频帧修正后的发声对象标签。
在本申请一个实施例中,所述互动统计方法还包括:
提取所述课堂音频数据中每个音频帧的第二音频特征;
根据所述第二音频特征得到各音频帧对应的第二分类结果,所述第二分类结果为单人发声及情绪分类结果、多人发声分类结果或环境辅助音分类结果;
根据所述第二分类结果确定授课过程中的互动有效时刻和互动无效时刻。
在本申请一个实施例中,所述第二分类结果为单人发声及情绪分类结果时,所述第二分类结果还包括单人发声对象的年龄分类结果;
所述选择所述音频帧数量值最大的发声对象标签作为属于讲师的发声对象标签之后,还包括:
获取讲师对应的年龄分布范围;
在所述讲师对应的各音频帧中,删除所述年龄分类结果未落在所述年龄分布范围的音频帧。
在本申请一个实施例中,所述根据所述第二音频特征得到各音频帧对应的第二分类结果,包括:
将各所述第二音频特征拼接成语谱图,将所述语谱图发送至第二分类模型,基于所述第二分类模型得到各所述音频帧对应的第二分类结果。
在本申请一个实施例中,所述提取课堂音频数据中每个音频帧的第一音频特征之前,包括:
滤除所述课堂音频数据中属于环境噪声的音频帧。
第二方面,本申请一个实施例还提供了一种互动统计装置,包括:
第一特征提取单元,用于提取所述课堂音频数据中每个音频帧的第一音频特征,所述课堂音频数据是对授课过程进行录音后得到的音频数据;
音频聚类单元,用于基于各所述第一音频特征对各所述音频帧进行聚类,得到多个音频类;
标签设置单元,用于基于所述音频类为每个所述音频帧设置对应的发声对象标签,同一所述音频类中各所述音频帧的发声对象标签相同;
标签修正单元,用于根据互邻音频帧的所述第一音频特征对音频帧的发声对象标签进行修正,所述互邻音频帧为所述课堂音频数据中在时域上相邻的两个音频帧;
次数统计单元,用于根据各所述音频帧在时域上的排列顺序以及各所述音频帧修正后的发声对象标签,统计授课过程中的师生互动次数。
第三方面,本申请一个实施例还提供了一种互动统计设备,包括:
一个或多个处理器;
存储器,用于存储一个或多个程序;
当所述一个或多个程序被所述一个或多个处理器执行,使得所述一个或多个处理器实现如第一方面所述的互动统计方法。
在本申请一个实施例中,所述互动统计设备还包括:第一录音装置,所述第一录音装置用于录制授课过程中的课堂音频数据。
第四方面,本申请一个实施例还提供了一种互动统计系统,包括:录音设备和互动统计 设备;
所述录音设备包括第二录音装置,所述第二录音装置用于录制授课过程中的课堂音频数据,并将所述课堂音频数据发送至所述互动统计设备;
所述互动统计设备包括存储器以及一个或多个处理器;
所述存储器用于存储一个或多个程序,所述一个或多个程序被所述一个或多个处理器执行,使得所述一个或多个处理器实现如第一方面所述的互动统计方法。
第五方面,本申请一个实施例还提供了一种计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现如第一方面所述的互动统计方法。
本申请一个实施例中,通过提取授课过程中录制的课堂音频数据中每个音频帧的第一音频特征,基于第一音频特征对各音频帧进行聚类,以将第一音频特征相似的音频帧聚为一类,并且为同一类的音频帧设置相同的发声对象标签,之后,基于时域上互邻音频帧的第一音频特征修正音频帧的发声对象标签,进而基于修正后的发声对象标签,统计授课过程中的师生互动次数的技术手段,解决了一些技术中,无法使学校的管理者量化且有效的掌握课堂师生互动情况的技术问题,利用音频处理技术,提取各音频帧的第一音频特征,进而基于第一音频特征为各音频帧设置适当的发声对象标签,可以明确属于同一发声对象的音频帧,进而统计出师生互动次数,上述过程可以依靠互动统计设备自动化实现,便于量化,使得学校的管理者无需旁听课堂也无需逐个听取录制的课堂音频数据,便可快速掌握各课堂的师生互动情况。
附图说明
图1为本申请一个实施例提供的一种互动统计方法的流程图;
图2为本申请一个实施例提供的一种第一音频特征的提取流程框架图;
图3为本申请一个实施例提供的一种互动统计方法的流程图;
图4为本申请一个实施例提供的一种第一分类模型训练时的训练结构示意图;
图5为本申请一个实施例提供的一种发声对象标签示意图;
图6为本申请一个实施例提供的一种互动统计方法的流程图;
图7为本申请一个实施例提供的第二分类模型的结构表格图;
图8为本申请一个实施例提供的一种第二分类模型应用示意图;
图9为本申请一个实施例提供的一种第二音频特征示意图;
图10为本申请一个实施例提供的一种互动无效时刻示意图;
图11为本申请一个实施例提供的一种互动有效时刻示意图;
图12为本申请一个实施例提供的一种活跃度显示界面图;
图13为本申请一个实施例提供的一种互动统计装置的结构示意图;
图14为本申请一个实施例提供的一种互动统计设备的结构示意图;
图15为本申请一个实施例提供的一种互动统计系统的结构示意图。
具体实施方式
下面结合附图和实施例对本申请作详细说明。可以理解的是,此处所描述的具体实施例用于解释本申请,而非对本申请的限定。另外还需要说明的是,为了便于描述,附图中仅示出了与本申请相关的部分而非全部结构。
本申请一个实施例提供了一种互动统计方法,该互动统计方法可以由互动统计设备执行,互动统计设备可以通过软件和/或硬件的方式实现,互动统计设备可以是两个或多个物理实体构成,也可以是一个物理实体构成。
一个实施例中,互动统计设备是具有数据分析以及处理能力的计算机设备,如互动统计设备可以是电脑、手机、交互平板等,一些实施方式中,互动统计设备通过其中安装的应用程序执行互动统计方法时,互动统计设备还可以是应用程序本身。
互动统计设备执行互动统计方法时,图1为本申请一个实施例提供的一种互动统计方法的流程图,参考图1,该互动统计方法包括:
步骤110、提取课堂音频数据中每个音频帧的第一音频特征,课堂音频数据是对授课过程进行录音后得到的音频数据。
授课过程是指讲师面向学生的教学过程,一个实施例中,讲师在教室内对学生进行授课,且授课过程中开启录音装置,以采集授课过程中的音频数据,当前,将采集的授课过程中的音频数据记为课堂音频数据,通过课堂音频数据可以反映授课过程的真实情况,每节课均可存在对应的课堂音频数据。可理解,录音装置是具有录音功能的装置。一种可选方式,录音装置可以作为独立的物理实体,与互动统计设备区别开来,当前,将包含录音装置的物理实体记为录音设备。每个教室均可配置录音设备,以实现录制课堂音频数据。可选的,每个教室还可配置互动统计设备。教室中的录音设备和互动统计设备可以通过有线或无线的方式连接,以实现将录音设备录制的课堂音频数据发送至互动统计设备。还可选的,互动统计设备还可以放置在非教室的办公室中,此时,一个互动统计设备可与一个或多个教室中的录音设备建立有线或无线连接,以获取录音设备录制的课堂音频数据。另一种可选方式,录音装置可集成在互动统计设备中,此时,每个教室均配置有互动统计设备,该互动统计设备具有录音功能且可以录制课堂音频数据。
需要说明,录制装置录制的课堂音频数据为时域上的数据。
示例性的,互动统计设备获取录音装置录制到的课堂音频数据。当前,互动统计设备获 取的课堂音频数据为录音装置结束录制后得到的音频数据,即课堂结束后得到的课堂音频数据。
可选的,授课过程中,除了录制课堂音频数据,还可以录制课堂视频数据,课堂视频数据是指对授课过程进行视频录制后得到的视频数据,课堂视频数据的录制方式当前不作限定。
互动统计设备获取课堂音频数据后,提取其中每个音频帧的音频特征,音频特征可以反映音频帧的信号特征,如Mel频率倒谱系数(Mel Frequency Cepstrum Coefficient,MFCC)特征、Filter bank(Fbank)特征均是较为常用的音频特征。当前,将课堂音频数据中每个音视频的音频特征记为对应音频帧的第一音频特征,之后,便可基于第一音频特征统计师生互动次数。
示例性的,以MFCC特征作为第一音频特征为例进行描述,此时,图2为本申请一个实施例提供的一种第一音频特征的提取流程框架图,参考图2,提取第一音频特征的过程包括:预加重、分帧、加窗、离散傅里叶变换(Discrete Fourier transform,DFT)、Mei滤波器组、Log和离散傅里叶逆变换(iDFT)。
预加重用于补偿课堂音频数据中高频部分的振幅,预加重的公式为:
y(t)=x(t)-αx(t-1)
其中,x(t)为课堂音频数据在第t时刻的采样点,x(t-1)为课堂音频数据在第t-1时刻的采样点,α为系数,一般取值0.97,y(t)为对x(t)预加重后的结果。
分帧是指将课堂音频数据分为多段,每一段包含的音频数据可以认为是一帧音频数据,即作为一个音频帧。各音频帧根据采集的时间先后顺序在时域上排列。
加窗用于平滑各音频帧,消除各音频帧两端可能会造成的信号(即音频数据)不连续性。一个实施例中,以使用汉明窗实现加窗为例,此时,汉明窗的公式为:
S’(n)=S(n)×W(n)
Figure PCTCN2022124805-appb-000001
S(n)=x(n+iL)
其中,W(n)也可以写成W(n,a)或Wham(n),0≤n≤N-1,N是汉明窗的窗长,i是帧索引,L是帧移,x(n+iL)表示窗内第n+iL个的采样点,a一般为0.46,S’(n)为加窗后得到的数据信号(即音频帧加窗后的数据信号)。
DFT用于得到加窗后的数据信号的频谱,DFT的公式为:
Figure PCTCN2022124805-appb-000002
其中,x(n)为加窗后的数据信号中第n个采样点,0≤n≤N’-1,X a(k)为对x(n)进行DFT后得到的频谱中第k个点,0≤k≤N’-1,N’为加窗后的数据信号的采样点总数量。
Mei滤波器组可以对课堂音频数据在频域上的信号的幅值进行精简,Mei滤波器组的公式为:
Figure PCTCN2022124805-appb-000003
其中,H (m)(k)为经Mei滤波器组处理后得到的数据信号在频域上的第k个点,
Figure PCTCN2022124805-appb-000004
M为Mei滤波器的总数量,f(m)为DFT后得到的频谱中第m个点,f(m+1)为DFT后得到的频谱中第m+1个点,f(m-1)为DFT后得到的频谱中第m-1个点。
由于人耳听到的声音与信号本身的大小是幂次方关系,所以需要对Mei滤波器处理后的信号做Log(即对数),Log的公式为:
Figure PCTCN2022124805-appb-000005
其中,s(m)表示进行Log操作后得到的数据信号,其余各参数含义可参考前述相关公式。
iDFT用于将频域上的数据信号转换成时域上的数据信号,进而得到各音频帧的MFCC特征。iDFT的公式为:
Figure PCTCN2022124805-appb-000006
其中,C(n)表示iDFT操作后得到的数据信号,L表示iDFT操作后得到的数据信号的长度。
可选的,也可以使用Frank特征作为第一音频特征,此时,在前述计算过程中,进行Log操作后可得到Frank特征。
基于上述过程得到各音频帧的第一音频特征后,执行步骤120。
步骤120、基于各第一音频特征对各音频帧进行聚类,得到多个音频类。
示例性的,得到每个音频帧的第一音频特征后,基于各第一音频特征对音频帧进行聚类, 以将第一音频特征相似的音频帧聚为一个类,当前,将聚类后的音频帧记为音频类。可理解,基于步骤110中第一音频特征的提取过程可知,第一音频特征可以体现音频帧在频谱上的特征,而基于同一个人说话生成的音频数据中各音频帧在频谱上的特征也很相似(即同一个人说话时的各音频帧的音频特征间相似度较高),基于此,可以基于各音频帧的第一音频特征对各音频帧进行聚类,以将第一音频特征高度相似的音频帧聚为一类,此时,可以认为聚类得到的一个音频类对应一个说话对象(即发声对象)。该过程可以认为是对第一音频特征进行初步的无监督聚类。
一个实施例中,以使用Kmeans聚类算法进行聚类为例进行描述,此时,聚类过程可以是:随机选择设定数量的第一音频特征作为初始的聚类中心,其中,设定数量可以根据实际场景设置,例如,课堂教学场景中,通常以讲师说话为主,偶尔需要学生回答问题,因此,结合课堂教学场景,设定数量设置为3,3为经验值。之后,计算课堂音频数据中每个音频帧的第一音频特征与三个初始的聚类中心的距离,其中,距离的计算方式当前不做限定,距离越近两个第一音频特征越相似。之后,确定与第一音频特征距离最小的聚类中心,并将该第一音频特征与这个聚类中心归为一类,全部音频帧均有对应的类别后,可以得到初始的三个音频类。之后。重新计算各音频类的聚类中心,其中,聚类中心的计算公式可以为:
Figure PCTCN2022124805-appb-000007
a j为当前计算得到的第j个音频类的聚类中心,+c j为第j个音频类,|+c j+|为第j个音频类包含的音频帧个数。x表示第j个音频类对应的第一音频特征。计算得到三个聚类中心后,重复计算课堂音频数据中每个音频帧的第一音频特征与三个聚类中心的距离,以基于距离对各音频帧进行聚类,并再次基于聚类结果重新计算聚类中心,重复前述过程,直到聚类的次数达到设定的次数阈值,或者是,直到重新计算的聚类中心与前次计算的聚类中心的区别较小(即误差变化最小),之后,得到三个音频类。
步骤130、基于音频类为每个音频帧设置对应的发声对象标签,同一音频类中各音频帧的发声对象标签相同。
发声对象是指课堂中讲话的对象,当前,发声对象可以是讲师或学生。一个实施例中,讲师属于成年人,学生属于未成年人。实际应用中,学生也可以属于成年人,同样适用本实施例提供的互动统计方法。
示例性的,为每个音频类设置一个发声对象标签,不同音频类的发声对象标签不同。一个发声对象标签对应一类说话对象,一个音频类内的各音频帧具有相同的发声对象标签,通过发声对象标签可以明确课堂音频数据中被认为是同一说话人的音频帧。可选的,发声对象标签的生成规则当前不做限定,发声对象标签可以由字母、数字、汉字和/或符号组成,例如, 当前音频类为三个,三个音频类对应的发声对象标签可分别记为speaker1、speaker2和speaker3。
步骤140、根据互邻音频帧的第一音频特征对音频帧的发声对象标签进行修正,互邻音频帧为课堂音频数据中在时域上相邻的两个音频帧。
示例性的,为了保证互动统计的准确性,确定各音频帧对应的发声对象标签后,对各音频帧的发声对象标签进行修正,以使属于同一发声对象的各音频帧尽可能对应相同的发声对象标签。一个实施例中,基于发声对象一次说话过程中对应的音频帧应是连续的特性,利用互邻音频帧的第一音频特征修正各音频帧的发声对象标签,互邻音频帧是指课堂音频数据的各音频帧中在时域上相邻的两个音频帧,即先后采集的两个音频帧。选择互邻音频帧时,可以是以两个音频帧的长度作为窗口长度,从课堂音频数据的第一个音频帧开始,每次滑动一个音频帧的长度,并取窗口中的两个音频帧作为互邻音频帧,即一个音频帧与其相邻的每个音频帧均可组成互邻音频帧,以用于修正发声对象标签。
选择互邻音频帧后,基于互邻音频帧的第一音频特征确定互邻音频帧是否属于同一发声对象。一个实施例中,预先训练一个神经网络模型,该神经网络模型为深度网络模型,用于识别互邻音频帧是否属于同一发声对象,即互邻音频帧是否为同一个说话人。可选的,神经网络模型为分类模型,其可以输出两类结果,当前,将该神经网络模型记为第一分类模型,将神经网络模型输出的结果记为第一分类结果,其中,第一分类结果为互邻音频帧属于同一发声对象或互邻音频帧不属于同一发声对象(即属于不同发声对象),即第一分类结果共包括两种类型,第一分类模型工作过程中,基于当前输入的互邻音频帧输出其中一种第一分类结果。示例来说,每次将一个互邻音频帧的两个第一音频特征输入至第一分类模型,由第一分类模型基于互邻音频帧的第一音频特征确定互邻音频帧是否属于同一发声对象,进而输出第一分类结果。
一个实施例中,第一分类模型输出的第一分类结果为互邻音频帧属于同一发声对象时,如果互邻音频帧的发声对象标签相同,则说明互邻音频帧的发声对象标签准确,无需进行修正,如果互邻音频帧的发声对象标签不同,则说明互邻音频帧的发声对象标签不准确,需要进行修正。修正时,在课堂音频数据的全部音频帧中,统计与互邻音频帧具有相同发声对象标签的音频帧数量值,此时,互邻音频帧的两个发声对象标签均有对应的音频帧数量值,之后,选择音频帧数量值较多的发声对象标签作为互邻音频帧的发声对象标签,即将互邻音频帧的两个发声对象标签统一为音频帧数量值较多的发声对象标签。第一分类模型输出的第一分类结果为互邻音频帧不属于同一发声对象时,如果互邻音频帧的发声对象标签不同,则说明互邻音频帧的发声对象标签准确,无需进行修正,如果互邻音频帧的发声对象标签相同,则说明互邻音频帧的发声对象标签不准确,需要进行修正。修正时,先缓存当前的互邻音频 帧,待全部互邻音频帧均处理完成(即基于第一分类结果修正发声对象标签)后,在各发声对象标签对应的音频帧中,选择一定数量(其值可根据实际情况选择)的音频帧,之后,分别计算所选择的音频帧与互邻音频帧中两个音频帧的相似度,进而基于相似度为互邻音频帧中的两个音频帧选择合适的发声对象标签,以使互邻音频帧具有不同的发声对象标签。通过上述方式,可以实现对全部音频帧的发声对象标签的修正。
可理解,实际应用中,还可以采用其他的方式确定互邻音频帧是否属于同一发声对象,进而基于确定结果修正发声对象标签,当前不作限定。
步骤150、根据各音频帧在时域上的排列顺序以及各音频帧修正后的发声对象标签,统计授课过程中的师生互动次数。
示例性的,基于各音频帧在时域上的排列顺序以及各音频帧当前的发声对象标签,可以确定授课过程中发声对象的更换次数,例如,按照各音频帧的采集时序(采集时序可通过时域上的排列顺序体现),确定发声对象标签发生变化时,更换次数加1,进而基于更换次数,得到课堂上的说话人的互动次数,一般而言,课堂上的发声对象主要是讲师和学生,即互动主要是讲师和学生的互动,因此,可以将基于发声对象标签变化情况统计的互动次数作为师生互动次数,即讲师和学生的互动次数。
示例性的,为了保证师生互动次数的准确性,可以先确定属于讲师的音频帧和属于非讲师的音频帧,再统计师生互动次数,其中,课堂场景下,非讲师的发声对象主要是指学生。授课过程中,讲师是主要的说话人,即课堂音频数据中属于讲师的音频帧的数量应该是最多的,因此,可以结合各发声对象标签所对应的音频帧的数量值,确定属于讲师的发声对象标签,剩余的发声对象标签便可理解为属于学生的发声对象标签,例如,选择音频帧数量值最大的发声对象标签作为属于讲师的发声对象标签,其他的发声对象标签作为属于学生(即非讲师)的发声对象标签,进而明确属于讲师的音频帧和属于学生的音频帧,之后,结合讲师和学生的音频帧统计授课过程中的师生互动次数。在统计时,可以是按照各音频帧的采集时序,当确定各音频帧由讲师变化到学生时,确定师生互动次数加1。
可选的,确定师生互动次数后,将师生互动次数和课堂标签关联保存,以便通过课堂标签调取出师生互动次数。每个课堂标签对应一次授课过程(即一节课),通过课堂标签可以调取到对应的授课过程的相关数据,课堂标签的标签内容当前不作限定。例如,课堂标签包括时间、题目和讲课教师。还可选的,课堂音频数据和课堂标签也关联保存,以通过课堂标签调取课堂音频数据。
上述,通过提取授课过程中录制的课堂音频数据中每个音频帧的第一音频特征,基于第一音频特征对各音频帧进行聚类,以将第一音频特征相似的音频帧聚为一类,并且为同一类 的音频帧设置相同的发声对象标签,之后,基于时域上互邻音频帧的第一音频特征修正音频帧的发声对象标签,进而基于修正后的发声对象标签,统计授课过程中的师生互动次数的技术手段,解决了一些技术中,无法使学校的管理者量化且有效的掌握课堂师生互动情况的技术问题,利用音频处理技术,提取各音频帧的第一音频特征,进而基于第一音频特征为各音频帧设置适当的发声对象标签,可以明确属于同一发声对象的音频帧,进而统计出师生互动次数,上述过程可以依靠互动统计设备自动化实现,便于量化,使得学校的管理者无需旁听课堂也无需逐个听取录制的课堂音频数据,便可快速掌握各课堂的师生互动情况。
图3为本申请一个实施例提供的一种互动统计方法的流程图。本实施例是在上述实施例的基础上进行示例化,参考图3,该互动统计方法包括:
步骤210、滤除课堂音频数据中属于环境噪声的音频帧,课堂音频数据是对授课过程进行录音后得到的音频数据。
示例性的,录制课堂音频数据时,录音装置会对采集范围内的全部声音数据进行录制,这样使得课堂音频数据不仅包括发声对象讲话的声音,还包括与授课过程无关的环境噪声,其中,环境噪声是指与课堂无关的非讲话声音,可理解,若授课过程中出现无声音的情况,那么,课堂音频数据中基于无声音的情况得到的音频帧也属于环境噪声。
环境噪声对于确定师生互动次数没有帮助,因此,互动统计设备获取到课堂音频数据后,向滤除课堂音频数据中的环境噪声。
滤除课堂音频数据中属于环境噪声的音频帧所使用的技术手段当前不作限定。例如,预先训练一个神经网络模型,通过神经网络模型预测各音频帧是否为环境噪声,进而基于神经网络模型的预测结果过滤掉环境噪声。再如,通过对课堂音频数据进行语音活动检测,过滤掉其中属于环境噪声的音频帧。一个实施例中,以语音活动检测为例,描述如何滤除属于环境噪声的音频帧。其中,语音活动检测时使用的公式为:
Figure PCTCN2022124805-appb-000008
其中,x k表示当前处理的音频帧的能量特征向量,能量特征向量是指当前处理的音频帧在各子带中的能量序列,子带能量是检测异常声音常用的指标,当前共有六个子带,其频率域分别是:80Hz-250Hz、250Hz-500Hz、500Hz-1KHz、1KHz-2KHz、2KHz-3KHz以及3KHz-4KHz,x k为变量的特征向量(feature_vector)且存放各子带能量组成的序列,可理解,属于环境噪声的音频帧和属于发声对象讲话的音频帧(即属于语音的音频帧)的能量特征向量存在明显的差别。u z和σ是基于历史的课堂音频数据统计得到的属于发声对象讲话的音频帧的能量特征向量的均值和方差,u z和σ属于音频帧的统计特征且可以决定高斯分布的概率,r k 是u z和σ的结合,即r k包括u z和σ。Z为0或1,其中,Z=0时,f表示音频帧属于环境噪声的概率,Z=1时,f表示音频帧属于发声对象讲话音频的概率。根据上述公式,可以确定出各音频帧属于环境噪声的概率和属于发声对象讲话音频的概率,之后,根据各概率滤除属于环境噪声的概率。例如,设置一个概率阈值(结合实际情况设置),当音频帧属于环境噪声的概率高于该概率阈值,便可滤除该音频帧。
需要说明,后续使用的课堂音频数据均是已经滤除环境噪声的音频帧。
步骤220、提取课堂音频数据中每个音频帧的第一音频特征。
步骤230、基于各第一音频特征对各音频帧进行聚类,得到多个音频类。
一个实施例中,音频类的数量为预先设置的第二数量,第二数量可以结合实际情况设置,课堂教学场景下,发声对象只有讲师和学生,且以讲师说话为主,因此,第二数量设置为3或4便可以满足师生互动统计的基本需求。示例性的,以第二数量为3且采用Kmeans聚类算法为例,此时,初始的聚类中心为3,且聚类后,可以得到三个音频类。
需说明,课堂上可能有多个学生回答问题,但是,由于学生和讲师在年龄上存在明显的差别,因此,学生和讲师所对应的第一音频特征也存在较为明显的差别,而相似年龄间的第一音频特征的相似度高于非相似年龄间的第一音频特征的相似度,即使为第二数量设置较小的数值,也可以基于第一音频特征将学生对应的音频帧和讲师对应的音频帧聚类为不同的音频类。
步骤240、基于音频类为每个音频帧设置对应的发声对象标签,同一音频类中各音频帧的发声对象标签相同。
步骤250、将互邻音频帧对应的两个第一音频特征一同输入至第一分类模型,基于第一分类模型得到互邻音频帧的第一分类结果,第一分类结果为互邻音频帧属于同一发声对象或互邻音频帧属于不同发声对象,互邻音频帧为课堂音频数据中在时域上相邻的两个音频帧。
第一分类模型为深度网络模型,其具体结构当前不作限定。将互邻音频帧的两个第一音频特征输入至第一分类模型后,由第一分类模型基于第一音频特征输出第一分类结果,且第一分类结果为0或1,0表示互邻音频帧的两个音频帧属于不同发声对象,1表示互邻音频帧的两个音频帧属于相同发声对象。
一个实施例中,第一分类模型采用监督训练,训练数据是基于录音装置录制得到的数千节课的课堂音频数据(可认为是历史课堂音频数据)且数千节课的课堂音频数据中互邻音频帧是否属于同一发声对象的第一分类结果已知(已知的第一分类结果可以认为是真实的第一分类结果)。可选的,第一分类模型训练过程中使用交叉熵损失函数,其中,交叉熵损失函数的公式为:
Figure PCTCN2022124805-appb-000009
其中,y i表示真实的第一分类结果,真实的第一分类结果用于表示训练时的互邻音频帧实际上是否为同一发声对象,
Figure PCTCN2022124805-appb-000010
表示预测的第一分类结果,预测的第一分类结果用于表示第一分类模型预测的互邻音频帧是否为同一发声对象,NLL(x,y)和J CE表示当前输入互邻音频帧的第一音频特征后得到的损失函数的具体值。
图4为本申请一个实施例提供的一种第一分类模型训练时的训练结构示意图,参考图4,Network1和Network2表示两个结构一样的深度网络模型,input1和input2表示互邻音频帧对应的两个第一音频特征,input1对应互邻音频帧中前一音频帧的第一音频特征,将两个第一音频特征分别输入至两个深度网络模型后,将基于两个深度网络模型得到的输出结果(即预测的第一分类结果)代入损失函数中,以得到损失函数的具体值,进而基于具体值调整Network1和Network2的网络参数,之后,再次输入新的互邻音频帧的第一音频特征,重复上述过程,直到训练完成,其中,训练完成的判断条件可以是训练的次数达到预先设定的次数或者是损失函数收敛。在训练过程中,使用梯度下降最优算法,以使得损失函数的具体值优化至尽可能小,其中,梯度下降是一阶最优化算法,常被应用于神经网络的训练过程。训练完成后,基于Network1和Network2得到第一分类模型。可选的,第一分类模型在互动统计设备中训练并应用,或者是,第一分类模型在其他设备中训练后部署在互动统计设备中应用。第一分类模型应用过程中,从课堂音频数据的第一个音频帧开始,将互邻音频帧的两个第一音频特征输入至第一分类模型,以得到两个互邻音频帧的第一分类结果。之后,每次移动一个音频帧的距离,并再次将移动后得到的互邻音频帧的两个第一音频特征输入至第一分类模型,以此类推,进而使得每个互邻音频帧均有对应的第一分类结果。
步骤260、根据第一分类结果修正音频帧的发声对象标签。
示例性的,发声对象标签的修正过程包括下述四种方案:
方案一、互邻音频帧的发声对象标签相同且第一分类结果为互邻音频帧属于同一发声对象,此时,本步骤可细化为:互邻音频帧对应相同的发声对象标签且第一分类结果为互邻音频帧属于同一发声对象时,维持互邻音频帧的发声对象标签。
方案二、互邻音频帧的发声对象标签不同且第一分类结果为互邻音频帧属于同一发声对象,此时,本步骤可细化为:互邻音频帧对应不同的发声对象标签且第一分类结果为互邻音频帧属于同一发声对象时,在全部音频帧中,分别统计互邻音频帧的两个发声对象标签对应的音频帧数量值;在统计得到的两个所述音频帧数量值中,选择较大的音频帧数量值对应的 发声对象标签作为互邻音频帧修正后的发声对象标签。
当前,为了便于理解,将互邻音频帧对应的发声对象标签记为目标发声对象标签,在课堂音频数据的全部音频帧中,统计与目标发声对象标签具有相同的发声对象标签的音频帧数量值,每个目标发声对象标签分别对应一个音频帧数量值。之后,在两个音频帧数量值中,选择数量值较大的音频帧数量值并确定该音频帧数量值对应的目标发声对象标签,之后,将该目标发声对象标签作为互邻的两个音频帧的发声对象标签,以实现发声对象标签的修正。举例而言,互邻音频帧的发声对象标签分别是:speaker1和speaker2,第一分类模型对互邻音频帧的第一分类结果为两个音频帧属于相同的发声对象标签,此时,统计课堂音频数据的各音频帧中,发声对象标签属于speaker1的音频帧数量值以及发声对象标签属于speaker2的音频帧数量值,之后,在两个音频帧数量值中,选择数值比较大的音频帧数量值,例如,speaker2的音频帧数量值较大,因此,将互邻音频帧的发声对象标签都设置为speaker2,以完成对互邻音频帧的发声对象标签的修正。需要说明,实际应用中,并没有标记目标发声对象标签的环节,当前使用目标发声对象表仅是为了便于对技术方案的解释。
方案三、互邻音频帧的发声对象标签相同且第一分类结果为互邻音频帧属于不同发声对象,此时,本步骤可细化为:互邻音频帧对应相同的发声对象标签且第一分类结果为互邻音频帧属于不同发声对象时,缓存互邻音频帧;课堂音频数据中各互邻音频帧均处理完毕后,在每个发声对象标签对应的各音频帧中分别选择第一数量的音频帧;确定每个发声对象标签对应选择的音频帧与互邻音频帧中每个音频帧的相似性;根据相似性,选择与互邻音频帧相似度最高的发声对象标签作为互邻音频帧修正后的发声对象标签。
示例性的,第一数量可以根据课堂音频数据包含的音频帧总数量值确定,一般而言,对于学校而言,录制各节课堂音频数据时的采样频率相同,并且,每节课的时长较为固定且浮动不大,所以,采集的课堂音频数据所包含的音频帧总数量值的范围较为固定,因此,第一数量根据音频帧总数量值预先被设置后便可以持续使用。当第一分类结果为互邻音频帧属于不同发声对象且互邻音频帧具有相同发声对象标签时,先缓存互邻音频帧,当全部音频帧均处理完毕(即每个音频帧的互邻音频帧的发声对象标签均被修正)后,获取缓存的音频帧,可理解,缓存音频帧时,可以认为该音频帧的发声对象标签也被修正,以便于确定课堂音频数据中每个互邻音频帧均处理完毕。可选的,若一个音频帧与其相邻的前一音频帧被缓存,那么,当该音频帧和后一音频帧组成互邻音频帧且基于分类结果修正发声对象标签后,在缓存中剔除该音频帧,或者是,在缓存中保留该音频帧以在全部音频帧处理完毕后再次修改该音频帧的发声对象标签,若一个音频帧与其相邻的前一音频帧组成互邻音频帧且基于分类结果修正发声对象标签后,当该音频帧和后一音频帧组成互邻音频帧且基于分类结果确定被缓 存时,缓存该音频帧。获取缓存的音频帧后,在每个发声对象标签对应的各音频帧中均选择第一数量的音频帧。之后,计算所选择的音频帧和缓存的音频帧的相似度,其中,通过计算两个音频帧的第一音频特征的相似度(可通过距离体现)确定两个音频帧的相似度。之后,基于相似度确定与缓存的音频帧相似度最高的发声对象标签并作为音频帧的发声对象标签。可选的,确定相似度最高的发声对象标签的方式可以是在发声对象标签对应的各音频帧(共有第一数量)与缓存的音频帧间的各相似度中选择数值最高的相似度,此时,每个发声对象标签对应一个数值最高的相似度,之后,在各数值最高的相似度中选择一个最大值,并将最大值对应的发声对象标签作为相似度最高的发声对象标签,或者是,将发声对象标签对应的各音频帧(共有第一数量)与缓存的音频帧间的各相似度取平均,此时,每个发声对象标签对应一个相似度平均值,在各相似度平均值中选择最高的相似度平均值,并将对应的发声对象标签作为相似度最高的发声对象标签。实际应用中,也可以采用其他的方式选择发声对象标签,当前不作限定。举例而言,互邻音频帧包括音频帧1和音频帧2,且两者对应的第一分类结果为互邻音频帧属于不同发声对象且互邻音频帧的发声对象标签均为speaker1,此时,将两个音频帧缓存,之后,对课堂音频数据的全部音频帧处理完毕(即课堂音频数据中每个互邻音频帧均处理完毕)后,获取缓存的音频帧1和音频帧2,之后,在三个发声对象标签对应的各音频帧中分别选择10个音频帧,分别计算所选择的音频帧和音频帧1以及音频帧2的相似度,之后,结合相似度确定对应发声对象标签speaker1对应的10个音频帧与音频帧1相似度最高,发声对象标签speaker3对应的10个音频帧与音频帧2相似度最高,因此,修正音频帧1的发声对象标签为speaker1、音频帧2的发声对象标签为speaker3。
方案四、第一分类结果为互邻音频帧属于不同发声对象且互邻音频帧的发声对象标签不同,此时,本步骤可细化为:互邻音频帧对应不同的发声对象标签且第一分类结果为互邻音频帧属于不同发声对象时,维持互邻音频帧的发声对象标签。
修正后,课堂音频数据中每个音频帧均有修正后的发声对象标签。图5为本申请一个实施例提供的一种发声对象标签示意图,参考图5,其示出了课堂音频数据的部分音频帧以及对应的发声对象标签,图5中每个矩形代表一个音频帧,各音频帧均有对应的发声对象标签。可选的,发声对象讲话时对应的音频帧应该是连续的,所以,当修正各音频帧的发声对象标签后,若某个音频帧的发声对象标签与前后音频帧的发声对象标签均不一致,则可以删除该音频帧,以避免误修正的情况。
步骤270、在全部音频帧中,统计每个发声对象标签对应的音频帧数量值。
示例性的,在课堂音频数据的全部音频帧中,统计具有相同发声对象标签的音频帧数量值,此时,一个发声对象标签均存在对应的一个音频帧数量值。
步骤280、选择音频帧数量值最大的发声对象标签作为属于讲师的发声对象标签。
一般而言,授课场景下,讲师说话最多,讲师对应的音频帧的数量也最大,且大于学生对应的音频帧的数量,因此,可以结合课堂中讲师的数量以及各发声对象标签对应的音频帧数量值,选择音频帧数量值最大的发声对象标签,且选择的发声对象标签与讲师的数量相等,将选择的发声对象标签作为属于讲师的发声对象标签,确定属于讲师的发声对象标签后,剩余的发声对象标签可以认为是属于学生的发声对象标签。举例而言,讲师数量一般为1,因此,可以将音频帧数量值最大的一个发声对象标签作为讲师的发声对象标签,其他的发声对象标签便可作为非讲师的发声对象标签,即学生的发声对象标签。可选的,属于非讲师的发声对象标签为多个时,每个发声对象标签对应一个非讲师的发声对象,如非讲师的发声对象标签为2个时,说明非讲师的发声对象共有2个。
步骤290、根据发声对象标签为讲师的各音频帧以及发声对象标签为非讲师的各音频帧在时域上的排列顺序,统计授课过程中的师生互动次数。
示例性的,根据属于讲师的发声对象标签可以确定属于讲师的音频帧,同样,根据属于非讲师(即学生)的发声对象标签可以确定属于非讲师的音频帧,之后,根据属于讲师的各音频帧和属于非讲师的各音频帧在时域上的排列顺序,从课堂音频数据的第一个音频帧开始,可以确定讲师和非讲师在时域上发声的更换次数。例如,参考图5,发声对象的更换情况是:speaker1-speaker2-speaker3-speaker1,对应的更换次数为3。之后,将更换次数作为师生互动次数。或者是,从课堂音频数据的第一个音频帧开始,确定发生对象由讲师更换到学生的更换次数,将更换次数作为师生互动次数。一个实施例中,以将发声对象由讲师更换为学生的更换次数作为师生互动次数为进行描述,此时,本步骤可细化为:根据发声对象标签为讲师的各音频帧以及发声对象标签为非讲师的各音频帧在时域上的排列顺序,确定发声对象由讲师更换为非讲师的更换次数,并将更换次数作为授课过程中的师生互动次数。
可选的,师生互动过程至少包括由讲师说话更换为学生讲话的过程,基于此,根据属于讲师的发声对象以及属于非讲师的发声对象在时域上的排列顺序,确定发声对象由讲师变化为非讲师的更换次数,并将该更换次数作为师生互动次数。
需说明,实际应用中,存在课堂上多个发声对象同时发声的情况,一般而言,多个发声对象同时发声时得到的音频帧间的相似度高于单个发声对象发声时的音频帧,因此,按照上述方式处理后,多个发声对象同时发声时的音频帧具有相同的发声对象标签的概率较大,因此,多个发声对象同时发声的情况并不会影响师生互动统计结果。
上述,通过提取授课过程中录制的课堂音频数据中每个音频帧的第一音频特征,基于第一音频特征对各音频帧进行聚类,以将第一音频特征相似的音频帧聚为一类,并且为同一类 的音频帧设置相同的发声对象标签,之后,利用第一分类模型得到时域上互邻音频帧的第一分类结果,并基于第一分类结果修正音频帧的发声对象标签,进而基于修正后的发声对象标签,统计授课过程中属于讲师的音频帧和属于学生的音频帧,进而得到师生互动次数的技术手段,解决了一些技术中,无法使学校的管理者量化且有效的掌握课堂师生互动情况的技术问题,利用音频处理技术,提取各音频帧的第一音频特征,进而基于第一音频特征为各音频帧设置适当的发声对象标签。并且,利用发声对象讲话时生成的音频帧应该为连续的特性,基于互邻音频帧的第一音频特征修正发声对象标签,保证了发声对象标签的准确性,进而保证了统计师生互动次数的准确性。并且,结合授课过程中讲师说话多于学生说话的特性,可以确定属于讲师的音频帧,进一步保证了统计师生互动次数的准确性。上述过程可以依靠互动统计设备自动化实现,便于量化,使得学校的管理者可以快速掌握各课堂的师生互动情况。
图6为本申请一个实施例提供的一种互动统计方法的流程图,该互动统计方法是在前述互动统计方法的基础上,增加了标记课堂中互动有效时刻和互动无效时刻的技术手段。其中,互动有效时刻也可以理解为课堂氛围的高光时刻,此时,课堂氛围较为活跃、轻松。互动无效时刻也可以理解为课堂氛围的至暗时刻,此时,课堂氛围较为沉闷、拖沓。
参考图6,该互动统计方法包括:
步骤310、获取授课过程中录制的课堂音频数据。之后,执行步骤320和步骤390。
示例性的,获取课堂音频数据后,还可以滤除课堂音频数据中属于环境噪声的音频帧。
一个实施例中,获取课堂音频数据后,除了统计师生互动次数,还可以标记课堂中互动有效时刻和互动无效时刻,即步骤310后可分别执行步骤320和步骤390。
步骤320、提取课堂音频数据中每个音频帧的第一音频特征。
步骤330、基于各第一音频特征对各音频帧进行聚类,得到多个音频类。
步骤340、基于音频类为每个音频帧设置对应的发声对象标签,同一音频类中各音频帧的发声对象标签相同。
步骤350、根据互邻音频帧的第一音频特征对音频帧的发声对象标签进行修正,互邻音频帧为课堂音频数据中在时域上相邻的两个音频帧。
步骤360、在全部音频帧中,统计每个发声对象标签对应的音频帧数量值。
步骤370、选择音频帧数量值最大的发声对象标签作为属于讲师的发声对象标签。
步骤380、根据发声对象标签为讲师的各音频帧以及发声对象标签为非讲师的各音频帧在时域上的排列顺序,统计授课过程中的师生互动次数。
步骤390、提取课堂音频数据中每个音频帧的第二音频特征。
其中,第二音频特征和第一音频特征的类型相同,提取过程也相同,当前不作赘述。
一个实施例中,提取第一音频特征后,也可以将第一音频特征作为第二音频特征,即无需执行步骤390,而是在步骤320后执行步骤3100。
步骤3100、根据第二音频特征得到各音频帧对应的第二分类结果,第二分类结果为单人发声及情绪分类结果、多人发声分类结果或环境辅助音分类结果。
一个实施例中,课堂音频数据可以包含笑声、掌声、多人同时发声(如课堂讨论场景下)、单人发声等发声氛围情况,且发声对象的音频还会带有发声对象的情绪,其中,情绪包括但不限定于正常、生气、沉闷/拖沓、兴奋/激昂。由于第二音频特征可以反映音频帧的信号特征,且不同发声氛围、不同情绪下,各音频帧的第二音频特征也会不同,因此,基于第二音频特征可以确定课堂音频数据对应的氛围分类,当前,将氛围分类结果记为第二分类结果。
第二分类结果所对应的氛围分类内容可以根据实际情况设置,一个实施例中,第二分类结果为单人发声及情绪分类结果、多人发声分类结果或环境辅助音分类结果,即第二分类结果共包括三种类型,每次可以得到其中一种类型的第二分类结果。单人发声及情绪分类结果对应单人发声及情绪分类,单人发声及情绪分类是指音频帧的氛围情况为单人发声且音频帧对应的情绪,可选的,单人发声及情绪分类可进一步包括单人发声且情绪为正常、单人发声且情绪为生气、单人发声且情绪为沉闷/拖沓、单人发声且情绪为兴奋/激昂,共四个子分类。第二分类结果为单人发声及情绪分类结果时应精细到具体的子类别。多人发声分类结果是指音频帧的氛围情况为多人发声。环境辅助音分类结果是指音频帧的氛围情况为与课堂氛围有关的辅助声音,环境辅助音可进一步包括掌声、笑声等子类别,第二分类结果为环境辅助音分类结果时应精细到具体的子类别。
一个实施例中,预先训练一个神经网络模型,该神经网络模型为卷积网络模型,用于输出音频帧对应的第二分类结果。该神经网络模型的具体结构当前不作限定。该神经网络模型可以认为是分类模型,当前,将该神经网络模型记为第二分类模型,此时,本步骤可具体包括:将各第二音频特征拼接成语谱图,将语谱图发送至第二分类模型,基于第二分类模型得到各音频帧对应的第二分类结果。
示例性的,图7为本申请一个实施例提供的第二分类模型的结构表格图,参考图7,其示出了第二分类模型的结构配置、输入图像的尺寸、各网络层的类型以及参数。一个实施例中,第二分类模型输入的图像为课堂音频数据的语谱图,语谱图可以基于第二音频特征得到。将各音频帧的第二音频特征按照各音频帧的采集时间由先至后进行排列拼接后,可以得到语谱图。可选的,也可以将连续且属于相同发声对象标签的音频帧拼接成一个语谱图,此时,课堂音频数据可以对应多个语谱图。之后,将语谱图输入至第二分类模型,以使第二分类模型基于语谱图确定课堂音频数据中各音频帧所属的第二分类结果。
一个实施例中,第二分类模型采用监督训练,训练数据是基于录音装置录制得到的数千节课的课堂音频数据(可认为是历史课堂音频数据)且数千节课的课堂音频数据中各音频帧对应的第二分类结果已知(已知的第二分类结果可以认为是真实的第二分类结果)。可选的,第二分类模型和第一分类模型可以采用相同的训练数据,也可以采用不同的训练数据,当前不作限定。
可选的,第二分类模型训练过程中使用交叉熵损失函数,其中,交叉熵损失函数的公式参考第一分类模型使用的交叉熵损失函数的公式,区别在于,第二分类模型的交叉熵损失函数中,y i表示真实的第二分类结果,
Figure PCTCN2022124805-appb-000011
表示预测的第二分类结果。第二分类模型的训练过程与第一分类模型的训练过程相似,当前不作赘述。
训练第二分类模型后,应用第二分类模型。示例来说,基于第二音频特征得到语谱图后,将语谱图输入至第二分类模型,以得到第二分类结果。可理解,课堂音频数据可以对应多个第二分类结果,且每个音频帧均有对应的第二分类结果。图8为本申请一个实施例提供的一种第二分类模型应用示意图,参考图8,以三个语谱图为例,将三个语谱图输入至第二分类模型后可以得到三个第二分类结果,图8中,三个第二分类结果分别为掌声、笑声和多人发声。需要说明,图8中示出三个第二分类模型是为了对三个语谱图的分类情况进行说明,实际应用中,只需要使用一个第二分类模型即可。
步骤3110、根据第二分类结果确定授课过程中的互动有效时刻和互动无效时刻。
示例性的,预先确定属于互动有效时刻的第二分类结果和属于互动无效时刻的第二分类结果,一个实施例中,属于互动有效时刻的第二分类结果包含:掌声、笑声、多人同时发声和单人发声且情绪为兴奋/激昂,属于互动无效时刻的第二分类结果包含:单人发声且情绪为生气和单人发声且情绪为沉闷/拖沓。得到第二分类结果后,将属于互动有效时刻的第二分类结果对应的音频帧标记为互动有效时刻的音频帧,并将互动有效时刻的音频帧对应的采集时间确定为互动有效时刻,将属于互动无效时刻的第二分类结果对应的音频帧标记为互动无效时刻的音频帧,并将互动无效时刻的音频帧对应的采集时间确定为互动无效时刻。
可选的,将互动有效时刻、互动无效时刻和师生互动次数关联保存,以便于学校管理者明确各课堂中的互动有效时刻和互动无效时刻。
举例而言,图9为本申请一个实施例提供的一种第二音频特征示意图,参考图9,其是对课堂音频数据处理后得到的各第二音频特征的热力图。图10为本申请一个实施例提供的一种互动无效时刻示意图,参考图10,其是图9对应的课堂音频数据中属于互动无效时刻的音频帧位置,图10可以认为是真实的结果。图11为本申请一个实施例提供的一种互动有效时 刻示意图,参考图11,其示出了图9对应的课堂音频数据中各音频帧属于互动有效时刻对应的各第二分类结果概率,其中,概率越高,表示对应的第二分类结果越准确,由图11可知,第4秒属于笑声的概率较高,第8秒至第9秒之间属于掌声(即拍手的概率较高),因此,可以标记第4秒对应的音频帧和第8秒至第9秒之间的音频帧均属于互动有效时刻的音频帧。并且,结合图10可知,图10中互动无效时刻的音频帧在图11中属于互动有效时刻的音频帧的概率值较小。
可选的,得到互动有效时刻和互动无效时刻后,可以结合互动有效时刻、互动无效时刻和师生互动次数确定课堂各时刻对应的活跃度,其中,活跃度的计算方式当前不作限定,一般而言,互动有效时刻的活跃度较高。之后,学校管理者可以通过查看课堂的活跃度确定课堂的教学质量,需说明,讲师也可以具有查看自身授课过程的各统计结果。例如,图12为本申请一个实施例提供的一种活跃度显示界面图,其示出了一节课在各时间下的活跃度,由图12可知,课堂开始后的第4分钟的活跃度最高。
上述,在统计师生互动次数的同时,还可以基于课堂音频数据中各音频帧的第二音频特征确定课堂氛围(即单人发声及情绪分类结果、多人发声分类结果或环境辅助音分类结果),进而得到互动有效时刻和互动无效时刻,可以实现仅从课堂音频数据中识别出更多维度的信息,便于学校管理者更好的掌握课堂情况。
本申请一个实施例中,第二分类模型识别出单人发声时,还可以预测出对应发声对象所属的年龄,此时,第二分类结果为单人发声及情绪分类结果时,第二分类结果还包括单人发声对象的年龄分类结果,相应的,选择音频帧数量值最大的发声对象标签作为属于讲师的发声对象标签之后,还包括:获取讲师对应的年龄分布范围;在讲师对应的各音频帧中,删除年龄分类结果未落在年龄分布范围的音频帧。
示例性的,第二分类模型输出单人发声及情绪分类结果时,还可以输出发声对象的年龄分类结果,即在训练第二分类模型时,加入预测单人发声时的年龄分类结果的训练。其中,年龄分类结果可理解为预测的单人发声对象的年龄,其可以是具体的年龄值也可以是年龄范围(年龄范围的跨度值可以根据实际情况设置)。
示例性的,通过第二分类模型得到年龄分类结果后,还可以通过年龄分类结果修正发声对象标签,例如,讲师和学生所属的年龄分布不同,因此,两者发声时的音频特征也存在较明显的差别。实施例中,预先设置讲师的年龄分布范围,该年龄分布范围可以根据实际情况设置,用于体现讲师的年龄分布。当第二分类结果包括单人发声、情绪分类以及年龄分类结果时,获取属于讲师的各音频帧,并确定属于讲师的各音频帧对应的年龄分类结果是否满足年龄分布范围,若满足,则说明该音频帧应是讲师的音频帧,若不满足,则说明该音频帧不 是讲师的音频帧,因此,删除该音频帧。
上述,通过预测单人发声时的年龄分类,并结合讲师的年龄分布范围修正讲师所属的音频帧,可以保证讲师所属的音频帧的准确性,进而保证统计师生互动次数的准确性。
图13为本申请一个实施例提供的一种互动统计装置的结构示意图,参考图13,该互动统计装置包括:第一特征提取单元401、音频聚类单元402、标签设置单元403、标签修正单元404以及次数统计单元405。
其中,第一特征提取单元401,用于提取课堂音频数据中每个音频帧的第一音频特征,课堂音频数据是对授课过程进行录音后得到的音频数据;音频聚类单元402,用于基于各所述第一音频特征对各所述音频帧进行聚类,得到多个音频类;标签设置单元403,用于基于所述音频类为每个所述音频帧设置对应的发声对象标签,同一所述音频类中各所述音频帧的发声对象标签相同;标签修正单元404,用于根据互邻音频帧的所述第一音频特征对音频帧的发声对象标签进行修正,互邻音频帧为课堂音频数据中在时域上相邻的两个音频帧;次数统计单元405,用于根据各所述音频帧在时域上的排列顺序以及各音频帧修正后的发声对象标签,统计授课过程中的师生互动次数。
本申请一个实施例中,次数统计单元405包括:第一数量统计子单元,用于在全部所述音频帧中,统计每个所述发声对象标签对应的音频帧数量值;讲师标签确定子单元,用于选择所述音频帧数量值最大的发声对象标签作为属于讲师的发声对象标签;互动次数统计子单元,用于根据发声对象标签为讲师的各音频帧以及发声对象标签为非讲师的各音频帧在时域上的排列顺序,统计授课过程中的师生互动次数。
本申请一个实施例中,互动次数统计子单元具体用于根据发声对象标签为讲师的各音频帧以及发声对象标签为非讲师的各音频帧在时域上的排列顺序,确定发声对象由讲师更换为非讲师的更换次数,并将更换次数作为授课过程中的师生互动次数。
本申请一个实施例中,标签修正单元404包括:第一分类子单元,用于将互邻音频帧对应的两个第一音频特征一同输入至第一分类模型,基于所述第一分类模型得到所述互邻音频帧的第一分类结果,所述第一分类结果为所述互邻音频帧属于同一发声对象或所述互邻音频帧属于不同发声对象;第一修正子单元,用于根据所述第一分类结果修正音频帧的发声对象标签。
本申请一个实施例中,第一修正子单元包括:第二数量统计孙单元,用于互邻音频帧对应不同的发声对象标签且所述第一分类结果为所述互邻音频帧属于同一发声对象时,在全部所述音频帧中,分别统计所述互邻音频帧的两个发声对象标签对应的音频帧数量值;数量选择孙单元,用于在统计得到的两个所述音频帧数量值中,选择较大的音频帧数量值对应的发 声对象标签作为所述互邻音频帧修正后的发声对象标签。
本申请一个实施例中,第一修正子单元包括:音频缓存孙单元,用于互邻音频帧对应相同的发声对象标签且所述第一分类结果为所述互邻音频帧属于不同发声对象时,缓存所述互邻音频帧;音频选择孙单元,用于所述课堂音频数据中各互邻音频帧均处理完毕后,在每个发声对象标签对应的各音频帧中分别选择第一数量的音频帧;相似确定孙单元,用于确定每个发声对象标签对应选择的音频帧与所述互邻音频帧中每个音频帧的相似性;标签选择孙单元,用于根据所述相似性,选择与所述互邻音频帧相似度最高的发声对象标签作为所述互邻音频帧修正后的发声对象标签。
本申请一个实施例中,还包括:第二特征提取单元,用于提取所述课堂音频数据中每个音频帧的第二音频特征;第二分类单元,用于根据所述第二音频特征得到各音频帧对应的第二分类结果,所述第二分类结果为单人发声及情绪分类结果、多人发声分类结果或环境辅助音分类结果;互动确定单元,用于根据所述第二分类结果确定授课过程中的互动有效时刻和互动无效时刻。
本申请一个实施例中,所述第二分类结果为单人发声及情绪分类结果时,所述第二分类结果还包括单人发声对象的年龄分类结果;所述装置还包括:年龄获取单元,用于选择所述音频帧数量值最大的发声对象标签作为属于讲师的发声对象标签之后,获取讲师对应的年龄分布范围;音频删除单元,用于在所述讲师对应的各所述音频帧中,删除所述年龄分类结果未落在所述年龄分布范围的音频帧。
本申请一个实施例中,第二分类单元具体用于:将各所述第二音频特征拼接成语谱图,将所述语谱图发送至第二分类模型,基于所述第二分类模型得到各所述音频帧对应的第二分类结果,第二分类结果为单人发声及情绪分类结果、多人发声分类结果或环境辅助音分类结果。
本申请一个实施例中,还包括:噪音滤除单元,用于提取课堂音频数据中每个音频帧的第一音频特征之前,滤除所述课堂音频数据中属于环境噪声的音频帧。
本申请一个实施例提供的互动统计装置包含在互动统计设备中,且可用于执行上述任意实施例中互动统计方法,具备相应的功能和有益效果。
值得注意的是,上述互动统计装置的实施例中,所包括的各个单元和模块只是按照功能逻辑进行划分的,但并不局限于上述的划分,只要能够实现相应的功能即可;另外,各功能单元的具体名称也只是为了便于相互区分,并不用于限制本发明的保护范围。
图14为本申请一个实施例提供的一种互动统计设备的结构示意图。如图14所示,该互动统计设备包括处理器50以及存储器51;互动统计设备中处理器50的数量可以是一个或多 个,图14中以一个处理器50为例;互动统计设备中的处理器50以及存储器51可以通过总线或其他方式连接,图14中以通过总线连接为例。
存储器51作为一种计算机可读存储介质,可用于存储软件程序、计算机可执行程序以及模块,如本申请实施例中的互动统计设备执行互动统计方法时对应的程序指令/模块(例如,互动统计装置中的第一特征提取单元401、音频聚类单元402、标签设置单元403、标签修正单元404以及次数统计单元405)。处理器50通过运行存储在存储器51中的软件程序、指令以及模块,从而执行互动统计设备的各种功能应用以及数据处理,即实现上述互动统计设备执行的互动统计方法。
存储器51可主要包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需的应用程序;存储数据区可存储根据互动统计设备的使用所创建的数据等。此外,存储器51可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件、闪存器件、或其他非易失性固态存储器件。在一些实例中,存储器51可进一步包括相对于处理器50远程设置的存储器,这些远程存储器可以通过网络连接至互动统计设备。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。
互动统计设备还可包括输入装置,输入装置可用于接收输入的数字或字符信息,以及产生与互动统计设备的用户设置以及功能控制有关的键信号输入。互动统计设备还可包括输出装置,输出装置可包括音频播放装置等设备。互动统计设备还可包括显示屏,该显示屏可以是具有触摸功能的显示屏,显示屏根据处理器50的指示进行显示。互动统计设备还可包括通信装置,以用于实现通信功能。
本申请一个实施例中,互动统计设备还可包含录音装置,当前,将互动统计设备包含的录音装置记为第一录音装置,第一录音装置用于录制授课过程中的课堂音频数据。此时,互动统计设备集成有录音功能,并可对课堂上的声音进行录制。
需说明,实际应用中,互动统计设备也可以不包含第一录音装置,而是协同独立的录音设备,以得到录音设备录制的课堂音频数据。
上述互动统计设备包含互动统计装置,可以用于执行互动统计设备执行的互动统计方法,具备相应的功能和有益效果。
图15为本申请一个实施例提供的一种互动统计系统的结构示意图。参考图15,互动统计系统包括录音设备60和互动统计设备70。
其中,录音设备60包括第二录音装置,第二录音装置用于录制授课过程中的课堂音频数据,并将课堂音频数据发送至互动统计设备70;互动统计设备70包括存储器以及一个或多个处理器。存储器用于存储一个或多个程序,所述一个或多个程序被所述一个或多个处理器 执行,使得所述一个或多个处理器实现如前述实施例提供的互动统计方法。
示例性的,将录音设备60包含的录音装置记为第二录音装置,通过第二录音装置录制课堂音频数据。此时,录制课堂音频数据和处理课堂音频数据由不同的物理实体构成。进一步的,录音设备60和互动统计设备70可以通过有线或无线的方式连接。
当前,互动统计设备70可参考前述实施例中互动统计设备70的相关描述,区别在于,当前互动统计设备70可以具备第一录音装置也可以不具备第一录音装置。即无论互动统计设备70是否具有第一录音装置,均由录音设备60录制课堂音频数据,这样可以使得互动统计设备70的安装位置更加灵活,不局限于教室中。
需要说明,一个互动统计设备70可以与一个或多个录音设备60相配合,当前图15中以一个录音设备60为例。
录音设备60录制的课堂音频数据发送至互动统计设备70,互动统计设备70获取课堂音频数据后进行处理,以实现前述实施例中的互动统计方法,且具备相应的功能和有益效果。
本申请一个实施例还提供一种包含计算机可执行指令的存储介质,所述计算机可执行指令在由计算机处理器执行时用于执行本申请任意实施例中提供的互动统计方法中的相关操作,且具备相应的功能和有益效果。
本领域内的技术人员应明白,本申请的实施例可提供为方法、系统、或计算机程序产品。
因此,本申请可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本申请可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。本申请是参照根据本申请实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方 框中指定的功能的步骤。
在一个典型的配置中,计算设备包括一个或多个处理器(CPU)、输入/输出接口、网络接口和内存。存储器可能包括计算机可读介质中的非永久性存储器,随机存取存储器(RAM)和/或非易失性内存等形式,如只读存储器(ROM)或闪存(flash RAM)。存储器是计算机可读介质的示例。
计算机可读介质包括永久性和非永久性、可移动和非可移动媒体可以由任何方法或技术来实现信息存储。信息可以是计算机可读指令、数据结构、程序的模块或其他数据。计算机的存储介质的例子包括,但不限于相变内存(PRAM)、静态随机存取存储器(SRAM)、动态随机存取存储器(DRAM)、其他类型的随机存取存储器(RAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、快闪记忆体或其他内存技术、只读光盘只读存储器(CD-ROM)、数字多功能光盘(DVD)或其他光学存储、磁盒式磁带,磁带磁磁盘存储或其他磁性存储设备或任何其他非传输介质,可用于存储可以被计算设备访问的信息。按照本文中的界定,计算机可读介质不包括暂存电脑可读媒体(transitory media),如调制的数据信号和载波。
还需要说明的是,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、商品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、商品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括要素的过程、方法、商品或者设备中还存在另外的相同要素。
注意,上述仅为本申请的较佳实施例及所运用技术原理。本领域技术人员会理解,本申请不限于这里所述的特定实施例,对本领域技术人员来说能够进行各种明显的变化、重新调整和替代而不会脱离本申请的保护范围。因此,虽然通过以上实施例对本申请进行了较为详细的说明,但是本申请不仅仅限于以上实施例,在不脱离本申请构思的情况下,还可以包括更多其他等效实施例,而本申请的范围由所附的权利要求范围决定。

Claims (15)

  1. 一种互动统计方法,其中,包括:
    提取课堂音频数据中每个音频帧的第一音频特征,所述课堂音频数据是对授课过程进行录音后得到的音频数据;
    基于各所述第一音频特征对各所述音频帧进行聚类,得到多个音频类;
    基于所述音频类为每个所述音频帧设置对应的发声对象标签,同一所述音频类中各所述音频帧的发声对象标签相同;
    根据互邻音频帧的所述第一音频特征对音频帧的发声对象标签进行修正,所述互邻音频帧为所述课堂音频数据中在时域上相邻的两个音频帧;
    根据各所述音频帧在时域上的排列顺序以及各所述音频帧修正后的发声对象标签,统计授课过程中的师生互动次数。
  2. 根据权利要求1所述的互动统计方法,其中,所述根据各所述音频帧在时域上的排列顺序以及各所述音频帧修正后的发声对象标签,统计授课过程中的师生互动次数,包括:
    在全部所述音频帧中,统计每个所述发声对象标签对应的音频帧数量值;
    选择所述音频帧数量值最大的发声对象标签作为属于讲师的发声对象标签;
    根据发声对象标签为讲师的各音频帧以及发声对象标签为非讲师的各音频帧在时域上的排列顺序,统计授课过程中的师生互动次数。
  3. 根据权利要求2所述的互动统计方法,其中,所述根据发声对象标签为讲师的各音频帧以及发声对象标签为非讲师的各音频帧在时域上的排列顺序,统计授课过程中的师生互动次数,包括:
    根据发声对象标签为讲师的各音频帧以及发声对象标签为非讲师的各音频帧在时域上的排列顺序,确定发声对象由讲师更换为非讲师的更换次数,并将所述更换次数作为所述授课过程中的师生互动次数。
  4. 根据权利要求1所述的互动统计方法,其中,所述根据互邻音频帧的所述第一音频特征对音频帧的发声对象标签进行修正,包括:
    将互邻音频帧对应的两个第一音频特征一同输入至第一分类模型,基于所述第一分类模型得到所述互邻音频帧的第一分类结果,所述第一分类结果为所述互邻音频帧属于同一发声对象或所述互邻音频帧属于不同发声对象;
    根据所述第一分类结果修正音频帧的发声对象标签。
  5. 根据权利要求4所述的互动统计方法,其中,所述根据所述第一分类结果修正音频帧的发声对象标签,包括:
    所述互邻音频帧对应不同的发声对象标签且所述第一分类结果为所述互邻音频帧属于同 一发声对象时,在全部所述音频帧中,分别统计所述互邻音频帧的两个发声对象标签对应的音频帧数量值;
    在统计得到的两个所述音频帧数量值中,选择较大的音频帧数量值对应的发声对象标签作为所述互邻音频帧修正后的发声对象标签。
  6. 根据权利要求4或5所述的互动统计方法,其中,所述根据所述第一分类结果修正音频帧的发声对象标签,包括:
    所述互邻音频帧对应相同的发声对象标签且所述第一分类结果为所述互邻音频帧属于不同发声对象时,缓存所述互邻音频帧;
    所述课堂音频数据中各互邻音频帧均处理完毕后,在每个所述发声对象标签对应的各音频帧中分别选择第一数量的音频帧;
    确定每个所述发声对象标签对应选择的音频帧与所述互邻音频帧中每个音频帧的相似性;
    根据所述相似性,选择与所述互邻音频帧相似度最高的发声对象标签作为所述互邻音频帧修正后的发声对象标签。
  7. 根据权利要求2所述的互动统计方法,其中,还包括:
    提取所述课堂音频数据中每个音频帧的第二音频特征;
    根据所述第二音频特征得到各音频帧对应的第二分类结果,所述第二分类结果为单人发声及情绪分类结果、多人发声分类结果或环境辅助音分类结果;
    根据所述第二分类结果确定授课过程中的互动有效时刻和互动无效时刻。
  8. 根据权利要求7所述的互动统计方法,其中,所述第二分类结果为单人发声及情绪分类结果时,所述第二分类结果还包括单人发声对象的年龄分类结果;
    所述选择所述音频帧数量值最大的发声对象标签作为属于讲师的发声对象标签之后,还包括:
    获取讲师对应的年龄分布范围;
    在所述讲师对应的各所述音频帧中,删除所述年龄分类结果未落在所述年龄分布范围的音频帧。
  9. 根据权利要求7所述的互动统计方法,其中,所述根据所述第二音频特征得到各音频帧对应的第二分类结果,包括:
    将各所述第二音频特征拼接成语谱图,将所述语谱图发送至第二分类模型,基于所述第二分类模型得到各所述音频帧对应的第二分类结果。
  10. 根据权利要求1-9任一所述的互动统计方法,其中,所述提取课堂音频数据中每个音频帧的第一音频特征之前,包括:
    滤除所述课堂音频数据中属于环境噪声的音频帧。
  11. 一种互动统计装置,其中,包括:
    第一特征提取单元,用于提取所述课堂音频数据中每个音频帧的第一音频特征,所述课堂音频数据是对授课过程进行录音后得到的音频数据;
    音频聚类单元,用于基于各所述第一音频特征对各所述音频帧进行聚类,得到多个音频类;
    标签设置单元,用于基于所述音频类为每个所述音频帧设置对应的发声对象标签,同一所述音频类中各所述音频帧的发声对象标签相同;
    标签修正单元,用于根据互邻音频帧的所述第一音频特征对音频帧的发声对象标签进行修正,所述互邻音频帧为所述课堂音频数据中在时域上相邻的两个音频帧;
    次数统计单元,用于根据各所述音频帧在时域上的排列顺序以及各所述音频帧修正后的发声对象标签,统计授课过程中的师生互动次数。
  12. 一种互动统计设备,其中,包括:
    一个或多个处理器;
    存储器,用于存储一个或多个程序;
    当所述一个或多个程序被所述一个或多个处理器执行,使得所述一个或多个处理器实现如权利要求1-10中任一所述的互动统计方法。
  13. 根据权利要求12所述的互动统计设备,其中,还包括:第一录音装置,所述第一录音装置用于录制授课过程中的课堂音频数据。
  14. 一种互动统计系统,其中,包括:录音设备和互动统计设备;
    所述录音设备包括第二录音装置,所述第二录音装置用于录制授课过程中的课堂音频数据,并将所述课堂音频数据发送至所述互动统计设备;
    所述互动统计设备包括存储器以及一个或多个处理器;
    所述存储器用于存储一个或多个程序,所述一个或多个程序被所述一个或多个处理器执行,使得所述一个或多个处理器实现如权利要求1-10中任一所述的互动统计方法。
  15. 一种计算机可读存储介质,其上存储有计算机程序,其中,该程序被处理器执行时实现如权利要求1-10中任一所述的互动统计方法。
PCT/CN2022/124805 2022-10-12 2022-10-12 互动统计方法、装置、设备、系统及存储介质 WO2024077511A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202280007331.0A CN118202410A (zh) 2022-10-12 2022-10-12 互动统计方法、装置、设备、系统及存储介质
PCT/CN2022/124805 WO2024077511A1 (zh) 2022-10-12 2022-10-12 互动统计方法、装置、设备、系统及存储介质

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2022/124805 WO2024077511A1 (zh) 2022-10-12 2022-10-12 互动统计方法、装置、设备、系统及存储介质

Publications (1)

Publication Number Publication Date
WO2024077511A1 true WO2024077511A1 (zh) 2024-04-18

Family

ID=90668590

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/124805 WO2024077511A1 (zh) 2022-10-12 2022-10-12 互动统计方法、装置、设备、系统及存储介质

Country Status (2)

Country Link
CN (1) CN118202410A (zh)
WO (1) WO2024077511A1 (zh)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108920513A (zh) * 2018-05-31 2018-11-30 深圳市图灵机器人有限公司 一种多媒体数据处理方法、装置和电子设备
CN109147771A (zh) * 2017-06-28 2019-01-04 广州视源电子科技股份有限公司 音频分割方法和系统
CN110427977A (zh) * 2019-07-10 2019-11-08 上海交通大学 一种课堂互动行为的检测方法
CN110473548A (zh) * 2019-07-31 2019-11-19 华中师范大学 一种基于声学信号的课堂交互网络分析方法
WO2020054822A1 (ja) * 2018-09-13 2020-03-19 LiLz株式会社 音解析装置及びその処理方法、プログラム
CN111709358A (zh) * 2020-06-14 2020-09-25 东南大学 基于课堂视频的师生行为分析系统
CN112164259A (zh) * 2020-10-15 2021-01-01 武汉职业技术学院 一种课堂师生互动教学系统及方法
US20210056676A1 (en) * 2019-08-23 2021-02-25 Worcester Polytechnic Institute Method and apparatus for estimating emotional quality using machine learning
CN112562731A (zh) * 2021-02-24 2021-03-26 北京读我网络技术有限公司 一种口语发音评测方法、装置、电子设备及存储介质

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109147771A (zh) * 2017-06-28 2019-01-04 广州视源电子科技股份有限公司 音频分割方法和系统
CN108920513A (zh) * 2018-05-31 2018-11-30 深圳市图灵机器人有限公司 一种多媒体数据处理方法、装置和电子设备
WO2020054822A1 (ja) * 2018-09-13 2020-03-19 LiLz株式会社 音解析装置及びその処理方法、プログラム
CN110427977A (zh) * 2019-07-10 2019-11-08 上海交通大学 一种课堂互动行为的检测方法
CN110473548A (zh) * 2019-07-31 2019-11-19 华中师范大学 一种基于声学信号的课堂交互网络分析方法
US20210056676A1 (en) * 2019-08-23 2021-02-25 Worcester Polytechnic Institute Method and apparatus for estimating emotional quality using machine learning
CN111709358A (zh) * 2020-06-14 2020-09-25 东南大学 基于课堂视频的师生行为分析系统
CN112164259A (zh) * 2020-10-15 2021-01-01 武汉职业技术学院 一种课堂师生互动教学系统及方法
CN112562731A (zh) * 2021-02-24 2021-03-26 北京读我网络技术有限公司 一种口语发音评测方法、装置、电子设备及存储介质

Also Published As

Publication number Publication date
CN118202410A (zh) 2024-06-14

Similar Documents

Publication Publication Date Title
Schuller et al. The INTERSPEECH 2021 computational paralinguistics challenge: COVID-19 cough, COVID-19 speech, escalation & primates
US11900947B2 (en) Method and system for automatically diarising a sound recording
US10818284B2 (en) Methods of and electronic devices for determining an intent associated with a spoken user utterance
WO2019227583A1 (zh) 一种声纹识别方法、装置、终端设备及存储介质
CN106847263B (zh) 演讲水平评价方法和装置及系统
CN111048064B (zh) 基于单说话人语音合成数据集的声音克隆方法及装置
WO2023222088A1 (zh) 语音识别与分类方法和装置
US20180122377A1 (en) Voice interaction apparatus and voice interaction method
CN110970036B (zh) 声纹识别方法及装置、计算机存储介质、电子设备
US11842721B2 (en) Systems and methods for generating synthesized speech responses to voice inputs by training a neural network model based on the voice input prosodic metrics and training voice inputs
US20200227069A1 (en) Method, device and apparatus for recognizing voice signal, and storage medium
CN111868823A (zh) 一种声源分离方法、装置及设备
CN104700831B (zh) 分析音频文件的语音特征的方法和装置
CN112581937A (zh) 一种语音指令的获得方法及装置
Mandel et al. Audio super-resolution using concatenative resynthesis
WO2024114303A1 (zh) 音素识别方法、装置、电子设备及存储介质
WO2024077511A1 (zh) 互动统计方法、装置、设备、系统及存储介质
CN109213466B (zh) 庭审信息的显示方法及装置
CN112837688B (zh) 语音转写方法、装置、相关系统及设备
Mandel et al. Learning a concatenative resynthesis system for noise suppression
CN114302301A (zh) 频响校正方法及相关产品
Kothalkar et al. Child-adult speech diarization in naturalistic conditions of preschool classrooms using room-independent ResNet model and automatic speech recognition-based re-segmentation
US11514920B2 (en) Method and system for determining speaker-user of voice-controllable device
CN114283788A (zh) 发音评测方法、发音评测系统的训练方法、装置及设备
KR20220136845A (ko) 소음을 제거하여 직원과 고객의 음성 또는 영상을 분석하는 방법 및 그 장치

Legal Events

Date Code Title Description
WWE Wipo information: entry into national phase

Ref document number: 202280007331.0

Country of ref document: CN

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22961703

Country of ref document: EP

Kind code of ref document: A1