CN114822557A - Method, device, equipment and storage medium for distinguishing different sounds in classroom - Google Patents

Method, device, equipment and storage medium for distinguishing different sounds in classroom Download PDF

Info

Publication number
CN114822557A
CN114822557A CN202210339090.1A CN202210339090A CN114822557A CN 114822557 A CN114822557 A CN 114822557A CN 202210339090 A CN202210339090 A CN 202210339090A CN 114822557 A CN114822557 A CN 114822557A
Authority
CN
China
Prior art keywords
sound
voiceprint
classroom
spectrogram
vectors
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210339090.1A
Other languages
Chinese (zh)
Inventor
孙德宇
谷娜娜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Zonekey Modern Technology Co ltd
Original Assignee
Beijing Zonekey Modern Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Zonekey Modern Technology Co ltd filed Critical Beijing Zonekey Modern Technology Co ltd
Priority to CN202210339090.1A priority Critical patent/CN114822557A/en
Publication of CN114822557A publication Critical patent/CN114822557A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/06Decision making techniques; Pattern matching strategies
    • G10L17/14Use of phonemic categorisation or speech recognition prior to speaker recognition or verification

Abstract

The method comprises the steps of collecting classroom sounds, inputting the classroom sounds into a trained voiceprint model, and obtaining voiceprint vectors of a plurality of sections of sound fragments; judging whether the sound segment corresponding to the voiceprint vector is non-teacher sound or not according to the voiceprint vector; if yes, inputting the voiceprint vectors into the trained sound classification model, and classifying the sound segments corresponding to the voiceprint vectors according to the voiceprint vectors; the training method of the sound classification model comprises the following steps: extracting the Mel spectrum characteristic of each training sample in the training sample set; converting the Mel spectral characteristics into a two-dimensional Mel frequency spectrogram; and inputting the Mel frequency spectrogram into a sound classification model, and training the sound classification model by using a VGG11 network structure. The application has the effect of distinguishing different sounds in a classroom.

Description

Method, device, equipment and storage medium for distinguishing different sounds in classroom
Technical Field
The present application relates to the field of sound classification technology, and in particular, to a method, an apparatus, a device, and a storage medium for distinguishing different sounds in a classroom.
Background
In a classroom, there are often many sounds, for example, a teacher's sound, a single student's sound, a plurality of students 'reading sounds, a discussion's sound and noise, and so on. In the course of classroom analysis, it is often necessary to distinguish these different sounds in a classroom in order to analyze different teaching behaviors in the classroom.
At present, the traditional method is to distinguish different sounds in a classroom through the sound of a teacher collected in advance, but only distinguish the sound of the teacher from the sound of a non-teacher, and does not distinguish the sound of a single student, the sound read by a plurality of students simultaneously, the sound of discussion and noise.
Disclosure of Invention
In order to distinguish different sounds in a classroom, the application provides a method, a device, equipment and a storage medium for distinguishing different sounds in the classroom.
In a first aspect, the present application provides a method for training a sound classification model, which adopts the following technical scheme:
a training method of a sound classification model comprises the following steps:
converting each training sample in a training sample set into a first spectrogram, wherein the first spectrogram is a two-dimensional Mel spectrogram;
and inputting the Mel frequency spectrogram into the sound classification model, and training the sound classification model by using a VGG11 network structure.
By adopting the technical scheme, the two-dimensional Mel frequency spectrogram is only used for voice recognition and voiceprint recognition at present, and the two-dimensional Mel frequency spectrogram is applied to voice classification in the application; wherein, voice recognition refers to recognizing the content of voice; voiceprint recognition refers to recognizing who says it; the sound classification is to distinguish the specific category of the sound regardless of the content of the sound and the speaker of the sound; at present, the network structure suitable for training the sound classification model is few, even none, so that the VGG11 network structure in image recognition is used for training the sound classification model, and meanwhile, a large number of experiments show that a two-dimensional Mel frequency spectrogram is very suitable for being combined with the VGG11 network structure and used as a training sample of the sound classification model to train the sound classification model; in summary, based on the mel frequency spectrum diagram and the VGG11 network structure, the sound classification model can be well trained by combining the sound processing and the image recognition, so as to obtain an accurate sound classification model.
In a second aspect, the present application provides a method for distinguishing different sounds in a classroom, which adopts the following technical scheme:
a method for distinguishing different sounds in a classroom comprises the following steps:
collecting classroom sounds, and inputting the classroom sounds into a trained voiceprint model to obtain voiceprint vectors of a plurality of sections of sound fragments;
judging whether a sound segment corresponding to the voiceprint vector is a non-teacher sound or not according to the voiceprint vector;
if so, extracting the characteristics of the sound segment corresponding to the voiceprint vector to obtain a second spectrogram, inputting the second spectrogram into a trained sound classification model, and classifying the sound segment corresponding to the voiceprint vector according to the voiceprint vector; wherein the second spectrogram is a two-dimensional Mel spectrogram;
the training method of the sound classification model comprises the following steps:
converting each training sample in a training sample set into a first spectrogram, wherein the first spectrogram is a two-dimensional Mel spectrogram;
inputting the first spectrogram into the sound classification model, and training the sound classification model by using a VGG11 network structure.
By adopting the technical scheme, in a classroom scene, a teacher is basically only 1 person, the voice of the teacher is single, students are multiple persons, and the voice is complex; teachers usually move in the range of the platform, all students are concentrated in the middle and rear parts of a classroom, the class generally has the sound of the teacher, the students and the teacher speak together less, and therefore the collected class sound is better distinguished from the sound of the teacher and the sound of non-teachers. Therefore, it is first recognized whether the voice segment is a non-teacher voice, and the teacher voice is not further classified, and only the non-teacher voice is further classified, so as to distinguish the voice of a single student, the voice of a plurality of students reading at the same time, the voice and the noise of discussion, and the like.
Preferably, the training method of the voiceprint model includes:
acquiring an open-source sound data set, making pre-collected classroom sounds into a classroom sound data set, and taking the open-source sound data set and the classroom sound data set as a sample set together;
and inputting the samples in the sample set into the voiceprint model, and training the voiceprint model by utilizing a deep learning algorithm.
By adopting the technical scheme, the voiceprint model is obtained by training only by using an open-source sound data set at present, most of the open-source sound data set is acquired from near-field recording and video sound on a video website, most of the sound in a classroom environment is acquired from a microphone on a ceiling and belongs to far-field sound, so that the problem of cross-domain acquisition environment and use environment exists, the traditional voiceprint model is slightly poor in performance when applied to the classroom environment, the acquisition cost of the sound data set is high, and the acquisition standards are not uniform; the sample set in this application has increased the classroom sound data set of making by a large amount of classroom sounds on the sound data set basis of open source, and the voiceprint model that uses trains based on such sample set for the voiceprint model is applicable to and uses in the classroom environment, improves the precision of voiceprint model output voiceprint vector.
Preferably, the acquiring the classroom sound and inputting the classroom sound into the trained voiceprint model to obtain the voiceprint vectors of the multiple sections of sound fragments includes:
dividing the classroom sound into a plurality of sound segments;
and respectively carrying out voiceprint extraction on the multiple sections of sound fragments to obtain the voiceprint vectors.
Preferably, the dividing the classroom sound into a plurality of sound segments includes:
dividing the classroom sound into a plurality of segments, wherein adjacent segments have shared parts and non-shared parts;
respectively calculating the matching degree of the vocal print characteristics of the shared part and the non-shared part of the adjacent segments;
acquiring a switching point based on the voiceprint feature matching degree;
and dividing the classroom sound into a plurality of sound segments according to the switching point.
By adopting the technical scheme, the switching point is detected based on the voiceprint feature matching degree, the classroom sound is divided into a plurality of sections of sound segments, and each section of sound segment is the same type of sound, for example, one section of sound segment is teacher sound, and the other section of sound segment is noise; therefore, each sound fragment is convenient to classify in the later period.
Preferably, the determining, according to the voiceprint vector, whether the sound segment corresponding to the voiceprint vector is a non-teacher sound includes:
and (3) carrying out voiceprint clustering based on the voiceprint vector by adopting a method of combining a BIRTCH clustering algorithm and a Calinski-Harabaz index, and judging whether the sound segment corresponding to the voiceprint vector is non-teacher sound.
By adopting the technical scheme, the BIRTCH clustering algorithm is utilized to perform voiceprint clustering on the voiceprint vector, and the Calinski-Harabaz index is utilized to evaluate the characteristic of good clustering effect so as to improve the clustering accuracy, so that the clustering result is more accurate.
Preferably, the method of combining the bircch clustering algorithm and the Calinski-Harabaz index, performing voiceprint clustering based on the voiceprint vector, and determining whether a sound segment corresponding to the voiceprint vector is a non-teacher sound includes:
clustering all the voiceprint vectors by adopting a BIRTCH clustering algorithm, and dividing all the voiceprint vectors into a first class and a second class;
adopting a BIRTCH clustering algorithm to perform secondary clustering on all the voiceprint vectors in the first class and all the voiceprint vectors in the second class respectively;
respectively acquiring a first index and a second index; the first index is a Calinski-Harabaz index obtained by performing secondary clustering on all the voiceprint vectors in the first class, and the second index is a Calinski-Harabaz index obtained by performing secondary clustering on all the voiceprint vectors in the second class;
judging whether the first index is larger than the second index;
if yes, judging the sound segment corresponding to the voiceprint vector in the first class as non-teacher sound;
if not, the sound segment corresponding to the voiceprint vector in the second class is judged to be non-teacher sound.
By adopting the technical scheme, the teacher sound and the non-teacher sound have a certain difference in the classroom environment, so that the BIRTCH clustering algorithm is utilized to perform first clustering on all the voiceprint vectors, all the voiceprint vectors can be clustered into two classes through the difference between the teacher sound and the non-teacher sound, but after the first clustering is finished, it is not clear which class of voiceprint vector corresponds to the sound segment which is the non-teacher sound, and which class of voiceprint vector corresponds to the sound segment which is the teacher sound; and respectively carrying out secondary clustering on all the voiceprint vectors in the first class and all the voiceprint vectors in the second class to obtain a first index and a second index, wherein the Calinski-Harabaz index can evaluate the clustering effect, and the teacher sound and the non-teacher sound are distinguished by the characteristic: the teacher in the class is a small number, so that the sound of the teacher is single, if the sound segment corresponding to the voiceprint vector in which class is the sound of the teacher, the clustering effect is poor, so that the Calinski-Harabaz index is small, and the sound segment corresponding to the voiceprint vector in the class with the small Calinski-Harabaz index is determined as the sound of the teacher; the number of students in a classroom is small, so that the non-teacher voices are diversified, and if the voice segment corresponding to the voiceprint vector in which class is the non-teacher voice, the clustering effect is good, so that the Calinski-Harabaz index is large, and the voice segment corresponding to the voiceprint vector in the class with the large Calinski-Harabaz index is judged as the non-teacher voice.
In a third aspect, the present application provides a device for distinguishing different voices in a classroom, which adopts the following technical scheme:
a device for distinguishing different voices in a classroom comprises,
the acquisition module is used for acquiring classroom sounds and inputting the classroom sounds into a trained voiceprint model to obtain voiceprint vectors of a plurality of sections of sound fragments;
the judging module is used for judging whether the sound segment corresponding to the voiceprint vector is non-teacher sound according to the voiceprint vector; if yes, switching to a classification module; and the number of the first and second groups,
the classification module is used for extracting the characteristics of the sound segments corresponding to the voiceprint vectors to obtain a second spectrogram, inputting the second spectrogram into a trained sound classification model, and classifying the sound segments corresponding to the voiceprint vectors according to the voiceprint vectors; wherein the second spectrogram is a two-dimensional Mel spectrogram;
the training method of the sound classification model comprises the following steps:
converting each training sample in a training sample set into a first spectrogram, wherein the first spectrogram is a two-dimensional Mel spectrogram;
inputting the first spectrogram into the sound classification model, and training the sound classification model by using a VGG11 network structure.
In a fourth aspect, the present application provides a computer device, which adopts the following technical solution:
a computer device comprising a memory and a processor, the memory having stored thereon a computer program that can be loaded by the processor and that performs the method of distinguishing between different sounds in a classroom as described in any one of the first aspect.
In a fifth aspect, the present application provides a computer-readable storage medium, which adopts the following technical solutions:
a computer readable storage medium storing a computer program which can be loaded by a processor and which implements the method of distinguishing between different sounds in a classroom according to any one of the first aspect.
Drawings
Fig. 1 is a schematic flowchart of a method for distinguishing different sounds in a classroom according to an embodiment of the present application.
Fig. 2 is a schematic diagram of classroom sounds provided by an embodiment of the present application.
Fig. 3 is a flowchart illustrating a method for determining non-teacher voices and teacher voices according to an embodiment of the present application.
Fig. 4 is a block diagram illustrating a structure of a device for distinguishing different sounds in a classroom according to an embodiment of the present disclosure.
Fig. 5 is a schematic structural diagram of a computer device provided in an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application.
The present embodiment provides a method for distinguishing different sounds in a classroom, as shown in fig. 1, the main flow of the method is described as follows (steps S101 to S103):
step S101: and (3) collecting classroom sounds, and inputting the classroom sounds into the trained voiceprint model to obtain the voiceprint vectors of the multiple sections of sound fragments.
The training method of the voiceprint model comprises the following specific steps:
the method comprises the steps of acquiring open-source sound data on the network, wherein the sound data on the network are usually collected from near field recording and video sound on a video website, and sorting and combining the acquired sound data into a sound data set. The classroom sound data are collected in real time through audio collecting equipment installed in a classroom, and the collected classroom sound data are sorted and combined into a classroom sound data set. And taking the open-source sound data set and the classroom sound data set together as a sample set.
And inputting the samples in the sample set into the voiceprint model, and training the voiceprint model by using a deep learning algorithm. Wherein, the audio acquisition equipment can be a microphone in a classroom.
In this embodiment, when a user starts to attend a class, the audio acquisition device acquires class sounds, the class sounds are input into the trained voiceprint model, the voiceprint model is used to divide the class sounds into a plurality of sections of sound fragments, and voiceprint extraction is performed on the plurality of sections of sound fragments respectively to obtain a voiceprint vector of each section of sound fragment.
One possible implementation of dividing the classroom sound into multiple sound segments is as follows: dividing classroom sound into a plurality of segments, wherein shared parts and non-shared parts are arranged between adjacent segments; respectively calculating the matching degree of the vocal print characteristics of the shared part and the non-shared part of the adjacent segments; acquiring a switching point based on the voiceprint feature matching degree; dividing the classroom sound into a plurality of sound segments according to the switching points.
The specific method for acquiring the switching point comprises the following steps: the common part is compared with the voiceprint feature matching degrees of the non-common parts of the two adjacent segments, and the segment with the higher voiceprint feature matching degree of the common part is judged to be the voice segment of the same person, so that the switching point is the coincidence point of the non-common part and the common part of the other segment.
The above is exemplified: referring to fig. 2, the axis is a time axis, and at the beginning of a classroom, classroom sounds begin to be collected, which is recorded as 0 second on the axis. Every 2 seconds of classroom sound is taken as a segment, 1 second of sounds in adjacent segments are repeated, and the repeated sounds are taken as a shared portion between adjacent segments.
There are 4 fragments shown in FIG. 2, fragment A [0s, 2s ], fragment B [1s, 3s ], fragment C [2s, 4s ] and fragment D [3s, 5s ], respectively. Wherein the segments B to C are 1s to 4s, the shared portion between the segments B and C is [2s, 3s ], the non-shared portion between the segments B is [1s, 2s ], and the non-shared portion between the segments C is [3s, 4s ].
And respectively carrying out voiceprint extraction on the parts of [1s, 2s ], [2s, 3s ] and [3s, 4s ] to obtain 3 voiceprint characteristic vectors. Calculating the voiceprint feature matching degree of the voiceprint feature vector of the [2s, 3s ] part and the voiceprint feature vector of the [1s, 2s ] part, and taking the voiceprint feature matching degree as a first matching degree; and calculating the voiceprint feature matching degree of the voiceprint feature vector of the [2s, 3s ] part and the voiceprint feature vector of the [3s, 4s ] part, and taking the voiceprint feature matching degree as a second matching degree.
Judging whether the first matching degree is greater than the second matching degree; if yes, taking the 3 rd second as a switching point; if not, the 2 nd second is taken as a switching point.
It should be noted that if the first matching degree is equal to the second matching degree, the segment B and the segment C are determined to be the segments of the same person speaking, and there is no switching point from the segment B to the segment C. However, since the first matching degree is rarely equal to the second matching degree, it is considered that there is a switching point with a high probability, and the class sound is cut into many sound segments.
Judging whether the classroom is finished or not; if yes, the process proceeds to step S102.
The specific method for judging the end of the classroom comprises the following steps: judging whether classroom sound exists or not; if not, timing; judging whether the timing time is greater than the preset time or not; if yes, the classroom is judged to be finished.
The present application further provides another implementable embodiment for segmenting a classroom sound into multiple sound segments, specifically as follows:
dividing classroom sound into a plurality of segments, wherein shared parts and non-shared parts are arranged between adjacent segments; determining adjacent segments where sound transformation occurs; for adjacent segments with sound transformation, respectively calculating the matching degree of the voiceprint characteristics of the shared part and the non-shared part of the adjacent segments; acquiring a switching point based on the voiceprint feature matching degree; dividing the classroom sound into a plurality of sound segments according to the switching points.
The specific method for determining the adjacent segments with the sound transformation comprises the following steps: for 4 continuous segments, 3 groups of adjacent segments are sequentially arranged, the voiceprint feature matching degree between each group of adjacent segments is respectively calculated, the voiceprint feature matching degree between the first group of adjacent segments is defined as a first voiceprint feature matching degree, the voiceprint feature matching degree between the second group of adjacent segments is defined as a second voiceprint feature matching degree, and the voiceprint feature matching degree between the third group of adjacent segments is defined as a third voiceprint feature matching degree. Judging whether the difference value obtained by subtracting the first voiceprint feature matching degree from the second voiceprint feature matching degree is smaller than a first preset value or not, and the difference value obtained by subtracting the second voiceprint feature matching degree from the third voiceprint feature matching degree is larger than a second preset value or not; if yes, the second group of adjacent segments are judged to be adjacent segments with sound conversion. The following method for obtaining the switching point in the adjacent segment where the sound conversion occurs is consistent with the principle of the first implementation method for dividing the classroom sound into multiple sound segments, and is not repeated herein.
For example, referring to fig. 2, segment a and segment B are a first set of adjacent segments, segment B and segment C are a second set of adjacent segments, and segment C and segment D are a third set of adjacent segments. And calculating to obtain a first voiceprint feature matching degree based on the voiceprint feature vectors of the segment A and the segment B, calculating to obtain a second voiceprint feature matching degree based on the voiceprint feature vectors of the segment B and the segment C, calculating to obtain a third voiceprint feature matching degree based on the voiceprint feature vectors of the segment C and the segment D, and comparing the first voiceprint feature matching degree, the second voiceprint feature matching degree and the third voiceprint feature matching degree to judge the segment B and the segment C as adjacent segments with sound transformation.
It should be noted that, the classroom sound can be acquired in real time during the classroom, and the acquired classroom sound is input into the voiceprint model for processing, or the classroom sound of the whole classroom can be input into the voiceprint model for processing after the classroom is finished.
Step S102: judging whether the sound segment corresponding to the voiceprint vector is non-teacher sound or not according to the voiceprint vector; if yes, the process proceeds to step S103.
In this embodiment, a method combining a bircch clustering algorithm and a Calinski-Harabaz index is used to perform voiceprint clustering on all the voiceprint vectors obtained in step S101, and whether the sound segment corresponding to each voiceprint vector is a non-teacher sound is determined.
The Calinski-Harabaz index has the function of measuring the sound-print clustering effect, and the larger the Calinski-Harabaz index is, the tighter each class is, the more dispersed the classes are, and the clustering result is better.
Specifically, a BIRTCH clustering algorithm is adopted to cluster all the voiceprint vectors, and all the voiceprint vectors are divided into a first class and a second class; adopting a BIRTCH clustering algorithm to perform secondary clustering on all the voiceprint vectors in the first class and all the voiceprint vectors in the second class respectively; respectively acquiring a first index and a second index; the first index is a Calinski-Harabaz index obtained by performing secondary clustering on all the voiceprint vectors in the first class, and the second index is a Calinski-Harabaz index obtained by performing secondary clustering on all the voiceprint vectors in the second class.
Judging whether the first index is larger than the second index; if yes, judging that the sound segment corresponding to the voiceprint vector in the first class is non-teacher sound, and judging that the sound segment corresponding to the voiceprint vector in the second class is teacher sound; if not, the sound segment corresponding to the voiceprint vector in the first class is judged to be teacher sound, and the sound segment corresponding to the voiceprint vector in the second class is judged to be non-teacher sound.
It is to be noted that there is substantially no case where the first index is equal to the second index because there is a relatively distinct difference between the teacher sound and the non-teacher sound.
The principle of the bircch clustering algorithm is to create a tree structure based on parameters, where a sample interval threshold parameter used to determine whether two voiceprint vectors are of the same class is the voiceprint feature similarity.
The voiceprint feature similarity is preset, all the voiceprint vectors are clustered into two classes, namely a first class and a second class, through the voiceprint feature similarity, and at the moment, it is not clear which class of voiceprint vectors corresponds to the sound fragment which is the non-teacher sound. Therefore, all the voiceprint vectors in the first class and all the voiceprint vectors in the second class are clustered for the second time respectively to obtain a first index and a second index, the sound segment corresponding to the voiceprint vector in the class with the larger index is the non-teacher sound, and the sound segment corresponding to the voiceprint vector in the class with the smaller index is the teacher sound.
The calculation formulas of the first index and the second index are general, and the calculation formulas are as follows:
Figure BDA0003578130730000111
wherein S represents the Calinski-Harabaz index; k represents the number of cluster categories (all voiceprints need to be sorted out in this application)The vectors are grouped into 2 classes, so the number of cluster classes is 2); n represents the total number of samples (samples are voiceprint vectors, the total number of samples refers to the number of voiceprint vectors); SS B Represents the class inner compactness; SS W Representing the degree of separation between classes.
Wherein the expression of the intra-class compactness is as follows:
Figure BDA0003578130730000112
the expression for the degree of separation between classes is as follows:
Figure BDA0003578130730000121
wherein, C k Number of samples, x, representing class k kc Characteristic value, x, representing the kth class of the c-th sample kmean Mean eigenvalues, X, of class k mean Represents the average eigenvalue of all samples.
The present application also provides another possible embodiment, specifically as follows:
presetting a plurality of voiceprint feature similarities, and respectively adopting the plurality of voiceprint feature similarities to cluster all the voiceprint vectors for multiple times to obtain a plurality of clustering results. Calinski-Harabaz indices for each cluster were calculated. And selecting the clustering result corresponding to the largest Calinski-Harabaz index as an optimal clustering result, wherein two types in the optimal clustering result are a first type and a second type respectively.
For example, referring to fig. 3, four voiceprint feature similarities of 0.35, 0.5, 0.65 and 0.8 are preset through manual experience, and four voiceprint feature similarities are adopted to perform four-time clustering on all the voiceprint vectors. Clustering all the voiceprint vectors into 2 classes, namely a class and b class respectively, by adopting the voiceprint feature similarity of 0.35, wherein the Calinski-Harabaz index of the clustering is a first clustering index; clustering all the voiceprint vectors into 2 classes, namely c class and d class respectively, by adopting the voiceprint feature similarity of 0.5, wherein the Calinski-Harabaz index of the clustering is the second clustering index; clustering all the voiceprint vectors into 2 classes by adopting the voiceprint characteristic similarity of 0.65, wherein the classes are respectively an e class and an f class, and the Calinski-Harabaz index of the clustering is a third clustering index; and clustering all the voiceprint vectors into 2 classes, namely g classes and h classes respectively, by adopting the voiceprint feature similarity of 0.8, wherein the Calinski-Harabaz index of the clustering is the fourth clustering index. By comparison, the second clustering index is the largest, and therefore, the voiceprint feature similarity of 0.5 is the best of the four voiceprint feature similarities of 0.35, 0.5, 0.65 and 0.8, with class c as the first class and class d as the second class.
It should be noted that fig. 3 illustrates only one case of a method for determining non-teacher voices and teacher voices, and the numerical values and calculation results are only used for illustration and do not limit the scope of the present application.
The voiceprint feature similarity is subjected to the first round of selection through the content, and further, the voiceprint feature similarity can be subjected to the second round of selection, even more rounds of selection, so that the better voiceprint feature similarity can be selected. For example, if 0.5 voiceprint feature similarity is selected in the first round of selection, then the voiceprint feature similarities of 0.44, 0.47, 0.5, 0.53 and 0.56 are set for the second round of selection, and the principle of the second round of selection is the same as that of the first round of selection, and is not described herein again; assuming that the result of the second round of selection is 0.53 voiceprint feature similarity, correspondingly, a clustering result obtained by clustering the voiceprints according to the 0.53 voiceprint feature similarity is used as an optimal clustering result, and two types in the optimal clustering result are a first type and a second type respectively.
And performing secondary clustering on all the voiceprint vectors in the first class and all the voiceprint vectors in the second class respectively to judge which class is the non-teacher sound and which class is the teacher sound, wherein the judging method is the same as the principle of the method for judging the non-teacher sound and the teacher sound through the first index and the second index, and the description is omitted here.
Step S103: extracting the characteristics of the sound segments corresponding to the voiceprint vectors to obtain a second spectrogram, inputting the second spectrogram into a trained sound classification model, and classifying the sound segments corresponding to the voiceprint vectors according to the voiceprint vectors; wherein the second spectrogram is a two-dimensional Mel spectrogram.
The training method of the sound classification model comprises the following specific steps:
converting each training sample in the training sample set into a first spectrogram; wherein the first spectrogram is a two-dimensional Mel spectrogram; and inputting the first spectrogram into a sound classification model, and training the sound classification model by using a VGG11 network structure.
The method for acquiring the training sample set comprises the following steps: the classroom sound is collected through audio acquisition equipment installed in a classroom, the classroom sound is extracted into a plurality of sections of sounds, and each section of sound is used as a training sample. And manually marking each training sample, wherein each training sample is correspondingly marked as the sound of a single student, the sound of a plurality of students reading simultaneously, the sound and the noise of discussion and the like. The number of samples finally labeled in this example is 69000, wherein the sound of a single student is 35000, the sound of multiple students reading simultaneously is 9000, the sound in question is 12000, and the noise is 13000.
In the embodiment, a sound segment which is a non-teacher sound is subjected to feature extraction to obtain a second spectrogram, the second spectrogram is input into a trained sound classification model, and the sound classification model performs sound classification according to the second spectrogram and outputs a classification result; wherein the classification result comprises the voice of a single student, the voice of a plurality of students reading together, the voice of discussion and noise.
To sum up, this application utilizes the sound data of whole class, combines to use voiceprint model and sound classification model, and the automatic zero interactive different sounds in the classroom of distinguishing, and then can be more high-efficient, more accurate carry out classroom teaching analysis.
In order to better implement the method, the embodiment of the present application further provides a device for distinguishing different sounds in a classroom, which may be specifically integrated in a computer device, such as a terminal or a server, where the terminal may include, but is not limited to, a mobile phone, a tablet computer, or a desktop computer.
Fig. 4 is a block diagram of a structure of a device for distinguishing different voices in a classroom according to an embodiment of the present application, where as shown in fig. 4, the device mainly includes:
the acquisition module 201 is configured to acquire classroom sounds, and input the classroom sounds into a trained voiceprint model to obtain voiceprint vectors of multiple sections of sound fragments;
the judging module 202 is configured to judge whether a sound segment corresponding to a voiceprint vector is a non-teacher sound according to the voiceprint vector; if yes, switching to a classification module; and the number of the first and second groups,
the classification module 203 is configured to perform feature extraction on the sound segments corresponding to the voiceprint vectors to obtain a second spectrogram, input the second spectrogram into a trained sound classification model, and classify the sound segments corresponding to the voiceprint vectors according to the voiceprint vectors; wherein the second spectrogram is a two-dimensional Mel spectrogram;
the training method of the sound classification model comprises the following steps:
converting each training sample in the training sample set into a first spectrogram, wherein the first spectrogram is a two-dimensional Mel spectrogram;
and inputting the first spectrogram into a sound classification model, and training the sound classification model by using a VGG11 network structure.
Various changes and specific examples in the method provided by the above embodiment are also applicable to the device for distinguishing different sounds in a classroom of the present embodiment, and through the foregoing detailed description of the method for distinguishing different sounds in a classroom, those skilled in the art can clearly know the implementation method of the device for distinguishing different sounds in a classroom in the present embodiment, and for the sake of brevity of the description, detailed descriptions are omitted here.
In order to better execute the program of the method, the embodiment of the present application further provides a computer device, as shown in fig. 5, the computer device 300 includes a memory 301 and a processor 302.
The computer device 300 may be implemented in various forms including devices such as a cell phone, a tablet computer, a palm top computer, a laptop computer, and a desktop computer.
The memory 301 may be used to store, among other things, instructions, programs, code sets, or instruction sets. The memory 301 may include a stored program area and a stored data area, wherein the stored program area may store instructions for implementing an operating system, instructions for at least one function (such as determining whether a sound segment corresponding to a voiceprint vector is a non-teacher sound and classifying the sound segment corresponding to the voiceprint vector, etc.), instructions for implementing a method for distinguishing different sounds in a classroom provided by the above-described embodiments, and the like; the storage data area may store data and the like involved in the method for distinguishing different sounds in a classroom provided by the above-described embodiment.
Processor 302 may include one or more processing cores. The processor 302 may invoke the data stored in the memory 301 by executing or executing instructions, programs, code sets, or instruction sets stored in the memory 301 to perform the various functions of the present application and to process the data. The Processor 302 may be at least one of an Application Specific Integrated Circuit (ASIC), a Digital Signal Processor (DSP), a Digital Signal Processing Device (DSPD), a Programmable Logic Device (PLD), a Field Programmable Gate Array (FPGA), a Central Processing Unit (CPU), a controller, a microcontroller, and a microprocessor. It is understood that the electronic devices for implementing the functions of the processor 302 may be other devices, and the embodiments of the present application are not limited thereto.
An embodiment of the present application provides a computer-readable storage medium, for example, including: a U-disk, a removable hard disk, a Read Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, an optical disk, or other various media capable of storing program codes. The computer readable storage medium stores a computer program that can be loaded by a processor and performs the method of distinguishing different sounds in a classroom of the above-described embodiments.
The specific embodiments are merely illustrative and not restrictive, and various modifications that do not materially contribute to the embodiments may be made by those skilled in the art after reading this specification as required, but are protected by patent laws within the scope of the claims of this application.

Claims (10)

1. A method for training a sound classification model, comprising:
converting each training sample in a training sample set into a first spectrogram, wherein the first spectrogram is a two-dimensional Mel spectrogram;
and inputting the Mel frequency spectrogram into the sound classification model, and training the sound classification model by using a VGG11 network structure.
2. A method for distinguishing different sounds in a classroom, comprising:
collecting classroom sounds, and inputting the classroom sounds into a trained voiceprint model to obtain voiceprint vectors of a plurality of sections of sound fragments;
judging whether a sound segment corresponding to the voiceprint vector is a non-teacher sound or not according to the voiceprint vector;
if so, extracting the characteristics of the sound segment corresponding to the voiceprint vector to obtain a second spectrogram, inputting the second spectrogram into a trained sound classification model, and classifying the sound segment corresponding to the voiceprint vector according to the voiceprint vector; wherein the second spectrogram is a two-dimensional Mel spectrogram;
the training method of the sound classification model comprises the following steps:
converting each training sample in a training sample set into a first spectrogram, wherein the first spectrogram is a two-dimensional Mel spectrogram;
inputting the first spectrogram into the sound classification model, and training the sound classification model by using a VGG11 network structure.
3. The method of claim 2, wherein the training method of the voiceprint model comprises:
acquiring an open-source sound data set, making pre-collected classroom sounds into a classroom sound data set, and taking the open-source sound data set and the classroom sound data set as a sample set together;
and inputting the samples in the sample set into the voiceprint model, and training the voiceprint model by utilizing a deep learning algorithm.
4. The method of claim 2, wherein the step of collecting the classroom sounds and inputting the classroom sounds into a trained voiceprint model to obtain a voiceprint vector of a plurality of segments of sounds comprises:
dividing the classroom sound into a plurality of sound segments;
and respectively carrying out voiceprint extraction on the multiple sections of sound fragments to obtain the voiceprint vectors.
5. The method of claim 3, wherein the dividing the classroom sound into multiple sound segments comprises:
dividing the classroom sound into a plurality of segments, wherein adjacent segments have shared parts and non-shared parts;
respectively calculating the matching degree of the vocal print characteristics of the shared part and the non-shared part of the adjacent segments;
acquiring a switching point based on the voiceprint feature matching degree;
and dividing the classroom sound into a plurality of sound segments according to the switching point.
6. The method of claim 2, wherein said determining whether the sound segment corresponding to the voiceprint vector is a non-teacher sound based on the voiceprint vector comprises:
and (3) carrying out voiceprint clustering based on the voiceprint vector by adopting a method of combining a BIRTCH clustering algorithm and a Calinski-Harabaz index, and judging whether the sound segment corresponding to the voiceprint vector is non-teacher sound.
7. The method of claim 6, wherein the determining whether the sound segment corresponding to the voiceprint vector is a non-teacher sound by using a combination of a BIRTCH clustering algorithm and a Calinski-Harabaz index to perform voiceprint clustering based on the voiceprint vector comprises:
clustering all the voiceprint vectors by adopting a BIRTCH clustering algorithm, and dividing all the voiceprint vectors into a first class and a second class;
adopting a BIRTCH clustering algorithm to perform secondary clustering on all the voiceprint vectors in the first class and all the voiceprint vectors in the second class respectively;
respectively acquiring a first index and a second index; the first index is a Calinski-Harabaz index obtained by performing secondary clustering on all the voiceprint vectors in the first class, and the second index is a Calinski-Harabaz index obtained by performing secondary clustering on all the voiceprint vectors in the second class;
judging whether the first index is larger than the second index;
if yes, judging the sound segment corresponding to the voiceprint vector in the first class as non-teacher sound;
if not, the sound segment corresponding to the voiceprint vector in the second class is judged to be non-teacher sound.
8. A device for distinguishing different voices in a classroom is characterized by comprising,
the acquisition module is used for acquiring classroom sounds and inputting the classroom sounds into a trained voiceprint model to obtain voiceprint vectors of a plurality of sections of sound fragments;
the judging module is used for judging whether the sound segment corresponding to the voiceprint vector is non-teacher sound according to the voiceprint vector; if yes, switching to a classification module; and the number of the first and second groups,
the classification module is used for extracting the characteristics of the sound segments corresponding to the voiceprint vectors to obtain a second spectrogram, inputting the second spectrogram into a trained sound classification model, and classifying the sound segments corresponding to the voiceprint vectors according to the voiceprint vectors; wherein the second spectrogram is a two-dimensional Mel spectrogram;
the training method of the sound classification model comprises the following steps:
converting each training sample in a training sample set into a first spectrogram, wherein the first spectrogram is a two-dimensional Mel spectrogram;
inputting the first spectrogram into the sound classification model, and training the sound classification model by using a VGG11 network structure.
9. A computer device comprising a memory and a processor, the memory having stored thereon a computer program that can be loaded by the processor and that executes the method according to any of claims 2 to 7.
10. A computer-readable storage medium, in which a computer program is stored which can be loaded by a processor and which executes the method of any one of claims 2 to 7.
CN202210339090.1A 2022-04-01 2022-04-01 Method, device, equipment and storage medium for distinguishing different sounds in classroom Pending CN114822557A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210339090.1A CN114822557A (en) 2022-04-01 2022-04-01 Method, device, equipment and storage medium for distinguishing different sounds in classroom

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210339090.1A CN114822557A (en) 2022-04-01 2022-04-01 Method, device, equipment and storage medium for distinguishing different sounds in classroom

Publications (1)

Publication Number Publication Date
CN114822557A true CN114822557A (en) 2022-07-29

Family

ID=82533290

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210339090.1A Pending CN114822557A (en) 2022-04-01 2022-04-01 Method, device, equipment and storage medium for distinguishing different sounds in classroom

Country Status (1)

Country Link
CN (1) CN114822557A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115206294A (en) * 2022-09-16 2022-10-18 深圳比特微电子科技有限公司 Training method, sound event detection method, device, equipment and medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115206294A (en) * 2022-09-16 2022-10-18 深圳比特微电子科技有限公司 Training method, sound event detection method, device, equipment and medium
CN115206294B (en) * 2022-09-16 2022-12-06 深圳比特微电子科技有限公司 Training method, sound event detection method, device, equipment and medium

Similar Documents

Publication Publication Date Title
CN110457432B (en) Interview scoring method, interview scoring device, interview scoring equipment and interview scoring storage medium
CN110147726B (en) Service quality inspection method and device, storage medium and electronic device
US10157619B2 (en) Method and device for searching according to speech based on artificial intelligence
CN110782872A (en) Language identification method and device based on deep convolutional recurrent neural network
CN108305615A (en) A kind of object identifying method and its equipment, storage medium, terminal
CN105957531B (en) Speech content extraction method and device based on cloud platform
US20100121637A1 (en) Semi-Automatic Speech Transcription
CN110544481B (en) S-T classification method and device based on voiceprint recognition and equipment terminal
Sahoo et al. Emotion recognition from audio-visual data using rule based decision level fusion
CN101710490A (en) Method and device for compensating noise for voice assessment
CN110263854B (en) Live broadcast label determining method, device and storage medium
CN111681143A (en) Multi-dimensional analysis method, device, equipment and storage medium based on classroom voice
CN109461441A (en) A kind of Activities for Teaching Intellisense method of adaptive, unsupervised formula
CN112468659A (en) Quality evaluation method, device, equipment and storage medium applied to telephone customer service
CN109800309A (en) Classroom Discourse genre classification methods and device
CN106710588B (en) Speech data sentence recognition method, device and system
CN115457966A (en) Pig cough sound identification method based on improved DS evidence theory multi-classifier fusion
CN114822557A (en) Method, device, equipment and storage medium for distinguishing different sounds in classroom
JP5626221B2 (en) Acoustic image segment classification apparatus and method
CN112052686B (en) Voice learning resource pushing method for user interactive education
Kamble et al. Emotion recognition for instantaneous Marathi spoken words
JP4219539B2 (en) Acoustic classification device
JPWO2016152132A1 (en) Audio processing apparatus, audio processing system, audio processing method, and program
US11238289B1 (en) Automatic lie detection method and apparatus for interactive scenarios, device and medium
CN109635151A (en) Establish the method, apparatus and computer equipment of audio retrieval index

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination