CN112397074A - Voiceprint recognition method based on MFCC (Mel frequency cepstrum coefficient) and vector element learning - Google Patents
Voiceprint recognition method based on MFCC (Mel frequency cepstrum coefficient) and vector element learning Download PDFInfo
- Publication number
- CN112397074A CN112397074A CN202011220705.6A CN202011220705A CN112397074A CN 112397074 A CN112397074 A CN 112397074A CN 202011220705 A CN202011220705 A CN 202011220705A CN 112397074 A CN112397074 A CN 112397074A
- Authority
- CN
- China
- Prior art keywords
- voice
- mfcc
- sample
- class
- frequency
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 39
- 239000013598 vector Substances 0.000 title claims abstract description 25
- 238000012549 training Methods 0.000 claims abstract description 28
- 238000000605 extraction Methods 0.000 claims abstract description 13
- 238000007781 pre-processing Methods 0.000 claims abstract description 10
- 238000001228 spectrum Methods 0.000 claims description 24
- 230000006870 function Effects 0.000 claims description 21
- 238000004364 calculation method Methods 0.000 claims description 17
- 238000005070 sampling Methods 0.000 claims description 15
- 238000012360 testing method Methods 0.000 claims description 15
- 238000009432 framing Methods 0.000 claims description 9
- 230000008569 process Effects 0.000 claims description 7
- 230000000694 effects Effects 0.000 claims description 6
- 238000012545 processing Methods 0.000 claims description 5
- 238000001914 filtration Methods 0.000 claims description 4
- 230000008859 change Effects 0.000 claims description 3
- 238000011478 gradient descent method Methods 0.000 claims description 3
- 238000009499 grossing Methods 0.000 claims description 3
- 239000004576 sand Substances 0.000 claims description 3
- 210000001260 vocal cord Anatomy 0.000 claims description 3
- 238000010586 diagram Methods 0.000 description 4
- 238000012790 confirmation Methods 0.000 description 2
- 238000013527 convolutional neural network Methods 0.000 description 2
- 239000003623 enhancer Substances 0.000 description 2
- 239000000284 extract Substances 0.000 description 2
- 210000000214 mouth Anatomy 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 210000003928 nasal cavity Anatomy 0.000 description 1
- 210000001331 nose Anatomy 0.000 description 1
- 238000013139 quantization Methods 0.000 description 1
- 230000005855 radiation Effects 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
- 230000001755 vocal effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification
- G10L17/02—Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification
- G10L17/04—Training, enrolment or model building
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification
- G10L17/06—Decision making techniques; Pattern matching strategies
- G10L17/08—Use of distortion metrics or a particular distance between probe pattern and reference templates
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification
- G10L17/18—Artificial neural networks; Connectionist approaches
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
Abstract
The invention discloses a voiceprint recognition method based on MFCC and vector element learning, which comprises the following steps: preprocessing voice; a step of feature extraction; training a model; and (5) pattern matching. The method has the advantages of fine classification and high identification accuracy.
Description
Technical Field
The invention relates to the field of voiceprint recognition, in particular to a voiceprint recognition method based on MFCC (Mel frequency cepstrum coefficient) and vector element learning.
Background
Voiceprint recognition, also known as speaker recognition, is a technique for discriminating the identity of a speaker by voice. Intuitively, although the voiceprint is not as visually seen as the individual difference between the human face and the fingerprint, the vocal tract, the oral cavity and the nasal cavity of each person have the individual difference, and therefore the voiceprint is reflected in the sound with the difference. If the mouth is said to be the transmitter of sound, the human ear acting as a receiver also has the ability to distinguish sound.
MFCC: Mel-Frequency Cepstral coeffients (Mel-Frequency Cepstral coeffients), converting time domain speech into Frequency domain, filtering the signals of the Frequency domain in a segmented manner to obtain the occupation ratio of different Frequency bands, and obtaining a matrix formed by the occupation ratio Coefficients, namely Mel-Frequency Cepstral Coefficients.
Meta learning: from the network structure point of view, meta learning consists of two networks — meta-net and net, on the one hand net gets knowledge from meta-net, on the other hand meta-net observes the performance improvement of net itself.
Prototype network: the method comprises the steps of firstly projecting samples to a space, calculating the center of each sample category, projecting input to a new feature space during classification, and converting the input (such as an image) into a new feature vector through a neural network, so that the vectors of the same category are relatively close to each other, and the vectors of different categories are relatively far away from each other. Meanwhile, calculating the mean value of each class represents the prototype of the class. And analyzing the class of the target by comparing the distance from the target to each center.
Currently, the mainstream methods for voiceprint recognition are Dynamic Time Warping (DTW), hidden markov theory (HMM), Vector Quantization (VQ), and the like. However, these methods have the disadvantages of low recognition accuracy, large number of calculations, lack of dynamic training, or over-reliance on the original speaker.
For the prototype network, the application range is not only in the learning process of single sample/small sample, but also in the learning mode of zero sample. The idea for this application is: although we have no data sample for the current classification, if a prototype representation (meta-information) of the classification can be generated in a higher hierarchy.
Disclosure of Invention
Aiming at the defects of the existing mainstream algorithm for voiceprint recognition, the invention aims to provide a voiceprint recognition method based on MFCC and vector element learning. The method has the advantages of fine classification and high identification accuracy.
The voiceprint recognition method based on MFCC and vector element learning comprises the following steps:
voice preprocessing: recording voice signals to obtain a voice data set, dividing the voice data set into a training set and a testing set, and then carrying out voice data enhancement and voice pre-emphasis processing on all voice signals in the voice data set;
a characteristic extraction step: performing feature extraction on voice signals in a training set after voice preprocessing by using the MFCC to obtain MFCC feature parameters;
model training: inputting MFCC characteristic parameters of the speech signals of the training set into a prototype network for model training;
pattern matching: MFCC characteristic parameters are extracted from a test set to-be-recognized voice signal, the MFCC characteristic parameters are input into a trained prototype network for calculation, Euclidean distance is used as a distance metric, the characteristic quantity extracted by the recognized voice must be compared with the trained model characteristic parameters of each person, and the most similar voice is found out as a recognition result.
The voice preprocessing step comprises the following steps:
a voice data enhancer step: collecting a speech signal of a person speaking at ordinary times through a speech collecting plate of the SEEED, and carrying out enhancement operation on the speech signal by forward playing, reverse playing and randomly deleting partial segments of the collected speech signal through praat software;
voice pre-emphasis: the voice signal passes through a high-pass filter to promote the high-frequency part, so that the frequency spectrum of the signal becomes flat, the frequency spectrum is kept in the whole frequency band from low frequency to high frequency, the same signal-to-noise ratio can be used for obtaining the frequency spectrum, meanwhile, the vocal cords and the lip effect in the sounding process are eliminated, the high-frequency part of the voice signal restrained by a sounding system is compensated, and the formant of the high frequency is highlighted.
The feature extraction step includes:
a pre-emphasis substep: the high-frequency part of the voice signal is boosted through a filter;
a sub-frame sub-step: framing the pre-emphasized voice signal;
hamming window substep: multiplying each frame of the framed speech signal by a Hamming window;
fast fourier transform substep: performing fast Fourier transform on each frame of voice signal after the Hamming window to obtain an energy spectrum;
triangular band-pass filtering substep: inputting the energy spectrum into a triangular band-pass filter bank, smoothing the frequency spectrum, eliminating the effect of harmonic waves and highlighting the formants of the original voice;
a logarithmic energy calculation substep: calculating the logarithmic energy output by each triangular band-pass filter;
discrete cosine transform substep: substituting the logarithmic energy obtained by calculation into discrete cosine transform to obtain MFCC characteristic parameters;
dynamic difference parameter substep: the dynamic characteristics of the voice signal are represented by the differential spectrum of the MFCC, and the multidimensional MFCC characteristic parameters are obtained.
The pre-emphasis sub-step comprises:
H(Z)=1-μz-1 (1),
wherein the value of μ is between 0.9-1.0, Z is the speech signal after pre-emphasis, Z is the speech signal before pre-emphasis.
The framing sub-step comprises:
firstly, N sampling points are grouped into an observation unit, which is called a frame, generally, the value of N is 256 or 512, the covered time is about 20-30 ms, in order to avoid the excessive change of two adjacent frames, an overlap region is provided between two adjacent frames, the overlap region includes M sampling points, generally, the value of M is about 1/2 or 1/3 of N, the sampling frequency of a speech signal adopted by speech recognition is generally 8KHz or 16KHz, and in terms of 8KHz, if the frame length is 256 sampling points, the corresponding time length is 256/8000 × 1000 ═ 32 ms.
The Hamming window sub-step includes:
multiplying each frame by a Hamming window to increase the continuity of the left and right ends of the frame, and multiplying the frame by the Hamming window if the signal after framing is S (N), N is 0,1 …, N-1, and N is the size of the frame
S′(n)=S(n)×W(n)
Wherein, W (n) represents Hamming window, different values of a will generate different Hamming windows, and a is 0.46 in general;
the triangular band-pass filter bank comprises 40 triangular band-pass filters, and the discrete cosine transform substep substitutes the 40 logarithmic energies obtained by calculation into discrete cosine transform to obtain a 13-order MFCC.
In the model training step, the prototype network algorithm comprises:
the main idea is as follows: projecting a sample space, namely embedding the sample space into a low-dimensional space, classifying the sample space by using the similarity of the sample in the low-dimensional space, then finding the clustering center of each classification in the low-dimensional space, and measuring the classification of a new sample by using a distance function;
assuming that the current data set is D, the representation of the samples inside it is { (x)1,y1),(x2,y2),...,(xn,yn) Where x denotes the vector representation and y denotes the class label, assuming that there are K classes, N samples per class, where N can be divided into NsAnd NQ(N=Ns+NQ) The corresponding sample sets are respectively denoted as SkSupport set and QkA query set;
for sample points inside the support set, a coding formula is usedTo generate a prototype representation for each class, where the coding formulaAny information extraction method may be used, such as CNN, LSTM;
for each classification, a prototype representation is generated of:
then calculating the distance condition of prototype representation of the query set and the support set;
finally, the probability p that the current sample belongs to each classification is calculatedw(y ═ k |, x), the calculation of softmax is used here:
wherein d () is a distance function, c is the clustering center of each class, and after the clustering center of each class of samples is known, which class the sample x belongs to can be described, and the class is represented by the distance function and the softmax function, and the probability that x belongs to the kth class is shown in formula (4);
Knowing that the kth class objective function corresponding to the sample x is shown in formula (5), minimizing the objective function by adopting a random gradient descent method, and obtaining the optimal parameter
The pattern matching step includes:
generating a coded representation for each sample point in the support set, generating a prototype representation for each classification by means of sum-average, and generating a vector representation for the query sample;
meanwhile, the distance condition represented by each query point and each classification prototype needs to be calculated, the probability result of softmax is calculated, the probability distribution condition of each classification is generated, and the class with the highest probability is the class label of the test data.
The technical scheme method can identify new classes never seen in the training process, and only needs a little sample data for each class. The prototype network maps the sample data in each category into a space and extracts their "mean" to represent as a prototype of that category. Using euclidean distance as a distance metric, the training makes the present class data the closest to the present class prototype representation and the further away from the other class prototype representations. During testing, softmax is carried out on the distance from the test data to the prototype data of each category to judge the category label of the test data. The main process of recognition is realized by a prototype network model based on vector element learning, so that better classification can be realized, and the problem of low recognition accuracy of the existing voiceprint recognition method is solved.
The method has the advantages of fine classification and high identification accuracy.
Drawings
FIG. 1 is a schematic overall flow chart of the embodiment;
FIG. 2 is a flow chart of a partial implementation of voiceprint recognition in an embodiment;
FIG. 3 is a diagram of an embodiment of a training architecture;
FIG. 4 is an overall architecture diagram of a prototype network in an embodiment;
FIG. 5 is a basic architecture diagram of the meta-learning technique in an embodiment;
FIG. 6 is a flow chart of modeling in an embodiment.
Detailed Description
The invention is further illustrated but not limited by the following figures and examples.
Example (b):
for speaker recognition, the feature quantities extracted from the recognized speech must be compared with the trained model feature parameters of each person to find the closest similarity as the recognition result. For speaker confirmation, only the input voice characteristic parameters are compared with the declared speaker voice template characteristic parameters, whether the two parameters are matched or not is determined through a corresponding method, if the two parameters are matched, confirmation is carried out, and if not, rejection is carried out.
The sound wave has corresponding amplitude in each period of time, in order to convert the sound wave into digital, the sound wave is separated in an equidistant way, the height of the sound wave at equidistant points is recorded, the height is called baud rate, the sound production frequency of a general person is 100 Hz-10000 Hz, the sampling frequency is generally determined by Nyquist sampling theorem, as shown in figure 6, therefore, the embodiment adopts 1.6KHz as the sampling frequency, the embodiment adopts an ADMP401 microphone pickup module to collect voice signals, the gain of the amplifier reaches 67dB, AD signals are output, the acquisition is convenient, in the voiceprint recognition, because the power spectrum of the voice signals is influenced by lip and nose radiation and is reduced along with the increase of the frequency of the signals, in order to make the frequency spectrum distribution of the voice signals more uniform, the frequency spectrum of the high-frequency part of the signals is subjected to lifting processing to reduce the low-frequency interference of the voice signals, the obtained signals are then fed into a model based on a processing platform in python language for training, as shown in fig. 3.
Referring to fig. 1 and 2, the voiceprint recognition method based on MFCC and vector element learning includes the following steps:
voice preprocessing: recording voice signals to obtain a voice data set, dividing the voice data set into a training set and a testing set, and then carrying out voice data enhancement and voice pre-emphasis processing on all voice signals in the voice data set;
a characteristic extraction step: performing feature extraction on voice signals in a training set after voice preprocessing by using the MFCC to obtain MFCC feature parameters;
model training: inputting MFCC characteristic parameters of the speech signals of the training set into a prototype network for model training;
pattern matching: MFCC characteristic parameters are extracted from a test set to-be-recognized voice signal and input into a trained prototype network for calculation, the overall architecture of the prototype network is as shown in FIG. 4, Euclidean distance is used as a distance measure, the characteristic quantity extracted by the recognized voice must be compared with the model characteristic parameters of each person obtained through training, and the closest similarity is found as the recognition result.
The voice preprocessing step comprises the following steps:
a voice data enhancer step: collecting a speech signal of a person speaking at ordinary times through a speech collecting plate of the SEEED, and carrying out enhancement operation on the speech signal by forward playing, reverse playing and randomly deleting partial segments of the collected speech signal through praat software;
voice pre-emphasis: the voice signal passes through a high-pass filter to promote the high-frequency part, so that the frequency spectrum of the signal becomes flat, the frequency spectrum is kept in the whole frequency band from low frequency to high frequency, the same signal-to-noise ratio can be used for obtaining the frequency spectrum, meanwhile, the vocal cords and the lip effect in the sounding process are eliminated, the high-frequency part of the voice signal restrained by a sounding system is compensated, and the formant of the high frequency is highlighted.
The feature extraction step includes:
a pre-emphasis substep: the high-frequency part of the voice signal is boosted through a filter;
a sub-frame sub-step: framing the pre-emphasized voice signal;
hamming window substep: multiplying each frame of the framed speech signal by a Hamming window;
fast fourier transform substep: performing fast Fourier transform on each frame of voice signal after the Hamming window to obtain an energy spectrum; triangular band-pass filtering substep: inputting the energy spectrum into a triangular band-pass filter bank, smoothing the frequency spectrum, eliminating the effect of harmonic waves and highlighting the formants of the original voice;
a logarithmic energy calculation substep: calculating the logarithmic energy output by each triangular band-pass filter;
discrete cosine transform substep: substituting the logarithmic energy obtained by calculation into discrete cosine transform to obtain MFCC characteristic parameters;
dynamic difference parameter substep: the dynamic characteristics of the voice signal are represented by the differential spectrum of the MFCC, and the multidimensional MFCC characteristic parameters are obtained.
The pre-emphasis sub-step comprises:
H(Z)=1-μz-1 (1),
wherein the value of μ is between 0.9-1.0, Z is the speech signal after pre-emphasis, Z is the speech signal before pre-emphasis. The framing sub-step comprises:
firstly, N sampling points are grouped into an observation unit, which is called a frame, generally, the value of N is 256 or 512, the covered time is about 20-30 ms, in order to avoid the excessive change of two adjacent frames, an overlap region is provided between two adjacent frames, the overlap region includes M sampling points, generally, the value of M is about 1/2 or 1/3 of N, the sampling frequency of a speech signal adopted by speech recognition is generally 8KHz or 16KHz, and in terms of 8KHz, if the frame length is 256 sampling points, the corresponding time length is 256/8000 × 1000 ═ 32 ms.
The Hamming window sub-step includes:
multiplying each frame by a Hamming window to increase the continuity of the left and right ends of the frame, and multiplying the frame by the Hamming window if the signal after framing is S (N), N is 0,1 …, N-1, and N is the size of the frame
S′(n)=S(n)×W(n)
Wherein, W (n) represents Hamming window, different values of a will generate different Hamming windows, and a is 0.46 in general;
the triangular band-pass filter bank comprises 40 triangular band-pass filters, and the discrete cosine transform substep substitutes the 40 logarithmic energies obtained by calculation into discrete cosine transform to obtain a 13-order MFCC.
In the model training step, the prototype network algorithm comprises:
the main idea is as follows: projecting a sample space, namely embedding the sample space into a low-dimensional space, classifying the sample space by using the similarity of the sample in the low-dimensional space, then finding the clustering center of each classification in the low-dimensional space, and measuring the classification of a new sample by using a distance function;
the MFCC obtained at the moment is two-dimensional data, abstract information of the MFCC, namely a voiceprint characteristic diagram, is extracted by utilizing the idea of a convolutional neural network, the network architecture trained in the embodiment is ResNet18, and the ResNet18 is mainly used for considering that the network is light in weight and efficient and stable in training;
assuming that the current data set is D, the representation of the samples inside it is { (x)1,y1),(x2,y2),...,(xn,yn) Where x denotes the vector representation and y denotes the class label, assuming that there are K classes, N samples per class, where N can be divided into NsAnd NQ(N=Ns+NQ) The corresponding sample sets are respectively denoted as SkSupport set and QkA query set;
in the embodiment, in the actual voice training, 5 persons are supported, 5 voice segments are provided for each person, the query set is still the 5 persons, 15 voice segments are provided for each person, and the voice time length of each person is set to be 5 seconds;
for sample points inside the support set, a coding formula is usedTo generate a prototype representation for each class, where the coding formulaAny information extraction method can be adopted;
for each classification, a prototype representation is generated of:
then calculating the distance condition of prototype representation of the query set and the support set;
finally, the probability p that the current sample belongs to each classification is calculatedw(y ═ k |, x), the calculation of softmax is used here:
wherein d () is a distance function, c is the clustering center of each class, and after the clustering center of each class of samples is known, which class the sample x belongs to can be described, and the class is represented by the distance function and the softmax function, and the probability that x belongs to the kth class is shown in formula (4);
Knowing that the kth class objective function corresponding to the sample x is shown in formula (5), minimizing the objective function by adopting a random gradient descent method, and obtaining the optimal parameter
The traditional algorithm strategy adopts a dual-threshold method for judgment, when a voice section enters, a short-time energy and short-time zero-crossing rate curve is gradually increased until the curve enters a silent section and is gradually reduced, but in unvoiced sections at the beginning and the end of the voice section, the short-time energy is almost zero, but the short-time zero-crossing rate is larger, so that when the short-time energy is simply used as a criterion for end point detection, unvoiced and tail sections of a voice signal are easily cut off, and the voice section cannot be completely cut off, so that the short-time zero-crossing rate is required to be used as a second-stage judgment, the method needs to slice the signal, adopts 20ms slicing when in analysis, can adopt an FFT (fast Fourier transform) method to obtain corresponding waveforms, once the single sound waves exist, adds the energy contained in each frequency band to form new audio segment characteristics, and aims at the general characteristics of an acoustic model, other signal transformation strategies have been proposed based on MFCC, which is a cepstral parameter extracted in the frequency domain on the MEL scale describing the non-linear behavior of human ear frequencies, and MEL, which is approximated by the following equation:
inputting the MFCC characteristic parameters of the obtained speech signals into a prototype network under vector element learning for model training, wherein the prototype network maps sample data in each class into a space as shown in FIG. 4, extracts the 'mean' of the sample data to represent the prototype of the class, and uses Euclidean distance as distance measurement to train the data of the class to have the closest distance to the prototype of the class and have the farther distance to the prototype of other classes; during testing, softmax is carried out on the distance from the test data to the prototype data of each category to judge the category label of the test data, so that the voiceprint is identified.
For the prototype network, the application range is not only in the learning process of single sample/small sample, but also in the learning mode of zero sample, and the idea for the application is as follows: although there are no data samples of the current classification, if a prototype representation of the classification, i.e. meta-information, can be generated in a higher hierarchy, as shown in fig. 5, by means of such meta-information, the corresponding calculations can be done, completing the corresponding classification task;
the pattern matching step includes:
generating a coded representation for each sample point in the support set, generating a prototype representation for each classification by means of sum-average, and generating a vector representation for the query sample;
meanwhile, the distance condition represented by each query point and each classification prototype needs to be calculated, the probability result of softmax is calculated, the probability distribution condition of each classification is generated, and the class with the highest probability is the class label of the test data.
The results of the comparison of the method of the present example with other conventional voiceprint recognition algorithms are shown in table 1:
TABLE 1
As a result, as shown in Table 1, the method of this example achieved a higher recognition rate.
Claims (9)
1. The voiceprint recognition method based on MFCC and vector element learning is characterized by comprising the following steps of:
voice preprocessing: recording voice signals to obtain a voice data set, dividing the voice data set into a training set and a testing set, and then carrying out voice data enhancement and voice pre-emphasis processing on all voice signals in the voice data set;
a characteristic extraction step: performing feature extraction on voice signals in a training set after voice preprocessing by using the MFCC to obtain MFCC feature parameters;
model training: inputting MFCC characteristic parameters of the speech signals of the training set into a prototype network for model training;
pattern matching: MFCC characteristic parameters are extracted from a test set to-be-recognized voice signal, the MFCC characteristic parameters are input into a trained prototype network for calculation, Euclidean distance is used as a distance metric, the characteristic quantity extracted by the recognized voice must be compared with the trained model characteristic parameters of each person, and the most similar voice is found out as a recognition result.
2. The method of voice print recognition based on MFCC and vector element learning of claim 1, wherein the speech preprocessing comprises:
a voice data enhancement step: collecting a speech signal of a person speaking at ordinary times through a speech collecting plate of the SEEED, and carrying out enhancement operation on the speech signal by forward playing, reverse playing and randomly deleting partial segments of the collected speech signal through praat software;
voice pre-emphasis: the voice signal passes through a high-pass filter to promote the high-frequency part, so that the frequency spectrum of the signal becomes flat, the frequency spectrum is kept in the whole frequency band from low frequency to high frequency, the same signal-to-noise ratio can be used for obtaining the frequency spectrum, meanwhile, the vocal cords and the lip effect in the sounding process are eliminated, the high-frequency part of the voice signal restrained by a sounding system is compensated, and the formant of the high frequency is highlighted.
3. The method of claim 1, wherein the feature extraction step comprises:
a pre-emphasis substep: the high-frequency part of the voice signal is boosted through a filter;
a sub-frame sub-step: framing the pre-emphasized voice signal;
hamming window substep: multiplying each frame of the framed speech signal by a Hamming window;
fast fourier transform substep: performing fast Fourier transform on each frame of voice signal after the Hamming window to obtain an energy spectrum;
triangular band-pass filtering substep: inputting the energy spectrum into a triangular band-pass filter bank, smoothing the frequency spectrum, eliminating the effect of harmonic waves and highlighting the formants of the original voice;
a logarithmic energy calculation substep: calculating the logarithmic energy output by each triangular band-pass filter;
discrete cosine transform substep: substituting the logarithmic energy obtained by calculation into discrete cosine transform to obtain MFCC characteristic parameters;
dynamic difference parameter substep: the dynamic characteristics of the voice signal are represented by the differential spectrum of the MFCC, and the multidimensional MFCC characteristic parameters are obtained.
4. The method of voiceprint recognition based on MFCC and vector element learning of claim 3, wherein the pre-emphasis sub-step comprises:
H(Z)=1-μz-1 (1),
wherein the value of μ is between 0.9-1.0, Z is the speech signal after pre-emphasis, Z is the speech signal before pre-emphasis.
5. The method of claim 3, wherein the framing sub-step comprises:
firstly, N sampling points are grouped into an observation unit, which is called a frame, generally, the value of N is 256 or 512, the covered time is about 20-30 ms, in order to avoid the excessive change of two adjacent frames, an overlap region is provided between two adjacent frames, the overlap region includes M sampling points, generally, the value of M is about 1/2 or 1/3 of N, the sampling frequency of a speech signal adopted by speech recognition is generally 8KHz or 16KHz, and in terms of 8KHz, if the frame length is 256 sampling points, the corresponding time length is 256/8000 × 1000 ═ 32 ms.
6. The method of voiceprint recognition based on MFCC and vector element learning of claim 3, wherein said Hamming window sub-step comprises:
multiplying each frame by a Hamming window to increase the continuity of the left and right ends of the frame, and multiplying the frame by the Hamming window if the signal after framing is S (N), N is 0,1 …, N-1, and N is the size of the frame
S′(n)=S(n)×W(n)
Where W (n) represents a Hamming window, different values of a will result in different Hamming windows, typically a being 0.46.
7. The voiceprint recognition method based on MFCC and vector element learning of claim 3, wherein said triangular band-pass filter bank comprises 40 triangular band-pass filters, and said discrete cosine transform sub-step substitutes the 40 logarithmic energies obtained by calculation into discrete cosine transform to obtain MFCC of order 13.
8. The method for voiceprint recognition based on MFCC and vector element learning of claim 1, wherein in the model training step, the prototype network algorithm comprises:
the main idea is as follows: projecting a sample space, namely embedding the sample space into a low-dimensional space, classifying the sample space by using the similarity of the sample in the low-dimensional space, then finding the clustering center of each classification in the low-dimensional space, and measuring the classification of a new sample by using a distance function;
assuming that the current data set is D, the representation of the samples inside it is { (x)1,y1),(x2,y2),...,(xn,yn) Where x denotes the vector representation and y denotes the class label, assuming that there are K classes, N samples per class, where N can be divided into NsAnd NQ(N=Ns+NQ) The corresponding sample sets are respectively denoted as SkSupport set and QkA query set;
for sample points inside the support set, a coding formula is usedTo generate a prototype representation for each class, where the coding formulaAny information extraction method may be used, such as CNN, LSTM;
for each classification, a prototype representation is generated of:
then calculating the distance condition of prototype representation of the query set and the support set;
finally, the probability p that the current sample belongs to each classification is calculatedw(y ═ k |, x), the calculation of softmax is used here:
wherein d () is a distance function, c is the clustering center of each class, and after the clustering center of each class of samples is known, which class the sample x belongs to can be described, and the class is represented by the distance function and the softmax function, and the probability that x belongs to the kth class is shown in formula (4);
9. The method of claim 1, wherein the pattern matching step comprises:
generating a coded representation for each sample point in the support set, generating a prototype representation for each classification by means of sum-average, and generating a vector representation for the query sample;
meanwhile, the distance condition represented by each query point and each classification prototype needs to be calculated, the probability result of softmax is calculated, the probability distribution condition of each classification is generated, and the class with the highest probability is the class label of the test data.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011220705.6A CN112397074A (en) | 2020-11-05 | 2020-11-05 | Voiceprint recognition method based on MFCC (Mel frequency cepstrum coefficient) and vector element learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011220705.6A CN112397074A (en) | 2020-11-05 | 2020-11-05 | Voiceprint recognition method based on MFCC (Mel frequency cepstrum coefficient) and vector element learning |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112397074A true CN112397074A (en) | 2021-02-23 |
Family
ID=74597377
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011220705.6A Pending CN112397074A (en) | 2020-11-05 | 2020-11-05 | Voiceprint recognition method based on MFCC (Mel frequency cepstrum coefficient) and vector element learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112397074A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113011346A (en) * | 2021-03-19 | 2021-06-22 | 电子科技大学 | Radiation source unknown signal identification method based on metric learning |
CN113658582A (en) * | 2021-07-15 | 2021-11-16 | 中国科学院计算技术研究所 | Voice-video cooperative lip language identification method and system |
CN114023312A (en) * | 2021-11-26 | 2022-02-08 | 杭州涿溪脑与智能研究所 | Voice voiceprint recognition general countermeasure disturbance construction method and system based on meta-learning |
CN116108372A (en) * | 2023-04-13 | 2023-05-12 | 中国人民解放军96901部队 | Infrasound event classification and identification method for small samples |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102324232A (en) * | 2011-09-12 | 2012-01-18 | 辽宁工业大学 | Method for recognizing sound-groove and system based on gauss hybrid models |
CN108847244A (en) * | 2018-08-22 | 2018-11-20 | 华东计算技术研究所(中国电子科技集团公司第三十二研究所) | Voiceprint recognition method and system based on MFCC and improved BP neural network |
CN111785286A (en) * | 2020-05-22 | 2020-10-16 | 南京邮电大学 | Home CNN classification and feature matching combined voiceprint recognition method |
-
2020
- 2020-11-05 CN CN202011220705.6A patent/CN112397074A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102324232A (en) * | 2011-09-12 | 2012-01-18 | 辽宁工业大学 | Method for recognizing sound-groove and system based on gauss hybrid models |
CN108847244A (en) * | 2018-08-22 | 2018-11-20 | 华东计算技术研究所(中国电子科技集团公司第三十二研究所) | Voiceprint recognition method and system based on MFCC and improved BP neural network |
CN111785286A (en) * | 2020-05-22 | 2020-10-16 | 南京邮电大学 | Home CNN classification and feature matching combined voiceprint recognition method |
Non-Patent Citations (4)
Title |
---|
ANAND PRASHANT: ""Few shot speaker recognition using deep neural networks"", 《IEEE》 * |
JAKE SNELL: ""Prototypical networks for few-shot learning"", 《ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 30:ANNUAL CONFERENCE ON NEURAL INFORMATION PROCESSING SYSTEMS》 * |
RUIRUI LI: ""Automatic speaker recognition with limited data"", 《WSDM》 * |
隔壁的NLP小哥: ""原型网络"", 《HTTPS://BLOG.CSDN.NET/HEI653779919/ARTICLE/DETAILS/106595614》 * |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113011346A (en) * | 2021-03-19 | 2021-06-22 | 电子科技大学 | Radiation source unknown signal identification method based on metric learning |
CN113658582A (en) * | 2021-07-15 | 2021-11-16 | 中国科学院计算技术研究所 | Voice-video cooperative lip language identification method and system |
CN114023312A (en) * | 2021-11-26 | 2022-02-08 | 杭州涿溪脑与智能研究所 | Voice voiceprint recognition general countermeasure disturbance construction method and system based on meta-learning |
CN114023312B (en) * | 2021-11-26 | 2022-08-23 | 杭州涿溪脑与智能研究所 | Voice voiceprint recognition general countermeasure disturbance construction method and system based on meta-learning |
CN116108372A (en) * | 2023-04-13 | 2023-05-12 | 中国人民解放军96901部队 | Infrasound event classification and identification method for small samples |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Agrawal et al. | Novel TEO-based Gammatone features for environmental sound classification | |
CN106935248B (en) | Voice similarity detection method and device | |
Kumar et al. | Design of an automatic speaker recognition system using MFCC, vector quantization and LBG algorithm | |
Muda et al. | Voice recognition algorithms using mel frequency cepstral coefficient (MFCC) and dynamic time warping (DTW) techniques | |
Cai et al. | Sensor network for the monitoring of ecosystem: Bird species recognition | |
US5913188A (en) | Apparatus and method for determining articulatory-orperation speech parameters | |
CN112397074A (en) | Voiceprint recognition method based on MFCC (Mel frequency cepstrum coefficient) and vector element learning | |
US8036891B2 (en) | Methods of identification using voice sound analysis | |
CN105825852A (en) | Oral English reading test scoring method | |
CN110827857A (en) | Speech emotion recognition method based on spectral features and ELM | |
CN111489763B (en) | GMM model-based speaker recognition self-adaption method in complex environment | |
CN114842878A (en) | Speech emotion recognition method based on neural network | |
Molla et al. | On the effectiveness of MFCCs and their statistical distribution properties in speaker identification | |
Chamoli et al. | Detection of emotion in analysis of speech using linear predictive coding techniques (LPC) | |
CN112466276A (en) | Speech synthesis system training method and device and readable storage medium | |
Sengupta et al. | Optimization of cepstral features for robust lung sound classification | |
Cai et al. | The best input feature when using convolutional neural network for cough recognition | |
CN112201226B (en) | Sound production mode judging method and system | |
Kumar et al. | Text dependent speaker identification in noisy environment | |
CN116052689A (en) | Voiceprint recognition method | |
Godino-Llorente et al. | Automatic detection of voice impairments due to vocal misuse by means of gaussian mixture models | |
Estrebou et al. | Voice recognition based on probabilistic SOM | |
Mahesha et al. | Vector Quantization and MFCC based classification of Dysfluencies in Stuttered Speech | |
Francese et al. | Automatic creation of a Vowel Dataset for performing Prosody Analysis in ASD screening | |
Gayathri et al. | Identification of voice pathology from temporal and cepstral features for vowel ‘a’low intonation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20210223 |
|
RJ01 | Rejection of invention patent application after publication |