CN112397074A - Voiceprint recognition method based on MFCC (Mel frequency cepstrum coefficient) and vector element learning - Google Patents

Voiceprint recognition method based on MFCC (Mel frequency cepstrum coefficient) and vector element learning Download PDF

Info

Publication number
CN112397074A
CN112397074A CN202011220705.6A CN202011220705A CN112397074A CN 112397074 A CN112397074 A CN 112397074A CN 202011220705 A CN202011220705 A CN 202011220705A CN 112397074 A CN112397074 A CN 112397074A
Authority
CN
China
Prior art keywords
voice
mfcc
sample
class
frequency
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011220705.6A
Other languages
Chinese (zh)
Inventor
林科
满瑞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guilin University of Electronic Technology
Original Assignee
Guilin University of Electronic Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guilin University of Electronic Technology filed Critical Guilin University of Electronic Technology
Priority to CN202011220705.6A priority Critical patent/CN112397074A/en
Publication of CN112397074A publication Critical patent/CN112397074A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/06Decision making techniques; Pattern matching strategies
    • G10L17/08Use of distortion metrics or a particular distance between probe pattern and reference templates
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/18Artificial neural networks; Connectionist approaches
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum

Abstract

The invention discloses a voiceprint recognition method based on MFCC and vector element learning, which comprises the following steps: preprocessing voice; a step of feature extraction; training a model; and (5) pattern matching. The method has the advantages of fine classification and high identification accuracy.

Description

Voiceprint recognition method based on MFCC (Mel frequency cepstrum coefficient) and vector element learning
Technical Field
The invention relates to the field of voiceprint recognition, in particular to a voiceprint recognition method based on MFCC (Mel frequency cepstrum coefficient) and vector element learning.
Background
Voiceprint recognition, also known as speaker recognition, is a technique for discriminating the identity of a speaker by voice. Intuitively, although the voiceprint is not as visually seen as the individual difference between the human face and the fingerprint, the vocal tract, the oral cavity and the nasal cavity of each person have the individual difference, and therefore the voiceprint is reflected in the sound with the difference. If the mouth is said to be the transmitter of sound, the human ear acting as a receiver also has the ability to distinguish sound.
MFCC: Mel-Frequency Cepstral coeffients (Mel-Frequency Cepstral coeffients), converting time domain speech into Frequency domain, filtering the signals of the Frequency domain in a segmented manner to obtain the occupation ratio of different Frequency bands, and obtaining a matrix formed by the occupation ratio Coefficients, namely Mel-Frequency Cepstral Coefficients.
Meta learning: from the network structure point of view, meta learning consists of two networks — meta-net and net, on the one hand net gets knowledge from meta-net, on the other hand meta-net observes the performance improvement of net itself.
Prototype network: the method comprises the steps of firstly projecting samples to a space, calculating the center of each sample category, projecting input to a new feature space during classification, and converting the input (such as an image) into a new feature vector through a neural network, so that the vectors of the same category are relatively close to each other, and the vectors of different categories are relatively far away from each other. Meanwhile, calculating the mean value of each class represents the prototype of the class. And analyzing the class of the target by comparing the distance from the target to each center.
Currently, the mainstream methods for voiceprint recognition are Dynamic Time Warping (DTW), hidden markov theory (HMM), Vector Quantization (VQ), and the like. However, these methods have the disadvantages of low recognition accuracy, large number of calculations, lack of dynamic training, or over-reliance on the original speaker.
For the prototype network, the application range is not only in the learning process of single sample/small sample, but also in the learning mode of zero sample. The idea for this application is: although we have no data sample for the current classification, if a prototype representation (meta-information) of the classification can be generated in a higher hierarchy.
Disclosure of Invention
Aiming at the defects of the existing mainstream algorithm for voiceprint recognition, the invention aims to provide a voiceprint recognition method based on MFCC and vector element learning. The method has the advantages of fine classification and high identification accuracy.
The voiceprint recognition method based on MFCC and vector element learning comprises the following steps:
voice preprocessing: recording voice signals to obtain a voice data set, dividing the voice data set into a training set and a testing set, and then carrying out voice data enhancement and voice pre-emphasis processing on all voice signals in the voice data set;
a characteristic extraction step: performing feature extraction on voice signals in a training set after voice preprocessing by using the MFCC to obtain MFCC feature parameters;
model training: inputting MFCC characteristic parameters of the speech signals of the training set into a prototype network for model training;
pattern matching: MFCC characteristic parameters are extracted from a test set to-be-recognized voice signal, the MFCC characteristic parameters are input into a trained prototype network for calculation, Euclidean distance is used as a distance metric, the characteristic quantity extracted by the recognized voice must be compared with the trained model characteristic parameters of each person, and the most similar voice is found out as a recognition result.
The voice preprocessing step comprises the following steps:
a voice data enhancer step: collecting a speech signal of a person speaking at ordinary times through a speech collecting plate of the SEEED, and carrying out enhancement operation on the speech signal by forward playing, reverse playing and randomly deleting partial segments of the collected speech signal through praat software;
voice pre-emphasis: the voice signal passes through a high-pass filter to promote the high-frequency part, so that the frequency spectrum of the signal becomes flat, the frequency spectrum is kept in the whole frequency band from low frequency to high frequency, the same signal-to-noise ratio can be used for obtaining the frequency spectrum, meanwhile, the vocal cords and the lip effect in the sounding process are eliminated, the high-frequency part of the voice signal restrained by a sounding system is compensated, and the formant of the high frequency is highlighted.
The feature extraction step includes:
a pre-emphasis substep: the high-frequency part of the voice signal is boosted through a filter;
a sub-frame sub-step: framing the pre-emphasized voice signal;
hamming window substep: multiplying each frame of the framed speech signal by a Hamming window;
fast fourier transform substep: performing fast Fourier transform on each frame of voice signal after the Hamming window to obtain an energy spectrum;
triangular band-pass filtering substep: inputting the energy spectrum into a triangular band-pass filter bank, smoothing the frequency spectrum, eliminating the effect of harmonic waves and highlighting the formants of the original voice;
a logarithmic energy calculation substep: calculating the logarithmic energy output by each triangular band-pass filter;
discrete cosine transform substep: substituting the logarithmic energy obtained by calculation into discrete cosine transform to obtain MFCC characteristic parameters;
dynamic difference parameter substep: the dynamic characteristics of the voice signal are represented by the differential spectrum of the MFCC, and the multidimensional MFCC characteristic parameters are obtained.
The pre-emphasis sub-step comprises:
H(Z)=1-μz-1 (1),
wherein the value of μ is between 0.9-1.0, Z is the speech signal after pre-emphasis, Z is the speech signal before pre-emphasis.
The framing sub-step comprises:
firstly, N sampling points are grouped into an observation unit, which is called a frame, generally, the value of N is 256 or 512, the covered time is about 20-30 ms, in order to avoid the excessive change of two adjacent frames, an overlap region is provided between two adjacent frames, the overlap region includes M sampling points, generally, the value of M is about 1/2 or 1/3 of N, the sampling frequency of a speech signal adopted by speech recognition is generally 8KHz or 16KHz, and in terms of 8KHz, if the frame length is 256 sampling points, the corresponding time length is 256/8000 × 1000 ═ 32 ms.
The Hamming window sub-step includes:
multiplying each frame by a Hamming window to increase the continuity of the left and right ends of the frame, and multiplying the frame by the Hamming window if the signal after framing is S (N), N is 0,1 …, N-1, and N is the size of the frame
S′(n)=S(n)×W(n)
Figure BDA0002761914900000031
Wherein, W (n) represents Hamming window, different values of a will generate different Hamming windows, and a is 0.46 in general;
the triangular band-pass filter bank comprises 40 triangular band-pass filters, and the discrete cosine transform substep substitutes the 40 logarithmic energies obtained by calculation into discrete cosine transform to obtain a 13-order MFCC.
In the model training step, the prototype network algorithm comprises:
the main idea is as follows: projecting a sample space, namely embedding the sample space into a low-dimensional space, classifying the sample space by using the similarity of the sample in the low-dimensional space, then finding the clustering center of each classification in the low-dimensional space, and measuring the classification of a new sample by using a distance function;
assuming that the current data set is D, the representation of the samples inside it is { (x)1,y1),(x2,y2),...,(xn,yn) Where x denotes the vector representation and y denotes the class label, assuming that there are K classes, N samples per class, where N can be divided into NsAnd NQ(N=Ns+NQ) The corresponding sample sets are respectively denoted as SkSupport set and QkA query set;
for sample points inside the support set, a coding formula is used
Figure BDA0002761914900000034
To generate a prototype representation for each class, where the coding formula
Figure BDA0002761914900000035
Any information extraction method may be used, such as CNN, LSTM;
for each classification, a prototype representation is generated of:
Figure BDA0002761914900000032
wherein
Figure BDA0002761914900000036
Representing the extracted features;
then calculating the distance condition of prototype representation of the query set and the support set;
finally, the probability p that the current sample belongs to each classification is calculatedw(y ═ k |, x), the calculation of softmax is used here:
Figure BDA0002761914900000033
wherein d () is a distance function, c is the clustering center of each class, and after the clustering center of each class of samples is known, which class the sample x belongs to can be described, and the class is represented by the distance function and the softmax function, and the probability that x belongs to the kth class is shown in formula (4);
finally, the network is obtained
Figure BDA0002761914900000041
Parameter (d) of
Figure BDA0002761914900000042
The loss function used is
Figure BDA0002761914900000043
Figure BDA0002761914900000044
Knowing that the kth class objective function corresponding to the sample x is shown in formula (5), minimizing the objective function by adopting a random gradient descent method, and obtaining the optimal parameter
Figure BDA0002761914900000045
The pattern matching step includes:
generating a coded representation for each sample point in the support set, generating a prototype representation for each classification by means of sum-average, and generating a vector representation for the query sample;
meanwhile, the distance condition represented by each query point and each classification prototype needs to be calculated, the probability result of softmax is calculated, the probability distribution condition of each classification is generated, and the class with the highest probability is the class label of the test data.
The technical scheme method can identify new classes never seen in the training process, and only needs a little sample data for each class. The prototype network maps the sample data in each category into a space and extracts their "mean" to represent as a prototype of that category. Using euclidean distance as a distance metric, the training makes the present class data the closest to the present class prototype representation and the further away from the other class prototype representations. During testing, softmax is carried out on the distance from the test data to the prototype data of each category to judge the category label of the test data. The main process of recognition is realized by a prototype network model based on vector element learning, so that better classification can be realized, and the problem of low recognition accuracy of the existing voiceprint recognition method is solved.
The method has the advantages of fine classification and high identification accuracy.
Drawings
FIG. 1 is a schematic overall flow chart of the embodiment;
FIG. 2 is a flow chart of a partial implementation of voiceprint recognition in an embodiment;
FIG. 3 is a diagram of an embodiment of a training architecture;
FIG. 4 is an overall architecture diagram of a prototype network in an embodiment;
FIG. 5 is a basic architecture diagram of the meta-learning technique in an embodiment;
FIG. 6 is a flow chart of modeling in an embodiment.
Detailed Description
The invention is further illustrated but not limited by the following figures and examples.
Example (b):
for speaker recognition, the feature quantities extracted from the recognized speech must be compared with the trained model feature parameters of each person to find the closest similarity as the recognition result. For speaker confirmation, only the input voice characteristic parameters are compared with the declared speaker voice template characteristic parameters, whether the two parameters are matched or not is determined through a corresponding method, if the two parameters are matched, confirmation is carried out, and if not, rejection is carried out.
The sound wave has corresponding amplitude in each period of time, in order to convert the sound wave into digital, the sound wave is separated in an equidistant way, the height of the sound wave at equidistant points is recorded, the height is called baud rate, the sound production frequency of a general person is 100 Hz-10000 Hz, the sampling frequency is generally determined by Nyquist sampling theorem, as shown in figure 6, therefore, the embodiment adopts 1.6KHz as the sampling frequency, the embodiment adopts an ADMP401 microphone pickup module to collect voice signals, the gain of the amplifier reaches 67dB, AD signals are output, the acquisition is convenient, in the voiceprint recognition, because the power spectrum of the voice signals is influenced by lip and nose radiation and is reduced along with the increase of the frequency of the signals, in order to make the frequency spectrum distribution of the voice signals more uniform, the frequency spectrum of the high-frequency part of the signals is subjected to lifting processing to reduce the low-frequency interference of the voice signals, the obtained signals are then fed into a model based on a processing platform in python language for training, as shown in fig. 3.
Referring to fig. 1 and 2, the voiceprint recognition method based on MFCC and vector element learning includes the following steps:
voice preprocessing: recording voice signals to obtain a voice data set, dividing the voice data set into a training set and a testing set, and then carrying out voice data enhancement and voice pre-emphasis processing on all voice signals in the voice data set;
a characteristic extraction step: performing feature extraction on voice signals in a training set after voice preprocessing by using the MFCC to obtain MFCC feature parameters;
model training: inputting MFCC characteristic parameters of the speech signals of the training set into a prototype network for model training;
pattern matching: MFCC characteristic parameters are extracted from a test set to-be-recognized voice signal and input into a trained prototype network for calculation, the overall architecture of the prototype network is as shown in FIG. 4, Euclidean distance is used as a distance measure, the characteristic quantity extracted by the recognized voice must be compared with the model characteristic parameters of each person obtained through training, and the closest similarity is found as the recognition result.
The voice preprocessing step comprises the following steps:
a voice data enhancer step: collecting a speech signal of a person speaking at ordinary times through a speech collecting plate of the SEEED, and carrying out enhancement operation on the speech signal by forward playing, reverse playing and randomly deleting partial segments of the collected speech signal through praat software;
voice pre-emphasis: the voice signal passes through a high-pass filter to promote the high-frequency part, so that the frequency spectrum of the signal becomes flat, the frequency spectrum is kept in the whole frequency band from low frequency to high frequency, the same signal-to-noise ratio can be used for obtaining the frequency spectrum, meanwhile, the vocal cords and the lip effect in the sounding process are eliminated, the high-frequency part of the voice signal restrained by a sounding system is compensated, and the formant of the high frequency is highlighted.
The feature extraction step includes:
a pre-emphasis substep: the high-frequency part of the voice signal is boosted through a filter;
a sub-frame sub-step: framing the pre-emphasized voice signal;
hamming window substep: multiplying each frame of the framed speech signal by a Hamming window;
fast fourier transform substep: performing fast Fourier transform on each frame of voice signal after the Hamming window to obtain an energy spectrum; triangular band-pass filtering substep: inputting the energy spectrum into a triangular band-pass filter bank, smoothing the frequency spectrum, eliminating the effect of harmonic waves and highlighting the formants of the original voice;
a logarithmic energy calculation substep: calculating the logarithmic energy output by each triangular band-pass filter;
discrete cosine transform substep: substituting the logarithmic energy obtained by calculation into discrete cosine transform to obtain MFCC characteristic parameters;
dynamic difference parameter substep: the dynamic characteristics of the voice signal are represented by the differential spectrum of the MFCC, and the multidimensional MFCC characteristic parameters are obtained.
The pre-emphasis sub-step comprises:
H(Z)=1-μz-1 (1),
wherein the value of μ is between 0.9-1.0, Z is the speech signal after pre-emphasis, Z is the speech signal before pre-emphasis. The framing sub-step comprises:
firstly, N sampling points are grouped into an observation unit, which is called a frame, generally, the value of N is 256 or 512, the covered time is about 20-30 ms, in order to avoid the excessive change of two adjacent frames, an overlap region is provided between two adjacent frames, the overlap region includes M sampling points, generally, the value of M is about 1/2 or 1/3 of N, the sampling frequency of a speech signal adopted by speech recognition is generally 8KHz or 16KHz, and in terms of 8KHz, if the frame length is 256 sampling points, the corresponding time length is 256/8000 × 1000 ═ 32 ms.
The Hamming window sub-step includes:
multiplying each frame by a Hamming window to increase the continuity of the left and right ends of the frame, and multiplying the frame by the Hamming window if the signal after framing is S (N), N is 0,1 …, N-1, and N is the size of the frame
S′(n)=S(n)×W(n)
Figure BDA0002761914900000061
Wherein, W (n) represents Hamming window, different values of a will generate different Hamming windows, and a is 0.46 in general;
the triangular band-pass filter bank comprises 40 triangular band-pass filters, and the discrete cosine transform substep substitutes the 40 logarithmic energies obtained by calculation into discrete cosine transform to obtain a 13-order MFCC.
In the model training step, the prototype network algorithm comprises:
the main idea is as follows: projecting a sample space, namely embedding the sample space into a low-dimensional space, classifying the sample space by using the similarity of the sample in the low-dimensional space, then finding the clustering center of each classification in the low-dimensional space, and measuring the classification of a new sample by using a distance function;
the MFCC obtained at the moment is two-dimensional data, abstract information of the MFCC, namely a voiceprint characteristic diagram, is extracted by utilizing the idea of a convolutional neural network, the network architecture trained in the embodiment is ResNet18, and the ResNet18 is mainly used for considering that the network is light in weight and efficient and stable in training;
assuming that the current data set is D, the representation of the samples inside it is { (x)1,y1),(x2,y2),...,(xn,yn) Where x denotes the vector representation and y denotes the class label, assuming that there are K classes, N samples per class, where N can be divided into NsAnd NQ(N=Ns+NQ) The corresponding sample sets are respectively denoted as SkSupport set and QkA query set;
in the embodiment, in the actual voice training, 5 persons are supported, 5 voice segments are provided for each person, the query set is still the 5 persons, 15 voice segments are provided for each person, and the voice time length of each person is set to be 5 seconds;
for sample points inside the support set, a coding formula is used
Figure BDA0002761914900000071
To generate a prototype representation for each class, where the coding formula
Figure BDA0002761914900000072
Any information extraction method can be adopted;
for each classification, a prototype representation is generated of:
Figure BDA0002761914900000073
wherein
Figure BDA0002761914900000074
Representing the extracted features;
then calculating the distance condition of prototype representation of the query set and the support set;
finally, the probability p that the current sample belongs to each classification is calculatedw(y ═ k |, x), the calculation of softmax is used here:
Figure BDA0002761914900000075
wherein d () is a distance function, c is the clustering center of each class, and after the clustering center of each class of samples is known, which class the sample x belongs to can be described, and the class is represented by the distance function and the softmax function, and the probability that x belongs to the kth class is shown in formula (4);
finally, the network is obtained
Figure BDA0002761914900000076
Parameter (d) of
Figure BDA0002761914900000077
The loss function used is
Figure BDA0002761914900000078
Figure BDA0002761914900000079
Knowing that the kth class objective function corresponding to the sample x is shown in formula (5), minimizing the objective function by adopting a random gradient descent method, and obtaining the optimal parameter
Figure BDA00027619149000000710
The traditional algorithm strategy adopts a dual-threshold method for judgment, when a voice section enters, a short-time energy and short-time zero-crossing rate curve is gradually increased until the curve enters a silent section and is gradually reduced, but in unvoiced sections at the beginning and the end of the voice section, the short-time energy is almost zero, but the short-time zero-crossing rate is larger, so that when the short-time energy is simply used as a criterion for end point detection, unvoiced and tail sections of a voice signal are easily cut off, and the voice section cannot be completely cut off, so that the short-time zero-crossing rate is required to be used as a second-stage judgment, the method needs to slice the signal, adopts 20ms slicing when in analysis, can adopt an FFT (fast Fourier transform) method to obtain corresponding waveforms, once the single sound waves exist, adds the energy contained in each frequency band to form new audio segment characteristics, and aims at the general characteristics of an acoustic model, other signal transformation strategies have been proposed based on MFCC, which is a cepstral parameter extracted in the frequency domain on the MEL scale describing the non-linear behavior of human ear frequencies, and MEL, which is approximated by the following equation:
Figure BDA0002761914900000081
inputting the MFCC characteristic parameters of the obtained speech signals into a prototype network under vector element learning for model training, wherein the prototype network maps sample data in each class into a space as shown in FIG. 4, extracts the 'mean' of the sample data to represent the prototype of the class, and uses Euclidean distance as distance measurement to train the data of the class to have the closest distance to the prototype of the class and have the farther distance to the prototype of other classes; during testing, softmax is carried out on the distance from the test data to the prototype data of each category to judge the category label of the test data, so that the voiceprint is identified.
For the prototype network, the application range is not only in the learning process of single sample/small sample, but also in the learning mode of zero sample, and the idea for the application is as follows: although there are no data samples of the current classification, if a prototype representation of the classification, i.e. meta-information, can be generated in a higher hierarchy, as shown in fig. 5, by means of such meta-information, the corresponding calculations can be done, completing the corresponding classification task;
the pattern matching step includes:
generating a coded representation for each sample point in the support set, generating a prototype representation for each classification by means of sum-average, and generating a vector representation for the query sample;
meanwhile, the distance condition represented by each query point and each classification prototype needs to be calculated, the probability result of softmax is calculated, the probability distribution condition of each classification is generated, and the class with the highest probability is the class label of the test data.
The results of the comparison of the method of the present example with other conventional voiceprint recognition algorithms are shown in table 1:
TABLE 1
Figure BDA0002761914900000082
As a result, as shown in Table 1, the method of this example achieved a higher recognition rate.

Claims (9)

1. The voiceprint recognition method based on MFCC and vector element learning is characterized by comprising the following steps of:
voice preprocessing: recording voice signals to obtain a voice data set, dividing the voice data set into a training set and a testing set, and then carrying out voice data enhancement and voice pre-emphasis processing on all voice signals in the voice data set;
a characteristic extraction step: performing feature extraction on voice signals in a training set after voice preprocessing by using the MFCC to obtain MFCC feature parameters;
model training: inputting MFCC characteristic parameters of the speech signals of the training set into a prototype network for model training;
pattern matching: MFCC characteristic parameters are extracted from a test set to-be-recognized voice signal, the MFCC characteristic parameters are input into a trained prototype network for calculation, Euclidean distance is used as a distance metric, the characteristic quantity extracted by the recognized voice must be compared with the trained model characteristic parameters of each person, and the most similar voice is found out as a recognition result.
2. The method of voice print recognition based on MFCC and vector element learning of claim 1, wherein the speech preprocessing comprises:
a voice data enhancement step: collecting a speech signal of a person speaking at ordinary times through a speech collecting plate of the SEEED, and carrying out enhancement operation on the speech signal by forward playing, reverse playing and randomly deleting partial segments of the collected speech signal through praat software;
voice pre-emphasis: the voice signal passes through a high-pass filter to promote the high-frequency part, so that the frequency spectrum of the signal becomes flat, the frequency spectrum is kept in the whole frequency band from low frequency to high frequency, the same signal-to-noise ratio can be used for obtaining the frequency spectrum, meanwhile, the vocal cords and the lip effect in the sounding process are eliminated, the high-frequency part of the voice signal restrained by a sounding system is compensated, and the formant of the high frequency is highlighted.
3. The method of claim 1, wherein the feature extraction step comprises:
a pre-emphasis substep: the high-frequency part of the voice signal is boosted through a filter;
a sub-frame sub-step: framing the pre-emphasized voice signal;
hamming window substep: multiplying each frame of the framed speech signal by a Hamming window;
fast fourier transform substep: performing fast Fourier transform on each frame of voice signal after the Hamming window to obtain an energy spectrum;
triangular band-pass filtering substep: inputting the energy spectrum into a triangular band-pass filter bank, smoothing the frequency spectrum, eliminating the effect of harmonic waves and highlighting the formants of the original voice;
a logarithmic energy calculation substep: calculating the logarithmic energy output by each triangular band-pass filter;
discrete cosine transform substep: substituting the logarithmic energy obtained by calculation into discrete cosine transform to obtain MFCC characteristic parameters;
dynamic difference parameter substep: the dynamic characteristics of the voice signal are represented by the differential spectrum of the MFCC, and the multidimensional MFCC characteristic parameters are obtained.
4. The method of voiceprint recognition based on MFCC and vector element learning of claim 3, wherein the pre-emphasis sub-step comprises:
H(Z)=1-μz-1 (1),
wherein the value of μ is between 0.9-1.0, Z is the speech signal after pre-emphasis, Z is the speech signal before pre-emphasis.
5. The method of claim 3, wherein the framing sub-step comprises:
firstly, N sampling points are grouped into an observation unit, which is called a frame, generally, the value of N is 256 or 512, the covered time is about 20-30 ms, in order to avoid the excessive change of two adjacent frames, an overlap region is provided between two adjacent frames, the overlap region includes M sampling points, generally, the value of M is about 1/2 or 1/3 of N, the sampling frequency of a speech signal adopted by speech recognition is generally 8KHz or 16KHz, and in terms of 8KHz, if the frame length is 256 sampling points, the corresponding time length is 256/8000 × 1000 ═ 32 ms.
6. The method of voiceprint recognition based on MFCC and vector element learning of claim 3, wherein said Hamming window sub-step comprises:
multiplying each frame by a Hamming window to increase the continuity of the left and right ends of the frame, and multiplying the frame by the Hamming window if the signal after framing is S (N), N is 0,1 …, N-1, and N is the size of the frame
S′(n)=S(n)×W(n)
Figure FDA0002761914890000021
Where W (n) represents a Hamming window, different values of a will result in different Hamming windows, typically a being 0.46.
7. The voiceprint recognition method based on MFCC and vector element learning of claim 3, wherein said triangular band-pass filter bank comprises 40 triangular band-pass filters, and said discrete cosine transform sub-step substitutes the 40 logarithmic energies obtained by calculation into discrete cosine transform to obtain MFCC of order 13.
8. The method for voiceprint recognition based on MFCC and vector element learning of claim 1, wherein in the model training step, the prototype network algorithm comprises:
the main idea is as follows: projecting a sample space, namely embedding the sample space into a low-dimensional space, classifying the sample space by using the similarity of the sample in the low-dimensional space, then finding the clustering center of each classification in the low-dimensional space, and measuring the classification of a new sample by using a distance function;
assuming that the current data set is D, the representation of the samples inside it is { (x)1,y1),(x2,y2),...,(xn,yn) Where x denotes the vector representation and y denotes the class label, assuming that there are K classes, N samples per class, where N can be divided into NsAnd NQ(N=Ns+NQ) The corresponding sample sets are respectively denoted as SkSupport set and QkA query set;
for sample points inside the support set, a coding formula is used
Figure FDA0002761914890000023
To generate a prototype representation for each class, where the coding formula
Figure FDA0002761914890000022
Any information extraction method may be used, such as CNN, LSTM;
for each classification, a prototype representation is generated of:
Figure FDA0002761914890000031
wherein
Figure FDA0002761914890000038
Representing the extracted features;
then calculating the distance condition of prototype representation of the query set and the support set;
finally, the probability p that the current sample belongs to each classification is calculatedw(y ═ k |, x), the calculation of softmax is used here:
Figure FDA0002761914890000032
wherein d () is a distance function, c is the clustering center of each class, and after the clustering center of each class of samples is known, which class the sample x belongs to can be described, and the class is represented by the distance function and the softmax function, and the probability that x belongs to the kth class is shown in formula (4);
finally, the network is obtained
Figure FDA0002761914890000033
Parameter (d) of
Figure FDA0002761914890000034
The loss function used is
Figure FDA0002761914890000035
Figure FDA0002761914890000036
Knowing that the kth class objective function corresponding to the sample x is shown in formula (5), minimizing the objective function by adopting a random gradient descent method, and obtaining the optimal parameter
Figure FDA0002761914890000037
9. The method of claim 1, wherein the pattern matching step comprises:
generating a coded representation for each sample point in the support set, generating a prototype representation for each classification by means of sum-average, and generating a vector representation for the query sample;
meanwhile, the distance condition represented by each query point and each classification prototype needs to be calculated, the probability result of softmax is calculated, the probability distribution condition of each classification is generated, and the class with the highest probability is the class label of the test data.
CN202011220705.6A 2020-11-05 2020-11-05 Voiceprint recognition method based on MFCC (Mel frequency cepstrum coefficient) and vector element learning Pending CN112397074A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011220705.6A CN112397074A (en) 2020-11-05 2020-11-05 Voiceprint recognition method based on MFCC (Mel frequency cepstrum coefficient) and vector element learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011220705.6A CN112397074A (en) 2020-11-05 2020-11-05 Voiceprint recognition method based on MFCC (Mel frequency cepstrum coefficient) and vector element learning

Publications (1)

Publication Number Publication Date
CN112397074A true CN112397074A (en) 2021-02-23

Family

ID=74597377

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011220705.6A Pending CN112397074A (en) 2020-11-05 2020-11-05 Voiceprint recognition method based on MFCC (Mel frequency cepstrum coefficient) and vector element learning

Country Status (1)

Country Link
CN (1) CN112397074A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113011346A (en) * 2021-03-19 2021-06-22 电子科技大学 Radiation source unknown signal identification method based on metric learning
CN113658582A (en) * 2021-07-15 2021-11-16 中国科学院计算技术研究所 Voice-video cooperative lip language identification method and system
CN114023312A (en) * 2021-11-26 2022-02-08 杭州涿溪脑与智能研究所 Voice voiceprint recognition general countermeasure disturbance construction method and system based on meta-learning
CN116108372A (en) * 2023-04-13 2023-05-12 中国人民解放军96901部队 Infrasound event classification and identification method for small samples

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102324232A (en) * 2011-09-12 2012-01-18 辽宁工业大学 Method for recognizing sound-groove and system based on gauss hybrid models
CN108847244A (en) * 2018-08-22 2018-11-20 华东计算技术研究所(中国电子科技集团公司第三十二研究所) Voiceprint recognition method and system based on MFCC and improved BP neural network
CN111785286A (en) * 2020-05-22 2020-10-16 南京邮电大学 Home CNN classification and feature matching combined voiceprint recognition method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102324232A (en) * 2011-09-12 2012-01-18 辽宁工业大学 Method for recognizing sound-groove and system based on gauss hybrid models
CN108847244A (en) * 2018-08-22 2018-11-20 华东计算技术研究所(中国电子科技集团公司第三十二研究所) Voiceprint recognition method and system based on MFCC and improved BP neural network
CN111785286A (en) * 2020-05-22 2020-10-16 南京邮电大学 Home CNN classification and feature matching combined voiceprint recognition method

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
ANAND PRASHANT: ""Few shot speaker recognition using deep neural networks"", 《IEEE》 *
JAKE SNELL: ""Prototypical networks for few-shot learning"", 《ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 30:ANNUAL CONFERENCE ON NEURAL INFORMATION PROCESSING SYSTEMS》 *
RUIRUI LI: ""Automatic speaker recognition with limited data"", 《WSDM》 *
隔壁的NLP小哥: ""原型网络"", 《HTTPS://BLOG.CSDN.NET/HEI653779919/ARTICLE/DETAILS/106595614》 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113011346A (en) * 2021-03-19 2021-06-22 电子科技大学 Radiation source unknown signal identification method based on metric learning
CN113658582A (en) * 2021-07-15 2021-11-16 中国科学院计算技术研究所 Voice-video cooperative lip language identification method and system
CN114023312A (en) * 2021-11-26 2022-02-08 杭州涿溪脑与智能研究所 Voice voiceprint recognition general countermeasure disturbance construction method and system based on meta-learning
CN114023312B (en) * 2021-11-26 2022-08-23 杭州涿溪脑与智能研究所 Voice voiceprint recognition general countermeasure disturbance construction method and system based on meta-learning
CN116108372A (en) * 2023-04-13 2023-05-12 中国人民解放军96901部队 Infrasound event classification and identification method for small samples

Similar Documents

Publication Publication Date Title
Agrawal et al. Novel TEO-based Gammatone features for environmental sound classification
CN106935248B (en) Voice similarity detection method and device
Kumar et al. Design of an automatic speaker recognition system using MFCC, vector quantization and LBG algorithm
Muda et al. Voice recognition algorithms using mel frequency cepstral coefficient (MFCC) and dynamic time warping (DTW) techniques
Cai et al. Sensor network for the monitoring of ecosystem: Bird species recognition
US5913188A (en) Apparatus and method for determining articulatory-orperation speech parameters
CN112397074A (en) Voiceprint recognition method based on MFCC (Mel frequency cepstrum coefficient) and vector element learning
US8036891B2 (en) Methods of identification using voice sound analysis
CN105825852A (en) Oral English reading test scoring method
CN110827857A (en) Speech emotion recognition method based on spectral features and ELM
CN111489763B (en) GMM model-based speaker recognition self-adaption method in complex environment
CN114842878A (en) Speech emotion recognition method based on neural network
Molla et al. On the effectiveness of MFCCs and their statistical distribution properties in speaker identification
Chamoli et al. Detection of emotion in analysis of speech using linear predictive coding techniques (LPC)
CN112466276A (en) Speech synthesis system training method and device and readable storage medium
Sengupta et al. Optimization of cepstral features for robust lung sound classification
Cai et al. The best input feature when using convolutional neural network for cough recognition
CN112201226B (en) Sound production mode judging method and system
Kumar et al. Text dependent speaker identification in noisy environment
CN116052689A (en) Voiceprint recognition method
Godino-Llorente et al. Automatic detection of voice impairments due to vocal misuse by means of gaussian mixture models
Estrebou et al. Voice recognition based on probabilistic SOM
Mahesha et al. Vector Quantization and MFCC based classification of Dysfluencies in Stuttered Speech
Francese et al. Automatic creation of a Vowel Dataset for performing Prosody Analysis in ASD screening
Gayathri et al. Identification of voice pathology from temporal and cepstral features for vowel ‘a’low intonation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20210223

RJ01 Rejection of invention patent application after publication