CN112397074A

CN112397074A - Voiceprint recognition method based on MFCC (Mel frequency cepstrum coefficient) and vector element learning

Info

Publication number: CN112397074A
Application number: CN202011220705.6A
Authority: CN
Inventors: 林科; 满瑞
Original assignee: Guilin University of Electronic Technology
Current assignee: Guilin University of Electronic Technology
Priority date: 2020-11-05
Filing date: 2020-11-05
Publication date: 2021-02-23

Abstract

The invention discloses a voiceprint recognition method based on MFCC and vector element learning, which comprises the following steps: preprocessing voice; a step of feature extraction; training a model; and (5) pattern matching. The method has the advantages of fine classification and high identification accuracy.

Description

Voiceprint recognition method based on MFCC (Mel frequency cepstrum coefficient) and vector element learning

Technical Field

The invention relates to the field of voiceprint recognition, in particular to a voiceprint recognition method based on MFCC (Mel frequency cepstrum coefficient) and vector element learning.

Background

Voiceprint recognition, also known as speaker recognition, is a technique for discriminating the identity of a speaker by voice. Intuitively, although the voiceprint is not as visually seen as the individual difference between the human face and the fingerprint, the vocal tract, the oral cavity and the nasal cavity of each person have the individual difference, and therefore the voiceprint is reflected in the sound with the difference. If the mouth is said to be the transmitter of sound, the human ear acting as a receiver also has the ability to distinguish sound.

MFCC: Mel-Frequency Cepstral coeffients (Mel-Frequency Cepstral coeffients), converting time domain speech into Frequency domain, filtering the signals of the Frequency domain in a segmented manner to obtain the occupation ratio of different Frequency bands, and obtaining a matrix formed by the occupation ratio Coefficients, namely Mel-Frequency Cepstral Coefficients.

Meta learning: from the network structure point of view, meta learning consists of two networks — meta-net and net, on the one hand net gets knowledge from meta-net, on the other hand meta-net observes the performance improvement of net itself.

Prototype network: the method comprises the steps of firstly projecting samples to a space, calculating the center of each sample category, projecting input to a new feature space during classification, and converting the input (such as an image) into a new feature vector through a neural network, so that the vectors of the same category are relatively close to each other, and the vectors of different categories are relatively far away from each other. Meanwhile, calculating the mean value of each class represents the prototype of the class. And analyzing the class of the target by comparing the distance from the target to each center.

Currently, the mainstream methods for voiceprint recognition are Dynamic Time Warping (DTW), hidden markov theory (HMM), Vector Quantization (VQ), and the like. However, these methods have the disadvantages of low recognition accuracy, large number of calculations, lack of dynamic training, or over-reliance on the original speaker.

For the prototype network, the application range is not only in the learning process of single sample/small sample, but also in the learning mode of zero sample. The idea for this application is: although we have no data sample for the current classification, if a prototype representation (meta-information) of the classification can be generated in a higher hierarchy.

Disclosure of Invention

Aiming at the defects of the existing mainstream algorithm for voiceprint recognition, the invention aims to provide a voiceprint recognition method based on MFCC and vector element learning. The method has the advantages of fine classification and high identification accuracy.

The voiceprint recognition method based on MFCC and vector element learning comprises the following steps:

voice preprocessing: recording voice signals to obtain a voice data set, dividing the voice data set into a training set and a testing set, and then carrying out voice data enhancement and voice pre-emphasis processing on all voice signals in the voice data set;

a characteristic extraction step: performing feature extraction on voice signals in a training set after voice preprocessing by using the MFCC to obtain MFCC feature parameters;

model training: inputting MFCC characteristic parameters of the speech signals of the training set into a prototype network for model training;

pattern matching: MFCC characteristic parameters are extracted from a test set to-be-recognized voice signal, the MFCC characteristic parameters are input into a trained prototype network for calculation, Euclidean distance is used as a distance metric, the characteristic quantity extracted by the recognized voice must be compared with the trained model characteristic parameters of each person, and the most similar voice is found out as a recognition result.

The voice preprocessing step comprises the following steps:

a voice data enhancer step: collecting a speech signal of a person speaking at ordinary times through a speech collecting plate of the SEEED, and carrying out enhancement operation on the speech signal by forward playing, reverse playing and randomly deleting partial segments of the collected speech signal through praat software;

voice pre-emphasis: the voice signal passes through a high-pass filter to promote the high-frequency part, so that the frequency spectrum of the signal becomes flat, the frequency spectrum is kept in the whole frequency band from low frequency to high frequency, the same signal-to-noise ratio can be used for obtaining the frequency spectrum, meanwhile, the vocal cords and the lip effect in the sounding process are eliminated, the high-frequency part of the voice signal restrained by a sounding system is compensated, and the formant of the high frequency is highlighted.

The feature extraction step includes:

a pre-emphasis substep: the high-frequency part of the voice signal is boosted through a filter;

a sub-frame sub-step: framing the pre-emphasized voice signal;

hamming window substep: multiplying each frame of the framed speech signal by a Hamming window;

fast fourier transform substep: performing fast Fourier transform on each frame of voice signal after the Hamming window to obtain an energy spectrum;

triangular band-pass filtering substep: inputting the energy spectrum into a triangular band-pass filter bank, smoothing the frequency spectrum, eliminating the effect of harmonic waves and highlighting the formants of the original voice;

a logarithmic energy calculation substep: calculating the logarithmic energy output by each triangular band-pass filter;

discrete cosine transform substep: substituting the logarithmic energy obtained by calculation into discrete cosine transform to obtain MFCC characteristic parameters;

dynamic difference parameter substep: the dynamic characteristics of the voice signal are represented by the differential spectrum of the MFCC, and the multidimensional MFCC characteristic parameters are obtained.

The pre-emphasis sub-step comprises:

H(Z)＝1-μz^-1 (1)，

wherein the value of μ is between 0.9-1.0, Z is the speech signal after pre-emphasis, Z is the speech signal before pre-emphasis.

The framing sub-step comprises:

firstly, N sampling points are grouped into an observation unit, which is called a frame, generally, the value of N is 256 or 512, the covered time is about 20-30 ms, in order to avoid the excessive change of two adjacent frames, an overlap region is provided between two adjacent frames, the overlap region includes M sampling points, generally, the value of M is about 1/2 or 1/3 of N, the sampling frequency of a speech signal adopted by speech recognition is generally 8KHz or 16KHz, and in terms of 8KHz, if the frame length is 256 sampling points, the corresponding time length is 256/8000 × 1000 ═ 32 ms.

The Hamming window sub-step includes:

multiplying each frame by a Hamming window to increase the continuity of the left and right ends of the frame, and multiplying the frame by the Hamming window if the signal after framing is S (N), N is 0,1 …, N-1, and N is the size of the frame

S′(n)＝S(n)×W(n)

Wherein, W (n) represents Hamming window, different values of a will generate different Hamming windows, and a is 0.46 in general;

the triangular band-pass filter bank comprises 40 triangular band-pass filters, and the discrete cosine transform substep substitutes the 40 logarithmic energies obtained by calculation into discrete cosine transform to obtain a 13-order MFCC.

In the model training step, the prototype network algorithm comprises:

the main idea is as follows: projecting a sample space, namely embedding the sample space into a low-dimensional space, classifying the sample space by using the similarity of the sample in the low-dimensional space, then finding the clustering center of each classification in the low-dimensional space, and measuring the classification of a new sample by using a distance function;

assuming that the current data set is D, the representation of the samples inside it is { (x)₁,y₁),(x₂,y₂),...,(x_n,y_n) Where x denotes the vector representation and y denotes the class label, assuming that there are K classes, N samples per class, where N can be divided into N_sAnd N_Q(N＝N_s+N_Q) The corresponding sample sets are respectively denoted as S_kSupport set and Q_kA query set;

for sample points inside the support set, a coding formula is used

To generate a prototype representation for each class, where the coding formula

Any information extraction method may be used, such as CNN, LSTM;

for each classification, a prototype representation is generated of:

wherein

Representing the extracted features;

then calculating the distance condition of prototype representation of the query set and the support set;

finally, the probability p that the current sample belongs to each classification is calculated_w(y ═ k |, x), the calculation of softmax is used here:

wherein d () is a distance function, c is the clustering center of each class, and after the clustering center of each class of samples is known, which class the sample x belongs to can be described, and the class is represented by the distance function and the softmax function, and the probability that x belongs to the kth class is shown in formula (4);

finally, the network is obtained

Parameter (d) of

The loss function used is

Knowing that the kth class objective function corresponding to the sample x is shown in formula (5), minimizing the objective function by adopting a random gradient descent method, and obtaining the optimal parameter

The pattern matching step includes:

generating a coded representation for each sample point in the support set, generating a prototype representation for each classification by means of sum-average, and generating a vector representation for the query sample;

meanwhile, the distance condition represented by each query point and each classification prototype needs to be calculated, the probability result of softmax is calculated, the probability distribution condition of each classification is generated, and the class with the highest probability is the class label of the test data.

The technical scheme method can identify new classes never seen in the training process, and only needs a little sample data for each class. The prototype network maps the sample data in each category into a space and extracts their "mean" to represent as a prototype of that category. Using euclidean distance as a distance metric, the training makes the present class data the closest to the present class prototype representation and the further away from the other class prototype representations. During testing, softmax is carried out on the distance from the test data to the prototype data of each category to judge the category label of the test data. The main process of recognition is realized by a prototype network model based on vector element learning, so that better classification can be realized, and the problem of low recognition accuracy of the existing voiceprint recognition method is solved.

The method has the advantages of fine classification and high identification accuracy.

Drawings

FIG. 1 is a schematic overall flow chart of the embodiment;

FIG. 2 is a flow chart of a partial implementation of voiceprint recognition in an embodiment;

FIG. 3 is a diagram of an embodiment of a training architecture;

FIG. 4 is an overall architecture diagram of a prototype network in an embodiment;

FIG. 5 is a basic architecture diagram of the meta-learning technique in an embodiment;

FIG. 6 is a flow chart of modeling in an embodiment.

Detailed Description

The invention is further illustrated but not limited by the following figures and examples.

Example (b):

for speaker recognition, the feature quantities extracted from the recognized speech must be compared with the trained model feature parameters of each person to find the closest similarity as the recognition result. For speaker confirmation, only the input voice characteristic parameters are compared with the declared speaker voice template characteristic parameters, whether the two parameters are matched or not is determined through a corresponding method, if the two parameters are matched, confirmation is carried out, and if not, rejection is carried out.

The sound wave has corresponding amplitude in each period of time, in order to convert the sound wave into digital, the sound wave is separated in an equidistant way, the height of the sound wave at equidistant points is recorded, the height is called baud rate, the sound production frequency of a general person is 100 Hz-10000 Hz, the sampling frequency is generally determined by Nyquist sampling theorem, as shown in figure 6, therefore, the embodiment adopts 1.6KHz as the sampling frequency, the embodiment adopts an ADMP401 microphone pickup module to collect voice signals, the gain of the amplifier reaches 67dB, AD signals are output, the acquisition is convenient, in the voiceprint recognition, because the power spectrum of the voice signals is influenced by lip and nose radiation and is reduced along with the increase of the frequency of the signals, in order to make the frequency spectrum distribution of the voice signals more uniform, the frequency spectrum of the high-frequency part of the signals is subjected to lifting processing to reduce the low-frequency interference of the voice signals, the obtained signals are then fed into a model based on a processing platform in python language for training, as shown in fig. 3.

Referring to fig. 1 and 2, the voiceprint recognition method based on MFCC and vector element learning includes the following steps:

pattern matching: MFCC characteristic parameters are extracted from a test set to-be-recognized voice signal and input into a trained prototype network for calculation, the overall architecture of the prototype network is as shown in FIG. 4, Euclidean distance is used as a distance measure, the characteristic quantity extracted by the recognized voice must be compared with the model characteristic parameters of each person obtained through training, and the closest similarity is found as the recognition result.

The voice preprocessing step comprises the following steps:

The feature extraction step includes:

a sub-frame sub-step: framing the pre-emphasized voice signal;

fast fourier transform substep: performing fast Fourier transform on each frame of voice signal after the Hamming window to obtain an energy spectrum; triangular band-pass filtering substep: inputting the energy spectrum into a triangular band-pass filter bank, smoothing the frequency spectrum, eliminating the effect of harmonic waves and highlighting the formants of the original voice;

The pre-emphasis sub-step comprises:

H(Z)＝1-μz^-1 (1)，

wherein the value of μ is between 0.9-1.0, Z is the speech signal after pre-emphasis, Z is the speech signal before pre-emphasis. The framing sub-step comprises:

The Hamming window sub-step includes:

S′(n)＝S(n)×W(n)

In the model training step, the prototype network algorithm comprises:

the MFCC obtained at the moment is two-dimensional data, abstract information of the MFCC, namely a voiceprint characteristic diagram, is extracted by utilizing the idea of a convolutional neural network, the network architecture trained in the embodiment is ResNet18, and the ResNet18 is mainly used for considering that the network is light in weight and efficient and stable in training;

in the embodiment, in the actual voice training, 5 persons are supported, 5 voice segments are provided for each person, the query set is still the 5 persons, 15 voice segments are provided for each person, and the voice time length of each person is set to be 5 seconds;

for sample points inside the support set, a coding formula is used

To generate a prototype representation for each class, where the coding formula

Any information extraction method can be adopted;

for each classification, a prototype representation is generated of:

wherein

Representing the extracted features;

finally, the network is obtained

Parameter (d) of

The loss function used is

The traditional algorithm strategy adopts a dual-threshold method for judgment, when a voice section enters, a short-time energy and short-time zero-crossing rate curve is gradually increased until the curve enters a silent section and is gradually reduced, but in unvoiced sections at the beginning and the end of the voice section, the short-time energy is almost zero, but the short-time zero-crossing rate is larger, so that when the short-time energy is simply used as a criterion for end point detection, unvoiced and tail sections of a voice signal are easily cut off, and the voice section cannot be completely cut off, so that the short-time zero-crossing rate is required to be used as a second-stage judgment, the method needs to slice the signal, adopts 20ms slicing when in analysis, can adopt an FFT (fast Fourier transform) method to obtain corresponding waveforms, once the single sound waves exist, adds the energy contained in each frequency band to form new audio segment characteristics, and aims at the general characteristics of an acoustic model, other signal transformation strategies have been proposed based on MFCC, which is a cepstral parameter extracted in the frequency domain on the MEL scale describing the non-linear behavior of human ear frequencies, and MEL, which is approximated by the following equation:

inputting the MFCC characteristic parameters of the obtained speech signals into a prototype network under vector element learning for model training, wherein the prototype network maps sample data in each class into a space as shown in FIG. 4, extracts the 'mean' of the sample data to represent the prototype of the class, and uses Euclidean distance as distance measurement to train the data of the class to have the closest distance to the prototype of the class and have the farther distance to the prototype of other classes; during testing, softmax is carried out on the distance from the test data to the prototype data of each category to judge the category label of the test data, so that the voiceprint is identified.

For the prototype network, the application range is not only in the learning process of single sample/small sample, but also in the learning mode of zero sample, and the idea for the application is as follows: although there are no data samples of the current classification, if a prototype representation of the classification, i.e. meta-information, can be generated in a higher hierarchy, as shown in fig. 5, by means of such meta-information, the corresponding calculations can be done, completing the corresponding classification task;

the pattern matching step includes:

The results of the comparison of the method of the present example with other conventional voiceprint recognition algorithms are shown in table 1:

TABLE 1

As a result, as shown in Table 1, the method of this example achieved a higher recognition rate.

Claims

1. The voiceprint recognition method based on MFCC and vector element learning is characterized by comprising the following steps of:

2. The method of voice print recognition based on MFCC and vector element learning of claim 1, wherein the speech preprocessing comprises:

a voice data enhancement step: collecting a speech signal of a person speaking at ordinary times through a speech collecting plate of the SEEED, and carrying out enhancement operation on the speech signal by forward playing, reverse playing and randomly deleting partial segments of the collected speech signal through praat software;

3. The method of claim 1, wherein the feature extraction step comprises:

a sub-frame sub-step: framing the pre-emphasized voice signal;

4. The method of voiceprint recognition based on MFCC and vector element learning of claim 3, wherein the pre-emphasis sub-step comprises:

H(Z)＝1-μz^-1 (1)，

5. The method of claim 3, wherein the framing sub-step comprises:

6. The method of voiceprint recognition based on MFCC and vector element learning of claim 3, wherein said Hamming window sub-step comprises:

S′(n)＝S(n)×W(n)

Where W (n) represents a Hamming window, different values of a will result in different Hamming windows, typically a being 0.46.

7. The voiceprint recognition method based on MFCC and vector element learning of claim 3, wherein said triangular band-pass filter bank comprises 40 triangular band-pass filters, and said discrete cosine transform sub-step substitutes the 40 logarithmic energies obtained by calculation into discrete cosine transform to obtain MFCC of order 13.

8. The method for voiceprint recognition based on MFCC and vector element learning of claim 1, wherein in the model training step, the prototype network algorithm comprises: