CN112542174A

CN112542174A - VAD-based multi-dimensional characteristic parameter voiceprint identification method

Info

Publication number: CN112542174A
Application number: CN202011557161.2A
Authority: CN
Inventors: 邓立新; 孙明铭; 濮勇; 徐艳君
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2020-12-25
Filing date: 2020-12-25
Publication date: 2021-03-23

Abstract

Step S1, reading, pre-emphasizing, framing and windowing the input voice signal, and converting the voice signal into a voice preprocessing signal; step S2, accurately detecting the start and stop frames of the framed voice preprocessing signal through end point detection, and removing the mute section; and S3, extracting MFCC characteristic parameters, MFCC standardized characteristic parameters, GFCC characteristic parameters and PNCC characteristic parameters of the voice signal after the endpoint detection, and combining the MFCC characteristic parameters, the MFCC standardized characteristic parameters and the PNCC characteristic parameters to form multi-dimensional characteristic parameters. The method improves the accuracy of endpoint detection, reduces the data volume of the training in the template training phase, enhances the anti-noise interference capability and effectively improves the recognition efficiency of voiceprint recognition.

Description

VAD-based multi-dimensional characteristic parameter voiceprint identification method

Technical Field

The invention belongs to the technical field of voiceprint recognition, and particularly relates to a multi-dimensional characteristic parameter voiceprint recognition method based on VAD.

Background

Voiceprint recognition is one of the biometric identification techniques, also known as speaker recognition. It is divided into two categories, namely speaker identification and speaker verification. The theoretical basis for voiceprint recognition is that each sound has a unique characteristic by which it is possible to effectively distinguish between different human voices.

The main flow of voiceprint recognition comprises reading the voice file of the speaker in the training set, and extracting characteristic information with identification from the read voice data through a specific filter. Common feature extraction methods include Mel cepstral coefficient MFCC, cepstral feature parameter GFCC of a Gamma-tone filter, energy regularization spectral coefficient PNCC, speaker Vector factors (Identity-Vector, I-Vector) and the like, and then template training is performed based on methods such as Gaussian mixture model GMM, dynamic time regularization DTW or artificial neural network template matching and the like. And finally, extracting the characteristics of the audio data in the test set to be matched with the trained template, so as to achieve the aim of voiceprint recognition.

In recent years, in order to improve the accuracy of voiceprint recognition, the following two types of feature extraction methods are mainly used in the field of speaker recognition.

(1) The method is mainly characterized in that a single feature extraction method is used for training according to different types of audios and the magnitude of signal-to-noise ratio. For example, the mainstream MFCC feature extraction, but sometimes the endpoint detection cannot accurately detect the start and stop endpoints of the speech because of the transformation of the frame length and the frame shift, and the standard cepstrum parameter MFCC only reflects the static characteristics of the speech parameters and does not perform well in the aspect of robustness.

(2) And calculating a difference spectrum from the extracted static features by a difference method to represent the dynamic features of the voice parameters, and then combining the dynamic features and the static features. The method can effectively improve the recognition performance of the system, but even greatly reduces the recognition accuracy under the condition of large noise interference in the voice information.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provide a multi-dimensional characteristic parameter voiceprint recognition method based on VAD, so that the accuracy of endpoint detection is improved, the data volume of training in the template training stage is reduced, the anti-noise interference capability is enhanced, and the recognition efficiency of voiceprint recognition is effectively improved.

The invention provides a voice print identification method based on VAD multidimensional characteristic parameters, which comprises the following steps,

step S1, reading, pre-emphasizing, framing and windowing the input voice signal, and converting the voice signal into a voice preprocessing signal;

step S2, accurately detecting the start and stop frames of the framed voice preprocessing signal through end point detection, and removing the mute section;

and S3, extracting MFCC characteristic parameters, MFCC standardized characteristic parameters, GFCC characteristic parameters and PNCC characteristic parameters of the voice signal after the endpoint detection, and combining the MFCC characteristic parameters, the MFCC standardized characteristic parameters and the PNCC characteristic parameters to form multi-dimensional characteristic parameters.

As a further technical solution of the present invention, in step S1, the voice signal reading is to read wav format audio files in a training set, and a wavfile () method in a scipy. io library in python is adopted to obtain a one-dimensional array representing audio information.

Further, in step S1, the pre-emphasis is to boost the high frequency component as a function of:

wherein u is a pre-emphasis coefficient and has a value range of 0.9-1.

Further, in step S1, the framing and windowing specifically includes that the parameter model of the speech signal is approximately unchanged within 10ms to 30ms, the number of frames in 1 second is 33 to 100 frames, an overlap region exists between adjacent frames, i.e., frame shift, and the ratio of the frame shift to the frame length is 1/3 to 1/2; and finally multiplying each frame signal by a Hamming window, wherein the expression of the Hamming window is as follows:

where a is the Hamming window coefficient.

Further, the end point detection in step S2 adopts a spectral entropy method, where entropy is identificationThe order degree of the signals, the spectrum entropy method is to detect the voice endpoint by detecting the flatness degree of the spectrum; the speech signal is

After windowing and framing, the nth frame is obtained

The FFT is:

k is the kth spectral line; the short-time energy of the speech frame in the frequency domain is:

wherein N is the FFT length, and only the positive frequency part is taken; the energy spectrum of the kth spectral line is

Normalized spectral probability density function for each frequency component of

The short-time spectral entropy of the frame is

Calculating the spectral entropy of each frame as

(ii) a The audio information in step S1 is intercepted by the spectral entropy method, and an area rich in speech information is reserved.

Further, in step S2, the endpoint detection adds a result verification mechanism, and when the spectral entropy method fails, the framed speech signal is filtered through an energy valve to remove the silence segment, where the value of the valve is

Wherein

Is an array of speech signal frame energies.

Furthermore, in step S3, the specific method for extracting the MFCC characteristic parameters is,

FFT is carried out on each frame signal of the voice preprocessing signal to obtain the frequency spectrum of each frame, and the power spectrum of the voice signal is obtained by taking the modulus square of the frequency spectrum of the voice signal;

passing the energy spectrum through a Mel triangular filter bank, wherein the center frequencies of the filter bank are uniformly arranged according to the Mel frequency, the filter bank comprises 22-26 filters, the base angle of each filter is the center frequency of the adjacent filter, and the approximate relational expression of the Mel frequency and the frequency is

Wherein f is frequency;

and calculating the logarithmic energy output by each Mel triangular filter bank, obtaining an MFCC feature vector through Discrete Cosine Transform (DCT), and returning to the default 13-dimensional cepstrum number.

Further, in step S3, the specific method for extracting the GFCC characteristic parameters is,

carrying out Fourier transform calculation on a specific signal frame; and taking an absolute value of the output frequency spectrum;

the energy spectrum is passed through a Gamma Filter Bank, which contains 20 filters, the output response of which is

Wherein, N is the channel number of the filter, and M is the frame number after sampling;

and calculating the logarithmic energy output by each filter bank, obtaining a GFCC characteristic vector through Discrete Cosine Transform (DCT), and returning to the default 13-dimensional cepstrum number.

Further, in step S3, the specific method for extracting the PNCC characteristic parameter is,

performing Fourier transform calculation on the signal frame, and taking an absolute value of an output frequency spectrum;

Where N is the number of channels of the filter and M is the number of frames after sampling.

Smoothing each frame, namely averaging the left frame and the right frame by 2 frames to obtain average power;

and normalizing the average power, and acquiring the PNCC characteristic parameters through an exponential function and Discrete Cosine Transform (DCT) which are more in line with the auditory characteristics of the human ear.

Further, in step S3, the dimension of the multi-dimensional feature parameter formed by combination is 52 dimensions.

The invention has the advantages that the voice information frame is intercepted through the endpoint detection, the energy valve device is added, the interference of a mute frame and a noise frame on the feature extraction result can be eliminated, the accuracy of the endpoint detection result is enhanced, the standardized feature parameter of the MFCC, the MFCC feature parameter, the Gamma atom filter feature parameter and the energy regularization spectrum coefficient PNCC feature parameter are combined into the multidimensional feature coefficient, and the accuracy of the identification is obviously improved.

Drawings

FIG. 1 is a flow chart of voiceprint recognition in accordance with the present invention;

FIG. 2 is a schematic diagram illustrating the effect of the endpoint detection method of the present invention;

FIG. 3 is a diagram of a multi-dimensional feature set according to the present invention.

Detailed Description

Referring to fig. 1, the present embodiment provides a method for recognizing a voiceprint with multidimensional feature parameters based on VAD, which is based on traditional feature parameter extraction and training of a speech library, and is improved for a feature extraction stage, and mainly includes three parts, namely speech signal preprocessing, endpoint detection, and feature parameter extraction, including the following steps,

In step S1, the voice signal reading is to read wav format audio files in the training set, and obtain a one-dimensional array representing audio information by using the wavfile () method in the scipy.

In step S1, since the frequency response curve of the glottal pulse is close to a second-order low-pass filter and the oral cavity radiation response is also close to a low-order high-pass filter, the frequency spectrum of the speech signal exhibits raised cosine roll-off fading, the value of the high-frequency component is usually much smaller than that of the low-frequency component, and in order to increase the high-frequency resolution of the speech signal and highlight the formants of the high-frequency part, we pre-emphasize the speech signal. The purpose of pre-emphasis is to compensate for the loss of high frequency components, which are boosted to pass the input speech test signal through a high pass filter as a function of:

wherein u is a pre-emphasis coefficient, the value range is 0.9-1, and is generally 0.97.

In the step S1, the framing and windowing are specifically that a parameter model of the voice signal is approximately unchanged within 10ms to 30ms, the number of frames in 1 second is 33 to 100 frames, an overlapping region exists between adjacent frames, namely frame shift, and the ratio of the frame shift to the frame length is 1/3 to 1/2; and finally multiplying each frame signal by a Hamming window, wherein the expression of the Hamming window is as follows:

where a is the Hamming window coefficient, which is 0.46.

In the step S2, the endpoint detection adopts a spectrum entropy method, wherein entropy is the ordered degree of the identification signal, and the spectrum entropy method is used for detecting the voice endpoint by detecting the flat degree of the spectrum; the speech signal is

After windowing and framing, the nth frame is obtained

The FFT is:

The short-time spectral entropy of the frame is

Calculating the spectral entropy of each frame as

(ii) a The audio information in step S1 is intercepted by the spectral entropy method, and a region rich in speech information is reserved, as shown in fig. 2.

In step S2, the endpoint detection adds a result verification mechanism, when the spectral entropy method fails, the speech signal after framing is screened by an energy valve to remove the mute section, and the valve value is

Wherein

Is an array of speech signal frame energies.

In step S3, the specific method for extracting the MFCC characteristic parameters is,

Wherein f is frequency;

the MFCC characteristic parameter is normalized by centering the MFCC characteristic parameter on the mean, centering on the component unit and centering on the unit variance, and the data is subjected to the normalization processing by subtracting the mean from the attribute (by column) and dividing by the variance. The net result is that for each attribute/column all data is clustered around 0, with a variance value of 1, resulting in a normalized feature that is the same size as the MFCC feature parameter dimension.

In step S3, the specific method for extracting GFCC characteristic parameters is,

In step S3, the specific method for extracting the PNCC characteristic parameters is,

passing the energy spectrum through a Gamma-tone filter bank comprising 20 filters, passThe output response of the filter is

As shown in fig. 3, in step S3, the MFCC characteristic parameter, the MFCC normalized characteristic parameter, the GFCC characteristic parameter, and the PNCC characteristic parameter are combined to form a multi-dimensional characteristic parameter. The selected characteristic parameter dimension is 13 dimensions, and the dimension of the multi-dimensional characteristic parameter formed by combination is 52 dimensions.

The foregoing illustrates and describes the principles, general features, and advantages of the present invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are intended to further illustrate the principles of the invention, and that various changes and modifications may be made without departing from the spirit and scope of the invention, which is intended to be protected by the appended claims. The scope of the invention is defined by the claims and their equivalents.

Claims

1. A multi-dimensional characteristic parameter voiceprint recognition method based on VAD is characterized by comprising the following steps,

2. The method for identifying the voiceprint of the multi-dimensional characteristic parameters based on the VAD as claimed in claim 1, wherein in the step S1, the voice signal reading is to read wav format audio files in a training set, and a wavfile () method in a scipy. io library in python is adopted to obtain a one-dimensional array representing audio information;

the method according to claim 1, wherein in step S1, the pre-emphasis is to boost the high frequency component as a function of:

wherein u is a pre-emphasis coefficient and has a value range of 0.9-1.

3. The method for voiceprint recognition based on VAD multi-dimensional characteristic parameters of claim 1, wherein in the step S1, the framing and windowing are specifically performed such that the parameter model of the speech signal is approximately unchanged within 10ms to 30ms, the number of frames in 1 second is 33 to 100 frames, an overlapping region exists between adjacent frames, i.e. frame shift, and the ratio of the frame shift to the frame length is 1/3 to 1/2; and finally multiplying each frame signal by a Hamming window, wherein the expression of the Hamming window is as follows:

in the formula, a is a Hamming window coefficient.

4. The method according to claim 1, wherein the end point detection in step S2 adopts a spectral entropy method, entropy is the degree of order of the identification signal, and the spectral entropy end point detection is to detect a voice end point by detecting the flatness of the spectrum; the speech signal is

After windowing and framing, the nth frame is obtained

The FFT is:

The short-time spectral entropy of the frame is

Calculating the spectral entropy of each frame as

5. The method according to claim 1, wherein in step S2, the endpoint detection adds a result verification mechanism, and when the spectral entropy method fails, the framed speech signal is filtered through an energy valve to remove silence segments, and the value of the valve is set as

Wherein

Is an array of speech signal frame energies.

6. The method for identifying the voiceprint of the multi-dimensional characteristic parameter based on the VAD according to the claim 1, wherein in the step S3, the specific method for extracting the MFCC characteristic parameter is,

Wherein f is frequency;

7. The method according to claim 1, wherein in step S3, the specific method for extracting GFCC characteristic parameters is,

8. The method as claimed in claim 1, wherein in step S3, the PNCC characteristic parameters are extracted by the specific method,

9. Smoothing each frame, namely averaging the left frame and the right frame by 2 frames to obtain average power;

and normalizing the average power, and acquiring the PNCC characteristic parameters through an exponential function and discrete cosine transform.

10. The method according to claim 1, wherein the dimension of the multidimensional characteristic parameter formed by combining in step S3 is 52 dimensions.