CN111489763A

CN111489763A - Adaptive method for speaker recognition in complex environment based on GMM model

Info

Publication number: CN111489763A
Application number: CN202010284977.6A
Authority: CN
Inventors: 郭雨欣; 宋雨佳
Original assignee: Wuhan University WHU
Current assignee: Wuhan University WHU
Priority date: 2020-04-13
Filing date: 2020-04-13
Publication date: 2020-08-04
Anticipated expiration: 2040-04-13
Also published as: CN111489763B

Abstract

The invention relates to a signal processing technology, in particular to a speaker recognition adaptive method in a complex environment based on a GMM model, which comprises a GMM-based speaker recognition model construction stage, namely, after preprocessing of low-pass filtering, pre-emphasis, windowing, framing and the like is carried out on a voice signal, filtering and denoising are carried out through a Gamma filter, and GMFCC combination characteristic parameters are extracted. And the speaker recognition and self-adaptation stage is also included, namely the speaker recognition is completed by extracting the voice characteristic parameters of the speaker to be recognized and carrying out self-adaptation adjustment on the original model. The method overcomes the defects of low speaker recognition accuracy rate and the like caused by illness or complex environment, provides a novel combined characteristic parameter method, can combine and analyze different characteristics, and effectively compensates errors caused by voice change due to different conditions of speakers, thereby improving the recognition accuracy rate.

Description

Adaptive method for speaker recognition in complex environment based on GMM model

Technical Field

The invention belongs to the technical field of signal processing, and particularly relates to a GMM model-based speaker recognition adaptive method in a complex environment.

Background

Speaker recognition is a method of performing feature extraction through collected voice signals of a speaker, analyzing and processing the voice signals, and then recognizing or confirming the speaker. With the rapid development of the current internet and information technology, speaker recognition technology is used in more and more related fields. Speaker recognition is used as a leading-edge technology and is widely applied to the fields of smart home, judicial criminal investigation, identity verification and the like.

With the progress of speaker recognition research, key technologies mainly develop around the problems of noise elimination, feature extraction, pattern matching and the like.

How to extract the individual features of the speaker from the speech signal of the speaker is the key of voiceprint recognition. The voice signal contains the characteristics of the sent voice and the personality characteristics of the speaker, and is a mixture of the voice characteristics and the personality characteristics of the speaker. The characteristic parameters extracted from the speech signal of the speaker meet a certain criterion, have robustness (the health condition and emotion of the speaker, dialect and other person imitation, and the like) to the outside, can keep stable for a long time, and are easy to extract from the speech signal.

From the acoustic level, the sound feature parameters can be simply classified into two categories: the inherent characteristics related to the physiological structure of the speaker are mainly embodied on the frequency spectrum structure of the voice, and include frequency spectrum envelope characteristic information reflecting vocal tract resonance and detail structural characteristic information reflecting the excitation property of a sound source such as vocal cord vibration, and typical characteristic parameters include pitch period coefficient and formant, which are not easy to be simulated, but are easy to be influenced by the health condition. The other type mainly reflects the dynamic characteristics of the vocal tract activity, namely the pronunciation mode, pronunciation habit and the like, which are reflected in the general dynamic characteristics of the audio structure changing along with time and containing characteristic parameters, and the characteristics are relatively stable but are easier to imitate, such as representative Mel cepstrum coefficients. If the two are objectively weighted and fused, the purpose of weighting and fusing the two can be achieved

Meanwhile, the extracted sound also has interference of surrounding noise and the like, and how to effectively remove the noise also becomes an important factor for the speaker to identify whether the speaker has high resolution or not.

Currently, adaptive techniques are also becoming mature. By the self-adaptive technology, the model parameters can be adjusted according to the speaking characteristics of the testers, and the identification accuracy is improved.

Disclosure of Invention

The invention aims to provide an adaptive method for analyzing different characteristics in a combined mode and effectively compensating errors caused by voice changes due to diseases or noises.

In order to achieve the purpose, the invention adopts the technical scheme that the speaker recognition self-adaptive method under the complex environment based on the GMM model comprises the following steps:

step 1, constructing a speaker recognition model based on GMM;

step 1.1, collecting a certain amount of voice data as training voice data for speaker recognition, and preprocessing the extracted voice data;

step 1.2, extracting a pitch period coefficient of the preprocessed voice signal by a cepstrum method;

step 1.3, extracting MFCC coefficients of the voice signals preprocessed in the step 1.1, and filtering the voice signals through a Gamma filter;

step 1.4, processing the MFCC coefficients to obtain first-order and second-order differences of the MFCC coefficients, and adding pitch period coefficients to obtain a GMFCC combined feature vector;

step 1.5, training a Gaussian mixture model by using the acoustic spectrum characteristics of a part of voice data;

step 2, speaker identification and self-adaptation;

step 2.1, preprocessing the voice to be recognized, extracting pitch period coefficients and MFCC coefficients from the voice data to be recognized, and obtaining GMFCC characteristics of the voice to be recognized after processing;

step 2.2, self-adaptive adjustment of the GMM model is carried out through the maximum posterior probability model;

and 2.3, identifying by using the adjusted model.

In the adaptive method for speaker recognition in a complex environment based on the GMM model, the step 1.1 is implemented by the following specific steps:

step 1.1.1, collecting a certain amount of voice data to make a corpus as training voice data for speaker recognition;

step 1.1.2, carrying out low-pass filtering on the obtained voice signal, reserving the frequency below 1000Hz, and simultaneously carrying out windowing and framing to obtain a frame signal;

and 1.1.3, performing least square method de-trend processing on each frame of signal, and eliminating noise in the voice signal by using spectral subtraction.

In the adaptive method for speaker recognition in a complex environment based on the GMM model, the implementation of step 1.2 includes the following specific steps:

step 1.2.1, analyzing the preprocessed signals to obtain a linear prediction model:

wherein ,

the l L PC coefficient, x representing the i frame of speech_i(m-l) represents the m-l-th frame,

represents the predicted mth frame;

step 1.2.2, deducing a transfer function of a prediction error:

wherein ,

the l L PC coefficient representing the i frame of speech;

step 1.2.3, eliminating the influence of a formant by utilizing a linear prediction method;

step 1.2.4, replacing the value of the burr point in the voice signal with the median value of each adjacent point by using a median filtering algorithm, and eliminating the influence of the burr point in the voice on voice analysis;

and step 1.2.5, detecting the pitch period of the processed voice signal by using a cepstrum method, and calculating a pitch period coefficient.

In the adaptive method for speaker recognition in a complex environment based on the GMM model, the implementation of step 1.3 includes the following specific steps:

step 1.3.1, preprocessing the voice data obtained in the step 1.1, and distributing the voice data according to Mel frequency through a triangular filter bank configured with M band-pass filters;

step 1.3.2, carrying out logarithmic energy processing on the data output by each filter bank in the step 1.3.1;

and step 1.3.3, obtaining the MFCC parameters after the data obtained in the step 1.3.2 is subjected to Discrete Cosine Transform (DCT).

In the adaptive method for speaker recognition in a complex environment based on the GMM model, the implementation of step 1.4 includes the following specific steps:

step 1.4.1, after the MFCC parameters of the voice signals are extracted in the step 1.3, extracting first-order MFCC and second-order MFCC parameters by using the following equations;

Sm＝MFCC+ΔMFCC+ΔΔMFCC

wherein ,d_tFor the cepstral coefficients of the first difference, T represents the cepstral coefficient dimension, θ is the time difference of the first derivative, taken as 1 or 2, c_tIs the t-th cepstrum coefficient;

step 1.4.2, the pitch period parameter extracted in the previous step

And the obtained MFCC parameter Sm is used as the posterior probability value of the test voice file, and the two vectors are normalized to ensure that

And Sm' becomes data between 0 and 1:

wherein ,

represents the pitch period parameter, max represents its maximum value,

and Sm' represents the normalized pitch period parameter and MFCC parameter;

step 1.4.3, calculating influence degree factors C1 and C2 of the two parameters by using an entropy weight method to form a new combined parameter GMFCC:

in the adaptive method for speaker recognition in a complex environment based on the GMM model, the implementation of step 1.5 includes the following specific steps:

and step 1.5.1, obtaining the GMM corresponding to each sample by using an EM algorithm, wherein each GMM corresponds to a respective mean value, covariance and weight.

In the adaptive method for speaker recognition in a complex environment based on the GMM model, the step 2.1 is implemented by the following specific steps:

step 2.1.1, preprocessing the voice to be recognized, including low-pass filtering, de-trending, framing, windowing and end point detection;

step 2.1.2, filtering the voice signals by utilizing a Gamma filter;

and 2.1.3, extracting pitch period coefficients and MFCC coefficients of the voice to be recognized through a cepstrum method, and calculating first-order MFCC and second-order MFCC parameters to form GMFCC combined parameters.

In the adaptive method for speaker recognition in a complex environment based on the GMM model, the step 2.2 is implemented by performing adaptive speaker transformation on the original model according to the parameters of the speech to be recognized by using the maximum posterior probability model to obtain the adaptive model related to the speaker.

In the adaptive method for speaker recognition in a complex environment based on the GMM model, the step 2.3 is implemented by calculating probability values P (Z | a) of the speech to be recognized and the original training respectively through a GMM formula, wherein Z is the speech data to be recognized, a is one of the training data, and the model with the maximum probability value is selected to label the speech to be recognized as the speaker.

The invention has the beneficial effects that: (1) the two voice parameters are used for recognition, and the reduction of recognition rate caused by sound change due to illness or different feelings is avoided by adding the pitch period parameter; and the dynamic characteristics of the sound channel activity are reflected on the basis of the MFCC parameters, so that the stability is certain.

(2) And filtering the original voice data by using a Gamma filter to remove the noise caused by the surrounding complex environment to reduce the identification accuracy.

(3) And modifying the original GMM model by utilizing the maximum posterior probability model according to the parameter characteristics of the voice data to be recognized, realizing the self-adaptation of the model and effectively improving the accuracy of the model recognition.

Drawings

FIG. 1 is a general flow chart of one embodiment of the present invention.

Detailed Description

Embodiments of the present invention will be described in detail below with reference to the accompanying drawings.

In order to overcome the defects that the speaker recognition accuracy is reduced due to illness or complex environment, and the like, the embodiment provides a novel combined characteristic parameter method which can combine and analyze different characteristics, effectively compensate errors caused by voice change due to illness or noise, and improve the recognition accuracy.

A speaker recognition adaptive method in a complex environment based on a GMM model comprises the following steps: and a GMM-based speaker recognition model construction stage, namely, after preprocessing of low-pass filtering, pre-emphasis, windowing, framing and the like is carried out on the voice signals, filtering and denoising are carried out through a Gamma filter, and GMFCC combination characteristic parameters are extracted. And a speaker identification and self-adaptation stage, namely, the speaker identification is completed by extracting the voice characteristic parameters of the speaker to be identified and carrying out self-adaptation adjustment on the original model.

The construction stage of the GMM-based speaker recognition model specifically comprises the following steps:

and step S1, collecting a certain amount of voice data as training voice data for speaker recognition, and preprocessing the extracted voice data.

In step S2, pitch period coefficients of the preprocessed speech signal are extracted by a cepstrum method.

In step S3, MFCC coefficients are extracted from the preprocessed speech information, and filtering is performed by a Gammatone filter.

Step S4, the MFCC coefficients are processed to obtain the first and second order differences of the MFCC coefficients, and the pitch period coefficients are added to obtain the GMFCC combined feature vector.

And step S5, training a Gaussian mixture model by using the acoustic spectrum characteristics of a part of voice data.

The speaker recognition and adaptation stage specifically comprises the following steps:

step S6, preprocessing the speech to be recognized, extracting pitch period coefficients and MFCC coefficients from the speech data to be recognized, and obtaining GMFCC characteristics of the speech to be recognized after processing.

In step S7, the GMM model is adaptively adjusted by the maximum a posteriori probability model.

In step S8, recognition is performed using the adjusted model.

In specific implementation, as shown in fig. 1, the embodiment is a speaker recognition adaptive method in a complex environment based on a GMM model, and includes 7 functional modules: the system comprises a data preprocessing module, a Gamma filtering module, a pitch period parameter extraction module, an MFCC parameter extraction module, a GMFCC combined parameter module, a GMM module and a self-adaptive module. The data preprocessing module has the main functions of performing end point detection, pre-emphasis, framing and windowing on original voice data by utilizing signal processing. The Gamma filtering module has the main function of filtering and denoising the original voice signal and highlighting the speaking voice of the speaker. The pitch period parameter extraction module has the main function of extracting the pitch period coefficient of the original voice, and the pitch period coefficient is used as the characteristic parameter of the voice for later training and recognition. The MFCC parameter extraction module has the main function of extracting the MFCC parameters, the first-order MFCC parameters and the second-order MFCC parameters of the voice. The main function of the GMFCC combined parameter module is to process the pitch period parameters and MFCCs and concatenate them into a high dimensional combined parameter, GMFCC. The GMM module has the main function of training the extracted characteristic parameters, and the training sample of each speaker obtains a corresponding GMM matching model through an EM algorithm. The main function of the self-adaptive module is to adjust the original model parameters according to the acoustic characteristics of a new speaker by using a MAP algorithm to realize self-adaptation.

The method of the embodiment comprises the following steps: a GMM-based speaker recognition model construction stage and a speaker recognition and self-adaptation stage.

the step S1 specifically has the following substeps:

step S11, a certain amount of speech data is collected to make a corpus as training speech data for speaker recognition.

And step S12, performing low-pass filtering on the obtained voice signal, only keeping the frequency below 1000Hz, and simultaneously performing windowing and framing to obtain a frame signal.

In step S14, a least square method de-trending process is performed on each frame of signal, and noise in the speech signal is removed by using spectral subtraction.

The substeps of step S2 are as follows:

step S21, analyzing the preprocessed signals to obtain a linear prediction model thereof:

wherein ,

representing the predicted mth frame.

Step S22, deriving a transfer function of the prediction error:

wherein ,

the l L PC coefficients representing the i frame of speech.

In step S23, the influence of the formants is eliminated by the linear prediction method.

And step S24, replacing the burr point value in the voice signal with the median value of the adjacent points by using a median filtering algorithm, and eliminating the influence of burrs in the voice on voice analysis.

Step S25, using cepstrum to detect the pitch period of the processed voice signal, and calculating the pitch period coefficient,

the substeps of step S3 are as follows:

step S31, obtaining the processed voice data after the preprocessing of step S1, and passing the processed voice data through a triangular filter bank configured with M band-pass filters to distribute the voice data according to Mel frequency so as to satisfy the requirement of auditory habits of human ears.

In step S32, logarithmic energy processing is performed on the data output from each filter bank in step S31.

And step S33, obtaining the MFCC parameters after the data obtained in step S32 is subjected to Discrete Cosine Transform (DCT).

The substeps of step S4 are as follows:

in step S41, after the MFCC parameters of the speech signal are extracted in step S3, the first-order MFCC and the second-order MFCC parameters can be extracted by the following equations.

Sm＝MFCC+ΔMFCC+ΔΔMFCC

wherein ,d_tCepstral coefficients for the first difference, T representing the cepstral coefficient dimension, theta being the time difference of the first derivative, the value being 1 or 2, c_tIs the t-th cepstrum coefficient.

Step S42, the pitch period parameter extracted in the previous step

And Sm' becomes data between 0 and 1:

wherein ,

represents the pitch period parameter, max represents its maximum value,

and Sm' represents the normalized pitch period parameter and MFCC parameters.

Step S43, calculating the influence degree factors C1 and C2 of the two parameters by using an entropy weight method, and forming a new combination parameter GMFCC:

the substeps of step S5 are as follows:

step S51, the mean, covariance and weight of each GMM model corresponding to each sample are obtained by EM algorithm.

the substeps of step S6 are as follows:

step S61, pre-process the speech to be recognized, including low-pass filtering, de-trending, framing, windowing, end-point detection, etc.

In step S62, the speech signal is filtered by the Gammatone filter.

Step S63, extracting pitch period coefficient and MFCC coefficient of the speech to be recognized by cepstrum method, and calculating first order MFCC and second order MFCC parameters to form GMFCC combined parameters.

A substep of step S7;

step S71 is to perform adaptive adjustment of the GMM model through the maximum a posteriori probability model, i.e. to perform adaptive transformation of the speaker on the original model according to the parameters of the speech to be recognized by using MAP (maximum a posteriori probability model), so as to obtain the adaptive model related to the speaker.

A substep of step S8;

step S81 identifies the voice to be identified and the probability value P (Z | a) of the original training (Z is the voice data to be identified and a is one of the training data) by GMM formula, and selects the model with the highest probability value, and then labels the voice to be identified as the speaker.

It should be understood that parts of the specification not set forth in detail are well within the prior art.

Although specific embodiments of the present invention have been described above with reference to the accompanying drawings, it will be appreciated by those skilled in the art that these are merely illustrative and that various changes or modifications may be made to these embodiments without departing from the principles and spirit of the invention. The scope of the invention is only limited by the appended claims.

Claims

1. A speaker recognition self-adaptive method in a complex environment based on a GMM model is characterized by comprising the following steps:

step 1, constructing a speaker recognition model based on GMM;

step 2, speaker identification and self-adaptation;

and 2.3, identifying by using the adjusted model.

2. The adaptive method for speaker recognition in a complex environment based on GMM model as claimed in claim 1, wherein the step 1.1 is implemented by the following steps:

3. The adaptive method for speaker recognition in a complex environment based on GMM model as claimed in claim 1, wherein the step 1.2 is implemented by the following steps:

wherein ,

represents the predicted mth frame;

step 1.2.2, deducing a transfer function of a prediction error:

wherein ,

the l L PC coefficient representing the i frame of speech;

4. The adaptive method for speaker recognition in a complex environment based on GMM model as claimed in claim 1, wherein the step 1.3 is implemented by the following steps:

5. The adaptive method for speaker recognition in a complex environment based on GMM model as claimed in claim 1, wherein the step 1.4 is implemented by the following steps:

Sm＝MFCC+ΔMFCC+ΔΔMFCC

step 1.4.2, the pitch period parameter extracted in the previous step

And Sm' becomes data between 0 and 1:

wherein ,

represents the pitch period parameter, max represents its maximum value,

and Sm' represents the normalized pitch period parameter and MFCC parameter;

6. the adaptive method for speaker recognition in a complex environment based on GMM model as claimed in claim 1, wherein the step 1.5 is implemented by the following steps:

7. The adaptive method for speaker recognition in a complex environment based on GMM model as claimed in claim 1, wherein the step 2.1 is implemented by the following steps:

step 2.1.2, filtering the voice signals by utilizing a Gamma filter;

8. The adaptive method for speaker recognition in a complex environment based on GMM model as claimed in claim 1, wherein the step 2.2 is implemented by using a maximum a posteriori probability model to perform speaker adaptive transformation on the original model according to the parameters of the speech to be recognized, so as to obtain the speaker dependent adaptive model.

9. The adaptive method for speaker recognition under complex environment based on GMM model as claimed in claim 1, wherein the step 2.3 is implemented by calculating probability values P (Z | a) of the speech to be recognized and the original training respectively through GMM formula, Z is the speech data to be recognized, a is a model in the training data, and the model with the highest probability value is selected to label the speech to be recognized as the speaker.