CN111816187A

CN111816187A - Deep neural network-based voice feature mapping method in complex environment

Info

Publication number: CN111816187A
Application number: CN202010635342.6A
Authority: CN
Inventors: 刘剑豪; 王亨佳; 胡乔林; 高坡; 都兴霖; 杨华兵; 王敏
Original assignee: Air Force Early Warning Academy
Current assignee: Air Force Early Warning Academy
Priority date: 2020-07-03
Filing date: 2020-07-03
Publication date: 2020-10-23

Abstract

The invention relates to the technical field of voice signal processing, and discloses a deep neural network-based voice feature mapping method in a complex environment, which comprises the following steps: step 1: constructing a large number of voice signal data pairs under clean environment and complex environment; step 2: extracting the characteristics of the voice signal in a clean environment; and step 3: extracting the characteristics of the voice signal in the complex environment; and 4, step 4: training a DNN model; and 5: and mapping the voice features in the complex environment by using the trained DNN model. The method adopts DNN as a mapping model, and can effectively fit the nonlinear relation between the voice signal characteristic parameters in a complex environment and the voice signal characteristic parameters in a clean environment; the feature mapping method provided by the invention can map the features of the voice signal in the complex environment, and effectively improves the purity of the voice features in the complex environment; the feature mapping method provided by the method has generalization capability on most acoustic scenes.

Description

Deep neural network-based voice feature mapping method in complex environment

Technical Field

The invention relates to the technical field of voice signal processing, in particular to a deep neural network-based voice feature mapping method in a complex environment.

Background

With the continuous development of voice signal processing technology, mode recognition technology and artificial intelligence technology, speaker recognition technology begins to move from laboratories to practical application fields, and shows great application prospects in the fields of information security, financial verification, public security criminal investigation, military and national defense confidentiality and the like.

The voice is the most frequently and efficiently used communication mode for human beings, and can be one of the biological characteristics of the human beings; under a good voice environment, voice is used as a man-machine interaction mode, so that not only can the meaning expressed by people be conveyed, but also the personal identity authentication can be accurately and quickly realized by utilizing a speaker recognition technology; under the wide application demand, the development of speaker recognition technology in the voice technology is promoted, and the method becomes an important condition of a human-computer interaction revolution and has epoch-making significance.

But a great deal of uncertain factors are enriched in real life, so that the robustness of speaker recognition still faces great difficulty and challenge; the factors mainly comprise the aspects of background noise, channel difference, voice difference, short voice, time-varying voice, emotional influence and the like; especially, the channel difference and the environmental change are inevitable obstacles of the speaker recognition technology in the application scene, which causes the speaker recognition performance to be sharply reduced.

The current solutions for solving the influence of environmental noise and channel difference on speaker recognition mainly include three types: one is a feature domain, the second is a model domain, and the third is a score domain; the method has the advantages that the method has a good effect of solving the problem of channel interference from the aspect of feature space, does not depend on the model, is irrelevant to the scoring algorithm after model matching, and is relevant to the inherent attribute of the voice signal; feature mapping is one solution for a feature domain; by training a mapping model of most generalized scenes, the speech signal characteristics under a clean environment are predicted from the speech signal characteristics under a complex environment, and the channel and noise robustness of the characteristics is greatly improved.

Disclosure of Invention

In view of the above problems, the present invention is directed to provide a deep neural network-based speech feature mapping method in a complex environment, which solves the problem of speech signal spectrum distortion caused by the mismatch between the training environment and the testing environment, and also eliminates the distortion of speaker feature parameters caused by channel transmission characteristics.

The invention adopts the following technical scheme for realizing the technical purpose: the deep neural network-based voice feature mapping method under the complex environment comprises the following steps:

step 1: constructing a large number of voice signal data pairs under clean environment and complex environment;

step 2: extracting the characteristics of the voice signal in a clean environment;

and step 3: extracting the characteristics of the voice signal in the complex environment;

and 4, step 4: training a DNN model;

and 5: and mapping the voice features in the complex environment by using the trained DNN model.

Further, the step 1 is specifically realized by:

the DNN-based feature mapping method is a supervised training model, needs a large amount of parallel corpora, and also needs a large amount of data pairs of voices in a complex environment and voices in a clean environment, so that the DNN model has good generalization capability, and the main factors influencing the recognition rate of a speaker under general conditions are environmental noise and channel difference, wherein the influence of additive noise in the environmental noise on the recognition rate is the largest, so that a large amount of parallel corpora are constructed according to the complex environment model, and the complex model is as shown in the formula:

S＝f(X₁,X₂,w)+αN

where α is an adjustment factor used to control the signal-to-noise ratio; x represents a voice signal collected in a clean environment; n represents a noise signal; w is a channel transmission parameter; the model can construct massive voice data with multiple signal-to-noise ratios, multiple noise types and complex environments transmitted by different channels.

Further, the specific implementation of step 2 includes the following steps:

step 2.1: preprocessing a voice signal in a clean environment, including sampling, quantizing, pre-emphasizing, end point detecting, framing and windowing;

step 2.2: and (3) extracting Mel cepstrum coefficients (MFCC) from the signals preprocessed in the step 2.1.

Further, the specific implementation of step 3 includes the following steps:

step 3.1: preprocessing a voice signal under a complex environment, including sampling, quantizing, pre-emphasizing, end point detecting, framing and windowing;

step 3.2: and (4) extracting MFCC characteristics of the signal preprocessed in the step 3.1.

Further, the specific implementation of the step 4 includes the following steps:

step 4.1: the method comprises the steps that an initialization model based on a limited Boltzmann machine is tried to be trained by utilizing voice characteristic parameters in a complex environment;

step 4.2: an inverse error propagation algorithm that employs a minimum mean square error criterion between the characteristics of speech in a clean environment and speech in a complex environment is used to update the parameters of the entire DNN.

Further, the step 5 comprises the following steps:

step 5.1: the feature mapping model is:

wherein c is_iFeatures of a speech signal in a complex environment; w is a₁…w_jIs a DNN model parameter, f (w)₁,…,w_j) Is a non-linear function; mu.s_iIs a perturbation term;

the feature is obtained after the feature mapping;

step 5.2: and (4) substituting the parameters estimated in the step (4.2) into the feature mapping parameter masking matrix to obtain the voice features obtained after the voice under the complex environment is subjected to feature mapping model operation, wherein the features at the moment can be approximately regarded as the voice features under the clean environment.

The invention has the following beneficial effects:

1. the method adopts DNN as a mapping model, and can effectively fit the nonlinear relation between the voice signal characteristic parameters in the complex environment and the voice signal characteristic parameters in the clean environment.

2. The feature mapping method provided by the invention can map the features of the voice signal in the complex environment, and effectively improves the purity of the voice features in the complex environment.

3. The feature mapping method provided by the method has generalization capability on most acoustic scenes.

Drawings

FIG. 1 is a flowchart of a deep neural network-based feature mapping method according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments; all other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, a deep neural network-based speech feature mapping method in a complex environment includes the following steps:

step 1: constructing a voice signal under a clean environment and a voice signal data pair under a complex environment;

the concrete implementation is as follows:

the DNN-based feature mapping method is a supervised training model, needs a large amount of parallel linguistic data, and also needs a large amount of data pairs of voice under a complex environment and voice under a clean environment, so that the DNN model has better generalization capability; in general, the main factors influencing the recognition rate of a speaker are environmental noise and channel difference, and additive noise in the environmental noise has the largest influence on the recognition rate, so that a large number of parallel corpora are constructed according to a complex environment model, and the complex model is shown as the formula:

S＝f(X,w)+αN

where α is an adjustment factor used to control the signal-to-noise ratio; x represents a voice signal collected in a clean environment; t represents a channel transmission matrix for controlling transmission characteristics of different channels; n represents a noise signal; w is a channel parameter; the model can construct massive voice data with multiple signal-to-noise ratios, multiple noise types and complex environments transmitted by different channels.

Step 2: extracting characteristic parameters of the voice in a clean environment;

the specific implementation comprises the following substeps:

step 2.1: preprocessing a voice signal obtained in a clean environment, including sampling, quantizing, pre-emphasizing, framing and windowing;

step 2.2: the preprocessed signal in step 2.1 is extracted as MFCC.

And step 3: extracting characteristic parameters of the voice in the complex environment;

step 3.1: preprocessing a voice signal under a complex environment, including sampling, quantizing, pre-emphasizing, framing and windowing;

step 3.2: the preprocessed signal in step 3.1 is extracted as MFCC.

And 4, step 4: carrying out big data training on the DNN mapping model;

step 4.2: updating parameters of the whole DNN by adopting a reverse error propagation algorithm of a minimum mean square error criterion between the characteristics of the speech in the clean environment and the speech in the complex environment; the objective function of the network training is as follows:

wherein: n is the number of all samples; the characteristics of the clean speech are represented by s (n); short wave speech characteristic

Represents; through calculation, the network weight can be corrected; the calculation formula used in the correction is as follows:

wherein: is the learning rate; w^lAnd b^lRefer to the parameter of the l-th layer; l is the total hidden layer; l +1 refers to the output layer of the network.

And 5: and performing feature mapping on the feature parameters of the voice under the complex environment through a trained DNN mapping model.

Step 5.1: the feature mapping model is:

the features obtained after the feature mapping.

In conclusion, the DNN is adopted as the mapping model, so that the nonlinear relation between the speech signal characteristic parameters in the complex environment and the speech signal characteristic parameters in the clean environment can be effectively fitted; the feature mapping method provided by the invention can map the features of the voice signal in the complex environment, and effectively improves the purity of the voice features in the complex environment; the feature mapping method provided by the method has generalization capability on most acoustic scenes.

Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. The method for mapping the speech features based on the deep neural network in the complex environment is characterized by comprising the following steps of:

and 4, step 4: training a DNN model;

2. The deep neural network-based speech feature mapping method under the complex environment according to claim 1, wherein the step 1 is implemented in a specific way:

S＝f(X₁,X₂,w)+αN

3. The deep neural network-based speech feature mapping method under the complex environment according to claim 1, wherein the step 2 is implemented by the following steps:

4. The deep neural network-based speech feature mapping method under the complex environment according to claim 1, wherein the specific implementation of the step 3 comprises the following steps:

5. The deep neural network-based speech feature mapping method under the complex environment according to claim 1, wherein the specific implementation of the step 4 comprises the following steps:

6. The deep neural network-based speech feature mapping method in the complex environment according to claims 1-5, wherein the step 5 comprises the following steps:

step 5.1: the feature mapping model is:

wherein c is_iFeatures of a speech signal in a complex environment; w is a₁...w_jIs a DNN model parameter, f (w)₁,…,w_j) Is a non-linear function; mu.s_iIs a perturbation term;

the feature is obtained after the feature mapping;