CN109767760A

CN109767760A - Far field audio recognition method based on the study of the multiple target of amplitude and phase information

Info

Publication number: CN109767760A
Application number: CN201910134661.6A
Authority: CN
Inventors: 党建武; 崔凌赫; 王龙标; 李东播
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2019-02-23
Filing date: 2019-02-23
Publication date: 2019-05-17

Abstract

The invention discloses a kind of far field audio recognition methods based on the study of the multiple target of amplitude and phase information, comprising the following steps: step 1, input data prepare；Step 2 extracts amplitude characteristic and a variety of phase properties；Step 3 constructs multitask deep neural network, and the amplitude characteristic of extraction and phase property are input to training in neural network, voice and enhanced feature after output enhancing.SRMR evaluation and test is done using enhanced voice, does speech recognition using enhanced feature.Present invention utilizes the methods of multiple target study, voice and feature are enhanced simultaneously, compared with the existing methods, it is poor to consider effect of group delay system (MGDCC) feature under reverberation voice, increasing another phase property makes up the deficiency of MGDCC based on the channel information (PBSFVT) of the source separation method of phase field, and then improves speech recognition accuracy.

Description

Far field audio recognition method based on the study of the multiple target of amplitude and phase information

Technical field

The invention belongs to far field technical field of voice recognition, are specifically related to a kind of more mesh based on amplitude and phase information Mark the far field audio recognition method of study.

Background technique

Interactive voice is most direct, the most natural communication exchange mode of human society.Speech recognition as key technology it One, text can be converted by voice signal by recognition of speech signals.Speech recognition is one and touches wide range of areas Cross discipline, final purpose are to make one similar computer to carry out interactive voice.

By years of researches, near field voice identification technology has been achieved for important breakthrough, and substantially increases performance, but It is that there is also problems for far field speech recognition technology, in the speech recognition of far field, target voice is often by ambient noise It is interfered with reverberation, to reduce the accuracy rate of speech recognition, leads to the sharply decline of performance.Therefore it needs to acquire microphone The signal arrived carries out speech enhan-cement processing, removes the disturbing factors such as noise and reverberation.

Summary of the invention

The present invention is heavily disturbed in reverberation voice for phase information, and phase existing for phase information itself Winding problems have used group delay method to avoid the winding problems of phase information, while attempting using different phase informations, group The channel information (PBSFVT) of delay system (MGDCC) and the source separation method based on phase field, utilizes out of phase information Complementarity carry out speech enhan-cement as important supplemental characteristic.

In order to solve problem above, the present invention uses different phase informations as important supplemental characteristic to carry out voice Enhancing proposes a kind of far field audio recognition method based on the study of the multiple target of amplitude and phase information, the technical side of use Case is as follows: the far field audio recognition method based on the study of the multiple target of amplitude and phase information, comprising the following steps:

1) input data prepares: the data concentrated respectively to training set, development set and verifying carry out data preparation；

2) feature extraction:

(1) based on the feature extraction of amplitude information: by framing, adding window, and to each short-time analysis window, by quick Signal is transformed into frequency domain by time domain and obtains corresponding frequency spectrum by Fourier transformation, then carries out frequency using Mel filter Filtering and the sensory perceptual system of the mankind is simulated with this；

(2) based on the feature extraction of phase information: extracting the phase information of each frame voice, including group delay system Two kinds of phase properties of channel information PBSFVT of MGDCC and the source separation method based on phase field；

3) model training: the feature extracted is input in the DNN of multiple target, and the DNN network of multiple target can be simultaneously Two different targets are learnt, to simulate the general character and difference between different target.

Feature extraction based on phase information in the step 2)-(2), including group delay system MGDCC phase property, tool Body extraction process is as follows: during carrying out Speech processing, needing to carry out the phase bit position of voice signal expansion and asks Its negative derivative is solved, negative derivative is known as group delay coefficient (GDF)；

Group delay function its be substantially the negative for calculating the derivative of continuous sound spectrograph；

The phase spectrum signature of continuous phase spectrum signature, that is, non-rolling can indicate are as follows:

Group delay function can similarly be calculated as following expression form:

Wherein: what two parts of real and imaginary parts and Y (ω) that subscript R and I are respectively indicated respectively indicated be x (n) and Frequency domain information after nx (n) Fourier transform；

Group delay coefficient after adjustment may be calculated:

Wherein: S (ω) indicates the smoothed version of X (ω)；

The spike behavior for reducing frequency spectrum, introduces two new variable αs and γ to be eliminated:

Wherein: α and γ, value range is between 0~1.

Feature extraction based on phase information in the step 2)-(2), the sound including the source separation method based on phase field Two kinds of phase properties of road information PBSFVT, specific extraction process are as follows:

Two kinds can be broken down into using Short Time Fourier Transform X (ω): two parts of allpass phase and minimum phase:

X (ω)=| X (ω) | e^jarg{X(ω)}=X_MinPh(ω)X_Allp(ω)

Wherein: X_MinPh(ω) and X_Allp(ω) respectively indicate the corresponding minimum phase part X after Fourier transformation and Allpass phase part, and there is the relationships of following formula between minimum phase and primary speech signal:

| X (ω) |=| X_MinPh(ω)|

On the other hand, the relationship between minimum phase and allpass phase are as follows:

Arg { X (ω) }=arg { X_MinPh(ω)}+arg{X_Allp(ω)}

Voice signal is transformed in phase field from amplitude domain by Hilbert transform, obtains minimum phase feature:

After Fourier transformation, convolution relation will become multiplication relationship, obtain following equalities:

Minimum phase feature and channel information processing method are combined, using source Filtering Model in minimum phase domain Operation carry out source filtering operation carry out information separation, minimum phase voice signal is decomposed into sound source information and channel information, The different model both to obtain.

The step 3) specifically: building multitask deep neural network, the amplitude characteristic of extraction and phase property is defeated Enter into neural network training, voice and enhanced feature after output enhancing.

It further include SRMR assessment and speech recognition, the specifically enhanced feature by DNN output carries out speech recognition, from And Word Error Rate WER (Word Error Rate) is obtained, the enhanced voice of output is carried out SRMR evaluation and test.

Beneficial effect

Present invention utilizes the methods of multiple target study, while enhancing the feature of voice signal and voice, and existing Method is compared, it is contemplated that effect of group delay system (MGDCC) feature under reverberation voice is poor, increases another phase Feature makes up the deficiency of MGDCC based on the channel information (PBSFVT) of the source separation method of phase field, and then improves voice and know Other accuracy rate.

Detailed description of the invention

Fig. 1 is multiple target learning framework basic block diagram proposed by the present invention.

Fig. 2 is the minimum phase domain channel information extraction process based on source separation method.

Fig. 3 is the method for the present invention flow chart.

Specific embodiment

Below by specific embodiments and the drawings, the present invention is further illustrated.The embodiment of the present invention is in order to more So that those skilled in the art is more fully understood the present invention well, any limitation is not made to the present invention.

As shown in figure 3, a kind of far field audio recognition method based on the study of the multiple target of amplitude and phase information, including with Lower step:

Step 1, input data prepare: data set chooses data provided by 2014 challenge match of REVERB, respectively to instruction Practice the data that collection, development set and verifying are concentrated and carries out data preparation；

Step 2, feature extraction:

1) based on the feature extraction of amplitude information: by framing, adding window, and to each short-time analysis window, by quick Signal is transformed into frequency domain by time domain and obtains corresponding frequency spectrum by Fourier transformation, then carries out frequency using Mel filter Filtering and the sensory perceptual system of the mankind is simulated with this.

2) based on the feature extraction of phase information: extracting the phase information of each frame voice, including group delay system (MGDCC) and two kinds of phase properties of the channel information of the source separation method based on phase field (PBSFVT).

Feature extraction of the step 2 of the present invention based on phase information includes group delay system (MGDCC) and based on phase Two kinds of phase properties of channel information (PBSFVT) of the source separation method of bit field, specific extraction process are as follows:

1) MGDCC is extracted:

When we are during carrying out Speech processing, need to carry out expansion solution to the phase bit position of voice signal Its negative derivative, negative derivative are known as group delay coefficient (GDF), and doing so can be efficiently used for extracting various voice signal ginsengs Number.Group delay function is the main representation method of current phase spectrum, is substantially the negative for calculating the derivative of continuous sound spectrograph. Therefore the phase spectrum signature of continuous phase spectrum signature, that is, non-rolling can indicate are as follows:

It is the phase information function of non-rolling in above formula, group delay function can similarly be calculated as following table Form is stated,

Wherein, what two parts of real and imaginary parts and Y (ω) that subscript R and I are respectively indicated respectively indicated be x (n) and Frequency domain information after nx (n) Fourier transform.In addition as can be seen from the above formula that, denominator disappears at zero close to unit circle It loses, it is therefore desirable to which the case where function is further adjusted, i.e., become zero for denominator is adjusted.By with its base The denominator is replaced to carry out solving the problems, such as that denominator becomes zero in smooth spectrum, the characteristic for the spike that group delay can be overcome to compose.It adjusts Group delay coefficient after whole may be calculated:

Wherein, S (ω) indicates the smoothed version of X (ω), but original group delay function remains formant spectrum Peak value acute problem, will affect the performance of speech recognition in this way.In order to reduce the spike behavior of frequency spectrum, introduce two it is new Variable α and γ are eliminated, and value range is between 0~1.

2) PBSFVT is extracted:

Voice signal is a kind of signal of mixed-phase information, wherein including minimum phase information and allpass phase information Etc..Therefore two kinds can be broken down into using Short Time Fourier Transform X (ω): two portions of allpass phase and minimum phase Point.

X (ω)=| X (ω) | e^jarg{X(ω)}=X_MinPh(ω)X_Allp(ω)

Wherein, X_MinPh(ω) and X_Allp(ω) respectively indicate the corresponding minimum phase part X after Fourier transformation and Allpass phase part, and there is the relationships of following formula between minimum phase and primary speech signal:

| X (ω) |=| X_MinPh(ω)|

Arg { X (ω) }=arg { X_MinPh(ω)}+arg{X_Allp(ω)}

For the information of minimum phase part, Hilbert transform provides the mapping relations between phase and amplitude, because We can be transformed to voice signal in phase field from amplitude domain by Hilbert transform for this, as follows, in this way we It is obtained with minimum phase feature,

After Fourier transformation, convolution relation will become multiplication relationship, therefore available following equalities:

Minimum phase feature and channel information processing method can be combined, using source Filtering Model in minimum phase The operation of bit field carries out source filtering operation and carries out information separation, and minimum phase voice signal can be thus decomposed into sound source letter Breath and channel information, to obtain the different model of the two.

Step 3, model training: the feature extracted is input in the DNN of multiple target, and the DNN network of multiple target can be with Two different targets are learnt simultaneously, to simulate the general character and difference between different target.

Step 4 exports result: the enhanced feature of DNN output being carried out speech recognition, to obtain WER (Word Error Rate), the enhanced voice of output is carried out SRMR evaluation and test.

Fig. 1 is multiple target learning framework basic block diagram proposed by the invention, by the voice based on MFCC essential characteristic Identification mission is as main task, based on the speech enhan-cement task of sound spectrograph feature as nonproductive task.This front-end feature processes Model combines voice recognition tasks and speech enhan-cement task, mix using the non-linear mapping capability of neural network Ring operation.In the regression model of front end dereverberation processing, the loss function to minimize mean square error (MSE) is carried out as target Optimization.Multi-purpose Neural e-learning estimates the target of two different tasks simultaneously, learns the general character and difference between two tasks It is different.Compared with independent training pattern, the learning efficiency and precision of prediction of main task model is can be improved in this method.Wherein DNN has 3 Layer hidden layer, including 3072 nodes, loss function are MSE, and what optimization algorithm was chosen is stochastic gradient descent.Speech recognition Task chooses the MFB feature of 23 dimensions as essential characteristic, and the speech enhan-cement task of auxiliary chooses the sound spectrograph feature of 256 dimensions, phase Position feature MGDCC and PBSFVT are 13 peacekeepings 23 dimension respectively.

In this multiple target learning framework, main task is characterized enhancing task, auxiliary for the speech recognition system of rear end Helping task is speech enhan-cement task, for promoting the extensive effect of main task.It can learn two by the representation method of inclusion layer Correlation between a task.

Fig. 2 is the minimum phase domain channel information extraction process based on source separation method, due to the big portion in voice signal Point information concentrates on middle low frequency part, therefore is filtered using Meier filter, obtains higher resolution ratio in low frequency, and high Frequency part can then be suppressed its resolution ratio.

In addition a last step is exactly to carry out decorrelation operation, and main cause is so that data are preferably matched in GMM- Diagonal covariance matrix used in HMM system acquires more accurate alignment information in acoustic model.Finally, to feature Vector carries out final processing, such as using cepstrum mean normalization (CMN), and calculates behavioral characteristics.In this task, the spy The basic setup for levying parameter is frame length, and frame displacement and filter quantity are respectively 25ms and 10ms and 23, use DCT later Decorrelation and Data Dimensionality Reduction is carried out, obtains the data characteristicses of 13 dimensions.

Table 1 is the structure and parameter setting of neural network；

Parameter	Value
		The shared hiding number of plies	3
Concealed nodes	Every layer 3072
		Concealed nodes type	Sigmoid
Loss function	MSE (mean square error)
		Optimization algorithm	Stochastic gradient descent
Context size	15
		The number of iterations	30

Table 2 is different input results comparisons (WER%) in multi-task learning frame

Table 1 lists the structure of neural network and specific parameter setting, table 2 are in 2014 challenge match number of REVERB According to the experimental result comparison carried out on collection, evaluation index is WER (Word Error Rate), we can prove phase information Importance in multiple target study.In this experiment, using phase information as the important auxiliary information of amplitude information, as The supplement of amplitude characteristic.In the present invention, the letter that will be handled in frequency domain by the method that Hilbert feature space is converted Breath is transformed into minimum phase domain and carries out estimation phase information, can obtain relatively accurate phase estimation feature in this way.By right Than MFB+spectrum method and MFB+spectrum+MGDCC+PBSFVT method, it can be seen that the performance of automatic speech recognition It is improved, and recognition result WER is reduced to 23.68% from 26.57%, opposite error rate reduces 10.88%.

Although above in conjunction with figure, invention has been described, and the invention is not limited to above-mentioned specific embodiment parties Formula, the above mentioned embodiment is only schematical, rather than restrictive, and those skilled in the art are in this hair Under bright enlightenment, without deviating from the spirit of the invention, many variations can also be made, these belong to guarantor of the invention Within shield.

Claims

1. the far field audio recognition method based on the study of the multiple target of amplitude and phase information, which is characterized in that including following step It is rapid:

2) feature extraction:

(1) based on the feature extraction of amplitude information: by framing, adding window, and to each short-time analysis window, by quick Fu Signal is transformed into frequency domain by time domain and obtains corresponding frequency spectrum by leaf transformation, and the mistake of frequency is then carried out using Mel filter Filter and simulate with this sensory perceptual system of mankind；

(2) based on the feature extraction of phase information: extract the phase information of each frame voice, including group delay system MGDCC with And two kinds of phase properties of the channel information PBSFVT of the source separation method based on phase field；

3) model training: the feature extracted is input in the DNN of multiple target, and the DNN network of multiple target can be simultaneously to two A different target is learnt, to simulate the general character and difference between different target.

2. the far field audio recognition method according to claim 1 based on the study of the multiple target of amplitude and phase information, It is characterized in that, the feature extraction based on phase information in the step 2)-(2), including group delay system MGDCC phase property, Specific extraction process is as follows: during carrying out Speech processing, needing that the phase bit position of voice signal is unfolded Its negative derivative is solved, negative derivative is known as group delay coefficient (GDF)；

Phase spectrum signature, that is, non-rolling phase spectrum signature can indicate are as follows:

Group delay function can similarly be calculated as following expression form:

Wherein: that two parts of real and imaginary parts and Y (ω) that subscript R and I are respectively indicated respectively indicate is x (n) and nx (n) frequency domain information after Fourier transform；

Group delay coefficient after adjustment may be calculated:

Wherein: S (ω) indicates the smoothed version of X (ω)；

Wherein: α and γ, value range is between 0~1.

3. the far field audio recognition method according to claim 1 based on the study of the multiple target of amplitude and phase information, It is characterized in that, the feature extraction based on phase information in the step 2)-(2), including the source separation method based on phase field Two kinds of phase properties of channel information PBSFVT, specific extraction process are as follows:

X (ω)=| X (ω) | e^jarg{X(ω)}=X_MinPh(ω)X_Allp(ω)

Wherein: X_MinPh(ω) and X_Allp(ω) respectively indicates the corresponding minimum phase part X and all-pass after Fourier transformation Phase bit position, and there is the relationships of following formula between minimum phase and primary speech signal:

| X (ω) |=| X_MinPh(ω)|

Arg { X (ω) }=arg { X_MinPh(ω)}+arg{X_Allp(ω)}

Minimum phase feature and channel information processing method are combined, the behaviour using source Filtering Model in minimum phase domain Make carry out source filtering operation and carry out information separation, minimum phase voice signal is decomposed into sound source information and channel information, thus Obtain the different model of the two.

4. the far field audio recognition method according to claim 1 based on the study of the multiple target of amplitude and phase information, It is characterized in that, the step 3) specifically: building multitask deep neural network, the amplitude characteristic of extraction and phase property is defeated Enter into neural network training, voice and enhanced feature after output enhancing.

5. the far field audio recognition method according to claim 1 based on the study of the multiple target of amplitude and phase information, It being characterized in that, further includes SRMR assessment and speech recognition, the specifically enhanced feature by DNN output carries out speech recognition, To obtain Word Error Rate, the enhanced voice of output is carried out SRMR evaluation and test.