CN109767781A

CN109767781A - Speech separating method, system and storage medium based on super-Gaussian priori speech model and deep learning

Info

Publication number: CN109767781A
Application number: CN201910167788.8A
Authority: CN
Inventors: 张啟权; 王明江; 陆云; 韩宇菲; 张禄
Original assignee: Shenzhen Graduate School Harbin Institute of Technology
Current assignee: Shenzhen Graduate School Harbin Institute of Technology
Priority date: 2019-03-06
Filing date: 2019-03-06
Publication date: 2019-05-17
Also published as: WO2020177372A1

Abstract

The present invention provides a kind of speech separating method based on super-Gaussian priori speech model and deep learning, system and storage mediums, the speech separating method is including the use of clean speech power spectral density estimated value and noise power spectral density estimated value, so as to find out the prior weight in gain function, prior weight is brought into the value that gain function is obtained in gain function, it is multiplied to obtain the estimated value of clean speech amplitude spectrum with noisy speech spectrum using gain function value, the voice signal that we can be restored out using overlap-add technology.The beneficial effects of the present invention are: the present invention can not only effectively inhibit nonstationary noise signal under the combination of traditional statistical model and depth learning technology, while also solving the problems, such as that depth learning technology height relies on training data and generalization ability is weak.The combination of the two makes the enhancing performance of this method all show to obtain unusual robust under various noise circumstances and state of signal-to-noise.

Description

Speech separating method, system based on super-Gaussian priori speech model and deep learning And storage medium

Technical field

The present invention relates to voice processing technology fields, more particularly to based on super-Gaussian priori speech model and deep learning Speech separating method, system and storage medium.

Background technique

Since voice signal is usually polluted by the interference noise from surrounding, this leads to such as automatic speech recognition, man-machine The application such as dialogue, hearing aid encounters very big challenge.Existing traditional voice enhancing technology for nonstationary noise and The big heavy discount of performance in the case of low signal-to-noise ratio.Although the speech enhancement technique based on deep learning risen recently can be well Inhibit nonstationary noise, however the performance of this kind of algorithm is highly dependent on training data, for no study or trains The meeting of Data Representation is very bad.

Summary of the invention

The present invention provides a kind of speech separating method based on super-Gaussian priori speech model and deep learning, including such as Lower step:

Step 1: receiving Noisy Speech Signal；

Step 2: respectively using super-Gaussian statistical model and Gaussian statistics model to clean speech signal and noise signal Fourier Transform Coefficients are modeled, based on the statistical model it is assumed that using minimum mean square error criterion to clean speech signal Amplitude spectrum is estimated, the estimation of amplitude spectrum is obtained；

Step 3: estimating clean speech power spectral density using deep neural network；

Step 4: minimum mean square error criterion is based on using statistical model, and noise power spectral density is tracked, noise function Rate spectrum density most descends mean square error estimation to obtain by what recursive average current noise was composed；

Step 5: clean speech power spectral density estimated value, the step 4 obtained using step 3 obtains noise power spectral density Prior weight is brought into gain function so as to find out the prior weight in gain function and obtains gain letter by estimated value Several values is multiplied to obtain the estimated value of clean speech amplitude spectrum, utilizes overlap-add using gain function value with noisy speech spectrum Our clean speech signals that can be restored out of technology.

As a further improvement of the present invention, in the step 2, for super-Gaussian voice signal model, the ginseng of selection Numerical value is μ=0.2 and β=0.001.

As a further improvement of the present invention, in the step 3, there are two hidden layers for deep neural network framework, swash The line rectification unit that function living uses, output layer is using softmax function.

As a further improvement of the present invention, the number of nodes of first and the second hidden layer is 512, the training number of use According to integrating as TIMIT speech database.

As a further improvement of the present invention, in the step 3, in order to train deep neural network, it is necessary first to right Voice data is pre-processed, with signal-to-noise ratio be 0 the noise signals of clean speech and multiple types, 5,10,15dB mixed It closes to obtain Noisy Speech Signal；The input feature vector of deep neural network be 13 Jan Vermeer cepstrum coefficients and its single order and Second differnce coefficient.

As a further improvement of the present invention, in the step 3, sub-frame processing is carried out to each Noisy Speech Signal, Its 39 dimensional feature, including 13 Jan Vermeer cepstrum coefficients and a second differnce coefficient are extracted simultaneously；Furthermore for utilization before and after frames We using present frame and take the features of each three frame in front and back 7 frames in total to information, so the input feature vector dimension of input layer is 273 Dimension.

As a further improvement of the present invention, in the step 3, the cost function that deep neural network uses is intersection Entropy loss function, output layer belong to the probability of each phoneme using softmax output present frame, using belonging to each phoneme probability And its corresponding power spectrum does the estimation that weighted average calculation goes out clean speech power spectral density.

The speech Separation system based on super-Gaussian priori speech model and deep learning that the present invention also provides a kind of, it is special Sign is, comprising: memory, processor and the computer program being stored on the memory, the computer program are matched It is set to the step of realizing method of the present invention when being called by the processor.

The present invention also provides a kind of computer readable storage medium, the computer-readable recording medium storage has calculating The step of machine program, the computer program realizes method of the present invention when being configured to be called by processor.

The beneficial effects of the present invention are: the present invention is by under the combination of traditional statistical model and depth learning technology, not only Nonstationary noise signal can effectively be inhibited, while also solving depth learning technology height and relying on training data and extensive energy The weak problem of power.The combination of the two makes the enhancing performance of this method under various noise circumstances and state of signal-to-noise All show to obtain unusual robust.

Detailed description of the invention

Fig. 1 is deep neural network architecture diagram of the invention.

Specific embodiment

The invention discloses a kind of speech separating method based on super-Gaussian priori speech model and deep learning, not only very Good inhibits nonstationary noise, also shows fine Generalization Capability simultaneously for the data that do not trained.

The present invention is mainly to pass through to realize that the voice an of robust increases in conjunction with traditional statistical model and depth learning technology Strong method.Entire method mainly includes four parts: using the speech gain function based on super-Gaussian voice hypothesized model, utilization Neural network estimates the meter of the estimation of power spectrum, noise power spectrum of clean speech signal, prior weight and gain function It calculates.

Introduce signal model first: it is contemplated that additive signal model, y (n)=x (n)+d (n), wherein y (n) be Noisy Speech Signal, x (n) and d (n) respectively represent clean speech signal and noise signal.By using Short Time Fourier Transform The relationship of time-frequency domain is obtained, Y (l, k)=X (l, k)+D (l, k), wherein l and k respectively represents the index of frame number and Frequency point.Language Sound and the Fourier Transform Coefficients of noise signal obey super-Gaussian and Gaussian Profile respectively.

The present invention is based on the speech separating methods of super-Gaussian priori speech model and deep learning to include the following steps:

Step 1: receiving Noisy Speech Signal；

Step 2: respectively using super-Gaussian statistical model and Gaussian statistics model to clean speech signal and noise signal Fourier Transform Coefficients are modeled, based on the statistical model it is assumed that using minimum mean square error criterion to clean speech signal Amplitude spectrum is estimated, being estimated as follows for amplitude spectrum is obtained:

Here, ξ=λ_x/λ_dRepresent prior weight, λ_x=E | X (l, k) |²And λ_d=E | D (l, k) |²It is pure respectively Net phonetic speech power spectrum density and noise power spectral density.Furthermore ζ=γ ξ/(μ+ξ) wherein γ=| Y (l, k) |²/λ_d(l, k) is represented Posteriori SNR.M (,；) indicate confluent hypergeometric function.The parameter that we select for super-Gaussian voice signal model Value is μ=0.2 and β=0.001.

It will be seen that gain function depends on the calculating of prior weight from formula (1), and prior weight It calculates and depends on clean speech power spectral density and noise power spectral density.So step 3 mainly estimates clean speech power Spectrum density.The present invention estimates pure phonetic speech power spectrum density using deep neural network.The depth nerve net that the present invention uses Network framework is as shown in Figure 1.

The neural network framework used is there are two hidden layer, the line rectification unit (ReLu) that activation primitive uses, output Layer is using softmax function.The number of nodes of first and the second hidden layer is 512.The training dataset used for TIMIT speech database.

In order to train neural network, it is necessary first to be pre-processed to voice data, we are clean speech and many classes The noise signal of type with signal-to-noise ratio is 0,5,10,15dB carry out mixing to obtaining Noisy Speech Signal.The input of neural network Feature be 13 Jan Vermeer cepstrum coefficients (Mel Frequency Cepstrum Coefficient, MFCC) and its single order and Second differnce coefficient.So we carry out sub-frame processing to each Noisy Speech Signal, while its 39 dimensional feature is extracted, including 13 Tie up MFCC and a second differnce coefficient.Furthermore in order to which using the information of before and after frames, we use present frame and take each three frame in front and back The feature of 7 frames in total is expressed as so the input feature vector dimension of input layer is 273 (39 multiplied by 7) dimension

Z_l=[z_1,l,z_2,l,…,z_V,l] (3)

Wherein V=273.

The cost function that neural network uses is cross entropy loss function.Trained target is which identification present frame belongs to A phoneme, so output layer belongs to the probability of each phoneme using softmax output present frame, be expressed as P (q | Z_l), it uses One-hot vector indicates.All phonemes include mute in TIMIT data set and non-voice state is divided into Q=61 Classification, q ∈ { 1,2,3 ..., Q }.

Finally using belonging to each phoneme probability and its corresponding power spectrum does weighted average calculation and goes out clean speech function The estimation of rate spectrum density.Training neural network, we select the calculating of Adam optimization algorithm progress gradient.

Step 4: by step 3, we have obtained the estimation of pure phonetic speech power spectrum density.Find out from formula (1), first It tests the calculating of signal-to-noise ratio while needing clean speech power spectral density and noise power spectral density.So in step 4, Wo Menli Least mean-square error (Minimum Mean Squared Error, MMSE) criterion is based on to noise function with traditional statistical model Rate spectrum density is tracked.Noise power spectral density most descends mean square error estimation to obtain by what recursive average current noise was composed ?.

Step 5: by step 3 and step 4, we have obtained clean speech power spectral density and noise power spectrum is close The estimated value of degree.Using value is estimated to obtain, we can find out the significant variable in gain function, prior weight.It carries it into The value of our available gain functions into gain function is multiplied us with noisy speech spectrum using gain function value Obtain the estimated value of pure voice amplitudes spectrum.The clean speech letter that we can be restored out using overlap-add technology Number.Overlap-add technology is technology of the common frequency restoration to time domain.

The speech Separation system based on super-Gaussian priori speech model and deep learning that the invention also discloses a kind of, packet Include: memory, processor and the computer program being stored on the memory, the computer program are configured to by described The step of processor realizes method of the present invention when calling.

The invention also discloses a kind of computer readable storage medium, the computer-readable recording medium storage has calculating The step of machine program, the computer program realizes method of the present invention when being configured to be called by processor.

Beneficial effects of the present invention are as follows:

1. present invention employs the super-Gaussian distribution hypothesized models for more meeting voice Fourier coefficient statistical property, so that estimating The gain function of gauge is more accurate.

2. being energized Speech processing using depth learning technology.Learnt using the powerful modeling ability of depth learning technology Noisy speech can effectively inhibit the noise signal of height non-stationary to the mapping relations between clean speech signal.

3. effectively nonstationary noise can not only be inhibited to believe under the combination of traditional statistical model and depth learning technology Number, while also solving the problems, such as that depth learning technology height relies on training data and generalization ability is weak.The combination of the two makes The enhancing performance of this method all shows to obtain unusual robust under various noise circumstances and state of signal-to-noise.

The above content is a further detailed description of the present invention in conjunction with specific preferred embodiments, and it cannot be said that Specific implementation of the invention is only limited to these instructions.For those of ordinary skill in the art to which the present invention belongs, exist Under the premise of not departing from present inventive concept, a number of simple deductions or replacements can also be made, all shall be regarded as belonging to of the invention Protection scope.

Claims

1. a kind of speech separating method based on super-Gaussian priori speech model and deep learning, which is characterized in that including as follows Step:

Step 1: receiving Noisy Speech Signal；

Step 2: respectively using super-Gaussian statistical model and Gaussian statistics model in Fu of clean speech signal and noise signal Leaf transformation coefficient is modeled, based on the statistical model it is assumed that using minimum mean square error criterion to clean speech signal amplitude Spectrum is estimated, the estimation of amplitude spectrum is obtained；

Step 4: minimum mean square error criterion is based on using statistical model, and noise power spectral density is tracked, noise power spectrum Density most descends mean square error estimation to obtain by what recursive average current noise was composed；

Step 5: clean speech power spectral density estimated value, the step 4 obtained using step 3 obtains noise power spectral density estimation Value, so as to find out the prior weight in gain function, prior weight is brought into gain function and obtains gain function Value is multiplied to obtain the estimated value of clean speech amplitude spectrum, utilizes overlap-add technology using gain function value with noisy speech spectrum The clean speech signal that we can be restored out.

2. speech separating method according to claim 1, which is characterized in that in the step 2, obtain estimating for amplitude spectrum It counts as follows:

Wherein, ξ=λ_x/λ_dRepresent prior weight, λ_x=E | X (l, k) |²And λ_d=E | D (l, k) |²It is pure language respectively Sound power spectral density and noise power spectral density；Furthermore the γ in ζ=γ ξ/(μ+ξ)=| Y (l, k) |²/λ_d(l, k) represents posteriority Signal-to-noise ratio, M (,；) indicate confluent hypergeometric function.

3. speech separating method according to claim 2, which is characterized in that in the step 2, for super-Gaussian voice Signal model, the parameter value selected are μ=0.2 and β=0.001.

4. speech separating method according to claim 1, which is characterized in that in the step 3, deep neural network frame There are two hidden layers for structure, and the line rectification unit that activation primitive uses, output layer is using softmax function.

5. speech separating method according to claim 4, which is characterized in that the number of nodes of first and the second hidden layer is equal It is 512, the training dataset used is TIMIT speech database.

6. speech separating method according to claim 1, which is characterized in that in the step 3, in order to train depth refreshing Through network, it is necessary first to be pre-processed to voice data, the noise signal clean speech and multiple types is with signal-to-noise ratio 0,5,10,15dB carries out mixing to obtain Noisy Speech Signal；The input feature vector of deep neural network is 13 Jan Vermeer cepstrums Coefficient and its single order and second differnce coefficient.

7. speech separating method according to claim 6, which is characterized in that in the step 3, to each noisy speech Signal carries out sub-frame processing, while extracting its 39 dimensional feature, including 13 Jan Vermeer cepstrum coefficients and a second differnce coefficient；Furthermore In order to which using the information of before and after frames, we using present frame and take the features of each three frame in front and back 7 frames in total, so input layer Input feature vector dimension is 273 dimensions.

8. speech separating method according to claim 6, which is characterized in that in the step 3, deep neural network is adopted Cost function is cross entropy loss function, and output layer belongs to the probability of each phoneme, benefit using softmax output present frame With belonging to each phoneme probability and its corresponding power spectrum does the estimation that weighted average calculation goes out clean speech power spectral density.

9. a kind of speech Separation system based on super-Gaussian priori speech model and deep learning characterized by comprising storage Device, processor and the computer program being stored on the memory, the computer program are configured to by the processor The step of method of any of claims 1-8 is realized when calling.

10. a kind of computer readable storage medium, it is characterised in that: the computer-readable recording medium storage has computer journey Sequence, the computer program realize the step of method of any of claims 1-8 when being configured to be called by processor Suddenly.