CN108806708A

CN108806708A - Voice de-noising method based on Computational auditory scene analysis and generation confrontation network model

Info

Publication number: CN108806708A
Application number: CN201810606145.4A
Authority: CN
Inventors: 陈龙; 张小博; 张晓灿
Original assignee: CETC 3 Research Institute
Current assignee: CETC 3 Research Institute
Priority date: 2018-06-13
Filing date: 2018-06-13
Publication date: 2018-11-13

Abstract

The present invention relates to a kind of based on Computational auditory scene analysis and generates the voice de-noising method of confrontation network model, including：Step 1, noisy speech is handled based on the generator and arbiter for generating confrontation network, obtains intermediate result；Step 2, the intermediate result is handled based on Computational auditory scene analysis method, obtains final result.The present invention can remove the partial noise in voice signal acquired under Complex Channel background environment, and preferably phonological component can be kept not distort.

Description

Voice de-noising method based on Computational auditory scene analysis and generation confrontation network model

Technical field

The present invention relates to a kind of voice de-noising methods, more particularly to one kind based on Computational auditory scene analysis and generating confrontation The voice de-noising method of network model.

Background technology

Voice is the most important means that the mankind mutually transmit information, one section of voice bearer intention of speaker, identity, feelings The abundant information such as thread.Voice signal can be propagated by various kinds of media such as empty gas and water, radio.Voice signal is passing During broadcasting, or due to the limitation of collecting device, usually it can all be interfered by various noises.Especially in certain professions In, extraneous noise is inevitable, and in many cases, and noise type is complicated, intensity is larger.This noise like Serious influence can be caused on subsequent voice signal processing, such as the accuracy of speech recognition can be reduced.In addition, if passing through people The mode of work handles the voice data of this Noise, and prolonged work can cause to damage to the auditory system of people.

Invention content

The purpose of the present invention is to provide a kind of based on Computational auditory scene analysis and generates the voice of confrontation network model Noise-reduction method to remove the partial noise in voice signal acquired under Complex Channel background environment, and keeps phonological component It does not distort.

The present invention provides a kind of based on Computational auditory scene analysis and generates the voice de-noising method of confrontation network model, Including：

Step 1, noisy speech is handled based on the generator and arbiter for generating confrontation network, obtains intermediate knot Fruit；

Step 2, intermediate result is handled based on Computational auditory scene analysis method, obtains final result.

Further, in step 1, the training process for generating confrontation network includes：

1) noisy data and clean data are inputted into arbiter, enables arbiter be judged as differing, passes through backpropagation Mode adjusts the network parameter of arbiter；

2) noisy data input generator is subjected to noise reduction process, is exported as a result, defeated together with noisy data later Enter arbiter, enable arbiter be judged as identical, the network parameter of arbiter is adjusted by way of backpropagation；

3) fixing step 2) in the obtained network parameter of arbiter, the net of generator is adjusted by way of backpropagation Network parameter, target are that generator is made to be judged as differing.

Further, step 2 includes：

Using the intermediate result as the input of Computational auditory scene analysis, masking estimation is carried out to input signal, according to Estimated result synthesizes the intermediate result again, obtains the voice data after noise reduction.

Compared with prior art the beneficial effects of the invention are as follows：

The partial noise in voice signal acquired under Complex Channel background environment can be removed, and can preferably be kept Phonological component does not distort.

Description of the drawings

Fig. 1 is the flow the present invention is based on Computational auditory scene analysis and the voice de-noising method for generating confrontation network model Figure；

Fig. 2 is the network structure of generator；

Fig. 3 is the network training process figure for generating confrontation network.

Specific implementation mode

The present invention is described in detail for each embodiment shown in below in conjunction with the accompanying drawings, but it should explanation, these Embodiment is not limitation of the present invention, those of ordinary skill in the art according to function, method made by these embodiments, Or the equivalent transformation in structure or replacement, all belong to the scope of protection of the present invention within.

It present embodiments provides a kind of based on Computational auditory scene analysis (Computational auditory scene Analysis, CASA) and generate confrontation network (Generative adversarial networks, GAN) model voice drop Method for de-noising, including：

Step 1, based on the generator (Generator) and arbiter (Discriminator) for generating confrontation network to containing Voice of making an uproar is handled, and intermediate result is obtained；

Voice de-noising method provided in this embodiment based on Computational auditory scene analysis and generation confrontation network model, energy The partial noise in voice signal acquired under Complex Channel background environment is enough removed, and can preferably keep phonological component not It distorts.

The training process of generation confrontation network includes in step 1：

In the present embodiment, step 2 includes：

Invention is further described in detail below.

Noise-reduction method of the present embodiment based on Computational auditory scene analysis and generation confrontation network carry out voice de-noising, CASA It is with ideal two-value masking (ideal binary mask, IBM) or ideal floating value masking (ideal ratio mask, IRM) Target is calculated, converts voice de-noising problem to parameter Estimation and two-value classification problem, GAN is differentiated by a generator and one Device forms, and network training process simulates zero-sum two-person game, is optimized respectively to the parameter of generator and arbiter, training mesh Mark is that a kind of effective mapping model between truthful data to training data is arrived in study.As shown in Figure 1, y (n) is for sample rate f_sLength is the noisy speech of n, and after GAN is handled, the intermediate result of output isAfter being handled using CASA, finally Result be x (n).In the present embodiment, the sample rate f of all voice data_sIt is unified for 16kHz.Below to the noise reduction based on GAN And the noise reduction part based on CASA is described in detail.

1, the noise reduction based on GAN

The essence of GAN networks is the zero-sum game between generator and arbiter.By being adjusted in continuous gambling process Parameter is saved, network gradually learns to the mapping relations between specific data, and can ensure that can be by this after training Mapping relations handle completely new data.For the voice de-noising problem under Complex Channel background environment, GAN networks need to learn Practise be y (n) withBetween mapping relations.

The GAN network structures that this method is proposed using Pascual et al., wherein generator G is as final progress noise reduction Network, that is, complete from y (n) toProcess, arbiter D only the training stage be used to G carry out game training, testing Stage can be removed completely.

The network structure of G is as shown in Figure 2, and structure is similar with autocoder, by the encoder of lower half portion in Fig. 2 It is constituted with decoder two parts in top half.Such composition is so that network has feature end to end, the input of network It is the voice signal of similar length with output, avoids complicated characteristic extraction procedure.The network knot of encoder and decoder Structure is identical, but the arrangement mode of several network layers is different, and the two is at symmetric relation.It is that full convolution connects between layers, this So that dense layer is not present in network, and temporarily closely to be associated with during network attention input signal and whole levels Relationship is further able to reduce the quantity of training parameter.

G is made of 22 one-dimensional trapezoidal convolutional layers, and every layer of filter width is 31, step-length 2.The number of filter is successively It is incremented by, width successively successively decreases.Every layer of output dimension be hits × characteristic pattern, respectively 16384 × 1,8192 × 16,4096 × 32,2048 × 32,1024 × 64,512 × 64,256 × 128,128 × 128,64 × 256,32 × 256,16 × 512,8 × 1024.Decoder network and encoder network filter width having the same and filter quantity.G networks are in addition between levels Connection outside, in fact also corresponding with the decoder layer connection of each coding layer shunts the compression process among model, i.e., Now jump connection.In this way, the details of low level can be removed so that speech waveform is more properly rebuild.Jump connection will The information that fine processing is crossed in voice is directly delivered to decoding stage, and can solve gradient disperse to a certain extent and ask Topic so that gradient can be transmitted deeper in back-propagation process in network model.

The network structure of arbiter D is similar to the coded portion of G, is one-dimensional convolutional coding structure and the convolution with sorter network Topological structure is identical.But difference lies in：1) input of D is two channels, and each channel is 16384 sampled points；2) exist Before LeakyReLU activation primitives, using virtually batch norm, and α=0.3；3) it is one one after the last one active coating Convolutional layer is tieed up, and filter width is 1, it in this way will not be down-sampled to hiding activation primitive progress.Therefore, the parameter number of full articulamentum Amount reduces the method merged to 8,1024 channels from 8 × 1024=8192 and can learn to arrive by deconvolution parameter.

The training process of network is as shown in Figure 3, and wherein y indicates the training data of Noise,It indicates corresponding to be free of The data of noise,It indicates by G treated data.The training voice data that this method uses is Valentini et al. public affairs Trainset partial datas in the database opened include 11572 clean speech from 28 people.This method is by above-mentioned Clean speech data add the mode of Gaussian noise to simulate the voice data under Complex Channel background condition, and in order to simulate The complexity of real noise, adds the noise of different signal-to-noise ratio, and concrete condition is as shown in table 1.Noise it can be seen from table The data accounting of relatively high (40dB) and noise relatively low (20dB) is less, and signal-to-noise ratio is the data accounting highest of 30dB, in this way Design be for the noise situations under preferably simulation of real scenes.

1 training data of table adds noise situations

Every 100 of training data is a batch, to every batch of training data, set trained rate as 0.0002, training process Including following three steps：

1) noisy data y and clean dataD is inputted, enables D be judged as differing and (being labeled as 1), passes through backpropagation Mode adjusts the network parameter of D；

2) noisy data y inputs G is subjected to noise reduction process, is exportedD is inputted together with y later, D is enabled to be judged as phase With (being labeled as 0), the network parameter of D is adjusted by way of backpropagation；

3) network parameter of the D obtained in fixed previous step, adjusts the network parameter of G, mesh by way of backpropagation It is designated as that D is made to be judged as differing and (being labeled as 1).

It is an epoch that whole training samples, which traverse one time,.After 86 epoch, terminate training process, Zhi Hougu Determine the network parameter of G, and in this, as final noise reduction network.

2, the noise reduction based on CASA

Y (n) is after the processing of G, resultInput as CASA.As shown in fig. 1, it carries out first to input The masking of signal is estimated, then according to estimated result pairIt is synthesized again, finally obtains the voice data x (n) after noise reduction.

Assuming thatIt is made of pure voice s (n) and noise signal l (n), i.e.,

Time-frequency representation Y ∈ R^m×nIt can be decomposed into sparse speech items S and low-rank noise item L, i.e.,

Y=S+L (2)

Above formula can be solved by the method for RPCA：

In view of the physical meaning of spectrogram, two after decomposition should be non-negative, therefore have：

But above-mentioned Model Condition is excessively harsh, therefore introduces dense error term：

Y=S+L+E (5)

By introducing auxiliary variable L₊And S₊, formula (4) can be rewritten as：

Its augmentation Lagrange's equation is：

In formula, Ω_Y, Ω_SAnd Ω_LFor extended binary variable, ρ is scale parameter.

Object function in formula (7) can divide, therefore the solution of ADMM algorithms may be used.All variables in formula (7) Can alternately it be updated respectively under ADMM frames by solving corresponding word problem.In two auxiliary variables and three binary Under the constraints of variable, L_ρIt can be minimized and be solved by gradient descent method.

Input signalIt can be analyzed to sparse item, low-rank item as stated above after gammatone filtering transformations With dense item three parts.Then, the masking estimation that IBM and IRM is performed as follows can be obtained：

In this way, noisy speech signalPoint of realization voice and noise can be synthesized again by sheltering weighted sum on frequency spectrum Solution, to achieve the purpose that noise reduction.

The present invention is directed to the noise-reduction method of voice data under Complex Channel background environment, it can be achieved that under the conditions of to Complex Noise The decrease of noise functions of the voice data got, while also preferably phonological component can be kept not distort.This method can be used as Intercept the skills such as artificial speech recognition under environment, automatic speech recognition, Application on Voiceprint Recognition, voice keyword detection, speech emotional analysis The preprocessing part of art, play the role of reduce noise jamming, improve identification or Detection accuracy, can be applied to information obtain with The military fields such as analysis can also be applied to the civil fields such as big data analysis.

The series of detailed descriptions listed above only for the present invention feasible embodiment specifically Bright, they are all without departing from equivalent implementations made by technical spirit of the present invention not to limit the scope of the invention Or change should all be included in the protection scope of the present invention.

It is obvious to a person skilled in the art that invention is not limited to the details of the above exemplary embodiments, Er Qie In the case of without departing substantially from spirit or essential attributes of the invention, the present invention can be realized in other specific forms.Therefore, no matter From the point of view of which point, the present embodiments are to be considered as illustrative and not restrictive, and the scope of the present invention is by appended power Profit requires rather than above description limits, it is intended that all by what is fallen within the meaning and scope of the equivalent requirements of the claims Variation is included within the present invention.

Claims

1. a kind of voice de-noising method based on Computational auditory scene analysis and generation confrontation network model, which is characterized in that packet It includes：

Step 1, noisy speech is handled based on the generator and arbiter for generating confrontation network, obtains intermediate result；

Step 2, the intermediate result is handled based on Computational auditory scene analysis method, obtains final result.

2. the voice de-noising side according to claim 1 based on Computational auditory scene analysis and generation confrontation network model Method, which is characterized in that in step 1, the training process for generating confrontation network includes：

1) noisy data and clean data are inputted into arbiter, enables arbiter be judged as differing, by way of backpropagation Adjust the network parameter of arbiter；

2) noisy data input generator is subjected to noise reduction process, is exported and is sentenced as a result, being inputted together with noisy data later Other device, enables arbiter be judged as identical, and the network parameter of arbiter is adjusted by way of backpropagation；

3) fixing step 2) in the obtained network parameter of arbiter, the network ginseng of generator is adjusted by way of backpropagation Number, target are that generator is made to be judged as differing.

3. the voice de-noising side according to claim 2 based on Computational auditory scene analysis and generation confrontation network model Method, which is characterized in that the step 2 includes：

Using the intermediate result as the input of Computational auditory scene analysis, masking estimation is carried out to input signal, according to estimation As a result the intermediate result is synthesized again, obtains the voice data after noise reduction.