CN109326302A

CN109326302A - A kind of sound enhancement method comparing and generate confrontation network based on vocal print

Info

Publication number: CN109326302A
Application number: CN201811353760.5A
Authority: CN
Inventors: 钟艳如; 张家豪; 赵帅杰; 李芳�; 蓝如师; 罗笑南
Original assignee: Guilin University of Electronic Technology
Current assignee: Guilin University of Electronic Technology
Priority date: 2018-11-14
Filing date: 2018-11-14
Publication date: 2019-02-12
Anticipated expiration: 2038-11-14
Also published as: CN109326302B

Abstract

The present invention disclose it is a kind of based on vocal print compare and generate confrontation network sound enhancement method, 1) establish three speech databases, respectively correspond Application on Voiceprint Recognition encoder, noise separation system and speech Separation system；2) training Application on Voiceprint Recognition encoder extracts the vocal print feature of target speaker, obtains target vocal print feature；3) sound spectrograph will be converted to noise frequency to be sent into the generator in noise separation system, must predict clean audio；4) it will predict that clean audio and true clean audio are sent into the discriminator training in noise separation system；5) assessor weight parameter is adjusted, so that assessor is preferably told true clean audio and predicts the difference of clean audio, obtain generating the generator of almost true clean audio；6) sound of speaker is sent into trained generator, generates and predicts clean sound spectrograph, the voice signal enhanced.This method small scale, calculate it is low, be easy to transplant, keep certain space-invariance and denoising effect is good.

Description

A kind of sound enhancement method comparing and generate confrontation network based on vocal print

Technical field

The present invention relates to speech enhancement technique field, specifically a kind of voice for comparing and generating confrontation network based on vocal print Enhancement Method.

Background technique

With the development of society, electronic product is universal, requirement of the people to voice quality is higher and higher.How electricity is improved Mobile communication quality of the sub- product under noisy environment has become instantly most popular research direction.And speech enhan-cement can mention The quality and comprehensibility of voice under high-noise environment, speech enhan-cement not only have in hearing aid and artificial cochlea field important Using, and it has been successfully applied to the pretreatment stage in speech recognition and Speaker Recognition System.

The method of classical speech enhan-cement has spectrum-subtraction, Wiener filtering, the method based on statistical model and Subspace algorithm. Since the eighties, neural network is also applied to speech enhan-cement.In recent years, denoising had been widely adopted from coding scheme.Example Such as, circulation denoising shows in the processing to audio signal contextual information good from coding.Nearest shot and long term memory network It is applied to denoising task.Although these above-mentioned methods can obtain good effect, but need a large amount of data and Calculation amount, it is difficult to be transplanted to embedded device.Moreover, these methods tend to rely on training set, the clean audio of output is base The average value of clean audio is exported in training set, can relatively be obscured, the processing to details is simultaneously not fully up to expectations.

Summary of the invention

It is an object of the invention to overcome the deficiencies in the prior art, and provide a kind of based on vocal print comparison and generation confrontation net The sound enhancement method of network, this method small scale, calculating is lower, is easy to transplant, keeps certain space-invariance and denoise effect Fruit is good.

Realizing the technical solution of the object of the invention is:

A kind of sound enhancement method being compared and generated confrontation network based on vocal print, includes the following steps:

1) three speech databases are established, Application on Voiceprint Recognition encoder, noise separation system and speech Separation system are respectively corresponded System；

2) training Application on Voiceprint Recognition encoder extracts the vocal print feature of target speaker, obtains target vocal print feature；

3) audio that band is made an uproar is converted to sound spectrograph and is sent into the generator in noise separation system, and generator is according to vocal print The target vocal print feature that identification encoder extracts isolates the sound of target speaker, obtains predicting clean audio；

4) clean audio feeding true in the clean audio of prediction and step 1) speech Separation system that step 3) obtains is made an uproar Discriminator in sound separation system is trained, and the sound for making discriminator tell speaker is generated by noise separation system Prediction sound spectrograph whether meet the distribution of realAudio；

5) adjust assessor weight parameter, make assessor preferably tell true clean audio and generator generate it is pre- The difference for surveying clean audio distinguishes the weight parameter that result updates generator according to discriminator, until life can not be identified in discriminator Grow up to be a useful person generation prediction audio and true clean audio difference, obtain can produce the generation of almost true clean audio Device；

6) it is trained to be converted to sound spectrograph feeding through Short Time Fourier Transform for the sound that speaker is collected by microphone In good generator, generates and predict clean sound spectrograph, then voice analog signal is converted to by anti-Short Time Fourier Transform, voice Analog signal plays back to arrive the voice signal of enhancing through loudspeaker.

The Application on Voiceprint Recognition encoder of answering is 2000NISI Speaker Recongnition Evaluation voice Answer Application on Voiceprint Recognition encoder in library；The noise separation system is the noise separation system in 100-nonspeech noise library；Institute The speech Separation system stated is the speech Separation system of TIMIT sound bank.

In step 2), the Application on Voiceprint Recognition encoder extracts the vocal print feature of target speaker, specifically: audio is believed Number the frame that width is 25ms and step-length is 10ms is converted to, every frame is filtered by mel filter, and is mentioned from result The energy spectrum having a size of 40 is taken to construct the sliding window of regular length on these frames as network inputs, and in each window Then shot and long term memory network last frame is exported the vocal print feature as the sliding window by upper operation shot and long term memory network (d-vector) it indicates.

The generator is by one 8 layers of convolutional network, one 1 layer of shot and long term memory recirculating network and one 2 The fully-connected network composition of layer, every layer is all used Relu activation primitive, and the last layer fully-connected network activates letter using sigmoid Number, for the sound spectrograph of input signal after convolutional layer, the vocal print feature (d-vector) of reference audio can be spliced to frame by frame volume In the output of lamination, shot and long term being inputted together and remembers layer, finally, the output of network is one section identical with input sound spectrograph dimension Output masking is multiplied with input sound spectrograph, the clean audio spectrogram of prediction of output audio can be obtained by mask (mask)

The discriminator is made of, every layer one 2 layers of convolutional networks and one 2 layers of full Connection Neural Network Relu activation primitive is all used, the last layer fully-connected network uses sigmoid activation primitive, and generator is clean by the prediction of generation Audio spectrogramIt is sent into discriminator, then clean audio X true in step 1) is sent into discriminator, trains discriminator neural network, The clean audio spectrogram of the prediction that discriminator generates generatorIt is determined as that false data gives low point (close to 0), to step 1) In true clean audio X be determined as that truthful data awards high marks (close to 1), learn point of truthful data and prediction data with this Whether cloth, the sound for making discriminator tell speaker in step 6) are accorded with by noise separation system prediction sound spectrograph generated Close the distribution of realAudio.

The adjustment assessor weight parameter, specifically by the message transmission of true falseness to generator, generator tune The parameter of whole network model corrects the sound spectrograph of its output, makes it closer to true distribution, and elimination is authenticated device and is determined as False noise signal, even if the clean sound spectrograph of prediction that generator generates" can out-trick " discriminator, and discriminator determines life Grow up to be a useful person generation the clean sound spectrograph of prediction be the true clean audio obtained in the TIMIT sound bank sound spectrograph X, in nerve net During network backpropagation, discriminator can preferably tell the clean sound of prediction of true clean audio and generator generation The difference of frequency, that is, preferably find the feature of true clean audio；Likewise, generator also can be with the mirror of continuous renewal Other device, adjusts its parameter, and the prediction sound spectrograph for generating it is mobile towards true clean audio sound spectrograph.

The generator, discriminator, mutual game are confronted with each other, and confrontation network algorithm is generated, and algorithmic formula is as follows:

It disappears to solve the problems, such as that classical way faces gradient, confrontation network (the least- is generated using least square Squares GAN) it replaces intersecting entropy loss (the cross-entropy loss), then:

In above-mentioned formula, G indicates generator (Generator), and D indicates discriminator (Discriminator), and V represents damage Mistake value, data indicate that the sound bank of true clean audio in step 1) speech Separation system, x indicate to extract in data true Clean speech audio, noise indicate the band noise frequency sound bank in step 1) speech Separation system, and n expression is extracted from noise Band noise frequency corresponding with x, G (n) indicate that generator carries out denoising to noisy speech, obtain predicting clean audioD(G (n)) indicate discriminator to the clean audio of predictionIt carries out being determined as that false number gives low point (close to 0), to true clean audio X It is determined as that truthful data awards high marks (close to 1).

A kind of sound enhancement method comparing and generate confrontation network based on vocal print provided by the invention, this method scale It is small, calculate it is lower, be easy to transplant, keep certain space-invariance and denoising effect is good.

Detailed description of the invention

Fig. 1 is the flow chart of the method for the present invention；

Fig. 2 is the Application on Voiceprint Recognition encoder schematic diagram in the present invention；

Fig. 3 is the generator schematic diagram in the present invention；

Fig. 4 is the discriminator schematic diagram in the present invention.

Specific embodiment

The present invention is further elaborated with reference to the accompanying drawings and examples, but is not limitation of the invention.

Embodiment:

As shown in Figure 1, a kind of sound enhancement method for being compared and being generated confrontation network based on vocal print, includes the following steps:

2000NISI Speaker Recongnition Evaluation sound bank is that vocal print feature is extracted in paper most Common data set is usually directly known as " CALLHOME " in the literature, it includes dialect in 500, is distributed in 6 in language: Ah Draw primary language, English, German, Japanese, mandarin, Spanish；

TIMIT sound bank is common by Texas Instrument (TI), the Massachusetts Institute of Technology (MIT) and Stanford Research Institute (SRI) The acoustics of acquisition-phoneme continuous speech corpus includes 6300 sentences, by every from regional 630 people of 8, U.S. main dialect People says given 10 sentences, and all sentences have carried out manual segmentation, label all in phone-level, and according to the ratio of 7:3 Data set is divided into training set (70%) and test set (30%)；

100-nonspeech noise library is inhuman noise sound in 100 collected by Guo Ning tiger team.

Use 2000NISI Speaker Recongnition Evaluation as first database training vocal print It identifies encoder, it is made to can be very good to extract the vocal print feature (d-vector) of speaker.Secondly, needing triple database Train entire noise separation system, input: 1. the clean audio from target speaker 2. 3. say from target by band noise frequency The reference audio of words person；Clean audio is selected from TIMIT sound bank and is made an uproar from noise according to different signal-to-noise ratio (SNR) anamorphic zone Audio is finally removed in target speaker and randomly selects a reference audio composition triple data in used clean audio, As the second database.

In step 2), the Application on Voiceprint Recognition encoder extracts the vocal print feature of target speaker, as shown in Fig. 2, specifically It is: converts audio signals into the frame that width is 25ms and step-length is 10ms, every frame is filtered by mel filter, and And the energy spectrum having a size of 40 is extracted from result as network inputs, the sliding window of regular length is constructed on these frames, And shot and long term memory network is run on each window, then it regard the output of shot and long term memory network last frame as the sliding window Vocal print feature (d-vector) indicate.

As shown in figure 3, the generator, is by one 8 layers of convolutional network, one 1 layer of shot and long term memory circulation Network and one 2 layers of fully-connected network composition, every layer is all used Relu activation primitive, and the last layer fully-connected network uses Sigmoid activation primitive, for the sound spectrograph of input signal after convolutional layer, the vocal print feature (d-vector) of reference audio can quilt It is spliced in the output of convolutional layer frame by frame, inputs shot and long term together and remember layer, finally, the output of network is one section and input language spectrum Output masking is multiplied by the identical mask of figure dimension (mask) with input sound spectrograph, and the prediction that output audio can be obtained is clean Audio spectrogram

As shown in figure 4, the discriminator, is by one 2 layers of convolutional network and one 2 layers of full Connection Neural Network Composition, every layer is all used Relu activation primitive, and the last layer fully-connected network uses sigmoid activation primitive, and generator will generate The clean audio spectrogram of predictionIt is sent into discriminator, then clean audio X true in step 1) is sent into discriminator, training discriminator Neural network, the clean audio spectrogram of the prediction that discriminator generates generatorIt is determined as that false data gives low point (close to 0), Truthful data, which awards high marks (close to 1), to be determined as to clean audio X true in step 1), truthful data and prediction number are learnt with this According to distribution, so that discriminator is told the sound of speaker in step 6) and pass through noise separation system prediction sound spectrograph generated Whether the distribution of realAudio is met.

It disappears to solve the problems, such as that classical way faces gradient, confrontation network (theleast- is generated using least square Squares GAN) it replaces intersecting entropy loss (the cross-entropy loss), then:

Claims

1. a kind of sound enhancement method for comparing and generating confrontation network based on vocal print, which comprises the steps of:

1) three speech databases are established, Application on Voiceprint Recognition encoder, noise separation system and speech Separation system are respectively corresponded；

3) audio that band is made an uproar is converted to sound spectrograph and is sent into the generator in noise separation system, and generator is according to Application on Voiceprint Recognition The target vocal print feature that encoder extracts isolates the sound of target speaker, obtains predicting clean audio；

4) true clean audio is sent into noise point in the clean audio of the prediction obtained step 3) and step 1) speech Separation system It is trained from the discriminator in system, the sound for making discriminator tell speaker is generated pre- by noise separation system Survey the distribution whether sound spectrograph meets realAudio；

5) assessor weight parameter is adjusted, so that assessor is preferably told the prediction that really clean audio and generator generate dry The difference of net audio distinguishes the weight parameter that result updates generator according to discriminator, until generator can not be identified in discriminator The difference of the prediction audio of generation and true clean audio obtains can produce the generator of almost true clean audio；

6) it is trained to be converted to sound spectrograph feeding through Short Time Fourier Transform for the sound that speaker is collected by microphone In generator, generates and predict clean sound spectrograph, then voice analog signal, speech simulation are converted to by anti-Short Time Fourier Transform Signal plays back to arrive the voice signal of enhancing through loudspeaker.

2. a kind of sound enhancement method for comparing and generating confrontation network based on vocal print according to claim 1, feature It is, the Application on Voiceprint Recognition encoder of answering is 2000NISI Speaker Recongnition Evaluation sound bank Answer Application on Voiceprint Recognition encoder；The noise separation system is the noise separation system in 100-nonspeech noise library；Described Speech Separation system is the speech Separation system of TIMIT sound bank.

3. a kind of sound enhancement method for comparing and generating confrontation network based on vocal print according to claim 1, feature It is, in step 2), the Application on Voiceprint Recognition encoder extracts the vocal print feature of target speaker, specifically: by audio signal The frame that width is 25ms and step-length is 10ms is converted to, every frame is filtered by mel filter, and is extracted from result Energy spectrum having a size of 40 constructs the sliding window of regular length as network inputs on these frames, and on each window Shot and long term memory network is run, then shot and long term memory network last frame is exported to the vocal print feature (d- as the sliding window Vector it) indicates.

4. a kind of sound enhancement method for comparing and generating confrontation network based on vocal print according to claim 1, feature It is, the generator, is by one 8 layers of convolutional network, one 1 layer of shot and long term memory recirculating network and one 2 layers Fully-connected network composition, every layer all use Relu activation primitive, the last layer fully-connected network use sigmoid activation primitive, For the sound spectrograph of input signal after convolutional layer, the vocal print feature (d-vector) of reference audio can be spliced to frame by frame convolutional layer Output on, input shot and long term together and remember layer, finally, the output of network be one section with input the identical mask of sound spectrograph dimension (mask), output masking is multiplied with input sound spectrograph, the clean audio spectrogram of prediction of output audio can be obtained

5. a kind of sound enhancement method for comparing and generating confrontation network based on vocal print according to claim 1, feature It is, the discriminator, is made of one 2 layers of convolutional networks and one 2 layers of full Connection Neural Network, every layer is all used Relu activation primitive, the last layer fully-connected network use sigmoid activation primitive, and generator is by the clean audio of the prediction of generation SpectrogramIt is sent into discriminator, then clean audio X true in step 1) is sent into discriminator, training discriminator neural network identifies The clean audio spectrogram of the prediction that device generates generatorIt is determined as that false data gives low point (close to 0), to true in step 1) It does solid work net audio X and is determined as that truthful data awards high marks (close to 1), learn the distribution of truthful data and prediction data with this, make Whether the sound that discriminator tells speaker in step 6) is met very by noise separation system prediction sound spectrograph generated The distribution of flatness frequency.

6. a kind of sound enhancement method for comparing and generating confrontation network based on vocal print according to claim 1, feature It is, the adjustment assessor weight parameter, specifically by the message transmission of true falseness to generator, generator adjusts net The parameter of network model corrects the sound spectrograph of its output, makes it closer to true distribution, and elimination is authenticated device and is determined as falseness Noise signal, even if generator generate the clean sound spectrograph of prediction" can out-trick " discriminator, and discriminator determines generator The clean sound spectrograph of prediction of generation is the sound spectrograph X of the true clean audio obtained in the TIMIT sound bank, anti-in neural network During propagation, discriminator can preferably tell the clean audio of prediction of true clean audio and generator generation Difference, that is, preferably find the feature of true clean audio；Likewise, generator also can be with the identification of continuous renewal Device adjusts its parameter, and the prediction sound spectrograph for generating it is mobile towards true clean audio sound spectrograph.

7. a kind of sound enhancement method for comparing and generating confrontation network based on vocal print according to claim 1, feature It is, the generator, discriminator, mutual game is confronted with each other, and confrontation network algorithm is generated, and algorithmic formula is as follows:

In above-mentioned formula, G indicates generator (Generator), and D indicates discriminator (Discriminator), and V represents penalty values, Data indicates that the sound bank of true clean audio in step 1) speech Separation system, x indicate to extract in data true clean Speech audio, noise indicate that the band noise frequency sound bank in step 1) speech Separation system, n indicate extraction and x from noise Corresponding band noise frequency, G (n) indicate that generator carries out denoising to noisy speech, obtain predicting clean audioD(G (n)) indicate discriminator to the clean audio of predictionIt carries out being determined as that false number gives low point (close to 0), to true clean audio X It is determined as that truthful data awards high marks (close to 1).