CN108574911A - The unsupervised single microphone voice de-noising method of one kind and system - Google Patents

The unsupervised single microphone voice de-noising method of one kind and system Download PDF

Info

Publication number
CN108574911A
CN108574911A CN201710137778.0A CN201710137778A CN108574911A CN 108574911 A CN108574911 A CN 108574911A CN 201710137778 A CN201710137778 A CN 201710137778A CN 108574911 A CN108574911 A CN 108574911A
Authority
CN
China
Prior art keywords
voice
noise
matrix
dictionary
present frame
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201710137778.0A
Other languages
Chinese (zh)
Other versions
CN108574911B (en
Inventor
李军锋
李煦
颜永红
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Acoustics CAS
Original Assignee
Institute of Acoustics CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Acoustics CAS filed Critical Institute of Acoustics CAS
Priority to CN201710137778.0A priority Critical patent/CN108574911B/en
Publication of CN108574911A publication Critical patent/CN108574911A/en
Application granted granted Critical
Publication of CN108574911B publication Critical patent/CN108574911B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R3/00Circuits for transducers, loudspeakers or microphones
    • H04R3/04Circuits for transducers, loudspeakers or microphones for correcting frequency response
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R2430/00Signal processing covered by H04R, not provided for in its groups

Landscapes

  • Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Acoustics & Sound (AREA)
  • Signal Processing (AREA)
  • Measurement Of Mechanical Vibrations Or Ultrasonic Waves (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

The invention discloses a kind of unsupervised single microphone voice de-noising method, the method includes:Carry out frequency spectrum extraction of the step 1) to the voice training data of all phonemes of covering of acquisition, then carries out k mean clusters to amplitude spectrum, obtains the corresponding voice dictionary of each classification;Then all different classes of voice dictionaries are combined into a complete voice dictionary WS;Step 2) carries out Short Time Fourier Transform to the signals with noise that current time reaches and obtains present frame amplitude spectrum xt, then with the preceding processed amplitude spectrum of L frames it is combined as output speech manual X=[xt‑L..., xt‑1, xt], the noise matrix W that former frame is estimatedNWith voice dictionary WSIt is combined into total dictionary matrix W=[WS WN], Non-negative Matrix Factorization is carried out to output speech manual X using the algorithm of iteration, obtains noise matrixVoice noise weight vectors corresponding with present frameThe noise matrix that step 3) is obtained by estimationWith noise weight vectorReconstruct the current frame speech signal after noise reduction.

Description

The unsupervised single microphone voice de-noising method of one kind and system
Technical field
The present invention relates to field of voice signal, it is more particularly related to a kind of unsupervised single microphone language Sound noise-reduction method and system.
Background technology
In many application scenarios (such as voice communication, automatic speech recognition, hearing aid), voice signal inevitably by It makes an uproar to the interference of ambient noise, such as road, wind is made an uproar, circuit noise etc., it is therefore desirable to which algorithm for design makes an uproar letter to the collected band of equipment Number carry out noise reduction process.And there is usually one microphones to pick up voice signal, algorithm for many hearing devices (or instrument) It needs to remove noise signal from a signals with noise, which in turns increases the solution difficulty of problem.
Traditional single microphone voice de-noising algorithm includes mainly two parts:Noise estimating part and gain calculating part Point.This kind of algorithm generally assumes that noise is stationary signal, therefore algorithm has preferable inhibition to stationary noise, however very Noise is non-stable in the case of more, it is difficult to which accurate estimated noise signal, bad so as to cause noise reduction.In recent years, it is based on Single microphone voice de-noising algorithm of data-driven has obtained extensive pass due to having preferable inhibition to nonstationary noise Note is such as based on the voice de-noising algorithm of Non-negative Matrix Factorization (non-negative matrix factorization, NMF).
In the algorithm based on NMF, Non-negative Matrix Factorization is carried out to voice and noise signal training data first and is obtained pair The dictionary matrix answered, these dictionary matrixes are used to describe the spectrum structure of voice and noise;Then in the noise reduction stage, band is made an uproar letter It number is decomposed into the product of dictionary matrix and weight matrix, after voice dictionary and respective weights matrix multiple are finally reconstructed noise reduction Voice signal.It is that this kind of algorithm does not need stationary noise it is assumed that can have preferable inhibition to nonstationary noise, be conducive to Practical application.However there is also some limitations for this kind of algorithm, need the training data of speaker dependent and specific noise type, But be difficult to obtain matched training data in advance in many scenes, cause using limited.Secondly, this algorithm is typically all Noise reduction process is carried out to one section of voice, and algorithm is required to handle in real time in practical applications.
Invention content
It is an object of the invention to overcome traditional to need based on NMF noise reduction algorithms to rely on speaker dependent and specific noise The limitation of type training data, it is proposed that a kind of unsupervised single microphone voice de-noising method, it is convenient to be applied in actual scene,
To achieve the goals above, the present invention provides a kind of unsupervised single microphone voice de-noising method, the methods Including:
Carry out frequency spectrum extraction of the step 1) to the voice training data of all phonemes of covering of acquisition, then to amplitude spectrum into Row k mean clusters obtain the corresponding voice dictionary of each classification;Then all different classes of voice dictionaries are combined into one Complete voice dictionary WS
Step 2) carries out Short Time Fourier Transform to the signals with noise that current time reaches and obtains present frame amplitude spectrum xt, then It is combined as output speech manual X=[x with the preceding processed amplitude spectrum of L framest-L..., xt-1, xt], former frame is estimated The noise matrix W arrivedNWith voice dictionary WSIt is combined into total dictionary matrix W=[WS WN], using the algorithm of iteration to exporting language Music X carries out Non-negative Matrix Factorization, obtains noise matrixVoice noise weight vectors corresponding with present frame
The noise matrix that step 3) is obtained by estimationWith noise weight vectorReconstruct the present frame language after noise reduction Sound signal.
In above-mentioned technical proposal, the step 1) specifically includes:
Step 101) acquires a large amount of clean speech as voice training data;The voice training data of acquisition will cover institute Some phonemes;
Step 102) pre-processes above-mentioned collected voice training data, then extracts the frequency of voice training data Spectrum;
Step 103) carries out k mean clusters, the speech frame similar by structure is composed to the amplitude spectrum of above-mentioned voice training data It is polymerized to one kind, obtains different classes of corresponding voice amplitudes spectrumG=1 ..., G, G are total clusters number;
Step 104) composes the voice amplitudes that above-mentioned cluster obtainsG=1 ..., G carries out Non-negative Matrix Factorization respectively, Obtain the corresponding voice dictionary of each classification;Then all different classes of voice dictionaries are combined into a complete voice word Allusion quotation.
In above-mentioned technical proposal, the specific implementation process of the step 104) is:
First to carrying out following Non-negative Matrix Factorization per a kind of voice amplitudes spectrum after cluster:
Wherein,To belong to the speech manual of g classes,To decompose the obtained corresponding dictionary matrix of g classes, it is used for Different spectrum structural informations is described;
All different classes of voice dictionaries are combined into a complete voice dictionary:
In above-mentioned technical proposal, the step 2) specifically includes:
Step 2-1) x is composed for present frame noisy speechtIf xtInput before being L-th frame is then not processed directly By xtAs output, otherwise combine the processed amplitude spectrum of preceding L frames as output speech manual X=[xt-L..., Xt-1, xt];
Step 2-2) the noise matrix W that estimates former frameNWith voice dictionary WSBe combined into total dictionary matrix W= [WS WN], noise matrix WNInitial value be random nonnegative value, use nonnegative value random initializtion present frame weight matrix ht, and In conjunction with the estimated obtained weight vectors of preceding L frames as weight matrix H=[ht-L..., ht-1, ht];Wherein ht=[hS, t T hN, t T]T,The voice weight vectors of G small dictionaries, h are corresponded to for present frameN, tFor noise vector;
Step 2-3)) determine X, W and H after, calculate X and WH similarity value V:
V=X./(WH)
Wherein ,/expression is divided by point by point;
Step 2-4) take out last row of V vt, update the weight vectors h of present framet
Wherein .* indicates point-by-point multiplication;
Step 2-5) it is rightIn speech vectorSparse punishment is carried out, is updated
Wherein, λ and ε is coefficient;
Step 2-6) update present frame noise matrix WN, and it is normalized:
Step 2-7) judgment step 2-6) obtained byWhether restrain;If a determination be made that certainly, it is transferred to step 3);Otherwise, it enablesIt is transferred to step 2-3).
In above-mentioned technical proposal, the step 3) specifically includes:
Step 3-1) noise matrix is being obtained by step 2) estimationWith present frame weight vectorsLater, reconstruct point Current frame speech spectrum from afterAnd noise spectrum
Step 3-2) speech manual after final noise reduction is obtained by the form of Wiener filtering
Step 3-3) combine the speech manual after noise reductionIt is extensive by inverse Fourier transform with the phase of present frame signals with noise It appears again the time domain waveform after noise reduction.
The present invention also provides a kind of unsupervised single microphone voice de-noising system, the system comprises:
The carry out frequency spectrum of voice dictionary generation module, the voice training data for all phonemes of covering to acquisition carries It takes, k mean clusters then is carried out to amplitude spectrum, obtain the corresponding voice dictionary of each classification;It then will be all different classes of Voice dictionary is combined into a complete voice dictionary WS
Noise dictionary generation module, the signals with noise for being reached to current time carry out Short Time Fourier Transform and are worked as Previous frame amplitude spectrum xt, then with the preceding processed amplitude spectrum of L frames it is combined as output speech manual X=[xt-L..., xt-1, xt], The noise matrix W that former frame is estimatedNWith voice dictionary WSIt is combined into total dictionary matrix W=[WS WN], using iteration Algorithm to output speech manual X carry out Non-negative Matrix Factorization, obtain noise matrixVoice noise weight corresponding with present frame VectorWith
Noise reduction module, for passing through noise matrixVoice noise weight vectors corresponding with present frameReconstruct drop Current frame speech signal after making an uproar.
In above-mentioned technical proposal, the specific implementation step of the voice dictionary generation module is:
Step 101) acquires a large amount of clean speech as voice training data;The voice training data of acquisition will cover institute Some phonemes;
Step 102) pre-processes above-mentioned collected voice training data, then extracts the frequency of voice training data Spectrum;
Step 103) carries out k mean clusters, the speech frame similar by structure is composed to the amplitude spectrum of above-mentioned voice training data It is polymerized to one kind, obtains different classes of corresponding voice amplitudes spectrumG=1 ..., G, G are total clusters number;
Step 104) composes the voice amplitudes that above-mentioned cluster obtainsG=1 ..., G carries out Non-negative Matrix Factorization respectively, Obtain the corresponding voice dictionary of each classification;Then all different classes of voice dictionaries are combined into a complete voice word Allusion quotation:
First to carrying out following Non-negative Matrix Factorization per a kind of voice amplitudes spectrum after cluster:
Wherein,To belong to the speech manual of g classes,To decompose the obtained corresponding dictionary matrix of g classes, it is used for Different spectrum structural informations is described;
All different classes of voice dictionaries are combined into a complete voice dictionary:
In above-mentioned technical proposal, the specific implementation process of the noise dictionary generation module is:
Step 2-1) x is composed for present frame noisy speechtIf xtInput before being L-th frame is then not processed directly By xtAs output, otherwise combine the processed amplitude spectrum of preceding L frames as output speech manual X=[xt-L..., xt-1, xt];
Step 2-2) the noise matrix W that estimates former frameNWith voice dictionary WSBe combined into total dictionary matrix W= [WS WN], noise matrix WNInitial value be random nonnegative value, use nonnegative value random initializtion present frame weight matrix ht, and In conjunction with the estimated obtained weight vectors of preceding L frames as weight matrix H=[ht-L..., ht-1, ht];Wherein ht=[hS, t T hN, t T]T,The voice weight vectors of G small dictionaries, h are corresponded to for present frameN, tFor noise vector;
Step 2-3)) determine X, W and H after, calculate X and WH similarity value V:
V=X./(WH)
Wherein ,/expression is divided by point by point;
Step 2-4) take out last row of V vt, update the weight vectors h of present framet
Wherein .* indicates point-by-point multiplication;
Step 2-5) it is rightIn speech vectorSparse punishment is carried out, is updated
Wherein, λ and ε is coefficient;
Step 2-6) update present frame noise matrix WN, and it is normalized:
(normalization WN)
Step 2-7) judgment step 2-6) obtained byWhether restrain;If a determination be made that certainly, it is transferred to step 3);Otherwise, it enablesIt is transferred to step 2-3).
In above-mentioned technical proposal, the specific implementation process of the noise reduction module is:
Step 3-1) noise matrix is being obtained by step 2) estimationWith present frame weight vectorsLater, reconstruct point Current frame speech spectrum from afterAnd noise spectrum
Step 3-2) speech manual after final noise reduction is obtained by the form of Wiener filtering
Step 3-3) combine the speech manual after noise reductionIt is extensive by inverse Fourier transform with the phase of present frame signals with noise It appears again the time domain waveform after noise reduction.
The advantage of the invention is that:
1, method of the invention eliminates the limitation to speaker dependent and specific noise training data, expands algorithm The scope of application;
2, the present invention realizes a kind of online noise reduction algorithm, has stronger practicability.
Description of the drawings
Fig. 1 is the flow chart of unsupervised single microphone voice de-noising method of the present invention;
Fig. 2 is online NMF decomposition algorithms proposed by the present invention.
Specific implementation mode
The present invention is described in more detail in the following with reference to the drawings and specific embodiments.
As shown in Figure 1, a kind of unsupervised single microphone voice de-noising method, the method includes:
Carry out frequency spectrum extraction of the step 1) to the voice training data of all phonemes of covering of acquisition, then to amplitude spectrum into Row k mean clusters obtain the corresponding voice dictionary of each classification;Then all different classes of voice dictionaries are combined into one Complete voice dictionary;It specifically includes:
Step 101) acquires a large amount of clean speech as voice training data;
Voice training data can be obtained from the sound bank much increased income, and the voice training data of acquisition will cover all Phoneme;
Step 102) pre-processes above-mentioned collected voice training data, then extracts the frequency spectrum of voice signal;
Carrying out pretreatment to voice signal includes:To each frame voice signal elder generation's zero padding to N points, N=2i, i is integer, and i≥8;Then, adding window is carried out to the signal of each frame or preemphasis is handled, windowed function is using Hamming window (hamming) or breathes out Peaceful window (hanning).
Step 103) carries out k mean clusters, the speech frame similar by structure is composed to the amplitude spectrum of above-mentioned voice training data It is polymerized to one kind, obtains different classes of corresponding voice amplitudes spectrumG=1 ..., G, G are total clusters number;
Step 104) composes the voice amplitudes that above-mentioned cluster obtainsG=1 ..., G carries out Non-negative Matrix Factorization respectively, Obtain the corresponding voice dictionary of each classification;Then all different classes of voice dictionaries are combined into a complete voice word Allusion quotation;
First to carrying out following Non-negative Matrix Factorization per a kind of voice amplitudes spectrum after cluster:
Wherein,To belong to the speech manual of g classes,To decompose the obtained corresponding dictionary matrix of g classes, it is used for Different spectrum structural information (such as each dictionary describes a phoneme) is described,;
All different classes of voice dictionaries are combined into a complete voice dictionary:
Voice dictionary can be obtained by above-mentioned cluster and decomposition in this way, for describing the phonetic element in noisy speech.
Step 2) carries out Short Time Fourier Transform (short-time Fourier to the signals with noise that current time reaches Transform, STFT) obtain present frame amplitude spectrum xt, then with the preceding processed amplitude spectrum of L frames be combined as output voice Compose X=[xt-L..., xt-1, xt], Non-negative Matrix Factorization (NMF) is carried out to output speech manual using the algorithm of iteration, is made an uproar Sound dictionary and the corresponding voice noise weight vectors of present frame;
As shown in Fig. 2, the step 2) specifically includes:
Step 2-1) x is composed for present frame noisy speechtIf xtInput before being L-th frame is then not processed directly By xtAs output, otherwise combine the processed amplitude spectrum of preceding L frames as output speech manual X=[xt-L..., xt-1, xt];
Step 2-2) the noise matrix W that estimates former frameNWith voice dictionary WSBe combined into total dictionary matrix W= [WS WN], noise matrix WNInitial value be random nonnegative value, use nonnegative value random initializtion present frame weight matrix ht, and In conjunction with the estimated obtained weight vectors of preceding L frames as weight matrix H=[ht-L..., ht-1, ht];Wherein ht=[hS, t T hN, t T]T,The voice weight vectors of multiple small dictionaries, h are corresponded to for present frameN, tFor noise to Amount;
Step 2-3)) determine X, W and H after, calculate X and WH similarity value V:
V=x./(WH)
Wherein ,/expression is divided by point by point;
Step 2-4) take out last row of V vt, update present frame weight vectors ht
Wherein .* indicates point-by-point multiplication;
Step 2-5) it is rightIn speech vectorSparse punishment is carried out, is updated
Wherein, λ and ε is coefficient;
Step 2-6) update noise matrix WN, and it is normalized:
(normalization WN)
Wherein, in above-mentioned steps ,/indicate to be divided by point by point and be multiplied point by point with .*.
Step 2-7) judgment step 2-6) obtained byWhether restrain;If a determination be made that certainly, it is transferred to step 2- 8);Otherwise, it enablesIt is transferred to step 2-3);
In this example, general iteration then confirms W 50 timesNIt has been restrained that, then stop iteration.
Step 2-8) it enablesThe processed speech manual of L frames before update:What addition was newly handled One frame removes an earliest frame, storage noise matrix WNWith weight vectors htFor handling next frame.
The noise matrix that step 3) is obtained by estimationAnd weight vectorsReconstruct the letter of the current frame speech after noise reduction Number;It specifically includes:
Step 3-1) noise matrix is being obtained by step 2) estimationWith present frame weight vectorsLater, reconstruct point Current frame speech spectrum from afterAnd noise spectrum
Step 3-2) speech manual after final noise reduction is obtained by the form of Wiener filtering
Step 3-3) combine the speech manual after noise reductionIt is extensive by inverse Fourier transform with the phase of present frame signals with noise It appears again the time domain waveform after noise reduction.
In addition, the present invention also provides a kind of unsupervised single microphone voice de-noising system, the system comprises:
The carry out frequency spectrum of voice dictionary generation module, the voice training data for all phonemes of covering to acquisition carries It takes, k mean clusters then is carried out to amplitude spectrum, obtain the corresponding voice dictionary of each classification;It then will be all different classes of Voice dictionary is combined into a complete voice dictionary WS
Noise dictionary generation module, the signals with noise for being reached to current time carry out Short Time Fourier Transform and are worked as Previous frame amplitude spectrum xt, then with the preceding processed amplitude spectrum of L frames it is combined as output speech manual X=[xt-L..., xt-1, xt], The noise matrix W that former frame is estimatedNWith voice dictionary WSIt is combined into total dictionary matrix W=[WS WN], using iteration Algorithm carry out Non-negative Matrix Factorization, obtain noise matrixVoice noise weight vectors corresponding with present frameWith
Noise reduction module, for passing through noise matrixVoice noise weight vectors corresponding with present frameReconstruct drop Current frame speech signal after making an uproar.
It should be noted last that above example is only used to illustrate the technical scheme of the present invention and unrestricted.Although ginseng It is described the invention in detail according to embodiment, it will be understood by those of ordinary skill in the art that, to the technical side of the present invention Case is modified or replaced equivalently, and without departure from the spirit and scope of technical solution of the present invention, should all be covered in the present invention Right in.

Claims (9)

1. a kind of unsupervised single microphone voice de-noising method, the method includes:
Carry out frequency spectrum extraction of the step 1) to the voice training data of all phonemes of covering of acquisition, then carries out k to amplitude spectrum Mean cluster obtains the corresponding voice dictionary of each classification;Then by all different classes of voice dictionaries be combined into one it is complete Standby voice dictionary WS
Step 2) carries out Short Time Fourier Transform to the signals with noise that current time reaches and obtains present frame amplitude spectrum xt, then with preceding L The processed amplitude spectrum of frame is combined as output speech manual X=[xt-L..., xt-1, xt], former frame is estimated Noise matrix WNWith voice dictionary WSIt is combined into total dictionary matrix W=[WS WN], using the algorithm of iteration to exporting speech manual X Non-negative Matrix Factorization is carried out, noise matrix is obtainedVoice noise weight vectors corresponding with present frame
The noise matrix that step 3) is obtained by estimationWith voice noise weight vectorsReconstruct the present frame language after noise reduction Sound signal.
2. unsupervised single microphone voice de-noising method according to claim 1, which is characterized in that the step 1) is specific Including:
Step 101) acquires a large amount of clean speech as voice training data;The voice training data of acquisition will cover all Phoneme;
Step 102) pre-processes above-mentioned collected voice training data, then extracts the frequency spectrum of voice training data;
Step 103) carries out k mean clusters to the amplitude spectrum of above-mentioned voice training data, will compose the similar speech frame of structure and is polymerized to One kind obtains different classes of corresponding voice amplitudes spectrumG=1 ..., G, G are total clusters number;
Step 104) composes the voice amplitudes that above-mentioned cluster obtainsG=1 ..., G carries out Non-negative Matrix Factorization respectively, obtains The corresponding voice dictionary of each classification;Then all different classes of voice dictionaries are combined into a complete voice dictionary.
3. unsupervised single microphone voice de-noising method according to claim 2, which is characterized in that the step 104) Specific implementation process is:
First to carrying out following Non-negative Matrix Factorization per a kind of voice amplitudes spectrum after cluster:
Wherein,To belong to the speech manual of g classes,To decompose the obtained corresponding dictionary matrix of g classes, for describing Different spectrum structural informations;
All different classes of voice dictionaries are combined into a complete voice dictionary:
4. unsupervised single microphone voice de-noising method according to claim 3, which is characterized in that the step 2) is specific Including:
Step 2-1) x is composed for present frame noisy speechtIf xtInput before being L-th frame is then not processed x directlyt As output, otherwise combine the processed amplitude spectrum of preceding L frames as output speech manual X=[xt-L..., xt-1, xt];
Step 2-2) the noise matrix W that estimates former frameNWith voice dictionary WSIt is combined into total dictionary matrix W=[WS WN], noise matrix WNInitial value be random nonnegative value, use nonnegative value random initializtion present frame weight matrix ht, and combine The preceding estimated obtained weight vectors of L frames are as weight matrix H=[ht-L..., ht-1, ht];Wherein ht=[hS, t T hN, t T ]T,The voice weight vectors of G small dictionaries, h are corresponded to for present frameN, tFor noise vector;
Step 2-3)) determine X, W and H after, calculate X and WH similarity value V:
V=X./(WH)
Wherein ,/expression is divided by point by point;
Step 2-4) take out last row of V vt, update the weight vectors h of present framet
Wherein .* indicates point-by-point multiplication;
Step 2-5) it is rightIn speech vectorSparse punishment is carried out, is updated
Wherein, λ and ε is coefficient;
Step 2-6) update present frame noise matrix WN, and it is normalized:
Step 2-7) judgment step 2-6) obtained byWhether restrain;If a determination be made that certainly, it is transferred to step 3);It is no Then, it enablesIt is transferred to step 2-3).
5. unsupervised single microphone voice de-noising method according to claim 4, which is characterized in that the step 3) is specific Including:
Step 3-1) noise matrix is being obtained by step 2) estimationWith present frame weight vectorsLater, after reconstruct separation Current frame speech spectrumAnd noise spectrum
Step 3-2) speech manual after final noise reduction is obtained by the form of Wiener filtering
Step 3-3) combine the speech manual after noise reductionIt is recovered by inverse Fourier transform with the phase of present frame signals with noise Time domain waveform after noise reduction.
6. a kind of unsupervised single microphone voice de-noising system, the system comprises:
Voice dictionary generation module, the carry out frequency spectrum extraction of the voice training data for all phonemes of covering to acquisition, so K mean clusters are carried out to amplitude spectrum afterwards, obtain the corresponding voice dictionary of each classification;Then by all different classes of voice words Allusion quotation is combined into a complete voice dictionary WS
Noise dictionary generation module, the signals with noise for being reached to current time carry out Short Time Fourier Transform and obtain present frame Amplitude spectrum xt, then with the preceding processed amplitude spectrum of L frames it is combined as output speech manual X=[xt-L..., xt-1, xt], will before The noise matrix W that one frame is estimatedNWith voice dictionary WSIt is combined into total dictionary matrix W=[WS WN], using the calculation of iteration Method carries out Non-negative Matrix Factorization to output speech manual X, obtains noise matrixVoice noise weight vectors corresponding with present frameWith
Noise reduction module, for passing through noise matrixVoice noise weight vectors corresponding with present frameAfter reconstructing noise reduction Current frame speech signal.
7. unsupervised single microphone voice de-noising system according to claim 6, which is characterized in that the voice dictionary life It is at the specific implementation step of module:
Step 101) acquires a large amount of clean speech as voice training data;The voice training data of acquisition will cover all Phoneme;
Step 102) pre-processes above-mentioned collected voice training data, then extracts the frequency spectrum of voice training data;
Step 103) carries out k mean clusters to the amplitude spectrum of above-mentioned voice training data, will compose the similar speech frame of structure and is polymerized to One kind obtains different classes of corresponding voice amplitudes spectrumG=1 ..., G, G are total clusters number;
Step 104) composes the voice amplitudes that above-mentioned cluster obtainsG=1 ..., G carries out Non-negative Matrix Factorization respectively, obtains The corresponding voice dictionary of each classification;Then all different classes of voice dictionaries are combined into a complete voice dictionary:
First to carrying out following Non-negative Matrix Factorization per a kind of voice amplitudes spectrum after cluster:
Wherein,To belong to the speech manual of g classes,To decompose the obtained corresponding dictionary matrix of g classes, for describing Different spectrum structural informations;
All different classes of voice dictionaries are combined into a complete voice dictionary:
8. unsupervised single microphone voice de-noising system according to claim 7, which is characterized in that the noise dictionary life It is at the specific implementation process of module:
Step 2-1) x is composed for present frame noisy speechtIf xtInput before being L-th frame is then not processed x directlyt As output, otherwise combine the processed amplitude spectrum of preceding L frames as output speech manual X=[xt-L..., xt-1, xt];
Step 2-2) the noise matrix W that estimates former frameNWith voice dictionary WSIt is combined into total dictionary matrix W=[WS WN], noise matrix WNInitial value be random nonnegative value, use nonnegative value random initializtion present frame weight matrix ht, and combine The preceding estimated obtained weight vectors of L frames are as weight matrix H=[ht-L..., ht-1, ht];Wherein ht=[hS, t T hN, t T ]T,The voice weight vectors of G small dictionaries, h are corresponded to for present frameN, tFor noise vector;
Step 2-3)) determine X, W and H after, calculate X and WH similarity value V:
V=X./(WH)
Wherein ,/expression is divided by point by point;
Step 2-4) take out last row of V vt, update the weight vectors h of present framet
Wherein .* indicates point-by-point multiplication;
Step 2-5) it is rightIn speech vectorSparse punishment is carried out, is updated
Wherein, λ and ε is coefficient;
Step 2-6) update present frame noise matrix WN, and it is normalized:
(normalization WN)
Step 2-7) judgment step 2-6) obtained byWhether restrain;If a determination be made that certainly, it is transferred to step 3);It is no Then, it enablesIt is transferred to step 2-3).
9. unsupervised single microphone voice de-noising system according to claim 8, which is characterized in that the noise reduction module Specific implementation process is:
Step 3-1) noise matrix is being obtained by step 2) estimationWith present frame weight vectorsLater, after reconstruct separation Current frame speech spectrumAnd noise spectrum
Step 3-2) speech manual after final noise reduction is obtained by the form of Wiener filtering
Step 3-3) combine the speech manual after noise reductionIt is recovered by inverse Fourier transform with the phase of present frame signals with noise Time domain waveform after noise reduction.
CN201710137778.0A 2017-03-09 2017-03-09 The unsupervised single microphone voice de-noising method of one kind and system Active CN108574911B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710137778.0A CN108574911B (en) 2017-03-09 2017-03-09 The unsupervised single microphone voice de-noising method of one kind and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710137778.0A CN108574911B (en) 2017-03-09 2017-03-09 The unsupervised single microphone voice de-noising method of one kind and system

Publications (2)

Publication Number Publication Date
CN108574911A true CN108574911A (en) 2018-09-25
CN108574911B CN108574911B (en) 2019-10-22

Family

ID=63577827

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710137778.0A Active CN108574911B (en) 2017-03-09 2017-03-09 The unsupervised single microphone voice de-noising method of one kind and system

Country Status (1)

Country Link
CN (1) CN108574911B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109545240A (en) * 2018-11-19 2019-03-29 清华大学 A kind of method of the sound separation of human-computer interaction
CN113823305A (en) * 2021-09-03 2021-12-21 深圳市芒果未来科技有限公司 Method and system for suppressing noise of metronome in audio

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2011107951A1 (en) * 2010-03-02 2011-09-09 Nokia Corporation Method and apparatus for upmixing a two-channel audio signal
US20130132085A1 (en) * 2011-02-21 2013-05-23 Gautham J. Mysore Systems and Methods for Non-Negative Hidden Markov Modeling of Signals
CN105657535A (en) * 2015-12-29 2016-06-08 北京搜狗科技发展有限公司 Audio recognition method and device
CN105957537A (en) * 2016-06-20 2016-09-21 安徽大学 Voice denoising method and system based on L1/2 sparse constraint convolution non-negative matrix decomposition

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2011107951A1 (en) * 2010-03-02 2011-09-09 Nokia Corporation Method and apparatus for upmixing a two-channel audio signal
US20130132085A1 (en) * 2011-02-21 2013-05-23 Gautham J. Mysore Systems and Methods for Non-Negative Hidden Markov Modeling of Signals
CN105657535A (en) * 2015-12-29 2016-06-08 北京搜狗科技发展有限公司 Audio recognition method and device
CN105957537A (en) * 2016-06-20 2016-09-21 安徽大学 Voice denoising method and system based on L1/2 sparse constraint convolution non-negative matrix decomposition

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109545240A (en) * 2018-11-19 2019-03-29 清华大学 A kind of method of the sound separation of human-computer interaction
CN109545240B (en) * 2018-11-19 2022-12-09 清华大学 Sound separation method for man-machine interaction
CN113823305A (en) * 2021-09-03 2021-12-21 深圳市芒果未来科技有限公司 Method and system for suppressing noise of metronome in audio

Also Published As

Publication number Publication date
CN108574911B (en) 2019-10-22

Similar Documents

Publication Publication Date Title
Michelsanti et al. Conditional generative adversarial networks for speech enhancement and noise-robust speaker verification
CN106971741B (en) Method and system for voice noise reduction for separating voice in real time
CN110634502B (en) Single-channel voice separation algorithm based on deep neural network
JP5230103B2 (en) Method and system for generating training data for an automatic speech recognizer
KR100908121B1 (en) Speech feature vector conversion method and apparatus
CN109767756B (en) Sound characteristic extraction algorithm based on dynamic segmentation inverse discrete cosine transform cepstrum coefficient
CN111899757B (en) Single-channel voice separation method and system for target speaker extraction
Yu et al. Adversarial network bottleneck features for noise robust speaker verification
GB2560174A (en) A feature extraction system, an automatic speech recognition system, a feature extraction method, an automatic speech recognition method and a method of train
CN110942766A (en) Audio event detection method, system, mobile terminal and storage medium
Soe Naing et al. Discrete Wavelet Denoising into MFCC for Noise Suppressive in Automatic Speech Recognition System.
CN113763965A (en) Speaker identification method with multiple attention characteristics fused
CN108574911B (en) The unsupervised single microphone voice de-noising method of one kind and system
CN111785262B (en) Speaker age and gender classification method based on residual error network and fusion characteristics
Hossain et al. Dual-transform source separation using sparse nonnegative matrix factorization
Wang et al. Robust speech recognition from ratio masks
Adam et al. Wavelet based Cepstral Coefficients for neural network speech recognition
Nataraj et al. Single channel speech enhancement using adaptive filtering and best correlating noise identification
Rodomagoulakis et al. Improved frequency modulation features for multichannel distant speech recognition
CN108573698B (en) Voice noise reduction method based on gender fusion information
Chowdhury et al. Speech enhancement using k-sparse autoencoder techniques
KR100329596B1 (en) Text-Independent Speaker Identification Using Telephone Speech
Oh et al. Preprocessing of independent vector analysis using feed-forward network for robust speech recognition
Srinivasarao Speech signal analysis and enhancement using combined wavelet Fourier transform with stacked deep learning architecture
Lee et al. Speech coding and noise reduction using ica-based speech features

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant