CN114566179A - Time delay controllable voice noise reduction method - Google Patents

Time delay controllable voice noise reduction method Download PDF

Info

Publication number
CN114566179A
CN114566179A CN202210258932.0A CN202210258932A CN114566179A CN 114566179 A CN114566179 A CN 114566179A CN 202210258932 A CN202210258932 A CN 202210258932A CN 114566179 A CN114566179 A CN 114566179A
Authority
CN
China
Prior art keywords
voice
gain function
complex
domain filter
noise
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210258932.0A
Other languages
Chinese (zh)
Inventor
邱锋海
王之禹
项京朋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Sound+ Technology Co ltd
Original Assignee
Beijing Sound+ Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Sound+ Technology Co ltd filed Critical Beijing Sound+ Technology Co ltd
Priority to CN202210258932.0A priority Critical patent/CN114566179A/en
Publication of CN114566179A publication Critical patent/CN114566179A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0224Processing in the time domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/02Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders
    • G10L19/0212Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders using orthogonal transformation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/26Pre-filtering or post-filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0232Processing in the frequency domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Abstract

The application relates to a voice noise reduction method with controllable time delay, which comprises the following steps: framing the voice with noise, and transforming a time domain and a frequency domain to obtain a complex frequency spectrum of the voice with noise; determining a gain function according to the complex frequency spectrum of the voice with the noise; the gain function is real or complex; determining a time domain filter according to the gain function, wherein the order of the time domain filter is set according to the time delay requirement; and inputting the voice with noise into the time domain filter to perform noise reduction processing meeting the time delay requirement to obtain pure voice. By adopting the method provided by the embodiment of the application, the advanced voice noise reduction performance can be achieved under the condition of low time delay, the operation complexity is reduced, and the robustness is improved.

Description

Time delay controllable voice noise reduction method
Technical Field
The present application relates to the multimedia field, and in particular, to a time delay controllable voice noise reduction method.
Background
The voice noise reduction has very important application in the aspects of voice communication, voice recognition, hearing aids, cochlear implants and the like, and can obviously improve the communication quality and the interactive experience. Speech noise reduction can be divided into unsupervised methods including spectral subtraction, subspace methods, etc., and supervised methods including nonnegative matrix factorization, dictionary learning methods, deep neural network methods, etc. Currently, most speech noise reduction is performed in the time-frequency domain: firstly, data is subjected to frame windowing, then Fourier transformation is carried out, then a gain function is estimated through an unsupervised method inference or a supervised method, then the gain function is acted on a complex spectrum of a noisy signal, and a time domain signal is reconstructed through an Overlap-Add (OLA). With this type of approach, the delay is determined by the frame length. The delay is severely limited in many systems, such as hearing aid systems, where all signal processing delays are controlled to be within 4ms to reduce the comb effect while meeting the requirement of the smallest perceived time difference; as another example, current tws (true Wireless stereo) earphones have a pass-through function (Transparency), which is activated, similarly to a hearing aid, and also needs to control the delay to be within 4 ms; for example, in a sound amplifying system, noise of a sound pickup signal of a microphone is suppressed, the time delay requirement is higher, and excessive introduction of the time delay of an electric signal algorithm causes obvious time delay of the sound amplifying system, so that echo can be caused in a serious case. Therefore, the method for reducing the noise of the voice with low delay and high performance has important application value.
One of the methods to reduce the delay is to reduce the frame length, for example, to 4 ms. However, research has shown that, due to the adoption of an excessively short frame length, the frequency resolution of a frequency spectrum after Fourier transform is too low, and due to the adoption of an unsupervised method, noise among voice harmonics cannot be effectively suppressed; by adopting the supervision method, the discrimination of the noise characteristic and the voice characteristic is reduced, the performance of the supervision method is seriously influenced, and the training of the supervision learning model can not be converged when the performance is serious.
Another method for reducing the time delay is to use a long frame to reduce the frame shift, for example, to reduce the frame shift to 4ms or even 2ms, but most of the existing voice noise reduction methods with controllable time delay use a frequency domain analysis and synthesis method, and use an overlap-add method to reconstruct an enhanced voice time domain signal, and the time delay is still determined by the frame length; it is worth mentioning that the existing speech separation method adopts a time domain signal end-to-end method, and the time delay is theoretically determined by frame shift, but the performance of the method is inferior to that of a frequency domain analysis and synthesis method, and the method has higher operation complexity and insufficient stability.
Disclosure of Invention
The method aims to set the time delay according to practical application so as to achieve the goal of controllable time delay, reduce the complexity of operation and improve the robustness.
In order to achieve the above object, the present application provides a time delay controllable voice noise reduction method, including the following steps: framing the voice with noise, and transforming a time domain and a frequency domain to obtain a complex frequency spectrum of the voice with noise; determining a gain function according to the complex frequency spectrum of the voice with noise; the gain function is real or complex; determining a time domain filter according to the gain function, wherein the order of the time domain filter is set according to the time delay requirement; and inputting the voice with noise into the time domain filter to perform noise reduction processing meeting the time delay requirement to obtain pure voice.
As a preferred embodiment, the determining a gain function according to the complex spectrum of the noisy speech includes: determining a magnitude spectrum of a complex frequency spectrum of the noisy speech; inputting the magnitude spectrum of the complex frequency spectrum of the voice with noise into the deep learning network, wherein the deep learning network is a real network; and determining a gain function according to the mapping target of the real network, wherein the gain function is a real number.
As a preferred embodiment, the determining a gain function according to the mapping target of the real network includes: under the condition that the mapping target of the deep learning network is the magnitude spectrum of pure voice, the gain function is the ratio of the magnitude spectrum of the pure voice to the magnitude spectrum of the voice with noise; and under the condition that the mapping target of the deep learning network is the magnitude spectrum compression value of the pure voice, the gain function is the ratio of the magnitude spectrum compression value of the pure voice and the magnitude spectrum of the voice with noise.
As a preferred embodiment, the determining the gain function over the complex spectrum comprises: determining a real part and an imaginary part of the complex frequency spectrum of the voice with noise, and inputting the real part and the imaginary part of the complex frequency spectrum of the voice with noise into the deep learning network, wherein the deep learning network is a complex network; or determining real and imaginary parts of a compressed complex spectrum of the noisy speech; inputting real and imaginary parts of the compressed complex spectrum of the noisy speech into the complex network; determining a gain function according to a mapping target of the complex network; the gain function is complex.
As a preferred embodiment, the determining a gain function according to the mapping objective of the complex network includes: under the condition that the mapping target of the complex network is a complex spectrum of pure voice, obtaining a gain function according to the ratio of the complex spectrum of the pure voice to the voice with noise, wherein the gain function is complex; or obtaining a gain function according to a ratio of the compressed complex spectrum of the clean speech to the noisy speech under the condition that the mapping target of the complex network is the compressed complex spectrum of the clean speech, wherein the gain function is complex.
As a preferred embodiment, the determining a time-domain filter according to the gain function, where an order of the time-domain filter is set according to a delay control requirement, includes: approximating the gain function by using a finite impulse response time domain filter, wherein the gain function is a fitting value of the finite impulse response time domain filter; and determining the order of the finite impulse response time domain filter according to the time delay control requirement.
As a preferred embodiment, the determining a time-domain filter according to the gain function, where an order of the time-domain filter is set according to a delay control requirement, includes: approximating the gain function by using an infinite impulse response time domain filter, wherein the gain function is a fitting value of the infinite impulse response time domain filter; determining the amplitude-frequency response of the infinite impulse response time domain filter; and determining the order of the infinite impulse response time domain filter according to the amplitude value of the gain function and the amplitude-frequency response, wherein the order of the infinite impulse response time domain filter meets the time delay control requirement.
As a preferred embodiment, the inputting the noisy speech into the time-domain filter to perform noise reduction processing meeting the delay requirement to obtain clean speech includes: and inputting the voice with noise into a finite impulse response time domain filter or an infinite impulse response time domain filter, and performing noise reduction processing according with the time delay requirement to obtain pure voice.
As a preferred embodiment, the inputting the noisy speech into the time-domain filter to perform noise reduction processing meeting the delay requirement to obtain clean speech includes: dividing the frequency of the voice with noise to obtain a first sub-band signal and a second sub-band signal; the first sub-band signal is processed by an infinite impulse response time domain filter to obtain estimated middle and low frequency voice; the second sub-band signal is processed by a finite impulse response time domain filter to obtain estimated high-frequency voice; and synthesizing the medium and low frequency voice signal and the high frequency voice signal to obtain pure voice.
As a preferred embodiment, the inputting the time-domain signal of the noisy speech into the time-domain filter to perform noise reduction processing meeting the delay requirement to obtain pure speech includes: finite impulse response time domain filter for determining deep learning network mapping with order of 2tdfsA/1000; wherein, tdFor delay requirements, fsIs the sampling frequency; convolving the l frame of speech with noise with a finite impulse response time domain filter obtained by mapping the l frame of speech with noise by a deep learning network to obtain the l frame of pure speech; l is a natural number; and arranging the pure voice of the frame l according to time to obtain the pure voice meeting the time delay requirement.
By adopting the method provided by the embodiment of the application, the advanced voice noise reduction performance can be achieved under the condition of low time delay, the operation complexity is reduced, and the robustness is improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments disclosed in the present specification, the drawings needed to be used in the description of the embodiments will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments disclosed in the present specification, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.
Fig. 1 is a flowchart of a delay-controllable voice denoising method according to an embodiment of the present application;
fig. 2 is a flow chart of FIR time-domain filter design in a time-delay controllable voice noise reduction method according to an embodiment of the present application;
fig. 3 is a diagram of a spectrogram test effect before and after Babble noise processing according to an embodiment of the present application.
Detailed Description
In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is understood that "some embodiments" may be the same subset or different subsets of all possible embodiments, and may be combined with each other without conflict.
In the following description, references to the terms "first \ second \ third, etc. or module a, module B, module C, etc. are used solely to distinguish between similar objects and do not denote a particular order or importance to the objects, but rather the specific order or sequence may be interchanged as appropriate to enable embodiments of the application described herein to be practiced in an order other than that shown or described herein.
In the following description, reference to reference numerals indicating steps, such as S110, S120 … …, etc., does not necessarily indicate that the steps are performed in this order, and the order of the preceding and following steps may be interchanged or performed simultaneously, where permissible.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the present application only and is not intended to be limiting of the application.
The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application. The following examples are only for illustrating the technical solutions of the present application more clearly, and the protection scope of the present application is not limited thereby.
It should be noted that the term "first" in the description and claims of the embodiments of the present application is used for distinguishing different objects, and is not used for describing a specific order of the objects. For example, the first speech segment is used to distinguish between different speech segments, rather than to describe a particular order of target objects. In the embodiments of the present application, words such as "exemplary," "for example," or "such as" are used to mean serving as examples, illustrations, or illustrations. Any embodiment or design described herein as "exemplary," "for example," or "such as" is not necessarily to be construed as preferred or advantageous over other embodiments or designs. Rather, use of the words "exemplary," "for example," or "such as" are intended to present relevant concepts in a concrete fashion.
First, the principle of a delay-controllable speech noise reduction method provided in the embodiment of the present application is introduced.
Suppose that the time domain signal with noise picked up by the microphone is x (n):
x(n)=s(n)+ds(n)+dt(n) (1)
in formula (1), s (n) is pure speech, ds(n) is the steady state noise, dt(n) is transient noise. After Short-time Fourier Transform (STFT), the time-domain signal model can be expressed as:
X(k,l)=S(k,l)+Ds(k,l)+Dt(k,l) (2)
in the formula (2), k and l respectively represent the kth frequency point and the l frame, S (k, l) represents the complex spectrum of pure voice, and Ds(k, l) and Dt(k, l) represents the complex spectrum of stationary noise and the complex spectrum of transient noise, respectively, and X (k, l) represents the complex spectrum of noisy speech. Taking X (k, l) as an example, the short-time fourier transform equation is:
Figure BDA0003549924650000061
in the formula (3), R is the frame shift, and N is the frame length.
In the case of only noisy speech time-domain signal X (n) or its complex spectrum X (k, l) picked up by a microphone, the purpose of single-channel speech noise reduction is to estimate the clean speech S (n) or the complex spectrum S (k, l) of the clean speech by various speech noise reduction methods, which can be generally written as:
Figure BDA0003549924650000062
in the prior art, the value of G (k, l) is usually real, and the value is between 0 and 1.
In order to reduce the frequency spectrum leakage, the existing method generally adopts sub-band analysis to decompose the noisy speech, but the adoption of sub-band decomposition introduces a certain time delay and increases the computational complexity. Thus, some existing methods estimate the gain function at each time-frequency point, or estimate the noise power at each time-frequency point, and then calculate the gain function based on the noisy speech power and the noise power at that time-frequency point. Because the existing methods all adopt a voice endpoint detection method or a method based on minimum statistical characteristics when estimating the noise power, the unsteady state noise power is difficult to accurately estimate, and the gain function G (k, l) is difficult to accurately estimate; while the accuracy of the estimate of the gain function G (k, l) can directly affect the domain filter hl(n), therefore, the existing method, although capable of controlling the time delay, has poor performance of suppressing the non-stationary noise.
In some cases, a speech noise reduction method based on deep learning network may be adopted, and not explicitly estimating G (k, l), but directly mapping the complex spectrum of clean speech or the magnitude spectrum of clean speech, such as:
Figure BDA0003549924650000071
whether formula (4) or formula (5) is used, the pure speech is reconstructed by overlap-add method, and the time delay is determined by the frame length N.
The embodiment of the application provides a time delay controllable voice noise reduction method, which comprises the steps of carrying out framing and time-frequency domain transformation on voice with noise to obtain a complex frequency spectrum of the voice with noise; determining a gain function of a complex frequency spectrum of the voice with noise; the gain function is real or complex; determining a time domain filter according to the gain function, wherein the order of the time domain filter is determined according to the time delay control requirement; and inputting the voice with the noise into a time domain filter for noise reduction to obtain pure voice meeting the time delay requirement.
Fig. 1 is a flowchart of a voice denoising method with controllable delay according to an embodiment of the present application. As shown in FIG. 1, delay-controllable voice noise reduction may be achieved by the following steps S1-S4.
And S1, framing the voice with noise, and transforming the time domain and the frequency domain to obtain the complex frequency spectrum of the voice with noise.
In an implementation manner, the noisy speech x (n) picked up by the microphone may be passed through a subband analysis filter, and a subband signal of the noisy speech of the l-th frame of the k-th frequency point may be output.
According to the above embodiment, the noisy speech x (n) may be passed through a subband analysis filter, and the full-band signal may be divided into time-domain signals of at least two subbands, which are respectively denoted as xL(n) and xH(n) wherein xL(n) is the sub-band voice of the middle and low frequency band, such as under 4000Hz, xH(n) is sub-band speech above 4000 Hz.
In an implementation manner, the noisy speech X (n) picked up by the microphone may be subjected to a frequency band analysis, and a complex spectrum of the noisy speech of the l frame at the k frequency point is output, where the complex spectrum of the noisy speech of the l frame at the k frequency point may be marked as X (k, l).
In comparison, the method adopts frequency band analysis, namely short-time Fourier transform to calculate the complex frequency spectrum of the voice with noise, can realize the voice noise reduction target with controllable time delay, has lower calculation complexity compared with a scheme of sub-band analysis, and can obtain satisfactory performance better.
S2, determining a gain function according to the complex spectrum of the noisy speech, the gain function being real or complex.
In one implementation, a conventional speech noise reduction estimation gain function G (k, l) is used for the subband signal of the noisy speech in the l frame of the k-th frequency point.
In an implementation manner, a gain function G (k, l) may be obtained by mapping a complex spectrum of a noisy speech in an l-th frame of a k-th frequency point by using a deep learning network.
S3, determining a time-domain filter h according to the gain function G (k, l)lAnd (n), the order of the time domain filter is determined according to the time delay control requirement.
S4, inputting the noisy speech x (n) into the time-domain filter hl(n) filtering to obtain enhanced clean speech
Figure BDA0003549924650000081
In the speech noise reduction method with controllable delay proposed in the embodiment of the present application, step S2 employs a speech noise reduction method of a deep learning network, and G (k, l) may be a real number or a complex number. In a low signal-to-noise ratio scenario, a complex gain function is used, which generally exhibits better performance since the clean speech phase can be estimated simultaneously.
In one implementation, step S2 is implemented by the following steps.
S21, obtaining a gain function G (k, l) by using deep learning network mapping, that is:
G(k,l)=DL_gNet{X(k,l)} (6)
in the equation (6), the gain function G (k, l) may be a real number or a complex number, and accordingly, the deep learning network may be a real network or a complex network. When the deep learning network adopts a real network, the input is the amplitude spectrum | X (k, l) | of the complex spectrum of the voice with noise, the compressed complex spectrum | X (k, l) |βOr mel-frequency cepstram coefficients (MFCC), and the like.
The mapping target of the deep learning network is real G (k, l), and the cost function of the deep learning network can be determined by sa (signal adaptation), that is:
Figure BDA0003549924650000082
wherein, the value of beta is between 0 and 1, and the typical value is 0.5.
In an implementation manner, in the case that the deep learning network is a real network, the magnitude spectrum of the complex spectrum of the noisy speech may be input into the deep learning network, and the output mapping target is a gain function, where the gain function is a real number.
In the case of a real network without implicitly mapping the real gain function G (k, l), the mapping target is the magnitude spectrum of clean speech or its compressed value | S (k, l) |βThen, the gain function is determined as the ratio of the mapping target to the noise-containing speech magnitude spectrum, that is:
Figure BDA0003549924650000091
in this embodiment, in the case where the mapping target of the deep learning network is the magnitude spectrum of clean speech, the gain function is the ratio of the magnitude spectrum of clean speech to the magnitude spectrum of noisy speech; in the case where the mapping target of the deep learning network is the magnitude spectrum compression value of the clean speech, the gain function is the ratio of the magnitude spectrum compression value of the clean speech to the magnitude spectrum of the noisy speech.
In one implementation, when the deep learning network is a complex network, the input to the deep learning network is the real part and imaginary part of the complex spectrum X (k, l) of the noisy speech, e.g., X (k, l) ═ Xr(k,l)+jXi(k, l), i.e. Xr(k, l) and XiAnd (k, l) are the real part and the imaginary part of X (k, l) respectively, and can be used as the input of the complex network.
In this embodiment, the real part and the imaginary part of the compressed complex spectrum of the noisy speech may also be used as input to the deep learning network, the compressed complex spectrum being:
Figure BDA0003549924650000092
wherein, the first and the second end of the pipe are connected with each other,
Figure BDA0003549924650000093
namely, it is
Figure BDA0003549924650000094
And
Figure BDA0003549924650000095
is a compressed complex spectrum X(c)Real and imaginary parts of (k, l).
Obviously, the complex spectrum X is compressed(c)(k, l) changes the amplitude of the original complex spectrum X (k, l) by | X (k, l) & ltuβBut without changing the phase, the compressed complex spectrum generally has better noise reduction performance.
The general mapping target is complex G (k, l), and the cost function of deep learning can also be sa (signal approximation), that is:
Lossall=αLoss_mag+(1-α)Loss_complex (10)
wherein, α is between 0 and 1, and the complex field Loss function Loss _ complex is:
Figure BDA0003549924650000101
wherein the content of the first and second substances,
Figure BDA0003549924650000102
and
Figure BDA0003549924650000103
is G (k, l) X(c)Real and imaginary parts of (k, l).
Figure BDA0003549924650000104
And
Figure BDA0003549924650000105
is a compressed complex spectrum of the clean speech complex spectrum S. In the task of voice noise reduction, alpha is typically 0.5, and the value can be used for balancing voice distortion and noise reduction amount.
If the complex network does not implicitly map the complex G (k, l), i.e. the mapped target is the clean speech complex spectrum or the compressed complex spectrum, then the gain function can be obtained by the ratio of the mapped target and the noisy speech, i.e.:
Figure BDA0003549924650000106
in one implementation, in the case that the mapping target of the deep learning network is the complex spectrum of the clean speech, the gain function can be obtained according to the ratio of the complex spectrum of the clean speech to the noisy speech, and the gain function is complex.
In one implementation, in the case that the mapping target of the deep learning network is a compressed complex spectrum of clean speech, a gain function is obtained according to a ratio of the compressed complex spectrum of clean speech to noisy speech, and the gain function is complex.
The depth learning network adopted by the voice noise reduction method with controllable time delay provided by the embodiment of the application can adopt a full-connected network (FC), a Convolutional Neural Network (CNN), a long-short memory neural network (LSTM) and the like, the size of the model can be determined according to the computing resources and the storage resources of the chip or the platform, and which model is adopted can be determined according to the acceleration kernel adopted by the chip and the platform. In one implementation, the time-domain filter design of step S3 can be implemented by the following steps.
S31, taking the gain function G (k, l) as the fitting value of the finite impulse response time domain filter in case of using a Finite Impulse Response (FIR) filter.
In one implementation, the gain function G (k, l) may be approximated with a Finite Impulse Response (FIR) filter having a linear phase.
And S32, determining the order of the FIR filter according to the time delay control requirement.
Since the FIR filter with linear phase has symmetry, the FIR filter filters the time domain signal of the noisy speech, introducing a time delay exactly equal to half the order of the FIR filter.
Illustratively, when the latency requirement is tdMillisecond, adoptA rate of fsIn the case of (unit: Hz), the order of the FIR filter is 2t at mostdfs/1000, e.g. time delay tdRequiring 4 milliseconds for the sampling rate fs16000Hz, and the maximum length of the FIR filter is 128 points.
Fig. 2 is a flow chart of FIR time domain filter design. As shown in fig. 2, the FIR time-domain filter can be obtained by window function design, and step S32 can be realized by the following steps S321 to S323.
S321, transforming the gain function G (k, l) of the l frame back to the time domain through inverse short-time Fourier transform to obtain the gain function G of the l frame signal in the nth time domainl(n)。
S322, the gain function G of the nth frame signal in the time domainl(n) performing time shift processing to the right in the time domain to obtain the nth-nth0Gain function G in time domainl(n-n0)。
S323, mixing the n-n0Gain function G in time domainl(n-n0) Performing truncation processing to obtain an FIR time domain filter h obtained by mapping the nth time domain of the first frame of the deep learning networkl(n)。
The FIR time domain filter obtained by the window function design has the advantages of low operation complexity and stable performance.
In one implementation, an infinite impulse response time domain filter (IIR) time domain filter may be obtained using a minimum phase design, since the IIR time domain filter may be designed as a minimum phase filter, with a shorter time delay than an FIR time domain filter; the IIR time domain filter approximation has the disadvantage that the linear phase cannot be guaranteed, i.e. the time delay of each frequency point may not be the same. The higher the order of the IIR time domain filter is, the more accurate the approximation is and the higher the complexity is; when the order of the IIR time-domain filter is too low, too large approximation error may be caused, which may cause unstable speech noise reduction performance.
In this embodiment, step S3 may be implemented by the following steps S31 '-S33'.
And S31', under the condition of adopting the IIR time domain filter, taking the gain function as the fitting value of the IIR time domain filter, and adopting the minimum phase to obtain the IIR time domain filter.
And S32', calculating the amplitude-frequency response according to the coefficient of the IIR time domain filter, and determining the amplitude-frequency response of the IIR time domain filter.
And S33', determining the order of the infinite impulse response time domain filter according to the amplitude value and the amplitude-frequency response of the gain function, wherein the order of the infinite impulse response time domain filter meets the requirement of time delay control.
In an implementation manner, the amplitude-frequency response of the IIR time-domain filter may be compared with the amplitude value | G (k, l) | of the gain function, and when the error is large, the order of the IIR time-domain filter may be increased, so as to reduce the approximation error and meet the requirement of the delay control.
The voice noise reduction method with controllable time delay provided by the embodiment of the application adopts an IIR time domain filter to fit a gain function G (k, l), and the time delay can be controlled in a smaller range.
Because the computational complexity of IIR time-domain filter design and time-domain filtering is directly related to the IIR order, the IIR time-domain filter is more suitable for approximating the peak-valley value, and the FIR time-domain filter is more suitable for approximating the gentle curve of the amplitude-frequency response. For speech signals, there are usually only significant peaks and valleys in the voiced sound segments below 4000Hz in the mid-low frequency band, and no significant peaks and valleys above 4000Hz in the high frequency band.
In one implementation, a mixture of FIR and IIR time-domain filters may be adopted, and then step S3 may be implemented by the following steps S31 "-S32".
S31', the gain function G (k, l) for the band below 4000Hz for noisy speech is fitted with an IIR filter.
S32', for the gain function G (k, l) of the noisy speech in the frequency band above 4000Hz, FIR filter fitting is performed to reduce the algorithm complexity.
In one implementation, step S4 may be implemented by the following steps S41-S42.
S41, dividing the frequency of the voice with noise to obtain a first sub-band signal and a second sub-band signal;
in one implementation, the method can be implementedDividing the full frequency band signal into two time domain signals of sub-bands by the time domain signal x (n) of the speech with noise through the sub-band analysis filter, and recording the two time domain signals as a first sub-band signal xL(n) and a second subband signal xH(n)。
And S42, the first sub-band signal is processed by an infinite impulse response time domain filter to obtain the estimated middle and low frequency voice.
In one implementation, the first subband signal x may be divided into two sub-bandsL(n) obtaining estimated middle and low frequency voice through IIR time domain filter
Figure BDA0003549924650000131
And S43, the second sub-band signal is processed by a finite impulse response time domain filter to obtain estimated high-frequency voice.
In one implementation, the second subband signal x may be combinedH(n) high frequency speech estimated by FIR time domain filter
Figure BDA0003549924650000132
And S44, synthesizing the low-frequency voice and the high-frequency voice to obtain pure voice of the full frequency band.
In one implementation, the speech of two sub-bands can be combined
Figure BDA0003549924650000133
And
Figure BDA0003549924650000134
synthesizing to obtain the final estimated pure speech with full frequency band in time domain
Figure BDA0003549924650000135
In one implementation, step S4 may be implemented by an FIR time domain filter based on deep learning network mapping, including the following steps:
s41', determining the FIR time-domain filter with the order of 2t for deep learning network mappingdfs/10002
The enhanced speech time domain signal is STFT transformed and the cost function of equation (10) can also be used to train the deep learning network parameters for implementing the FIR time domain filter mapping.
S42', convolving the l frame time domain signal with the time domain signal of the noisy speech with the finite impulse response time domain filter obtained by the deep learning network in the l frame mapping, and obtaining the l frame pure speech according with the time delay requirement, wherein the l frame output pure speech is:
Figure BDA0003549924650000136
in formula (13), xl(n) is the time domain signal of the l frame speech with noise, hl(n) is a FIR time domain filter obtained by mapping the deep learning network on the l frame,
Figure BDA0003549924650000137
the first frame of clean speech is required to meet the delay requirement.
The time domain convolution in equation (13) can also be quickly realized by frequency domain multiplication, and h is used for ensuring linear convolutionl(n) zero should be padded, x, before Fourier transform is performedl(n) taking the signal of the previous frame, and h after zero paddingl(n) are of uniform length.
Example 1
The embodiment 1 of the present application provides a time delay controllable voice noise reduction method, including:
s51, converting the time domain signal x of the voice with noisel(N) framing, wherein the frame length is N, the frame shift is R, and short-time Fourier transform is carried out to obtain a noisy speech complex frequency spectrum X (k, l);
s52, directly estimating a real gain function G (k, l) by adopting a real network or directly estimating a complex gain function G (k, l) by adopting a complex network;
s53, designing an FIR time domain filter by adopting a window function method, wherein the order is determined by time delay; or directly approximating the gain function G (k, l) by an IIR time domain filter;
s54, time domain of the voice with noiseThe signal x (n) passes through an FIR time domain filter or an IIR time domain filter to obtain pure voice of the full frequency band in the time domain
Figure BDA0003549924650000141
Example 2
The embodiment 2 of the present application provides a time delay controllable voice noise reduction method, including:
s61, dividing a time domain signal X (N) of the voice with noise into frames, wherein the frame length is N, the frame shift is R, and performing short-time Fourier transform to obtain a complex frequency spectrum X (k, l) of the voice with noise;
s62, estimating the magnitude spectrum of the pure voice by adopting a real network
Figure BDA0003549924650000142
Or complex network to estimate clean speech complex spectrum
Figure BDA0003549924650000143
The real and imaginary parts of (c);
s63, calculating a gain function G (k, l) by adopting an equation (7) or an equation (11);
s64, designing an FIR time domain filter by adopting a window function method, wherein the order is determined by time delay; or directly approximating the gain function G (k, l) by an IIR time domain filter;
s65, making the time domain signal x (n) of the voice with noise pass through FIR time domain filter or IIR time domain filter to get the pure voice of the whole frequency band in the time domain
Figure BDA0003549924650000144
Example 3
The embodiment 3 of the present application provides a time delay controllable voice noise reduction method, including:
s71, dividing a time domain signal X (N) of the voice with noise into frames, wherein the frame length is N, the frame shift is R, and performing short-time Fourier transform to obtain a complex frequency spectrum X (k, l) of the voice with noise;
s72, directly estimating or indirectly estimating a gain function G (k, l) by adopting deep learning;
s73, approximating the gain function G (k, l) below 4000Hz by an IIR time domain filter, and approximating the gain function G (k, l) above 4000Hz by an FIR time domain filter;
s74, making the time domain signal x (n) of the voice with noise pass through a sub-band analysis filter to obtain a sub-band signal x below 4000HzL(n) and sub-band signals x above 4000HzH(n);
S75,xL(n) obtaining sub-band enhanced voice below 4000Hz through an IIR time domain filter
Figure BDA0003549924650000151
xH(n) obtaining sub-band enhanced voice above 4000Hz through FIR time domain filter
Figure BDA0003549924650000152
S76, converting the speech of two sub-bands
Figure BDA0003549924650000153
And
Figure BDA0003549924650000154
synthesizing to obtain the final estimated pure speech with full frequency band in time domain
Figure BDA0003549924650000155
Example 4
The embodiment 4 of the present application provides a time delay controllable voice noise reduction method, including:
s81, dividing a time domain signal X (N) of the voice with noise into frames, wherein the frame length is N, the frame shift is R, and performing short-time Fourier transform to obtain a complex frequency spectrum X (k, l) of the voice with noise;
s82, directly estimating or indirectly estimating a gain function G (k, l) by adopting deep learning;
s83, using the complex frequency spectrum X (k, l) of the voice with noise and the gain function G (k, l) as the input of the deep learning network to map the first frame FIR time domain filter hl(n);
S84, calculating the enhanced speech time domain signal of the I frame by adopting the formula (12)
Figure BDA0003549924650000156
Or adopting frequency domain multiplication to replace time domain convolution to obtain enhanced voice time domain signal
Figure BDA0003549924650000157
The output signal of each frame forms a time sequence which is the pure speech of the full frequency band in the time domain
Figure BDA0003549924650000158
The embodiment of the application provides a deep learning voice noise reduction scheme with controllable time delay aiming at the requirement of single-channel voice noise reduction on low time delay, retains the advantage that a deep learning voice enhancement method can well inhibit strong unsteady noise and the advantage that the deep learning voice enhancement method can recover voice under the scene of low signal-to-noise ratio, and also achieves the aim of controllable time delay. If the time delay requirement is not more than the frame length, the enhanced pure voice can be directly synthesized by adopting an overlap-add method; if the delay requirement is far smaller than the frame length, for example, 4 milliseconds, the time domain approximation is performed on the gain function, the gain function can be approximated by an FIR time domain filter, an IIR time domain filter or a mixed mode of the FIR time domain filter and the IIR time domain filter, or the time domain filter can be directly mapped by a deep learning network.
Fig. 3 is a diagram of a spectrogram test effect before and after Babble noise processing according to an embodiment of the present application. As shown in fig. 3, comparison of the speech spectrogram before and after Babble noise processing: (a) a voice with noise; (b) estimating a gain function by adopting a traditional method and designing a time domain filter for filtering by adopting a window function method; (c) estimating a gain function by adopting a deep learning method and designing a time domain filter for filtering by adopting a window function method; (d) and estimating a gain function by adopting a deep learning method and mapping a time domain filter for filtering by adopting the deep learning method. The test result shows that the method provided by the embodiment of the application can achieve advanced voice noise reduction performance under the condition of low time delay.
The voice noise reduction method with controllable time delay provided by the embodiment of the application is used for carrying out single-channel voice noise reduction based on time delay controllable deep learning, and the method is combined with a supervised learning method such as deep learning and a time domain filter method.
According to the voice noise reduction method with controllable time delay, a gain function of each time frequency point is deduced in a time frequency domain by adopting a deep learning method, a time domain filter is optimized in each frame, and filtering enhancement is carried out in a time domain to realize voice noise reduction.
The embodiment of the application provides a method for realizing the optimal design of a time domain filter by adopting three modes of Infinite Impulse Response (IIR), Finite Impulse Response (FIR) filter and deep learning network mapping fitting gain function.
In order to reduce the complexity of operation and improve the robustness, and consider the voice characteristics, the embodiment of the application simultaneously proposes to adopt a subband analysis synthesis method, firstly, a full-band signal is divided into two subbands by a subband analysis filter, an IIR (infinite impulse response) time domain filter is adopted at a low frequency, an FIR (finite impulse response) time domain filter is adopted at a high frequency for fitting, then, the two subbands are respectively subjected to time domain filtering enhancement, and finally, the full-band voice time domain signal is reconstructed by adopting subband synthesis.
It will be further appreciated by those of ordinary skill in the art that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be embodied in electronic hardware, computer software, or combinations of both, and that the components and steps of the examples have been described in a functional general in the foregoing description for the purpose of illustrating clearly the interchangeability of hardware and software. Whether these functions are performed in hardware or software depends on the particular application of the solution and design constraints. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied in hardware, a software module executed by a processor, or a combination of the two. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
The above-mentioned embodiments, objects, technical solutions and advantages of the present application are described in further detail, it should be understood that the above-mentioned embodiments are merely exemplary embodiments of the present application, and are not intended to limit the scope of the present application, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present application should be included in the scope of the present application.

Claims (10)

1. A time delay controllable voice noise reduction method is characterized by comprising the following steps:
framing the voice with noise, and performing time-frequency domain transformation to obtain a complex frequency spectrum of the voice with noise;
determining a gain function according to the complex frequency spectrum of the voice with the noise; the gain function is real or complex;
determining a time domain filter according to the gain function, wherein the order of the time domain filter is set according to the time delay requirement;
and inputting the voice with noise into the time domain filter to perform noise reduction processing meeting the time delay requirement to obtain pure voice.
2. The method of claim 1, wherein determining a gain function from the complex spectrum of the noisy speech comprises:
determining a magnitude spectrum of a complex frequency spectrum of the noisy speech;
inputting the magnitude spectrum of the complex frequency spectrum of the voice with noise into the deep learning network, wherein the deep learning network is a real network;
and determining a gain function according to the mapping target of the real network, wherein the gain function is a real number.
3. The method of claim 2, wherein determining a gain function based on the mapped target of the real network comprises:
under the condition that the mapping target of the deep learning network is the magnitude spectrum of pure voice, the gain function is the ratio of the magnitude spectrum of the pure voice to the magnitude spectrum of the voice with noise;
and under the condition that the mapping target of the deep learning network is the magnitude spectrum compression value of the pure voice, the gain function is the ratio of the magnitude spectrum compression value of the pure voice and the magnitude spectrum of the voice with noise.
4. The method of claim 1, wherein determining a gain function from the complex spectrum of the noisy speech comprises:
determining a real part and an imaginary part of the complex frequency spectrum of the voice with noise, and inputting the real part and the imaginary part of the complex frequency spectrum of the voice with noise into the deep learning network, wherein the deep learning network is a complex network; or
Determining real and imaginary parts of a compressed complex spectrum of the noisy speech; inputting the real and imaginary parts of the compressed complex spectrum of the noisy speech into the complex network;
determining a gain function according to a mapping target of the complex network; the gain function is a complex number.
5. The method of claim 4, wherein determining a gain function based on the mapped target of the complex network comprises:
under the condition that the mapping target of the complex network is the complex spectrum of the pure voice, obtaining a gain function according to the ratio of the complex spectrum of the pure voice to the voice with noise, wherein the gain function is complex; or
And under the condition that the mapping target of the complex network is the compressed complex spectrum of the pure voice, obtaining a gain function according to the ratio of the compressed complex spectrum of the pure voice and the voice with noise, wherein the gain function is complex.
6. The method according to any of claims 1-5, wherein said determining a time-domain filter according to said gain function, the order of said time-domain filter being set according to a delay control requirement, comprises:
approximating the gain function by using a finite impulse response time domain filter, wherein the gain function is a fitting value of the finite impulse response time domain filter;
and determining the order of the finite impulse response time domain filter according to the time delay control requirement.
7. The method according to any of claims 1-5, wherein said determining a time-domain filter according to said gain function, the order of said time-domain filter being set according to a delay control requirement, comprises:
approximating the gain function by using an infinite impulse response time domain filter, wherein the gain function is a fitting value of the infinite impulse response time domain filter;
determining the amplitude-frequency response of the infinite impulse response time domain filter;
and determining the order of the infinite impulse response time domain filter according to the amplitude value of the gain function and the amplitude-frequency response, wherein the order of the infinite impulse response time domain filter meets the time delay control requirement.
8. The method according to any of claims 1-5, wherein said inputting said noisy speech into said time-domain filter for denoising in accordance with said delay requirement to obtain clean speech comprises:
and inputting the voice with noise into a finite impulse response time domain filter or an infinite impulse response time domain filter, and performing noise reduction processing according with the time delay requirement to obtain pure voice.
9. The method according to any of claims 1-5, wherein said inputting said noisy speech into said time-domain filter for denoising in accordance with said delay requirement to obtain clean speech comprises:
dividing the frequency of the voice with noise to obtain a first sub-band signal and a second sub-band signal;
the first sub-band signal is processed by an infinite impulse response time domain filter to obtain estimated middle and low frequency voice;
the second sub-band signal is processed by a finite impulse response time domain filter to obtain estimated high-frequency voice;
and synthesizing the medium and low frequency voice signal and the high frequency voice signal to obtain pure voice.
10. The method according to any one of claims 1 to 5, wherein said inputting said time-domain signal of the noisy speech into said time-domain filter for performing noise reduction processing meeting said delay requirement to obtain clean speech comprises:
finite impulse response time domain filter for determining deep learning network mapping with order of 2tdfsA/1000; wherein, tdFor delay requirements, fsIs the sampling frequency;
convolving the l frame of speech with noise with a finite impulse response time domain filter obtained by mapping the l frame of speech with noise by a deep learning network to obtain the l frame of pure speech; l is a natural number;
and arranging the pure voice of the frame l according to time to obtain the pure voice meeting the time delay requirement.
CN202210258932.0A 2022-03-16 2022-03-16 Time delay controllable voice noise reduction method Pending CN114566179A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210258932.0A CN114566179A (en) 2022-03-16 2022-03-16 Time delay controllable voice noise reduction method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210258932.0A CN114566179A (en) 2022-03-16 2022-03-16 Time delay controllable voice noise reduction method

Publications (1)

Publication Number Publication Date
CN114566179A true CN114566179A (en) 2022-05-31

Family

ID=81719039

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210258932.0A Pending CN114566179A (en) 2022-03-16 2022-03-16 Time delay controllable voice noise reduction method

Country Status (1)

Country Link
CN (1) CN114566179A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115132231A (en) * 2022-08-31 2022-09-30 安徽讯飞寰语科技有限公司 Voice activity detection method, device, equipment and readable storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115132231A (en) * 2022-08-31 2022-09-30 安徽讯飞寰语科技有限公司 Voice activity detection method, device, equipment and readable storage medium
CN115132231B (en) * 2022-08-31 2022-12-13 安徽讯飞寰语科技有限公司 Voice activity detection method, device, equipment and readable storage medium

Similar Documents

Publication Publication Date Title
CN107845389B (en) Speech enhancement method based on multi-resolution auditory cepstrum coefficient and deep convolutional neural network
CN107452389B (en) Universal single-track real-time noise reduction method
CN110085249B (en) Single-channel speech enhancement method of recurrent neural network based on attention gating
Hermansky et al. RASTA processing of speech
CN110867181B (en) Multi-target speech enhancement method based on SCNN and TCNN joint estimation
Doclo et al. GSVD-based optimal filtering for single and multimicrophone speech enhancement
US8880396B1 (en) Spectrum reconstruction for automatic speech recognition
CN108447495B (en) Deep learning voice enhancement method based on comprehensive feature set
Xiao et al. Normalization of the speech modulation spectra for robust speech recognition
CN108172231A (en) A kind of dereverberation method and system based on Kalman filtering
CN112735456B (en) Speech enhancement method based on DNN-CLSTM network
CN105679330B (en) Based on the digital deaf-aid noise-reduction method for improving subband signal-to-noise ratio (SNR) estimation
CN113077806B (en) Audio processing method and device, model training method and device, medium and equipment
CN110808057A (en) Voice enhancement method for generating confrontation network based on constraint naive
WO2022256577A1 (en) A method of speech enhancement and a mobile computing device implementing the method
CN110970044B (en) Speech enhancement method oriented to speech recognition
Doclo et al. Multimicrophone noise reduction using recursive GSVD-based optimal filtering with ANC postprocessing stage
Li et al. A multi-objective learning speech enhancement algorithm based on IRM post-processing with joint estimation of SCNN and TCNN
CN114566179A (en) Time delay controllable voice noise reduction method
CN110931034B (en) Pickup noise reduction method for built-in earphone of microphone
CN115424627A (en) Voice enhancement hybrid processing method based on convolution cycle network and WPE algorithm
CN114023352B (en) Voice enhancement method and device based on energy spectrum depth modulation
CN113066483B (en) Sparse continuous constraint-based method for generating countermeasure network voice enhancement
CN114189781A (en) Noise reduction method and system for double-microphone neural network noise reduction earphone
EP2063420A1 (en) Method and assembly to enhance the intelligibility of speech

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination