CN114566179A

CN114566179A - Time delay controllable voice noise reduction method

Info

Publication number: CN114566179A
Application number: CN202210258932.0A
Authority: CN
Inventors: 邱锋海; 王之禹; 项京朋
Original assignee: Beijing Sound+ Technology Co ltd
Current assignee: Beijing Sound+ Technology Co ltd
Priority date: 2022-03-16
Filing date: 2022-03-16
Publication date: 2022-05-31

Abstract

The application relates to a voice noise reduction method with controllable time delay, which comprises the following steps: framing the voice with noise, and transforming a time domain and a frequency domain to obtain a complex frequency spectrum of the voice with noise; determining a gain function according to the complex frequency spectrum of the voice with the noise; the gain function is real or complex; determining a time domain filter according to the gain function, wherein the order of the time domain filter is set according to the time delay requirement; and inputting the voice with noise into the time domain filter to perform noise reduction processing meeting the time delay requirement to obtain pure voice. By adopting the method provided by the embodiment of the application, the advanced voice noise reduction performance can be achieved under the condition of low time delay, the operation complexity is reduced, and the robustness is improved.

Description

Time delay controllable voice noise reduction method

Technical Field

The present application relates to the multimedia field, and in particular, to a time delay controllable voice noise reduction method.

Background

The voice noise reduction has very important application in the aspects of voice communication, voice recognition, hearing aids, cochlear implants and the like, and can obviously improve the communication quality and the interactive experience. Speech noise reduction can be divided into unsupervised methods including spectral subtraction, subspace methods, etc., and supervised methods including nonnegative matrix factorization, dictionary learning methods, deep neural network methods, etc. Currently, most speech noise reduction is performed in the time-frequency domain: firstly, data is subjected to frame windowing, then Fourier transformation is carried out, then a gain function is estimated through an unsupervised method inference or a supervised method, then the gain function is acted on a complex spectrum of a noisy signal, and a time domain signal is reconstructed through an Overlap-Add (OLA). With this type of approach, the delay is determined by the frame length. The delay is severely limited in many systems, such as hearing aid systems, where all signal processing delays are controlled to be within 4ms to reduce the comb effect while meeting the requirement of the smallest perceived time difference; as another example, current tws (true Wireless stereo) earphones have a pass-through function (Transparency), which is activated, similarly to a hearing aid, and also needs to control the delay to be within 4 ms; for example, in a sound amplifying system, noise of a sound pickup signal of a microphone is suppressed, the time delay requirement is higher, and excessive introduction of the time delay of an electric signal algorithm causes obvious time delay of the sound amplifying system, so that echo can be caused in a serious case. Therefore, the method for reducing the noise of the voice with low delay and high performance has important application value.

One of the methods to reduce the delay is to reduce the frame length, for example, to 4 ms. However, research has shown that, due to the adoption of an excessively short frame length, the frequency resolution of a frequency spectrum after Fourier transform is too low, and due to the adoption of an unsupervised method, noise among voice harmonics cannot be effectively suppressed; by adopting the supervision method, the discrimination of the noise characteristic and the voice characteristic is reduced, the performance of the supervision method is seriously influenced, and the training of the supervision learning model can not be converged when the performance is serious.

Another method for reducing the time delay is to use a long frame to reduce the frame shift, for example, to reduce the frame shift to 4ms or even 2ms, but most of the existing voice noise reduction methods with controllable time delay use a frequency domain analysis and synthesis method, and use an overlap-add method to reconstruct an enhanced voice time domain signal, and the time delay is still determined by the frame length; it is worth mentioning that the existing speech separation method adopts a time domain signal end-to-end method, and the time delay is theoretically determined by frame shift, but the performance of the method is inferior to that of a frequency domain analysis and synthesis method, and the method has higher operation complexity and insufficient stability.

Disclosure of Invention

The method aims to set the time delay according to practical application so as to achieve the goal of controllable time delay, reduce the complexity of operation and improve the robustness.

In order to achieve the above object, the present application provides a time delay controllable voice noise reduction method, including the following steps: framing the voice with noise, and transforming a time domain and a frequency domain to obtain a complex frequency spectrum of the voice with noise; determining a gain function according to the complex frequency spectrum of the voice with noise; the gain function is real or complex; determining a time domain filter according to the gain function, wherein the order of the time domain filter is set according to the time delay requirement; and inputting the voice with noise into the time domain filter to perform noise reduction processing meeting the time delay requirement to obtain pure voice.

As a preferred embodiment, the determining a gain function according to the complex spectrum of the noisy speech includes: determining a magnitude spectrum of a complex frequency spectrum of the noisy speech; inputting the magnitude spectrum of the complex frequency spectrum of the voice with noise into the deep learning network, wherein the deep learning network is a real network; and determining a gain function according to the mapping target of the real network, wherein the gain function is a real number.

As a preferred embodiment, the determining a gain function according to the mapping target of the real network includes: under the condition that the mapping target of the deep learning network is the magnitude spectrum of pure voice, the gain function is the ratio of the magnitude spectrum of the pure voice to the magnitude spectrum of the voice with noise; and under the condition that the mapping target of the deep learning network is the magnitude spectrum compression value of the pure voice, the gain function is the ratio of the magnitude spectrum compression value of the pure voice and the magnitude spectrum of the voice with noise.

As a preferred embodiment, the determining the gain function over the complex spectrum comprises: determining a real part and an imaginary part of the complex frequency spectrum of the voice with noise, and inputting the real part and the imaginary part of the complex frequency spectrum of the voice with noise into the deep learning network, wherein the deep learning network is a complex network; or determining real and imaginary parts of a compressed complex spectrum of the noisy speech; inputting real and imaginary parts of the compressed complex spectrum of the noisy speech into the complex network; determining a gain function according to a mapping target of the complex network; the gain function is complex.

As a preferred embodiment, the determining a gain function according to the mapping objective of the complex network includes: under the condition that the mapping target of the complex network is a complex spectrum of pure voice, obtaining a gain function according to the ratio of the complex spectrum of the pure voice to the voice with noise, wherein the gain function is complex; or obtaining a gain function according to a ratio of the compressed complex spectrum of the clean speech to the noisy speech under the condition that the mapping target of the complex network is the compressed complex spectrum of the clean speech, wherein the gain function is complex.

As a preferred embodiment, the determining a time-domain filter according to the gain function, where an order of the time-domain filter is set according to a delay control requirement, includes: approximating the gain function by using a finite impulse response time domain filter, wherein the gain function is a fitting value of the finite impulse response time domain filter; and determining the order of the finite impulse response time domain filter according to the time delay control requirement.

As a preferred embodiment, the determining a time-domain filter according to the gain function, where an order of the time-domain filter is set according to a delay control requirement, includes: approximating the gain function by using an infinite impulse response time domain filter, wherein the gain function is a fitting value of the infinite impulse response time domain filter; determining the amplitude-frequency response of the infinite impulse response time domain filter; and determining the order of the infinite impulse response time domain filter according to the amplitude value of the gain function and the amplitude-frequency response, wherein the order of the infinite impulse response time domain filter meets the time delay control requirement.

As a preferred embodiment, the inputting the noisy speech into the time-domain filter to perform noise reduction processing meeting the delay requirement to obtain clean speech includes: and inputting the voice with noise into a finite impulse response time domain filter or an infinite impulse response time domain filter, and performing noise reduction processing according with the time delay requirement to obtain pure voice.

As a preferred embodiment, the inputting the noisy speech into the time-domain filter to perform noise reduction processing meeting the delay requirement to obtain clean speech includes: dividing the frequency of the voice with noise to obtain a first sub-band signal and a second sub-band signal; the first sub-band signal is processed by an infinite impulse response time domain filter to obtain estimated middle and low frequency voice; the second sub-band signal is processed by a finite impulse response time domain filter to obtain estimated high-frequency voice; and synthesizing the medium and low frequency voice signal and the high frequency voice signal to obtain pure voice.

As a preferred embodiment, the inputting the time-domain signal of the noisy speech into the time-domain filter to perform noise reduction processing meeting the delay requirement to obtain pure speech includes: finite impulse response time domain filter for determining deep learning network mapping with order of 2t_df_sA/1000; wherein, t_dFor delay requirements, f_sIs the sampling frequency; convolving the l frame of speech with noise with a finite impulse response time domain filter obtained by mapping the l frame of speech with noise by a deep learning network to obtain the l frame of pure speech; l is a natural number; and arranging the pure voice of the frame l according to time to obtain the pure voice meeting the time delay requirement.

By adopting the method provided by the embodiment of the application, the advanced voice noise reduction performance can be achieved under the condition of low time delay, the operation complexity is reduced, and the robustness is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments disclosed in the present specification, the drawings needed to be used in the description of the embodiments will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments disclosed in the present specification, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.

Fig. 1 is a flowchart of a delay-controllable voice denoising method according to an embodiment of the present application;

fig. 2 is a flow chart of FIR time-domain filter design in a time-delay controllable voice noise reduction method according to an embodiment of the present application;

fig. 3 is a diagram of a spectrogram test effect before and after Babble noise processing according to an embodiment of the present application.

Detailed Description

In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is understood that "some embodiments" may be the same subset or different subsets of all possible embodiments, and may be combined with each other without conflict.

In the following description, references to the terms "first \ second \ third, etc. or module a, module B, module C, etc. are used solely to distinguish between similar objects and do not denote a particular order or importance to the objects, but rather the specific order or sequence may be interchanged as appropriate to enable embodiments of the application described herein to be practiced in an order other than that shown or described herein.

In the following description, reference to reference numerals indicating steps, such as S110, S120 … …, etc., does not necessarily indicate that the steps are performed in this order, and the order of the preceding and following steps may be interchanged or performed simultaneously, where permissible.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the present application only and is not intended to be limiting of the application.

The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application. The following examples are only for illustrating the technical solutions of the present application more clearly, and the protection scope of the present application is not limited thereby.

It should be noted that the term "first" in the description and claims of the embodiments of the present application is used for distinguishing different objects, and is not used for describing a specific order of the objects. For example, the first speech segment is used to distinguish between different speech segments, rather than to describe a particular order of target objects. In the embodiments of the present application, words such as "exemplary," "for example," or "such as" are used to mean serving as examples, illustrations, or illustrations. Any embodiment or design described herein as "exemplary," "for example," or "such as" is not necessarily to be construed as preferred or advantageous over other embodiments or designs. Rather, use of the words "exemplary," "for example," or "such as" are intended to present relevant concepts in a concrete fashion.

First, the principle of a delay-controllable speech noise reduction method provided in the embodiment of the present application is introduced.

Suppose that the time domain signal with noise picked up by the microphone is x (n):

x(n)＝s(n)+d_s(n)+d_t(n) (1)

in formula (1), s (n) is pure speech, d_s(n) is the steady state noise, d_t(n) is transient noise. After Short-time Fourier Transform (STFT), the time-domain signal model can be expressed as:

X(k,l)＝S(k,l)+D_s(k,l)+D_t(k,l) (2)

in the formula (2), k and l respectively represent the kth frequency point and the l frame, S (k, l) represents the complex spectrum of pure voice, and D_s(k, l) and D_t(k, l) represents the complex spectrum of stationary noise and the complex spectrum of transient noise, respectively, and X (k, l) represents the complex spectrum of noisy speech. Taking X (k, l) as an example, the short-time fourier transform equation is:

in the formula (3), R is the frame shift, and N is the frame length.

In the case of only noisy speech time-domain signal X (n) or its complex spectrum X (k, l) picked up by a microphone, the purpose of single-channel speech noise reduction is to estimate the clean speech S (n) or the complex spectrum S (k, l) of the clean speech by various speech noise reduction methods, which can be generally written as:

in the prior art, the value of G (k, l) is usually real, and the value is between 0 and 1.

In order to reduce the frequency spectrum leakage, the existing method generally adopts sub-band analysis to decompose the noisy speech, but the adoption of sub-band decomposition introduces a certain time delay and increases the computational complexity. Thus, some existing methods estimate the gain function at each time-frequency point, or estimate the noise power at each time-frequency point, and then calculate the gain function based on the noisy speech power and the noise power at that time-frequency point. Because the existing methods all adopt a voice endpoint detection method or a method based on minimum statistical characteristics when estimating the noise power, the unsteady state noise power is difficult to accurately estimate, and the gain function G (k, l) is difficult to accurately estimate; while the accuracy of the estimate of the gain function G (k, l) can directly affect the domain filter h_l(n), therefore, the existing method, although capable of controlling the time delay, has poor performance of suppressing the non-stationary noise.

In some cases, a speech noise reduction method based on deep learning network may be adopted, and not explicitly estimating G (k, l), but directly mapping the complex spectrum of clean speech or the magnitude spectrum of clean speech, such as:

whether formula (4) or formula (5) is used, the pure speech is reconstructed by overlap-add method, and the time delay is determined by the frame length N.

The embodiment of the application provides a time delay controllable voice noise reduction method, which comprises the steps of carrying out framing and time-frequency domain transformation on voice with noise to obtain a complex frequency spectrum of the voice with noise; determining a gain function of a complex frequency spectrum of the voice with noise; the gain function is real or complex; determining a time domain filter according to the gain function, wherein the order of the time domain filter is determined according to the time delay control requirement; and inputting the voice with the noise into a time domain filter for noise reduction to obtain pure voice meeting the time delay requirement.

Fig. 1 is a flowchart of a voice denoising method with controllable delay according to an embodiment of the present application. As shown in FIG. 1, delay-controllable voice noise reduction may be achieved by the following steps S1-S4.

And S1, framing the voice with noise, and transforming the time domain and the frequency domain to obtain the complex frequency spectrum of the voice with noise.

In an implementation manner, the noisy speech x (n) picked up by the microphone may be passed through a subband analysis filter, and a subband signal of the noisy speech of the l-th frame of the k-th frequency point may be output.

According to the above embodiment, the noisy speech x (n) may be passed through a subband analysis filter, and the full-band signal may be divided into time-domain signals of at least two subbands, which are respectively denoted as x_L(n) and x_H(n) wherein x_L(n) is the sub-band voice of the middle and low frequency band, such as under 4000Hz, x_H(n) is sub-band speech above 4000 Hz.

In an implementation manner, the noisy speech X (n) picked up by the microphone may be subjected to a frequency band analysis, and a complex spectrum of the noisy speech of the l frame at the k frequency point is output, where the complex spectrum of the noisy speech of the l frame at the k frequency point may be marked as X (k, l).

In comparison, the method adopts frequency band analysis, namely short-time Fourier transform to calculate the complex frequency spectrum of the voice with noise, can realize the voice noise reduction target with controllable time delay, has lower calculation complexity compared with a scheme of sub-band analysis, and can obtain satisfactory performance better.

S2, determining a gain function according to the complex spectrum of the noisy speech, the gain function being real or complex.

In one implementation, a conventional speech noise reduction estimation gain function G (k, l) is used for the subband signal of the noisy speech in the l frame of the k-th frequency point.

In an implementation manner, a gain function G (k, l) may be obtained by mapping a complex spectrum of a noisy speech in an l-th frame of a k-th frequency point by using a deep learning network.

S3, determining a time-domain filter h according to the gain function G (k, l)_lAnd (n), the order of the time domain filter is determined according to the time delay control requirement.

S4, inputting the noisy speech x (n) into the time-domain filter h_l(n) filtering to obtain enhanced clean speech

In the speech noise reduction method with controllable delay proposed in the embodiment of the present application, step S2 employs a speech noise reduction method of a deep learning network, and G (k, l) may be a real number or a complex number. In a low signal-to-noise ratio scenario, a complex gain function is used, which generally exhibits better performance since the clean speech phase can be estimated simultaneously.

In one implementation, step S2 is implemented by the following steps.

S21, obtaining a gain function G (k, l) by using deep learning network mapping, that is:

G(k,l)＝DL_gNet{X(k,l)} (6)

in the equation (6), the gain function G (k, l) may be a real number or a complex number, and accordingly, the deep learning network may be a real network or a complex network. When the deep learning network adopts a real network, the input is the amplitude spectrum | X (k, l) | of the complex spectrum of the voice with noise, the compressed complex spectrum | X (k, l) |^βOr mel-frequency cepstram coefficients (MFCC), and the like.

The mapping target of the deep learning network is real G (k, l), and the cost function of the deep learning network can be determined by sa (signal adaptation), that is:

wherein, the value of beta is between 0 and 1, and the typical value is 0.5.

In an implementation manner, in the case that the deep learning network is a real network, the magnitude spectrum of the complex spectrum of the noisy speech may be input into the deep learning network, and the output mapping target is a gain function, where the gain function is a real number.

In the case of a real network without implicitly mapping the real gain function G (k, l), the mapping target is the magnitude spectrum of clean speech or its compressed value | S (k, l) |^βThen, the gain function is determined as the ratio of the mapping target to the noise-containing speech magnitude spectrum, that is:

in this embodiment, in the case where the mapping target of the deep learning network is the magnitude spectrum of clean speech, the gain function is the ratio of the magnitude spectrum of clean speech to the magnitude spectrum of noisy speech; in the case where the mapping target of the deep learning network is the magnitude spectrum compression value of the clean speech, the gain function is the ratio of the magnitude spectrum compression value of the clean speech to the magnitude spectrum of the noisy speech.

In one implementation, when the deep learning network is a complex network, the input to the deep learning network is the real part and imaginary part of the complex spectrum X (k, l) of the noisy speech, e.g., X (k, l) ═ X_r(k,l)+jX_i(k, l), i.e. X_r(k, l) and X_iAnd (k, l) are the real part and the imaginary part of X (k, l) respectively, and can be used as the input of the complex network.

In this embodiment, the real part and the imaginary part of the compressed complex spectrum of the noisy speech may also be used as input to the deep learning network, the compressed complex spectrum being:

wherein,

namely, it is

And

is a compressed complex spectrum X^(c)Real and imaginary parts of (k, l).

Obviously, the complex spectrum X is compressed^(c)(k, l) changes the amplitude of the original complex spectrum X (k, l) by | X (k, l) & ltu^βBut without changing the phase, the compressed complex spectrum generally has better noise reduction performance.

The general mapping target is complex G (k, l), and the cost function of deep learning can also be sa (signal approximation), that is:

Loss_all＝αLoss_mag+(1-α)Loss_complex (10)

wherein, α is between 0 and 1, and the complex field Loss function Loss _ complex is:

wherein,

and

is G (k, l) X^(c)Real and imaginary parts of (k, l).

And

is a compressed complex spectrum of the clean speech complex spectrum S. In the task of voice noise reduction, alpha is typically 0.5, and the value can be used for balancing voice distortion and noise reduction amount.

If the complex network does not implicitly map the complex G (k, l), i.e. the mapped target is the clean speech complex spectrum or the compressed complex spectrum, then the gain function can be obtained by the ratio of the mapped target and the noisy speech, i.e.:

in one implementation, in the case that the mapping target of the deep learning network is the complex spectrum of the clean speech, the gain function can be obtained according to the ratio of the complex spectrum of the clean speech to the noisy speech, and the gain function is complex.

In one implementation, in the case that the mapping target of the deep learning network is a compressed complex spectrum of clean speech, a gain function is obtained according to a ratio of the compressed complex spectrum of clean speech to noisy speech, and the gain function is complex.

The depth learning network adopted by the voice noise reduction method with controllable time delay provided by the embodiment of the application can adopt a full-connected network (FC), a Convolutional Neural Network (CNN), a long-short memory neural network (LSTM) and the like, the size of the model can be determined according to the computing resources and the storage resources of the chip or the platform, and which model is adopted can be determined according to the acceleration kernel adopted by the chip and the platform. In one implementation, the time-domain filter design of step S3 can be implemented by the following steps.

S31, taking the gain function G (k, l) as the fitting value of the finite impulse response time domain filter in case of using a Finite Impulse Response (FIR) filter.

In one implementation, the gain function G (k, l) may be approximated with a Finite Impulse Response (FIR) filter having a linear phase.

And S32, determining the order of the FIR filter according to the time delay control requirement.

Since the FIR filter with linear phase has symmetry, the FIR filter filters the time domain signal of the noisy speech, introducing a time delay exactly equal to half the order of the FIR filter.

Illustratively, when the latency requirement is t_dMillisecond, adoptA rate of f_sIn the case of (unit: Hz), the order of the FIR filter is 2t at most_df_s/1000, e.g. time delay t_dRequiring 4 milliseconds for the sampling rate f_s16000Hz, and the maximum length of the FIR filter is 128 points.

Fig. 2 is a flow chart of FIR time domain filter design. As shown in fig. 2, the FIR time-domain filter can be obtained by window function design, and step S32 can be realized by the following steps S321 to S323.

S321, transforming the gain function G (k, l) of the l frame back to the time domain through inverse short-time Fourier transform to obtain the gain function G of the l frame signal in the nth time domain_l(n)。

S322, the gain function G of the nth frame signal in the time domain_l(n) performing time shift processing to the right in the time domain to obtain the nth-nth₀Gain function G in time domain_l(n-n₀)。

S323, mixing the n-n₀Gain function G in time domain_l(n-n₀) Performing truncation processing to obtain an FIR time domain filter h obtained by mapping the nth time domain of the first frame of the deep learning network_l(n)。

The FIR time domain filter obtained by the window function design has the advantages of low operation complexity and stable performance.

In one implementation, an infinite impulse response time domain filter (IIR) time domain filter may be obtained using a minimum phase design, since the IIR time domain filter may be designed as a minimum phase filter, with a shorter time delay than an FIR time domain filter; the IIR time domain filter approximation has the disadvantage that the linear phase cannot be guaranteed, i.e. the time delay of each frequency point may not be the same. The higher the order of the IIR time domain filter is, the more accurate the approximation is and the higher the complexity is; when the order of the IIR time-domain filter is too low, too large approximation error may be caused, which may cause unstable speech noise reduction performance.

In this embodiment, step S3 may be implemented by the following steps S31 '-S33'.

And S31', under the condition of adopting the IIR time domain filter, taking the gain function as the fitting value of the IIR time domain filter, and adopting the minimum phase to obtain the IIR time domain filter.

And S32', calculating the amplitude-frequency response according to the coefficient of the IIR time domain filter, and determining the amplitude-frequency response of the IIR time domain filter.

And S33', determining the order of the infinite impulse response time domain filter according to the amplitude value and the amplitude-frequency response of the gain function, wherein the order of the infinite impulse response time domain filter meets the requirement of time delay control.

In an implementation manner, the amplitude-frequency response of the IIR time-domain filter may be compared with the amplitude value | G (k, l) | of the gain function, and when the error is large, the order of the IIR time-domain filter may be increased, so as to reduce the approximation error and meet the requirement of the delay control.

The voice noise reduction method with controllable time delay provided by the embodiment of the application adopts an IIR time domain filter to fit a gain function G (k, l), and the time delay can be controlled in a smaller range.

Because the computational complexity of IIR time-domain filter design and time-domain filtering is directly related to the IIR order, the IIR time-domain filter is more suitable for approximating the peak-valley value, and the FIR time-domain filter is more suitable for approximating the gentle curve of the amplitude-frequency response. For speech signals, there are usually only significant peaks and valleys in the voiced sound segments below 4000Hz in the mid-low frequency band, and no significant peaks and valleys above 4000Hz in the high frequency band.

In one implementation, a mixture of FIR and IIR time-domain filters may be adopted, and then step S3 may be implemented by the following steps S31 "-S32".

S31', the gain function G (k, l) for the band below 4000Hz for noisy speech is fitted with an IIR filter.

S32', for the gain function G (k, l) of the noisy speech in the frequency band above 4000Hz, FIR filter fitting is performed to reduce the algorithm complexity.

In one implementation, step S4 may be implemented by the following steps S41-S42.

S41, dividing the frequency of the voice with noise to obtain a first sub-band signal and a second sub-band signal;

in one implementation, the method can be implementedDividing the full frequency band signal into two time domain signals of sub-bands by the time domain signal x (n) of the speech with noise through the sub-band analysis filter, and recording the two time domain signals as a first sub-band signal x_L(n) and a second subband signal x_H(n)。

And S42, the first sub-band signal is processed by an infinite impulse response time domain filter to obtain the estimated middle and low frequency voice.

In one implementation, the first subband signal x may be divided into two sub-bands_L(n) obtaining estimated middle and low frequency voice through IIR time domain filter

And S43, the second sub-band signal is processed by a finite impulse response time domain filter to obtain estimated high-frequency voice.

In one implementation, the second subband signal x may be combined_H(n) high frequency speech estimated by FIR time domain filter

And S44, synthesizing the low-frequency voice and the high-frequency voice to obtain pure voice of the full frequency band.

In one implementation, the speech of two sub-bands can be combined

And

synthesizing to obtain the final estimated pure speech with full frequency band in time domain

In one implementation, step S4 may be implemented by an FIR time domain filter based on deep learning network mapping, including the following steps:

s41', determining the FIR time-domain filter with the order of 2t for deep learning network mapping_df_s/1000²。

The enhanced speech time domain signal is STFT transformed and the cost function of equation (10) can also be used to train the deep learning network parameters for implementing the FIR time domain filter mapping.

S42', convolving the l frame time domain signal with the time domain signal of the noisy speech with the finite impulse response time domain filter obtained by the deep learning network in the l frame mapping, and obtaining the l frame pure speech according with the time delay requirement, wherein the l frame output pure speech is:

in formula (13), x_l(n) is the time domain signal of the l frame speech with noise, h_l(n) is a FIR time domain filter obtained by mapping the deep learning network on the l frame,

the first frame of clean speech is required to meet the delay requirement.

The time domain convolution in equation (13) can also be quickly realized by frequency domain multiplication, and h is used for ensuring linear convolution_l(n) zero should be padded, x, before Fourier transform is performed_l(n) taking the signal of the previous frame, and h after zero padding_l(n) are of uniform length.

Example 1

The embodiment 1 of the present application provides a time delay controllable voice noise reduction method, including:

s51, converting the time domain signal x of the voice with noise_l(N) framing, wherein the frame length is N, the frame shift is R, and short-time Fourier transform is carried out to obtain a noisy speech complex frequency spectrum X (k, l);

s52, directly estimating a real gain function G (k, l) by adopting a real network or directly estimating a complex gain function G (k, l) by adopting a complex network;

s53, designing an FIR time domain filter by adopting a window function method, wherein the order is determined by time delay; or directly approximating the gain function G (k, l) by an IIR time domain filter;

s54, time domain of the voice with noiseThe signal x (n) passes through an FIR time domain filter or an IIR time domain filter to obtain pure voice of the full frequency band in the time domain

Example 2

The embodiment 2 of the present application provides a time delay controllable voice noise reduction method, including:

s61, dividing a time domain signal X (N) of the voice with noise into frames, wherein the frame length is N, the frame shift is R, and performing short-time Fourier transform to obtain a complex frequency spectrum X (k, l) of the voice with noise;

s62, estimating the magnitude spectrum of the pure voice by adopting a real network

Or complex network to estimate clean speech complex spectrum

The real and imaginary parts of (c);

s63, calculating a gain function G (k, l) by adopting an equation (7) or an equation (11);

s64, designing an FIR time domain filter by adopting a window function method, wherein the order is determined by time delay; or directly approximating the gain function G (k, l) by an IIR time domain filter;

s65, making the time domain signal x (n) of the voice with noise pass through FIR time domain filter or IIR time domain filter to get the pure voice of the whole frequency band in the time domain

Example 3

The embodiment 3 of the present application provides a time delay controllable voice noise reduction method, including:

s71, dividing a time domain signal X (N) of the voice with noise into frames, wherein the frame length is N, the frame shift is R, and performing short-time Fourier transform to obtain a complex frequency spectrum X (k, l) of the voice with noise;

s72, directly estimating or indirectly estimating a gain function G (k, l) by adopting deep learning;

s73, approximating the gain function G (k, l) below 4000Hz by an IIR time domain filter, and approximating the gain function G (k, l) above 4000Hz by an FIR time domain filter;

s74, making the time domain signal x (n) of the voice with noise pass through a sub-band analysis filter to obtain a sub-band signal x below 4000Hz_L(n) and sub-band signals x above 4000Hz_H(n)；

S75,x_L(n) obtaining sub-band enhanced voice below 4000Hz through an IIR time domain filter

x_H(n) obtaining sub-band enhanced voice above 4000Hz through FIR time domain filter

S76, converting the speech of two sub-bands

And

Example 4

The embodiment 4 of the present application provides a time delay controllable voice noise reduction method, including:

s81, dividing a time domain signal X (N) of the voice with noise into frames, wherein the frame length is N, the frame shift is R, and performing short-time Fourier transform to obtain a complex frequency spectrum X (k, l) of the voice with noise;

s82, directly estimating or indirectly estimating a gain function G (k, l) by adopting deep learning;

s83, using the complex frequency spectrum X (k, l) of the voice with noise and the gain function G (k, l) as the input of the deep learning network to map the first frame FIR time domain filter h_l(n)；

S84, calculating the enhanced speech time domain signal of the I frame by adopting the formula (12)

Or adopting frequency domain multiplication to replace time domain convolution to obtain enhanced voice time domain signal

The output signal of each frame forms a time sequence which is the pure speech of the full frequency band in the time domain

The embodiment of the application provides a deep learning voice noise reduction scheme with controllable time delay aiming at the requirement of single-channel voice noise reduction on low time delay, retains the advantage that a deep learning voice enhancement method can well inhibit strong unsteady noise and the advantage that the deep learning voice enhancement method can recover voice under the scene of low signal-to-noise ratio, and also achieves the aim of controllable time delay. If the time delay requirement is not more than the frame length, the enhanced pure voice can be directly synthesized by adopting an overlap-add method; if the delay requirement is far smaller than the frame length, for example, 4 milliseconds, the time domain approximation is performed on the gain function, the gain function can be approximated by an FIR time domain filter, an IIR time domain filter or a mixed mode of the FIR time domain filter and the IIR time domain filter, or the time domain filter can be directly mapped by a deep learning network.

Fig. 3 is a diagram of a spectrogram test effect before and after Babble noise processing according to an embodiment of the present application. As shown in fig. 3, comparison of the speech spectrogram before and after Babble noise processing: (a) a voice with noise; (b) estimating a gain function by adopting a traditional method and designing a time domain filter for filtering by adopting a window function method; (c) estimating a gain function by adopting a deep learning method and designing a time domain filter for filtering by adopting a window function method; (d) and estimating a gain function by adopting a deep learning method and mapping a time domain filter for filtering by adopting the deep learning method. The test result shows that the method provided by the embodiment of the application can achieve advanced voice noise reduction performance under the condition of low time delay.

The voice noise reduction method with controllable time delay provided by the embodiment of the application is used for carrying out single-channel voice noise reduction based on time delay controllable deep learning, and the method is combined with a supervised learning method such as deep learning and a time domain filter method.

According to the voice noise reduction method with controllable time delay, a gain function of each time frequency point is deduced in a time frequency domain by adopting a deep learning method, a time domain filter is optimized in each frame, and filtering enhancement is carried out in a time domain to realize voice noise reduction.

The embodiment of the application provides a method for realizing the optimal design of a time domain filter by adopting three modes of Infinite Impulse Response (IIR), Finite Impulse Response (FIR) filter and deep learning network mapping fitting gain function.

In order to reduce the complexity of operation and improve the robustness, and consider the voice characteristics, the embodiment of the application simultaneously proposes to adopt a subband analysis synthesis method, firstly, a full-band signal is divided into two subbands by a subband analysis filter, an IIR (infinite impulse response) time domain filter is adopted at a low frequency, an FIR (finite impulse response) time domain filter is adopted at a high frequency for fitting, then, the two subbands are respectively subjected to time domain filtering enhancement, and finally, the full-band voice time domain signal is reconstructed by adopting subband synthesis.

It will be further appreciated by those of ordinary skill in the art that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be embodied in electronic hardware, computer software, or combinations of both, and that the components and steps of the examples have been described in a functional general in the foregoing description for the purpose of illustrating clearly the interchangeability of hardware and software. Whether these functions are performed in hardware or software depends on the particular application of the solution and design constraints. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied in hardware, a software module executed by a processor, or a combination of the two. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

The above-mentioned embodiments, objects, technical solutions and advantages of the present application are described in further detail, it should be understood that the above-mentioned embodiments are merely exemplary embodiments of the present application, and are not intended to limit the scope of the present application, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present application should be included in the scope of the present application.

Claims

1. A time delay controllable voice noise reduction method is characterized by comprising the following steps:

framing the voice with noise, and performing time-frequency domain transformation to obtain a complex frequency spectrum of the voice with noise;

determining a gain function according to the complex frequency spectrum of the voice with the noise; the gain function is real or complex;

determining a time domain filter according to the gain function, wherein the order of the time domain filter is set according to the time delay requirement;

and inputting the voice with noise into the time domain filter to perform noise reduction processing meeting the time delay requirement to obtain pure voice.

2. The method of claim 1, wherein determining a gain function from the complex spectrum of the noisy speech comprises:

determining a magnitude spectrum of a complex frequency spectrum of the noisy speech;

inputting the magnitude spectrum of the complex frequency spectrum of the voice with noise into the deep learning network, wherein the deep learning network is a real network;

and determining a gain function according to the mapping target of the real network, wherein the gain function is a real number.

3. The method of claim 2, wherein determining a gain function based on the mapped target of the real network comprises:

under the condition that the mapping target of the deep learning network is the magnitude spectrum of pure voice, the gain function is the ratio of the magnitude spectrum of the pure voice to the magnitude spectrum of the voice with noise;

and under the condition that the mapping target of the deep learning network is the magnitude spectrum compression value of the pure voice, the gain function is the ratio of the magnitude spectrum compression value of the pure voice and the magnitude spectrum of the voice with noise.

4. The method of claim 1, wherein determining a gain function from the complex spectrum of the noisy speech comprises:

determining a real part and an imaginary part of the complex frequency spectrum of the voice with noise, and inputting the real part and the imaginary part of the complex frequency spectrum of the voice with noise into the deep learning network, wherein the deep learning network is a complex network; or

Determining real and imaginary parts of a compressed complex spectrum of the noisy speech; inputting the real and imaginary parts of the compressed complex spectrum of the noisy speech into the complex network;

determining a gain function according to a mapping target of the complex network; the gain function is a complex number.

5. The method of claim 4, wherein determining a gain function based on the mapped target of the complex network comprises:

under the condition that the mapping target of the complex network is the complex spectrum of the pure voice, obtaining a gain function according to the ratio of the complex spectrum of the pure voice to the voice with noise, wherein the gain function is complex; or

And under the condition that the mapping target of the complex network is the compressed complex spectrum of the pure voice, obtaining a gain function according to the ratio of the compressed complex spectrum of the pure voice and the voice with noise, wherein the gain function is complex.

6. The method according to any of claims 1-5, wherein said determining a time-domain filter according to said gain function, the order of said time-domain filter being set according to a delay control requirement, comprises:

approximating the gain function by using a finite impulse response time domain filter, wherein the gain function is a fitting value of the finite impulse response time domain filter;

and determining the order of the finite impulse response time domain filter according to the time delay control requirement.

7. The method according to any of claims 1-5, wherein said determining a time-domain filter according to said gain function, the order of said time-domain filter being set according to a delay control requirement, comprises:

approximating the gain function by using an infinite impulse response time domain filter, wherein the gain function is a fitting value of the infinite impulse response time domain filter;

determining the amplitude-frequency response of the infinite impulse response time domain filter;

and determining the order of the infinite impulse response time domain filter according to the amplitude value of the gain function and the amplitude-frequency response, wherein the order of the infinite impulse response time domain filter meets the time delay control requirement.

8. The method according to any of claims 1-5, wherein said inputting said noisy speech into said time-domain filter for denoising in accordance with said delay requirement to obtain clean speech comprises:

and inputting the voice with noise into a finite impulse response time domain filter or an infinite impulse response time domain filter, and performing noise reduction processing according with the time delay requirement to obtain pure voice.

9. The method according to any of claims 1-5, wherein said inputting said noisy speech into said time-domain filter for denoising in accordance with said delay requirement to obtain clean speech comprises:

dividing the frequency of the voice with noise to obtain a first sub-band signal and a second sub-band signal;

the first sub-band signal is processed by an infinite impulse response time domain filter to obtain estimated middle and low frequency voice;

the second sub-band signal is processed by a finite impulse response time domain filter to obtain estimated high-frequency voice;

and synthesizing the medium and low frequency voice signal and the high frequency voice signal to obtain pure voice.

10. The method according to any one of claims 1 to 5, wherein said inputting said time-domain signal of the noisy speech into said time-domain filter for performing noise reduction processing meeting said delay requirement to obtain clean speech comprises:

finite impulse response time domain filter for determining deep learning network mapping with order of 2t_df_sA/1000; wherein, t_dFor delay requirements, f_sIs the sampling frequency;

convolving the l frame of speech with noise with a finite impulse response time domain filter obtained by mapping the l frame of speech with noise by a deep learning network to obtain the l frame of pure speech; l is a natural number;

and arranging the pure voice of the frame l according to time to obtain the pure voice meeting the time delay requirement.