CN114067825A

CN114067825A - Comfort noise generation method based on time-frequency masking estimation and application thereof

Info

Publication number: CN114067825A
Application number: CN202111360253.6A
Authority: CN
Inventors: 何平; 樊晓辉; 蒋升
Original assignee: Suirui Technology Group Co Ltd
Current assignee: Suirui Technology Group Co Ltd
Priority date: 2021-11-17
Filing date: 2021-11-17
Publication date: 2022-02-18

Abstract

The invention relates to the technical field of noise processing, and particularly discloses a comfort noise generation method based on time-frequency masking estimation and application thereof, wherein the method comprises the following steps: s1, converting the time domain signal X (n) picked up by the microphone element into a time-frequency domain signal, and obtaining the frequency spectrum X (l, k) of the microphone signal in the ith frame and the kth frequency band; s2, estimating the power spectrum density of the comfort noise; s3, generating comfortable noise with corresponding energy; s4, synthesizing the target speech. The scheme is characterized in that the stationary noise component is estimated based on the time-frequency masking information obtained based on deep learning, so that the phenomenon that excessive comfortable noise is generated due to the fact that voice energy is accumulated to the stationary noise component can be avoided; on the other hand, the comfort is selectively increased for the time frequency unit, and the noise introduced by the voice leading time frequency unit is avoided.

Description

Comfort noise generation method based on time-frequency masking estimation and application thereof

Technical Field

The present invention relates to the field of noise processing technologies, and in particular, to a comfort noise generation method based on time-frequency masking estimation and an application thereof.

Background

Noise suppression and speech enhancement have been key techniques for improving the quality of speech communication in conferencing systems or conferencing equipment. The traditional signal processing method is to track the noise power spectral density and the voice power spectral density in a signal, then construct a masking value of 0 to 1 in a frequency domain based on wiener filtering, and achieve the purpose of inhibiting background noise after masking a microphone signal. The signal processing technology has the disadvantages that non-stationary noise in the environment cannot be effectively processed, and the voice distortion is too large under strong noise interference. At present, time-frequency masking information estimation based on deep learning is another common method for noise suppression, and the main idea is to directly estimate a time-frequency masking value from a mixed signal by training a noisy data set to a pure voice signal. The deep learning method can better process non-stationary noise, but also has the distortion problem of speech over-cancellation.

Therefore, in summary, the main disadvantages of the prior art are:

the method for tracking the stationary component in the background noise by signal processing has the defect of overlarge comfort noise in the scene with larger environmental noise.

The existing comfort noise generation method is to add noise to all time-frequency units without adding any distinction, which causes that a certain amount of noise is added to the time-frequency area dominated by voice.

How to significantly improve the quality of hearing perception in a range of generating moderate comfortable noise is a difficult problem to be solved urgently.

The existing scheme combines stationary component estimation in environmental noise, and then generates white noise with the same energy to be added into a frequency spectrum so as to weaken the influence of voice distortion on listening perception.

The information disclosed in this background section is only for enhancement of understanding of the general background of the invention and should not be taken as an acknowledgement or any form of suggestion that this information forms the prior art already known to a person skilled in the art.

Disclosure of Invention

The invention aims to provide a comfort noise generation method based on time-frequency masking estimation and application thereof, which can improve the communication quality, noise suppression and voice enhancement applied to a voice conference system and the like.

In order to achieve the above object, the present invention provides a comfort noise generation method based on time-frequency masking estimation, comprising the steps of:

s1, converting the time domain signal X (n) picked up by the microphone element into a time-frequency domain signal, and obtaining the frequency spectrum X (l, k) of the microphone signal in the ith frame and the kth frequency band;

s2, estimating the power spectrum density of the comfort noise;

s3, generating comfortable noise with corresponding energy;

s4, synthesizing the target speech.

In a specific implementation scenario, in a first step, the microphone array signal is first subjected to signal decomposition. And Fourier transform is adopted to convert the suitable signal into a frequency domain signal to obtain a frequency spectrum, so that subsequent noise processing is facilitated. And secondly, carrying out comfort noise power spectral density estimation on the frequency domain signal, wherein the step sequentially comprises three steps of noise power spectral density estimation, stationary noise power spectral density estimation and comfort noise energy estimation. Wherein, the noise power spectral density estimation adopts the video masking information in the prior art. Third, a comfort noise spectrum is generated. Play the effect of starting and stopping, make things convenient for subsequent pronunciation further processing analysis. And fourthly, estimating the target voice. The method comprises the steps of target voice frequency domain estimation, after estimation, inverse Fourier transform is carried out to obtain a target time domain signal, namely a target voice signal, and the target time domain signal is output.

Alternatively, the calculation formula of the frequency spectrum X (l, k) in S1 is as follows:

wherein, N is the frame length 512, w (N) is the hamming window of the frame length 512, N is the time label, l is the time frame number, k is the frequency number, and X (l, k) is the frequency spectrum of the microphone signal in the kth frame and the kth frequency band.

Optionally, the S2 specifically includes:

s21, obtaining a time-frequency masking value M (l, k), and calculating the power spectral density rho of the environmental noise for each frequency band k_v(k)：

Wherein | | | represents taking the modulus of the complex number, alpha is the smoothing factor between adjacent frames, the value range is between 0 and 1;

s22, estimating stationary noise power spectral density rho for each frequency band k_min(k)：

The stationary noise power spectral density represents a minimum component in the tracked noise, namely a minimum value of a noise component in the signal, alpha is a smoothing factor which is the same as the value obtained in the step S21, and gamma represents that the stationary noise control factor value is less than 1;

s23, calculating comfort noise energy ζ:

where K has a value equal to half the frame length.

Optionally, the smoothing factor α is 0.95.

Optionally, the stationary noise control factor γ is 0.08.

Optionally, the S3 specifically includes:

generating a comfort noise power spectrum v (l, k):

where σ (n) is a white noise sequence with energy of 1 and length of 512.

Optionally, the S4 specifically includes:

s41, calculating the frequency domain estimation of the target voice according to the following formula:

the system comprises a frequency spectrum X (l, k), a time-frequency masking value M (l, k), and a comfort noise power frequency spectrum upsilon (l, k);

s42, performing inverse Fourier transform to obtain target voice time domain estimation:

where w (k) is the hamming window of frame length 512.

The invention also provides a comfort noise generation system based on time-frequency masking estimation, which is used for implementing a comfort noise generation method based on time-frequency masking estimation, and comprises the following steps:

the signal decomposition module is used for converting a time domain signal X (n) picked up by the microphone element into a time-frequency domain signal and obtaining a frequency spectrum X (l, k) of the microphone signal in the ith frame and the kth frequency band;

a comfort noise power spectral density estimation module for estimating a comfort noise power spectral density;

the comfortable noise generating module is used for generating comfortable noise with corresponding energy;

and the target voice synthesis module is used for synthesizing the target voice.

Optionally, the comfort noise power spectral density estimation module includes a noise power spectral density estimation module, a stationary noise power spectral density estimation module, and a comfort noise energy estimation module, which sequentially process the signal.

The invention also provides an electronic device comprising a memory and a processor, wherein the processor is used for realizing the steps of the comfort noise generation method based on the time-frequency masking estimation when executing the computer management program stored in the memory.

Compared with the prior art, the comfort noise generation method based on the time-frequency masking estimation and the application thereof provided by the invention comprise the following steps: s1, converting the time domain signal X (n) picked up by the microphone element into a time-frequency domain signal, and obtaining the frequency spectrum X (l, k) of the microphone signal in the ith frame and the kth frequency band; s2, estimating the power spectrum density of the comfort noise; s3, generating comfortable noise with corresponding energy; s4, synthesizing the target speech. The scheme is characterized in that the stationary noise component is estimated based on the time-frequency masking information obtained based on deep learning, so that the phenomenon that excessive comfortable noise is generated due to the fact that voice energy is accumulated to the stationary noise component can be avoided; on the other hand, the comfort is selectively increased for the time frequency unit, and the noise introduced by the voice leading time frequency unit is avoided.

Drawings

Fig. 1 is a functional block diagram of a comfort noise generation system based on time-frequency masking estimation according to the present invention.

Detailed Description

The following detailed description of the present invention is provided in conjunction with the accompanying drawings, but it should be understood that the scope of the present invention is not limited to the specific embodiments.

Throughout the specification and claims, unless explicitly stated otherwise, the word "comprise", or variations such as "comprises" or "comprising", will be understood to imply the inclusion of a stated element or component but not the exclusion of any other element or component.

Example one

A comfort noise generation method based on time-frequency masking estimation according to the preferred embodiment of the present invention comprises the following steps:

s2, estimating the power spectrum density of the comfort noise;

s3, generating comfortable noise with corresponding energy;

s4, synthesizing the target speech.

In a preferred embodiment, the spectrum X (l, k) in S1 is calculated as follows:

obtaining a time-frequency domain representation by performing a short-time fourier transform on the time-domain signal x (n):

wherein, N is the frame length 512, w (N) is the hamming window with the length 512, N is the time label, l is the time frame number, and k is the frequency number. X (l, k) is the spectrum of the microphone signal in the kth frequency band, the l frame. A frequency band refers to a signal component corresponding to a certain frequency. And obtaining a Hamming window value corresponding to each sample time point n based on the Hamming window function. The hamming window is a fixed prior art and will not be described further herein.

Example two

The present embodiment is the same as the comfort noise generation method based on time-frequency masking estimation in the first embodiment, and the different places are as follows:

s2 specifically includes: assuming that the time-frequency masking value estimated by deep learning is M (l, k), the power spectral density rho of the environmental noise is calculated for each frequency band k_v(k)：

Wherein | | | represents taking the modulus of the complex number, alpha is the smoothing factor between adjacent frames, the value range is between 0 and 1. The invention preferably selects alpha to be 0.95, if the value is too small, the power spectral density estimation has the defect of unstable variation amplitude, and if the value is too high, the energy estimation is too smooth, and the modeling capability of non-smooth noise is reduced. The prior art M (l, k) is adopted, and is a masking value obtained through deep learning estimation, and the value range is between 0 and 1. And when M (l, k) is less than 0.5, the time frequency unit is represented as an environment noise dominant time frequency unit, the noise power spectral density can be updated, otherwise, the time frequency unit is represented as a voice dominant time unit, and the updating of the noise power spectral density is stopped. The result of this step is used to update the stationary noise power spectral density as (3) below.

Estimating stationary noise power spectral density ρ for each frequency band k_min(k)：

This power spectral density represents a minimal component in the tracked noise, i.e. a minimal value of the noise component in the signal, without directly smoothing the microphone signal. Where α is the same smoothing factor as in step 2. Gamma represents the stationary noise control factor and since stationary noise is part of the noise, the stationary noise energy is less than the noise energy, i.e. gamma < 1. The control factor value is gamma 0.08, and the adoption of the value can generate proper comfortable noise energy and avoid overlarge comfortable noise. This step is used to calculate the comfort noise energy.

Calculating comfort noise energy ζ:

this step calculates the physical meaning that the value ρ for all the frequency bands k_min(k) Averaging, where K represents the number of all bands. The value of K is equal to half the frame length, i.e. 256. This step is used for comfort noise energy in the subsequent steps.

Step S3 specifically includes: generating a comfort noise power spectrum v (l, k):

where σ (n) is a white noise sequence of energy 1 and length 512. This step is used as the next step to calculate the final speech spectrum.

Step S4 specifically includes: obtaining a frequency domain estimation of the target voice according to the comfortable noise power spectrum obtained in the step (5):

namely, when the energy after the time frequency masking is less than the energy of the comfort noise, the comfort noise is added on the time frequency unit, and the distortion caused by local energy loss is avoided. Meanwhile, if the energy after the time-frequency masking is larger than the comfortable noise energy, no noise is added, and the phenomenon that the added noise energy is too large is avoided.

Performing inverse Fourier transform to obtain target voice time domain estimation:

where w (k) is the hamming window of frame length 512.

The time domain estimated signal can be directly converted into a voltage signal through digital-to-analog conversion, and the voltage signal is played by a loudspeaker to form enhanced voice.

As shown in fig. 1, an embodiment of the present invention further provides a comfort noise generation system based on time-frequency masking estimation, where the system is configured to implement the comfort noise generation method based on time-frequency masking estimation according to the two previous embodiments, where the system includes:

Specifically, the comfort noise power spectral density estimation module includes a noise power spectral density estimation module, a stationary noise power spectral density estimation module, and a comfort noise energy estimation module, which sequentially process signals.

An embodiment of the present invention further provides an electronic device, which includes a memory and a processor, where the processor is configured to implement the steps of the comfort noise generation method based on time-frequency masking estimation as described in the two previous embodiments when executing a computer management program stored in the memory.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The foregoing descriptions of specific exemplary embodiments of the present invention have been presented for purposes of illustration and description. It is not intended to limit the invention to the precise form disclosed, and obviously many modifications and variations are possible in light of the above teaching. The exemplary embodiments were chosen and described in order to explain certain principles of the invention and its practical application to enable one skilled in the art to make and use various exemplary embodiments of the invention and various alternatives and modifications as are suited to the particular use contemplated. It is intended that the scope of the invention be defined by the claims and their equivalents.

Claims

1. A comfort noise generation method based on time-frequency masking estimation is characterized by comprising the following steps:

s2, estimating the power spectrum density of the comfort noise;

s3, generating comfortable noise with corresponding energy;

s4, synthesizing the target speech.

2. The method for comfort noise generation based on time-frequency masking estimation according to claim 1, characterized in that the calculation formula of the frequency spectrum X (l, k) in S1 is as follows:

3. The method for comfort noise generation based on time-frequency masking estimation as claimed in claim 2, wherein said S2 specifically includes:

s23, calculating comfort noise energy ζ:

where K has a value equal to half the frame length.

4. A comfort noise generation method based on time-frequency masking estimation according to claim 3, characterized in that the smoothing factor α is 0.95.

5. A comfort noise generation method based on time-frequency masking estimation according to claim 3, characterized in that the stationary noise control factor γ is 0.08.

6. The method for comfort noise generation based on time-frequency masking estimation as claimed in claim 3, wherein said S3 specifically includes:

generating a comfort noise power spectrum v (l, k):

where σ (n) is a white noise sequence with energy of 1 and length of 512.

7. The method for comfort noise generation based on time-frequency masking estimation as claimed in claim 6, wherein said S4 specifically includes:

where w (k) is the hamming window of frame length 512.

8. A comfort noise generation system based on time-frequency masking estimation, characterized in that the system is configured to implement the comfort noise generation method based on time-frequency masking estimation according to any one of claims 1 to 7, and comprises:

9. The time-frequency mask estimation based comfort noise generation system according to claim 8, wherein the comfort noise power spectral density estimation module comprises a noise power spectral density estimation module, a stationary noise power spectral density estimation module, and a comfort noise energy estimation module, which sequentially process signals.

10. An electronic device, comprising a memory, a processor for implementing the steps of the method for comfort noise generation based on time-frequency mask estimation according to any of claims 1-7 when executing a computer management class program stored in the memory.