CN113948101A

CN113948101A - Noise suppression method and device based on spatial discrimination detection

Info

Publication number: CN113948101A
Application number: CN202111216600.8A
Authority: CN
Inventors: 何平; 蒋升
Original assignee: Suirui Technology Group Co Ltd
Current assignee: Suirui Technology Group Co Ltd
Priority date: 2021-10-19
Filing date: 2021-10-19
Publication date: 2022-01-18

Abstract

The invention discloses a noise suppression method and a device based on spatial discrimination detection, belonging to the field of information processing, and the method comprises the following steps: s1: calculating a guide vector and a super-directional filter for the time domain signal of each microphone; s2: converting the initialized signals into time-frequency domain signals, and constructing frequency domain prediction vectors; s3: performing noise suppression filter calculation on the time-frequency domain signals to obtain a filter for constructing noise and reverberation suppression; wherein the noise suppression filter calculation comprises: calculating a spatial discriminative coefficient and spatial weight information, updating a weighted autocorrelation matrix, and constructing a noise and reverberation suppression filter; s4: and according to the obtained filter, obtaining the frequency domain estimation of the target voice, and further obtaining the time domain estimation of the target voice. The invention can ensure the optimality of the filter and improve the performances of noise suppression and reverberation suppression.

Description

Noise suppression method and device based on spatial discrimination detection

Technical Field

The present invention belongs to the field of information processing, and in particular, relates to a noise suppression method and apparatus based on spatial discrimination detection.

Background

In the applications of an actual online conference system, intelligent home voice interaction and the like, a speaker has a certain distance from a microphone, and reverberation and noise are picked up by the microphone at the same time, so that the voice communication quality and the interaction accuracy are influenced. On one hand, the reverberation of the multiple reflections of the wall can cause the performance of the noise suppression filter to be reduced, especially in the application scenes with large reverberation such as conference rooms and the like; on the other hand, the presence of background noise also leads to a degradation of the performance of reverberation suppression.

The currently common scheme firstly converts each channel time domain signal into a time-frequency domain based on short-time Fourier transform, then designs a group of filters to calculate the correlation of each time-frequency unit relative to the historical signal, wherein the correlation is caused by reverberation, and the filters eliminate the reverberation based on the correlation. And then, calculating an ideal guide vector based on the azimuth information of the speaker relative to the microphone array, and designing a filter based on a minimum noise energy mode. Both filters perform reverberation suppression and noise suppression in sequence, usually significantly less effective than noise or scenes where reverberation exists alone.

In the prior art, the method for suppressing noise mainly has the following defects:

1) noise and reverberation exist simultaneously, a model is established for the reverberation and the noise independently, and the performance of sequence type reverberation suppression and noise suppression is generally reduced remarkably.

2) Under a strong reverberation scene, a guide vector purely based on azimuth information is not matched with a real guide vector, so that voice distortion is caused, and the voice interaction quality is reduced.

The information disclosed in this background section is only for enhancement of understanding of the general background of the invention and should not be taken as an acknowledgement or any form of suggestion that this information forms the prior art already known to a person skilled in the art.

Disclosure of Invention

The invention aims to provide a noise suppression method and a noise suppression device based on spatial differentiation detection, which can ensure the optimality of a filter and improve the performances of noise suppression and reverberation suppression.

In order to achieve the above object, the present invention provides a noise suppression method based on spatial discriminative detection, comprising the steps of:

s1: calculating a guide vector and a super-directional filter for the time domain signal of each microphone;

s2: converting the initialized signals into time-frequency domain signals, and constructing frequency domain prediction vectors;

s3: performing noise suppression filter calculation on the time-frequency domain signals to obtain a filter for constructing noise and reverberation suppression; wherein the noise suppression filter calculation comprises: calculating a spatial discriminative coefficient and spatial weight information, updating a weighted autocorrelation matrix, and constructing a noise and reverberation suppression filter;

s4: and according to the obtained filter, obtaining the frequency domain estimation of the target voice, and further obtaining the time domain estimation of the target voice.

Further, before the step S1, the method further includes acquiring a voice signal x of a microphone_m(n)；

In step S1, the method specifically includes the following steps:

s101: for each frequency band k, a target speech steering vector u (k) is calculated:

q(θ)＝[cos(θ),sin(θ)]；

s102: for each frequency band k, compute a super-directional filter h (k):

further, the step S2 includes the following steps:

s201: for time domain signal x_m(n) performing short-time Fourier transform to obtain a time-frequency domain expression:

s202: for each frequency band k, a frequency domain prediction vector X (l, k) sum is constructed

X(l,k)＝[X₁(l,k),X₂(l,k),...,X_M(l,k)]^T；

Further, the step S3 includes the following steps:

s301: calculating spatial discriminative coefficients and spatial weight information of the current frame:

the spatial discriminative coefficient is calculated as follows:

where ρ is_s(l) And ρ_x(l) Respectively representing the voice direction of the l frame and the energy estimation of the signals picked up by the microphone, wherein the difference of the energy distribution represents the spatial distinctiveness;

the spatial weight information of the current frame is calculated as follows:

s302: updating the weighted autocorrelation matrix;

for each band k, the update of the cross-correlation coefficient vector R ^ (l, k) is as follows:

s303: the noise and reverberation suppressed filters are updated.

For each frequency band k, the noise and reverberation suppressing filter G (l, k) is constructed as follows:

still further, the step S4 includes the steps of:

s401: obtaining the frequency domain estimation of the target voice according to the solved noise and reverberation suppression filter

S402: carrying out inverse Fourier transform on the frequency domain estimation of the target voice to obtain the final target voice estimation

The invention also provides a noise suppression device based on the spatial discrimination detection, which comprises an initialization module, a signal decomposition module, a filter calculation module and a target voice estimation module;

the initialization module is used for calculating a guide vector and a super-directional filter of a time domain signal of each microphone;

the signal decomposition module is used for converting the initialized signal into a time-frequency domain signal and constructing a frequency domain prediction vector;

the filter calculation module is used for performing noise suppression filter calculation on the time-frequency domain signals to obtain a filter for constructing noise and reverberation suppression; wherein the filter calculation module comprises: a first calculation module for calculating spatial discriminative coefficients and spatial weight information, a first update module for updating a weighted autocorrelation matrix, and a first construction module for constructing a noise and reverberation suppression filter;

and the target voice estimation module is used for obtaining the frequency domain estimation of the target voice according to the obtained filter so as to obtain the time domain estimation of the target voice.

Further, the initialization module is also used for acquiring a voice signal x of the microphone_m(n)；

The initialization module is configured to:

for each frequency band k, a target speech steering vector u (k) is calculated:

q(θ)＝[cos(θ),sin(θ)]；

for each frequency band k, compute a super-directional filter h (k):

further, the signal decomposition module comprises a signal conversion module and a vector construction module;

the signal conversion module is used for converting the time domain signal x_m(n) performing short-time Fourier transform to obtain a time-frequency domain expression:

the vector construction module constructs frequency domain prediction vectors X (l, k) and X (l, k) for each frequency band k

X(l,k)＝[X₁(l,k),X₂(l,k),...,X_M(l,k)]^T；

Further, the step S3 includes the following steps:

in the first calculation module, the spatial discriminative coefficients are calculated as follows:

the spatial weight information of the current frame is calculated as follows:

in the first updating module, for each frequency band k, the cross-correlation coefficient vector R ^ (l, k) is updated as follows:

in the first building block, for each frequency band k, the noise and reverberation suppressing filter G (l, k) is built as follows:

furthermore, the target voice estimation module comprises a frequency domain estimation module and a target voice estimation module;

the frequency domain estimation module is used for obtaining the frequency domain estimation of the target voice according to the solved noise and reverberation suppression filter

The target voice estimation module is used for carrying out inverse Fourier transform on the frequency domain estimation of the target voice to obtain the final target voice estimation

Compared with the prior art, the noise suppression method and device based on the spatial discrimination detection, provided by the invention, have the advantages that compared with a cascading reverberation reduction and noise reduction scheme, the noise and reverberation are jointly modeled, an overall optimal filter is designed, and the reverberation and noise suppression can achieve a better effect. In addition, the spatial discriminative weight designed by the invention can adaptively distinguish the time-frequency region with dominant voice, can further avoid the over-cancellation of the target voice as noise or reverberation, and improves the robustness.

Drawings

Fig. 1 is a flowchart of a noise suppression method based on spatial discrimination detection in the present embodiment.

Fig. 2 is a diagram of a hamming window function used in this embodiment.

Fig. 3 is a schematic diagram of noise suppression based on spatial discrimination detection in the present embodiment.

Detailed Description

The following detailed description of the present invention is provided in conjunction with the accompanying drawings, but it should be understood that the scope of the present invention is not limited to the specific embodiments.

Throughout the specification and claims, unless explicitly stated otherwise, the word "comprise", or variations such as "comprises" or "comprising", will be understood to imply the inclusion of a stated element or component but not the exclusion of any other element or component.

As shown in fig. 1, according to the noise suppression method based on spatial discriminative detection in the preferred embodiment of the present invention, a unified filter design rule is adopted to jointly model noise and reverberation, so as to ensure the optimality of the filter; a group of spatial distinguishing characteristics is designed, the speaker guide vector is updated in real time in a reverberation scene, and the performance of noise suppression and reverberation suppression is improved.

The method is applied to a system based on a microphone array, and specifically comprises the following four implementation steps:

s1: the time domain signal for each microphone is subjected to steering vector and super-directional filter calculations.

Before step S1, the method further includes acquiring a voice signal of the microphone, where the acquired voice signal is as follows: let x be_m(n) represents original time domain signals picked up by M microphone elements in real time, wherein M represents a microphone serial number label, and the value of the microphone serial number label is from 1 to M; n represents a time stamp; the direction of the target speech relative to the microphone array is known as θ.

The target voice is a voice signal corresponding to a target direction, and for a voice separation task, the target direction is known in advance according to the extracted signal, for example, for a large-screen voice communication device, a target voice signal in a 90-degree direction is expected to be separated.

Specifically, the step S1 specifically includes the following steps:

s101: for each frequency band K (K ═ 1, 2.. K), which is a signal component corresponding to a certain frequency, a target speech guidance vector u (K) is calculated. The specific calculation formula is as follows:

q(θ)＝[cos(θ),sin(θ)]。

wherein f is_kK, where K is determined according to subsequent fourier transform, and if the frame length is 512, the value of K is half of the frame length; c is sound speed, and c is 340 m/s; d_mIs the two-dimensional coordinate value of the mth microphone; superscript H represents the conjugate transpose operator; j represents an imaginary unit

q (theta) is a direction vector, omega_kIs the frequency band circle frequency.

This step S101 is used to initialize a steering vector representing the signal difference of each microphone element in the theta direction in the ideal scene without reverberation and array element difference. For calculating the super-directional filter in the subsequent step S102.

S102: for each frequency band k, a super-directional filter h (k) is calculated. The specific calculation formula is as follows:

where R (k) represents the autocorrelation matrix of the uniform scattered field and the superscript-1 represents the inverse of the matrix. The super-directional filter can theoretically completely reserve a signal of a target direction theta, and simultaneously suppress uniform scattered field noise to the maximum extent.

S2: and converting the initialized signals into time-frequency domain signals, and constructing frequency domain prediction vectors.

Specifically, the step S2 includes the steps of:

s201: for time domain signal x_m(n) performing a short-time Fourier transform to obtain a time-frequency domain representation, the purpose of which is to convert the time-domain signal into a time-frequency domain signal. The specific calculation formula is as follows:

wherein, N is the frame length, and N is 512; w (n) is a Hamming window of length 512, where n represents the number of times, and thus w (n) represents the value of each corresponding time number n; l is a time frame sequence number and takes a frame as a unit; k is a frequency number. X_m(l, k) is the spectrum of the mth microphone signal, in the mth frame, the kth frequency band. The hamming window function used in the present invention is shown in fig. 2.

The specific calculation formula is as follows:

X(l,k)＝[X₁(l,k),X₂(l,k),...,X_M(l,k)]^T；

wherein, the superscript T represents the transpose operator; l represents the length of the filter forward timeframe and typically ranges from 3 to 20. In the above formulaIt can be seen that the vector X (l, k) is an M X1 dimensional column vector,

is a (L +1) × 1 dimension column vector. Constructing the frequency domain prediction vector

To predict the noise and reverberation components in subsequent steps.

The value of the method is L-12, so that the storage and calculation time can be effectively saved, and the noise suppression performance can not be obviously reduced.

The transformation from the time domain signal to the time-frequency domain can be completed by the above step S2.

S3: and performing noise suppression filter calculation on the time-frequency domain signals to obtain a filter for constructing noise and reverberation suppression.

Wherein the calculation of the noise suppression filter comprises: calculating spatial discriminative coefficients and spatial weight information, updating a weighted autocorrelation matrix, and constructing a noise and reverberation suppression filter.

Specifically, the step S3 includes the steps of:

s301: calculating a spatial discriminative coefficient and spatial weight information of a current frame;

the spatial discriminative coefficient is calculated as follows:

wherein, | represents a modulus of the complex number; alpha is a smoothing factor between adjacent frames and has a value ranging between 0 and 1. In the present invention, it is preferable that α is 0.92, if the value α is less than 0.88, the variation range of the energy estimation exceeds 20%, there is a defect of instability, and if the value α is more than 0.96, the energy estimation is too smooth, and the spatial discrimination is less than 40 degrees. A value of 0.92 can balance robustness and accuracy very well.

In the formula, ρ_s(l) And ρ_x(l) Respectively representing the voice direction of the l frame and the energy estimation of the microphone picked signal, rho_s(l-1) and ρ_x(l-1) respectively represents the speech direction of the l-1 frame and the energy estimation of the microphone pick-up signal, and the difference of the energy distribution represents the spatial distinctiveness.

The spatial weight information of the current frame is calculated as follows:

the spatial weight information calculated in step S301 is used to update the weighted autocorrelation matrix in the subsequent steps.

S302: updating the weighted autocorrelation matrix;

where α is an adjacent inter-frame smoothing factor, and the smoothing factor α is the same as that in step S301. Through the dynamic weight information in the formula, the correlation matrix can selectively accumulate noise and reverberation components, and the noise and reverberation can be inhibited without causing distortion of target voice. The weighted autocorrelation matrix represents the correlation between weighted microphone signals, and the weighted autocorrelation matrix mainly retains the correlation information of noise and reverberation because the weights are smaller at places where the noise and reverberation are larger. The weighted autocorrelation matrix can be used for the computation of the final filter in subsequent steps.

S303: the noise and reverberation suppressed filters are updated.

wherein the content of the first and second substances,

the column vector with dimension (L +1) × 1, which is obtained by the u (k) and 0 vector expansion in step S101, allows the filter to have both the noise reduction and reverberation reduction performance through the expansion process.

The noise and reverberation suppression filter is used to perform the frequency domain estimation calculation of the target speech in the subsequent step S4.

The method specifically comprises the following steps:

The specific calculation formula is as follows:

The specific calculation formula is as follows:

through the steps of the invention, the initialization, the signal decomposition, the filter calculation and the target voice estimation of the target voice estimation signal can be realized.

In practical use, based on an 8-microphone linear array, a recording data test in a conference scene with a microphone spacing of 3.5cm, a length of 8 meters and a width of 4 meters and a height of 2.5 meters shows that by adopting the algorithm, the signal-to-noise ratio can be improved by 10dB (noise energy is suppressed by 90%) and the reverberation suppression ratio is 4.5dB (reverberation energy is suppressed by 65%).

As shown in fig. 3, an embodiment of the present invention is a directional information guided real-time speech separation apparatus applied to a microphone array based system, which includes an initialization module 1, a signal decomposition module 2, a filter calculation module 3, and a target speech estimation module 4.

And the initialization module 1 is used for calculating a steering vector and a super-directional filter of the time domain signal of each microphone.

The initialization module 1 can also be used to obtain the speech signal of the microphone, the obtained speech signal is as follows: let x be_m(n) represents original time domain signals picked up by M microphone elements in real time, wherein M represents a microphone serial number label, and the value of the microphone serial number label is from 1 to M; n represents a time stamp; the direction of the target speech relative to the microphone array is known as θ.

Specifically, the initialization module 1 is configured to perform the following operations:

for each frequency band K (K ═ 1, 2.. K), which is a signal component corresponding to a certain frequency, a target speech guidance vector u (K) is calculated. The specific calculation formula is as follows:

q(θ)＝[cos(θ),sin(θ)]。

q (theta) is a direction vector, omega_kIs the frequency band circle frequency.

The above operation is used to initialize steering vectors representing the signal differences of the microphone elements in the theta direction in an ideal scene without reverberation and array element differences. For calculating the super-directional filter in subsequent operations.

For each frequency band k, a super-directional filter h (k) is calculated. The specific calculation formula is as follows:

where R (k) represents the autocorrelation matrix of the uniform scattered field, the inverse of the-1 matrix is superscripted. The filter can theoretically completely reserve the signal of the target direction theta, and simultaneously suppress the uniform scattered field noise to the maximum extent.

And the signal decomposition module 2 is used for converting the initialized signal into a time-frequency domain signal and constructing a frequency domain prediction vector.

In particular, the signal decomposition module 2 comprises the following sub-modules: the device comprises a signal conversion module and a vector construction module.

A signal conversion module for converting the time domain signal x_m(n) performing a short-time Fourier transform to obtain a time-frequency domain representation, the purpose of which is to convert the time-domain signal into a time-frequency domain signal. The specific calculation formula is as follows:

A vector construction module for constructing a frequency domain original vector X (l, k) sum for each frequency band k

The specific calculation formula is as follows:

X(l,k)＝[X_1(l,k),X_2(l,k),...,X_M(l,k)]^T；

X^(l,k)＝[X^T(l,k),X^T(l-1,k),....,X^T(l-L,k)]^T。

wherein, the superscript T represents the transpose operator; l represents the length of the filter forward timeframe and typically ranges from 3 to 20. In the above formula, it can be seen that the vector X (l, k) is an M × 1 dimension column vector,

To predict the noise and reverberation components in subsequent steps.

The transformation from the time domain signal to the time-frequency domain can be completed through the operation.

And the filter calculation module 3 is used for performing noise suppression filter calculation on the time-frequency domain signal to obtain a filter for constructing noise and reverberation suppression.

Wherein, the filter calculation module 3 includes: the first computing module is used for computing spatial differentiation coefficients and spatial weight information, the first updating module is used for updating a weighted autocorrelation matrix, and the first constructing module is used for constructing a noise and reverberation suppression filter.

Specifically, in the first calculation module, the spatial discriminative coefficient is calculated as follows:

In the formula, ρ_s(l) And ρ_x(l) Respectively representing the voice direction of the l frame and the energy estimation of the microphone picked signal, rho_s(l-1) and ρ_xAnd (l-1) respectively represents the speech direction of the l-1 frame and the energy estimation of the microphone pick-up signal.

The spatial weight information of the current frame is calculated as follows:

the spatial weight information calculated in the above operation is used for subsequent updating of the weighted autocorrelation matrix.

In the first updating module, the cross-correlation coefficient vector R ^ (l, k) is updated as follows:

wherein α is a smoothing factor between adjacent frames, and the smoothing factor α is the same as that in the first calculation module. Through the dynamic weight information in the formula, the correlation matrix can selectively accumulate noise and reverberation components, and the noise and reverberation can be inhibited without causing distortion of target voice. The weighted autocorrelation matrix represents the correlation between weighted microphone signals, and the weighted autocorrelation matrix mainly retains the correlation information of noise and reverberation because the weights are smaller at places where the noise and reverberation are larger. The weighted autocorrelation matrix can be used for subsequent final filter calculations.

wherein the content of the first and second substances,

in order to initialize the column vector with dimension (L +1) × 1 in the module 1, u (k) and 0 vector are expanded, the filter has the performance of noise reduction and reverberation reduction through the expansion process.

The noise suppression filter is used for carrying out frequency domain estimation calculation of the target voice in subsequent operation.

And the target voice estimation module 4 is used for obtaining the frequency domain estimation of the target voice according to the obtained filter, and further obtaining the time domain estimation of the target voice.

Specifically, the target speech estimation module 4 includes the following sub-modules: the device comprises a frequency domain estimation module and a target voice estimation module.

The frequency domain estimation module obtains the target language according to the solved noise and reverberation suppression filterFrequency domain estimation of tones

The specific calculation formula is as follows:

The specific calculation formula is as follows:

the 4 modules are all absent from the invention. And the absence of any step can cause that the target voice cannot be extracted.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The foregoing descriptions of specific exemplary embodiments of the present invention have been presented for purposes of illustration and description. It is not intended to limit the invention to the precise form disclosed, and obviously many modifications and variations are possible in light of the above teaching. The exemplary embodiments were chosen and described in order to explain certain principles of the invention and its practical application to enable one skilled in the art to make and use various exemplary embodiments of the invention and various alternatives and modifications as are suited to the particular use contemplated. It is intended that the scope of the invention be defined by the claims and their equivalents.

Claims

1. A noise suppression method based on spatial discrimination detection is characterized by comprising the following steps:

2. The noise suppression method based on spatial differentiation detection according to claim 1, characterized in that said step S1 is preceded by further comprising obtaining a speech signal x of a microphone_m(n)；

In step S1, the method specifically includes the following steps:

q(θ)＝[cos(θ),sin(θ)]；

s102: for each frequency band k, compute a super-directional filter h (k):

3. the noise suppression method based on spatial discrimination detection according to claim 2, wherein said step S2 includes the steps of:

X(l,k)＝[X₁(l,k),X₂(l,k),...,X_M(l,k)]^T；

4. The noise suppression method based on spatial discrimination detection according to claim 3, wherein said step S3 includes the steps of:

the spatial discriminative coefficient is calculated as follows:

the spatial weight information of the current frame is calculated as follows:

s302: updating the weighted autocorrelation matrix;

s303: the noise and reverberation suppressed filters are updated.

5. the noise suppression method based on spatial differentiation detection according to claim 4, characterized in that said step S4 comprises the steps of:

6. A noise suppression device based on spatial differentiation detection is characterized by comprising an initialization module, a signal decomposition module, a filter calculation module and a target voice estimation module;

7. The apparatus according to claim 6, wherein the initialization module is further configured to obtain a speech signal x of a microphone_m(n)；

The initialization module is configured to:

for each frequency band k, a target speech steering vector u (k) is calculated:

q(θ)＝[cos(θ),sin(θ)]；

for each frequency band k, compute a super-directional filter h (k):

8. the apparatus according to claim 7, wherein the signal decomposition module comprises a signal conversion module and a vector construction module;

X(l,k)＝[X₁(l,k),X₂(l,k),...,X_M(l,k)]^T；

9. The noise suppression method based on spatial differentiation detection according to claim 8, characterized in that said step S3 comprises the steps of:

the spatial weight information of the current frame is calculated as follows:

10. the apparatus for noise suppression based on spatial discrimination detection according to claim 9, wherein the target speech estimation module includes a frequency domain estimation module and a target speech estimation module;

said frequencyA domain estimation module for obtaining the frequency domain estimation of the target voice according to the solved noise and reverberation suppression filter