CN114333888A

CN114333888A - Multi-beam joint noise reduction method and device based on white noise gain control

Info

Publication number: CN114333888A
Application number: CN202111666084.9A
Authority: CN
Inventors: 项京朋; 邱峰海
Original assignee: Beijing Sound+ Technology Co ltd
Current assignee: Beijing Sound+ Technology Co ltd
Priority date: 2021-12-30
Filing date: 2021-12-30
Publication date: 2022-04-12

Abstract

The application provides a multi-beam joint noise reduction method and device based on white noise gain control. The method comprises the following steps: carrying out preliminary noise reduction on voice signals acquired by the earphone microphone array by using a plurality of directional beam formers with different white noise gain lower limits to obtain a plurality of directional beam output signals with different white noise gain lower limits; obtaining a zero notch beam output signal corresponding to the voice signal by using a zero notch filter; the beam output signals are post-filtered using a machine learning based speech enhancement model. The method integrates the advantages of a plurality of directional beam formers with different white noise gain lower limits, and can improve the suppression effect on noise in other directions while extracting the target voice.

Description

Multi-beam joint noise reduction method and device based on white noise gain control

Technical Field

The application relates to the technical field of voice noise reduction, in particular to a multi-beam joint noise reduction method and device based on white noise gain control.

Background

Headsets have become a common electronic product in everyday entertainment and voice communications. In practical applications, the microphone of the earphone often picks up various sounds, including other noises such as subway noise, road noise, other speaker noise, etc., in addition to the human voice of the wearer. These noises often have the characteristics of randomness, non-stationarity and the like, which not only affect the call quality, but also affect the performance of voice assistant applications such as voice wakeup, voice recognition and the like.

Therefore, noise reduction schemes are required to be designed for headphones to suppress various noises and enhance the wearer's voice.

Disclosure of Invention

The application provides a white noise gain control-based multi-beam joint noise reduction method and device, wherein a plurality of directional beam formers corresponding to different white noise gain lower limits are adopted to carry out joint noise reduction on voice signals, and by integrating the advantages of the plurality of directional beam formers, when target voice is extracted, the beam white noise gain is controlled, the noise suppression effect in other directions is improved, and the voice noise reduction effect is finally improved.

In a first aspect, the present application provides a method for determining a beamformer. The beamformer is applied to headphones, and the method comprises the following steps:

acquiring a lower limit value of white noise gain corresponding to a target beam former; the target beam former is obtained by weighting and summing a diagonal loading super-directional beam former and a delay summation beam former corresponding to the earphone, the diagonal loading super-directional beam former and the delay summation beam former are determined according to a steering vector corresponding to the earphone, and the steering vector is determined according to the distance between array elements in an extra-aural microphone array of the earphone and a pointing angle corresponding to the beam former;

updating a first weight of the diagonally-loaded super-directional beamformer in the target beamformer and a second weight of the delay-sum beamformer in the target beamformer in accordance with the lower limit value;

and determining an updated target beam former according to the updated first weight and the updated second weight.

According to the scheme, the weights of the diagonal-loading super-directional beam former and the delay summation beam former in the target beam former are updated according to the white noise gain lower limit of the target beam former, so that the white noise gain of the target beam former can be controlled within the lower limit value. The target beam former integrates the advantages of high white noise gain of the delay-sum beam former and high directivity of the diagonal-loading super-directivity beam former, and can achieve a certain noise reduction effect while controlling the white noise gain.

In one possible embodiment, the target beamformer comprises a directional beamformer or a null beamformer; wherein, the pointing angle corresponding to the directional beam former is different from the pointing angle corresponding to the null notch beam former.

In one possible implementation, the updating the first weight of the diagonally-loaded super-directional beamformer in the target beamformer and the second weight of the delay-sum beamformer in the target beamformer according to the lower limit value comprises:

determining a regularization parameter of the target beam former in each frequency band according to the target lower limit value, the white noise gain of the delay summation beam former in each frequency band, the white noise gain of the diagonal loading super-directional beam former in each frequency band, and the directivity index value of the diagonal loading super-directional beam former in each frequency band;

and determining the updated first weight and the updated second weight according to the regularization parameters corresponding to the frequency bands.

In a second aspect, the present application further provides a speech noise reduction method. The method is applied to earphones, and comprises the following steps:

acquiring a voice signal to be processed; the voice signal is collected by a microphone array outside the earphone;

obtaining a plurality of directional beam output signals by a plurality of directional beam formers according to the voice signals, and obtaining zero-notch beam output signals corresponding to the voice signals by a zero-notch beam former; wherein the plurality of directional beam output signals correspond to the plurality of directional beam formers one to one, and white noise gain lower limit values corresponding to the plurality of directional beam formers are different; the plurality of directional beamformers and the null beamformer are both obtained according to the method of the first aspect and its alternative embodiments,

according to the zero-notch beam output signal and the plurality of directional beam output signals, post-filtering is carried out by utilizing a voice enhancement model to obtain a frequency domain signal after noise reduction;

and obtaining a voice signal after noise reduction according to the frequency domain signal after noise reduction.

Because the noise suppression capability of the directional beam former with different white noise gain lower limits to other directions in voice is different, for a common fixed beam former, the lower the white noise gain of the directional beam former is, that is, the weaker the suppression capability to the self-noise of the microphone is, but the better the suppression capability to the noise in other directions is. Self-noise is amplified when the white noise gain is below 0 dB. Therefore, in the above-mentioned scheme, the combined noise reduction is performed by using a plurality of directional beam formers with different white noise gain lower limits, so that the advantages of the plurality of directional beam formers can be integrated, the white noise gain in the noise-reduced voice can be controlled, and the noise suppression effect in other directions can be ensured, so as to improve the voice noise reduction effect.

In a third aspect, the present application further provides a method for training a speech enhancement model, where the speech enhancement model is applied to a headset, and the method includes:

acquiring a voice signal with noise and a pure voice signal;

obtaining a plurality of directional beam output signals corresponding to the voice signal with noise by using a plurality of directional beam formers corresponding to the earphone, and obtaining a zero-notch beam output signal corresponding to the voice signal with noise by using a zero-notch beam former corresponding to the earphone; wherein the plurality of directional beam formers and the plurality of directional beam output signals are in one-to-one correspondence, the plurality of directional beam formers and the null notch beam former are obtained according to the method in the first aspect and its optional embodiments, and the white noise gains corresponding to the plurality of directional beam formers have different lower limit values;

updating parameters of a speech enhancement model according to the plurality of directional beam output signals, the null notch beam output signal and the clean speech signal.

In one possible embodiment, the speech enhancement model is determined according to a GCCRN network.

In a fourth aspect, the present application further provides a device for determining a beamformer. The beam former is applied to earphones, and the determining device comprises:

the acquisition module is used for acquiring a lower limit value of white noise gain corresponding to the target beam former; the target beam former is obtained by carrying out weighted summation on a diagonal loading super-directivity beam former and a delay summation beam former corresponding to the earphone, the diagonal loading super-directivity beam former and the delay summation beam former are determined according to a steering vector corresponding to the earphone, and the steering vector is determined according to the distance between array elements in a microphone array outside the earphone and a pointing angle corresponding to the beam former;

an update module to update a first weight of the diagonally-loaded super-directional beamformer in the target beamformer and a second weight of the delay-sum beamformer in the target beamformer according to the lower limit value;

a determining module, configured to determine an updated target beamformer according to the updated first weight and the updated second weight.

In a possible implementation manner, the update module is specifically configured to:

In a fifth aspect, the present application further provides a speech noise reduction apparatus. The voice noise reduction device is applied to a headset, and comprises:

the acquisition module is used for acquiring a voice signal to be processed; the voice signal is collected by a microphone array outside the earphone;

a beam forming module, configured to obtain, according to the voice signal, a plurality of directional beam output signals by using a plurality of directional beam formers, and obtain a null notch beam output signal corresponding to the voice signal by using a null notch beam former; wherein the plurality of directional beam output signals correspond to the plurality of directional beam formers one to one, the plurality of directional beam formers and the null notch beam former are obtained according to the method in the first aspect and the optional embodiments thereof, and white noise gain lower limit values corresponding to the plurality of directional beam formers are different;

a post-filtering module, configured to perform post-filtering by using a speech enhancement model according to the zero-notch beam output signal and the multiple directional beam output signals, to obtain a noise-reduced speech signal;

and the conversion module is used for obtaining the voice signal after noise reduction according to the frequency domain signal after noise reduction.

In a sixth aspect, the present application further provides a training apparatus for a speech enhancement model. The speech enhancement model is applied to a headset, the training device comprising:

the acquisition module is used for acquiring a voice signal with noise and a pure voice signal;

a beam forming module, configured to obtain a plurality of directional beam output signals corresponding to the noisy speech signal by using a plurality of directional beam formers corresponding to the earphones, and obtain a zero-notch beam output signal corresponding to the noisy speech signal by using a zero-notch beam former corresponding to the earphones; wherein the plurality of directional beam formers and the plurality of directional beam output signals are in one-to-one correspondence, the plurality of directional beam formers and the null notch beam former are obtained according to the method in the first aspect and its optional embodiments, and the white noise gains corresponding to the plurality of directional beam formers have different lower limit values;

and the updating module is used for updating parameters of a voice enhancement model according to the plurality of directional beam output signals, the zero notch beam output signal and the pure voice signal.

In a seventh aspect, the present application further provides a headset. The headset comprises a processor and a memory, the processor being configured to execute a computer program stored in the memory to perform the method of the second aspect.

Any one of the devices or earphones provided above is used to execute the method provided above, and therefore, the beneficial effects that can be achieved by the devices or earphones refer to the beneficial effects of the corresponding schemes in the corresponding methods provided above, and are not described herein again.

Drawings

Fig. 1 is a schematic view of an application scenario provided in an embodiment of the present application;

fig. 2 is a schematic diagram of a user wearing a headset according to an embodiment of the present application;

fig. 3 is a white noise gain contrast diagram and a directivity index contrast diagram of each beam former according to the embodiment of the present application;

fig. 4 is a beam comparison diagram of each beam former provided by the embodiment of the present application;

fig. 5 is a flowchart of a method for determining a beamformer provided in an embodiment of the present application;

fig. 6 is a beam comparison diagram of a beam former based on different white noise gain control according to an embodiment of the present application;

FIG. 7 is a spectrogram of a 0-4kHz signal in an actual noisy indoor environment and spectrogram of signals processed by different directional beamformers according to an embodiment of the present disclosure;

FIG. 8 is a spectrogram of a 0-1kHz signal in an actual noisy indoor environment and spectrogram of signals processed by different directional beamformers according to an embodiment of the present disclosure;

FIG. 9 is a beam contrast diagram of a null notch beamformer for two different pointing angles according to an embodiment of the present application;

FIG. 10 is a beam pattern and white noise gain plot for a zero notch beamformer as provided by an embodiment of the present application;

fig. 11 is a speech spectrogram of signals received by a microphone array in an actual noisy indoor environment before and after being processed by a null beamformer according to an embodiment of the present application;

fig. 12 is a schematic diagram of a white noise gain control-based multi-beam joint noise reduction process according to an embodiment of the present application;

fig. 13 is a flowchart of a multi-beam joint noise reduction method based on white noise gain control according to an embodiment of the present application;

fig. 14 is a spectrogram of a signal received by a microphone array in an indoor music noise environment and a spectrogram of the signal processed by different directional beam formers according to an embodiment of the present application;

fig. 15 is a spectrogram of a signal received by a microphone array in an outdoor noise environment and a spectrogram of the signal processed by different directional beamformers according to an embodiment of the present disclosure;

FIG. 16 is a flowchart of a method for training a speech enhancement model according to an embodiment of the present application;

fig. 17 is a schematic structural diagram of a GCCRN network according to an embodiment of the present application;

fig. 18 is a schematic structural diagram of a determining apparatus of a beamformer provided in an embodiment of the present application;

fig. 19 is a schematic structural diagram of a multi-beam joint noise reduction apparatus based on white noise gain control according to an embodiment of the present application;

fig. 20 is a schematic structural diagram of a training apparatus for a speech enhancement model according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions of the embodiments of the present application will be described below with reference to the accompanying drawings.

In the description of the embodiments of the present application, the words "exemplary," "for example," or "for instance" are used to mean serving as an example, instance, or illustration. Any embodiment or design described herein as "exemplary," "e.g.," or "e.g.," is not to be construed as preferred or advantageous over other embodiments or designs. Rather, use of the words "exemplary," "e.g.," or "exemplary" is intended to present relevant concepts in a concrete fashion.

In the description of the embodiments of the present application, the term "and/or" is only one kind of association relationship describing an associated object, and means that three relationships may exist, for example, a and/or B may mean: a exists alone, B exists alone, and A and B exist at the same time. In addition, the term "plurality" means two or more unless otherwise specified. For example, the plurality of systems refers to two or more systems, and the plurality of screen terminals refers to two or more screen terminals.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicit indication of indicated technical features. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. The terms "comprising," "including," "having," and variations thereof mean "including, but not limited to," unless expressly specified otherwise.

Fig. 1 is a schematic diagram of a signal model of a headset provided in the present application. As shown in fig. 1, when a user 1 wears an earphone in a space and uses the earphone to talk, an ear microphone of the earphone collects voice of the user 1 and ambient noise around the user. The ear microphone is a microphone located outside the ear when the user wears the earphone.

Fig. 2 shows a schematic view of a user wearing a headset. As shown in the front view of fig. 2, both sides of the headset are provided with an extra-aural microphone array (left side Mic1 and Mic3, right side Mic2 and Mic4) configured as a linear array. As shown in the side view of fig. 2, each of the extra-aural microphone arrays contains two microphones.

Specifically, the signal collected by the ith extra-aural microphone in the earphone can be expressed by formula (1).

x_i(n)＝s_i(n)+d_i(n)+w_i(n) (1)

In the formula (1), s_i(n) represents the voice signal of the user collected by the ith extra-aural microphone, d_i(n) represents the ambient noise signal collected by the ith extra-aural microphone, w_i(n) denotes the self-noise of the ith extra-aural microphone, i ∈ [1, M ∈]. M represents the number of microphones in the extra-aural microphone array.

The ambient noise may include direct sound and noise made up of diffuse field noise, among others. Referring to fig. 1, the direct sound may be sound emitted from the user 2 of fig. 1 and collected by a microphone after being transmitted straight, and the diffuse field noise may be sound emitted from the user 2 and collected by a microphone after being reflected.

Based on the signal representation shown in equation 1, the frequency domain receiving model of the i-th ear microphone can be as shown in equation (2).

X_i(k,l)＝S_i(k,l)+D_i(k,l)+W_i(k,l) (2)

In the formula (2), X_i(k,l)、S_i(k,l)、D_i(k, l) and W_i(k, l) are each x_i(n)、s_i(n)、d_i(n) and w_i(n) corresponding to the kth frequency band of the first frame (corresponding frequency is denoted as f)_k) Short time spectrum. In practical application, x can be transformed by fast Fourier transform_i(n) conversion to X_i(k,l)。

In one example, the headset may employ three noise reduction schemes for speech enhancement. Three noise reduction schemes are briefly described below.

The first is a single microphone solution. The scheme is that the earphone is provided with an extra-aural sound transmitter, and the traditional spectral subtraction method or a single-channel speech enhancement algorithm based on machine learning is adopted to process the frequency domain signal represented by the formula (2).

The scheme has low cost, but cannot achieve good noise suppression effect under a high-noise environment. Especially when the noise includes the voice or voice-like signal of the non-wearer, the scheme cannot effectively distinguish the target voice from the interfering voice, resulting in a significant reduction in the noise suppression effect.

The second scheme is that the external acoustic transmitter is combined with an internal auxiliary sensor to reduce noise, and signals collected by the external acoustic transmitter and the internal auxiliary sensor are fused to realize voice enhancement. Common internal auxiliary sensors include an ear microphone, a vibration sensor, an acceleration sensor, or the like.

This scheme needs inside auxiliary sensor to provide information and assists the judgement, though can obtain better noise reduction effect, but needs the person of wearing to correctly wear the earphone in order to guarantee auxiliary sensor's pickup quality, and the cost is higher moreover. When the wearing is incorrect or the coupling is poor, the fusion scheme of the inner ear signal and the outer ear signal is mismatched, and the noise reduction performance is reduced. In addition, the price of the internal auxiliary microphone is relatively expensive, which may additionally increase the hardware cost of the headset.

The third is a multi-ear microphone scheme, and generally 2-4 microphones are distributed in a single-side earphone to form an ear microphone array. According to the scheme, firstly, a beam forming method or a self-adaptive beam forming method is adopted to carry out primary enhancement on a frequency domain signal represented by a formula (2), and then a post-filtering method is utilized to further enhance beam output information. A common representative method is a time frequency-domain-based generalized sidelobe canceller (TF-GSC) algorithm, which obtains a primarily enhanced target signal and a noise signal by constructing a directional beam and a zero notch beam, and then performs post-filtering to obtain enhanced information.

The hardware cost of the scheme is low, and compared with a single-microphone method, the noise reduction performance is obviously improved. There are still some problems with this solution.

For example, the directional beams in the schemes are typically formed using a Delay and Sum Beamformer (DSB) and a Super Directional Beamformer (SDB). The DSB has no problem of white noise gain amplification, but has poor directivity, limited noise suppression capability in each direction and high sidelobe level in a beam pattern. SDB has the highest directivity index, i.e., has good directivity, but has a serious problem of low-frequency noise amplification.

As another example, null notch beams are often formed using differential beamforming or using a blocking matrix design. The former also has the problem of low-frequency white noise amplification, and the latter has the problem of too small null width. When the width of the zero notch is too small, if a certain deviation exists between the preset zero notch direction and the actual human mouth direction, the trap performance is sharply reduced. These problems can lead to post-filtering speech distortion phenomena at low signal-to-noise ratios.

Specifically, the DSB and SDB used in the above-described third scheme may be expressions as shown in equations (3) and (4), respectively.

In the above formulas (3) and (4), v (k, θ)_s) Indicating steering vectors, v, directed at the person's mouth, corresponding to the beamformer^H(k,θ_s) Denotes v (k, θ)_s) The conjugate transpose of (1); gamma-shaped_d(k) Is a diffusion field covariance matrix of dimension M × M, whose n-th row and M-th column elements can be calculated according to equation (5).

In the formula (5), d_mnDenotes the distance between the nth and mth microphones, n ∈ [1, M]，m∈[1,M]And c represents the speed of sound.

Generally, the indices measuring the beam performance of the beamformer include a directivity index, a white noise gain, and a beam pattern. Wherein the directivity index, the white noise gain, and the beam pattern may be determined according to formula (6), formula (7), and formula (8).

B(k,θ)＝w^H(k)v(k,θ) (8)

In the above-mentioned formulas (6) to (8),

which represents the index of directivity,

denotes white noise gain, B (k, theta) denotes beam gain in the beam pattern, w^H(k) The conjugate transpose of expression w (k) representing the beamformer and v (k, θ) represents the steering vector pointing in the direction of angle θ. The larger the directivity index is, the stronger the beam directivity of the beam former is, and the stronger the noise suppression capability in the directions other than the directivity direction is. The larger the white noise gain is, the less the white noise amplification phenomenon is generated in the beam former, and the better the robustness is.

SDB has the largest directivity index

While the delay-sum beamformer has the maximum white noise gain relative to the other beamformers mentioned in this application

Is a constant number

) The corresponding expressions are shown as formula (9) and formula (10), respectively.

Aiming at the problem of white noise amplification of the SDB, the SDB can be improved, namely, the SDB is subjected to diagonal loading operation to obtain a diagonal loading super-directional beam former-directional load (SDB-DL). Specifically, Γ in equation (4) is re-determined from the diagonal load ε (k)_d(k) I.e. gamma_ε(k)＝Γ_d(k)+ε(k)I_M. Therefore, the expression of SDB-DL can be as shown in equation (10).

Accordingly, the white noise gain expression and the directivity expression of SDB-DL may be as shown in formula (12) and formula (13).

Although the SDB-DL can be corrected in a diagonal loading mode, the loading capacity of different frequency points is different, and the loading capacity of different frequency points can be determined according to experience.

The DSB, SDB and SDB-DL mentioned above all have certain advantages and disadvantages. The performance index analysis diagram of the three beamformers shown in fig. 3 is obtained according to the expressions of the respective beamformers of DSB, SDB, and SDB-DL and the performance index expressions mentioned above. As shown in a diagram of FIG. 3, the white noise gain of SDB near 0Hz is about-30 dB, which has serious low frequency white noise amplification problem; DSB has almost no low frequency white noise amplification; SDB-DL operates due to diagonal loading, and the white noise amplification condition is moderate compared with SDB. As shown in the b diagram of fig. 3, the directivity of SDB is optimal, the directivity of DSB is the worst, and the directivity of SDB-DL is degraded compared to SDB.

With the beam pattern shown in fig. 4, the width of the main lobe of the DSB is large, and the noise suppression capability in other directions is poor; the SDB has the narrowest width of a main lobe and has the best noise suppression capability in other directions; SDB-DL is moderate. In fig. 4, the horizontal axis represents the pointing angle Direction, the left side represents the Frequency, and the right side represents the beam gain.

Based on the above problems of the beam former, embodiments of the present application provide a beam former based on white noise gain control, which can achieve a directivity index as high as possible on the premise of accurately controlling the white noise gain. In particular, the beamformer is a regularization weighting based beamformer, as shown in equation (14).

In equation (14), α (k) is expressed as a regularization parameter, and the diagonal loading ε (k) may be determined by equation (15). The value range of epsilon (k) is 10^-6To 10^-2。

Equation (14) may be further modified to the form shown in equation (16), where in equation (16), α_ε(k) Is the intermediate variable(s) of the variable,

as can be seen from equation (16), the beamformer w_α(k) Can be regarded as a linear combination of SDB-DL and DSB, and

and

weighted weights for SDB-DL and DSB, respectively.

The beam former w is obtained from equations (11), (12) and (16)_α(k) Corresponding white noise gain

As shown in fig. 17.

In the formula (17), the reaction is carried out,

from equation (16):

according to the formula (17) can be obtained,

in combination with the above formula

Regularization parameters as in equation (19) can be calculatedFormula for calculating the number α (k).

According to the formulae (16) and

it can be seen that the weights of the SDB-DL and DSB described above are related to the white noise gain, steering vector and regularization parameters of the beamformer. As can be seen from equations (10), (12) and (19), the regularization parameter is related to the white noise gain and steering vector of the beamformer.

Wherein when the pointing angle in the steering vector is an angle pointing at the human mouth, the beam former is a directional beam former. The beamformer is a null notch beamformer when the pointing angle in the steering vector is an angle not pointing at the human mouth.

Therefore, when determining the beamformer, a directional beamformer and a null beamformer can be obtained from different white noise gain lower limit values and pointing angles.

Based on this, the embodiments of the present application provide a method for determining a beamformer, which is used for determining the above-mentioned directional beamformer and null notch beamformer.

Fig. 5 is a flowchart of a method for determining a beamformer according to an embodiment of the present application.

As shown in fig. 5, the method includes steps S501 to S503 as follows.

In step S501, a lower limit value of a white noise gain corresponding to a target beamformer is acquired.

Specifically, the lower limit of the white noise gain can be 0dB, -6dB, or-12 dB. The target beamformer may be the beamformer shown in equation (16). The target beam former is obtained by weighting and summing a diagonally loaded super-directional beam former and a delay and sum beam former corresponding to the earphone.

Specifically, as shown in equations (3) and (11), the diagonally loaded super-directional beamformer and the delay-sum beamformer corresponding to the headphones can be determined according to the steering vectors corresponding to the headphones.

In particular, the steering vector may be determined based on the distance between the elements in the off-ear microphone array of the earpiece and the corresponding pointing angle of the target beamformer, i.e. the steering vector may be determined based on the distance between the elements in the target beamformer and the corresponding pointing angle of the target beamformer

Wherein tau is_m＝d_m1cosθ_s/c，θ_sIndicating the pointing angle, d_m1Indicating the distance between the mth microphone and the 1 st microphone, which may be a microphone in the microphone array that is further away from the person's mouth or a microphone that is closer to the person's mouth.

Specifically, when the target beamformer is a directional beamformer, θ_sIs the angle pointing at the mouth of the person. In one example, the angle pointing at the mouth of a person may be 0 degrees.

Specifically, when the target beamformer is a null beamformer, one pointing angle may be set for the microphone array, or a different pointing angle may be set for each microphone in the microphone array. Taking the example of a microphone array comprising two microphones, the pointing angle of the microphone near the mouth of the person may be set to 135 degrees and the pointing angle of the microphone away from the mouth of the person may be set to 100 degrees. Thereby, a beamformer with two different pointing directions is obtained.

In one example, the white noise gain lower limit value may be limited for different frequencies. For example, for frequency f_kWhen the lower limit value is lower than SDB-DL at f_kWhen the white noise is gained, f corresponding to SDB-DL can be used instead_kThe white noise gain is used as the lower limit of the white noise gain corresponding to the frequency.

In step S502, the first weight of the diagonally-loaded super-directional beamformer in the target beamformer and the second weight of the delay-sum beamformer in the target beamformer are updated according to the lower limit value.

The first weight is the weight mentioned above

The second weight is the weight mentioned above

Specifically, the regularization parameter α (k) may be calculated according to equation (19), and then calculated according to equation

Calculating the intermediate variable alpha_ε(k) Further, the updated first weight and second weight can be obtained.

In step S503, an updated target beamformer is determined based on the updated first weight and the updated second weight.

Specifically, the updated target beamformer can be obtained in combination with equation (16) according to the updated first weight and the updated second weight.

In one example, when the white noise gain lower limit of the target beamformer can be set to 0dB, -6dB, and-12 dB, three different white noise gain lower limit target beamformers can be obtained. Fig. 6 shows the beam pattern of the beamformer for three different lower limits. As shown in fig. 6, the lower the white noise gain lower limit is set, the narrower the main lobe width of the low frequency is, and the better the noise suppression effect is for the other directions.

The above-mentioned beamformer determining method can improve the problem of white noise amplification by controlling the white noise gain lower limit of the target beamformer.

Fig. 7 and 8 show the spectrogram of the output signals 0-4kHz and 0-1kHz of recorded microphone signals in an actual noisy indoor environment after being processed by different beamformers, respectively. Wherein, a diagram of fig. 6 and a diagram of fig. 7 a represent spectrogram of the speech frequency domain signal of the corresponding frequency band received by the microphone array, b diagram of fig. 6 and b diagram of fig. 7 represent signal spectrogram after DSB processes the speech frequency domain signal of the corresponding frequency band, c diagram of fig. 6 and c diagram of fig. 7 represent signal spectrogram after SDB processes the speech frequency domain signal of the corresponding frequency band, and d diagram of fig. 6 and d diagram of fig. 7 represent signal spectrogram after the directional beam former with white noise gain lower limit of-6 dB determined according to the method shown in fig. 5 processes the speech frequency domain signal of the corresponding frequency band.

Comparing the graph b in fig. 7 with the graph d in fig. 7, it can be seen that the noise residue of the DSB is more, and the interference suppression capability of the directional beamformer obtained by the method of the present application is significantly improved, and the calculated average noise reduction amount of 0-4kHz can be improved by about 6 dB.

Comparing the graph c in fig. 8 with the graph d in fig. 8, it can be seen that the directional beamformer obtained by the method of the present application has no obvious white noise amplification phenomenon below 800Hz, and SDB generates a serious white noise amplification phenomenon, which seriously affects the voice quality. Compared with the two beamformers, the directional beamformer obtained by the method can effectively inhibit noises in other directions under the condition of not obviously amplifying low-frequency white noise.

In one example, when the corresponding pointing directions of the microphone arrays are different, two beamformers with different pointing directions can be obtained. Fig. 9 shows the beam pattern of the beamformer when two microphones are included in the microphone array. The a diagram in fig. 9 corresponds to the beam pattern of the beamformer with a pointing direction of 135 degrees, and the b diagram corresponds to the beam pattern of the beamformer with a pointing direction of 100 degrees. As can be seen from fig. 9, the two beamformers w are except for the null region around 0 degrees_n1(k) And w_n2(k) The null regions are formed in the region of 0 degree and the vicinity thereof, and the null depth in the null regions has certain complementary characteristics. In particular, w_n1(k) The depth of null at high frequencies is shallow in the 0 degree direction, and w_n1(k) Where the corresponding depth of the null is deeper.

Thus, according to w_n1(k) And w_n2(k) Can carry out weighted summation on the two wave beam formers with different pointing directions to design a new wave beam former w_n(k)＝α₁w_n1(k)+α₂w_n2(k) I.e. the final zero-notch beamformer w_n(k) In that respect Optionally, the weight α₁And alpha₂May be set to 0.5.

FIG. 10 shows a zero notch beamformer w_n(k) And a corresponding white noise gain map. w is a_n(k) Deeper nulls can be formed in the 0 degree direction and the space regions of plus or minus 45 degrees, while still maintaining higher gain in other directions. Meanwhile, the full-band white noise gain is controlled in the range of 0dB to 2dB, and the problem of white noise gain amplification does not exist.

Fig. 11 shows a spectrogram of a recorded microphone signal before and after processing by a null beamformer in an actual noisy indoor environment. It can be seen that the zero-notch beamformer obtained by the method of the present application can effectively suppress the noise component of the wearer without generating obvious white noise amplification. It should be noted that although this zero-notch beamformer does not cause the problem of low-frequency white noise amplification, the target speech suppression capability in the 0 degree direction below 1kHz is also poor, so that the target speech in the zero-notch beam output below 1kHz can be removed by using the harmonic detection technique during the post-filtering process, so that the target speech with low frequency can be retained during the post-filtering process as much as possible.

It should be noted that when the number of microphones and the array type are changed, the beam former group { w ] can be designed by different pointing directions_n,i(k)}_i＝1,2,...MConstructing a new zero-notch beamformer by a linear combination thereof

The specific pointing direction and weighting coefficients can be determined in advance according to the number of microphones and the array type.

Based on the determination method of the beam former, the embodiment of the application provides a multi-beam joint noise reduction method based on white noise gain control. As shown in fig. 12, the scheme first uses short-time Fourier transform (STFT) to acquire a speech signal x₁(t) and x₂(t) conversion into the frequency domain to obtain X₁(k, l) and X₂(k, l); multiple directional beam formers pointing to the human mouth designed based on the white noise gain control method are utilized to carry out preliminary filtering to obtain fingers corresponding to different white noise gainsDirective beam output signal (Y)_0dB(k,l)、Y_-6dB(k,l)、Y-₁₂dB (k, l)); and a zero-notch beam former which is designed based on the white noise gain control method and points to the human mouth to obtain a zero-notch beam output signal (Y) corresponding to the collected voice signal_null(k, l)); finally, the speech enhancement model based on machine learning is utilized to carry out post-filtering processing on the directional beam output signals corresponding to different white noise gains obtained in the prior art according to the zero-notch beam output signals, and the frequency domain signal Y after noise reduction is obtained_enh(k, l) and obtaining the denoised speech signal y (t) using an inverse short-time Fourier transform (ISTFT). The scheme can solve the problem of low-frequency white noise amplification existing in the third scheme.

Fig. 13 shows a flowchart of a noise reduction method according to an embodiment of the present application.

As shown in fig. 13, the method includes: step S1301 to step S1303 as follows.

In step S1301, a speech signal to be processed is acquired.

The speech signal is collected by a microphone array external to the headset.

In step S1302, based on the voice signal, directional beam output signals corresponding to a plurality of white noise gains are obtained by a plurality of directional beam formers, and a null notch beam output signal corresponding to the voice signal is obtained by a null notch beam former.

The plurality of directional beamformers correspond to the directional beam output signals corresponding to the plurality of white noise gains one-to-one, the plurality of directional beamformers and the null beamformer are obtained by the method shown in fig. 5, and the white noise gains corresponding to the plurality of directional beamformers have different lower limit values.

In step S1303, post-filtering is performed by using a speech enhancement model according to the null notch beam output signal and the plurality of directional beam output signals, so as to obtain a noise-reduced speech signal.

Specifically, the real part signal and the imaginary part signal of the plurality of directional beam output signals and the real part signal and the imaginary part signal of the zero-notch beam output signal may be input into the speech enhancement model, and the noise-reduced frequency domain signal output by the speech enhancement model is obtained.

Wherein, the plurality of directional beam output signals can be signals corresponding to white noise with the lower limit of the gain of 0dB, -6dB and-12 dB respectively. At this time, the input of the speech enhancement model is 8-channel data. The specific training process of the speech enhancement model will be described later, and will not be described further herein.

And obtaining the voice signal after noise reduction according to the frequency domain signal after noise reduction. Specifically, inverse fast fourier transform may be performed on the noise-reduced frequency domain signal to obtain a noise-reduced speech signal.

The noise reduction method is based on the advantage that the noise suppression effect of the target wave beam formers with different white noise gain lower limits is different, the wave beam formers with different white noise gain lower limits are used for multi-beam filtering, then the voice enhancement model is used for post-filtering, and the noise suppression effect in other directions can be improved.

Fig. 14 shows a spectrogram (a diagram in fig. 14) of a signal received by a microphone array in an indoor music noise environment, a spectrogram (b diagram in fig. 14) after being processed by a directional beamformer (WNG lower limit-6 dB) based on white noise gain control, a spectrogram (c diagram in fig. 14) after being processed by a zero notch beamformer based on white noise gain control, and a spectrogram (d diagram in fig. 14) of a signal obtained by using the noise reduction method shown in fig. 13, respectively.

Fig. 15 shows a spectrogram (a diagram in fig. 15) of a signal received by a microphone in an outdoor noise environment, a spectrogram (b diagram in fig. 15) after being processed by a directional beamformer (WNG lower limit-6 dB) based on white noise gain control, a spectrogram (c diagram in fig. 15) after being processed by a zero notch beamformer based on white noise gain control, and a spectrogram (d diagram in fig. 15) of a signal obtained by using the noise reduction method shown in fig. 13, respectively.

As can be seen from fig. 14 and 15, the directional beamformer based on white noise gain control can effectively reduce noise, while the null beamformer based on white noise gain control can effectively eliminate target voice, and the null beamformer can fully retain information of music interference signals, thereby finally achieving a better voice enhancement effect.

Fig. 16 is a flowchart of a training method of a speech enhancement model according to an embodiment of the present application.

As shown in fig. 16, the method includes steps S1601 to S1603 as follows.

In step S1601, a noisy speech signal and a clean speech signal are acquired.

Noise signals are collected in different environments, such as noise types and noise incoming wave directions of roads, subways, meeting rooms and the like, and the noise signals are required to have diversity characteristics.

A clean speech signal of a user is collected in a quiet environment. When the pure voice signals are collected, the wearing position of the earphone can be properly adjusted, so that the robustness of the model is enhanced.

Then, the noise signal and the pure voice signal are mixed according to the signal-to-noise ratio to obtain a voice signal with noise.

In one example, the noise signal and the clean speech signal may be mixed according to different signal-to-noise ratios to obtain different noisy speech signals.

In step S1602, a plurality of directional beam output signals corresponding to the noisy speech signal are obtained by a plurality of directional beam formers corresponding to headphones, and a null-notch beam output signal corresponding to the noisy speech signal is obtained by a null-notch beam former corresponding to headphones.

The plurality of directional beam formers and the plurality of directional beam output signals are in one-to-one correspondence, the plurality of directional beam formers and the null notch beam former are obtained by the method shown in fig. 5, and the lower limit values of white noise gains corresponding to the plurality of directional beam formers are different.

In step S1603, parameters of the speech enhancement model are updated based on the plurality of directional beam output signals, the null notch beam output signal, and the clean speech signal.

Specifically, a plurality of directional beam output signals based on white noise gain lower limit control and zero-notch beam output signals based on white noise gain control may be input to the speech enhancement model, and the output of the speech enhancement model may be obtained.

Then, an error value is calculated using a loss function based on the output of the speech enhancement model and the clean speech signal, and parameters of the speech enhancement model are updated using a gradient descent method based on the error value.

The speech enhancement model may adopt a gccrn (gate complex restriction recovery network) structure, or a Gate Recovery Unit (GRU) structure and a U-NET structure. Fig. 17 is a schematic diagram of a GCCRN network structure according to an embodiment of the present application. As shown in fig. 17, the GCCRN network includes an encoder encoding network, a cyclic convolution network, a decoder decoding network, and an output layer.

The encoder has 5 layers, and each layer is constructed by using a conv2d convolution function. The decoder has 5 layers, each of which includes a real part output and an imaginary part output, and employs a deconv2d deconvolution function. The cyclic convolution network is provided with 2 layers of LSTM structures, and each layer of LSTM adopts two groups of same network structures, so that the number of model parameters can be reduced. The output layer is an FC full connection layer.

Based on the method for determining the beamformer shown in fig. 5, an embodiment of the present application provides a device for determining a beamformer.

Fig. 18 is a schematic structural diagram of a device 1800 for determining a beamformer according to an embodiment of the present application. As shown in fig. 18, the determining device 1800 includes: an obtaining module 1801, an updating module 1802, and a determining module 1803.

The obtaining module 1801 is configured to obtain a lower limit value of a white noise gain corresponding to a target beamformer; the target beam former is obtained by weighting and summing a diagonal loading super-directivity beam former and a delay summation beam former corresponding to the earphone, the diagonal loading super-directivity beam former and the delay summation beam former are determined according to a steering vector corresponding to the earphone, and the steering vector is determined according to the distance between array elements in a microphone array outside the earphone and a pointing angle corresponding to the beam former.

Wherein the update module 1802 is configured to update a first weight of the diagonally-loaded super-directional beamformer in the target beamformer and a second weight of the delay-sum beamformer in the target beamformer according to the lower limit value.

The determining module 1803 is configured to determine an updated target beamformer according to the updated first weight and the updated second weight.

The specific implementation process of each module may refer to the summary of the invention or the description of the determination method shown in fig. 5, and is not described herein again.

Based on the denoising method shown in fig. 13, an embodiment of the present application provides a denoising device.

Fig. 19 is a schematic structural diagram of a noise reducer 1900 according to an embodiment of the present application. As shown in fig. 19, the noise reducer 1900 includes: an acquisition module 1901, a beamforming module 1902, and a post-filtering module 1903.

The obtaining module 1901 is configured to obtain a voice signal to be processed; the speech signal is collected by a microphone array external to the headset.

The beam forming module 1902 is configured to, according to the voice signal, obtain a plurality of directional beam output signals by using a plurality of directional beam formers, and obtain a null-notch beam output signal corresponding to the voice signal by using a null-notch beam former; the plurality of directional beam output signals correspond to the plurality of directional beam formers one to one, and white noise gain lower limit values corresponding to the plurality of directional beam formers are different.

The post-filtering module 1903 is configured to perform post-filtering by using a speech enhancement model according to the zero-notch beam output signal and the multiple directional beam output signals, so as to obtain a noise-reduced speech signal.

The specific implementation process of each module may refer to the summary of the invention or the introduction of the noise reduction method shown in fig. 13, and is not described herein again.

Based on the above training method of the speech enhancement model shown in fig. 16, an embodiment of the present application further provides a training apparatus.

Fig. 20 is a schematic structural diagram of an exercise device 2000 according to an embodiment of the present disclosure. As shown in fig. 20, the training apparatus 2000 includes: an acquisition module 2001, a beamforming module 2002 and an update module 2003.

The obtaining module 2001 is used for obtaining a noisy speech signal and a clean speech signal.

The beam forming module 2002 is configured to obtain a plurality of directional beam output signals corresponding to the noisy speech signal by using a plurality of directional beam formers corresponding to the earphones, and obtain a zero-notched beam output signal corresponding to the noisy speech signal by using a zero-notched beam former corresponding to the earphones; wherein the plurality of directional beam formers and the plurality of directional beam output signals are in one-to-one correspondence, and lower limit values of white noise gains corresponding to the plurality of directional beam formers are different.

The updating module 2003 is configured to update parameters of a speech enhancement model according to the plurality of directional beam output signals, the notch beam output signal, and the clean speech signal.

The specific implementation process of each module may refer to the summary of the invention or the description in the training method shown in fig. 16, and is not described herein again.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

It is to be understood that the various numerical references referred to in the embodiments of the present application are merely for descriptive convenience and are not intended to limit the scope of the embodiments of the present application. It should be understood that, in the embodiment of the present application, the sequence numbers of the above-mentioned processes do not mean the execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiment of the present application.

The above-mentioned embodiments, objects, technical solutions and advantages of the present application are further described in detail, it should be understood that the above-mentioned embodiments are only exemplary embodiments of the present invention, and are not intended to limit the scope of the present application, and any modifications, equivalent substitutions, improvements and the like made on the basis of the technical solutions of the present application should be included in the scope of the present application.

Claims

1. A method for determining a beamformer to be applied to headphones, the method comprising:

2. The method of claim 1, wherein the target beamformer comprises a directional beamformer or a null beamformer; wherein, the pointing angle corresponding to the directional beam former is different from the pointing angle corresponding to the null notch beam former.

3. The method of claim 1, wherein the updating a first weight of the diagonally-loaded super-directional beamformer in the target beamformer and a second weight of the delay-sum beamformer in the target beamformer according to the lower limit value comprises:

determining regularization parameters of the target beam former in each frequency band according to the target lower limit value, the white noise gain value of the delay summation beam former in each frequency band, the white noise gain value of the diagonal loading super-directional beam former in each frequency band and the directivity index value of the diagonal loading super-directional beam former in each frequency band;

4. A method for reducing noise in speech, the method being applied to a headset, the method comprising:

obtaining a plurality of directional beam output signals by a plurality of directional beam formers according to the voice signals, and obtaining zero-notch beam output signals corresponding to the voice signals by a zero-notch beam former; wherein the plurality of directional beam output signals correspond to the plurality of directional beam formers in a one-to-one correspondence, the plurality of directional beam formers and the null notch beam former are both obtained according to the method of any one of claims 1-3, and white noise gain lower limit values corresponding to the plurality of directional beam formers are different;

and performing post-filtering by using a voice enhancement model according to the zero-notch beam output signal and the directional beam output signals to obtain a voice signal subjected to noise reduction.

5. A method for training a speech enhancement model, wherein the speech enhancement model is applied to a headset, the method comprising:

acquiring a voice signal with noise and a pure voice signal;

obtaining a plurality of directional beam output signals corresponding to the voice signal with noise by using a plurality of directional beam formers corresponding to the earphone, and obtaining a zero-notch beam output signal corresponding to the voice signal with noise by using a zero-notch beam former corresponding to the earphone; wherein the plurality of directional beamformers and the plurality of directional beam output signals are in one-to-one correspondence, the plurality of directional beamformers and the null beamformer are obtained according to the method of any one of claims 1-3, and lower limit values of white noise gains corresponding to the plurality of directional beamformers are different;

6. The method of claim 5, wherein the speech enhancement model is determined according to a GCCRN network.

7. A device for determining a beamformer to be applied to headphones, the device comprising:

8. A voice noise reduction apparatus applied to a headphone, the voice noise reduction apparatus comprising:

a beam forming module, configured to obtain, according to the voice signal, a plurality of directional beam output signals by using a plurality of directional beam formers, and obtain a null notch beam output signal corresponding to the voice signal by using a null notch beam former; wherein the plurality of directional beam output signals correspond to the plurality of directional beam formers in a one-to-one correspondence, the plurality of directional beam formers and the null notch beam former are both obtained according to the method of any one of claims 1-3, and white noise gain lower limit values corresponding to the plurality of directional beam formers are different;

and the post-filtering module is used for performing post-filtering by using a voice enhancement model according to the zero-notch beam output signal and the directional beam output signals to obtain a voice signal subjected to noise reduction.

9. An apparatus for training a speech enhancement model, wherein the speech enhancement model is applied to a headset, the apparatus comprising:

a beam forming module, configured to obtain a plurality of directional beam output signals corresponding to the noisy speech signal by using a plurality of directional beam formers corresponding to the earphones, and obtain a zero-notch beam output signal corresponding to the noisy speech signal by using a zero-notch beam former corresponding to the earphones; wherein the plurality of directional beamformers and the plurality of directional beam output signals are in one-to-one correspondence, the plurality of directional beamformers and the null beamformer are obtained according to the method of any one of claims 1-3, and lower limit values of white noise gains corresponding to the plurality of directional beamformers are different;

10. An earphone, characterized in that the earphone comprises a processor and a memory, the processor being configured to execute a computer program stored in the memory to perform the method of claim 4.