CN110858485B

CN110858485B - Voice enhancement method, device, equipment and storage medium

Info

Publication number: CN110858485B
Application number: CN201810967670.9A
Authority: CN
Inventors: 刘章; 余涛
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2018-08-23
Filing date: 2018-08-23
Publication date: 2023-06-30
Anticipated expiration: 2038-08-23
Also published as: CN110858485A

Abstract

The disclosure provides a voice enhancement method, a voice enhancement device, voice enhancement equipment and a storage medium. Subtracting the outputs of two microphones in the microphone array to obtain a first-order differential output; comparing the first order differential output with a predetermined threshold; determining a masking value of each frequency point based on a comparison result, wherein the masking value is used for representing the masking condition of noise in noisy voice on the voice; and performing speech enhancement based on the concealment value. The voice enhancement scheme realized based on the differential mask has almost no delay, is not influenced by directional human voice interference, and can effectively improve the success rate of voice recognition in noisy scenes such as subway ticket purchasing machines.

Description

Voice enhancement method, device, equipment and storage medium

Technical Field

The present disclosure relates to the field of speech enhancement, and in particular, to a method, apparatus, device, and storage medium for speech enhancement.

Background

With the development of artificial intelligence voice technology, many traditional devices have an increasing demand for man-machine voice interaction, such as subway ticket purchasing machines. However, successful application to subway ticket-purchasing airports requires a highly noisy environment to be challenged. These noise are: foam noise caused by crowd speaking, interference noise caused by surrounding speakers of ticket purchasing persons, crowd movement generated noise, mechanical noise of subway locomotive movement, interference sound of a tweeter and the like. The high-noise brings great challenges to voice recognition, and the voice recognition effect can be drastically reduced in the high-noise environment because the conventional acoustic model technology cannot effectively overcome the influence of foam noise and human voice interference.

Thus, a speech enhancement scheme for noisy scenes is needed.

Disclosure of Invention

It is an object of the present disclosure to provide a speech enhancement scheme capable of improving the speech enhancement effect.

According to a first aspect of the present disclosure, a speech enhancement method is presented, comprising: subtracting the outputs of two microphones in the microphone array to obtain a first-order differential output; comparing the first order differential output with a predetermined threshold; determining a masking value of each frequency point based on the comparison result, wherein the masking value is used for representing the masking condition of noise in the voice with noise on the voice; and performing speech enhancement based on the concealment values.

Optionally, the step of determining the concealment value of each frequency bin includes: the concealment value of the frequency bin when the first order differential output is less than the predetermined threshold is determined to be 1, and the concealment value of the frequency bin when the first order differential output is greater than or equal to the predetermined threshold is determined to be 0.

Optionally, the step of determining the concealment value of each frequency bin includes: determining a hidden value estimation result of each first-order differential output based on the results of comparing the plurality of first-order differential outputs with the predetermined threshold values respectively; and determining a final hidden value of the frequency point based on the hidden value corresponding to the same frequency point in the multiple hidden value estimation results.

Optionally, the step of determining the final concealment value for the bin includes: and taking the product of the hidden values corresponding to the same frequency point in the multiple hidden value estimation results as the final hidden value of the frequency point.

Optionally, the first order differential output is equal to the product of the filter coefficients and a matrix of time-frequency domain data of the two microphones.

Optionally, the filter coefficients are

Where h (ω) is the filter coefficient, τ ₀ Is the distance of the two microphones divided by the speed of sound, ω is the angular frequency, and α is a parameter used to adjust the direction of the differential nulls.

Optionally, the voice enhancement method further comprises: calculating the relative angles of the two microphones and the speaker based on the sound source position information of the speaker; and determining alpha in the filter coefficients based on the relative angle.

Optionally, the step of calculating the relative angles of the two microphones and the speaker comprises: determining a first direction vector from the center of the two microphones to the speaker; determining a second direction vector from one of the two microphones to the other microphone; based on the first direction vector and the second direction vector, a relative angle is calculated.

Optionally, the step of performing speech enhancement based on the concealment value comprises: calculating a first correlation matrix corresponding to the voice and a second correlation matrix corresponding to the noise based on the concealment value; and performing voice enhancement by using a beam forming algorithm based on the first correlation matrix and the second correlation matrix.

Optionally, the first correlation matrix is a covariance matrix of a corresponding voice portion extracted from the time-frequency domain data output by the microphone array based on the concealment value, and the second correlation matrix is a covariance matrix of a corresponding noise portion extracted from the time-frequency domain data output by the microphone array based on the concealment value.

According to a second aspect of the present disclosure, there is also provided a speech enhancement apparatus comprising: the difference module is used for subtracting the outputs of the two microphones in the microphone array to obtain first-order difference output; the comparison module is used for comparing the first-order differential output with a preset threshold value; the determining module is used for determining the masking value of each frequency point based on the comparison result, wherein the masking value is used for representing the masking condition of noise in the voice with noise on the voice; and a voice enhancement module for voice enhancement based on the concealment value.

Alternatively, the determining module determines a concealment value of a bin when the first order differential output is less than a predetermined threshold to be 1, and determines a concealment value of a bin when the first order differential output is greater than or equal to the predetermined threshold to be 0.

Optionally, the determining module determines a hidden value estimation result of each first-order differential output based on a result of comparing the plurality of first-order differential outputs with a predetermined threshold value, and determines a final hidden value of the frequency point based on a hidden value of the same frequency point in the plurality of hidden value estimation results.

Optionally, the determining module takes the product of the concealment values corresponding to the same frequency point in the multiple concealment value estimation results as the final concealment value of the frequency point.

Optionally, the filter coefficients are

Optionally, the concealment value estimation device further comprises: the angle calculation module is used for calculating the relative angles of the two microphones and the speaker based on the sound source position information of the speaker; and a coefficient determination module for determining alpha in the filter coefficients based on the relative angle.

Optionally, the angle calculation module includes: a first direction vector determination module for determining a first direction vector from the center of the two microphones to the speaker; a second direction vector determining module for determining a second direction vector from one of the two microphones to the other microphone; and a calculation sub-module for calculating the relative angle based on the first direction vector and the second direction vector.

Optionally, the voice enhancement module includes a matrix calculation module for calculating a first correlation matrix for the corresponding voice and a second correlation matrix for the corresponding noise based on the concealment value; and a beam forming module for performing voice enhancement by using a beam forming algorithm based on the first correlation matrix and the second correlation matrix.

According to a third aspect of the present disclosure, there is also provided an apparatus supporting a voice interaction function, including: a microphone array for receiving sound input; and the terminal processor is used for subtracting the outputs of the two microphones in the microphone array to obtain first-order differential output, comparing the first-order differential output with a preset threshold value, determining the hidden value of each frequency point based on the comparison result, and carrying out voice enhancement based on the hidden value, wherein the hidden value is used for representing the shielding condition of noise in noisy voice on voice.

Optionally, the device further comprises a communication module for sending the voice-enhanced voice data to the server.

Optionally, the device is any one of the following: a ticket purchasing machine; an intelligent sound box; a robot; an automobile.

According to a fourth aspect of the present disclosure, there is also provided a computing device comprising: a processor; and a memory having executable code stored thereon, which when executed by the processor causes the processor to perform the method as set forth in the first aspect of the present disclosure.

According to a fifth aspect of the present disclosure there is also provided a non-transitory machine-readable storage medium having stored thereon executable code which, when executed by a processor of an electronic device, causes the processor to perform a method as set out in the first aspect of the present disclosure.

The mask estimation is carried out based on the first-order differential output of the microphone array, and the time is only dependent on sound source positioning information, so that no delay or small delay can be achieved. And is not affected by directional human voice interference. Therefore, the voice enhancement scheme based on the differential mask can realize real-time voice enhancement and effectively improve the success rate of voice recognition in noisy scenes such as subway ticket purchasing machines.

Drawings

The foregoing and other objects, features and advantages of the disclosure will be apparent from the following more particular descriptions of exemplary embodiments of the disclosure as illustrated in the accompanying drawings wherein like reference numbers generally represent like parts throughout exemplary embodiments of the disclosure.

Fig. 1 is a schematic diagram showing a microphone array composed of 4 microphones.

Fig. 2 is a schematic flow chart illustrating a speech enhancement method according to an embodiment of the present disclosure.

Fig. 3 is a block diagram illustrating a structure of a device supporting a voice interaction function according to an embodiment of the present disclosure.

Fig. 4 is an overall flowchart illustrating a speech enhancement scheme according to an embodiment of the present disclosure.

Fig. 5 is a schematic block diagram showing the structure of a voice enhancement apparatus according to an embodiment of the present disclosure.

Fig. 6 is a schematic block diagram showing the structure of functional modules that the speech enhancement module in fig. 5 may have.

Fig. 7 shows a schematic structural diagram of a computing device according to an embodiment of the present disclosure.

Detailed Description

Preferred embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While the preferred embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

[ PREPARATION ] A method for producing a polypeptide

(1) Speech enhancement

Speech enhancement refers to a technique for extracting useful speech signals from noise background, suppressing and reducing noise interference when speech signals are disturbed or even submerged by various kinds of noise.

(2) Microphone array

The microphone array is an array formed by arranging a group of omnidirectional microphones positioned at different positions in space according to a certain shape rule, is a device for spatially sampling spatially-spread sound signals, and the acquired signals contain spatial position information. The array can be divided into a near field model and a far field model according to the distance between the sound source and the microphone array. Depending on the topology of the microphone array, it can be classified into a linear array, a planar array, a volumetric array, etc.

(3) Null sinking

The lowest gain point in the beam pattern.

(4)MVDR

MVDR (Minimum Variance Distortionless Response, minimum variance distortion-free response) is an adaptive beamforming algorithm based on a maximum signal-to-noise ratio (SNR) criterion. The MVDR algorithm may adaptively minimize the power of the array output in the desired direction while maximizing the signal-to-noise ratio.

(5)mask

mask, which can be translated into a masking value (or masking value), can characterize the masking of noise to speech in noisy speech. Generally, masks are largely divided into Ideal binary masks (Ideal Binary Mask, IBM) and Ideal Ratio Masks (IRM). The mask referred to in this disclosure may be considered IBM.

IBM is to divide an audio signal into different sub-bands according to auditory perception characteristics, and to set the energy of the corresponding time-frequency unit to 0 (in the case where noise is dominant) or to remain the same, i.e., to set the energy of the corresponding time-frequency unit to 1 (in the case where target speech is dominant), according to the signal-to-noise ratio at each time-frequency unit.

The IRM also calculates for each time-frequency element, but unlike IBM's "non-0 i.e. 1", calculates the energy ratio between the speech signal and the noise, resulting in a number between 0 and 1, and then changes the energy level of the time-frequency element accordingly. IRM is evolution of IBM, reflects the suppression degree of noise on each time-frequency unit, and can further improve the quality and the intelligibility of the separated voice.

[ scheme overview ]

The voice enhancement technology based on the microphone array signal processing can greatly improve the signal-to-noise ratio and the voice recognition performance, so that the voice recognition rate can be improved by installing the microphone array on voice interaction equipment (such as a ticket purchasing machine) and through an effective enhancement scheme.

The current leading voice enhancement scheme adopts a mask (hidden value) estimation framework, firstly estimates the mask of a target frequency point, and then adopts a beamforming method to carry out spatial filtering so as to realize voice enhancement. Common mask estimation techniques include cluster-based CGMM (Complex Gaussian mixture model ) and nn-mask based neural network, but these two schemes have the disadvantage that the mask cannot be estimated in real time and directional human voice interference cannot be solved.

In view of this, the disclosure proposes a mask estimation scheme based on a differential method, and reliable mask estimation values can be obtained by combining one or more groups of differential microphones. Based on the obtained mask estimated value, a correlation matrix of the corresponding voice and a correlation matrix of the noise can be calculated, and then spatial filtering is performed by using a beam forming method such as MVDR or GEV (generalized eigenvector ) and the like so as to realize voice enhancement.

Aspects of the disclosure are further described below.

Mask estimation scheme based on differential principle

A. First order differential microphone principle

By utilizing the difference of the spatial sound pressure, the first-order differential output can be obtained by subtracting the output of the two omnidirectional microphones. The filter coefficients of the first order differential microphone may be expressed as follows,

wherein τ ₀ Is the distance of the two microphones divided by the speed of sound, ω is the angular frequency, and α is a parameter used to adjust the direction of the differential nulls. The first order differential filtering may result in a beam pattern that is uniform at all frequencies, which may be described by the following equation,

where θ is the relative angle of the two microphones to the speaker and α is used to adjust the direction of the differential nulls. As can be seen from the definition of nulls, the nulling angle is defined when cos θ=α. Thus, the first order differential filtering proposed by the present disclosure is to place nulls in a specified direction (θ, i.e., the direction of the two microphones relative to the speaker) in order to obtain a differential mask.

The following describes the derivation of the calculation formula of the first order differential filter coefficient, and the disclosure will not be repeated for details related thereto that are well known in the art.

The differential array shows the difference of the spatial sound pressure, the first-order difference of the sound pressure can be obtained by subtracting the outputs of two similarly placed omnidirectional microphones, and similarly, N microphones can obtain the N-1-order difference of the sound pressure at most. In designing a differential array, it is important that the microphone spacing be small enough that the finite difference output between microphones can estimate the actual sound pressure field difference, and that the microphone spacing be much smaller than the wavelength compared to the wavelength of the sound signal.

In a first order differential array, two microphones are required, with two constraints: 1. no distortion in the target direction, i.e., gain in the speaker direction (i.e., θ, the relative direction of the two microphones to the speaker) as described above) =1; 2. the zero point is within the interval 0 ° < θ <180 °.

These two constraints are expressed mathematically as follows:

d ^T (ω，cos 0°)h(ω)＝d ^T (ω，1)h(ω)＝1

d ^T (ω，α _1，1 )h(ω)＝β _1，1

wherein h (ω) is a filter coefficient, d ^T () h (ω) represents the filter coefficient under the condition in brackets, ω is the angular frequency, α _1,1 Is a parameter for adjusting the direction of the zero point, alpha _1,1 ＝cosθ _1,1 Indicating that zero is set in the target direction, hence beta _1,1 ＝0。

The above matrix can be expressed as follows:

the equation may be further expressed as a vandermonde matrix (Vandermonde matrix),

wherein τ ₀ Is the distance of the two microphones divided by the speed of sound,solving the matrix can obtain a first order differential filter,

when the microphone spacing is much smaller than the signal wavelength, the following mathematical assumption can be made: e, e ^x The mathematical assumption is applied to the above formula to simplify the mathematical assumption, and a calculation formula of the filter coefficient of the first-order differential microphone can be obtained

B. Differential mask acquisition

When the nulls of the differential microphone are aligned in a given direction, speech in that direction may be masked. By utilizing the principle, the time-frequency point of the null direction can be found on the time-frequency diagram, and the method for acquiring the differential mask is described below by taking a linear microphone array consisting of 4 microphones as shown in fig. 1 as an example. It should be appreciated that the differential mask acquisition method of the present disclosure may also be applicable to other numbers or forms of microphone arrays.

1. Firstly, the relative angles of the mic1, the mic2 and the speaker can be calculated through the sound source position information, and the direction vector from the center of the mic1, the mic2 to the speaker is set as (x) _s ,y _s ) The direction vector of mic2 to mic1 is (x ₁₂ ,y ₁₂ ) Then

Cos (θ) ₁ ) The filter coefficients H of mic1, mic2 can be obtained directly from α brought into the formula in A ₁₂ (omega). The filter coefficients H of the mic3 and the mic4 can be obtained in the same way ₃₄ (ω)。

2. Let the time-frequency domain data of mic1, mic2, mic3, mic4 be X respectively ₁ (t,ω)，X ₂ (t,ω)，X ₃ (t,ω),X ₄ (t, ω) calculating the differential output of mic1, mic2

Calculating the differential output +.>

3. If the frequency point (t, ω) comes from the mic1, mic2 differential nulls, then X ₁₂ The modulus of (t, ω) is theoretically zero and will be much smaller than the original X ₁ (t,ω),X ₂ The modulus of (t, ω) can thus be set to a threshold e, when

When this is the case, the frequency bin (t, ω) is considered to be from the targeted speaker. I.e. mask at frequency point (t, ω), M ₁₂ (t, ω) =1, otherwise M ₁₂ (t, ω) =0, wherein the specific value of the threshold e can be determined empirically.

As shown in the beam pattern curve of fig. 1, when a null approaches one end, the response in the direction near this end is also small, which can easily lead to an estimation error. Multiple sets of first order differential microphones may be used to overcome this drawback.

In particular, M can be estimated by the same theory ₃₄ (t,ω)、M ₁₃ (t,ω)、M ₂₃ (t,ω)、M ₂₄ (t, ω), etc., the final mask output may be the product of the masks of the sets of first order differential microphones. For example, the final mask output may be expressed as M (t, ω) =m ₁₂ (t,ω)·M ₃₄ (t,ω)·M ₂₃ (t,ω)·M ₁₃ (t,ω)·M ₂₄ (t, ω). Wherein M (t, ω) represents the final mask output result, M ₁₂ (t, ω) represents the differential mask estimation result (i.e., the below-mentioned hidden value estimation result) obtained for mic1, mic2, M ₃₄ (t, ω) represents the differential mask estimation result obtained for mic3, mic4, M ₂₃ (t, ω) represents the differential mask estimation result obtained for mic2, mic3, M ₁₃ (t, ω) represents the differential mask estimation result obtained for mic1, mic3, M ₂₄ (t, ω) represents the differential mask estimation result obtained for the mic2, mic 4.

[ Speech enhancement ]

After the voice mask information is acquired, voice enhancement can be performed based on the mask information. Spatial filtering may be performed, for example, using beamforming methods to achieve speech enhancement. For example, speech enhancement may be implemented based on a subspace method, the basic idea of which is to calculate an autocorrelation matrix or covariance matrix of a signal, then divide the noisy speech signal into a useful signal subspace and a noise signal subspace, and reconstruct the signal using the useful signal subspace, thereby obtaining an enhanced signal.

It can be seen that the subspace method requires constructing a covariance matrix from noisy speech signals, which is then decomposed to obtain a signal subspace and a noise subspace. In the disclosure, a correlation matrix (such as a covariance matrix) of a voice and a correlation matrix (such as a covariance matrix) of a noise can be rapidly calculated according to mask information obtained based on a differential mask obtaining mode, wherein the correlation matrix of the voice can represent a signal subspace corresponding to the voice, and the correlation matrix of the noise can represent a noise subspace corresponding to the noise. Thus, the calculated correlation matrix may be applied in a beamforming algorithm to achieve speech enhancement.

As one example of the present disclosure, the correlation matrix of the voice may be a covariance matrix of a corresponding voice portion extracted from the time-frequency domain data of the output of the microphone array based on the acquired mask information, and the correlation matrix of the noise may be a covariance matrix of a corresponding noise portion extracted from the time-frequency domain data of the output of the microphone array based on the acquired mask information. For example, the correlation matrix of noise and speech can be calculated by the following formula, wherein the calculation formula of the correlation matrix of noise can be expressed as R _NN ＝E _t ((1-M(t,ω))·(X(t,ω)X(t,ω) ^H ) The correlation matrix calculation formula of the voice can be expressed as R _SS ＝E _t (M(t,ω)·(X(t,ω)X(t,ω) ^H )). Wherein R is _NN Correlation matrix representing noise, R _SS Correlation matrix representing speech, E _t The average statistic can be used for representing the expectation, M (t, omega) represents the concealment value of each frequency point (t, omega) obtained through final calculation, X (t, omega) represents the time-frequency domain data output by the microphone array, and X (t, omega) ^H Represents the conjugate transpose of X (t, ω).

After the correlation matrix of speech and noise is obtained, speech may be enhanced based on a variety of beamforming algorithms, e.g., the above-mentioned MVDR, GEV, etc. beamforming algorithms may be used to achieve speech enhancement. The specific implementation of the beamforming algorithm is not described here.

So far, a brief description will be made about the implementation principle of the mask estimation scheme based on the differential principle and the speech enhancement scheme based on the mask of the present disclosure.

[ Speech enhancement method ]

The present disclosure may be implemented as a speech enhancement method. Fig. 2 is a schematic flow chart illustrating a speech enhancement method according to an embodiment of the present disclosure.

Referring to fig. 2, in step S210, the outputs of two microphones in a microphone array are subtracted to obtain a first order differential output.

The microphone array may be mounted on a voice interactive device such as a ticket purchaser. The difference in spatial sound pressure can be used to subtract the outputs of two microphones in the microphone array to obtain a first order differential output. Here, the outputs of any pair of microphones in the microphone array may be subtracted to obtain one first order differential output, or the outputs of each pair of microphones may be subtracted to obtain a plurality of first order differential outputs.

In this embodiment, the first-order differential output may be equal to the product of the filter coefficient and the matrix of the time-frequency domain data of the corresponding two microphones. Taking two microphones, namely, a mic1 and a mic2 as examples, the first-order differential output of the mic1 and the mic2

Wherein H is ₁₂ (ω) is the filter coefficient of mic1, mic2, X ₁ (t, omega) is the time-frequency data of mic1, X ₂ (t, ω) is the time-frequency data of mic 2. As described above, the filter coefficients may be expressed as

In this embodiment, the relative angles of the two microphones and the speaker may also be calculated based on the sound source position information of the speaker, and α in the filter coefficients may be determined based on the relative angles. For example, a first direction vector of the center of the two microphones to the speaker may be first determined, then a second direction vector of one of the two microphones to the other microphone may be determined, and the relative angle may be calculated based on the first direction vector and the second direction vector. The specific calculation process of the relative angle may be referred to the above related description, and will not be described herein.

It is emphasized that the beam pattern may be described as,

when cos θ=α, it is the null angle, and cos θ is the cosine of the relative angles of the two microphones and the speaker corresponding to the filter coefficients.

In step S220, the first order differential output is compared with a predetermined threshold.

In step S230, based on the comparison result, the concealment value of each frequency bin is determined.

The masking value is used to characterize the masking of the speech by noise in the noisy speech. In the present disclosure, the hidden value (mask) may refer to an ideal binary mask (Ideal Binary Mask, IBM), and description of IBM may be referred to above, and will not be repeated here.

As can be seen from the description formula of the beam pattern, when the nulls of the differential microphone are aligned in a specified direction, the voice in the direction can be shielded, and if the frequency point (t, omega) comes from the differential nulls of the mic1 and the mic2, then X ₁₂ The modulus of (t, ω) is theoretically zero and will be much smaller than the original X ₁ (t,ω),X ₂ (t, ω) modulus. Thus, a predetermined threshold can be setFor each first-order differential output, the value may be compared with a predetermined threshold, the concealment value of the frequency point (t, ω) when the first-order differential output is smaller than the predetermined threshold is 1, and the concealment value of the frequency point (t, ω) when the first-order differential output is larger than the predetermined threshold is 0.

For example, a threshold e may be set when

When the frequency point (t, omega) comes from the target speaker, i.e. mask, M at the frequency point (t, omega) ₁₂ (t, ω) =1, otherwise M ₁₂ (t, ω) =0, wherein the specific value of the threshold e can be determined empirically.

As shown in the beam pattern curve of fig. 1, when a null approaches one end, the response in the direction near that end will also be small, which can easily lead to mask estimation errors. Therefore, the disclosure proposes that the hidden value estimation result of each first-order differential output can be determined based on the results that the plurality of first-order differential outputs are compared with the predetermined threshold value respectively, and then the final hidden value of the frequency point can be determined based on the hidden value corresponding to the same frequency point in the plurality of hidden value estimation results, for example, the product of the hidden values corresponding to the same frequency point in the plurality of hidden value estimation results can be used as the final hidden value of the frequency point.

For example, assuming that the microphone array is composed of 4 microphones of mic1, mic2, mic3, mic4, M can be estimated using the above method ₁₂ (t,ω)、M ₃₄ (t,ω)、M ₁₃ (t,ω)、M ₂₃ (t,ω)、M ₂₄ The concealment value estimation results of the multiple sets of first-order differential microphones such as (t, ω) and the like can then take the product of the calculated concealment values corresponding to the same frequency point as the concealment value of the frequency point, that is, the final mask output can be expressed as M (t, ω), where M (t, ω) =m ₁₂ (t,ω)·M ₃₄ (t,ω)·M ₂₃ (t,ω)·M ₁₃ (t,ω)·M ₂₄ (t, ω). Thus, the influence caused by the fact that the null is close to one end can be overcome.

In step S240, speech enhancement is performed based on the concealment value.

Based on the determined concealment values, spatial filtering can be performed by using a beam forming algorithm such as MVDR or GEV so as to realize voice enhancement. The specific calculation process is a mature technology in the field, and will not be described here again.

Briefly, a first correlation matrix for the corresponding speech and a second correlation matrix for the corresponding noise may be first calculated based on the concealment values. The first correlation matrix is a covariance matrix of a corresponding voice part extracted from time-frequency domain data output by a microphone array based on a hidden value, and the second correlation matrix is a covariance matrix of a corresponding noise part extracted from time-frequency domain data output by the microphone array based on the hidden value. Then, based on the first correlation matrix and the second correlation matrix, a beamforming algorithm is used for speech enhancement.

As an example, a first correlation matrix R _SS ＝E _t (M(t,ω)·(X(t,ω)X(t,ω) ^H ) A second correlation matrix R) _NN ＝E _t ((1-M(t,ω))·(X(t,ω)X(t,ω) ^H ) Where M (t, ω) represents the concealment values of different frequency points (t, ω), X (t, ω) represents the time-frequency domain data output by the microphone array, E _t Representing mathematical expectations, X (t, ω) ^H Represents the conjugate transpose of X (t, ω). Where X (t, ω) may include time-frequency domain data for one or more microphones in the microphone array.

[ application scenario and application case ]

The voice enhancement scheme of the present disclosure is applicable to devices supporting voice interaction functions in noisy environments, in particular devices that are far away from the sound source (speaker, i.e. user giving voice instructions), such as Echo (smart speaker), robots, cars, ticket purchasing machines, etc. Noisy environments as referred to herein refer to environments in which various noise effects are present. Taking a subway ticket purchasing machine as an example, the subway ticket purchasing machine is often arranged at an entrance with more people flow in a subway station, and the voice recognition technology needs to be successfully applied to the subway ticket purchasing machine, so that a highly noisy noise environment is required to be challenged, and the noise includes but is not limited to: foam noise caused by crowd speaking, interference noise caused by speakers around ticket buyers, crowd movement generated noise, mechanical noise of subway motions, interference sound of tweeters and the like.

The voice enhancement scheme disclosed by the invention can be deployed on equipment which is applied in a noisy environment and supports a voice interaction function, and can be used for enhancing target voice so as to improve voice recognition performance.

Fig. 3 is a block diagram illustrating a structure of a device supporting a voice interaction function according to an embodiment of the present disclosure. The device shown in fig. 3 may be a voice interaction device applied to a noisy environment, and may be, but is not limited to, a smart speaker, a robot, an automobile, a ticket purchasing machine, and the like.

As shown in fig. 3, the device 300 includes a microphone array 310 and a terminal processor 320.

Microphone array 310 is configured to receive acoustic input, which may include both speaker speech and ambient noise from the acoustic input received by microphone array 310.

For sound signals received by the microphone array 310, the terminal processor 320 may first analog-to-digital convert it to sound data, which may then be voice enhanced using the voice enhancement scheme of the present disclosure. Briefly, the terminal processor 320 may subtract the outputs of two microphones in the microphone array to obtain a first order differential output, compare the first order differential output with a predetermined threshold, determine a concealment value for each frequency point based on the comparison result, and perform speech enhancement based on the concealment value, where the concealment value is used to characterize the masking of noise from speech in noisy speech. Details of the specific implementation of the voice enhancement scheme performed by the terminal processor 320 may be found in the above-related description, and will not be repeated here.

In addition, the device 300 may also include a communication module 330. The voice data after the voice enhancement by the terminal processor 320 may be sent to the server through the communication module 330, and the server performs subsequent operations such as voice recognition, instruction issuing, and the like.

Fig. 4 is an overall flow chart illustrating a speech enhancement scheme that may be performed by the device shown in fig. 3. The VAD determination, differential mask estimation, statistic calculation, and beamforming shown in fig. 4 may be performed by the terminal processor in fig. 3.

As shown in fig. 4, the filled circles on the left represent the microphone array, and after the microphone array collects the original sound wave signal and obtains the digital sound data through ADC (analog-to-digital conversion), VAD judgment can be performed according to the voice activity detection result (i.e. VAD input). The voice activity detection result may be result information obtained based on the existing voice activity detection mode, and the principle and implementation details of the voice activity detection mode are not important in the present disclosure, and are not described herein.

According to the VAD judgment result, whether to transmit the sound data to the differential mask estimation module or directly transmit the sound data to the statistic calculation module can be determined. For example, the analog-to-digital converted sound data may be transmitted to the differential mask estimation module when the VAD input is that there is voice activity, and the analog-to-digital converted digital signal may be transmitted to the statistic calculation module when the VAD input is that there is no voice activity.

The differential mask estimation module can accept sound source position information input and execute the mask estimation scheme disclosed by the disclosure to obtain the time frequency point of the corresponding target voice in the noisy voice. The sound source location information may be location information of the target speaker determined based on any known localization method, for example, the sound source location information may be determined by a multi-signal classification algorithm (Multiple Signal classification, MUSIC), and a specific determination method about the sound source location information is not an attention point of the present disclosure, so the present disclosure will not be repeated about a localization process of the sound source location information.

The statistic calculation module can receive the original audio data and the mask information of each frequency point estimated by the differential mask estimation module, and calculate a correlation matrix of corresponding voice and a correlation matrix of noise. The beam forming module can calculate the coefficient of the spatial filter through the correlation matrix of the input voice and the noise, and beam-form the original audio to output the finally enhanced voice.

Compared with the mask estimation schemes based on the cluster CGMM and the neural network, the mask estimation scheme based on the first-order differential output has almost no delay (the time depends on sound source positioning information, and no delay or small delay can be achieved). Moreover, for directional human voice interference, the CGMM method based on clustering and the nn-mask based on the neural network cannot be solved, and the mask estimation scheme based on the first-order differential output is not affected by the directional human voice interference. Therefore, the real-time voice enhancement scheme based on the differential mask can effectively improve the success rate of voice recognition in noisy scenes such as subway ticket purchasing machines.

[ Speech enhancement device ]

Fig. 5 is a schematic block diagram showing the structure of a voice enhancement apparatus according to an embodiment of the present disclosure. Wherein the functional modules of the speech enhancement apparatus may be implemented by hardware, software, or a combination of hardware and software that implements the principles of the present disclosure. Those skilled in the art will appreciate that the functional modules depicted in fig. 5 may be combined or divided into sub-modules to implement the principles of the invention described above. Accordingly, the description herein may support any possible combination, or division, or even further definition of the functional modules described herein.

The following is a brief description of the functional modules that the speech enhancement apparatus may have and the operations that each functional module may perform, and details related thereto may be referred to the above description in connection with fig. 2, which is not repeated here.

Referring to fig. 5, the speech enhancement apparatus 500 includes a difference module 510, a comparison module 520, a determination module 530, and a speech enhancement module 540.

The difference module 510 is configured to subtract the outputs of two microphones in the microphone array to obtain a first-order difference output. The comparison module 520 is configured to compare the first order differential output to a predetermined threshold. The determining module 530 is configured to determine a concealment value of each frequency bin based on the comparison result, and the speech enhancement module 540 is configured to perform speech enhancement based on the concealment value. Wherein the masking value is used to characterize the masking of noise to speech in noisy speech. Wherein the determining module 530 may determine a concealment value of a bin when the first order differential output is less than a predetermined threshold as 1 and determine a concealment value of a bin when the first order differential output is greater than or equal to the predetermined threshold as 0.

As an example of the present disclosure, the determining module 530 may determine a concealment value estimation result of each first-order differential output based on a result of comparing the plurality of first-order differential outputs with a predetermined threshold value, respectively, and determine a final concealment value of the frequency bin based on a concealment value of the same frequency bin among the plurality of concealment value estimation results.

In this embodiment, the first order differential output may be equal to the product of the filter coefficients and a matrix of time-frequency domain data of the two microphones. Wherein the filter coefficients are

As an example of the present disclosure, the concealment value estimation device 500 may further include an angle calculation module and a coefficient determination module (not shown in the figure). The angle calculation module is used for calculating the relative angles of the two microphones and the speaker based on the sound source position information of the speaker, and the coefficient determination module is used for determining alpha in the filter coefficients based on the relative angles.

Alternatively, the angle calculation module may include: a first direction vector determination module for determining a first direction vector from the center of the two microphones to the speaker; a second direction vector determining module for determining a second direction vector from one of the two microphones to the other microphone; and a calculation sub-module for calculating the relative angle based on the first direction vector and the second direction vector.

As shown in fig. 6, the speech enhancement module 540 includes a matrix calculation module 541 and a beamforming module 542.

The matrix calculation module 541 is configured to calculate a first correlation matrix corresponding to the speech and a second correlation matrix corresponding to the noise based on the concealment value. Wherein the first correlation matrix is a corresponding extracted from the time-frequency domain data output by the microphone array based on the hidden valueAnd the second correlation matrix is a covariance matrix of a corresponding noise part extracted from the time-frequency domain data output by the microphone array based on the hidden value. For example, a first correlation matrix R _SS ＝E _t (M(t,ω)·(X(t,ω)X(t,ω) ^H ) A second correlation matrix R) _NN ＝E _t ((1-M(t,ω))·(X(t,ω)X(t,ω) ^H ) Where M (t, ω) represents a hidden value matrix of different frequency points (t, ω), X (t, ω) represents time-frequency domain data output by the microphone array, E _t Representing mathematical expectations, X (t, ω) ^H Represents the conjugate transpose of X (t, ω).

The beamforming module 542 is configured to perform speech enhancement using a beamforming algorithm based on the first correlation matrix and the second correlation matrix. For example, spatial filtering may be performed using a beam forming method such as MVDR or GEV to achieve speech enhancement.

[ computing device ]

Fig. 7 illustrates a schematic diagram of a computing device that may be used to implement the above-described speech enhancement method according to an embodiment of the present disclosure.

Referring to fig. 7, a computing device 700 includes a memory 710 and a processor 720.

Processor 720 may be a multi-core processor or may include multiple processors. In some embodiments, processor 720 may include a general-purpose host processor and one or more special coprocessors such as, for example, a Graphics Processor (GPU), a Digital Signal Processor (DSP), etc. In some embodiments, processor 720 may be implemented using custom circuitry, for example, an application specific integrated circuit (ASIC, application Specific Integrated Circuit) or a field programmable gate array (FPGA, field Programmable Gate Arrays).

Memory 710 may include various types of storage units, such as system memory, read Only Memory (ROM), and persistent storage. Where the ROM may store static data or instructions that are required by the processor 720 or other modules of the computer. The persistent storage may be a readable and writable storage. The persistent storage may be a non-volatile memory device that does not lose stored instructions and data even after the computer is powered down. In some embodiments, the persistent storage device employs a mass storage device (e.g., magnetic or optical disk, flash memory) as the persistent storage device. In other embodiments, the persistent storage may be a removable storage device (e.g., diskette, optical drive). The system memory may be a read-write memory device or a volatile read-write memory device, such as dynamic random access memory. The system memory may store instructions and data that are required by some or all of the processors at runtime. Furthermore, memory 710 may include any combination of computer-readable storage media including various types of semiconductor memory chips (DRAM, SRAM, SDRAM, flash memory, programmable read-only memory), magnetic disks, and/or optical disks may also be employed. In some embodiments, memory 710 may include readable and/or writable removable storage devices such as Compact Discs (CDs), digital versatile discs (e.g., DVD-ROMs, dual layer DVD-ROMs), blu-ray discs read only, super-density discs, flash memory cards (e.g., SD cards, min SD cards, micro-SD cards, etc.), magnetic floppy disks, and the like. The computer readable storage medium does not contain a carrier wave or an instantaneous electronic signal transmitted by wireless or wired transmission.

The memory 710 has stored thereon executable code that, when executed by the processor 720, causes the processor 720 to perform the speech enhancement method described above.

The speech enhancement method, apparatus and device according to the present disclosure have been described in detail above with reference to the accompanying drawings.

Furthermore, the method according to the present disclosure may also be implemented as a computer program or computer program product comprising computer program code instructions for performing the above steps defined in the above method of the present disclosure.

Alternatively, the present disclosure may also be implemented as a non-transitory machine-readable storage medium (or computer-readable storage medium, or machine-readable storage medium) having stored thereon executable code (or computer program, or computer instruction code) that, when executed by a processor of an electronic device (or computing device, server, etc.), causes the processor to perform the steps of the above-described method according to the present disclosure.

Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the disclosure herein may be implemented as electronic hardware, computer software, or combinations of both.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems and methods according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The foregoing description of the embodiments of the present disclosure has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the various embodiments described. The terminology used herein was chosen in order to best explain the principles of the embodiments, the practical application, or the improvement of technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A method of speech enhancement, comprising:

subtracting the outputs of two microphones in the microphone array to obtain a first-order differential output;

comparing the first order differential output with a predetermined threshold;

determining a concealment value of each frequency point based on a comparison result, wherein the concealment value is used for representing the shielding condition of noise in noisy speech on the speech, and when the first-order differential output is smaller than the preset threshold value, the nulls of the first-order differential microphones formed by the two microphones are aligned to a designated direction, and the designated direction is the direction of the two microphones relative to a speaker; and

and carrying out voice enhancement based on the hidden value.

2. The voice enhancement method of claim 1, wherein the determining the concealment value for each frequency bin comprises:

a concealment value of a frequency bin when the first order differential output is less than the predetermined threshold is determined to be 1, and a concealment value of a frequency bin when the first order differential output is greater than or equal to the predetermined threshold is determined to be 0.

3. The voice enhancement method of claim 1, wherein the determining the concealment value for each frequency bin comprises:

determining a hidden value estimation result of each first-order differential output based on the results of comparing the plurality of first-order differential outputs with the predetermined threshold values respectively; and

And determining the final hidden value of the frequency point based on the hidden values corresponding to the same frequency point in the multiple hidden value estimation results.

4. The speech enhancement method of claim 3, wherein the step of determining the final concealment value for the bin comprises:

and taking the product of the hidden values corresponding to the same frequency point in the plurality of hidden value estimation results as the final hidden value of the frequency point.

5. The speech enhancement method of claim 1 wherein,

the first order differential output is equal to the product of the filter coefficients and a matrix of time-frequency domain data of the two microphones.

6. The speech enhancement method of claim 5, wherein the filter coefficients are

7. The speech enhancement method of claim 6, further comprising:

calculating the relative angles of the two microphones and the speaker based on the sound source position information of the speaker; and

alpha in the filter coefficients is determined based on the relative angle.

8. The speech enhancement method of claim 7, wherein the step of calculating the relative angles of the two microphones to the speaker comprises:

Determining a first direction vector from the center of the two microphones to the speaker;

determining a second direction vector from one of the two microphones to the other microphone;

the relative angle is calculated based on the first direction vector and the second direction vector.

9. The seed speech enhancement method of claim 1, wherein the step of speech enhancing based on the concealment value comprises:

calculating a first correlation matrix corresponding to the voice and a second correlation matrix corresponding to the noise based on the concealment value; and

and performing voice enhancement by using a beam forming algorithm based on the first correlation matrix and the second correlation matrix.

10. The speech enhancement method of claim 9 wherein,

the first correlation matrix is a covariance matrix of a corresponding voice part extracted from time-frequency domain data output by the microphone array based on the concealment value,

the second correlation matrix is a covariance matrix of a corresponding noise portion extracted from the time-frequency domain data output from the microphone array based on the concealment value.

11. A speech enhancement apparatus comprising:

the difference module is used for subtracting the outputs of the two microphones in the microphone array to obtain first-order difference output;

A comparison module for comparing the first order differential output with a predetermined threshold;

the determining module is used for determining a concealment value of each frequency point based on a comparison result, wherein the concealment value is used for representing the shielding condition of noise in noisy speech on the speech, and when the first-order differential output is smaller than the preset threshold value, the nulls of the first-order differential microphones formed by the two microphones are aligned to a designated direction, and the designated direction is the direction of the two microphones relative to a speaker; and

and the voice enhancement module is used for carrying out voice enhancement based on the hidden value.

12. An apparatus supporting voice interaction functions, comprising:

a microphone array for receiving sound input; and

and the terminal processor is used for subtracting the outputs of the two microphones in the microphone array to obtain a first-order differential output, comparing the first-order differential output with a preset threshold value, determining a hidden value of each frequency point based on a comparison result, and carrying out voice enhancement based on the hidden value, wherein the hidden value is used for representing the shielding condition of noise in noisy voice on voice, and when the first-order differential output is smaller than the preset threshold value, the nulls of the first-order differential microphones formed by the two microphones are aligned with a designated direction, and the designated direction is the direction of the two microphones relative to a speaker.

13. The apparatus of claim 12, further comprising:

and the communication module is used for sending the voice data after voice enhancement to the server.

14. The apparatus of claim 12, wherein the apparatus is any one of:

a ticket purchasing machine;

an intelligent sound box;

a robot;

an automobile.

15. A computing device, comprising:

a processor; and

a memory having executable code stored thereon, which when executed by the processor causes the processor to perform the method of any of claims 1-10.

16. A non-transitory machine-readable storage medium having stored thereon executable code, which when executed by a processor of an electronic device, causes the processor to perform the method of any of claims 1 to 10.