WO2022105571A1

WO2022105571A1 - Speech enhancement method and apparatus, and device and computer-readable storage medium

Info

Publication number: WO2022105571A1
Application number: PCT/CN2021/127260
Authority: WO
Inventors: 赵沁; 徐国强
Original assignee: 深圳壹账通智能科技有限公司
Priority date: 2020-11-17
Filing date: 2021-10-29
Publication date: 2022-05-27
Also published as: CN112489674A

Abstract

A speech enhancement method and apparatus, and a device and a computer-readable storage medium. The method comprises: collecting a speech signal by means of a microphone array, and converting the speech signal into a frequency-domain observation signal, wherein the speech signal is a time-domain observation signal (S10); inputting the frequency-domain observation signal into a first super-directivity beam former in a generalized sidelobe canceler, so as to determine a reference speech signal output by the first super-directivity beam former (S20); inputting the frequency-domain observation signal into a second super-directivity beam former in the generalized sidelobe canceler, so as to determine a noise signal corresponding to the speech signal (S30); and determining, on the basis of the reference speech signal and the noise signal, a speech enhancement signal corresponding to the speech signal (S40). By means of the method, a speech signal from a target orientation can be effectively enhanced, noise interference can be better removed, and the accuracy of a reference speech signal and a noise signal can be effectively improved, such that the accuracy of a speech enhancement signal can be further improved.

Description

Speech enhancement method, apparatus, device, and computer-readable storage medium

This application requires a Chinese patent application with an application number of 202011297820.3 and an invention title of "Speech Enhancement Method, Apparatus, Equipment and Computer-readable Storage Medium", which was filed at the Patent Office of the State Intellectual Property Office of the People's Republic of China on November 17, 2020 , the entire contents of which are incorporated herein by reference.

technical field

The present application relates to the technical field of signal processing, and in particular, to a speech enhancement method, apparatus, device, and computer-readable storage medium.

Background technique

The application of smart terminal devices is becoming more and more extensive, such as smart TVs, smart speakers, smart vending machines, and smart ticket vending machines. With the vigorous development of voice technology and hardware technology, voice interaction has become an important interface for intelligent human-computer interaction. However, noise is ubiquitous in the actual environment. For the efficient calculation and processing of the back-end, it is very important to pick up clean target speech signals, so the front-end speech signal enhancement is essential. And, with the widespread use of speech recognition technology, the demand for speech signal processing technology also expands. At present, in the process of speech recognition or voiceprint recognition, the speech signal collected by the front-end device generally contains noise, including noise in the background environment and noise generated during the recording process of the front-end device. The inventor realizes that these noise-carrying speech signals will affect the accuracy of speech recognition during speech recognition. Therefore, speech enhancement processing (that is, noise reduction processing is performed on the speech signal) needs to be performed on the speech signal to extract the speech signal from the speech signal. The purer speech signal is extracted from the signal as much as possible to make speech recognition more accurate. At present, the voice signal extracted after voice enhancement processing is not high in accuracy, which is not conducive to subsequent voice recognition.

The above content is only used to assist the understanding of the technical solutions of the present application, and does not mean that the above content is the prior art.

technical problem

One of the purposes of the embodiments of the present application is to provide a speech enhancement method, apparatus, device, and computer-readable storage medium, which aims to solve the technical problem of low accuracy of the speech signal extracted after speech enhancement processing is currently performed on the speech signal.

technical solutions

In a first aspect, an embodiment of the present application provides a speech enhancement method, wherein the speech enhancement method includes the following steps:

The voice signal is collected by the microphone array, and the voice signal is converted into a frequency-domain observation signal, wherein the voice signal is a time-domain observation signal;

inputting the frequency domain observation signal to the first super-directional beamformer in the generalized sidelobe canceller to determine the reference speech signal output by the first super-directional beamformer;

The frequency domain observation signal is input to the second super-directional beamformer of the generalized sidelobe canceller to determine the noise signal corresponding to the speech signal, wherein the constraint corresponding to the second super-directional beamformer The matrix is orthogonal to the blocking matrix corresponding to the first super-directional beamformer;

A speech enhancement signal corresponding to the speech signal is determined based on the reference speech signal and the noise signal.

In a second aspect, an embodiment of the present application provides a voice enhancement device, wherein the voice enhancement device includes:

Acquisition module, for collecting voice signal by microphone array, and described voice signal is converted into frequency domain observation signal, wherein, described voice signal is time domain observation signal;

a first determination module, configured to input the frequency domain observation signal to a first super-directional beamformer in the generalized sidelobe canceller, to determine a reference speech signal output by the first super-directional beamformer;

The second determination module is configured to input the frequency domain observation signal to the second super-directional beamformer of the generalized sidelobe canceller, so as to determine the noise signal corresponding to the speech signal, wherein the second super-directional beam The constraint matrix corresponding to the directional beamformer and the blocking matrix corresponding to the first super-directional beamformer are orthogonal to each other;

A third determining module, configured to determine a speech enhancement signal corresponding to the speech signal based on the reference speech signal and the noise signal.

In a third aspect, an embodiment of the present application provides a speech enhancement device, including a memory, a processor, and a speech enhancement program stored in the memory and running on the processor, where the processor implements the speech enhancement program when executing:

In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium, the computer-readable storage medium may be non-volatile or volatile, and the computer-readable storage medium stores a computer program, the computer program Implemented when executed by the processor:

beneficial effect

Compared with the prior art, the embodiments of the present application have the beneficial effects of collecting voice signals through a microphone array, and converting the voice signals into frequency-domain observation signals, wherein the voice signals are time-domain observation signals; The frequency domain observation signal is input to the first super-directional beamformer in the generalized sidelobe canceller to determine the reference speech signal output by the first super-directional beamformer; the frequency domain observation signal is input to a second super-directional beamformer of a generalized sidelobe canceller to determine a noise signal corresponding to the speech signal, wherein the constraint matrix corresponding to the second super-directional beamformer is the same as the first super-directional beamformer The blocking matrices corresponding to the beamformers are orthogonal to each other; the speech enhancement signal corresponding to the speech signal is determined based on the reference speech signal and the noise signal. This embodiment improves the generalized sidelobe canceller technology by combining the generalized sidelobe canceller structure and the super-directional beamforming technology, using the characteristics of strong directivity and narrow main lobe of the super-directional beamforming technology, Therefore, the first super-directional beamformer in the generalized sidelobe canceller can effectively enhance the speech signal of the target azimuth, and the enhancement effect is good, and at the same time, the lower branch of the generalized sidelobe canceller is improved based on the super-directional beamforming technology. The blocking matrix part of , can filter out noise interference more effectively, thus more effectively improving the accuracy of the calculated reference speech signal and noise signal, thereby further improving the accuracy of the speech enhancement signal.

Description of drawings

In order to explain the technical solutions of the embodiments of the present application more clearly, the following briefly introduces the accompanying drawings used in the description of the embodiments. For those of ordinary skill, other drawings can also be obtained from these drawings without any creative effort.

FIG. 1 is a schematic structural diagram of a speech enhancement device of a hardware operating environment involved in a solution according to an embodiment of the present application;

FIG. 2 is a schematic flowchart of the first embodiment of the speech enhancement method of the application;

FIG. 3 is a schematic flowchart of the second embodiment of the speech enhancement method of the present application.

The realization, functional characteristics and advantages of the purpose of the present application will be further described with reference to the accompanying drawings in conjunction with the embodiments.

Embodiments of the present invention

It should be understood that the specific embodiments described herein are only used to explain the present application, but not to limit the present application.

The speech enhancement method, apparatus, device and computer-readable storage medium provided by this application can also be applied to the field of artificial intelligence.

As shown in FIG. 1 , FIG. 1 is a schematic structural diagram of a speech enhancement device of a hardware operating environment involved in the solution of the embodiment of the present application.

The voice enhancement device in the embodiment of this application may be a PC, or a smart phone, a tablet computer, an e-book reader, an MP3 (Moving Picture Experts Group Audio Layer III, moving image expert compression standard audio layer 3) player, an MP4 (Moving Picture Experts Group Audio Layer III) player, an MP4 (Moving Picture Experts Group Audio Layer III) player, a Picture Experts Group Audio Layer IV, moving image expert compression standard audio layer 4) Players, portable computers and other portable terminal equipment with display functions.

As shown in FIG. 1 , the speech enhancement device may include: a processor 1001 , such as a CPU, a network interface 1004 , a user interface 1003 , a memory 1005 , and a communication bus 1002 . Among them, the communication bus 1002 is used to realize the connection and communication between these components. The user interface 1003 may include a display screen (Display), an input unit such as a keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface and a wireless interface. Optionally, the network interface 1004 may include a standard wired interface and a wireless interface (eg, a WI-FI interface). The memory 1005 may be high-speed RAM memory, or may be non-volatile memory, such as disk memory. Optionally, the memory 1005 may also be a storage device independent of the aforementioned processor 1001 .

Optionally, the voice enhancement device may further include a camera, an RF (Radio Frequency, radio frequency) circuit, a sensor, an audio circuit, a WiFi module, and the like. Among them, sensors such as light sensors, motion sensors and other sensors. Specifically, the light sensor may include an ambient light sensor and a proximity sensor, wherein the ambient light sensor may adjust the brightness of the display screen according to the brightness of the ambient light, and the proximity sensor may turn off the display screen and/or when the voice enhancement device is moved to the ear or backlight. As a kind of motion sensor, the gravitational acceleration sensor can detect the magnitude of acceleration in all directions (generally three axes), and can detect the magnitude and direction of gravity when it is stationary, and can be used for applications that recognize the posture of voice enhancement devices (such as switching between horizontal and vertical screens). , related games, magnetometer attitude calibration), vibration recognition related functions (such as pedometer, tapping), etc.; of course, the voice enhancement device can also be equipped with other sensors such as gyroscope, barometer, hygrometer, thermometer, infrared sensor, etc. It is not repeated here.

Those skilled in the art can understand that the structure of the speech enhancement device shown in FIG. 1 does not constitute a limitation to the speech enhancement device, and may include more or less components than those shown in the figure, or combine some components, or different components layout.

As shown in FIG. 1 , the memory 1005 as a computer storage medium may include an operating system, a network communication module, a user interface module and a speech enhancement program.

In the voice enhancement device shown in FIG. 1, the network interface 1004 is mainly used to connect the background server, and perform data communication with the background server; the user interface 1003 is mainly used to connect the client (client), and perform data communication with the client; and The processor 1001 may be configured to call the speech enhancement program stored in the memory 1005, and execute the speech enhancement method provided by the embodiment of the present application.

The present application also provides a speech enhancement method. Referring to FIG. 2 , FIG. 2 is a schematic flowchart of the first embodiment of the speech enhancement method of the present application.

Step S10, collecting a voice signal through a microphone array, and converting the voice signal into a frequency-domain observation signal, wherein the voice signal is a time-domain observation signal;

The speech enhancement method proposed in this application is applied to intelligent terminal equipment, and is based on the technology of microphone array and generalized sidelobe canceller. Among them, the microphone array is composed of multiple microphone array elements. The microphone array is used to collect the sound signal in the real environment, that is, the speech signal. The generalized sidelobe canceller is an improved beamformer based on the super-directional beamforming technology. The lobe canceller includes an upper branch and a lower branch. The upper branch of the generalized sidelobe canceller is used to pass and initially enhance the speech signal in the target direction, and the lower branch of the generalized sidelobe canceller is used to filter out the speech signal in the target direction. signal and through the noise signal in the speech signal. It can be understood that, for the microphone array, due to the different distribution positions of each microphone array element, there will be a certain time difference between the speech signals received by the array element, and the direction and position of the sound source can be determined by using this information.

In this embodiment, before the speech enhancement process is performed, an M-element microphone array is used to collect a speech signal in a real environment, wherein the speech signal collected through the microphone array is the time domain observation signal x(n)=[x ₁ (t),x ₂ (t),...,x _M (t)]. After performing preprocessing operations such as framing operations on the above-mentioned time-domain observation signals, the pre-processed time-domain observation signals are processed frame by frame, and after frame-by-frame processing is completed, frame data corresponding to the speech signal is obtained; after that, the frame data is processed. Using short-time discrete Fourier transform, the frequency domain observation signal X _i (e ^jω ) is obtained, where i represents the data of the ith frame. In the following, for simplicity, X(k) is used to represent the frequency domain data of the kth frame.

Step S20, inputting the frequency domain observation signal to the first super-directional beamformer in the generalized sidelobe canceller to determine the reference speech signal output by the first super-directional beamformer;

In this embodiment, after the frequency domain observation signal corresponding to the speech signal is obtained, the frequency domain observation signal is input to the upper branch of the generalized sidelobe canceller, the super-directional beamformer is used for beamforming, and the output is initially enhanced based on the target direction. After the voice signal is obtained, the reference voice signal is obtained, the target direction is the main lobe pointing, and the output corresponding to the main lobe is the initially enhanced reference voice signal. The direction angle corresponding to the voice signal is the angle formed by the voice signal and the plane where the microphone array is located when the voice signal is received by the microphone array. The generalized sidelobe canceller is an improved beamformer based on super-directional beamforming technology. The generalized sidelobe canceller includes the first super-directional beamformer of the upper branch and the second super-directional beamformer of the lower branch. wherein, the constraint matrix corresponding to the second super-directional beamformer and the blocking matrix corresponding to the first super-directional beamformer are orthogonal to each other, and the first super-directional beamformer is used to enhance the upper branch of the generalized sidelobe canceller The voice signal of the signal passing through the branch can effectively enhance the voice signal of the target azimuth by using the characteristics of strong directivity and narrow main lobe of the first super-directional beamformer. The enhancement effect is good.

Step S30, inputting the frequency domain observation signal to the second super-directional beamformer of the generalized sidelobe canceller to determine the noise signal corresponding to the speech signal, wherein the second super-directional beamformer The corresponding constraint matrix is orthogonal to the blocking matrix corresponding to the first super-directional beamformer;

In this embodiment, after the frequency domain observation signal corresponding to the speech signal is obtained, the frequency domain observation signal is input to the second super-directional beamformer of the lower branch of the generalized sidelobe canceller, so as to pass the second super-directional beam The former realizes the function of the blocking matrix of the lower branch of the generalized sidelobe canceller, that is, the function of the blocking matrix of the lower branch of the generalized sidelobe canceller is completed by the second super-directional beamformer. The direction of the interference noise is preset in the device, and the noise signal is calculated based on the preset direction of the interference noise, so that the second super-directional beamformer outputs the noise signal based on the preset direction of the interference noise and the frequency domain observation signal. It can be understood that the output of the lower branch of the generalized sidelobe canceller can successfully block the speech signal, so as to obtain the signal part containing only interference noise.

Step S40, determining a speech enhancement signal corresponding to the speech signal based on the reference speech signal and the noise signal.

In this embodiment, after passing the reference speech signal output by the upper branch and the noise signal output by the lower branch of the generalized sidelobe canceller, the reference speech signal output by the upper branch and the noise signal output by the lower branch are input to the automatic In the adaptive noise suppressor, the adaptive noise suppressor adopts the normalized least mean square error criterion (NLMS), based on the reference voice signal and the noise signal, the voice signal collected by the microphone array is adaptively filtered, and the frequency domain is obtained after the adaptive filtering is completed. It can be understood that the speech enhancement signal output by the adaptive noise suppressor is the speech enhancement signal in the frequency domain. Therefore, the subsequent Fourier transform of the speech enhancement signal in the frequency domain can be obtained. domain of speech enhancement signals. Specifically, after the speech enhancement signal in the frequency domain is obtained, inverse short-time discrete Fourier transform is performed on the speech enhancement signal in the frequency domain to obtain the time domain enhancement signal and output.

The voice enhancement method proposed in this embodiment collects voice signals through a microphone array, and converts the voice signals into frequency-domain observation signals, wherein the voice signals are time-domain observation signals; the frequency-domain observation signals are input to the first super-directional beamformer in the generalized sidelobe canceller to determine the reference speech signal output by the first super-directional beamformer; the frequency domain observation signal is input to the first super-directional beamformer of the generalized sidelobe canceller Two super-directional beamformers to determine the noise signal corresponding to the speech signal, wherein the constraint matrix corresponding to the second super-directional beamformer and the blocking matrix corresponding to the first super-directional beamformer orthogonal to each other; determining a speech enhancement signal corresponding to the speech signal based on the reference speech signal and the noise signal. This embodiment improves the generalized sidelobe canceller technology by combining the generalized sidelobe canceller structure and the super-directional beamforming technology, using the characteristics of strong directivity and narrow main lobe of the super-directional beamforming technology, Therefore, the first super-directional beamformer in the generalized sidelobe canceller can effectively enhance the speech signal of the target azimuth, and the enhancement effect is good, and at the same time, the lower branch of the generalized sidelobe canceller is improved based on the super-directional beamforming technology. The blocking matrix part of , can filter out noise interference more effectively, thus more effectively improving the accuracy of the calculated reference speech signal and noise signal, thereby further improving the accuracy of the speech enhancement signal.

Based on the first embodiment, a second embodiment of the speech enhancement method of the present application is proposed. Referring to FIG. 3 , in this embodiment, step S20 includes:

Step S21, the frequency domain observation signal is input to the first super-directional beamformer of the generalized sidelobe canceller, so as to determine the said voice signal based on the direction angle corresponding to the voice signal and the array element spacing corresponding to the microphone array. Steering vector of each frequency point of the frequency domain observation signal;

Step S22, determining the first projection matrix of each frequency point of the frequency domain observation signal based on the steering vector of each frequency point of the frequency domain observation signal;

Step S23, determining a reference speech signal output by the first super-directional beamformer based on the first projection matrix and the frequency domain observation signal.

In this embodiment, after the frequency domain observation signal corresponding to the speech signal is obtained, the frequency domain observation signal is input to the upper branch of the generalized sidelobe canceller, and the first super-directional beamformer of the upper branch is based on the corresponding speech signal. The direction angle and the corresponding array element spacing of the microphone array are used to calculate the steering vector of each frequency point of the frequency domain observation signal; after obtaining the steering vector of each frequency point of the frequency domain observation signal, the first super-directional beamformer is based on the frequency domain observation signal. The steering vector of each frequency point of the signal is used to calculate the noise cross-correlation coefficient matrix of each frequency point of the frequency-domain observation signal; then, based on the noise cross-correlation coefficient matrix of each frequency point, the first value of each frequency point of the frequency-domain observation signal is calculated. Projection matrix; after obtaining the first projection matrix of each frequency point, the first super-directional beamformer determines the reference speech signal output by the upper branch of the generalized sidelobe canceller based on the first projection matrix and the frequency domain observation signal.

Specifically, assuming that the direction angle is θ, the array element spacing is d, and the reference array element is set as the first microphone, for the nth frequency point of the mth array element data, the steering vector of each frequency point of the frequency domain observation signal is calculated. , based on the direction angle corresponding to the speech signal and the array element spacing corresponding to the microphone array, the calculation formula for calculating the steering vector of each frequency point of the frequency domain observation signal is as follows:

where f is the sampling rate, N _fft is the length of the fast Fourier transform, and c is the speed of the signal, here the speed of sound.

After that, for each frequency point of the frequency-domain observation signal, the calculation is performed frequency-by-frequency point, and the noise cross-correlation coefficient matrix Q of the nth frequency point is calculated based on the steering vector of each frequency point of the frequency-domain observation signal. The formula for calculating the noise cross-correlation coefficient matrix of a point is as follows:

where i and j represent the i-th array element and the j-th array element of the microphone array, respectively.

Then calculate the projection matrix of frequency point n, that is, calculate the first projection matrix of each frequency point of the frequency domain observation signal, and calculate the first projection matrix of each frequency point of the frequency domain observation signal based on the noise cross-correlation coefficient matrix of each frequency point. The calculation formula is as follows:

Among them, α represents the steering matrix of the nth frequency point with respect to the direction θ.

Finally, the beam output signal of the upper branch is calculated, that is, the reference speech signal output by the upper branch of the generalized sidelobe canceller is calculated, and the output of the upper branch of the generalized sidelobe canceller is determined based on the first projection matrix and the frequency domain observation signal. The calculation formula of the reference speech signal is as follows:

Y(k,n)=W(θ,n) ^H X(k,n)

Wherein, Y(k,n) is the reference speech signal corresponding to the nth frequency point of the kth frame of the frequency domain observation signal.

Further, the above process takes the microphone array as a uniform linear array as an example of the calculation formula. According to actual needs, the enhancement of the speech signal can also be accomplished by using an array such as a uniform circular array.

Further, the step of determining the first projection matrix of each frequency point of the frequency domain observation signal based on the steering vector of each frequency point of the frequency domain observation signal includes:

Step S221, based on the steering vector of each frequency point of the frequency-domain observation signal, calculate the noise cross-correlation coefficient matrix of each frequency point of the frequency-domain observation signal;

Step S222: Calculate the first projection matrix of each frequency point of the frequency domain observation signal based on the noise cross-correlation coefficient matrix of each frequency point.

In this embodiment, after obtaining the steering vectors of each frequency point of the frequency-domain observation signal, the first super-directional beamformer of the generalized sidelobe canceller calculates the frequency-domain based on the steering vectors of each frequency point of the frequency-domain observation signal The noise cross-correlation coefficient matrix of each frequency point of the observed signal; then, based on the noise cross-correlation coefficient matrix of each frequency point, the first projection matrix of each frequency point of the frequency-domain observation signal is calculated, so as to be based on the first projection matrix and the frequency point. The domain observation signal determines the reference speech signal output by the upper branch of the generalized sidelobe canceller. In this embodiment, the example calculation formula for calculating the noise cross-correlation coefficient matrix and the example calculation formula corresponding to the first projection matrix of each frequency point of the frequency-domain observation signal based on the noise cross-correlation coefficient matrix of each frequency point are specifically referred to the previous embodiment. .

Further, the step of inputting the frequency domain observation signal to the second super-directional beamformer of the generalized sidelobe canceller to determine the noise signal corresponding to the speech signal includes:

Step S31, inputting the frequency domain observation signal to the second super-directional beamformer of the generalized sidelobe canceller, to determine the second projection matrix of each frequency point of the frequency domain observation signal based on the noise direction vector;

Step S32: Determine the noise signal output by the second super-directional beamformer based on the second projection matrix and the frequency domain observation signal.

In this embodiment, after the frequency domain observation signal corresponding to the speech signal is obtained, the frequency domain observation signal is input to the second super-directional beamformer of the lower branch of the generalized sidelobe canceller, so as to pass the second super-directional beam The former implements the function of the blocking matrix of the lower branch of the generalized sidelobe canceller. Specifically, first, based on the direction angle of the preset interference noise and the array element spacing corresponding to the microphone array, the noise steering vector of each frequency point of the frequency-domain observation signal is calculated; then, based on the noise steering vector of each frequency point of the frequency-domain observation signal , calculate the second projection matrix of each frequency point of the frequency-domain observation signal; finally, calculate and output the noise signal based on the second projection matrix and the frequency-domain observation signal, so that the generalized sidelobe canceller can block the beamformer according to the second super-directional beamformer The noise signal obtained after dropping the reference speech number. It can be understood that the output of the lower branch of the generalized sidelobe canceller can successfully block the reference speech signal, so as to obtain the signal part containing only interference noise, that is, the noise signal.

Based on the direction angle corresponding to the speech signal and the array element spacing corresponding to the microphone array, the steering vector of each frequency point of the frequency-domain observation signal is calculated; after obtaining the steering vector of each frequency point of the frequency-domain observation signal, the first super-directional beamformer Calculate the noise cross-correlation coefficient matrix of each frequency point of the frequency-domain observation signal based on the steering vector of each frequency point of the frequency-domain observation signal; then, calculate the noise cross-correlation coefficient matrix of each frequency point of the frequency-domain observation signal After obtaining the first projection matrix of each frequency point, the first super-directional beamformer determines the reference speech output by the upper branch of the generalized sidelobe canceller based on the first projection matrix and the frequency domain observation signal Signal.

Further, the step of determining the speech enhancement signal corresponding to the speech signal based on the reference speech signal and the noise signal includes:

Step S41, inputting the reference speech signal and the noise signal into an adaptive noise suppressor, so as to perform automatic self-adaptation on the frequency domain observation signal corresponding to the speech signal based on the reference speech signal and the noise signal. adapting to noise suppression to obtain an error signal corresponding to the frequency domain observation signal;

Step S42, the error signal is input to the adaptive noise suppressor, and the normalized minimum mean square error criterion is used to optimize the parameters of the adaptive noise suppressor, after the optimization of the adaptive noise suppressor is completed. A speech enhancement signal corresponding to the speech signal is determined.

In this embodiment, after passing the reference speech signal output by the upper branch and the noise signal output by the lower branch of the generalized sidelobe canceller, the reference speech signal output by the upper branch and the noise signal output by the lower branch are input to the automatic In the adaptive noise suppressor, the adaptive noise suppressor performs adaptive noise suppression on the frequency domain observation signal corresponding to the speech signal according to the reference speech signal and the noise signal, so as to maximize the suppression of the noise signal in the speech signal, so that the adaptive noise The suppressor outputs a high-precision speech enhancement signal. The reference speech signal output by the upper branch and the noise signal output by the lower branch are input into the adaptive noise suppressor, and the error signal is first calculated based on the reference speech signal and the noise signal through the adaptive noise suppressor, wherein the error signal is the frequency The domain observation signal is the speech signal after noise suppression, but in fact the error signal belongs to the speech signal with low accuracy, and the speech signal needs to be suppressed many times to obtain the signal with high accuracy. After the error signal is obtained, the error signal is input to the adaptive noise suppressor for the adaptive noise suppressor to use the normalized minimum mean square error criterion to optimize the parameters of the adaptive noise suppressor, and when optimizing the adaptive noise suppressor After the completion of the device, a high-precision speech enhancement signal is output.

Further, inputting the reference speech signal and the noise signal into an adaptive noise suppressor, so as to pair the frequency domain observation signal corresponding to the speech signal based on the reference speech signal and the noise signal The steps of performing adaptive noise suppression to obtain an error signal corresponding to the frequency domain observation signal include:

Step S411, inputting the reference speech signal and the noise signal into an adaptive noise suppressor to determine an adjustment signal based on the weight vector corresponding to the adaptive noise suppressor and the reference speech signal;

Step S412: Adjust the frequency-domain observation signal corresponding to the speech signal based on the adjustment signal, and determine an error signal corresponding to the frequency-domain observation signal after adjustment.

In this embodiment, after passing the reference speech signal output by the upper branch and the noise signal output by the lower branch of the generalized sidelobe canceller, the reference speech signal output by the upper branch and the noise signal output by the lower branch are input to the automatic In the adaptive noise suppressor, the adaptive noise suppressor performs adaptive noise suppression on the frequency domain observation signal corresponding to the speech signal according to the reference speech signal and the noise signal, so as to maximize the suppression of the noise signal in the speech signal, so that the adaptive noise The suppressor outputs a high-precision speech enhancement signal. Specifically, the adjustment signal is first calculated based on the weight vector corresponding to the adaptive noise suppressor and the reference speech signal, and the adaptive noise suppressor outputs the adjustment signal; after the adjustment signal is obtained, the frequency domain observation signal is adjusted based on the adjustment signal, and the adjusted signal is obtained. The error signal after observing the signal in the frequency domain. The manner of adjusting the frequency-domain observation signal based on the adjustment signal may be to subtract the adjustment signal from the frequency-domain observation signal to obtain an error signal corresponding to the speech signal.

Further, step S10 includes:

Step S11, collecting voice signals through a microphone array, and performing a frame division operation on the voice signals to obtain frame data corresponding to the voice signals;

Step S12: Perform short-time discrete Fourier transform on the frame data corresponding to the speech signal to obtain a frequency domain observation signal corresponding to the speech signal.

In this embodiment, before the speech enhancement process is performed, an M-element microphone array is used to collect a speech signal in a real environment, wherein the speech signal collected by the microphone array is a time-domain observation signal, where the speech signal can represent is x(n)=[x1( _t ), _x2 (t),..., _xM (t)]. Perform preprocessing operations such as framing operations on the above-mentioned time-domain observation signals, and then perform frame-by-frame processing on the pre-processed time-domain observation signals. After the frame-by-frame processing is completed, the frame data corresponding to the speech signal is obtained; The short-time discrete Fourier transform is used to obtain the frequency domain observation signal, wherein the frequency domain observation signal can be expressed as X _i (e ^jω ), and i represents the i-th frame of data. In the following, for simplicity, X(k) is used to represent the frequency domain data of the kth frame.

In the speech enhancement method proposed in this embodiment, the frequency domain observation signal is input to the first super-directional beamformer of the generalized sidelobe canceller, so that the direction angle corresponding to the speech signal corresponds to the microphone array based on the direction angle corresponding to the speech signal. Determine the steering vector of each frequency point of the frequency-domain observation signal; determine the first projection matrix of each frequency point of the frequency-domain observation signal based on the steering vector of each frequency point of the frequency-domain observation signal; The first projection matrix and the frequency domain observation signal determine the reference speech signal output by the first super-directional beamformer. In this embodiment, by combining the generalized sidelobe canceller structure and the super-directional beamforming technology, the super-directional beamforming technology has the characteristics of strong directivity and narrow main lobe, and the super-directionality is applied to the upper branch of the generalized sidelobe canceller. The first super-directional beamformer in the generalized sidelobe canceller can effectively enhance the speech signal of the target azimuth, so that the enhancement effect of the reference speech signal is good.

In addition, an embodiment of the present application also proposes a voice enhancement device, where the voice enhancement device includes:

Further, the first determining module is also used for:

inputting the frequency domain observation signal to the first super-directional beamformer of the generalized sidelobe canceller, so as to determine the frequency domain observation based on the direction angle corresponding to the speech signal and the array element spacing corresponding to the microphone array Steering vector of each frequency point of the signal;

determining a first projection matrix of each frequency point of the frequency-domain observation signal based on the steering vector of each frequency point of the frequency-domain observation signal;

A reference speech signal output by the first super-directional beamformer is determined based on the first projection matrix and the frequency domain observation signal.

Further, the first determining module is also used for:

Calculate the noise cross-correlation coefficient matrix of each frequency point of the frequency-domain observation signal based on the steering vector of each frequency point of the frequency-domain observation signal;

Based on the noise cross-correlation coefficient matrix of each frequency point, the first projection matrix of each frequency point of the frequency domain observation signal is calculated.

Further, the second determining module is also used for:

inputting the frequency-domain observation signal to a second super-directional beamformer of a generalized sidelobe canceller to determine a second projection matrix of each frequency point of the frequency-domain observation signal based on the noise direction vector;

A noise signal output by the second super-directional beamformer is determined based on the second projection matrix and the frequency domain observation signal.

Further, the third determining module is also used for:

inputting the reference speech signal and the noise signal into an adaptive noise suppressor to perform adaptive noise suppression on the frequency domain observation signal corresponding to the speech signal based on the reference speech signal and the noise signal , obtain the error signal corresponding to the frequency domain observation signal;

The error signal is input to the adaptive noise suppressor, and a normalized minimum mean square error criterion is used to optimize the parameters of the adaptive noise suppressor, and the adaptive noise suppressor is determined after the optimization is completed. The speech enhancement signal corresponding to the speech signal.

Further, the third determining module is also used for:

inputting the reference speech signal and the noise signal into an adaptive noise suppressor to determine an adjustment signal based on a weight vector corresponding to the adaptive noise suppressor and the reference speech signal;

The frequency domain observation signal corresponding to the speech signal is adjusted based on the adjustment signal, and an error signal corresponding to the adjustment of the frequency domain observation signal is determined.

Further, the acquisition module is also used for:

Collecting voice signals through a microphone array, and performing a frame division operation on the voice signals to obtain frame data corresponding to the voice signals;

A short-time discrete Fourier transform is performed on the frame data corresponding to the speech signal to obtain a frequency domain observation signal corresponding to the speech signal.

Embodiments of the present application also provide a voice enhancement device, including a memory, a processor, and a voice enhancement program stored in the memory and running on the processor, where the processor executes the voice enhancement program to achieve:

In addition, an embodiment of the present application also proposes a computer-readable storage medium. The computer-readable storage medium may be non-volatile or volatile, and a speech enhancement program is stored on the computer-readable storage medium. The speech enhancement program is implemented when executed by the processor:

The specific embodiments of the computer-readable storage medium of the present application are basically the same as the above-mentioned embodiments of the speech enhancement method, and are not described in detail here.

It should be noted that, herein, the terms "comprising", "comprising" or any other variation thereof are intended to encompass non-exclusive inclusion, such that a process, method, article or system comprising a series of elements includes not only those elements, It also includes other elements not expressly listed or inherent to such a process, method, article or system. Without further limitation, an element qualified by the phrase "comprising a..." does not preclude the presence of additional identical elements in the process, method, article or system that includes the element.

The above-mentioned serial numbers of the embodiments of the present application are only for description, and do not represent the advantages or disadvantages of the embodiments.

From the description of the above embodiments, those skilled in the art can clearly understand that the methods of the above embodiments can be implemented by means of software plus a necessary general hardware platform, and of course hardware can also be used, but in many cases the former is better implementation. Based on this understanding, the technical solutions of the present application can be embodied in the form of software products in essence or the parts that make contributions to the prior art, and the computer software products are stored in a storage medium (such as ROM/RAM) as described above. , magnetic disk, optical disc), including several instructions to make a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to execute the methods described in the various embodiments of the present application.

The above are only the preferred embodiments of the present application, and are not intended to limit the patent scope of the present application. Any equivalent structure or equivalent process transformation made by using the contents of the description and drawings of the present application, or directly or indirectly applied in other related technical fields , are similarly included within the scope of patent protection of this application.

Claims

A speech enhancement method, wherein the speech enhancement method comprises the following steps:

The voice signal is collected by the microphone array, and the voice signal is converted into a frequency-domain observation signal, wherein the voice signal is a time-domain observation signal;

inputting the frequency domain observation signal to a first super-directional beamformer in the generalized sidelobe canceller to determine a reference speech signal output by the first super-directional beamformer;

The frequency domain observation signal is input to the second super-directional beamformer of the generalized sidelobe canceller to determine the noise signal corresponding to the speech signal, wherein the constraint corresponding to the second super-directional beamformer The matrix is orthogonal to the blocking matrix corresponding to the first super-directional beamformer;

A speech enhancement signal corresponding to the speech signal is determined based on the reference speech signal and the noise signal.
The speech enhancement method of claim 1, wherein the frequency domain observation signal is input to a first super-directional beamformer in a generalized sidelobe canceller to determine the first super-directional beam The steps of forming the reference speech signal output by the generator include:

inputting the frequency domain observation signal to the first super-directional beamformer of the generalized sidelobe canceller, so as to determine the frequency domain observation based on the direction angle corresponding to the speech signal and the array element spacing corresponding to the microphone array Steering vector of each frequency point of the signal;

determining a first projection matrix of each frequency point of the frequency-domain observation signal based on the steering vector of each frequency point of the frequency-domain observation signal;

A reference speech signal output by the first super-directional beamformer is determined based on the first projection matrix and the frequency domain observation signal.
The speech enhancement method according to claim 2, wherein the step of determining the first projection matrix of each frequency point of the frequency domain observation signal based on the steering vector of each frequency point of the frequency domain observation signal comprises:

Based on the steering vector of each frequency point of the frequency-domain observation signal, calculate the noise cross-correlation coefficient matrix of each frequency point of the frequency-domain observation signal;

Based on the noise cross-correlation coefficient matrix of each frequency point, the first projection matrix of each frequency point of the frequency domain observation signal is calculated.
The speech enhancement method according to claim 1, wherein the frequency domain observation signal is input to the second super-directional beamformer of the generalized sidelobe canceller to determine the noise signal corresponding to the speech signal. Steps include:

inputting the frequency-domain observation signal to a second super-directional beamformer of a generalized sidelobe canceller to determine a second projection matrix of each frequency point of the frequency-domain observation signal based on the noise direction vector;

A noise signal output by the second super-directional beamformer is determined based on the second projection matrix and the frequency domain observation signal.
The speech enhancement method according to claim 1, wherein the step of determining the speech enhancement signal corresponding to the speech signal based on the reference speech signal and the noise signal comprises:

inputting the reference speech signal and the noise signal into an adaptive noise suppressor to perform adaptive noise suppression on the frequency domain observation signal corresponding to the speech signal based on the reference speech signal and the noise signal , obtain the error signal corresponding to the frequency domain observation signal;

The error signal is input to the adaptive noise suppressor, and a normalized minimum mean square error criterion is used to optimize the parameters of the adaptive noise suppressor, and the adaptive noise suppressor is determined after the optimization is completed. The speech enhancement signal corresponding to the speech signal.
6. The speech enhancement method of claim 5, wherein said inputting said reference speech signal and said noise signal into an adaptive noise suppressor, to adjust said reference speech signal and said noise signal to said The frequency domain observation signal corresponding to the speech signal is subjected to adaptive noise suppression, and the steps of obtaining the error signal corresponding to the frequency domain observation signal include:

inputting the reference speech signal and the noise signal into an adaptive noise suppressor to determine an adjustment signal based on a weight vector corresponding to the adaptive noise suppressor and the reference speech signal;

The frequency domain observation signal corresponding to the speech signal is adjusted based on the adjustment signal, and an error signal corresponding to the adjustment of the frequency domain observation signal is determined.
The speech enhancement method according to any one of claims 1 to 6, wherein the step of collecting speech signals through a microphone array and converting the speech signals into frequency domain observation signals comprises:

Collecting voice signals through a microphone array, and performing a frame division operation on the voice signals to obtain frame data corresponding to the voice signals;

A short-time discrete Fourier transform is performed on the frame data corresponding to the speech signal to obtain a frequency domain observation signal corresponding to the speech signal.
A voice enhancement device, wherein the voice enhancement device comprises:

a collection module, configured to collect voice signals through a microphone array, and convert the voice signals into frequency-domain observation signals, wherein the voice signals are time-domain observation signals;

a first determination module, configured to input the frequency domain observation signal to a first super-directional beamformer in the generalized sidelobe canceller, to determine a reference speech signal output by the first super-directional beamformer;

The second determination module is configured to input the frequency domain observation signal to the second super-directional beamformer of the generalized sidelobe canceller, so as to determine the noise signal corresponding to the speech signal, wherein the second super-directional beam The constraint matrix corresponding to the directional beamformer and the blocking matrix corresponding to the first super-directional beamformer are orthogonal to each other;

A third determining module, configured to determine a speech enhancement signal corresponding to the speech signal based on the reference speech signal and the noise signal.
A speech enhancement device, wherein the speech enhancement device comprises: a memory, a processor, and a speech enhancement program stored on the memory and executable on the processor, the speech enhancement program being executed by the processor Implemented at execution time:

The voice signal is collected by the microphone array, and the voice signal is converted into a frequency-domain observation signal, wherein the voice signal is a time-domain observation signal;

inputting the frequency domain observation signal to a first super-directional beamformer in the generalized sidelobe canceller to determine a reference speech signal output by the first super-directional beamformer;

The frequency domain observation signal is input to the second super-directional beamformer of the generalized sidelobe canceller to determine the noise signal corresponding to the speech signal, wherein the constraint corresponding to the second super-directional beamformer The matrix is orthogonal to the blocking matrix corresponding to the first super-directional beamformer;

A speech enhancement signal corresponding to the speech signal is determined based on the reference speech signal and the noise signal.
9. The speech enhancement apparatus of claim 9, wherein the frequency domain observation signal is input to a first super-directional beamformer in a generalized sidelobe canceller to determine the first super-directional beam The steps of forming the reference speech signal output by the generator include:

inputting the frequency domain observation signal to the first super-directional beamformer of the generalized sidelobe canceller, so as to determine the frequency domain observation based on the direction angle corresponding to the speech signal and the array element spacing corresponding to the microphone array Steering vector of each frequency point of the signal;

determining a first projection matrix of each frequency point of the frequency-domain observation signal based on the steering vector of each frequency point of the frequency-domain observation signal;

A reference speech signal output by the first super-directional beamformer is determined based on the first projection matrix and the frequency domain observation signal.
The speech enhancement device according to claim 10, wherein the step of determining the first projection matrix of each frequency point of the frequency domain observation signal based on the steering vector of each frequency point of the frequency domain observation signal comprises:

Based on the steering vector of each frequency point of the frequency-domain observation signal, calculate the noise cross-correlation coefficient matrix of each frequency point of the frequency-domain observation signal;

Based on the noise cross-correlation coefficient matrix of each frequency point, the first projection matrix of each frequency point of the frequency domain observation signal is calculated.
The speech enhancement device of claim 9, wherein the frequency domain observation signal is input to a second super-directional beamformer of a generalized sidelobe canceller to determine the noise signal corresponding to the speech signal. Steps include:

inputting the frequency-domain observation signal to a second super-directional beamformer of a generalized sidelobe canceller to determine a second projection matrix of each frequency point of the frequency-domain observation signal based on the noise direction vector;

A noise signal output by the second super-directional beamformer is determined based on the second projection matrix and the frequency domain observation signal.
The speech enhancement device according to claim 9, wherein the step of determining the speech enhancement signal corresponding to the speech signal based on the reference speech signal and the noise signal comprises:

inputting the reference speech signal and the noise signal into an adaptive noise suppressor to perform adaptive noise suppression on the frequency domain observation signal corresponding to the speech signal based on the reference speech signal and the noise signal , obtain the error signal corresponding to the frequency domain observation signal;

The error signal is input to the adaptive noise suppressor, and a normalized minimum mean square error criterion is used to optimize the parameters of the adaptive noise suppressor, and the adaptive noise suppressor is determined after the optimization is completed. The speech enhancement signal corresponding to the speech signal.
14. The speech enhancement apparatus of claim 13, wherein said inputting said reference speech signal and said noise signal into an adaptive noise suppressor to adjust said reference speech signal and said noise signal to said The frequency domain observation signal corresponding to the speech signal is subjected to adaptive noise suppression, and the steps of obtaining the error signal corresponding to the frequency domain observation signal include:

inputting the reference speech signal and the noise signal into an adaptive noise suppressor to determine an adjustment signal based on a weight vector corresponding to the adaptive noise suppressor and the reference speech signal;

The frequency domain observation signal corresponding to the speech signal is adjusted based on the adjustment signal, and an error signal corresponding to the adjustment of the frequency domain observation signal is determined.
The speech enhancement device according to any one of claims 9 to 14, wherein the step of collecting speech signals through a microphone array and converting the speech signals into frequency domain observation signals comprises:

Collecting voice signals through a microphone array, and performing a frame division operation on the voice signals to obtain frame data corresponding to the voice signals;

A short-time discrete Fourier transform is performed on the frame data corresponding to the speech signal to obtain a frequency domain observation signal corresponding to the speech signal.
A computer-readable storage medium, wherein a speech enhancement program is stored on the computer-readable storage medium, and the speech enhancement program is implemented when executed by a processor:

The voice signal is collected by the microphone array, and the voice signal is converted into a frequency-domain observation signal, wherein the voice signal is a time-domain observation signal;

inputting the frequency domain observation signal to a first super-directional beamformer in the generalized sidelobe canceller to determine a reference speech signal output by the first super-directional beamformer;

The frequency domain observation signal is input to the second super-directional beamformer of the generalized sidelobe canceller to determine the noise signal corresponding to the speech signal, wherein the constraint corresponding to the second super-directional beamformer The matrix is orthogonal to the blocking matrix corresponding to the first super-directional beamformer;

A speech enhancement signal corresponding to the speech signal is determined based on the reference speech signal and the noise signal.
17. The computer-readable storage medium of claim 16, wherein the inputting the frequency domain observation signal to a first super-directional beamformer in a generalized sidelobe canceller to determine the first super-directional The steps of generating the reference speech signal output by the beamformer include:

inputting the frequency domain observation signal to the first super-directional beamformer of the generalized sidelobe canceller, so as to determine the frequency domain observation based on the direction angle corresponding to the speech signal and the array element spacing corresponding to the microphone array Steering vector of each frequency point of the signal;

determining a first projection matrix of each frequency point of the frequency-domain observation signal based on the steering vector of each frequency point of the frequency-domain observation signal;

A reference speech signal output by the first super-directional beamformer is determined based on the first projection matrix and the frequency domain observation signal.
The computer-readable storage medium of claim 17, wherein the step of determining the first projection matrix of each frequency point of the frequency domain observation signal based on the steering vector of each frequency point of the frequency domain observation signal comprises:

Based on the steering vector of each frequency point of the frequency-domain observation signal, calculate the noise cross-correlation coefficient matrix of each frequency point of the frequency-domain observation signal;

Based on the noise cross-correlation coefficient matrix of each frequency point, the first projection matrix of each frequency point of the frequency domain observation signal is calculated.
17. The computer-readable storage medium of claim 16, wherein the frequency domain observation signal is input to a second super-directional beamformer of a generalized sidelobe canceller to determine noise corresponding to the speech signal The steps of the signal include:

inputting the frequency-domain observation signal to a second super-directional beamformer of a generalized sidelobe canceller to determine a second projection matrix of each frequency point of the frequency-domain observation signal based on the noise direction vector;

A noise signal output by the second super-directional beamformer is determined based on the second projection matrix and the frequency domain observation signal.
The computer-readable storage medium of claim 16, wherein the step of determining a speech enhancement signal corresponding to the speech signal based on the reference speech signal and the noise signal comprises:

inputting the reference speech signal and the noise signal into an adaptive noise suppressor to perform adaptive noise suppression on the frequency domain observation signal corresponding to the speech signal based on the reference speech signal and the noise signal , obtain the error signal corresponding to the frequency domain observation signal;

The error signal is input to the adaptive noise suppressor, and a normalized minimum mean square error criterion is used to optimize the parameters of the adaptive noise suppressor, and the adaptive noise suppressor is determined after the optimization is completed. The speech enhancement signal corresponding to the speech signal.