WO2022105571A1 - Speech enhancement method and apparatus, and device and computer-readable storage medium - Google Patents

Speech enhancement method and apparatus, and device and computer-readable storage medium Download PDF

Info

Publication number
WO2022105571A1
WO2022105571A1 PCT/CN2021/127260 CN2021127260W WO2022105571A1 WO 2022105571 A1 WO2022105571 A1 WO 2022105571A1 CN 2021127260 W CN2021127260 W CN 2021127260W WO 2022105571 A1 WO2022105571 A1 WO 2022105571A1
Authority
WO
WIPO (PCT)
Prior art keywords
signal
speech
domain observation
frequency
super
Prior art date
Application number
PCT/CN2021/127260
Other languages
French (fr)
Chinese (zh)
Inventor
赵沁
徐国强
Original Assignee
深圳壹账通智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳壹账通智能科技有限公司 filed Critical 深圳壹账通智能科技有限公司
Publication of WO2022105571A1 publication Critical patent/WO2022105571A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0232Processing in the frequency domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/20Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L2021/02161Number of inputs available containing the signal or the noise to be suppressed
    • G10L2021/02166Microphone arrays; Beamforming

Definitions

  • the present application relates to the technical field of signal processing, and in particular, to a speech enhancement method, apparatus, device, and computer-readable storage medium.
  • speech enhancement processing that is, noise reduction processing is performed on the speech signal
  • the purer speech signal is extracted from the signal as much as possible to make speech recognition more accurate.
  • the voice signal extracted after voice enhancement processing is not high in accuracy, which is not conducive to subsequent voice recognition.
  • One of the purposes of the embodiments of the present application is to provide a speech enhancement method, apparatus, device, and computer-readable storage medium, which aims to solve the technical problem of low accuracy of the speech signal extracted after speech enhancement processing is currently performed on the speech signal.
  • an embodiment of the present application provides a speech enhancement method, wherein the speech enhancement method includes the following steps:
  • the voice signal is collected by the microphone array, and the voice signal is converted into a frequency-domain observation signal, wherein the voice signal is a time-domain observation signal;
  • the frequency domain observation signal is input to the second super-directional beamformer of the generalized sidelobe canceller to determine the noise signal corresponding to the speech signal, wherein the constraint corresponding to the second super-directional beamformer
  • the matrix is orthogonal to the blocking matrix corresponding to the first super-directional beamformer
  • a speech enhancement signal corresponding to the speech signal is determined based on the reference speech signal and the noise signal.
  • an embodiment of the present application provides a voice enhancement device, wherein the voice enhancement device includes:
  • Acquisition module for collecting voice signal by microphone array, and described voice signal is converted into frequency domain observation signal, wherein, described voice signal is time domain observation signal;
  • a first determination module configured to input the frequency domain observation signal to a first super-directional beamformer in the generalized sidelobe canceller, to determine a reference speech signal output by the first super-directional beamformer;
  • the second determination module is configured to input the frequency domain observation signal to the second super-directional beamformer of the generalized sidelobe canceller, so as to determine the noise signal corresponding to the speech signal, wherein the second super-directional beam
  • the constraint matrix corresponding to the directional beamformer and the blocking matrix corresponding to the first super-directional beamformer are orthogonal to each other;
  • a third determining module configured to determine a speech enhancement signal corresponding to the speech signal based on the reference speech signal and the noise signal.
  • an embodiment of the present application provides a speech enhancement device, including a memory, a processor, and a speech enhancement program stored in the memory and running on the processor, where the processor implements the speech enhancement program when executing:
  • the voice signal is collected by the microphone array, and the voice signal is converted into a frequency-domain observation signal, wherein the voice signal is a time-domain observation signal;
  • the frequency domain observation signal is input to the second super-directional beamformer of the generalized sidelobe canceller to determine the noise signal corresponding to the speech signal, wherein the constraint corresponding to the second super-directional beamformer
  • the matrix is orthogonal to the blocking matrix corresponding to the first super-directional beamformer
  • a speech enhancement signal corresponding to the speech signal is determined based on the reference speech signal and the noise signal.
  • an embodiment of the present application provides a computer-readable storage medium, the computer-readable storage medium may be non-volatile or volatile, and the computer-readable storage medium stores a computer program, the computer program Implemented when executed by the processor:
  • the voice signal is collected by the microphone array, and the voice signal is converted into a frequency-domain observation signal, wherein the voice signal is a time-domain observation signal;
  • the frequency domain observation signal is input to the second super-directional beamformer of the generalized sidelobe canceller to determine the noise signal corresponding to the speech signal, wherein the constraint corresponding to the second super-directional beamformer
  • the matrix is orthogonal to the blocking matrix corresponding to the first super-directional beamformer
  • a speech enhancement signal corresponding to the speech signal is determined based on the reference speech signal and the noise signal.
  • the embodiments of the present application have the beneficial effects of collecting voice signals through a microphone array, and converting the voice signals into frequency-domain observation signals, wherein the voice signals are time-domain observation signals;
  • the frequency domain observation signal is input to the first super-directional beamformer in the generalized sidelobe canceller to determine the reference speech signal output by the first super-directional beamformer;
  • the frequency domain observation signal is input to a second super-directional beamformer of a generalized sidelobe canceller to determine a noise signal corresponding to the speech signal, wherein the constraint matrix corresponding to the second super-directional beamformer is the same as the first super-directional beamformer
  • the blocking matrices corresponding to the beamformers are orthogonal to each other;
  • the speech enhancement signal corresponding to the speech signal is determined based on the reference speech signal and the noise signal.
  • This embodiment improves the generalized sidelobe canceller technology by combining the generalized sidelobe canceller structure and the super-directional beamforming technology, using the characteristics of strong directivity and narrow main lobe of the super-directional beamforming technology, Therefore, the first super-directional beamformer in the generalized sidelobe canceller can effectively enhance the speech signal of the target azimuth, and the enhancement effect is good, and at the same time, the lower branch of the generalized sidelobe canceller is improved based on the super-directional beamforming technology.
  • the blocking matrix part of can filter out noise interference more effectively, thus more effectively improving the accuracy of the calculated reference speech signal and noise signal, thereby further improving the accuracy of the speech enhancement signal.
  • FIG. 1 is a schematic structural diagram of a speech enhancement device of a hardware operating environment involved in a solution according to an embodiment of the present application
  • FIG. 2 is a schematic flowchart of the first embodiment of the speech enhancement method of the application
  • FIG. 3 is a schematic flowchart of the second embodiment of the speech enhancement method of the present application.
  • the speech enhancement method, apparatus, device and computer-readable storage medium provided by this application can also be applied to the field of artificial intelligence.
  • FIG. 1 is a schematic structural diagram of a speech enhancement device of a hardware operating environment involved in the solution of the embodiment of the present application.
  • the voice enhancement device in the embodiment of this application may be a PC, or a smart phone, a tablet computer, an e-book reader, an MP3 (Moving Picture Experts Group Audio Layer III, moving image expert compression standard audio layer 3) player, an MP4 (Moving Picture Experts Group Audio Layer III) player, an MP4 (Moving Picture Experts Group Audio Layer III) player, a Picture Experts Group Audio Layer IV, moving image expert compression standard audio layer 4) Players, portable computers and other portable terminal equipment with display functions.
  • MP3 Motion Picture Experts Group Audio Layer III, moving image expert compression standard audio layer 3
  • MP4 Motion Picture Experts Group Audio Layer III
  • MP4 Motion Picture Experts Group Audio Layer III
  • a Picture Experts Group Audio Layer IV moving image expert compression standard audio layer
  • the speech enhancement device may include: a processor 1001 , such as a CPU, a network interface 1004 , a user interface 1003 , a memory 1005 , and a communication bus 1002 .
  • the communication bus 1002 is used to realize the connection and communication between these components.
  • the user interface 1003 may include a display screen (Display), an input unit such as a keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface and a wireless interface.
  • the network interface 1004 may include a standard wired interface and a wireless interface (eg, a WI-FI interface).
  • the memory 1005 may be high-speed RAM memory, or may be non-volatile memory, such as disk memory.
  • the memory 1005 may also be a storage device independent of the aforementioned processor 1001 .
  • the voice enhancement device may further include a camera, an RF (Radio Frequency, radio frequency) circuit, a sensor, an audio circuit, a WiFi module, and the like.
  • sensors such as light sensors, motion sensors and other sensors.
  • the light sensor may include an ambient light sensor and a proximity sensor, wherein the ambient light sensor may adjust the brightness of the display screen according to the brightness of the ambient light, and the proximity sensor may turn off the display screen and/or when the voice enhancement device is moved to the ear or backlight.
  • the gravitational acceleration sensor can detect the magnitude of acceleration in all directions (generally three axes), and can detect the magnitude and direction of gravity when it is stationary, and can be used for applications that recognize the posture of voice enhancement devices (such as switching between horizontal and vertical screens). , related games, magnetometer attitude calibration), vibration recognition related functions (such as pedometer, tapping), etc.; of course, the voice enhancement device can also be equipped with other sensors such as gyroscope, barometer, hygrometer, thermometer, infrared sensor, etc. It is not repeated here.
  • the structure of the speech enhancement device shown in FIG. 1 does not constitute a limitation to the speech enhancement device, and may include more or less components than those shown in the figure, or combine some components, or different components layout.
  • the memory 1005 as a computer storage medium may include an operating system, a network communication module, a user interface module and a speech enhancement program.
  • the network interface 1004 is mainly used to connect the background server, and perform data communication with the background server;
  • the user interface 1003 is mainly used to connect the client (client), and perform data communication with the client;
  • the processor 1001 may be configured to call the speech enhancement program stored in the memory 1005, and execute the speech enhancement method provided by the embodiment of the present application.
  • FIG. 2 is a schematic flowchart of the first embodiment of the speech enhancement method of the present application.
  • Step S10 collecting a voice signal through a microphone array, and converting the voice signal into a frequency-domain observation signal, wherein the voice signal is a time-domain observation signal;
  • the speech enhancement method proposed in this application is applied to intelligent terminal equipment, and is based on the technology of microphone array and generalized sidelobe canceller.
  • the microphone array is composed of multiple microphone array elements.
  • the microphone array is used to collect the sound signal in the real environment, that is, the speech signal.
  • the generalized sidelobe canceller is an improved beamformer based on the super-directional beamforming technology.
  • the lobe canceller includes an upper branch and a lower branch.
  • the upper branch of the generalized sidelobe canceller is used to pass and initially enhance the speech signal in the target direction
  • the lower branch of the generalized sidelobe canceller is used to filter out the speech signal in the target direction. signal and through the noise signal in the speech signal.
  • the pre-processed time-domain observation signals are processed frame by frame, and after frame-by-frame processing is completed, frame data corresponding to the speech signal is obtained; after that, the frame data is processed.
  • the frequency domain observation signal X i (e j ⁇ ) is obtained, where i represents the data of the ith frame.
  • X(k) is used to represent the frequency domain data of the kth frame.
  • Step S20 inputting the frequency domain observation signal to the first super-directional beamformer in the generalized sidelobe canceller to determine the reference speech signal output by the first super-directional beamformer;
  • the frequency domain observation signal corresponding to the speech signal is obtained, the frequency domain observation signal is input to the upper branch of the generalized sidelobe canceller, the super-directional beamformer is used for beamforming, and the output is initially enhanced based on the target direction.
  • the reference voice signal is obtained, the target direction is the main lobe pointing, and the output corresponding to the main lobe is the initially enhanced reference voice signal.
  • the direction angle corresponding to the voice signal is the angle formed by the voice signal and the plane where the microphone array is located when the voice signal is received by the microphone array.
  • the generalized sidelobe canceller is an improved beamformer based on super-directional beamforming technology.
  • the generalized sidelobe canceller includes the first super-directional beamformer of the upper branch and the second super-directional beamformer of the lower branch. wherein, the constraint matrix corresponding to the second super-directional beamformer and the blocking matrix corresponding to the first super-directional beamformer are orthogonal to each other, and the first super-directional beamformer is used to enhance the upper branch of the generalized sidelobe canceller.
  • the voice signal of the signal passing through the branch can effectively enhance the voice signal of the target azimuth by using the characteristics of strong directivity and narrow main lobe of the first super-directional beamformer.
  • the enhancement effect is good.
  • Step S30 inputting the frequency domain observation signal to the second super-directional beamformer of the generalized sidelobe canceller to determine the noise signal corresponding to the speech signal, wherein the second super-directional beamformer
  • the corresponding constraint matrix is orthogonal to the blocking matrix corresponding to the first super-directional beamformer
  • the frequency domain observation signal is input to the second super-directional beamformer of the lower branch of the generalized sidelobe canceller, so as to pass the second super-directional beam
  • the former realizes the function of the blocking matrix of the lower branch of the generalized sidelobe canceller, that is, the function of the blocking matrix of the lower branch of the generalized sidelobe canceller is completed by the second super-directional beamformer.
  • the direction of the interference noise is preset in the device, and the noise signal is calculated based on the preset direction of the interference noise, so that the second super-directional beamformer outputs the noise signal based on the preset direction of the interference noise and the frequency domain observation signal. It can be understood that the output of the lower branch of the generalized sidelobe canceller can successfully block the speech signal, so as to obtain the signal part containing only interference noise.
  • Step S40 determining a speech enhancement signal corresponding to the speech signal based on the reference speech signal and the noise signal.
  • the adaptive noise suppressor adopts the normalized least mean square error criterion (NLMS), based on the reference voice signal and the noise signal, the voice signal collected by the microphone array is adaptively filtered, and the frequency domain is obtained after the adaptive filtering is completed.
  • NLMS normalized least mean square error criterion
  • the speech enhancement signal output by the adaptive noise suppressor is the speech enhancement signal in the frequency domain. Therefore, the subsequent Fourier transform of the speech enhancement signal in the frequency domain can be obtained. domain of speech enhancement signals. Specifically, after the speech enhancement signal in the frequency domain is obtained, inverse short-time discrete Fourier transform is performed on the speech enhancement signal in the frequency domain to obtain the time domain enhancement signal and output.
  • the voice enhancement method proposed in this embodiment collects voice signals through a microphone array, and converts the voice signals into frequency-domain observation signals, wherein the voice signals are time-domain observation signals; the frequency-domain observation signals are input to the first super-directional beamformer in the generalized sidelobe canceller to determine the reference speech signal output by the first super-directional beamformer; the frequency domain observation signal is input to the first super-directional beamformer of the generalized sidelobe canceller Two super-directional beamformers to determine the noise signal corresponding to the speech signal, wherein the constraint matrix corresponding to the second super-directional beamformer and the blocking matrix corresponding to the first super-directional beamformer orthogonal to each other; determining a speech enhancement signal corresponding to the speech signal based on the reference speech signal and the noise signal.
  • This embodiment improves the generalized sidelobe canceller technology by combining the generalized sidelobe canceller structure and the super-directional beamforming technology, using the characteristics of strong directivity and narrow main lobe of the super-directional beamforming technology, Therefore, the first super-directional beamformer in the generalized sidelobe canceller can effectively enhance the speech signal of the target azimuth, and the enhancement effect is good, and at the same time, the lower branch of the generalized sidelobe canceller is improved based on the super-directional beamforming technology.
  • the blocking matrix part of can filter out noise interference more effectively, thus more effectively improving the accuracy of the calculated reference speech signal and noise signal, thereby further improving the accuracy of the speech enhancement signal.
  • step S20 includes:
  • Step S21 the frequency domain observation signal is input to the first super-directional beamformer of the generalized sidelobe canceller, so as to determine the said voice signal based on the direction angle corresponding to the voice signal and the array element spacing corresponding to the microphone array. Steering vector of each frequency point of the frequency domain observation signal;
  • Step S22 determining the first projection matrix of each frequency point of the frequency domain observation signal based on the steering vector of each frequency point of the frequency domain observation signal;
  • Step S23 determining a reference speech signal output by the first super-directional beamformer based on the first projection matrix and the frequency domain observation signal.
  • the frequency domain observation signal corresponding to the speech signal is obtained, the frequency domain observation signal is input to the upper branch of the generalized sidelobe canceller, and the first super-directional beamformer of the upper branch is based on the corresponding speech signal.
  • the direction angle and the corresponding array element spacing of the microphone array are used to calculate the steering vector of each frequency point of the frequency domain observation signal; after obtaining the steering vector of each frequency point of the frequency domain observation signal, the first super-directional beamformer is based on the frequency domain observation signal.
  • the steering vector of each frequency point of the signal is used to calculate the noise cross-correlation coefficient matrix of each frequency point of the frequency-domain observation signal; then, based on the noise cross-correlation coefficient matrix of each frequency point, the first value of each frequency point of the frequency-domain observation signal is calculated.
  • Projection matrix after obtaining the first projection matrix of each frequency point, the first super-directional beamformer determines the reference speech signal output by the upper branch of the generalized sidelobe canceller based on the first projection matrix and the frequency domain observation signal.
  • the steering vector of each frequency point of the frequency domain observation signal is calculated.
  • the calculation formula for calculating the steering vector of each frequency point of the frequency domain observation signal is as follows:
  • N fft is the length of the fast Fourier transform
  • c is the speed of the signal, here the speed of sound.
  • the calculation is performed frequency-by-frequency point, and the noise cross-correlation coefficient matrix Q of the nth frequency point is calculated based on the steering vector of each frequency point of the frequency-domain observation signal.
  • the formula for calculating the noise cross-correlation coefficient matrix of a point is as follows:
  • i and j represent the i-th array element and the j-th array element of the microphone array, respectively.
  • calculate the projection matrix of frequency point n that is, calculate the first projection matrix of each frequency point of the frequency domain observation signal, and calculate the first projection matrix of each frequency point of the frequency domain observation signal based on the noise cross-correlation coefficient matrix of each frequency point.
  • the calculation formula is as follows:
  • represents the steering matrix of the nth frequency point with respect to the direction ⁇ .
  • the beam output signal of the upper branch is calculated, that is, the reference speech signal output by the upper branch of the generalized sidelobe canceller is calculated, and the output of the upper branch of the generalized sidelobe canceller is determined based on the first projection matrix and the frequency domain observation signal.
  • the calculation formula of the reference speech signal is as follows:
  • Y(k,n) is the reference speech signal corresponding to the nth frequency point of the kth frame of the frequency domain observation signal.
  • the above process takes the microphone array as a uniform linear array as an example of the calculation formula.
  • the enhancement of the speech signal can also be accomplished by using an array such as a uniform circular array.
  • the step of determining the first projection matrix of each frequency point of the frequency domain observation signal based on the steering vector of each frequency point of the frequency domain observation signal includes:
  • Step S221 based on the steering vector of each frequency point of the frequency-domain observation signal, calculate the noise cross-correlation coefficient matrix of each frequency point of the frequency-domain observation signal;
  • Step S222 Calculate the first projection matrix of each frequency point of the frequency domain observation signal based on the noise cross-correlation coefficient matrix of each frequency point.
  • the first super-directional beamformer of the generalized sidelobe canceller calculates the frequency-domain based on the steering vectors of each frequency point of the frequency-domain observation signal The noise cross-correlation coefficient matrix of each frequency point of the observed signal; then, based on the noise cross-correlation coefficient matrix of each frequency point, the first projection matrix of each frequency point of the frequency-domain observation signal is calculated, so as to be based on the first projection matrix and the frequency point.
  • the domain observation signal determines the reference speech signal output by the upper branch of the generalized sidelobe canceller.
  • the example calculation formula for calculating the noise cross-correlation coefficient matrix and the example calculation formula corresponding to the first projection matrix of each frequency point of the frequency-domain observation signal based on the noise cross-correlation coefficient matrix of each frequency point are specifically referred to the previous embodiment. .
  • the step of inputting the frequency domain observation signal to the second super-directional beamformer of the generalized sidelobe canceller to determine the noise signal corresponding to the speech signal includes:
  • Step S31 inputting the frequency domain observation signal to the second super-directional beamformer of the generalized sidelobe canceller, to determine the second projection matrix of each frequency point of the frequency domain observation signal based on the noise direction vector;
  • Step S32 Determine the noise signal output by the second super-directional beamformer based on the second projection matrix and the frequency domain observation signal.
  • the frequency domain observation signal is input to the second super-directional beamformer of the lower branch of the generalized sidelobe canceller, so as to pass the second super-directional beam
  • the former implements the function of the blocking matrix of the lower branch of the generalized sidelobe canceller.
  • the noise steering vector of each frequency point of the frequency-domain observation signal is calculated; then, based on the noise steering vector of each frequency point of the frequency-domain observation signal , calculate the second projection matrix of each frequency point of the frequency-domain observation signal; finally, calculate and output the noise signal based on the second projection matrix and the frequency-domain observation signal, so that the generalized sidelobe canceller can block the beamformer according to the second super-directional beamformer
  • the noise signal obtained after dropping the reference speech number. It can be understood that the output of the lower branch of the generalized sidelobe canceller can successfully block the reference speech signal, so as to obtain the signal part containing only interference noise, that is, the noise signal.
  • the steering vector of each frequency point of the frequency-domain observation signal is calculated; after obtaining the steering vector of each frequency point of the frequency-domain observation signal, the first super-directional beamformer Calculate the noise cross-correlation coefficient matrix of each frequency point of the frequency-domain observation signal based on the steering vector of each frequency point of the frequency-domain observation signal; then, calculate the noise cross-correlation coefficient matrix of each frequency point of the frequency-domain observation signal After obtaining the first projection matrix of each frequency point, the first super-directional beamformer determines the reference speech output by the upper branch of the generalized sidelobe canceller based on the first projection matrix and the frequency domain observation signal Signal.
  • the step of determining the speech enhancement signal corresponding to the speech signal based on the reference speech signal and the noise signal includes:
  • Step S41 inputting the reference speech signal and the noise signal into an adaptive noise suppressor, so as to perform automatic self-adaptation on the frequency domain observation signal corresponding to the speech signal based on the reference speech signal and the noise signal. adapting to noise suppression to obtain an error signal corresponding to the frequency domain observation signal;
  • Step S42 the error signal is input to the adaptive noise suppressor, and the normalized minimum mean square error criterion is used to optimize the parameters of the adaptive noise suppressor, after the optimization of the adaptive noise suppressor is completed.
  • a speech enhancement signal corresponding to the speech signal is determined.
  • the adaptive noise suppressor after passing the reference speech signal output by the upper branch and the noise signal output by the lower branch of the generalized sidelobe canceller, the reference speech signal output by the upper branch and the noise signal output by the lower branch are input to the automatic In the adaptive noise suppressor, the adaptive noise suppressor performs adaptive noise suppression on the frequency domain observation signal corresponding to the speech signal according to the reference speech signal and the noise signal, so as to maximize the suppression of the noise signal in the speech signal, so that the adaptive noise The suppressor outputs a high-precision speech enhancement signal.
  • the reference speech signal output by the upper branch and the noise signal output by the lower branch are input into the adaptive noise suppressor, and the error signal is first calculated based on the reference speech signal and the noise signal through the adaptive noise suppressor, wherein the error signal is the frequency
  • the domain observation signal is the speech signal after noise suppression, but in fact the error signal belongs to the speech signal with low accuracy, and the speech signal needs to be suppressed many times to obtain the signal with high accuracy.
  • the error signal is input to the adaptive noise suppressor for the adaptive noise suppressor to use the normalized minimum mean square error criterion to optimize the parameters of the adaptive noise suppressor, and when optimizing the adaptive noise suppressor After the completion of the device, a high-precision speech enhancement signal is output.
  • the steps of performing adaptive noise suppression to obtain an error signal corresponding to the frequency domain observation signal include:
  • Step S411 inputting the reference speech signal and the noise signal into an adaptive noise suppressor to determine an adjustment signal based on the weight vector corresponding to the adaptive noise suppressor and the reference speech signal;
  • Step S412 Adjust the frequency-domain observation signal corresponding to the speech signal based on the adjustment signal, and determine an error signal corresponding to the frequency-domain observation signal after adjustment.
  • the adaptive noise suppressor after passing the reference speech signal output by the upper branch and the noise signal output by the lower branch of the generalized sidelobe canceller, the reference speech signal output by the upper branch and the noise signal output by the lower branch are input to the automatic In the adaptive noise suppressor, the adaptive noise suppressor performs adaptive noise suppression on the frequency domain observation signal corresponding to the speech signal according to the reference speech signal and the noise signal, so as to maximize the suppression of the noise signal in the speech signal, so that the adaptive noise The suppressor outputs a high-precision speech enhancement signal.
  • the adjustment signal is first calculated based on the weight vector corresponding to the adaptive noise suppressor and the reference speech signal, and the adaptive noise suppressor outputs the adjustment signal; after the adjustment signal is obtained, the frequency domain observation signal is adjusted based on the adjustment signal, and the adjusted signal is obtained.
  • the error signal after observing the signal in the frequency domain.
  • the manner of adjusting the frequency-domain observation signal based on the adjustment signal may be to subtract the adjustment signal from the frequency-domain observation signal to obtain an error signal corresponding to the speech signal.
  • step S10 includes:
  • Step S11 collecting voice signals through a microphone array, and performing a frame division operation on the voice signals to obtain frame data corresponding to the voice signals;
  • Step S12 Perform short-time discrete Fourier transform on the frame data corresponding to the speech signal to obtain a frequency domain observation signal corresponding to the speech signal.
  • Perform preprocessing operations such as framing operations on the above-mentioned time-domain observation signals, and then perform frame-by-frame processing on the pre-processed time-domain observation signals.
  • the frame data corresponding to the speech signal is obtained;
  • the short-time discrete Fourier transform is used to obtain the frequency domain observation signal, wherein the frequency domain observation signal can be expressed as X i (e j ⁇ ), and i represents the i-th frame of data.
  • X(k) is used to represent the frequency domain data of the kth frame.
  • the frequency domain observation signal is input to the first super-directional beamformer of the generalized sidelobe canceller, so that the direction angle corresponding to the speech signal corresponds to the microphone array based on the direction angle corresponding to the speech signal.
  • the first projection matrix and the frequency domain observation signal determine the reference speech signal output by the first super-directional beamformer.
  • the super-directional beamforming technology has the characteristics of strong directivity and narrow main lobe, and the super-directionality is applied to the upper branch of the generalized sidelobe canceller.
  • the first super-directional beamformer in the generalized sidelobe canceller can effectively enhance the speech signal of the target azimuth, so that the enhancement effect of the reference speech signal is good.
  • an embodiment of the present application also proposes a voice enhancement device, where the voice enhancement device includes:
  • Acquisition module for collecting voice signal by microphone array, and described voice signal is converted into frequency domain observation signal, wherein, described voice signal is time domain observation signal;
  • a first determination module configured to input the frequency domain observation signal to a first super-directional beamformer in the generalized sidelobe canceller, to determine a reference speech signal output by the first super-directional beamformer;
  • the second determination module is configured to input the frequency domain observation signal to the second super-directional beamformer of the generalized sidelobe canceller, so as to determine the noise signal corresponding to the speech signal, wherein the second super-directional beam
  • the constraint matrix corresponding to the directional beamformer and the blocking matrix corresponding to the first super-directional beamformer are orthogonal to each other;
  • a third determining module configured to determine a speech enhancement signal corresponding to the speech signal based on the reference speech signal and the noise signal.
  • the first determining module is also used for:
  • the frequency domain observation signal inputting the frequency domain observation signal to the first super-directional beamformer of the generalized sidelobe canceller, so as to determine the frequency domain observation based on the direction angle corresponding to the speech signal and the array element spacing corresponding to the microphone array Steering vector of each frequency point of the signal;
  • a reference speech signal output by the first super-directional beamformer is determined based on the first projection matrix and the frequency domain observation signal.
  • the first determining module is also used for:
  • the first projection matrix of each frequency point of the frequency domain observation signal is calculated.
  • the second determining module is also used for:
  • a noise signal output by the second super-directional beamformer is determined based on the second projection matrix and the frequency domain observation signal.
  • the third determining module is also used for:
  • the error signal is input to the adaptive noise suppressor, and a normalized minimum mean square error criterion is used to optimize the parameters of the adaptive noise suppressor, and the adaptive noise suppressor is determined after the optimization is completed.
  • the speech enhancement signal corresponding to the speech signal.
  • the third determining module is also used for:
  • the frequency domain observation signal corresponding to the speech signal is adjusted based on the adjustment signal, and an error signal corresponding to the adjustment of the frequency domain observation signal is determined.
  • the acquisition module is also used for:
  • a short-time discrete Fourier transform is performed on the frame data corresponding to the speech signal to obtain a frequency domain observation signal corresponding to the speech signal.
  • Embodiments of the present application also provide a voice enhancement device, including a memory, a processor, and a voice enhancement program stored in the memory and running on the processor, where the processor executes the voice enhancement program to achieve:
  • the voice signal is collected by the microphone array, and the voice signal is converted into a frequency-domain observation signal, wherein the voice signal is a time-domain observation signal;
  • the frequency domain observation signal is input to the second super-directional beamformer of the generalized sidelobe canceller to determine the noise signal corresponding to the speech signal, wherein the constraint corresponding to the second super-directional beamformer
  • the matrix is orthogonal to the blocking matrix corresponding to the first super-directional beamformer
  • a speech enhancement signal corresponding to the speech signal is determined based on the reference speech signal and the noise signal.
  • an embodiment of the present application also proposes a computer-readable storage medium.
  • the computer-readable storage medium may be non-volatile or volatile, and a speech enhancement program is stored on the computer-readable storage medium.
  • the speech enhancement program is implemented when executed by the processor:
  • the voice signal is collected by the microphone array, and the voice signal is converted into a frequency-domain observation signal, wherein the voice signal is a time-domain observation signal;
  • the frequency domain observation signal is input to the second super-directional beamformer of the generalized sidelobe canceller to determine the noise signal corresponding to the speech signal, wherein the constraint corresponding to the second super-directional beamformer
  • the matrix is orthogonal to the blocking matrix corresponding to the first super-directional beamformer
  • a speech enhancement signal corresponding to the speech signal is determined based on the reference speech signal and the noise signal.

Abstract

A speech enhancement method and apparatus, and a device and a computer-readable storage medium. The method comprises: collecting a speech signal by means of a microphone array, and converting the speech signal into a frequency-domain observation signal, wherein the speech signal is a time-domain observation signal (S10); inputting the frequency-domain observation signal into a first super-directivity beam former in a generalized sidelobe canceler, so as to determine a reference speech signal output by the first super-directivity beam former (S20); inputting the frequency-domain observation signal into a second super-directivity beam former in the generalized sidelobe canceler, so as to determine a noise signal corresponding to the speech signal (S30); and determining, on the basis of the reference speech signal and the noise signal, a speech enhancement signal corresponding to the speech signal (S40). By means of the method, a speech signal from a target orientation can be effectively enhanced, noise interference can be better removed, and the accuracy of a reference speech signal and a noise signal can be effectively improved, such that the accuracy of a speech enhancement signal can be further improved.

Description

语音增强方法、装置、设备及计算机可读存储介质Speech enhancement method, apparatus, device, and computer-readable storage medium
本申请要求于2020年11月17日在中华人民共和国国家知识产权局专利局提交的、申请号为202011297820.3、发明名称为“语音增强方法、装置、设备及计算机可读存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application requires a Chinese patent application with an application number of 202011297820.3 and an invention title of "Speech Enhancement Method, Apparatus, Equipment and Computer-readable Storage Medium", which was filed at the Patent Office of the State Intellectual Property Office of the People's Republic of China on November 17, 2020 , the entire contents of which are incorporated herein by reference.
技术领域technical field
本申请涉及信号处理技术领域,尤其涉及一种语音增强方法、装置、设备及计算机可读存储介质。The present application relates to the technical field of signal processing, and in particular, to a speech enhancement method, apparatus, device, and computer-readable storage medium.
背景技术Background technique
智能终端设备的应用越来越广泛,比如智能电视,智能音箱,智能售贩机、智能卖票机等。随着语音技术和硬件技术的蓬勃发展,语音交互成为智能人机交互的重要接口。然而实际环境下噪声无处不在,为了后端的高效计算和处理,拾取干净目标语音信号非常重要,因此前端的语音信号增强必不可少。并且,随着语音识别技术的广泛使用,语音信号处理技术的需求也随之扩大。目前,在语音识别或声纹识别过程中,由前端设备采集到的语音信号一般都带有噪声,包括背景环境中的噪声以及前端设备录音过程中产生的噪声。发明人意识到,这些携带噪声的语音信号在进行语音识别时,会影响语音识别的准确性,因此,需要对语音信号进行语音增强处理(即对语音信号进行降噪处理),以从该语音信号中尽可能提取到更纯净的语音信号,以使语音识别更加准确。当前对语音信号进行语音增强处理后提取的语音信号精度不高,不利于后续进行语音识别。The application of smart terminal devices is becoming more and more extensive, such as smart TVs, smart speakers, smart vending machines, and smart ticket vending machines. With the vigorous development of voice technology and hardware technology, voice interaction has become an important interface for intelligent human-computer interaction. However, noise is ubiquitous in the actual environment. For the efficient calculation and processing of the back-end, it is very important to pick up clean target speech signals, so the front-end speech signal enhancement is essential. And, with the widespread use of speech recognition technology, the demand for speech signal processing technology also expands. At present, in the process of speech recognition or voiceprint recognition, the speech signal collected by the front-end device generally contains noise, including noise in the background environment and noise generated during the recording process of the front-end device. The inventor realizes that these noise-carrying speech signals will affect the accuracy of speech recognition during speech recognition. Therefore, speech enhancement processing (that is, noise reduction processing is performed on the speech signal) needs to be performed on the speech signal to extract the speech signal from the speech signal. The purer speech signal is extracted from the signal as much as possible to make speech recognition more accurate. At present, the voice signal extracted after voice enhancement processing is not high in accuracy, which is not conducive to subsequent voice recognition.
上述内容仅用于辅助理解本申请的技术方案,并不代表承认上述内容是现有技术。The above content is only used to assist the understanding of the technical solutions of the present application, and does not mean that the above content is the prior art.
技术问题technical problem
本申请实施例的目的之一在于提供一种语音增强方法、装置、设备及计算机可读存储介质,旨在解决当前对语音信号进行语音增强处理后提取的语音信号精度低的技术问题。One of the purposes of the embodiments of the present application is to provide a speech enhancement method, apparatus, device, and computer-readable storage medium, which aims to solve the technical problem of low accuracy of the speech signal extracted after speech enhancement processing is currently performed on the speech signal.
技术解决方案technical solutions
第一方面,本申请实施例提供了一种语音增强方法,其中,所述语音增强方法包括以下步骤:In a first aspect, an embodiment of the present application provides a speech enhancement method, wherein the speech enhancement method includes the following steps:
通过麦克风阵列采集语音信号,并将所述语音信号转换成频域观测信号,其中,所述语音信号为时域观测信号;The voice signal is collected by the microphone array, and the voice signal is converted into a frequency-domain observation signal, wherein the voice signal is a time-domain observation signal;
将所述频域观测信号输入至广义旁瓣消除器中的第一超指向性波束形成器,以确定所述第一超指向性波束形成器输出的参考语音信号;inputting the frequency domain observation signal to the first super-directional beamformer in the generalized sidelobe canceller to determine the reference speech signal output by the first super-directional beamformer;
将所述频域观测信号输入至广义旁瓣消除器的第二超指向性波束形成器,以确定所述语音信号对应的噪声信号,其中,所述第二超指向性波束形成器对应的约束矩阵与所述第一超指向性波束形成器对应的阻塞矩阵相互正交;The frequency domain observation signal is input to the second super-directional beamformer of the generalized sidelobe canceller to determine the noise signal corresponding to the speech signal, wherein the constraint corresponding to the second super-directional beamformer The matrix is orthogonal to the blocking matrix corresponding to the first super-directional beamformer;
基于所述参考语音信号和所述噪声信号确定所述语音信号对应的语音增强信号。A speech enhancement signal corresponding to the speech signal is determined based on the reference speech signal and the noise signal.
第二方面,本申请实施例提供了一种语音增强装置,其中,所述语音增强装置包括:In a second aspect, an embodiment of the present application provides a voice enhancement device, wherein the voice enhancement device includes:
采集模块,用于通过麦克风阵列采集语音信号,并将所述语音信号转换成频域观测信 号,其中,所述语音信号为时域观测信号;Acquisition module, for collecting voice signal by microphone array, and described voice signal is converted into frequency domain observation signal, wherein, described voice signal is time domain observation signal;
第一确定模块,用于将所述频域观测信号输入至广义旁瓣消除器中的第一超指向性波束形成器,以确定所述第一超指向性波束形成器输出的参考语音信号;a first determination module, configured to input the frequency domain observation signal to a first super-directional beamformer in the generalized sidelobe canceller, to determine a reference speech signal output by the first super-directional beamformer;
第二确定模块,用于将所述频域观测信号输入至广义旁瓣消除器的第二超指向性波束形成器,以确定所述语音信号对应的噪声信号,其中,所述第二超指向性波束形成器对应的约束矩阵与所述第一超指向性波束形成器对应的阻塞矩阵相互正交;The second determination module is configured to input the frequency domain observation signal to the second super-directional beamformer of the generalized sidelobe canceller, so as to determine the noise signal corresponding to the speech signal, wherein the second super-directional beam The constraint matrix corresponding to the directional beamformer and the blocking matrix corresponding to the first super-directional beamformer are orthogonal to each other;
第三确定模块,用于基于所述参考语音信号和所述噪声信号确定所述语音信号对应的语音增强信号。A third determining module, configured to determine a speech enhancement signal corresponding to the speech signal based on the reference speech signal and the noise signal.
第三方面,本申请实施例提供了一种语音增强设备,包括存储器、处理器以及存储在存储器中并可在处理器上运行的语音增强程序,所述处理器执行语音增强程序时实现:In a third aspect, an embodiment of the present application provides a speech enhancement device, including a memory, a processor, and a speech enhancement program stored in the memory and running on the processor, where the processor implements the speech enhancement program when executing:
通过麦克风阵列采集语音信号,并将所述语音信号转换成频域观测信号,其中,所述语音信号为时域观测信号;The voice signal is collected by the microphone array, and the voice signal is converted into a frequency-domain observation signal, wherein the voice signal is a time-domain observation signal;
将所述频域观测信号输入至广义旁瓣消除器中的第一超指向性波束形成器,以确定所述第一超指向性波束形成器输出的参考语音信号;inputting the frequency domain observation signal to the first super-directional beamformer in the generalized sidelobe canceller to determine the reference speech signal output by the first super-directional beamformer;
将所述频域观测信号输入至广义旁瓣消除器的第二超指向性波束形成器,以确定所述语音信号对应的噪声信号,其中,所述第二超指向性波束形成器对应的约束矩阵与所述第一超指向性波束形成器对应的阻塞矩阵相互正交;The frequency domain observation signal is input to the second super-directional beamformer of the generalized sidelobe canceller to determine the noise signal corresponding to the speech signal, wherein the constraint corresponding to the second super-directional beamformer The matrix is orthogonal to the blocking matrix corresponding to the first super-directional beamformer;
基于所述参考语音信号和所述噪声信号确定所述语音信号对应的语音增强信号。A speech enhancement signal corresponding to the speech signal is determined based on the reference speech signal and the noise signal.
第四方面,本申请实施例提供了一种计算机可读存储介质,计算机可读存储介质可以是非易失性,也可以是易失性,计算机可读存储介质存储有计算机程序,所述计算机程序被处理器执行时实现:In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium, the computer-readable storage medium may be non-volatile or volatile, and the computer-readable storage medium stores a computer program, the computer program Implemented when executed by the processor:
通过麦克风阵列采集语音信号,并将所述语音信号转换成频域观测信号,其中,所述语音信号为时域观测信号;The voice signal is collected by the microphone array, and the voice signal is converted into a frequency-domain observation signal, wherein the voice signal is a time-domain observation signal;
将所述频域观测信号输入至广义旁瓣消除器中的第一超指向性波束形成器,以确定所述第一超指向性波束形成器输出的参考语音信号;inputting the frequency domain observation signal to the first super-directional beamformer in the generalized sidelobe canceller to determine the reference speech signal output by the first super-directional beamformer;
将所述频域观测信号输入至广义旁瓣消除器的第二超指向性波束形成器,以确定所述语音信号对应的噪声信号,其中,所述第二超指向性波束形成器对应的约束矩阵与所述第一超指向性波束形成器对应的阻塞矩阵相互正交;The frequency domain observation signal is input to the second super-directional beamformer of the generalized sidelobe canceller to determine the noise signal corresponding to the speech signal, wherein the constraint corresponding to the second super-directional beamformer The matrix is orthogonal to the blocking matrix corresponding to the first super-directional beamformer;
基于所述参考语音信号和所述噪声信号确定所述语音信号对应的语音增强信号。A speech enhancement signal corresponding to the speech signal is determined based on the reference speech signal and the noise signal.
有益效果beneficial effect
本申请实施例与现有技术相比存在的有益效果是:通过麦克风阵列采集语音信号,并将所述语音信号转换成频域观测信号,其中,所述语音信号为时域观测信号;将所述频域观测信号输入至广义旁瓣消除器中的第一超指向性波束形成器,以确定所述第一超指向性波束形成器输出的参考语音信号;将所述频域观测信号输入至广义旁瓣消除器的第二超指向性波束形成器,以确定所述语音信号对应的噪声信号,其中,所述第二超指向性波束形成器对应的约束矩阵与所述第一超指向性波束形成器对应的阻塞矩阵相互正交;基于所述 参考语音信号和所述噪声信号确定所述语音信号对应的语音增强信号。本实施例通过结合广义旁瓣消除器结构和超指向性波束形成技术,利用超指向性波束形成技术的指向性强、主瓣窄的特性,在广义旁瓣消除器技术的基础上进行改进,从而通过对广义旁瓣消除器中的第一超指向性波束形成器能够有效增强目标方位的语音信号,增强效果佳,并且同时基于超指向性波束形成技术改进了广义旁瓣消除器下支路的阻塞矩阵部分,能够更有效滤除噪声干扰,因此更有效地提升所计算的参考语音信号和噪声信号的精确度,从而进一步地提升语音增强信号的精确度。Compared with the prior art, the embodiments of the present application have the beneficial effects of collecting voice signals through a microphone array, and converting the voice signals into frequency-domain observation signals, wherein the voice signals are time-domain observation signals; The frequency domain observation signal is input to the first super-directional beamformer in the generalized sidelobe canceller to determine the reference speech signal output by the first super-directional beamformer; the frequency domain observation signal is input to a second super-directional beamformer of a generalized sidelobe canceller to determine a noise signal corresponding to the speech signal, wherein the constraint matrix corresponding to the second super-directional beamformer is the same as the first super-directional beamformer The blocking matrices corresponding to the beamformers are orthogonal to each other; the speech enhancement signal corresponding to the speech signal is determined based on the reference speech signal and the noise signal. This embodiment improves the generalized sidelobe canceller technology by combining the generalized sidelobe canceller structure and the super-directional beamforming technology, using the characteristics of strong directivity and narrow main lobe of the super-directional beamforming technology, Therefore, the first super-directional beamformer in the generalized sidelobe canceller can effectively enhance the speech signal of the target azimuth, and the enhancement effect is good, and at the same time, the lower branch of the generalized sidelobe canceller is improved based on the super-directional beamforming technology. The blocking matrix part of , can filter out noise interference more effectively, thus more effectively improving the accuracy of the calculated reference speech signal and noise signal, thereby further improving the accuracy of the speech enhancement signal.
附图说明Description of drawings
为了更清楚地说明本申请实施例技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to explain the technical solutions of the embodiments of the present application more clearly, the following briefly introduces the accompanying drawings used in the description of the embodiments. For those of ordinary skill, other drawings can also be obtained from these drawings without any creative effort.
图1是本申请实施例方案涉及的硬件运行环境的语音增强设备结构示意图;FIG. 1 is a schematic structural diagram of a speech enhancement device of a hardware operating environment involved in a solution according to an embodiment of the present application;
图2为本申请语音增强方法第一实施例的流程示意图;FIG. 2 is a schematic flowchart of the first embodiment of the speech enhancement method of the application;
图3为本申请语音增强方法第二实施例的流程示意图。FIG. 3 is a schematic flowchart of the second embodiment of the speech enhancement method of the present application.
本申请目的的实现、功能特点及优点将结合实施例,参照附图做进一步说明。The realization, functional characteristics and advantages of the purpose of the present application will be further described with reference to the accompanying drawings in conjunction with the embodiments.
本发明的实施方式Embodiments of the present invention
应当理解,此处所描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。It should be understood that the specific embodiments described herein are only used to explain the present application, but not to limit the present application.
本申请提供的一种语音增强方法、装置、设备及计算机可读存储介质,也可应用于人工智能领域。The speech enhancement method, apparatus, device and computer-readable storage medium provided by this application can also be applied to the field of artificial intelligence.
如图1所示,图1是本申请实施例方案涉及的硬件运行环境的语音增强设备结构示意图。As shown in FIG. 1 , FIG. 1 is a schematic structural diagram of a speech enhancement device of a hardware operating environment involved in the solution of the embodiment of the present application.
本申请实施例语音增强设备可以是PC,也可以是智能手机、平板电脑、电子书阅读器、MP3(Moving Picture Experts Group Audio Layer III,动态影像专家压缩标准音频层面3)播放器、MP4(Moving Picture Experts Group Audio Layer IV,动态影像专家压缩标准音频层面4)播放器、便携计算机等具有显示功能的可移动式终端设备。The voice enhancement device in the embodiment of this application may be a PC, or a smart phone, a tablet computer, an e-book reader, an MP3 (Moving Picture Experts Group Audio Layer III, moving image expert compression standard audio layer 3) player, an MP4 (Moving Picture Experts Group Audio Layer III) player, an MP4 (Moving Picture Experts Group Audio Layer III) player, a Picture Experts Group Audio Layer IV, moving image expert compression standard audio layer 4) Players, portable computers and other portable terminal equipment with display functions.
如图1所示,该语音增强设备可以包括:处理器1001,例如CPU,网络接口1004,用户接口1003,存储器1005,通信总线1002。其中,通信总线1002用于实现这些组件之间的连接通信。用户接口1003可以包括显示屏(Display)、输入单元比如键盘(Keyboard),可选用户接口1003还可以包括标准的有线接口、无线接口。网络接口1004可选的可以包括标准的有线接口、无线接口(如WI-FI接口)。存储器1005可以是高速RAM存储器,也可以是稳定的存储器(non-volatile memory),例如磁盘存储器。存储器1005可选的还可以是独立于前述处理器1001的存储装置。As shown in FIG. 1 , the speech enhancement device may include: a processor 1001 , such as a CPU, a network interface 1004 , a user interface 1003 , a memory 1005 , and a communication bus 1002 . Among them, the communication bus 1002 is used to realize the connection and communication between these components. The user interface 1003 may include a display screen (Display), an input unit such as a keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface and a wireless interface. Optionally, the network interface 1004 may include a standard wired interface and a wireless interface (eg, a WI-FI interface). The memory 1005 may be high-speed RAM memory, or may be non-volatile memory, such as disk memory. Optionally, the memory 1005 may also be a storage device independent of the aforementioned processor 1001 .
可选地,语音增强设备还可以包括摄像头、RF(Radio Frequency,射频)电路,传感器、音频电路、WiFi模块等等。其中,传感器比如光传感器、运动传感器以及其他传感器。具体地,光传感器可包括环境光传感器及接近传感器,其中,环境光传感器可根据环境光线的明暗来调节显示屏的亮度,接近传感器可在语音增强设备移动到耳边时,关闭显示屏 和/或背光。作为运动传感器的一种,重力加速度传感器可检测各个方向上(一般为三轴)加速度的大小,静止时可检测出重力的大小及方向,可用于识别语音增强设备姿态的应用(比如横竖屏切换、相关游戏、磁力计姿态校准)、振动识别相关功能(比如计步器、敲击)等;当然,语音增强设备还可配置陀螺仪、气压计、湿度计、温度计、红外线传感器等其他传感器,在此不再赘述。Optionally, the voice enhancement device may further include a camera, an RF (Radio Frequency, radio frequency) circuit, a sensor, an audio circuit, a WiFi module, and the like. Among them, sensors such as light sensors, motion sensors and other sensors. Specifically, the light sensor may include an ambient light sensor and a proximity sensor, wherein the ambient light sensor may adjust the brightness of the display screen according to the brightness of the ambient light, and the proximity sensor may turn off the display screen and/or when the voice enhancement device is moved to the ear or backlight. As a kind of motion sensor, the gravitational acceleration sensor can detect the magnitude of acceleration in all directions (generally three axes), and can detect the magnitude and direction of gravity when it is stationary, and can be used for applications that recognize the posture of voice enhancement devices (such as switching between horizontal and vertical screens). , related games, magnetometer attitude calibration), vibration recognition related functions (such as pedometer, tapping), etc.; of course, the voice enhancement device can also be equipped with other sensors such as gyroscope, barometer, hygrometer, thermometer, infrared sensor, etc. It is not repeated here.
本领域技术人员可以理解,图1中示出的语音增强设备结构并不构成对语音增强设备的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。Those skilled in the art can understand that the structure of the speech enhancement device shown in FIG. 1 does not constitute a limitation to the speech enhancement device, and may include more or less components than those shown in the figure, or combine some components, or different components layout.
如图1所示,作为一种计算机存储介质的存储器1005中可以包括操作系统、网络通信模块、用户接口模块以及语音增强程序。As shown in FIG. 1 , the memory 1005 as a computer storage medium may include an operating system, a network communication module, a user interface module and a speech enhancement program.
在图1所示的语音增强设备中,网络接口1004主要用于连接后台服务器,与后台服务器进行数据通信;用户接口1003主要用于连接客户端(用户端),与客户端进行数据通信;而处理器1001可以用于调用存储器1005中存储的语音增强程序,并执行本申请实施例提供的语音增强方法。In the voice enhancement device shown in FIG. 1, the network interface 1004 is mainly used to connect the background server, and perform data communication with the background server; the user interface 1003 is mainly used to connect the client (client), and perform data communication with the client; and The processor 1001 may be configured to call the speech enhancement program stored in the memory 1005, and execute the speech enhancement method provided by the embodiment of the present application.
本申请还提供一种语音增强方法,参照图2,图2为本申请语音增强方法第一实施例的流程示意图。The present application also provides a speech enhancement method. Referring to FIG. 2 , FIG. 2 is a schematic flowchart of the first embodiment of the speech enhancement method of the present application.
步骤S10,通过麦克风阵列采集语音信号,并将所述语音信号转换成频域观测信号,其中,所述语音信号为时域观测信号;Step S10, collecting a voice signal through a microphone array, and converting the voice signal into a frequency-domain observation signal, wherein the voice signal is a time-domain observation signal;
本申请提出的语音增强方法应用于智能终端设备,基于麦克风阵列和广义旁瓣消除器技术。其中,麦克风阵列由多个麦克风阵元组成,麦克风阵列用于采集真实环境下的声音信号即语音信号,广义旁瓣消除器为基于超指向性波束形成技术改进后的波束形成器,且广义旁瓣消除器包括上支路和下支路,广义旁瓣消除器的上支路用于通过并初步增强目标方向的语音信号,广义旁瓣消除器的下支路用于滤除目标方向的语音信号以及通过语音信号中的噪声信号。可以理解的是,对与麦克风阵列而言,由于各个麦克风阵元的分布位置不同,阵元接收的语音信号会存在一定的时间差,利用这一信息可以确定声源的方向和位置。The speech enhancement method proposed in this application is applied to intelligent terminal equipment, and is based on the technology of microphone array and generalized sidelobe canceller. Among them, the microphone array is composed of multiple microphone array elements. The microphone array is used to collect the sound signal in the real environment, that is, the speech signal. The generalized sidelobe canceller is an improved beamformer based on the super-directional beamforming technology. The lobe canceller includes an upper branch and a lower branch. The upper branch of the generalized sidelobe canceller is used to pass and initially enhance the speech signal in the target direction, and the lower branch of the generalized sidelobe canceller is used to filter out the speech signal in the target direction. signal and through the noise signal in the speech signal. It can be understood that, for the microphone array, due to the different distribution positions of each microphone array element, there will be a certain time difference between the speech signals received by the array element, and the direction and position of the sound source can be determined by using this information.
在本实施例中,在执行语音增强过程之前,采用M元麦克风阵列采集得到真实环境下的语音信号,其中,通过麦克风阵列采集到的语音信号即为时域观测信号x(n)=[x 1(t),x 2(t),...,x M(t)]。对上述时域观测信号执行分帧操作等预处理操作之后,再对预处理后的时域观测信号进行逐帧处理,逐帧处理完成后得到语音信号对应的帧数据;之后,再对帧数据采用短时离散傅里叶变换,得到频域观测信号X i(e ),其中i表示第i帧数据。后续为了简便,使用X(k)表示第k帧的频域数据。 In this embodiment, before the speech enhancement process is performed, an M-element microphone array is used to collect a speech signal in a real environment, wherein the speech signal collected through the microphone array is the time domain observation signal x(n)=[x 1 (t),x 2 (t),...,x M (t)]. After performing preprocessing operations such as framing operations on the above-mentioned time-domain observation signals, the pre-processed time-domain observation signals are processed frame by frame, and after frame-by-frame processing is completed, frame data corresponding to the speech signal is obtained; after that, the frame data is processed. Using short-time discrete Fourier transform, the frequency domain observation signal X i (e ) is obtained, where i represents the data of the ith frame. In the following, for simplicity, X(k) is used to represent the frequency domain data of the kth frame.
步骤S20,将所述频域观测信号输入至广义旁瓣消除器中的第一超指向性波束形成器,以确定所述第一超指向性波束形成器输出的参考语音信号;Step S20, inputting the frequency domain observation signal to the first super-directional beamformer in the generalized sidelobe canceller to determine the reference speech signal output by the first super-directional beamformer;
在本实施例中,得到语音信号对应的频域观测信号后,将频域观测信号输入到广义旁瓣消除器的上支路,采用超指向波束形成器进行波束形成,基于目标方向输出初步增强后 的语音信号,得到参考语音信号,目标方向即主瓣指向,主瓣对应的输出即初步增强后的参考语音信号。其中,语音信号对应的方向角度为麦克风阵列接收到语音信号时语音信号与麦克风阵列所在平面所形成的角度。广义旁瓣消除器为基于超指向性波束形成技术改进后的波束形成器,广义旁瓣消除器包括上支路的第一超指向性波束形成器和下支路的第二超指向性波束形成器,其中,第二超指向性波束形成器对应的约束矩阵与第一超指向性波束形成器对应的阻塞矩阵相互正交,第一超指向性波束形成器用于增强广义旁瓣消除器上支路所通过信号的语音信号,利用第一超指向波束形成器指向性强、主瓣窄的特性,能够有效增强目标方位的语音信号,对广义旁瓣消除器上支路所通过信号的语音信号的增强效果佳。In this embodiment, after the frequency domain observation signal corresponding to the speech signal is obtained, the frequency domain observation signal is input to the upper branch of the generalized sidelobe canceller, the super-directional beamformer is used for beamforming, and the output is initially enhanced based on the target direction. After the voice signal is obtained, the reference voice signal is obtained, the target direction is the main lobe pointing, and the output corresponding to the main lobe is the initially enhanced reference voice signal. The direction angle corresponding to the voice signal is the angle formed by the voice signal and the plane where the microphone array is located when the voice signal is received by the microphone array. The generalized sidelobe canceller is an improved beamformer based on super-directional beamforming technology. The generalized sidelobe canceller includes the first super-directional beamformer of the upper branch and the second super-directional beamformer of the lower branch. wherein, the constraint matrix corresponding to the second super-directional beamformer and the blocking matrix corresponding to the first super-directional beamformer are orthogonal to each other, and the first super-directional beamformer is used to enhance the upper branch of the generalized sidelobe canceller The voice signal of the signal passing through the branch can effectively enhance the voice signal of the target azimuth by using the characteristics of strong directivity and narrow main lobe of the first super-directional beamformer. The enhancement effect is good.
步骤S30,将所述频域观测信号输入至广义旁瓣消除器的第二超指向性波束形成器,以确定所述语音信号对应的噪声信号,其中,所述第二超指向性波束形成器对应的约束矩阵与所述第一超指向性波束形成器对应的阻塞矩阵相互正交;Step S30, inputting the frequency domain observation signal to the second super-directional beamformer of the generalized sidelobe canceller to determine the noise signal corresponding to the speech signal, wherein the second super-directional beamformer The corresponding constraint matrix is orthogonal to the blocking matrix corresponding to the first super-directional beamformer;
在本实施例中,得到语音信号对应的频域观测信号后,将频域观测信号输入到广义旁瓣消除器下支路的第二超指向性波束形成器,从而通过第二超指向性波束形成器实现广义旁瓣消除器下支路的阻塞矩阵的功能,即广义旁瓣消除器下支路的阻塞矩阵的功能采用第二超指向性波束形成器完成,在第二超指向性波束形成器中预设干扰噪声的方向,以基于预设的干扰噪声的方向进行计算噪声信号,以使第二超指向性波束形成器基于预设的干扰噪声的方向和频域观测信号输出噪声信号。可以理解的是,该广义旁瓣消除器下支路的输出能够成功阻塞掉语音信号,得到只包含有干扰噪声的信号部分。In this embodiment, after the frequency domain observation signal corresponding to the speech signal is obtained, the frequency domain observation signal is input to the second super-directional beamformer of the lower branch of the generalized sidelobe canceller, so as to pass the second super-directional beam The former realizes the function of the blocking matrix of the lower branch of the generalized sidelobe canceller, that is, the function of the blocking matrix of the lower branch of the generalized sidelobe canceller is completed by the second super-directional beamformer. The direction of the interference noise is preset in the device, and the noise signal is calculated based on the preset direction of the interference noise, so that the second super-directional beamformer outputs the noise signal based on the preset direction of the interference noise and the frequency domain observation signal. It can be understood that the output of the lower branch of the generalized sidelobe canceller can successfully block the speech signal, so as to obtain the signal part containing only interference noise.
步骤S40,基于所述参考语音信号和所述噪声信号确定所述语音信号对应的语音增强信号。Step S40, determining a speech enhancement signal corresponding to the speech signal based on the reference speech signal and the noise signal.
在本实施例中,通过广义旁瓣消除器上支路输出的参考语音信号和下支路输出的噪声信号后,将上支路输出的参考语音信号和下支路输出的噪声信号输入到自适应噪声抑制器中,自适应噪声抑制器采用归一化最小均方误差准则(NLMS),基于参考语音信号和噪声信号对麦克风阵列采集语音信号进行自适应滤波,自适应滤波完成后得到频域的语音增强信号,可以理解的是,自适应噪声抑制器输出的语音增强信号为频域的语音增强信号,因此,后续再需对频域的语音增强信号进行傅里叶变换后才能够得到时域的语音增强信号。具体地,得到频域的语音增强信号后,对频域的语音增强信号进行逆短时离散傅里叶变换,得到时域增强信号并输出。In this embodiment, after passing the reference speech signal output by the upper branch and the noise signal output by the lower branch of the generalized sidelobe canceller, the reference speech signal output by the upper branch and the noise signal output by the lower branch are input to the automatic In the adaptive noise suppressor, the adaptive noise suppressor adopts the normalized least mean square error criterion (NLMS), based on the reference voice signal and the noise signal, the voice signal collected by the microphone array is adaptively filtered, and the frequency domain is obtained after the adaptive filtering is completed. It can be understood that the speech enhancement signal output by the adaptive noise suppressor is the speech enhancement signal in the frequency domain. Therefore, the subsequent Fourier transform of the speech enhancement signal in the frequency domain can be obtained. domain of speech enhancement signals. Specifically, after the speech enhancement signal in the frequency domain is obtained, inverse short-time discrete Fourier transform is performed on the speech enhancement signal in the frequency domain to obtain the time domain enhancement signal and output.
本实施例提出的语音增强方法,通过麦克风阵列采集语音信号,并将所述语音信号转换成频域观测信号,其中,所述语音信号为时域观测信号;将所述频域观测信号输入至广义旁瓣消除器中的第一超指向性波束形成器,以确定所述第一超指向性波束形成器输出的参考语音信号;将所述频域观测信号输入至广义旁瓣消除器的第二超指向性波束形成器,以确定所述语音信号对应的噪声信号,其中,所述第二超指向性波束形成器对应的约束矩阵与所述第一超指向性波束形成器对应的阻塞矩阵相互正交;基于所述参考语音信号和所述噪声信号确定所述语音信号对应的语音增强信号。本实施例通过结合广义旁瓣消除器结 构和超指向性波束形成技术,利用超指向性波束形成技术的指向性强、主瓣窄的特性,在广义旁瓣消除器技术的基础上进行改进,从而通过对广义旁瓣消除器中的第一超指向性波束形成器能够有效增强目标方位的语音信号,增强效果佳,并且同时基于超指向性波束形成技术改进了广义旁瓣消除器下支路的阻塞矩阵部分,能够更有效滤除噪声干扰,因此更有效地提升所计算的参考语音信号和噪声信号的精确度,从而进一步地提升语音增强信号的精确度。The voice enhancement method proposed in this embodiment collects voice signals through a microphone array, and converts the voice signals into frequency-domain observation signals, wherein the voice signals are time-domain observation signals; the frequency-domain observation signals are input to the first super-directional beamformer in the generalized sidelobe canceller to determine the reference speech signal output by the first super-directional beamformer; the frequency domain observation signal is input to the first super-directional beamformer of the generalized sidelobe canceller Two super-directional beamformers to determine the noise signal corresponding to the speech signal, wherein the constraint matrix corresponding to the second super-directional beamformer and the blocking matrix corresponding to the first super-directional beamformer orthogonal to each other; determining a speech enhancement signal corresponding to the speech signal based on the reference speech signal and the noise signal. This embodiment improves the generalized sidelobe canceller technology by combining the generalized sidelobe canceller structure and the super-directional beamforming technology, using the characteristics of strong directivity and narrow main lobe of the super-directional beamforming technology, Therefore, the first super-directional beamformer in the generalized sidelobe canceller can effectively enhance the speech signal of the target azimuth, and the enhancement effect is good, and at the same time, the lower branch of the generalized sidelobe canceller is improved based on the super-directional beamforming technology. The blocking matrix part of , can filter out noise interference more effectively, thus more effectively improving the accuracy of the calculated reference speech signal and noise signal, thereby further improving the accuracy of the speech enhancement signal.
基于第一实施例,提出本申请语音增强方法的第二实施例,参照图3,在本实施例中,步骤S20包括:Based on the first embodiment, a second embodiment of the speech enhancement method of the present application is proposed. Referring to FIG. 3 , in this embodiment, step S20 includes:
步骤S21,将所述频域观测信号输入至广义旁瓣消除器的第一超指向性波束形成器,以基于所述语音信号对应的方向角度和所述麦克风阵列对应的阵元间距确定所述频域观测信号各频点的导向矢量;Step S21, the frequency domain observation signal is input to the first super-directional beamformer of the generalized sidelobe canceller, so as to determine the said voice signal based on the direction angle corresponding to the voice signal and the array element spacing corresponding to the microphone array. Steering vector of each frequency point of the frequency domain observation signal;
步骤S22,基于所述频域观测信号各频点的导向矢量,确定所述频域观测信号各频点的第一投影矩阵;Step S22, determining the first projection matrix of each frequency point of the frequency domain observation signal based on the steering vector of each frequency point of the frequency domain observation signal;
步骤S23,基于所述第一投影矩阵和所述频域观测信号确定所述第一超指向性波束形成器输出的参考语音信号。Step S23, determining a reference speech signal output by the first super-directional beamformer based on the first projection matrix and the frequency domain observation signal.
在本实施例中,得到语音信号对应的频域观测信号后,将频域观测信号输入到广义旁瓣消除器的上支路,上支路的第一超指向性波束形成器基于语音信号对应的方向角度和麦克风阵列对应的阵元间距,进行计算频域观测信号各频点的导向矢量;得到频域观测信号各频点的导向矢量后,第一超指向性波束形成器基于频域观测信号各频点的导向矢量,进行计算频域观测信号各频点的噪声互相关系数矩阵;之后,再基于各频点的噪声互相关系数矩阵,进行计算频域观测信号各频点的第一投影矩阵;得到各频点的第一投影矩阵后,第一超指向性波束形成器基于第一投影矩阵和频域观测信号确定广义旁瓣消除器的上支路输出的参考语音信号。In this embodiment, after the frequency domain observation signal corresponding to the speech signal is obtained, the frequency domain observation signal is input to the upper branch of the generalized sidelobe canceller, and the first super-directional beamformer of the upper branch is based on the corresponding speech signal. The direction angle and the corresponding array element spacing of the microphone array are used to calculate the steering vector of each frequency point of the frequency domain observation signal; after obtaining the steering vector of each frequency point of the frequency domain observation signal, the first super-directional beamformer is based on the frequency domain observation signal. The steering vector of each frequency point of the signal is used to calculate the noise cross-correlation coefficient matrix of each frequency point of the frequency-domain observation signal; then, based on the noise cross-correlation coefficient matrix of each frequency point, the first value of each frequency point of the frequency-domain observation signal is calculated. Projection matrix; after obtaining the first projection matrix of each frequency point, the first super-directional beamformer determines the reference speech signal output by the upper branch of the generalized sidelobe canceller based on the first projection matrix and the frequency domain observation signal.
具体地,假设方向角度为θ,阵元间距为d,设置参考阵元为第一个麦克风,对于第m个阵元数据的第n个频点,计算频域观测信号各频点的导向矢量,基于语音信号对应的方向角度和麦克风阵列对应的阵元间距,进行计算频域观测信号各频点的导向矢量的计算公式如下:Specifically, assuming that the direction angle is θ, the array element spacing is d, and the reference array element is set as the first microphone, for the nth frequency point of the mth array element data, the steering vector of each frequency point of the frequency domain observation signal is calculated. , based on the direction angle corresponding to the speech signal and the array element spacing corresponding to the microphone array, the calculation formula for calculating the steering vector of each frequency point of the frequency domain observation signal is as follows:
Figure PCTCN2021127260-appb-000001
Figure PCTCN2021127260-appb-000001
其中f为采样率,N fft为快速傅里叶变换的长度,c为信号的速度,在此为声速。 where f is the sampling rate, N fft is the length of the fast Fourier transform, and c is the speed of the signal, here the speed of sound.
之后,对于频域观测信号的各个频点逐频点进行计算,第n个频点的噪声互相关系数矩阵Q,基于频域观测信号各频点的导向矢量,进行计算频域观测信号各频点的噪声互相关系数矩阵的计算公式如下:After that, for each frequency point of the frequency-domain observation signal, the calculation is performed frequency-by-frequency point, and the noise cross-correlation coefficient matrix Q of the nth frequency point is calculated based on the steering vector of each frequency point of the frequency-domain observation signal. The formula for calculating the noise cross-correlation coefficient matrix of a point is as follows:
Figure PCTCN2021127260-appb-000002
Figure PCTCN2021127260-appb-000002
Figure PCTCN2021127260-appb-000003
Figure PCTCN2021127260-appb-000003
其中i,j分别表示麦克风阵列的第i个阵元和第j个阵元。where i and j represent the i-th array element and the j-th array element of the microphone array, respectively.
之后再计算频点n的投影矩阵,即计算频域观测信号各频点的第一投影矩阵,基于各频点的噪声互相关系数矩阵,进行计算频域观测信号各频点的第一投影矩阵的计算公式如下:Then calculate the projection matrix of frequency point n, that is, calculate the first projection matrix of each frequency point of the frequency domain observation signal, and calculate the first projection matrix of each frequency point of the frequency domain observation signal based on the noise cross-correlation coefficient matrix of each frequency point. The calculation formula is as follows:
Figure PCTCN2021127260-appb-000004
Figure PCTCN2021127260-appb-000004
其中,α表示第n个频点关于方向θ的导向矩阵。Among them, α represents the steering matrix of the nth frequency point with respect to the direction θ.
最后,计算上支路的波束输出信号,即计算广义旁瓣消除器的上支路输出的参考语音信号,基于第一投影矩阵和频域观测信号确定广义旁瓣消除器的上支路输出的参考语音信号的计算公式如下:Finally, the beam output signal of the upper branch is calculated, that is, the reference speech signal output by the upper branch of the generalized sidelobe canceller is calculated, and the output of the upper branch of the generalized sidelobe canceller is determined based on the first projection matrix and the frequency domain observation signal. The calculation formula of the reference speech signal is as follows:
Y(k,n)=W(θ,n) HX(k,n) Y(k,n)=W(θ,n) H X(k,n)
其中,Y(k,n)为频域观测信号第k帧的第n个频点对应的参考语音信号。Wherein, Y(k,n) is the reference speech signal corresponding to the nth frequency point of the kth frame of the frequency domain observation signal.
进一步地,上述流程以麦克风阵列为均匀线性阵列为示例计算公式。根据实际需求,也可以使用均匀圆阵等阵列完成语音信号的增强。Further, the above process takes the microphone array as a uniform linear array as an example of the calculation formula. According to actual needs, the enhancement of the speech signal can also be accomplished by using an array such as a uniform circular array.
进一步地,所述基于所述频域观测信号各频点的导向矢量,确定所述频域观测信号各频点的第一投影矩阵的步骤包括:Further, the step of determining the first projection matrix of each frequency point of the frequency domain observation signal based on the steering vector of each frequency point of the frequency domain observation signal includes:
步骤S221,基于频域观测信号各频点的导向矢量,进行计算频域观测信号各频点的噪声互相关系数矩阵;Step S221, based on the steering vector of each frequency point of the frequency-domain observation signal, calculate the noise cross-correlation coefficient matrix of each frequency point of the frequency-domain observation signal;
步骤S222,基于各频点的噪声互相关系数矩阵,进行计算频域观测信号各频点的第一投影矩阵。Step S222: Calculate the first projection matrix of each frequency point of the frequency domain observation signal based on the noise cross-correlation coefficient matrix of each frequency point.
在本实施例中,得到频域观测信号各频点的导向矢量后,广义旁瓣消除器的的第一超指向性波束形成器基于频域观测信号各频点的导向矢量,进行计算频域观测信号各频点的噪声互相关系数矩阵;之后,再基于各频点的噪声互相关系数矩阵,进行计算频域观测信号各频点的第一投影矩阵,以供基于第一投影矩阵和频域观测信号确定广义旁瓣消除器的上支路输出的参考语音信号。在本实施例中计算噪声互相关系数矩阵的示例计算公式以及基于各频点的噪声互相关系数矩阵计算频域观测信号各频点的第一投影矩阵对应的示例计算公式具体参考上一实施例。In this embodiment, after obtaining the steering vectors of each frequency point of the frequency-domain observation signal, the first super-directional beamformer of the generalized sidelobe canceller calculates the frequency-domain based on the steering vectors of each frequency point of the frequency-domain observation signal The noise cross-correlation coefficient matrix of each frequency point of the observed signal; then, based on the noise cross-correlation coefficient matrix of each frequency point, the first projection matrix of each frequency point of the frequency-domain observation signal is calculated, so as to be based on the first projection matrix and the frequency point. The domain observation signal determines the reference speech signal output by the upper branch of the generalized sidelobe canceller. In this embodiment, the example calculation formula for calculating the noise cross-correlation coefficient matrix and the example calculation formula corresponding to the first projection matrix of each frequency point of the frequency-domain observation signal based on the noise cross-correlation coefficient matrix of each frequency point are specifically referred to the previous embodiment. .
进一步地,所述将所述频域观测信号输入至广义旁瓣消除器的第二超指向性波束形成器,以确定所述语音信号对应的噪声信号的步骤包括:Further, the step of inputting the frequency domain observation signal to the second super-directional beamformer of the generalized sidelobe canceller to determine the noise signal corresponding to the speech signal includes:
步骤S31,将所述频域观测信号输入至广义旁瓣消除器的第二超指向性波束形成器,以基于所述噪声方向矢量确定所述频域观测信号各频点的第二投影矩阵;Step S31, inputting the frequency domain observation signal to the second super-directional beamformer of the generalized sidelobe canceller, to determine the second projection matrix of each frequency point of the frequency domain observation signal based on the noise direction vector;
步骤S32,基于所述第二投影矩阵和所述频域观测信号确定所述第二超指向性波束形成器输出的噪声信号。Step S32: Determine the noise signal output by the second super-directional beamformer based on the second projection matrix and the frequency domain observation signal.
在本实施例中,得到语音信号对应的频域观测信号后,将频域观测信号输入到广义旁 瓣消除器下支路的第二超指向性波束形成器,从而通过第二超指向性波束形成器实现广义旁瓣消除器下支路的阻塞矩阵的功能。具体地,首先基于通过预设干扰噪声的方向角度和麦克风阵列对应的阵元间距,进行计算频域观测信号各频点的噪声导向矢量;之后,基于频域观测信号各频点的噪声导向矢量,计算频域观测信号各频点的第二投影矩阵;最后,基于第二投影矩阵和频域观测信号计算并输出噪声信号,以使广义旁瓣消除器根据第二超指向性波束形成器阻塞掉参考语音号后得到的噪声信号。可以理解的是,该广义旁瓣消除器下支路的输出能够成功阻塞掉参考语音信号,得到只包含有干扰噪声的信号部分即噪声信号。In this embodiment, after the frequency domain observation signal corresponding to the speech signal is obtained, the frequency domain observation signal is input to the second super-directional beamformer of the lower branch of the generalized sidelobe canceller, so as to pass the second super-directional beam The former implements the function of the blocking matrix of the lower branch of the generalized sidelobe canceller. Specifically, first, based on the direction angle of the preset interference noise and the array element spacing corresponding to the microphone array, the noise steering vector of each frequency point of the frequency-domain observation signal is calculated; then, based on the noise steering vector of each frequency point of the frequency-domain observation signal , calculate the second projection matrix of each frequency point of the frequency-domain observation signal; finally, calculate and output the noise signal based on the second projection matrix and the frequency-domain observation signal, so that the generalized sidelobe canceller can block the beamformer according to the second super-directional beamformer The noise signal obtained after dropping the reference speech number. It can be understood that the output of the lower branch of the generalized sidelobe canceller can successfully block the reference speech signal, so as to obtain the signal part containing only interference noise, that is, the noise signal.
基于语音信号对应的方向角度和麦克风阵列对应的阵元间距,进行计算频域观测信号各频点的导向矢量;得到频域观测信号各频点的导向矢量后,第一超指向性波束形成器基于频域观测信号各频点的导向矢量,进行计算频域观测信号各频点的噪声互相关系数矩阵;之后,再基于各频点的噪声互相关系数矩阵,进行计算频域观测信号各频点的第一投影矩阵;得到各频点的第一投影矩阵后,第一超指向性波束形成器基于第一投影矩阵和频域观测信号确定广义旁瓣消除器的上支路输出的参考语音信号。Based on the direction angle corresponding to the speech signal and the array element spacing corresponding to the microphone array, the steering vector of each frequency point of the frequency-domain observation signal is calculated; after obtaining the steering vector of each frequency point of the frequency-domain observation signal, the first super-directional beamformer Calculate the noise cross-correlation coefficient matrix of each frequency point of the frequency-domain observation signal based on the steering vector of each frequency point of the frequency-domain observation signal; then, calculate the noise cross-correlation coefficient matrix of each frequency point of the frequency-domain observation signal After obtaining the first projection matrix of each frequency point, the first super-directional beamformer determines the reference speech output by the upper branch of the generalized sidelobe canceller based on the first projection matrix and the frequency domain observation signal Signal.
进一步地,所述基于所述参考语音信号和所述噪声信号确定所述语音信号对应的语音增强信号的步骤包括:Further, the step of determining the speech enhancement signal corresponding to the speech signal based on the reference speech signal and the noise signal includes:
步骤S41,将所述参考语音信号和所述噪声信号输入至自适应噪声抑制器中,以基于所述参考语音信号和所述噪声信号对所述语音信号对应的所述频域观测信号进行自适应噪声抑制,得到所述频域观测信号对应的误差信号;Step S41, inputting the reference speech signal and the noise signal into an adaptive noise suppressor, so as to perform automatic self-adaptation on the frequency domain observation signal corresponding to the speech signal based on the reference speech signal and the noise signal. adapting to noise suppression to obtain an error signal corresponding to the frequency domain observation signal;
步骤S42,将所述误差信号输入至所述自适应噪声抑制器,并采用归一化最小均方误差准则优化所述自适应噪声抑制器的参数,在优化完成所述自适应噪声抑制器后确定所述语音信号对应的语音增强信号。Step S42, the error signal is input to the adaptive noise suppressor, and the normalized minimum mean square error criterion is used to optimize the parameters of the adaptive noise suppressor, after the optimization of the adaptive noise suppressor is completed. A speech enhancement signal corresponding to the speech signal is determined.
在本实施例中,通过广义旁瓣消除器上支路输出的参考语音信号和下支路输出的噪声信号后,将上支路输出的参考语音信号和下支路输出的噪声信号输入到自适应噪声抑制器中,以使自适应噪声抑制器根据参考语音信号和噪声信号对语音信号对应的频域观测信号进行自适应噪声抑制,最大限度地抑制语音信号中的噪声信号,使得自适应噪声抑制器输出精度高的语音增强信号。将上支路输出的参考语音信号和下支路输出的噪声信号输入到自适应噪声抑制器中,首先通过自适应噪声抑制器基于参考语音信号和噪声信号计算误差信号,其中,误差信号为频域观测信号在噪声抑制后语音信号,但实际上误差信号属于精确度较低的语音信号,需要语音信号需经过多次抑制后才能得到精确度高的信号。得到误差信号后,将误差信号输入至自适应噪声抑制器,以供自适应噪声抑制器采用归一化最小均方误差准则对自适应噪声抑制器的参数进行优化,并且在优化自适应噪声抑制器完成后输出精确度高的语音增强信号。In this embodiment, after passing the reference speech signal output by the upper branch and the noise signal output by the lower branch of the generalized sidelobe canceller, the reference speech signal output by the upper branch and the noise signal output by the lower branch are input to the automatic In the adaptive noise suppressor, the adaptive noise suppressor performs adaptive noise suppression on the frequency domain observation signal corresponding to the speech signal according to the reference speech signal and the noise signal, so as to maximize the suppression of the noise signal in the speech signal, so that the adaptive noise The suppressor outputs a high-precision speech enhancement signal. The reference speech signal output by the upper branch and the noise signal output by the lower branch are input into the adaptive noise suppressor, and the error signal is first calculated based on the reference speech signal and the noise signal through the adaptive noise suppressor, wherein the error signal is the frequency The domain observation signal is the speech signal after noise suppression, but in fact the error signal belongs to the speech signal with low accuracy, and the speech signal needs to be suppressed many times to obtain the signal with high accuracy. After the error signal is obtained, the error signal is input to the adaptive noise suppressor for the adaptive noise suppressor to use the normalized minimum mean square error criterion to optimize the parameters of the adaptive noise suppressor, and when optimizing the adaptive noise suppressor After the completion of the device, a high-precision speech enhancement signal is output.
进一步地,所述将所述参考语音信号和所述噪声信号输入至自适应噪声抑制器中,以基于所述参考语音信号和所述噪声信号对所述语音信号对应的所述频域观测信号进行自适应噪声抑制,得到所述频域观测信号对应的误差信号的步骤包括:Further, inputting the reference speech signal and the noise signal into an adaptive noise suppressor, so as to pair the frequency domain observation signal corresponding to the speech signal based on the reference speech signal and the noise signal The steps of performing adaptive noise suppression to obtain an error signal corresponding to the frequency domain observation signal include:
步骤S411,将所述参考语音信号和所述噪声信号输入至自适应噪声抑制器中,以基于所述自适应噪声抑制器对应的权重矢量和所述参考语音信号,确定调整信号;Step S411, inputting the reference speech signal and the noise signal into an adaptive noise suppressor to determine an adjustment signal based on the weight vector corresponding to the adaptive noise suppressor and the reference speech signal;
步骤S412,基于所述调整信号对所述语音信号对应的所述频域观测信号进行调整,确定调整所述频域观测信号后对应的误差信号。Step S412: Adjust the frequency-domain observation signal corresponding to the speech signal based on the adjustment signal, and determine an error signal corresponding to the frequency-domain observation signal after adjustment.
在本实施例中,通过广义旁瓣消除器上支路输出的参考语音信号和下支路输出的噪声信号后,将上支路输出的参考语音信号和下支路输出的噪声信号输入到自适应噪声抑制器中,以使自适应噪声抑制器根据参考语音信号和噪声信号对语音信号对应的频域观测信号进行自适应噪声抑制,最大限度地抑制语音信号中的噪声信号,使得自适应噪声抑制器输出精度高的语音增强信号。具体地,首先基于自适应噪声抑制器对应的权重矢量和参考语音信号进行计算调整信号,自适应噪声抑制器输出调整信号;得到调整信号之后,基于调整信号对频域观测信号进行调整,得到调整频域观测信号后的误差信号。其中,基于调整信号对频域观测信号进行调整的方式可以是使频域观测信号减去调整信号,得到语音信号对应的误差信号。In this embodiment, after passing the reference speech signal output by the upper branch and the noise signal output by the lower branch of the generalized sidelobe canceller, the reference speech signal output by the upper branch and the noise signal output by the lower branch are input to the automatic In the adaptive noise suppressor, the adaptive noise suppressor performs adaptive noise suppression on the frequency domain observation signal corresponding to the speech signal according to the reference speech signal and the noise signal, so as to maximize the suppression of the noise signal in the speech signal, so that the adaptive noise The suppressor outputs a high-precision speech enhancement signal. Specifically, the adjustment signal is first calculated based on the weight vector corresponding to the adaptive noise suppressor and the reference speech signal, and the adaptive noise suppressor outputs the adjustment signal; after the adjustment signal is obtained, the frequency domain observation signal is adjusted based on the adjustment signal, and the adjusted signal is obtained. The error signal after observing the signal in the frequency domain. The manner of adjusting the frequency-domain observation signal based on the adjustment signal may be to subtract the adjustment signal from the frequency-domain observation signal to obtain an error signal corresponding to the speech signal.
进一步地,步骤S10包括:Further, step S10 includes:
步骤S11,通过麦克风阵列采集语音信号,并对所述语音信号执行分帧操作,得到所述语音信号对应的帧数据;Step S11, collecting voice signals through a microphone array, and performing a frame division operation on the voice signals to obtain frame data corresponding to the voice signals;
步骤S12,对所述语音信号对应的帧数据进行短时离散傅里叶变换,得到所述语音信号对应的频域观测信号。Step S12: Perform short-time discrete Fourier transform on the frame data corresponding to the speech signal to obtain a frequency domain observation signal corresponding to the speech signal.
在本实施例中,在执行语音增强过程之前,采用M元麦克风阵列采集得到真实环境下的语音信号,其中,通过麦克风阵列采集到的语音信号即为时域观测信号,其中,语音信号可以表示为x(n)=[x 1(t),x 2(t),...,x M(t)]。对上述时域观测信号执行分帧操作等预处理操作,再对预处理后的时域观测信号进行逐帧处理,逐帧处理完成后得到语音信号对应的帧数据;之后,再对帧数据采用短时离散傅里叶变换,得到频域观测信号其中,频域观测信号可以表示为X i(e ),i表示第i帧数据。后续为了简便,使用X(k)表示第k帧的频域数据。 In this embodiment, before the speech enhancement process is performed, an M-element microphone array is used to collect a speech signal in a real environment, wherein the speech signal collected by the microphone array is a time-domain observation signal, where the speech signal can represent is x(n)=[x1( t ), x2 (t),..., xM (t)]. Perform preprocessing operations such as framing operations on the above-mentioned time-domain observation signals, and then perform frame-by-frame processing on the pre-processed time-domain observation signals. After the frame-by-frame processing is completed, the frame data corresponding to the speech signal is obtained; The short-time discrete Fourier transform is used to obtain the frequency domain observation signal, wherein the frequency domain observation signal can be expressed as X i (e ), and i represents the i-th frame of data. In the following, for simplicity, X(k) is used to represent the frequency domain data of the kth frame.
本实施例提出的语音增强方法,通过将所述频域观测信号输入至广义旁瓣消除器的第一超指向性波束形成器,以基于所述语音信号对应的方向角度和所述麦克风阵列对应的阵元间距确定所述频域观测信号各频点的导向矢量;基于所述频域观测信号各频点的导向矢量,确定所述频域观测信号各频点的第一投影矩阵;基于所述第一投影矩阵和所述频域观测信号确定所述第一超指向性波束形成器输出的参考语音信号。本实施例通过结合广义旁瓣消除器结构和超指向性波束形成技术,利用超指向性波束形成技术的指向性强、主瓣窄的特性,在广义旁瓣消除器的上支路应用超指向性波束形成技术,从而通过对广义旁瓣消除器中的第一超指向性波束形成器能够有效增强目标方位的语音信号,使参考语音信号的增强效果佳。In the speech enhancement method proposed in this embodiment, the frequency domain observation signal is input to the first super-directional beamformer of the generalized sidelobe canceller, so that the direction angle corresponding to the speech signal corresponds to the microphone array based on the direction angle corresponding to the speech signal. Determine the steering vector of each frequency point of the frequency-domain observation signal; determine the first projection matrix of each frequency point of the frequency-domain observation signal based on the steering vector of each frequency point of the frequency-domain observation signal; The first projection matrix and the frequency domain observation signal determine the reference speech signal output by the first super-directional beamformer. In this embodiment, by combining the generalized sidelobe canceller structure and the super-directional beamforming technology, the super-directional beamforming technology has the characteristics of strong directivity and narrow main lobe, and the super-directionality is applied to the upper branch of the generalized sidelobe canceller. The first super-directional beamformer in the generalized sidelobe canceller can effectively enhance the speech signal of the target azimuth, so that the enhancement effect of the reference speech signal is good.
此外,本申请实施例还提出一种语音增强装置,所述语音增强装置包括:In addition, an embodiment of the present application also proposes a voice enhancement device, where the voice enhancement device includes:
采集模块,用于通过麦克风阵列采集语音信号,并将所述语音信号转换成频域观测信 号,其中,所述语音信号为时域观测信号;Acquisition module, for collecting voice signal by microphone array, and described voice signal is converted into frequency domain observation signal, wherein, described voice signal is time domain observation signal;
第一确定模块,用于将所述频域观测信号输入至广义旁瓣消除器中的第一超指向性波束形成器,以确定所述第一超指向性波束形成器输出的参考语音信号;a first determination module, configured to input the frequency domain observation signal to a first super-directional beamformer in the generalized sidelobe canceller, to determine a reference speech signal output by the first super-directional beamformer;
第二确定模块,用于将所述频域观测信号输入至广义旁瓣消除器的第二超指向性波束形成器,以确定所述语音信号对应的噪声信号,其中,所述第二超指向性波束形成器对应的约束矩阵与所述第一超指向性波束形成器对应的阻塞矩阵相互正交;The second determination module is configured to input the frequency domain observation signal to the second super-directional beamformer of the generalized sidelobe canceller, so as to determine the noise signal corresponding to the speech signal, wherein the second super-directional beam The constraint matrix corresponding to the directional beamformer and the blocking matrix corresponding to the first super-directional beamformer are orthogonal to each other;
第三确定模块,用于基于所述参考语音信号和所述噪声信号确定所述语音信号对应的语音增强信号。A third determining module, configured to determine a speech enhancement signal corresponding to the speech signal based on the reference speech signal and the noise signal.
进一步地,所述第一确定模块,还用于:Further, the first determining module is also used for:
将所述频域观测信号输入至广义旁瓣消除器的第一超指向性波束形成器,以基于所述语音信号对应的方向角度和所述麦克风阵列对应的阵元间距确定所述频域观测信号各频点的导向矢量;inputting the frequency domain observation signal to the first super-directional beamformer of the generalized sidelobe canceller, so as to determine the frequency domain observation based on the direction angle corresponding to the speech signal and the array element spacing corresponding to the microphone array Steering vector of each frequency point of the signal;
基于所述频域观测信号各频点的导向矢量,确定所述频域观测信号各频点的第一投影矩阵;determining a first projection matrix of each frequency point of the frequency-domain observation signal based on the steering vector of each frequency point of the frequency-domain observation signal;
基于所述第一投影矩阵和所述频域观测信号确定所述第一超指向性波束形成器输出的参考语音信号。A reference speech signal output by the first super-directional beamformer is determined based on the first projection matrix and the frequency domain observation signal.
进一步地,所述第一确定模块,还用于:Further, the first determining module is also used for:
基于频域观测信号各频点的导向矢量,进行计算频域观测信号各频点的噪声互相关系数矩阵;Calculate the noise cross-correlation coefficient matrix of each frequency point of the frequency-domain observation signal based on the steering vector of each frequency point of the frequency-domain observation signal;
基于各频点的噪声互相关系数矩阵,进行计算频域观测信号各频点的第一投影矩阵。Based on the noise cross-correlation coefficient matrix of each frequency point, the first projection matrix of each frequency point of the frequency domain observation signal is calculated.
进一步地,所述第二确定模块,还用于:Further, the second determining module is also used for:
将所述频域观测信号输入至广义旁瓣消除器的第二超指向性波束形成器,以基于所述噪声方向矢量确定所述频域观测信号各频点的第二投影矩阵;inputting the frequency-domain observation signal to a second super-directional beamformer of a generalized sidelobe canceller to determine a second projection matrix of each frequency point of the frequency-domain observation signal based on the noise direction vector;
基于所述第二投影矩阵和所述频域观测信号确定所述第二超指向性波束形成器输出的噪声信号。A noise signal output by the second super-directional beamformer is determined based on the second projection matrix and the frequency domain observation signal.
进一步地,所述第三确定模块,还用于:Further, the third determining module is also used for:
将所述参考语音信号和所述噪声信号输入至自适应噪声抑制器中,以基于所述参考语音信号和所述噪声信号对所述语音信号对应的所述频域观测信号进行自适应噪声抑制,得到所述频域观测信号对应的误差信号;inputting the reference speech signal and the noise signal into an adaptive noise suppressor to perform adaptive noise suppression on the frequency domain observation signal corresponding to the speech signal based on the reference speech signal and the noise signal , obtain the error signal corresponding to the frequency domain observation signal;
将所述误差信号输入至所述自适应噪声抑制器,并采用归一化最小均方误差准则优化所述自适应噪声抑制器的参数,在优化完成所述自适应噪声抑制器后确定所述语音信号对应的语音增强信号。The error signal is input to the adaptive noise suppressor, and a normalized minimum mean square error criterion is used to optimize the parameters of the adaptive noise suppressor, and the adaptive noise suppressor is determined after the optimization is completed. The speech enhancement signal corresponding to the speech signal.
进一步地,所述第三确定模块,还用于:Further, the third determining module is also used for:
将所述参考语音信号和所述噪声信号输入至自适应噪声抑制器中,以基于所述自适应噪声抑制器对应的权重矢量和所述参考语音信号,确定调整信号;inputting the reference speech signal and the noise signal into an adaptive noise suppressor to determine an adjustment signal based on a weight vector corresponding to the adaptive noise suppressor and the reference speech signal;
基于所述调整信号对所述语音信号对应的所述频域观测信号进行调整,确定调整所述 频域观测信号后对应的误差信号。The frequency domain observation signal corresponding to the speech signal is adjusted based on the adjustment signal, and an error signal corresponding to the adjustment of the frequency domain observation signal is determined.
进一步地,所述采集模块,还用于:Further, the acquisition module is also used for:
通过麦克风阵列采集语音信号,并对所述语音信号执行分帧操作,得到所述语音信号对应的帧数据;Collecting voice signals through a microphone array, and performing a frame division operation on the voice signals to obtain frame data corresponding to the voice signals;
对所述语音信号对应的帧数据进行短时离散傅里叶变换,得到所述语音信号对应的频域观测信号。A short-time discrete Fourier transform is performed on the frame data corresponding to the speech signal to obtain a frequency domain observation signal corresponding to the speech signal.
本申请实施例还提供了一种语音增强设备,包括存储器、处理器以及存储在存储器中并可在处理器上运行的语音增强程序,所述处理器执行语音增强程序时实现:Embodiments of the present application also provide a voice enhancement device, including a memory, a processor, and a voice enhancement program stored in the memory and running on the processor, where the processor executes the voice enhancement program to achieve:
通过麦克风阵列采集语音信号,并将所述语音信号转换成频域观测信号,其中,所述语音信号为时域观测信号;The voice signal is collected by the microphone array, and the voice signal is converted into a frequency-domain observation signal, wherein the voice signal is a time-domain observation signal;
将所述频域观测信号输入至广义旁瓣消除器中的第一超指向性波束形成器,以确定所述第一超指向性波束形成器输出的参考语音信号;inputting the frequency domain observation signal to the first super-directional beamformer in the generalized sidelobe canceller to determine the reference speech signal output by the first super-directional beamformer;
将所述频域观测信号输入至广义旁瓣消除器的第二超指向性波束形成器,以确定所述语音信号对应的噪声信号,其中,所述第二超指向性波束形成器对应的约束矩阵与所述第一超指向性波束形成器对应的阻塞矩阵相互正交;The frequency domain observation signal is input to the second super-directional beamformer of the generalized sidelobe canceller to determine the noise signal corresponding to the speech signal, wherein the constraint corresponding to the second super-directional beamformer The matrix is orthogonal to the blocking matrix corresponding to the first super-directional beamformer;
基于所述参考语音信号和所述噪声信号确定所述语音信号对应的语音增强信号。A speech enhancement signal corresponding to the speech signal is determined based on the reference speech signal and the noise signal.
此外,本申请实施例还提出一种计算机可读存储介质,计算机可读存储介质可以是非易失性,也可以是易失性,所述计算机可读存储介质上存储有语音增强程序,所述语音增强程序被处理器执行时实现:In addition, an embodiment of the present application also proposes a computer-readable storage medium. The computer-readable storage medium may be non-volatile or volatile, and a speech enhancement program is stored on the computer-readable storage medium. The speech enhancement program is implemented when executed by the processor:
通过麦克风阵列采集语音信号,并将所述语音信号转换成频域观测信号,其中,所述语音信号为时域观测信号;The voice signal is collected by the microphone array, and the voice signal is converted into a frequency-domain observation signal, wherein the voice signal is a time-domain observation signal;
将所述频域观测信号输入至广义旁瓣消除器中的第一超指向性波束形成器,以确定所述第一超指向性波束形成器输出的参考语音信号;inputting the frequency domain observation signal to the first super-directional beamformer in the generalized sidelobe canceller to determine the reference speech signal output by the first super-directional beamformer;
将所述频域观测信号输入至广义旁瓣消除器的第二超指向性波束形成器,以确定所述语音信号对应的噪声信号,其中,所述第二超指向性波束形成器对应的约束矩阵与所述第一超指向性波束形成器对应的阻塞矩阵相互正交;The frequency domain observation signal is input to the second super-directional beamformer of the generalized sidelobe canceller to determine the noise signal corresponding to the speech signal, wherein the constraint corresponding to the second super-directional beamformer The matrix is orthogonal to the blocking matrix corresponding to the first super-directional beamformer;
基于所述参考语音信号和所述噪声信号确定所述语音信号对应的语音增强信号。A speech enhancement signal corresponding to the speech signal is determined based on the reference speech signal and the noise signal.
本申请计算机可读存储介质具体实施例与上述语音增强方法的各实施例基本相同,在此不再详细赘述。The specific embodiments of the computer-readable storage medium of the present application are basically the same as the above-mentioned embodiments of the speech enhancement method, and are not described in detail here.
需要说明的是,在本文中,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者系统不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者系统所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括该要素的过程、方法、物品或者系统中还存在另外的相同要素。It should be noted that, herein, the terms "comprising", "comprising" or any other variation thereof are intended to encompass non-exclusive inclusion, such that a process, method, article or system comprising a series of elements includes not only those elements, It also includes other elements not expressly listed or inherent to such a process, method, article or system. Without further limitation, an element qualified by the phrase "comprising a..." does not preclude the presence of additional identical elements in the process, method, article or system that includes the element.
上述本申请实施例序号仅仅为了描述,不代表实施例的优劣。The above-mentioned serial numbers of the embodiments of the present application are only for description, and do not represent the advantages or disadvantages of the embodiments.
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可 借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在如上所述的一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端设备(可以是手机,计算机,服务器,空调器,或者网络设备等)执行本申请各个实施例所述的方法。From the description of the above embodiments, those skilled in the art can clearly understand that the methods of the above embodiments can be implemented by means of software plus a necessary general hardware platform, and of course hardware can also be used, but in many cases the former is better implementation. Based on this understanding, the technical solutions of the present application can be embodied in the form of software products in essence or the parts that make contributions to the prior art, and the computer software products are stored in a storage medium (such as ROM/RAM) as described above. , magnetic disk, optical disc), including several instructions to make a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to execute the methods described in the various embodiments of the present application.
以上仅为本申请的优选实施例,并非因此限制本申请的专利范围,凡是利用本申请说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,均同理包括在本申请的专利保护范围内。The above are only the preferred embodiments of the present application, and are not intended to limit the patent scope of the present application. Any equivalent structure or equivalent process transformation made by using the contents of the description and drawings of the present application, or directly or indirectly applied in other related technical fields , are similarly included within the scope of patent protection of this application.

Claims (20)

  1. 一种语音增强方法,其中,所述语音增强方法包括以下步骤:A speech enhancement method, wherein the speech enhancement method comprises the following steps:
    通过麦克风阵列采集语音信号,并将所述语音信号转换成频域观测信号,其中,所述语音信号为时域观测信号;The voice signal is collected by the microphone array, and the voice signal is converted into a frequency-domain observation signal, wherein the voice signal is a time-domain observation signal;
    将所述频域观测信号输入至广义旁瓣消除器中的第一超指向性波束形成器,以确定所述第一超指向性波束形成器输出的参考语音信号;inputting the frequency domain observation signal to a first super-directional beamformer in the generalized sidelobe canceller to determine a reference speech signal output by the first super-directional beamformer;
    将所述频域观测信号输入至广义旁瓣消除器的第二超指向性波束形成器,以确定所述语音信号对应的噪声信号,其中,所述第二超指向性波束形成器对应的约束矩阵与所述第一超指向性波束形成器对应的阻塞矩阵相互正交;The frequency domain observation signal is input to the second super-directional beamformer of the generalized sidelobe canceller to determine the noise signal corresponding to the speech signal, wherein the constraint corresponding to the second super-directional beamformer The matrix is orthogonal to the blocking matrix corresponding to the first super-directional beamformer;
    基于所述参考语音信号和所述噪声信号确定所述语音信号对应的语音增强信号。A speech enhancement signal corresponding to the speech signal is determined based on the reference speech signal and the noise signal.
  2. 如权利要求1所述的语音增强方法,其中,所述将所述频域观测信号输入至广义旁瓣消除器中的第一超指向性波束形成器,以确定所述第一超指向性波束形成器输出的参考语音信号的步骤包括:The speech enhancement method of claim 1, wherein the frequency domain observation signal is input to a first super-directional beamformer in a generalized sidelobe canceller to determine the first super-directional beam The steps of forming the reference speech signal output by the generator include:
    将所述频域观测信号输入至广义旁瓣消除器的第一超指向性波束形成器,以基于所述语音信号对应的方向角度和所述麦克风阵列对应的阵元间距确定所述频域观测信号各频点的导向矢量;inputting the frequency domain observation signal to the first super-directional beamformer of the generalized sidelobe canceller, so as to determine the frequency domain observation based on the direction angle corresponding to the speech signal and the array element spacing corresponding to the microphone array Steering vector of each frequency point of the signal;
    基于所述频域观测信号各频点的导向矢量,确定所述频域观测信号各频点的第一投影矩阵;determining a first projection matrix of each frequency point of the frequency-domain observation signal based on the steering vector of each frequency point of the frequency-domain observation signal;
    基于所述第一投影矩阵和所述频域观测信号确定所述第一超指向性波束形成器输出的参考语音信号。A reference speech signal output by the first super-directional beamformer is determined based on the first projection matrix and the frequency domain observation signal.
  3. 如权利要求2所述的语音增强方法,其中,所述基于所述频域观测信号各频点的导向矢量,确定所述频域观测信号各频点的第一投影矩阵的步骤包括:The speech enhancement method according to claim 2, wherein the step of determining the first projection matrix of each frequency point of the frequency domain observation signal based on the steering vector of each frequency point of the frequency domain observation signal comprises:
    基于频域观测信号各频点的导向矢量,进行计算频域观测信号各频点的噪声互相关系数矩阵;Based on the steering vector of each frequency point of the frequency-domain observation signal, calculate the noise cross-correlation coefficient matrix of each frequency point of the frequency-domain observation signal;
    基于各频点的噪声互相关系数矩阵,进行计算频域观测信号各频点的第一投影矩阵。Based on the noise cross-correlation coefficient matrix of each frequency point, the first projection matrix of each frequency point of the frequency domain observation signal is calculated.
  4. 如权利要求1所述的语音增强方法,其中,所述将所述频域观测信号输入至广义旁瓣消除器的第二超指向性波束形成器,以确定所述语音信号对应的噪声信号的步骤包括:The speech enhancement method according to claim 1, wherein the frequency domain observation signal is input to the second super-directional beamformer of the generalized sidelobe canceller to determine the noise signal corresponding to the speech signal. Steps include:
    将所述频域观测信号输入至广义旁瓣消除器的第二超指向性波束形成器,以基于所述噪声方向矢量确定所述频域观测信号各频点的第二投影矩阵;inputting the frequency-domain observation signal to a second super-directional beamformer of a generalized sidelobe canceller to determine a second projection matrix of each frequency point of the frequency-domain observation signal based on the noise direction vector;
    基于所述第二投影矩阵和所述频域观测信号确定所述第二超指向性波束形成器输出的噪声信号。A noise signal output by the second super-directional beamformer is determined based on the second projection matrix and the frequency domain observation signal.
  5. 如权利要求1所述的语音增强方法,其中,所述基于所述参考语音信号和所述噪声信号确定所述语音信号对应的语音增强信号的步骤包括:The speech enhancement method according to claim 1, wherein the step of determining the speech enhancement signal corresponding to the speech signal based on the reference speech signal and the noise signal comprises:
    将所述参考语音信号和所述噪声信号输入至自适应噪声抑制器中,以基于所述参考语音信号和所述噪声信号对所述语音信号对应的所述频域观测信号进行自适应噪声抑制,得 到所述频域观测信号对应的误差信号;inputting the reference speech signal and the noise signal into an adaptive noise suppressor to perform adaptive noise suppression on the frequency domain observation signal corresponding to the speech signal based on the reference speech signal and the noise signal , obtain the error signal corresponding to the frequency domain observation signal;
    将所述误差信号输入至所述自适应噪声抑制器,并采用归一化最小均方误差准则优化所述自适应噪声抑制器的参数,在优化完成所述自适应噪声抑制器后确定所述语音信号对应的语音增强信号。The error signal is input to the adaptive noise suppressor, and a normalized minimum mean square error criterion is used to optimize the parameters of the adaptive noise suppressor, and the adaptive noise suppressor is determined after the optimization is completed. The speech enhancement signal corresponding to the speech signal.
  6. 如权利要求5所述的语音增强方法,其中,所述将所述参考语音信号和所述噪声信号输入至自适应噪声抑制器中,以基于所述参考语音信号和所述噪声信号对所述语音信号对应的所述频域观测信号进行自适应噪声抑制,得到所述频域观测信号对应的误差信号的步骤包括:6. The speech enhancement method of claim 5, wherein said inputting said reference speech signal and said noise signal into an adaptive noise suppressor, to adjust said reference speech signal and said noise signal to said The frequency domain observation signal corresponding to the speech signal is subjected to adaptive noise suppression, and the steps of obtaining the error signal corresponding to the frequency domain observation signal include:
    将所述参考语音信号和所述噪声信号输入至自适应噪声抑制器中,以基于所述自适应噪声抑制器对应的权重矢量和所述参考语音信号,确定调整信号;inputting the reference speech signal and the noise signal into an adaptive noise suppressor to determine an adjustment signal based on a weight vector corresponding to the adaptive noise suppressor and the reference speech signal;
    基于所述调整信号对所述语音信号对应的所述频域观测信号进行调整,确定调整所述频域观测信号后对应的误差信号。The frequency domain observation signal corresponding to the speech signal is adjusted based on the adjustment signal, and an error signal corresponding to the adjustment of the frequency domain observation signal is determined.
  7. 如权利要求1至6任一项所述的语音增强方法,其中,所述通过麦克风阵列采集语音信号,并将所述语音信号转换成频域观测信号的步骤包括:The speech enhancement method according to any one of claims 1 to 6, wherein the step of collecting speech signals through a microphone array and converting the speech signals into frequency domain observation signals comprises:
    通过麦克风阵列采集语音信号,并对所述语音信号执行分帧操作,得到所述语音信号对应的帧数据;Collecting voice signals through a microphone array, and performing a frame division operation on the voice signals to obtain frame data corresponding to the voice signals;
    对所述语音信号对应的帧数据进行短时离散傅里叶变换,得到所述语音信号对应的频域观测信号。A short-time discrete Fourier transform is performed on the frame data corresponding to the speech signal to obtain a frequency domain observation signal corresponding to the speech signal.
  8. 一种语音增强装置,其中,所述语音增强装置包括:A voice enhancement device, wherein the voice enhancement device comprises:
    采集模块,用于通过麦克风阵列采集语音信号,并将所述语音信号转换成频域观测信号,其中,所述语音信号为时域观测信号;a collection module, configured to collect voice signals through a microphone array, and convert the voice signals into frequency-domain observation signals, wherein the voice signals are time-domain observation signals;
    第一确定模块,用于将所述频域观测信号输入至广义旁瓣消除器中的第一超指向性波束形成器,以确定所述第一超指向性波束形成器输出的参考语音信号;a first determination module, configured to input the frequency domain observation signal to a first super-directional beamformer in the generalized sidelobe canceller, to determine a reference speech signal output by the first super-directional beamformer;
    第二确定模块,用于将所述频域观测信号输入至广义旁瓣消除器的第二超指向性波束形成器,以确定所述语音信号对应的噪声信号,其中,所述第二超指向性波束形成器对应的约束矩阵与所述第一超指向性波束形成器对应的阻塞矩阵相互正交;The second determination module is configured to input the frequency domain observation signal to the second super-directional beamformer of the generalized sidelobe canceller, so as to determine the noise signal corresponding to the speech signal, wherein the second super-directional beam The constraint matrix corresponding to the directional beamformer and the blocking matrix corresponding to the first super-directional beamformer are orthogonal to each other;
    第三确定模块,用于基于所述参考语音信号和所述噪声信号确定所述语音信号对应的语音增强信号。A third determining module, configured to determine a speech enhancement signal corresponding to the speech signal based on the reference speech signal and the noise signal.
  9. 一种语音增强设备,其中,所述语音增强设备包括:存储器、处理器及存储在所述存储器上并可在所述处理器上运行的语音增强程序,所述语音增强程序被所述处理器执行时实现:A speech enhancement device, wherein the speech enhancement device comprises: a memory, a processor, and a speech enhancement program stored on the memory and executable on the processor, the speech enhancement program being executed by the processor Implemented at execution time:
    通过麦克风阵列采集语音信号,并将所述语音信号转换成频域观测信号,其中,所述语音信号为时域观测信号;The voice signal is collected by the microphone array, and the voice signal is converted into a frequency-domain observation signal, wherein the voice signal is a time-domain observation signal;
    将所述频域观测信号输入至广义旁瓣消除器中的第一超指向性波束形成器,以确定所述第一超指向性波束形成器输出的参考语音信号;inputting the frequency domain observation signal to a first super-directional beamformer in the generalized sidelobe canceller to determine a reference speech signal output by the first super-directional beamformer;
    将所述频域观测信号输入至广义旁瓣消除器的第二超指向性波束形成器,以确定所述 语音信号对应的噪声信号,其中,所述第二超指向性波束形成器对应的约束矩阵与所述第一超指向性波束形成器对应的阻塞矩阵相互正交;The frequency domain observation signal is input to the second super-directional beamformer of the generalized sidelobe canceller to determine the noise signal corresponding to the speech signal, wherein the constraint corresponding to the second super-directional beamformer The matrix is orthogonal to the blocking matrix corresponding to the first super-directional beamformer;
    基于所述参考语音信号和所述噪声信号确定所述语音信号对应的语音增强信号。A speech enhancement signal corresponding to the speech signal is determined based on the reference speech signal and the noise signal.
  10. 如权利要求9所述的语音增强设备,其中,所述将所述频域观测信号输入至广义旁瓣消除器中的第一超指向性波束形成器,以确定所述第一超指向性波束形成器输出的参考语音信号的步骤包括:9. The speech enhancement apparatus of claim 9, wherein the frequency domain observation signal is input to a first super-directional beamformer in a generalized sidelobe canceller to determine the first super-directional beam The steps of forming the reference speech signal output by the generator include:
    将所述频域观测信号输入至广义旁瓣消除器的第一超指向性波束形成器,以基于所述语音信号对应的方向角度和所述麦克风阵列对应的阵元间距确定所述频域观测信号各频点的导向矢量;inputting the frequency domain observation signal to the first super-directional beamformer of the generalized sidelobe canceller, so as to determine the frequency domain observation based on the direction angle corresponding to the speech signal and the array element spacing corresponding to the microphone array Steering vector of each frequency point of the signal;
    基于所述频域观测信号各频点的导向矢量,确定所述频域观测信号各频点的第一投影矩阵;determining a first projection matrix of each frequency point of the frequency-domain observation signal based on the steering vector of each frequency point of the frequency-domain observation signal;
    基于所述第一投影矩阵和所述频域观测信号确定所述第一超指向性波束形成器输出的参考语音信号。A reference speech signal output by the first super-directional beamformer is determined based on the first projection matrix and the frequency domain observation signal.
  11. 如权利要求10所述的语音增强设备,其中,所述基于所述频域观测信号各频点的导向矢量,确定所述频域观测信号各频点的第一投影矩阵的步骤包括:The speech enhancement device according to claim 10, wherein the step of determining the first projection matrix of each frequency point of the frequency domain observation signal based on the steering vector of each frequency point of the frequency domain observation signal comprises:
    基于频域观测信号各频点的导向矢量,进行计算频域观测信号各频点的噪声互相关系数矩阵;Based on the steering vector of each frequency point of the frequency-domain observation signal, calculate the noise cross-correlation coefficient matrix of each frequency point of the frequency-domain observation signal;
    基于各频点的噪声互相关系数矩阵,进行计算频域观测信号各频点的第一投影矩阵。Based on the noise cross-correlation coefficient matrix of each frequency point, the first projection matrix of each frequency point of the frequency domain observation signal is calculated.
  12. 如权利要求9所述的语音增强设备,其中,所述将所述频域观测信号输入至广义旁瓣消除器的第二超指向性波束形成器,以确定所述语音信号对应的噪声信号的步骤包括:The speech enhancement device of claim 9, wherein the frequency domain observation signal is input to a second super-directional beamformer of a generalized sidelobe canceller to determine the noise signal corresponding to the speech signal. Steps include:
    将所述频域观测信号输入至广义旁瓣消除器的第二超指向性波束形成器,以基于所述噪声方向矢量确定所述频域观测信号各频点的第二投影矩阵;inputting the frequency-domain observation signal to a second super-directional beamformer of a generalized sidelobe canceller to determine a second projection matrix of each frequency point of the frequency-domain observation signal based on the noise direction vector;
    基于所述第二投影矩阵和所述频域观测信号确定所述第二超指向性波束形成器输出的噪声信号。A noise signal output by the second super-directional beamformer is determined based on the second projection matrix and the frequency domain observation signal.
  13. 如权利要求9所述的语音增强设备,其中,所述基于所述参考语音信号和所述噪声信号确定所述语音信号对应的语音增强信号的步骤包括:The speech enhancement device according to claim 9, wherein the step of determining the speech enhancement signal corresponding to the speech signal based on the reference speech signal and the noise signal comprises:
    将所述参考语音信号和所述噪声信号输入至自适应噪声抑制器中,以基于所述参考语音信号和所述噪声信号对所述语音信号对应的所述频域观测信号进行自适应噪声抑制,得到所述频域观测信号对应的误差信号;inputting the reference speech signal and the noise signal into an adaptive noise suppressor to perform adaptive noise suppression on the frequency domain observation signal corresponding to the speech signal based on the reference speech signal and the noise signal , obtain the error signal corresponding to the frequency domain observation signal;
    将所述误差信号输入至所述自适应噪声抑制器,并采用归一化最小均方误差准则优化所述自适应噪声抑制器的参数,在优化完成所述自适应噪声抑制器后确定所述语音信号对应的语音增强信号。The error signal is input to the adaptive noise suppressor, and a normalized minimum mean square error criterion is used to optimize the parameters of the adaptive noise suppressor, and the adaptive noise suppressor is determined after the optimization is completed. The speech enhancement signal corresponding to the speech signal.
  14. 如权利要求13所述的语音增强设备,其中,所述将所述参考语音信号和所述噪声信号输入至自适应噪声抑制器中,以基于所述参考语音信号和所述噪声信号对所述语音信号对应的所述频域观测信号进行自适应噪声抑制,得到所述频域观测信号对应的误差信号的步骤包括:14. The speech enhancement apparatus of claim 13, wherein said inputting said reference speech signal and said noise signal into an adaptive noise suppressor to adjust said reference speech signal and said noise signal to said The frequency domain observation signal corresponding to the speech signal is subjected to adaptive noise suppression, and the steps of obtaining the error signal corresponding to the frequency domain observation signal include:
    将所述参考语音信号和所述噪声信号输入至自适应噪声抑制器中,以基于所述自适应噪声抑制器对应的权重矢量和所述参考语音信号,确定调整信号;inputting the reference speech signal and the noise signal into an adaptive noise suppressor to determine an adjustment signal based on a weight vector corresponding to the adaptive noise suppressor and the reference speech signal;
    基于所述调整信号对所述语音信号对应的所述频域观测信号进行调整,确定调整所述频域观测信号后对应的误差信号。The frequency domain observation signal corresponding to the speech signal is adjusted based on the adjustment signal, and an error signal corresponding to the adjustment of the frequency domain observation signal is determined.
  15. 如权利要求9至14任一项所述的语音增强设备,其中,所述通过麦克风阵列采集语音信号,并将所述语音信号转换成频域观测信号的步骤包括:The speech enhancement device according to any one of claims 9 to 14, wherein the step of collecting speech signals through a microphone array and converting the speech signals into frequency domain observation signals comprises:
    通过麦克风阵列采集语音信号,并对所述语音信号执行分帧操作,得到所述语音信号对应的帧数据;Collecting voice signals through a microphone array, and performing a frame division operation on the voice signals to obtain frame data corresponding to the voice signals;
    对所述语音信号对应的帧数据进行短时离散傅里叶变换,得到所述语音信号对应的频域观测信号。A short-time discrete Fourier transform is performed on the frame data corresponding to the speech signal to obtain a frequency domain observation signal corresponding to the speech signal.
  16. 一种计算机可读存储介质,其中,所述计算机可读存储介质上存储有语音增强程序,所述语音增强程序被处理器执行时实现:A computer-readable storage medium, wherein a speech enhancement program is stored on the computer-readable storage medium, and the speech enhancement program is implemented when executed by a processor:
    通过麦克风阵列采集语音信号,并将所述语音信号转换成频域观测信号,其中,所述语音信号为时域观测信号;The voice signal is collected by the microphone array, and the voice signal is converted into a frequency-domain observation signal, wherein the voice signal is a time-domain observation signal;
    将所述频域观测信号输入至广义旁瓣消除器中的第一超指向性波束形成器,以确定所述第一超指向性波束形成器输出的参考语音信号;inputting the frequency domain observation signal to a first super-directional beamformer in the generalized sidelobe canceller to determine a reference speech signal output by the first super-directional beamformer;
    将所述频域观测信号输入至广义旁瓣消除器的第二超指向性波束形成器,以确定所述语音信号对应的噪声信号,其中,所述第二超指向性波束形成器对应的约束矩阵与所述第一超指向性波束形成器对应的阻塞矩阵相互正交;The frequency domain observation signal is input to the second super-directional beamformer of the generalized sidelobe canceller to determine the noise signal corresponding to the speech signal, wherein the constraint corresponding to the second super-directional beamformer The matrix is orthogonal to the blocking matrix corresponding to the first super-directional beamformer;
    基于所述参考语音信号和所述噪声信号确定所述语音信号对应的语音增强信号。A speech enhancement signal corresponding to the speech signal is determined based on the reference speech signal and the noise signal.
  17. 如权利要求16所述的计算机可读存储介质,其中,所述将所述频域观测信号输入至广义旁瓣消除器中的第一超指向性波束形成器,以确定所述第一超指向性波束形成器输出的参考语音信号的步骤包括:17. The computer-readable storage medium of claim 16, wherein the inputting the frequency domain observation signal to a first super-directional beamformer in a generalized sidelobe canceller to determine the first super-directional The steps of generating the reference speech signal output by the beamformer include:
    将所述频域观测信号输入至广义旁瓣消除器的第一超指向性波束形成器,以基于所述语音信号对应的方向角度和所述麦克风阵列对应的阵元间距确定所述频域观测信号各频点的导向矢量;inputting the frequency domain observation signal to the first super-directional beamformer of the generalized sidelobe canceller, so as to determine the frequency domain observation based on the direction angle corresponding to the speech signal and the array element spacing corresponding to the microphone array Steering vector of each frequency point of the signal;
    基于所述频域观测信号各频点的导向矢量,确定所述频域观测信号各频点的第一投影矩阵;determining a first projection matrix of each frequency point of the frequency-domain observation signal based on the steering vector of each frequency point of the frequency-domain observation signal;
    基于所述第一投影矩阵和所述频域观测信号确定所述第一超指向性波束形成器输出的参考语音信号。A reference speech signal output by the first super-directional beamformer is determined based on the first projection matrix and the frequency domain observation signal.
  18. 如权利要求17所述的计算机可读存储介质,其中,所述基于所述频域观测信号各频点的导向矢量,确定所述频域观测信号各频点的第一投影矩阵的步骤包括:The computer-readable storage medium of claim 17, wherein the step of determining the first projection matrix of each frequency point of the frequency domain observation signal based on the steering vector of each frequency point of the frequency domain observation signal comprises:
    基于频域观测信号各频点的导向矢量,进行计算频域观测信号各频点的噪声互相关系数矩阵;Based on the steering vector of each frequency point of the frequency-domain observation signal, calculate the noise cross-correlation coefficient matrix of each frequency point of the frequency-domain observation signal;
    基于各频点的噪声互相关系数矩阵,进行计算频域观测信号各频点的第一投影矩阵。Based on the noise cross-correlation coefficient matrix of each frequency point, the first projection matrix of each frequency point of the frequency domain observation signal is calculated.
  19. 如权利要求16所述的计算机可读存储介质,其中,所述将所述频域观测信号输入 至广义旁瓣消除器的第二超指向性波束形成器,以确定所述语音信号对应的噪声信号的步骤包括:17. The computer-readable storage medium of claim 16, wherein the frequency domain observation signal is input to a second super-directional beamformer of a generalized sidelobe canceller to determine noise corresponding to the speech signal The steps of the signal include:
    将所述频域观测信号输入至广义旁瓣消除器的第二超指向性波束形成器,以基于所述噪声方向矢量确定所述频域观测信号各频点的第二投影矩阵;inputting the frequency-domain observation signal to a second super-directional beamformer of a generalized sidelobe canceller to determine a second projection matrix of each frequency point of the frequency-domain observation signal based on the noise direction vector;
    基于所述第二投影矩阵和所述频域观测信号确定所述第二超指向性波束形成器输出的噪声信号。A noise signal output by the second super-directional beamformer is determined based on the second projection matrix and the frequency domain observation signal.
  20. 如权利要求16所述的计算机可读存储介质,其中,所述基于所述参考语音信号和所述噪声信号确定所述语音信号对应的语音增强信号的步骤包括:The computer-readable storage medium of claim 16, wherein the step of determining a speech enhancement signal corresponding to the speech signal based on the reference speech signal and the noise signal comprises:
    将所述参考语音信号和所述噪声信号输入至自适应噪声抑制器中,以基于所述参考语音信号和所述噪声信号对所述语音信号对应的所述频域观测信号进行自适应噪声抑制,得到所述频域观测信号对应的误差信号;inputting the reference speech signal and the noise signal into an adaptive noise suppressor to perform adaptive noise suppression on the frequency domain observation signal corresponding to the speech signal based on the reference speech signal and the noise signal , obtain the error signal corresponding to the frequency domain observation signal;
    将所述误差信号输入至所述自适应噪声抑制器,并采用归一化最小均方误差准则优化所述自适应噪声抑制器的参数,在优化完成所述自适应噪声抑制器后确定所述语音信号对应的语音增强信号。The error signal is input to the adaptive noise suppressor, and a normalized minimum mean square error criterion is used to optimize the parameters of the adaptive noise suppressor, and the adaptive noise suppressor is determined after the optimization is completed. The speech enhancement signal corresponding to the speech signal.
PCT/CN2021/127260 2020-11-17 2021-10-29 Speech enhancement method and apparatus, and device and computer-readable storage medium WO2022105571A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011297820.3 2020-11-17
CN202011297820.3A CN112489674A (en) 2020-11-17 2020-11-17 Speech enhancement method, device, equipment and computer readable storage medium

Publications (1)

Publication Number Publication Date
WO2022105571A1 true WO2022105571A1 (en) 2022-05-27

Family

ID=74931606

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/127260 WO2022105571A1 (en) 2020-11-17 2021-10-29 Speech enhancement method and apparatus, and device and computer-readable storage medium

Country Status (2)

Country Link
CN (1) CN112489674A (en)
WO (1) WO2022105571A1 (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112489674A (en) * 2020-11-17 2021-03-12 深圳壹账通智能科技有限公司 Speech enhancement method, device, equipment and computer readable storage medium
CN114023307B (en) * 2022-01-05 2022-06-14 阿里巴巴达摩院(杭州)科技有限公司 Sound signal processing method, speech recognition method, electronic device, and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105792074A (en) * 2016-02-26 2016-07-20 西北工业大学 Voice signal processing method and device
WO2016167141A1 (en) * 2015-04-16 2016-10-20 ソニー株式会社 Signal processing device, signal processing method, and program
CN109389991A (en) * 2018-10-24 2019-02-26 中国科学院上海微系统与信息技术研究所 A kind of signal enhancing method based on microphone array
US10418048B1 (en) * 2018-04-30 2019-09-17 Cirrus Logic, Inc. Noise reference estimation for noise reduction
CN111341340A (en) * 2020-02-28 2020-06-26 重庆邮电大学 Robust GSC method based on coherence and energy ratio
CN112489674A (en) * 2020-11-17 2021-03-12 深圳壹账通智能科技有限公司 Speech enhancement method, device, equipment and computer readable storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016167141A1 (en) * 2015-04-16 2016-10-20 ソニー株式会社 Signal processing device, signal processing method, and program
CN105792074A (en) * 2016-02-26 2016-07-20 西北工业大学 Voice signal processing method and device
US10418048B1 (en) * 2018-04-30 2019-09-17 Cirrus Logic, Inc. Noise reference estimation for noise reduction
CN109389991A (en) * 2018-10-24 2019-02-26 中国科学院上海微系统与信息技术研究所 A kind of signal enhancing method based on microphone array
CN111341340A (en) * 2020-02-28 2020-06-26 重庆邮电大学 Robust GSC method based on coherence and energy ratio
CN112489674A (en) * 2020-11-17 2021-03-12 深圳壹账通智能科技有限公司 Speech enhancement method, device, equipment and computer readable storage medium

Also Published As

Publication number Publication date
CN112489674A (en) 2021-03-12

Similar Documents

Publication Publication Date Title
CN107464564B (en) Voice interaction method, device and equipment
US11620983B2 (en) Speech recognition method, device, and computer-readable storage medium
JP7011075B2 (en) Target voice acquisition method and device based on microphone array
CN110491403B (en) Audio signal processing method, device, medium and audio interaction equipment
CN104246531B (en) System and method for showing user interface
WO2022105571A1 (en) Speech enhancement method and apparatus, and device and computer-readable storage medium
WO2021135628A1 (en) Voice signal processing method and speech separation method
US11284151B2 (en) Loudness adjustment method and apparatus, and electronic device and storage medium
CN110473568B (en) Scene recognition method and device, storage medium and electronic equipment
CN109887494B (en) Method and apparatus for reconstructing a speech signal
CN108109617A (en) A kind of remote pickup method
CN111863020B (en) Voice signal processing method, device, equipment and storage medium
CN111986691B (en) Audio processing method, device, computer equipment and storage medium
CN110364156A (en) Voice interactive method, system, terminal and readable storage medium storing program for executing
CN112233689B (en) Audio noise reduction method, device, equipment and medium
CN115620727B (en) Audio processing method and device, storage medium and intelligent glasses
CN115497500B (en) Audio processing method and device, storage medium and intelligent glasses
WO2022256577A1 (en) A method of speech enhancement and a mobile computing device implementing the method
CN110517702B (en) Signal generation method, and voice recognition method and device based on artificial intelligence
Bai et al. Audio enhancement and intelligent classification of household sound events using a sparsely deployed array
US11915718B2 (en) Position detection method, apparatus, electronic device and computer readable storage medium
WO2024041512A1 (en) Audio noise reduction method and apparatus, and electronic device and readable storage medium
CN111627456B (en) Noise elimination method, device, equipment and readable storage medium
CN113506582A (en) Sound signal identification method, device and system
CN113223552B (en) Speech enhancement method, device, apparatus, storage medium, and program

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21893719

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM1205A DATED 23.08.2023)

122 Ep: pct application non-entry in european phase

Ref document number: 21893719

Country of ref document: EP

Kind code of ref document: A1