WO2022105571A1 - Procédé et appareil d'amélioration de la parole, dispositif et support de stockage lisible par ordinateur - Google Patents

Procédé et appareil d'amélioration de la parole, dispositif et support de stockage lisible par ordinateur Download PDF

Info

Publication number
WO2022105571A1
WO2022105571A1 PCT/CN2021/127260 CN2021127260W WO2022105571A1 WO 2022105571 A1 WO2022105571 A1 WO 2022105571A1 CN 2021127260 W CN2021127260 W CN 2021127260W WO 2022105571 A1 WO2022105571 A1 WO 2022105571A1
Authority
WO
WIPO (PCT)
Prior art keywords
signal
speech
domain observation
frequency
super
Prior art date
Application number
PCT/CN2021/127260
Other languages
English (en)
Chinese (zh)
Inventor
赵沁
徐国强
Original Assignee
深圳壹账通智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳壹账通智能科技有限公司 filed Critical 深圳壹账通智能科技有限公司
Publication of WO2022105571A1 publication Critical patent/WO2022105571A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0232Processing in the frequency domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/20Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L2021/02161Number of inputs available containing the signal or the noise to be suppressed
    • G10L2021/02166Microphone arrays; Beamforming

Definitions

  • the present application relates to the technical field of signal processing, and in particular, to a speech enhancement method, apparatus, device, and computer-readable storage medium.
  • speech enhancement processing that is, noise reduction processing is performed on the speech signal
  • the purer speech signal is extracted from the signal as much as possible to make speech recognition more accurate.
  • the voice signal extracted after voice enhancement processing is not high in accuracy, which is not conducive to subsequent voice recognition.
  • One of the purposes of the embodiments of the present application is to provide a speech enhancement method, apparatus, device, and computer-readable storage medium, which aims to solve the technical problem of low accuracy of the speech signal extracted after speech enhancement processing is currently performed on the speech signal.
  • an embodiment of the present application provides a speech enhancement method, wherein the speech enhancement method includes the following steps:
  • the voice signal is collected by the microphone array, and the voice signal is converted into a frequency-domain observation signal, wherein the voice signal is a time-domain observation signal;
  • the frequency domain observation signal is input to the second super-directional beamformer of the generalized sidelobe canceller to determine the noise signal corresponding to the speech signal, wherein the constraint corresponding to the second super-directional beamformer
  • the matrix is orthogonal to the blocking matrix corresponding to the first super-directional beamformer
  • a speech enhancement signal corresponding to the speech signal is determined based on the reference speech signal and the noise signal.
  • an embodiment of the present application provides a voice enhancement device, wherein the voice enhancement device includes:
  • Acquisition module for collecting voice signal by microphone array, and described voice signal is converted into frequency domain observation signal, wherein, described voice signal is time domain observation signal;
  • a first determination module configured to input the frequency domain observation signal to a first super-directional beamformer in the generalized sidelobe canceller, to determine a reference speech signal output by the first super-directional beamformer;
  • the second determination module is configured to input the frequency domain observation signal to the second super-directional beamformer of the generalized sidelobe canceller, so as to determine the noise signal corresponding to the speech signal, wherein the second super-directional beam
  • the constraint matrix corresponding to the directional beamformer and the blocking matrix corresponding to the first super-directional beamformer are orthogonal to each other;
  • a third determining module configured to determine a speech enhancement signal corresponding to the speech signal based on the reference speech signal and the noise signal.
  • an embodiment of the present application provides a speech enhancement device, including a memory, a processor, and a speech enhancement program stored in the memory and running on the processor, where the processor implements the speech enhancement program when executing:
  • the voice signal is collected by the microphone array, and the voice signal is converted into a frequency-domain observation signal, wherein the voice signal is a time-domain observation signal;
  • the frequency domain observation signal is input to the second super-directional beamformer of the generalized sidelobe canceller to determine the noise signal corresponding to the speech signal, wherein the constraint corresponding to the second super-directional beamformer
  • the matrix is orthogonal to the blocking matrix corresponding to the first super-directional beamformer
  • a speech enhancement signal corresponding to the speech signal is determined based on the reference speech signal and the noise signal.
  • an embodiment of the present application provides a computer-readable storage medium, the computer-readable storage medium may be non-volatile or volatile, and the computer-readable storage medium stores a computer program, the computer program Implemented when executed by the processor:
  • the voice signal is collected by the microphone array, and the voice signal is converted into a frequency-domain observation signal, wherein the voice signal is a time-domain observation signal;
  • the frequency domain observation signal is input to the second super-directional beamformer of the generalized sidelobe canceller to determine the noise signal corresponding to the speech signal, wherein the constraint corresponding to the second super-directional beamformer
  • the matrix is orthogonal to the blocking matrix corresponding to the first super-directional beamformer
  • a speech enhancement signal corresponding to the speech signal is determined based on the reference speech signal and the noise signal.
  • the embodiments of the present application have the beneficial effects of collecting voice signals through a microphone array, and converting the voice signals into frequency-domain observation signals, wherein the voice signals are time-domain observation signals;
  • the frequency domain observation signal is input to the first super-directional beamformer in the generalized sidelobe canceller to determine the reference speech signal output by the first super-directional beamformer;
  • the frequency domain observation signal is input to a second super-directional beamformer of a generalized sidelobe canceller to determine a noise signal corresponding to the speech signal, wherein the constraint matrix corresponding to the second super-directional beamformer is the same as the first super-directional beamformer
  • the blocking matrices corresponding to the beamformers are orthogonal to each other;
  • the speech enhancement signal corresponding to the speech signal is determined based on the reference speech signal and the noise signal.
  • This embodiment improves the generalized sidelobe canceller technology by combining the generalized sidelobe canceller structure and the super-directional beamforming technology, using the characteristics of strong directivity and narrow main lobe of the super-directional beamforming technology, Therefore, the first super-directional beamformer in the generalized sidelobe canceller can effectively enhance the speech signal of the target azimuth, and the enhancement effect is good, and at the same time, the lower branch of the generalized sidelobe canceller is improved based on the super-directional beamforming technology.
  • the blocking matrix part of can filter out noise interference more effectively, thus more effectively improving the accuracy of the calculated reference speech signal and noise signal, thereby further improving the accuracy of the speech enhancement signal.
  • FIG. 1 is a schematic structural diagram of a speech enhancement device of a hardware operating environment involved in a solution according to an embodiment of the present application
  • FIG. 2 is a schematic flowchart of the first embodiment of the speech enhancement method of the application
  • FIG. 3 is a schematic flowchart of the second embodiment of the speech enhancement method of the present application.
  • the speech enhancement method, apparatus, device and computer-readable storage medium provided by this application can also be applied to the field of artificial intelligence.
  • FIG. 1 is a schematic structural diagram of a speech enhancement device of a hardware operating environment involved in the solution of the embodiment of the present application.
  • the voice enhancement device in the embodiment of this application may be a PC, or a smart phone, a tablet computer, an e-book reader, an MP3 (Moving Picture Experts Group Audio Layer III, moving image expert compression standard audio layer 3) player, an MP4 (Moving Picture Experts Group Audio Layer III) player, an MP4 (Moving Picture Experts Group Audio Layer III) player, a Picture Experts Group Audio Layer IV, moving image expert compression standard audio layer 4) Players, portable computers and other portable terminal equipment with display functions.
  • MP3 Motion Picture Experts Group Audio Layer III, moving image expert compression standard audio layer 3
  • MP4 Motion Picture Experts Group Audio Layer III
  • MP4 Motion Picture Experts Group Audio Layer III
  • a Picture Experts Group Audio Layer IV moving image expert compression standard audio layer
  • the speech enhancement device may include: a processor 1001 , such as a CPU, a network interface 1004 , a user interface 1003 , a memory 1005 , and a communication bus 1002 .
  • the communication bus 1002 is used to realize the connection and communication between these components.
  • the user interface 1003 may include a display screen (Display), an input unit such as a keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface and a wireless interface.
  • the network interface 1004 may include a standard wired interface and a wireless interface (eg, a WI-FI interface).
  • the memory 1005 may be high-speed RAM memory, or may be non-volatile memory, such as disk memory.
  • the memory 1005 may also be a storage device independent of the aforementioned processor 1001 .
  • the voice enhancement device may further include a camera, an RF (Radio Frequency, radio frequency) circuit, a sensor, an audio circuit, a WiFi module, and the like.
  • sensors such as light sensors, motion sensors and other sensors.
  • the light sensor may include an ambient light sensor and a proximity sensor, wherein the ambient light sensor may adjust the brightness of the display screen according to the brightness of the ambient light, and the proximity sensor may turn off the display screen and/or when the voice enhancement device is moved to the ear or backlight.
  • the gravitational acceleration sensor can detect the magnitude of acceleration in all directions (generally three axes), and can detect the magnitude and direction of gravity when it is stationary, and can be used for applications that recognize the posture of voice enhancement devices (such as switching between horizontal and vertical screens). , related games, magnetometer attitude calibration), vibration recognition related functions (such as pedometer, tapping), etc.; of course, the voice enhancement device can also be equipped with other sensors such as gyroscope, barometer, hygrometer, thermometer, infrared sensor, etc. It is not repeated here.
  • the structure of the speech enhancement device shown in FIG. 1 does not constitute a limitation to the speech enhancement device, and may include more or less components than those shown in the figure, or combine some components, or different components layout.
  • the memory 1005 as a computer storage medium may include an operating system, a network communication module, a user interface module and a speech enhancement program.
  • the network interface 1004 is mainly used to connect the background server, and perform data communication with the background server;
  • the user interface 1003 is mainly used to connect the client (client), and perform data communication with the client;
  • the processor 1001 may be configured to call the speech enhancement program stored in the memory 1005, and execute the speech enhancement method provided by the embodiment of the present application.
  • FIG. 2 is a schematic flowchart of the first embodiment of the speech enhancement method of the present application.
  • Step S10 collecting a voice signal through a microphone array, and converting the voice signal into a frequency-domain observation signal, wherein the voice signal is a time-domain observation signal;
  • the speech enhancement method proposed in this application is applied to intelligent terminal equipment, and is based on the technology of microphone array and generalized sidelobe canceller.
  • the microphone array is composed of multiple microphone array elements.
  • the microphone array is used to collect the sound signal in the real environment, that is, the speech signal.
  • the generalized sidelobe canceller is an improved beamformer based on the super-directional beamforming technology.
  • the lobe canceller includes an upper branch and a lower branch.
  • the upper branch of the generalized sidelobe canceller is used to pass and initially enhance the speech signal in the target direction
  • the lower branch of the generalized sidelobe canceller is used to filter out the speech signal in the target direction. signal and through the noise signal in the speech signal.
  • the pre-processed time-domain observation signals are processed frame by frame, and after frame-by-frame processing is completed, frame data corresponding to the speech signal is obtained; after that, the frame data is processed.
  • the frequency domain observation signal X i (e j ⁇ ) is obtained, where i represents the data of the ith frame.
  • X(k) is used to represent the frequency domain data of the kth frame.
  • Step S20 inputting the frequency domain observation signal to the first super-directional beamformer in the generalized sidelobe canceller to determine the reference speech signal output by the first super-directional beamformer;
  • the frequency domain observation signal corresponding to the speech signal is obtained, the frequency domain observation signal is input to the upper branch of the generalized sidelobe canceller, the super-directional beamformer is used for beamforming, and the output is initially enhanced based on the target direction.
  • the reference voice signal is obtained, the target direction is the main lobe pointing, and the output corresponding to the main lobe is the initially enhanced reference voice signal.
  • the direction angle corresponding to the voice signal is the angle formed by the voice signal and the plane where the microphone array is located when the voice signal is received by the microphone array.
  • the generalized sidelobe canceller is an improved beamformer based on super-directional beamforming technology.
  • the generalized sidelobe canceller includes the first super-directional beamformer of the upper branch and the second super-directional beamformer of the lower branch. wherein, the constraint matrix corresponding to the second super-directional beamformer and the blocking matrix corresponding to the first super-directional beamformer are orthogonal to each other, and the first super-directional beamformer is used to enhance the upper branch of the generalized sidelobe canceller.
  • the voice signal of the signal passing through the branch can effectively enhance the voice signal of the target azimuth by using the characteristics of strong directivity and narrow main lobe of the first super-directional beamformer.
  • the enhancement effect is good.
  • Step S30 inputting the frequency domain observation signal to the second super-directional beamformer of the generalized sidelobe canceller to determine the noise signal corresponding to the speech signal, wherein the second super-directional beamformer
  • the corresponding constraint matrix is orthogonal to the blocking matrix corresponding to the first super-directional beamformer
  • the frequency domain observation signal is input to the second super-directional beamformer of the lower branch of the generalized sidelobe canceller, so as to pass the second super-directional beam
  • the former realizes the function of the blocking matrix of the lower branch of the generalized sidelobe canceller, that is, the function of the blocking matrix of the lower branch of the generalized sidelobe canceller is completed by the second super-directional beamformer.
  • the direction of the interference noise is preset in the device, and the noise signal is calculated based on the preset direction of the interference noise, so that the second super-directional beamformer outputs the noise signal based on the preset direction of the interference noise and the frequency domain observation signal. It can be understood that the output of the lower branch of the generalized sidelobe canceller can successfully block the speech signal, so as to obtain the signal part containing only interference noise.
  • Step S40 determining a speech enhancement signal corresponding to the speech signal based on the reference speech signal and the noise signal.
  • the adaptive noise suppressor adopts the normalized least mean square error criterion (NLMS), based on the reference voice signal and the noise signal, the voice signal collected by the microphone array is adaptively filtered, and the frequency domain is obtained after the adaptive filtering is completed.
  • NLMS normalized least mean square error criterion
  • the speech enhancement signal output by the adaptive noise suppressor is the speech enhancement signal in the frequency domain. Therefore, the subsequent Fourier transform of the speech enhancement signal in the frequency domain can be obtained. domain of speech enhancement signals. Specifically, after the speech enhancement signal in the frequency domain is obtained, inverse short-time discrete Fourier transform is performed on the speech enhancement signal in the frequency domain to obtain the time domain enhancement signal and output.
  • the voice enhancement method proposed in this embodiment collects voice signals through a microphone array, and converts the voice signals into frequency-domain observation signals, wherein the voice signals are time-domain observation signals; the frequency-domain observation signals are input to the first super-directional beamformer in the generalized sidelobe canceller to determine the reference speech signal output by the first super-directional beamformer; the frequency domain observation signal is input to the first super-directional beamformer of the generalized sidelobe canceller Two super-directional beamformers to determine the noise signal corresponding to the speech signal, wherein the constraint matrix corresponding to the second super-directional beamformer and the blocking matrix corresponding to the first super-directional beamformer orthogonal to each other; determining a speech enhancement signal corresponding to the speech signal based on the reference speech signal and the noise signal.
  • This embodiment improves the generalized sidelobe canceller technology by combining the generalized sidelobe canceller structure and the super-directional beamforming technology, using the characteristics of strong directivity and narrow main lobe of the super-directional beamforming technology, Therefore, the first super-directional beamformer in the generalized sidelobe canceller can effectively enhance the speech signal of the target azimuth, and the enhancement effect is good, and at the same time, the lower branch of the generalized sidelobe canceller is improved based on the super-directional beamforming technology.
  • the blocking matrix part of can filter out noise interference more effectively, thus more effectively improving the accuracy of the calculated reference speech signal and noise signal, thereby further improving the accuracy of the speech enhancement signal.
  • step S20 includes:
  • Step S21 the frequency domain observation signal is input to the first super-directional beamformer of the generalized sidelobe canceller, so as to determine the said voice signal based on the direction angle corresponding to the voice signal and the array element spacing corresponding to the microphone array. Steering vector of each frequency point of the frequency domain observation signal;
  • Step S22 determining the first projection matrix of each frequency point of the frequency domain observation signal based on the steering vector of each frequency point of the frequency domain observation signal;
  • Step S23 determining a reference speech signal output by the first super-directional beamformer based on the first projection matrix and the frequency domain observation signal.
  • the frequency domain observation signal corresponding to the speech signal is obtained, the frequency domain observation signal is input to the upper branch of the generalized sidelobe canceller, and the first super-directional beamformer of the upper branch is based on the corresponding speech signal.
  • the direction angle and the corresponding array element spacing of the microphone array are used to calculate the steering vector of each frequency point of the frequency domain observation signal; after obtaining the steering vector of each frequency point of the frequency domain observation signal, the first super-directional beamformer is based on the frequency domain observation signal.
  • the steering vector of each frequency point of the signal is used to calculate the noise cross-correlation coefficient matrix of each frequency point of the frequency-domain observation signal; then, based on the noise cross-correlation coefficient matrix of each frequency point, the first value of each frequency point of the frequency-domain observation signal is calculated.
  • Projection matrix after obtaining the first projection matrix of each frequency point, the first super-directional beamformer determines the reference speech signal output by the upper branch of the generalized sidelobe canceller based on the first projection matrix and the frequency domain observation signal.
  • the steering vector of each frequency point of the frequency domain observation signal is calculated.
  • the calculation formula for calculating the steering vector of each frequency point of the frequency domain observation signal is as follows:
  • N fft is the length of the fast Fourier transform
  • c is the speed of the signal, here the speed of sound.
  • the calculation is performed frequency-by-frequency point, and the noise cross-correlation coefficient matrix Q of the nth frequency point is calculated based on the steering vector of each frequency point of the frequency-domain observation signal.
  • the formula for calculating the noise cross-correlation coefficient matrix of a point is as follows:
  • i and j represent the i-th array element and the j-th array element of the microphone array, respectively.
  • calculate the projection matrix of frequency point n that is, calculate the first projection matrix of each frequency point of the frequency domain observation signal, and calculate the first projection matrix of each frequency point of the frequency domain observation signal based on the noise cross-correlation coefficient matrix of each frequency point.
  • the calculation formula is as follows:
  • represents the steering matrix of the nth frequency point with respect to the direction ⁇ .
  • the beam output signal of the upper branch is calculated, that is, the reference speech signal output by the upper branch of the generalized sidelobe canceller is calculated, and the output of the upper branch of the generalized sidelobe canceller is determined based on the first projection matrix and the frequency domain observation signal.
  • the calculation formula of the reference speech signal is as follows:
  • Y(k,n) is the reference speech signal corresponding to the nth frequency point of the kth frame of the frequency domain observation signal.
  • the above process takes the microphone array as a uniform linear array as an example of the calculation formula.
  • the enhancement of the speech signal can also be accomplished by using an array such as a uniform circular array.
  • the step of determining the first projection matrix of each frequency point of the frequency domain observation signal based on the steering vector of each frequency point of the frequency domain observation signal includes:
  • Step S221 based on the steering vector of each frequency point of the frequency-domain observation signal, calculate the noise cross-correlation coefficient matrix of each frequency point of the frequency-domain observation signal;
  • Step S222 Calculate the first projection matrix of each frequency point of the frequency domain observation signal based on the noise cross-correlation coefficient matrix of each frequency point.
  • the first super-directional beamformer of the generalized sidelobe canceller calculates the frequency-domain based on the steering vectors of each frequency point of the frequency-domain observation signal The noise cross-correlation coefficient matrix of each frequency point of the observed signal; then, based on the noise cross-correlation coefficient matrix of each frequency point, the first projection matrix of each frequency point of the frequency-domain observation signal is calculated, so as to be based on the first projection matrix and the frequency point.
  • the domain observation signal determines the reference speech signal output by the upper branch of the generalized sidelobe canceller.
  • the example calculation formula for calculating the noise cross-correlation coefficient matrix and the example calculation formula corresponding to the first projection matrix of each frequency point of the frequency-domain observation signal based on the noise cross-correlation coefficient matrix of each frequency point are specifically referred to the previous embodiment. .
  • the step of inputting the frequency domain observation signal to the second super-directional beamformer of the generalized sidelobe canceller to determine the noise signal corresponding to the speech signal includes:
  • Step S31 inputting the frequency domain observation signal to the second super-directional beamformer of the generalized sidelobe canceller, to determine the second projection matrix of each frequency point of the frequency domain observation signal based on the noise direction vector;
  • Step S32 Determine the noise signal output by the second super-directional beamformer based on the second projection matrix and the frequency domain observation signal.
  • the frequency domain observation signal is input to the second super-directional beamformer of the lower branch of the generalized sidelobe canceller, so as to pass the second super-directional beam
  • the former implements the function of the blocking matrix of the lower branch of the generalized sidelobe canceller.
  • the noise steering vector of each frequency point of the frequency-domain observation signal is calculated; then, based on the noise steering vector of each frequency point of the frequency-domain observation signal , calculate the second projection matrix of each frequency point of the frequency-domain observation signal; finally, calculate and output the noise signal based on the second projection matrix and the frequency-domain observation signal, so that the generalized sidelobe canceller can block the beamformer according to the second super-directional beamformer
  • the noise signal obtained after dropping the reference speech number. It can be understood that the output of the lower branch of the generalized sidelobe canceller can successfully block the reference speech signal, so as to obtain the signal part containing only interference noise, that is, the noise signal.
  • the steering vector of each frequency point of the frequency-domain observation signal is calculated; after obtaining the steering vector of each frequency point of the frequency-domain observation signal, the first super-directional beamformer Calculate the noise cross-correlation coefficient matrix of each frequency point of the frequency-domain observation signal based on the steering vector of each frequency point of the frequency-domain observation signal; then, calculate the noise cross-correlation coefficient matrix of each frequency point of the frequency-domain observation signal After obtaining the first projection matrix of each frequency point, the first super-directional beamformer determines the reference speech output by the upper branch of the generalized sidelobe canceller based on the first projection matrix and the frequency domain observation signal Signal.
  • the step of determining the speech enhancement signal corresponding to the speech signal based on the reference speech signal and the noise signal includes:
  • Step S41 inputting the reference speech signal and the noise signal into an adaptive noise suppressor, so as to perform automatic self-adaptation on the frequency domain observation signal corresponding to the speech signal based on the reference speech signal and the noise signal. adapting to noise suppression to obtain an error signal corresponding to the frequency domain observation signal;
  • Step S42 the error signal is input to the adaptive noise suppressor, and the normalized minimum mean square error criterion is used to optimize the parameters of the adaptive noise suppressor, after the optimization of the adaptive noise suppressor is completed.
  • a speech enhancement signal corresponding to the speech signal is determined.
  • the adaptive noise suppressor after passing the reference speech signal output by the upper branch and the noise signal output by the lower branch of the generalized sidelobe canceller, the reference speech signal output by the upper branch and the noise signal output by the lower branch are input to the automatic In the adaptive noise suppressor, the adaptive noise suppressor performs adaptive noise suppression on the frequency domain observation signal corresponding to the speech signal according to the reference speech signal and the noise signal, so as to maximize the suppression of the noise signal in the speech signal, so that the adaptive noise The suppressor outputs a high-precision speech enhancement signal.
  • the reference speech signal output by the upper branch and the noise signal output by the lower branch are input into the adaptive noise suppressor, and the error signal is first calculated based on the reference speech signal and the noise signal through the adaptive noise suppressor, wherein the error signal is the frequency
  • the domain observation signal is the speech signal after noise suppression, but in fact the error signal belongs to the speech signal with low accuracy, and the speech signal needs to be suppressed many times to obtain the signal with high accuracy.
  • the error signal is input to the adaptive noise suppressor for the adaptive noise suppressor to use the normalized minimum mean square error criterion to optimize the parameters of the adaptive noise suppressor, and when optimizing the adaptive noise suppressor After the completion of the device, a high-precision speech enhancement signal is output.
  • the steps of performing adaptive noise suppression to obtain an error signal corresponding to the frequency domain observation signal include:
  • Step S411 inputting the reference speech signal and the noise signal into an adaptive noise suppressor to determine an adjustment signal based on the weight vector corresponding to the adaptive noise suppressor and the reference speech signal;
  • Step S412 Adjust the frequency-domain observation signal corresponding to the speech signal based on the adjustment signal, and determine an error signal corresponding to the frequency-domain observation signal after adjustment.
  • the adaptive noise suppressor after passing the reference speech signal output by the upper branch and the noise signal output by the lower branch of the generalized sidelobe canceller, the reference speech signal output by the upper branch and the noise signal output by the lower branch are input to the automatic In the adaptive noise suppressor, the adaptive noise suppressor performs adaptive noise suppression on the frequency domain observation signal corresponding to the speech signal according to the reference speech signal and the noise signal, so as to maximize the suppression of the noise signal in the speech signal, so that the adaptive noise The suppressor outputs a high-precision speech enhancement signal.
  • the adjustment signal is first calculated based on the weight vector corresponding to the adaptive noise suppressor and the reference speech signal, and the adaptive noise suppressor outputs the adjustment signal; after the adjustment signal is obtained, the frequency domain observation signal is adjusted based on the adjustment signal, and the adjusted signal is obtained.
  • the error signal after observing the signal in the frequency domain.
  • the manner of adjusting the frequency-domain observation signal based on the adjustment signal may be to subtract the adjustment signal from the frequency-domain observation signal to obtain an error signal corresponding to the speech signal.
  • step S10 includes:
  • Step S11 collecting voice signals through a microphone array, and performing a frame division operation on the voice signals to obtain frame data corresponding to the voice signals;
  • Step S12 Perform short-time discrete Fourier transform on the frame data corresponding to the speech signal to obtain a frequency domain observation signal corresponding to the speech signal.
  • Perform preprocessing operations such as framing operations on the above-mentioned time-domain observation signals, and then perform frame-by-frame processing on the pre-processed time-domain observation signals.
  • the frame data corresponding to the speech signal is obtained;
  • the short-time discrete Fourier transform is used to obtain the frequency domain observation signal, wherein the frequency domain observation signal can be expressed as X i (e j ⁇ ), and i represents the i-th frame of data.
  • X(k) is used to represent the frequency domain data of the kth frame.
  • the frequency domain observation signal is input to the first super-directional beamformer of the generalized sidelobe canceller, so that the direction angle corresponding to the speech signal corresponds to the microphone array based on the direction angle corresponding to the speech signal.
  • the first projection matrix and the frequency domain observation signal determine the reference speech signal output by the first super-directional beamformer.
  • the super-directional beamforming technology has the characteristics of strong directivity and narrow main lobe, and the super-directionality is applied to the upper branch of the generalized sidelobe canceller.
  • the first super-directional beamformer in the generalized sidelobe canceller can effectively enhance the speech signal of the target azimuth, so that the enhancement effect of the reference speech signal is good.
  • an embodiment of the present application also proposes a voice enhancement device, where the voice enhancement device includes:
  • Acquisition module for collecting voice signal by microphone array, and described voice signal is converted into frequency domain observation signal, wherein, described voice signal is time domain observation signal;
  • a first determination module configured to input the frequency domain observation signal to a first super-directional beamformer in the generalized sidelobe canceller, to determine a reference speech signal output by the first super-directional beamformer;
  • the second determination module is configured to input the frequency domain observation signal to the second super-directional beamformer of the generalized sidelobe canceller, so as to determine the noise signal corresponding to the speech signal, wherein the second super-directional beam
  • the constraint matrix corresponding to the directional beamformer and the blocking matrix corresponding to the first super-directional beamformer are orthogonal to each other;
  • a third determining module configured to determine a speech enhancement signal corresponding to the speech signal based on the reference speech signal and the noise signal.
  • the first determining module is also used for:
  • the frequency domain observation signal inputting the frequency domain observation signal to the first super-directional beamformer of the generalized sidelobe canceller, so as to determine the frequency domain observation based on the direction angle corresponding to the speech signal and the array element spacing corresponding to the microphone array Steering vector of each frequency point of the signal;
  • a reference speech signal output by the first super-directional beamformer is determined based on the first projection matrix and the frequency domain observation signal.
  • the first determining module is also used for:
  • the first projection matrix of each frequency point of the frequency domain observation signal is calculated.
  • the second determining module is also used for:
  • a noise signal output by the second super-directional beamformer is determined based on the second projection matrix and the frequency domain observation signal.
  • the third determining module is also used for:
  • the error signal is input to the adaptive noise suppressor, and a normalized minimum mean square error criterion is used to optimize the parameters of the adaptive noise suppressor, and the adaptive noise suppressor is determined after the optimization is completed.
  • the speech enhancement signal corresponding to the speech signal.
  • the third determining module is also used for:
  • the frequency domain observation signal corresponding to the speech signal is adjusted based on the adjustment signal, and an error signal corresponding to the adjustment of the frequency domain observation signal is determined.
  • the acquisition module is also used for:
  • a short-time discrete Fourier transform is performed on the frame data corresponding to the speech signal to obtain a frequency domain observation signal corresponding to the speech signal.
  • Embodiments of the present application also provide a voice enhancement device, including a memory, a processor, and a voice enhancement program stored in the memory and running on the processor, where the processor executes the voice enhancement program to achieve:
  • the voice signal is collected by the microphone array, and the voice signal is converted into a frequency-domain observation signal, wherein the voice signal is a time-domain observation signal;
  • the frequency domain observation signal is input to the second super-directional beamformer of the generalized sidelobe canceller to determine the noise signal corresponding to the speech signal, wherein the constraint corresponding to the second super-directional beamformer
  • the matrix is orthogonal to the blocking matrix corresponding to the first super-directional beamformer
  • a speech enhancement signal corresponding to the speech signal is determined based on the reference speech signal and the noise signal.
  • an embodiment of the present application also proposes a computer-readable storage medium.
  • the computer-readable storage medium may be non-volatile or volatile, and a speech enhancement program is stored on the computer-readable storage medium.
  • the speech enhancement program is implemented when executed by the processor:
  • the voice signal is collected by the microphone array, and the voice signal is converted into a frequency-domain observation signal, wherein the voice signal is a time-domain observation signal;
  • the frequency domain observation signal is input to the second super-directional beamformer of the generalized sidelobe canceller to determine the noise signal corresponding to the speech signal, wherein the constraint corresponding to the second super-directional beamformer
  • the matrix is orthogonal to the blocking matrix corresponding to the first super-directional beamformer
  • a speech enhancement signal corresponding to the speech signal is determined based on the reference speech signal and the noise signal.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

Procédé et appareil d'amélioration de la parole, dispositif et support de stockage lisible par ordinateur. Le procédé consiste : à capter un signal de parole au moyen d'un réseau de microphones, et à convertir le signal de parole en un signal d'observation de domaine de fréquence, le signal de parole étant un signal d'observation de domaine temporel (S10) ; à entrer le signal d'observation de domaine de fréquence dans un premier dispositif de formation de faisceau à super-directivité dans un dispositif d'annulation de lobe latéral généralisé, de façon à déterminer un signal de parole de référence fourni par le premier dispositif de formation de faisceau à super-directivité (S20) ; à entrer le signal d'observation de domaine de fréquence dans un second dispositif de formation de faisceau à super-directivité dans le dispositif d'annulation de lobe latéral généralisé, de façon à déterminer un signal de bruit correspondant au signal de parole (S30) ; à déterminer, sur la base du signal de parole de référence et du signal de bruit, un signal d'amélioration de la parole correspondant au signal de parole (S40). Au moyen du procédé, un signal de parole provenant d'une orientation cible peut être efficacement amélioré, le brouillage de bruit peut être mieux éliminé et la précision d'un signal de parole de référence et d'un signal de bruit peut être efficacement améliorée, de telle sorte que la précision d'un signal d'amélioration de la parole peut encore être améliorée.
PCT/CN2021/127260 2020-11-17 2021-10-29 Procédé et appareil d'amélioration de la parole, dispositif et support de stockage lisible par ordinateur WO2022105571A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011297820.3A CN112489674A (zh) 2020-11-17 2020-11-17 语音增强方法、装置、设备及计算机可读存储介质
CN202011297820.3 2020-11-17

Publications (1)

Publication Number Publication Date
WO2022105571A1 true WO2022105571A1 (fr) 2022-05-27

Family

ID=74931606

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/127260 WO2022105571A1 (fr) 2020-11-17 2021-10-29 Procédé et appareil d'amélioration de la parole, dispositif et support de stockage lisible par ordinateur

Country Status (2)

Country Link
CN (1) CN112489674A (fr)
WO (1) WO2022105571A1 (fr)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112489674A (zh) * 2020-11-17 2021-03-12 深圳壹账通智能科技有限公司 语音增强方法、装置、设备及计算机可读存储介质
CN114023307B (zh) * 2022-01-05 2022-06-14 阿里巴巴达摩院(杭州)科技有限公司 声音信号处理方法、语音识别方法、电子设备和存储介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105792074A (zh) * 2016-02-26 2016-07-20 西北工业大学 一种语音信号处理方法和装置
WO2016167141A1 (fr) * 2015-04-16 2016-10-20 ソニー株式会社 Dispositif de traitement de signal, procédé de traitement de signal et programme
CN109389991A (zh) * 2018-10-24 2019-02-26 中国科学院上海微系统与信息技术研究所 一种基于麦克风阵列的信号增强方法
US10418048B1 (en) * 2018-04-30 2019-09-17 Cirrus Logic, Inc. Noise reference estimation for noise reduction
CN111341340A (zh) * 2020-02-28 2020-06-26 重庆邮电大学 基于相干性和能量比的鲁棒gsc方法
CN112489674A (zh) * 2020-11-17 2021-03-12 深圳壹账通智能科技有限公司 语音增强方法、装置、设备及计算机可读存储介质

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016167141A1 (fr) * 2015-04-16 2016-10-20 ソニー株式会社 Dispositif de traitement de signal, procédé de traitement de signal et programme
CN105792074A (zh) * 2016-02-26 2016-07-20 西北工业大学 一种语音信号处理方法和装置
US10418048B1 (en) * 2018-04-30 2019-09-17 Cirrus Logic, Inc. Noise reference estimation for noise reduction
CN109389991A (zh) * 2018-10-24 2019-02-26 中国科学院上海微系统与信息技术研究所 一种基于麦克风阵列的信号增强方法
CN111341340A (zh) * 2020-02-28 2020-06-26 重庆邮电大学 基于相干性和能量比的鲁棒gsc方法
CN112489674A (zh) * 2020-11-17 2021-03-12 深圳壹账通智能科技有限公司 语音增强方法、装置、设备及计算机可读存储介质

Also Published As

Publication number Publication date
CN112489674A (zh) 2021-03-12

Similar Documents

Publication Publication Date Title
CN107464564B (zh) 语音交互方法、装置及设备
US11620983B2 (en) Speech recognition method, device, and computer-readable storage medium
JP7011075B2 (ja) マイク・アレイに基づく対象音声取得方法及び装置
CN110491403B (zh) 音频信号的处理方法、装置、介质和音频交互设备
JP6400566B2 (ja) ユーザインターフェースを表示するためのシステムおよび方法
WO2022105571A1 (fr) Procédé et appareil d'amélioration de la parole, dispositif et support de stockage lisible par ordinateur
WO2021135628A1 (fr) Procédé de traitement de signal vocal et procédé de séparation de la voix
US11284151B2 (en) Loudness adjustment method and apparatus, and electronic device and storage medium
CN110473568B (zh) 场景识别方法、装置、存储介质及电子设备
CN109887494B (zh) 重构语音信号的方法和装置
CN108109617A (zh) 一种远距离拾音方法
CN111986691B (zh) 音频处理方法、装置、计算机设备及存储介质
CN111863020B (zh) 语音信号处理方法、装置、设备及存储介质
CN111554321A (zh) 降噪模型训练方法、装置、电子设备及存储介质
CN110364156A (zh) 语音交互方法、系统、终端及可读存储介质
CN112233689B (zh) 音频降噪方法、装置、设备及介质
CN115620727B (zh) 音频处理方法、装置、存储介质及智能眼镜
CN115497500B (zh) 音频处理方法、装置、存储介质及智能眼镜
WO2022256577A1 (fr) Procédé d'amélioration de la parole et dispositif informatique mobile mettant en oeuvre le procédé
CN110517702B (zh) 信号生成的方法、基于人工智能的语音识别方法及装置
Bai et al. Audio enhancement and intelligent classification of household sound events using a sparsely deployed array
US11915718B2 (en) Position detection method, apparatus, electronic device and computer readable storage medium
WO2024041512A1 (fr) Procédé et appareil de réduction de bruit audio, dispositif électronique et support d'enregistrement lisible
CN111627456B (zh) 噪音排除方法、装置、设备及可读存储介质
CN113223552B (zh) 语音增强方法、装置、设备、存储介质及程序

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21893719

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM1205A DATED 23.08.2023)

122 Ep: pct application non-entry in european phase

Ref document number: 21893719

Country of ref document: EP

Kind code of ref document: A1