WO2019205798A1 - Speech enhancement method, device and equipment - Google Patents

Speech enhancement method, device and equipment Download PDF

Info

Publication number
WO2019205798A1
WO2019205798A1 PCT/CN2019/076189 CN2019076189W WO2019205798A1 WO 2019205798 A1 WO2019205798 A1 WO 2019205798A1 CN 2019076189 W CN2019076189 W CN 2019076189W WO 2019205798 A1 WO2019205798 A1 WO 2019205798A1
Authority
WO
WIPO (PCT)
Prior art keywords
sub
frequency
domain signal
voice
band
Prior art date
Application number
PCT/CN2019/076189
Other languages
French (fr)
Chinese (zh)
Inventor
安黄彬
Original Assignee
深圳市沃特沃德股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳市沃特沃德股份有限公司 filed Critical 深圳市沃特沃德股份有限公司
Publication of WO2019205798A1 publication Critical patent/WO2019205798A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0232Processing in the frequency domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/21Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being power information
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L2021/02161Number of inputs available containing the signal or the noise to be suppressed
    • G10L2021/02165Two microphones, one receiving mainly the noise signal and the other one mainly the speech signal
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L2021/02161Number of inputs available containing the signal or the noise to be suppressed
    • G10L2021/02166Microphone arrays; Beamforming
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M1/00Substation equipment, e.g. for use by subscribers
    • H04M1/02Constructional features of telephone sets
    • H04M1/19Arrangements of transmitters, receivers, or complete sets to prevent eavesdropping, to attenuate local noise or to prevent undesired transmission; Mouthpieces or receivers specially adapted therefor

Definitions

  • the present invention relates to the field of communications, and more particularly to a method, apparatus and apparatus for voice enhancement.
  • the interference of environmental noise in the existing voice communication process is unavoidable, and the surrounding environmental noise interference will cause the communication device to finally receive the noise signal contaminated by noise, affecting the quality of the voice signal.
  • strong background noise seriously affects the communication quality, triggers the user's hearing fatigue, affects the user's daily mood and nerve activities, and urgently needs to reduce the voice of the call. Processing to improve speech intelligibility.
  • the frequency domain processing amount is large, and the effect of enhancing the voice by noise reduction needs to be improved.
  • the main object of the present invention is to provide a method, device and device for voice enhancement, which aims to solve the technical problem that the voice intensity and the speech intelligibility are not high due to the influence of noise in the existing voice call.
  • the present invention provides a voice enhancement method, which collects voice signals through a dual microphone voice channel, and each voice channel performs voice enhancement processing separately, including: acquiring a frequency domain signal of a current voice signal; and dividing the frequency domain signal according to a preset rule. a plurality of sub-bands arranged in sequence; calculating a first wave speed output of each of the sub-bands according to a minimum variance distortion response algorithm; and obtaining an average value of each of the first wave speed outputs to obtain a second wave speed of the frequency domain signal Output.
  • the present invention also provides a voice enhancement device, which collects voice signals through a dual microphone voice channel, and each voice channel performs voice enhancement processing, including: a first acquisition module, configured to acquire a frequency domain signal of a current voice signal; a module, configured to divide the frequency domain signal into a plurality of sequentially arranged sub-bands according to a preset rule; and a calculating module, configured to separately calculate a first wave speed output of each of the sub-bands according to a minimum variance distortion response algorithm; And a module, configured to obtain a second wave speed output of the frequency domain signal by performing an average calculation on each of the first wave speed outputs.
  • the present invention also provides a speech enhanced device comprising a memory, a processor and an application, the application being stored in the memory and configured to be executed by the processor, the application being configured to Used to perform the speech enhancement method described.
  • the present invention decomposes a wideband frequency domain signal of a voice signal collected by a dual microphone into a plurality of narrow bands that do not overlap each other, and calculates each subband by an MVDR (Minimum Variance Distortion Less Response) algorithm.
  • the MVDR beam output is combined and averaged by the MVDR beam outputs of the plurality of sub-bands to obtain the MVDR beam output of the entire wideband frequency domain signal, thereby avoiding traditional processing methods such as direct addition by delay, sidelobe cancellation, and MVDR calculation.
  • the speech enhancement effect is improved; and the present invention tracks the environmental noise variation in each subband by calculating the MVDR beam output of each subband by the MVDR algorithm.
  • the undulating noise dynamically adjusts the smoothing factor to improve the noise processing effect; when processing the wideband frequency domain signal of the voice signal collected by the dual microphone, the invention only selects the frequency range of the voice segment of the call for processing, thereby improving the processing speed and increasing the drop. Noise enhances the real-time nature of speech, meeting people at lower SNR conditions Hears more clear and undistorted voice call has practical value.
  • FIG. 1 is a schematic flow chart of a method for voice enhancement according to an embodiment of the present invention
  • FIG. 2 is a schematic flowchart of a method for reducing a frequency domain processing amount in a voice enhancement method according to an embodiment of the present invention
  • FIG. 3 is a schematic flow chart of a noise processing method in a method for voice enhancement according to an embodiment of the present invention
  • FIG. 4 is a schematic structural diagram of a device for voice enhancement according to an embodiment of the present invention.
  • FIG. 5 is a schematic structural diagram of a partitioning module according to an embodiment of the present invention.
  • FIG. 6 is a schematic structural diagram of a computing module according to an embodiment of the present invention.
  • FIG. 7 is a schematic structural diagram of a first acquiring module according to an embodiment of the present invention.
  • FIG. 8 is a schematic diagram of an optimized structure of a voice enhanced device according to an embodiment of the present invention.
  • FIG. 9 is a schematic structural diagram of a device for voice enhancement according to another embodiment of the present invention.
  • FIG. 10 is a schematic structural diagram of a partitioning module according to another embodiment of the present invention.
  • FIG. 11 is a schematic structural diagram of a partitioning module according to still another embodiment of the present invention.
  • Figure 12 is a schematic structural view of a noise processing system according to an embodiment of the present invention.
  • FIG. 13 is a schematic structural diagram of a first acquiring submodule according to still another embodiment of the present invention.
  • FIG. 14 is a schematic structural diagram of a second acquisition submodule according to still another embodiment of the present invention.
  • FIG. 15 is a schematic structural diagram of a first obtaining submodule according to still another embodiment of the present invention.
  • a voice enhancement method collects voice signals through a dual microphone voice channel, and each voice channel performs voice enhancement processing, including:
  • the frequency domain signal refers to the signal data obtained by transforming the time domain signal of the voice signal collected by the dual microphone voice channel by FFT (Fast Fourier Transformation), because the voice signal in this embodiment passes through the double The microphone voice channel is collected, so the same time processing is performed on the voice signals of the same time domain frame collected by the left and right channels of the dual microphone.
  • FFT Fast Fourier Transformation
  • the dual microphone voice channels of the embodiment are respectively connected with an FFT, and the FFT is converted.
  • the signal data is buffered in two buffers of the same length for further subsequent processing to enhance the voice processing effect.
  • S2 The foregoing frequency domain signal is divided into a plurality of sequentially arranged sub-bands according to a preset rule.
  • the processing effect of the wideband frequency domain signal of the MVDR algorithm is not ideal, which will cause serious speech distortion and affect the quality of the output speech.
  • the wideband frequency domain signal is divided into a plurality of subbands that are arranged in a non-overlapping manner, and the MVDR algorithm is separately performed on the subbands to reduce the speech distortion and improve the processed speech quality.
  • S3 Calculate the first wave speed output of each of the sub-bands according to a minimum variance distortion response algorithm.
  • the output weight vector of each sub-band is obtained by the associated covariance matrix.
  • the MVDR beamformer of this embodiment is composed of a plurality of linear arrays of identical spatial sensors, and the covariance matrix of the data is obtained through the received data of the array to find the angle corresponding to the maximum value point, that is, the incident direction of the speech signal. To minimize the array output power in the desired direction while maximizing the signal-to-noise ratio.
  • the MVDR algorithm is separately performed on each sub-band to obtain a first wave speed output (ie, frequency data) corresponding to each sub-band, so as to improve the effect of the MVDR algorithm on the frequency domain signal of the voice signal, and reduce the voice distortion. .
  • the output frequency data of the frequency domain signal corresponding to the time domain frame is obtained, and The left and right channels of the microphone voice channel are respectively output. Then, by repeating the above steps S1 to S4, all the time domain frame data of the voice signal is processed.
  • step S2 includes:
  • S200 distinguishing the sensitive frequency band in the frequency domain signal, wherein the sensitive frequency band is the first frequency band, and the frequency band of the frequency domain signal other than the sensitive frequency band is the second frequency band;
  • the sensitive frequency band of this embodiment is determined according to the use of the voice signal.
  • the frequency band of the call voice is 200 Hz to 3400 Hz, and the sensitive frequency band is 1 kHz to 2 kHz; for example, the frequency band for listening to music is 50 Hz to 15000 Hz, and the sensitive frequency band is 2 kHz. To 5KHz or 1KHz to 4KHz.
  • the first frequency band is evenly divided into a plurality of first sub-bands
  • the second frequency band is evenly divided into a plurality of second sub-bands, wherein a bandwidth of the second sub-band is greater than a bandwidth of the first sub-band.
  • the sub-bands of the sensitive frequency band are divided into more detailed, and the frequency bands other than the sensitive frequency band are coarsely divided, that is, the bandwidth of the sub-band of the sensitive frequency band is smaller than the sub-band bandwidth of the frequency band other than the sensitive frequency band.
  • the speech distortion in the sensitive frequency band is less, and the rougher mad division of the frequency band outside the sensitive frequency band reduces the disadvantage of the calculation amount caused by the excessive number of factor bands.
  • step S3 of calculating the first wave speed output of each of the sub-bands according to the minimum variance distortion response algorithm includes:
  • S300 Perform voice activation detection in each of the foregoing sub-bands to obtain power ratios of two adjacent non-speech segments.
  • the power spectrum of the non-speech segment (ie, noise) is estimated by the voice activation detection in the gap period of the speech signal, so as to timely judge the change trend of the surrounding environment noise, so as to track the noise in detail.
  • the power variation of the non-speech segment is tracked by the change of the power ratio of the two non-speech segments, and the increase of the power ratio indicates that the noise intensity is enhanced, and vice versa.
  • S301 Acquire a corresponding smoothing factor for removing the non-speech segment according to the power ratio.
  • the smoothing factor of removing the non-speech segment is dynamically adjusted according to the change of the noise power obtained by the tracking.
  • the smoothing factor should be set smaller, when the time-varying speed of the environmental noise is When the relative sampling rate is slow or the noise power is relatively strong, the smoothing factor should be larger, and the tracking of the spatial sound field changes in time, better tracking the environmental noise changes and changing the degree of noise removal, effectively smoothing the fluctuation of the noise, reducing
  • the influence of noise fluctuations further improves the signal-to-noise ratio of the dual-make noise reduction and improves the sound quality of the output speech signal.
  • the covariance matrix is updated in time according to the dynamically changing smoothing factor to more accurately determine the incident direction of the speech signal, further reducing the influence of ambient noise on the acquisition of the dual microphone speech channel.
  • S303 Perform eigen decomposition according to the covariance matrix to obtain an output weight vector of each of the sub-bands.
  • the data output by the MVDR algorithm of this embodiment is a covariance matrix, and the output weight vector corresponding to the covariance matrix is obtained by feature decomposition, that is, the first wave speed output.
  • step S1 of acquiring the frequency domain signal of the current voice signal includes:
  • S100 Acquire a first time domain signal of a current voice signal separately collected by the dual microphone voice channel.
  • the dual microphone voice channel of this embodiment collects time domain signals of voice signals, and the time domain signals are sequentially arranged in time series.
  • the first time domain signal in this embodiment is set in the other time domain signals, and the terms "first” and the like herein are only differences and are not limited.
  • S101 Input the first time domain signals to the band pass filters respectively corresponding to the dual microphone voice channels, respectively, to obtain time domain signals of a specified frequency range.
  • the voice frequency band data concerned by this embodiment is a frequency range of human speech sounds, that is, 200 Hz to 3400 Hz, to satisfy the effect of enhancing speech speech, and avoiding distortion of normal speech.
  • the voice signals outside the frequency band of 200 Hz to 3400 Hz are all filtered out by the preprocessing process, and full coverage of 200 Hz to 3400 Hz is ensured, thereby achieving less data processing amount and ensuring the effect of voice distortion.
  • S102 Convert the time domain signals of the specified frequency range to the frequency domain signals of the specified frequency range of the current voice signal by using a Fourier transform respectively associated with the dual microphone voice channels.
  • each time domain signal is converted into a frequency domain signal by FFT transformation.
  • the voice signals of the dual microphone voice channel are synchronized to perform the same conversion operation, and the converted data is respectively buffered in two identical buffers.
  • the method includes:
  • the time domain signal collected by the dual microphone voice channel as a voice signal is converted into a frequency domain signal, and then processed by noise reduction, voice addition, etc., and the processed frequency domain signal is required by an inverse Fourier transformer. It is converted to the corresponding time domain signal before it is answered and recognized by the human ear.
  • the voice signals collected by the dual microphone voice channel in this embodiment are synchronized in the left and right voice channels respectively, at the output end. Synthesize into one.
  • a method for reducing a frequency domain processing amount by using a voice signal to perform voice signal preprocessing to reduce a frequency domain processing amount in the embodiment includes: before step S2. , do the following:
  • the specified frequency point in this embodiment includes FFT transforms such as 1024 points, 2048 points, and 256 points.
  • 1024 points are preferred, and the processing effect is satisfied under the limitation of a suitable calculation amount.
  • a speech signal having a frequency range of 200 Hz to 3400 Hz is transformed by a 1024-point FFT transform, and a frequency domain signal of a frequency distribution of about 144 points is obtained.
  • a full speech segment including 200 Hz to 3400 Hz it is necessary to process a full frequency domain signal with a frequency distribution of about 512 points, which greatly reduces the amount of calculation.
  • step S2 of dividing the frequency domain signal into a plurality of sequentially arranged sub-bands according to a preset rule includes:
  • S202 Acquire a total amount of frequency points of the frequency domain signal corresponding to the first time domain signal obtained by the Fourier transform method of the specified frequency point;
  • the total frequency of the first time domain signal of the present implementation is 144 points, and then the basis of the subband division is performed according to 144 points.
  • the frequency domain signal is uniformly divided into a plurality of sequentially arranged sub-bands according to the total number of frequency points.
  • the division may be performed by configuring the number of frequency points on each subband.
  • the number of frequency points included in each sub-band is configured to be 24, that is, the number of sub-bands of the first time-domain signal is 144 divided by 24, which is 6 sub-bands.
  • Other embodiments of the present invention may configure the number of frequency points included in each sub-band to be 8, 6, etc., so as to evenly divide the sub-bands.
  • the number of frequency points included in each sub-band is 8 the number of sub-bands is 18; when the number of frequency points included in each sub-band is 6, the number of sub-bands is 24.
  • each sub-band includes a sub-band division scheme in which the number of frequency points is 6 and the number of sub-bands is 24, in order to optimize the effect of speech noise reduction enhancement. Because the more subbands are divided, the narrower the subband bandwidth is, the less the speech distortion is after the MVDR algorithm, but the calculation amount is slightly increased; the smaller the subband, the smaller the calculation amount, but the larger the subband bandwidth, the relative sub If the number of bands is large, the distortion will be larger.
  • the method includes:
  • S204 Calculate a frequency band center frequency corresponding to each of the first sub-band and each second sub-band respectively;
  • the center frequency of the sub-band is obtained to obtain the direction vector of the sub-band, so as to better control the optimal angle of the collected speech signal, and avoid carrying the strongest noise drying when collecting the speech signal.
  • the first sub-band of the present embodiment has the same processing principle as the second sub-band, except that the bandwidth is different. For example, in this embodiment, a process of uniformly dividing sub-bands is taken as an example for detailed description. After the 1024-point FFT transform of the wideband frequency domain signal of this embodiment, the resolution of each frequency point is 1600/10024 points, and the frequency corresponding to the frequency range of 200 Hz to 3400 Hz is 12 to 207.
  • S205 Calculate, according to the center frequency of the frequency band, a direction vector corresponding to each of the first sub-band and each of the second sub-bands.
  • the direction vector is calculated by substituting the center frequency calculated above into the following formula.
  • the left voice channel is taken as the reference point
  • the time delay estimation tao can be obtained by cross-correlation calculation using data collected by the dual microphone voice channel.
  • S206 Obtain, according to the direction vector, a covariance matrix of a frequency band feature corresponding to each first sub-band and each second sub-band, and an optimal weight coefficient corresponding to an inverse matrix of the covariance matrix.
  • signals are collected through a dual microphone voice channel, and the covariance matrix is 2 rows and 2 columns.
  • r_inv the inverse matrix of the covariance matrix
  • W_opt the optimal weight coefficient of the current subband
  • W_opt r_inv*vssL/(vssL'*r_inv*vssL)
  • vssL Indicates the direction vector
  • vssL' indicates the direction vector transpose.
  • the original vector is one row and two columns, and after transposition, it is two rows and one column.
  • the optimal weight coefficient refers to the optimal angle of the double-microphone voice channel when searching for the user's speech within the scanning angle range. For example, when scanning from -45° to 45°, the noise signal carried by the user's speech signal is the lowest at 60°. , 60° is the optimal angle.
  • S207 Calculate, according to the optimal weight coefficient, a first signal output corresponding to each of the first sub-band and each of the second sub-bands.
  • the frequency of the Fbin_loL point frequency to the Fbin_hiL point, S_R is the frequency vector of the Fbin_loL point frequency after the FFT transformation of the current time domain frame data acquired by the right channel to the Fbin_hiL point, that is, S_L or S_R is the frequency data in the corresponding sub-band .
  • Fbin_loL is the subscript of the lower boundary of the frequency of the subband
  • Fbin_hiL is the superscript of the upper boundary of the frequency of the subband
  • the frequency output data of the left and right channels are stored in the buffer, and the first time domain signal is corresponding
  • the frequency data in all subband buffers is added to obtain the first signal output of the respective outputs of the left and right voice channels of the dual microphone voice channel.
  • the method includes:
  • S208 Receive a second time domain signal with a minimum time difference from the first time domain signal according to a time sequence of the received voice signal.
  • the time domain frame data is processed one by one in chronological order.
  • the second time domain signal is subjected to the same processing process as the first time domain signal to obtain a second signal output corresponding to the second time domain signal.
  • the second signal output processing process of this embodiment is the same as the first signal output.
  • the speech intensity is improved by noise processing.
  • step S300 includes:
  • S3001 Perform voice activation detection on each sub-band in a non-speaking period to obtain a first power of a first time, a second power of a second time, and a third power of a third time of the current first non-speech segment, where The first time, the second time, and the third time are sequentially connected in reverse order according to the time of occurrence.
  • VAD detection Voice Activity Detection
  • the noise in the sub-band is estimated in the non-speech period (ie, no user-speaking information) of the VAD detection, by retaining the last three
  • the noise power values of the stages are estimated.
  • the latest noise power estimation time is the first time, the corresponding first power is P1, the previous time of the first time is the second time, and the second power corresponding to the second time is P2, the previous one of the second time The moment is the third time, and the third power corresponding to the third time is P3.
  • S3002 Calculate a current power change corresponding to each sub-band by calculating a ratio of the first power to the second power, and obtain a previous power change corresponding to each sub-band by calculating a ratio of the second power to the third power.
  • S3003 Acquire a power ratio of two adjacent non-speech segments by calculating a first ratio of the current power variation to the previous power variation.
  • step S301 of the embodiment includes:
  • the preset range of this embodiment is that the value of Value is in the range of 0.8 to 1.2.
  • the smoothing factor is set to an initialization value, for example, the initialization value is 1.0.
  • the method further includes:
  • the second ratio is calculated, and the second ratio is used as the smoothing factor. For example, if the current Value has a value of 1.1 and the second ratio is 1.0/1.1, the smoothing factor at the current time is 1.0/1.1.
  • the noise smoothing factor is removed by dynamic real-time adjustment, the influence of noise fluctuation is reduced, the signal-to-noise ratio of the double-mike noise reduction is further improved, and the sound quality of the output voice signal is improved.
  • step S302 of the embodiment includes:
  • S3021 Acquire a frequency point vector of a sub-band subscript of the current time subscript to an upper boundary; 3022: update the sub-band covariance matrix according to a smoothing factor of the current time and a frequency point vector.
  • the covariance matrix of this embodiment is updated in real time according to the following formula.
  • the processing procedure of the time domain signal collected by the dual microphone left channel is taken as an example. After the frequency domain signal corresponding to the time domain signal is divided into subbands, the covariance matrix is updated as follows.
  • R_SUBBAND_new R_SUBBAND_old*alfa+S_L*S_L'*(1-alfa), where alfa is the smoothing factor of the current time, R_SUBBAND_new is the updated covariance matrix, R_SUBBAND_old is the original covariance matrix of the previous time, and S_L S_L is the frequency vector of the Fbin_loL point frequency after the FFT transformation of the current time domain frame data acquired by the left channel to the Fbin_hiL point, and S_L' represents the frequency vector transposition.
  • a voice enhancement device collects a voice signal through a dual microphone voice channel, and each voice channel performs voice enhancement processing, including: a first acquisition module 1 configured to acquire a frequency of a current voice signal. Domain signal.
  • the dividing module 2 is configured to divide the frequency domain signal into a plurality of sequentially arranged sub-bands according to a preset rule.
  • the calculating module 3 is configured to separately calculate the first wave speed output of each of the sub-bands according to a minimum variance distortion response algorithm.
  • the second obtaining module 4 is configured to obtain a second wave speed output of the frequency domain signal by performing an average calculation on each of the first wave speed outputs.
  • the foregoing dividing module 2 includes:
  • the area molecular module 200 is configured to distinguish the sensitive frequency band in the frequency domain signal, wherein the sensitive frequency band is the first frequency band, and the frequency band except the sensitive frequency band is the second frequency band; the first dividing submodule 201 is used by The first frequency band is evenly divided into a plurality of first sub-bands, and the second frequency band is evenly divided into a plurality of second sub-bands, wherein a bandwidth of each second sub-band is greater than a bandwidth of each of the first sub-bands.
  • the foregoing calculation module 3 includes:
  • the first obtaining sub-module 300 is configured to obtain a power ratio of two adjacent non-speech segments by using voice activation detection in each of the foregoing sub-bands.
  • a second obtaining sub-module 301 configured to obtain a smoothing factor corresponding to the non-speech segment according to the power ratio;
  • the first obtaining sub-module 302 is configured to obtain a covariance matrix of the frequency band features in each of the sub-bands according to the smoothing factor;
  • the obtaining sub-module 303 is configured to perform feature decomposition according to the covariance matrix to obtain an output weight vector of each sub-band, that is, a first wave speed output.
  • the first acquiring module 1 includes:
  • the third obtaining sub-module 100 is configured to acquire a first time domain signal of the current voice signal separately collected by the dual microphone voice channel.
  • the input sub-module 101 is configured to input the first time domain signals to the band pass filters respectively corresponding to the dual microphone voice channels, respectively, to obtain time domain signals of the specified frequency range.
  • the conversion sub-module 102 is configured to respectively convert the time domain signals of the specified frequency range into the frequency domain signals of the specified frequency range of the current voice signal by using the Fourier transform respectively associated with the dual microphone voice channels.
  • a voice enhancement apparatus includes: a conversion module 5, configured to separately input a second wave speed output of a frequency domain signal to an inverse Fourier transformer respectively associated with a dual microphone voice channel The frequency domain signal is converted into an output time domain signal; and the output module 6 is configured to respectively output a corresponding output time domain signal through the dual microphone voice channel.
  • a voice signal is preprocessed by a voice channel to reduce a frequency domain processing amount, and a front end of the partitioning module 2 is connected with a selection module 20 for using a frequency.
  • the calculation level of the domain processing platform selects the Fourier transform mode of the specified frequency point; and the obtaining module 21 is configured to: after the pre-processing of the first time domain signal of the current voice signal separately collected by the dual microphone voice channel, respectively A frequency domain signal corresponding to the first time domain signal obtained by the Fourier transform of the frequency point.
  • the partitioning module 2 of the present embodiment includes: a third acquiring sub-module 202, configured to acquire a frequency point of a frequency domain signal corresponding to the first time domain signal obtained by using the Fourier transform method of the specified frequency point.
  • the second dividing sub-module 203 is configured to uniformly divide the frequency domain signal into a plurality of sequentially arranged sub-bands according to the total frequency of the foregoing frequency points.
  • a partitioning module 2 includes: a first calculating sub-module 204, configured to respectively calculate a frequency band center frequency corresponding to each first sub-band and each second sub-band;
  • the sub-module 205 is configured to calculate a direction vector corresponding to each of the first sub-band and each of the second sub-bands according to the center frequency of the frequency band.
  • the obtaining sub-module 206 is configured to obtain, according to the direction vector, a covariance matrix of the frequency band features corresponding to each of the first sub-band and each of the second sub-bands, and an optimal weight coefficient corresponding to the inverse matrix of the covariance matrix.
  • the third calculation sub-module 207 is configured to calculate, according to the optimal weight coefficient, a first signal output corresponding to each of the first sub-band and each of the second sub-bands.
  • the dividing module 2 includes: a receiving submodule 208, configured to receive a second time domain signal with a minimum time difference from the first time domain signal according to a time sequence of the received voice signal; and third obtaining the submodule 209, The second time domain signal is subjected to the same process as the first time domain signal to obtain a second signal output corresponding to the second time domain signal.
  • a process for calculating a first wave speed output of each of the sub-bands according to a minimum variance distortion response algorithm, including a noise processing system improves speech intensity by noise processing.
  • the first obtaining sub-module 300 includes: a detecting unit 3001, configured to obtain a first power of a first time of the current first non-speech segment by performing voice activation detection on each sub-band in a non-speaking period, And the second power of the second time and the third power of the third time, wherein the first time, the second time, and the third time are sequentially connected in reverse order according to the time of occurrence.
  • the obtaining unit 3002 is configured to obtain a current power change corresponding to each of the sub-bands by calculating a ratio of the first power to the second power, and obtain a corresponding ratio of the second power to the third power The power of the previous moment changes.
  • the first obtaining unit 3003 is configured to obtain a power ratio of two adjacent non-speech segments by calculating a first ratio of a current power change to a previous time power variation.
  • the second obtaining submodule 301 of the embodiment includes:
  • the determining unit 3011 is configured to determine whether the first ratio is within a preset range, and the selecting unit 3012 is configured to: if the first ratio is within the preset range, select an initializing smoothing factor as a smoothing factor of the current moment.
  • the second obtaining sub-module 301 further includes: a calculating unit 3013, configured to calculate a second ratio of the initializing smoothing factor to the first ratio if the first ratio is not within the preset range.
  • the setting unit 3014 is configured to set a second ratio as a smoothing factor of the current time.
  • the first obtaining submodule 302 of this embodiment includes:
  • the second obtaining unit 3021 is configured to acquire a frequency point vector of the lower boundary of the sub-band of the current time to the upper boundary, and an update unit 3022, configured to use the smoothing factor of the current time and the frequency vector The covariance matrix of the band is updated.
  • the present application also provides a voice enhanced device including a memory, a processor and an application, the application being stored in a memory and configured to be executed by a processor, the application being configured to perform any of the above embodiments The method of speech enhancement.
  • step counter device of the present invention and the apparatus described above for performing one or more of the methods of the present application.
  • the device may be specially designed and manufactured for the required purposes, or may also include known devices in a general purpose computer.
  • a device has computer programs or applications stored therein that are selectively activated or reconfigured.
  • Such computer programs may be stored in a device (eg, computer) readable medium or in any type of medium suitable for storing electronic instructions and coupled to a bus, respectively, including but not limited to any type of Disk (including floppy disk, hard disk, optical disk, CD-ROM, and magneto-optical disk), ROM (Read-Only Memory, read-only memory), RAM (Random Access Memory), EPROM (Erasable Programmable Read-Only Memory, Erasable programmable read-only memory, EEPROM (Electrically Erasable Programmable Read-Only Memory), flash memory, magnetic card or light card.
  • a readable medium includes any medium that is stored or transmitted by a device (eg, a computer) in a readable form.

Abstract

A speech enhancement method. Speech signals are acquired by means of dual-microphone speech channels, and the various speech channels respectively perform a speech enhancement processing. The method comprises: acquiring a frequency domain signal of a current speech signal (S1); dividing the frequency domain signal according to a preset rule into multiple sequentially arranged sub-bands (S2); respectively calculating a first beam output of the various sub-bands according to the minimum variance distortionless response (MVDR) algorithm (S3); and acquiring a second beam output of the frequency domain signal by calculating the mean value of the various first beam outputs (S4).

Description

语音增强的方法、装置及设备Voice enhancement method, device and device 技术领域Technical field
本发明涉及到通讯领域,特别是涉及到语音增强的方法、装置及设备。The present invention relates to the field of communications, and more particularly to a method, apparatus and apparatus for voice enhancement.
背景技术Background technique
现有语音通信过程中环境噪声的干扰是不可避免的,周围的环境噪音干扰将导致通讯设备最终接收到的是受噪声污染的语音信号,影响语音信号的质量。特别在汽车、飞机、船只、机场、商场等噪音严重的公众环境下,强背景噪声严重影响通讯质量,引发用户的听觉疲劳,影响用户的日常心情和神经活动,迫切需求对通话语音进行降噪处理以提高语音清晰度。但现有双麦克降噪方法中,频域处理量较大,且通过降噪增强语音的效果还有待提升。The interference of environmental noise in the existing voice communication process is unavoidable, and the surrounding environmental noise interference will cause the communication device to finally receive the noise signal contaminated by noise, affecting the quality of the voice signal. Especially in the public environment with serious noise such as cars, airplanes, boats, airports, shopping malls, etc., strong background noise seriously affects the communication quality, triggers the user's hearing fatigue, affects the user's daily mood and nerve activities, and urgently needs to reduce the voice of the call. Processing to improve speech intelligibility. However, in the existing dual-mike noise reduction method, the frequency domain processing amount is large, and the effect of enhancing the voice by noise reduction needs to be improved.
技术问题technical problem
本发明的主要目的为提供一种语音增强的方法、装置及设备,旨在解决现有语音通话中由于噪音影响导致语音强度和语音清晰度不高的技术问题。The main object of the present invention is to provide a method, device and device for voice enhancement, which aims to solve the technical problem that the voice intensity and the speech intelligibility are not high due to the influence of noise in the existing voice call.
技术解决方案Technical solution
本发明提出一种语音增强的方法,通过双麦克语音通道采集语音信号,且各语音通道分别进行语音增强处理,包括:获取当前语音信号的频域信号;按照预设规则将上述频域信号划分为多个依次排布的子频带;根据最小方差失真响应算法分别计算各上述子频带的第一波速输出;通过对各上述第一波速输出进行平均值计算,获取上述频域信号的第二波速输出。The present invention provides a voice enhancement method, which collects voice signals through a dual microphone voice channel, and each voice channel performs voice enhancement processing separately, including: acquiring a frequency domain signal of a current voice signal; and dividing the frequency domain signal according to a preset rule. a plurality of sub-bands arranged in sequence; calculating a first wave speed output of each of the sub-bands according to a minimum variance distortion response algorithm; and obtaining an average value of each of the first wave speed outputs to obtain a second wave speed of the frequency domain signal Output.
本发明还提供了一种语音增强的装置,通过双麦克语音通道采集语音信号,且各语音通道分别进行语音增强处理,包括:第一获取模块,用于获取当前语音信号的频域信号;划分模块,用于按照预设规则将上述频域信号划分为多个依次排布的子频带;计算模块,用于根据最小方差失真响应算法分别计算各上述子频带的第一波速输出;第二获取模块,用于通过对各上述第一波速输出进行平均值计算,获取上述频域信号的第二波速输出。The present invention also provides a voice enhancement device, which collects voice signals through a dual microphone voice channel, and each voice channel performs voice enhancement processing, including: a first acquisition module, configured to acquire a frequency domain signal of a current voice signal; a module, configured to divide the frequency domain signal into a plurality of sequentially arranged sub-bands according to a preset rule; and a calculating module, configured to separately calculate a first wave speed output of each of the sub-bands according to a minimum variance distortion response algorithm; And a module, configured to obtain a second wave speed output of the frequency domain signal by performing an average calculation on each of the first wave speed outputs.
本发明还提供了一种语音增强的设备,包括存储器、处理器和应用程序,所述应用程序被存储在所述存储器中并被配置为由所述处理器执行,所述应用程序被配置为用于执行所述的语音增强方法。The present invention also provides a speech enhanced device comprising a memory, a processor and an application, the application being stored in the memory and configured to be executed by the processor, the application being configured to Used to perform the speech enhancement method described.
有益效果Beneficial effect
本发明有益技术效果:本发明通过将双麦克采集的语音信号的宽带频域信号分解为多个互不重叠的窄带,并通过MVDR(Minimum Variance Distortionless  Response,最小方差失真响应)算法计算各子频带的MVDR波束输出,并将多个子频带的MVDR波束输出进行加和求平均,得到整个宽带频域信号的MVDR波束输出,避免了通过延迟直接相加、旁瓣对消、MVDR计算等传统处理方法,对于宽带频域信号的降噪效果不佳的问题,提高了语音增强效果;而且本发明在通过MVDR算法计算各子频带的MVDR波束输出时,在各子频带内通过跟踪环境噪音变化,对起伏较大的噪音通过动态调整平滑因子以提高噪音处理效果;本发明在处理双麦克采集的语音信号的宽带频域信号时,只选择通话语音段的频率范围进行处理,提高处理速度,提高降噪增强语音的实时性,满足在较低信噪比状况下,人能接听到较为清晰且不失真的通话语音,具有实际应用价值。Advantageous technical effects of the present invention: The present invention decomposes a wideband frequency domain signal of a voice signal collected by a dual microphone into a plurality of narrow bands that do not overlap each other, and calculates each subband by an MVDR (Minimum Variance Distortion Less Response) algorithm. The MVDR beam output is combined and averaged by the MVDR beam outputs of the plurality of sub-bands to obtain the MVDR beam output of the entire wideband frequency domain signal, thereby avoiding traditional processing methods such as direct addition by delay, sidelobe cancellation, and MVDR calculation. For the problem that the noise reduction effect of the wideband frequency domain signal is not good, the speech enhancement effect is improved; and the present invention tracks the environmental noise variation in each subband by calculating the MVDR beam output of each subband by the MVDR algorithm. The undulating noise dynamically adjusts the smoothing factor to improve the noise processing effect; when processing the wideband frequency domain signal of the voice signal collected by the dual microphone, the invention only selects the frequency range of the voice segment of the call for processing, thereby improving the processing speed and increasing the drop. Noise enhances the real-time nature of speech, meeting people at lower SNR conditions Hears more clear and undistorted voice call has practical value.
附图说明DRAWINGS
图1本发明一实施例的语音增强的方法流程示意图;1 is a schematic flow chart of a method for voice enhancement according to an embodiment of the present invention;
图2本发明一实施例的语音增强的方法中的减少频域处理量的方法流程示意图;2 is a schematic flowchart of a method for reducing a frequency domain processing amount in a voice enhancement method according to an embodiment of the present invention;
图3本发明一实施例的语音增强的方法中的噪音处理方法流程示意图;3 is a schematic flow chart of a noise processing method in a method for voice enhancement according to an embodiment of the present invention;
图4本发明一实施例的语音增强的装置结构示意图;4 is a schematic structural diagram of a device for voice enhancement according to an embodiment of the present invention;
图5本发明一实施例的划分模块的结构示意图;FIG. 5 is a schematic structural diagram of a partitioning module according to an embodiment of the present invention; FIG.
图6本发明一实施例的计算模块的结构示意图;6 is a schematic structural diagram of a computing module according to an embodiment of the present invention;
图7本发明一实施例的第一获取模块的结构示意图;FIG. 7 is a schematic structural diagram of a first acquiring module according to an embodiment of the present invention;
图8本发明一实施例的语音增强的装置优化结构示意图;FIG. 8 is a schematic diagram of an optimized structure of a voice enhanced device according to an embodiment of the present invention; FIG.
图9本发明另一实施例的语音增强的装置结构示意图;FIG. 9 is a schematic structural diagram of a device for voice enhancement according to another embodiment of the present invention; FIG.
图10本发明另一实施例的划分模块的结构示意图;FIG. 10 is a schematic structural diagram of a partitioning module according to another embodiment of the present invention; FIG.
图11本发明再一实施例的划分模块的结构示意图;11 is a schematic structural diagram of a partitioning module according to still another embodiment of the present invention;
图12本发明一实施例的噪音处理系统的结构示意图;Figure 12 is a schematic structural view of a noise processing system according to an embodiment of the present invention;
图13本发明又一实施例的第一获取子模块的结构示意图;FIG. 13 is a schematic structural diagram of a first acquiring submodule according to still another embodiment of the present invention;
图14本发明又一实施例的第二获取子模块的结构示意图;FIG. 14 is a schematic structural diagram of a second acquisition submodule according to still another embodiment of the present invention;
图15本发明又一实施例的第一得到子模块的结构示意图。FIG. 15 is a schematic structural diagram of a first obtaining submodule according to still another embodiment of the present invention.
本发明的最佳实施方式BEST MODE FOR CARRYING OUT THE INVENTION
参照图1,本发明一实施例的语音增强的方法,通过双麦克语音通道采集语音信号,且各语音通道分别进行语音增强处理,包括:Referring to FIG. 1 , a voice enhancement method according to an embodiment of the present invention collects voice signals through a dual microphone voice channel, and each voice channel performs voice enhancement processing, including:
S1:获取当前语音信号的频域信号。S1: Acquire a frequency domain signal of the current voice signal.
本实施例中,频域信号指将双麦克语音通道采集的语音信号的时域信号通过FFT(Fast Fourier Transformation,离散傅氏变换)变换后的信号数据,由于本实施例中的语音信号通过双麦克语音通道采集,所以对双麦克的左右通道采集的同一时域帧的语音信号分别同步做同样的处理,比如,本实施例的双麦克语音通道分别连接有FFT,并将经FFT变换后的信号数据缓存于两个相同长度的缓存器中,以便分别进一步作后续处理,以增强语音处理效果。In this embodiment, the frequency domain signal refers to the signal data obtained by transforming the time domain signal of the voice signal collected by the dual microphone voice channel by FFT (Fast Fourier Transformation), because the voice signal in this embodiment passes through the double The microphone voice channel is collected, so the same time processing is performed on the voice signals of the same time domain frame collected by the left and right channels of the dual microphone. For example, the dual microphone voice channels of the embodiment are respectively connected with an FFT, and the FFT is converted. The signal data is buffered in two buffers of the same length for further subsequent processing to enhance the voice processing effect.
S2:按照预设规则将上述频域信号划分为多个依次排布的子频带。S2: The foregoing frequency domain signal is divided into a plurality of sequentially arranged sub-bands according to a preset rule.
MVDR算法宽带频域信号的处理效果不理想,会导致语音失真严重,影响输出语音的质量。本实施例通过将宽带频域信号划分为多个互不重叠依次排布的子频带,通过对上述子频带分别进行MVDR算法,以降低语音失真度,提高处理后的语音质量。The processing effect of the wideband frequency domain signal of the MVDR algorithm is not ideal, which will cause serious speech distortion and affect the quality of the output speech. In this embodiment, the wideband frequency domain signal is divided into a plurality of subbands that are arranged in a non-overlapping manner, and the MVDR algorithm is separately performed on the subbands to reduce the speech distortion and improve the processed speech quality.
S3:根据最小方差失真响应算法分别计算各上述子频带的第一波速输出。S3: Calculate the first wave speed output of each of the sub-bands according to a minimum variance distortion response algorithm.
本实施例的MVDR算法,通过相关联的协方差矩阵得到各子频带的输出权向量。本实施例的MVDR波束形成器中由多个完全一样的空间传感器的线性阵列组成,通过阵列的接收数据得到数据的协方差矩阵,以找出极大值点对应的角度,即语音信号入射方向,以使在期望方向上的阵列输出功率最小,同时信噪比最大。本实施例通过对各子频带分别进行MVDR算法,以获得各子频带分别对应的第一波速输出(即频率数据),以提高对语音信号的频域信号进行MVDR算法后的效果,减少语音失真。In the MVDR algorithm of this embodiment, the output weight vector of each sub-band is obtained by the associated covariance matrix. The MVDR beamformer of this embodiment is composed of a plurality of linear arrays of identical spatial sensors, and the covariance matrix of the data is obtained through the received data of the array to find the angle corresponding to the maximum value point, that is, the incident direction of the speech signal. To minimize the array output power in the desired direction while maximizing the signal-to-noise ratio. In this embodiment, the MVDR algorithm is separately performed on each sub-band to obtain a first wave speed output (ie, frequency data) corresponding to each sub-band, so as to improve the effect of the MVDR algorithm on the frequency domain signal of the voice signal, and reduce the voice distortion. .
S4:通过对各上述第一波速输出进行平均值计算,获取上述频域信号的第二波速输出。S4: Acquire an average value of each of the first wave speed outputs to obtain a second wave speed output of the frequency domain signal.
本实施例通过将把该语音信号的时域帧对应的所有子频带缓存内的频率数据相加然后求平均值,就得到该时域帧对应的频域信号的输出频率数据,并通过与双麦克语音通道的左右通道分别输出。然后通过循环上述步骤S1至S4,直至将语音信号的所有时域帧数据处理完毕。In this embodiment, by adding the frequency data in all the sub-band buffers corresponding to the time domain frame of the voice signal and then averaging, the output frequency data of the frequency domain signal corresponding to the time domain frame is obtained, and The left and right channels of the microphone voice channel are respectively output. Then, by repeating the above steps S1 to S4, all the time domain frame data of the voice signal is processed.
进一步地,步骤S2,包括:Further, step S2 includes:
S200:区分上述频域信号中的敏感频段,其中,上述敏感频段为第一频段,上述频域信号中除上述敏感频段之外的频段为第二频段;S200: distinguishing the sensitive frequency band in the frequency domain signal, wherein the sensitive frequency band is the first frequency band, and the frequency band of the frequency domain signal other than the sensitive frequency band is the second frequency band;
本实施例的敏感频段根据语音信号的用途确定,比如,通话语音的频段为200Hz至3400Hz,其中的敏感频段为1KHz到2KHz;再比如,听音乐的频段为50Hz到15000Hz,其敏感频段为2KHz到5KHz或1KHz到4KHz。The sensitive frequency band of this embodiment is determined according to the use of the voice signal. For example, the frequency band of the call voice is 200 Hz to 3400 Hz, and the sensitive frequency band is 1 kHz to 2 kHz; for example, the frequency band for listening to music is 50 Hz to 15000 Hz, and the sensitive frequency band is 2 kHz. To 5KHz or 1KHz to 4KHz.
S201:将第一频段均匀划分为多个第一子频带,将第二频段均匀划分为多个第二子频带,其中,上述第二子频带的带宽大于上述第一子频带的带宽。S201: The first frequency band is evenly divided into a plurality of first sub-bands, and the second frequency band is evenly divided into a plurality of second sub-bands, wherein a bandwidth of the second sub-band is greater than a bandwidth of the first sub-band.
本实施例通过将敏感频段的子频带划分的更细致,而对敏感频段之外的频段进行较粗狂的划分,即敏感频段的子频带的带宽小于敏感频段之外的频段的子频段带宽,使敏感频段的语音失真更少,且通过对敏感频段之外的频段进行较粗狂的划分减少因子频带数量过多引起的计算量增大的弊端。In this embodiment, the sub-bands of the sensitive frequency band are divided into more detailed, and the frequency bands other than the sensitive frequency band are coarsely divided, that is, the bandwidth of the sub-band of the sensitive frequency band is smaller than the sub-band bandwidth of the frequency band other than the sensitive frequency band. The speech distortion in the sensitive frequency band is less, and the rougher mad division of the frequency band outside the sensitive frequency band reduces the disadvantage of the calculation amount caused by the excessive number of factor bands.
进一步地,上述根据最小方差失真响应算法分别计算各上述子频带的第一波速输出的步骤S3,包括:Further, the step S3 of calculating the first wave speed output of each of the sub-bands according to the minimum variance distortion response algorithm includes:
S300:在各上述子频带内分别通过语音激活检测,获取相邻的两个非语音段的功率比。S300: Perform voice activation detection in each of the foregoing sub-bands to obtain power ratios of two adjacent non-speech segments.
本实施例通过语音激活检测在语音信号间隙期对非语音段(即噪音)的功率谱进行估计,以便及时判断周边环境噪音的变化趋势,以便对噪音进行详细跟踪。本实施例通过两个非语音段的功率比的变化跟踪非语音段的功率变化,功率比变大表示噪音强度增强,反之噪音强度减弱。In this embodiment, the power spectrum of the non-speech segment (ie, noise) is estimated by the voice activation detection in the gap period of the speech signal, so as to timely judge the change trend of the surrounding environment noise, so as to track the noise in detail. In this embodiment, the power variation of the non-speech segment is tracked by the change of the power ratio of the two non-speech segments, and the increase of the power ratio indicates that the noise intensity is enhanced, and vice versa.
S301:根据上述功率比获取相应的去除上述非语音段的平滑因子;S301: Acquire a corresponding smoothing factor for removing the non-speech segment according to the power ratio.
本实施例根据跟踪获得的噪音功率的变化动态调整去除非语音段的平滑因子,当环境噪音的时变速度相对采样速率较快时,平滑因子应设置的小一些,当环境噪音的时变速度相对采样速率较慢时或者噪声功率比较强时候,平滑因子应该大一些,以及时跟踪空间声场的变化,更好的跟踪环境噪音变化而改变去噪音的程度,有效的平滑噪声的起伏,减小噪音起伏的影响,进一步改善双麦克降噪的信噪比,改善输出语音信号的音质。In this embodiment, the smoothing factor of removing the non-speech segment is dynamically adjusted according to the change of the noise power obtained by the tracking. When the time-varying speed of the environmental noise is relatively fast relative to the sampling rate, the smoothing factor should be set smaller, when the time-varying speed of the environmental noise is When the relative sampling rate is slow or the noise power is relatively strong, the smoothing factor should be larger, and the tracking of the spatial sound field changes in time, better tracking the environmental noise changes and changing the degree of noise removal, effectively smoothing the fluctuation of the noise, reducing The influence of noise fluctuations further improves the signal-to-noise ratio of the dual-make noise reduction and improves the sound quality of the output speech signal.
S302:根据上述平滑因子得到各上述子频带内的频带特征的协方差矩阵;S302: Obtain a covariance matrix of frequency band features in each of the sub-bands according to the smoothing factor;
根据动态变化的平滑因子及时更新协方差矩阵,以便更精准地判断语音信号入射方向,进一步降低周围噪音对双麦克语音通道采集的影响。The covariance matrix is updated in time according to the dynamically changing smoothing factor to more accurately determine the incident direction of the speech signal, further reducing the influence of ambient noise on the acquisition of the dual microphone speech channel.
S303:根据协方差矩阵进行特征分解,得到各上述子频带的输出权向量。S303: Perform eigen decomposition according to the covariance matrix to obtain an output weight vector of each of the sub-bands.
本实施例的MVDR算法输出的数据为协方差矩阵,通过特征分解获得协方差矩阵对应的输出权向量,即第一波速输出。The data output by the MVDR algorithm of this embodiment is a covariance matrix, and the output weight vector corresponding to the covariance matrix is obtained by feature decomposition, that is, the first wave speed output.
进一步地,上述获取当前语音信号的频域信号的步骤S1,包括:Further, the step S1 of acquiring the frequency domain signal of the current voice signal includes:
S100:获取双麦克语音通道分别采集的当前语音信号的第一时域信号。S100: Acquire a first time domain signal of a current voice signal separately collected by the dual microphone voice channel.
本实施例的双麦克语音通道采集的为语音信号的时域信号,上述时域信号以时间顺序依次排布的各时域帧数据。本实施例的第一时域信号为区域于其他 时域信号而设定,此处的“第一”等用语仅为区别,不作限定。The dual microphone voice channel of this embodiment collects time domain signals of voice signals, and the time domain signals are sequentially arranged in time series. The first time domain signal in this embodiment is set in the other time domain signals, and the terms "first" and the like herein are only differences and are not limited.
S101:将上述第一时域信号分别输入到上述双麦克语音通道分别对应的带通滤波器,分别得到指定频率范围的时域信号。S101: Input the first time domain signals to the band pass filters respectively corresponding to the dual microphone voice channels, respectively, to obtain time domain signals of a specified frequency range.
本实例通过只选择处理关注的语音频段数据,以减少数据处理量,提高实时处理效果。本实施例关注的语音频段数据为人说话声音的频率范围,即200Hz至3400Hz,以满足对通话语音增强的效果,且避免了正常语音的失真。本实施例通过将200Hz至3400Hz频段之外的语音信号通过预处理过程全部过滤掉,且确保200Hz至3400Hz全覆盖,实现较少数据处理量且确保语音不失真的效果。In this example, only the voice frequency band data of interest is selected to reduce the amount of data processing and improve the real-time processing effect. The voice frequency band data concerned by this embodiment is a frequency range of human speech sounds, that is, 200 Hz to 3400 Hz, to satisfy the effect of enhancing speech speech, and avoiding distortion of normal speech. In this embodiment, the voice signals outside the frequency band of 200 Hz to 3400 Hz are all filtered out by the preprocessing process, and full coverage of 200 Hz to 3400 Hz is ensured, thereby achieving less data processing amount and ensuring the effect of voice distortion.
S102:将上述指定频率范围的时域信号分别通过与上述双麦克语音通道分别关联的傅氏变换,分别转换为当前语音信号的上述指定频率范围的频域信号。S102: Convert the time domain signals of the specified frequency range to the frequency domain signals of the specified frequency range of the current voice signal by using a Fourier transform respectively associated with the dual microphone voice channels.
本实施例的子频带划分、噪音处理等操作过程需要在频域信号上进行,本实施例通过FFT变换将各时域信号转变为频域信号。双麦克语音通道的语音信号同步进行同样的转换操作,并分别将转换后的数据缓存于两个相同的缓存器中。The operation process of subband division, noise processing, and the like in this embodiment needs to be performed on the frequency domain signal. In this embodiment, each time domain signal is converted into a frequency domain signal by FFT transformation. The voice signals of the dual microphone voice channel are synchronized to perform the same conversion operation, and the converted data is respectively buffered in two identical buffers.
进一步地,上述通过对各上述第一波速输出进行平均值计算,获取上述频域信号的第二波速输出的步骤S4之后,包括:Further, after the step S4 of acquiring the second wave speed output of the frequency domain signal by performing an average value calculation on each of the first wave speed outputs, the method includes:
S5:通过将上述频域信号的第二波速输出分别输入到与上述双麦克语音通道分别关联的反傅氏变换器中,将上述频域信号转换为输出时域信号;S5: converting the frequency domain signal into an output time domain signal by inputting the second wave speed output of the frequency domain signal to an inverse Fourier transformer respectively associated with the dual microphone voice channel;
本实施例将双麦克语音通道采集的为语音信号的时域信号,通过转变为频域信号,然后经过降噪音、增语音等处理后,需要通过反傅氏变换器将处理后的频域信号转换为相应的时域信号,才被人耳接听与识别。In this embodiment, the time domain signal collected by the dual microphone voice channel as a voice signal is converted into a frequency domain signal, and then processed by noise reduction, voice addition, etc., and the processed frequency domain signal is required by an inverse Fourier transformer. It is converted to the corresponding time domain signal before it is answered and recognized by the human ear.
S6:通过上述双麦克语音通道分别输出对应的上述输出时域信号。S6: output the corresponding output time domain signal by using the dual microphone voice channel.
本实施例的双麦克语音通道采集的语音信号在经过过滤筛选频率段、FFT变换、子频带划分、降噪音增语音、反FFT变换的过程中,均为左右语音通道分别同步进行,在输出端合成为一体。In the process of filtering and filtering the frequency segment, FFT transform, subband division, noise reduction, and inverse FFT, the voice signals collected by the dual microphone voice channel in this embodiment are synchronized in the left and right voice channels respectively, at the output end. Synthesize into one.
参照图2,本发明另一实施例中语音增强方法中,首先通过对语音通道采集语音信号进行预处理以减少频域处理量,本实施例减少频域处理量的方法包括:在步骤S2之前,进行如下操作:Referring to FIG. 2, in a voice enhancement method according to another embodiment of the present invention, a method for reducing a frequency domain processing amount by using a voice signal to perform voice signal preprocessing to reduce a frequency domain processing amount in the embodiment includes: before step S2. , do the following:
S20:根据频域处理平台的计算量水平,选择指定频点的傅氏变换方式;S20: selecting a Fourier transform method of the specified frequency point according to the calculation level of the frequency domain processing platform;
本实施例中的指定频点包括1024点、2048点、256点等FFT变换,本实施例优选1024点,在合适计算量的限定下满足处理效果的需求。The specified frequency point in this embodiment includes FFT transforms such as 1024 points, 2048 points, and 256 points. In this embodiment, 1024 points are preferred, and the processing effect is satisfied under the limitation of a suitable calculation amount.
S21:将双麦克语音通道分别采集的当前语音信号的第一时域信号经过预处理后,分别通过指定频点的傅氏变换方式得到的第一时域信号对应的频域信号;S21: The first time domain signal of the current voice signal separately collected by the dual microphone voice channel is preprocessed, and then the frequency domain signal corresponding to the first time domain signal obtained by the Fourier transform of the specified frequency point is respectively received;
本实施例通过1024点FFT变换对频率范围为200Hz至3400Hz的语音信号进行变换,则获得约144点的频点分布的频域信号。而相比于对包括200Hz至3400Hz的全语音段进行处理时,需要处理约512点的频点分布的全频域信号,大幅减少了计算量。In this embodiment, a speech signal having a frequency range of 200 Hz to 3400 Hz is transformed by a 1024-point FFT transform, and a frequency domain signal of a frequency distribution of about 144 points is obtained. Compared with the full speech segment including 200 Hz to 3400 Hz, it is necessary to process a full frequency domain signal with a frequency distribution of about 512 points, which greatly reduces the amount of calculation.
进一步地,上述将上述频域信号按照预设规则进行划分为多个依次排布的子频带的步骤S2,包括:Further, the step S2 of dividing the frequency domain signal into a plurality of sequentially arranged sub-bands according to a preset rule includes:
S202:获取经过上述指定频点的傅氏变换方式得到的上述第一时域信号对应的频域信号的频点总量;S202: Acquire a total amount of frequency points of the frequency domain signal corresponding to the first time domain signal obtained by the Fourier transform method of the specified frequency point;
举例地,本实施第一时域信号的频点总量为144点,然后根据144点进行子频带划分的依据。For example, the total frequency of the first time domain signal of the present implementation is 144 points, and then the basis of the subband division is performed according to 144 points.
S203:根据频点总量对上述频域信号均匀划分为多个依次排布的子频带。S203: The frequency domain signal is uniformly divided into a plurality of sequentially arranged sub-bands according to the total number of frequency points.
本实施例的子频带划分过程中,可通过配置每个子频带上的频点数量进行划分。举例地,将各子频带包含的频点数量配置为24,即第一时域信号的子频带的数量为144除以24,为6个子频带。本发明其他实施例可将各子频带包含的频点数量配置为8、6等,以便均匀划分子频带。各子频带包含的频点数量配置为8时,子频带数量为18;各子频带包含的频点数量配置为6时,子频带数量为24。本实施例优选各子频带包含的频点数量配置为6,子频带数量为24的子频带划分方案,以便优化语音降噪增强的效果。因为子频带划分的越多,子频带的带宽越窄,则经过MVDR算法后语音失真越少,但计算量略微增加;相反子频带越少,计算量小,但子频带带宽越大,相对子频带数量多的,失真则会更大。In the subband division process of this embodiment, the division may be performed by configuring the number of frequency points on each subband. For example, the number of frequency points included in each sub-band is configured to be 24, that is, the number of sub-bands of the first time-domain signal is 144 divided by 24, which is 6 sub-bands. Other embodiments of the present invention may configure the number of frequency points included in each sub-band to be 8, 6, etc., so as to evenly divide the sub-bands. When the number of frequency points included in each sub-band is 8, the number of sub-bands is 18; when the number of frequency points included in each sub-band is 6, the number of sub-bands is 24. In this embodiment, it is preferable that each sub-band includes a sub-band division scheme in which the number of frequency points is 6 and the number of sub-bands is 24, in order to optimize the effect of speech noise reduction enhancement. Because the more subbands are divided, the narrower the subband bandwidth is, the less the speech distortion is after the MVDR algorithm, but the calculation amount is slightly increased; the smaller the subband, the smaller the calculation amount, but the larger the subband bandwidth, the relative sub If the number of bands is large, the distortion will be larger.
进一步地,上述将上述第一频段均匀划分为多个第一子频带,将上述第二频段均匀划分为多个第二子频带的步骤S201之后,包括:Further, after the step S201 of uniformly dividing the first frequency band into the plurality of first sub-bands and uniformly dividing the second frequency band into the plurality of second sub-bands, the method includes:
S204:分别计算各第一子频带和各第二子频带一一对应的频带中心频率;S204: Calculate a frequency band center frequency corresponding to each of the first sub-band and each second sub-band respectively;
本实施例通过子频带的中心频率,以获得子频带的方向向量,以便更好的控制采集语音信号的最佳角度,避免在采集语音信号时携带最强噪音干燥。本 实施例的第一子频带与第二子频带的处理原理相同,只是带宽不同。举例地,本实施例以均匀划分的子频带的处理过程为例,进行详细说明。本实施例的宽带频域信号经过1024点FFT变换后,每个频点的分辨率为16000/1024点,则200Hz至3400Hz对应的频率下标为:12至207。以均匀划分为24个子频带作为举例,则每个子频带的带宽为:band_siz=(up-low)/numband,其中up为3400Hz对应的频率下标,而low对应的200Hz的频率下标,numband为子频带的数量参数,按照24个子频带划分,则每个子频带带宽包含8个频点的下标。第K个子频带的中心频率下标为:fv(k)=((low+(k-1)*band_siz)+(low+(k-1)*band_siz+band_siz-1))/2;于是对应的子频带的中心频率为:F_center=fv(k)/FFT_siz*Fs,其中FFT_siz表示傅里叶变换长度,即1024点,Fs表示采样频率,即16000。In this embodiment, the center frequency of the sub-band is obtained to obtain the direction vector of the sub-band, so as to better control the optimal angle of the collected speech signal, and avoid carrying the strongest noise drying when collecting the speech signal. The first sub-band of the present embodiment has the same processing principle as the second sub-band, except that the bandwidth is different. For example, in this embodiment, a process of uniformly dividing sub-bands is taken as an example for detailed description. After the 1024-point FFT transform of the wideband frequency domain signal of this embodiment, the resolution of each frequency point is 1600/10024 points, and the frequency corresponding to the frequency range of 200 Hz to 3400 Hz is 12 to 207. For example, the bandwidth of each sub-band is: band_siz=(up-low)/numband, where up is the frequency subscript corresponding to 3400 Hz, and low corresponds to the frequency subscript of 200 Hz, numband is The number parameter of the sub-band is divided according to 24 sub-bands, and each sub-band bandwidth includes subscripts of 8 frequency points. The center frequency subscript of the Kth subband is: fv(k)=((low+(k-1)*band_siz)+(low+(k-1)*band_siz+band_siz-1))/2; then the corresponding sub The center frequency of the frequency band is: F_center=fv(k)/FFT_siz*Fs, where FFT_siz represents the Fourier transform length, ie 1024 points, and Fs represents the sampling frequency, ie 16000.
S205:根据上述频带中心频率分别计算得到各上述第一子频带和各上述第二子频带一一对应的方向向量。S205: Calculate, according to the center frequency of the frequency band, a direction vector corresponding to each of the first sub-band and each of the second sub-bands.
本实施例通过将以上计算得到的中心频率,代入如下公式计算方向向量。vssL=e ((delay)*(-j)*2*pi*F_center),其中,vssL为计算的方向向量,j是复数标志,j是-1的平方根,pi是常数3.1415926,e为常数数值,e=2.71828183,而exp(a)为指数函数,其中delay为双麦克的左右两个语音通道的延迟时间点向量。通常取左边语音通道为参考点,则右边语音通道相对左边语音通道的时间延迟为tao,delay=[0,tao]。时间延迟估计tao可以采用双麦克语音通道采集的数据进行互相关计算得到。 In this embodiment, the direction vector is calculated by substituting the center frequency calculated above into the following formula. vssL=e ((delay)*(-j)*2*pi*F_center) , where vssL is the calculated direction vector, j is the complex sign, j is the square root of -1, pi is the constant 3.1415926, and e is a constant value , e=2.71828183, and exp(a) is an exponential function, where delay is the delay time point vector of the left and right two voice channels of the dual microphone. Usually, the left voice channel is taken as the reference point, and the time delay of the right voice channel relative to the left voice channel is tao, delay=[0, tao]. The time delay estimation tao can be obtained by cross-correlation calculation using data collected by the dual microphone voice channel.
S206:根据上述方向向量分别获得各第一子频带和各第二子频带一一对应的频带特征的协方差矩阵以及协方差矩阵的逆矩阵对应的最优权系数。S206: Obtain, according to the direction vector, a covariance matrix of a frequency band feature corresponding to each first sub-band and each second sub-band, and an optimal weight coefficient corresponding to an inverse matrix of the covariance matrix.
本实施例通过双麦克语音通道采集信号,其协方差矩阵是2行2列。求该协方差矩阵的逆矩阵,以r_inv表示为协方差矩阵的逆矩阵,W_opt为当前子频带的最优权系数,则W_opt=r_inv*vssL/(vssL'*r_inv*vssL),其中,vssL表示方向向量,vssL'表示方向向量转置,比如原向量为一行两列,转置后为两行一列。最优权系数是指在扫描角度范围内寻找用户说话时双麦克语音通道的最优 角度,比如,从-45°扫描至45°时,60°时用户说话的语音信号中携带的噪音强度最低,则60°为最优角度。In this embodiment, signals are collected through a dual microphone voice channel, and the covariance matrix is 2 rows and 2 columns. Find the inverse matrix of the covariance matrix, denoted by r_inv as the inverse matrix of the covariance matrix, and W_opt is the optimal weight coefficient of the current subband, then W_opt=r_inv*vssL/(vssL'*r_inv*vssL), where vssL Indicates the direction vector, and vssL' indicates the direction vector transpose. For example, the original vector is one row and two columns, and after transposition, it is two rows and one column. The optimal weight coefficient refers to the optimal angle of the double-microphone voice channel when searching for the user's speech within the scanning angle range. For example, when scanning from -45° to 45°, the noise signal carried by the user's speech signal is the lowest at 60°. , 60° is the optimal angle.
S207:根据上述最优权系数分别计算各上述第一子频带和各上述第二子频带一一对应的第一信号输出。S207: Calculate, according to the optimal weight coefficient, a first signal output corresponding to each of the first sub-band and each of the second sub-bands.
本实施例中,Out_L=W_opt*S_L;Out_R=W_opt*S_R;其中Out_L为左通道输出频率数据,Out_R为右通道的输出频率数据,S_L为左通道采集的当前时域帧数据FFT变换后的第Fbin_loL点频率到Fbin_hiL点的频率向量,S_R为右通道采集的当前时域帧数据FFT变换后的第Fbin_loL点频率到Fbin_hiL点的频率向量,即S_L或S_R为对应的子频带内的频率数据。其中Fbin_loL为该子频带的频率下边界的下标,而Fbin_hiL为该子频带的频率上边界的上标,最后将左右两通道的频率输出数据保存在缓存中,将第一时域信号对应的所有子频带缓存内的频率数据相加,就得到双麦克语音通道的左右两个语音通道的各自的输出的第一信号输出。In this embodiment, Out_L=W_opt*S_L; Out_R=W_opt*S_R; wherein Out_L is the output frequency data of the left channel, Out_R is the output frequency data of the right channel, and S_L is the FFT of the current time domain frame data acquired by the left channel. The frequency of the Fbin_loL point frequency to the Fbin_hiL point, S_R is the frequency vector of the Fbin_loL point frequency after the FFT transformation of the current time domain frame data acquired by the right channel to the Fbin_hiL point, that is, S_L or S_R is the frequency data in the corresponding sub-band . Where Fbin_loL is the subscript of the lower boundary of the frequency of the subband, and Fbin_hiL is the superscript of the upper boundary of the frequency of the subband, and finally the frequency output data of the left and right channels are stored in the buffer, and the first time domain signal is corresponding The frequency data in all subband buffers is added to obtain the first signal output of the respective outputs of the left and right voice channels of the dual microphone voice channel.
进一步地,上述根据上述最优权系数分别计算各上述第一子频带和各上述第二子频带一一对应的信号输出的步骤S207之后,包括:Further, after the step S207 of calculating the signal output corresponding to each of the first sub-band and each of the second sub-bands according to the optimal weight coefficient, the method includes:
S208:按照接收的语音信号的时间顺序,接收距离上述第一时域信号时间差最小的第二时域信号;S208: Receive a second time domain signal with a minimum time difference from the first time domain signal according to a time sequence of the received voice signal.
本实施例按照接收的语音信号的时间顺序,即先接收到的先处理,后接收到的后处理,依次按照时间顺序逐一处理各时域帧数据。In this embodiment, according to the time sequence of the received voice signals, that is, the first processing received first, and the subsequent processing received, the time domain frame data is processed one by one in chronological order.
S209:将上述第二时域信号经过与上述第一时域信号相同的处理过程,得到与上述第二时域信号对应的第二信号输出。S209: The second time domain signal is subjected to the same processing process as the first time domain signal to obtain a second signal output corresponding to the second time domain signal.
本实施例的第二信号输出处理过程同第一信号输出。The second signal output processing process of this embodiment is the same as the first signal output.
参照图3,本发明一实施例中语音增强方法中,根据最小方差失真响应算法分别计算各子频带的第一波速输出的过程中,通过噪音处理提高语音强度。Referring to FIG. 3, in the speech enhancement method according to an embodiment of the present invention, in the process of calculating the first wave speed output of each sub-band according to the minimum variance distortion response algorithm, the speech intensity is improved by noise processing.
进一步地,步骤S300,包括:Further, step S300 includes:
S3001:通过在非说话时段对各子频带分别进行语音激活检测,得到当前第一非语音段的第一时间的第一功率、第二时间的第二功率以及第三时间的第 三功率,其中,第一时间、第二时间、第三时间按照发生时间依次倒序衔接。S3001: Perform voice activation detection on each sub-band in a non-speaking period to obtain a first power of a first time, a second power of a second time, and a third power of a third time of the current first non-speech segment, where The first time, the second time, and the third time are sequentially connected in reverse order according to the time of occurrence.
本实施例在每个子频带内都会进行VAD检测(Voice Activity Detection,语音激活检测),在VAD检测的非语音期(即无用户说话信息)对该子频带内的噪音做估计,通过保留最近三个阶段的噪音功率值进行估计。设最近一次的噪声功率估计时间为第一时间,相应的第一功率为P1,第一时间的前一时刻为第二时间,第二时间对应的第二功率为P2,第二时间的前一时刻为第三时间,第三时间对应的第三功率为P3。In this embodiment, VAD detection (Voice Activity Detection) is performed in each sub-band, and the noise in the sub-band is estimated in the non-speech period (ie, no user-speaking information) of the VAD detection, by retaining the last three The noise power values of the stages are estimated. The latest noise power estimation time is the first time, the corresponding first power is P1, the previous time of the first time is the second time, and the second power corresponding to the second time is P2, the previous one of the second time The moment is the third time, and the third power corresponding to the third time is P3.
S3002:通过计算第一功率与第二功率的比值,获得各子频带分别对应的当前功率变化,通过计算第二功率与第三功率的比值,获得各子频带分别对应的前时刻功率变化。S3002: Calculate a current power change corresponding to each sub-band by calculating a ratio of the first power to the second power, and obtain a previous power change corresponding to each sub-band by calculating a ratio of the second power to the third power.
本实施例中第一功率与第二功率的比值表示为:Vr_cur=P1/P2,第二功率与上述第三功率的比值表示为:Vr_pre=P2/P3。In this embodiment, the ratio of the first power to the second power is expressed as: Vr_cur=P1/P2, and the ratio of the second power to the third power is expressed as: Vr_pre=P2/P3.
S3003:通过计算上述当前功率变化与上述前时刻功率变化的第一比值,获取相邻的两个非语音段的功率比。S3003: Acquire a power ratio of two adjacent non-speech segments by calculating a first ratio of the current power variation to the previous power variation.
本实施例的当前功率变化与前时刻功率变化的第一比值表示为:Value=Vr_cur/Vr_pre。如果Vr_cur明显大于Vr_pre,则表明噪音干扰降低,则应降低平滑因子,以避免过度平滑引起的语音失真。The first ratio of the current power change to the previous time power change of this embodiment is expressed as: Value=Vr_cur/Vr_pre. If Vr_cur is significantly larger than Vr_pre, indicating a reduction in noise interference, the smoothing factor should be reduced to avoid speech distortion caused by excessive smoothing.
进一步地,本实施例的步骤S301,包括:Further, step S301 of the embodiment includes:
S3011:判断上述第一比值是否在预设范围内;S3011: determining whether the first ratio is within a preset range;
本实施例的预设范围为Value的值在0.8至1.2的范围区间。The preset range of this embodiment is that the value of Value is in the range of 0.8 to 1.2.
S3012:若是,选定初始化平滑因子为当前时刻的平滑因子。S3012: If yes, the initialization smoothing factor is selected as the smoothing factor of the current time.
本实施例若Value的值在0.8至1.2的范围区间内,则设定平滑因子为初始化值,比如初始化值为1.0。In this embodiment, if the value of Value is in the range of 0.8 to 1.2, the smoothing factor is set to an initialization value, for example, the initialization value is 1.0.
进一步地,上述步骤S3011之后,还包括:Further, after the step S3011, the method further includes:
S3013:若否,则计算上述初始化平滑因子与上述第一比值的第二比值;S3013: If not, calculating a second ratio of the initializing smoothing factor to the first ratio;
本实施例中若Value的值不在0.8至1.2的范围区间内,如果Value的值大于1.2或者小于0.8时,则将计算第二比值,并将第二比值作为平滑因子。比如,当前Value的值为1.1,则第二比值为1.0/1.1,则当前时刻的平滑因子为1.0/1.1。In this embodiment, if the value of Value is not in the range of 0.8 to 1.2, if the value of Value is greater than 1.2 or less than 0.8, the second ratio is calculated, and the second ratio is used as the smoothing factor. For example, if the current Value has a value of 1.1 and the second ratio is 1.0/1.1, the smoothing factor at the current time is 1.0/1.1.
S3014:设定上述第二比值为当前时刻的平滑因子。S3014: Set the second ratio to be a smoothing factor of the current time.
本实施例通过动态实时调整去除噪音的平滑因子,减小噪音起伏的影响, 进一步改善双麦克降噪的信噪比,改善输出语音信号的音质。In this embodiment, the noise smoothing factor is removed by dynamic real-time adjustment, the influence of noise fluctuation is reduced, the signal-to-noise ratio of the double-mike noise reduction is further improved, and the sound quality of the output voice signal is improved.
进一步地,本实施例的步骤S302,包括:Further, step S302 of the embodiment includes:
S3021:获取当前时间的子频带的下边界下标到上边界上标的频点向量;3022:根据当前时刻的平滑因子以及频点向量对子频带协方差矩阵进行更新。S3021: Acquire a frequency point vector of a sub-band subscript of the current time subscript to an upper boundary; 3022: update the sub-band covariance matrix according to a smoothing factor of the current time and a frequency point vector.
本实施例的协方差矩阵按照如下公式进行实时更新,以双麦克左通道采集的时域信号的处理过程为例,对时域信号对应的频域信号划分子频带后,协方差矩阵更新方式如下:R_SUBBAND_new=R_SUBBAND_old*alfa+S_L*S_L'*(1-alfa),其中alfa为当前时刻的平滑因子,R_SUBBAND_new为更新后的协方差矩阵,R_SUBBAND_old为更新前一时刻的原协方差矩阵,S_L表示S_L为左通道采集的当前时域帧数据FFT变换后的第Fbin_loL点频率到Fbin_hiL点的频率向量,S_L'表示频率向量转置。The covariance matrix of this embodiment is updated in real time according to the following formula. The processing procedure of the time domain signal collected by the dual microphone left channel is taken as an example. After the frequency domain signal corresponding to the time domain signal is divided into subbands, the covariance matrix is updated as follows. :R_SUBBAND_new=R_SUBBAND_old*alfa+S_L*S_L'*(1-alfa), where alfa is the smoothing factor of the current time, R_SUBBAND_new is the updated covariance matrix, R_SUBBAND_old is the original covariance matrix of the previous time, and S_L S_L is the frequency vector of the Fbin_loL point frequency after the FFT transformation of the current time domain frame data acquired by the left channel to the Fbin_hiL point, and S_L' represents the frequency vector transposition.
参照图4,本发明一实施例的语音增强的装置,通过双麦克语音通道采集语音信号,且各语音通道分别进行语音增强处理,包括:第一获取模块1,用于获取当前语音信号的频域信号。划分模块2,用于按照预设规则将频域信号划分为多个依次排布子频带。计算模块3,用于根据最小方差失真响应算法分别计算各上述子频带的第一波速输出。第二获取模块4,用于通过对各上述第一波速输出进行平均值计算,获取上述频域信号的第二波速输出。Referring to FIG. 4, a voice enhancement device according to an embodiment of the present invention collects a voice signal through a dual microphone voice channel, and each voice channel performs voice enhancement processing, including: a first acquisition module 1 configured to acquire a frequency of a current voice signal. Domain signal. The dividing module 2 is configured to divide the frequency domain signal into a plurality of sequentially arranged sub-bands according to a preset rule. The calculating module 3 is configured to separately calculate the first wave speed output of each of the sub-bands according to a minimum variance distortion response algorithm. The second obtaining module 4 is configured to obtain a second wave speed output of the frequency domain signal by performing an average calculation on each of the first wave speed outputs.
本领域技术人员可以理解的是,本实施例的装置和上述方法实施例所述的方法相辅相成、互相适应,上述方法实施例中描述的多个细节和说明均可适用于本实施例中的装置,为了避免重复,本装置的实施例不再赘述。It can be understood by those skilled in the art that the apparatus in this embodiment and the method in the foregoing method embodiments are complementary to each other, and various details and descriptions described in the foregoing method embodiments are applicable to the apparatus in this embodiment. In order to avoid repetition, the embodiment of the device will not be described again.
参照图5,上述划分模块2,包括:Referring to FIG. 5, the foregoing dividing module 2 includes:
区分子模块200,用于区分上述频域信号中的敏感频段,其中,敏感频段为第一频段,频域信号中除敏感频段之外的频段为第二频段;第一划分子模块201,用于将第一频段均匀划分为多个第一子频带,将第二频段均匀划分为多个第二子频带,其中,各第二子频带的带宽大于各第一子频带的带宽。The area molecular module 200 is configured to distinguish the sensitive frequency band in the frequency domain signal, wherein the sensitive frequency band is the first frequency band, and the frequency band except the sensitive frequency band is the second frequency band; the first dividing submodule 201 is used by The first frequency band is evenly divided into a plurality of first sub-bands, and the second frequency band is evenly divided into a plurality of second sub-bands, wherein a bandwidth of each second sub-band is greater than a bandwidth of each of the first sub-bands.
参照图6,上述计算模块3,包括:Referring to FIG. 6, the foregoing calculation module 3 includes:
第一获取子模块300,用于在各上述子频带内分别通过语音激活检测,获取相邻的两个非语音段的功率比。第二获取子模块301,用于根据功率比获取相应去除非语音段的平滑因子;第一得到子模块302,用于根据平滑因子得到各上述子频带内的频带特征的协方差矩阵;第二得到子模块303,用于根据上述协方差矩阵进行特征分解,得到各子频带的输出权向量,即第一波速输出。The first obtaining sub-module 300 is configured to obtain a power ratio of two adjacent non-speech segments by using voice activation detection in each of the foregoing sub-bands. a second obtaining sub-module 301, configured to obtain a smoothing factor corresponding to the non-speech segment according to the power ratio; the first obtaining sub-module 302 is configured to obtain a covariance matrix of the frequency band features in each of the sub-bands according to the smoothing factor; The obtaining sub-module 303 is configured to perform feature decomposition according to the covariance matrix to obtain an output weight vector of each sub-band, that is, a first wave speed output.
参照图7,上述第一获取模块1,包括:Referring to FIG. 7, the first acquiring module 1 includes:
第三获取子模块100,用于获取双麦克语音通道分别采集的当前语音信号的第一时域信号。输入子模块101,用于将第一时域信号分别输入到上述双麦克语音通道分别对应的带通滤波器,分别得到指定频率范围的时域信号。转换子模块102,用于将指定频率范围的时域信号分别通过与双麦克语音通道分别关联的傅氏变换,分别转换为当前语音信号的指定频率范围的频域信号。The third obtaining sub-module 100 is configured to acquire a first time domain signal of the current voice signal separately collected by the dual microphone voice channel. The input sub-module 101 is configured to input the first time domain signals to the band pass filters respectively corresponding to the dual microphone voice channels, respectively, to obtain time domain signals of the specified frequency range. The conversion sub-module 102 is configured to respectively convert the time domain signals of the specified frequency range into the frequency domain signals of the specified frequency range of the current voice signal by using the Fourier transform respectively associated with the dual microphone voice channels.
参照图8,本发明另一实施例的语音增强的装置,包括:转换模块5,用于通过将频域信号的第二波速输出分别输入到与双麦克语音通道分别关联的反傅氏变换器中,将上述频域信号转换为输出时域信号;输出模块6,用于通过双麦克语音通道分别输出对应的输出时域信号。Referring to FIG. 8, a voice enhancement apparatus according to another embodiment of the present invention includes: a conversion module 5, configured to separately input a second wave speed output of a frequency domain signal to an inverse Fourier transformer respectively associated with a dual microphone voice channel The frequency domain signal is converted into an output time domain signal; and the output module 6 is configured to respectively output a corresponding output time domain signal through the dual microphone voice channel.
参照图9,本发明另一实施例中语音增强装置中,首先通过对语音通道采集语音信号进行预处理以减少频域处理量,划分模块2的前端连接有:选择模块20,用于根据频域处理平台的计算量水平,选择指定频点的傅氏变换方式;得到模块21,用于将双麦克语音通道分别采集的当前语音信号的第一时域信号经过预处理后,分别通过上述指定频点的傅氏变换方式得到的上述第一时域信号对应的频域信号。Referring to FIG. 9, in a voice enhancement apparatus according to another embodiment of the present invention, first, a voice signal is preprocessed by a voice channel to reduce a frequency domain processing amount, and a front end of the partitioning module 2 is connected with a selection module 20 for using a frequency. The calculation level of the domain processing platform selects the Fourier transform mode of the specified frequency point; and the obtaining module 21 is configured to: after the pre-processing of the first time domain signal of the current voice signal separately collected by the dual microphone voice channel, respectively A frequency domain signal corresponding to the first time domain signal obtained by the Fourier transform of the frequency point.
参照图10,本实施例的划分模块2,包括:第三获取子模块202,用于获取经过上述指定频点的傅氏变换方式得到的上述第一时域信号对应的频域信号的频点总量;第二划分子模块203,用于根据上述频点总量对上述频域信号均匀划分为多个依次排布的子频带。Referring to FIG. 10, the partitioning module 2 of the present embodiment includes: a third acquiring sub-module 202, configured to acquire a frequency point of a frequency domain signal corresponding to the first time domain signal obtained by using the Fourier transform method of the specified frequency point. The second dividing sub-module 203 is configured to uniformly divide the frequency domain signal into a plurality of sequentially arranged sub-bands according to the total frequency of the foregoing frequency points.
参照图11,本发明再一实施例的划分模块2,包括:第一计算子模块204,用于分别计算各第一子频带和各第二子频带一一对应的频带中心频率;第二计算子模块205,用于根据频带中心频率分别计算得到各第一子频带和各第二子频带一一对应的方向向量。获得子模块206,用于根据方向向量分别获得各第一子频带和各第二子频带一一对应的频带特征的协方差矩阵以及协方差矩阵的逆矩阵对应的最优权系数。第三计算子模块207,用于根据最优权系数分别计算各第一子频带和各第二子频带一一对应的第一信号输出。Referring to FIG. 11, a partitioning module 2 according to another embodiment of the present invention includes: a first calculating sub-module 204, configured to respectively calculate a frequency band center frequency corresponding to each first sub-band and each second sub-band; The sub-module 205 is configured to calculate a direction vector corresponding to each of the first sub-band and each of the second sub-bands according to the center frequency of the frequency band. The obtaining sub-module 206 is configured to obtain, according to the direction vector, a covariance matrix of the frequency band features corresponding to each of the first sub-band and each of the second sub-bands, and an optimal weight coefficient corresponding to the inverse matrix of the covariance matrix. The third calculation sub-module 207 is configured to calculate, according to the optimal weight coefficient, a first signal output corresponding to each of the first sub-band and each of the second sub-bands.
进一步地,上述划分模块2,包括:接收子模块208,用于按照接收的语音信号的时间顺序,接收距离第一时域信号时间差最小的第二时域信号;第三得到子模块209,用于将第二时域信号经过与第一时域信号相同的处理过程,得到与第二时域信号对应的第二信号输出。Further, the dividing module 2 includes: a receiving submodule 208, configured to receive a second time domain signal with a minimum time difference from the first time domain signal according to a time sequence of the received voice signal; and third obtaining the submodule 209, The second time domain signal is subjected to the same process as the first time domain signal to obtain a second signal output corresponding to the second time domain signal.
参照图12,本发明又一实施例中语音增强方法中,根据最小方差失真响应算法分别计算各上述子频带的第一波速输出的过程中,包括噪音处理系统,通过噪音处理提高语音强度。Referring to FIG. 12, in a speech enhancement method according to another embodiment of the present invention, a process for calculating a first wave speed output of each of the sub-bands according to a minimum variance distortion response algorithm, including a noise processing system, improves speech intensity by noise processing.
参照图13,第一获取子模块300,包括:检测单元3001,用于通过在非说话时段对各子频带分别进行语音激活检测,得到当前第一非语音段的第一时间的第一功率、与第二时间的第二功率以及与第三时间的第三功率,其中,第一时间、第二时间、第三时间按照发生时间依次倒序衔接。获得单元3002,用于则通过计算第一功率与第二功率的比值,获得各上述子频带分别对应的当前功率变化,通过计算第二功率与上述第三功率的比值,获得各子频带分别对应的前时刻功率变化。第一获取单元3003,用于通过计算当前功率变化与前时刻功率变化的第一比值,获取相邻的两个非语音段的功率比。Referring to FIG. 13 , the first obtaining sub-module 300 includes: a detecting unit 3001, configured to obtain a first power of a first time of the current first non-speech segment by performing voice activation detection on each sub-band in a non-speaking period, And the second power of the second time and the third power of the third time, wherein the first time, the second time, and the third time are sequentially connected in reverse order according to the time of occurrence. The obtaining unit 3002 is configured to obtain a current power change corresponding to each of the sub-bands by calculating a ratio of the first power to the second power, and obtain a corresponding ratio of the second power to the third power The power of the previous moment changes. The first obtaining unit 3003 is configured to obtain a power ratio of two adjacent non-speech segments by calculating a first ratio of a current power change to a previous time power variation.
参照图14,本实施例的第二获取子模块301,包括:Referring to FIG. 14, the second obtaining submodule 301 of the embodiment includes:
判断单元3011,用于判断第一比值是否在预设范围内;选定单元3012,用于若第一比值在预设范围内,选定初始化平滑因子为当前时刻的平滑因子。The determining unit 3011 is configured to determine whether the first ratio is within a preset range, and the selecting unit 3012 is configured to: if the first ratio is within the preset range, select an initializing smoothing factor as a smoothing factor of the current moment.
进一步地,上述第二获取子模块301,还包括:计算单元3013,用于若第一比值不在预设范围内,则计算初始化平滑因子与第一比值的第二比值。设定单元3014,用于设定第二比值为当前时刻的平滑因子。Further, the second obtaining sub-module 301 further includes: a calculating unit 3013, configured to calculate a second ratio of the initializing smoothing factor to the first ratio if the first ratio is not within the preset range. The setting unit 3014 is configured to set a second ratio as a smoothing factor of the current time.
参照图15,本实施例的第一得到子模块302,包括:Referring to FIG. 15, the first obtaining submodule 302 of this embodiment includes:
第二获取单元3021,用于获取当前时间的上述子频带的下边界下标到上边界上标的频点向量;更新单元3022,用于根据上述当前时刻的平滑因子以及上述频点向量对上述子频带的协方差矩阵进行更新。The second obtaining unit 3021 is configured to acquire a frequency point vector of the lower boundary of the sub-band of the current time to the upper boundary, and an update unit 3022, configured to use the smoothing factor of the current time and the frequency vector The covariance matrix of the band is updated.
本申请还提供了一种语音增强的设备,包括存储器、处理器和应用程序,应用程序被存储在存储器中并被配置为由处理器执行,应用程序被配置为用于执行上述任一实施例中的语音增强的方法。The present application also provides a voice enhanced device including a memory, a processor and an application, the application being stored in a memory and configured to be executed by a processor, the application being configured to perform any of the above embodiments The method of speech enhancement.
本领域技术人员可以理解,本发明的计步设备和上述所涉及用于执行本申请中方法中的一项或多项的设备。设备可以为所需的目的而专门设计和制造,或者也可以包括通用计算机中的已知设备。设备具有存储在其内的计算机程序或应用程序,这些计算机程序选择性地激活或重构。这样的计算机程序可以被存储在设备(例如,计算机)可读介质中或者存储在适于存储电子指令并分别耦联到总线的任何类型的介质中,计算机可读介质包括但不限于任何类型的盘(包括软盘、硬盘、光盘、CD-ROM、和磁光盘)、ROM(Read-Only Memory, 只读存储器)、RAM(Random Access Memory,随机存储器)、EPROM(Erasable Programmable Read-Only Memory,可擦写可编程只读存储器)、EEPROM(Electrically Erasable Programmable Read-Only Memory,电可擦可编程只读存储器)、闪存、磁性卡片或光线卡片。也就是,可读介质包括由设备(例如,计算机)以能够读的形式存储或传输信息的任何介质。Those skilled in the art will appreciate the step counter device of the present invention and the apparatus described above for performing one or more of the methods of the present application. The device may be specially designed and manufactured for the required purposes, or may also include known devices in a general purpose computer. A device has computer programs or applications stored therein that are selectively activated or reconfigured. Such computer programs may be stored in a device (eg, computer) readable medium or in any type of medium suitable for storing electronic instructions and coupled to a bus, respectively, including but not limited to any type of Disk (including floppy disk, hard disk, optical disk, CD-ROM, and magneto-optical disk), ROM (Read-Only Memory, read-only memory), RAM (Random Access Memory), EPROM (Erasable Programmable Read-Only Memory, Erasable programmable read-only memory, EEPROM (Electrically Erasable Programmable Read-Only Memory), flash memory, magnetic card or light card. That is, a readable medium includes any medium that is stored or transmitted by a device (eg, a computer) in a readable form.

Claims (17)

  1. 一种语音增强的方法,其特征在于,通过双麦克语音通道采集语音信号,且各语音通道分别进行语音增强处理,包括:A voice enhancement method is characterized in that a voice signal is collected through a dual microphone voice channel, and each voice channel is separately subjected to voice enhancement processing, including:
    获取当前语音信号的频域信号;Obtaining a frequency domain signal of the current voice signal;
    按照预设规则将所述频域信号划分为多个依次排布的子频带;Dividing the frequency domain signal into a plurality of sub-bands arranged in sequence according to a preset rule;
    根据最小方差失真响应算法分别计算各所述子频带的第一波速输出;Calculating a first wave speed output of each of the sub-bands according to a minimum variance distortion response algorithm;
    通过对各所述第一波速输出进行平均值计算,获取所述频域信号的第二波速输出。A second wave speed output of the frequency domain signal is obtained by performing an average calculation on each of the first wave speed outputs.
  2. 根据权利要求1所述的语音增强的方法,其特征在于,所述按照预设规则将所述频域信号划分为多个依次排布的子频带的步骤,包括:The method for voice enhancement according to claim 1, wherein the step of dividing the frequency domain signal into a plurality of sequentially arranged sub-bands according to a preset rule comprises:
    区分所述频域信号中的敏感频段,其中,所述敏感频段为第一频段,所述频域信号中除所述敏感频段之外的频段为第二频段;Distinguishing the sensitive frequency band in the frequency domain signal, wherein the sensitive frequency band is a first frequency band, and a frequency band other than the sensitive frequency band in the frequency domain signal is a second frequency band;
    将所述第一频段均匀划分为多个第一子频带,将所述第二频段均匀划分为多个第二子频带,其中,所述第二子频带的带宽大于所述第一子频带的带宽。The first frequency band is evenly divided into a plurality of first sub-bands, and the second frequency band is evenly divided into a plurality of second sub-bands, wherein a bandwidth of the second sub-band is greater than that of the first sub-band bandwidth.
  3. 根据权利要求2所述的语音增强的方法,其特征在于,所述将所述第一频段均匀划分为多个第一子频带,将所述第二频段均匀划分为多个第二子频带的步骤之后,包括:The method for voice enhancement according to claim 2, wherein the first frequency band is evenly divided into a plurality of first sub-bands, and the second frequency band is evenly divided into a plurality of second sub-bands After the steps, include:
    分别计算各所述第一子频带和各所述第二子频带一一对应的频带中心频率;Calculating, respectively, a frequency band center frequency corresponding to each of the first sub-band and each of the second sub-bands;
    根据所述频带中心频率分别计算得到各所述第一子频带和各所述第二子频带一一对应的方向向量;Calculating, according to the center frequency of the frequency band, a direction vector corresponding to each of the first sub-band and each of the second sub-bands;
    根据所述方向向量分别获得各所述第一子频带和各所述第二子频带一一对应的频带特征的协方差矩阵以及协方差矩阵的逆矩阵对应的最优权系数;Obtaining, according to the direction vector, a covariance matrix of a frequency band feature corresponding to each of the first subband and each of the second subbands, and an optimal weight coefficient corresponding to an inverse matrix of the covariance matrix;
    根据所述最优权系数分别计算各所述第一子频带和各所述第二子频带一一对应的第一信号输出。And calculating, according to the optimal weight coefficient, a first signal output corresponding to each of the first sub-band and each of the second sub-bands.
  4. 根据权利要求1所述的语音增强的方法,其特征在于,所述根据最小方差失真响应算法分别计算各所述子频带的第一波速输出的步骤,包括:The method for voice enhancement according to claim 1, wherein the step of separately calculating the first wave speed output of each of the sub-bands according to a minimum variance distortion response algorithm comprises:
    在各所述子频带内分别通过语音激活检测,获取相邻的两个非语音段的功率比;Performing voice activation detection in each of the sub-bands to obtain power ratios of two adjacent non-speech segments;
    根据所述功率比获取相应的去除所述非语音段的平滑因子;Obtaining a corresponding smoothing factor for removing the non-speech segment according to the power ratio;
    根据所述平滑因子得到各所述子频带内的频带特征的协方差矩阵;Obtaining a covariance matrix of frequency band features in each of the sub-bands according to the smoothing factor;
    根据所述协方差矩阵进行特征分解,得到各所述子频带的输出权向量。Performing feature decomposition according to the covariance matrix to obtain an output weight vector of each of the sub-bands.
  5. 根据权利要求1所述的语音增强的方法,其特征在于,所述获取当前语音信号的频域信号的步骤,包括:The method of claim 1 , wherein the step of acquiring a frequency domain signal of a current voice signal comprises:
    获取所述双麦克语音通道分别采集的当前语音信号的第一时域信号;Obtaining a first time domain signal of the current voice signal separately collected by the dual microphone voice channel;
    将所述第一时域信号分别输入到所述双麦克语音通道分别对应的带通滤波器,分别得到指定频率范围的时域信号;And inputting the first time domain signals to the band pass filters respectively corresponding to the dual microphone voice channels, respectively, to obtain time domain signals of a specified frequency range;
    将所述指定频率范围的时域信号分别通过与所述双麦克语音通道分别关联的傅氏变换,分别转换为当前语音信号的所述指定频率范围的频域信号。The time domain signals of the specified frequency range are respectively converted into frequency domain signals of the specified frequency range of the current voice signal by Fourier transform respectively associated with the dual microphone voice channels.
  6. 根据权利要求5所述的语音增强的方法,其特征在于,所述通过对各所述第一波速输出进行平均值计算,获取所述频域信号的第二波速输出的步骤之后,包括:The method of claim 5, wherein the step of obtaining an average value of each of the first wave speed outputs to obtain a second wave speed output of the frequency domain signal comprises:
    通过将所述频域信号的第二波速输出分别输入到与所述双麦克语音通道分别关联的反傅氏变换器中,将所述频域信号转换为输出时域信号;Converting the frequency domain signal into an output time domain signal by respectively inputting a second wave speed output of the frequency domain signal into an inverse Fourier transformer respectively associated with the dual microphone voice channel;
    通过所述双麦克语音通道分别输出对应的所述输出时域信号。The corresponding output time domain signals are respectively output through the dual microphone voice channels.
  7. 根据权利要求1所述的语音增强的方法,其特征在于,所述按照预设规则将所述频域信号划分为多个依次排布的子频带的步骤之前,包括:The method for voice enhancement according to claim 1, wherein the step of dividing the frequency domain signal into a plurality of sequentially arranged sub-bands according to a preset rule comprises:
    根据频域处理平台的计算量水平,选择指定频点的傅氏变换方式;Selecting a Fourier transform method of the specified frequency point according to the calculation level of the frequency domain processing platform;
    将所述双麦克语音通道分别采集的当前语音信号的第一时域信号经过预处理后,分别通过所述指定频点的傅氏变换方式得到的所述第一时域信号对应的频域信号。After the first time domain signal of the current voice signal collected by the dual microphone voice channel is preprocessed, respectively, the frequency domain signal corresponding to the first time domain signal obtained by the Fourier transform of the specified frequency point is respectively obtained. .
  8. 根据权利要求7所述的语音增强的方法,其特征在于,所述按照预设规则将所述频域信号划分为多个依次排布的子频带的步骤,包括:The method for voice enhancement according to claim 7, wherein the step of dividing the frequency domain signal into a plurality of sequentially arranged sub-bands according to a preset rule comprises:
    获取经过所述指定频点的傅氏变换方式得到的所述第一时域信号对应的频域信号的频点总量;Obtaining a total amount of frequency points of the frequency domain signal corresponding to the first time domain signal obtained by the Fourier transform method of the specified frequency point;
    根据所述频点总量对所述频域信号均匀划分为多个依次排布的子频带。The frequency domain signal is evenly divided into a plurality of sequentially arranged sub-bands according to the total number of frequency points.
  9. 一种语音增强的装置,其特征在于,通过双麦克语音通道采集语音信号,且各语音通道分别进行语音增强处理,包括:A voice enhancement device is characterized in that a voice signal is collected through a dual microphone voice channel, and each voice channel is separately subjected to voice enhancement processing, including:
    第一获取模块,用于获取当前语音信号的频域信号;a first acquiring module, configured to acquire a frequency domain signal of a current voice signal;
    划分模块,用于按照预设规则将所述频域信号划分为多个依次排布的子频带;a dividing module, configured to divide the frequency domain signal into a plurality of sequentially arranged sub-bands according to a preset rule;
    计算模块,用于根据最小方差失真响应算法分别计算各所述子频带的第一 波速输出;a calculating module, configured to separately calculate a first wave speed output of each of the sub-bands according to a minimum variance distortion response algorithm;
    第二获取模块,用于通过对各所述第一波速输出进行平均值计算,获取所述频域信号的第二波速输出。And a second acquiring module, configured to acquire a second wave speed output of the frequency domain signal by performing an average calculation on each of the first wave speed outputs.
  10. 根据权利要求9所述的语音增强的装置,其特征在于,所述划分模块包括:The apparatus for voice enhancement according to claim 9, wherein the dividing module comprises:
    区分子模块,用于区分所述频域信号中的敏感频段,其中,所述敏感频段为第一频段,所述频域信号中除所述敏感频段之外的频段为第二频段;a region molecular module, configured to distinguish a sensitive frequency band in the frequency domain signal, wherein the sensitive frequency band is a first frequency band, and a frequency band other than the sensitive frequency band in the frequency domain signal is a second frequency band;
    划分子模块,用于将所述第一频段均匀划分为多个第一子频带,将所述第二频段均匀划分为多个第二子频带,其中,所述第二子频带的带宽大于所述第一子频带的带宽。a sub-module, configured to divide the first frequency band into a plurality of first sub-bands, and divide the second frequency-band into a plurality of second sub-bands, wherein a bandwidth of the second sub-band is greater than The bandwidth of the first sub-band.
  11. 根据权利要求10所述的语音增强的装置,其特征在于,所述划分模块,包括:The apparatus for voice enhancement according to claim 10, wherein the dividing module comprises:
    第一计算子模块,用于分别计算各所述第一子频带和各所述第二子频带一一对应的频带中心频率;a first calculation sub-module, configured to separately calculate a frequency band center frequency corresponding to each of the first sub-band and each of the second sub-bands;
    第二计算子模块,用于根据所述频带中心频率分别计算得到各所述第一子频带和各所述第二子频带一一对应的方向向量;a second calculation sub-module, configured to calculate, according to the frequency center frequency of the frequency band, a direction vector corresponding to each of the first sub-band and each of the second sub-bands;
    获得子模块,用于根据所述方向向量分别获得各所述第一子频带和各所述第二子频带一一对应的频带特征的协方差矩阵以及协方差矩阵的逆矩阵对应的最优权系数;Obtaining a sub-module, configured to obtain, according to the direction vector, a covariance matrix of a frequency band feature corresponding to each of the first sub-band and each of the second sub-bands, and an optimal weight corresponding to an inverse matrix of the covariance matrix coefficient;
    第三计算子模块,用于根据所述最优权系数分别计算各所述第一子频带和各所述第二子频带一一对应的第一信号输出。And a third calculating submodule, configured to calculate, according to the optimal weight coefficient, a first signal output corresponding to each of the first subband and each of the second subbands.
  12. 根据权利要求9所述的语音增强的装置,其特征在于,所述计算模块包括:The apparatus for voice enhancement according to claim 9, wherein the calculation module comprises:
    第一获取子模块,用于在各所述子频带内分别通过语音激活检测,获取相邻的两个非语音段的功率比;a first acquiring submodule, configured to obtain a power ratio of two adjacent non-speech segments by using voice activation detection in each of the sub-bands;
    第二获取子模块,用于根据所述功率比获取相应的去除所述非语音段的平滑因子;a second obtaining submodule, configured to acquire, according to the power ratio, a smoothing factor for removing the non-speech segment;
    第一得到子模块,用于根据所述平滑因子得到各所述子频带内的频带特征的协方差矩阵;a first obtaining submodule, configured to obtain, according to the smoothing factor, a covariance matrix of frequency band features in each of the subbands;
    第二得到子模块,用于根据所述协方差矩阵进行特征分解,得到各所述子频带的输出权向量。And a second obtaining submodule, configured to perform eigen decomposition according to the covariance matrix to obtain an output weight vector of each of the subbands.
  13. 根据权利要求9所述的语音增强的装置,其特征在于,所述第一获取模块,包括:The apparatus for voice enhancement according to claim 9, wherein the first obtaining module comprises:
    第三获取子模块,用于获取所述双麦克语音通道分别采集的当前语音信号的第一时域信号;a third acquiring submodule, configured to acquire a first time domain signal of the current voice signal separately collected by the dual microphone voice channel;
    输入子模块,用于将所述第一时域信号分别输入到所述双麦克语音通道分别对应的带通滤波器,分别得到指定频率范围的时域信号;The input sub-module is configured to input the first time domain signals to the band pass filters respectively corresponding to the dual microphone voice channels, respectively, to obtain time domain signals of a specified frequency range;
    转换子模块,用于将所述指定频率范围的时域信号分别通过与所述双麦克语音通道分别关联的傅氏变换,分别转换为当前语音信号的所述指定频率范围的频域信号。And a conversion submodule, configured to respectively convert the time domain signals of the specified frequency range into a frequency domain signal of the specified frequency range of the current voice signal by using a Fourier transform respectively associated with the dual microphone voice channel.
  14. 根据权利要求13所述的语音增强的装置,其特征在于,包括:The apparatus for voice enhancement according to claim 13, comprising:
    转换模块,用于通过将所述频域信号的第二波速输出分别输入到与所述双麦克语音通道分别关联的反傅氏变换器中,将所述频域信号转换为输出时域信号;a conversion module, configured to convert the frequency domain signal into an output time domain signal by inputting a second wave speed output of the frequency domain signal into an inverse Fourier transformer respectively associated with the dual microphone voice channel;
    输出模块,用于通过所述双麦克语音通道分别输出对应的所述输出时域信号。And an output module, configured to respectively output the corresponding output time domain signals through the dual microphone voice channels.
  15. 根据权利要求9所述的语音增强的装置,其特征在于,包括:The apparatus for voice enhancement according to claim 9, comprising:
    选择模块,用于根据频域处理平台的计算量水平,选择指定频点的傅氏变换方式;a selection module, configured to select a Fourier transform mode of the specified frequency point according to a calculation level of the frequency domain processing platform;
    得到模块,用于将所述双麦克语音通道分别采集的当前语音信号的第一时域信号经过预处理后,分别通过所述指定频点的傅氏变换方式得到的所述第一时域信号对应的频域信号。Obtaining a module, configured to: after the first time domain signal of the current voice signal separately collected by the dual microphone voice channel is preprocessed, respectively obtain the first time domain signal obtained by the Fourier transform of the specified frequency point Corresponding frequency domain signal.
  16. 根据权利要求9所述的语音增强的装置,其特征在于,所述划分模块,包括:The apparatus for voice enhancement according to claim 9, wherein the dividing module comprises:
    第三获取子模块,用于获取经过所述指定频点的傅氏变换方式得到的所述第一时域信号对应的频域信号的频点总量;a third obtaining submodule, configured to obtain a total amount of frequency points of the frequency domain signal corresponding to the first time domain signal obtained by the Fourier transform method of the specified frequency point;
    第二划分子模块,用于根据所述频点总量对所述频域信号均匀划分为多个依次排布的子频带。The second dividing sub-module is configured to uniformly divide the frequency domain signal into a plurality of sequentially arranged sub-bands according to the total frequency of the frequency points.
  17. 一种语音增强的设备,包括存储器、处理器和应用程序,所述应用程序被存储在所述存储器中并被配置为由所述处理器执行,其特征在于,所述应用程序被配置为用于执行权利要求1至8任一项所述的语音增强方法。A speech enhanced device comprising a memory, a processor and an application, the application being stored in the memory and configured to be executed by the processor, wherein the application is configured to use A speech enhancement method according to any one of claims 1 to 8.
PCT/CN2019/076189 2018-04-27 2019-02-26 Speech enhancement method, device and equipment WO2019205798A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810395019.9A CN108447500B (en) 2018-04-27 2018-04-27 Method and device for speech enhancement
CN201810395019.9 2018-04-27

Publications (1)

Publication Number Publication Date
WO2019205798A1 true WO2019205798A1 (en) 2019-10-31

Family

ID=63201941

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/076189 WO2019205798A1 (en) 2018-04-27 2019-02-26 Speech enhancement method, device and equipment

Country Status (2)

Country Link
CN (1) CN108447500B (en)
WO (1) WO2019205798A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112420068A (en) * 2020-10-23 2021-02-26 四川长虹电器股份有限公司 Quick self-adaptive beam forming method based on Mel frequency scale frequency division

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108447500B (en) * 2018-04-27 2020-08-18 深圳市沃特沃德股份有限公司 Method and device for speech enhancement
CN108717855B (en) * 2018-04-27 2020-07-28 深圳市沃特沃德股份有限公司 Noise processing method and device
CN109151211B (en) * 2018-09-30 2022-01-11 Oppo广东移动通信有限公司 Voice processing method and device and electronic equipment
CN110021307B (en) * 2019-04-04 2022-02-01 Oppo广东移动通信有限公司 Audio verification method and device, storage medium and electronic equipment
CN110838307B (en) * 2019-11-18 2022-02-25 思必驰科技股份有限公司 Voice message processing method and device
CN111179960B (en) * 2020-03-06 2022-10-18 北京小米松果电子有限公司 Audio signal processing method and device and storage medium
CN111429933B (en) * 2020-03-06 2022-09-30 北京小米松果电子有限公司 Audio signal processing method and device and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103354937A (en) * 2011-02-10 2013-10-16 杜比实验室特许公司 Post-processing including median filtering of noise suppression gains
CN104157295A (en) * 2014-08-22 2014-11-19 中国科学院上海高等研究院 Method used for detecting and suppressing transient noise
CN107749305A (en) * 2017-09-29 2018-03-02 百度在线网络技术(北京)有限公司 Method of speech processing and its device
CN108447500A (en) * 2018-04-27 2018-08-24 深圳市沃特沃德股份有限公司 The method and apparatus of speech enhan-cement

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101599274B (en) * 2009-06-26 2012-03-28 瑞声声学科技(深圳)有限公司 Method for speech enhancement
CN101916567B (en) * 2009-11-23 2012-02-01 瑞声声学科技(深圳)有限公司 Speech enhancement method applied to dual-microphone system
CN101976565A (en) * 2010-07-09 2011-02-16 瑞声声学科技(深圳)有限公司 Dual-microphone-based speech enhancement device and method
US10074380B2 (en) * 2016-08-03 2018-09-11 Apple Inc. System and method for performing speech enhancement using a deep neural network-based signal
US10181321B2 (en) * 2016-09-27 2019-01-15 Vocollect, Inc. Utilization of location and environment to improve recognition
CN107391498B (en) * 2017-07-28 2020-10-27 深圳市沃特沃德股份有限公司 Voice translation method and device

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103354937A (en) * 2011-02-10 2013-10-16 杜比实验室特许公司 Post-processing including median filtering of noise suppression gains
CN104157295A (en) * 2014-08-22 2014-11-19 中国科学院上海高等研究院 Method used for detecting and suppressing transient noise
CN107749305A (en) * 2017-09-29 2018-03-02 百度在线网络技术(北京)有限公司 Method of speech processing and its device
CN108447500A (en) * 2018-04-27 2018-08-24 深圳市沃特沃德股份有限公司 The method and apparatus of speech enhan-cement

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112420068A (en) * 2020-10-23 2021-02-26 四川长虹电器股份有限公司 Quick self-adaptive beam forming method based on Mel frequency scale frequency division
CN112420068B (en) * 2020-10-23 2022-05-03 四川长虹电器股份有限公司 Quick self-adaptive beam forming method based on Mel frequency scale frequency division

Also Published As

Publication number Publication date
CN108447500B (en) 2020-08-18
CN108447500A (en) 2018-08-24

Similar Documents

Publication Publication Date Title
WO2019205798A1 (en) Speech enhancement method, device and equipment
WO2019205796A1 (en) Frequency-domain processing amount reduction method, apparatus and device
CN106782590B (en) Microphone array beam forming method based on reverberation environment
CN109215677B (en) Wind noise detection and suppression method and device suitable for voice and audio
EP2608197B1 (en) Method, device, and system for noise reduction in multi-microphone array
US8654990B2 (en) Multiple microphone based directional sound filter
US8073689B2 (en) Repetitive transient noise removal
US8218780B2 (en) Methods and systems for blind dereverberation
Yoo et al. Speech signal modification to increase intelligibility in noisy environments
Roman et al. Pitch-based monaural segregation of reverberant speech
CN111312275B (en) On-line sound source separation enhancement system based on sub-band decomposition
KR100917460B1 (en) Noise cancellation apparatus and method thereof
KR101295727B1 (en) Apparatus and method for adaptive noise estimation
WO2019205797A1 (en) Noise processing method, apparatus and device
Rao et al. Speech enhancement using sub-band cross-correlation compensated Wiener filter combined with harmonic regeneration
Nabi et al. An improved speech enhancement algorithm for dual-channel mobile phones using wavelet and genetic algorithm
Miyazaki et al. Theoretical analysis of parametric blind spatial subtraction array and its application to speech recognition performance prediction
CN108074580B (en) Noise elimination method and device
Upadhyay et al. A perceptually motivated stationary wavelet packet filterbank using improved spectral over-subtraction for enhancement of speech in various noise environments
CN113936687B (en) Method for real-time voice separation voice transcription
Defraene et al. A psychoacoustically motivated speech distortion weighted multi-channel Wiener filter for noise reduction
Hussain et al. A novel psychoacoustically motivated multichannel speech enhancement system
Wang et al. A Dual-microphone Sub-band Post-filter Using Simplified TBRR for Speech Enhancement
CN114333878A (en) Noise reduction system of wireless microphone
CN116741136A (en) Multichannel noise suppression method and device

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19792957

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19792957

Country of ref document: EP

Kind code of ref document: A1