WO2019205798A1

WO2019205798A1 - Speech enhancement method, device and equipment

Info

Publication number: WO2019205798A1
Application number: PCT/CN2019/076189
Authority: WO
Inventors: 安黄彬
Original assignee: 深圳市沃特沃德股份有限公司
Priority date: 2018-04-27
Filing date: 2019-02-26
Publication date: 2019-10-31
Also published as: CN108447500B; CN108447500A

Abstract

A speech enhancement method. Speech signals are acquired by means of dual-microphone speech channels, and the various speech channels respectively perform a speech enhancement processing. The method comprises: acquiring a frequency domain signal of a current speech signal (S1); dividing the frequency domain signal according to a preset rule into multiple sequentially arranged sub-bands (S2); respectively calculating a first beam output of the various sub-bands according to the minimum variance distortionless response (MVDR) algorithm (S3); and acquiring a second beam output of the frequency domain signal by calculating the mean value of the various first beam outputs (S4).

Description

Voice enhancement method, device and device

Technical field

The present invention relates to the field of communications, and more particularly to a method, apparatus and apparatus for voice enhancement.

Background technique

The interference of environmental noise in the existing voice communication process is unavoidable, and the surrounding environmental noise interference will cause the communication device to finally receive the noise signal contaminated by noise, affecting the quality of the voice signal. Especially in the public environment with serious noise such as cars, airplanes, boats, airports, shopping malls, etc., strong background noise seriously affects the communication quality, triggers the user's hearing fatigue, affects the user's daily mood and nerve activities, and urgently needs to reduce the voice of the call. Processing to improve speech intelligibility. However, in the existing dual-mike noise reduction method, the frequency domain processing amount is large, and the effect of enhancing the voice by noise reduction needs to be improved.

technical problem

The main object of the present invention is to provide a method, device and device for voice enhancement, which aims to solve the technical problem that the voice intensity and the speech intelligibility are not high due to the influence of noise in the existing voice call.

Technical solution

The present invention provides a voice enhancement method, which collects voice signals through a dual microphone voice channel, and each voice channel performs voice enhancement processing separately, including: acquiring a frequency domain signal of a current voice signal; and dividing the frequency domain signal according to a preset rule. a plurality of sub-bands arranged in sequence; calculating a first wave speed output of each of the sub-bands according to a minimum variance distortion response algorithm; and obtaining an average value of each of the first wave speed outputs to obtain a second wave speed of the frequency domain signal Output.

The present invention also provides a voice enhancement device, which collects voice signals through a dual microphone voice channel, and each voice channel performs voice enhancement processing, including: a first acquisition module, configured to acquire a frequency domain signal of a current voice signal; a module, configured to divide the frequency domain signal into a plurality of sequentially arranged sub-bands according to a preset rule; and a calculating module, configured to separately calculate a first wave speed output of each of the sub-bands according to a minimum variance distortion response algorithm; And a module, configured to obtain a second wave speed output of the frequency domain signal by performing an average calculation on each of the first wave speed outputs.

The present invention also provides a speech enhanced device comprising a memory, a processor and an application, the application being stored in the memory and configured to be executed by the processor, the application being configured to Used to perform the speech enhancement method described.

Beneficial effect

Advantageous technical effects of the present invention: The present invention decomposes a wideband frequency domain signal of a voice signal collected by a dual microphone into a plurality of narrow bands that do not overlap each other, and calculates each subband by an MVDR (Minimum Variance Distortion Less Response) algorithm. The MVDR beam output is combined and averaged by the MVDR beam outputs of the plurality of sub-bands to obtain the MVDR beam output of the entire wideband frequency domain signal, thereby avoiding traditional processing methods such as direct addition by delay, sidelobe cancellation, and MVDR calculation. For the problem that the noise reduction effect of the wideband frequency domain signal is not good, the speech enhancement effect is improved; and the present invention tracks the environmental noise variation in each subband by calculating the MVDR beam output of each subband by the MVDR algorithm. The undulating noise dynamically adjusts the smoothing factor to improve the noise processing effect; when processing the wideband frequency domain signal of the voice signal collected by the dual microphone, the invention only selects the frequency range of the voice segment of the call for processing, thereby improving the processing speed and increasing the drop. Noise enhances the real-time nature of speech, meeting people at lower SNR conditions Hears more clear and undistorted voice call has practical value.

DRAWINGS

1 is a schematic flow chart of a method for voice enhancement according to an embodiment of the present invention;

2 is a schematic flowchart of a method for reducing a frequency domain processing amount in a voice enhancement method according to an embodiment of the present invention;

3 is a schematic flow chart of a noise processing method in a method for voice enhancement according to an embodiment of the present invention;

4 is a schematic structural diagram of a device for voice enhancement according to an embodiment of the present invention;

FIG. 5 is a schematic structural diagram of a partitioning module according to an embodiment of the present invention; FIG.

6 is a schematic structural diagram of a computing module according to an embodiment of the present invention;

FIG. 7 is a schematic structural diagram of a first acquiring module according to an embodiment of the present invention;

FIG. 8 is a schematic diagram of an optimized structure of a voice enhanced device according to an embodiment of the present invention; FIG.

FIG. 9 is a schematic structural diagram of a device for voice enhancement according to another embodiment of the present invention; FIG.

FIG. 10 is a schematic structural diagram of a partitioning module according to another embodiment of the present invention; FIG.

11 is a schematic structural diagram of a partitioning module according to still another embodiment of the present invention;

Figure 12 is a schematic structural view of a noise processing system according to an embodiment of the present invention;

FIG. 13 is a schematic structural diagram of a first acquiring submodule according to still another embodiment of the present invention;

FIG. 14 is a schematic structural diagram of a second acquisition submodule according to still another embodiment of the present invention;

FIG. 15 is a schematic structural diagram of a first obtaining submodule according to still another embodiment of the present invention.

BEST MODE FOR CARRYING OUT THE INVENTION

Referring to FIG. 1 , a voice enhancement method according to an embodiment of the present invention collects voice signals through a dual microphone voice channel, and each voice channel performs voice enhancement processing, including:

S1: Acquire a frequency domain signal of the current voice signal.

In this embodiment, the frequency domain signal refers to the signal data obtained by transforming the time domain signal of the voice signal collected by the dual microphone voice channel by FFT (Fast Fourier Transformation), because the voice signal in this embodiment passes through the double The microphone voice channel is collected, so the same time processing is performed on the voice signals of the same time domain frame collected by the left and right channels of the dual microphone. For example, the dual microphone voice channels of the embodiment are respectively connected with an FFT, and the FFT is converted. The signal data is buffered in two buffers of the same length for further subsequent processing to enhance the voice processing effect.

S2: The foregoing frequency domain signal is divided into a plurality of sequentially arranged sub-bands according to a preset rule.

The processing effect of the wideband frequency domain signal of the MVDR algorithm is not ideal, which will cause serious speech distortion and affect the quality of the output speech. In this embodiment, the wideband frequency domain signal is divided into a plurality of subbands that are arranged in a non-overlapping manner, and the MVDR algorithm is separately performed on the subbands to reduce the speech distortion and improve the processed speech quality.

S3: Calculate the first wave speed output of each of the sub-bands according to a minimum variance distortion response algorithm.

In the MVDR algorithm of this embodiment, the output weight vector of each sub-band is obtained by the associated covariance matrix. The MVDR beamformer of this embodiment is composed of a plurality of linear arrays of identical spatial sensors, and the covariance matrix of the data is obtained through the received data of the array to find the angle corresponding to the maximum value point, that is, the incident direction of the speech signal. To minimize the array output power in the desired direction while maximizing the signal-to-noise ratio. In this embodiment, the MVDR algorithm is separately performed on each sub-band to obtain a first wave speed output (ie, frequency data) corresponding to each sub-band, so as to improve the effect of the MVDR algorithm on the frequency domain signal of the voice signal, and reduce the voice distortion. .

S4: Acquire an average value of each of the first wave speed outputs to obtain a second wave speed output of the frequency domain signal.

In this embodiment, by adding the frequency data in all the sub-band buffers corresponding to the time domain frame of the voice signal and then averaging, the output frequency data of the frequency domain signal corresponding to the time domain frame is obtained, and The left and right channels of the microphone voice channel are respectively output. Then, by repeating the above steps S1 to S4, all the time domain frame data of the voice signal is processed.

Further, step S2 includes:

S200: distinguishing the sensitive frequency band in the frequency domain signal, wherein the sensitive frequency band is the first frequency band, and the frequency band of the frequency domain signal other than the sensitive frequency band is the second frequency band;

The sensitive frequency band of this embodiment is determined according to the use of the voice signal. For example, the frequency band of the call voice is 200 Hz to 3400 Hz, and the sensitive frequency band is 1 kHz to 2 kHz; for example, the frequency band for listening to music is 50 Hz to 15000 Hz, and the sensitive frequency band is 2 kHz. To 5KHz or 1KHz to 4KHz.

S201: The first frequency band is evenly divided into a plurality of first sub-bands, and the second frequency band is evenly divided into a plurality of second sub-bands, wherein a bandwidth of the second sub-band is greater than a bandwidth of the first sub-band.

In this embodiment, the sub-bands of the sensitive frequency band are divided into more detailed, and the frequency bands other than the sensitive frequency band are coarsely divided, that is, the bandwidth of the sub-band of the sensitive frequency band is smaller than the sub-band bandwidth of the frequency band other than the sensitive frequency band. The speech distortion in the sensitive frequency band is less, and the rougher mad division of the frequency band outside the sensitive frequency band reduces the disadvantage of the calculation amount caused by the excessive number of factor bands.

Further, the step S3 of calculating the first wave speed output of each of the sub-bands according to the minimum variance distortion response algorithm includes:

S300: Perform voice activation detection in each of the foregoing sub-bands to obtain power ratios of two adjacent non-speech segments.

In this embodiment, the power spectrum of the non-speech segment (ie, noise) is estimated by the voice activation detection in the gap period of the speech signal, so as to timely judge the change trend of the surrounding environment noise, so as to track the noise in detail. In this embodiment, the power variation of the non-speech segment is tracked by the change of the power ratio of the two non-speech segments, and the increase of the power ratio indicates that the noise intensity is enhanced, and vice versa.

S301: Acquire a corresponding smoothing factor for removing the non-speech segment according to the power ratio.

In this embodiment, the smoothing factor of removing the non-speech segment is dynamically adjusted according to the change of the noise power obtained by the tracking. When the time-varying speed of the environmental noise is relatively fast relative to the sampling rate, the smoothing factor should be set smaller, when the time-varying speed of the environmental noise is When the relative sampling rate is slow or the noise power is relatively strong, the smoothing factor should be larger, and the tracking of the spatial sound field changes in time, better tracking the environmental noise changes and changing the degree of noise removal, effectively smoothing the fluctuation of the noise, reducing The influence of noise fluctuations further improves the signal-to-noise ratio of the dual-make noise reduction and improves the sound quality of the output speech signal.

S302: Obtain a covariance matrix of frequency band features in each of the sub-bands according to the smoothing factor;

The covariance matrix is updated in time according to the dynamically changing smoothing factor to more accurately determine the incident direction of the speech signal, further reducing the influence of ambient noise on the acquisition of the dual microphone speech channel.

S303: Perform eigen decomposition according to the covariance matrix to obtain an output weight vector of each of the sub-bands.

The data output by the MVDR algorithm of this embodiment is a covariance matrix, and the output weight vector corresponding to the covariance matrix is obtained by feature decomposition, that is, the first wave speed output.

Further, the step S1 of acquiring the frequency domain signal of the current voice signal includes:

S100: Acquire a first time domain signal of a current voice signal separately collected by the dual microphone voice channel.

The dual microphone voice channel of this embodiment collects time domain signals of voice signals, and the time domain signals are sequentially arranged in time series. The first time domain signal in this embodiment is set in the other time domain signals, and the terms "first" and the like herein are only differences and are not limited.

S101: Input the first time domain signals to the band pass filters respectively corresponding to the dual microphone voice channels, respectively, to obtain time domain signals of a specified frequency range.

In this example, only the voice frequency band data of interest is selected to reduce the amount of data processing and improve the real-time processing effect. The voice frequency band data concerned by this embodiment is a frequency range of human speech sounds, that is, 200 Hz to 3400 Hz, to satisfy the effect of enhancing speech speech, and avoiding distortion of normal speech. In this embodiment, the voice signals outside the frequency band of 200 Hz to 3400 Hz are all filtered out by the preprocessing process, and full coverage of 200 Hz to 3400 Hz is ensured, thereby achieving less data processing amount and ensuring the effect of voice distortion.

S102: Convert the time domain signals of the specified frequency range to the frequency domain signals of the specified frequency range of the current voice signal by using a Fourier transform respectively associated with the dual microphone voice channels.

The operation process of subband division, noise processing, and the like in this embodiment needs to be performed on the frequency domain signal. In this embodiment, each time domain signal is converted into a frequency domain signal by FFT transformation. The voice signals of the dual microphone voice channel are synchronized to perform the same conversion operation, and the converted data is respectively buffered in two identical buffers.

Further, after the step S4 of acquiring the second wave speed output of the frequency domain signal by performing an average value calculation on each of the first wave speed outputs, the method includes:

S5: converting the frequency domain signal into an output time domain signal by inputting the second wave speed output of the frequency domain signal to an inverse Fourier transformer respectively associated with the dual microphone voice channel;

In this embodiment, the time domain signal collected by the dual microphone voice channel as a voice signal is converted into a frequency domain signal, and then processed by noise reduction, voice addition, etc., and the processed frequency domain signal is required by an inverse Fourier transformer. It is converted to the corresponding time domain signal before it is answered and recognized by the human ear.

S6: output the corresponding output time domain signal by using the dual microphone voice channel.

In the process of filtering and filtering the frequency segment, FFT transform, subband division, noise reduction, and inverse FFT, the voice signals collected by the dual microphone voice channel in this embodiment are synchronized in the left and right voice channels respectively, at the output end. Synthesize into one.

Referring to FIG. 2, in a voice enhancement method according to another embodiment of the present invention, a method for reducing a frequency domain processing amount by using a voice signal to perform voice signal preprocessing to reduce a frequency domain processing amount in the embodiment includes: before step S2. , do the following:

S20: selecting a Fourier transform method of the specified frequency point according to the calculation level of the frequency domain processing platform;

The specified frequency point in this embodiment includes FFT transforms such as 1024 points, 2048 points, and 256 points. In this embodiment, 1024 points are preferred, and the processing effect is satisfied under the limitation of a suitable calculation amount.

S21: The first time domain signal of the current voice signal separately collected by the dual microphone voice channel is preprocessed, and then the frequency domain signal corresponding to the first time domain signal obtained by the Fourier transform of the specified frequency point is respectively received;

In this embodiment, a speech signal having a frequency range of 200 Hz to 3400 Hz is transformed by a 1024-point FFT transform, and a frequency domain signal of a frequency distribution of about 144 points is obtained. Compared with the full speech segment including 200 Hz to 3400 Hz, it is necessary to process a full frequency domain signal with a frequency distribution of about 512 points, which greatly reduces the amount of calculation.

Further, the step S2 of dividing the frequency domain signal into a plurality of sequentially arranged sub-bands according to a preset rule includes:

S202: Acquire a total amount of frequency points of the frequency domain signal corresponding to the first time domain signal obtained by the Fourier transform method of the specified frequency point;

For example, the total frequency of the first time domain signal of the present implementation is 144 points, and then the basis of the subband division is performed according to 144 points.

S203: The frequency domain signal is uniformly divided into a plurality of sequentially arranged sub-bands according to the total number of frequency points.

In the subband division process of this embodiment, the division may be performed by configuring the number of frequency points on each subband. For example, the number of frequency points included in each sub-band is configured to be 24, that is, the number of sub-bands of the first time-domain signal is 144 divided by 24, which is 6 sub-bands. Other embodiments of the present invention may configure the number of frequency points included in each sub-band to be 8, 6, etc., so as to evenly divide the sub-bands. When the number of frequency points included in each sub-band is 8, the number of sub-bands is 18; when the number of frequency points included in each sub-band is 6, the number of sub-bands is 24. In this embodiment, it is preferable that each sub-band includes a sub-band division scheme in which the number of frequency points is 6 and the number of sub-bands is 24, in order to optimize the effect of speech noise reduction enhancement. Because the more subbands are divided, the narrower the subband bandwidth is, the less the speech distortion is after the MVDR algorithm, but the calculation amount is slightly increased; the smaller the subband, the smaller the calculation amount, but the larger the subband bandwidth, the relative sub If the number of bands is large, the distortion will be larger.

Further, after the step S201 of uniformly dividing the first frequency band into the plurality of first sub-bands and uniformly dividing the second frequency band into the plurality of second sub-bands, the method includes:

S204: Calculate a frequency band center frequency corresponding to each of the first sub-band and each second sub-band respectively;

In this embodiment, the center frequency of the sub-band is obtained to obtain the direction vector of the sub-band, so as to better control the optimal angle of the collected speech signal, and avoid carrying the strongest noise drying when collecting the speech signal. The first sub-band of the present embodiment has the same processing principle as the second sub-band, except that the bandwidth is different. For example, in this embodiment, a process of uniformly dividing sub-bands is taken as an example for detailed description. After the 1024-point FFT transform of the wideband frequency domain signal of this embodiment, the resolution of each frequency point is 1600/10024 points, and the frequency corresponding to the frequency range of 200 Hz to 3400 Hz is 12 to 207. For example, the bandwidth of each sub-band is: band_siz=(up-low)/numband, where up is the frequency subscript corresponding to 3400 Hz, and low corresponds to the frequency subscript of 200 Hz, numband is The number parameter of the sub-band is divided according to 24 sub-bands, and each sub-band bandwidth includes subscripts of 8 frequency points. The center frequency subscript of the Kth subband is: fv(k)=((low+(k-1)*band_siz)+(low+(k-1)*band_siz+band_siz-1))/2; then the corresponding sub The center frequency of the frequency band is: F_center=fv(k)/FFT_siz*Fs, where FFT_siz represents the Fourier transform length, ie 1024 points, and Fs represents the sampling frequency, ie 16000.

S205: Calculate, according to the center frequency of the frequency band, a direction vector corresponding to each of the first sub-band and each of the second sub-bands.

In this embodiment, the direction vector is calculated by substituting the center frequency calculated above into the following formula. vssL=e ^{((delay)*(-j)*2*pi*F_center)} , where vssL is the calculated direction vector, j is the complex sign, j is the square root of -1, pi is the constant 3.1415926, and e is a constant value , e=2.71828183, and exp(a) is an exponential function, where delay is the delay time point vector of the left and right two voice channels of the dual microphone. Usually, the left voice channel is taken as the reference point, and the time delay of the right voice channel relative to the left voice channel is tao, delay=[0, tao]. The time delay estimation tao can be obtained by cross-correlation calculation using data collected by the dual microphone voice channel.

S206: Obtain, according to the direction vector, a covariance matrix of a frequency band feature corresponding to each first sub-band and each second sub-band, and an optimal weight coefficient corresponding to an inverse matrix of the covariance matrix.

In this embodiment, signals are collected through a dual microphone voice channel, and the covariance matrix is 2 rows and 2 columns. Find the inverse matrix of the covariance matrix, denoted by r_inv as the inverse matrix of the covariance matrix, and W_opt is the optimal weight coefficient of the current subband, then W_opt=r_inv*vssL/(vssL'*r_inv*vssL), where vssL Indicates the direction vector, and vssL' indicates the direction vector transpose. For example, the original vector is one row and two columns, and after transposition, it is two rows and one column. The optimal weight coefficient refers to the optimal angle of the double-microphone voice channel when searching for the user's speech within the scanning angle range. For example, when scanning from -45° to 45°, the noise signal carried by the user's speech signal is the lowest at 60°. , 60° is the optimal angle.

S207: Calculate, according to the optimal weight coefficient, a first signal output corresponding to each of the first sub-band and each of the second sub-bands.

In this embodiment, Out_L=W_opt*S_L; Out_R=W_opt*S_R; wherein Out_L is the output frequency data of the left channel, Out_R is the output frequency data of the right channel, and S_L is the FFT of the current time domain frame data acquired by the left channel. The frequency of the Fbin_loL point frequency to the Fbin_hiL point, S_R is the frequency vector of the Fbin_loL point frequency after the FFT transformation of the current time domain frame data acquired by the right channel to the Fbin_hiL point, that is, S_L or S_R is the frequency data in the corresponding sub-band . Where Fbin_loL is the subscript of the lower boundary of the frequency of the subband, and Fbin_hiL is the superscript of the upper boundary of the frequency of the subband, and finally the frequency output data of the left and right channels are stored in the buffer, and the first time domain signal is corresponding The frequency data in all subband buffers is added to obtain the first signal output of the respective outputs of the left and right voice channels of the dual microphone voice channel.

Further, after the step S207 of calculating the signal output corresponding to each of the first sub-band and each of the second sub-bands according to the optimal weight coefficient, the method includes:

S208: Receive a second time domain signal with a minimum time difference from the first time domain signal according to a time sequence of the received voice signal.

In this embodiment, according to the time sequence of the received voice signals, that is, the first processing received first, and the subsequent processing received, the time domain frame data is processed one by one in chronological order.

S209: The second time domain signal is subjected to the same processing process as the first time domain signal to obtain a second signal output corresponding to the second time domain signal.

The second signal output processing process of this embodiment is the same as the first signal output.

Referring to FIG. 3, in the speech enhancement method according to an embodiment of the present invention, in the process of calculating the first wave speed output of each sub-band according to the minimum variance distortion response algorithm, the speech intensity is improved by noise processing.

Further, step S300 includes:

S3001: Perform voice activation detection on each sub-band in a non-speaking period to obtain a first power of a first time, a second power of a second time, and a third power of a third time of the current first non-speech segment, where The first time, the second time, and the third time are sequentially connected in reverse order according to the time of occurrence.

In this embodiment, VAD detection (Voice Activity Detection) is performed in each sub-band, and the noise in the sub-band is estimated in the non-speech period (ie, no user-speaking information) of the VAD detection, by retaining the last three The noise power values of the stages are estimated. The latest noise power estimation time is the first time, the corresponding first power is P1, the previous time of the first time is the second time, and the second power corresponding to the second time is P2, the previous one of the second time The moment is the third time, and the third power corresponding to the third time is P3.

S3002: Calculate a current power change corresponding to each sub-band by calculating a ratio of the first power to the second power, and obtain a previous power change corresponding to each sub-band by calculating a ratio of the second power to the third power.

In this embodiment, the ratio of the first power to the second power is expressed as: Vr_cur=P1/P2, and the ratio of the second power to the third power is expressed as: Vr_pre=P2/P3.

S3003: Acquire a power ratio of two adjacent non-speech segments by calculating a first ratio of the current power variation to the previous power variation.

The first ratio of the current power change to the previous time power change of this embodiment is expressed as: Value=Vr_cur/Vr_pre. If Vr_cur is significantly larger than Vr_pre, indicating a reduction in noise interference, the smoothing factor should be reduced to avoid speech distortion caused by excessive smoothing.

Further, step S301 of the embodiment includes:

S3011: determining whether the first ratio is within a preset range;

The preset range of this embodiment is that the value of Value is in the range of 0.8 to 1.2.

S3012: If yes, the initialization smoothing factor is selected as the smoothing factor of the current time.

In this embodiment, if the value of Value is in the range of 0.8 to 1.2, the smoothing factor is set to an initialization value, for example, the initialization value is 1.0.

Further, after the step S3011, the method further includes:

S3013: If not, calculating a second ratio of the initializing smoothing factor to the first ratio;

In this embodiment, if the value of Value is not in the range of 0.8 to 1.2, if the value of Value is greater than 1.2 or less than 0.8, the second ratio is calculated, and the second ratio is used as the smoothing factor. For example, if the current Value has a value of 1.1 and the second ratio is 1.0/1.1, the smoothing factor at the current time is 1.0/1.1.

S3014: Set the second ratio to be a smoothing factor of the current time.

In this embodiment, the noise smoothing factor is removed by dynamic real-time adjustment, the influence of noise fluctuation is reduced, the signal-to-noise ratio of the double-mike noise reduction is further improved, and the sound quality of the output voice signal is improved.

Further, step S302 of the embodiment includes:

S3021: Acquire a frequency point vector of a sub-band subscript of the current time subscript to an upper boundary; 3022: update the sub-band covariance matrix according to a smoothing factor of the current time and a frequency point vector.

The covariance matrix of this embodiment is updated in real time according to the following formula. The processing procedure of the time domain signal collected by the dual microphone left channel is taken as an example. After the frequency domain signal corresponding to the time domain signal is divided into subbands, the covariance matrix is updated as follows. :R_SUBBAND_new=R_SUBBAND_old*alfa+S_L*S_L'*(1-alfa), where alfa is the smoothing factor of the current time, R_SUBBAND_new is the updated covariance matrix, R_SUBBAND_old is the original covariance matrix of the previous time, and S_L S_L is the frequency vector of the Fbin_loL point frequency after the FFT transformation of the current time domain frame data acquired by the left channel to the Fbin_hiL point, and S_L' represents the frequency vector transposition.

Referring to FIG. 4, a voice enhancement device according to an embodiment of the present invention collects a voice signal through a dual microphone voice channel, and each voice channel performs voice enhancement processing, including: a first acquisition module 1 configured to acquire a frequency of a current voice signal. Domain signal. The dividing module 2 is configured to divide the frequency domain signal into a plurality of sequentially arranged sub-bands according to a preset rule. The calculating module 3 is configured to separately calculate the first wave speed output of each of the sub-bands according to a minimum variance distortion response algorithm. The second obtaining module 4 is configured to obtain a second wave speed output of the frequency domain signal by performing an average calculation on each of the first wave speed outputs.

It can be understood by those skilled in the art that the apparatus in this embodiment and the method in the foregoing method embodiments are complementary to each other, and various details and descriptions described in the foregoing method embodiments are applicable to the apparatus in this embodiment. In order to avoid repetition, the embodiment of the device will not be described again.

Referring to FIG. 5, the foregoing dividing module 2 includes:

The area molecular module 200 is configured to distinguish the sensitive frequency band in the frequency domain signal, wherein the sensitive frequency band is the first frequency band, and the frequency band except the sensitive frequency band is the second frequency band; the first dividing submodule 201 is used by The first frequency band is evenly divided into a plurality of first sub-bands, and the second frequency band is evenly divided into a plurality of second sub-bands, wherein a bandwidth of each second sub-band is greater than a bandwidth of each of the first sub-bands.

Referring to FIG. 6, the foregoing calculation module 3 includes:

The first obtaining sub-module 300 is configured to obtain a power ratio of two adjacent non-speech segments by using voice activation detection in each of the foregoing sub-bands. a second obtaining sub-module 301, configured to obtain a smoothing factor corresponding to the non-speech segment according to the power ratio; the first obtaining sub-module 302 is configured to obtain a covariance matrix of the frequency band features in each of the sub-bands according to the smoothing factor; The obtaining sub-module 303 is configured to perform feature decomposition according to the covariance matrix to obtain an output weight vector of each sub-band, that is, a first wave speed output.

Referring to FIG. 7, the first acquiring module 1 includes:

The third obtaining sub-module 100 is configured to acquire a first time domain signal of the current voice signal separately collected by the dual microphone voice channel. The input sub-module 101 is configured to input the first time domain signals to the band pass filters respectively corresponding to the dual microphone voice channels, respectively, to obtain time domain signals of the specified frequency range. The conversion sub-module 102 is configured to respectively convert the time domain signals of the specified frequency range into the frequency domain signals of the specified frequency range of the current voice signal by using the Fourier transform respectively associated with the dual microphone voice channels.

Referring to FIG. 8, a voice enhancement apparatus according to another embodiment of the present invention includes: a conversion module 5, configured to separately input a second wave speed output of a frequency domain signal to an inverse Fourier transformer respectively associated with a dual microphone voice channel The frequency domain signal is converted into an output time domain signal; and the output module 6 is configured to respectively output a corresponding output time domain signal through the dual microphone voice channel.

Referring to FIG. 9, in a voice enhancement apparatus according to another embodiment of the present invention, first, a voice signal is preprocessed by a voice channel to reduce a frequency domain processing amount, and a front end of the partitioning module 2 is connected with a selection module 20 for using a frequency. The calculation level of the domain processing platform selects the Fourier transform mode of the specified frequency point; and the obtaining module 21 is configured to: after the pre-processing of the first time domain signal of the current voice signal separately collected by the dual microphone voice channel, respectively A frequency domain signal corresponding to the first time domain signal obtained by the Fourier transform of the frequency point.

Referring to FIG. 10, the partitioning module 2 of the present embodiment includes: a third acquiring sub-module 202, configured to acquire a frequency point of a frequency domain signal corresponding to the first time domain signal obtained by using the Fourier transform method of the specified frequency point. The second dividing sub-module 203 is configured to uniformly divide the frequency domain signal into a plurality of sequentially arranged sub-bands according to the total frequency of the foregoing frequency points.

Referring to FIG. 11, a partitioning module 2 according to another embodiment of the present invention includes: a first calculating sub-module 204, configured to respectively calculate a frequency band center frequency corresponding to each first sub-band and each second sub-band; The sub-module 205 is configured to calculate a direction vector corresponding to each of the first sub-band and each of the second sub-bands according to the center frequency of the frequency band. The obtaining sub-module 206 is configured to obtain, according to the direction vector, a covariance matrix of the frequency band features corresponding to each of the first sub-band and each of the second sub-bands, and an optimal weight coefficient corresponding to the inverse matrix of the covariance matrix. The third calculation sub-module 207 is configured to calculate, according to the optimal weight coefficient, a first signal output corresponding to each of the first sub-band and each of the second sub-bands.

Further, the dividing module 2 includes: a receiving submodule 208, configured to receive a second time domain signal with a minimum time difference from the first time domain signal according to a time sequence of the received voice signal; and third obtaining the submodule 209, The second time domain signal is subjected to the same process as the first time domain signal to obtain a second signal output corresponding to the second time domain signal.

Referring to FIG. 12, in a speech enhancement method according to another embodiment of the present invention, a process for calculating a first wave speed output of each of the sub-bands according to a minimum variance distortion response algorithm, including a noise processing system, improves speech intensity by noise processing.

Referring to FIG. 13 , the first obtaining sub-module 300 includes: a detecting unit 3001, configured to obtain a first power of a first time of the current first non-speech segment by performing voice activation detection on each sub-band in a non-speaking period, And the second power of the second time and the third power of the third time, wherein the first time, the second time, and the third time are sequentially connected in reverse order according to the time of occurrence. The obtaining unit 3002 is configured to obtain a current power change corresponding to each of the sub-bands by calculating a ratio of the first power to the second power, and obtain a corresponding ratio of the second power to the third power The power of the previous moment changes. The first obtaining unit 3003 is configured to obtain a power ratio of two adjacent non-speech segments by calculating a first ratio of a current power change to a previous time power variation.

Referring to FIG. 14, the second obtaining submodule 301 of the embodiment includes:

The determining unit 3011 is configured to determine whether the first ratio is within a preset range, and the selecting unit 3012 is configured to: if the first ratio is within the preset range, select an initializing smoothing factor as a smoothing factor of the current moment.

Further, the second obtaining sub-module 301 further includes: a calculating unit 3013, configured to calculate a second ratio of the initializing smoothing factor to the first ratio if the first ratio is not within the preset range. The setting unit 3014 is configured to set a second ratio as a smoothing factor of the current time.

Referring to FIG. 15, the first obtaining submodule 302 of this embodiment includes:

The second obtaining unit 3021 is configured to acquire a frequency point vector of the lower boundary of the sub-band of the current time to the upper boundary, and an update unit 3022, configured to use the smoothing factor of the current time and the frequency vector The covariance matrix of the band is updated.

The present application also provides a voice enhanced device including a memory, a processor and an application, the application being stored in a memory and configured to be executed by a processor, the application being configured to perform any of the above embodiments The method of speech enhancement.

Those skilled in the art will appreciate the step counter device of the present invention and the apparatus described above for performing one or more of the methods of the present application. The device may be specially designed and manufactured for the required purposes, or may also include known devices in a general purpose computer. A device has computer programs or applications stored therein that are selectively activated or reconfigured. Such computer programs may be stored in a device (eg, computer) readable medium or in any type of medium suitable for storing electronic instructions and coupled to a bus, respectively, including but not limited to any type of Disk (including floppy disk, hard disk, optical disk, CD-ROM, and magneto-optical disk), ROM (Read-Only Memory, read-only memory), RAM (Random Access Memory), EPROM (Erasable Programmable Read-Only Memory, Erasable programmable read-only memory, EEPROM (Electrically Erasable Programmable Read-Only Memory), flash memory, magnetic card or light card. That is, a readable medium includes any medium that is stored or transmitted by a device (eg, a computer) in a readable form.

Claims

A voice enhancement method is characterized in that a voice signal is collected through a dual microphone voice channel, and each voice channel is separately subjected to voice enhancement processing, including:

Obtaining a frequency domain signal of the current voice signal;

Dividing the frequency domain signal into a plurality of sub-bands arranged in sequence according to a preset rule;

Calculating a first wave speed output of each of the sub-bands according to a minimum variance distortion response algorithm;

A second wave speed output of the frequency domain signal is obtained by performing an average calculation on each of the first wave speed outputs.
The method for voice enhancement according to claim 1, wherein the step of dividing the frequency domain signal into a plurality of sequentially arranged sub-bands according to a preset rule comprises:

Distinguishing the sensitive frequency band in the frequency domain signal, wherein the sensitive frequency band is a first frequency band, and a frequency band other than the sensitive frequency band in the frequency domain signal is a second frequency band;

The first frequency band is evenly divided into a plurality of first sub-bands, and the second frequency band is evenly divided into a plurality of second sub-bands, wherein a bandwidth of the second sub-band is greater than that of the first sub-band bandwidth.
The method for voice enhancement according to claim 2, wherein the first frequency band is evenly divided into a plurality of first sub-bands, and the second frequency band is evenly divided into a plurality of second sub-bands After the steps, include:

Calculating, respectively, a frequency band center frequency corresponding to each of the first sub-band and each of the second sub-bands;

Calculating, according to the center frequency of the frequency band, a direction vector corresponding to each of the first sub-band and each of the second sub-bands;

Obtaining, according to the direction vector, a covariance matrix of a frequency band feature corresponding to each of the first subband and each of the second subbands, and an optimal weight coefficient corresponding to an inverse matrix of the covariance matrix;

And calculating, according to the optimal weight coefficient, a first signal output corresponding to each of the first sub-band and each of the second sub-bands.
The method for voice enhancement according to claim 1, wherein the step of separately calculating the first wave speed output of each of the sub-bands according to a minimum variance distortion response algorithm comprises:

Performing voice activation detection in each of the sub-bands to obtain power ratios of two adjacent non-speech segments;

Obtaining a corresponding smoothing factor for removing the non-speech segment according to the power ratio;

Obtaining a covariance matrix of frequency band features in each of the sub-bands according to the smoothing factor;

Performing feature decomposition according to the covariance matrix to obtain an output weight vector of each of the sub-bands.
The method of claim 1 , wherein the step of acquiring a frequency domain signal of a current voice signal comprises:

Obtaining a first time domain signal of the current voice signal separately collected by the dual microphone voice channel;

And inputting the first time domain signals to the band pass filters respectively corresponding to the dual microphone voice channels, respectively, to obtain time domain signals of a specified frequency range;

The time domain signals of the specified frequency range are respectively converted into frequency domain signals of the specified frequency range of the current voice signal by Fourier transform respectively associated with the dual microphone voice channels.
The method of claim 5, wherein the step of obtaining an average value of each of the first wave speed outputs to obtain a second wave speed output of the frequency domain signal comprises:

Converting the frequency domain signal into an output time domain signal by respectively inputting a second wave speed output of the frequency domain signal into an inverse Fourier transformer respectively associated with the dual microphone voice channel;

The corresponding output time domain signals are respectively output through the dual microphone voice channels.
The method for voice enhancement according to claim 1, wherein the step of dividing the frequency domain signal into a plurality of sequentially arranged sub-bands according to a preset rule comprises:

Selecting a Fourier transform method of the specified frequency point according to the calculation level of the frequency domain processing platform;

After the first time domain signal of the current voice signal collected by the dual microphone voice channel is preprocessed, respectively, the frequency domain signal corresponding to the first time domain signal obtained by the Fourier transform of the specified frequency point is respectively obtained. .
The method for voice enhancement according to claim 7, wherein the step of dividing the frequency domain signal into a plurality of sequentially arranged sub-bands according to a preset rule comprises:

Obtaining a total amount of frequency points of the frequency domain signal corresponding to the first time domain signal obtained by the Fourier transform method of the specified frequency point;

The frequency domain signal is evenly divided into a plurality of sequentially arranged sub-bands according to the total number of frequency points.
A voice enhancement device is characterized in that a voice signal is collected through a dual microphone voice channel, and each voice channel is separately subjected to voice enhancement processing, including:

a first acquiring module, configured to acquire a frequency domain signal of a current voice signal;

a dividing module, configured to divide the frequency domain signal into a plurality of sequentially arranged sub-bands according to a preset rule;

a calculating module, configured to separately calculate a first wave speed output of each of the sub-bands according to a minimum variance distortion response algorithm;

And a second acquiring module, configured to acquire a second wave speed output of the frequency domain signal by performing an average calculation on each of the first wave speed outputs.
The apparatus for voice enhancement according to claim 9, wherein the dividing module comprises:

a region molecular module, configured to distinguish a sensitive frequency band in the frequency domain signal, wherein the sensitive frequency band is a first frequency band, and a frequency band other than the sensitive frequency band in the frequency domain signal is a second frequency band;

a sub-module, configured to divide the first frequency band into a plurality of first sub-bands, and divide the second frequency-band into a plurality of second sub-bands, wherein a bandwidth of the second sub-band is greater than The bandwidth of the first sub-band.
The apparatus for voice enhancement according to claim 10, wherein the dividing module comprises:

a first calculation sub-module, configured to separately calculate a frequency band center frequency corresponding to each of the first sub-band and each of the second sub-bands;

a second calculation sub-module, configured to calculate, according to the frequency center frequency of the frequency band, a direction vector corresponding to each of the first sub-band and each of the second sub-bands;

Obtaining a sub-module, configured to obtain, according to the direction vector, a covariance matrix of a frequency band feature corresponding to each of the first sub-band and each of the second sub-bands, and an optimal weight corresponding to an inverse matrix of the covariance matrix coefficient;

And a third calculating submodule, configured to calculate, according to the optimal weight coefficient, a first signal output corresponding to each of the first subband and each of the second subbands.
The apparatus for voice enhancement according to claim 9, wherein the calculation module comprises:

a first acquiring submodule, configured to obtain a power ratio of two adjacent non-speech segments by using voice activation detection in each of the sub-bands;

a second obtaining submodule, configured to acquire, according to the power ratio, a smoothing factor for removing the non-speech segment;

a first obtaining submodule, configured to obtain, according to the smoothing factor, a covariance matrix of frequency band features in each of the subbands;

And a second obtaining submodule, configured to perform eigen decomposition according to the covariance matrix to obtain an output weight vector of each of the subbands.
The apparatus for voice enhancement according to claim 9, wherein the first obtaining module comprises:

a third acquiring submodule, configured to acquire a first time domain signal of the current voice signal separately collected by the dual microphone voice channel;

The input sub-module is configured to input the first time domain signals to the band pass filters respectively corresponding to the dual microphone voice channels, respectively, to obtain time domain signals of a specified frequency range;

And a conversion submodule, configured to respectively convert the time domain signals of the specified frequency range into a frequency domain signal of the specified frequency range of the current voice signal by using a Fourier transform respectively associated with the dual microphone voice channel.
The apparatus for voice enhancement according to claim 13, comprising:

a conversion module, configured to convert the frequency domain signal into an output time domain signal by inputting a second wave speed output of the frequency domain signal into an inverse Fourier transformer respectively associated with the dual microphone voice channel;

And an output module, configured to respectively output the corresponding output time domain signals through the dual microphone voice channels.
The apparatus for voice enhancement according to claim 9, comprising:

a selection module, configured to select a Fourier transform mode of the specified frequency point according to a calculation level of the frequency domain processing platform;

Obtaining a module, configured to: after the first time domain signal of the current voice signal separately collected by the dual microphone voice channel is preprocessed, respectively obtain the first time domain signal obtained by the Fourier transform of the specified frequency point Corresponding frequency domain signal.
The apparatus for voice enhancement according to claim 9, wherein the dividing module comprises:

a third obtaining submodule, configured to obtain a total amount of frequency points of the frequency domain signal corresponding to the first time domain signal obtained by the Fourier transform method of the specified frequency point;

The second dividing sub-module is configured to uniformly divide the frequency domain signal into a plurality of sequentially arranged sub-bands according to the total frequency of the frequency points.
A speech enhanced device comprising a memory, a processor and an application, the application being stored in the memory and configured to be executed by the processor, wherein the application is configured to use A speech enhancement method according to any one of claims 1 to 8.