WO2019119593A1 - 语音增强方法及装置 - Google Patents

语音增强方法及装置 Download PDF

Info

Publication number
WO2019119593A1
WO2019119593A1 PCT/CN2018/073281 CN2018073281W WO2019119593A1 WO 2019119593 A1 WO2019119593 A1 WO 2019119593A1 CN 2018073281 W CN2018073281 W CN 2018073281W WO 2019119593 A1 WO2019119593 A1 WO 2019119593A1
Authority
WO
WIPO (PCT)
Prior art keywords
power spectrum
noise
spectral subtraction
user
cluster
Prior art date
Application number
PCT/CN2018/073281
Other languages
English (en)
French (fr)
Inventor
胡伟湘
苗磊
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to US16/645,677 priority Critical patent/US11164591B2/en
Priority to CN201880067882.XA priority patent/CN111226277B/zh
Publication of WO2019119593A1 publication Critical patent/WO2019119593A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/008Multichannel audio signal coding or decoding using interchannel correlation to reduce redundancy, e.g. joint-stereo, intensity-coding or matrixing
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0232Processing in the frequency domain

Definitions

  • the present application relates to the field of voice processing technologies, and in particular, to a voice enhancement method and apparatus.
  • VoIP Voice over Internet Protocol
  • the user's voice signal may be blurred due to noise in the environment (such as a street, a restaurant, a waiting room, a waiting hall, etc.), and the intelligibility is lowered. Therefore, how to eliminate the noise in the sound signal collected by the microphone is an urgent problem to be solved.
  • FIG. 1 is a schematic flow chart of a conventional spectral subtraction method.
  • a sound signal collected by a microphone is divided into a noisy speech signal and a noise signal by voice activity detection (VAD).
  • VAD voice activity detection
  • the noisy speech signal is obtained by using a Fast Fourier Transform (FFT) transform to obtain amplitude information and phase information (where the amplitude information is obtained by power spectrum estimation to obtain a power spectrum of the noisy speech signal), and the noise signal passes the noise power.
  • FFT Fast Fourier Transform
  • the spectral subtraction parameter is obtained by the spectral subtraction parameter calculation process; wherein the spectral subtraction parameter includes but is not limited to at least one of the following: a reduction factor ⁇ ( ⁇ > 1) and the spectral order ⁇ (0 ⁇ ⁇ ⁇ 1).
  • the amplitude information of the noisy speech signal is subjected to spectral subtraction processing to obtain a denoised speech signal.
  • an inverse fast Fourier Transform (IFFT) transform and a superposition process are performed according to the denoised speech signal and the phase information of the noisy speech signal to obtain an enhanced speech signal.
  • IFFT inverse fast Fourier Transform
  • the power spectrum is directly subtracted, which makes the denoised speech signal easy to produce "music noise", which will directly affect the intelligibility and naturalness of the speech signal.
  • the embodiment of the present invention provides a voice enhancement method and apparatus, which improves the denoised voice signal by adapting the spectral subtraction parameter according to the user voice power spectrum characteristic and/or the ambient noise power spectrum characteristic of the user. Intelligibility and naturalness improve noise reduction performance.
  • an embodiment of the present application provides a voice enhancement method, including:
  • a first spectral subtraction parameter according to a power spectrum of the noisy speech signal and a power spectrum of the noise signal; wherein the noisy speech signal and the noise signal are obtained by dividing the sound signal collected by the microphone;
  • a second spectral subtraction parameter according to the first spectral subtraction parameter and the reference power spectrum; wherein the reference power spectrum comprises: a user speech prediction power spectrum and/or an environmental noise prediction power spectrum;
  • the noisy speech signal is spectrally subtracted according to the power spectrum of the noise signal and the second spectral subtraction parameter.
  • the first spectral subtraction parameter is determined according to a power spectrum of the noisy speech signal and a power spectrum of the noise signal; further, determining the first spectral decrement parameter and the reference power spectrum
  • the two-spectrum subtraction parameter performs spectral subtraction processing on the noisy speech signal according to the power spectrum of the noise signal and the second spectral subtraction parameter; wherein, the reference power spectrum comprises: a user speech prediction power spectrum and/or an environmental noise prediction power spectrum.
  • the first spectral subtraction parameter is optimized to obtain the second spectral subtraction parameter, so that The spectral subtraction processing of the noisy speech signal according to the optimized second spectral subtraction parameter can not only apply a wide signal to noise ratio range, but also improve the intelligibility and naturalness of the denoised speech signal, and improve the drop. Noise performance.
  • the reference power spectrum includes: the user voice prediction power spectrum
  • determining the second spectral subtraction parameter according to the first spectral subtraction parameter and the reference power spectrum including:
  • a second spectral subtraction parameter according to the first spectral subtraction function F1(x, y); wherein x represents a first spectral subtraction parameter; y represents a user speech prediction power spectrum; and a value of F1(x, y) is positively related to x The value of F1(x, y) is negatively related to y.
  • the second spectral subtraction parameter is obtained by optimizing the first spectral subtraction parameter by considering the regularity of the user voice power spectrum characteristic of the terminal device, so as to reduce the parameter according to the second spectrum.
  • the spectral subtraction processing is performed on the noisy speech signal, so that the user voice of the terminal device can be protected, and the intelligibility and naturalness of the denoised speech signal are improved.
  • the reference power spectrum includes: the ambient noise prediction power spectrum
  • determining the second spectral subtraction parameter according to the first spectral subtraction parameter and the reference power spectrum including:
  • a second spectral subtraction parameter according to the second spectral subtraction function F2(x, z); wherein x represents a first spectral subtraction parameter; z represents an ambient noise predicted power spectrum; and a value of F2(x, z) is positively related to x The value of F2(x,z) is positively related to z.
  • the second spectral subtraction parameter is obtained by optimizing the first spectral subtraction parameter by considering the regularity of the noise power spectrum characteristic of the environment in which the user is located, so as to obtain the second spectral subtraction parameter according to the second spectral subtraction parameter.
  • the spectral subtraction processing of the noisy speech signal can more accurately remove the noise signal in the noisy speech signal, and improve the intelligibility and naturalness of the denoised speech signal.
  • the reference power spectrum includes: a user voice prediction power spectrum and an ambient noise prediction power spectrum
  • determining a second spectral subtraction parameter according to the first spectral subtraction parameter and the reference power spectrum including:
  • a second spectral subtraction parameter according to a third spectral subtraction function F3(x, y, z); wherein x represents a first spectral subtraction parameter; y represents a user speech prediction power spectrum; z represents an environmental noise prediction power spectrum; F3 (x The value of y, z) is positively related to x, the value of F3(x, y, z) is negatively related to y, and the value of F3(x, y, z) is positively related to z.
  • the second spectral subtraction is obtained by optimizing the first spectral subtraction parameter by considering the user voice power spectrum characteristic of the terminal device and the regularity of the ambient noise power spectrum characteristic of the user.
  • the parameter is used to perform spectral subtraction processing on the noisy speech signal according to the second spectral subtraction parameter, so that not only the user voice of the terminal device can be protected, but also the noise signal in the noisy speech signal can be removed more accurately, and the denoising is improved.
  • the intelligibility and naturalness of the subsequent speech signal is improved.
  • the method before determining the second spectral subtraction parameter according to the first spectral subtraction parameter and the reference power spectrum, the method further includes:
  • a target user power spectrum cluster according to a power spectrum of the noisy speech signal and a user power spectrum distribution class; wherein the user power spectrum distribution class comprises: at least one user historical power spectrum cluster; the target user power spectrum cluster is at least one user The cluster closest to the power spectrum of the noisy speech signal in historical power spectrum clustering;
  • the user voice prediction power spectrum is determined according to the power spectrum of the noisy speech signal and the target user power spectrum cluster.
  • the target user power spectrum clustering is determined according to the power spectrum of the noisy speech signal and the user power spectrum distribution class; further, according to the power spectrum of the noisy speech signal and the target user power
  • the spectral clustering determines the user speech prediction power spectrum, so as to further optimize the first spectral subtraction parameter according to the user voice prediction power spectrum to obtain the second spectral subtraction parameter, and according to the optimized second spectral subtraction parameter, the noisy speech signal
  • the spectrum subtraction process is performed, so that the user voice of the terminal device can be protected, and the intelligibility and naturalness of the denoised voice signal are improved.
  • the method before determining the second spectral subtraction parameter according to the first spectral subtraction parameter and the reference power spectrum, the method further includes:
  • the noise power spectrum distribution class comprises: at least one noise history power spectrum cluster; the target noise power spectrum cluster is at least one noise history power a cluster in spectral clustering that is closest to the power spectrum of the noise signal;
  • the ambient noise predicted power spectrum is determined according to the power spectrum of the noise signal and the target noise power spectrum cluster.
  • the target noise power spectrum cluster is determined according to the power spectrum of the noise signal and the noise power spectrum distribution class; further, the power spectrum of the noise signal and the target noise power spectrum cluster are determined.
  • the environmental noise predicts the power spectrum, so as to further optimize the first spectral subtraction parameter according to the environmental noise prediction power spectrum to obtain the second spectral subtraction parameter, and perform spectral subtraction processing on the noisy speech signal according to the optimized second spectral subtraction parameter. Therefore, the noise signal in the noisy speech signal can be removed more accurately, and the intelligibility and naturalness of the denoised speech signal are improved.
  • the method before determining the second spectral subtraction parameter according to the first spectral subtraction parameter and the reference power spectrum, the method further includes:
  • the target user power spectrum cluster is a cluster that is closest to a power spectrum distance of the noisy speech signal in at least one user historical power spectrum cluster
  • the noise power spectrum distribution class includes: at least one Noise history power spectrum clustering
  • target noise power spectrum clustering is a cluster of at least one noise history power spectrum cluster that is closest to the power spectrum distance of the noise signal
  • the ambient noise predicted power spectrum is determined according to the power spectrum of the noise signal and the target noise power spectrum cluster.
  • the target user power spectrum clustering is determined according to the power spectrum of the noisy speech signal and the user power spectrum distribution class, and the target is determined according to the power spectrum of the noise signal and the noise power spectrum distribution class.
  • Noise power spectrum clustering further, determining a user voice prediction power spectrum according to a power spectrum of the noisy speech signal and a target user power spectrum cluster, and determining an environmental noise prediction power according to the power spectrum of the noise signal and the target noise power spectrum clustering Spectrum, in order to further optimize the first spectral subtraction parameter according to the user voice prediction power spectrum and the environmental noise prediction power spectrum to obtain a second spectral subtraction parameter, and perform spectrum on the noisy speech signal according to the optimized second spectral subtraction parameter
  • the processing is reduced, so that not only the user voice of the terminal device can be protected, but also the noise signal in the noisy voice signal can be removed more accurately, and the intelligibility and naturalness of the denoised voice signal are improved.
  • determining a user voice prediction power spectrum according to a power spectrum of the noisy speech signal and a target user power spectrum cluster including:
  • the ambient noise prediction power spectrum is determined according to the power spectrum of the noise signal and the target noise power spectrum cluster, including:
  • NP a power spectrum of the noise signal
  • NPT a target noise power spectrum cluster
  • F5 (NP, NPT) b * NP + (1 b) *NPT, b represents the second estimated coefficient.
  • the method before determining the target user power spectrum clustering according to the power spectrum of the noisy speech signal and the user power spectrum distribution class, the method further includes:
  • the user power spectrum distribution class is dynamically adjusted according to the denoised voice signal, so that the user voice prediction power spectrum can be determined more accurately, and the power spectrum is further predicted according to the user voice.
  • the method before determining the target noise power spectrum cluster according to the power spectrum of the noise signal and the noise power spectrum distribution class, the method further includes:
  • the noise power spectrum distribution class is dynamically adjusted according to the power spectrum of the noise signal, so that the environmental noise prediction power spectrum can be determined more accurately, and the power spectrum is further predicted according to the environmental noise.
  • the first spectral subtraction parameter is optimized to obtain the second spectral subtraction parameter, and the noisy speech signal is spectrally subtracted according to the optimized second spectral subtraction parameter, so that the noise signal in the noisy speech signal can be removed more accurately. Improved noise reduction performance.
  • the embodiment of the present application provides a voice enhancement apparatus, including:
  • a first determining module configured to determine a first spectral subtraction parameter according to a power spectrum of the noisy speech signal and a power spectrum of the noise signal; wherein the noisy speech signal and the noise signal are used to divide and process the sound signal collected by the microphone Obtained after
  • a second determining module configured to determine a second spectral subtraction parameter according to the first spectral subtraction parameter and the reference power spectrum; wherein the reference power spectrum comprises: a user speech prediction power spectrum and/or an environmental noise prediction power spectrum;
  • the spectral subtraction module is configured to perform spectral subtraction processing on the noisy speech signal according to the power spectrum of the noise signal and the second spectral subtraction parameter.
  • the second determining module is specifically configured to:
  • a second spectral subtraction parameter according to the first spectral subtraction function F1(x, y); wherein x represents a first spectral subtraction parameter; y represents a user speech prediction power spectrum; and a value of F1(x, y) is positively related to x The value of F1(x, y) is negatively related to y.
  • the second determining module is specifically configured to:
  • a second spectral subtraction parameter according to the second spectral subtraction function F2(x, z); wherein x represents a first spectral subtraction parameter; z represents an ambient noise predicted power spectrum; and a value of F2(x, z) is positively related to x The value of F2(x,z) is positively related to z.
  • the second determining module is specifically configured to:
  • a second spectral subtraction parameter according to a third spectral subtraction function F3(x, y, z); wherein x represents a first spectral subtraction parameter; y represents a user speech prediction power spectrum; z represents an environmental noise prediction power spectrum; F3 (x The value of y, z) is positively related to x, the value of F3(x, y, z) is negatively related to y, and the value of F3(x, y, z) is positively related to z.
  • the device further includes:
  • a third determining module configured to determine a target user power spectrum cluster according to a power spectrum of the noisy speech signal and a user power spectrum distribution class; wherein the user power spectrum distribution class includes: at least one user historical power spectrum cluster; target user power The spectral cluster is a cluster that is closest to the power spectrum of the noisy speech signal in at least one user historical power spectrum cluster;
  • a fourth determining module configured to determine a user voice prediction power spectrum according to the power spectrum of the noisy speech signal and the target user power spectrum cluster.
  • the device further includes:
  • a fifth determining module configured to determine a target noise power spectrum cluster according to a power spectrum of the noise signal and a noise power spectrum distribution class; wherein the noise power spectrum distribution class comprises: at least one noise history power spectrum cluster; the target noise power spectrum is aggregated The class is a cluster of at least one noise history power spectrum cluster that is closest to the power spectrum distance of the noise signal;
  • a sixth determining module configured to determine an ambient noise predicted power spectrum according to the power spectrum of the noise signal and the target noise power spectrum cluster.
  • the device further includes:
  • a third determining module configured to determine a target user power spectrum cluster according to a power spectrum of the noisy speech signal and a user power spectrum distribution class
  • a fifth determining module configured to determine a target noise power spectrum cluster according to a power spectrum of the noise signal and a noise power spectrum distribution class; wherein the user power spectrum distribution class includes: at least one user historical power spectrum cluster; the target user power spectrum cluster The class is a cluster that is closest to the power spectrum of the noisy speech signal in at least one user historical power spectrum cluster; the noise power spectrum distribution class includes: at least one noise history power spectrum cluster; the target noise power spectrum cluster is at least one a cluster of noise history power spectrum clusters that is closest to the power spectrum of the noise signal;
  • a fourth determining module configured to determine a user voice prediction power spectrum according to a power spectrum of the noisy speech signal and a target user power spectrum cluster
  • a sixth determining module configured to determine an ambient noise predicted power spectrum according to the power spectrum of the noise signal and the target noise power spectrum cluster.
  • the fourth determining module is specifically configured to:
  • the sixth determining module is specifically configured to:
  • NP a power spectrum of the noise signal
  • NPT a target noise power spectrum cluster
  • F5 (NP, NPT) b * NP + (1 b) *NPT, b represents the second estimated coefficient.
  • the device further includes:
  • the first obtaining module is configured to acquire a user power spectrum distribution class.
  • the device further includes:
  • the second obtaining module is configured to obtain a noise power spectrum distribution class.
  • an embodiment of the present application provides a voice enhancement apparatus, including a processor and a memory;
  • a memory is used to store program instructions
  • a processor for invoking and executing program instructions stored in the memory to implement any of the methods described in the first aspect above.
  • an embodiment of the present application provides a program for performing the method of the above first aspect when executed by a processor.
  • an embodiment of the present application provides a computer program product comprising instructions that, when run on a computer, cause the computer to perform the method of the first aspect above.
  • an embodiment of the present application provides a computer readable storage medium, where the computer readable storage medium stores instructions that, when run on a computer, cause the computer to perform the method of the first aspect.
  • 1 is a schematic flow chart of a conventional spectral subtraction method
  • FIG. 2A is a schematic diagram of an application scenario provided by an embodiment of the present application.
  • FIG. 2B is a schematic structural diagram of a terminal device with a microphone according to an embodiment of the present disclosure
  • 2C is a schematic diagram of voice spectrum of different users according to an embodiment of the present application.
  • 2D is a schematic flowchart of a voice enhancement method according to an embodiment of the present application.
  • 3A is a schematic flowchart of a voice enhancement method according to another embodiment of the present application.
  • FIG. 3B is a schematic diagram of a user power spectrum distribution class according to an embodiment of the present application.
  • 3C is a schematic diagram of a learning process of a user voice power spectrum characteristic provided by an embodiment of the present application.
  • FIG. 4A is a schematic flowchart of a voice enhancement method according to another embodiment of the present application.
  • 4B is a schematic diagram of a noise power spectrum distribution class provided by an embodiment of the present application.
  • 4C is a schematic diagram of a learning process of a noise power spectrum characteristic provided by an embodiment of the present application.
  • FIG. 5 is a schematic flowchart of a voice enhancement method according to another embodiment of the present disclosure.
  • 6A is a schematic flowchart 1 of a voice enhancement method according to another embodiment of the present application.
  • FIG. 6B is a second schematic flowchart of a voice enhancement method according to another embodiment of the present disclosure.
  • FIG. 7A is a schematic flowchart 3 of a voice enhancement method according to another embodiment of the present disclosure.
  • FIG. 7B is a schematic flowchart 4 of a voice enhancement method according to another embodiment of the present disclosure.
  • FIG. 8A is a schematic flowchart 5 of a voice enhancement method according to another embodiment of the present disclosure.
  • FIG. 8B is a schematic flowchart 6 of a voice enhancement method according to another embodiment of the present disclosure.
  • FIG. 9A is a schematic structural diagram of a voice enhancement apparatus according to an embodiment of the present disclosure.
  • FIG. 9B is a schematic structural diagram of a voice enhancement apparatus according to another embodiment of the present disclosure.
  • FIG. 10 is a schematic structural diagram of a voice enhancement apparatus according to another embodiment of the present disclosure.
  • FIG. 11 is a schematic structural diagram of a voice enhancement apparatus according to another embodiment of the present disclosure.
  • FIG. 2A is a schematic diagram of an application scenario provided by an embodiment of the present application.
  • the voice enhancement method provided by the embodiment of the present application may be performed in the terminal device; of course, the embodiment of the present application may also be applied to other scenarios. In the embodiment, no limitation is imposed on this.
  • terminal device 1 and the terminal device 2 are shown in FIG. 2A, and of course, other numbers of terminal devices may be included, which are not limited in this embodiment of the present application. .
  • the apparatus for performing the voice enhancement method may be a terminal device, or may be a device of a voice enhancement method in the terminal device.
  • the device of the voice enhancement method in the terminal device may be a chip system, a circuit or a module, etc., which is not limited in this application.
  • the terminal device involved in the present application may include, but is not limited to, any one of the following: a device having a voice communication function, such as a mobile phone, a tablet computer, a personal digital assistant, and the like, and other devices having a voice communication function.
  • a device having a voice communication function such as a mobile phone, a tablet computer, a personal digital assistant, and the like, and other devices having a voice communication function.
  • the terminal device involved in the present application may include a hardware layer, an operating system layer running on the hardware layer, and an application layer running on the operating system layer.
  • the hardware layer includes hardware such as a central processing unit (CPU), a memory management unit (MMU), and a memory (also referred to as main memory).
  • the operating system may be any one or more computer operating systems that implement business processing through a process, such as a Linux operating system, a Unix operating system, an Android operating system, an iOS operating system, or a Windows operating system.
  • the application layer includes applications such as browsers, contacts, word processing software, and instant messaging software.
  • the first spectral subtraction parameter involved in the embodiment of the present application may include, but is not limited to, at least one of the following: a first overshooting factor ⁇ ( ⁇ >1) and a first spectral step ⁇ (0 ⁇ 1).
  • the second spectral subtraction parameter involved in the embodiment of the present application is a spectral subtraction parameter obtained by optimizing the first spectral subtraction parameter.
  • the second spectral subtraction parameter involved in the embodiment of the present application may include, but is not limited to, at least one of the following: a second over-subtraction factor ⁇ ' ( ⁇ '>1) and a second spectral step ⁇ ' (0 ⁇ ' ⁇ 1).
  • Each power spectrum involved in the embodiments of the present application may refer to a power spectrum that does not consider sub-band division, or considers a power spectrum of sub-band division (or referred to as a sub-band power spectrum).
  • the power spectrum of the noisy speech signal can be referred to as a subband power spectrum of the noisy speech signal
  • the power spectrum of the noise signal can be called The subband power spectrum of the noise signal
  • the user speech prediction power spectrum can be called the user speech prediction subband power spectrum
  • the environmental noise prediction power spectrum It can be called the subband power spectrum with environmental noise prediction
  • the subband division is considered, the user power spectrum distribution class can be called the user subband power spectrum distribution class
  • the user history Power spectrum clustering can be called user history subband power spectrum clustering
  • Spectral subtraction is usually used to eliminate noise in the sound signal.
  • the sound signal collected by the microphone is divided into a noisy speech signal and a noise signal by VAD.
  • the noisy speech signal is obtained by FFT transform to obtain amplitude information and phase information (where the amplitude information is obtained by power spectrum estimation to obtain a power spectrum of the noisy speech signal), and the noise signal is estimated by the noise power spectrum to obtain a power spectrum of the noise signal.
  • the spectral subtraction parameter is obtained by the spectral subtraction parameter calculation process.
  • the amplitude information of the noisy speech signal is subjected to spectral subtraction processing to obtain a denoised speech signal.
  • the enhanced speech signal is obtained by performing IFFT conversion and superposition processing according to the denoised speech signal and the phase information of the noisy speech signal.
  • the power spectrum is directly subtracted.
  • the applicable signal-to-noise ratio range is narrow.
  • the signal-to-noise ratio is low, the intelligibility of the speech is greatly damaged, and on the other hand, the denoising is also performed.
  • the subsequent speech signal is prone to "music noise", which directly affects the intelligibility and naturalness of the speech signal.
  • the sound signal collected by the microphone in the embodiment of the present application may be a dual microphone in the terminal device.
  • FIG. 2B is a schematic structural diagram of a terminal device with a microphone provided by an embodiment of the present application, as shown in FIG. 2B.
  • the sound signals collected by the first microphone and the second microphone may be sound signals collected by other numbers of microphones in the terminal device, which is not limited in the embodiment of the present application.
  • the position of each microphone in FIG. 2B is only exemplary, and may be set in other locations of the terminal device, which is not limited in the embodiment of the present application.
  • FIG. 2C is a schematic diagram of voice spectrum of different users according to an embodiment of the present application.
  • the speech spectrum characteristics (such as the speech spectrum corresponding to the female voice AO in FIG. 2C, the speech spectrum corresponding to the female voice DJ, the speech spectrum corresponding to the male voice MH, and the voice spectrum corresponding to the male voice MS) are different from each other.
  • the specific user's call scene has certain regularity (for example, the user is usually in a quiet indoor office from 8:00 to 17:00, on a noisy subway at 17:10 to 19:00, etc.), therefore, There is a certain regularity in the noise power spectrum characteristics of the environment in which a particular user is located.
  • the voice enhancement method and apparatus by taking into account the regularity of the user voice power spectrum characteristics of the terminal device and/or the regularity of the environmental noise power spectrum characteristics of the user, optimize the processing of the first spectrum subtraction parameter.
  • Obtaining a second spectral subtraction parameter, so as to perform spectral subtraction processing on the noisy speech signal according to the optimized second spectral subtraction parameter not only can apply a wide signal to noise ratio range, but also improve the intelligibility of the denoised speech signal. Degree and naturalness improve noise reduction performance.
  • FIG. 2D is a schematic flowchart of a voice enhancement method according to an embodiment of the present application. As shown in FIG. 2D, the method in this embodiment of the present application may include:
  • Step S201 Determine a first spectral subtraction parameter according to a power spectrum of the noisy speech signal and a power spectrum of the noise signal.
  • the first spectral subtraction parameter is determined according to the power spectrum of the noisy speech signal and the power spectrum of the noise signal; wherein the noisy speech signal and the noise signal are obtained by dividing the sound signal collected by the microphone. .
  • the method for determining the first spectral subtraction parameter according to the power spectrum of the noisy speech signal and the power spectrum of the noise signal may refer to the spectral subtraction parameter calculation process in the prior art, and details are not described herein again.
  • the first spectral subtraction parameter may include: a first over-subtraction factor ⁇ and/or a first spectral step ⁇ , and may of course include other parameters, which are not limited in the embodiment of the present application.
  • Step S202 Determine a second spectral subtraction parameter according to the first spectral subtraction parameter and the reference power spectrum.
  • the first spectral subtraction parameter is optimized to obtain the second spectral subtraction parameter, so that according to the second spectrum
  • the subtraction parameter performs spectral subtraction processing on the noisy speech signal, thereby improving the intelligibility and naturalness of the denoised speech signal.
  • the second spectral subtraction parameter is determined according to the first spectral subtraction parameter and the reference power spectrum; wherein the reference power spectrum comprises: a user speech prediction power spectrum and/or an environmental noise prediction power spectrum.
  • the second spectral subtraction parameter is determined according to the first spectral subtraction parameter, the reference power spectrum, and the spectral subtraction function; wherein the spectral subtraction function may include but is not limited to at least one of the following: a first spectral subtraction function F1 (x, y ), the second spectral subtraction function F2 (x, z) and the third spectral subtraction function F3 (x, y, z).
  • the user voice prediction power spectrum involved in this embodiment is: a user voice power spectrum predicted according to the user history power spectrum and the power spectrum of the noisy voice signal (which can be used to reflect the user voice power spectrum characteristics).
  • the ambient noise predicted power spectrum involved in this embodiment is an ambient noise power spectrum predicted according to the noise history power spectrum and the power spectrum of the noise signal (which can be used to reflect the ambient noise power spectrum characteristics of the user).
  • the second spectral subtraction parameter is determined according to the first spectral subtraction function F1(x, y).
  • the second spectrum subtraction is determined according to the first spectral subtraction function F1(x, y).
  • F1(x, y) represents the first spectral subtraction parameter
  • y represents the user speech prediction power spectrum
  • the value of F1(x, y) is positively related to x (ie, the larger x, the larger the value of F1(x, y) )
  • the value of F1(x, y) is negatively related to y (ie, the larger y is, the smaller the value of F1(x, y) is).
  • the second spectral subtraction parameter is greater than or equal to the preset minimum spectral subtraction parameter and is less than or equal to the first spectral subtraction parameter.
  • the first spectral subtraction parameter includes the first over-subtraction factor ⁇ , determining a second spectral subtraction parameter (including a second over-subtraction factor ⁇ ') according to the first spectral subtraction function F1(x, y); Where ⁇ ' ⁇ [min_ ⁇ , ⁇ ], min_ ⁇ represents the first preset minimum spectral subtraction parameter.
  • the first spectral subtraction parameter comprises the first spectral order ⁇ , determining the second spectral subtraction parameter (including the second spectral order ⁇ ') according to the first spectral subtraction function F1(x, y); wherein ⁇ ' ⁇ [ Min_ ⁇ , ⁇ ], min_ ⁇ represents a second preset minimum spectral subtraction parameter.
  • the first spectral subtraction parameter comprises the first over-subtraction factor ⁇ and the first spectral order ⁇ , determining the second spectral subtraction parameter (including the second over-subtraction factor ⁇ ' according to the first spectral subtraction function F1(x, y) And a second spectral order ⁇ '); exemplarily, ⁇ ' is determined according to the first spectral subtraction function F1( ⁇ , y), and ⁇ ' is determined according to the first spectral subtraction function F1( ⁇ , y); wherein ⁇ ' ⁇ [min_ ⁇ , ⁇ ], ⁇ ' ⁇ [min_ ⁇ , ⁇ ], min_ ⁇ represents a first preset minimum spectral subtraction parameter, and min_ ⁇ represents a second preset minimum spectral subtraction parameter.
  • the second spectral subtraction parameter is obtained by optimizing the first spectral subtraction parameter by considering the regularity of the user voice power spectrum characteristic of the terminal device, so as to perform spectrum on the noisy speech signal according to the second spectral subtraction parameter.
  • the processing is reduced, so that the user's voice of the terminal device can be protected, and the intelligibility and naturalness of the denoised voice signal are improved.
  • the second spectral subtraction parameter is determined according to the second spectral subtraction function F2(x, z).
  • the second spectral subtraction is determined according to the second spectral subtraction function F2(x, z).
  • F2(x, z) represents the ambient noise predicted power spectrum
  • the value of F2(x, z) is positively related to x (ie, the larger x, the larger the value of F2(x, z) )
  • the value of F2(x,z) is in a positive relationship with z (ie, the larger z, the larger the value of F2(x,z)).
  • the second spectral subtraction parameter is greater than or equal to the first spectral subtraction parameter and is less than or equal to the preset maximum spectral subtraction parameter.
  • the first spectral subtraction parameter includes the first over-subtraction factor ⁇ , determining a second spectral subtraction parameter (including a second over-subtraction factor ⁇ ') according to the second spectral subtraction function F2(x, z); Where ⁇ ' ⁇ [ ⁇ ,max_ ⁇ ], max_ ⁇ represents the first preset maximum spectral subtraction parameter.
  • the first spectral subtraction parameter comprises the first spectral order ⁇
  • the second spectral subtraction parameter (including the second spectral order ⁇ ') is determined according to the second spectral subtraction function F2(x, z); wherein ⁇ ' ⁇ [ ⁇ , max_ ⁇ ], max_ ⁇ represents a second preset maximum spectral subtraction parameter.
  • the first spectral subtraction parameter includes the first over-subtraction factor ⁇ and the first spectral order ⁇ , determining the second spectral subtraction parameter (including the second over-subtraction factor ⁇ ' according to the second spectral subtraction function F2(x, z) And a second spectral order ⁇ '); exemplarily, ⁇ ' is determined according to the second spectral subtraction function F2( ⁇ , z), and ⁇ ' is determined according to the second spectral subtraction function F2( ⁇ , z); wherein ⁇ ' ⁇ [ ⁇ ,max_ ⁇ ], ⁇ ' ⁇ [ ⁇ ,max_ ⁇ ], max_ ⁇ represents a first preset maximum spectral subtraction parameter, and max_ ⁇ represents a second preset maximum spectral subtraction parameter.
  • the second spectral subtraction parameter is obtained by optimizing the first spectral subtraction parameter by considering the regularity of the noise power spectrum characteristic of the environment in which the user is located, so as to perform spectrum on the noisy speech signal according to the second spectral subtraction parameter.
  • the processing is reduced, so that the noise signal in the noisy speech signal can be removed more accurately, and the intelligibility and naturalness of the denoised speech signal are improved.
  • the second spectral subtraction parameter is determined according to the third spectral subtraction function F3 (x, y, z).
  • the reference power spectrum includes: the user voice prediction power spectrum and the environmental noise prediction power spectrum
  • the three-spectrum subtraction function F3(x, y, z) determines a second spectral subtraction parameter; wherein x represents the first spectral subtraction parameter; y represents the user speech prediction power spectrum; z represents the environmental noise prediction power spectrum; F3 (x, y The value of z) is positively related to x (ie, the larger x is, the larger the value of F3(x, y, z) is), and the value of F3(x, y, z) is negatively related to y (ie, y The larger, the smaller the value of F3(x, y, z), and the value of F3(x, y, z) is positively related to z (ie, the larger z, the F3(x, y), according to the The three-spectrum subtraction function F3(x, y, z) determines a second spectral subtraction parameter; wherein
  • the second spectral subtraction parameter (including the second over-subtraction factor ⁇ ' is determined according to the third spectral subtraction function F3(x, y, z) ). 2) If the first spectral subtraction parameter comprises the first spectral order ⁇ , the second spectral subtraction parameter (including the second spectral order ⁇ ') is determined according to the third spectral subtraction function F3(x, y, z).
  • the first spectral subtraction parameter includes the first over-subtraction factor ⁇ and the first spectral order ⁇ , determining the second spectral subtraction parameter (including the second over-subtraction factor according to the third spectral subtraction function F3(x, y, z) ⁇ ' and the second spectral order ⁇ '); exemplarily, ⁇ ' is determined according to the third spectral subtraction function F3( ⁇ , y, z), and ⁇ is determined according to the third spectral subtraction function F3( ⁇ , y, z) '.
  • the first spectral subtraction parameter is optimized to obtain a second spectral subtraction parameter, so that according to the second spectrum
  • the subtraction parameter performs spectral subtraction processing on the noisy speech signal, thereby not only protecting the user's voice of the terminal device, but also more accurately removing the noise signal in the noisy speech signal, thereby improving the understandability of the denoised speech signal. Degree and naturalness.
  • the second spectral subtraction parameter may be determined by other methods according to the first spectral subtraction parameter and the reference power spectrum, which is not limited in the embodiment of the present application.
  • Step S203 performing spectral subtraction processing on the noisy speech signal according to the power spectrum of the noise signal and the second spectral subtraction parameter.
  • the denoised speech signal is obtained by performing spectral subtraction processing on the noisy speech signal according to the power spectrum of the noise signal and the second spectral subtraction parameter (obtained after the first spectral subtraction parameter optimization process), so as to further
  • the denoised speech signal and the phase information of the noisy speech signal are subjected to IFFT conversion and superposition processing to obtain an enhanced speech signal.
  • the manner of performing spectral subtraction processing on the noisy speech signal according to the power spectrum of the noise signal and the second spectral subtraction parameter may refer to the spectral subtraction process in the prior art, and details are not described herein again.
  • the first spectral subtraction parameter is determined according to the power spectrum of the noisy speech signal and the power spectrum of the noise signal; further, the second spectral subtraction parameter is determined according to the first spectral subtraction parameter and the reference power spectrum, and according to The power spectrum of the noise signal and the second spectral subtraction parameter perform spectral subtraction processing on the noisy speech signal; wherein the reference power spectrum comprises: a user speech prediction power spectrum and/or an environmental noise prediction power spectrum.
  • the first spectral subtraction parameter is optimized to obtain the second spectral subtraction parameter, so that The spectral subtraction processing of the noisy speech signal according to the optimized second spectral subtraction parameter can not only apply a wide signal to noise ratio range, but also improve the intelligibility and naturalness of the denoised speech signal, and improve the drop. Noise performance.
  • FIG. 3 is a schematic flowchart of a voice enhancement method according to another embodiment of the present application.
  • the embodiment of the present application relates to an optional implementation process of determining a user voice prediction power spectrum.
  • the method further includes:
  • Step S301 Determine a target user power spectrum cluster according to a power spectrum of the noisy speech signal and a user power spectrum distribution class.
  • the user power spectrum distribution class includes: at least one user historical power spectrum cluster; the target user power spectrum cluster is a cluster of at least one user historical power spectrum cluster that is closest to the power spectrum distance of the noisy speech signal.
  • the distance between each user historical power spectrum cluster in the user power spectrum distribution class and the power spectrum of the noisy speech signal is calculated separately, and each user historical power spectrum is clustered and noisy.
  • the user history power spectrum clustering closest to the distance between the power spectra of the speech signals is determined as the target user power spectrum clustering.
  • the calculation of the distance between any user historical power spectrum clustering and the power spectrum of the noisy speech signal may be performed by any of the following algorithms: Euclidean Distance algorithm, Manhattan Distance (Manhattan Distance) The algorithm, the standardized Euclidean distance algorithm, and the Cosine algorithm.
  • Euclidean Distance algorithm Manhattan Distance (Manhattan Distance)
  • the algorithm the standardized Euclidean distance algorithm
  • Cosine algorithm the distance between each user historical power spectrum cluster in the user power spectrum distribution class and the power spectrum of the noisy speech signal
  • Step S302 Determine a user voice prediction power spectrum according to a power spectrum of the noisy speech signal and a target user power spectrum cluster.
  • the user voice prediction power spectrum is exemplarily determined according to the power spectrum of the noisy speech signal, the target user power spectrum clustering, and the estimation function.
  • determining a user voice prediction power spectrum according to the first estimation function F4 (SP, SPT); wherein SP represents a power spectrum of the noisy speech signal; SPT represents a target user power spectrum cluster; F4 (SP, PST) a*SP+(1-a)*PST, a represents the first estimated coefficient, 0 ⁇ a ⁇ 1.
  • the value of a may be gradually reduced as the user power spectrum distribution class is gradually improved.
  • the first estimation function F4(SP, SPT) may also be equal to other equivalent or deformation formulas of a*SP+(1-a)*PST (or may also be based on other first estimation functions F4(SP, SPT), etc.
  • the effect or deformation estimation function determines the user voice prediction power spectrum), which is not limited in the embodiment of the present application.
  • the target user power spectrum clustering is determined according to the power spectrum of the noisy speech signal and the user power spectrum distribution class; further, the user voice prediction is determined according to the power spectrum of the noisy speech signal and the target user power spectrum clustering.
  • Power spectrum in order to further optimize the first spectral subtraction parameter according to the user voice prediction power spectrum to obtain a second spectral subtraction parameter, and perform spectral subtraction processing on the noisy speech signal according to the optimized second spectral subtraction parameter, thereby The user voice of the terminal device is protected, and the intelligibility and naturalness of the denoised voice signal are improved.
  • the method before step S301, further includes: acquiring a user power spectrum distribution class.
  • the user power spectrum online learning is performed on the voice signal denoised by the user history, and the user voice power spectrum characteristics are statistically analyzed to generate a user-specific user power spectrum distribution class to implement user voice adaptation.
  • the specific acquisition manner can be as follows:
  • FIG. 3B is a schematic diagram of a user power spectrum distribution class according to an embodiment of the present disclosure
  • FIG. 3C is a schematic flowchart of a learning process of a user voice power spectrum characteristic according to an embodiment of the present application.
  • the user power spectrum offline learning is performed on the voice signal after user history denoising by applying a clustering algorithm, and the initial distribution class of the user power spectrum is generated; optionally, the voice signal of the user history may be combined with other user history.
  • the clustering algorithm may include, but is not limited to, any of the following: K-means center value (K-means) and K-Nearest Neighbor (K-NN).
  • the classification of the pronunciation type (such as initials, finals, unvoiced, voiced, popped sounds, etc.) may be combined, and of course, other classification factors may also be combined, which is in the embodiment of the present application. This is not a limitation.
  • the user power spectrum distribution class after the above adjustment includes: user history power spectrum cluster A1, user history power spectrum cluster A2, and user history power spectrum cluster A3, and user denoised voice signals.
  • A4 be an example.
  • the traditional spectral subtraction algorithm or the voice enhancement method provided by the present application is used to determine the denoised speech signal, and further, the denoised speech signal is used according to the user.
  • the last adjusted user power spectrum distribution class for adaptive clustering iteration ie, user power spectrum online learning
  • the cluster center of the last adjusted user power spectrum distribution class Modify to output the adjusted user power spectrum distribution class.
  • the voice signal and the user power spectrum are initially denominated according to the user.
  • the initial clustering center in the distribution class performs adaptive clustering iteration; when it is not the first adaptive clustering iteration, it is based on the denoised speech signal of the user and the last adjusted user power spectrum distribution class.
  • the historical clustering center performs adaptive clustering iteration.
  • the user power spectrum distribution class is dynamically adjusted according to the voice signal denoised by the user each time, so that the user voice prediction power spectrum can be determined more accurately, and the first spectrum is further reduced according to the user voice prediction power spectrum.
  • the parameters are optimized to obtain the second spectral subtraction parameter, and the noisy speech signal is subjected to spectral subtraction processing according to the optimized second spectral subtraction parameter, thereby protecting the user voice of the terminal device and improving the noise reduction performance.
  • FIG. 4 is a schematic flowchart of a voice enhancement method according to another embodiment of the present application.
  • Embodiments of the present application relate to an alternative implementation process for determining an environmental noise prediction power spectrum.
  • the method further includes:
  • Step S401 Determine a target noise power spectrum cluster according to a power spectrum of the noise signal and a noise power spectrum distribution class.
  • the noise power spectrum distribution class includes: at least one noise history power spectrum cluster; the target noise power spectrum cluster is a cluster of at least one noise history power spectrum cluster that is closest to the power spectrum distance of the noise signal.
  • the distance between each noise historical power spectrum cluster in the noise power spectrum distribution class and the power spectrum of the noise signal is separately calculated, and each noise historical power spectrum is clustered with the noise signal.
  • the nearest historical noise power spectrum clustering between the power spectra is determined as the target noise power spectrum clustering.
  • the calculation of the distance between the power spectrum of the noise signal and the power spectrum of the noise signal may be performed by any of the following algorithms: an Euclidean distance algorithm, a Manhattan distance algorithm, a standardized Euclidean distance algorithm, As well as the angle cosine algorithm, of course, other algorithms may be used, which are not limited in the embodiment of the present application.
  • Step S402 determining an environmental noise prediction power spectrum according to a power spectrum of the noise signal and a target noise power spectrum cluster.
  • the ambient noise prediction power spectrum is exemplarily determined according to the power spectrum of the noise signal, the target noise power spectrum clustering, and the estimation function.
  • determining an ambient noise prediction power spectrum according to a second estimation function F5 (NP, NPT); wherein NP represents a power spectrum of the noise signal; NPT represents a target noise power spectrum cluster; F5 (NP, NPT) b* NP+(1-b)*NPT, b represents the second estimated coefficient, 0 ⁇ b ⁇ 1.
  • F5 (NP, NPT) b* NP+(1-b)*NPT, b represents the second estimated coefficient, 0 ⁇ b ⁇ 1.
  • the value of b may be gradually reduced as the noise power spectrum distribution class is gradually improved.
  • the second estimation function F5(NP, NPT) may also be equal to other equivalent or deformation formulas of b*NP+(1-b)*NPT (or may also be based on the second estimation function F5(NP, NPT), etc.
  • the effector or deformation estimation function determines the ambient noise prediction power spectrum), which is not limited in the embodiment of the present application.
  • the target noise power spectrum cluster is determined according to the power spectrum of the noise signal and the noise power spectrum distribution class; further, the ambient noise prediction power spectrum is determined according to the power spectrum of the noise signal and the target noise power spectrum cluster, so that Further, the first spectral subtraction parameter is optimized according to the environmental noise prediction power spectrum to obtain a second spectral subtraction parameter, and the noisy speech signal is subjected to spectral subtraction processing according to the optimized second spectral subtraction parameter, thereby being more accurately removed.
  • the noise signal in the noisy speech signal improves the intelligibility and naturalness of the denoised speech signal.
  • the method before step S401, further includes: acquiring a noise power spectrum distribution class.
  • the noise power spectrum is learned online by the historical noise signal of the environment in which the user is located, and the noise power spectrum characteristic of the environment in which the user is located is statistically analyzed to generate a user-specific correlation noise power spectrum distribution class to implement the user.
  • Voice adaptation can be as follows:
  • FIG. 4B is a schematic diagram of a noise power spectrum distribution class according to an embodiment of the present disclosure
  • FIG. 4C is a schematic diagram of a learning flow of noise power spectrum characteristics provided by an embodiment of the present application.
  • the noise power spectrum offline learning is performed on the historical noise signal of the environment in which the user is located by applying a clustering algorithm, and the initial distribution of the noise power spectrum is generated; optionally, the historical noise signal of the environment in which other users are located may be combined
  • the noise power spectrum is learned offline.
  • the clustering algorithm may include, but is not limited to, any of the following: K-means and K-NN.
  • a classification of a typical environmental noise scenario (such as a crowded place, etc.) may be combined, and of course, other classification factors may be combined, which is not limited in the embodiment of the present application. .
  • the noise power spectrum distribution class of the above adjustment includes: noise history power spectrum cluster B1, noise history power spectrum cluster B2, and noise history power spectrum cluster B3, and the power spectrum of the noise signal is B4.
  • the power spectrum of the noise signal is determined by using a conventional spectral subtraction algorithm or the voice enhancement method provided by the present application, and further, according to the power spectrum of the noise signal (as shown in FIG. 4B).
  • B4) and the last adjusted noise power spectrum distribution class for adaptive clustering iteration ie, noise power spectrum online learning, modify the clustering center of the last adjusted noise power spectrum distribution class to output the original Sub-adjusted noise power spectrum distribution class.
  • the first adaptive clustering iteration ie, the last adjusted noise power spectrum distribution class is the noise power spectrum initial distribution class
  • the initial clustering center performs adaptive clustering iteration; when it is not the first adaptive clustering iteration, it is based on the power spectrum of the noise signal and the historical clustering center in the last adjusted noise power spectrum distribution class. Adapt to cluster iterations.
  • the noise power spectrum distribution class is dynamically adjusted according to the power spectrum of the noise signal, so that the environmental noise prediction power spectrum can be determined more accurately, and the first spectrum subtraction parameter is further performed according to the environmental noise prediction power spectrum.
  • the optimization process obtains the second spectral subtraction parameter, and performs spectral subtraction processing on the noisy speech signal according to the optimized second spectral subtraction parameter, thereby more accurately removing the noise signal in the noisy speech signal and improving the noise reduction performance.
  • FIG. 5 is a schematic flowchart diagram of a voice enhancement method according to another embodiment of the present application.
  • the embodiment of the present application relates to an optional implementation process of determining a user voice prediction power spectrum and an environmental noise prediction power spectrum.
  • the method further includes:
  • Step S501 Determine a target user power spectrum cluster according to a power spectrum of the noisy speech signal and a user power spectrum distribution class, and determine a target noise power spectrum cluster according to the power spectrum of the noise signal and the noise power spectrum distribution class.
  • the user power spectrum distribution class includes: at least one user historical power spectrum clustering; the target user power spectrum clustering is clustering of the power spectrum distance of the at least one user historical power spectrum cluster and the noisy speech signal; the noise power
  • the spectral distribution class includes: at least one noise history power spectrum cluster; the target noise power spectrum cluster is a cluster of at least one noise history power spectrum cluster that is closest to the power spectrum distance of the noise signal.
  • step S301 refers to related content in step S301 and step S401 in the foregoing embodiment, and details are not described herein again.
  • Step S502 Determine a user voice prediction power spectrum according to a power spectrum of the noisy speech signal and a target user power spectrum cluster.
  • step S302 For the specific implementation of the step, refer to the related content in step S302 in the foregoing embodiment, and details are not described herein again.
  • Step S503 determining an environmental noise prediction power spectrum according to a power spectrum of the noise signal and a target noise power spectrum cluster.
  • step S402 For the specific implementation of the step, refer to the related content in step S402 in the foregoing embodiment, and details are not described herein again.
  • the method before step S501, further includes: acquiring a user power spectrum distribution class and a noise power spectrum distribution class.
  • step S503 may be performed after step S502 is performed first, or step S502 is performed after step S503 is performed first, which is not limited in the embodiment of the present application.
  • the target user power spectrum clustering is determined according to the power spectrum of the noisy speech signal and the user power spectrum distribution class, and the target noise power spectrum cluster is determined according to the power spectrum of the noise signal and the noise power spectrum distribution class;
  • Ground the user voice prediction power spectrum is determined according to the power spectrum of the noisy speech signal and the target user power spectrum cluster, and the environmental noise prediction power spectrum is determined according to the power spectrum of the noise signal and the target noise power spectrum cluster, so as to further according to the user.
  • the speech prediction power spectrum and the environmental noise prediction power spectrum optimize the first spectral subtraction parameter to obtain the second spectral subtraction parameter, and perform spectral subtraction processing on the noisy speech signal according to the optimized second spectral subtraction parameter, thereby not only The user's voice of the terminal device is protected, and the noise signal in the noisy voice signal can be removed more accurately, and the intelligibility and naturalness of the denoised voice signal are improved.
  • FIG. 6 is a schematic flowchart 1 of a voice enhancement method according to another embodiment of the present application
  • FIG. 6B is a schematic flowchart 2 of a voice enhancement method according to another embodiment of the present application.
  • the embodiment of the present application relates to an optional implementation process of how to implement a voice enhancement method when considering the regularity of the user voice power spectrum characteristics of the terminal device and considering the sub-band division.
  • the specific implementation process of the embodiment of the present application is as follows:
  • the sound signal collected by the dual microphone is divided into a noisy speech signal and a noise signal by VAD. Further, the noisy speech signal is obtained by FFT transform to obtain amplitude information and phase information (where the amplitude information is obtained by subband power spectrum estimation to obtain a subband power spectrum SP(m, i) of the noisy speech signal), and the noise signal passes the noise.
  • the subband power spectrum is estimated to obtain the subband power spectrum of the noise signal.
  • the first spectral subtraction parameter is obtained by the spectral subtraction parameter calculation process, and m represents the mth subband (m
  • m represents the mth subband
  • i represents the ith frame (the range of i is determined according to the number of frame sequences of the processed noisy speech signal).
  • the first spectral subtraction parameter is optimized according to the user voice prediction subband power spectrum PSP(m, i).
  • the subband power spectrum PSP(m, i) and the first spectral subtraction parameter are predicted according to the user voice.
  • the user speech prediction subband power spectrum PSP(m, i) is: according to the subband power spectrum SP(m, i) of the noisy speech signal and the user subband power spectrum distribution class
  • the subband power spectrum SP(m,i) of the noisy speech signal is determined by the nearest user history subband power spectrum clustering (ie, target user power spectrum clustering, SPT(m)) for speech subband power spectrum estimation.
  • the amplitude information of the noisy speech signal is subjected to spectral subtraction processing to obtain a denoised speech signal.
  • the enhanced speech signal is obtained by performing IFFT conversion and superposition processing according to the denoised speech signal and the phase information of the noisy speech signal.
  • the user subband power spectrum online learning may be performed on the denoised voice signal to update the user subband power spectrum distribution class in real time, so as to further follow the subband power spectrum of the next noisy speech signal.
  • User history sub-band power spectrum clustering ie, next target user power spectrum clustering
  • closest to the sub-band power spectrum of the noisy speech signal in the updated user subband power spectrum distribution class for performing speech subband
  • the power spectrum estimate determines the next user speech prediction subband power spectrum to subsequently optimize the next first spectral subtraction parameter.
  • the second spectral subtraction parameter is obtained by optimizing the first spectral subtraction parameter according to the user voice prediction subband power spectrum by considering the regularity of the user voice power spectrum characteristic of the terminal device.
  • the user voice of the terminal device can be protected, and the intelligibility and naturalness of the denoised speech signal are improved.
  • the manner of subband division involved in the embodiment of the present application may refer to the division manner shown in Table 1.
  • the frequency domain value of the signal after the Fourier transform is used.
  • other division manners may be adopted, which is not limited in the embodiment of the present application.
  • Table 1 is a reference chart for Bark critical band division.
  • FIG. 7A is a schematic flowchart 3 of a voice enhancement method according to another embodiment of the present disclosure
  • FIG. 7B is a schematic flowchart diagram of a voice enhancement method according to another embodiment of the present application.
  • the embodiment of the present application relates to an optional implementation process of how to implement a speech enhancement method when considering the regularity of the environmental noise power spectrum characteristics of the user and considering the sub-band division.
  • the specific implementation process of the embodiment of the present application is as follows:
  • the sound signal collected by the dual microphone is divided into a noisy speech signal and a noise signal by VAD. Further, the noisy speech signal is obtained by FFT transform to obtain amplitude information and phase information (where the amplitude information is obtained by subband power spectrum estimation to obtain a subband power spectrum of the noisy speech signal), and the noise signal is estimated by the noise subband power spectrum. Subband power spectrum NP(m,i) of the noise signal. Further, the first spectral subtraction parameter is obtained by spectral subtraction parameter calculation processing according to the subband power spectrum NP(m,i) of the noise signal and the subband power spectrum of the noisy speech signal. Further, the first spectral subtraction parameter is optimized according to the ambient noise prediction power spectrum PNP(m, i).
  • the second spectrum is obtained according to the ambient noise prediction power spectrum PNP(m, i) and the first spectral subtraction parameter.
  • Subtracting parameter, wherein the ambient noise prediction power spectrum PNP(m, i) is: subband power spectrum NP of the noise signal according to the subband power spectrum NP(m, i) of the noise signal and the noise subband power spectrum distribution class (m, i)
  • the nearest noise history subband power spectrum clustering ie, target noise subband power spectrum clustering, NPT(m) is determined by noise subband power spectrum estimation.
  • the amplitude information of the noisy speech signal is subjected to spectral subtraction processing to obtain a denoised speech signal.
  • the enhanced speech signal is obtained by performing IFFT conversion and superposition processing according to the denoised speech signal and the phase information of the noisy speech signal.
  • the subband power spectrum NP(m,i) of the noise signal may be subjected to online learning of the noise subband power spectrum to update the noise subband power spectrum distribution class in real time, so as to be subsequently based on the next noise signal.
  • the noise history subband power spectrum clustering of the subband power spectrum and the updated noise subband power spectrum distribution class closest to the subband power spectrum of the noise signal ie, the next target noise subband power spectrum clustering
  • the noise sub-band power spectrum estimation is performed to determine the next environmental noise prediction sub-band power spectrum, so as to optimize the next first spectrum decrement parameter.
  • the second spectral subtraction parameter is obtained by optimizing the first spectral subtraction parameter according to the environmental noise prediction subband power spectrum by considering the regularity of the environmental noise power spectrum characteristic of the user.
  • the noise signal in the noisy speech signal can be removed more accurately, and the intelligibility and naturalness of the denoised speech signal are improved.
  • FIG. 8 is a schematic flowchart of a voice enhancement method according to another embodiment of the present disclosure.
  • FIG. 8B is a schematic flowchart diagram of a voice enhancement method according to another embodiment of the present application.
  • the embodiment of the present application relates to how to implement a voice enhancement method when considering the user voice power spectrum characteristics of the terminal device, the regularity of the environment noise power spectrum characteristics of the user, and considering the subband division.
  • An optional implementation process As shown in FIG. 8A and FIG. 8B, the specific implementation process of the embodiment of the present application is as follows:
  • the sound signal collected by the dual microphone is divided into a noisy speech signal and a noise signal by VAD. Further, the noisy speech signal is obtained by FFT transform to obtain amplitude information and phase information (where the amplitude information is obtained by subband power spectrum estimation to obtain a subband power spectrum SP(m, i) of the noisy speech signal), and the noise signal passes the noise.
  • the subband power spectrum is estimated to obtain the subband power spectrum NP(m, i) of the noise signal.
  • the first spectral subtraction parameter is obtained by the spectral subtraction parameter calculation process according to the subband power spectrum of the noise signal and the subband power spectrum of the noisy speech signal.
  • the first spectral subtraction parameter is optimized according to the user voice prediction subband power spectrum PSP(m, i) and the ambient noise prediction power spectrum PNP(m, i).
  • the subband power spectrum is predicted according to the user voice.
  • the PSP(m,i), the ambient noise prediction power spectrum PNP(m,i) and the first spectral subtraction parameter obtain a second spectral subtraction parameter;
  • the user speech prediction subband power spectrum PSP(m,i) is: according to the band User history subband power spectrum clustering of the subband power spectrum SP(m,i) of the noisy speech signal and the subband power spectrum SP(m,i) of the noisy speech signal in the user subband power spectrum distribution class (ie, target user power spectrum clustering, SPT(m)) is determined by voice subband power spectrum estimation;
  • ambient noise prediction power spectrum PNP(m, i) is: subband power spectrum NP(m, i according to noise signal)
  • the amplitude information of the noisy speech signal is subjected to spectral subtraction processing to obtain a denoised speech signal.
  • the enhanced speech signal is obtained by performing IFFT conversion and superposition processing according to the denoised speech signal and the phase information of the noisy speech signal.
  • the user subband power spectrum online learning may be performed on the denoised voice signal to update the user subband power spectrum distribution class in real time, so as to further subband power spectrum and update according to the next noisy speech signal.
  • the user sub-band power spectrum distribution of the user sub-band power spectrum distribution class closest to the sub-band power spectrum of the noisy speech signal ie, the next target user power spectrum clustering
  • the voice sub-band power is performed.
  • the spectral estimate determines the next user speech prediction subband power spectrum to subsequently optimize the next first spectral subtraction parameter.
  • the noise subband power spectrum can be learned online on the subband power spectrum of the noise signal to update the noise subband power spectrum distribution class in real time, so as to further subband power spectrum and update according to the next noise signal.
  • Noise subband power spectrum distribution in the noise subband power spectrum distribution class closest to the subband power spectrum of the noise signal ie, next target noise subband power spectrum clustering
  • performing noise subband power The spectral estimate determines the next ambient noise predicted power spectrum to subsequently optimize the next first spectral subtraction parameter.
  • the sub-band power spectrum and the ambient noise are predicted according to the user voice prediction by considering the regularity of the user's voice power spectrum characteristics of the terminal device and the environmental noise power spectrum characteristics of the user.
  • the spectrum is optimized for the first spectral subtraction parameter to obtain the second spectral subtraction parameter, so that the noisy speech signal is subjected to spectral subtraction processing according to the second spectral subtraction parameter, so that the noise signal in the noisy speech signal can be removed more accurately and improved.
  • the intelligibility and naturalness of the denoised speech signal are optimized for the first spectral subtraction parameter to obtain the second spectral subtraction parameter, so that the noisy speech signal is subjected to spectral subtraction processing according to the second spectral subtraction parameter, so that the noise signal in the noisy speech signal can be removed more accurately and improved.
  • FIG. 9A is a schematic structural diagram of a voice enhancement apparatus according to an embodiment of the present disclosure.
  • the voice enhancement apparatus 90 provided by the embodiment of the present application includes: a first determining module 901, a second determining module 902, and a spectrum subtracting module 903.
  • the first determining module 901 is configured to determine a first spectral subtraction parameter according to a power spectrum of the noisy speech signal and a power spectrum of the noise signal; wherein the noisy speech signal and the noise signal are sound signals collected by the microphone Obtained after the division process;
  • a second determining module 902 configured to determine a second spectral subtraction parameter according to the first spectral subtraction parameter and the reference power spectrum; wherein the reference power spectrum comprises: a user voice prediction power spectrum and/or an environmental noise prediction power spectrum;
  • the spectrum subtraction module 903 is configured to perform spectral subtraction processing on the noisy speech signal according to the power spectrum of the noise signal and the second spectral subtraction parameter.
  • the second determining module 902 is specifically configured to:
  • a second spectral subtraction parameter according to the first spectral subtraction function F1(x, y); wherein x represents a first spectral subtraction parameter; y represents a user speech prediction power spectrum; and a value of F1(x, y) is positively related to x The value of F1(x, y) is negatively related to y.
  • the second determining module 902 is specifically configured to:
  • a second spectral subtraction parameter according to the second spectral subtraction function F2(x, z); wherein x represents a first spectral subtraction parameter; z represents an ambient noise predicted power spectrum; and a value of F2(x, z) is positively related to x The value of F2(x,z) is positively related to z.
  • the second determining module 902 is specifically configured to:
  • a second spectral subtraction parameter according to a third spectral subtraction function F3(x, y, z); wherein x represents a first spectral subtraction parameter; y represents a user speech prediction power spectrum; z represents an environmental noise prediction power spectrum; F3 (x The value of y, z) is positively related to x, the value of F3(x, y, z) is negatively related to y, and the value of F3(x, y, z) is positively related to z.
  • the voice enhancement device 90 further includes:
  • a third determining module configured to determine a target user power spectrum cluster according to a power spectrum of the noisy speech signal and a user power spectrum distribution class; wherein the user power spectrum distribution class includes: at least one user historical power spectrum cluster; target user power The spectral cluster is a cluster that is closest to the power spectrum of the noisy speech signal in at least one user historical power spectrum cluster;
  • a fourth determining module configured to determine a user voice prediction power spectrum according to the power spectrum of the noisy speech signal and the target user power spectrum cluster.
  • the voice enhancement device 90 further includes:
  • a fifth determining module configured to determine a target noise power spectrum cluster according to a power spectrum of the noise signal and a noise power spectrum distribution class; wherein the noise power spectrum distribution class comprises: at least one noise history power spectrum cluster; the target noise power spectrum is aggregated The class is a cluster of at least one noise history power spectrum cluster that is closest to the power spectrum distance of the noise signal;
  • a sixth determining module configured to determine an ambient noise predicted power spectrum according to the power spectrum of the noise signal and the target noise power spectrum cluster.
  • the voice enhancement device 90 further includes:
  • a third determining module configured to determine a target user power spectrum cluster according to a power spectrum of the noisy speech signal and a user power spectrum distribution class
  • a fifth determining module configured to determine a target noise power spectrum cluster according to a power spectrum of the noise signal and a noise power spectrum distribution class; wherein the user power spectrum distribution class includes: at least one user historical power spectrum cluster; the target user power spectrum cluster The class is a cluster that is closest to the power spectrum of the noisy speech signal in at least one user historical power spectrum cluster; the noise power spectrum distribution class includes: at least one noise history power spectrum cluster; the target noise power spectrum cluster is at least one a cluster of noise history power spectrum clusters that is closest to the power spectrum of the noise signal;
  • a fourth determining module configured to determine a user voice prediction power spectrum according to a power spectrum of the noisy speech signal and a target user power spectrum cluster
  • a sixth determining module configured to determine an ambient noise predicted power spectrum according to the power spectrum of the noise signal and the target noise power spectrum cluster.
  • the fourth determining module is specifically configured to:
  • the sixth determining module is specifically configured to:
  • NP a power spectrum of the noise signal
  • NPT a target noise power spectrum cluster
  • F5 (NP, NPT) b * NP + (1 b) *NPT, b represents the second estimated coefficient.
  • the voice enhancement device 90 further includes:
  • the first obtaining module is configured to acquire a user power spectrum distribution class.
  • the voice enhancement device 90 further includes:
  • the second obtaining module is configured to obtain a noise power spectrum distribution class.
  • the voice enhancement device of this embodiment may be used to implement the technical solution of the foregoing voice enhancement method embodiment of the present application, and the implementation principle and technical effects thereof are similar, and details are not described herein again.
  • FIG. 9B is a schematic structural diagram of a voice enhancement apparatus according to another embodiment of the present disclosure.
  • the voice enhancement apparatus provided by the embodiment of the present application may include: a VAD module, a noise estimation module, a spectrum subtraction parameter calculation module, a spectrum analysis module, a spectrum subtraction module, an online learning module, a parameter optimization module, and phase recovery. Module.
  • the VAD module is respectively connected to the noise estimation module and the spectrum analysis module
  • the noise estimation module is respectively connected to the online learning module and the spectral subtraction parameter calculation module
  • the spectrum analysis module is respectively connected to the online learning module and the spectral subtraction module
  • the parameter optimization module is respectively connected.
  • the online learning module, the spectral subtraction parameter calculation module and the spectral subtraction module are also connected to the spectral subtraction parameter calculation module and the phase recovery module.
  • the VAD module is configured to divide the sound signal collected by the microphone into a noisy speech signal and a noise signal; the noise estimation module is configured to estimate a power spectrum of the noise signal; and the spectrum analysis module is configured to estimate a power spectrum of the noisy speech signal.
  • the phase recovery module is configured to recover the enhanced speech signal according to the phase information determined in the spectrum analysis module and the denoised speech signal processed by the spectral subtraction module. As shown in FIG.
  • the function of the spectral subtraction parameter calculation module may be the same as that of the first determination module 901 in the above embodiment; the function of the parameter optimization module may be the same as the function of the second determination module 902 in the above embodiment; The function of the spectral subtraction module may be the same as that of the spectral subtraction module 903 in the above embodiment; the function of the online learning module may be the third determining module, the fourth determining module, the fifth determining module, and the sixth determining module in the foregoing embodiments. All functions of the first acquisition module and the second acquisition module are the same.
  • the voice enhancement device of this embodiment may be used to implement the technical solution of the foregoing voice enhancement method embodiment of the present application, and the implementation principle and technical effects thereof are similar, and details are not described herein again.
  • FIG. 10 is a schematic structural diagram of a voice enhancement apparatus according to another embodiment of the present disclosure.
  • the voice enhancement apparatus provided by the embodiment of the present application includes: a processor 1001 and a memory 1002;
  • the memory 1001 is configured to store program instructions.
  • the processor 1002 is configured to invoke and execute the program instructions stored in the memory to implement the technical solution of the voice enhancement method embodiment of the present application.
  • the implementation principle and technical effects are similar, and details are not described herein again.
  • Figure 10 only shows a simplified design of the speech enhancement device.
  • the voice enhancement device may also include any number of transmitters, receivers, processors, memories, and/or communication units, etc., which are not limited in this embodiment.
  • FIG. 11 is a schematic structural diagram of a voice enhancement apparatus according to another embodiment of the present disclosure.
  • the voice enhancement device provided by the embodiment of the present application may be a terminal device.
  • the terminal device is used as the mobile phone 100 as an example. It should be understood that the illustrated mobile phone 100 is only one example of a terminal device, and the mobile phone 100 may have more or fewer components than those shown in the figures, two or more components may be combined, or Has a different component configuration.
  • the mobile phone 100 may specifically include: a processor 101, a radio frequency (RF) circuit 102, a memory 103, a touch screen 104, a Bluetooth device 105, one or more sensors 106, and wireless fidelity (WIreless-Fidelity).
  • Wi-Fi device 107 positioning device 108, audio circuit 109, speaker 113, microphone 114, peripheral interface 110, and power supply device 111.
  • the touch screen 104 can include: a touch panel 104-1 and a display 104-2. These components can communicate over one or more communication buses or signal lines (not shown in Figure 11).
  • FIG. 11 does not constitute a limitation on a mobile phone, and the mobile phone 100 may include more or less components than those illustrated, or combine some components, or Different parts are arranged.
  • the audio components of the mobile phone 100 will be specifically described below in conjunction with the components involved in the present application, and other components will not be described in detail.
  • audio circuit 109, speaker 113, microphone 114 may provide an audio interface between the user and handset 100.
  • the audio circuit 109 can transmit the converted electrical data of the received audio data to the speaker 113 for conversion to the sound signal output by the speaker 113.
  • the microphone 114 is generally a combination of two or more microphones, the microphone. 114 converts the collected sound signal into an electrical signal, which is received by the audio circuit 109 and then converted into audio data, and then the audio data is output to the RF circuit 102 for transmission to, for example, another mobile phone, or the audio data is output to the memory 103 for further processing.
  • the audio circuit can include a dedicated processor.
  • the technical solution in the foregoing voice enhancement method embodiment of the present application may be a dedicated processor running in the audio circuit 109, or may be run in the processor 101 shown in FIG. 11, and the implementation principle and technical effect are similar. , will not repeat them here.
  • the embodiment of the present application further provides a program, which is used to execute the technical solution of the foregoing voice enhancement method embodiment of the present application, and the implementation principle and technical effects thereof are similar, and details are not described herein again.
  • the embodiment of the present application further provides a computer program product including instructions, which when executed on a computer, causes the computer to execute the technical solution of the foregoing voice enhancement method embodiment of the present application, and the implementation principle and technical effect are similar, Narration.
  • the embodiment of the present application further provides a computer readable storage medium, where the computer readable storage medium stores instructions, when it is run on a computer, causes the computer to execute the technical solution of the voice enhancement method embodiment of the present application, and the implementation principle thereof Similar to the technical effect, it will not be described here.
  • the disclosed apparatus and method may be implemented in other manners.
  • the device embodiments described above are merely illustrative.
  • the division of the unit is only a logical function division.
  • there may be another division manner for example, multiple units or components may be combined or Can be integrated into another system, or some features can be ignored or not executed.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection through some interface, device or unit, and may be in an electrical, mechanical or other form.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of the embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.
  • the above integrated unit can be implemented in the form of hardware or in the form of hardware plus software functional units.
  • the above-described integrated unit implemented in the form of a software functional unit can be stored in a computer readable storage medium.
  • the software functional unit described above is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor to perform the methods described in various embodiments of the present application. Part of the steps.
  • the foregoing storage medium includes: a U disk, a mobile hard disk, a read-only memory (ROM), a random access memory (RAM), a magnetic disk, or an optical disk, and the like, which can store program codes. .
  • the computer can be a general purpose computer, a special purpose computer, a computer network, a network device, a terminal device, or other programmable device.
  • the computer instructions can be stored in a computer readable storage medium or transferred from one computer readable storage medium to another computer readable storage medium, for example, the computer instructions can be from a website site, computer, server or data center Transfer to another website site, computer, server, or data center by wire (eg, coaxial cable, fiber optic, digital subscriber line (DSL), or wireless (eg, infrared, wireless, microwave, etc.).
  • the computer readable storage medium can be any available media that can be accessed by a computer or a data storage device such as a server, data center, or the like that includes one or more available media.
  • the usable medium may be a magnetic medium (eg, a floppy disk, a hard disk, a magnetic tape), an optical medium (eg, a DVD), or a semiconductor medium (such as a solid state disk (SSD)).

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Mathematical Physics (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

一种语音增强方法及装置,方法包括:根据带噪语音信号的功率谱以及噪声信号的功率谱,确定第一谱减参数(S201);根据第一谱减参数以及参考功率谱确定第二谱减参数(S202);根据噪声信号的功率谱和第二谱减参数对带噪语音信号进行谱减处理(S203);其中,参考功率谱包括:用户语音预测功率谱和/或环境噪声预测功率。通过考虑到终端设备的用户语音功率谱特性和/或用户所处环境噪声功率谱特性的规律性,对第一谱减参数进行优化处理得到第二谱减参数,以便根据优化后的第二谱减参数对带噪语音信号进行谱减处理,提高了去噪后的语音信号的可懂度和自然度,从而提高了降噪性能。

Description

语音增强方法及装置
本申请要求于2017年12月18日提交中国专利局、申请号为201711368189.X、申请名称为“一种自适应降噪的方法和终端”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及语音处理技术领域,尤其涉及一种语音增强方法及装置。
背景技术
随着通讯技术和网络技术的飞速发展,语音通信已远远超越了传统的以固定电话为主要形式的范畴,在手机通信、电视/电话会议、车载免提通信、网络电话(Voice over Internet Protocol,VoIP)电话等诸多领域中被广泛应用。在语音通信的应用中,可能由于环境中的噪声(例如街道、餐馆、候车室、候机厅等)使得用户的语音信号变得模糊,可懂度降低。因此,如何消除麦克风采集到的声音信号中的噪声是亟待解决的问题。
通常情况下采用谱减法消除声音信号中的噪声。图1为传统的谱减法的流程示意图,如图1所示,通过语音检测(Voice Activity Detection,VAD)将麦克风采集到的声音信号划分为带噪语音信号和噪声信号。进一步地,带噪语音信号通过快速傅立叶变换(Fast Fourier Transform,FFT)变换得到幅度信息和相位信息(其中,幅度信息通过功率谱估计得到带噪语音信号的功率谱),以及噪声信号通过噪声功率谱估计得到噪声信号的功率谱。进一步地,根据噪声信号的功率谱以及带噪语音信号的功率谱,通过谱减参数计算处理得到谱减参数;其中,谱减参数包括但不限于以下至少一项:过减因子α(α>1)和频谱阶β(0≤β≤1)。进一步地,根据噪声信号的功率谱以及谱减参数,对带噪语音信号的幅度信息进行谱减处理得到去噪后的语音信号。进一步地,根据去噪后的语音信号以及带噪语音信号的相位信息进行快速傅里叶反变换(Inverse Fast Fourier Transform,IFFT)变换以及叠加等处理,得到增强后的语音信号。
但传统的谱减法中功率谱直接相减的方式,会使去噪后的语音信号容易产生“音乐噪声”,从而会直接影响语音信号的可懂度和自然度。
发明内容
本申请实施例提供一种语音增强方法及装置,通过根据用户语音功率谱特性和/或用户所处环境噪声功率谱特性对谱减参数的适应性调整,从而提高了去噪后的语音信号的可懂度和自然度,提高了降噪性能。
第一方面,本申请实施例提供一种语音增强方法,包括:
根据带噪语音信号的功率谱以及噪声信号的功率谱,确定第一谱减参数;其中,带噪语音信号以及噪声信号为对麦克风所采集到的声音信号进行划分处理后得到的;
根据第一谱减参数以及参考功率谱确定第二谱减参数;其中,参考功率谱包括:用户语音预测功率谱和/或环境噪声预测功率谱;
根据噪声信号的功率谱和第二谱减参数对带噪语音信号进行谱减处理。
第一方面提供的语音增强方法实施例中,通过根据带噪语音信号的功率谱以及噪声信号的功率谱,确定第一谱减参数;进一步地,根据第一谱减参数以及参考功率谱确定第二谱减参数,并根据噪声信号的功率谱和第二谱减参数对带噪语音信号进行谱减处理;其中,参考功率谱包括:用户语音预测功率谱和/或环境噪声预测功率谱。可见,本实施例中,通过考虑到终端设备的用户语音功率谱特性和/或用户所处环境噪声功率谱特性的规律性,对第一谱减参数进行优化处理得到第二谱减参数,以便根据优化后的第二谱减参数对带噪语音信号进行谱减处理,不仅可以适用较宽的信噪比范围,而且提高了去噪后的语音信号的可懂度和自然度,提高了降噪性能。
在一种可能的实现方式中,若参考功率谱包括:用户语音预测功率谱,根据第一谱减参数以及参考功率谱确定第二谱减参数,包括:
根据第一谱减函数F1(x,y)确定第二谱减参数;其中,x代表第一谱减参数;y代表用户语音预测功率谱;F1(x,y)的值与x成正向关系,F1(x,y)的值与y成负向关系。
本实现方式提供的语音增强方法实施例中,通过考虑到终端设备的用户语音功率谱特性的规律性,对第一谱减参数进行优化处理得到第二谱减参数,以便根据第二谱减参数对带噪语音信号进行谱减处理,从而可以对终端设备的用户语音进行保护,提高了去噪后的语音信号的可懂度和自然度。
在一种可能的实现方式中,若参考功率谱包括:环境噪声预测功率谱,根据第一谱减参数以及参考功率谱确定第二谱减参数,包括:
根据第二谱减函数F2(x,z)确定第二谱减参数;其中,x代表第一谱减参数;z代表环境噪声预测功率谱;F2(x,z)的值与x成正向关系,F2(x,z)的值与z成正向关系。
本实现方式提供的语音增强方法实施例中,通过考虑到用户所处环境噪声功率谱特性的规律性,对第一谱减参数进行优化处理得到第二谱减参数,以便根据第二谱减参数对带噪语音信号进行谱减处理,从而可以更加准确地去掉带噪语音信号中的噪声信号,提高了去噪后的语音信号的可懂度和自然度。
在一种可能的实现方式中,若参考功率谱包括:用户语音预测功率谱和环境噪声预测功率谱,根据第一谱减参数以及参考功率谱确定第二谱减参数,包括:
根据第三谱减函数F3(x,y,z)确定第二谱减参数;其中,x代表第一谱减参数;y代表用户语音预测功率谱;z代表环境噪声预测功率谱;F3(x,y,z)的值与x成正向关系,F3(x,y,z)的值与y成负向关系,且F3(x,y,z)的值与z成正向关系。
本实现方式提供的语音增强方法实施例中,通过考虑到终端设备的用户语音功率谱特性和用户所处环境噪声功率谱特性的规律性,对第一谱减参数进行优化处理得到第二谱减参数,以便根据第二谱减参数对带噪语音信号进行谱减处理,从而不仅可以对终端设备的用户语音进行保护,还可以更加准确地去掉带噪语音信号中的噪声信 号,提高了去噪后的语音信号的可懂度和自然度。
在一种可能的实现方式中,根据第一谱减参数以及参考功率谱确定第二谱减参数之前,还包括:
根据带噪语音信号的功率谱以及用户功率谱分布类确定目标用户功率谱聚类;其中,用户功率谱分布类包括:至少一个用户历史功率谱聚类;目标用户功率谱聚类为至少一个用户历史功率谱聚类中与带噪语音信号的功率谱距离最近的聚类;
根据带噪语音信号的功率谱以及目标用户功率谱聚类确定用户语音预测功率谱。
本实现方式提供的语音增强方法实施例中,通过根据带噪语音信号的功率谱以及用户功率谱分布类确定目标用户功率谱聚类;进一步地,根据带噪语音信号的功率谱以及目标用户功率谱聚类确定用户语音预测功率谱,以便进一步地根据用户语音预测功率谱对第一谱减参数进行优化处理得到第二谱减参数,并根据优化后的第二谱减参数对带噪语音信号进行谱减处理,从而可以对终端设备的用户语音进行保护,提高了去噪后的语音信号的可懂度和自然度。
在一种可能的实现方式中,根据第一谱减参数以及参考功率谱确定第二谱减参数之前,还包括:
根据噪声信号的功率谱以及噪声功率谱分布类确定目标噪声功率谱聚类;其中,噪声功率谱分布类包括:至少一个噪声历史功率谱聚类;目标噪声功率谱聚类为至少一个噪声历史功率谱聚类中与噪声信号的功率谱距离最近的聚类;
根据噪声信号的功率谱以及目标噪声功率谱聚类确定环境噪声预测功率谱。
本实现方式提供的语音增强方法实施例中,通过根据噪声信号的功率谱以及噪声功率谱分布类确定目标噪声功率谱聚类;进一步地,根据噪声信号的功率谱以及目标噪声功率谱聚类确定环境噪声预测功率谱,以便进一步地根据环境噪声预测功率谱对第一谱减参数进行优化处理得到第二谱减参数,并根据优化后的第二谱减参数对带噪语音信号进行谱减处理,从而可以更加准确地去掉带噪语音信号中的噪声信号,提高了去噪后的语音信号的可懂度和自然度。
在一种可能的实现方式中,根据第一谱减参数以及参考功率谱确定第二谱减参数之前,还包括:
根据带噪语音信号的功率谱以及用户功率谱分布类确定目标用户功率谱聚类,以及根据噪声信号的功率谱以及噪声功率谱分布类确定目标噪声功率谱聚类;其中,用户功率谱分布类包括:至少一个用户历史功率谱聚类;目标用户功率谱聚类为至少一个用户历史功率谱聚类中与带噪语音信号的功率谱距离最近的聚类;噪声功率谱分布类包括:至少一个噪声历史功率谱聚类;目标噪声功率谱聚类为至少一个噪声历史功率谱聚类中与噪声信号的功率谱距离最近的聚类;
根据带噪语音信号的功率谱以及目标用户功率谱聚类确定用户语音预测功率谱;
根据噪声信号的功率谱以及目标噪声功率谱聚类确定环境噪声预测功率谱。
本实现方式提供的语音增强方法实施例中,通过根据带噪语音信号的功率谱以及用户功率谱分布类确定目标用户功率谱聚类,以及根据噪声信号的功率谱以及噪声功率谱分布类确定目标噪声功率谱聚类;进一步地,根据带噪语音信号的功率谱以及目标用户功率谱聚类确定用户语音预测功率谱,以及根据噪声信号的功率谱以及目标噪 声功率谱聚类确定环境噪声预测功率谱,以便进一步地根据用户语音预测功率谱和环境噪声预测功率谱对第一谱减参数进行优化处理得到第二谱减参数,并根据优化后的第二谱减参数对带噪语音信号进行谱减处理,从而不仅可以对终端设备的用户语音进行保护,还可以更加准确地去掉带噪语音信号中的噪声信号,提高了去噪后的语音信号的可懂度和自然度。
在一种可能的实现方式中,根据带噪语音信号的功率谱以及目标用户功率谱聚类确定用户语音预测功率谱,包括:
根据第一估计函数F4(SP,SPT)确定用户语音预测功率谱;其中,SP代表带噪语音信号的功率谱;SPT代表目标用户功率谱聚类;F4(SP,PST)=a*SP+(1-a)*PST,a代表第一估计系数。
在一种可能的实现方式中,根据噪声信号的功率谱以及目标噪声功率谱聚类确定环境噪声预测功率谱,包括:
根据第二估计函数F5(NP,NPT)确定环境噪声预测功率谱;其中,NP代表噪声信号的功率谱;NPT代表目标噪声功率谱聚类;F5(NP,NPT)=b*NP+(1-b)*NPT,b代表第二估计系数。
在一种可能的实现方式中,根据带噪语音信号的功率谱以及用户功率谱分布类确定目标用户功率谱聚类之前,还包括:
获取用户功率谱分布类。
本实现方式提供的语音增强方法实施例中,通过每次根据去噪后的语音信号动态调整用户功率谱分布类,以便后续可以更加准确地确定用户语音预测功率谱,进一步根据用户语音预测功率谱对第一谱减参数进行优化处理得到第二谱减参数,并根据优化后的第二谱减参数对带噪语音信号进行谱减处理,从而可以对终端设备的用户语音进行保护,提高了降噪性能。
在一种可能的实现方式中,根据噪声信号的功率谱以及噪声功率谱分布类确定目标噪声功率谱聚类之前,还包括:
获取噪声功率谱分布类。
本实现方式提供的语音增强方法实施例中,通过每次根据噪声信号的功率谱动态调整噪声功率谱分布类,以便后续可以更加准确地确定环境噪声预测功率谱,进一步根据环境噪声预测功率谱对第一谱减参数进行优化处理得到第二谱减参数,并根据优化后的第二谱减参数对带噪语音信号进行谱减处理,从而可以更加准确地去掉带噪语音信号中的噪声信号,提高了降噪性能。
第二方面,本申请实施例提供一种语音增强装置,包括:
第一确定模块,用于根据带噪语音信号的功率谱以及噪声信号的功率谱,确定第一谱减参数;其中,带噪语音信号以及噪声信号为对麦克风所采集到的声音信号进行划分处理后得到的;
第二确定模块,用于根据第一谱减参数以及参考功率谱确定第二谱减参数;其中,参考功率谱包括:用户语音预测功率谱和/或环境噪声预测功率谱;
谱减模块,用于根据噪声信号的功率谱和第二谱减参数对带噪语音信号进行谱减处理。
在一种可能的实现方式中,若参考功率谱包括:用户语音预测功率谱,第二确定模块具体用于:
根据第一谱减函数F1(x,y)确定第二谱减参数;其中,x代表第一谱减参数;y代表用户语音预测功率谱;F1(x,y)的值与x成正向关系,F1(x,y)的值与y成负向关系。
在一种可能的实现方式中,若参考功率谱包括:环境噪声预测功率谱,第二确定模块具体用于:
根据第二谱减函数F2(x,z)确定第二谱减参数;其中,x代表第一谱减参数;z代表环境噪声预测功率谱;F2(x,z)的值与x成正向关系,F2(x,z)的值与z成正向关系。
在一种可能的实现方式中,若参考功率谱包括:用户语音预测功率谱和环境噪声预测功率谱,第二确定模块具体用于:
根据第三谱减函数F3(x,y,z)确定第二谱减参数;其中,x代表第一谱减参数;y代表用户语音预测功率谱;z代表环境噪声预测功率谱;F3(x,y,z)的值与x成正向关系,F3(x,y,z)的值与y成负向关系,且F3(x,y,z)的值与z成正向关系。
在一种可能的实现方式中,该装置还包括:
第三确定模块,用于根据带噪语音信号的功率谱以及用户功率谱分布类确定目标用户功率谱聚类;其中,用户功率谱分布类包括:至少一个用户历史功率谱聚类;目标用户功率谱聚类为至少一个用户历史功率谱聚类中与带噪语音信号的功率谱距离最近的聚类;
第四确定模块,用于根据带噪语音信号的功率谱以及目标用户功率谱聚类确定用户语音预测功率谱。
在一种可能的实现方式中,该装置还包括:
第五确定模块,用于根据噪声信号的功率谱以及噪声功率谱分布类确定目标噪声功率谱聚类;其中,噪声功率谱分布类包括:至少一个噪声历史功率谱聚类;目标噪声功率谱聚类为至少一个噪声历史功率谱聚类中与噪声信号的功率谱距离最近的聚类;
第六确定模块,用于根据噪声信号的功率谱以及目标噪声功率谱聚类确定环境噪声预测功率谱。
在一种可能的实现方式中,该装置还包括:
第三确定模块,用于根据带噪语音信号的功率谱以及用户功率谱分布类确定目标用户功率谱聚类;
第五确定模块,用于根据噪声信号的功率谱以及噪声功率谱分布类确定目标噪声功率谱聚类;其中,用户功率谱分布类包括:至少一个用户历史功率谱聚类;目标用户功率谱聚类为至少一个用户历史功率谱聚类中与带噪语音信号的功率谱距离最近的聚类;噪声功率谱分布类包括:至少一个噪声历史功率谱聚类;目标噪声功率谱聚类为至少一个噪声历史功率谱聚类中与噪声信号的功率谱距离最近的聚类;
第四确定模块,用于根据带噪语音信号的功率谱以及目标用户功率谱聚类确定用户语音预测功率谱;
第六确定模块,用于根据噪声信号的功率谱以及目标噪声功率谱聚类确定环境噪声预测功率谱。
在一种可能的实现方式中,第四确定模块具体用于:
根据第一估计函数F4(SP,SPT)确定用户语音预测功率谱;其中,SP代表带噪语音信号的功率谱;SPT代表目标用户功率谱聚类;F4(SP,PST)=a*SP+(1-a)*PST,a代表第一估计系数。
在一种可能的实现方式中,第六确定模块具体用于:
根据第二估计函数F5(NP,NPT)确定环境噪声预测功率谱;其中,NP代表噪声信号的功率谱;NPT代表目标噪声功率谱聚类;F5(NP,NPT)=b*NP+(1-b)*NPT,b代表第二估计系数。
在一种可能的实现方式中,该装置还包括:
第一获取模块,用于获取用户功率谱分布类。
在一种可能的实现方式中,该装置还包括:
第二获取模块,用于获取噪声功率谱分布类。
上述第二方面的实现方式所提供的语音增强装置,其有益效果可以参见上述第一方面的实现方式所带来的有益效果,在此不再赘述。
第三方面,本申请实施例提供一种语音增强装置,包括处理器和存储器;
其中,存储器,用于存储程序指令;
处理器,用于调用并执行存储器中存储的程序指令,实现如上述第一方面所描述的任意一种方法。
上述第三方面的实现方式所提供的语音增强装置,其有益效果可以参见上述第一方面的实现方式所带来的有益效果,在此不再赘述。
第四方面,本申请实施例提供一种程序,该程序在被处理器执行时用于执行以上第一方面的方法。
第五方面,本申请实施例提供一种包含指令的计算机程序产品,当其在计算机上运行时,使得计算机执行上述第一方面的方法。
第六方面,本申请实施例提供一种计算机可读存储介质,计算机可读存储介质中存储有指令,当其在计算机上运行时,使得计算机执行上述第一方面的方法。
附图说明
图1为传统的谱减法的流程示意图;
图2A为本申请实施例提供的应用场景示意图;
图2B为本申请一实施例提供的具有麦克风的终端设备的结构示意图;
图2C为本申请实施例提供的不同用户的语音频谱示意图;
图2D为本申请一实施例提供的语音增强方法的流程示意图;
图3A为本申请另一实施例提供的语音增强方法的流程示意图;
图3B为本申请实施例提供的用户功率谱分布类示意图;
图3C为本申请实施例提供的用户语音功率谱特性的学习流程示意图;
图4A为本申请另一实施例提供的语音增强方法的流程示意图;
图4B为本申请实施例提供的噪声功率谱分布类示意图;
图4C为本申请实施例提供的噪声功率谱特性的学习流程示意图;
图5为本申请另一实施例提供的语音增强方法的流程示意图;
图6A为本申请另一实施例提供的语音增强方法的流程示意图一;
图6B为本申请另一实施例提供的语音增强方法的流程示意图二;
图7A为本申请另一实施例提供的语音增强方法的流程示意图三;
图7B为本申请另一实施例提供的语音增强方法的流程示意图四;
图8A为本申请另一实施例提供的语音增强方法的流程示意图五;
图8B为本申请另一实施例提供的语音增强方法的流程示意图六;
图9A为本申请一实施例提供的语音增强装置的结构示意图;
图9B为本申请另一实施例提供的语音增强装置的结构示意图;
图10为本申请另一实施例提供的语音增强装置的结构示意图;
图11为本申请另一实施例提供的语音增强装置的结构示意图。
具体实施方式
首先,对本申请实施例中所涉及的应用场景和部分词汇进行解释说明。
图2A为本申请实施例提供的应用场景示意图。如图2A所示,当任意两个终端设备之间进行语音通信时,该终端设备中可以执行本申请实施例提供的语音增强方法;当然,本申请实施例还可以应用于其它场景,本申请实施例中,对此并不作限制。
需要说明的是,为了便于理解,图2A中仅示出两个终端设备(如终端设备1和终端设备2),当然还可以包括其它数量的终端设备,本申请实施例中对此并不作限制。
本申请实施例中,执行语音增强方法的装置可以是终端设备,也可以是终端设备中语音增强方法的装置。示例性地,终端设备中语音增强方法的装置可以是芯片系统、电路或者模块等,本申请不作限制。
本申请涉及的终端设备可以包括但不限于以下任一项:手机、平板电脑、个人数字助理等具有语音通信功能的设备,还可以是其它具有语音通信功能的设备。
本申请所涉及的终端设备可以包括硬件层、运行在硬件层之上的操作系统层,以及运行在操作系统层上的应用层。该硬件层包括中央处理器(Central Processing Unit,CPU)、内存管理单元(Memory Management Unit,MMU)和内存(也称为主存)等硬件。该操作系统可以是任意一种或多种通过进程(Process)实现业务处理的计算机操作系统,例如,Linux操作系统、Unix操作系统、Android操作系统、iOS操作系统或windows操作系统等。该应用层包含浏览器、通讯录、文字处理软件、即时通信软件等应用。
本申请实施例中的编号“第一”以及“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序,不应对本申请实施例构成任何限定。
本申请实施例涉及的第一谱减参数可以包括但不限于以下至少一项:第一过减因子α(α>1)和第一频谱阶β(0≤β≤1)。
本申请实施例中涉及的第二谱减参数为对第一谱减参数作优化处理后得到的谱 减参数。
本申请实施例涉及的第二谱减参数可以包括但不限于以下至少一项:第二过减因子α'(α'>1)和第二频谱阶β'(0≤β'≤1)。
本申请实施例涉及的各功率谱可以指:不考虑子带划分的功率谱,或者考虑子带划分的功率谱(或者称之为子带功率谱)。示例性地,1)若考虑子带划分,则带噪语音信号的功率谱可以称之为带噪语音信号的子带功率谱;2)若考虑子带划分,则噪声信号的功率谱可以称之为噪声信号的子带功率谱;3)若考虑子带划分,则用户语音预测功率谱可以称之为用户语音预测子带功率谱;4)若考虑子带划分,则环境噪声预测功率谱可以称之为带环境噪声预测子带功率谱;5)若考虑子带划分,则用户功率谱分布类可以称之为用户子带功率谱分布类;6)若考虑子带划分,则用户历史功率谱聚类可以称之为用户历史子带功率谱聚类;7)若考虑子带划分,则目标用户功率谱聚类可以称之为目标用户子带功率谱聚类;8)若考虑子带划分,则噪声功率谱分布类可以称之为噪声子带功率谱分布类;9)若考虑子带划分,则噪声历史功率谱聚类可以称之为噪声历史子带功率谱聚类;10)若考虑子带划分,则目标噪声功率谱聚类可以称之为目标噪声子带功率谱聚类。
通常情况下采用谱减法消除声音信号中的噪声。如图1所示,通过VAD将麦克风采集到的声音信号划分为带噪语音信号和噪声信号。进一步地,带噪语音信号通过FFT变换得到幅度信息和相位信息(其中,幅度信息通过功率谱估计得到带噪语音信号的功率谱),以及噪声信号通过噪声功率谱估计得到噪声信号的功率谱。进一步地,根据噪声信号的功率谱以及带噪语音信号的功率谱,通过谱减参数计算处理得到谱减参数。进一步地,根据噪声信号的功率谱以及谱减参数,对带噪语音信号的幅度信息进行谱减处理得到去噪后的语音信号。进一步地,根据去噪后的语音信号以及带噪语音信号的相位信息进行IFFT变换以及叠加等处理,得到增强后的语音信号。
但传统的谱减法中功率谱直接相减的方式,一方面适用的信噪比范围较窄,在信噪比较低时对语音的可懂度损伤较大,另一方面也会使去噪后的语音信号容易产生“音乐噪声”,都会直接影响语音信号的可懂度和自然度。
本申请实施例涉及的麦克风所采集到的声音信号可以为通过终端设备中的双麦克(示例性地,图2B为本申请一实施例提供的具有麦克风的终端设备的结构示意图,如图2B所示的第一麦克风和第二麦克风)所采集到的声音信号,当然还可以为通过终端设备中的其它数量个麦克风所采集到的声音信号,本申请实施例中对此并不作限制。需要说明的是,图2B中每个麦克风的位置仅为示例性地,还可以设置在终端设备的其它位置,本申请实施例中对此并不作限制。
随着终端设备的普遍使用,终端设备个性化使用趋势明显(或者说终端设备通常只会对应一个特定的用户),由于不同用户的声道特性差异明显,不同用户的语音频谱特性明显不同(或者说用户的语音频谱特性具有明显的个性化)。示例性地,图2C为本申请实施例提供的不同用户的语音频谱示意图,如图2C所示,在同样的环境噪声中(如图2C中的环境噪声频谱),不同用户即使说相同的词语,其语音频谱特性(如图2C中的女声AO对应的语音频谱、女声DJ对应的语音频谱、男声MH对应的语音频谱和男声MS对应的语音频谱)互不相同。
另外,考虑到特定用户的通话场景具有一定的规律性(例如,该用户通常8:00至17:00处于安静的室内办公,17:10至19:00处于嘈杂的地铁上等),因此,特定用户所处环境噪声功率谱特性存在一定的规律性。
本申请实施例提供的语音增强方法及装置,考虑到终端设备的用户语音功率谱特性的规律性和/或用户所处环境噪声功率谱特性的规律性,通过对第一谱减参数进行优化处理得到第二谱减参数,以便根据优化后的第二谱减参数对带噪语音信号进行谱减处理,不仅可以适用较宽的信噪比范围,而且提高了去噪后的语音信号的可懂度和自然度,提高了降噪性能。
下面以具体地实施例对本申请的技术方案以及本申请的技术方案如何解决上述技术问题进行详细说明。下面这几个具体的实施例可以相互结合,对于相同或相似的概念或过程可能在某些实施例中不再赘述。
图2D为本申请一实施例提供的语音增强方法的流程示意图。如图2D所示,本申请实施例的方法可以包括:
步骤S201、根据带噪语音信号的功率谱以及噪声信号的功率谱,确定第一谱减参数。
本步骤中,根据带噪语音信号的功率谱以及噪声信号的功率谱,确定第一谱减参数;其中,带噪语音信号以及噪声信号为对麦克风所采集到的声音信号进行划分处理后得到的。
可选地,根据带噪语音信号的功率谱以及噪声信号的功率谱,确定第一谱减参数的方式可以参考现有技术中的谱减参数计算过程,此处不再赘述。
可选地,第一谱减参数可以包括:第一过减因子α和/或第一频谱阶β,当然还可以包括其它参数,本申请实施例中对此并不作限制。
步骤S202、根据第一谱减参数以及参考功率谱确定第二谱减参数。
本步骤中,考虑到终端设备的用户语音功率谱特性和/或用户所处环境噪声功率谱特性的规律性,对第一谱减参数进行优化处理得到第二谱减参数,以便根据第二谱减参数对带噪语音信号进行谱减处理,从而可以提高去噪后的语音信号的可懂度和自然度。
具体地,根据第一谱减参数以及参考功率谱确定第二谱减参数;其中,参考功率谱包括:用户语音预测功率谱和/或环境噪声预测功率谱。示例性地,根据第一谱减参数、参考功率谱以及谱减函数确定第二谱减参数;其中,谱减函数可以包括但不限于以下至少一项:第一谱减函数F1(x,y)、第二谱减函数F2(x,z)以及第三谱减函数F3(x,y,z)。
本实施例中涉及的用户语音预测功率谱为:根据用户历史功率谱以及带噪语音信号的功率谱所预测的用户语音功率谱(可以用于体现用户语音功率谱特性)。
本实施例中涉及的环境噪声预测功率谱为:根据噪声历史功率谱以及噪声信号的功率谱所预测的环境噪声功率谱(可以用于体现用户所处环境噪声功率谱特性)。
本申请实施例下述部分中以参考功率谱所包括的内容不同,分别对“根据第一谱减参数以及参考功率谱确定第二谱减参数”的具体实现方式进行说明:
第一种可实现方式:若参考功率谱包括:用户语音预测功率谱,根据第一谱减函 数F1(x,y)确定第二谱减参数。
本实现方式中,若考虑到终端设备的用户语音功率谱特性的规律性(参考功率谱包括:用户语音预测功率谱),则根据第一谱减函数F1(x,y)确定第二谱减参数;其中,x代表第一谱减参数;y代表用户语音预测功率谱;F1(x,y)的值与x成正向关系(即x越大,则F1(x,y)的值越大),F1(x,y)的值与y成负向关系(即y越大,则F1(x,y)的值越小)。可选地,第二谱减参数大于或等于预设最小谱减参数,且小于或等于第一谱减参数。
示例性地,1)若第一谱减参数包括第一过减因子α,则根据第一谱减函数F1(x,y)确定第二谱减参数(包括第二过减因子α');其中,α'∈[min_α,α],min_α代表第一预设最小谱减参数。2)若第一谱减参数包括第一频谱阶β,则根据第一谱减函数F1(x,y)确定第二谱减参数(包括第二频谱阶β');其中,β'∈[min_β,β],min_β代表第二预设最小谱减参数。3)若第一谱减参数包括第一过减因子α和第一频谱阶β,则根据第一谱减函数F1(x,y)确定第二谱减参数(包括第二过减因子α'和第二频谱阶β');示例性地,根据第一谱减函数F1(α,y)确定α',以及根据第一谱减函数F1(β,y)确定β';其中,α'∈[min_α,α],β'∈[min_β,β],min_α代表第一预设最小谱减参数,min_β代表第二预设最小谱减参数。
本实现方式中,通过考虑到终端设备的用户语音功率谱特性的规律性,对第一谱减参数进行优化处理得到第二谱减参数,以便根据第二谱减参数对带噪语音信号进行谱减处理,从而可以对终端设备的用户语音进行保护,提高了去噪后的语音信号的可懂度和自然度。
第二种可实现方式:若参考功率谱包括:环境噪声预测功率谱,根据第二谱减函数F2(x,z)确定第二谱减参数。
本实现方式中,若考虑到用户所处环境噪声功率谱特性的规律性(参考功率谱包括:环境噪声预测功率谱),则根据第二谱减函数F2(x,z)确定第二谱减参数;其中,x代表第一谱减参数;z代表环境噪声预测功率谱;F2(x,z)的值与x成正向关系(即x越大,则F2(x,z)的值越大),F2(x,z)的值与z成正向关系(即z越大,则F2(x,z)的值越大)。可选地,第二谱减参数大于或等于第一谱减参数,且小于或等于预设最大谱减参数。
示例性地,1)若第一谱减参数包括第一过减因子α,则根据第二谱减函数F2(x,z)确定第二谱减参数(包括第二过减因子α');其中,α'∈[α,max_α],max_α代表第一预设最大谱减参数。2)若第一谱减参数包括第一频谱阶β,则根据第二谱减函数F2(x,z)确定第二谱减参数(包括第二频谱阶β');其中,β'∈[β,max_β],max_β代表第二预设最大谱减参数。3)若第一谱减参数包括第一过减因子α和第一频谱阶β,则根据第二谱减函数F2(x,z)确定第二谱减参数(包括第二过减因子α'和第二频谱阶β');示例性地,根据第二谱减函数F2(α,z)确定α',以及根据第二谱减函数F2(β,z)确定β';其中,α'∈[α,max_α],β'∈[β,max_β],max_α代表第一预设最大谱减参数,max_β代表第二预设最大谱减参数。
本实现方式中,通过考虑到用户所处环境噪声功率谱特性的规律性,对第一谱减参数进行优化处理得到第二谱减参数,以便根据第二谱减参数对带噪语音信号进行谱 减处理,从而可以更加准确地去掉带噪语音信号中的噪声信号,提高了去噪后的语音信号的可懂度和自然度。
第三种可实现方式:若参考功率谱包括:用户语音预测功率谱和环境噪声预测功率谱,根据第三谱减函数F3(x,y,z)确定第二谱减参数。
本实现方式中,若考虑到终端设备的用户语音功率谱特性和用户所处环境噪声功率谱特性的规律性(参考功率谱包括:用户语音预测功率谱和环境噪声预测功率谱),则根据第三谱减函数F3(x,y,z)确定第二谱减参数;其中,x代表第一谱减参数;y代表用户语音预测功率谱;z代表环境噪声预测功率谱;F3(x,y,z)的值与x成正向关系(即x越大,则F3(x,y,z)的值越大),F3(x,y,z)的值与y成负向关系(即y越大,则F3(x,y,z)的值越小),且F3(x,y,z)的值与z成正向关系(即z越大,则F3(x,y,z)的值越大)。可选地,第二谱减参数大于或等于预设最小谱减参数,且小于或等于预设最大谱减参数。
示例性地,1)若第一谱减参数包括第一过减因子α,则根据第三谱减函数F3(x,y,z)确定第二谱减参数(包括第二过减因子α')。2)若第一谱减参数包括第一频谱阶β,则根据第三谱减函数F3(x,y,z)确定第二谱减参数(包括第二频谱阶β')。3)若第一谱减参数包括第一过减因子α和第一频谱阶β,则根据第三谱减函数F3(x,y,z)确定第二谱减参数(包括第二过减因子α'和第二频谱阶β');示例性地,根据第三谱减函数F3(α,y,z)确定α',以及根据第三谱减函数F3(β,y,z)确定β'。
本实现方式中,通过考虑到终端设备的用户语音功率谱特性和用户所处环境噪声功率谱特性的规律性,对第一谱减参数进行优化处理得到第二谱减参数,以便根据第二谱减参数对带噪语音信号进行谱减处理,从而不仅可以对终端设备的用户语音进行保护,还可以更加准确地去掉带噪语音信号中的噪声信号,提高了去噪后的语音信号的可懂度和自然度。
当然,根据第一谱减参数以及参考功率谱,还可通过其它方式确定第二谱减参数,本申请实施例中对此并不作限制。
步骤S203、根据噪声信号的功率谱和第二谱减参数对带噪语音信号进行谱减处理。
本步骤中,根据噪声信号的功率谱和第二谱减参数(对第一谱减参数优化处理后得到的)对带噪语音信号进行谱减处理得到去噪后的语音信号,以便进一步地根据去噪后的语音信号以及带噪语音信号的相位信息进行IFFT变换以及叠加等处理,得到增强后的语音信号。可选地,根据噪声信号的功率谱和第二谱减参数对带噪语音信号进行谱减处理的方式可以参考现有技术中的谱减处理过程,此处不再赘述。
本实施例中,通过根据带噪语音信号的功率谱以及噪声信号的功率谱,确定第一谱减参数;进一步地,根据第一谱减参数以及参考功率谱确定第二谱减参数,并根据噪声信号的功率谱和第二谱减参数对带噪语音信号进行谱减处理;其中,参考功率谱包括:用户语音预测功率谱和/或环境噪声预测功率谱。可见,本实施例中,通过考虑到终端设备的用户语音功率谱特性和/或用户所处环境噪声功率谱特性的规律性,对第一谱减参数进行优化处理得到第二谱减参数,以便根据优化后的第二谱减参数对带噪语音信号进行谱减处理,不仅可以适用较宽的信噪比范围,而且提高了去噪后的语音 信号的可懂度和自然度,提高了降噪性能。
图3A为本申请另一实施例提供的语音增强方法的流程示意图。本申请实施例涉及的是如何确定用户语音预测功率谱的一种可选地实现过程。如图3A所示,在上述实施例的基础上,步骤S202之前,还包括:
步骤S301、根据带噪语音信号的功率谱以及用户功率谱分布类确定目标用户功率谱聚类。
其中,用户功率谱分布类包括:至少一个用户历史功率谱聚类;目标用户功率谱聚类为至少一个用户历史功率谱聚类中与带噪语音信号的功率谱距离最近的聚类。
本步骤中,示例性地,通过分别计算用户功率谱分布类中各用户历史功率谱聚类与带噪语音信号的功率谱之间的距离,并将各用户历史功率谱聚类中与带噪语音信号的功率谱之间的距离最近的用户历史功率谱聚类确定为目标用户功率谱聚类。可选地,任一用户历史功率谱聚类与带噪语音信号的功率谱之间的距离的计算方式可以采用以下算法中的任意算法:欧氏距离(Euclidean Distance)算法、曼哈顿距离(Manhattan Distance)算法、标准化欧氏距离(Standardized Euclidean distance)算法,以及夹角余弦(Cosine)算法,当然,还可以采用其它算法,本申请实施例中对此并不作限制。
步骤S302、根据带噪语音信号的功率谱以及目标用户功率谱聚类确定用户语音预测功率谱。
本步骤中,示例性地,根据带噪语音信号的功率谱、目标用户功率谱聚类以及估计函数确定用户语音预测功率谱。
可选地,根据第一估计函数F4(SP,SPT)确定用户语音预测功率谱;其中,SP代表带噪语音信号的功率谱;SPT代表目标用户功率谱聚类;F4(SP,PST)=a*SP+(1-a)*PST,a代表第一估计系数,0≤a≤1。可选地,a的值可以随着用户功率谱分布类的逐步完善,而逐步减小。
当然,第一估计函数F4(SP,SPT)还可以等于a*SP+(1-a)*PST的其它等效或变形公式(或者还可以根据第一估计函数F4(SP,SPT)的其它等效或变形估计函数确定用户语音预测功率谱),本申请实施例中对此并不作限制。
本实施例中,通过根据带噪语音信号的功率谱以及用户功率谱分布类确定目标用户功率谱聚类;进一步地,根据带噪语音信号的功率谱以及目标用户功率谱聚类确定用户语音预测功率谱,以便进一步地根据用户语音预测功率谱对第一谱减参数进行优化处理得到第二谱减参数,并根据优化后的第二谱减参数对带噪语音信号进行谱减处理,从而可以对终端设备的用户语音进行保护,提高了去噪后的语音信号的可懂度和自然度。
可选地,在上述实施例的基础上,步骤S301之前还包括:获取用户功率谱分布类。
本实施例中,通过对用户历史去噪后的语音信号进行用户功率谱在线学习,统计分析用户语音功率谱特性,以生成用户个性化相关的用户功率谱分布类来实现对用户语音的自适应。可选地,具体的获取方式可以参见如下内容:
图3B为本申请实施例提供的用户功率谱分布类示意图,图3C为本申请实施例提供的用户语音功率谱特性的学习流程示意图。示例性地,通过应用聚类算法对用户历史去噪后的语音信号进行用户功率谱离线学习,生成用户功率谱初始分布类;可选地,还可结合其他用户历史去躁后的语音信号进行用户功率谱离线学习)。示例性地,聚类算法可以包括但不限于以下任意项:K-聚类中心值(K-means)和K最近邻(K-Nearest Neighbor,K-NN)。可选地,在用户功率谱初始分布类的构建过程中可以结合发音类型(如声母、韵母、清音、浊音、爆破音等)的分类,当然还可以结合其它分类因素,本申请实施例中对此并不作限制。
结合图3B所示,以上一次调整后的用户功率谱分布类包括:用户历史功率谱聚类A1、用户历史功率谱聚类A2和用户历史功率谱聚类A3,以及用户去噪后的语音信号为A4为例进行说明。结合图3B和图3C所示,在语音通话过程中,应用传统的谱减算法或者本申请提供的语音增强方法确定用户去噪后的语音信号,进一步地,根据该用户去噪后的语音信号(如图3B中的A4)以及上一次调整后的用户功率谱分布类进行自适应聚类迭代(即用户功率谱在线学习),对上一次调整后的用户功率谱分布类的聚类中心进行修改,以输出本次调整后的用户功率谱分布类。
可选地,当第一次自适应聚类迭代时(即上一次调整后的用户功率谱分布类为用户功率谱初始分布类),则根据该用户去噪后的语音信号和用户功率谱初始分布类中的初始聚类中心进行自适应聚类迭代;当非第一次自适应聚类迭代时,则根据该用户去噪后的语音信号和上一次调整后的用户功率谱分布类中的历史聚类中心进行自适应聚类迭代。
本申请实施例中,通过每次根据用户去噪后的语音信号动态调整用户功率谱分布类,以便后续可以更加准确地确定用户语音预测功率谱,进一步根据用户语音预测功率谱对第一谱减参数进行优化处理得到第二谱减参数,并根据优化后的第二谱减参数对带噪语音信号进行谱减处理,从而可以对终端设备的用户语音进行保护,提高了降噪性能。
图4A为本申请另一实施例提供的语音增强方法的流程示意图。本申请实施例涉及的是如何确定环境噪声预测功率谱的一种可选地实现过程。如图4A所示,在上述实施例的基础上,步骤S202之前,还包括:
步骤S401、根据噪声信号的功率谱以及噪声功率谱分布类确定目标噪声功率谱聚类。
其中,噪声功率谱分布类包括:至少一个噪声历史功率谱聚类;目标噪声功率谱聚类为至少一个噪声历史功率谱聚类中与噪声信号的功率谱距离最近的聚类。
本实施例中,示例性地,通过分别计算噪声功率谱分布类中各噪声历史功率谱聚类与噪声信号的功率谱之间的距离,并将各噪声历史功率谱聚类中与噪声信号的功率谱之间的距离最近的噪声历史功率谱聚类确定为目标噪声功率谱聚类。可选地,任一噪声历史功率谱聚类中与噪声信号的功率谱之间的距离的计算方式可以采用以下算法中的任意算法:欧氏距离算法、曼哈顿距离算法、标准化欧氏距离算法,以及夹角余弦算法,当然,还可以采用其它算法,本申请实施例中对此并不作限制。
步骤S402、根据噪声信号的功率谱以及目标噪声功率谱聚类确定环境噪声预测功率谱。
本步骤中,示例性地,根据噪声信号的功率谱、目标噪声功率谱聚类以及估计函数确定环境噪声预测功率谱。
可选地,根据第二估计函数F5(NP,NPT)确定环境噪声预测功率谱;其中,NP代表噪声信号的功率谱;NPT代表目标噪声功率谱聚类;F5(NP,NPT)=b*NP+(1-b)*NPT,b代表第二估计系数,0≤b≤1。可选地,b的值可以随着噪声功率谱分布类的逐步完善,而逐步减小。
当然,第二估计函数F5(NP,NPT)还可以等于b*NP+(1-b)*NPT的其它等效或变形公式(或者还可以根据第二估计函数F5(NP,NPT)的其它等效或变形估计函数确定环境噪声预测功率谱),本申请实施例中对此并不作限制。
本实施例中,通过根据噪声信号的功率谱以及噪声功率谱分布类确定目标噪声功率谱聚类;进一步地,根据噪声信号的功率谱以及目标噪声功率谱聚类确定环境噪声预测功率谱,以便进一步地根据环境噪声预测功率谱对第一谱减参数进行优化处理得到第二谱减参数,并根据优化后的第二谱减参数对带噪语音信号进行谱减处理,从而可以更加准确地去掉带噪语音信号中的噪声信号,提高了去噪后的语音信号的可懂度和自然度。
可选地,在上述实施例的基础上,步骤S401之前还包括:获取噪声功率谱分布类。
本实施例中,通过对用户所处环境的历史噪声信号进行噪声功率谱在线学习,统计分析用户所处环境的噪声功率谱特性,以生成用户个性化相关的噪声功率谱分布类来实现对用户语音的自适应。可选地,具体的获取方式可以参见如下内容:
图4B为本申请实施例提供的噪声功率谱分布类示意图,图4C为本申请实施例提供的噪声功率谱特性的学习流程示意图。示例性地,通过应用聚类算法对用户所处环境的历史噪声信号进行噪声功率谱离线学习,生成噪声功率谱初始分布类;可选地,还可结合其它用户所处环境的历史噪声信号进行噪声功率谱离线学习。示例性地,聚类算法可以包括但不限于以下任意项:K-means和K-NN。可选地,在噪声功率谱初始分布类的构建过程中可以结合典型的环境噪声场景(如人员密集场所等)的分类,当然还可以结合其它分类因素,本申请实施例中对此并不作限制。
结合图4B所示,以上一次调整后的噪声功率谱分布类包括:噪声历史功率谱聚类B1、噪声历史功率谱聚类B2和噪声历史功率谱聚类B3,以及噪声信号的功率谱为B4为例进行说明。结合图4B和图4C所示,在语音通话过程中,应用传统的谱减算法或者本申请提供的语音增强方法确定噪声信号的功率谱,进一步地,根据噪声信号的功率谱(如图4B中的B4)以及上一次调整后的噪声功率谱分布类进行自适应聚类迭代(即噪声功率谱在线学习),对上一次调整后的噪声功率谱分布类的聚类中心进行修改,以输出本次调整后的噪声功率谱分布类。
可选地,当第一次自适应聚类迭代时(即上一次调整后的噪声功率谱分布类为噪声功率谱初始分布类),则根据噪声信号的功率谱和噪声功率谱初始分布类中的初始 聚类中心进行自适应聚类迭代;当非第一次自适应聚类迭代时,则根据噪声信号的功率谱和上一次调整后的噪声功率谱分布类中的历史聚类中心进行自适应聚类迭代。
本申请实施例中,通过每次根据噪声信号的功率谱动态调整噪声功率谱分布类,以便后续可以更加准确地确定环境噪声预测功率谱,进一步根据环境噪声预测功率谱对第一谱减参数进行优化处理得到第二谱减参数,并根据优化后的第二谱减参数对带噪语音信号进行谱减处理,从而可以更加准确地去掉带噪语音信号中的噪声信号,提高了降噪性能。
图5为本申请另一实施例提供的语音增强方法的流程示意图。本申请实施例涉及的是如何确定用户语音预测功率谱和环境噪声预测功率谱的一种可选地实现过程。如图5所示,在上述实施例的基础上,步骤S202之前,还包括:
步骤S501、根据带噪语音信号的功率谱以及用户功率谱分布类确定目标用户功率谱聚类,以及根据噪声信号的功率谱以及噪声功率谱分布类确定目标噪声功率谱聚类。
其中,用户功率谱分布类包括:至少一个用户历史功率谱聚类;目标用户功率谱聚类为至少一个用户历史功率谱聚类中与带噪语音信号的功率谱距离最近的聚类;噪声功率谱分布类包括:至少一个噪声历史功率谱聚类;目标噪声功率谱聚类为至少一个噪声历史功率谱聚类中与噪声信号的功率谱距离最近的聚类。
可选地,本步骤的具体实现方式可以参见上述实施例中关于步骤S301和步骤S401的相关内容,此处不再赘述。
步骤S502、根据带噪语音信号的功率谱以及目标用户功率谱聚类确定用户语音预测功率谱。
可选地,本步骤的具体实现方式可以参见上述实施例中关于步骤S302的相关内容,此处不再赘述。
步骤S503、根据噪声信号的功率谱以及目标噪声功率谱聚类确定环境噪声预测功率谱。
可选地,本步骤的具体实现方式可以参见上述实施例中关于步骤S402的相关内容,此处不再赘述。
可选地,在上述实施例的基础上,步骤S501之前还包括:获取用户功率谱分布类和噪声功率谱分布类。
可选地,具体的获取方式可以参见上述实施例中的相关内容,此处不再赘述。
需要说明的是,上述步骤S502和步骤S503的执行顺序可以同时并行执行,或者先执行步骤S502后执行步骤S503,或者先执行步骤S503后执行步骤S502,本申请实施例中对此并不作限制。
本实施例中,通过根据带噪语音信号的功率谱以及用户功率谱分布类确定目标用户功率谱聚类,以及根据噪声信号的功率谱以及噪声功率谱分布类确定目标噪声功率谱聚类;进一步地,根据带噪语音信号的功率谱以及目标用户功率谱聚类确定用户语音预测功率谱,以及根据噪声信号的功率谱以及目标噪声功率谱聚类确定环境噪声预测功率谱,以便进一步地根据用户语音预测功率谱和环境噪声预测功率谱对第一谱减 参数进行优化处理得到第二谱减参数,并根据优化后的第二谱减参数对带噪语音信号进行谱减处理,从而不仅可以对终端设备的用户语音进行保护,还可以更加准确地去掉带噪语音信号中的噪声信号,提高了去噪后的语音信号的可懂度和自然度。
图6A为本申请另一实施例提供的语音增强方法的流程示意图一,图6B为本申请另一实施例提供的语音增强方法的流程示意图二。结合上述任意实施例,本申请实施例涉及的是当考虑到终端设备的用户语音功率谱特性的规律性以及考虑子带划分时,如何实现语音增强方法的一种可选地实现过程。如图6A和6B所示,本申请实施例的具体实现过程如下:
通过VAD将双麦克风采集到的声音信号划分为带噪语音信号和噪声信号。进一步地,带噪语音信号通过FFT变换得到幅度信息和相位信息(其中,幅度信息通过子带功率谱估计得到带噪语音信号的子带功率谱SP(m,i)),以及噪声信号通过噪声子带功率谱估计得到噪声信号的子带功率谱。进一步地,根据噪声信号的子带功率谱以及带噪语音信号的子带功率谱SP(m,i),通过谱减参数计算处理得到第一谱减参数,m代表第m个子带(m的取值范围为根据预设的子带数量确定的),i代表第i帧(i的取值范围为根据所处理的带噪语音信号的帧序列数目确定的)。进一步地,根据用户语音预测子带功率谱PSP(m,i)对第一谱减参数进行优化,示例性地,根据用户语音预测子带功率谱PSP(m,i)以及第一谱减参数得到第二谱减参数,其中,用户语音预测子带功率谱PSP(m,i)为:根据带噪语音信号的子带功率谱SP(m,i)和用户子带功率谱分布类中与带噪语音信号的子带功率谱SP(m,i)距离最近的用户历史子带功率谱聚类(即目标用户功率谱聚类,SPT(m))进行语音子带功率谱估计确定的。进一步地,根据噪声信号的子带功率谱以及第二谱减参数,对带噪语音信号的幅度信息进行谱减处理得到去噪后的语音信号。进一步地,根据去噪后的语音信号以及带噪语音信号的相位信息进行IFFT变换以及叠加等处理,得到增强后的语音信号。
可选地,还可以对去噪后的语音信号进行用户子带功率谱在线学习,以实时更新用户子带功率谱分布类,进而以便后续根据下一次的带噪语音信号的子带功率谱和更新后的用户子带功率谱分布类中与该带噪语音信号的子带功率谱距离最近的用户历史子带功率谱聚类(即下一次的目标用户功率谱聚类),进行语音子带功率谱估计确定下一次的用户语音预测子带功率谱,以便后续优化下一次的第一谱减参数。
综上所述,本申请实施例中,通过考虑到终端设备的用户语音功率谱特性的规律性,根据用户语音预测子带功率谱对第一谱减参数进行优化处理得到第二谱减参数,以便根据第二谱减参数对带噪语音信号进行谱减处理,从而可以对终端设备的用户语音进行保护,提高了去噪后的语音信号的可懂度和自然度。
可选地,本申请实施例涉及的子带划分的方式可以参考表1所示的划分方式(可选地,Bark域的值b=6.7asinh[(f-20)/600],f代表对信号进行傅里叶变换后的频域取值),当然还可以采用其它的划分方式,本申请实施例中对此并不作限制。
表1为Bark临界频带划分参考示意表
Figure PCTCN2018073281-appb-000001
图7A为本申请另一实施例提供的语音增强方法的流程示意图三,图7B为本申请另一实施例提供的语音增强方法的流程示意图四。结合上述任意实施例,本申请实施例涉及的是当考虑到用户所处环境噪声功率谱特性的规律性以及考虑子带划分时,如何实现语音增强方法的一种可选地实现过程。如图7A和7B所示,本申请实施例的具体实现过程如下:
通过VAD将双麦克风采集到的声音信号划分为带噪语音信号和噪声信号。进一步地,带噪语音信号通过FFT变换得到幅度信息和相位信息(其中,幅度信息通过子带功率谱估计得到带噪语音信号的子带功率谱),以及噪声信号通过噪声子带功率谱估计得到噪声信号的子带功率谱NP(m,i)。进一步地,根据噪声信号的子带功率谱NP(m,i)以及带噪语音信号的子带功率谱,通过谱减参数计算处理得到第一谱减参数。进一步地,根据环境噪声预测功率谱PNP(m,i)对第一谱减参数进行优化,示例性地,根据环境噪声预测功率谱PNP(m,i)以及第一谱减参数得到第二谱减参数,其中,环境噪声预测功率谱PNP(m,i)为:根据噪声信号的子带功率谱NP(m,i)和噪声子带功率谱分布类中与噪声信号的子带功率谱NP(m,i)距离最近的噪声历史子带功率谱聚类(即目标噪声子带功率谱聚类,NPT(m))进行噪声子带功率谱估计确定的。进一步地,根据噪声信号的子带功率谱以及第二谱减参数,对带噪语音信号的幅度信息进行谱减处理得到去噪后的语音信号。进一步地,根据去噪后的语音信号以及带噪语音信号的相位信息进行IFFT变换以及叠加等处理,得到增强后的语音信号。
可选地,还可以对噪声信号的子带功率谱NP(m,i)进行噪声子带功率谱在线学习,以实时更新噪声子带功率谱分布类,进而以便后续根据下一次的噪声信号的子带功率谱和更新后的噪声子带功率谱分布类中与该噪声信号的子带功率谱距离最近的噪声历史子带功率谱聚类(即下一次的目标噪声子带功率谱聚类),进行噪声子带功率谱 估计确定下一次的环境噪声预测子带功率谱,以便后续优化下一次的第一谱减参数。
综上所述,本申请实施例中,通过考虑到用户所处环境噪声功率谱特性的规律性,根据环境噪声预测子带功率谱对第一谱减参数进行优化处理得到第二谱减参数,以便根据第二谱减参数对带噪语音信号进行谱减处理,从而可以更加准确地去掉带噪语音信号中的噪声信号,提高了去噪后的语音信号的可懂度和自然度。
图8A为本申请另一实施例提供的语音增强方法的流程示意图五,图8B为本申请另一实施例提供的语音增强方法的流程示意图六。结合上述任意实施例,本申请实施例涉及的是当考虑到终端设备的用户语音功率谱特性、用户所处环境噪声功率谱特性的规律性以及考虑子带划分时,如何实现语音增强方法的一种可选地实现过程。如图8A和8B所示,本申请实施例的具体实现过程如下:
通过VAD将双麦克风采集到的声音信号划分为带噪语音信号和噪声信号。进一步地,带噪语音信号通过FFT变换得到幅度信息和相位信息(其中,幅度信息通过子带功率谱估计得到带噪语音信号的子带功率谱SP(m,i)),以及噪声信号通过噪声子带功率谱估计得到噪声信号的子带功率谱NP(m,i)。进一步地,根据噪声信号的子带功率谱以及带噪语音信号的子带功率谱,通过谱减参数计算处理得到第一谱减参数。进一步地,根据用户语音预测子带功率谱PSP(m,i)、环境噪声预测功率谱PNP(m,i)对第一谱减参数进行优化,示例性地,根据用户语音预测子带功率谱PSP(m,i)、环境噪声预测功率谱PNP(m,i)以及第一谱减参数得到第二谱减参数;其中,用户语音预测子带功率谱PSP(m,i)为:根据带噪语音信号的子带功率谱SP(m,i)和用户子带功率谱分布类中与带噪语音信号的子带功率谱SP(m,i)距离最近的用户历史子带功率谱聚类(即目标用户功率谱聚类,SPT(m))进行语音子带功率谱估计确定的;环境噪声预测功率谱PNP(m,i)为:根据噪声信号的子带功率谱NP(m,i)和噪声子带功率谱分布类中与噪声信号的子带功率谱NP(m,i)距离最近的噪声历史子带功率谱聚类(即目标噪声子带功率谱聚类,NPT(m))进行噪声子带功率谱估计确定的。进一步地,根据噪声信号的子带功率谱以及第二谱减参数,对带噪语音信号的幅度信息进行谱减处理得到去噪后的语音信号。进一步地,根据去噪后的语音信号以及带噪语音信号的相位信息进行IFFT变换以及叠加等处理,得到增强后的语音信号。
可选地,还可以对去噪后的语音信号进行用户子带功率谱在线学习以实时更新用户子带功率谱分布类,进而以便后续根据下一次的带噪语音信号的子带功率谱和更新后的用户子带功率谱分布类中与该带噪语音信号的子带功率谱距离最近的用户历史子带功率谱聚类(即下一次的目标用户功率谱聚类),进行语音子带功率谱估计确定下一次的用户语音预测子带功率谱,以便后续优化下一次的第一谱减参数。
可选地,还可以对噪声信号的子带功率谱进行噪声子带功率谱在线学习,以实时更新噪声子带功率谱分布类,进而以便后续根据下一次的噪声信号的子带功率谱和更新后的噪声子带功率谱分布类中与该噪声信号的子带功率谱距离最近的噪声历史子带功率谱聚类(即下一次的目标噪声子带功率谱聚类),进行噪声子带功率谱估计确定下一次的环境噪声预测功率谱,以便后续优化下一次的第一谱减参数。
综上所述,本申请实施例中,通过考虑到终端设备的用户语音功率谱特性和用户 所处环境噪声功率谱特性的规律性,根据用户语音预测子带功率谱和环境噪声预测子带功率谱对第一谱减参数进行优化处理得到第二谱减参数,以便根据第二谱减参数对带噪语音信号进行谱减处理,从而可以更加准确地去掉带噪语音信号中的噪声信号,提高了去噪后的语音信号的可懂度和自然度。
图9A为本申请一实施例提供的语音增强装置的结构示意图。如图9A所示,本申请实施例提供的语音增强装置90,包括:第一确定模块901、第二确定模块902以及谱减模块903。
其中,第一确定模块901,用于根据带噪语音信号的功率谱以及噪声信号的功率谱,确定第一谱减参数;其中,带噪语音信号以及噪声信号为对麦克风所采集到的声音信号进行划分处理后得到的;
第二确定模块902,用于根据第一谱减参数以及参考功率谱确定第二谱减参数;其中,参考功率谱包括:用户语音预测功率谱和/或环境噪声预测功率谱;
谱减模块903,用于根据噪声信号的功率谱和第二谱减参数对带噪语音信号进行谱减处理。
可选地,若参考功率谱包括:用户语音预测功率谱,第二确定模块902具体用于:
根据第一谱减函数F1(x,y)确定第二谱减参数;其中,x代表第一谱减参数;y代表用户语音预测功率谱;F1(x,y)的值与x成正向关系,F1(x,y)的值与y成负向关系。
可选地,若参考功率谱包括:环境噪声预测功率谱,第二确定模块902具体用于:
根据第二谱减函数F2(x,z)确定第二谱减参数;其中,x代表第一谱减参数;z代表环境噪声预测功率谱;F2(x,z)的值与x成正向关系,F2(x,z)的值与z成正向关系。
可选地,若参考功率谱包括:用户语音预测功率谱和环境噪声预测功率谱,第二确定模块902具体用于:
根据第三谱减函数F3(x,y,z)确定第二谱减参数;其中,x代表第一谱减参数;y代表用户语音预测功率谱;z代表环境噪声预测功率谱;F3(x,y,z)的值与x成正向关系,F3(x,y,z)的值与y成负向关系,且F3(x,y,z)的值与z成正向关系。
可选地,语音增强装置90还包括:
第三确定模块,用于根据带噪语音信号的功率谱以及用户功率谱分布类确定目标用户功率谱聚类;其中,用户功率谱分布类包括:至少一个用户历史功率谱聚类;目标用户功率谱聚类为至少一个用户历史功率谱聚类中与带噪语音信号的功率谱距离最近的聚类;
第四确定模块,用于根据带噪语音信号的功率谱以及目标用户功率谱聚类确定用户语音预测功率谱。
可选地,语音增强装置90还包括:
第五确定模块,用于根据噪声信号的功率谱以及噪声功率谱分布类确定目标噪声功率谱聚类;其中,噪声功率谱分布类包括:至少一个噪声历史功率谱聚类;目标噪声功率谱聚类为至少一个噪声历史功率谱聚类中与噪声信号的功率谱距离最近的聚类;
第六确定模块,用于根据噪声信号的功率谱以及目标噪声功率谱聚类确定环境噪 声预测功率谱。
可选地,语音增强装置90还包括:
第三确定模块,用于根据带噪语音信号的功率谱以及用户功率谱分布类确定目标用户功率谱聚类;
第五确定模块,用于根据噪声信号的功率谱以及噪声功率谱分布类确定目标噪声功率谱聚类;其中,用户功率谱分布类包括:至少一个用户历史功率谱聚类;目标用户功率谱聚类为至少一个用户历史功率谱聚类中与带噪语音信号的功率谱距离最近的聚类;噪声功率谱分布类包括:至少一个噪声历史功率谱聚类;目标噪声功率谱聚类为至少一个噪声历史功率谱聚类中与噪声信号的功率谱距离最近的聚类;
第四确定模块,用于根据带噪语音信号的功率谱以及目标用户功率谱聚类确定用户语音预测功率谱;
第六确定模块,用于根据噪声信号的功率谱以及目标噪声功率谱聚类确定环境噪声预测功率谱。
可选地,第四确定模块具体用于:
根据第一估计函数F4(SP,SPT)确定用户语音预测功率谱;其中,SP代表带噪语音信号的功率谱;SPT代表目标用户功率谱聚类;F4(SP,PST)=a*SP+(1-a)*PST,a代表第一估计系数。
可选地,第六确定模块具体用于:
根据第二估计函数F5(NP,NPT)确定环境噪声预测功率谱;其中,NP代表噪声信号的功率谱;NPT代表目标噪声功率谱聚类;F5(NP,NPT)=b*NP+(1-b)*NPT,b代表第二估计系数。
可选地,语音增强装置90还包括:
第一获取模块,用于获取用户功率谱分布类。
可选地,语音增强装置90还包括:
第二获取模块,用于获取噪声功率谱分布类。
本实施例的语音增强装置,可以用于执行本申请上述语音增强方法实施例的技术方案,其实现原理和技术效果类似,此处不再赘述。
图9B为本申请另一实施例提供的语音增强装置的结构示意图。如图9B所示,本申请实施例提供的语音增强装置可以包括:VAD模块、噪声估计模块、谱减参数计算模块、频谱分析模块、谱减模块、在线学习模块、参数优化模块、以及相位恢复模块。其中,VAD模块分别连接至噪声估计模块和频谱分析模块,噪声估计模块分别连接至在线学习模块和谱减参数计算模块,频谱分析模块分别连接至在线学习模块和谱减模块,参数优化模块分别连接至在线学习模块、谱减参数计算模块和谱减模块,谱减模块还与谱减参数计算模块和相位恢复模块连接。
可选地,VAD模块用于将麦克风采集到的声音信号划分为带噪语音信号和噪声信号;噪声估计模块用于估计噪声信号的功率谱;频谱分析模块用于估计带噪语音信号的功率谱;相位恢复模块用于根据频谱分析模块中确定的相位信息和谱减模块处理后的去噪后的语音信号恢复得到增强后的语音信号。结合图9A所示,谱减参数计算模 块的功能可以与上述实施例中的第一确定模块901的功能相同;参数优化模块的功能可以与上述实施例中的第二确定模块902的功能相同;谱减模块的功能可以上述实施例中的谱减模块903的功能相同;在线学习模块的功能可以由上述实施例中的第三确定模块、第四确定模块、第五确定模块、第六确定模块、第一获取模块和第二获取模块的所有功能相同。
本实施例的语音增强装置,可以用于执行本申请上述语音增强方法实施例的技术方案,其实现原理和技术效果类似,此处不再赘述。
图10为本申请另一实施例提供的语音增强装置的结构示意图。如图10所示,本申请实施例提供的语音增强装置,包括:处理器1001和存储器1002;
其中,存储器1001,用于存储程序指令;
处理器1002,用于调用并执行所述存储器中存储的程序指令,实现本申请上述语音增强方法实施例的技术方案,其实现原理和技术效果类似,此处不再赘述。
可以理解的是,图10仅仅示出了语音增强装置的简化设计。在其他的实施方式中,语音增强装置还可以包含任意数量的发射器、接收器、处理器、存储器和/或通信单元等,本申请实施例中对此并不作限制。
图11为本申请另一实施例提供的语音增强装置的结构示意图。可选地,本申请实施例提供的语音增强装置可以是终端设备。如图11所示,本申请实施例中以终端设备为手机100为例进行说明。应该理解的是,图示手机100仅是终端设备的一个范例,并且手机100可以具有比图中所示出的更多的或者更少的部件,可以组合两个或更多的部件,或者可以具有不同的部件配置。
如图11所示,手机100具体可以包括:处理器101、射频(Radio Frequency,RF)电路102、存储器103、触摸屏104、蓝牙装置105、一个或多个传感器106、无线保真(WIreless-Fidelity,Wi-Fi)装置107、定位装置108、音频电路109、扬声器113、麦克风114、外设接口110以及电源装置111等部件。可选地,触摸屏104中可以包括:触控板104-1和显示器104-2。这些部件可通过一根或多根通信总线或信号线(图11中未示出)进行通信。
需要说明的是,本领域技术人员可以理解,图11中示出的硬件结构并不构成对手机的限定,手机100可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。
下面结合本申请所涉及的部件对手机100的音频部件进行具体的介绍,而其他部件暂不做详细描述。
示例性地,音频电路109、扬声器113、麦克风114可提供用户与手机100之间的音频接口。音频电路109可将接收到的音频数据转换后的电信号,传输到扬声器113,由扬声器113转换为声音信号输出;另一方面,麦克风114一般是2个或者2两个以上麦克风的组合,麦克风114将收集的声音信号转换为电信号,由音频电路109接收后转换为音频数据,再将音频数据输出至RF电路102以发送给比如另一手机,或者将音频数据输出至存储器103以便进一步处理。同时,音频电路可以包括专用处理器。
可选地,本申请上述语音增强方法实施例中的技术方案可以运行在音频电路109中的专用处理器,也可以运行在图11中所示的处理器101中,其实现原理和技术效果类似,此处不再赘述。
本申请实施例还提供一种程序,该程序在被处理器执行时用于执行本申请上述语音增强方法实施例的技术方案,其实现原理和技术效果类似,此处不再赘述。
本申请实施例还提供一种包含指令的计算机程序产品,当其在计算机上运行时,使得计算机执行本申请上述语音增强方法实施例的技术方案,其实现原理和技术效果类似,此处不再赘述。
本申请实施例还提供一种计算机可读存储介质,计算机可读存储介质中存储有指令,当其在计算机上运行时,使得计算机执行本申请上述语音增强方法实施例的技术方案,其实现原理和技术效果类似,此处不再赘述。
在本申请所提供的几个实施例中,应该理解到,所揭露的装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用硬件加软件功能单元的形式实现。
上述以软件功能单元的形式实现的集成的单元,可以存储在一个计算机可读取存储介质中。上述软件功能单元存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)或处理器(processor)执行本申请各个实施例所述方法的部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
本领域技术人员可以清楚地了解到,为描述的方便和简洁,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将装置的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。上述描述的装置的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
本领域普通技术人员可以理解,在本申请的各种实施例中,上述各过程的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。
在上述各实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。所述计 算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、网络设备、终端设备或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线(DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质,(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质(例如固态硬盘Solid State Disk(SSD))等。

Claims (24)

  1. 一种语音增强方法,其特征在于,包括:
    根据带噪语音信号的功率谱以及噪声信号的功率谱,确定第一谱减参数;其中,所述带噪语音信号以及所述噪声信号为对麦克风所采集到的声音信号进行划分处理后得到的;
    根据所述第一谱减参数以及参考功率谱确定第二谱减参数;其中,所述参考功率谱包括:用户语音预测功率谱和/或环境噪声预测功率谱;
    根据所述噪声信号的功率谱和所述第二谱减参数对所述带噪语音信号进行谱减处理。
  2. 根据权利要求1所述的方法,其特征在于,若所述参考功率谱包括:用户语音预测功率谱,所述根据所述第一谱减参数以及参考功率谱确定第二谱减参数,包括:
    根据第一谱减函数F1(x,y)确定所述第二谱减参数;其中,x代表所述第一谱减参数;y代表所述用户语音预测功率谱;F1(x,y)的值与x成正向关系,F1(x,y)的值与y成负向关系。
  3. 根据权利要求1所述的方法,其特征在于,若所述参考功率谱包括:环境噪声预测功率谱,所述根据所述第一谱减参数以及参考功率谱确定第二谱减参数,包括:
    根据第二谱减函数F2(x,z)确定所述第二谱减参数;其中,x代表所述第一谱减参数;z代表所述环境噪声预测功率谱;F2(x,z)的值与x成正向关系,F2(x,z)的值与z成正向关系。
  4. 根据权利要求1所述的方法,其特征在于,若所述参考功率谱包括:用户语音预测功率谱和环境噪声预测功率谱,所述根据所述第一谱减参数以及参考功率谱确定第二谱减参数,包括:
    根据第三谱减函数F3(x,y,z)确定所述第二谱减参数;其中,x代表所述第一谱减参数;y代表所述用户语音预测功率谱;z代表所述环境噪声预测功率谱;F3(x,y,z)的值与x成正向关系,F3(x,y,z)的值与y成负向关系,且F3(x,y,z)的值与z成正向关系。
  5. 根据权利要求2所述的方法,其特征在于,所述根据所述第一谱减参数以及参考功率谱确定第二谱减参数之前,还包括:
    根据所述带噪语音信号的功率谱以及用户功率谱分布类确定目标用户功率谱聚类;其中,所述用户功率谱分布类包括:至少一个用户历史功率谱聚类;所述目标用户功率谱聚类为所述至少一个用户历史功率谱聚类中与所述带噪语音信号的功率谱距离最近的聚类;
    根据所述带噪语音信号的功率谱以及所述目标用户功率谱聚类确定所述用户语音预测功率谱。
  6. 根据权利要求3所述的方法,其特征在于,所述根据所述第一谱减参数以及参考功率谱确定第二谱减参数之前,还包括:
    根据所述噪声信号的功率谱以及噪声功率谱分布类确定目标噪声功率谱聚类;其中,所述噪声功率谱分布类包括:至少一个噪声历史功率谱聚类;所述目标噪声功率 谱聚类为所述至少一个噪声历史功率谱聚类中与所述噪声信号的功率谱距离最近的聚类;
    根据所述噪声信号的功率谱以及所述目标噪声功率谱聚类确定所述环境噪声预测功率谱。
  7. 根据权利要求4所述的方法,其特征在于,所述根据所述第一谱减参数以及参考功率谱确定第二谱减参数之前,还包括:
    根据所述带噪语音信号的功率谱以及用户功率谱分布类确定目标用户功率谱聚类,以及根据所述噪声信号的功率谱以及噪声功率谱分布类确定目标噪声功率谱聚类;其中,所述用户功率谱分布类包括:至少一个用户历史功率谱聚类;所述目标用户功率谱聚类为所述至少一个用户历史功率谱聚类中与所述带噪语音信号的功率谱距离最近的聚类;所述噪声功率谱分布类包括:至少一个噪声历史功率谱聚类;所述目标噪声功率谱聚类为所述至少一个噪声历史功率谱聚类中与所述噪声信号的功率谱距离最近的聚类;
    根据所述带噪语音信号的功率谱以及所述目标用户功率谱聚类确定所述用户语音预测功率谱;
    根据所述噪声信号的功率谱以及所述目标噪声功率谱聚类确定所述环境噪声预测功率谱。
  8. 根据权利要求5或7所述的方法,其特征在于,所述根据所述带噪语音信号的功率谱以及所述目标用户功率谱聚类确定所述用户语音预测功率谱,包括:
    根据第一估计函数F4(SP,SPT)确定所述用户语音预测功率谱;其中,SP代表所述带噪语音信号的功率谱;SPT代表所述目标用户功率谱聚类;F4(SP,PST)=a*SP+(1-a)*PST,a代表第一估计系数。
  9. 根据权利要求6或7所述的方法,其特征在于,所述根据所述噪声信号的功率谱以及所述目标噪声功率谱聚类确定所述环境噪声预测功率谱,包括:
    根据第二估计函数F5(NP,NPT)确定所述环境噪声预测功率谱;其中,NP代表所述噪声信号的功率谱;NPT代表所述目标噪声功率谱聚类;F5(NP,NPT)=b*NP+(1-b)*NPT,b代表第二估计系数。
  10. 根据权利要求5、7或8所述的方法,其特征在于,所述根据所述带噪语音信号的功率谱以及用户功率谱分布类确定目标用户功率谱聚类之前,还包括:
    获取所述用户功率谱分布类。
  11. 根据权利要求6、7或9所述的方法,其特征在于,所述根据所述噪声信号的功率谱以及噪声功率谱分布类确定目标噪声功率谱聚类之前,还包括:
    获取所述噪声功率谱分布类。
  12. 一种语音增强装置,其特征在于,包括:
    第一确定模块,用于根据带噪语音信号的功率谱以及噪声信号的功率谱,确定第一谱减参数;其中,所述带噪语音信号以及所述噪声信号为对麦克风所采集到的声音信号进行划分处理后得到的;
    第二确定模块,用于根据所述第一谱减参数以及参考功率谱确定第二谱减参数;其中,所述参考功率谱包括:用户语音预测功率谱和/或环境噪声预测功率谱;
    谱减模块,用于根据所述噪声信号的功率谱和所述第二谱减参数对所述带噪语音信号进行谱减处理。
  13. 根据权利要求12所述的装置,其特征在于,若所述参考功率谱包括:用户语音预测功率谱,所述第二确定模块具体用于:
    根据第一谱减函数F1(x,y)确定所述第二谱减参数;其中,x代表所述第一谱减参数;y代表所述用户语音预测功率谱;F1(x,y)的值与x成正向关系,F1(x,y)的值与y成负向关系。
  14. 根据权利要求12所述的装置,其特征在于,若所述参考功率谱包括:环境噪声预测功率谱,所述第二确定模块具体用于:
    根据第二谱减函数F2(x,z)确定所述第二谱减参数;其中,x代表所述第一谱减参数;z代表所述环境噪声预测功率谱;F2(x,z)的值与x成正向关系,F2(x,z)的值与z成正向关系。
  15. 根据权利要求12所述的装置,其特征在于,若所述参考功率谱包括:用户语音预测功率谱和环境噪声预测功率谱,所述第二确定模块具体用于:
    根据第三谱减函数F3(x,y,z)确定所述第二谱减参数;其中,x代表所述第一谱减参数;y代表所述用户语音预测功率谱;z代表所述环境噪声预测功率谱;F3(x,y,z)的值与x成正向关系,F3(x,y,z)的值与y成负向关系,且F3(x,y,z)的值与z成正向关系。
  16. 根据权利要求13所述的装置,其特征在于,所述装置还包括:
    第三确定模块,用于根据所述带噪语音信号的功率谱以及用户功率谱分布类确定目标用户功率谱聚类;其中,所述用户功率谱分布类包括:至少一个用户历史功率谱聚类;所述目标用户功率谱聚类为所述至少一个用户历史功率谱聚类中与所述带噪语音信号的功率谱距离最近的聚类;
    第四确定模块,用于根据所述带噪语音信号的功率谱以及所述目标用户功率谱聚类确定所述用户语音预测功率谱。
  17. 根据权利要求14所述的装置,其特征在于,所述装置还包括:
    第五确定模块,用于根据所述噪声信号的功率谱以及噪声功率谱分布类确定目标噪声功率谱聚类;其中,所述噪声功率谱分布类包括:至少一个噪声历史功率谱聚类;所述目标噪声功率谱聚类为所述至少一个噪声历史功率谱聚类中与所述噪声信号的功率谱距离最近的聚类;
    第六确定模块,用于根据所述噪声信号的功率谱以及所述目标噪声功率谱聚类确定所述环境噪声预测功率谱。
  18. 根据权利要求15所述的装置,其特征在于,所述装置还包括:
    第三确定模块,用于根据所述带噪语音信号的功率谱以及用户功率谱分布类确定目标用户功率谱聚类;
    第五确定模块,用于根据所述噪声信号的功率谱以及噪声功率谱分布类确定目标噪声功率谱聚类;其中,所述用户功率谱分布类包括:至少一个用户历史功率谱聚类;所述目标用户功率谱聚类为所述至少一个用户历史功率谱聚类中与所述带噪语音信号的功率谱距离最近的聚类;所述噪声功率谱分布类包括:至少一个噪声历史功率谱 聚类;所述目标噪声功率谱聚类为所述至少一个噪声历史功率谱聚类中与所述噪声信号的功率谱距离最近的聚类;
    第四确定模块,用于根据所述带噪语音信号的功率谱以及所述目标用户功率谱聚类确定所述用户语音预测功率谱;
    第六确定模块,用于根据所述噪声信号的功率谱以及所述目标噪声功率谱聚类确定所述环境噪声预测功率谱。
  19. 根据权利要求16或18所述的装置,其特征在于,所述第四确定模块具体用于:
    根据第一估计函数F4(SP,SPT)确定所述用户语音预测功率谱;其中,SP代表所述带噪语音信号的功率谱;SPT代表所述目标用户功率谱聚类;F4(SP,PST)=a*SP+(1-a)*PST,a代表第一估计系数。
  20. 根据权利要求17或18所述的装置,其特征在于,所述第六确定模块具体用于:
    根据第二估计函数F5(NP,NPT)确定所述环境噪声预测功率谱;其中,NP代表所述噪声信号的功率谱;NPT代表所述目标噪声功率谱聚类;F5(NP,NPT)=b*NP+(1-b)*NPT,b代表第二估计系数。
  21. 根据权利要求16、18或19所述的装置,其特征在于,所述装置还包括:
    第一获取模块,用于获取所述用户功率谱分布类。
  22. 根据权利要求17、18或20所述的装置,其特征在于,所述装置还包括:
    第二获取模块,用于获取所述噪声功率谱分布类。
  23. 一种语音增强装置,其特征在于,包括处理器和存储器;
    其中,所述存储器,用于存储程序指令;
    所述处理器,用于调用并执行所述存储器中存储的程序指令,实现如权利要求1至11中任一项所述的方法。
  24. 一种计算机可读存储介质,所述计算机可读存储介质中存储有指令,当其所述指令在计算机上运行时,使得所述计算机执行如权利要求1至11中任一项所述的方法。
PCT/CN2018/073281 2017-12-18 2018-01-18 语音增强方法及装置 WO2019119593A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
US16/645,677 US11164591B2 (en) 2017-12-18 2018-01-18 Speech enhancement method and apparatus
CN201880067882.XA CN111226277B (zh) 2017-12-18 2018-01-18 语音增强方法及装置

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201711368189 2017-12-18
CN201711368189.X 2017-12-18

Publications (1)

Publication Number Publication Date
WO2019119593A1 true WO2019119593A1 (zh) 2019-06-27

Family

ID=66993022

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/073281 WO2019119593A1 (zh) 2017-12-18 2018-01-18 语音增强方法及装置

Country Status (3)

Country Link
US (1) US11164591B2 (zh)
CN (1) CN111226277B (zh)
WO (1) WO2019119593A1 (zh)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111986693B (zh) * 2020-08-10 2024-07-09 北京小米松果电子有限公司 音频信号的处理方法及装置、终端设备和存储介质
CN113793620B (zh) * 2021-11-17 2022-03-08 深圳市北科瑞声科技股份有限公司 基于场景分类的语音降噪方法、装置、设备及存储介质
CN114387953B (zh) * 2022-01-25 2024-10-22 重庆卡佐科技有限公司 一种车载环境下的语音增强方法和语音识别方法
CN115132219A (zh) * 2022-06-22 2022-09-30 中国兵器工业计算机应用技术研究所 基于二次谱减法的复杂噪声背景下的语音识别方法和系统
CN116705013B (zh) * 2023-07-28 2023-10-10 腾讯科技(深圳)有限公司 语音唤醒词的检测方法、装置、存储介质和电子设备

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050071156A1 (en) * 2003-09-30 2005-03-31 Intel Corporation Method for spectral subtraction in speech enhancement
US20050288923A1 (en) * 2004-06-25 2005-12-29 The Hong Kong University Of Science And Technology Speech enhancement by noise masking
CN103730126A (zh) * 2012-10-16 2014-04-16 联芯科技有限公司 噪声抑制方法和噪声抑制器
CN104200811A (zh) * 2014-08-08 2014-12-10 华迪计算机集团有限公司 对语音信号进行自适应谱减消噪处理的方法和装置
CN104252863A (zh) * 2013-06-28 2014-12-31 上海通用汽车有限公司 车载收音机的音频降噪处理系统及方法
CN107393550A (zh) * 2017-07-14 2017-11-24 深圳永顺智信息科技有限公司 语音处理方法及装置

Family Cites Families (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4897878A (en) * 1985-08-26 1990-01-30 Itt Corporation Noise compensation in speech recognition apparatus
US6775652B1 (en) * 1998-06-30 2004-08-10 At&T Corp. Speech recognition over lossy transmission systems
US7103540B2 (en) * 2002-05-20 2006-09-05 Microsoft Corporation Method of pattern recognition using noise reduction uncertainty
US20040078199A1 (en) * 2002-08-20 2004-04-22 Hanoh Kremer Method for auditory based noise reduction and an apparatus for auditory based noise reduction
US7133825B2 (en) * 2003-11-28 2006-11-07 Skyworks Solutions, Inc. Computationally efficient background noise suppressor for speech coding and speech recognition
CN101015001A (zh) * 2004-09-07 2007-08-08 皇家飞利浦电子股份有限公司 提高了噪声抑制能力的电话装置
KR100745977B1 (ko) * 2005-09-26 2007-08-06 삼성전자주식회사 음성 구간 검출 장치 및 방법
CN102436820B (zh) * 2010-09-29 2013-08-28 华为技术有限公司 高频带信号编码方法及装置、高频带信号解码方法及装置
US9589580B2 (en) * 2011-03-14 2017-03-07 Cochlear Limited Sound processing based on a confidence measure
WO2015092943A1 (en) * 2013-12-17 2015-06-25 Sony Corporation Electronic devices and methods for compensating for environmental noise in text-to-speech applications
US9552829B2 (en) * 2014-05-01 2017-01-24 Bellevue Investments Gmbh & Co. Kgaa System and method for low-loss removal of stationary and non-stationary short-time interferences
CN104269178A (zh) * 2014-08-08 2015-01-07 华迪计算机集团有限公司 对语音信号进行自适应谱减和小波包消噪处理的方法和装置
US9818084B1 (en) * 2015-12-09 2017-11-14 Impinj, Inc. RFID loss-prevention based on transition risk
US10991355B2 (en) * 2019-02-18 2021-04-27 Bose Corporation Dynamic sound masking based on monitoring biosignals and environmental noises

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050071156A1 (en) * 2003-09-30 2005-03-31 Intel Corporation Method for spectral subtraction in speech enhancement
US20050288923A1 (en) * 2004-06-25 2005-12-29 The Hong Kong University Of Science And Technology Speech enhancement by noise masking
CN103730126A (zh) * 2012-10-16 2014-04-16 联芯科技有限公司 噪声抑制方法和噪声抑制器
CN104252863A (zh) * 2013-06-28 2014-12-31 上海通用汽车有限公司 车载收音机的音频降噪处理系统及方法
CN104200811A (zh) * 2014-08-08 2014-12-10 华迪计算机集团有限公司 对语音信号进行自适应谱减消噪处理的方法和装置
CN107393550A (zh) * 2017-07-14 2017-11-24 深圳永顺智信息科技有限公司 语音处理方法及装置

Also Published As

Publication number Publication date
US20200279573A1 (en) 2020-09-03
US11164591B2 (en) 2021-11-02
CN111226277A (zh) 2020-06-02
CN111226277B (zh) 2022-12-27

Similar Documents

Publication Publication Date Title
WO2019119593A1 (zh) 语音增强方法及装置
US9978388B2 (en) Systems and methods for restoration of speech components
TWI730584B (zh) 關鍵詞的檢測方法以及相關裝置
JP7498560B2 (ja) システム及び方法
US9536540B2 (en) Speech signal separation and synthesis based on auditory scene analysis and speech modeling
WO2019100500A1 (zh) 语音信号降噪方法及设备
US10045140B2 (en) Utilizing digital microphones for low power keyword detection and noise suppression
WO2020088154A1 (zh) 语音降噪方法、存储介质和移动终端
WO2016123560A1 (en) Contextual switching of microphones
CN109756818B (zh) 双麦克风降噪方法、装置、存储介质及电子设备
CN106165015B (zh) 用于促进基于加水印的回声管理的装置和方法
CN114203163A (zh) 音频信号处理方法及装置
WO2016119388A1 (zh) 一种基于语音信号构造聚焦协方差矩阵的方法及装置
WO2022256577A1 (en) A method of speech enhancement and a mobile computing device implementing the method
CN114898762A (zh) 基于目标人的实时语音降噪方法、装置和电子设备
US20150325252A1 (en) Method and device for eliminating noise, and mobile terminal
CN112802490B (zh) 一种基于传声器阵列的波束形成方法和装置
WO2017123814A1 (en) Systems and methods for assisting automatic speech recognition
JP6711765B2 (ja) 形成装置、形成方法および形成プログラム
US20180277134A1 (en) Key Click Suppression
CN118613866A (zh) 用于使用递归神经网络进行统一声学回声抑制的技术
CN114220430A (zh) 多音区语音交互方法、装置、设备以及存储介质
Kamarudin et al. Acoustic echo cancellation using adaptive filtering algorithms for Quranic accents (Qiraat) identification
CN113539300A (zh) 基于噪声抑制的语音检测方法、装置、存储介质以及终端
US20230024855A1 (en) Method and electronic device for improving audio quality

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18892192

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18892192

Country of ref document: EP

Kind code of ref document: A1