CN114141267A

CN114141267A - Speech enhancement method and device based on complex frequency spectrum characteristics

Info

Publication number: CN114141267A
Application number: CN202111447463.9A
Authority: CN
Inventors: 苏家雨; 王博; 欧阳鹏
Original assignee: Jiangsu Qingwei Intelligent Technology Co ltd
Current assignee: Jiangsu Qingwei Intelligent Technology Co ltd
Priority date: 2021-11-30
Filing date: 2021-11-30
Publication date: 2022-03-04

Abstract

The invention discloses a speech enhancement method and a speech enhancement device based on complex frequency spectrum characteristics, wherein the method comprises the following steps: carrying out Fourier transform on the voice with noise to obtain a complex frequency spectrum; calculating a logarithmic real part power spectrum of a real part and a logarithmic imaginary part power spectrum of an imaginary part in a complex frequency spectrum; inputting the logarithm real part power spectrum and the logarithm imaginary part power spectrum into a pre-trained masking prediction network to obtain a first masking value corresponding to the real part and a second masking value corresponding to the imaginary part; and respectively enhancing the real part and the imaginary part of the complex frequency spectrum by using the first masking value and the second masking value, and performing inverse Fourier transform on the enhanced complex frequency spectrum to obtain the enhanced voice. According to the scheme provided by the embodiment of the invention, the masking values are respectively obtained for the real part and the imaginary part of the complex frequency spectrum of the voice in the frequency domain, and the phase information is hidden in the process of respectively obtaining the masking values for the real part and the imaginary part, namely, the masking values are obtained by using the energy and the phase information of the voice signal as the characteristics, so that the voice distortion is reduced while the noise is removed.

Description

Speech enhancement method and device based on complex frequency spectrum characteristics

Technical Field

The present invention relates to the field of speech enhancement technologies, and in particular, to a speech enhancement method and apparatus based on complex spectral features.

Background

The speech signal is interfered by various types of noise in the environment, and the interference causes serious distortion of the speech signal, so that the understanding of speech semantics by people becomes difficult. The purpose of speech enhancement is to remove or reduce various types of noise from noisy speech.

The traditional single-channel speech enhancement algorithm comprises a spectral subtraction method, a method based on minimum mean square error and the like, but the traditional enhancement algorithm needs to estimate the frequency spectrum information of noise firstly, and the abrupt noise causes the estimation of the frequency spectrum information to be difficult; meanwhile, the traditional algorithm needs to assume Gaussian distribution of signals, so that the enhancement effect is limited.

Therefore, the neural network based on deep learning is widely applied to a speech enhancement algorithm, a single-microphone signal is usually subjected to Fourier transform, spectrum features (such as a logarithmic power spectrum and a Mel logarithmic power spectrum) of the single-microphone signal are calculated at the same time, the overall features are sent to the deep neural network to learn a masking value, and then the masking value is masked on a noisy speech to complete speech enhancement. However, existing masking enhancement methods ignore the effect of phase on the speech signal.

Disclosure of Invention

In order to solve the problems of insufficient precision and low quantization efficiency of the existing quantization method, the embodiment of the invention provides a quantization method and a quantization device of a neural network. The technical scheme is as follows:

in a first aspect, a speech enhancement method based on complex spectral features is provided, the method including:

carrying out Fourier transform on the voice with noise to obtain a complex frequency spectrum expressed by the voice with noise in a frequency domain;

calculating the logarithm power spectrum of a real part in the complex frequency spectrum to obtain a logarithm real part power spectrum, and calculating the logarithm power spectrum of an imaginary part in the complex frequency spectrum to obtain a logarithm imaginary part power spectrum;

inputting the obtained logarithm real part power spectrum and logarithm imaginary part power spectrum into a pre-trained masking prediction network to obtain a first masking value corresponding to the real part and a second masking value corresponding to the imaginary part;

and respectively enhancing the real part and the imaginary part of the complex frequency spectrum by using the first masking value and the second masking value, and performing inverse Fourier transform on the enhanced complex frequency spectrum to obtain the enhanced voice corresponding to the voice with noise.

Optionally, the training process of the masking prediction network includes:

acquiring a training sample, wherein the training sample comprises a sample voice with noise and a clean voice which is used for being combined with noise to form the sample voice with noise;

carrying out Fourier transform on the sample voice with noise to obtain a sample complex frequency spectrum represented by the sample voice with noise in a frequency domain;

calculating the logarithm power spectrum of the real part in the sample complex frequency spectrum to obtain a sample logarithm real part power spectrum, and calculating the logarithm power spectrum of the imaginary part in the sample complex frequency spectrum to obtain a sample logarithm imaginary part power spectrum;

inputting the obtained sample logarithm real part power spectrum and the sample logarithm imaginary part power spectrum into an initial masking prediction network to obtain a first sample masking value corresponding to the real part and a second sample masking value corresponding to the imaginary part;

respectively enhancing the real part and the imaginary part of the sample complex frequency spectrum by using the first sample masking value and the second sample masking value, and performing inverse Fourier transform on the enhanced sample complex frequency spectrum to obtain sample enhanced voice corresponding to the sample noisy voice;

calculating a mean square error between the clean speech and the sample enhanced speech as a loss value;

under the condition that the loss value is not converged, adjusting the initial masking prediction network based on the loss value, and returning to input the obtained sample logarithm real part power spectrum and sample logarithm imaginary part power spectrum into the initial masking prediction network to obtain a first sample masking value corresponding to the real part and a second sample masking value corresponding to the imaginary part;

in case the loss values converge, the initial masking prediction network is taken as a masking prediction network for speech enhancement.

Optionally, the step of calculating the log power spectrum of the real part in the complex frequency spectrum to obtain the log real part power spectrum includes:

calculating the log power spectrum of the real part in the complex frequency spectrum by the following expression to obtain a log real part power spectrum:

LRPS(|X(t，i)|)＝log(|X_real(t，i)|²)

wherein LRPS (| X (t, i) |) represents the log real part power spectrum, X_real(t, i) represents a real part.

Optionally, the step of calculating a log-imaginary power spectrum of an imaginary part in the complex frequency spectrum to obtain a log-imaginary power spectrum includes:

calculating the logarithm power spectrum of the imaginary part in the complex frequency spectrum to obtain a logarithm imaginary part power spectrum by the following expression:

LIPS(|X(t，i)|)＝log(|X_image(t，i)|²)

where LIPS (| X (t, i) |) represents the log-imaginary power spectrum, X_image(t, i) denotes an imaginary part.

In a second aspect, an apparatus for speech enhancement based on complex spectral features is provided, the apparatus comprising:

the Fourier transform module is used for carrying out Fourier transform on the voice with noise to obtain a complex frequency spectrum represented by the voice with noise in a frequency domain;

the characteristic extraction module is used for calculating the logarithm power spectrum of a real part in the complex frequency spectrum to obtain a logarithm real part power spectrum, and calculating the logarithm power spectrum of an imaginary part in the complex frequency spectrum to obtain a logarithm imaginary part power spectrum;

the masking prediction module is used for inputting the obtained logarithm real part power spectrum and logarithm imaginary part power spectrum into a pre-trained masking prediction network to obtain a first masking value corresponding to the real part and a second masking value corresponding to the imaginary part;

and the voice enhancement module is used for respectively enhancing the real part and the imaginary part of the complex frequency spectrum by using the first masking value and the second masking value, and performing inverse Fourier transform on the enhanced complex frequency spectrum to obtain the enhanced voice corresponding to the noisy voice.

Optionally, the method further includes a model training module, configured to obtain the masking prediction network by:

Optionally, the feature extraction module is specifically configured to calculate a log power spectrum of a real part in the complex frequency spectrum by using the following expression to obtain a log real part power spectrum:

LRPS(|X(t，i)|)＝log(|X_real(t，i)|²)

Optionally, the feature extraction module is specifically configured to calculate a logarithmic power spectrum of an imaginary part in the complex frequency spectrum to obtain a logarithmic imaginary part power spectrum by using the following expression:

LIPS(|X(t，i)|)＝log(|X_image(t，i)|²)

In a third aspect, an electronic device is provided, which includes a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory complete communication with each other through the communication bus;

a memory for storing a computer program;

a processor, configured to implement the speech enhancement method based on complex spectral features according to the first aspect when executing the program stored in the memory.

In the voice enhancement process, the masking values are respectively obtained for the real part and the imaginary part of the complex frequency spectrum of the voice in the frequency domain, and the process of respectively obtaining the masking values for the real part and the imaginary part implies phase information, namely, the energy and the phase information of the voice signal are simultaneously used as characteristics to obtain the masking values, so that the voice distortion is reduced while the noise is removed.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic flowchart of a speech enhancement method based on complex spectral features according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram of a speech enhancement apparatus based on complex spectral features according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention will be described in detail with reference to the accompanying drawings.

Referring to fig. 1, an embodiment of the present invention provides a speech enhancement method based on complex spectral features, including:

s100, carrying out Fourier transform on the voice with noise to obtain a complex frequency spectrum represented by the voice with noise in a frequency domain.

In implementation, for a noisy speech, in the time domain, it can be written as:

y(k)＝s(k)+n(k) (1)

wherein y (k), s (k), n (k) represent noisy speech, clean speech components, and noise components, respectively; while speech enhancement is typically performed in the frequency domain, fourier transforming noisy speech can be written as:

Y(t，k)＝S(t，k)+N(t，k) (2)

or Y (t, k) ═ Y (t, k)_real+Y(t，k)_image (3)

The energy of the real part and the imaginary part is respectively obtained in the enhancing process for voice enhancement, so that the phase information of the noisy voice signal is implied, namely, the energy and the phase of the noisy voice signal are simultaneously used as characteristics for voice enhancement.

S110, calculating the log power spectrum of the real part in the complex frequency spectrum to obtain a log real part power spectrum, and calculating the log power spectrum of the imaginary part in the complex frequency spectrum to obtain a log imaginary part power spectrum.

In implementation, the log power spectrum of the real part in the complex spectrum can be obtained by calculating the log power spectrum of the real part by the following expression:

LRPS(|X(t，i)|)＝log(|X_real(t，i)|²) (4)

LIPS(|X(t，i)|)＝log(|X_image(t，i)|²) (5)

And S120, inputting the obtained logarithm real part power spectrum and logarithm imaginary part power spectrum into a pre-trained masking prediction network to obtain a first masking value corresponding to the real part and a second masking value corresponding to the imaginary part.

In the implementation, a logarithm real part power spectrum and a logarithm imaginary part power spectrum are respectively obtained and used as features to be input into a masking prediction network to obtain masking values mask corresponding to a real part and an imaginary part respectively, the masking prediction network is built by a GRU network and comprises 3 GRU layers, a sigmoid is adopted as an activation function by one FC layer, and the specific training process comprises the following steps:

acquiring a training sample, wherein the training sample comprises a sample voice with noise and clean voice which is used for being combined with noise to form the sample voice with noise;

carrying out Fourier transform on the sample voice with noise to obtain a sample complex frequency spectrum expressed by the sample voice with noise in a frequency domain;

calculating the logarithm power spectrum of a real part in the sample complex frequency spectrum to obtain a sample logarithm real part power spectrum, and calculating the logarithm power spectrum of an imaginary part in the sample complex frequency spectrum to obtain a sample logarithm imaginary part power spectrum;

inputting the obtained sample logarithm real part power spectrum and the sample logarithm imaginary part power spectrum into an initial masking prediction network to obtain a first sample masking value corresponding to a real part and a second sample masking value corresponding to an imaginary part;

respectively enhancing a real part and an imaginary part of a sample complex frequency spectrum by using the first sample masking value and the second sample masking value, and performing Fourier inverse transformation on the enhanced sample complex frequency spectrum to obtain a sample enhanced voice corresponding to the sample noisy voice;

calculating the mean square error between the clean voice and the sample enhanced voice as a loss value;

under the condition that the loss value is not converged, adjusting the initial masking prediction network based on the loss value, and returning to input the obtained sample logarithm real part power spectrum and the sample logarithm imaginary part power spectrum into the initial masking prediction network to obtain a first sample masking value corresponding to the real part and a second sample masking value corresponding to the imaginary part;

in case the loss values converge, the initial masking prediction network is used as the masking prediction network for speech enhancement.

S130, the real part and the imaginary part of the complex frequency spectrum are respectively enhanced by utilizing the first masking value and the second masking value, and the enhanced complex frequency spectrum is subjected to Fourier inverse transformation to obtain enhanced voice corresponding to the voice with noise.

In the implementation, the real part and the imaginary part are enhanced respectively, which may be specifically expressed as:

Y_c(t，k)_real＝mask_real*Y(t，k)_real (6)

Y_c(t，k)_image＝mask_image*Y(t，k)_image (7)

y_c(k)＝ifft(Y_c(t，k)) (8)

wherein, Y_c(t, k) the enhanced complex frequency spectrum is subjected to fast Fourier transform to obtain enhanced voice y_c(k)。

In order to verify the effect of enhancing single-microphone voice, a large amount of voice data with noise is constructed, more than one hundred thousand pieces of collected clean voice data and clean voice in open source AISHELL data set are specifically used, knocking noise, television noise, music noise and the like are collected to be used as point source interference, and subway noise, bus noise, office noise and the like are collected to be used as scattering noise. Then, clean voice and noise are randomly selected, according to a practical scene, 84 million noisy voice data with the signal-to-noise ratio ranging from-5 db to 15db are constructed, 80 million voice data are used for network training, 20000 voice data are used for training verification and optimizing a network, and 20000 voice data are used for effect testing after the network training is completed. Where the audio sampling rate for all configurations is 16 khz.

In the stage of testing the final noise reduction effect and comparing, the adopted indexes are SI-SDR, short-time intelligibility (STOI) and objective evaluation index of voice quality (PESQ), the final test result is shown in table 1, and the indexes of respectively extracting logarithmic real part power spectrum and logarithmic imaginary part power as features to perform voice enhancement are higher than those of the existing voice enhancement method.

Means (characteristic)	Network	SI-SDR	PESQ	STOI
					Far-field speech with noise		1.63	2.12	0.75
LPS	64-64-64gru+257fc	11.75	2.67	0.82
					LRPS+LIPS/mask_real+mask_image	64-64-64gru+257fc	11.97	2.88	0.87

TABLE 1 comparison of test results

Referring to fig. 2, an embodiment of the present invention provides a speech enhancement apparatus based on complex spectral features, where the apparatus includes:

the fourier transform module 200 is configured to perform fourier transform on the voice with noise to obtain a complex frequency spectrum represented by the voice with noise in a frequency domain;

a feature extraction module 210, configured to calculate a log power spectrum of a real part in the complex spectrum to obtain a log real part power spectrum, and calculate a log power spectrum of an imaginary part in the complex spectrum to obtain a log imaginary part power spectrum;

a masking prediction module 220, configured to input the obtained log real part power spectrum and log imaginary part power spectrum into a pre-trained masking prediction network, so as to obtain a first masking value corresponding to the real part and a second masking value corresponding to the imaginary part;

the speech enhancement module 230 is configured to enhance the real part and the imaginary part of the complex frequency spectrum by using the first masking value and the second masking value, and perform inverse fourier transform on the enhanced complex frequency spectrum to obtain an enhanced speech corresponding to the noisy speech.

In an implementation, the method further comprises a model training module for obtaining the masking prediction network by the following steps:

In an implementation, the feature extraction module 210 is specifically configured to calculate a log power spectrum of a real part in the complex frequency spectrum by using the following expression:

LRPS(|X(t，i)|)＝log(|X_real(t，i)|²)

In an implementation, the feature extraction module 210 is specifically configured to calculate a log-imaginary power spectrum of an imaginary part in the complex frequency spectrum by using the following expression to obtain a log-imaginary power spectrum:

LIPS(|X(t，i)|)＝log(|X_image(t，i)|²)

An embodiment of the present invention further provides an electronic device, as shown in fig. 3, including a processor 001, a communication interface 002, a memory 003 and a communication bus 004, where the processor 001, the communication interface 002 and the memory 003 complete mutual communication through the communication bus 004,

a memory 003 for storing a computer program;

the processor 001, when executing the program stored in the memory 003, is configured to implement the method for speech enhancement based on complex spectral features, including:

The communication bus mentioned in the electronic device may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.

The communication interface is used for communication between the electronic equipment and other equipment.

The Memory may include a Random Access Memory (RAM) or a Non-Volatile Memory (NVM), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the processor.

The Processor may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but also Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. The term "comprising", without further limitation, means that the element so defined is not excluded from the group consisting of additional identical elements in the process, method, article, or apparatus that comprises the element.

All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, as for the embodiments of the apparatus and the electronic device, since they are substantially similar to the embodiments of the method, the description is simple, and the relevant points can be referred to only in the partial description of the embodiments of the method.

The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims

1. A method for speech enhancement based on complex spectral features, the method comprising:

2. The method of claim 1, wherein the training process of the masking prediction network comprises:

3. The method of claim 1, wherein the step of calculating the log power contribution of the real part of the complex spectrum to obtain a log real power spectrum comprises:

LRPS(|X(t，i)|)＝log(|X_real(t，i)|²)

4. The method of claim 1, wherein the step of computing the log-imaginary power spectrum of the imaginary part in the complex spectrum to obtain a log-imaginary power spectrum comprises:

LIPS(|X(t，i)|)＝log(|X_image(t，i)|²)

5. An apparatus for speech enhancement based on complex spectral features, the apparatus comprising:

6. The apparatus of claim 5, further comprising a model training module to derive a masking prediction network by:

7. The apparatus of claim 5, wherein the feature extraction module is specifically configured to calculate a log power spectrum of a real part in the complex spectrum by the following expression:

LRPS(|X(t，i)|)＝log(|X_real(t，i)|²)

8. The apparatus according to claim 5, wherein the feature extraction module is specifically configured to calculate a log-imaginary power spectrum of an imaginary part in the complex spectrum by the following expression:

LIPS(|X(t，i)|)＝log(|X_image(t，i)|²)

9. An electronic device is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor and the communication interface are used for realizing mutual communication by the memory through the communication bus;

a memory for storing a computer program;

a processor for implementing the method steps of any of claims 1 to 4 when executing a program stored in the memory.