CN114141267A - Speech enhancement method and device based on complex frequency spectrum characteristics - Google Patents
Speech enhancement method and device based on complex frequency spectrum characteristics Download PDFInfo
- Publication number
- CN114141267A CN114141267A CN202111447463.9A CN202111447463A CN114141267A CN 114141267 A CN114141267 A CN 114141267A CN 202111447463 A CN202111447463 A CN 202111447463A CN 114141267 A CN114141267 A CN 114141267A
- Authority
- CN
- China
- Prior art keywords
- sample
- power spectrum
- logarithm
- masking
- voice
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000001228 spectrum Methods 0.000 title claims abstract description 234
- 238000000034 method Methods 0.000 title claims abstract description 42
- 230000000873 masking effect Effects 0.000 claims abstract description 129
- 230000002708 enhancing effect Effects 0.000 claims abstract description 14
- 230000008569 process Effects 0.000 claims abstract description 13
- 238000012549 training Methods 0.000 claims description 21
- 238000004891 communication Methods 0.000 claims description 19
- 230000003595 spectral effect Effects 0.000 claims description 12
- 238000000605 extraction Methods 0.000 claims description 9
- 238000004590 computer program Methods 0.000 claims description 5
- 230000000694 effects Effects 0.000 description 5
- 238000013139 quantization Methods 0.000 description 4
- 230000009471 action Effects 0.000 description 3
- 238000013528 artificial neural network Methods 0.000 description 3
- 238000012360 testing method Methods 0.000 description 3
- 238000010586 diagram Methods 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 230000009466 transformation Effects 0.000 description 2
- 230000004913 activation Effects 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 239000000835 fiber Substances 0.000 description 1
- 238000013100 final test Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 238000011410 subtraction method Methods 0.000 description 1
- 230000001131 transforming effect Effects 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0216—Noise filtering characterised by the method used for estimating noise
- G10L21/0232—Processing in the frequency domain
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/18—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/21—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being power information
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Quality & Reliability (AREA)
- Soundproofing, Sound Blocking, And Sound Damping (AREA)
Abstract
The invention discloses a speech enhancement method and a speech enhancement device based on complex frequency spectrum characteristics, wherein the method comprises the following steps: carrying out Fourier transform on the voice with noise to obtain a complex frequency spectrum; calculating a logarithmic real part power spectrum of a real part and a logarithmic imaginary part power spectrum of an imaginary part in a complex frequency spectrum; inputting the logarithm real part power spectrum and the logarithm imaginary part power spectrum into a pre-trained masking prediction network to obtain a first masking value corresponding to the real part and a second masking value corresponding to the imaginary part; and respectively enhancing the real part and the imaginary part of the complex frequency spectrum by using the first masking value and the second masking value, and performing inverse Fourier transform on the enhanced complex frequency spectrum to obtain the enhanced voice. According to the scheme provided by the embodiment of the invention, the masking values are respectively obtained for the real part and the imaginary part of the complex frequency spectrum of the voice in the frequency domain, and the phase information is hidden in the process of respectively obtaining the masking values for the real part and the imaginary part, namely, the masking values are obtained by using the energy and the phase information of the voice signal as the characteristics, so that the voice distortion is reduced while the noise is removed.
Description
Technical Field
The present invention relates to the field of speech enhancement technologies, and in particular, to a speech enhancement method and apparatus based on complex spectral features.
Background
The speech signal is interfered by various types of noise in the environment, and the interference causes serious distortion of the speech signal, so that the understanding of speech semantics by people becomes difficult. The purpose of speech enhancement is to remove or reduce various types of noise from noisy speech.
The traditional single-channel speech enhancement algorithm comprises a spectral subtraction method, a method based on minimum mean square error and the like, but the traditional enhancement algorithm needs to estimate the frequency spectrum information of noise firstly, and the abrupt noise causes the estimation of the frequency spectrum information to be difficult; meanwhile, the traditional algorithm needs to assume Gaussian distribution of signals, so that the enhancement effect is limited.
Therefore, the neural network based on deep learning is widely applied to a speech enhancement algorithm, a single-microphone signal is usually subjected to Fourier transform, spectrum features (such as a logarithmic power spectrum and a Mel logarithmic power spectrum) of the single-microphone signal are calculated at the same time, the overall features are sent to the deep neural network to learn a masking value, and then the masking value is masked on a noisy speech to complete speech enhancement. However, existing masking enhancement methods ignore the effect of phase on the speech signal.
Disclosure of Invention
In order to solve the problems of insufficient precision and low quantization efficiency of the existing quantization method, the embodiment of the invention provides a quantization method and a quantization device of a neural network. The technical scheme is as follows:
in a first aspect, a speech enhancement method based on complex spectral features is provided, the method including:
carrying out Fourier transform on the voice with noise to obtain a complex frequency spectrum expressed by the voice with noise in a frequency domain;
calculating the logarithm power spectrum of a real part in the complex frequency spectrum to obtain a logarithm real part power spectrum, and calculating the logarithm power spectrum of an imaginary part in the complex frequency spectrum to obtain a logarithm imaginary part power spectrum;
inputting the obtained logarithm real part power spectrum and logarithm imaginary part power spectrum into a pre-trained masking prediction network to obtain a first masking value corresponding to the real part and a second masking value corresponding to the imaginary part;
and respectively enhancing the real part and the imaginary part of the complex frequency spectrum by using the first masking value and the second masking value, and performing inverse Fourier transform on the enhanced complex frequency spectrum to obtain the enhanced voice corresponding to the voice with noise.
Optionally, the training process of the masking prediction network includes:
acquiring a training sample, wherein the training sample comprises a sample voice with noise and a clean voice which is used for being combined with noise to form the sample voice with noise;
carrying out Fourier transform on the sample voice with noise to obtain a sample complex frequency spectrum represented by the sample voice with noise in a frequency domain;
calculating the logarithm power spectrum of the real part in the sample complex frequency spectrum to obtain a sample logarithm real part power spectrum, and calculating the logarithm power spectrum of the imaginary part in the sample complex frequency spectrum to obtain a sample logarithm imaginary part power spectrum;
inputting the obtained sample logarithm real part power spectrum and the sample logarithm imaginary part power spectrum into an initial masking prediction network to obtain a first sample masking value corresponding to the real part and a second sample masking value corresponding to the imaginary part;
respectively enhancing the real part and the imaginary part of the sample complex frequency spectrum by using the first sample masking value and the second sample masking value, and performing inverse Fourier transform on the enhanced sample complex frequency spectrum to obtain sample enhanced voice corresponding to the sample noisy voice;
calculating a mean square error between the clean speech and the sample enhanced speech as a loss value;
under the condition that the loss value is not converged, adjusting the initial masking prediction network based on the loss value, and returning to input the obtained sample logarithm real part power spectrum and sample logarithm imaginary part power spectrum into the initial masking prediction network to obtain a first sample masking value corresponding to the real part and a second sample masking value corresponding to the imaginary part;
in case the loss values converge, the initial masking prediction network is taken as a masking prediction network for speech enhancement.
Optionally, the step of calculating the log power spectrum of the real part in the complex frequency spectrum to obtain the log real part power spectrum includes:
calculating the log power spectrum of the real part in the complex frequency spectrum by the following expression to obtain a log real part power spectrum:
LRPS(|X(t,i)|)=log(|Xreal(t,i)|2)
wherein LRPS (| X (t, i) |) represents the log real part power spectrum, Xreal(t, i) represents a real part.
Optionally, the step of calculating a log-imaginary power spectrum of an imaginary part in the complex frequency spectrum to obtain a log-imaginary power spectrum includes:
calculating the logarithm power spectrum of the imaginary part in the complex frequency spectrum to obtain a logarithm imaginary part power spectrum by the following expression:
LIPS(|X(t,i)|)=log(|Ximage(t,i)|2)
where LIPS (| X (t, i) |) represents the log-imaginary power spectrum, Ximage(t, i) denotes an imaginary part.
In a second aspect, an apparatus for speech enhancement based on complex spectral features is provided, the apparatus comprising:
the Fourier transform module is used for carrying out Fourier transform on the voice with noise to obtain a complex frequency spectrum represented by the voice with noise in a frequency domain;
the characteristic extraction module is used for calculating the logarithm power spectrum of a real part in the complex frequency spectrum to obtain a logarithm real part power spectrum, and calculating the logarithm power spectrum of an imaginary part in the complex frequency spectrum to obtain a logarithm imaginary part power spectrum;
the masking prediction module is used for inputting the obtained logarithm real part power spectrum and logarithm imaginary part power spectrum into a pre-trained masking prediction network to obtain a first masking value corresponding to the real part and a second masking value corresponding to the imaginary part;
and the voice enhancement module is used for respectively enhancing the real part and the imaginary part of the complex frequency spectrum by using the first masking value and the second masking value, and performing inverse Fourier transform on the enhanced complex frequency spectrum to obtain the enhanced voice corresponding to the noisy voice.
Optionally, the method further includes a model training module, configured to obtain the masking prediction network by:
acquiring a training sample, wherein the training sample comprises a sample voice with noise and a clean voice which is used for being combined with noise to form the sample voice with noise;
carrying out Fourier transform on the sample voice with noise to obtain a sample complex frequency spectrum represented by the sample voice with noise in a frequency domain;
calculating the logarithm power spectrum of the real part in the sample complex frequency spectrum to obtain a sample logarithm real part power spectrum, and calculating the logarithm power spectrum of the imaginary part in the sample complex frequency spectrum to obtain a sample logarithm imaginary part power spectrum;
inputting the obtained sample logarithm real part power spectrum and the sample logarithm imaginary part power spectrum into an initial masking prediction network to obtain a first sample masking value corresponding to the real part and a second sample masking value corresponding to the imaginary part;
respectively enhancing the real part and the imaginary part of the sample complex frequency spectrum by using the first sample masking value and the second sample masking value, and performing inverse Fourier transform on the enhanced sample complex frequency spectrum to obtain sample enhanced voice corresponding to the sample noisy voice;
calculating a mean square error between the clean speech and the sample enhanced speech as a loss value;
under the condition that the loss value is not converged, adjusting the initial masking prediction network based on the loss value, and returning to input the obtained sample logarithm real part power spectrum and sample logarithm imaginary part power spectrum into the initial masking prediction network to obtain a first sample masking value corresponding to the real part and a second sample masking value corresponding to the imaginary part;
in case the loss values converge, the initial masking prediction network is taken as a masking prediction network for speech enhancement.
Optionally, the feature extraction module is specifically configured to calculate a log power spectrum of a real part in the complex frequency spectrum by using the following expression to obtain a log real part power spectrum:
LRPS(|X(t,i)|)=log(|Xreal(t,i)|2)
wherein LRPS (| X (t, i) |) represents the log real part power spectrum, Xreal(t, i) represents a real part.
Optionally, the feature extraction module is specifically configured to calculate a logarithmic power spectrum of an imaginary part in the complex frequency spectrum to obtain a logarithmic imaginary part power spectrum by using the following expression:
LIPS(|X(t,i)|)=log(|Ximage(t,i)|2)
where LIPS (| X (t, i) |) represents the log-imaginary power spectrum, Ximage(t, i) denotes an imaginary part.
In a third aspect, an electronic device is provided, which includes a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory complete communication with each other through the communication bus;
a memory for storing a computer program;
a processor, configured to implement the speech enhancement method based on complex spectral features according to the first aspect when executing the program stored in the memory.
In the voice enhancement process, the masking values are respectively obtained for the real part and the imaginary part of the complex frequency spectrum of the voice in the frequency domain, and the process of respectively obtaining the masking values for the real part and the imaginary part implies phase information, namely, the energy and the phase information of the voice signal are simultaneously used as characteristics to obtain the masking values, so that the voice distortion is reduced while the noise is removed.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a schematic flowchart of a speech enhancement method based on complex spectral features according to an embodiment of the present invention;
fig. 2 is a schematic structural diagram of a speech enhancement apparatus based on complex spectral features according to an embodiment of the present invention;
fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention will be described in detail with reference to the accompanying drawings.
Referring to fig. 1, an embodiment of the present invention provides a speech enhancement method based on complex spectral features, including:
s100, carrying out Fourier transform on the voice with noise to obtain a complex frequency spectrum represented by the voice with noise in a frequency domain.
In implementation, for a noisy speech, in the time domain, it can be written as:
y(k)=s(k)+n(k) (1)
wherein y (k), s (k), n (k) represent noisy speech, clean speech components, and noise components, respectively; while speech enhancement is typically performed in the frequency domain, fourier transforming noisy speech can be written as:
Y(t,k)=S(t,k)+N(t,k) (2)
or Y (t, k) ═ Y (t, k)real+Y(t,k)image (3)
The energy of the real part and the imaginary part is respectively obtained in the enhancing process for voice enhancement, so that the phase information of the noisy voice signal is implied, namely, the energy and the phase of the noisy voice signal are simultaneously used as characteristics for voice enhancement.
S110, calculating the log power spectrum of the real part in the complex frequency spectrum to obtain a log real part power spectrum, and calculating the log power spectrum of the imaginary part in the complex frequency spectrum to obtain a log imaginary part power spectrum.
In implementation, the log power spectrum of the real part in the complex spectrum can be obtained by calculating the log power spectrum of the real part by the following expression:
LRPS(|X(t,i)|)=log(|Xreal(t,i)|2) (4)
wherein LRPS (| X (t, i) |) represents the log real part power spectrum, Xreal(t, i) represents a real part.
Calculating the logarithm power spectrum of the imaginary part in the complex frequency spectrum to obtain a logarithm imaginary part power spectrum by the following expression:
LIPS(|X(t,i)|)=log(|Ximage(t,i)|2) (5)
where LIPS (| X (t, i) |) represents the log-imaginary power spectrum, Ximage(t, i) denotes an imaginary part.
And S120, inputting the obtained logarithm real part power spectrum and logarithm imaginary part power spectrum into a pre-trained masking prediction network to obtain a first masking value corresponding to the real part and a second masking value corresponding to the imaginary part.
In the implementation, a logarithm real part power spectrum and a logarithm imaginary part power spectrum are respectively obtained and used as features to be input into a masking prediction network to obtain masking values mask corresponding to a real part and an imaginary part respectively, the masking prediction network is built by a GRU network and comprises 3 GRU layers, a sigmoid is adopted as an activation function by one FC layer, and the specific training process comprises the following steps:
acquiring a training sample, wherein the training sample comprises a sample voice with noise and clean voice which is used for being combined with noise to form the sample voice with noise;
carrying out Fourier transform on the sample voice with noise to obtain a sample complex frequency spectrum expressed by the sample voice with noise in a frequency domain;
calculating the logarithm power spectrum of a real part in the sample complex frequency spectrum to obtain a sample logarithm real part power spectrum, and calculating the logarithm power spectrum of an imaginary part in the sample complex frequency spectrum to obtain a sample logarithm imaginary part power spectrum;
inputting the obtained sample logarithm real part power spectrum and the sample logarithm imaginary part power spectrum into an initial masking prediction network to obtain a first sample masking value corresponding to a real part and a second sample masking value corresponding to an imaginary part;
respectively enhancing a real part and an imaginary part of a sample complex frequency spectrum by using the first sample masking value and the second sample masking value, and performing Fourier inverse transformation on the enhanced sample complex frequency spectrum to obtain a sample enhanced voice corresponding to the sample noisy voice;
calculating the mean square error between the clean voice and the sample enhanced voice as a loss value;
under the condition that the loss value is not converged, adjusting the initial masking prediction network based on the loss value, and returning to input the obtained sample logarithm real part power spectrum and the sample logarithm imaginary part power spectrum into the initial masking prediction network to obtain a first sample masking value corresponding to the real part and a second sample masking value corresponding to the imaginary part;
in case the loss values converge, the initial masking prediction network is used as the masking prediction network for speech enhancement.
S130, the real part and the imaginary part of the complex frequency spectrum are respectively enhanced by utilizing the first masking value and the second masking value, and the enhanced complex frequency spectrum is subjected to Fourier inverse transformation to obtain enhanced voice corresponding to the voice with noise.
In the implementation, the real part and the imaginary part are enhanced respectively, which may be specifically expressed as:
Yc(t,k)real=maskreal*Y(t,k)real (6)
Yc(t,k)image=maskimage*Y(t,k)image (7)
yc(k)=ifft(Yc(t,k)) (8)
wherein, Yc(t, k) the enhanced complex frequency spectrum is subjected to fast Fourier transform to obtain enhanced voice yc(k)。
In order to verify the effect of enhancing single-microphone voice, a large amount of voice data with noise is constructed, more than one hundred thousand pieces of collected clean voice data and clean voice in open source AISHELL data set are specifically used, knocking noise, television noise, music noise and the like are collected to be used as point source interference, and subway noise, bus noise, office noise and the like are collected to be used as scattering noise. Then, clean voice and noise are randomly selected, according to a practical scene, 84 million noisy voice data with the signal-to-noise ratio ranging from-5 db to 15db are constructed, 80 million voice data are used for network training, 20000 voice data are used for training verification and optimizing a network, and 20000 voice data are used for effect testing after the network training is completed. Where the audio sampling rate for all configurations is 16 khz.
In the stage of testing the final noise reduction effect and comparing, the adopted indexes are SI-SDR, short-time intelligibility (STOI) and objective evaluation index of voice quality (PESQ), the final test result is shown in table 1, and the indexes of respectively extracting logarithmic real part power spectrum and logarithmic imaginary part power as features to perform voice enhancement are higher than those of the existing voice enhancement method.
Means (characteristic) | Network | SI-SDR | PESQ | STOI |
Far-field speech with noise | 1.63 | 2.12 | 0.75 | |
LPS | 64-64-64gru+257fc | 11.75 | 2.67 | 0.82 |
LRPS+LIPS/maskreal+maskimage | 64-64-64gru+257fc | 11.97 | 2.88 | 0.87 |
TABLE 1 comparison of test results
Referring to fig. 2, an embodiment of the present invention provides a speech enhancement apparatus based on complex spectral features, where the apparatus includes:
the fourier transform module 200 is configured to perform fourier transform on the voice with noise to obtain a complex frequency spectrum represented by the voice with noise in a frequency domain;
a feature extraction module 210, configured to calculate a log power spectrum of a real part in the complex spectrum to obtain a log real part power spectrum, and calculate a log power spectrum of an imaginary part in the complex spectrum to obtain a log imaginary part power spectrum;
a masking prediction module 220, configured to input the obtained log real part power spectrum and log imaginary part power spectrum into a pre-trained masking prediction network, so as to obtain a first masking value corresponding to the real part and a second masking value corresponding to the imaginary part;
the speech enhancement module 230 is configured to enhance the real part and the imaginary part of the complex frequency spectrum by using the first masking value and the second masking value, and perform inverse fourier transform on the enhanced complex frequency spectrum to obtain an enhanced speech corresponding to the noisy speech.
In an implementation, the method further comprises a model training module for obtaining the masking prediction network by the following steps:
acquiring a training sample, wherein the training sample comprises a sample voice with noise and a clean voice which is used for being combined with noise to form the sample voice with noise;
carrying out Fourier transform on the sample voice with noise to obtain a sample complex frequency spectrum represented by the sample voice with noise in a frequency domain;
calculating the logarithm power spectrum of the real part in the sample complex frequency spectrum to obtain a sample logarithm real part power spectrum, and calculating the logarithm power spectrum of the imaginary part in the sample complex frequency spectrum to obtain a sample logarithm imaginary part power spectrum;
inputting the obtained sample logarithm real part power spectrum and the sample logarithm imaginary part power spectrum into an initial masking prediction network to obtain a first sample masking value corresponding to the real part and a second sample masking value corresponding to the imaginary part;
respectively enhancing the real part and the imaginary part of the sample complex frequency spectrum by using the first sample masking value and the second sample masking value, and performing inverse Fourier transform on the enhanced sample complex frequency spectrum to obtain sample enhanced voice corresponding to the sample noisy voice;
calculating a mean square error between the clean speech and the sample enhanced speech as a loss value;
under the condition that the loss value is not converged, adjusting the initial masking prediction network based on the loss value, and returning to input the obtained sample logarithm real part power spectrum and sample logarithm imaginary part power spectrum into the initial masking prediction network to obtain a first sample masking value corresponding to the real part and a second sample masking value corresponding to the imaginary part;
in case the loss values converge, the initial masking prediction network is taken as a masking prediction network for speech enhancement.
In an implementation, the feature extraction module 210 is specifically configured to calculate a log power spectrum of a real part in the complex frequency spectrum by using the following expression:
LRPS(|X(t,i)|)=log(|Xreal(t,i)|2)
wherein LRPS (| X (t, i) |) represents the log real part power spectrum, Xreal(t, i) represents a real part.
In an implementation, the feature extraction module 210 is specifically configured to calculate a log-imaginary power spectrum of an imaginary part in the complex frequency spectrum by using the following expression to obtain a log-imaginary power spectrum:
LIPS(|X(t,i)|)=log(|Ximage(t,i)|2)
where LIPS (| X (t, i) |) represents the log-imaginary power spectrum, Ximage(t, i) denotes an imaginary part.
An embodiment of the present invention further provides an electronic device, as shown in fig. 3, including a processor 001, a communication interface 002, a memory 003 and a communication bus 004, where the processor 001, the communication interface 002 and the memory 003 complete mutual communication through the communication bus 004,
a memory 003 for storing a computer program;
the processor 001, when executing the program stored in the memory 003, is configured to implement the method for speech enhancement based on complex spectral features, including:
carrying out Fourier transform on the voice with noise to obtain a complex frequency spectrum expressed by the voice with noise in a frequency domain;
calculating the logarithm power spectrum of a real part in the complex frequency spectrum to obtain a logarithm real part power spectrum, and calculating the logarithm power spectrum of an imaginary part in the complex frequency spectrum to obtain a logarithm imaginary part power spectrum;
inputting the obtained logarithm real part power spectrum and logarithm imaginary part power spectrum into a pre-trained masking prediction network to obtain a first masking value corresponding to the real part and a second masking value corresponding to the imaginary part;
and respectively enhancing the real part and the imaginary part of the complex frequency spectrum by using the first masking value and the second masking value, and performing inverse Fourier transform on the enhanced complex frequency spectrum to obtain the enhanced voice corresponding to the voice with noise.
In the voice enhancement process, the masking values are respectively obtained for the real part and the imaginary part of the complex frequency spectrum of the voice in the frequency domain, and the process of respectively obtaining the masking values for the real part and the imaginary part implies phase information, namely, the energy and the phase information of the voice signal are simultaneously used as characteristics to obtain the masking values, so that the voice distortion is reduced while the noise is removed.
The communication bus mentioned in the electronic device may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.
The communication interface is used for communication between the electronic equipment and other equipment.
The Memory may include a Random Access Memory (RAM) or a Non-Volatile Memory (NVM), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the processor.
The Processor may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but also Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components.
In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. The term "comprising", without further limitation, means that the element so defined is not excluded from the group consisting of additional identical elements in the process, method, article, or apparatus that comprises the element.
All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, as for the embodiments of the apparatus and the electronic device, since they are substantially similar to the embodiments of the method, the description is simple, and the relevant points can be referred to only in the partial description of the embodiments of the method.
The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.
Claims (9)
1. A method for speech enhancement based on complex spectral features, the method comprising:
carrying out Fourier transform on the voice with noise to obtain a complex frequency spectrum expressed by the voice with noise in a frequency domain;
calculating the logarithm power spectrum of a real part in the complex frequency spectrum to obtain a logarithm real part power spectrum, and calculating the logarithm power spectrum of an imaginary part in the complex frequency spectrum to obtain a logarithm imaginary part power spectrum;
inputting the obtained logarithm real part power spectrum and logarithm imaginary part power spectrum into a pre-trained masking prediction network to obtain a first masking value corresponding to the real part and a second masking value corresponding to the imaginary part;
and respectively enhancing the real part and the imaginary part of the complex frequency spectrum by using the first masking value and the second masking value, and performing inverse Fourier transform on the enhanced complex frequency spectrum to obtain the enhanced voice corresponding to the voice with noise.
2. The method of claim 1, wherein the training process of the masking prediction network comprises:
acquiring a training sample, wherein the training sample comprises a sample voice with noise and a clean voice which is used for being combined with noise to form the sample voice with noise;
carrying out Fourier transform on the sample voice with noise to obtain a sample complex frequency spectrum represented by the sample voice with noise in a frequency domain;
calculating the logarithm power spectrum of the real part in the sample complex frequency spectrum to obtain a sample logarithm real part power spectrum, and calculating the logarithm power spectrum of the imaginary part in the sample complex frequency spectrum to obtain a sample logarithm imaginary part power spectrum;
inputting the obtained sample logarithm real part power spectrum and the sample logarithm imaginary part power spectrum into an initial masking prediction network to obtain a first sample masking value corresponding to the real part and a second sample masking value corresponding to the imaginary part;
respectively enhancing the real part and the imaginary part of the sample complex frequency spectrum by using the first sample masking value and the second sample masking value, and performing inverse Fourier transform on the enhanced sample complex frequency spectrum to obtain sample enhanced voice corresponding to the sample noisy voice;
calculating a mean square error between the clean speech and the sample enhanced speech as a loss value;
under the condition that the loss value is not converged, adjusting the initial masking prediction network based on the loss value, and returning to input the obtained sample logarithm real part power spectrum and sample logarithm imaginary part power spectrum into the initial masking prediction network to obtain a first sample masking value corresponding to the real part and a second sample masking value corresponding to the imaginary part;
in case the loss values converge, the initial masking prediction network is taken as a masking prediction network for speech enhancement.
3. The method of claim 1, wherein the step of calculating the log power contribution of the real part of the complex spectrum to obtain a log real power spectrum comprises:
calculating the log power spectrum of the real part in the complex frequency spectrum by the following expression to obtain a log real part power spectrum:
LRPS(|X(t,i)|)=log(|Xreal(t,i)|2)
wherein LRPS (| X (t, i) |) represents the log real part power spectrum, Xreal(t, i) represents a real part.
4. The method of claim 1, wherein the step of computing the log-imaginary power spectrum of the imaginary part in the complex spectrum to obtain a log-imaginary power spectrum comprises:
calculating the logarithm power spectrum of the imaginary part in the complex frequency spectrum to obtain a logarithm imaginary part power spectrum by the following expression:
LIPS(|X(t,i)|)=log(|Ximage(t,i)|2)
where LIPS (| X (t, i) |) represents the log-imaginary power spectrum, Ximage(t, i) denotes an imaginary part.
5. An apparatus for speech enhancement based on complex spectral features, the apparatus comprising:
the Fourier transform module is used for carrying out Fourier transform on the voice with noise to obtain a complex frequency spectrum represented by the voice with noise in a frequency domain;
the characteristic extraction module is used for calculating the logarithm power spectrum of a real part in the complex frequency spectrum to obtain a logarithm real part power spectrum, and calculating the logarithm power spectrum of an imaginary part in the complex frequency spectrum to obtain a logarithm imaginary part power spectrum;
the masking prediction module is used for inputting the obtained logarithm real part power spectrum and logarithm imaginary part power spectrum into a pre-trained masking prediction network to obtain a first masking value corresponding to the real part and a second masking value corresponding to the imaginary part;
and the voice enhancement module is used for respectively enhancing the real part and the imaginary part of the complex frequency spectrum by using the first masking value and the second masking value, and performing inverse Fourier transform on the enhanced complex frequency spectrum to obtain the enhanced voice corresponding to the noisy voice.
6. The apparatus of claim 5, further comprising a model training module to derive a masking prediction network by:
acquiring a training sample, wherein the training sample comprises a sample voice with noise and a clean voice which is used for being combined with noise to form the sample voice with noise;
carrying out Fourier transform on the sample voice with noise to obtain a sample complex frequency spectrum represented by the sample voice with noise in a frequency domain;
calculating the logarithm power spectrum of the real part in the sample complex frequency spectrum to obtain a sample logarithm real part power spectrum, and calculating the logarithm power spectrum of the imaginary part in the sample complex frequency spectrum to obtain a sample logarithm imaginary part power spectrum;
inputting the obtained sample logarithm real part power spectrum and the sample logarithm imaginary part power spectrum into an initial masking prediction network to obtain a first sample masking value corresponding to the real part and a second sample masking value corresponding to the imaginary part;
respectively enhancing the real part and the imaginary part of the sample complex frequency spectrum by using the first sample masking value and the second sample masking value, and performing inverse Fourier transform on the enhanced sample complex frequency spectrum to obtain sample enhanced voice corresponding to the sample noisy voice;
calculating a mean square error between the clean speech and the sample enhanced speech as a loss value;
under the condition that the loss value is not converged, adjusting the initial masking prediction network based on the loss value, and returning to input the obtained sample logarithm real part power spectrum and sample logarithm imaginary part power spectrum into the initial masking prediction network to obtain a first sample masking value corresponding to the real part and a second sample masking value corresponding to the imaginary part;
in case the loss values converge, the initial masking prediction network is taken as a masking prediction network for speech enhancement.
7. The apparatus of claim 5, wherein the feature extraction module is specifically configured to calculate a log power spectrum of a real part in the complex spectrum by the following expression:
LRPS(|X(t,i)|)=log(|Xreal(t,i)|2)
wherein LRPS (| X (t, i) |) represents the log real part power spectrum, Xreal(t, i) represents a real part.
8. The apparatus according to claim 5, wherein the feature extraction module is specifically configured to calculate a log-imaginary power spectrum of an imaginary part in the complex spectrum by the following expression:
LIPS(|X(t,i)|)=log(|Ximage(t,i)|2)
where LIPS (| X (t, i) |) represents the log-imaginary power spectrum, Ximage(t, i) denotes an imaginary part.
9. An electronic device is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor and the communication interface are used for realizing mutual communication by the memory through the communication bus;
a memory for storing a computer program;
a processor for implementing the method steps of any of claims 1 to 4 when executing a program stored in the memory.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111447463.9A CN114141267A (en) | 2021-11-30 | 2021-11-30 | Speech enhancement method and device based on complex frequency spectrum characteristics |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111447463.9A CN114141267A (en) | 2021-11-30 | 2021-11-30 | Speech enhancement method and device based on complex frequency spectrum characteristics |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114141267A true CN114141267A (en) | 2022-03-04 |
Family
ID=80386339
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111447463.9A Pending CN114141267A (en) | 2021-11-30 | 2021-11-30 | Speech enhancement method and device based on complex frequency spectrum characteristics |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114141267A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117935838A (en) * | 2024-03-25 | 2024-04-26 | 深圳市声扬科技有限公司 | Audio acquisition method and device, electronic equipment and storage medium |
-
2021
- 2021-11-30 CN CN202111447463.9A patent/CN114141267A/en active Pending
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117935838A (en) * | 2024-03-25 | 2024-04-26 | 深圳市声扬科技有限公司 | Audio acquisition method and device, electronic equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
El-Moneim et al. | Text-independent speaker recognition using LSTM-RNN and speech enhancement | |
CN111161752A (en) | Echo cancellation method and device | |
CN112700786B (en) | Speech enhancement method, device, electronic equipment and storage medium | |
WO2022141868A1 (en) | Method and apparatus for extracting speech features, terminal, and storage medium | |
CN114242044B (en) | Voice quality evaluation method, voice quality evaluation model training method and device | |
CN112735456A (en) | Speech enhancement method based on DNN-CLSTM network | |
Mundodu Krishna et al. | Single channel speech separation based on empirical mode decomposition and Hilbert transform | |
US10262680B2 (en) | Variable sound decomposition masks | |
CN112949708A (en) | Emotion recognition method and device, computer equipment and storage medium | |
CN111863008A (en) | Audio noise reduction method and device and storage medium | |
Shafik et al. | Speaker identification based on Radon transform and CNNs in the presence of different types of interference for Robotic Applications | |
CN115171714A (en) | Voice enhancement method and device, electronic equipment and storage medium | |
CN114141267A (en) | Speech enhancement method and device based on complex frequency spectrum characteristics | |
Mirbeygi et al. | RPCA-based real-time speech and music separation method | |
Ram et al. | Deep neural network based speech enhancement | |
CN113921030B (en) | Speech enhancement neural network training method and device based on weighted speech loss | |
Tkachenko et al. | Speech enhancement for speaker recognition using deep recurrent neural networks | |
Srinivas et al. | A classification-based non-local means adaptive filtering for speech enhancement and its FPGA prototype | |
CN113555031B (en) | Training method and device of voice enhancement model, and voice enhancement method and device | |
Sun et al. | Single-channel speech enhancement based on joint constrained dictionary learning | |
CN115859048A (en) | Noise processing method and device for partial discharge signal | |
Cuccovillo et al. | Spectral denoising for microphone classification | |
CN113707172B (en) | Single-channel voice separation method, system and computer equipment of sparse orthogonal network | |
Arcos et al. | Ideal neighbourhood mask for speech enhancement | |
CN113921027B (en) | Speech enhancement method and device based on spatial features and electronic equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication |