CN114141267A - Speech enhancement method and device based on complex frequency spectrum characteristics - Google Patents

Speech enhancement method and device based on complex frequency spectrum characteristics Download PDF

Info

Publication number
CN114141267A
CN114141267A CN202111447463.9A CN202111447463A CN114141267A CN 114141267 A CN114141267 A CN 114141267A CN 202111447463 A CN202111447463 A CN 202111447463A CN 114141267 A CN114141267 A CN 114141267A
Authority
CN
China
Prior art keywords
sample
power spectrum
logarithm
masking
voice
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111447463.9A
Other languages
Chinese (zh)
Inventor
苏家雨
王博
欧阳鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangsu Qingwei Intelligent Technology Co ltd
Original Assignee
Jiangsu Qingwei Intelligent Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangsu Qingwei Intelligent Technology Co ltd filed Critical Jiangsu Qingwei Intelligent Technology Co ltd
Priority to CN202111447463.9A priority Critical patent/CN114141267A/en
Publication of CN114141267A publication Critical patent/CN114141267A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0232Processing in the frequency domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/21Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being power information

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Quality & Reliability (AREA)
  • Soundproofing, Sound Blocking, And Sound Damping (AREA)

Abstract

The invention discloses a speech enhancement method and a speech enhancement device based on complex frequency spectrum characteristics, wherein the method comprises the following steps: carrying out Fourier transform on the voice with noise to obtain a complex frequency spectrum; calculating a logarithmic real part power spectrum of a real part and a logarithmic imaginary part power spectrum of an imaginary part in a complex frequency spectrum; inputting the logarithm real part power spectrum and the logarithm imaginary part power spectrum into a pre-trained masking prediction network to obtain a first masking value corresponding to the real part and a second masking value corresponding to the imaginary part; and respectively enhancing the real part and the imaginary part of the complex frequency spectrum by using the first masking value and the second masking value, and performing inverse Fourier transform on the enhanced complex frequency spectrum to obtain the enhanced voice. According to the scheme provided by the embodiment of the invention, the masking values are respectively obtained for the real part and the imaginary part of the complex frequency spectrum of the voice in the frequency domain, and the phase information is hidden in the process of respectively obtaining the masking values for the real part and the imaginary part, namely, the masking values are obtained by using the energy and the phase information of the voice signal as the characteristics, so that the voice distortion is reduced while the noise is removed.

Description

Speech enhancement method and device based on complex frequency spectrum characteristics
Technical Field
The present invention relates to the field of speech enhancement technologies, and in particular, to a speech enhancement method and apparatus based on complex spectral features.
Background
The speech signal is interfered by various types of noise in the environment, and the interference causes serious distortion of the speech signal, so that the understanding of speech semantics by people becomes difficult. The purpose of speech enhancement is to remove or reduce various types of noise from noisy speech.
The traditional single-channel speech enhancement algorithm comprises a spectral subtraction method, a method based on minimum mean square error and the like, but the traditional enhancement algorithm needs to estimate the frequency spectrum information of noise firstly, and the abrupt noise causes the estimation of the frequency spectrum information to be difficult; meanwhile, the traditional algorithm needs to assume Gaussian distribution of signals, so that the enhancement effect is limited.
Therefore, the neural network based on deep learning is widely applied to a speech enhancement algorithm, a single-microphone signal is usually subjected to Fourier transform, spectrum features (such as a logarithmic power spectrum and a Mel logarithmic power spectrum) of the single-microphone signal are calculated at the same time, the overall features are sent to the deep neural network to learn a masking value, and then the masking value is masked on a noisy speech to complete speech enhancement. However, existing masking enhancement methods ignore the effect of phase on the speech signal.
Disclosure of Invention
In order to solve the problems of insufficient precision and low quantization efficiency of the existing quantization method, the embodiment of the invention provides a quantization method and a quantization device of a neural network. The technical scheme is as follows:
in a first aspect, a speech enhancement method based on complex spectral features is provided, the method including:
carrying out Fourier transform on the voice with noise to obtain a complex frequency spectrum expressed by the voice with noise in a frequency domain;
calculating the logarithm power spectrum of a real part in the complex frequency spectrum to obtain a logarithm real part power spectrum, and calculating the logarithm power spectrum of an imaginary part in the complex frequency spectrum to obtain a logarithm imaginary part power spectrum;
inputting the obtained logarithm real part power spectrum and logarithm imaginary part power spectrum into a pre-trained masking prediction network to obtain a first masking value corresponding to the real part and a second masking value corresponding to the imaginary part;
and respectively enhancing the real part and the imaginary part of the complex frequency spectrum by using the first masking value and the second masking value, and performing inverse Fourier transform on the enhanced complex frequency spectrum to obtain the enhanced voice corresponding to the voice with noise.
Optionally, the training process of the masking prediction network includes:
acquiring a training sample, wherein the training sample comprises a sample voice with noise and a clean voice which is used for being combined with noise to form the sample voice with noise;
carrying out Fourier transform on the sample voice with noise to obtain a sample complex frequency spectrum represented by the sample voice with noise in a frequency domain;
calculating the logarithm power spectrum of the real part in the sample complex frequency spectrum to obtain a sample logarithm real part power spectrum, and calculating the logarithm power spectrum of the imaginary part in the sample complex frequency spectrum to obtain a sample logarithm imaginary part power spectrum;
inputting the obtained sample logarithm real part power spectrum and the sample logarithm imaginary part power spectrum into an initial masking prediction network to obtain a first sample masking value corresponding to the real part and a second sample masking value corresponding to the imaginary part;
respectively enhancing the real part and the imaginary part of the sample complex frequency spectrum by using the first sample masking value and the second sample masking value, and performing inverse Fourier transform on the enhanced sample complex frequency spectrum to obtain sample enhanced voice corresponding to the sample noisy voice;
calculating a mean square error between the clean speech and the sample enhanced speech as a loss value;
under the condition that the loss value is not converged, adjusting the initial masking prediction network based on the loss value, and returning to input the obtained sample logarithm real part power spectrum and sample logarithm imaginary part power spectrum into the initial masking prediction network to obtain a first sample masking value corresponding to the real part and a second sample masking value corresponding to the imaginary part;
in case the loss values converge, the initial masking prediction network is taken as a masking prediction network for speech enhancement.
Optionally, the step of calculating the log power spectrum of the real part in the complex frequency spectrum to obtain the log real part power spectrum includes:
calculating the log power spectrum of the real part in the complex frequency spectrum by the following expression to obtain a log real part power spectrum:
LRPS(|X(t,i)|)=log(|Xreal(t,i)|2)
wherein LRPS (| X (t, i) |) represents the log real part power spectrum, Xreal(t, i) represents a real part.
Optionally, the step of calculating a log-imaginary power spectrum of an imaginary part in the complex frequency spectrum to obtain a log-imaginary power spectrum includes:
calculating the logarithm power spectrum of the imaginary part in the complex frequency spectrum to obtain a logarithm imaginary part power spectrum by the following expression:
LIPS(|X(t,i)|)=log(|Ximage(t,i)|2)
where LIPS (| X (t, i) |) represents the log-imaginary power spectrum, Ximage(t, i) denotes an imaginary part.
In a second aspect, an apparatus for speech enhancement based on complex spectral features is provided, the apparatus comprising:
the Fourier transform module is used for carrying out Fourier transform on the voice with noise to obtain a complex frequency spectrum represented by the voice with noise in a frequency domain;
the characteristic extraction module is used for calculating the logarithm power spectrum of a real part in the complex frequency spectrum to obtain a logarithm real part power spectrum, and calculating the logarithm power spectrum of an imaginary part in the complex frequency spectrum to obtain a logarithm imaginary part power spectrum;
the masking prediction module is used for inputting the obtained logarithm real part power spectrum and logarithm imaginary part power spectrum into a pre-trained masking prediction network to obtain a first masking value corresponding to the real part and a second masking value corresponding to the imaginary part;
and the voice enhancement module is used for respectively enhancing the real part and the imaginary part of the complex frequency spectrum by using the first masking value and the second masking value, and performing inverse Fourier transform on the enhanced complex frequency spectrum to obtain the enhanced voice corresponding to the noisy voice.
Optionally, the method further includes a model training module, configured to obtain the masking prediction network by:
acquiring a training sample, wherein the training sample comprises a sample voice with noise and a clean voice which is used for being combined with noise to form the sample voice with noise;
carrying out Fourier transform on the sample voice with noise to obtain a sample complex frequency spectrum represented by the sample voice with noise in a frequency domain;
calculating the logarithm power spectrum of the real part in the sample complex frequency spectrum to obtain a sample logarithm real part power spectrum, and calculating the logarithm power spectrum of the imaginary part in the sample complex frequency spectrum to obtain a sample logarithm imaginary part power spectrum;
inputting the obtained sample logarithm real part power spectrum and the sample logarithm imaginary part power spectrum into an initial masking prediction network to obtain a first sample masking value corresponding to the real part and a second sample masking value corresponding to the imaginary part;
respectively enhancing the real part and the imaginary part of the sample complex frequency spectrum by using the first sample masking value and the second sample masking value, and performing inverse Fourier transform on the enhanced sample complex frequency spectrum to obtain sample enhanced voice corresponding to the sample noisy voice;
calculating a mean square error between the clean speech and the sample enhanced speech as a loss value;
under the condition that the loss value is not converged, adjusting the initial masking prediction network based on the loss value, and returning to input the obtained sample logarithm real part power spectrum and sample logarithm imaginary part power spectrum into the initial masking prediction network to obtain a first sample masking value corresponding to the real part and a second sample masking value corresponding to the imaginary part;
in case the loss values converge, the initial masking prediction network is taken as a masking prediction network for speech enhancement.
Optionally, the feature extraction module is specifically configured to calculate a log power spectrum of a real part in the complex frequency spectrum by using the following expression to obtain a log real part power spectrum:
LRPS(|X(t,i)|)=log(|Xreal(t,i)|2)
wherein LRPS (| X (t, i) |) represents the log real part power spectrum, Xreal(t, i) represents a real part.
Optionally, the feature extraction module is specifically configured to calculate a logarithmic power spectrum of an imaginary part in the complex frequency spectrum to obtain a logarithmic imaginary part power spectrum by using the following expression:
LIPS(|X(t,i)|)=log(|Ximage(t,i)|2)
where LIPS (| X (t, i) |) represents the log-imaginary power spectrum, Ximage(t, i) denotes an imaginary part.
In a third aspect, an electronic device is provided, which includes a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory complete communication with each other through the communication bus;
a memory for storing a computer program;
a processor, configured to implement the speech enhancement method based on complex spectral features according to the first aspect when executing the program stored in the memory.
In the voice enhancement process, the masking values are respectively obtained for the real part and the imaginary part of the complex frequency spectrum of the voice in the frequency domain, and the process of respectively obtaining the masking values for the real part and the imaginary part implies phase information, namely, the energy and the phase information of the voice signal are simultaneously used as characteristics to obtain the masking values, so that the voice distortion is reduced while the noise is removed.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a schematic flowchart of a speech enhancement method based on complex spectral features according to an embodiment of the present invention;
fig. 2 is a schematic structural diagram of a speech enhancement apparatus based on complex spectral features according to an embodiment of the present invention;
fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention will be described in detail with reference to the accompanying drawings.
Referring to fig. 1, an embodiment of the present invention provides a speech enhancement method based on complex spectral features, including:
s100, carrying out Fourier transform on the voice with noise to obtain a complex frequency spectrum represented by the voice with noise in a frequency domain.
In implementation, for a noisy speech, in the time domain, it can be written as:
y(k)=s(k)+n(k) (1)
wherein y (k), s (k), n (k) represent noisy speech, clean speech components, and noise components, respectively; while speech enhancement is typically performed in the frequency domain, fourier transforming noisy speech can be written as:
Y(t,k)=S(t,k)+N(t,k) (2)
or Y (t, k) ═ Y (t, k)real+Y(t,k)image (3)
The energy of the real part and the imaginary part is respectively obtained in the enhancing process for voice enhancement, so that the phase information of the noisy voice signal is implied, namely, the energy and the phase of the noisy voice signal are simultaneously used as characteristics for voice enhancement.
S110, calculating the log power spectrum of the real part in the complex frequency spectrum to obtain a log real part power spectrum, and calculating the log power spectrum of the imaginary part in the complex frequency spectrum to obtain a log imaginary part power spectrum.
In implementation, the log power spectrum of the real part in the complex spectrum can be obtained by calculating the log power spectrum of the real part by the following expression:
LRPS(|X(t,i)|)=log(|Xreal(t,i)|2) (4)
wherein LRPS (| X (t, i) |) represents the log real part power spectrum, Xreal(t, i) represents a real part.
Calculating the logarithm power spectrum of the imaginary part in the complex frequency spectrum to obtain a logarithm imaginary part power spectrum by the following expression:
LIPS(|X(t,i)|)=log(|Ximage(t,i)|2) (5)
where LIPS (| X (t, i) |) represents the log-imaginary power spectrum, Ximage(t, i) denotes an imaginary part.
And S120, inputting the obtained logarithm real part power spectrum and logarithm imaginary part power spectrum into a pre-trained masking prediction network to obtain a first masking value corresponding to the real part and a second masking value corresponding to the imaginary part.
In the implementation, a logarithm real part power spectrum and a logarithm imaginary part power spectrum are respectively obtained and used as features to be input into a masking prediction network to obtain masking values mask corresponding to a real part and an imaginary part respectively, the masking prediction network is built by a GRU network and comprises 3 GRU layers, a sigmoid is adopted as an activation function by one FC layer, and the specific training process comprises the following steps:
acquiring a training sample, wherein the training sample comprises a sample voice with noise and clean voice which is used for being combined with noise to form the sample voice with noise;
carrying out Fourier transform on the sample voice with noise to obtain a sample complex frequency spectrum expressed by the sample voice with noise in a frequency domain;
calculating the logarithm power spectrum of a real part in the sample complex frequency spectrum to obtain a sample logarithm real part power spectrum, and calculating the logarithm power spectrum of an imaginary part in the sample complex frequency spectrum to obtain a sample logarithm imaginary part power spectrum;
inputting the obtained sample logarithm real part power spectrum and the sample logarithm imaginary part power spectrum into an initial masking prediction network to obtain a first sample masking value corresponding to a real part and a second sample masking value corresponding to an imaginary part;
respectively enhancing a real part and an imaginary part of a sample complex frequency spectrum by using the first sample masking value and the second sample masking value, and performing Fourier inverse transformation on the enhanced sample complex frequency spectrum to obtain a sample enhanced voice corresponding to the sample noisy voice;
calculating the mean square error between the clean voice and the sample enhanced voice as a loss value;
under the condition that the loss value is not converged, adjusting the initial masking prediction network based on the loss value, and returning to input the obtained sample logarithm real part power spectrum and the sample logarithm imaginary part power spectrum into the initial masking prediction network to obtain a first sample masking value corresponding to the real part and a second sample masking value corresponding to the imaginary part;
in case the loss values converge, the initial masking prediction network is used as the masking prediction network for speech enhancement.
S130, the real part and the imaginary part of the complex frequency spectrum are respectively enhanced by utilizing the first masking value and the second masking value, and the enhanced complex frequency spectrum is subjected to Fourier inverse transformation to obtain enhanced voice corresponding to the voice with noise.
In the implementation, the real part and the imaginary part are enhanced respectively, which may be specifically expressed as:
Yc(t,k)real=maskreal*Y(t,k)real (6)
Yc(t,k)image=maskimage*Y(t,k)image (7)
yc(k)=ifft(Yc(t,k)) (8)
wherein, Yc(t, k) the enhanced complex frequency spectrum is subjected to fast Fourier transform to obtain enhanced voice yc(k)。
In order to verify the effect of enhancing single-microphone voice, a large amount of voice data with noise is constructed, more than one hundred thousand pieces of collected clean voice data and clean voice in open source AISHELL data set are specifically used, knocking noise, television noise, music noise and the like are collected to be used as point source interference, and subway noise, bus noise, office noise and the like are collected to be used as scattering noise. Then, clean voice and noise are randomly selected, according to a practical scene, 84 million noisy voice data with the signal-to-noise ratio ranging from-5 db to 15db are constructed, 80 million voice data are used for network training, 20000 voice data are used for training verification and optimizing a network, and 20000 voice data are used for effect testing after the network training is completed. Where the audio sampling rate for all configurations is 16 khz.
In the stage of testing the final noise reduction effect and comparing, the adopted indexes are SI-SDR, short-time intelligibility (STOI) and objective evaluation index of voice quality (PESQ), the final test result is shown in table 1, and the indexes of respectively extracting logarithmic real part power spectrum and logarithmic imaginary part power as features to perform voice enhancement are higher than those of the existing voice enhancement method.
Means (characteristic) Network SI-SDR PESQ STOI
Far-field speech with noise 1.63 2.12 0.75
LPS 64-64-64gru+257fc 11.75 2.67 0.82
LRPS+LIPS/maskreal+maskimage 64-64-64gru+257fc 11.97 2.88 0.87
TABLE 1 comparison of test results
Referring to fig. 2, an embodiment of the present invention provides a speech enhancement apparatus based on complex spectral features, where the apparatus includes:
the fourier transform module 200 is configured to perform fourier transform on the voice with noise to obtain a complex frequency spectrum represented by the voice with noise in a frequency domain;
a feature extraction module 210, configured to calculate a log power spectrum of a real part in the complex spectrum to obtain a log real part power spectrum, and calculate a log power spectrum of an imaginary part in the complex spectrum to obtain a log imaginary part power spectrum;
a masking prediction module 220, configured to input the obtained log real part power spectrum and log imaginary part power spectrum into a pre-trained masking prediction network, so as to obtain a first masking value corresponding to the real part and a second masking value corresponding to the imaginary part;
the speech enhancement module 230 is configured to enhance the real part and the imaginary part of the complex frequency spectrum by using the first masking value and the second masking value, and perform inverse fourier transform on the enhanced complex frequency spectrum to obtain an enhanced speech corresponding to the noisy speech.
In an implementation, the method further comprises a model training module for obtaining the masking prediction network by the following steps:
acquiring a training sample, wherein the training sample comprises a sample voice with noise and a clean voice which is used for being combined with noise to form the sample voice with noise;
carrying out Fourier transform on the sample voice with noise to obtain a sample complex frequency spectrum represented by the sample voice with noise in a frequency domain;
calculating the logarithm power spectrum of the real part in the sample complex frequency spectrum to obtain a sample logarithm real part power spectrum, and calculating the logarithm power spectrum of the imaginary part in the sample complex frequency spectrum to obtain a sample logarithm imaginary part power spectrum;
inputting the obtained sample logarithm real part power spectrum and the sample logarithm imaginary part power spectrum into an initial masking prediction network to obtain a first sample masking value corresponding to the real part and a second sample masking value corresponding to the imaginary part;
respectively enhancing the real part and the imaginary part of the sample complex frequency spectrum by using the first sample masking value and the second sample masking value, and performing inverse Fourier transform on the enhanced sample complex frequency spectrum to obtain sample enhanced voice corresponding to the sample noisy voice;
calculating a mean square error between the clean speech and the sample enhanced speech as a loss value;
under the condition that the loss value is not converged, adjusting the initial masking prediction network based on the loss value, and returning to input the obtained sample logarithm real part power spectrum and sample logarithm imaginary part power spectrum into the initial masking prediction network to obtain a first sample masking value corresponding to the real part and a second sample masking value corresponding to the imaginary part;
in case the loss values converge, the initial masking prediction network is taken as a masking prediction network for speech enhancement.
In an implementation, the feature extraction module 210 is specifically configured to calculate a log power spectrum of a real part in the complex frequency spectrum by using the following expression:
LRPS(|X(t,i)|)=log(|Xreal(t,i)|2)
wherein LRPS (| X (t, i) |) represents the log real part power spectrum, Xreal(t, i) represents a real part.
In an implementation, the feature extraction module 210 is specifically configured to calculate a log-imaginary power spectrum of an imaginary part in the complex frequency spectrum by using the following expression to obtain a log-imaginary power spectrum:
LIPS(|X(t,i)|)=log(|Ximage(t,i)|2)
where LIPS (| X (t, i) |) represents the log-imaginary power spectrum, Ximage(t, i) denotes an imaginary part.
An embodiment of the present invention further provides an electronic device, as shown in fig. 3, including a processor 001, a communication interface 002, a memory 003 and a communication bus 004, where the processor 001, the communication interface 002 and the memory 003 complete mutual communication through the communication bus 004,
a memory 003 for storing a computer program;
the processor 001, when executing the program stored in the memory 003, is configured to implement the method for speech enhancement based on complex spectral features, including:
carrying out Fourier transform on the voice with noise to obtain a complex frequency spectrum expressed by the voice with noise in a frequency domain;
calculating the logarithm power spectrum of a real part in the complex frequency spectrum to obtain a logarithm real part power spectrum, and calculating the logarithm power spectrum of an imaginary part in the complex frequency spectrum to obtain a logarithm imaginary part power spectrum;
inputting the obtained logarithm real part power spectrum and logarithm imaginary part power spectrum into a pre-trained masking prediction network to obtain a first masking value corresponding to the real part and a second masking value corresponding to the imaginary part;
and respectively enhancing the real part and the imaginary part of the complex frequency spectrum by using the first masking value and the second masking value, and performing inverse Fourier transform on the enhanced complex frequency spectrum to obtain the enhanced voice corresponding to the voice with noise.
In the voice enhancement process, the masking values are respectively obtained for the real part and the imaginary part of the complex frequency spectrum of the voice in the frequency domain, and the process of respectively obtaining the masking values for the real part and the imaginary part implies phase information, namely, the energy and the phase information of the voice signal are simultaneously used as characteristics to obtain the masking values, so that the voice distortion is reduced while the noise is removed.
The communication bus mentioned in the electronic device may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.
The communication interface is used for communication between the electronic equipment and other equipment.
The Memory may include a Random Access Memory (RAM) or a Non-Volatile Memory (NVM), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the processor.
The Processor may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but also Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components.
In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. The term "comprising", without further limitation, means that the element so defined is not excluded from the group consisting of additional identical elements in the process, method, article, or apparatus that comprises the element.
All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, as for the embodiments of the apparatus and the electronic device, since they are substantially similar to the embodiments of the method, the description is simple, and the relevant points can be referred to only in the partial description of the embodiments of the method.
The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims (9)

1. A method for speech enhancement based on complex spectral features, the method comprising:
carrying out Fourier transform on the voice with noise to obtain a complex frequency spectrum expressed by the voice with noise in a frequency domain;
calculating the logarithm power spectrum of a real part in the complex frequency spectrum to obtain a logarithm real part power spectrum, and calculating the logarithm power spectrum of an imaginary part in the complex frequency spectrum to obtain a logarithm imaginary part power spectrum;
inputting the obtained logarithm real part power spectrum and logarithm imaginary part power spectrum into a pre-trained masking prediction network to obtain a first masking value corresponding to the real part and a second masking value corresponding to the imaginary part;
and respectively enhancing the real part and the imaginary part of the complex frequency spectrum by using the first masking value and the second masking value, and performing inverse Fourier transform on the enhanced complex frequency spectrum to obtain the enhanced voice corresponding to the voice with noise.
2. The method of claim 1, wherein the training process of the masking prediction network comprises:
acquiring a training sample, wherein the training sample comprises a sample voice with noise and a clean voice which is used for being combined with noise to form the sample voice with noise;
carrying out Fourier transform on the sample voice with noise to obtain a sample complex frequency spectrum represented by the sample voice with noise in a frequency domain;
calculating the logarithm power spectrum of the real part in the sample complex frequency spectrum to obtain a sample logarithm real part power spectrum, and calculating the logarithm power spectrum of the imaginary part in the sample complex frequency spectrum to obtain a sample logarithm imaginary part power spectrum;
inputting the obtained sample logarithm real part power spectrum and the sample logarithm imaginary part power spectrum into an initial masking prediction network to obtain a first sample masking value corresponding to the real part and a second sample masking value corresponding to the imaginary part;
respectively enhancing the real part and the imaginary part of the sample complex frequency spectrum by using the first sample masking value and the second sample masking value, and performing inverse Fourier transform on the enhanced sample complex frequency spectrum to obtain sample enhanced voice corresponding to the sample noisy voice;
calculating a mean square error between the clean speech and the sample enhanced speech as a loss value;
under the condition that the loss value is not converged, adjusting the initial masking prediction network based on the loss value, and returning to input the obtained sample logarithm real part power spectrum and sample logarithm imaginary part power spectrum into the initial masking prediction network to obtain a first sample masking value corresponding to the real part and a second sample masking value corresponding to the imaginary part;
in case the loss values converge, the initial masking prediction network is taken as a masking prediction network for speech enhancement.
3. The method of claim 1, wherein the step of calculating the log power contribution of the real part of the complex spectrum to obtain a log real power spectrum comprises:
calculating the log power spectrum of the real part in the complex frequency spectrum by the following expression to obtain a log real part power spectrum:
LRPS(|X(t,i)|)=log(|Xreal(t,i)|2)
wherein LRPS (| X (t, i) |) represents the log real part power spectrum, Xreal(t, i) represents a real part.
4. The method of claim 1, wherein the step of computing the log-imaginary power spectrum of the imaginary part in the complex spectrum to obtain a log-imaginary power spectrum comprises:
calculating the logarithm power spectrum of the imaginary part in the complex frequency spectrum to obtain a logarithm imaginary part power spectrum by the following expression:
LIPS(|X(t,i)|)=log(|Ximage(t,i)|2)
where LIPS (| X (t, i) |) represents the log-imaginary power spectrum, Ximage(t, i) denotes an imaginary part.
5. An apparatus for speech enhancement based on complex spectral features, the apparatus comprising:
the Fourier transform module is used for carrying out Fourier transform on the voice with noise to obtain a complex frequency spectrum represented by the voice with noise in a frequency domain;
the characteristic extraction module is used for calculating the logarithm power spectrum of a real part in the complex frequency spectrum to obtain a logarithm real part power spectrum, and calculating the logarithm power spectrum of an imaginary part in the complex frequency spectrum to obtain a logarithm imaginary part power spectrum;
the masking prediction module is used for inputting the obtained logarithm real part power spectrum and logarithm imaginary part power spectrum into a pre-trained masking prediction network to obtain a first masking value corresponding to the real part and a second masking value corresponding to the imaginary part;
and the voice enhancement module is used for respectively enhancing the real part and the imaginary part of the complex frequency spectrum by using the first masking value and the second masking value, and performing inverse Fourier transform on the enhanced complex frequency spectrum to obtain the enhanced voice corresponding to the noisy voice.
6. The apparatus of claim 5, further comprising a model training module to derive a masking prediction network by:
acquiring a training sample, wherein the training sample comprises a sample voice with noise and a clean voice which is used for being combined with noise to form the sample voice with noise;
carrying out Fourier transform on the sample voice with noise to obtain a sample complex frequency spectrum represented by the sample voice with noise in a frequency domain;
calculating the logarithm power spectrum of the real part in the sample complex frequency spectrum to obtain a sample logarithm real part power spectrum, and calculating the logarithm power spectrum of the imaginary part in the sample complex frequency spectrum to obtain a sample logarithm imaginary part power spectrum;
inputting the obtained sample logarithm real part power spectrum and the sample logarithm imaginary part power spectrum into an initial masking prediction network to obtain a first sample masking value corresponding to the real part and a second sample masking value corresponding to the imaginary part;
respectively enhancing the real part and the imaginary part of the sample complex frequency spectrum by using the first sample masking value and the second sample masking value, and performing inverse Fourier transform on the enhanced sample complex frequency spectrum to obtain sample enhanced voice corresponding to the sample noisy voice;
calculating a mean square error between the clean speech and the sample enhanced speech as a loss value;
under the condition that the loss value is not converged, adjusting the initial masking prediction network based on the loss value, and returning to input the obtained sample logarithm real part power spectrum and sample logarithm imaginary part power spectrum into the initial masking prediction network to obtain a first sample masking value corresponding to the real part and a second sample masking value corresponding to the imaginary part;
in case the loss values converge, the initial masking prediction network is taken as a masking prediction network for speech enhancement.
7. The apparatus of claim 5, wherein the feature extraction module is specifically configured to calculate a log power spectrum of a real part in the complex spectrum by the following expression:
LRPS(|X(t,i)|)=log(|Xreal(t,i)|2)
wherein LRPS (| X (t, i) |) represents the log real part power spectrum, Xreal(t, i) represents a real part.
8. The apparatus according to claim 5, wherein the feature extraction module is specifically configured to calculate a log-imaginary power spectrum of an imaginary part in the complex spectrum by the following expression:
LIPS(|X(t,i)|)=log(|Ximage(t,i)|2)
where LIPS (| X (t, i) |) represents the log-imaginary power spectrum, Ximage(t, i) denotes an imaginary part.
9. An electronic device is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor and the communication interface are used for realizing mutual communication by the memory through the communication bus;
a memory for storing a computer program;
a processor for implementing the method steps of any of claims 1 to 4 when executing a program stored in the memory.
CN202111447463.9A 2021-11-30 2021-11-30 Speech enhancement method and device based on complex frequency spectrum characteristics Pending CN114141267A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111447463.9A CN114141267A (en) 2021-11-30 2021-11-30 Speech enhancement method and device based on complex frequency spectrum characteristics

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111447463.9A CN114141267A (en) 2021-11-30 2021-11-30 Speech enhancement method and device based on complex frequency spectrum characteristics

Publications (1)

Publication Number Publication Date
CN114141267A true CN114141267A (en) 2022-03-04

Family

ID=80386339

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111447463.9A Pending CN114141267A (en) 2021-11-30 2021-11-30 Speech enhancement method and device based on complex frequency spectrum characteristics

Country Status (1)

Country Link
CN (1) CN114141267A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117935838A (en) * 2024-03-25 2024-04-26 深圳市声扬科技有限公司 Audio acquisition method and device, electronic equipment and storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117935838A (en) * 2024-03-25 2024-04-26 深圳市声扬科技有限公司 Audio acquisition method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
El-Moneim et al. Text-independent speaker recognition using LSTM-RNN and speech enhancement
CN111161752A (en) Echo cancellation method and device
CN112700786B (en) Speech enhancement method, device, electronic equipment and storage medium
WO2022141868A1 (en) Method and apparatus for extracting speech features, terminal, and storage medium
CN114242044B (en) Voice quality evaluation method, voice quality evaluation model training method and device
CN112735456A (en) Speech enhancement method based on DNN-CLSTM network
Mundodu Krishna et al. Single channel speech separation based on empirical mode decomposition and Hilbert transform
US10262680B2 (en) Variable sound decomposition masks
CN112949708A (en) Emotion recognition method and device, computer equipment and storage medium
CN111863008A (en) Audio noise reduction method and device and storage medium
Shafik et al. Speaker identification based on Radon transform and CNNs in the presence of different types of interference for Robotic Applications
CN115171714A (en) Voice enhancement method and device, electronic equipment and storage medium
CN114141267A (en) Speech enhancement method and device based on complex frequency spectrum characteristics
Mirbeygi et al. RPCA-based real-time speech and music separation method
Ram et al. Deep neural network based speech enhancement
CN113921030B (en) Speech enhancement neural network training method and device based on weighted speech loss
Tkachenko et al. Speech enhancement for speaker recognition using deep recurrent neural networks
Srinivas et al. A classification-based non-local means adaptive filtering for speech enhancement and its FPGA prototype
CN113555031B (en) Training method and device of voice enhancement model, and voice enhancement method and device
Sun et al. Single-channel speech enhancement based on joint constrained dictionary learning
CN115859048A (en) Noise processing method and device for partial discharge signal
Cuccovillo et al. Spectral denoising for microphone classification
CN113707172B (en) Single-channel voice separation method, system and computer equipment of sparse orthogonal network
Arcos et al. Ideal neighbourhood mask for speech enhancement
CN113921027B (en) Speech enhancement method and device based on spatial features and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication