WO2023045779A1 - 一种音频降噪方法、装置、设备及存储介质 - Google Patents

一种音频降噪方法、装置、设备及存储介质 Download PDF

Info

Publication number
WO2023045779A1
WO2023045779A1 PCT/CN2022/118040 CN2022118040W WO2023045779A1 WO 2023045779 A1 WO2023045779 A1 WO 2023045779A1 CN 2022118040 W CN2022118040 W CN 2022118040W WO 2023045779 A1 WO2023045779 A1 WO 2023045779A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio data
audio
spectrum
complex
denoised
Prior art date
Application number
PCT/CN2022/118040
Other languages
English (en)
French (fr)
Inventor
舒晓峰
竺烨航
尚楚翔
陈彦洁
Original Assignee
北京字跳网络技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京字跳网络技术有限公司 filed Critical 北京字跳网络技术有限公司
Publication of WO2023045779A1 publication Critical patent/WO2023045779A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0224Processing in the time domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0232Processing in the frequency domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0316Speech enhancement, e.g. noise reduction or echo cancellation by changing the amplitude

Definitions

  • the present disclosure relates to the field of data processing, and in particular to an audio noise reduction method, device, equipment and storage medium.
  • an embodiment of the present disclosure provides an audio noise reduction method, which can implement audio noise reduction, thereby better improving the sound quality of the audio.
  • the present disclosure provides an audio noise reduction method, the method comprising:
  • Denoising result audio data corresponding to the audio data to be reduced is determined based on the first-order enhanced amplitude spectrum corresponding to the audio data to be reduced and the complex time-frequency mask.
  • the estimating the complex time-frequency mask of the audio data to be denoised by using the preset complex network model includes:
  • the complex spectrum to be denoised includes a complex spectrum determined based on the first-order enhanced amplitude spectrum corresponding to the audio data to be denoised and the original phase spectrum of the audio data to be denoised, or , a complex spectrum determined based on the original spectrum and the original phase spectrum of the audio data to be denoised;
  • the determining the noise reduction result audio data corresponding to the audio data to be reduced based on the first-order enhanced amplitude spectrum corresponding to the audio data to be reduced and the complex time-frequency mask includes:
  • phase enhancement spectrum corresponding to the audio data to be reduced based on the phase gain and the original phase spectrum corresponding to the audio data to be reduced
  • the preset real network model and the preset complex network model are used to form a two-stage time-domain convolutional network TCN model.
  • the preset real number network model before estimating the amplitude-time-frequency masking of the audio data to be denoised by using the preset real number network model, it further includes:
  • the two-stage TCN model is trained by using audio training samples whose sampling rate is higher than a preset sampling rate threshold.
  • the audio training samples whose sampling rate is higher than the preset sampling rate threshold before training the two-stage TCN model, it also includes:
  • using the audio training samples whose sampling rate is higher than the preset sampling rate threshold to train the two-stage TCN model includes:
  • the two-stage TCN model is trained by using the augmented audio training samples; wherein, the sampling rate of the augmented audio training samples is higher than a preset sampling rate threshold.
  • the present disclosure provides an audio noise reduction device, the device comprising:
  • An acquisition module configured to acquire audio data to be denoised
  • the first estimation module is configured to estimate the amplitude time-frequency mask of the audio data to be reduced by using a preset real number network model; wherein the amplitude time-frequency mask is used to determine the first-order enhancement corresponding to the audio data to be reduced amplitude spectrum;
  • the second estimation module is used to estimate the complex time-frequency mask of the audio data to be denoised by using a preset complex network model
  • the first determining module is configured to determine the noise reduction result audio data corresponding to the audio data to be reduced based on the first-order enhanced amplitude spectrum corresponding to the audio data to be reduced and the complex time-frequency mask.
  • the present disclosure provides a computer-readable storage medium, where instructions are stored in the computer-readable storage medium, and when the instructions are run on a terminal device, the terminal device is made to implement the above method.
  • the present disclosure provides a device, including: a memory, a processor, and a computer program stored on the memory and operable on the processor, when the processor executes the computer program, Implement the above method.
  • the present disclosure provides a computer program product, where the computer program product includes a computer program/instruction, and when the computer program/instruction is executed by a processor, the above method is implemented.
  • An embodiment of the present disclosure provides an audio noise reduction method.
  • the audio data to be reduced is obtained, and then the amplitude time-frequency mask of the audio data to be reduced is estimated by using a preset real number network model, and the corresponding frequency of the audio data to be reduced can be obtained.
  • First order enhanced magnitude spectrum is estimated.
  • the complex time-frequency mask of the audio data to be denoised is estimated by using the preset complex network model, and the denoising result audio data corresponding to the audio data to be denoised is determined by combining the first-order enhanced amplitude spectrum and the complex time-frequency mask.
  • the embodiments of the present disclosure use the preset real number network model to enhance the amplitude spectrum of the audio data to be denoised, and use the preset complex number network model to simultaneously enhance the amplitude spectrum and phase spectrum of the audio data to be denoised. It can be seen that the embodiments of the present disclosure can realize the Noise reduction Noise reduction processing of audio data, so as to better improve the sound quality of audio.
  • FIG. 1 is a flowchart of an audio noise reduction method provided by an embodiment of the present disclosure
  • FIG. 2 is a schematic diagram of a two-stage TCN model provided by an embodiment of the present disclosure
  • FIG. 3 is a schematic structural diagram of an audio noise reduction device provided by an embodiment of the present disclosure.
  • Fig. 4 is a schematic structural diagram of an audio noise reduction device provided by an embodiment of the present disclosure.
  • noise in audio can be divided into at least two types: stationary noise and non-stationary noise.
  • Stationary noise means that the statistical characteristics of noise will not change with time, and common ones include white noise and pink noise.
  • Non-stationary noise refers to statistical noise Characteristics change over time, such as keyboard sound, mouse click sound, etc.
  • audio noise reduction tools often use a single network model to achieve audio noise reduction.
  • the complexity of the network model is low, it is difficult to guarantee the noise reduction effect on audio, for example, it is especially difficult to guarantee the non-stationary noise in audio. inhibitory effect.
  • an embodiment of the present disclosure provides an audio noise reduction method, which uses a preset real network model and a preset complex network model to perform noise reduction processing on the audio data to be reduced, and then synthesizes the noise reduction results of the two to determine the The audio data of the noise reduction result corresponding to the noise reduction audio data, it can be seen that compared with using a single network model for audio noise reduction, the embodiments of the present disclosure can have a better suppression effect on non-stationary noise, thereby ensuring the overall sound quality of the audio Noise reduction effect, and then better improve the sound quality of the audio.
  • the embodiment of the present disclosure obtains the audio data to be denoised, and then uses a preset real number network model to estimate the amplitude time-frequency mask of the audio data to be denoised, so as to obtain the first-order enhanced amplitude spectrum corresponding to the audio data to be denoised. Furthermore, the complex time-frequency mask of the audio data to be denoised is estimated by using the preset complex network model, and the denoising result audio data corresponding to the audio data to be denoised is determined by combining the first-order enhanced amplitude spectrum and the complex time-frequency mask.
  • the embodiments of the present disclosure use the preset real number network model to enhance the amplitude spectrum of the audio data to be denoised, and use the preset complex number network model to simultaneously enhance the amplitude spectrum and phase spectrum of the audio data to be denoised. It can be seen that the embodiments of the present disclosure can realize the Noise reduction Noise reduction processing of audio data, while ensuring the noise reduction effect, thereby better improving the sound quality of the audio.
  • an embodiment of the present disclosure provides an audio noise reduction method.
  • FIG. 1 it is a flow chart of an audio noise reduction method provided by an embodiment of the present disclosure. The method includes:
  • S101 Acquire audio data to be denoised.
  • the audio data to be denoised in the embodiments of the present disclosure may be any audio segment, where the audio segment may also be an audio segment extracted from a video, or the like.
  • the embodiment of the present disclosure does not limit the audio data to be denoised.
  • the embodiment of the present disclosure may perform real-time noise reduction processing on the audio data to be reduced for noise during the audio recording stage, or may perform noise reduction processing on the audio data to be reduced during the audio editing stage.
  • Embodiments of the present disclosure do not limit noise reduction scenarios.
  • S102 Estimate an amplitude-time-frequency mask of the audio data to be denoised by using a preset real number network model.
  • the amplitude time-frequency mask is used to determine the first-order enhanced amplitude spectrum corresponding to the audio data to be denoised.
  • the preset real number network model is trained by using the audio training samples, and the trained preset real number network model is obtained, which is used to perform amplitude enhancement processing on the audio data to be denoised.
  • the preset real number network model can be realized based on any AI model.
  • the preset real number network model can be realized by Temporal Convolutional Network (TCN) or by Recurrent Neural Network (RNN). ) to achieve and so on.
  • the audio data to be denoised can be input into the preset real number network model for processing, and the preset real number network model outputs the amplitude of the audio data to be denoised time-frequency masking.
  • the magnitude time-frequency mask is used to represent the proportional relationship between the enhanced magnitude spectrum and the original magnitude spectrum.
  • the amplitude time-frequency mask is used to determine the first-order enhanced amplitude spectrum corresponding to the audio data to be reduced, including: the amplitude time-frequency mask is used to match the original frequency spectrum of the audio data to be reduced. The amplitude spectra are multiplied to obtain the first-order enhanced amplitude spectrum corresponding to the audio data to be denoised.
  • the amplitude time-frequency mask of the audio data to be denoised is obtained, and then, by comparing the amplitude time-frequency mask with the original amplitude spectrum of the audio data to be denoised Multiply, get the enhanced amplitude spectrum of the audio data to be denoised, as the first-order enhanced amplitude spectrum.
  • the first-order enhanced amplitude spectrum is the amplitude spectrum after the frequency spectrum of the audio data to be denoised is enhanced through a preset real number network model.
  • S103 Estimate a complex time-frequency mask of the audio data to be denoised by using a preset complex network model.
  • the preset complex network model is trained by using the audio training data to obtain the trained preset complex network model, which is used to simultaneously enhance the amplitude and phase of the audio data to be denoised.
  • the preset complex network model can be realized based on any AI model.
  • the preset complex network model can be realized by Temporal Convolutional Network (TCN), or by Recurrent Neural Network (RNN). ) to achieve and so on.
  • TCN Temporal Convolutional Network
  • RNN Recurrent Neural Network
  • the complex spectrum determined based on the original spectrum and the original phase spectrum of the audio data to be denoised is first determined as the Noise reduction complex spectrum. Then, the complex frequency spectrum to be denoised is input into a preset complex network model for processing, and the preset complex network model outputs a complex time-frequency mask corresponding to the audio data to be denoised.
  • the complex time-frequency mask is used to represent the proportional relationship between the enhanced spectrum and the original spectrum, and the complex time-frequency mask includes a real part and an imaginary part.
  • the embodiments of the present disclosure can also determine the complex spectrum determined based on the first-order enhanced amplitude spectrum and the original phase spectrum corresponding to the audio data to be denoised as the complex spectrum to be denoised, so as to preset the complex network model
  • the amplitude and phase of the frequency spectrum of the audio data to be reduced can be further enhanced, thereby further improving the effect of noise reduction.
  • the original phase spectrum of the audio data to be denoised is first obtained, and then the frequency spectrum determined based on the first-order enhanced amplitude spectrum and the original phase spectrum corresponding to the audio data to be denoised is determined as the complex number to be denoised spectrum. Furthermore, the complex frequency spectrum to be denoised is input into a preset complex network model for processing, and the preset complex network model outputs a complex time-frequency mask corresponding to the audio data to be denoised.
  • S104 Determine noise reduction result audio data corresponding to the audio data to be reduced based on the first-order enhanced magnitude spectrum corresponding to the audio data to be reduced and the complex time-frequency mask.
  • the corresponding First-order enhanced magnitude spectrum and complex time-frequency masking After the amplitude enhancement of the audio data to be denoised by the preset real number network model, and the simultaneous enhancement of the amplitude and phase of the audio data to be denoised by the preset complex number network model, the corresponding First-order enhanced magnitude spectrum and complex time-frequency masking. Then, based on the first-order enhanced amplitude spectrum and the complex time-frequency mask corresponding to the audio data to be reduced, the noise reduction result audio data corresponding to the audio data to be reduced is determined, and the noise reduction processing of the frequency to be reduced is realized.
  • the amplitude gain and the phase gain are determined based on the complex time-frequency mask.
  • the amplitude gain is used to characterize the amplitude enhancement of the spectrum of the audio data to be denoised by the preset complex network model
  • the phase gain is used to characterize the phase enhancement of the frequency spectrum of the audio data to be denoised by the preset complex network model.
  • the phase enhancement spectrum corresponding to the audio data to be reduced is determined.
  • the second-order enhanced amplitude spectrum corresponding to the audio data to be reduced determine the second-order enhanced amplitude spectrum corresponding to the audio data to be reduced.
  • the second-order enhanced amplitude spectrum is an amplitude spectrum obtained by performing amplitude enhancement on the audio data to be denoised through a preset real number network model and a preset complex number network model. Furthermore, based on the second-order enhanced amplitude spectrum and phase enhanced spectrum, the enhanced spectrum corresponding to the audio data to be reduced is determined, and the noise reduction result audio data corresponding to the audio data to be reduced is determined based on the enhanced spectrum.
  • formulas (1) and (2) can be used to calculate the amplitude gain and phase gain, respectively.
  • formula (3) can be used to calculate the enhanced spectrum corresponding to the audio data to be denoised, the following formula (3):
  • Y phase is used to represent the original phase spectrum, is used to represent the phase-enhanced spectrum, is used to represent the first-order enhanced magnitude spectrum, Used to represent the second-order enhanced magnitude spectrum.
  • the denoising result audio data corresponding to the audio data to be denoised is obtained through processing such as inverse Fourier transform.
  • the audio data to be reduced is obtained, and then the amplitude time-frequency mask of the audio data to be reduced is estimated by using the preset real number network model, and the corresponding frequency of the audio data to be reduced can be obtained.
  • the complex time-frequency mask of the audio data to be denoised is estimated by using the preset complex network model, and the denoising result audio data corresponding to the audio data to be denoised is determined by combining the first-order enhanced amplitude spectrum and the complex time-frequency mask.
  • the embodiments of the present disclosure use the preset real number network model to enhance the amplitude spectrum of the audio data to be denoised, and use the preset complex number network model to simultaneously enhance the amplitude spectrum and phase spectrum of the audio data to be denoised. It can be seen that the embodiments of the present disclosure can realize the Noise reduction Noise reduction processing of audio data, so as to better improve the sound quality of audio.
  • the embodiments of the present disclosure can implement a preset real number network model and a preset complex number network model based on the TCN model.
  • the embodiments of the present disclosure can use the two-stage temporal convolution network TCN model to perform noise reduction processing on audio, thereby improving the sound quality of audio to a large extent.
  • the two-stage TCN model includes a real TCN model and a complex TCN model, and Y(n) is used to represent the audio data to be denoised.
  • the complex time-frequency mask corresponding to Y(n) includes the real part and the imaginary part
  • the two-stage TCN model before using the two-stage TCN model to denoise the audio, the two-stage TCN model is first trained. Specifically, the two-stage TCN model can be trained by using audio training data whose sampling rate is higher than the preset sampling rate threshold, so that the trained two-stage TCN model can perform better noise reduction on audio data with a higher sampling rate Effect.
  • the preset sampling rate threshold may be a value greater than 16K.
  • the two-stage TCN model can be trained by using the time domain loss function SISNR.
  • the time domain loss function SISNR will not be introduced here.
  • preset data augmentation processing can be performed on the audio training samples to enrich the diversity of the audio training samples.
  • the preset data augmentation processing includes performing high-pass, low-pass, band-pass, setting different volumes and/or equalizing the audio training samples according to preset probabilities
  • the preset data augmentation processing may include processing operations such as high-pass, low-pass, band-pass, setting different volumes and/or equalization on the audio training samples with a certain probability.
  • the augmented audio training samples can be used to train the two-stage TCN model.
  • the sampling rate of the augmented audio training samples may be higher than a preset sampling rate threshold, so as to ensure the robustness of the two-stage TCN model to high sampling rate audio data noise reduction processing.
  • the audio noise reduction method provided by the embodiments of the present disclosure can use the two-stage TCN model to achieve audio noise reduction, especially for the suppression of non-stable noise in the audio, which further improves the noise reduction effect and improves the audio quality. Sound quality improves user experience.
  • the present disclosure also provides an audio noise reduction device.
  • FIG. 3 it is a schematic structural diagram of an audio noise reduction device provided by an embodiment of the present disclosure.
  • the device includes:
  • An acquisition module 301 configured to acquire audio data to be denoised
  • the first estimation module 302 is configured to estimate the amplitude-time-frequency mask of the audio data to be reduced by using a preset real number network model; wherein, the amplitude-time-frequency mask is used to determine the first-order corresponding to the audio data to be reduced Enhanced amplitude spectrum;
  • the second estimation module 303 is configured to estimate the complex time-frequency mask of the audio data to be denoised by using a preset complex network model
  • a determining module 304 configured to determine noise reduction result audio data corresponding to the audio data to be reduced based on the first-order enhanced magnitude spectrum corresponding to the audio data to be reduced and the complex time-frequency mask.
  • the second estimation module includes:
  • the first determining submodule is used to determine the complex spectrum to be denoised; wherein, the complex spectrum to be denoised includes the first-order enhanced magnitude spectrum corresponding to the audio data to be denoised and the original audio data to be denoised A complex spectrum determined by the phase spectrum, or a complex spectrum determined based on the original spectrum of the audio data to be denoised and the original phase spectrum;
  • the first processing submodule is configured to input the complex frequency spectrum to be denoised into a preset complex network model, and output the complex time-frequency mask corresponding to the audio data to be denoised after being processed by the preset complex network model .
  • the determination module includes:
  • a second determining submodule configured to determine an amplitude gain and a phase gain based on the complex time-frequency mask
  • the third determination submodule is used to determine the phase enhancement spectrum corresponding to the audio data to be reduced based on the phase gain and the original phase spectrum corresponding to the audio data to be reduced;
  • the fourth determining submodule is used to determine the second-order enhanced amplitude spectrum corresponding to the audio data to be reduced based on the amplitude gain and the first-order enhanced amplitude spectrum corresponding to the audio data to be reduced;
  • the fifth determining submodule is configured to determine the noise reduction result audio data corresponding to the audio data to be reduced based on the second-order enhanced magnitude spectrum and the phase enhanced spectrum.
  • the preset real network model and the preset complex network model are used to form a two-stage time-domain convolutional network TCN model.
  • the device further includes:
  • the training module is used to train the two-stage TCN model by using audio training samples whose sampling rate is higher than a preset sampling rate threshold.
  • the device further includes:
  • An augmentation module configured to perform preset data augmentation processing on the audio training samples to obtain augmented audio training samples
  • the training module is specifically used for:
  • the two-stage TCN model is trained; wherein, the augmented audio training samples have a sampling rate higher than a preset sampling rate threshold.
  • the preset data augmentation processing includes performing high-pass, low-pass, band-pass, setting different volumes and/or equalizing the audio training samples according to preset probabilities.
  • the training module is specifically used to train the two-stage TCN model by using the time domain loss function SISNR.
  • the amplitude time-frequency mask is used to determine the first-order enhanced amplitude spectrum corresponding to the audio data to be reduced, including: the amplitude time-frequency mask is used to match the original frequency spectrum of the audio data to be reduced. The amplitude spectra are multiplied to obtain the first-order enhanced amplitude spectrum corresponding to the audio data to be denoised.
  • the audio data to be reduced is obtained, and then the amplitude-time-frequency mask of the audio data to be reduced is estimated by using a preset real number network model, and a value corresponding to the audio data to be reduced can be obtained.
  • Order Enhanced Magnitude Spectrum Furthermore, the complex time-frequency mask of the audio data to be denoised is estimated by using the preset complex network model, and the denoising result audio data corresponding to the audio data to be denoised is determined by combining the first-order enhanced amplitude spectrum and the complex time-frequency mask.
  • the embodiments of the present disclosure use the preset real number network model to enhance the amplitude spectrum of the audio data to be denoised, and use the preset complex number network model to simultaneously enhance the amplitude spectrum and phase spectrum of the audio data to be denoised. It can be seen that the embodiments of the present disclosure can realize the Noise reduction Noise reduction processing of audio data, so as to better improve the sound quality of audio.
  • an embodiment of the present disclosure also provides a computer-readable storage medium, where instructions are stored in the computer-readable storage medium, and when the instructions are run on a terminal device, the terminal device realizes this
  • the audio noise reduction method described in the embodiment is disclosed.
  • the embodiment of the present disclosure further provides a computer program product, the computer program product includes a computer program/instruction, and when the computer program/instruction is executed by a processor, the audio noise reduction method described in the embodiment of the present disclosure is implemented.
  • an embodiment of the present disclosure also provides an audio noise reduction device, as shown in FIG. 4 , which may include:
  • Processor 401 memory 402 , input device 403 and output device 404 .
  • the number of processors 401 in the audio noise reduction device may be one or more, and one processor is taken as an example in FIG. 4 .
  • the processor 401 , the memory 402 , the input device 43 and the output device 404 may be connected through a bus or in other ways, wherein connection through a bus is taken as an example in FIG. 4 .
  • the memory 402 can be used to store software programs and modules, and the processor 401 executes various functional applications and data processing of the audio noise reduction device by running the software programs and modules stored in the memory 402 .
  • the memory 402 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required by at least one function, and the like.
  • the memory 402 may include a high-speed random access memory, and may also include a non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage devices.
  • the input device 403 can be used to receive input digital or character information, and generate signal input related to user settings and function control of the audio noise reduction device.
  • the processor 401 loads the executable files corresponding to the processes of one or more application programs into the memory 402 according to the following instructions, and the processor 401 runs the executable files stored in the memory 402. Application program, so as to realize various functions of the above-mentioned audio noise reduction equipment.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Soundproofing, Sound Blocking, And Sound Damping (AREA)

Abstract

一种音频降噪方法、装置、设备、计算机可读存储介质及程序产品,该方法包括:获取待降噪音频数据(S101),利用预设实数网络模型估计待降噪音频数据的幅度时频掩蔽,得到待降噪音频数据对应的一阶增强幅度谱(S102),利用预设复数网络模型估计该待降噪音频数据的复数时频掩蔽(S103),结合一阶增强幅度谱和复数时频掩蔽,确定待降噪音频数据对应的降噪结果音频数据(S104)。

Description

一种音频降噪方法、装置、设备及存储介质
本公开要求于2021年09月24日提交的,申请名称为“一种音频降噪方法、装置、设备及存储介质”的、中国专利申请号为“202111124158.6”的优先权,该中国专利申请的全部内容通过引用结合在本公开中。
技术领域
本公开涉及数据处理领域,尤其涉及一种音频降噪方法、装置、设备及存储介质。
背景技术
音频录制的过程中,往往会由于环境或者设备等原因,导致录制的音频中存在有噪声的情况,从而造成音频给用户的体验感较差。
目前,对音频进行降噪的工具少之甚少,且仅有的几款降噪工具对音频降噪的效果也不尽人意。
因此,如何实现音频降噪,从而提升音频的音质,是目前亟需解决的技术问题。
发明内容
为了解决上述技术问题或者至少部分地解决上述技术问题,本公开实施例提供了一种音频降噪方法,能够实现对音频进行降噪,从而较好的提升音频的音质。
第一方面,本公开提供了一种音频降噪方法,所述方法包括:
获取待降噪音频数据;
利用预设实数网络模型估计所述待降噪音频数据的幅度时频掩蔽;其中,所述幅度时频掩蔽用于确定所述待降噪音频数据对应的一阶增强幅度谱;
利用预设复数网络模型估计所述待降噪音频数据的复数时频掩蔽;
基于所述待降噪音频数据对应的一阶增强幅度谱和所述复数时频掩蔽,确定所述待降噪音频数据对应的降噪结果音频数据。
一种实施方式中,所述利用预设复数网络模型估计所述待降噪音频数据的复数时频掩蔽,包括:
确定待降噪复数频谱;其中,所述待降噪复数频谱包括基于所述待降噪音频数据对应的一阶增强幅度谱和所述待降噪音频数据的原始相位谱确定的复数频谱,或者,基于所述待降噪音频数据的原始频谱和原始相位谱确定的复数频谱;
将所述待降噪复数频谱输入至预设复数网络模型,经过所述预设复数网络模型的处理后,输出所述待降噪音频数据对应的复数时频掩蔽。
一种实施方式中,所述基于所述待降噪音频数据对应的一阶增强幅度谱和所述复数时频掩蔽,确定所述待降噪音频数据对应的降噪结果音频数据,包括:
基于所述复数时频掩蔽,确定幅度增益和相位增益;
基于所述相位增益和所述待降噪音频数据对应的原始相位谱,确定所述待降噪音频数据对应的相位增强谱;
以及,基于所述幅度增益和所述待降噪音频数据对应的一阶增强幅度谱,确定所述待降噪音频数据对应的二阶增强幅度谱;
基于所述二阶增强幅度谱和所述相位增强谱,确定所述待降噪音频数据对应的降噪结果音频数据。
一种实施方式中,所述预设实数网络模型和所述预设复数网络模型用于构成双阶段时域卷积网络TCN模型。
一种实施方式中,所述利用预设实数网络模型估计所述待降噪音频数据的幅度时频掩蔽之前,还包括:
利用采样率高于预设采样率阈值的音频训练样本,对所述双阶段TCN模型进行训练。
一种实施方式中,所述利用采样率高于预设采样率阈值的音频训练样本,对所述双阶段TCN模型进行训练之前,还包括:
对所述音频训练样本进行预设数据增广处理,得到增广后音频训练样本;
相应的,所述利用采样率高于预设采样率阈值的音频训练样本,对所述双阶段TCN模型进行训练,包括:
利用所述增广后音频训练样本,对所述双阶段TCN模型进行训练;其中,所述增广后音频训练样本的采样率高于预设采样率阈值。
第二方面,本公开提供了一种音频降噪装置,所述装置包括:
获取模块,用于获取待降噪音频数据;
第一估计模块,用于利用预设实数网络模型估计所述待降噪音频数据的幅度时频掩蔽;其中,所述幅度时频掩蔽用于确定所述待降噪音频数据对应的一阶增强幅度谱;
第二估计模块,用于利用预设复数网络模型估计所述待降噪音频数据的复数时频掩蔽;
第一确定模块,用于基于所述待降噪音频数据对应的一阶增强幅度谱和所述复数时频掩蔽,确定所述待降噪音频数据对应的降噪结果音频数据。
第三方面,本公开提供了一种计算机可读存储介质,所述计算机可读存储介质中存储有指令,当所述指令在终端设备上运行时,使得所述终端设备实现上述的方法。
第四方面,本公开提供了一种设备,包括:存储器,处理器,及存储在所述存储器上并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时,实现上述的方法。
第五方面,本公开提供了一种计算机程序产品,所述计算机程序产品包括计算机程序/指令,所述计算机程序/指令被处理器执行时实现上述的方法。
本公开实施例提供的技术方案与相关技术相比至少具有如下优点:
本公开实施例提供了一种音频降噪方法,首先,获取待降噪音频数据,然后利用预设实数网络模型估计待降噪音频数据的幅度时频掩蔽,能够得到待降噪音频数据对应的一阶增强幅度谱。进而,利用预设复数网络模型估计该待降噪音频数据的复数时频掩蔽,并结合一阶增强幅度谱和复数时频掩蔽,确定待降噪音频数据对应的降噪结果音频数据。本公开实施例利用预设实数网络模型增强待降噪音频数据的幅度谱,以及利用预设复数网络模型同时增强待降噪音频数据的幅度谱和相位谱,可见,本公开实施例能够实现对待降噪音频数据的降噪处理,从而较好的提升音频的音质。
附图说明
此处的附图被并入说明书中并构成本说明书的一部分,示出了符合本公开的实施例,并与说明书一起用于解释本公开的原理。
为了更清楚地说明本公开实施例或相关技术中的技术方案,下面将对实施例或相关技术描述中所需要使用的附图作简单地介绍,显而易见地,对于本领域普通技术人员而言,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。
图1为本公开实施例提供的一种音频降噪方法的流程图;
图2为本公开实施例提供的一种双阶段TCN模型的示意图;
图3为本公开实施例提供的一种音频降噪装置的结构示意图;
图4为本公开实施例提供的一种音频降噪设备的结构示意图。
具体实施方式
为了能够更清楚地理解本公开的上述目的、特征和优点,下面将对本公开的方案进行进一步描述。需要说明的是,在不冲突的情况下,本公开的实施例及实施例中的特征可以相互组合。
在下面的描述中阐述了很多具体细节以便于充分理解本公开,但本公开还可以采用其他不同于在此描述的方式来实施;显然,说明书中的实施例只是本公开的一部分实施例,而不是全部的实施例。
由于录制环境或者设备等原因,导致录制的音频中可能存在噪声,使得音频的音质较差,影响用户体验。其中,音频中的噪声可以分为平稳噪声和非平稳噪声至少两种,平稳噪声是指噪声统计特性不会随时间变化,常见的有白噪声和粉噪声等;而非平稳噪声是指噪声统计特性随时间变化,常见的如键盘声、鼠标点击声等。
目前,对音频进行降噪的工具可以采用人工智能AI降噪模型实现,但是,目前的AI降噪模型通常对于平稳噪声有较好的抑制作用,但是对非平稳噪声的抑制作用较弱,从而导致目前的降噪工具对音频的降噪效果不能保证用户的体验。
实际应用中,音频降噪工具往往采用单一的网络模型实现对音频的降噪,虽然网络模型 的复杂度较低,但是难以保证对音频的降噪效果,例如尤其难以保证对音频中非平稳噪声的抑制效果。为此,本公开实施例提供了一种音频降噪方法,利用预设实数网络模型和预设复数网络模型分别对待降噪音频数据进行降噪处理,进而综合二者的降噪结果确定出待降噪音频数据对应的降噪结果音频数据,可见,相比于采用单一的网络模型对音频进行降噪,本公开实施例能够对非平稳噪声有较好的抑制效果,从而保证对音频整体的降噪效果,进而较好的提升音频的音质。
具体的,本公开实施例获取待降噪音频数据,然后利用预设实数网络模型估计待降噪音频数据的幅度时频掩蔽,能够得到待降噪音频数据对应的一阶增强幅度谱。进而,利用预设复数网络模型估计该待降噪音频数据的复数时频掩蔽,并结合一阶增强幅度谱和复数时频掩蔽,确定待降噪音频数据对应的降噪结果音频数据。
本公开实施例利用预设实数网络模型增强待降噪音频数据的幅度谱,以及利用预设复数网络模型同时增强待降噪音频数据的幅度谱和相位谱,可见,本公开实施例能够实现对待降噪音频数据的降噪处理,同时保证降噪效果,进而较好的提升音频的音质。
基于此,本公开实施例提供了一种音频降噪方法,参考图1,为本公开实施例提供的一种音频降噪方法的流程图,该方法包括:
S101:获取待降噪音频数据。
本公开实施例中的待降噪音频数据可以是任意的音频片段,其中,该音频片段也可以是从视频中提取出的音频片段等。本公开实施例对于待降噪音频数据不做限制。
实际应用中,本公开实施例可以在音频录制阶段对待降噪音频数据进行实时降噪处理,也可以在音频编辑阶段对待降噪音频数据进行降噪处理。本公开实施例对于降噪场景不做限制。
S102:利用预设实数网络模型估计所述待降噪音频数据的幅度时频掩蔽。
其中,所述幅度时频掩蔽用于确定所述待降噪音频数据对应的一阶增强幅度谱。
本公开实施例中,首先利用音频训练样本对预设实数网络模型进行训练,得到经过训练的预设实数网络模型,用于对待降噪音频数据进行幅度增强处理。其中,预设实数网络模型可以基于任一种AI模型实现,例如预设实数网络模型可以由时域卷积网络(Temporal Convolutional Network;TCN)实现,也可以由递归神经网络(Recurrent Neural Network;RNN)实现等。
本公开实施例中,在预设实数网络模型经过训练之后,可以将待降噪音频数据输入至该预设实数网络模型中进行处理,有该预设实数网络模型输出待降噪音频数据的幅度时频掩蔽。其中,幅度时频掩蔽用于表示增强幅度谱与原始幅度谱之间的比例关系。
一种实施方式中,所述幅度时频掩蔽用于确定所述待降噪音频数据对应的一阶增强幅度 谱,包括:所述幅度时频掩蔽用于与所述待降噪音频数据的原始幅度谱相乘,得到所述待降噪音频数据对应的一阶增强幅度谱。
在预设实数网络模型对待降噪音频数据的频谱进行幅度增强后,得到待降噪音频数据的幅度时频掩蔽,然后,通过将该幅度时频掩蔽与待降噪音频数据的原始幅度谱相乘,得到待降噪音频数据的增强幅度谱,作为一阶增强幅度谱。其中,该一阶增强幅度谱是待降噪音频数据的频谱经过预设实数网络模型进行幅度增强后的幅度谱。
S103:利用预设复数网络模型估计所述待降噪音频数据的复数时频掩蔽。
本公开实施例中,首先利用音频训练数据对预设复数网络模型进行训练,得到经过训练的预设复数网络模型,用于对待降噪音频数据进行幅度和相位的同时增强处理。其中,预设复数网络模型可以基于任一种AI模型实现,例如预设复数网络模型可以由时域卷积网络(Temporal Convolutional Network;TCN)实现,也可以由递归神经网络(Recurrent Neural Network;RNN)实现等。
一种实施方式中,在利用经过训练的预设复数网络模型对待降噪音频数据进行降噪处理之前,首先将基于待降噪音频数据的原始频谱和原始相位谱确定的复数频谱,确定为待降噪复数频谱。然后,将该待降噪复数频谱输入至预设复数网络模型中进行处理,由预设复数网络模型输出待降噪音频数据对应的复数时频掩蔽。其中,复数时频掩蔽用于表征增强频谱与原始频谱之间的比例关系,复数时频掩蔽包括实部部分和虚部部分。
为了提升降噪的效果,本公开实施例还可以将基于待降噪音频数据对应的一阶增强幅度谱和原始相位谱确定的复数频谱,确定为待降噪复数频谱,以便预设复数网络模型能够在预设实数网络模型降噪的基础上,进一步的对待降噪音频数据的频谱的幅度和相位进行增强,从而进一步的提升降噪的效果。
具体的,一种实施方式中,首先获取待降噪音频数据的原始相位谱,然后将基于待降噪音频数据对应的一阶增强幅度谱和原始相位谱确定的频谱,确定为待降噪复数频谱。进而,将该待降噪复数频谱输入至预设复数网络模型中进行处理,由预设复数网络模型输出待降噪音频数据对应的复数时频掩蔽。
S104:基于所述待降噪音频数据对应的一阶增强幅度谱和所述复数时频掩蔽,确定所述待降噪音频数据对应的降噪结果音频数据。
本公开实施例中,经过预设实数网络模型对待降噪音频数据的幅度增强,以及预设复数网络模型对待降噪音频数据的幅度和相位的同时增强之后,分别得到待降噪音频数据对应的一阶增强幅度谱以及复数时频掩蔽。然后,基于待降噪音频数据对应的一阶增强幅度谱和复数时频掩蔽,确定待降噪音频数据对应的降噪结果音频数据,实现对待降噪音频的降噪处理。
一种实施方式中,首先,基于复数时频掩蔽确定幅度增益和相位增益。其中,幅度增益 用于表征预设复数网络模型对待降噪音频数据的频谱的幅度增强情况,相位增益用于表征预设复数网络模型对待降噪音频数据的频谱的相位增强情况。然后,基于相位增益和待降噪音频数据对应的原始相位谱,确定待降噪音频数据对应的相位增强谱。以及,基于幅度增益和待降噪音频数据对应的一阶增强幅度谱,确定待降噪音频数据对应的二阶增强幅度谱。其中,所述二阶增强幅度谱为待降噪音频数据经过预设实数网络模型和预设复数网络模型进行幅度增强得到的幅度谱。进而,基于二阶增强幅度谱和相位增强谱,确定待降噪音频数据对应的增强频谱,并基于该增强频谱确定待降噪音频数据对应的降噪结果音频数据。
实际应用中,可以利用公式(1)和(2),分别计算幅度增益和相位增益,以下为公式(1)和(2):
Figure PCTCN2022118040-appb-000001
Figure PCTCN2022118040-appb-000002
其中,
Figure PCTCN2022118040-appb-000003
用于表示幅度增益,
Figure PCTCN2022118040-appb-000004
用于表示复数时频掩蔽的实部部分,
Figure PCTCN2022118040-appb-000005
用于表示复数时频掩蔽的虚部部分,
Figure PCTCN2022118040-appb-000006
用于表示相位增益;
另外,可以利用公式(3)计算待降噪音频数据对应的增强频谱,以下为公式(3):
Figure PCTCN2022118040-appb-000007
其中,
Figure PCTCN2022118040-appb-000008
用于表示增强频谱,Y phase用于表示原始相位谱,
Figure PCTCN2022118040-appb-000009
用于表示相位增强谱,
Figure PCTCN2022118040-appb-000010
用于表示一阶增强幅度谱,
Figure PCTCN2022118040-appb-000011
用于表示二阶增强幅度谱。
在获取到待降噪音频数据对应的增强频谱之后,通过反傅里叶变换等处理,得到待降噪音频数据对应的降噪结果音频数据。
可见,本公开实施例提供的音频降噪方法中,首先,获取待降噪音频数据,然后利用预设实数网络模型估计待降噪音频数据的幅度时频掩蔽,能够得到待降噪音频数据对应的一阶增强幅度谱。进而,利用预设复数网络模型估计该待降噪音频数据的复数时频掩蔽,并结合一阶增强幅度谱和复数时频掩蔽,确定待降噪音频数据对应的降噪结果音频数据。本公开实施例利用预设实数网络模型增强待降噪音频数据的幅度谱,以及利用预设复数网络模型同时增强待降噪音频数据的幅度谱和相位谱,可见,本公开实施例能够实现对待降噪音频数据的降噪处理,从而较好的提升音频的音质。
由于TCN模型相比于其他网络模型在音频降噪领域具有更好的效果,因此,本公开实施例可以基于TCN模型实现预设实数网络模型和预设复数网络模型。另外,为了进一步提高音频降噪的效果,本公开实施例可以利用双阶段时域卷积网络TCN模型对音频进行降噪 处理,从而较大程度的改善音频的音质。
参考图2,为本公开实施例提供的一种双阶段TCN模型的示意图。其中,双阶段TCN模型包括实数TCN模型和复数TCN模型,Y(n)用于表示待降噪音频数据。
实际应用中,在获取到Y(n)之后,先后对Y(n)进行短时傅立叶变换STFT和Log|.|的处理,并将处理结果输入至实数TCN模型中,经过实数TCN模型的处理后,输出Y(n)对应的幅度时频掩蔽;然后获取Y(n)的原始幅度谱,并计算原始幅度谱与幅度时频掩蔽的乘积,作为Y(n)对应的一阶增强幅度谱
Figure PCTCN2022118040-appb-000012
将基于一阶增强幅度谱
Figure PCTCN2022118040-appb-000013
和原始相位谱Y phase确定的待降噪复数频谱,输入至复数TCN模型中,经过复数TCN模型的处理后,输出Y(n)对应的复数时频掩蔽。其中,Y(n)对应的复数时频掩蔽包括实部部分
Figure PCTCN2022118040-appb-000014
和虚部部分
Figure PCTCN2022118040-appb-000015
值得注意的是,用于实现实数TCN模型和复数TCN模型的模型架构和参数等,本公开实施例不做限制。
实际应用中,在利用双阶段TCN模型对音频进行降噪之前,首先对双阶段TCN模型进行训练。具体的,可以利用采样率高于预设采样率阈值的音频训练数据,对双阶段TCN模型进行训练,以便经过训练的双阶段TCN模型能够对采样率较高的音频数据有较好的降噪效果。其中,预设采样率阈值可以为大于16K的数值。
一种实施方式中,可以采用时域损失函数SISNR对双阶段TCN模型进行训练。其中,针对时域损失函数SISNR在此不做过多介绍。
另外,为了提升双阶段TCN模型的鲁棒性,在对双阶段TCN模型进行训练之前,可以对音频训练样本进行预设数据增广处理,以丰富音频训练样本的多样性。
一种实施方式中,所述预设数据增广处理包括按预设的概率对所述音频训练样本进行高通、低通、带通、设置不同音量和/或均衡
其中,预设数据增广处理可以包括按一定的概率对音频训练样本进行高通、低通、带通、设置不同音量和/或均衡等处理操作。
实际应用中,在对音频训练样本进行预设数据增广处理,得到增广后音频训练样本之后,可以利用增广后音频训练样本,对双阶段TCN模型进行训练。
一种实施方式中,增广后音频训练样本的采样率可以高于预设采样率阈值,以便保证双阶段TCN模型对高采样率的音频数据降噪处理的鲁棒性。
本公开实施例提供的音频降噪方法能够利用双阶段的TCN模型实现对音频的降噪,尤其对音频中的非稳定性噪声的抑制效果较好,进一步提升了降噪的效果,改善音频的音质,提升用户的体验。
基于上述方法实施例,本公开还提供了一种音频降噪装置,参考图3,为本公开实施例提供的一种音频降噪装置的结构示意图,所述装置包括:
获取模块301,用于获取待降噪音频数据;
第一估计模块302,用于利用预设实数网络模型估计所述待降噪音频数据的幅度时频掩蔽;其中,所述幅度时频掩蔽用于确定所述待降噪音频数据对应的一阶增强幅度谱;
第二估计模块303,用于利用预设复数网络模型估计所述待降噪音频数据的复数时频掩蔽;
确定模块304,用于基于所述待降噪音频数据对应的一阶增强幅度谱和所述复数时频掩蔽,确定所述待降噪音频数据对应的降噪结果音频数据。
一种实施方式中,所述第二估计模块,包括:
第一确定子模块,用于确定待降噪复数频谱;其中,所述待降噪复数频谱包括基于所述待降噪音频数据对应的一阶增强幅度谱和所述待降噪音频数据的原始相位谱确定的复数频谱,或者,基于所述待降噪音频数据的原始频谱和原始相位谱确定的复数频谱;
第一处理子模块,用于将所述待降噪复数频谱输入至预设复数网络模型,经过所述预设复数网络模型的处理后,输出所述待降噪音频数据对应的复数时频掩蔽。
一种实施方式中,所述确定模块,包括:
第二确定子模块,用于基于所述复数时频掩蔽,确定幅度增益和相位增益;
第三确定子模块,用于基于所述相位增益和所述待降噪音频数据对应的原始相位谱,确定所述待降噪音频数据对应的相位增强谱;
第四确定子模块,用于基于所述幅度增益和所述待降噪音频数据对应的一阶增强幅度谱,确定所述待降噪音频数据对应的二阶增强幅度谱;
第五确定子模块,用于基于所述二阶增强幅度谱和所述相位增强谱,确定所述待降噪音频数据对应的降噪结果音频数据。
一种实施方式中,所述预设实数网络模型和所述预设复数网络模型用于构成双阶段时域卷积网络TCN模型。
一种实施方式中,所述装置还包括:
训练模块,用于利用采样率高于预设采样率阈值的音频训练样本,对所述双阶段TCN模型进行训练。
一种实施方式中,所述装置还包括:
增广模块,用于对所述音频训练样本进行预设数据增广处理,得到增广后音频训练样本;
相应的,所述训练模块,具体用于:
利用所述增广后音频训练样本,对所述双阶段TCN模型进行训练;其中,所述增广后 音频训练样本的采样率高于预设采样率阈值。
一种实施方式中,所述预设数据增广处理包括按预设的概率对音频训练样本进行高通、低通、带通、设置不同音量和/或均衡。
一种实施方式中,训练模块具体用于采用时域损失函数SISNR对双阶段TCN模型进行训练。
一种实施方式中,所述幅度时频掩蔽用于确定所述待降噪音频数据对应的一阶增强幅度谱,包括:所述幅度时频掩蔽用于与所述待降噪音频数据的原始幅度谱相乘,得到所述待降噪音频数据对应的一阶增强幅度谱。
本公开实施例提供的音频降噪装置中,首先,获取待降噪音频数据,然后利用预设实数网络模型估计待降噪音频数据的幅度时频掩蔽,能够得到待降噪音频数据对应的一阶增强幅度谱。进而,利用预设复数网络模型估计该待降噪音频数据的复数时频掩蔽,并结合一阶增强幅度谱和复数时频掩蔽,确定待降噪音频数据对应的降噪结果音频数据。本公开实施例利用预设实数网络模型增强待降噪音频数据的幅度谱,以及利用预设复数网络模型同时增强待降噪音频数据的幅度谱和相位谱,可见,本公开实施例能够实现对待降噪音频数据的降噪处理,从而较好的提升音频的音质。
除了上述方法和装置以外,本公开实施例还提供了一种计算机可读存储介质,计算机可读存储介质中存储有指令,当所述指令在终端设备上运行时,使得所述终端设备实现本公开实施例所述的音频降噪方法。
本公开实施例还提供了一种计算机程序产品,所述计算机程序产品包括计算机程序/指令,所述计算机程序/指令被处理器执行时实现本公开实施例所述的音频降噪方法。
另外,本公开实施例还提供了一种音频降噪设备,参见图4所示,可以包括:
处理器401、存储器402、输入装置403和输出装置404。音频降噪设备中的处理器401的数量可以一个或多个,图4中以一个处理器为例。在本公开的一些实施例中,处理器401、存储器402、输入装置43和输出装置404可通过总线或其它方式连接,其中,图4中以通过总线连接为例。
存储器402可用于存储软件程序以及模块,处理器401通过运行存储在存储器402的软件程序以及模块,从而执行音频降噪设备的各种功能应用以及数据处理。存储器402可主要包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需的应用程序等。此外,存储器402可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件、闪存器件、或其他易失性固态存储器件。输入装置403可用于接收输入的数字或字符信息,以及产生与音频降噪设备的用户设置以及功能控制有关的信号输入。
具体在本实施例中,处理器401会按照如下的指令,将一个或一个以上的应用程序的进程对应的可执行文件加载到存储器402中,并由处理器401来运行存储在存储器402中的应用程序,从而实现上述音频降噪设备的各种功能。
需要说明的是,在本文中,诸如“第一”和“第二”等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。
以上所述仅是本公开的具体实施方式,使本领域技术人员能够理解或实现本公开。对这些实施例的多种修改对本领域的技术人员来说将是显而易见的,本文中所定义的一般原理可以在不脱离本公开的精神或范围的情况下,在其它实施例中实现。因此,本公开将不会被限制于本文所述的这些实施例,而是要符合与本文所公开的原理和新颖特点相一致的最宽的范围。

Claims (13)

  1. 一种音频降噪方法,其中,所述方法包括:
    获取待降噪音频数据;
    利用预设实数网络模型估计所述待降噪音频数据的幅度时频掩蔽;其中,所述幅度时频掩蔽用于确定所述待降噪音频数据对应的一阶增强幅度谱;
    利用预设复数网络模型估计所述待降噪音频数据的复数时频掩蔽;
    基于所述待降噪音频数据对应的一阶增强幅度谱和所述复数时频掩蔽,确定所述待降噪音频数据对应的降噪结果音频数据。
  2. 根据权利要求1所述的方法,其中,所述利用预设复数网络模型估计所述待降噪音频数据的复数时频掩蔽,包括:
    确定待降噪复数频谱;其中,所述待降噪复数频谱包括基于所述待降噪音频数据对应的一阶增强幅度谱和所述待降噪音频数据的原始相位谱确定的复数频谱,或者,基于所述待降噪音频数据的原始频谱和原始相位谱确定的复数频谱;
    将所述待降噪复数频谱输入至预设复数网络模型,经过所述预设复数网络模型的处理后,输出所述待降噪音频数据对应的复数时频掩蔽。
  3. 根据权利要求1或2所述的方法,其中,所述基于所述待降噪音频数据对应的一阶增强幅度谱和所述复数时频掩蔽,确定所述待降噪音频数据对应的降噪结果音频数据,包括:
    基于所述复数时频掩蔽,确定幅度增益和相位增益;
    基于所述相位增益和所述待降噪音频数据对应的原始相位谱,确定所述待降噪音频数据对应的相位增强谱;
    以及,基于所述幅度增益和所述待降噪音频数据对应的一阶增强幅度谱,确定所述待降噪音频数据对应的二阶增强幅度谱;
    基于所述二阶增强幅度谱和所述相位增强谱,确定所述待降噪音频数据对应的降噪结果音频数据。
  4. 根据权利要求1所述的方法,其中,所述预设实数网络模型和所述预设复数网络模型用于构成双阶段时域卷积网络TCN模型。
  5. 根据权利要求4所述的方法,其中,所述利用预设实数网络模型估计所述待降噪音频数据的幅度时频掩蔽之前,还包括:
    利用采样率高于预设采样率阈值的音频训练样本,对所述双阶段TCN模型进行训练。
  6. 根据权利要求5所述的方法,其中,所述利用采样率高于预设采样率阈值的音频训练样本,对所述双阶段TCN模型进行训练之前,还包括:
    对所述音频训练样本进行预设数据增广处理,得到增广后音频训练样本;
    相应的,所述利用采样率高于预设采样率阈值的音频训练样本,对所述双阶段TCN模型进行训练,包括:
    利用所述增广后音频训练样本,对所述双阶段TCN模型进行训练;其中,所述增广后音频训练样本的采样率高于预设采样率阈值。
  7. 根据权利要求6所述的方法,其中,所述预设数据增广处理包括按预设的概率对所述音频训练样本进行高通、低通、带通、设置不同音量和/或均衡。
  8. 根据权利要求5所述的方法,其中,所述对所述双阶段TCN模型进行训练,包括:
    采用时域损失函数SISNR对双阶段TCN模型进行训练。
  9. 根据权利要求1所述的方法,其中,所述幅度时频掩蔽用于确定所述待降噪音频数据对应的一阶增强幅度谱,包括:
    所述幅度时频掩蔽用于与所述待降噪音频数据的原始幅度谱相乘,得到所述待降噪音频数据对应的一阶增强幅度谱。
  10. 一种音频降噪装置,其中,所述装置包括:
    获取模块,用于获取待降噪音频数据;
    第一估计模块,用于利用预设实数网络模型估计所述待降噪音频数据的幅度时频掩蔽;其中,所述幅度时频掩蔽用于确定所述待降噪音频数据对应的一阶增强幅度谱;
    第二估计模块,用于利用预设复数网络模型估计所述待降噪音频数据的复数时频掩蔽;
    确定模块,用于基于所述待降噪音频数据对应的一阶增强幅度谱和所述复数时频掩蔽,确定所述待降噪音频数据对应的降噪结果音频数据。
  11. 一种计算机可读存储介质,其中,所述计算机可读存储介质中存储有指令,当所述指令在终端设备上运行时,使得所述终端设备实现如权利要求1-9任一项所述的方法。
  12. 一种设备,其包括:存储器,处理器,及存储在所述存储器上并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时,实现如权利要求1-9任一项所述的方法。
  13. 一种计算机程序产品,其中,所述计算机程序产品包括计算机程序/指令,所述计算机程序/指令被处理器执行时实现如权利要求1-9任一项所述的方法。
PCT/CN2022/118040 2021-09-24 2022-09-09 一种音频降噪方法、装置、设备及存储介质 WO2023045779A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111124158.6 2021-09-24
CN202111124158.6A CN115862649A (zh) 2021-09-24 2021-09-24 一种音频降噪方法、装置、设备及存储介质

Publications (1)

Publication Number Publication Date
WO2023045779A1 true WO2023045779A1 (zh) 2023-03-30

Family

ID=85652626

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/118040 WO2023045779A1 (zh) 2021-09-24 2022-09-09 一种音频降噪方法、装置、设备及存储介质

Country Status (2)

Country Link
CN (1) CN115862649A (zh)
WO (1) WO2023045779A1 (zh)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108735213A (zh) * 2018-05-29 2018-11-02 太原理工大学 一种基于相位补偿的语音增强方法及系统
CN110739002A (zh) * 2019-10-16 2020-01-31 中山大学 基于生成对抗网络的复数域语音增强方法、系统及介质
CN110808063A (zh) * 2019-11-29 2020-02-18 北京搜狗科技发展有限公司 一种语音处理方法、装置和用于处理语音的装置
CN111508514A (zh) * 2020-04-10 2020-08-07 江苏科技大学 基于补偿相位谱的单通道语音增强算法
US20210012767A1 (en) * 2020-09-25 2021-01-14 Intel Corporation Real-time dynamic noise reduction using convolutional networks
CN112567458A (zh) * 2018-08-16 2021-03-26 三菱电机株式会社 音频信号处理系统、音频信号处理方法及计算机可读存储介质

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108735213A (zh) * 2018-05-29 2018-11-02 太原理工大学 一种基于相位补偿的语音增强方法及系统
CN112567458A (zh) * 2018-08-16 2021-03-26 三菱电机株式会社 音频信号处理系统、音频信号处理方法及计算机可读存储介质
CN110739002A (zh) * 2019-10-16 2020-01-31 中山大学 基于生成对抗网络的复数域语音增强方法、系统及介质
CN110808063A (zh) * 2019-11-29 2020-02-18 北京搜狗科技发展有限公司 一种语音处理方法、装置和用于处理语音的装置
CN111508514A (zh) * 2020-04-10 2020-08-07 江苏科技大学 基于补偿相位谱的单通道语音增强算法
US20210012767A1 (en) * 2020-09-25 2021-01-14 Intel Corporation Real-time dynamic noise reduction using convolutional networks

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
"Master's Thesis", 27 March 2020, ZHEJIANG UNIVERSITY, China, article LI, BIN: "Single Channel Speech Enhancement Based on Deep Neural Network", pages: 1 - 60, XP009544831, DOI: 10.27461/d.cnki.gzjdx.2020.003246 *
LI, WANLING, ZHANG QIU-JU: "Speech Enhancement Based on Joint Maximum A Posteriori Probability", COMPUTER SYSTEMS AND APPLICATIONS, ZHONGGUO KEXUEYUAN RUANJIAN YANJIUSUO, CN, vol. 27, no. 12, 1 January 2018 (2018-01-01), CN , pages 163 - 168, XP093053996, ISSN: 1003-3254, DOI: 10.15888/j.cnki.csa.006670 *
ZHENG NAIJUN: "SIGNAL ENHANCEMENT BASED ON COMPLEX-VALUED NEURAL NETWORKS", XIDIAN UNIVERSITY MASTER'S THESES, no. 05, 1 January 2018 (2018-01-01), XP055827314 *

Also Published As

Publication number Publication date
CN115862649A (zh) 2023-03-28

Similar Documents

Publication Publication Date Title
US10511908B1 (en) Audio denoising and normalization using image transforming neural network
Erkelens et al. Minimum mean-square error estimation of discrete Fourier coefficients with generalized Gamma priors
CN107113521B (zh) 用辅助键座麦克风来检测和抑制音频流中的键盘瞬态噪声
EP4189677B1 (en) Noise reduction using machine learning
CN111696568A (zh) 一种半监督瞬态噪声抑制方法
US9210505B2 (en) Maintaining spatial stability utilizing common gain coefficient
CN112712816A (zh) 语音处理模型的训练方法和装置以及语音处理方法和装置
Zheng et al. A constrained MMSE LP residual estimator for speech dereverberation in noisy environments
CN107045874B (zh) 一种基于相关性的非线性语音增强方法
WO2014132499A1 (ja) 信号処理装置および方法
CN113314147A (zh) 音频处理模型的训练方法及装置、音频处理方法及装置
Hendriks et al. MAP estimators for speech enhancement under normal and Rayleigh inverse Gaussian distributions
WO2013061232A1 (en) Audio signal noise attenuation
WO2023045779A1 (zh) 一种音频降噪方法、装置、设备及存储介质
JP2020076907A (ja) 信号処理装置、信号処理プログラム及び信号処理方法
JP6707914B2 (ja) ゲイン処理装置及びプログラム、並びに、音響信号処理装置及びプログラム
CN114220451A (zh) 音频消噪方法、电子设备和存储介质
JP6361148B2 (ja) 雑音推定装置、方法及びプログラム
CN113299308A (zh) 一种语音增强方法、装置、电子设备及存储介质
Lee et al. Speech Enhancement Using Phase‐Dependent A Priori SNR Estimator in Log‐Mel Spectral Domain
KR101096091B1 (ko) 음성 분리 장치 및 이를 이용한 단일 채널 음성 분리 방법
Thiem et al. Reducing artifacts in GAN audio synthesis
Jia et al. Speech enhancement using modified mmse-lsa and phase reconstruction in voiced and unvoiced speech
Steinmetz et al. High-Fidelity Noise Reduction with Differentiable Signal Processing
JPWO2012157783A1 (ja) 音声処理装置、音声処理方法および音声処理プログラムを記録した記録媒体

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22871827

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 18571119

Country of ref document: US

NENP Non-entry into the national phase

Ref country code: DE