WO2022032608A1 - 一种音频降噪方法和装置 - Google Patents

一种音频降噪方法和装置 Download PDF

Info

Publication number
WO2022032608A1
WO2022032608A1 PCT/CN2020/109052 CN2020109052W WO2022032608A1 WO 2022032608 A1 WO2022032608 A1 WO 2022032608A1 CN 2020109052 W CN2020109052 W CN 2020109052W WO 2022032608 A1 WO2022032608 A1 WO 2022032608A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio signal
noise reduction
frequency
network model
model
Prior art date
Application number
PCT/CN2020/109052
Other languages
English (en)
French (fr)
Inventor
孙学京
郭红阳
王松
Original Assignee
南京拓灵智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 南京拓灵智能科技有限公司 filed Critical 南京拓灵智能科技有限公司
Priority to KR1020207026990A priority Critical patent/KR20200128684A/ko
Publication of WO2022032608A1 publication Critical patent/WO2022032608A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/04Time compression or expansion
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering

Definitions

  • Embodiments of the present invention relate to the technical field of audio noise processing, and in particular, to an audio noise reduction method and device.
  • the signal (usually speech) picked up by the microphone will usually be Contaminated by noise, these noises can seriously degrade the quality of speech, thereby affecting the effect of speech applications. Therefore, the microphone signal must be noise-reduced before being stored, analyzed, transmitted or played. Noise suppression technology reduces steady-state and non-steady-state noise in speech signals, thereby increasing signal-to-noise ratio, improving speech intelligibility and reducing hearing fatigue.
  • neural networks have been applied to various scenarios, such as speech enhancement, speech recognition, and texture recognition. , in-situ voice interaction, etc., studies have shown that the use of deep neural networks can improve the robustness of the system to different environments.
  • a new generation of artificial intelligence methods using recurrent neural network (RNN) to model audio signal timing information, or converting one-dimensional speech signal into two-dimensional spectrogram and then using multi-layer convolutional neural network (CNN) modeling, have achieved effective progress.
  • RNN recurrent neural network
  • CNN multi-layer convolutional neural network
  • Generative Adversarial Networks have also been applied to speech enhancement tasks. Compared with traditional methods such as spectral subtraction, minimum mean square error, and nanofiltering, the use of neural networks can effectively utilize context-related information, and has a significant effect on dealing with non-stationary noise.
  • the traditional noise reduction process obtains a noise reduction audio signal by estimating the noise energy of the input audio signal and further using it for noise suppression processing, as shown in FIG. 1 .
  • the algorithm can get good effect on steady-state noise, but the noise reduction effect is not obvious in the case of non-steady-state noise and low signal-to-noise ratio at a certain noise convergence time.
  • Noise reduction based on neural network is mainly through network training of training audio (including pure sound and noise) to obtain different network model parameters; the input noise speech is suppressed by the corresponding noise reduction model parameters, and noise reduction is obtained. After the audio, see Figure 2.
  • the traditional noise reduction algorithm has a certain noise convergence time, and the effect of the algorithm on non-stationary noise and low signal-to-noise ratio is not obvious. At the same time, in some noise environments (such as fan noise), the traditional algorithm has poor noise reduction effect. .
  • noise reduction processing on audio signals with different sampling rates based on the neural network noise reduction algorithm
  • it is necessary to select different network models according to the sampling rate and frame length which is very inflexible in actual use.
  • the actual audio system especially the call function
  • it is usually necessary to support different sampling rates and frame lengths. Fixed sampling rates and frame lengths are very inflexible and difficult to expand in actual systems, which limits the implementation and implementation of many related products.
  • the embodiments of the present invention provide an audio noise reduction method and device.
  • the method does not need to train different models for input signals with different sampling rates and different frame lengths, and can dynamically switch the base layer and the extension layer of the model in the inference stage.
  • the input audio signals with different frame lengths and different sampling rates are subjected to noise reduction processing, which facilitates flexible noise reduction processing for the audio system to solve the problem of neural network noise reduction for different sampling rates and different frames.
  • the inflexibility of long signal processing the problem of not easy to expand. Its specific technical solutions are as follows:
  • an audio noise reduction method including:
  • the original audio signal is input into a pre-trained layered expansion network model for calculation to obtain a noise-reduced audio signal after noise reduction.
  • the training of the layered expansion network model includes:
  • the model parameters of the multiple frequency division network models are combined to obtain a hierarchical extended network model.
  • the corresponding frequency band is obtained.
  • model parameters of the multi-group frequency network model includes:
  • the frequency band is divided by the Mel frequency analysis technique based on auditory characteristics, the Bark domain or the division technique based on ERB scale.
  • Another aspect of the present invention also provides an audio noise reduction device, comprising an acquisition module for acquiring a pre-sampled original audio signal to be processed;
  • a noise reduction module for inputting the original audio signal into a pre-trained hierarchical extended network model for calculation to obtain a noise-reduced noise-reduced audio signal.
  • the noise reduction module includes:
  • a signal acquisition module for acquiring the audio signal to be trained
  • It is used to perform frequency division processing on the audio signal to be trained by adopting multiple preset sampling rates, and input the audio signal after frequency division processing into the network model of the frequency band corresponding to the sampling rate for training, so as to obtain multi-component A model parameter calculation module for the model parameters of the frequency network model;
  • the corresponding frequency band is obtained.
  • combination module combination includes:
  • the frequency band is divided by the Mel frequency analysis technique based on auditory characteristics, the Bark domain or the division technique based on ERB scale.
  • An audio noise reduction method provided in Embodiment 1 of the present invention adopts a layered expansion network model to perform noise reduction processing on input audio signals of different frame lengths and different sampling rates, which improves the noise reduction processing efficiency of audio signals. , which is convenient for flexible noise reduction processing for the audio system, and solves the inflexibility and expansion problems of neural network noise reduction for signal processing of different sampling rates and different frame lengths.
  • 1 is a flowchart of a traditional noise reduction method based on noise energy estimation
  • Embodiment 3 is a flowchart of an optimized noise reduction method provided in Embodiment 1 of the present invention.
  • FIG. 5 is an exemplary flowchart of the optimized noise reduction method provided in Embodiment 3 of the present invention.
  • the present invention provides an audio noise reduction method.
  • Methods include:
  • the original audio signal is input into a pre-trained layered expansion network model for calculation to obtain a noise-reduced audio signal after noise reduction.
  • the above-mentioned original audio signal may be referred to as input audio for short, and the above-mentioned hierarchical extended network model is a noise processing algorithm, which can perform noise reduction processing on input audio signals with different sampling rates and different frame lengths, also called noise suppression processing.
  • FIG. 3 it is a flowchart of an optimized noise reduction method provided for Embodiment 1 of the present invention. This flowchart describes the training of the above-mentioned hierarchical expansion network model in detail, including steps:
  • the model parameters of the multiple frequency division network models are combined to obtain a hierarchical extended network model.
  • a fast Fourier transform (FFT) is performed, and 257 FFT complex coefficients are obtained.
  • FFT fast Fourier transform
  • call The corresponding network is trained to obtain the model parameters of the frequency division network model, and then the model parameters of the frequency division network model are combined to obtain the model parameters of the layered expansion network model, and finally the layered expansion network model is obtained.
  • the number of the above-mentioned frequency division network models is multiple.
  • multiple groups of network models are represented by network 1, network 2, . . . network N.
  • the low-frequency part is usually used as the base layer of the network, and more model parameters are set, because the low-frequency part contains more and richer speech information.
  • the high-frequency part we define it as an extension layer, and build a network based on the characteristics of the extension layer, and the network sets fewer model parameters.
  • the corresponding frequency band is obtained according to the frame length of the audio signal to be trained and the sampling rate.
  • the frequency band is divided using Mel frequency analysis technique based on auditory characteristics, Bark domain or ERB scale division technique.
  • the above-mentioned combining the model parameters of the multi-group frequency network model includes: superimposing and combining multiple groups of the model parameters, or in Embodiment 3 of the present invention, combining the parameters of the low-frequency network model The parameters are combined as input parameters of the high frequency network model.
  • FIG. 4 is an exemplary flowchart of the optimized noise reduction method provided by Embodiment 2 of the present invention
  • FIG. 5 is an exemplary flowchart of the optimized noise reduction method provided by Embodiment 3 of the present invention.
  • Embodiment 2 and Embodiment 3 are exemplarily described below by sampling audio signals at 48 kHz.
  • the hierarchical extended network model training is performed for the 48kHz sampled audio signal, the audio signal is divided into s1 and s2, and different network layers are used for training.
  • the frame length will be set to 512 or 1024.
  • Different systems have different frame lengths, and the different frame lengths can be mapped to a unified fixed number of bands by banding, such as the commonly used Melband;
  • the audio signal is subjected to frequency division processing according to the sampling rate to obtain the audio signal s1 (including the number of bands corresponding to the frequency components of 0 to 8 kHz) and the audio signal s2 (including the number of bands corresponding to the frequency components of 8 to 24 kHz);
  • the audio signal s1 is trained by the network 1, and the model parameter p1 is obtained; the audio signal s2 is trained by the network 2, and the model parameter p2 is obtained;
  • model parameters p1 and the model parameters p2 are combined to obtain the model parameters of the 48kHz audio;
  • the frequency-divided audio signals are not limited to two, and the frequency components included in each audio signal are not limited to the second embodiment, and can be set according to specific applications.
  • the frequency domain features can be used as the network input, and the time domain signal can also be used as the network input.
  • the hierarchical extended network model training is carried out for the 48kHz sampled audio signal, the audio signal is divided into s1 and s2, and different network layers are used for training, and the output of network 1 is used as the input of network 2 for training.
  • the frame length will be set to 512 or 1024.
  • Different systems have different frame lengths, and the different frame lengths can be mapped to a unified fixed number of bands by banding, such as the commonly used Melband;
  • the audio signal is subjected to frequency division processing according to the sampling rate to obtain the audio signal s1 (including the number of bands corresponding to the frequency components of 0 to 8 kHz) and the audio signal s2 (including the number of bands corresponding to the frequency components of 8 to 24 kHz);
  • the audio signal s1 is trained by the network 1, and the model parameter p1 is obtained;
  • the audio signal s2 is trained by the network 2, and the model parameter p2 is obtained by combining the model parameter p1 in the training process;
  • model parameters p1 and the model parameters p2 are combined to obtain the model parameters of the 48kHz audio;
  • the frequency-divided audio signals are not limited to two, and the frequency components included in each audio signal are not limited to the third embodiment, and can be set according to specific applications.
  • the frequency domain features can be used as the network input, and the time domain signal can also be used as the network input.
  • An audio noise reduction method and device disclosed in the embodiments of the present invention include acquiring a pre-sampled original audio signal to be processed; inputting the original audio signal into a pre-trained hierarchical expansion network model for calculation, and obtaining a reduced Noise-reduced audio signal after noise.
  • the invention adopts a layered expansion network model, and performs noise reduction processing according to the sampling rate and frame length of the input audio signal.
  • the method can adapt to system noise reduction processing of different bandwidths and complexities, solve the inflexibility and scalability of signal noise reduction with different sampling rates and different frame lengths, and perform noise reduction according to various parameters of the input audio signal. processing, which can effectively ensure the quality of the voice signal while improving the robustness of the network.
  • a second aspect of the embodiments of the present invention provides an audio noise reduction device, including an acquisition module for acquiring a pre-sampled original audio signal to be processed;
  • a noise reduction module for inputting the original audio signal into a pre-trained hierarchical extended network model for calculation to obtain a noise-reduced noise-reduced audio signal.
  • the noise reduction module includes:
  • a signal acquisition module for acquiring the audio signal to be trained
  • It is used to perform frequency division processing on the audio signal to be trained by adopting multiple preset sampling rates, and input the audio signal after frequency division processing into the network model of the frequency band corresponding to the sampling rate for training, so as to obtain multi-component A model parameter calculation module for the model parameters of the frequency network model;
  • the corresponding frequency band is obtained.
  • combination module combination includes:
  • the frequency band is divided by the Mel frequency analysis technique based on auditory characteristics, the Bark domain or the division technique based on ERB scale.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Soundproofing, Sound Blocking, And Sound Damping (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

一种音频降噪方法和装置,包括获取待处理的预先经过采样处理的原音频信号;将原音频信号输入至预先训练的分层扩展网络模型进行计算,得到降噪后的降噪音频信号。该音频降噪方法和装置采用分层扩展网络模型,根据输入的音频信号的采样率、帧长的不同进行降噪处理。该方法能够适应不同带宽和复杂度的系统降噪处理,解决了不同采样率、不同帧长的信号降噪的不灵活性和不易扩展性,针对输入音频信号各种参数的不同进行降噪处理,在提高网络鲁棒性的同时有效保证音频信号的质量。

Description

一种音频降噪方法和装置 技术领域
本发明实施例涉及音频噪声处理技术领域,具体涉及一种音频降噪方法和装置。
背景技术
在大多数与音频和语音有关的应用中,例如人机界面、免提通信、IP语音(VoIP)、助听器、电话会议或远程协作系统等等,由麦克风拾取的信号(通常是语音)通常会受到噪声的污染,这些噪音会严重降低语音的质量,从而影响语音应用的效果。因此,麦克风的信号在存储,分析,传输或者播放之前必须进行降噪处理。噪声抑制技术可减少语音信号的稳态和非稳态噪声,从而提高信噪比,改善语音清晰度并减少听力疲劳。
传统降噪算法对于低信噪比,非稳态噪声都缺乏有效的解决方法,近年来随着深度学习的发展,神经网络被应用到了各个场景中,例如:语音增强,语音识别,神纹识别,原场语音交互等,研究表明使用深度神经网络能够提高系统对不同环境的鲁棒性。新一代人工智能方法,利用循环神经网络(RNN)对音频信号时序的信息的建模,或者将一维语音信号转为二维语谱图进而采用多层卷积神经网络(CNN)建模,均获得了有效的进展。生成对抗网络作为2017年重大突破,也被应用到语音增强任务。相比较于谱减、最小均方误差、为纳滤波等传统方法,使用神经网络可以有成效的利用上下文相关信息,对于处理非稳态噪声有明显的效果。
传统的降噪处理通过对输入音频信号进行噪声能量估计并进一步用于噪声抑制处理,得到降噪后的音频信号,参见图1所示。该算法对稳态噪声能得到很好的效果,但是在一定的噪声收敛时间,且对于非稳态噪声以及低信噪比的情况下,降噪效果不明显。
基于神经网络进行降噪,主要是通过将训练音频(包含纯净音和带噪音)进行网络训练,得到不同的网络模型参数;输入的噪声语音采用相应的降噪模型参数进行噪声抑制,得到降噪后的音频,参见图2。
传统降噪算法存在一定的噪声收敛时间,且该算法对于非稳态噪声以及低信噪比时的效果不明显,同时在某些噪声环境下(例如风扇噪声),传统算法降噪效果不好。基于神经网络降噪算法在不同采样率的音频信号进行降噪处理的时候,需要根据采样率和帧长选用不同的网络模型,这样在实际使用过程中很不灵活,实际音频系统尤其是通话功能为了适应不同带宽和复杂度,通常需要支持不同的采样率和帧长,固定采样率和帧长在实际系统中非常不灵活,不易扩展,限制了很多相关产品的落地和实施。
发明内容
为此,本发明实施例提供一种音频降噪方法和装置,该方法针对输入的不同采样率和不同帧长的信号无需进行训练不同模型,在推理阶段可以动态切换模型的基础层和扩展层。基于该改进的网络训练模型,对输入的不同帧长、不同采样率的音频信号进行降噪处理,便于对音频系统进行灵活的降噪处理,以解决神经网络降噪对不同采样率、不同帧长的信号处理的不灵活性,不易扩展的问题。其具体技术方案如下:
根据本发明实施例的第一方面提供一种音频降噪方法,包括:
获取待处理的预先经过采样处理的原音频信号;
将所述原音频信号输入至预先训练的分层扩展网络模型进行计算,得到降噪后的降噪音频信号。
进一步的,所述分层扩展网络模型的训练包括:
获取待训练音频信号;
采用多个预设采样率对所述待训练音频信号进行分频处理,并将进行分频处理后的音频信号输入至与所述采样率对应频带的网络模型进行训练,得到多组分频网络模型的模型参数;
将所述多个分频网络模型的模型参数进行组合,得到分层扩展网络模型。
进一步的,根据待训练音频信号的帧长,采样率的不同,得到对应的所述频带。
进一步的,所述将所述多组分频网络模型的模型参数进行组合,包括:
将多组所述模型参数进行叠加组合,或者将低频网络模型的参数作为高频网络模型的输入参数进行组合。
进一步的,采用基于听觉特性的Mel频率分析技术、Bark域或基于ERB尺度划分技术对所述频带进行划分。
本发明的另一方面还提供一种音频降噪装置,包括用于获取待处理的预先经过采样处理的原音频信号的获取模块;
用于将所述原音频信号输入至预先训练的分层扩展网络模型进行计算,得到降噪后的降噪音频信号的降噪模块。
进一步的,所述降噪模块包括:
用于获取待训练音频信号的信号获取模块;
用于采用多个预设采样率对所述待训练音频信号进行分频处理,并将进行分频处理后的音频信号输入至与所述采样率对应频带的网络模型进行训练,得到多组分频网络模型的模型参数的模型参数计算模块;
用于将所述多个分频网络模型的模型参数进行组合,得到分层扩展网络模型的组合模块。
进一步的,根据待训练音频信号的帧长,采样率的不同,得到对应的所述频带。
进一步的,所述组合模块组合,包括:
用于将多组所述模型参数进行叠加组合的叠加模块,或者将低频网络模型 参数作为高频网络模型的输入参数进行组合的输入组合模块。
进一步的,采用基于听觉特性的Mel频率分析技术、Bark域或基于ERB尺度划分技术对所述频带进行划分。
本发明实施例具有如下优点:
本发明实施例1提供的一种音频降噪方法采用分层扩展网络模型对输入的不同帧长、不同采样率的输入音频信号都能进行降噪处理,提高了对音频信号的降噪处理效率,便于对音频系统进行灵活的降噪处理,解决了神经网络降噪对不同采样率、不同帧长的信号处理的不灵活性,不易扩展的问题。
附图说明
为了更清楚地说明本发明的实施方式或现有技术中的技术方案,下面将对实施方式或现有技术描述中所需要使用的附图作简单地介绍。显而易见地,下面描述中的附图仅仅是示例性的,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据提供的附图引伸获得其它的实施附图。
本说明书所绘示的结构、比例、大小等,均仅用以配合说明书所揭示的内容,以供熟悉此技术的人士了解与阅读,并非用以限定本发明可实施的限定条件,故不具技术上的实质意义,任何结构的修饰、比例关系的改变或大小的调整,在不影响本发明所能产生的功效及所能达成的目的下,均应仍落在本发明所揭示的技术内容能涵盖的范围内。
图1为传统的基于噪声能量估计的降噪方法的流程图;
图2为传统基于深度学习的神经网络降噪方法的流程图;
图3为本发明实施例1提供的优化的降噪方法的流程图;
图4为本发明实施例2提供的优化的降噪方法的示例流程图;
图5为本发明实施例3提供的优化的降噪方法的示例流程图。
具体实施方式
以下由特定的具体实施例说明本发明的实施方式,熟悉此技术的人士可由本说明书所揭露的内容轻易地了解本发明的其他优点及功效,显然,所描述的实施例是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。
为了便于对音频系统进行灵活的降噪处理,以解决神经网络降噪对不同采样率、不同帧长的信号处理的不灵活性,不易扩展的问题,本发明提供一种音频降噪方法,该方法包括:
获取待处理的预先经过采样处理的原音频信号;
将所述原音频信号输入至预先训练的分层扩展网络模型进行计算,得到降噪后的降噪音频信号。
上述原音频信号可简称为输入音频,上述分层扩展网络模型是一种噪声处理算法,它能够对不同采样率、不同帧长的输入音频信号进行降噪处理,也称抑噪处理。具体的,参见图3为为本发明实施例1提供的优化的降噪方法的流程图。该流程图在对上述分层扩展网络模型的训练做了详细的描述,包括步骤:
获取待训练音频信号;
采用多个预设采样率对所述待训练音频信号进行分频处理,并将进行分频处理后的音频信号输入至与所述采样率对应频带的网络模型进行训练,得到多组分频网络模型的模型参数;
将所述多个分频网络模型的模型参数进行组合,得到分层扩展网络模型。
具体的,在分层扩展网络模型的训练过程中,根据输入音频的帧长(比如16kHz的采样率,帧长为256点),进行快速傅里叶变换(FFT),得到257个FFT复数系数,通过三角窗把FFT系数转化为能量,并映射到统一的符合听觉系统的非线性子带(band)上,比如40个Mel band,然后根据采样率对训练音频进行分频处理,接下来调用相应的网络进行训练,得到分频网络模型的模型参数,进而将所述分频网络模型的模型参数进行组合,得到分层扩展网 络模型的模型参数,最终得到分层扩展网络模型。上述分频网络模型的个数为多个,在本发明实施例中多组网络模型采用网络1、网络2...网络N来表示。
在模型训练中,低频部分通常作为基础层(base layer)的网络,设置较多的模型参数,因为低频部分包含更多更丰富语音信息。而对高频部分我们定义为扩展层(extension layer),针对扩展层的特征进行构建网络,网络设置较少的模型参数。
需要说明的是,根据待训练音频信号的帧长,采样率的不同,得到对应的所述频带。采用基于听觉特性的Mel频率分析技术、Bark域或基于ERB尺度划分技术对所述频带进行划分。
在本发明实施例2中,上述将所述多组分频网络模型的模型参数进行组合,包括:将多组所述模型参数进行叠加组合,或者在本发明实施例3中将低频网络模型的参数作为高频网络模型的输入参数进行组合。
参见图4和图5,图4为本发明实施例2提供的优化的降噪方法的示例流程图;图5为本发明实施例3提供的优化的降噪方法的示例流程图。
下面通过针对48kHz采样音频信号对实施例2和实施例3的方法进行示例性的说明。
实施例2的示例
针对48kHz采样音频信号进行分层扩展网络模型训练,将音频信号分为s1和s2,并采用不同的网络层进行训练。
具体方案:
针对48kHz音频信号,帧长会设置为512或1024。不同的系统帧长不同,帧长的不同可以通过banding的方式将其映射到统一的固定数目的band上,例如常用的Melband;
根据采样率将音频信号进行分频处理,得到音频信号s1(包含0~8kHz的频率分量対映的band个数)和音频信号s2(包含8~24kHz的频率分量対映的band个数);
音频信号s1采用网络1进行训练,得到模型参数p1;音频信号s2采用网 络2进行训练,得到模型参数p2;
模型参数p1和模型参数p2进行组合得到48kHz音频的模型参数;
对输入48kHz音频信号进行降噪抑制处理,得到降噪后的音频。对输入16kHz的音频,则只使用模型参数p1。具体实施过程中,分频处理后的音频信号不限于2个,且各个音频信号包含的频率分量不限于实施例二,可以根据具体应用进行设置。网络训练过程中,可以采用频域特征作为网络输入,也可以采用时域信号作为网络输入。
实施例3的示例
针对48kHz采样音频信号进行分层扩展网络模型训练,将音频信号分为s1和s2,并采用不同的网络层进行训练,且网络1的输出作为网络2的输入进行训练。
针对48kHz音频信号,帧长会设置为512或1024。不同的系统帧长不同,帧长的不同可以通过banding的方式将其映射到统一的固定数目的band上,例如常用的Melband;
根据采样率将音频信号进行分频处理,得到音频信号s1(包含0~8kHz的频率分量対映的band个数)和音频信号s2(包含8~24kHz的频率分量対映的band个数);
音频信号s1采用网络1进行训练,得到模型参数p1;音频信号s2采用网络2进行训练,且训练过程中结合模型参数p1,得到模型参数p2;
模型参数p1和模型参数p2进行组合得到48kHz音频的模型参数;
对输入48kHz音频信号进行降噪抑制处理,得到降噪后的音频。对输入16kHz的音频,则只使用模型参数p1。
具体实施过程中,分频处理后的音频信号不限于2个,且各个音频信号包含的频率分量不限于实施例三,可以根据具体应用进行设置。网络训练过程中,可以采用频域特征作为网络输入,也可以采用时域信号作为网络输入。
本发明实施例具有如下优点:
本发明实施例公开的一种音频降噪方法和装置,包括获取待处理的预先经 过采样处理的原音频信号;将所述原音频信号输入至预先训练的分层扩展网络模型进行计算,得到降噪后的降噪音频信号。本发明采用分层扩展网络模型,根据输入的音频信号的采样率,帧长的不同进行降噪处理的方法。该方法能够适应不同带宽和复杂度的系统降噪处理,解决了不同采样率,不同的帧长的信号降噪的不灵活性和不易扩展性,针对输入音频信号各种参数的不同进行降噪处理,在提高网络鲁棒性的同时有效保证语音信号的质量。
本发明实施例的第二方面提供一种音频降噪装置,包括用于获取待处理的预先经过采样处理的原音频信号的获取模块;
用于将所述原音频信号输入至预先训练的分层扩展网络模型进行计算,得到降噪后的降噪音频信号的降噪模块。
进一步的,所述降噪模块包括:
用于获取待训练音频信号的信号获取模块;
用于采用多个预设采样率对所述待训练音频信号进行分频处理,并将进行分频处理后的音频信号输入至与所述采样率对应频带的网络模型进行训练,得到多组分频网络模型的模型参数的模型参数计算模块;
用于将所述多个分频网络模型的模型参数进行组合,得到分层扩展网络模型的组合模块。
进一步的,根据待训练音频信号的帧长,采样率的不同,得到对应的所述频带。
进一步的,所述组合模块组合,包括:
用于将多组所述模型参数进行叠加组合的叠加模块,或者将低频网络模型参数作为高频网络模型的输入参数进行组合的输入组合模块。
进一步的,采用基于听觉特性的Mel频率分析技术、Bark域或基于ERB尺度划分技术对所述频带进行划分。
虽然,上文中已经用一般性说明及具体实施例对本发明作了详尽的描述,但在本发明基础上,可以对之作一些修改或改进,这对本领域技术人员而言是显而易见的。因此,在不偏离本发明精神的基础上所做的这些修改或改进,均 属于本发明要求保护的范围。

Claims (10)

  1. 一种音频降噪方法,其特征在于,包括:
    获取待处理的预先经过采样处理的原音频信号;
    将所述原音频信号输入至预先训练的分层扩展网络模型进行计算,得到降噪后的降噪音频信号。
  2. 根据权利要求1所述的方法,其特征在于,所述分层扩展网络模型的训练包括:
    获取待训练音频信号;
    采用多个预设采样率对所述待训练音频信号进行分频处理,并将进行分频处理后的音频信号输入至与所述采样率对应频带的网络模型进行训练,得到多组分频网络模型的模型参数;
    将所述多个分频网络模型的模型参数进行组合,得到分层扩展网络模型。
  3. 根据权利要求2所述的方法,其特征在于,根据待训练音频信号的帧长,采样率的不同,得到对应的所述频带。
  4. 根据权利要求2所述的方法,其特征在于,所述将所述多组分频网络模型的模型参数进行组合,包括:
    将多组所述模型参数进行叠加组合,或者将低频网络模型的参数作为高频网络模型的输入参数进行组合。
  5. 根据权利要求2所述的方法,其特征在于,采用基于听觉特性的Mel频率分析技术、Bark域或基于ERB尺度划分技术对所述频带进行划分。
  6. 一种音频降噪装置,其特征在于,包括用于获取待处理的预先经过采样处理的原音频信号的获取模块;
    用于将所述原音频信号输入至预先训练的分层扩展网络模型进行计算,得到降噪后的降噪音频信号的降噪模块。
  7. 根据权利要求6所述的装置,其特征在于,所述降噪模块包括:
    用于获取待训练音频信号的信号获取模块;
    用于采用多个预设采样率对所述待训练音频信号进行分频处理,并将进行分频处理后的音频信号输入至与所述采样率对应频带的网络模型进行训练,得 到多组分频网络模型的模型参数的模型参数计算模块;
    用于将所述多个分频网络模型的模型参数进行组合,得到分层扩展网络模型的组合模块。
  8. 根据权利要求7所述的装置,其特征在于,根据待训练音频信号的帧长,采样率的不同,得到对应的所述频带。
  9. 根据权利要求7所述的装置,其特征在于,所述组合模块组合,包括:
    用于将多组所述模型参数进行叠加组合的叠加模块,或者将低频网络模型参数作为高频网络模型的输入参数进行组合的输入组合模块。
  10. 根据权利要求7所述的装置,其特征在于,采用基于听觉特性的Mel频率分析技术、Bark域或基于ERB尺度划分技术对所述频带进行划分。
PCT/CN2020/109052 2020-08-11 2020-08-14 一种音频降噪方法和装置 WO2022032608A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
KR1020207026990A KR20200128684A (ko) 2020-08-11 2020-08-14 오디오 노이즈 감소 방법 및 장치

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010803652.4 2020-08-11
CN202010803652.4A CN111916103B (zh) 2020-08-11 2020-08-11 一种音频降噪方法和装置

Publications (1)

Publication Number Publication Date
WO2022032608A1 true WO2022032608A1 (zh) 2022-02-17

Family

ID=73284160

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/109052 WO2022032608A1 (zh) 2020-08-11 2020-08-14 一种音频降噪方法和装置

Country Status (2)

Country Link
CN (1) CN111916103B (zh)
WO (1) WO2022032608A1 (zh)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101014997A (zh) * 2004-02-18 2007-08-08 皇家飞利浦电子股份有限公司 用于生成用于自动语音识别器的训练数据的方法和系统
US20090012785A1 (en) * 2007-07-03 2009-01-08 General Motors Corporation Sampling rate independent speech recognition
CN105513590A (zh) * 2015-11-23 2016-04-20 百度在线网络技术(北京)有限公司 语音识别的方法和装置
CN108510979A (zh) * 2017-02-27 2018-09-07 芋头科技(杭州)有限公司 一种混合频率声学识别模型的训练方法及语音识别方法
CN108922560A (zh) * 2018-05-02 2018-11-30 杭州电子科技大学 一种基于混合深度神经网络模型的城市噪声识别方法
US20200066296A1 (en) * 2018-08-21 2020-02-27 2Hz, Inc Speech Enhancement And Noise Suppression Systems And Methods
CN111223493A (zh) * 2020-01-08 2020-06-02 北京声加科技有限公司 语音信号降噪处理方法、传声器和电子设备
CN111261183A (zh) * 2018-12-03 2020-06-09 珠海格力电器股份有限公司 一种语音去噪的方法及装置
CN111429930A (zh) * 2020-03-16 2020-07-17 云知声智能科技股份有限公司 一种基于自适应采样率的降噪模型处理方法及系统

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10991379B2 (en) * 2018-06-22 2021-04-27 Babblelabs Llc Data driven audio enhancement

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101014997A (zh) * 2004-02-18 2007-08-08 皇家飞利浦电子股份有限公司 用于生成用于自动语音识别器的训练数据的方法和系统
US20090012785A1 (en) * 2007-07-03 2009-01-08 General Motors Corporation Sampling rate independent speech recognition
CN105513590A (zh) * 2015-11-23 2016-04-20 百度在线网络技术(北京)有限公司 语音识别的方法和装置
CN108510979A (zh) * 2017-02-27 2018-09-07 芋头科技(杭州)有限公司 一种混合频率声学识别模型的训练方法及语音识别方法
CN108922560A (zh) * 2018-05-02 2018-11-30 杭州电子科技大学 一种基于混合深度神经网络模型的城市噪声识别方法
US20200066296A1 (en) * 2018-08-21 2020-02-27 2Hz, Inc Speech Enhancement And Noise Suppression Systems And Methods
CN111261183A (zh) * 2018-12-03 2020-06-09 珠海格力电器股份有限公司 一种语音去噪的方法及装置
CN111223493A (zh) * 2020-01-08 2020-06-02 北京声加科技有限公司 语音信号降噪处理方法、传声器和电子设备
CN111429930A (zh) * 2020-03-16 2020-07-17 云知声智能科技股份有限公司 一种基于自适应采样率的降噪模型处理方法及系统

Also Published As

Publication number Publication date
CN111916103B (zh) 2024-02-20
CN111916103A (zh) 2020-11-10

Similar Documents

Publication Publication Date Title
CN109065067B (zh) 一种基于神经网络模型的会议终端语音降噪方法
Li et al. On the importance of power compression and phase estimation in monaural speech dereverberation
CN107845389B (zh) 一种基于多分辨率听觉倒谱系数和深度卷积神经网络的语音增强方法
WO2021042870A1 (zh) 语音处理的方法、装置、电子设备及计算机可读存储介质
CN105788607B (zh) 应用于双麦克风阵列的语音增强方法
CN108604452B (zh) 声音信号增强装置
CN110931031A (zh) 一种融合骨振动传感器和麦克风信号的深度学习语音提取和降噪方法
CN111833896A (zh) 融合反馈信号的语音增强方法、系统、装置和存储介质
CN110600050A (zh) 基于深度神经网络的麦克风阵列语音增强方法及系统
AU2009203194A1 (en) Noise spectrum tracking in noisy acoustical signals
EP1526510B1 (en) Systems and methods for echo cancellation with arbitrary playback sampling rates
JP5595605B2 (ja) 音声信号復元装置および音声信号復元方法
CN111667844A (zh) 一种基于麦克风阵列的低运算量语音增强装置
US20080219457A1 (en) Enhancement of Speech Intelligibility in a Mobile Communication Device by Controlling the Operation of a Vibrator of a Vibrator in Dependance of the Background Noise
Schröter et al. Low latency speech enhancement for hearing aids using deep filtering
Garg Speech enhancement using long short term memory with trained speech features and adaptive wiener filter
KR101850693B1 (ko) 인-이어 마이크로폰을 갖는 이어셋의 대역폭 확장 장치 및 방법
WO2022032608A1 (zh) 一种音频降噪方法和装置
KR20200128684A (ko) 오디오 노이즈 감소 방법 및 장치
WO2020110228A1 (ja) 情報処理装置、プログラム及び情報処理方法
Zheng et al. Low-latency monaural speech enhancement with deep filter-bank equalizer
CN114023352A (zh) 一种基于能量谱深度调制的语音增强方法及装置
JP2024502287A (ja) 音声強調方法、音声強調装置、電子機器、及びコンピュータプログラム
CN114189781A (zh) 双麦神经网络降噪耳机的降噪方法及系统
Biradar et al. Implementation of an Active Noise Cancellation Technique using Deep Learning

Legal Events

Date Code Title Description
ENP Entry into the national phase

Ref document number: 20207026990

Country of ref document: KR

Kind code of ref document: A

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20949106

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20949106

Country of ref document: EP

Kind code of ref document: A1