WO2020042707A1 - 一种基于卷积递归神经网络的单通道实时降噪方法 - Google Patents

一种基于卷积递归神经网络的单通道实时降噪方法 Download PDF

Info

Publication number
WO2020042707A1
WO2020042707A1 PCT/CN2019/090530 CN2019090530W WO2020042707A1 WO 2020042707 A1 WO2020042707 A1 WO 2020042707A1 CN 2019090530 W CN2019090530 W CN 2019090530W WO 2020042707 A1 WO2020042707 A1 WO 2020042707A1
Authority
WO
WIPO (PCT)
Prior art keywords
neural network
recurrent neural
convolutional
noise reduction
acoustic feature
Prior art date
Application number
PCT/CN2019/090530
Other languages
English (en)
French (fr)
Inventor
谭可
闫永杰
Original Assignee
大象声科(深圳)科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 大象声科(深圳)科技有限公司 filed Critical 大象声科(深圳)科技有限公司
Publication of WO2020042707A1 publication Critical patent/WO2020042707A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Definitions

  • the present disclosure relates to the technical field of computer applications, and in particular, to a single-channel real-time noise reduction method, device, electronic device, and storage medium based on a convolutional recurrent neural network.
  • Speech noise reduction refers to separating the target speech signal from background noise to eliminate or suppress background noise.
  • Single-channel voice is a voice signal generated only by recording with a single microphone.
  • beamforming-based noise reduction technology that is, spatial filtering through the appropriate configuration of a microphone array
  • single-channel voice noise reduction can be applied to a wider range of acoustic scenarios .
  • single-channel speech noise reduction cost-effective it is also easier to use in practice.
  • single-channel speech separation can be used to enhance the effects of beamforming and associated microphone arrays.
  • single-channel speech noise reduction is regarded as a kind of supervised learning, and the signal processing problem is transformed into a supervised learning task.
  • Signal processing methods represented by traditional speech enhancement are based on general statistical analysis of background noise and speech, while supervised learning methods are driven by data and can automatically learn from specific training samples. It can be said that the introduction of supervised learning methods has achieved a leap in single-channel speech noise reduction technology.
  • the number of network parameters and the model are more complicated, which affects the real-time performance and noise reduction effect of single-channel speech noise reduction.
  • the present disclosure provides a single-channel real-time noise reduction method, device, and terminal based on a convolutional recurrent neural network.
  • a single-channel real-time noise reduction method based on a convolutional recurrent neural network including:
  • the step of extracting an acoustic feature from the received single-channel sound signal includes:
  • the normalizing process is performed on the spectral amplitude vector, and the steps of forming an acoustic feature include:
  • the spectrum amplitude vectors of the current time frame and the past time frame are combined for normalization processing to form acoustic features.
  • the step of performing normalization processing on the spectral amplitude vector to form an acoustic feature includes:
  • the spectrum amplitude vectors of the current time frame, the past time frame and the future time frame are combined and normalized to form an acoustic feature.
  • the method before the step of iteratively calculating the acoustic feature in a pre-trained convolutional recurrent neural network model and calculating a ratio film of the acoustic feature, the method further includes:
  • the pre-collected speech training set is trained by the convolutional recurrent neural network to construct the convolutional recurrent neural network model.
  • the convolutional neural network is a convolutional encoder-decoder structure
  • the encoder includes a set of convolutional layers and a pooling layer, the structure of the decoder and the encoder in reverse order Similarly, the output of the encoder is connected to the input of the decoder.
  • the recurrent neural network with long-term and short-term memory includes two stacked long-term and short-term memory layers.
  • the step of combining a convolutional neural network with a recurrent neural network with long and short-term memory to obtain a convolutional recurrent neural network includes:
  • the two stacked long-term and short-term memory layers are merged into an encoder and a decoder of a convolutional neural network to construct the convolutional recurrent neural network.
  • each convolutional layer or pooling layer in the convolutional neural network includes up to 16 cores, and each long-short-term memory layer of the recurrent neural network with long-term and short-term memory includes 64 neurons.
  • the voice training set is composed of background noise collected in a daily environment, various types of male and female voices, and voice signals with a specific signal-to-noise ratio.
  • a single-channel real-time noise reduction device including:
  • An acoustic feature extraction module for extracting acoustic features from a received single-channel sound signal
  • a ratio film calculation module configured to iteratively calculate the acoustic features in a pre-built convolutional recurrent neural network model to calculate the ratio film of the acoustic features
  • a masking module configured to mask the acoustic feature by using the ratio film
  • a speech synthesis module is configured to synthesize the masked acoustic feature and the phase of the single-channel sound signal to obtain a noise-reduced speech signal.
  • an ideal ratio mask is adopted as the training target of the convolutional recurrent neural network.
  • an electronic device including:
  • At least one processor At least one processor
  • a memory connected in communication with the at least one processor; wherein,
  • the memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor to enable the at least one processor to execute the method according to the first aspect.
  • a computer-readable storage medium for storing a program, which when executed causes an electronic device to perform the method as described in the first aspect.
  • the acoustic features are extracted from the received single-channel sound signals, and the acoustic features are iteratively calculated in a pre-trained convolutional recurrent neural network model to calculate the acoustic feature ratio film, and then use the ratio film Mask the acoustic features, and then combine the masked acoustic features with the phase of the single-channel sound signal to obtain the speech signal. Since a pre-trained convolutional recurrent neural network model is used in this scheme, it has good noise reduction. At the same time of performance, it greatly reduces the number of neural network parameters, reducing the amount of data storage and the need for system data bandwidth.
  • Fig. 1 is a flowchart illustrating a single-channel real-time noise reduction method based on a convolutional recurrent neural network according to an exemplary embodiment.
  • FIG. 2 is a specific implementation flowchart of step S110 in a single-channel real-time noise reduction method based on a convolution recurrent neural network according to the embodiment of FIG. 1.
  • FIG. 3 is a detailed implementation flowchart of step S120 in the single-channel real-time noise reduction method based on the convolution recurrent neural network of FIG. 1.
  • Fig. 4 is a schematic flowchart of a single channel real-time noise reduction according to an exemplary embodiment.
  • FIG. 5 is a schematic diagram of the predicted spectral amplitude when the CRN model is not compressed.
  • FIG. 6 is a schematic diagram of the predicted spectral amplitude after compression of the CRN model.
  • Fig. 7 is a schematic diagram illustrating the results of comparison of STOI parameters trained with an LSTM model, trained with a CRN model, and untrained STOI parameters in a multi-person conversation noise scene, according to an exemplary embodiment.
  • Fig. 8 is a schematic diagram showing the results of comparison of STOI parameters trained with an LSTM model, trained with a CRN model, and untrained in a cafe noise scene, according to an exemplary embodiment.
  • FIG. 9 is a frequency spectrum diagram of an untrained sound signal in a multi-person conversation noise scene at -5dB SNR (signal-to-noise ratio) according to an exemplary embodiment.
  • FIG. 10 is a pure speech spectrum diagram corresponding to FIG. 9 according to an exemplary embodiment.
  • FIG. 11 is a frequency spectrum diagram after noise reduction using a CRN model is output according to an exemplary embodiment.
  • Fig. 12 is a block diagram of a single-channel real-time noise reduction device according to an exemplary embodiment.
  • FIG. 13 is a block diagram of an acoustic feature extraction module 110 in the single-channel real-time noise reduction device shown in the embodiment corresponding to FIG. 12.
  • FIG. 14 is a block diagram of a ratio film calculation module 120 shown in the embodiment corresponding to FIG. 12.
  • Fig. 1 is a flowchart illustrating a single-channel real-time noise reduction method based on a convolutional recurrent neural network according to an exemplary embodiment.
  • the single-channel real-time noise reduction method based on convolutional recurrent neural network can be used in electronic devices such as smartphones and computers.
  • the single-channel real-time noise reduction method based on a convolutional recurrent neural network may include steps S110, S120, S130, and S140.
  • step S110 an acoustic feature is extracted from the received single-channel sound signal.
  • the single-channel sound signal is a signal to be subjected to real-time noise reduction processing.
  • a single-channel sound signal contains speech and non-human noise.
  • electronic devices When electronic devices perform single-channel voice real-time noise reduction processing, they can receive single-channel sound signals collected by recording devices such as microphones, and can also receive single-channel sound signals sent by other electronic devices. They can also receive single-channel sound signals through other methods. Not described here one by one.
  • Acoustic features are data features that can characterize a single channel sound signal.
  • STFT short-time Fourier transform
  • the single-channel sound signal can be extracted using wavelet transform.
  • Acoustic features may also be extracted from the received single-channel sound signal in other forms.
  • step S110 may include steps S111, S112, and S113.
  • step S111 the received single-channel sound signal is divided into time frames according to a preset time period.
  • the preset time period is a preset time interval period, and a single channel sound signal is divided into multiple time frames according to the preset time period.
  • the received single-channel sound signal is divided into multiple time frames according to 20 milliseconds per frame, and each two adjacent time frames have an overlap of 10 milliseconds.
  • step S112 a spectrum amplitude vector is extracted from the time frame.
  • step S113 the spectrum amplitude vector is normalized to form an acoustic feature.
  • STFT is applied to each time frame to extract a spectral amplitude vector, and each spectral amplitude vector is subjected to a normalization process to form an acoustic feature.
  • the temporal context is a feature of the speech signal
  • multiple consecutive frames centered on the current time frame are connected into a larger vector to integrate the context information to further improve the single channel speech noise reduction performance.
  • the spectrum amplitude vector when the spectrum amplitude vector is normalized, the spectrum amplitude vectors of the current time frame and the past time frame are combined and normalized to form an acoustic feature.
  • the degree of real-time required by the application is high, so future time frames cannot be used.
  • the spectrum amplitude vector of the current time frame and the past time frame are combined for normalization processing.
  • the previous 5 frames and the current time frame are spliced into a unified feature vector as the input of the present invention.
  • the number of time frames in the past can also be less than five, which further saves calculation time and improves the real-time performance of the application while sacrificing certain noise reduction performance.
  • the spectrum amplitude vectors of the current time frame, the past time frame, and the future time frame are combined and normalized to form an acoustic feature.
  • future time frames can be used as input.
  • STOI Short-Time Objective Intelligence
  • STOI is an important indicator for evaluating the noise reduction performance of speech, and its typical value range is between 0 and 1, which can be interpreted as the percentage of speech understood.
  • the single-channel sound signal is divided into time frames according to a preset time period in advance, and by setting an appropriate time period, the noise based on the acoustic features extracted from each time frame is reduced to noise.
  • the noise reduction performance by selectively combining the current time frame with the spectral amplitude vectors of past time frames and future time frames to form acoustic features.
  • Step S120 Iteratively calculate the acoustic features in a pre-trained convolutional recurrent neural network model to calculate a ratio film of the acoustic features.
  • the ratio film characterizes the relationship between a noisy speech signal and a pure speech signal, which indicates the trade-off between noise suppression and speech retention.
  • the pure speech signal can be restored from the noisy speech.
  • the convolutional recurrent neural network model is pre-trained.
  • the acoustic feature obtained in step S110 is used as an input of a convolutional recurrent neural network model, and an iterative operation is performed in the convolutional recurrent neural network model to calculate a ratio film to the acoustic feature.
  • an IRM (Ideal Ratio Mask) is used as the target of the iterative operation.
  • the IRM of each time-frequency unit in the spectrogram can be expressed by the following equation:
  • S FFT (t, f) and N FFT (t, f) represent the pure speech amplitude spectrum and noise amplitude spectrum at the time frame t and the frequency segment f, respectively.
  • the ideal ratio film is predicted during the supervised training process, and then the acoustic characteristics are masked by the ratio film to obtain the noise-reduced speech signal.
  • step S130 the acoustic characteristics are masked by using a ratio film.
  • step S140 the masked acoustic feature is synthesized with the phase of the single-channel sound signal to obtain a noise-reduced speech signal.
  • the trained CRN Convolutional Recurrent Network, Convolutional Recurrent Neural Network
  • the use of trained neural networks for specific applications is called inference or operation.
  • each layer of the CRN model processes noisy signals.
  • the T-F masking is derived from the results of the inference, and then used to weight (or mask) the noisy speech amplitude to produce an enhanced speech signal that is clearer than the original noise input.
  • the masked spectral amplitude vector is sent to the inverse Fourier transform together with the phase of the single-channel sound signal to derive a speech signal in the corresponding time domain.
  • the acoustic features are extracted from the received single-channel sound signals, and the acoustic features are iteratively calculated in a pre-trained convolutional recurrent neural network model to calculate the ratio of acoustic features After the film, this ratio film is used to mask the acoustic characteristics. The masked acoustic feature is then synthesized with the phase of the single-channel sound signal to obtain a speech signal.
  • the pre-trained convolutional recurrent neural network model is used in this solution, which greatly reduces the number of neural network parameters, reduces the amount of data storage, and the demand for system data bandwidth. It can greatly improve the performance of noise reduction while achieving great improvements. Real-time performance of single-channel voice noise reduction.
  • Fig. 3 is a flowchart of a specific implementation of step S120 in a single-channel real-time noise reduction method based on a convolution recurrent neural network according to the corresponding embodiment of Fig. 1.
  • the single-channel real-time noise reduction method based on a convolutional recurrent neural network may include steps S121 and S122.
  • step S121 a convolutional neural network is combined with a recurrent neural network with long and short-term memory to obtain a convolutional recurrent neural network.
  • CNN Convolutional Neural Network
  • Hubel and Wiesel while studying neurons in the cerebral cortex of cats for local sensitivity and direction selection, discovered that their unique network structure can effectively reduce the complexity of feedback neural networks, and then proposed a convolution neural network The neural structure of the network.
  • CNN has become one of the research hotspots in many scientific fields, especially in the field of pattern classification. Since the network avoids complex pre-processing of images and can directly input original images, it has been widely used.
  • the new recognition machine proposed by K. Fukushima in 1980 was the first implementation of a convolutional neural network. Among them, the representative is the version proposed by LeCun et al.
  • the convolution operation in CNN is defined based on a two-dimensional structure, which defines that each underlying feature of the local receptive field is only related to a subset of the input, such as the topological neighborhood.
  • the topological local restrictions in the convolutional layer will make the weight matrix very sparse, so the two layers connected by the convolution operation are only locally connected. Calculating such a matrix multiplication is more convenient and efficient than calculating a dense matrix multiplication. In addition, a smaller number of free parameters will make statistical calculations more beneficial. In an image with a two-dimensional topology, the same input pattern will appear at different locations, and similar values are more likely to have stronger dependencies, which is very important for the model of the data.
  • CNN uses the method of weight sharing to reduce the number of parameters to be learned, reduces the complexity of the model, and greatly reduces the number of network weights. Compared with the general forward BP algorithm (Error Back Propagation, error back propagation), the training speed and accuracy have been greatly improved. As a deep learning algorithm, CNN can minimize the overhead of data preprocessing.
  • LSTM recurrent neural network with long short-term memory
  • RNN recurrent neural network with long short-term memory
  • LSTM recurrent neural network with long short-term memory
  • LSTMs usually perform better than time-recursive neural networks and Hidden Markov Models (HMM), such as for continuous handwriting recognition without segmentation.
  • HMM Hidden Markov Models
  • the artificial neural network model built with LSTM won the ICDAR handwriting recognition competition championship.
  • LSTM is also commonly used for automatic speech recognition.
  • the TIMIT natural speech database reached a record of 17.7% error rate.
  • LSTM can be used as a complex non-linear unit to construct larger deep neural networks.
  • LSTM is a specific type of RNN that can effectively record the long-term correlation of sound signals. Compared with the traditional RNN, LSTM improves the gradient reduction or gradient explosion problem caused by the training process over time.
  • the memory cell of the LSTM module has three gates: input gate, forget gate, and output gate. The input gate controls how much current information should be added to the memory unit, forgets how much previous information the gate control should retain, and the output gate controls whether to output information.
  • the LSTM can be described by a mathematical formula as follows.
  • o t ⁇ (W io x t + b io + W ho h t-1 + b ho )
  • i t , f t and o t are the outputs of the input gate, forget gate and output gate, respectively.
  • x t and h t represent the input features and hidden activations at time t, respectively.
  • g t and c t denote block input and storage unit, respectively.
  • the symbol ⁇ indicates that the array elements are multiplied in order.
  • the input and forget gates are calculated based on the previous activation function and the current input, and context-dependent updates are performed on the memory cells based on the input gate and forget gate.
  • the LSTM When training is used for speech denoising, the LSTM saves the relevant context for mask prediction at the current moment.
  • a convolutional recurrent neural network is obtained by combining a convolutional neural network with a recurrent neural network with long and short-term memory.
  • CRN has the characteristics of both CNN and LSTM, so that it can efficiently perform voice denoising, greatly reduce the number of neural network parameters of the CRN network, and effectively reduce the size of the CRN.
  • step S122 a pre-collected speech training set is trained by a convolutional recurrent neural network to construct a convolutional recurrent neural network model.
  • the pre-collected speech training set is trained by the convolutional recurrent neural network, and the network neural network of the convolutional recurrent neural network is adjusted. Parameters to make its output approximate the IRM to build a convolutional recurrent neural network model.
  • Fig. 4 is a schematic flowchart of a single channel real-time noise reduction according to an exemplary embodiment.
  • the input is a sound signal and the output is a noise-reduced voice signal.
  • the dotted arrow in the figure Represents the steps involved during training, dotted dotted arrows in the figure Indicates the steps of the prediction phase, the solid arrows in the figure " "Represents the steps of training and prediction sharing.
  • the present invention uses an ideal ratio film (IRM) as the training target. IRM is obtained by comparing the noisy speech signal with the STFT corresponding to the pure speech signal.
  • IRM ideal ratio film
  • the RNN with LSTM estimates the ideal ratio film of each input noisy speech, and then calculates the Mean-squared error (MSE) between the ideal ratio film and the estimated ratio film. After repeated multiple rounds of iteration, the entire training set is The MSE is minimized, and the training samples are used only once in each iteration.
  • the prediction phase is entered, that is, the trained convolutional recurrent neural network model is used to directly reduce the noise of the input sound signal. Specifically, The trained convolutional recurrent neural network model processes the input sound signal and calculates the ratio film, then uses the calculated ratio film to process the input sound signal, and finally re-synthesizes the noise-reduced speech signal.
  • the output of the top deconvolution layer is obtained through the sigmoidal shape function (see Figure 4) to obtain the prediction of the ratio film, and then compared with the IRM. Through comparison, an MSE error is generated and used to adjust the CRN weight.
  • the convolutional neural network is a convolutional encoder-decoder structure, including a convolutional encoder and a corresponding decoder.
  • the encoder includes a set of convolutional layers and pooling layers for extracting high-level features from the input; the structure of the decoder is the same as that of the encoder in reverse order, which can map low-resolution features at the output of the encoder to the full input size Feature map.
  • the number of cores remains symmetric: gradually increase the number of cores in the encoder and gradually reduce the number of cores in the decoder.
  • a symmetric encoder-decoder architecture ensures that the output has the same shape as the input.
  • the recurrent neural network with long short-term memory includes two stacked long-term and short-term memory layers.
  • a convolutional recurrent neural network is constructed by merging two stacked long-term and short-term memory layers between the encoder and decoder of the convolutional neural network to process the time dynamic characteristics of the sound signal.
  • each convolution layer or pooling layer in the CRN includes a maximum of 16 cores.
  • Each long-short-term memory layer of a recurrent neural network with long-term and short-term memory includes 64 neurons.
  • the compressed CRN model can also greatly improve the STOI parameters; compared with the uncompressed CRN model, it has a large impact on the noise reduction performance of the CRN model. Reduced model size and reduced data storage for CRN models.
  • a convolutional recurrent neural network is obtained by combining a convolutional neural network with a recurrent neural network with long and short-term memory, so that the number of parameters of the convolutional recurrent neural network model with a convolutional neural network is small Characteristics, while ensuring good noise reduction performance, greatly improving the real-time performance of single-channel voice noise reduction.
  • the voice training set of this exemplary embodiment is collected by a large number in a daily environment Background noise, various types of male and female voices, and human voices and noise mixed at a specific signal-to-noise ratio (SNR).
  • SNR signal-to-noise ratio
  • the speech training set contains a lot of noise (about 1000 hours) and many speech fragments, and the entire speech training set lasts about hundreds of hours to ensure that the CRN model is fully trained.
  • FIG. 7 is a schematic diagram showing the results of comparing LTO models trained with a CRN model and comparisons with untrained STOI parameters in a multi-person conversation noise scenario according to an exemplary embodiment
  • FIG. 8 is a diagram according to an exemplary embodiment.
  • the example shows the results of comparing the results of training with LSTM models, training with CRN models, and comparison with untrained STOI parameters in a cafe noise scene.
  • the encoder of the CRN model used in this embodiment has five convolution layers, the decoder has five deconvolution layers, and there are two LSTM layers between the encoder and the decoder.
  • the method based on the CRN model proposed in this embodiment has greatly improved the STOI parameter.
  • the STOI The parameter is improved by about 20 percentage points; the STOI parameter is improved by about 10 percentage points when the SNR is 5 dB.
  • Figures 7 and 8 also show that the performance of this method is always better than that of the RNN model based on LSTM, and the STOI improvement is more obvious at lower SNR.
  • Figure 9 shows the spectrum of an untrained sound signal in a multi-person conversation noise scenario at -5dB SNR;
  • Figure 10 shows the corresponding pure utterance spectrum;
  • Figure 11 shows Spectral graph after noise reduction using CRN model.
  • Figures 9, 10, and 11 show that the reduced speech signal is much closer to pure speech than the sound signal in a multi-person conversation noise scene.
  • the single-channel noise reduction of the present invention refers to processing a signal collected by a single microphone. Compared with a beamforming microphone array noise reduction method, the single-channel noise reduction has wider practicability.
  • the invention adopts a supervised learning method for speech noise reduction, and uses a convolutional recurrent neural network model to predict a ratio film of a sound signal.
  • the speech training set used by the convolutional recurrent neural network model proposed by the present invention contains a lot of noise (about 1000 hours) and many speech fragments, and the entire speech training set lasts for hundreds of hours, ensuring that the CRN model is fully trained and a single channel is real-time
  • the implementation of noise reduction does not depend on future time frames.
  • the convolutional neural network is combined with a recurrent neural network with long and short-term memory to obtain a convolutional recurrent neural network.
  • the pre-collected speech training set is trained by the convolutional neural network to construct a convolutional recurrent neural network model. This model greatly reduces the number of neural network parameters and reduces the amount of data storage. While achieving good noise reduction performance, it greatly improves the real-time performance of single-channel speech noise reduction.
  • the following is a device embodiment of the present disclosure, which can be used to implement the above-mentioned embodiment of a single-channel real-time noise reduction method based on a convolutional recurrent neural network.
  • a device embodiment of the present disclosure can be used to implement the above-mentioned embodiment of a single-channel real-time noise reduction method based on a convolutional recurrent neural network.
  • Fig. 12 is a block diagram of a single-channel real-time noise reduction device according to an exemplary embodiment.
  • the device includes, but is not limited to, an acoustic feature extraction module 110, a ratio film calculation module 120, a masking module 130, and a speech synthesis module 140.
  • An acoustic feature extraction module 110 configured to extract an acoustic feature from a received single-channel sound signal
  • the ratio film calculation module 120 is configured to perform an iterative operation on the acoustic feature in a pre-built convolutional recurrent neural network model to calculate the ratio film of the acoustic feature;
  • a masking module 130 configured to mask the acoustic feature by using the ratio film
  • the speech synthesis module 140 is configured to synthesize the masked acoustic feature and a phase of the single-channel sound signal to obtain a noise-reduced speech signal.
  • the acoustic feature extraction module 110 described in FIG. 12 includes, but is not limited to, a time frame division unit 111, a spectral amplitude vector extraction unit 112, and an acoustic feature formation unit 113.
  • the time frame dividing unit 111 is configured to divide the received single-channel sound signal into time frames according to a preset time period
  • a spectrum amplitude vector extraction unit 112 configured to extract a spectrum amplitude vector from the time frame
  • the acoustic feature forming unit 113 is configured to perform normalization processing on the spectral amplitude vector to form an acoustic feature.
  • the ratio film calculation module 120 described in FIG. 12 includes, but is not limited to, a network combination unit 121 and a network model construction unit 122.
  • a network combining unit 121 configured to combine a convolutional neural network with a recurrent neural network with long and short-term memory to obtain a convolutional recurrent neural network
  • a network model constructing unit 122 is configured to train a pre-collected speech training set through the convolutional recurrent neural network to construct the convolutional recurrent neural network model.
  • the present invention further provides an electronic device that performs all or part of the steps of a single-channel real-time noise reduction method based on a convolution recurrent neural network as shown in any one of the above exemplary embodiments.
  • Electronic equipment includes:
  • a memory connected in communication with the processor; wherein,
  • the memory stores readability instructions that, when executed by the processor, implement the method according to any one of the foregoing exemplary embodiments.
  • a storage medium is also provided, and the storage medium is a computer-readable storage medium, such as a temporary and non-transitory computer-readable storage medium including instructions.

Landscapes

  • Engineering & Computer Science (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Quality & Reliability (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

一种基于卷积递归神经网络的单通道实时降噪方法。该方法包括:从接收的单通道声音信号中提取声学特征(S110);将声学特征在预先训练的卷积递归神经网络模型中进行迭代运算,计算声学特征的比值膜(S120);采用比值膜对声学特征进行掩蔽(S130);将经过掩蔽后的声学特征与单通道声音信号的相位进行合成,得到语音信号(S140)。该方法能够减少神经网络参数数量,降低数据存储量和对系统数据带宽的需求,能够实现良好的降噪性能,提高单通道语音降噪的实时性。

Description

一种基于卷积递归神经网络的单通道实时降噪方法 技术领域
本公开涉及计算机应用技术领域,特别涉及一种基于卷积递归神经网络的单通道实时降噪方法、装置及电子设备、存储介质。
背景技术
语音降噪指将目标语音信号与背景噪声进行分离从而消除或抑制背景噪声。单通道语音是仅依靠单麦克风录音生成的语音信号,与基于波束形成的降噪技术(即通过麦克风阵列的适当配置进行空间滤波)相比,单通道语音降噪可以应用于更宽泛的声学场景。单通道语音降噪不仅在成本上有优势,而且在实际情况下也更容易使用。此外,单通道语音分离可用于强化波束形成和相关的麦克风阵列的效果。
由于单通道语音没有麦克风阵列提供的空间信息作为参考,因此单声道语音降噪尤为困难。最近,将单通道语音降噪当作一种监督学习,突破性地将信号处理问题转变为监督学习任务。以传统的语音增强为代表的信号处理方法是基于背景噪声和语音的一般统计分析,而监督学习方法由数据驱动,并且能够自动地从具体的训练样本中学习。可以说,监督学习方法的引入实现了单通道语音降噪技术的飞跃。然而,目前监督学习的单通道语音降噪方法中,网络参数的数量较多且模型较为复杂,影响了单通道语音降噪的实时性及降噪效果。
发明内容
为了解决相关技术中单通道语音降噪的网络参数数量较多且模型较为复杂的技术问题,本公开提供了一种基于卷积递归神经网络的单通道实时降噪方法、装置及终端。
第一方面,提供了一种基于卷积递归神经网络的单通道实时降噪方法,包括:
从接收的单通道声音信号中提取声学特征;
将所述声学特征在预先训练的卷积递归神经网络模型中进行迭代运算,计算所述声学特征的比值膜;
采用所述比值膜对所述声学特征进行掩蔽;
将经过掩蔽后的所述声学特征与所述单通道声音信号的相位进行合成,得到降噪后语音信号。
可选的,所述从接收的单通道声音信号中提取声学特征的步骤包括:
将接收的单通道声音信号按照预设时间周期分为时间帧;
从所述时间帧中提取频谱幅度矢量;
对所述频谱幅度矢量进行归一化处理,形成声学特征。
可选的,对所述频谱幅度矢量进行归一化处理,形成声学特征的步骤包括:
将当前时间帧与过去时间帧的频谱幅度矢量合并进行归一化处理,形成声学特征。
可选的,所述对所述频谱幅度矢量进行归一化处理,形成声学特征的步骤包括:
将当前时间帧、过去时间帧与未来时间帧的频谱幅度矢量合并进行归一化处理,形成声学特征。
可选的,将所述声学特征在预先训练的卷积递归神经网络模型中进行迭代运算,计算所述声学特征的比值膜的步骤之前,所述方法还包括:
将卷积神经网络与具有长短期记忆的递归神经网络进行组合得到卷积递归神经网络;
通过所述卷积递归神经网络对预先收集的语音训练集进行训练,构建所述卷积递归神经网络模型。
可选的,所述卷积神经网络为卷积编码器-解码器结构,所述编码器包括一组卷积层和池化层,所述解码器的结构与反向顺序的所述编码器相同,所述编码器的输出连接所述解码器的输入。
可选的,所述具有长短期记忆的递归神经网络包括两个堆叠的长短期记忆层。
可选的,所述将卷积神经网络与具有长短期记忆的递归神经网络进行组合得到卷积递归神经网络的步骤包括:
将两个堆叠的长短期记忆层合并入卷积神经网络的编码器与解码器之间,构建所述卷积递归神经网络。
可选的,所述卷积神经网络中的每个卷积层或池化层包括最多16个核,所述具有长短期记忆的递归神经网络的每个长短期记忆层包括64个神经元。
可选的,所述语音训练集由在日常环境下采集的背景噪声、各类型的男女声以及特定信噪比混合的语音信号组合而成。
第二方面,提供了一种单通道实时降噪装置,包括:
声学特征提取模块,用于从接收的单通道声音信号中提取声学特征;
比值膜计算模块,用于将所述声学特征在预先构建的卷积递归神经网络模型中进行迭代运算,计算所述声学特征的比值膜;
掩蔽模块,用于采用所述比值膜对所述声学特征进行掩蔽;
语音合成模块,用于将经过掩蔽后的所述声学特征与所述单通道声音信号的相位进行合成,得到降噪语音信号。
可选的,采用理想比值掩膜作为卷积递归神经网络的训练目标。
第三方面,提供了一种电子设备,包括:
至少一个处理器;以及
与所述至少一个处理器通信连接的存储器;其中,
所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行如第一方面所述的方法。
第四方面,提供了一种计算机可读存储介质,用于存储程序,所述程序在 被执行时使得电子设备执行如第一方面所述的方法。
本公开的实施例提供的技术方案可以包括以下有益效果:
在进行单通道实时降噪时,从接收的单通道声音信号中提取声学特征,将声学特征在预先训练的卷积递归神经网络模型中进行迭代运算计算声学特征的比值膜后,采用该比值膜对声学特征进行掩蔽,再将经过掩蔽后的声学特征与单通道声音信号的相位进行合成,得到语音信号,由于该方案中采用了预先训练的卷积递归神经网络模型,在具有良好的降噪性能的同时,大大减少了神经网络参数数量,降低了数据存储量和对系统数据带宽的需求。
应当理解的是,以上的一般描述和后文的细节描述仅为示例性,并不能限制本公开范围。
附图说明
此处的附图被并入说明书中并构成本说明书的一部分,示出了符合本发明的实施例,并于说明书一起用于解释本发明的原理。
图1是根据一示例性实施例示出的一种基于卷积递归神经网络的单通道实时降噪方法的流程图。
图2是图1对应实施例的基于卷积递归神经网络的单通道实时降噪方法中步骤S110的一种具体实现流程图。
图3是图1基于卷积递归神经网络的单通道实时降噪方法中步骤S120的一种具体实现流程图。
图4是根据一示例性实施例示出的单通道实时降噪的流程示意图。
图5为未对CRN模型进行压缩时预测的频谱幅度示意图。
图6为对CRN模型进行压缩后预测的频谱幅度示意图。
图7是根据一示例性实施例示出的在多人谈话噪声场景下经LSTM模型训练的、经CRN模型训练的、及与未经训练的STOI参数比对结果示意图。
图8是根据一示例性实施例示出的在咖啡厅噪声场景下经LSTM模型训练 的、经CRN模型训练的、及与未经训练的STOI参数比对结果示意图。
图9是根据一示例性实施例输出的在-5dB SNR(信噪比)下多人谈话噪声场景中未经训练的声音信号的频谱图。
图10是根据一示例性实施例相应于图9的纯净话语频谱图。
图11是根据一示例性实施例输出的是采用CRN模型降噪后的频谱图。
图12是根据一示例性实施例示出的一种单通道实时降噪装置的框图。
图13是图12对应实施例示出的单通道实时降噪装置中声学特征提取模块110的一种框图。
图14是图12对应实施例示出的比值膜计算模块120的一种框图。
具体实施方式
这里将详细地对示例性实施例执行说明,其示例表示在附图中。下面的描述涉及附图时,除非另有表示,不同附图中的相同数字表示相同或相似的要素。以下示例性实施例中所描述的实施方式并不代表与本发明相一致的所有实施方式。相反,它们仅是与如所附权利要求书中所详述的、与本发明的一些方面相一致的装置和方法的例子。
图1是根据一示例性实施例示出的一种基于卷积递归神经网络的单通道实时降噪方法的流程图。该基于卷积递归神经网络的单通道实时降噪方法可用于智能手机、电脑等电子设备中。如图1所示,该基于卷积递归神经网络的单通道实时降噪方法可以包括步骤S110、步骤S120、步骤S130、步骤S140。
步骤S110,从接收的单通道声音信号中提取声学特征。
单通道声音信号是将要进行实时降噪处理的信号。
通常地,单通道声音信号包含语音和非人声干扰噪声。
电子设备进行单通道语音实时降噪处理时,可以接收麦克风等录音设备采集的单通道声音信号,也可以接收其它电子设备发送的单通道声音信号,还可以是通过其它方式接收单通道声音信号,在此不进行一一描述。
声学特征是能够表征单通道声音信号的数据特征。
从接收的单通道声音信号中提取声学特征时,可以对单通道声音信号采用STFT(short-time Fourier transform,短时傅里叶变换)提取声学特征,也可以对单通道声音信号采用小波变换提取声学特征,还可以采用其它形式从接收的单通道声音信号中提取声学特征。
可选的,如图2所示,步骤S110可以包括步骤S111、步骤S112、步骤S113。
步骤S111,将接收的单通道声音信号按照预设时间周期分为时间帧。
预设时间周期是预先设置的时间间隔期,按照预设时间周期,将单通道声音信号分为多个时间帧。
在一具体示例性实施例中,将接收的单通道声音信号按照每帧20毫秒分为多个时间帧,且每两个相邻的时间帧之间具有10毫秒的重叠。
步骤S112,从时间帧中提取频谱幅度矢量。
步骤S113,对频谱幅度矢量进行归一化处理,形成声学特征。
在一示例性实施例中,将STFT应用于每个时间帧以提取频谱幅度矢量,每一频谱幅度矢量经过归一化处理后,形成声学特征。
可选的,由于时间上下文是语音信号的特征,因此通过以当前时间帧为中心的多个连续帧连接成更大的矢量来整合上下文信息,以进一步提高单通道语音降噪性能。
例如,在对频谱幅度矢量进行归一化处理时,将当前时间帧与过去时间帧的频谱幅度矢量合并进行归一化处理,形成声学特征。
由于移动通信和助听器等降噪应用中,应用要求的实时性程度高,因此不能使用未来时间帧。针对此类实时性应用,在对频谱幅度矢量进行归一化处理时,将当前时间帧与过去时间帧的频谱幅度矢量合并进行归一化处理。具体地,将先前5帧和当前时间帧拼接成一个统一的特征向量,作为本发明的输入。过去时间帧的数量还可以小于5个,在牺牲一定降噪性能的情况下进一步节省计算时间,提高应用的实时性。
又例如,在对频谱幅度矢量进行归一化处理时,将当前时间帧、过去时间帧与未来时间帧的频谱幅度矢量合并进行归一化处理,形成声学特征。
对于不要求实时处理的应用,例如自动语音识别(ASR),可以使用未来时间帧作为输入。具体的,通过将当前时间帧、1个未来时间帧以及5个过去时间帧共7个时间帧拼接成一个统一的特征向量作为本发明的输入,相比未加入未来时间帧的场景,这种场景下可以使STOI(Short-Time Objective intelligibility,短时客观可懂度)提高大约一个百分点。STOI是评估语音降噪性能的重要指标,其典型数值范围在0和1之间,可以解释为听懂语音的百分比。
因此,在从单通道声音信号中提取声学特征时,预先按照预设时间周期将单通道声音信号分为时间帧,通过设置适当的时间周期,使基于从各时间帧提取的声学特征为降噪处理时提供输入,而且通过将当前时间帧与过去时间帧、未来时间帧的频谱幅度矢量进行选择性合并形成声学特征,可提高降噪性能。
步骤S120,将声学特征在预先训练的卷积递归神经网络模型中进行迭代运算,计算声学特征的比值膜。
比值膜是表征带噪语音信号与纯净语音信号之间的关系,其指示了抑制噪声与保留语音的权衡。
理想情况下,通过比值膜对带噪语音信号进行掩蔽处理后,可以从带噪语音中还原出纯净的语音信号。
卷积递归神经网络模型是预先训练而成的。
将步骤S110得到的声学特征作为卷积递归神经网络模型的输入,在该卷积递归神经网络模型中进行迭代运算,计算对该声学特征的比值膜。
在该步骤中,将IRM(Ideal Ratio Mask,理想比值膜)作为迭代运算的目标。频谱图中的每个时频单元的IRM可以用以下等式来表述:
Figure PCTCN2019090530-appb-000001
其中S FFT(t,f)和N FFT(t,f)分别表示在时间帧t和频率段f处的纯净语音幅度谱和噪音幅度谱。
通过在监督训练过程中预测理想比值膜,进而采用比值膜对声学特征进行掩蔽,以取得降噪后的语音信号。
步骤S130,采用比值膜对声学特征进行掩蔽。
步骤S140,将经过掩蔽后的声学特征与单通道声音信号的相位进行合成, 得到降噪语音信号。
训练完成后,经过训练的CRN(Convolutional Recurrent Network,即卷积递归神经网络)即可用于语音降噪的应用。将训练完成的神经网络用于特定应用称为推理或操作。在推理阶段,CRN模型的各层处理带噪信号。由推理的结果得出T-F掩蔽,然后将其用于加权(或掩蔽)有噪声的语音幅度,以产生比原始噪声输入更加清晰的增强语音信号。
在一示例性实施例中,将经过掩蔽后的频谱幅度矢量连同单通道声音信号的相位一起发送到逆傅立叶变换,以导出相应时域中的语音信号。
利用如上所述的方法,在进行单通道实时降噪时,从接收的单通道声音信号中提取声学特征,将声学特征在预先训练的卷积递归神经网络模型中进行迭代运算计算声学特征的比值膜后,采用该比值膜对声学特征进行掩蔽。再将经过掩蔽后的声学特征与单通道声音信号的相位进行合成,得到语音信号。由于该方案中采用了预先训练的卷积递归神经网络模型,大大减少了神经网络参数数量,降低了数据存储量,和对系统数据带宽的需求,在能够实现良好的降噪性能同时可大大提高单通道语音降噪的实时性。
图3是根据图1对应实施例示出的基于卷积递归神经网络的单通道实时降噪方法中步骤S120的一种具体实现流程图。如图3所示,该基于卷积递归神经网络的单通道实时降噪方法可以包括步骤S121、步骤S122。
步骤S121,将卷积神经网络与具有长短期记忆的递归神经网络进行组合得到卷积递归神经网络。
CNN(Convolutional Neural Network,卷积神经网络)是近年引起广泛重视的一种高效识别方法。20世纪60年代,Hubel和Wiesel在研究猫脑皮层中用于局部敏感和方向选择的神经元时发现其独特的网络结构可以有效地降低反馈神经网络的复杂性,继而提出了类似于卷积神经网络的神经结构。现在,CNN已经成为众多科学领域的研究热点之一,特别是在模式分类领域,由于该网络避免了对图像的复杂前期预处理,可以直接输入原始图像,因而得到了广泛的应用。K.Fukushima在1980年提出的新识别机是卷积神经网络的第一个实现网络。其中,具有代表性的是LeCun等提出的版本。
CNN中的卷积操作是基于二维结构定义的,其定义局部感受域每个底层特 征只跟输入的一个子集有关,如拓扑邻域。卷积层里面的拓扑局部限制会使得权重矩阵非常稀疏,所以卷积操作连接的两层只有局部连接。计算这样的矩阵乘法比起计算一个稠密矩阵乘法更方便高效,另外更小数目的自由参数会使得统计计算有更多的好处。在拥有二维拓扑结构的图像里,相同的输入模式会在不同位置出现,而且相近的值更可能有更强的依赖,这对于数据的模型是非常重要的。
CNN利用权值共享的方式减少需要学习的参数数目,使模型的复杂度降低,而且使网络权值的数量也大大的减少。较之一般前向BP算法(Error Back Propagation,误差反向传播),训练速度和准确度得到了极大的提高。CNN作为一个深度学习算法,可以使得数据的预处理的开销达到最小化。
具有长短期记忆(LSTM,Long Short-Term Memory)的递归神经网络(RNN,Recurrent Neural Network)(以下将“具有长短期记忆的递归神经网络”简称为“LSTM”)是一种时间递归神经网络,论文首次发表于1997年。由于独特的设计结构,LSTM适合于处理和预测时间序列中间隔和延迟非常长的重要事件。
LSTM的表现通常比时间递归神经网络及隐马尔科夫模型(HMM)更好,比如用在不分段连续手写识别上。2009年,用LSTM构建的人工神经网络模型赢得过ICDAR手写识别比赛冠军。LSTM还普遍用于自动语音识别,2013年运用TIMIT自然演讲数据库达到17.7%错误率的纪录。作为非线性模型,LSTM可作为复杂的非线性单元用于构造更大型深度神经网络。
LSTM是一种特定类型的RNN,可以有效地记录声音信号的长时相关性。与传统的RNN相比,LSTM改善了在训练过程中随着时间的推移而带来的梯度减少或梯度爆炸问题。LSTM模块的存储单元有三个门:输入门、忘记门和输出门。输入门控制应将多少当前信息添加到存储器单元,忘记门控制应保留多少先前信息,输出门控制是否输出信息。具体的,LSTM可用数学公式描述如下。
i t=σ(W iix t+b ii+W hih t-1+b hi)
f t=σ(W ifx t+b if+W hfh t-1+b hf)
g t=tanh(W igx t+b ig+W hgh t-1+b hg)
o t=σ(W iox t+b io+W hoh t-1+b ho)
c t=f t⊙c t-1+i t⊙g t
h t=o t⊙tanh(c t)
其中i t,f t和o t分别是输入门、忘记门和输出门的输出。x t和h t分别表示在时间t的输入特征和隐藏激活。g t和c t分别表示block输入和存储单元。σ代表sigmoidal函数,如σ(x)=1/(1+ex),tanh代表双曲正切函数,如tanh(x)=(ex-e -x)/(e x+e -x)。符号⊙表示数组元素依次相乘。输入和忘记门是根据先前的激活函数和当前输入计算的,并根据输入门和忘记门对存储器单元执行上下文关联的更新。
当训练用于语音去噪时,LSTM保存相关语境用于当前时刻的掩膜预测。
通过将卷积神经网络与具有长短期记忆的递归神经网络进行组合得到卷积递归神经网络(CRN)。CRN兼具CNN和LSTM的特性,从而能够高效进行语音去噪的同时,大大较少了CRN网络的神经网络参数的数量,有效减小了CRN的大小。
步骤S122,通过卷积递归神经网络对预先收集的语音训练集进行训练,构建卷积递归神经网络模型。
在将卷积神经网络与具有长短期记忆的递归神经网络进行组合得到卷积递归神经网络后,通过卷积递归神经网络对预先收集的语音训练集进行训练,调整卷积递归神经网络的网络神经参数,使其输出逼近IRM,以构建卷积递归神经网络模型。
图4是根据一示例性实施例示出的单通道实时降噪的流程示意图。如图4所示,输入为声音信号,输出为降噪后的语音信号,图中的虚线箭头
Figure PCTCN2019090530-appb-000002
表示在训练期间涉及的步骤,图中的带点虚线箭头
Figure PCTCN2019090530-appb-000003
表示预测阶段的步骤,图中的实线箭头“
Figure PCTCN2019090530-appb-000004
”表示训练和预测共享的步骤。作为有监督学习方法,本发明使用理想比值膜(IRM)为训练目标。IRM是通过比较带噪语音信号和对应纯净语音信号的STFT得到。在训练阶段,带有LSTM的RNN估计每个输入带噪语音的理想比值膜,然后计算理想比值膜和估计比值膜之间的均方误差(MSE,Mean-squared error)。经过重复的多轮迭代将整个训练集的MSE最小化,而每轮迭代中训练样本仅使用一次。训练阶段结束后,进入预测阶段,即使用训练好的卷积递归神经网络模型直接对输入的声音信号进行降噪,具体而言,经过训练的卷积递归神经网络模型对输入的声音信号进行处理并计算比值膜,然后使用计算的比值膜对输入的声音信号进行处理,最后重新合成得到 降噪后的语音信号。
顶部反卷积层的输出通过sigmoidal形函数(参见图4)以得到比值膜的预测,再与IRM进行比较,通过比较,生成MSE错误,用于调整CRN权重。
可选的,卷积神经网络为卷积编码器-解码器结构,包括卷积编码器和相应的解码器。编码器包括一组卷积层和池化层,用于从输入中提取高级特征;解码器的结构与反向顺序的编码器相同,能将编码器输出端的低分辨率特征映射到全输入大小的特征映射。内核的数量保持对称:在编码器中逐渐增加内核的数量,而在解码器中逐渐减少内核的数量。对称的编码器—解码器架构确保输出具有与输入相同的形状。为了改善整个网络中的信息和梯度流,我们采用跳过连接,将每个编码器层的输出连接到每个解码器层的输入。其中输出不依赖于未来输入的因果卷积和因果解卷积分别应用于编码器和解码器中,以便实现用于实时处理的因果系统。
可选的,具有长短期记忆的递归神经网络包括两个堆叠的长短期记忆层。通过在卷积神经网络的编码器与解码器之间,合并入两个堆叠的长短期记忆层,构建卷积递归神经网络,以处理声音信号的时间动态特性。
可选的,为进一步减小CRN模型尺寸,对CRN模型进行压缩,例如,CRN中的每个卷积层或池化层包括最多16个核。具有长短期记忆的递归神经网络的每个长短期记忆层包括64个神经元。通过将CRN模型进行压缩,其STOI性能与完全训练的未经压缩的CRN模型相比仅略微有所降低。当输入SNR为-5dB时,STOI参数的下降约为4-5%。当输入SNR较高时,STOI参数下降甚至更小。图5为未对CRN模型进行压缩时预测的频谱幅度,图6为对CRN模型进行压缩后预测的频谱幅度。
总而言之,与未处理过的混合语音相比,压缩后的CRN模型也能够极大改善STOI参数;相比未经压缩的CRN模型,在对CRN模型的降噪性能影响不大的情况下,大大降低了模型尺寸,减少了CRN模型的数据存储量。
利用如上所述的方法,通过将卷积神经网络与具有长短期记忆的递归神经网络进行组合得到卷积递归神经网络,使构建的卷积递归神经网络模型具备卷积神经网络的参数数量较少的特性,在保证实现良好的降噪性能的同时大大提高了单通道语音降噪的实时性。
可选的,为进一步提高本方案的泛化能力,即为实现不受限于特定噪音环境和特定说话人的通用降噪效果,本示例性实施例的语音训练集由大量在日常环境下采集的背景噪声、各类型的男声和女声、以及以特定信噪比(SNR)混合的人声和噪声组合而成。
另外,语音训练集包含大量噪声(约1000小时)和众多语音片段,且整个语音训练集持续大约数百小时,确保CRN模型得到充分训练。
图7是根据一示例性实施例示出的在多人谈话噪声场景下经LSTM模型训练的、经CRN模型训练的、及与未经训练的STOI参数比对结果示意图,图8是根据一示例性实施例示出的在咖啡厅噪声场景下经LSTM模型训练的、经CRN模型训练的、及与未经训练的STOI参数比对结果示意图。本实施例所使用的CRN模型的编码器中具有五个卷积层,解码器中具有五个解卷积层,编码器和解码器之间具有两个LSTM层。如图7、8所示,本实施例所提出的基于CRN模型的方法相对于未处理的有噪声的声音信号而言,STOI参数有极大改善,信噪比为-5dB的状态下,STOI参数提高了约20个百分点;SNR为5dB的状态下,STOI参数提高了约10个百分点。图7、8还表明,本方法的性能始终优于以LSTM为主的RNN模型方法,并且在较低SNR下STOI改善更明显。
为进一步说明降噪结果,图9显示的是在-5dB SNR下多人谈话噪声场景中未经训练的声音信号的频谱图;图10显示的是相应纯净话语的频谱图;图11显示的是采用CRN模型降噪后的频谱图。图9、10、11表明,降噪后的语音信号比多人谈话噪声场景的声音信号远更接近于纯净语音。
本发明的单通道降噪是指对单个麦克风采集的信号进行处理,相比波束形成的麦克风阵列降噪方法,单声道降噪具有更广泛的实用性。本发明采用有监督学习方法进行语音降噪,通过使用带有卷积递归神经网络模型来预测声音信号的比值膜。本发明提出的卷积递归神经网络模型使用的语音训练集包含大量噪声(约1000小时)和众多语音片段,且整个语音训练集持续大约数百小时,确保CRN模型得到充分训练,使单通道实时降噪的实现并不依赖于未来时间帧。由于将卷积神经网络与具有长短期记忆的递归神经网络进行组合得到卷积递归神经网络,再通过卷积递归神经网络对预先收集的语音训练集进行训练,构建卷积递归神经网络模型。这一模型大大减少了神经网络参数数量,并且降低了数据存储量,在能够实现良好的降噪性能的同时大大提高了单通道语音降噪的 实时性。
下述为本公开装置实施例,可以用于执行本上述基于卷积递归神经网络的单通道实时降噪方法实施例。对于本公开装置实施例中未披露的细节,请参照本公开基于卷积递归神经网络的单通道实时降噪方法实施例。
图12是根据一示例性实施例示出的一种单通道实时降噪装置的框图,该装置包括但不限于:声学特征提取模块110、比值膜计算模块120、掩蔽模块130及语音合成模块140。
声学特征提取模块110,用于从接收的单通道声音信号中提取声学特征;
比值膜计算模块120,用于将所述声学特征在预先构建的卷积递归神经网络模型中进行迭代运算,计算所述声学特征的比值膜;
掩蔽模块130,用于采用所述比值膜对所述声学特征进行掩蔽;
语音合成模块140,用于将经过掩蔽后的所述声学特征与所述单通道声音信号的相位进行合成,得到降噪语音信号。
上述装置中各个模块的功能和作用的实现过程,具体见上述基于卷积递归神经网络的单通道实时降噪方法中对应步骤的实现过程,在此不再赘述。
可选的,如图13所示,图12中所述的声学特征提取模块110包括但不限于:时间帧划分单元111、频谱幅度矢量提取单元112和声学特征形成单元113。
时间帧划分单元111,用于将接收的单通道声音信号按照预设时间周期分为时间帧;
频谱幅度矢量提取单元112,用于从所述时间帧中提取频谱幅度矢量;
声学特征形成单元113,用于对所述频谱幅度矢量进行归一化处理,形成声学特征。
可选的,如图14所示,图12中所述比值膜计算模块120包括但不限于:网络组合单元121和网络模型构建单元122。
网络组合单元121,用于将卷积神经网络与具有长短期记忆的递归神经网络进行组合得到卷积递归神经网络;
网络模型构建单元122,用于通过所述卷积递归神经网络对预先收集的语音训练集进行训练,构建所述卷积递归神经网络模型。
可选的,本发明还提供一种电子设备,执行如上述示例性实施例任一所示的基于卷积递归神经网络的单通道实时降噪方法的全部或者部分步骤。电子设备包括:
处理器;以及
与所述处理器通信连接的存储器;其中,
所述存储器存储有可读性指令,所述可读性指令被所述处理器执行时实现如上述任一示例性实施例所述的方法。
该实施例中的终端中处理器执行操作的具体方式已经在有关该基于卷积递归神经网络的单通道实时降噪方法的实施例中执行了详细描述,此处将不做详细阐述说明。
在示例性实施例中,还提供了一种存储介质,该存储介质为计算机可读性存储介质,例如可以为包括指令的临时性和非临时性计算机可读性存储介质。
应当理解的是,本发明并不局限于上面已经描述并在附图中示出的精确结构,并且可以在不脱离其范围执行各种修改和改变。本发明的范围仅由所附的权利要求来限制。

Claims (13)

  1. 一种基于卷积递归神经网络的单通道实时降噪方法,其特征在于,所述方法包括:
    从接收的单通道声音信号中提取声学特征;
    将所述声学特征在预先训练的卷积递归神经网络模型中进行迭代运算,计算所述声学特征的比值膜;
    采用所述比值膜对所述声学特征进行掩蔽;
    将经过掩蔽后的所述声学特征与所述单通道声音信号的相位进行合成,得到语音信号。
  2. 根据权利要求1所述的方法,其特征在于,所述从接收的单通道声音信号中提取声学特征的步骤包括:
    将接收的单通道声音信号按照预设时间周期分为时间帧;
    从所述时间帧中提取频谱幅度矢量;
    对所述频谱幅度矢量进行归一化处理,形成声学特征。
  3. 根据权利要求2所述的方法,其特征在于,所述频谱幅度矢量进行归一化处理,形成声学特征的步骤包括:
    将当前时间帧与过去时间帧的频谱幅度矢量合并进行归一化处理形成声学特征。
  4. 根据权利要求2所述的方法,其特征在于,所述频谱幅度矢量进行归一化处理,形成声学特征的步骤包括:
    将当前时间帧、过去时间帧与未来时间帧的频谱幅度矢量合并进行归一化处理,形成声学特征。
  5. 根据权利要求1所述的方法,其特征在于,所述声学特征在预先训练的卷积递归神经网络模型中进行迭代运算,计算所述声学特征的比值膜的步骤包括:
    将卷积神经网络与具有长短期记忆的递归神经网络进行组合得到卷积递归 神经网络;
    通过所述卷积递归神经网络对预先收集的语音训练集进行训练,构建所述卷积递归神经网络模型。
  6. 根据权利要求5所述的方法,其特征在于,所述卷积神经网络为卷积编码器-解码器结构,所述编码器包括一组卷积层和池化层,所述解码器的结构与反向顺序的所述编码器相同,所述编码器的输出连接所述解码器的输入。
  7. 根据权利要求5所述的方法,其特征在于,所述具有长短期记忆的递归神经网络包括两个堆叠的长短期记忆层。
  8. 根据权利要求5所述的方法,其特征在于,所述将卷积神经网络与具有长短期记忆的递归神经网络进行组合得到卷积递归神经网络的步骤包括:
    将两个堆叠的长短期记忆层合并入卷积神经网络的编码器与解码器之间,构建所述卷积递归神经网络。
  9. 根据权利要求5所述的方法,其特征在于,所述卷积神经网络中的每个卷积层或池化层包括最多16个核,所述具有长短期记忆的递归神经网络的每个长短期记忆层包括64个神经元。
  10. 根据权利要求5所述的方法,其特征在于,所述语音训练集由在日常环境下采集的背景噪声、各类型的男女声以及特定信噪比混合的语音信号组合而成。
  11. 一种基于卷积递归神经网络的单通道实时降噪装置,其特征在于,所述装置包括:
    声学特征提取模块,用于从接收的单通道声音信号中提取声学特征;
    比值膜计算模块,用于将所述声学特征在预先构建的卷积递归神经网络模型中进行迭代运算,计算所述声学特征的比值膜;
    掩蔽模块,用于采用所述比值膜对所述声学特征进行掩蔽;
    语音合成模块,用于将经过掩蔽后的所述声学特征与所述单通道声音信号的相位进行合成,得到降噪语音信号。
  12. 一种电子设备,其特征在于,所述电子设备包括:
    至少一个处理器;以及
    与所述至少一个处理器通信连接的存储器;其中,
    所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行如权利要求1-10任一项所述的方法。
  13. 一种计算机可读存储介质,用于存储程序,其特征在于,所述程序在被执行时使得电子设备执行如权利要求1-10任一项所述的方法。
PCT/CN2019/090530 2018-08-31 2019-06-10 一种基于卷积递归神经网络的单通道实时降噪方法 WO2020042707A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201811010712.6A CN109841226B (zh) 2018-08-31 2018-08-31 一种基于卷积递归神经网络的单通道实时降噪方法
CN201811010712.6 2018-08-31

Publications (1)

Publication Number Publication Date
WO2020042707A1 true WO2020042707A1 (zh) 2020-03-05

Family

ID=66883030

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/090530 WO2020042707A1 (zh) 2018-08-31 2019-06-10 一种基于卷积递归神经网络的单通道实时降噪方法

Country Status (2)

Country Link
CN (1) CN109841226B (zh)
WO (1) WO2020042707A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022079264A3 (en) * 2020-10-15 2022-06-02 Dolby International Ab Method and apparatus for neural network based processing of audio using sinusoidal activation
WO2022204630A1 (en) * 2021-03-23 2022-09-29 Qualcomm Incorporated Context-based speech enhancement

Families Citing this family (37)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109841226B (zh) * 2018-08-31 2020-10-16 大象声科(深圳)科技有限公司 一种基于卷积递归神经网络的单通道实时降噪方法
CN110335622B (zh) * 2019-06-13 2024-03-01 平安科技(深圳)有限公司 音频单音色分离方法、装置、计算机设备及存储介质
CN110321810A (zh) * 2019-06-14 2019-10-11 华南师范大学 单通道信号双路分离方法、装置、存储介质及处理器
CN110334654A (zh) * 2019-07-08 2019-10-15 北京地平线机器人技术研发有限公司 视频预测方法和装置、视频预测模型的训练方法及车辆
CN110503940B (zh) * 2019-07-12 2021-08-31 中国科学院自动化研究所 语音增强方法、装置、存储介质、电子设备
CN110534123B (zh) * 2019-07-22 2022-04-01 中国科学院自动化研究所 语音增强方法、装置、存储介质、电子设备
CN110491404B (zh) * 2019-08-15 2020-12-22 广州华多网络科技有限公司 语音处理方法、装置、终端设备及存储介质
CN110491407B (zh) * 2019-08-15 2021-09-21 广州方硅信息技术有限公司 语音降噪的方法、装置、电子设备及存储介质
CN110379412B (zh) * 2019-09-05 2022-06-17 腾讯科技(深圳)有限公司 语音处理的方法、装置、电子设备及计算机可读存储介质
CN110634502B (zh) * 2019-09-06 2022-02-11 南京邮电大学 基于深度神经网络的单通道语音分离算法
CN110751958A (zh) * 2019-09-25 2020-02-04 电子科技大学 一种基于rced网络的降噪方法
CN110767223B (zh) * 2019-09-30 2022-04-12 大象声科(深圳)科技有限公司 一种单声道鲁棒性的语音关键词实时检测方法
CN110660406A (zh) * 2019-09-30 2020-01-07 大象声科(深圳)科技有限公司 近距离交谈场景下双麦克风移动电话的实时语音降噪方法
WO2021062706A1 (zh) * 2019-09-30 2021-04-08 大象声科(深圳)科技有限公司 近距离交谈场景下双麦克风移动电话的实时语音降噪方法
CN110931031A (zh) * 2019-10-09 2020-03-27 大象声科(深圳)科技有限公司 一种融合骨振动传感器和麦克风信号的深度学习语音提取和降噪方法
CN110806640B (zh) * 2019-10-28 2021-12-28 西北工业大学 光子集成视觉特征成像芯片
CN110808063A (zh) * 2019-11-29 2020-02-18 北京搜狗科技发展有限公司 一种语音处理方法、装置和用于处理语音的装置
CN113053400B (zh) * 2019-12-27 2024-06-07 武汉Tcl集团工业研究院有限公司 音频信号降噪模型的训练方法、音频信号降噪方法及设备
CN113223545A (zh) * 2020-02-05 2021-08-06 字节跳动有限公司 一种语音降噪方法、装置、终端及存储介质
CN113270091B (zh) * 2020-02-14 2024-04-16 声音猎手公司 音频处理系统和方法
CN111326167B (zh) * 2020-03-09 2022-05-13 广州深声科技有限公司 一种基于神经网络的声学特征转换方法
CN111429930B (zh) * 2020-03-16 2023-02-28 云知声智能科技股份有限公司 一种基于自适应采样率的降噪模型处理方法及系统
CN111292759B (zh) * 2020-05-11 2020-07-31 上海亮牛半导体科技有限公司 一种基于神经网络的立体声回声消除方法及系统
CN111754983A (zh) * 2020-05-18 2020-10-09 北京三快在线科技有限公司 一种语音去噪方法、装置、电子设备及存储介质
CN111883091A (zh) * 2020-07-09 2020-11-03 腾讯音乐娱乐科技(深圳)有限公司 音频降噪方法和音频降噪模型的训练方法
CN112466318B (zh) * 2020-10-27 2024-01-19 北京百度网讯科技有限公司 语音处理方法、装置及语音处理模型的生成方法、装置
CN116508099A (zh) * 2020-10-29 2023-07-28 杜比实验室特许公司 基于深度学习的语音增强
CN112565977B (zh) * 2020-11-27 2023-03-07 大象声科(深圳)科技有限公司 高频信号重建模型的训练方法和高频信号重建方法及装置
CN112671419B (zh) * 2020-12-17 2022-05-03 北京邮电大学 一种无线信号重建方法、装置、系统、设备和存储介质
CN112614504A (zh) * 2020-12-22 2021-04-06 平安科技(深圳)有限公司 单声道语音降噪方法、系统、设备及可读存储介质
CN112613431B (zh) * 2020-12-28 2021-06-29 中北大学 一种泄漏气体自动识别方法、系统及装置
CN112992168B (zh) * 2021-02-26 2024-04-19 平安科技(深圳)有限公司 语音降噪器训练方法、装置、计算机设备和存储介质
CN113345463B (zh) * 2021-05-31 2024-03-01 平安科技(深圳)有限公司 基于卷积神经网络的语音增强方法、装置、设备及介质
CN113314140A (zh) * 2021-05-31 2021-08-27 哈尔滨理工大学 一种端到端时域多尺度卷积神经网络的音源分离算法
CN113744748A (zh) * 2021-08-06 2021-12-03 浙江大华技术股份有限公司 一种网络模型的训练方法、回声消除方法及设备
CN114283830A (zh) * 2021-12-17 2022-04-05 南京工程学院 基于深度学习网络的麦克风信号回声消除模型构建方法
CN114464206A (zh) * 2022-04-11 2022-05-10 中国人民解放军空军预警学院 一种单通道盲源分离方法及系统

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2013149123A1 (en) * 2012-03-30 2013-10-03 The Ohio State University Monaural speech filter
US20170061978A1 (en) * 2014-11-07 2017-03-02 Shannon Campbell Real-time method for implementing deep neural network based speech separation
CN106847302A (zh) * 2017-02-17 2017-06-13 大连理工大学 基于卷积神经网络的单通道混合语音时域分离方法
CN107452389A (zh) * 2017-07-20 2017-12-08 大象声科(深圳)科技有限公司 一种通用的单声道实时降噪方法
CN109841226A (zh) * 2018-08-31 2019-06-04 大象声科(深圳)科技有限公司 一种基于卷积递归神经网络的单通道实时降噪方法

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160111107A1 (en) * 2014-10-21 2016-04-21 Mitsubishi Electric Research Laboratories, Inc. Method for Enhancing Noisy Speech using Features from an Automatic Speech Recognition System
CN106328122A (zh) * 2016-08-19 2017-01-11 深圳市唯特视科技有限公司 一种利用长短期记忆模型递归神经网络的语音识别方法
CN106875007A (zh) * 2017-01-25 2017-06-20 上海交通大学 用于语音欺骗检测的基于卷积长短期记忆端对端深度神经网络

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2013149123A1 (en) * 2012-03-30 2013-10-03 The Ohio State University Monaural speech filter
US20170061978A1 (en) * 2014-11-07 2017-03-02 Shannon Campbell Real-time method for implementing deep neural network based speech separation
CN106847302A (zh) * 2017-02-17 2017-06-13 大连理工大学 基于卷积神经网络的单通道混合语音时域分离方法
CN107452389A (zh) * 2017-07-20 2017-12-08 大象声科(深圳)科技有限公司 一种通用的单声道实时降噪方法
CN109841226A (zh) * 2018-08-31 2019-06-04 大象声科(深圳)科技有限公司 一种基于卷积递归神经网络的单通道实时降噪方法

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
D. S. WANG: "A Deep Convolutional Encoder-Decoder Model for Robust Speech Dereverberation", INTERNATIONAL CONFERENCE ON DIGITAL SIGNAL PROCESSING (DSP, 25 August 2017 (2017-08-25), XP033246182, ISSN: 2165-3577 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022079264A3 (en) * 2020-10-15 2022-06-02 Dolby International Ab Method and apparatus for neural network based processing of audio using sinusoidal activation
WO2022204630A1 (en) * 2021-03-23 2022-09-29 Qualcomm Incorporated Context-based speech enhancement
US11715480B2 (en) 2021-03-23 2023-08-01 Qualcomm Incorporated Context-based speech enhancement

Also Published As

Publication number Publication date
CN109841226B (zh) 2020-10-16
CN109841226A (zh) 2019-06-04

Similar Documents

Publication Publication Date Title
WO2020042707A1 (zh) 一种基于卷积递归神经网络的单通道实时降噪方法
CN107452389B (zh) 一种通用的单声道实时降噪方法
EP4006898A1 (en) Voice recognition method, device, and computer-readable storage medium
CN109841206B (zh) 一种基于深度学习的回声消除方法
Zhang et al. Deep learning for environmentally robust speech recognition: An overview of recent developments
CN110600017B (zh) 语音处理模型的训练方法、语音识别方法、系统及装置
Zhao et al. Monaural speech dereverberation using temporal convolutional networks with self attention
CN110600018B (zh) 语音识别方法及装置、神经网络训练方法及装置
Qian et al. Speech Enhancement Using Bayesian Wavenet.
Chang et al. Temporal modeling using dilated convolution and gating for voice-activity-detection
WO2021042870A1 (zh) 语音处理的方法、装置、电子设备及计算机可读存储介质
CN111653288B (zh) 基于条件变分自编码器的目标人语音增强方法
Shah et al. Time-frequency mask-based speech enhancement using convolutional generative adversarial network
CN111179911B (zh) 目标语音提取方法、装置、设备、介质和联合训练方法
CN108922544B (zh) 通用向量训练方法、语音聚类方法、装置、设备及介质
TW201935464A (zh) 基於記憶性瓶頸特徵的聲紋識別的方法及裝置
WO2019227586A1 (zh) 语音模型训练方法、说话人识别方法、装置、设备及介质
CN111899757B (zh) 针对目标说话人提取的单通道语音分离方法及系统
Shao et al. Bayesian separation with sparsity promotion in perceptual wavelet domain for speech enhancement and hybrid speech recognition
CN110660406A (zh) 近距离交谈场景下双麦克风移动电话的实时语音降噪方法
KR102026226B1 (ko) 딥러닝 기반 Variational Inference 모델을 이용한 신호 단위 특징 추출 방법 및 시스템
Li et al. Densely connected network with time-frequency dilated convolution for speech enhancement
CN113763965A (zh) 一种多重注意力特征融合的说话人识别方法
Tu et al. DNN training based on classic gain function for single-channel speech enhancement and recognition
Soni et al. State-of-the-art analysis of deep learning-based monaural speech source separation techniques

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19855108

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19855108

Country of ref document: EP

Kind code of ref document: A1