CN112489668B

CN112489668B - Dereverberation method, device, electronic equipment and storage medium

Info

Publication number: CN112489668B
Application number: CN202011218042.4A
Authority: CN
Inventors: 张国昌; 于利标; 韩润强; 魏建强
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2020-11-04
Filing date: 2020-11-04
Publication date: 2024-02-02
Anticipated expiration: 2040-11-04
Also published as: CN112489668A

Abstract

The application discloses a dereverberation method, a dereverberation device, electronic equipment and a storage medium, and relates to the fields of voice technology and deep learning. The specific implementation scheme is as follows: decomposing at least one collected voice signal sub-band into a plurality of sub-band signals; performing offset processing on each sub-band signal according to a set time offset to obtain a reference signal of each sub-band signal; inputting each sub-band signal of at least one voice signal and a corresponding reference signal into a dereverberation model to obtain an ideal ratio membrane IRM corresponding to each sub-band signal of at least one voice signal; carrying out amplitude spectrum enhancement on each sub-band signal of at least one path of voice signal by adopting a corresponding IRM (inter-range interference) to obtain each sub-band signal of dereverberated at least one path of voice signal; and carrying out subband synthesis on each subband signal of the dereverberation of the same path of voice signal to obtain each path of dereverberation voice signal. Thus, reverberant sound can be removed well through the band independent dereverberation model.

Description

Dereverberation method, device, electronic equipment and storage medium

Technical Field

The present application relates to the field of computers, and in particular, to the field of speech technology and deep learning, and more particularly, to a dereverberation method, apparatus, electronic device, and storage medium.

Background

When sound propagates indoors, the sound is reflected by various indoor obstacles such as ceilings, floors, walls and the like, reverberation is formed, and excessive reverberation is one of factors affecting listening comfort and far-field speech recognition performance. Therefore, how to better remove the reverberant sound is a problem to be solved at present.

Disclosure of Invention

The application provides a dereverberation method, a dereverberation device, electronic equipment and a storage medium.

According to an aspect of the present application, there is provided a dereverberation method comprising:

decomposing at least one collected voice signal sub-band into a plurality of sub-band signals;

performing offset processing on each sub-band signal according to a set time offset to obtain a reference signal of each sub-band signal;

inputting each sub-band signal of the at least one voice signal and a corresponding reference signal into a dereverberation model to obtain an ideal ratio membrane IRM corresponding to each sub-band signal of the at least one voice signal;

performing amplitude spectrum enhancement on each sub-band signal of at least one path of voice signal by adopting a corresponding IRM (inter-band interference) to obtain each sub-band signal of dereverberated at least one path of voice signal;

And carrying out subband synthesis on each subband signal of the dereverberation of the same path of voice signal to obtain each path of dereverberation voice signal.

According to another aspect of the present application, there is provided a dereverberation apparatus comprising:

the decomposing module is used for decomposing the collected at least one voice signal sub-band into a plurality of sub-band signals;

the offset module is used for carrying out offset processing on each sub-band signal according to a set time offset to obtain a reference signal of each sub-band signal;

the input module is used for inputting each sub-band signal of the at least one voice signal and a corresponding reference signal into a dereverberation model so as to obtain an ideal ratio membrane IRM corresponding to each sub-band signal of the at least one voice signal;

the dereverberation module is used for carrying out amplitude spectrum enhancement on each sub-band signal of at least one path of voice signal by adopting a corresponding IRM to obtain each sub-band signal of dereverberation of at least one path of voice signal;

and the synthesis module is used for carrying out subband synthesis on each subband signal of the dereverberation of the same path of voice signal so as to obtain each path of dereverberation voice signal.

According to another aspect of the present application, there is provided an electronic device including:

At least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the dereverberation method described in an embodiment of the above aspect.

According to another aspect of the present application, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform the dereverberation method according to the embodiment of the above aspect.

According to another aspect of the present application, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the above-described dereverberation method.

Other effects of the above alternative will be described below in connection with specific embodiments.

Drawings

The drawings are for better understanding of the present solution and do not constitute a limitation of the present application. Wherein:

FIG. 1 is a flow chart of a dereverberation method provided according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a process for generating a reverberation signal provided according to one embodiment of the present application;

FIG. 3 is a system block diagram of a dereverberation method provided in accordance with one embodiment of the present application;

FIG. 4 is a flow chart of another dereverberation method provided in accordance with an embodiment of the present disclosure;

FIG. 5 is a flow chart of another dereverberation method provided in accordance with an embodiment of the present application;

FIG. 6 is a flow chart of another dereverberation method provided in accordance with an embodiment of the present application;

FIG. 7 is a spectral diagram of direct sound and early onset reflected sound of an embodiment of the present application;

FIG. 8 is a spectral diagram of a reverberant signal (pre-processing) according to one embodiment of the present application;

FIG. 9 is a spectral diagram of a weighted prediction error algorithm in the related art;

FIG. 10 is a spectral diagram of a dereverberation method of an embodiment of the present application;

FIG. 11 is a schematic diagram of a dereverberation apparatus provided according to an embodiment of the present disclosure;

fig. 12 is a block diagram of an electronic device of a dereverberation method according to an embodiment of the present application.

Detailed Description

Exemplary embodiments of the present application are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present application to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

The following describes a dereverberation method, an apparatus, an electronic device, and a storage medium according to embodiments of the present application with reference to the accompanying drawings.

Prior to describing the dereverberation method of the present application, the lower reverberation will be described.

When sound propagates indoors, it is reflected by various indoor obstacles such as ceilings, floors, walls, etc., and reverberation is formed, and excessive reverberation is one of factors affecting listening comfort and far-field speech recognition performance.

Depending on the early and late arrival times, a reverberant speech (audio) signal will typically be split into three parts: direct sound, early reflected sound, and late reverberant sound. The direct sound is a signal component which is transmitted to the microphone by the sound source in the shortest path without being reflected by barriers such as a ceiling, a floor, a wall and the like; early reflected sound is generally defined as the reflected signal component that propagates to the microphone within 50 milliseconds after the direct sound; the remaining signal component is late reverberant. Numerous studies indicate that early reflections help to improve speech intelligibility and listening comfort, while late reverberations lead to reduced speech intelligibility and listening comfort. Thus, reducing late reverberation is helpful to both human subjective hearing and machine perception.

Therefore, the application provides a dereverberation method, and late reverberant sounds can be removed better through a frequency band independent dereverberation model.

Fig. 1 is a flow chart illustrating a dereverberation method according to an embodiment of the present disclosure.

The dereverberation method of the embodiment of the application can be performed by the dereverberation device provided by the embodiment of the application, and the device can be configured in an electronic device.

As shown in fig. 1, the dereverberation method of the embodiment of the present application includes:

step 101, decomposing at least one collected voice signal sub-band into a plurality of sub-band signals.

The speech signal is a sound signal received while dereverberation is performed.

The voice signals including direct sound, early reflected sound and late reverberant sound are collected through a recording device such as a microphone, after the voice signals are collected, the voice signals are decomposed through subbands, and the time domain of the voice signals collected by a single-path or multi-path microphone is converted into the frequency domain. For example, a fourier-like transform may be utilized for the time-domain to frequency-domain processing.

Step 102, performing offset processing on each sub-band signal according to the set time offset to obtain a reference signal of each sub-band signal.

The offset processing is to shift the voice signal collected by the microphone to the right to form a reference channel, so that the past voice signal can be aligned to the current moment, and the dereverberation model can eliminate reverberant sound in the current voice signal by using the past voice signal.

The right shift is to adjust the time parameter, and assuming that y (t) is a voice signal collected by the microphone, a reference signal obtained by performing offset processing according to a set time offset of 2ms is r (t) =y (t-2).

Step 103, inputting each sub-band signal of at least one path of voice signal and the corresponding reference signal into the dereverberation model to obtain an ideal ratio film IRM corresponding to each sub-band signal of at least one path of voice signal.

The dereverberation model is used for processing each sub-band signal of at least one path of voice signal and the corresponding reference signal thereof to obtain an ideal ratio module IRM corresponding to each sub-band signal of at least one path of voice signal, wherein the value of the ideal ratio module IRM is 0-1, namely the existence probability of the dereverberation voice signal.

Step 104, performing amplitude spectrum enhancement on each sub-band signal of at least one path of voice signal by adopting a corresponding IRM to obtain each sub-band signal of at least one path of voice signal for dereverberation.

Multiplying the existence probability point of the dereverberated voice signal by the amplitude spectrum of the original voice signal acquired by the microphone to obtain an enhanced amplitude spectrum, namely, each subband signal of dereverberated at least one path of voice signal.

Step 105, subband synthesis is performed on each subband signal dereverberated by the same speech signal to obtain each dereverberated speech signal.

The enhanced amplitude spectrum combines the phase of the original speech signal collected by the microphone to synthesize a dereverberated time domain signal, i.e., a dereverberated speech signal.

According to the dereverberation method, at least one collected voice signal sub-band is decomposed into a plurality of sub-band signals; performing offset processing on each sub-band signal according to a set time offset to obtain a reference signal of each sub-band signal; inputting each sub-band signal of at least one voice signal and a corresponding reference signal into a dereverberation model to obtain an ideal ratio membrane IRM corresponding to each sub-band signal of at least one voice signal; carrying out amplitude spectrum enhancement on each sub-band signal of at least one path of voice signal by adopting a corresponding IRM (inter-range interference) to obtain each sub-band signal of dereverberated at least one path of voice signal; and carrying out subband synthesis on each subband signal of the dereverberation of the same path of voice signal to obtain each path of dereverberation voice signal. Thus, reverberation can be removed well by the band independent dereverberation model.

Those skilled in the art will recognize that the dereverberation algorithm may improve the processing of subsequent models in the speech enhancement system, e.g., the dereverberation algorithm may increase the discrimination between noise and speech, and that the dereverberation may be followed by a noise reduction process to increase the amount of noise cancellation.

Currently, the commonly used dereverberation algorithms are mainly divided into two categories: one class is based on signal processing schemes; another class is based on deep learning schemes.

The processing method based on the signal processing scheme comprises a spectral subtraction method, a blind deconvolution method and a weighted prediction error method. The spectrum subtraction requires estimating the spectrum of the late reverberation signal, and then subtracting the estimated spectrum of the late reverberation signal from the spectrum of the microphone-collected voice signal; the blind deconvolution method needs to estimate the inverse function of the transfer function of the sound source reaching the microphone, and the reverberant signal can recover the reverberant-free voice signal after being processed by the inverse function; the weighted prediction error method requires estimating the reverberation component corresponding to the late room impulse response, and then subtracting the reverberation component from the microphone-collected speech signal to obtain the direct sound and the early reflected sound. It follows that the signal processing scheme-based processing method often requires some physical prior assumptions, such as short-time stationary assumptions of the speech, noise and speech uncorrelated assumptions, etc., and such assumptions are difficult to fully satisfy in a practical scenario, which may lead to a distortion of the target speech and an insufficient reverberation suppression capability. In addition, the processing method based on the signal processing scheme also has the problem that the processing method cannot be matched with other subsequent deep learning models well, so that the performance of the algorithms is degraded.

Although the processing method based on the deep learning scheme does not need physical prior assumption, the calculation amount of the neural network dereverberation model based on the RNN (Recurrent Neural Network, cyclic neural network) and the CNN (Convolutional Neural Network ) is very large at present, and is difficult to run in the current real-time system.

As described above, the conventional processing method based on the signal processing scheme has the problems of limited reverberation cancellation capability, easy voice distortion, inability to be well matched with other models at the rear end, and the like, and the processing method based on the deep learning scheme has the problems of too high computational complexity and inability to run in real time. Therefore, the application provides a dereverberation model with independent passing frequency bands, which has the advantages of low signal processing calculation complexity and data driving of a deep learning model, and can eliminate the problem of mismatching of a signal processing scheme and a rear-end model through combined training with other deep learning models at the rear end.

Before describing the dereverberation model of the present application, the actual application scenario of the present application will be described.

It is assumed that N microphones are provided in the recording apparatus. The short-time fourier transform domain of the speech signal picked up by the microphone is represented as an N-dimensional vector y _t,f Wherein t is a frame index of a time domain, f is a frequency band index of a frequency domain, and y is a far field scene _t,f Can be expressed as the following formula (1):

y _t,f ＝x _t,f ^(early) +x _t,f ^(tail) ＝s _t,f *h ₁ (f)+s _t,f *h ₂ (f) (1)

wherein s is _t,f X is a representation of the short-time fourier transform domain of a reverberant-free speech signal _t,f ^(early) For early reflected and direct sound, x _t,f ^(tail) For late reverberant sound, h ₁ (f) For the room impulse response (Room Impulse Response, RIR) peak and the component within 50 milliseconds after the peak, h ₂ (f) Is the trailing component after 50 milliseconds of RIR. The target of the dereverberation algorithm is from y _t,f Extracting early-stage emission sound and direct sound x _t,f ^(early) . Fig. 2 illustrates a process in which a target reverberant signal x (n) generates a reverberant signal y (n) by convolving the RIR.

To estimate x _t,f ^(early) The present application uses a dereverberation model of the structure shown in fig. 3, and describes the dereverberation of the present application in conjunction with the dereverberation method shown in fig. 4A sound method. Referring to fig. 3 and 4, the dereverberation method of the embodiment of the present application includes:

step 401, decomposing the collected at least one voice signal sub-band into a plurality of sub-band signals.

Step 402, performing offset processing on each sub-band signal according to the set time offset to obtain a reference signal of each sub-band signal.

The specific process is referred to steps 101 and 102, which are not described in detail in this embodiment.

Step 403, splicing the m-th sub-band signal of the N paths of voice signals with the corresponding reference signal to obtain an input signal; wherein M is a positive integer with a value smaller than or equal to M, M is the total channel number of the sub-band signals decomposed by each path of voice signals, and N is the total channel number of at least one path of voice signals.

After the reference channel is formed by step 402, the original speech signal collected by the microphone and the reference signal are spliced together, and the input to form the 2*N channel is sent to a subsequent dereverberation model as the input signal.

Step 404, the input signal is input into the dereverberation model to obtain the IRM corresponding to the m-th subband signal of each of the N speech signals.

In one embodiment of the present application, a dereverberation model comprises: a complex convolutional neural network CNN layer, a real CNN layer, an RNN layer, and an output layer.

The complex convolution neural network CNN layer is used for extracting the characteristics of the complex parts of each frame signal in the input signal to obtain the audio characteristics of the complex parts of each frame signal, wherein the audio characteristics are used for indicating the signal phase and the signal amplitude; the real number CNN layer is used for carrying out complex number to real number conversion processing on the audio characteristics of the complex number part of each frame signal to obtain real number audio characteristics of each frame signal; predicting to obtain a first IRM representation of each frame signal according to the real audio characteristics of each frame signal and the real audio characteristics of the corresponding adjacent frame signal; the cyclic neural network RNN layer is used for predicting and obtaining second IRM representation of each frame signal according to the first IRM representation of each frame signal and the first IRM representation of each frame signal with the corresponding time sequence prior; and the output layer is used for mapping and obtaining the IRM corresponding to the m-th sub-band signal of each path of voice signal in the N paths of voice signals according to the second IRM representation of each frame signal.

After the dereverberation model obtains the input signal, the complex parts of each frame signal in the input signal in the 2*N channel are sent to the complex convolutional neural network CNN layer and then to the complex BN layer, and the use of the complex BN layer is to introduce phase information into the dereverberation model so that the dereverberation model can more accurately estimate the ideal ratio membrane IRM. The convolution kernel of the complex convolution is T1, and the convolution mode is causal convolution.

The complex BN layer outputs and then performs a log () operation to convert the complex number to a real number while compressing the dynamic range of the input. The Log () operation is then sent to the real CNN layer to perform complex-to-real processing on the audio features of the complex parts of each frame signal, so as to obtain real audio features of each frame signal. Wherein the convolution kernel of the real convolution is still T1, again to guarantee the causal properties of the system, the convolution here is also causal. The use of the causal convolution layer can enable the model to memorize more past information and enhance the modeling capability of the model.

And then, predicting the first IRM representation of each frame signal according to the obtained real audio features of each frame signal and the real audio features of the corresponding adjacent frame signals, and inputting the first IRM representation into the RNN layer. The RNN layer predicts and obtains a second IRM representation of each frame signal according to the first IRM representation of each frame signal and the first IRM representation of each frame signal with the corresponding time sequence being prior. The RNN layer may have a structure of GRU (Gated Recurrent Unit) or LSTM (Long Short Term Memory). The RNN layer may enable a dereverberation model with greater timing modeling capabilities.

The RNN layer outputs the second IRM representation to an output layer, the output layer comprising a fully connected layer and an active layer, wherein the fully connected layer of each sub-band has an output dimension N (number of microphones), the active layer is activated using a sigmoid function to map the second IRM representation to between 0-1.

In step 405, amplitude spectrum enhancement is performed on each subband signal of at least one voice signal by using a corresponding IRM, so as to obtain each subband signal of dereverberated at least one voice signal.

In step 406, subband synthesis is performed on each subband signal dereverberated from the same speech signal to obtain each dereverberated speech signal.

The specific process is referred to steps 103-105, which are not described in detail in this embodiment.

According to the dereverberation method, at least one collected voice signal sub-band is decomposed into a plurality of sub-band signals; performing offset processing on each sub-band signal according to a set time offset to obtain a reference signal of each sub-band signal; inputting each sub-band signal of at least one voice signal and a corresponding reference signal into a dereverberation model to obtain an ideal ratio membrane IRM corresponding to each sub-band signal of at least one voice signal; carrying out amplitude spectrum enhancement on each sub-band signal of at least one path of voice signal by adopting a corresponding IRM (inter-range interference) to obtain each sub-band signal of dereverberated at least one path of voice signal; and carrying out subband synthesis on each subband signal of the dereverberation of the same path of voice signal to obtain each path of dereverberation voice signal. Thus, reverberation can be removed well by the band independent dereverberation model. And the model has the advantages of low signal processing and calculation complexity and data driving of the deep learning model, and meanwhile, the model can also eliminate the problem of unmatched signal processing schemes and back-end models through combined training with other deep learning models at the back-end.

To accelerate and stabilize training, to improve the dereverberation model performance, and to reduce the quantity of dereverberation model parameters, in one embodiment of the present application, as shown in fig. 5, before inputting the input signal into the dereverberation model (i.e., step 204), further comprises:

step 501, selecting one RNN layer from a plurality of candidate RNN layers as an RNN layer in a dereverberation model according to the frequency band of the mth subband signal; the plurality of candidate RNN layers respectively have different hidden layer node numbers, and the hidden layer node numbers of the RNN layers in the dereverberation model are matched with the frequency band where the m-th subband signal is located.

That is, the mth subband signal of different frequency bands uses RNN layers of different nodes. For example, based on the acoustic properties of a room that attenuates high frequency reflected sound faster, low frequency reverberation tends to be more severe, so the application can use RNN layers with a greater number of hidden layer nodes in the low frequency band.

To further reduce the number of dereverberation model parameters while significantly increasing the training speed, in one embodiment of the present application, model parameters are shared between RNN layers to which multiple adjacent frequency bands are matched.

Assuming that m subbands are divided into 4 groups, model parameters of part of adjacent frequency points are shared in the groups, and the number of dereverberation model parameters contained in each group is m/4, the number of dereverberation model parameters is greatly reduced, and meanwhile, the training speed is also obviously increased.

To accelerate and stabilize training, to improve the dereverberation model performance, and to reduce the quantity of dereverberation model parameters, in another embodiment of the present application, as shown in fig. 6, before inputting the input signal into the dereverberation model (i.e., step 204), further comprises:

step 601, generating a plurality of room impulse responses RIRs according to configured sound source parameters; wherein the sound source parameters include one or more combinations of microphone array parameters, sample rate, room size, reverberation time, sound source location, and microphone array location.

In this embodiment, a virtual source method is used to generate a room impulse response RIR database under various parameter configurations according to parameters such as configured microphone array parameters, sampling rate, room size, reverberation time, sound source position, and microphone array position.

Step 602, truncating the multiple RIRs respectively to obtain the corresponding truncated RIRs.

In this embodiment, the peak and audio after 50ms are truncated for multiple RIRs as part of the direct and early reflections.

And 603, taking the audio obtained by convolving each RIR before truncation with the standard voice as a training sample, and marking the training sample by adopting the audio obtained by convolving the RIR after truncation with the standard voice.

In this embodiment, each RIR convolution standard speech before truncation (speech before truncation) is used as an input training sample of the dereverberation model, and each RIR convolution standard speech after truncation is used as a training target of the dereverberation model, that is, a training sample after labeling.

Step 604, training the dereverberation model using a training sample.

In this embodiment, a scale-invariant signal-to-noise ratio (SI-SNR) is used as a loss function in the training process of the dereverberation model, and the loss function is expressed as the following formula (2):

wherein,to remove reverberant voice signals by using a dereverberation model, s is a target non-reverberant voice signal, s _target For dereverberating components of target dereverberated speech signals output by dereverberating models e _noise The target non-reverberant speech signal output for the dereverberated model is noise components.

In order to enable the dereverberation model to stably work for a long time and increase the generalization capability of the dereverberation model, long sentence training can be used, and random equalization and random gain processing can be performed on clean non-reverberant voice.

The effect achieved by the dereverberation method of the present application is verified in connection with the test results of fig. 7-10.

Fig. 7 is a spectrum diagram of direct sound and early onset reflected sound according to an embodiment of the present application, fig. 8 is a spectrum diagram of a reverberant signal (before processing) according to an embodiment of the present application, fig. 9 is a spectrum diagram of a weighted prediction error algorithm according to the related art, and fig. 10 is a spectrum diagram of a dereverberation method according to an embodiment of the present application. As can be seen from the spectrogram, the dereverberation method of the present application can eliminate more late reverberant sound compared with the weighted prediction error algorithm in the related art, and meanwhile, the appetite signal envelope after the dereverberation processing of the present application is also closer to the direct sound and early onset reflected sound signals of fig. 7.

According to the experimental results, the dereverberation method provided by the application can well remove the late reverberation, and the voice distortion is controllable.

In the embodiment of the application, at least one collected voice signal sub-band is decomposed into a plurality of sub-band signals; performing offset processing on each sub-band signal according to a set time offset to obtain a reference signal of each sub-band signal; inputting each sub-band signal of at least one voice signal and a corresponding reference signal into a dereverberation model to obtain an ideal ratio membrane IRM corresponding to each sub-band signal of at least one voice signal; carrying out amplitude spectrum enhancement on each sub-band signal of at least one path of voice signal by adopting a corresponding IRM (inter-range interference) to obtain each sub-band signal of dereverberated at least one path of voice signal; and carrying out subband synthesis on each subband signal of the dereverberation of the same path of voice signal to obtain each path of dereverberation voice signal. Thus, reverberation can be removed well by the band independent dereverberation model. And the model has the advantages of low signal processing and calculation complexity and data driving of the deep learning model, and meanwhile, the model can also eliminate the problem of unmatched signal processing schemes and back-end models through combined training with other deep learning models at the back-end. In addition, the method and the device adopt the form that RNN layers with different parameter scales and adjacent frequency band parameters are shared by different frequency bands, so that the low-frequency reverberation elimination amount and generalization capability of the model are improved, and the parameter quantity of the model is reduced.

In order to implement the above embodiments, the embodiments of the present application further provide a dereverberation apparatus.

Fig. 11 is a schematic structural diagram of a dereverberation device according to an embodiment of the present disclosure.

As shown in fig. 11, the dereverberation apparatus 1100 of the embodiment of the present application includes: decomposition module 1110, offset module 1120, input module 1130, dereverberation module 1140, and synthesis module 1150.

The decomposing module 1110 is configured to decompose the collected at least one voice signal subband into a plurality of subband signals;

the offset module 1120 is configured to perform offset processing on each sub-band signal according to a set time offset, so as to obtain a reference signal of each sub-band signal;

an input module 1130, configured to input each sub-band signal of the at least one speech signal and a corresponding reference signal into the dereverberation model, so as to obtain an ideal ratio film IRM corresponding to each sub-band signal of the at least one speech signal;

a dereverberation module 1140, configured to perform amplitude spectrum enhancement on each subband signal of at least one path of speech signal by using a corresponding IRM, so as to obtain each subband signal dereverberated by at least one path of speech signal;

the synthesizing module 1150 is configured to perform subband synthesis on each subband signal dereverberated by the same speech signal to obtain each dereverberated speech signal.

In one possible implementation manner of the embodiment of the present application, the input module 1130 is configured to:

splicing the m-th sub-band signal of the N paths of voice signals with the corresponding reference signal to obtain an input signal; wherein M is a positive integer with a value smaller than or equal to M, M is the total channel number of sub-band signals decomposed by each path of voice signals, and N is the total channel number of at least one path of voice signals;

and inputting the input signal into the dereverberation model to obtain the IRM corresponding to the m-th sub-band signal of each voice signal in the N voice signals.

In one possible implementation manner of the embodiment of the present application, the dereverberation model includes:

the complex convolution neural network CNN layer is used for extracting the characteristics of the complex parts of each frame signal in the input signal to obtain the audio characteristics of the complex parts of each frame signal, wherein the audio characteristics are used for indicating the signal phase and the signal amplitude;

the real number CNN layer is used for carrying out complex number to real number conversion processing on the audio characteristics of the complex number part of each frame signal to obtain real number audio characteristics of each frame signal; predicting to obtain a first IRM representation of each frame signal according to the real audio characteristics of each frame signal and the real audio characteristics of the corresponding adjacent frame signal;

The RNN layer is used for predicting and obtaining second IRM representation of each frame signal according to the first IRM representation of each frame signal and the first IRM representation of each frame signal with the corresponding time sequence prior;

and the output layer is used for mapping and obtaining the IRM corresponding to the m-th sub-band signal of each path of voice signal in the N paths of voice signals according to the second IRM representation of each frame signal.

In one possible implementation of an embodiment of the present application, before the input module 1130 inputs the input signal into the dereverberation model, the input module is further configured to:

selecting one RNN layer from a plurality of candidate RNN layers as an RNN layer in a dereverberation model according to the frequency band of the m-th subband signal;

the plurality of candidate RNN layers respectively have different hidden layer node numbers, and the hidden layer node numbers of the RNN layers in the dereverberation model are matched with the frequency band where the m-th subband signal is located.

In one possible implementation manner of the embodiment of the present application, model parameters are shared between RNN layers matched by a plurality of adjacent frequency bands.

generating a plurality of room impulse responses RIRs according to the configured sound source parameters; wherein the sound source parameters include one or more combinations of microphone array parameters, sample rate, room size, reverberation time, sound source location, and microphone array location;

Respectively truncating the plurality of RIRs to obtain the corresponding truncated RIRs;

taking the audio obtained by convoluting each RIR before truncation with the standard voice as a training sample, and marking the training sample by adopting the audio obtained by convoluting the RIR after truncation with the standard voice;

and training the dereverberation model by adopting a training sample.

It should be noted that the explanation of the embodiment of the dereverberation method is also applicable to the dereverberation device of this embodiment, and thus will not be repeated here.

According to the dereverberation device, at least one voice signal subband is collected through a decomposition module and is decomposed into a plurality of subband signals, each subband signal is subjected to offset processing according to a set time offset through an offset module, a reference signal of each subband signal is obtained, each subband signal of the at least one voice signal and a corresponding reference signal are input into a dereverberation model through an input module, an ideal ratio membrane IRM corresponding to each subband signal of the at least one voice signal is obtained, amplitude spectrum enhancement is carried out on each subband signal of the at least one voice signal through a dereverberation module, each subband signal of the at least one voice signal is obtained, the dereverberation is carried out on each subband signal of the same voice signal through a synthesis module, and therefore each dereverberation voice signal is obtained. Thus, reverberation can be removed well by the band independent dereverberation model. And the model has the advantages of low signal processing and calculation complexity and data driving of the deep learning model, and meanwhile, the model can also eliminate the problem of unmatched signal processing schemes and back-end models through combined training with other deep learning models at the back-end. In addition, the method and the device adopt the form that RNN layers with different parameter scales and adjacent frequency band parameters are shared by different frequency bands, so that the low-frequency reverberation elimination amount and generalization capability of the model are improved, and the parameter quantity of the model is reduced.

To achieve the above embodiments, the present application proposes a computer program product comprising a computer program which, when executed by a processor, implements the dereverberation method of the above embodiments.

According to embodiments of the present application, an electronic device and a readable storage medium are also provided.

As shown in fig. 12, a block diagram of an electronic device is provided for a dereverberation method according to an embodiment of the present application. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the application described and/or claimed herein.

As shown in fig. 12, the electronic device includes: one or more processors 1201, memory 1202, and interfaces for connecting the various components, including a high-speed interface and a low-speed interface. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions executing within the electronic device, including instructions stored in or on memory to display graphical information of the GUI on an external input/output device, such as a display device coupled to the interface. In other embodiments, multiple processors and/or multiple buses may be used, if desired, along with multiple memories and multiple memories. Also, multiple electronic devices may be connected, each providing a portion of the necessary operations (e.g., as a server array, a set of blade servers, or a multiprocessor system). One processor 1201 is illustrated in fig. 12.

Memory 1202 is a non-transitory computer-readable storage medium provided herein. Wherein the memory stores instructions executable by the at least one processor to cause the at least one processor to perform the dereverberation method provided herein. The non-transitory computer readable storage medium of the present application stores computer instructions for causing a computer to perform the dereverberation method provided by the present application.

Memory 1202, which is a non-transitory computer-readable storage medium, may be used to store non-transitory software programs, non-transitory computer-executable programs, and modules, such as program instructions/modules (e.g., decomposition module 1110, offset module 1120, input module 1130, dereverberation module 1140, and synthesis module 1150, shown in fig. 11) corresponding to the dereverberation method in embodiments of the present application. The processor 1201 performs various functional applications of the server and data processing, i.e., implements the dereverberation method in the above-described method embodiments, by running non-transitory software programs, instructions, and modules stored in the memory 1202.

Memory 1202 may include a storage program area that may store an operating system, at least one application program required for functionality, and a storage data area; the storage data area may store data created from the use of the dereverberated electronic device, etc. In addition, memory 1202 may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid-state storage device. In some embodiments, memory 1202 optionally includes memory remotely located with respect to processor 1201, which may be connected to the dereverberated electronic device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The electronic device of the dereverberation method may further include: an input device 1203 and an output device 1204. The processor 1201, the memory 1202, the input device 1203, and the output device 1204 may be connected by a bus or otherwise, for example in fig. 12.

The input device 1203 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the dereverberated electronic device, such as a touch screen, keypad, mouse, trackpad, touchpad, pointer stick, one or more mouse buttons, trackball, joystick, and like input devices. The output device 1204 may include a display apparatus, auxiliary lighting devices (e.g., LEDs), and haptic feedback devices (e.g., vibration motors), among others. The display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, and a plasma display. In some implementations, the display device may be a touch screen.

Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific ASIC (application specific integrated circuit), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

These computing programs (also referred to as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented in a high-level procedural and/or object-oriented programming language, and/or in assembly/machine language. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical host and VPS (Virtual Private Server ) service are overcome.

According to the technical scheme, the method and the device for removing reverberation can well remove reverberation through the frequency band independent dereverberation model, and relate to the fields of voice technology and deep learning. And the model has the advantages of low signal processing and calculation complexity and data driving of the deep learning model, and meanwhile, the model can also eliminate the problem of unmatched signal processing schemes and back-end models through combined training with other deep learning models at the back-end. In addition, the method and the device adopt the form that RNN layers with different parameter scales and adjacent frequency band parameters are shared by different frequency bands, so that the low-frequency reverberation elimination amount and generalization capability of the model are improved, and the parameter quantity of the model is reduced.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps described in the present application may be performed in parallel, sequentially, or in a different order, provided that the desired results of the technical solutions disclosed in the present application can be achieved, and are not limited herein.

The above embodiments do not limit the scope of the application. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present application are intended to be included within the scope of the present application.

Claims

1. A method of dereverberation, comprising:

carrying out subband synthesis on each subband signal of the dereverberation of the same path of voice signal to obtain each path of dereverberation voice signal;

the inputting each sub-band signal of the at least one voice signal and the corresponding reference signal into a dereverberation model to obtain an ideal ratio film IRM corresponding to each sub-band signal of the at least one voice signal includes:

Inputting the input signal into the dereverberation model to obtain IRM corresponding to the m-th sub-band signal of each path of voice signal in the N paths of voice signals;

wherein the dereverberation model comprises:

a complex convolutional neural network CNN layer, which is used for extracting the characteristics of the complex part of each frame signal in the input signal to obtain the audio characteristics of the complex part of each frame signal, wherein the audio characteristics are used for indicating the signal phase and the signal amplitude;

the real number CNN layer is used for carrying out complex number to real number conversion processing on the audio characteristics of the complex number part of each frame signal to obtain real number audio characteristics of each frame signal; predicting to obtain a first IRM representation of each frame signal according to the real audio characteristics of each frame signal and the real audio characteristics of the corresponding adjacent frame signals;

the cyclic neural network RNN layer is used for predicting and obtaining second IRM representation of each frame signal according to the first IRM representation of each frame signal and the first IRM representation of each frame signal with the corresponding time sequence in advance;

2. The dereverberation method according to claim 1, wherein before the inputting the input signal into the dereverberation model, further comprising:

selecting one RNN layer from a plurality of candidate RNN layers as the RNN layer in the dereverberation model according to the frequency band of the m-th subband signal;

the plurality of candidate RNN layers respectively have different numbers of hidden layer nodes, and the number of hidden layer nodes of the RNN layer in the dereverberation model is matched with the frequency band where the m-th subband signal is located.

3. The dereverberation method according to claim 2, wherein model parameters are shared between RNN layers to which a plurality of adjacent frequency bands are matched.

4. The dereverberation method according to claim 1, wherein before the inputting the input signal into the dereverberation model, further comprising:

generating a plurality of room impulse responses RIRs according to the configured sound source parameters; wherein the sound source parameters include one or more combinations of microphone array parameters, sampling rate, room size, reverberation time, sound source location, and microphone array location;

respectively truncating the RIRs to obtain the corresponding truncated RIRs;

Taking the audio obtained by convoluting each RIR before truncation with standard voice as a training sample, and marking the training sample by adopting the audio obtained by convoluting the RIR after truncation with the standard voice;

and training the dereverberation model by adopting the training sample.

5. A dereverberation apparatus comprising:

the synthesis module is used for carrying out subband synthesis on each subband signal of the dereverberation of the same path of voice signal so as to obtain each path of dereverberation voice signal;

Wherein, input module is used for:

wherein the dereverberation model comprises:

6. The dereverberation device of claim 5, wherein the input module is further configured to, prior to inputting the input signal into the dereverberation model:

7. The dereverberation device of claim 6, wherein model parameters are shared between RNN layers matched by a plurality of adjacent frequency bands.

8. The dereverberation device of claim 5, wherein the input module is further configured to, prior to inputting the input signal into the dereverberation model:

Respectively truncating the RIRs to obtain the corresponding truncated RIRs;

and training the dereverberation model by adopting the training sample.

9. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the dereverberation method of any of claims 1-4.

10. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the dereverberation method of any of claims 1-4.