CN112201267A

CN112201267A - Audio processing method and device, electronic equipment and storage medium

Info

Publication number: CN112201267A
Application number: CN202010930871.9A
Authority: CN
Inventors: 李楠; 张晨
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2020-09-07
Filing date: 2020-09-07
Publication date: 2021-01-08

Abstract

The present disclosure relates to an audio processing method, an apparatus, an electronic device, and a storage medium, the method comprising: acquiring an audio signal to be processed; acquiring a noise signal included in an audio signal to be processed and reverberation time of the audio signal to be processed; determining a signal-to-noise ratio and a noise reduction gain factor of the audio signal to be processed according to the audio signal to be processed and the noise signal, and determining a reverberation signal included in the audio signal to be processed according to the audio signal to be processed and the reverberation time length; and removing reverberation of the audio signal to be processed according to the signal-to-noise ratio, the noise reduction gain factor and the reverberation signal to obtain the audio signal after removing reverberation. According to the technical scheme provided by the embodiment of the disclosure, when the reverberation of the audio signal to be processed is removed, the signal-to-noise ratio, the noise reduction gain factor and the reverberation signal are considered, so that when the noise signal exists in the audio signal to be processed, the reverberation of the audio signal to be processed can be removed well.

Description

Audio processing method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of audio technologies, and in particular, to an audio processing method and apparatus, an electronic device, and a storage medium.

Background

Acoustic reverberation is a common physical phenomenon produced by the reflection of sound waves. When the microphone is used for collecting the audio signal, the reverberation causes interference to the audio signal, and the serious reverberation causes the intelligibility of the audio signal to be reduced. Therefore, the dereverberation technology for audio signals has attracted certain attention in audio communication, high-quality voice capturing and playback, and other scenes.

In the related art, a dereverberation method based on WPE (Weighted Prediction Error) is usually adopted to dereverberate an audio signal, but the method has a high dependence on the signal-to-noise ratio of the audio signal, and when noise exists in the audio signal, the convergence of the algorithm is poor, and finally, the dereverberation effect is poor.

Disclosure of Invention

In order to solve the technical problem existing in the related art that the dereverberation effect is poor when noise exists in an audio signal and the audio signal is dereverberated, the present disclosure provides an audio processing method, an apparatus, an electronic device and a storage medium, and the technical scheme of the present disclosure is as follows:

according to a first aspect of the embodiments of the present disclosure, there is provided an audio processing method, including:

acquiring an audio signal to be processed;

acquiring a noise signal included in the audio signal to be processed and reverberation time of the audio signal to be processed;

determining a signal-to-noise ratio and a noise reduction gain factor of the audio signal to be processed according to the audio signal to be processed and the noise signal, and determining a reverberation signal included in the audio signal to be processed according to the audio signal to be processed and the reverberation time length;

and dereverberating the audio signal to be processed according to the signal-to-noise ratio, the noise reduction gain factor and the reverberation signal to obtain an audio signal after dereverberation.

Optionally, the dereverberating the audio signal to be processed according to the signal-to-noise ratio, the noise reduction gain factor, and the reverberation signal to obtain a dereverberated audio signal includes:

for any current frame audio signal of the audio signal to be processed, calculating a first gain factor corresponding to the current frame audio signal through the audio signal of the current frame audio signal, the reverberation signal of the current frame audio signal and a preset minimum dereverberation gain factor;

calculating a second gain factor according to the first gain factor and the noise reduction gain factor corresponding to the current frame audio signal;

smoothing the first gain factor and the second gain factor to obtain a target gain factor;

and performing dereverberation on the current frame audio signal through the target gain factor to obtain a dereverberated audio signal corresponding to the current frame audio signal.

Optionally, when the signal-to-noise ratio of the current frame audio signal is smaller than a preset signal-to-noise ratio, and the reverberation duration of the current frame audio signal is greater than a preset reverberation duration;

calculating a second gain factor according to the first gain factor and the noise reduction gain factor corresponding to the current frame audio signal, including:

calculating the dereverberation and denoising scale factor of the current frame audio signal according to the following formula:

wherein gamma is a dereverberation and denoising scale factor of the current frame audio signal, and snr (n) is a signal-to-noise ratio corresponding to the current frame audio signal;

the second gain factor is calculated according to the following formula:

wherein G is_tmpIs a second gain factor, said G_dereverb(n) is the first gain factor corresponding to the current frame audio signal, G_denoiseAnd (n) is a noise reduction gain factor corresponding to the current frame audio signal.

Optionally, when the signal-to-noise ratio of the current frame audio signal is greater than a preset signal-to-noise ratio, or the reverberation duration of the current frame audio signal is less than a preset reverberation duration;

and determining the noise reduction gain factor corresponding to the current frame audio signal as a second gain factor.

Optionally, the determining the reverberation signal of the audio signal to be processed according to the audio signal to be processed and the reverberation time length includes:

for any current frame audio signal of the audio signals to be processed, calculating the energy of the previous frame audio signal of the current frame audio signal after the attenuation of the excitation energy vector;

determining the excitation energy vector of the current frame audio signal according to the maximum value between the energy of the previous frame audio signal after the attenuation of the excitation energy vector and the energy of the current frame audio signal;

and determining the reverberation signal of the current frame audio signal according to the excitation energy vector of the current frame audio signal, the time interval between two adjacent frames of the audio signal to be processed and the reverberation duration corresponding to the current frame audio signal.

Optionally, the calculating, for any current frame of audio signals of the audio signals to be processed, energy after attenuation of an excitation energy vector of an audio signal of a previous frame of the current frame of audio signals includes:

when a current frame audio signal of the audio signal to be processed is a first frame audio signal of the audio signal to be processed, determining the energy of the current frame audio signal after attenuation of an excitation energy vector as 0;

when the current frame audio signal of the audio signal to be processed is not the first frame audio signal of the audio signal to be processed, calculating the energy after the attenuation of the excitation energy vector of the previous frame audio signal of the current frame audio signal according to the following formula:

wherein R (n) isThe energy of the previous frame of audio signal of the current frame of audio signal after attenuation of the excitation energy vector, Ra (n-1) is the excitation energy vector of the previous frame of audio signal of the current frame of audio signal, and T is₁The time interval between two adjacent frames of the audio signal to be processed is defined.

Optionally, the determining the reverberation signal of the current frame audio signal according to the excitation energy vector of the current frame audio signal, the time interval between two adjacent frames of the audio signal to be processed, and the reverberation duration corresponding to the current frame audio signal includes:

calculating a reverberation signal of the current frame audio signal according to the following formula:

M＝RT60(n)/T₁；

wherein sr (n) is a reverberation signal of the current frame audio signal, RT60(n) is a reverberation duration corresponding to the current frame audio signal, and M is a number of frames within the reverberation duration corresponding to the current frame audio signal; ra (n-M) is the energy after attenuation of the excitation energy vector of the previous M frames of the current frame audio signal.

Optionally, the acquiring the audio signal to be processed includes:

acquiring an original audio signal;

carrying out short-time Fourier transform on the original audio signal to obtain a time-frequency domain signal of the original audio signal;

and determining the time-frequency domain signal as the audio signal to be processed.

Optionally, the method further includes:

and removing the noise signal included in the audio signal to be processed by the noise reduction gain factor.

According to a second aspect of the embodiments of the present disclosure, there is provided an audio processing apparatus including:

an audio signal acquisition module configured to perform acquisition of an audio signal to be processed;

a noise signal and reverberation duration acquisition module configured to perform acquisition of a noise signal included in the audio signal to be processed and a reverberation duration of the audio signal to be processed;

a signal-to-noise ratio and reverberation signal determination module configured to determine a signal-to-noise ratio and a noise reduction gain factor of the audio signal to be processed according to the audio signal to be processed and the noise signal, and determine a reverberation signal included in the audio signal to be processed according to the audio signal to be processed and a reverberation time length;

and the dereverberation module is configured to dereverberate the audio signal to be processed according to the signal-to-noise ratio, the noise reduction gain factor and the reverberation signal to obtain a dereverberated audio signal.

Optionally, the dereverberation module includes:

a first gain factor calculation unit configured to calculate, for any current frame audio signal of the audio signals to be processed, a first gain factor corresponding to the current frame audio signal through an audio signal of the current frame audio signal, a reverberation signal of the current frame audio signal, and a preset minimum dereverberation gain factor;

a second gain factor calculation unit configured to calculate a second gain factor according to the first gain factor and a noise reduction gain factor corresponding to the current frame audio signal;

a target gain factor calculation unit configured to perform smoothing processing on the first gain factor and the second gain factor to obtain a target gain factor;

and the dereverberation unit is configured to perform dereverberation on the current frame audio signal through the target gain factor to obtain a dereverberated audio signal corresponding to the current frame audio signal.

the second gain factor calculation unit is specifically configured to perform:

the second gain factor is calculated according to the following formula:

the second gain factor calculation unit is specifically configured to perform:

Optionally, the signal-to-noise ratio and reverberation signal determining module includes:

the energy calculation unit is configured to calculate the energy of an excitation energy vector of an audio signal in the previous frame of the audio signal of the current frame after attenuation for any current frame of the audio signal to be processed;

an excitation energy vector determination unit configured to perform determining an excitation energy vector of the current frame audio signal according to a maximum value between an energy of the previous frame audio signal after attenuation of the excitation energy vector and an energy of the current frame audio signal;

and the reverberation signal determination unit is configured to determine the reverberation signal of the current frame audio signal according to the excitation energy vector of the current frame audio signal, the time interval between two adjacent frames of the audio signal to be processed and the reverberation time length corresponding to the current frame audio signal.

Optionally, the energy calculating unit is specifically configured to perform:

wherein, r (n) is the energy after attenuation of the excitation energy vector of the previous frame of audio signal of the current frame of audio signal, Ra (n-1) is the excitation energy vector of the previous frame of audio signal of the current frame of audio signal, and T₁The time interval between two adjacent frames of the audio signal to be processed is defined.

Optionally, the reverberation signal determination unit is specifically configured to perform:

M＝RT60(n)/T₁；

Optionally, the audio signal obtaining module is specifically configured to perform:

acquiring an original audio signal;

Optionally, the apparatus further comprises:

a denoising module configured to perform denoising of a noise signal included in the audio signal to be processed by the denoising gain factor.

According to a third aspect of the embodiments of the present disclosure, there is provided an electronic apparatus including:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the audio processing method of the first aspect.

According to a fourth aspect of embodiments of the present disclosure, there is provided a storage medium having instructions that, when executed by a processor of an electronic device, enable the electronic device to perform the audio processing method of the first aspect.

According to a fifth aspect of embodiments of the present disclosure, there is provided a computer program product containing instructions which, when run on a computer, cause the computer to carry out the audio processing method of the first aspect.

According to the technical scheme provided by the embodiment of the disclosure, an audio signal to be processed is obtained; acquiring a noise signal included in an audio signal to be processed and reverberation time of the audio signal to be processed; determining the signal-to-noise ratio and the noise reduction gain factor of the audio signal to be processed according to the audio signal to be processed and the noise signal; determining a reverberation signal of the audio signal to be processed according to the audio signal to be processed and the reverberation time; and removing reverberation of the audio signal to be processed according to the signal-to-noise ratio, the noise reduction gain factor and the reverberation signal to obtain the audio signal after removing reverberation.

Therefore, by the technical scheme provided by the embodiment of the disclosure, when the audio signal to be processed is dereverberated, the signal-to-noise ratio, the noise reduction gain factor and the reverberation signal are considered, so that the audio signal to be processed can be dereverberated well when the noise signal exists in the audio signal to be processed. Moreover, the disturbance of the reverberation on the audio signal can be stably reduced, meanwhile, the audio signal after the reverberation is removed cannot be distorted, and the quality and the intelligibility of the audio signal of a real-time communication scene are improved.

Drawings

FIG. 1 is a flow diagram illustrating a method of audio processing according to an exemplary embodiment;

FIG. 2 is a flowchart of one implementation of step S14 in the embodiment of FIG. 1;

FIG. 3 is a flowchart illustrating one embodiment of determining a reverberation signal included in an audio signal to be processed according to the audio signal to be processed and a reverberation time duration according to an exemplary embodiment;

FIG. 4 is a schematic diagram illustrating an audio processing procedure in accordance with an exemplary embodiment;

FIG. 5 is a block diagram illustrating an audio processing device according to an exemplary embodiment;

FIG. 6 is a block diagram illustrating an electronic device in accordance with an exemplary embodiment;

FIG. 7 is a block diagram illustrating an audio processing device according to an example embodiment;

fig. 8 is a block diagram illustrating another audio processing device according to an example embodiment.

Detailed Description

In order to make the technical solutions of the present disclosure better understood by those of ordinary skill in the art, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in sequences other than those illustrated or otherwise described herein. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

In order to solve the technical problem existing in the related art that when noise exists in an audio signal, dereverberation effect is poor when dereverberating the audio signal, embodiments of the present disclosure provide an audio processing method, an apparatus, an electronic device and a storage medium,

in a first aspect, an audio processing method provided by an embodiment of the present disclosure is first explained in detail.

As shown in fig. 1, an audio processing method provided in an embodiment of the present disclosure may include the following steps:

in step S11, an audio signal to be processed is acquired.

Specifically, the audio signal to be processed is an audio signal to be dereverberated. In practical applications, the audio signal to be processed usually includes signal components such as a speech signal, a noise signal and a reverberation signal.

In one embodiment, acquiring the audio signal to be processed may include the following steps, respectively step a1 to step a 3:

step a1, an original audio signal is obtained.

Step a2, performing short-time Fourier transform on the original audio signal to obtain a time-frequency domain signal of the original audio signal.

Step a3, determining the time-frequency domain signal as the audio signal to be processed.

In this embodiment, the original audio signal may be converted into a time-frequency domain signal by a short-time fourier transform, specifically, as follows:

X(n)＝STFT(x(t))

wherein, x (t) is a time domain audio signal, x (N) is a time domain audio signal, N is a frame sequence, N is more than 0 and less than or equal to N, and N is the total frame number. It should be noted that, in the embodiment of the present disclosure, since the processing is the same for each frequency band, the symbol indicating the frequency band information is not represented in the frequency domain signal. After converting the original audio signal into a time-frequency domain signal, the time-frequency domain signal may be determined as the audio signal to be processed.

In step S12, a noise signal included in the audio signal to be processed and a reverberation time period of the audio signal to be processed are acquired.

Specifically, after the audio signal to be processed is acquired, the noise signal included in the audio signal to be processed may be extracted through methods such as stationary noise estimation based on a time window, noise estimation based on statistics, and the like. Those skilled in the art should understand that, for a specific implementation process of extracting a noise signal based on stationary noise estimation of a time window or based on statistical noise estimation, details of the embodiment of the disclosure are not repeated here. In addition, the embodiment of the present disclosure does not specifically limit the way of extracting the noise signal included in the audio signal to be processed. In addition, in practical applications, only stationary noise signals included in the audio signal to be processed may be extracted, and non-stationary noise signals included in the signal to be processed may not be extracted.

And obtaining the reverberation time length of the audio signal to be processed through the attenuation characteristic of the audio signal to be processed. Specifically, the audio signal to be processed may be input into a pre-trained reverberation time length estimation model, the reverberation time length estimation model extracts attenuation characteristics of each frame of audio signal included in the audio signal to be processed, and obtains a reverberation time length corresponding to each frame of audio signal based on the attenuation characteristics of each frame of audio signal, that is, RT60(n) is output from the reverberation time length estimation model, where RT60(n) is the reverberation time length corresponding to each frame of audio signal.

Of course, the reverberation time of the audio signal to be processed may also be obtained through other implementation manners, and the specific implementation manner of obtaining the reverberation time of the audio signal to be processed through the attenuation feature of the audio signal to be processed is not particularly limited in the embodiment of the present disclosure.

In step S13, the signal-to-noise ratio and the noise reduction gain factor of the audio signal to be processed are determined according to the audio signal to be processed and the noise signal, and the reverberation signal included in the audio signal to be processed is determined according to the audio signal to be processed and the reverberation time length.

Specifically, after the audio signal to be processed and the noise signal are obtained, the signal-to-noise ratio and the noise reduction gain factor of each frame of audio signal included in the audio signal to be processed may be estimated by using the audio signal to be processed x (n) and the noise signal noise (n), so as to obtain the signal-to-noise ratio and the noise reduction gain factor of the audio signal to be processed.

And, the following formula can be used to calculate the signal-to-noise ratio of each frame of audio signal included in the audio signal to be processed:

wherein, snr (n) represents the signal-to-noise ratio corresponding to the nth frame of audio signal, and x (n) is the signal voltage corresponding to the nth frame of audio signal; noise (n) is the signal voltage corresponding to the noise signal of the nth frame.

And the noise reduction gain factor of each frame of audio signal included in the audio signal to be processed can be calculated by using the following formula:

and, the reverberation signal of the audio signal to be processed can also be determined according to the audio signal to be processed and the reverberation time length. Specifically, after obtaining the reverberation time of each frame of audio signal included in the audio signal to be processed, the reverberation signal of the audio signal to be processed may be obtained according to the audio signal to be processed and the reverberation time of each frame of audio signal.

For clarity of the description of the scheme, a specific implementation manner of determining the reverberation signal of the audio signal to be processed according to the audio signal to be processed and the reverberation time length will be explained in detail in the following embodiments.

In step S14, the audio signal to be processed is dereverberated according to the signal-to-noise ratio, the noise reduction gain factor, and the reverberation signal, so as to obtain a dereverberated audio signal.

Specifically, after the signal-to-noise ratio, the noise reduction gain factor, and the reverberation signal of the audio signal to be processed are obtained, the reverberation of the audio signal to be processed may be removed according to the signal-to-noise ratio, the noise reduction gain factor, and the reverberation signal, so as to obtain the audio signal after being removed from the reverberation. Therefore, according to the technical scheme provided by the embodiment of the disclosure, when the audio signal to be processed is dereverberated, the signal-to-noise ratio, the noise reduction gain factor and the reverberation signal are considered, so that the audio signal to be processed can be dereverberated well when noise exists in the audio signal to be processed.

For clarity of the description of the scheme, a specific implementation of step S14, dereverberating the audio signal to be processed according to the signal-to-noise ratio, the noise reduction gain factor and the reverberation signal to obtain a dereverberated audio signal will be described in detail in the following embodiments.

According to the technical scheme provided by the embodiment of the disclosure, an audio signal to be processed is obtained; acquiring a noise signal included in an audio signal to be processed and reverberation time of the audio signal to be processed; determining the signal-to-noise ratio and the noise reduction gain factor of the audio signal to be processed according to the audio signal to be processed and the noise signal; determining a reverberation signal of the audio signal to be processed according to the audio signal to be processed and the reverberation time length; and removing reverberation of the audio signal to be processed according to the signal-to-noise ratio, the noise reduction gain factor and the reverberation signal to obtain the audio signal after removing reverberation.

In one embodiment, dereverberating the audio signal to be processed according to the signal-to-noise ratio, the noise reduction gain factor, and the reverberation signal to obtain a dereverberated audio signal, as shown in fig. 2, the method may include the following steps:

in step S141, for any current frame of audio signals to be processed, a first gain factor corresponding to the current frame of audio signals is calculated through the audio signal of the current frame of audio signals, the reverberation signal of the current frame of audio signals, and a preset minimum dereverberation gain factor.

Specifically, the current frame audio signal may be any one of the audio signals to be processed. The first gain factor corresponding to the current frame audio signal may be calculated according to the following formula:

wherein G is_dereverb(n) is the first gain factor corresponding to the current frame audio signal, | X (n) & gt²The energy of the current frame audio signal is sr (n), the reverberation signal corresponding to the current frame audio signal, and lambda is a preset minimum dereverberation gain factor, and the size of lambda can be set according to the maximum limit of dereverberation required, for example, the size of lambda can be 0.1.

In step S142, a second gain factor is calculated according to the first gain factor and the noise reduction gain factor corresponding to the current frame audio signal.

Specifically, after the first gain factor is obtained through calculation, the second gain factor may be calculated according to the first gain factor and the noise reduction gain factor corresponding to the current frame audio signal.

Also, in order to accurately calculate the second gain factor, the audio signal to be processed may be better dereverberated in a subsequent step. In practical applications, there are two cases when calculating the second gain factor.

As an implementation manner of the embodiment of the present disclosure, a signal-to-noise ratio of a current frame audio signal is smaller than a preset signal-to-noise ratio, and a reverberation duration of the current frame audio signal is greater than a preset reverberation duration; the preset signal-to-noise ratio may be 20dB, and the preset reverberation time may be 300 ms.

At this time, calculating the second gain factor according to the first gain factor and the noise reduction gain factor corresponding to the current frame audio signal may include the following steps:

calculating the dereverberation and de-noise scale factor of the current frame audio signal according to the following formula:

wherein gamma is the dereverberation and de-noise scale factor of the current frame audio signal, and SNR (n) is the signal-to-noise ratio corresponding to the current frame audio signal;

the second gain factor is calculated according to the following formula:

wherein G is_tmpIs a second gain factor, G_dereverb(n) is a first gain factor, G, corresponding to the audio signal of the current frame_denoiseAnd (n) is the noise reduction gain factor corresponding to the current frame audio signal.

As another implementation manner of the embodiment of the present disclosure, a signal-to-noise ratio of the current frame audio signal is greater than a preset signal-to-noise ratio, or a reverberation duration of the current frame audio signal is less than a preset reverberation duration; the preset signal-to-noise ratio may be 20dB, and the preset reverberation time may be 300 ms.

At this time, calculating a second gain factor according to the first gain factor and the noise reduction gain factor corresponding to the current frame audio signal includes:

That is, G_tmp＝G_denoise(n)。

In step S143, the first gain factor and the second gain factor are smoothed to obtain a target gain factor.

After the first gain factor and the second gain factor are calculated, the first gain factor and the second gain factor may be smoothed to obtain a target gain factor for final dereverberation.

Specifically, the first gain factor and the second gain factor may be smoothed according to the following formula:

G(n)＝smooth*G_dereverb(n-1)+(1-smooth)*G_tmp。

when n is 1, G_dereverb(n-1) is the initialized gain factor, i.e. G_dereverb(0) 1. Where smooth is a smoothing coefficient, the magnitude of the smoothing coefficient may be a value close to 1, such as 0.9.

In step S144, dereverberation is performed on the current frame audio signal by using the target gain factor, so as to obtain a dereverberated audio signal corresponding to the current frame audio signal.

Specifically, after the target gain factor g (n) corresponding to the current frame audio signal is obtained, dereverberation may be performed on the current frame audio signal through the target gain factor, so as to obtain a dereverberated audio signal corresponding to the current frame audio signal.

The dereverberated audio signal corresponding to the current frame audio signal can be obtained according to the following formula:

Y(n)＝G(n)*X(n)

y (n) is the audio signal after dereverberation corresponding to the current frame audio signal, G (n) is the target gain factor corresponding to the current frame audio signal, and X (n) is the current frame audio signal.

Since the current frame audio signal may be any frame audio signal of the audio signal to be processed, after the dereverberated audio signals corresponding to all the current frame audio signals are obtained, the dereverberated audio signal of the audio signal to be processed may be obtained.

In one embodiment, determining the reverberation signal of the audio signal to be processed according to the audio signal to be processed and the reverberation time length, as shown in fig. 3, may include the following steps:

in step S131, for any current frame of audio signals of the audio signals to be processed, the energy after attenuation of the excitation energy vector of the previous frame of audio signals of the current frame of audio signals is calculated.

Specifically, if the current frame audio signal of the audio signal to be processed is the first frame audio signal of the audio signal to be processed, the current frame audio signal does not have the previous frame audio signal, and therefore, the excitation energy vector of the previous frame audio signal does not exist. At this time, the excitation energy vector of the previous frame of audio signal may be initialized, that is, ra (n) is 0, and the energy of the previous frame of audio signal after attenuation of the excitation energy vector may also be 0.

If the current frame audio signal of the audio signal to be processed is not the first frame audio signal of the audio signal to be processed, the current frame audio signal has the previous frame audio signal, and therefore, the excitation energy vector of the previous frame audio signal also exists, and at this time, the energy after the attenuation of the excitation energy vector of the previous frame audio signal is not 0.

As an implementation manner of the embodiment of the present disclosure, for any current frame of audio signals to be processed, calculating the energy after attenuation of the excitation energy vector of the previous frame of audio signals of the current frame of audio signals, the following steps may be included, which are step b1 and step b 2:

and b1, when the current frame audio signal of the audio signal to be processed is the first frame audio signal of the audio signal to be processed, determining the energy after the attenuation of the excitation energy vector of the previous frame audio signal of the current frame audio signal as 0.

As can be seen from the above description, when the current frame audio signal is the first frame audio signal, there is no previous frame audio signal in the current frame audio signal, and therefore, the energy after the attenuation of the excitation energy vector of the previous frame of the current frame audio signal can be determined as 0.

Step b2, when the current frame audio signal of the audio signal to be processed is not the first frame audio signal of the audio signal to be processed, calculating the energy after the attenuation of the excitation energy vector of the previous frame audio signal of the current frame audio signal according to the following formula:

wherein R (n) is the energy after attenuation of the excitation energy vector of the previous frame of audio signal of the current frame of audio signal, Ra (n-1) is the excitation energy vector of the previous frame of audio signal of the current frame of audio signal, T₁The time interval between two adjacent frames of the audio signal to be processed is defined.

In step S132, the excitation energy vector of the audio signal of the current frame is determined according to the maximum value between the attenuated energy of the excitation energy vector of the audio signal of the previous frame and the energy of the audio signal of the current frame.

Specifically, after obtaining the energy of the previous frame of audio signal after attenuation of the excitation energy vector and the energy of the current frame of audio signal, comparing the energy of the previous frame of audio signal after attenuation of the excitation energy vector with the energy of the current frame of audio signal, and taking the maximum value to obtain the excitation energy vector of the current frame of audio signal. The specific formula is as follows:

Ra(n)＝max(R(n)，|X(n)|²)

wherein Ra (n) is the excitation energy vector of the current frame audio signal, R (n) is the energy attenuated by the excitation energy vector of the previous frame audio signal, | X (n) is²Is the energy of the audio signal of the current frame.

In step S133, a reverberation signal of the current frame audio signal is determined according to the excitation energy vector of the current frame audio signal, a time interval between two adjacent frames of the audio signal to be processed, and a reverberation duration corresponding to the current frame audio signal.

Specifically, after obtaining the excitation energy vector of the current frame audio signal, the reverberation signal of the current frame audio signal can be obtained through the excitation energy vector of the current frame audio signal, a time interval between two adjacent frames of the audio signal to be processed, and a reverberation duration corresponding to the current frame audio signal.

As an implementation manner of the embodiment of the present disclosure, determining a reverberation signal of a current frame audio signal according to an excitation energy vector of the current frame audio signal, a time interval between two adjacent frames of an audio signal to be processed, and a reverberation duration corresponding to the current frame audio signal may include the following steps:

calculating the reverberation signal of the current frame audio signal according to the following formula:

M＝RT60(n)/T₁；

wherein, sr (n) is a reverberation signal of the current frame audio signal, RT60(n) is a reverberation duration corresponding to the current frame audio signal, and M is a frame number corresponding to the reverberation duration corresponding to the current frame audio signal; ra (n-M) is the energy after attenuation of the excitation energy vector of the previous M frames of the current frame audio signal.

Since the current frame audio signal may be any frame audio signal of the audio signal to be processed, after obtaining the reverberation signals of all the current frame audio signals, the reverberation signals included in the audio signal to be processed may be obtained.

Therefore, according to the technical scheme provided by the embodiment, the reverberation time of the audio signal to be processed is estimated, and the excitation energy vector is utilized to accurately and efficiently determine the reverberation signal included in the audio signal to be processed, so that the subsequent dereverberation of the audio signal to be processed is facilitated.

On the basis of the foregoing embodiment, in order to further improve the signal quality of the dereverberated audio signal, in an implementation manner, the audio processing method may further include the following steps:

and removing the noise signal included in the audio signal to be processed by a noise reduction gain factor.

In this embodiment, not only the audio signal to be processed may be dereverberated, but also the audio signal to be processed may be denoised, so that the signal quality of the processed audio signal is higher.

For clarity of description, the audio processing signal provided by the embodiment of the present disclosure will be described in detail with reference to specific examples. As shown in fig. 4.

In practical application, the system may include the following modules: the device comprises a stationary noise estimation module, a signal-to-noise ratio estimation module, a reverberation time estimation module, a reverberation spectrum estimation module and a reverberation elimination module.

The audio input is an audio signal to be processed collected by the system microphone module, and generally includes signal components such as a speech signal, a noise signal, and a reverberation signal.

Firstly, an audio signal to be processed is input into a stationary noise estimation module, and the stationary noise estimation module is used for estimating stationary noise signals included in the audio signal to be processed.

And secondly, the signal-to-noise ratio estimation module carries out signal-to-noise ratio estimation by utilizing a noise signal estimation result output by the stationary noise estimation module, and simultaneously calculates a gain factor for removing noise, namely a noise reduction gain factor.

Again, the audio signal to be processed may be input to the reverberation time estimation module, and the ambient reverberation level of the audio signal to be processed is estimated by the reverberation time estimation module, that is, the reverberation duration RT60 of the audio signal to be processed is estimated.

And then, using the reverberation time length RT60 index obtained by the reverberation time estimation module as a reference to perform reverberation spectrum estimation to obtain a reverberation spectrum of the audio signal to be processed, namely obtaining the reverberation signal included in the audio signal to be processed.

And finally, eliminating the reverberation signal and the stable noise signal in the audio signal to be processed simultaneously by utilizing information such as the estimated reverberation spectrum, the estimated signal-to-noise ratio, the noise reduction gain factor and the like to obtain the output audio without reverberation, namely the processed audio signal.

Therefore, according to the technical scheme provided by the embodiment of the disclosure, when the audio signal to be processed is dereverberated, the signal-to-noise ratio, the noise reduction gain factor and the reverberation spectrum are considered, so that when the audio signal to be processed has a noise signal, the audio signal to be processed can be dereverberated well, and the noise reduction gain factor is utilized to perform denoising on the audio signal to be processed. Moreover, the disturbance of the reverberation on the audio signal can be stably reduced, meanwhile, the audio signal after the reverberation is removed cannot be distorted, and the quality and the intelligibility of the audio signal of a real-time communication scene are improved.

According to a second aspect of the embodiments of the present disclosure, there is provided an audio processing apparatus, as shown in fig. 5, including:

an audio signal acquisition module 510 configured to perform acquiring an audio signal to be processed;

a noise signal and reverberation duration obtaining module 520 configured to perform obtaining of a noise signal included in the audio signal to be processed and a reverberation duration of the audio signal to be processed;

a signal-to-noise ratio and reverberation signal determination module 530 configured to determine a signal-to-noise ratio and a noise reduction gain factor of the audio signal to be processed according to the audio signal to be processed and the noise signal, and determine a reverberation signal included in the audio signal to be processed according to the audio signal to be processed and the reverberation time length;

a dereverberation module 540 configured to perform dereverberation on the audio signal to be processed according to the signal-to-noise ratio, the noise reduction gain factor, and the reverberation signal, so as to obtain a dereverberated audio signal.

Optionally, the dereverberation module includes:

the second gain factor calculation unit is specifically configured to perform:

the second gain factor is calculated according to the following formula:

the second gain factor calculation unit is specifically configured to perform:

Optionally, the energy calculating unit is specifically configured to perform:

M＝RT60(n)/T₁；

acquiring an original audio signal;

Optionally, the apparatus further comprises:

According to a third aspect of the embodiments of the present disclosure, there is provided an electronic apparatus, as shown in fig. 6, including:

a processor 610;

a memory 620 for storing the processor-executable instructions;

Fig. 7 is a block diagram illustrating an audio processing device 700 according to an example embodiment. For example, the apparatus 700 may be provided as a server. Referring to fig. 7, apparatus 700 includes a processing component 722 that further includes one or more processors and memory resources, represented by memory 732, for storing instructions, such as applications, that are executable by processing component 722. The application programs stored in memory 732 may include one or more modules that each correspond to a set of instructions. Furthermore, the processing component 722 is configured to execute instructions to perform the audio processing method according to the first aspect.

The apparatus 700 may also include a power component 726 configured to perform power management of the apparatus 700, a wired or wireless network interface 750 configured to connect the apparatus 700 to a network, and an input output (I/O) interface 758. The apparatus 700 may operate based on an operating system stored in memory 732, such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, or the like.

Fig. 8 is a block diagram illustrating an audio processing device 800 according to an example embodiment. For example, the apparatus 800 may be a mobile phone, a computer, a digital broadcast electronic device, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, and the like.

Referring to fig. 8, the apparatus 800 may include one or more of the following components: a processing component 802, a memory 804, a power component 806, a multimedia component 808, an audio component 810, an input/output (I/O) interface 812, a sensor component 814, and a communication component 816.

The processing component 802 generally controls overall operation of the device 800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing components 802 may include one or more processors 820 to execute instructions to perform all or a portion of the steps of the methods described above. Further, the processing component 802 can include one or more modules that facilitate interaction between the processing component 802 and other components. For example, the processing component 802 can include a multimedia module to facilitate interaction between the multimedia component 808 and the processing component 802.

The memory 804 is configured to store various types of data to support operation at the device 800. Examples of such data include instructions for any application or method operating on device 800, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 804 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

Power supply components 807 provide power to the various components of device 800. The power components 807 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the apparatus 800.

The multimedia component 808 includes a screen that provides an output interface between the device 800 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 808 includes a front facing camera and/or a rear facing camera. The front-facing camera and/or the rear-facing camera may receive external multimedia data when the device 800 is in an operating mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

The audio component 810 is configured to output and/or input audio signals. For example, the audio component 810 includes a Microphone (MIC) configured to receive external audio signals when the apparatus 800 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in memory 404 or transmitted via communications component 816. In some embodiments, audio component 810 also includes a speaker for outputting audio signals.

The I/O interface 812 provides an interface between the processing component 802 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

The sensor assembly 814 includes one or more sensors for providing various aspects of state assessment for the device 800. For example, the sensor assembly 814 may detect the open/closed state of the device 800, the relative positioning of the components, such as a display and keypad of the apparatus 800, the sensor assembly 814 may also detect a change in position of the apparatus 800 or a component of the apparatus 800, the presence or absence of user contact with the apparatus 800, orientation or acceleration/deceleration of the apparatus 800, and a change in temperature of the apparatus 800. Sensor assembly 814 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 814 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 814 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 816 is configured to facilitate communications between the apparatus 800 and other devices in a wired or wireless manner. The apparatus 800 may access a wireless network based on a communication standard, such as WiFi, an operator network (such as 2G, 3G, 4G, or 5G), or a combination thereof. In an exemplary embodiment, the communication component 416 receives broadcast signals or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 816 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the apparatus 800 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the audio processing method described in the first aspect.

In an exemplary embodiment, a non-transitory computer-readable storage medium comprising instructions, such as the memory 804 comprising instructions, executable by the processor 820 of the device 800 to perform the above-described method is also provided. Alternatively, for example, the storage medium may be a non-transitory computer-readable storage medium, such as a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This disclosure is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

The present disclosure is not limited to the precise arrangements described above and shown in the drawings, and various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. An audio processing method, comprising:

acquiring an audio signal to be processed;

2. The method of claim 1, wherein dereverberating the audio signal to be processed according to the signal-to-noise ratio, the noise reduction gain factor, and the reverberation signal to obtain a dereverberated audio signal comprises:

3. The method of claim 2, wherein when the signal-to-noise ratio of the current frame audio signal is less than a preset signal-to-noise ratio, and the reverberation duration of the current frame audio signal is greater than a preset reverberation duration;

the second gain factor is calculated according to the following formula:

4. The method of claim 2, wherein when the signal-to-noise ratio of the current frame audio signal is greater than a preset signal-to-noise ratio, or the reverberation duration of the current frame audio signal is less than a preset reverberation duration;

5. The method of claim 1, wherein the determining the reverberation signal of the audio signal to be processed according to the audio signal to be processed and the reverberation time comprises:

6. The method according to claim 5, wherein said calculating, for any current frame audio signal of the audio signals to be processed, the attenuated energy of the excitation energy vector of the previous frame audio signal of the current frame audio signal comprises:

7. The method of claim 6, wherein the determining the reverberation signal of the current frame audio signal according to the excitation energy vector of the current frame audio signal, the time interval between two adjacent frames of the audio signal to be processed, and the reverberation duration corresponding to the current frame audio signal comprises:

M＝RT60(n)/T₁；

8. An audio processing apparatus, comprising:

9. An electronic device, comprising:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the audio processing method of any of claims 1 to 7.

10. A storage medium in which instructions, when executed by a processor of an electronic device, enable the electronic device to perform the audio processing method of any of claims 1 to 7.