CN111353258A

CN111353258A - Echo suppression method based on coding and decoding neural network, audio device and equipment

Info

Publication number: CN111353258A
Application number: CN202010084801.6A
Authority: CN
Inventors: 曾志先; 肖龙源; 李稀敏; 蔡振华; 刘晓葳
Original assignee: Xiamen Kuaishangtong Technology Co Ltd
Current assignee: Xiamen Kuaishangtong Technology Co Ltd
Priority date: 2020-02-10
Filing date: 2020-02-10
Publication date: 2020-06-30

Abstract

The invention discloses an echo suppression method based on a coding and decoding neural network, an audio device and equipment, wherein echo-contained audio data are generated by acquiring anechoic audio data and adding simulated echoes; converting the anechoic audio data into an anechoic spectrogram, and extracting the spectral characteristics of the anechoic spectrogram; converting the audio data with echoes into a spectrogram with echoes, and extracting the spectral characteristics of the spectrogram with echoes; building a coding and decoding neural network model according to the spectral characteristics of the anechoic spectrogram and the spectral characteristics of the echoic spectrogram; training the encoding and decoding neural network model to obtain a prediction model of the echo-removed spectrogram; converting audio data to be processed into a spectrogram, inputting the spectrogram into a prediction model, outputting a de-echoed spectrogram, and converting the de-echoed spectrogram to obtain de-echoed audio data; the method can remove echoes in different scenes, can fit echo linear laws in different environments, and has high adaptability.

Description

Echo suppression method based on coding and decoding neural network, audio device and equipment

Technical Field

The invention relates to the technical field of audio and network communication, in particular to an echo suppression method, an echo suppression device, echo suppression equipment and a storage medium based on a coding and decoding neural network.

Background

In the current speech recognition system, suppressing echo in speech is a very important link in the recognition process, and the effect of echo processing will directly affect the effect and result of speech recognition. In particular, with the wide application of teleconferencing such as video conferencing and telephone conferencing, during teleconferencing, a signal transmitted from a distant place (i.e., a far-end signal) is played through a speaker of a telephone in a conference room, sound waves are reflected by walls, floors, ceilings and the like, and both reflected waves and direct waves are picked up by a microphone of the telephone and transmitted to a distant place as a part of a signal at a near end to form an echo; in addition, sound leaking from the handset can also be picked up by the microphone of the phone and sent to the far-end as part of the near-end signal, which can also form echo. These echoes have a certain delay and when the delay is over 50 milliseconds and there is no or only little attenuation, the far-end user perceives a clear echo. Since this Echo is generated by an Acoustic route, it is called Acoustic Echo (Acoustic Echo).

The traditional echo suppression method uses a linear filtering mode at present, and calculates a related linear filtering algorithm by finding out the gradual change rule of the acoustic echo, so that the echo in the sound is reduced and eliminated by the filtering algorithm. However, the method can only simulate the echo situation in a fixed scene, and when the scene is not fixed and the echo situation is complex, the echo suppression effect will be reduced, and the method does not have high stability.

Disclosure of Invention

In order to solve the above problems, the present invention provides an echo suppression method based on a codec neural network, an audio device, an apparatus, and a storage medium, which can remove echoes in different scenes, can fit echo linear laws in different environments, and have high adaptability.

In order to achieve the purpose, the invention adopts the technical scheme that:

an echo suppression method based on a coding and decoding neural network comprises the following steps:

obtaining anechoic audio data, and adding simulated echo to the anechoic audio data to generate audio data with echo;

converting the anechoic audio data into an anechoic spectrogram, and extracting the frequency spectrum characteristics of the anechoic spectrogram; converting the echo audio data into an echo spectrogram, and extracting the spectral characteristics of the echo spectrogram;

building a coding and decoding neural network model according to the spectral characteristics of the anechoic spectrogram and the spectral characteristics of the echoic spectrogram;

taking the frequency spectrum characteristics of the anechoic spectrogram as tag data, taking the frequency spectrum characteristics of the anechoic spectrogram as input data, and training the coding and decoding neural network model to obtain a prediction model of the anechoic spectrogram;

and converting the audio data to be processed into a spectrogram, inputting the spectrogram into the prediction model, outputting a de-echoed spectrogram, and converting the de-echoed spectrogram to obtain de-echoed audio data.

Preferably, the anechoic audio data is obtained by recording audio associated with a human voice in a quiet anechoic environment.

Preferably, the anechoic audio data is converted into an anechoic spectrogram through an FFT algorithm by an audio processing library in Python; and converting the spectrum characteristics of the anechoic spectrogram into characteristic data after dimensionality reduction through Embedding processing, and taking the characteristic data as label data of the coding and decoding neural network model.

Preferably, analog echoes are added to the audio data through an indoor audio array processing algorithm library in Python, and the corresponding audio data with echoes is generated through simulating the space size of the environment through the indoor audio array processing algorithm library and setting the size of the echoes and the extension time of the echoes.

Further, the audio data with echo is converted into a spectrogram with echo through an FFT algorithm by an audio processing library in Python; and converting the spectrum characteristics of the echoed spectrogram into characteristic data after dimensionality reduction through Embedding processing, and taking the characteristic data as input data of the coding and decoding neural network model.

Preferably, the encoding and decoding neural network model further comprises an encoding network and a decoding network, wherein:

the coding network is used for coding the frequency spectrum characteristics of the anechoic spectrogram or the frequency spectrum characteristics of the echoic spectrogram and converting the characteristic dimension of the spectrogram into a two-dimensional matrix structure;

the decoding network is used for decoding the encoded data and converting the decoded data into characteristic dimensions of a spectrogram.

Preferably, the encoding network further comprises:

the preprocessing network comprises three full connection layers, and each full connection layer is provided with a dropout processing;

the CBHG network comprises a convolution filter, a high-dimensional network and a bidirectional GRU network, wherein the convolution filter performs convolution processing on an output result of the preprocessing network, merges the convolution processing result and the feature data after Embedding Embedding processing through a residual error network after the convolution filter, inputs the merged data into the high-dimensional network, and finally takes an output result of the high-dimensional network as the input of the bidirectional GRU network and outputs a forward RNN result and a reverse RNN result;

an attention network for calculating an attention probability value of the RNN result in the forward direction and the RNN result in the reverse direction.

Preferably, the decoding network further comprises:

the preprocessing network has the same structure as that of the coding network and carries out nonlinear transformation processing on the attention probability value through a full-connection network;

an attention-recurrent neural network, comprising a layer of 256 gating recurrent units, wherein the output result of the preprocessing network and the attention probability value are used as input and output to a decoding-recurrent neural network after passing through the GRU unit;

the decoding-cyclic neural network comprises two layers of residual gating cyclic units, wherein each layer also comprises 256 gating cyclic units, and the output dimension of the residual gating cyclic units is further converted into the dimension of a spectrogram in a dimension conversion mode or through an RNN network.

Preferably, when the prediction model is trained, the anechoic spectrogram and the anechoic spectrogram are further subjected to loss value calculation, and the loss value is subjected to iterative training through an Adam optimization algorithm in Tensorflow to obtain the fitted prediction model.

Another object of the present invention is to provide an audio apparatus, which includes a memory and a processor, wherein the memory stores instructions, and the processor causes an echo suppressing apparatus to implement the echo suppressing method according to any of the above embodiments by executing the instructions stored in the memory.

It is a further object of the present invention to provide an apparatus comprising the audio device.

It is still another object of the present invention to provide a computer-readable storage medium, wherein instructions are stored in the computer-readable storage medium, and the instructions are executed by an echo suppressing apparatus to enable the echo suppressing apparatus to implement any one of the above echo suppressing methods.

The invention has the beneficial effects that:

(1) according to the echo suppression method based on the encoding and decoding neural network, the echo-free audio data are extracted, the simulated echoes are added to the echo-free audio data, the echo-free audio data are input into the encoding and decoding network model and trained to obtain the prediction model, the audio data to be processed are input into the prediction model to be predicted, echo-removed audio data are output, echoes in different scenes can be removed, echo linear rules in different environments can be fitted, and the method has high adaptability;

(2) the encoding and decoding neural network adopts a residual error network mode, during encoding, the convolution processing result and the feature data after the Embedding processing are merged through the residual error network, and the merged data are input into the high-dimensional network, so that a deeper network layer can be realized, and the problem of gradient disappearance can not be caused;

(3) the prediction model trained by the encoding and decoding neural network has better generalization capability, and can remove echoes of various scenes;

(4) when the prediction model is trained, the anechoic spectrogram and the anechoic spectrogram are further subjected to loss value calculation, and the loss value is subjected to iterative training through an Adam optimization algorithm in Tensorflow, so that the fitted prediction model is more accurate;

(5) the echo suppression method is applied to the voice recognition technology, so that the influence of echo sound on voice recognition can be reduced, the accuracy of the voice recognition is improved, and the probability of error recognition is reduced.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention and not to limit the invention. In the drawings:

FIG. 1 is a simplified flowchart of a prediction model training process of a codec neural network based on an echo suppression method of the codec neural network according to the present invention;

FIG. 2 is a simplified flow chart of a prediction process of an echo suppression method based on a codec neural network according to the present invention;

FIG. 3 is a network structure diagram of a codec neural network model of an echo suppression method based on the codec neural network according to the present invention;

FIG. 4 is a diagram of a coding network structure of a coding and decoding neural network model of an echo suppression method based on a coding and decoding neural network according to the present invention;

FIG. 5 is a decoding network structure diagram of a codec neural network model of an echo suppression method based on the codec neural network according to the present invention.

Detailed Description

In order to make the technical problems, technical solutions and advantageous effects of the present invention more clear and obvious, the present invention is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

As shown in fig. 1 and fig. 2, an echo suppression method based on a codec neural network of the present invention includes the following steps:

In this embodiment, the anechoic audio data is obtained by recording audio related to human voice in a quiet anechoic environment. Specifically, the anechoic audio data is obtained by collecting real recorded audio related to human voice, wherein the recorded audio is unrelated to the content of sound, but the sound of the audio is clear and the content of the audio is clear; after 1 to 2 ten thousand pieces of audio data meeting the requirements are collected, the original anechoic audio data are used as training label data; the analog echo is to process the anechoic audio data through a linear filter so as to realize an echo effect; here, a linear function of the echo is involved; and the anechoic audio data passes through a linear filter to obtain the audio data with the echo.

The anechoic audio data is further converted into an anechoic spectrogram through an FFT algorithm through an audio processing library (library of library) in Python, and the converted spectrogram data adopts a two-dimensional matrix structure; and converting the spectrum characteristics of the anechoic spectrogram into characteristic data after dimensionality reduction through Embedding processing, and taking the characteristic data as label data of the coding and decoding neural network model. Further, in this embodiment, a simulated echo is added to the anechoic audio data through an indoor audio array processing algorithm library (pyro-acoustics library) in Python, and the corresponding audio data with echo is generated by simulating the spatial size of the environment through the indoor audio array processing algorithm library and setting the size of the echo and the extension time of the echo.

Further, the audio data with echo is converted into a spectrogram with echo through an FFT algorithm by an audio processing library (Librosa library) in Python, and the converted spectrogram data adopts a two-dimensional matrix structure; and converting the spectrum characteristics of the echoed spectrogram into characteristic data after dimensionality reduction through Embedding processing, and taking the characteristic data as input data of the coding and decoding neural network model.

As shown in fig. 3, the codec neural network model further includes an encoding network (Encoder network) and a decoding network (Decoder network), wherein:

the encoding network (Encoder network) is used for encoding the spectral features of the anechoic spectrogram or the spectral features of the echoic spectrogram and converting the characteristic dimensions of the spectrogram into a two-dimensional matrix structure;

the decoding network (Decoder network) is used for decoding the encoded data and converting the decoded data into the characteristic dimension of the spectrogram.

In the matrix structure, the first dimensions of different audios are different, and the first dimension is determined according to the audio time length; the second dimension is set to be fixed, and the higher the second dimension is, the more information is described; the features after dimensionality reduction are reduced in the second dimension, which describes the abstract speech features.

As shown in fig. 4, the encoding network (Encoder network) further includes:

a pre-processing network (prenet network) comprising three fully connected layers, and each fully connected layer has a dropout process;

the CBHG network comprises a convolution filter, a high-dimensional network and a bidirectional GRU network, wherein the convolution filter performs convolution processing on an output result of the preprocessing network, merges (sequence addition) the convolution processing result and the feature data after Embedding Embedding processing through a residual error network after the convolution filter, inputs the merged data into the high-dimensional network (equivalent to dimension increasing processing, and reduces the low-dimensional features after dimension reduction into high-dimensional spectrogram features), and finally takes the output result of the high-dimensional network as the input of the bidirectional GRU network and outputs a forward RNN result and a reverse RNN result;

an Attention network (Attention network) for calculating an Attention probability value of the RNN result in the forward direction and the RNN result in the reverse direction.

Wherein, the convolution filter includes in order: a convolution layer, a pooling layer, two one-dimensional convolution layers; the size of the filter (convolution) of the first layer of convolutional layers is 3, stride (step size) is 1, the adopted activation function is ReLu (activation function), the size of the filter (convolution) and stride (step size) of the second layer of convolutional layers is the same as that of the first layer, and the activation function is not adopted.

Each layer structure of the high-dimensional network (highway nets) is as follows: the input is put into two fully-connected networks of one layer at the same time, the activation functions of the two networks respectively adopt ReLu (activation function) and sigmoid (activation function) functions, the input is assumed, the output of ReLu (activation function) is output1, the output of sigmoid (activation function) is output2, and then the output of the high-dimensional layer (highway layer) is output which is output1 output2+ input (1-output 2); in this example, 4 high-dimensional layers are used.

The Attention network encapsulates the Attention of the created GRU circulation network through an Attention Wrapper function encapsulated by TensorFlow; the Attention Mechanism (Attention Mechanism) therein calculates the Attention required by the different cells (cells) in the GRU network and is presented in a probabilistic manner in the range of 0-1.

When an Encoder result after the Attention is obtained through the Attention network, a Decoder decoding network process is started to be built, wherein the Decoder decoding network process is used for decoding information of the Encoder into required result information.

As shown in fig. 5, the decoding network (Decoder network) further includes:

a pre-processing network (prenet network) having the same structure as the pre-processing network of the encoding network, and performing a non-linear transformation process on the attention probability value through a full-connection network;

an Attention-recurrent neural network (Attention-RNN network) including a layer including 256 gated recurrent units (GRU units) which output a result of the preprocessing network and the Attention probability value as inputs, and output to a Decoder-RNN network through the GRU units;

a decoding-recurrent neural network (Decoder-RNN network) comprising two layers of residual gated cyclic units (residual GRU units), each layer also comprising 256 gated cyclic units (GRU units), the input of the Decoder of the first step being a 0 matrix, and the output of the t step being taken as the input of the t +1 step; and further converting the output dimension of the residual gating cycle unit into the dimension of a spectrogram in a dimension conversion mode or through an RNN network.

Preferably, when the prediction model is trained, the anechoic spectrogram and the anechoic spectrogram are further subjected to loss value calculation, and a difference value is calculated by using euclidean distance (i.e., a loss value) according to a difference value of each corresponding position value between the spectrogram and the spectrogram, wherein the two spectrograms have the same result and are matrices; iterative training of loss values is carried out through an Adam optimization algorithm in TensorFlow, and a fitted prediction model is obtained; the present embodiment iteratively trains 2 ten thousand batches, and 64 audio spectrogram data are transmitted into each batch for training.

When a coding and decoding neural network model after training fitting is obtained, taking out the trained model as a prediction model; firstly, converting audio data to be processed with echo into spectrogram data through a Librosa library, then transmitting the spectrogram data to the prediction model, outputting audio spectrogram data with echo removed by the model, finally converting the spectrogram into audio data through an open-source vocoder (vocoder), and finally storing the audio data in an audio file form.

Another object of the present invention is to provide an audio apparatus, which includes a memory and a processor, wherein the memory stores instructions, and the processor causes an echo suppressing apparatus to implement the echo suppressing method according to any of the above embodiments by executing the instructions stored in the memory. The audio devices include, but are not limited to: power amplifier, audio amplifier, multimedia console, digital sound console, audio sampling card, synthesizer, middle and high frequency audio amplifier, microphone, sound card in the PC, earphone, etc.

It is a further object of the present invention to provide an apparatus comprising the audio device. The device may be a general purpose computer device or a special purpose computer device, which may be a server. In a specific implementation, the device may be a desktop, a laptop, a web server, a Personal Digital Assistant (PDA), a mobile phone, a tablet computer, a wireless terminal device, a communication device, an embedded device, or the like.

One embodiment of the present application also provides a computer storage medium having instructions stored therein; the echo suppression device (which may be a computer device, such as a server) executes the instructions, for example, a processor in the computer device executes the instructions, so that the echo suppression device implements the echo suppression method according to the above embodiment.

It should be noted that, in the present specification, the embodiments are all described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments may be referred to each other. For the apparatus and device embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and reference may be made to some descriptions of the method embodiments for relevant points. Also, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

While the above description shows and describes the preferred embodiments of the present invention, it is to be understood that the invention is not limited to the forms disclosed herein, but is not to be construed as excluding other embodiments and is capable of use in various other combinations, modifications, and environments and is capable of changes within the scope of the inventive concept as expressed herein, commensurate with the above teachings, or the skill or knowledge of the relevant art. And that modifications and variations may be effected by those skilled in the art without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. An echo suppression method based on a coding and decoding neural network is characterized by comprising the following steps:

2. The echo suppression method based on the codec neural network of claim 1, wherein: the anechoic audio data is obtained by recording audio associated with a human voice in a quiet anechoic environment.

3. The echo suppression method based on the codec neural network of claim 2, wherein: converting the anechoic audio data into an anechoic spectrogram through an FFT algorithm by an audio processing library in Python; and converting the spectrum characteristics of the anechoic spectrogram into characteristic data after dimensionality reduction through Embedding processing, and taking the characteristic data as label data of the coding and decoding neural network model.

4. The echo suppression method based on the codec neural network of claim 1, wherein: adding simulated echoes for audio data through an indoor audio array processing algorithm library in Python, simulating the space size of an environment through the indoor audio array processing algorithm library, and setting the size of the echoes and the extension time of the echoes to generate corresponding audio data with the echoes.

5. The echo suppression method based on the codec neural network of claim 4, wherein: converting the audio data with echo into a spectrogram with echo through an FFT algorithm by an audio processing library in Python; and converting the spectrum characteristics of the echoed spectrogram into characteristic data after dimensionality reduction through Embedding processing, and taking the characteristic data as input data of the coding and decoding neural network model.

6. The echo suppression method based on the codec neural network according to claim 3 or 5, wherein: the codec neural network model further includes an encoding network and a decoding network, wherein:

7. The echo suppression method based on codec neural network of claim 6, wherein: the encoding network further comprises:

8. The echo suppression method based on the codec neural network of claim 7, wherein: the decoding network further comprises:

9. The echo suppression method based on the codec neural network of claim 1, wherein: when the prediction model is trained, the anechoic spectrogram and the anechoic spectrogram are further subjected to loss value calculation, and the loss value is subjected to iterative training through an Adam optimization algorithm in Tensorflow, so that the fitted prediction model is obtained.

10. An audio apparatus, characterized by: comprising a memory having instructions stored therein and a processor that, upon execution of the instructions stored in the memory, causes an echo suppression device to implement the echo suppression method according to any one of claims 1 to 9.

11. An apparatus, characterized by: comprising the audio device of claim 10.

12. A computer-readable storage medium having stored therein instructions, execution of which by an echo suppression device causes the echo suppression device to implement the echo suppression method of any one of claims 1 to 9.