CN112614502B

CN112614502B - Echo cancellation method based on double LSTM neural network

Info

Publication number: CN112614502B
Application number: CN202011455735.5A
Authority: CN
Inventors: 王前慧; 邓小红; 胡涛; 李俊潇
Original assignee: Sichuan Changhong Electric Co Ltd
Current assignee: Sichuan Changhong Electric Co Ltd
Priority date: 2020-12-10
Filing date: 2020-12-10
Publication date: 2022-01-28
Anticipated expiration: 2040-12-10
Also published as: CN112614502A

Abstract

The invention relates to the field of audio signal processing, aims to solve the problem of poor echo cancellation effect in the prior art, and provides an echo cancellation method based on a double-LSTM neural network, which comprises the following steps: acquiring a first sound source signal to be input to a loudspeaker and a second sound source signal input by a microphone, and extracting a first frequency spectrum characteristic of the first sound source signal and a second frequency spectrum characteristic of the second sound source signal; obtaining an echo estimation signal and a noise estimation signal according to the first spectral feature and the second spectral feature and based on a first LSTM neural network model; extracting a third spectral feature of the echo estimation signal and a fourth spectral feature of a noise estimation signal; obtaining a pure voice signal based on a second LSTM neural network model according to the second spectral feature, the third spectral feature and the fourth spectral feature; the clean speech signal is input to a speaker. The invention can effectively eliminate the echo signal in the voice signal and is suitable for the intelligent television.

Description

Echo cancellation method based on double LSTM neural network

Technical Field

The invention relates to the field of audio signal processing, in particular to an echo cancellation method.

Background

With the advent of the artificial intelligence era, voice technology is an important interface for human-computer interaction. With the continuous development of the internet of things technology, people hope to use voice control intelligent equipment in a longer distance and a more complex environment, so that the traditional near-field voice interaction cannot meet the requirements of people, and the microphone array technology becomes the core of far-field interaction.

Aiming at the current complex application scene, a series of key technologies capable of effectively improving the speech recognition rate are developed based on a microphone array, and the key technologies mainly comprise: speech enhancement, sound source localization, reverberation cancellation, echo cancellation, noise suppression. For devices with speakers and microphones (such as smart audio and smart television), the played sound of the devices needs to be eliminated to obtain effective speaker sound, and the conventional echo cancellation algorithm mainly utilizes means such as adaptive signal processing to eliminate the interference of background sound. But there are various kinds of noise in everyday scenarios, so that noise is a non-negligible contributing factor in echo cancellation. When no noise exists, the effect is good, when environmental noise exists, the performance of the echo cancellation algorithm is reduced, and particularly when non-stationary noise exists, the echo cancellation effect is not ideal.

Disclosure of Invention

The invention aims to solve the problem of poor echo cancellation effect in the prior art, and provides an echo cancellation method based on a double-LSTM neural network.

The technical scheme adopted by the invention for solving the technical problems is as follows: the echo cancellation method based on the double LSTM neural network is characterized by comprising the following steps of:

step 1, acquiring a first sound source signal to be input to a loudspeaker and a second sound source signal input by a microphone, and extracting a first frequency spectrum characteristic of the first sound source signal and a second frequency spectrum characteristic of the second sound source signal;

step 2, obtaining an echo estimation signal and a noise estimation signal according to the first frequency spectrum characteristic and the second frequency spectrum characteristic and based on a first LSTM neural network model, wherein the first LSTM neural network model is obtained by training according to a first sample sound source signal, a second sample sound source signal, a sample echo signal and a sample noise signal;

step 3, extracting a third spectral feature of the echo estimation signal and a fourth spectral feature of the noise estimation signal;

step 4, obtaining a pure voice signal according to the second frequency spectrum characteristic, the third frequency spectrum characteristic and the fourth frequency spectrum characteristic and based on a second LSTM neural network model, wherein the second LSTM neural network model is obtained by training according to a second sample sound source signal, a sample echo signal, a sample noise signal and the pure sample voice signal;

and 5, inputting the pure voice signal to a loudspeaker.

Further, the first LSTM neural network model includes an echo estimation model and a noise estimation model, the echo estimation model is obtained by training according to a first sample sound source signal, a second sample sound source signal, and a sample echo signal, and the noise estimation model is obtained by training according to the first sample sound source signal, the second sample sound source signal, and a sample noise signal.

The invention has the beneficial effects that: the echo cancellation method based on the double LSTM neural network eliminates the echo signal with noise based on the LSTM neural network model, eliminates the influence of the noise on the echo cancellation, and can effectively eliminate the echo signal in the voice signal.

Drawings

Fig. 1 is a schematic flow chart of an echo cancellation method based on a dual LSTM neural network according to an embodiment of the present invention;

FIG. 2 is a diagram illustrating a conventional echo cancellation structure;

fig. 3 is another schematic flow chart of the echo cancellation method based on the dual LSTM neural network according to the embodiment of the present invention.

Detailed Description

Embodiments of the present invention will be described in detail below with reference to the accompanying drawings.

The invention aims to solve the problem of poor echo cancellation effect in the prior art, and provides an echo cancellation method based on a double-LSTM neural network, which has the main technical conception that: acquiring a first sound source signal to be input to a loudspeaker and a second sound source signal input by a microphone, and extracting a first frequency spectrum characteristic of the first sound source signal and a second frequency spectrum characteristic of the second sound source signal; obtaining an echo estimation signal and a noise estimation signal according to the first spectral feature and the second spectral feature and based on a first LSTM neural network model, wherein the first LSTM neural network model is obtained by training according to a first sample sound source signal, a second sample sound source signal, a sample echo signal and a sample noise signal; extracting a third spectral feature of the echo estimation signal and a fourth spectral feature of a noise estimation signal; obtaining a pure voice signal according to the second frequency spectrum characteristic, the third frequency spectrum characteristic and the fourth frequency spectrum characteristic and based on a second LSTM neural network model, wherein the second LSTM neural network model is obtained by training according to a second sample sound source signal, a sample echo signal, a sample noise signal and the pure sample voice signal; the clean speech signal is input to a speaker.

Before implementation, a first LSTM neural network model and a second LSTM neural network model are obtained by pre-training, wherein the first LSTM neural network model can be obtained by training according to a first sample sound source signal, a second sample sound source signal, a sample echo signal and a sample noise signal, and the second LSTM neural network model can be obtained by training according to the first sample sound source signal, the sample echo signal, the sample noise signal and a pure sample voice signal; when specifically using, acquire the second sound source signal of waiting to input the first sound source signal and the microphone input of treating the input to the speaker, wherein, first sound source signal is the distal end signal of waiting to input to the microphone in the echo channel, and the second sound source signal is the sound source signal that the microphone was collected, wherein includes: the method comprises the steps of firstly inputting a first frequency spectrum characteristic of a first sound source signal and a second frequency spectrum characteristic of a second sound source signal into a first LSTM neural network model to obtain an echo estimation signal and a noise estimation signal corresponding to the current environment, then inputting the second frequency spectrum characteristic of the second sound source signal, a third frequency spectrum characteristic of the echo estimation signal and a fourth frequency spectrum characteristic of the noise estimation signal into a second LSTM neural network model to obtain a pure voice signal, and finally inputting the pure voice signal into a loudspeaker to achieve echo cancellation of the sound source signal.

Examples

The echo cancellation method based on the double-LSTM neural network according to the embodiment of the present invention, as shown in FIG. 1, includes the following steps:

step S1, acquiring a first sound source signal to be input to a loudspeaker and a second sound source signal input by a microphone, and extracting a first frequency spectrum characteristic of the first sound source signal and a second frequency spectrum characteristic of the second sound source signal;

a conventional echo cancellation structure is shown in fig. 2, which performs echo cancellation on a far-end signal to be input to a speaker through an adaptive filter, and in this embodiment, on the basis of the far-end signal, that is, a first sound source signal to be input to the speaker, is obtained, and a second sound source signal input by a microphone, that is, a sound source signal collected by the microphone is obtained.

After the first sound source signal and the second sound source signal are obtained, a first spectrum feature corresponding to the first sound source signal and a second spectrum feature corresponding to the second sound source signal are extracted through a feature extraction method.

Step S2, obtaining an echo estimation signal and a noise estimation signal according to the first frequency spectrum characteristic and the second frequency spectrum characteristic and based on a first LSTM neural network model, wherein the first LSTM neural network model is obtained by training according to a first sample sound source signal, a second sample sound source signal, a sample echo signal and a sample noise signal;

a Long Short-Term Memory (LSTM) neural network is a variant of a Recurrent Neural Network (RNN), can overcome the defects of gradient extinction and explosion of the traditional RNN, and can selectively keep the Memory quantity of context, reduce the network depth and relieve the gradient extinction phenomenon by introducing a gating mechanism into a Memory unit.

Specifically, the first LSTM neural network model is trained in advance before being used specifically, and is obtained by training according to a first sample sound source signal, a second sample sound source signal, a sample echo signal and a sample noise signal, specifically, noise signals in different environments can be collected as sample noise signals, echo signals at different volumes of speakers and at different distances between the speakers and a microphone are collected as sample echo signals, and the first sample sound source signal and the second sample sound source signal corresponding to the above conditions are collected, and the established preliminary LSTM neural network model is trained through the first sample sound source signal, the second sample sound source signal, the sample echo signal and the sample noise signal, so that the first LSTM neural network model is obtained.

When the method is used specifically, a first frequency spectrum characteristic of a first sound source signal and a second frequency spectrum characteristic of a second sound source signal which are obtained currently are input into a first LSTM neural network model, and an echo estimation signal and a noise estimation signal corresponding to the current environment can be obtained.

In this embodiment, the first LSTM neural network model may include an echo estimation model and a noise estimation model, where the echo estimation model is used for calculating an echo estimation signal, and may be obtained by training a first sample sound source signal, a second sample sound source signal, and a sample echo signal, and the noise estimation model is used for calculating a noise estimation signal, and may be obtained by training a first sample sound source signal, a second sample sound source signal, and a sample noise signal.

Step S3, extracting a third spectral feature of the echo estimation signal and a fourth spectral feature of the noise estimation signal;

specifically, corresponding to step S1, the existing feature extraction method may be used to perform feature extraction on the echo estimation signal and the noise estimation signal output by the first LSTM neural network model, so as to obtain a third spectral feature of the echo estimation signal and a fourth spectral feature of the noise estimation signal.

Step S4, obtaining a pure voice signal according to the second spectral feature, the third spectral feature and the fourth spectral feature and based on a second LSTM neural network model, wherein the second LSTM neural network model is obtained by training according to a second sample sound source signal, a sample echo signal, a sample noise signal and the pure sample voice signal;

specifically, the second LSTM neural network model is also trained in advance before being used specifically, and may be trained according to a second sample sound source signal, a sample echo signal, a sample noise signal, and a pure sample voice signal, specifically, noise signals in different environments may be collected as sample noise signals, echo signals at different volumes of speakers and at different distances between the speakers and the microphone are collected as sample echo signals, and second sample sound source signals corresponding to the above conditions and pure voice signals of different users are collected, and the preliminary LSTM neural network model established is trained through the second sample sound source signal, the sample echo signal, the sample noise signal, and the pure sample voice signal, thereby obtaining the second LSTM neural network model.

When the method is used specifically, the second spectral feature of the second sound source signal, the third spectral feature of the echo estimation signal and the fourth spectral feature of the noise estimation signal are input into the second LSTM neural network model, and then a pure voice signal can be obtained.

And step S5, inputting the pure voice signal to a loudspeaker.

Finally, the pure voice signal output by the second LSTM neural network model is input to a loudspeaker, and echo cancellation of the sound source signal can be achieved.

In summary, as shown in fig. 3, in this embodiment, the first sound source signal and the second sound source signal are input to the first LSTM neural network model to obtain the echo estimation signal and the noise estimation signal, then the spectral features of the echo estimation signal and the noise estimation signal are extracted, and then the spectral features of the echo estimation signal, the noise estimation signal and the second sound source signal are input to the second LSTM neural network model to obtain the target signal. The method can keep the memory quantity of the context, reduce the network depth and relieve the gradient disappearance phenomenon, and the method has obvious inhibiting effect on the echo signal with noise.

Claims

1. The echo cancellation method based on the double LSTM neural network is characterized by comprising the following steps of:

and 5, inputting the pure voice signal to a loudspeaker.

2. The method of claim 1, wherein the first LSTM neural network model comprises an echo estimation model trained from a first sample sound source signal, a second sample sound source signal, and a sample echo signal, and a noise estimation model trained from a first sample sound source signal, a second sample sound source signal, and a sample noise signal.