CN111885275A

CN111885275A - Echo cancellation method and device for voice signal, storage medium and electronic device

Info

Publication number: CN111885275A
Application number: CN202010718717.5A
Authority: CN
Inventors: 马路; 黄华; 赵培; 苏腾荣
Original assignee: Haier Uplus Intelligent Technology Beijing Co Ltd
Current assignee: Haier Uplus Intelligent Technology Beijing Co Ltd
Priority date: 2020-07-23
Filing date: 2020-07-23
Publication date: 2020-11-03
Anticipated expiration: 2040-07-23
Also published as: CN111885275B

Abstract

The embodiment of the invention provides a method, a device, a storage medium and an electronic device for eliminating echo of a voice signal, wherein the method comprises the following steps: acquiring a far-end reference signal and a group of voice signals of near-end voice signals acquired by voice acquisition equipment, and inputting the far-end reference signal into a target neural network model to obtain a target echo estimation signal; the target echo estimation signal and the near-end voice signal are subjected to target processing to obtain a target voice signal, so that a linear echo signal and a nonlinear echo signal contained in the near-end voice signal are eliminated, the problem that the echo of the voice signal is difficult to effectively eliminate in the related technology can be solved, the linear echo signal and the nonlinear echo signal of the voice signal are effectively eliminated, the integrity of the voice signal can be kept, and the technical effect of an original sound source signal is not damaged.

Description

Echo cancellation method and device for voice signal, storage medium and electronic device

Technical Field

The embodiment of the invention relates to the field of communication, in particular to a method and a device for eliminating echo of a voice signal, a storage medium and an electronic device.

Background

In the related art, the sound emitted from the voice terminal device through the speaker is received by the microphone of the voice terminal device, so that voice interference, that is, an echo of the voice signal is generated. In the process of echo cancellation, a nonlinear processing module with a complex structure is generally adopted to suppress nonlinear echoes in a voice signal, and the technical problems that a nonlinear operation unit is complex in calculation and the frequency spectrum structure of a main sound source is easily damaged exist.

Aiming at the technical problem that the echo of a voice signal is difficult to effectively eliminate in the related art, an effective solution is not provided at present.

Disclosure of Invention

Embodiments of the present invention provide a method, an apparatus, a storage medium, and an electronic apparatus for eliminating an echo of a voice signal, so as to at least solve a problem that it is difficult to effectively eliminate the echo of the voice signal in the related art.

According to an embodiment of the present invention, there is provided an echo cancellation method for a voice signal, including: acquiring a first group of voice signals, wherein the first group of voice signals comprise far-end reference signals and near-end voice signals acquired by voice acquisition equipment, and the far-end reference signals are voice signals needing to be played by a loudspeaker; inputting the far-end reference signal into a target neural network model to obtain a target echo estimation signal, wherein the target neural network model is used for determining a near-end echo estimation signal acquired by the voice acquisition equipment after the far-end reference signal is played by the loudspeaker; and performing target processing on the target echo estimation signal and the near-end voice signal to obtain a target voice signal, wherein the target voice signal is a voice signal obtained after a linear echo signal and a nonlinear echo signal contained in the near-end voice signal are eliminated.

According to another embodiment of the present invention, there is provided an echo canceling device for a speech signal, including: the system comprises an acquisition module, a processing module and a display module, wherein the acquisition module is used for acquiring a first group of voice signals, the first group of voice signals comprise far-end reference signals and near-end voice signals acquired by voice acquisition equipment, and the far-end reference signals are voice signals needing to be played by a loudspeaker; the input module is used for inputting the far-end reference signal into a target neural network model to obtain a target echo estimation signal, wherein the target neural network model is used for determining a near-end echo estimation signal acquired by the voice acquisition equipment after the far-end reference signal is played by the loudspeaker; and the processing module is used for performing target processing on the target echo estimation signal and the near-end voice signal to obtain a target voice signal, wherein the target voice signal is a voice signal obtained after a linear echo signal and a nonlinear echo signal contained in the near-end voice signal are eliminated.

According to a further embodiment of the present invention, there is also provided a computer-readable storage medium having a computer program stored thereon, wherein the computer program is arranged to perform the steps of any of the above method embodiments when executed.

According to yet another embodiment of the present invention, there is also provided an electronic device, including a memory in which a computer program is stored and a processor configured to execute the computer program to perform the steps in any of the above method embodiments.

According to the method, a group of voice signals of a far-end reference signal and a near-end voice signal collected by voice collection equipment are obtained, and the far-end reference signal is input into a target neural network model to obtain a target echo estimation signal; the target echo estimation signal and the near-end voice signal are subjected to target processing to obtain a target voice signal, so that a linear echo signal and a nonlinear echo signal contained in the near-end voice signal are eliminated, the problem that the echo of the voice signal is difficult to effectively eliminate in the related technology can be solved, the linear echo signal and the nonlinear echo signal of the voice signal are effectively eliminated, the integrity of the voice signal can be kept, and the technical effect of an original sound source signal is not damaged.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:

fig. 1 is a block diagram of a hardware configuration of a mobile terminal of a method for echo cancellation of a voice signal according to an embodiment of the present invention;

FIG. 2 is a flow chart of a method for echo cancellation of a speech signal according to an embodiment of the present invention;

FIG. 3 is a diagram illustrating a method for echo cancellation of a speech signal according to an embodiment of the present invention;

FIG. 4 is a diagram illustrating a linear interpolation method according to an embodiment of the present invention;

FIG. 5 is a diagram illustrating a method for echo cancellation of a speech signal according to another embodiment of the present invention;

FIG. 6 is a schematic structural diagram of a neural network for implementing echo cancellation of a speech signal according to an embodiment of the present invention;

FIG. 7 is a flow chart of another echo cancellation method for a speech signal according to an embodiment of the present invention;

FIG. 8 is a diagram illustrating a neural network model training method for performing echo cancellation on a speech signal according to an embodiment of the present invention;

FIG. 9 is a diagram illustrating a neural network model training method for performing echo cancellation on a speech signal according to an embodiment of the present invention;

fig. 10 is a schematic structural diagram of an echo cancellation device for a speech signal according to an embodiment of the present invention.

Detailed Description

Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings in conjunction with the embodiments.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.

The method embodiments provided in the embodiments of the present application may be executed in a mobile terminal, a computer terminal, or a similar computing device. Taking the example of the operation on the mobile terminal, fig. 1 is a block diagram of a hardware structure of the mobile terminal of the echo cancellation method for a voice signal according to the embodiment of the present invention. As shown in fig. 1, the mobile terminal may include one or more (only one shown in fig. 1) processors 102 (the processor 102 may include, but is not limited to, a processing device such as a microprocessor MCU or a programmable logic device FPGA), and a memory 104 for storing data, wherein the mobile terminal may further include a transmission device 106 for communication functions and an input-output device 108. It will be understood by those skilled in the art that the structure shown in fig. 1 is only an illustration, and does not limit the structure of the mobile terminal. For example, the mobile terminal may also include more or fewer components than shown in FIG. 1, or have a different configuration than shown in FIG. 1.

The memory 104 may be used to store a computer program, for example, a software program of an application software and a module, such as a computer program corresponding to the echo cancellation method for a speech signal in the embodiment of the present invention, and the processor 102 executes various functional applications and data processing by running the computer program stored in the memory 104, so as to implement the method described above. The memory 104 may include high speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include memory located remotely from the processor 102, which may be connected to the mobile terminal over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The transmission device 106 is used for receiving or transmitting data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider of the mobile terminal. In one example, the transmission device 106 includes a Network adapter (NIC), which can be connected to other Network devices through a base station so as to communicate with the internet. In one example, the transmission device 106 may be a Radio Frequency (RF) module, which is used for communicating with the internet in a wireless manner.

In the present embodiment, a method for echo cancellation of a voice signal running on a mobile terminal, a computer terminal or a similar computing device is provided, fig. 2 is a flow chart of an alternative echo cancellation of a voice signal according to an embodiment of the present invention, as shown in fig. 2, the flow chart includes the following steps:

s202, acquiring a first group of voice signals, wherein the first group of voice signals comprise far-end reference signals and near-end voice signals acquired by voice acquisition equipment, and the far-end reference signals are voice signals needing to be played by a loudspeaker;

s204, inputting the far-end reference signal into a target neural network model to obtain a target echo estimation signal, wherein the target neural network model is used for determining a near-end echo estimation signal acquired by voice acquisition equipment after the far-end reference signal is played by a loudspeaker;

and S206, performing target processing on the target echo estimation signal and the near-end voice signal to obtain a target voice signal, wherein the target voice signal is a voice signal obtained after a linear echo signal and a nonlinear echo signal contained in the near-end voice signal are eliminated.

Optionally, in this embodiment, the far-end reference signal may include, but is not limited to, an original audio signal that needs to be played, and the near-end speech signal may include, but is not limited to, a speech signal collected by a speech collecting device, and specifically may include, but is not limited to, all sounds received by a microphone, for example, a sound of a human speaking and an echo of a sound played by a speaker.

Optionally, in this embodiment, the target neural network model may include, but is not limited to, an RNN recurrent neural network model, a RESNET residual neural network model, and the like, which is only an example and is not limited in any way in this embodiment.

According to the embodiment, a group of voice signals of a far-end reference signal and a near-end voice signal collected by voice collection equipment are obtained, and the far-end reference signal is input into a target neural network model to obtain a target echo estimation signal; the target echo estimation signal and the near-end voice signal are subjected to target processing to obtain a target voice signal, so that a linear echo signal and a nonlinear echo signal contained in the near-end voice signal are eliminated, the problem that the echo of the voice signal is difficult to effectively eliminate in the related technology can be solved, the linear echo signal and the nonlinear echo signal of the voice signal are effectively eliminated, the integrity of the voice signal can be kept, and the technical effect of an original sound source signal is not damaged.

In an alternative embodiment, the inputting of the far-end reference signal into the target neural network model to obtain the target echo estimation signal includes: inputting a first far-end reference signal acquired in a first data segment into an adaptive filter to obtain a first echo estimation signal, wherein the far-end reference signal comprises the first far-end reference signal; and inputting the first far-end reference signal and the first echo estimation signal into a first target neural network model to obtain a second echo estimation signal, wherein the target neural network model comprises the first target neural network model. And subtracting the second echo estimation signal from the near-end signal acquired in the first data segment to obtain a first error signal.

Optionally, in this embodiment, the first target neural network model may include, but is not limited to, a residual neural network model, the first data segment may include but is not limited to a first data segment in which the far-end reference signal is sampled at a preset sampling frequency for a preset sampling time before being played by the speaker, the far-end reference signal may be passed through an adaptive filter to obtain a first echo estimate signal representing an estimate of the near-end echo, after the echo estimation value is subtracted from the near-end speech signal, an error signal is obtained, the error signal comprises both an original sound source signal and a residual echo, after the residual echo is eliminated by the error signal through a multi-stage residual error network, on one hand, the parameters are fed back to the adaptive filter for real-time updating and adjustment, and on the other hand, the parameters are fed back to the input end of the residual error neural network for iterative cancellation of residual echo.

Optionally, in this embodiment, a neural network is used to cancel the residual nonlinear echo, while the adaptive filtering is used to cancel the linear echo. Because the frequency spectrum of the residual echo after the linear echo is eliminated by the self-adaptive filter has more remarkable frequency spectrum characteristics with the frequency spectrum of the main sound source, and the amplitude is smaller than that of the main sound source, the characteristics of the residual echo can be better learned by adopting a specially trained neural network, so that the residual echo is eliminated from the original frequency spectrum, and meanwhile, the frequency spectrum of the original main sound source is not damaged.

Fig. 3 is a schematic diagram of an echo cancellation method for a speech signal according to an embodiment of the present invention, as shown in fig. 3, taking a target neural network model as a residual neural network model as an example, by connecting an adaptive filter in parallel with a first target neural network model, there are two residual echoes input: a far-end reference signal and an echo-cancelled error signal. The two signals are sent to a first layer residual network 302 after feature extraction, the identity mapping of the first layer residual network is replaced by an adaptive filter 304, the input of the adaptive filter is a far-end reference signal, and the output of the adaptive filter is an estimate of a near-end echo obtained from the far-end reference signal, in other words, the identity mapping may include, but is not limited to, identity mapping of a near-end linear echo. The residual network can be realized by adopting one layer or multiple layers according to the actual application requirement and precision requirement. If a multi-layer implementation is employed, a standard residual network structure is used, starting from the second layer.

Optionally, in this embodiment, the frequency bin gain of the second error signal may be calculated by:

since the gain factor calculated by the network is obtained for each frequency band, calculation accurate to each frequency point (corresponding to the aforementioned sampling point) is required. The schematic diagram of the calculation process is shown in fig. 4, and the calculation formula is as follows:

wherein, g_k(m) represents the gain coefficient of the mth frequency point of the kth frequency band, g_kAnd g_k+1Frequency of k and k +1 respectivelyAnd the gain coefficient is provided, M is the mth frequency point of the kth frequency band, and M represents the length of the kth frequency band.

In an optional embodiment, after subtracting the second echo estimate signal from the near-end signal to obtain a first error signal, the method further comprises: feeding back the first error signal to the adaptive filter and the first target neural network model to update target parameters in the adaptive filter and the first target neural network model, wherein the target parameters are used for eliminating a nonlinear echo signal in the near-end speech signal; inputting a second far-end reference signal acquired in a second data segment into the updated adaptive filter to obtain a third echo estimation signal; inputting the second far-end reference signal and the third echo estimation signal into the updated first target neural network model to obtain a fourth echo estimation signal; and subtracting the fourth echo estimation signal from the near-end signal acquired in the first data segment to obtain a second error signal.

Optionally, in this embodiment, a final near-end echo estimation signal is obtained after one-stage or multi-stage residual network calculation, and the final near-end echo estimation signal is subtracted from the near-end speech signal to obtain an error signal, where the error signal is fed back to the adaptive filter for real-time update of the weight coefficient, and fed back to the residual network input terminal for residual echo cancellation. After multiple iterations, the adaptive filter can obtain more accurate linear echo estimation, and the residual error network can obtain more accurate residual echo estimation, so that the error signal gradually tends to only contain the original sound source audio signal after multiple iterations, and the echo cancellation effect is achieved.

In an optional embodiment, the inputting the far-end reference signal into a target neural network model to obtain a target echo estimation signal includes: inputting a first far-end reference signal acquired in a first data segment into an adaptive filter to obtain a fifth echo estimation signal, wherein the far-end reference signal comprises the first far-end reference signal, and the fifth echo estimation signal is an estimation of a linear echo signal of the near-end voice signal; subtracting the near-end voice signal from the fifth echo estimation signal to obtain a third error signal; and inputting the third error signal and the first far-end reference signal into a second target neural network model to obtain a fourth error signal, wherein the target neural network model comprises the second target neural network model, and the fourth error signal is a voice signal obtained after the third error signal eliminates the nonlinear echo signal.

Optionally, in this embodiment, the far-end reference signal is subjected to an adaptive filter to obtain an estimate of a near-end echo, the estimate of the echo is subtracted from the near-end speech signal to obtain an error signal, the error signal includes both an original sound source signal and a residual echo, and the error signal is fed back to the adaptive filter to perform real-time update and adjustment of parameters.

Optionally, in this embodiment, the second target neural network model may include, but is not limited to, a recurrent neural network model, and the second preset threshold may be the same as or different from the first preset threshold, and may be preset according to actual needs, or may be obtained according to a related algorithm.

In an optional embodiment, after inputting the third error signal and the first far-end reference signal into a second target neural network model to obtain a fourth error signal, the method further comprises: feeding back the fourth error signal to the adaptive filter and the second target neural network model to update target parameters of the adaptive filter and the second target neural network model, wherein the target parameters are used for eliminating a nonlinear echo signal in the near-end speech signal; inputting a second far-end reference signal acquired in a second data segment into the updated adaptive filter to obtain a sixth echo estimation signal; subtracting the near-end voice signal acquired in the second data segment from the sixth echo estimation signal to obtain a fifth error signal; and inputting the fifth error signal and the second far-end reference signal into the updated second target neural network model to obtain a sixth error signal.

Alternatively, in this embodiment, after adaptive filtering, the linear echo can be suppressed to a large extent, and only the sound source signal (the clean speech signal to be collected) and the residual echo signal are left. The filtered residual echo amplitude is much smaller than the acoustic source signal, and the spectrum is very different from the acoustic source signal spectrum, which can be considered as a special noise type, therefore, the elimination can be but is not limited to using a trained neural network. Since the far-end reference signal has some correlation to the residual echo to some extent, the inputs to the neural network may be: the far-end reference signal and the adaptively filtered error signal, or only the adaptively filtered error signal.

Fig. 5 is a schematic diagram of an echo cancellation method for a speech signal according to an embodiment of the present invention, and as shown in fig. 5, an adaptive filter is connected in series with a second target neural network model, an estimated signal of a near-end echo is obtained by inputting a far-end reference signal, and then the estimated signal of the near-end echo is subtracted from the near-end speech signal, so as to finally obtain an original sound source signal from which a linear echo and a nonlinear echo are cancelled.

In an optional embodiment, the inputting the third error signal and the first far-end reference signal into a second target neural network model to obtain a fourth error signal includes: respectively performing frame division and windowing on the third error signal and the first far-end reference signal to obtain a second group of voice signals; performing Fourier transform on the second group of voice signals to obtain a third group of voice signals, wherein the third group of voice signals are represented based on a frequency domain; extracting the features of the third group of voice signals to obtain a group of feature vectors; inputting the set of feature vectors into the second target neural network model to obtain a fourth set of voice signals, wherein the fourth set of voice signals are voice signals obtained by eliminating nonlinear echoes by using a frequency band gain coefficient calculated by the neural network according to the set of feature vectors; performing inverse Fourier transform on the fourth group of voice signals to obtain a fifth group of voice signals, wherein the fifth group of voice signals are voice signals based on time domain representation; and performing windowing processing on the fifth group of voice signals to obtain the fourth error signal.

Alternatively, in the present embodiment, the structure of the second target neural network is as shown in fig. 6. Mainly include 3 modules: double end detection 602, residual echo estimation 604, residual echo cancellation 606. Wherein, the double-end detection contains two signal detection submodules, namely: a speech detection module 608 (also known as a near-end signal detection module) and an echo detection module 610 (also known as a far-end signal detection module).

The residual echo cancellation process can be illustrated in fig. 7, and the flow includes the following steps:

s702, windowing. The input signals (speech signal and reference signal) undergo a frame windowing process to eliminate spectral discontinuities at frame boundaries.

And S704, short-time Fourier transform. And the input signal is transformed to the frequency domain, so that the frequency domain characteristics are convenient to extract.

And S706, feature extraction. For the near-end or far-end signal, 22-dimensional Bark frequency domain characteristics, first-order and second-order differences of 6 Bark frequency domain characteristics, 6 coefficients related to tone characteristics of audio, 1 tone period and 1 spectrum dynamic characteristic are extracted, and the total number of the features is 42.

S708, RNN network. Three cyclic neural networks are utilized to complete echo cancellation of input voice, and mainly gain coefficients of each Bark frequency band are calculated; each frequency band of the input audio is multiplied by a corresponding coefficient to pass a particular voice and cancel the echo.

S710, band gain interpolation. Because the RNN calculates 22 Bark frequency band gain coefficients, in order to obtain the gain coefficient of each frequency point, an interpolation algorithm is adopted to obtain the gain coefficient of each frequency point in the frequency points.

And S712, performing inverse Fourier transform. And converting the frequency domain audio signal subjected to echo cancellation into time domain audio.

And S714, windowing. Corresponding to the windowing before signal processing, a complete window filter is formed.

Optionally, in this embodiment, the recurrent neural network determines the near-end and far-end signals by training the neural network in advance using the labeled audio data. The double-end detection module judges whether only a far-end signal exists, whether only a near-end signal exists, whether both signals exist or not according to a near-end voice signal and a loudspeaker reference channel signal (far-end reference signal) input by the double-end detection module. Echo cancellation is only performed when a far-end signal is present. The training process of the double ended detection module is shown in fig. 7. The training data can be manually marked data or simulation data. The collection process of the manual labeling data comprises the following steps: and collecting the audio played by the sound source by adopting a recording device, and simultaneously playing the audio by using a loudspeaker of the recording device. At this time, the audio recorded by the recording apparatus is a superposition of the sound source audio and the apparatus own audio. And then the human ear is used to judge whether the far-end signal and the near-end signal exist. Taking training with simulation data as an example, the generation and training process of simulation data may include, but is not limited to, as shown in fig. 8, and the process steps may include the following steps:

s802, the reference signal is the audio before being played out by the speaker through the speaker playing channel. And performing frame processing on the reference channel audio, and calculating the audio energy of each frame. Setting 3 thresholds, comparing the energy value with the threshold, wherein the energy value is greater than the threshold 2, and marking the frame of audio as 1; if the frame audio is less than the threshold 1, marking the frame audio as 0; greater than threshold 1 and less than threshold 2, note that the frame audio is 0.5. These numbers each represent the probability of the presence of the frame of audio. Meanwhile, a feature vector of each frame of audio of the reference channel is calculated.

S804, the real far-end signal is a signal received by the microphone after the reference signal is played by the speaker of the self and transmitted through the space. Here, the room impulse response is used to convolve the reference signal to produce the far-end signal, simulating a linear echo. The linear echo is subjected to adaptive filtering to obtain a linear residual echo, and then a nonlinear residual echo is superposed. The nonlinear residual echo can be obtained by recording echo by real equipment and then processing the echo by a linear filter.

The near-end signal is the signal emitted by the sound source and received by the microphone S806. The true near-end signal is a signal that is disturbed by spatial noise, so different types of noise signals are superimposed on top of the clean audio to represent the near-end signal. And calculating a feature vector of the near-end signal after the noise is superimposed. The labeling of the near-end signal is still calculated based on the clean audio. And obtaining corresponding labels by adopting an energy calculation method and a threshold comparison method similar to those of the reference signals.

And S808, constructing a data input vector by the generated reference signal characteristic vector and the signal characteristic vector of the sound source signal after the residual echo is superposed, and combining the generated corresponding labels to form a label corresponding to the frame input vector. And (4) sending the input vector and the corresponding label into a double-end detection neural network for training. And obtaining the neural network module capable of judging whether the double-end signal exists or not after the training is finished.

In an optional embodiment, before inputting the far-end reference signal into the target neural network model to obtain the target echo estimation signal, the method further includes: determining feature vectors of the far-end reference signal and the near-end speech signal; labeling the far-end reference signal and the near-end voice signal to obtain labeling information corresponding to the feature vector; and inputting the characteristic vector and the labeling information into a neural network model to be trained to obtain the target neural network model.

Optionally, in this embodiment, as shown in fig. 9, the process includes the following steps:

s902, a residual echo signal is generated. Convolving the signal of the reference channel with the room impulse response to obtain an echo signal, namely: the far-end signal can also be multiplied in the frequency domain to realize the same function. The echo signal is passed through a linear filter to eliminate linear echo, and then certain nonlinear residual echo is superposed. The nonlinear residual echo can be obtained by recording echo by real equipment and then processing the echo by a linear filter.

And S904, feature extraction. Superposing the noise signal and the residual echo signal obtained in the step S902 to a sound source pure signal, and performing amplitude normalization to prevent amplitude truncation; the feature vector of the mixed signal is calculated as an input to the neural network.

And S906, acquiring the label. Calculating the frequency band energy of the pure audio frequency of the sound source to obtain E_s,k(ii) a Superposing the residual echo signal obtained in the step S902 with the pure signal, and simultaneously superposing a noise signal to obtain a microphone receiving signal; calculating a band gain E of the superimposed signal_m,k(ii) a Calculating a gain coefficient:

the gain coefficient is the label needed by the training of the echo cancellation network.

And S908, training. And sending the input feature vectors and the corresponding labels into the whole echo cancellation network for training, wherein the training aims to enable the whole network to have the capability of calculating gain coefficients.

Through the above description of the embodiments, those skilled in the art can clearly understand that the method according to the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but the former is a better implementation mode in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.

In this embodiment, an echo cancellation device for a speech signal is further provided, and the device is used to implement the foregoing embodiments and preferred embodiments, which have already been described and are not described again. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware, or a combination of software and hardware is also possible and contemplated.

Fig. 10 is a block diagram of an echo cancellation device for a speech signal according to an embodiment of the present invention, as shown in fig. 10, the device includes:

an obtaining module 1002, configured to obtain a first group of voice signals, where the first group of voice signals includes a far-end reference signal and a near-end voice signal acquired by a voice acquisition device, and the far-end reference signal is a voice signal that needs to be played by a speaker;

an input module 1004, configured to input the far-end reference signal into a target neural network model to obtain a target echo estimation signal, where the target neural network model is configured to determine a near-end echo estimation signal acquired by the voice acquisition device after the far-end reference signal is played through the speaker;

the processing module 1006 performs target processing on the target echo estimation signal and the near-end speech signal to obtain a target speech signal, where the target speech signal is a speech signal obtained after a linear echo signal and a nonlinear echo signal included in the near-end speech signal are removed.

In an alternative embodiment, the input module 1004 includes: a first input unit, configured to input a first far-end reference signal acquired in a first data segment into an adaptive filter to obtain a first echo estimation signal, where the far-end reference signal includes the first far-end reference signal; a first input unit, configured to input the first far-end reference signal and the first echo estimation signal into a first target neural network model to obtain a second echo estimation signal, where the target neural network model includes the first target neural network model; and the first calculation unit is used for subtracting the second echo estimation signal from the near-end signal acquired in the first data segment to obtain a first error signal.

In an optional embodiment, the apparatus is further configured to: after subtracting the near-end signal from the second echo estimation signal to obtain a first error signal, feeding the first error signal back to the adaptive filter and the first target neural network model to update target parameters in the adaptive filter and the first target neural network model, wherein the target parameters are used for eliminating a nonlinear echo signal in the near-end speech signal; inputting a second far-end reference signal acquired in a second data segment into the updated adaptive filter to obtain a third echo estimation signal; inputting the second far-end reference signal and the third echo estimation signal into the updated first target neural network model to obtain a fourth echo estimation signal; and subtracting the fourth echo estimation signal from the near-end signal acquired in the first data segment to obtain a second error signal.

In an alternative embodiment, the input module 1004 includes: a second input unit, configured to input a first far-end reference signal acquired in a first data segment into an adaptive filter to obtain a fifth echo estimation signal, where the far-end reference signal includes the first far-end reference signal, and the fifth echo estimation signal is an estimate of a linear echo signal of the near-end speech signal; the second calculating unit is used for subtracting the near-end voice signal from the fifth echo estimation signal to obtain a third error signal; a second input unit, configured to input the third error signal and the first far-end reference signal into a second target neural network model to obtain a fourth error signal, where the target neural network model includes the second target neural network model, and the fourth error signal is a speech signal after the nonlinear echo signal is eliminated by the third error signal.

In an optional embodiment, the apparatus is further configured to: after the third error signal and the first far-end reference signal are input into a second target neural network model to obtain a fourth error signal, feeding the fourth error signal back to the adaptive filter and the second target neural network model to update target parameters of the adaptive filter and the second target neural network model, wherein the target parameters are used for eliminating a nonlinear echo signal in the near-end speech signal; inputting a second far-end reference signal acquired in a second data segment into the updated adaptive filter to obtain a sixth echo estimation signal; subtracting the near-end voice signal acquired in the second data segment from the sixth echo estimation signal to obtain a fifth error signal; and inputting the fifth error signal and the second far-end reference signal into the updated second target neural network model to obtain a sixth error signal.

In an optional embodiment, the apparatus is further configured to input the third error signal and the first far-end reference signal into a second target neural network model to obtain a fourth error signal by: respectively performing frame division and windowing on the third error signal and the first far-end reference signal to obtain a second group of voice signals; performing Fourier transform on the second group of voice signals to obtain a third group of voice signals, wherein the third group of voice signals are represented based on a frequency domain; extracting the features of the third group of voice signals to obtain a group of feature vectors; inputting the set of feature vectors into the second target neural network model to obtain a fourth set of voice signals, wherein the fourth set of voice signals are voice signals obtained by eliminating nonlinear echoes by using a frequency band gain coefficient calculated by the neural network according to the set of feature vectors; performing inverse Fourier transform on the fourth group of voice signals to obtain a fifth group of voice signals, wherein the fifth group of voice signals are voice signals based on time domain representation; and performing windowing processing on the fifth group of voice signals to obtain the fourth error signal.

In an optional embodiment, the apparatus is further configured to: before inputting the far-end reference signal into a target neural network model to obtain a target echo estimation signal, determining the feature vectors of the far-end reference signal and the near-end speech signal; labeling the far-end reference signal and the near-end voice signal to obtain labeling information corresponding to the feature vector; and inputting the characteristic vector and the labeling information into a neural network model to be trained to obtain the target neural network model.

It should be noted that, the above modules may be implemented by software or hardware, and for the latter, the following may be implemented, but not limited to: the modules are all positioned in the same processor; alternatively, the modules are respectively located in different processors in any combination.

Embodiments of the present invention also provide a computer-readable storage medium having a computer program stored thereon, wherein the computer program is arranged to perform the steps of any of the above-mentioned method embodiments when executed.

In the present embodiment, the above-mentioned computer-readable storage medium may be configured to store a computer program for executing the steps of:

s1, acquiring a first group of voice signals, wherein the first group of voice signals comprise far-end reference signals and near-end voice signals acquired by voice acquisition equipment, and the far-end reference signals are voice signals needing to be played by a loudspeaker;

s2, inputting the far-end reference signal into a target neural network model to obtain a target echo estimation signal, wherein the target neural network model is used for determining a near-end echo estimation signal acquired by a voice acquisition device after the far-end reference signal is played by a loudspeaker;

s3, performing target processing on the target echo estimation signal and the near-end speech signal to obtain a target speech signal, where the target speech signal is a speech signal obtained by removing a linear echo signal and a nonlinear echo signal contained in the near-end speech signal.

The computer readable storage medium is further arranged to store a computer program for performing the steps of:

In an exemplary embodiment, the computer-readable storage medium may include, but is not limited to: various media capable of storing computer programs, such as a usb disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic disk, or an optical disk.

Embodiments of the present invention also provide an electronic device comprising a memory having a computer program stored therein and a processor arranged to run the computer program to perform the steps of any of the above method embodiments.

In an exemplary embodiment, the electronic apparatus may further include a transmission device and an input/output device, wherein the transmission device is connected to the processor, and the input/output device is connected to the processor.

In an exemplary embodiment, the processor may be configured to execute the following steps by a computer program:

For specific examples in this embodiment, reference may be made to the examples described in the above embodiments and exemplary embodiments, and details of this embodiment are not repeated herein.

It will be apparent to those skilled in the art that the various modules or steps of the invention described above may be implemented using a general purpose computing device, they may be centralized on a single computing device or distributed across a network of computing devices, and they may be implemented using program code executable by the computing devices, such that they may be stored in a memory device and executed by the computing device, and in some cases, the steps shown or described may be performed in an order different than that described herein, or they may be separately fabricated into various integrated circuit modules, or multiple ones of them may be fabricated into a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method for echo cancellation of a speech signal, comprising:

acquiring a first group of voice signals, wherein the first group of voice signals comprise far-end reference signals and near-end voice signals acquired by voice acquisition equipment, and the far-end reference signals are voice signals needing to be played by a loudspeaker;

inputting the far-end reference signal into a target neural network model to obtain a target echo estimation signal, wherein the target neural network model is used for determining a near-end echo estimation signal acquired by the voice acquisition equipment after the far-end reference signal is played by the loudspeaker;

and performing target processing on the target echo estimation signal and the near-end voice signal to obtain a target voice signal, wherein the target voice signal is a voice signal obtained after a linear echo signal and a nonlinear echo signal contained in the near-end voice signal are eliminated.

2. The method of claim 1, wherein inputting the far-end reference signal into a target neural network model to obtain a target echo estimation signal comprises:

inputting a first far-end reference signal acquired in a first data segment into an adaptive filter to obtain a first echo estimation signal, wherein the far-end reference signal comprises the first far-end reference signal;

inputting the first far-end reference signal and a first echo estimation signal into a first target neural network model to obtain a second echo estimation signal, wherein the target neural network model comprises the first target neural network model;

and subtracting the second echo estimation signal from the near-end signal acquired in the first data segment to obtain a first error signal.

3. The method of claim 2, wherein after subtracting the near-end signal from the second echo estimate signal to obtain a first error signal, the method further comprises:

feeding back the first error signal to the adaptive filter and the first target neural network model to update target parameters in the adaptive filter and the first target neural network model, wherein the target parameters are used for eliminating a nonlinear echo signal in the near-end speech signal;

inputting a second far-end reference signal acquired in a second data segment into the updated adaptive filter to obtain a third echo estimation signal;

inputting the second far-end reference signal and the third echo estimation signal into the updated first target neural network model to obtain a fourth echo estimation signal;

and subtracting the fourth echo estimation signal from the near-end signal acquired in the first data segment to obtain a second error signal.

4. The method of claim 1, wherein inputting the far-end reference signal into a target neural network model to obtain a target echo estimation signal comprises:

inputting a first far-end reference signal acquired in a first data segment into an adaptive filter to obtain a fifth echo estimation signal, wherein the far-end reference signal comprises the first far-end reference signal, and the fifth echo estimation signal is an estimation of a linear echo signal of the near-end voice signal;

subtracting the near-end voice signal from the fifth echo estimation signal to obtain a third error signal;

and inputting the third error signal and the first far-end reference signal into a second target neural network model to obtain a fourth error signal, wherein the target neural network model comprises the second target neural network model, and the fourth error signal is a voice signal obtained after the third error signal eliminates the nonlinear echo signal.

5. The method of claim 4, wherein after inputting the third error signal and the first remote reference signal into a second target neural network model to obtain a fourth error signal, the method further comprises:

feeding back the fourth error signal to the adaptive filter and the second target neural network model to update target parameters of the adaptive filter and the second target neural network model, wherein the target parameters are used for eliminating a nonlinear echo signal in the near-end speech signal;

inputting a second far-end reference signal acquired in a second data segment into the updated adaptive filter to obtain a sixth echo estimation signal;

subtracting the near-end voice signal acquired in the second data segment from the sixth echo estimation signal to obtain a fifth error signal;

and inputting the fifth error signal and the second far-end reference signal into the updated second target neural network model to obtain a sixth error signal.

6. The method of claim 4, wherein inputting the third error signal and the first remote reference signal into a second target neural network model to obtain a fourth error signal comprises:

respectively performing frame division and windowing on the third error signal and the first far-end reference signal to obtain a second group of voice signals;

performing Fourier transform on the second group of voice signals to obtain a third group of voice signals, wherein the third group of voice signals are represented based on a frequency domain;

extracting the features of the third group of voice signals to obtain a group of feature vectors;

inputting the set of feature vectors into the second target neural network model to obtain a fourth set of voice signals, wherein the fourth set of voice signals are voice signals obtained by eliminating nonlinear echoes by using a frequency band gain coefficient calculated by the neural network according to the set of feature vectors;

performing inverse Fourier transform on the fourth group of voice signals to obtain a fifth group of voice signals, wherein the fifth group of voice signals are voice signals based on time domain representation;

and performing windowing processing on the fifth group of voice signals to obtain the fourth error signal.

7. The method of claim 1, wherein before inputting the far-end reference signal into a target neural network model to obtain a target echo estimation signal, the method further comprises:

determining feature vectors of the far-end reference signal and the near-end speech signal;

labeling the far-end reference signal and the near-end voice signal to obtain labeling information corresponding to the feature vector;

and inputting the characteristic vector and the labeling information into a neural network model to be trained to obtain the target neural network model.

8. An echo cancellation device for a speech signal, comprising:

the system comprises an acquisition module, a processing module and a display module, wherein the acquisition module is used for acquiring a first group of voice signals, the first group of voice signals comprise far-end reference signals and near-end voice signals acquired by voice acquisition equipment, and the far-end reference signals are voice signals needing to be played by a loudspeaker;

the input module is used for inputting the far-end reference signal into a target neural network model to obtain a target echo estimation signal, wherein the target neural network model is used for determining a near-end echo estimation signal acquired by the voice acquisition equipment after the far-end reference signal is played by the loudspeaker;

and the processing module is used for performing target processing on the target echo estimation signal and the near-end voice signal to obtain a target voice signal, wherein the target voice signal is a voice signal obtained after a linear echo signal and a nonlinear echo signal contained in the near-end voice signal are eliminated.

9. A computer-readable storage medium, in which a computer program is stored, wherein the computer program is arranged to perform the method of any of claims 1 to 7 when executed.

10. An electronic device comprising a memory and a processor, wherein the memory has stored therein a computer program, and wherein the processor is arranged to execute the computer program to perform the method of any of claims 1 to 7.