CN109587362B

CN109587362B - Echo suppression processing method and device

Info

Publication number: CN109587362B
Application number: CN201811584032.5A
Authority: CN
Inventors: 聂颖; 郑权; 沙露露; 聂镭; 张峰
Original assignee: Longma Zhixin Zhuhai Hengqin Technology Co ltd
Current assignee: Longma Zhixin Zhuhai Hengqin Technology Co ltd
Priority date: 2018-12-24
Filing date: 2018-12-24
Publication date: 2020-06-26
Anticipated expiration: 2038-12-24
Also published as: CN109587362A

Abstract

The application discloses a processing method and device for echo suppression. The method comprises the following steps: acquiring sound data, wherein the sound data comprises: the voice data processing device comprises first data and second data, wherein the first data are voice data sent by a first end to a second end, the second data are voice data sent from a third end to a fourth end after the first data are sent to the second end, the first end and the third end are used for recording voice data, and the second end and the fourth end are used for playing the voice data; processing the sound data according to a wavenet model, wherein the wavenet model is trained in advance, and the wavenet model is used for removing or weakening data related to the first data from the second data; and acquiring output sound data from the wavenett model. By the method and the device, the problems of low accuracy of an echo cancellation algorithm and poor echo suppression effect in the related technology are solved.

Description

Echo suppression processing method and device

Technical Field

The present application relates to the field of echo processing, and in particular, to a processing method and apparatus for echo suppression.

Background

Echoes generated by transmission through a network can be roughly classified into two types, namely circuit echoes and acoustic echoes. Acoustic echo is inevitable in a teleconference system, wherein the acoustic echo is generated by the following specific steps:

as shown in fig. 1, in the teleconference process, a voice signal emitted by a speaker at the a-end is subjected to sound-electricity conversion by a microphone, transmitted to the B-end through an intermediate transmission process, and then played by a speaker at the B-end. At this time, the sound played by the speaker at the B end is received by the microphone at the B end, and then is transmitted back to the speaker at the a end and played by the speaker at the a end. At this time, the speaker at the a-side hears the voice that he or she had previously spoken, and if the returned voice is delayed more than 0.1 second from when it was emitted, the human ear can distinguish, and the returned voice heard at this time is called an acoustic echo.

Further, the above contents are specifically and schematically explained: the double-ended voice signal is divided into a near end (near _ end) and a far end (far _ end), when the near end transmits a near-end voice signal (near _ speed) to the far end, the far-end voice signal (near _ speed) is received at the far end, when the far end transmits the far _ speed to the near end through a microphone, the near _ speed received from a far-end loudspeaker is also transmitted to the near end through the microphone, at the moment, the near-end microphone receives two signals, one is far _ speed transmitted from the far end, and the other is echo (echo) generated by the near _ speed. Similarly, since the call forms a closed loop, the echo signal is also received at the far end.

In addition, the above echo phenomenon also occurs in various scenarios, such as the following two scenarios:

firstly, when the mobile phone is used for voice or video communication with others, the mobile phone plays music, at the moment, the music played by a loudspeaker of the mobile phone can be picked up by a microphone of the mobile phone, and the music played by the loudspeaker can also generate an echo signal.

Secondly, in KTV, the microphone picks up the singing person's voice, and also picks up part of the voice emitted by the speaker, which also generates an echo signal.

In real life, we only need far _ speed and not echo, so acoustic echo must be cancelled. While the conventional echo cancellation algorithm is usually implemented by adaptive filtering, the specific working principle is as follows: and estimating the impulse response corresponding to the echo path by adopting a parameter adjustable filter to further obtain an estimated value of the echo, and subtracting the signal from the received signal of the microphone to eliminate the echo.

However, the echo cancellation algorithm in the prior art is prone to generate errors in estimation of the impulse response, and in addition, when the echo cancellation algorithm in the prior art directly subtracts two signals, the introduced music noise may reduce the echo suppression effect.

Aiming at the problems of lower accuracy of an echo cancellation algorithm and poorer echo suppression effect in the related technology, an effective solution is not provided at present.

Disclosure of Invention

The application provides a processing method and a processing device for echo suppression, which aim to solve the problems of low accuracy of an echo cancellation algorithm and poor echo suppression effect in the related art.

According to one aspect of the present application, a processing method for echo suppression is provided. The method comprises the following steps: acquiring sound data, wherein the sound data comprises: the voice data processing method comprises the following steps that first data and second data are transmitted to a second end from a first end, the second data are transmitted to a fourth end from a third end after the first data are transmitted to the second end, the first end and the third end are used for recording voice data, and the second end and the fourth end are used for playing the voice data; processing the sound data according to a wavenet model, wherein the wavenet model is trained in advance, and is used for removing or weakening data related to the first data from the second data; and acquiring output sound data from the wavenet model.

Optionally, processing the sound data according to the wavenet model includes: acquiring the position where the echo starts in the second data according to the first data, taking the first data as first data to be input, and taking data after the position where the echo starts in the second data as second data to be input; inputting the first data to be input and the second data to be input into a wavenet model, wherein the wavenet model is used for removing data related to the first data to be input from the second data to be input.

Optionally, obtaining a position where an echo in the second data starts according to the first data includes: determining a point of the plurality of data points of the second data that is most relevant to the first data; and determining the position of the echo start in the second data according to a target point with the maximum correlation with the first data in the plurality of data points of the second data, wherein the position corresponding to the target point is the position of the echo start in the second data.

Optionally, before the first data to be input and the second data to be input are input into the wavenet model, the method further includes: acquiring a plurality of groups of training data, wherein each group of training data comprises: sound data sent by a first terminal to a second terminal and sound data sent from a third terminal to a fourth terminal after the first data is sent to the second terminal; and carrying out learning training on the wavenet model by using the plurality of groups of training data so as to determine parameter data in the wavenet model.

Optionally, the plurality of sets of training data at least include any one of the following features: the sampling rate of sound data in each group of training data is 16 KHz; the average time length of each section of sound data in the multiple groups of training data is 10s, and the standard deviation of the time length of each section of sound data is 1 s; the total duration of the sound data in the multiple groups of training data is 20 h.

Optionally, the performing of the learning training on the wavenet model by using the plurality of sets of training data includes: and carrying out learning training on the wavenet model by using a random gradient descent method according to the plurality of groups of training data.

Optionally, after obtaining the output sound data from the wavenet model, the method further includes: and smoothing the sound data acquired and output from the wavenet model to acquire the sound data played at the fourth end.

According to another aspect of the present application, a processing apparatus for echo suppression is provided. The device includes: a first acquisition unit configured to acquire sound data, wherein the sound data includes: the voice data processing method comprises the following steps that first data and second data are transmitted to a second end from a first end, the second data are transmitted to a fourth end from a third end after the first data are transmitted to the second end, the first end and the third end are used for recording voice data, and the second end and the fourth end are used for playing the voice data; a processing unit, configured to process the sound data according to a wavenet model, where the wavenet model is pre-trained, and the wavenet model is used to remove data related to the first data from the second data; an output unit for acquiring the output sound data from the wavenet model

According to another aspect of the present application, there is provided a storage medium including a stored program, wherein the program executes the echo suppression processing method according to any one of the above.

According to another aspect of the present application, there is provided a processor for executing a program, where the program is executed to perform the echo suppression processing method according to any one of the above.

Through the application, the following steps are adopted: acquiring sound data, wherein the sound data comprises: the voice data processing method comprises the following steps that first data and second data are transmitted to a second end from a first end, the second data are transmitted to a fourth end from a third end after the first data are transmitted to the second end, the first end and the third end are used for recording voice data, and the second end and the fourth end are used for playing the voice data; processing the sound data according to a wavenet model, wherein the wavenet model is trained in advance, and is used for removing or weakening data related to the first data from the second data; and acquiring the output sound data from the wavenet model, thereby solving the problems of lower accuracy of an echo cancellation algorithm and poorer echo suppression effect in the related technology.

That is, by using the wavenet model, the data related to the first data is removed or weakened from the second data, that is, compared with the conventional method, the echo suppression processing method according to the embodiment of the present application weakens/removes echo information in the sound data by using the neural network, so as to achieve the technical effects of accurately removing/weakening echo information in the sound data transmitted from the third end to the fourth end after the first data is transmitted to the second end, and improving intelligibility of the output signal.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate embodiments of the application and, together with the description, serve to explain the application and are not intended to limit the application. In the drawings:

FIG. 1 is a schematic diagram of double ended echo generation in the prior art;

fig. 2 is a flowchart of a processing method for echo suppression according to an embodiment of the present application;

FIG. 3a is a schematic diagram of an alternative wavenet model structure provided in accordance with an embodiment of the present application;

FIG. 3b is a schematic diagram of an alternative wavenet model architecture provided in accordance with an embodiment of the present application;

FIG. 3c is a schematic diagram of an alternative wavenet model architecture provided in accordance with an embodiment of the present application;

FIG. 4 is a schematic diagram of an alternative smoothing process and zero padding process provided in accordance with an embodiment of the present application; and

fig. 5 is a schematic diagram of a processing apparatus for echo suppression according to an embodiment of the present application.

Detailed Description

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.

In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only partial embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

It should be noted that the terms "first," "second," and the like in the description and claims of this application and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It should be understood that the data so used may be interchanged under appropriate circumstances such that embodiments of the application described herein may be used. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

According to an embodiment of the present application, a processing method of echo suppression is provided.

Fig. 2 is a flowchart of a processing method of echo suppression according to an embodiment of the present application. As shown in fig. 2, the method comprises the steps of:

step S102, acquiring sound data, wherein the sound data comprises: the voice data processing device comprises first data and second data, wherein the first data are voice data sent by a first end to a second end, the second data are voice data sent from a third end to a fourth end after the first data are sent to the second end, the first end and the third end are used for recording voice data, and the second end and the fourth end are used for playing the voice data.

And step S104, processing the sound data according to a wavenet model, wherein the wavenet model is trained in advance, and the wavenet model is used for removing or weakening data related to the first data from the second data.

Step S106, acquiring the output sound data from the wavenett model.

The echo suppression processing method provided by the embodiment of the application obtains the sound data, wherein the sound data includes: the voice data processing device comprises first data and second data, wherein the first data are voice data sent by a first end to a second end, the second data are voice data sent from a third end to a fourth end after the first data are sent to the second end, the first end and the third end are used for recording voice data, and the second end and the fourth end are used for playing the voice data; processing the sound data according to a wavenet model, wherein the wavenet model is trained in advance, and the wavenet model is used for removing or weakening data related to the first data from the second data; the output sound data is obtained from the wavenett model, and the problems of low accuracy of an echo cancellation algorithm and poor echo suppression effect in the related technology are solved.

That is, by using the wavenet model, the data related to the first data is removed or weakened from the second data, that is, compared with the conventional method, the echo suppression processing method according to the embodiment of the present application weakens/removes the echo information in the sound data using the neural network, so as to achieve the technical effects of accurately removing/weakening the echo information in the sound data transmitted from the third end to the fourth end after the first data is transmitted to the second end, and improving the intelligibility of the output signal.

Here, it should be noted that: in this application, the second end and the third end are located in the same space, that is, after the second end receives the sound data sent by the first end, the sound data is played in the space where the second end is located, at this time, the third end records new sound data in the space, where the newly recorded sound data may include the sound data played by the second end. The following examples are specific for illustration:

in an alternative example, there are two devices that communicate with each other, and after the first communication device enters the sound data through the first terminal, the sound data is sent to the second communication device, and the second communication device plays the received sound data through the second terminal. At this time, the second communication device logs in new sound data through the third end and sends the sound data to the first communication device, and the first communication device plays the received sound data through the fourth end, at this time, the sound data played by the first communication device through the fourth end may contain the sound data played by the second communication device through the second end, that is, the sound data played by the first communication device through the fourth end may contain the sound data logged in by the first communication device through the first end.

In another alternative example, there is a microphone device and a playing device, in this case, the first terminal and the third terminal in this embodiment of the present application may refer to the same object, and the second terminal and the fourth terminal may also refer to the same object, that is, the first terminal and the third terminal in this application refer to microphone devices, and the second terminal and the fourth terminal refer to playing devices, it needs to be emphasized that: the microphone device and the playback device are located in the same space. At this time, the microphone device enters first data and transmits the first data to the playing device, the playing device receives and plays the first data, at this time, the microphone enters second data again and transmits the second data to the playing device, and at this time, the second data may include the first data played by the playing device.

The echo suppression processing method is exemplified as follows: the method comprises the steps of taking a voice signal with a period of time of 20S as input voice data, wherein the sampling rate is 16kHZ, the first 6S of the input signal is near speed, the last 14S of the voice data is far speed + echo, namely the 1-96k sampling points are near speed, the 96001-320000 sampling points represent far speed + echo, the voice data is connected to a wavenet network, and echo in the voice data is suppressed through step S204, so that far speed + echo after echo suppression is obtained.

Optionally, in the processing method for echo suppression provided in the embodiment of the present application, processing the sound data according to the wavenet model includes: acquiring the position where the echo starts in the second data according to the first data, taking the first data as first data to be input, and taking the data behind the position where the echo starts in the second data as second data to be input; and inputting the first data to be input and the second data to be input into the wavenet model.

That is, before the wavenet model processes the sound data, the sound data needs to be preprocessed, that is, the position where the echo starts in the second data is determined, the first data is used as the first data to be input, and the data after the position where the echo starts in the second data is used as the second data to be input, at this time, the first data to be input and the second data to be input are input into the wavenet model, so that the wavenet model removes the data related to the first data to be input from the second data to be input.

It is worth emphasizing that: it is innovative to decide the input data of the wavenet network according to the position where the echo starts, so as to perform echo cancellation on the input data. That is, determining input data of the wavenet network according to the position where the echo starts so as to perform echo cancellation on the input data is a technical solution that has not been disclosed in the prior art.

Specifically, obtaining the position where the echo starts in the second data according to the first data can be implemented by: determining a point of the plurality of data points of the second data which has the greatest correlation with the first data; and determining the position of the echo start in the second data according to a target point with the maximum correlation with the first data in the plurality of data points of the second data, wherein the position corresponding to the target point is the position of the echo start in the second data.

That is, a correlation function between the sound data sent by the first terminal to the second terminal and the sound data transmitted from the third terminal to the fourth terminal after the first data is transmitted to the second terminal is calculated, and the position where the echo starts in the sound data transmitted from the third terminal to the fourth terminal after the first data is transmitted to the second terminal is determined according to the calculation result, wherein the position corresponding to the point with the maximum correlation is the position where the echo starts.

For example, in the case where X represents near speed (sound data generating echo), Y represents far speed + echo (sound data including echo), and X₁＝[1，2，3]，X₂＝[0，1，2，3]In the case of (3), X is calculated₁And X₂The correlation results of (c) are: [0,0,3,8, 14,8,3]The corresponding position is [ -3, -2, -1, 0, 1, 2, 3 [ -3- ]]If the index corresponding to the maximum value 14 is 1, the position where the echo starts is calculated to be 2, where the index position starts from 0 and the number of sample points starts from 1.

Optionally, in the processing method for echo suppression provided in the embodiment of the present application, before inputting the first data to be input and the second data to be input into the wavenet model, the method further includes: acquiring a plurality of groups of training data, wherein each group of training data comprises: the first end sends the sound data to the second end, and the sound data is sent from the third end to the fourth end after the first data is sent to the second end; and performing learning training on the wavenet model by using the plurality of groups of training data to determine parameter data in the wavenet model.

Specifically, as shown in fig. 3a, 3b, and 3c, when the position where the echo starts is determined to be 17, the processing is performed using the wavenet network from the 17 th sampling point, and the processing is not performed on the sampling point before the position 17. Taking an example that a wavenet network is connected to 8 sampling points in near _ speed at most, fig. 3a is a schematic diagram of the wavenet at the 1 st sampling point suppressing echo, fig. 3b is a schematic diagram of the wavenet at the 2 nd sampling point suppressing echo, and fig. 3c is a schematic diagram of the wavenet at the 8 th sampling point suppressing echo, wherein the wavenet network structure includes five layers, specifically, partition 1, partition 2, partition 4, partition 8, and partition 16, each black line represents a parameter to be trained, "echo _ cancel" represents output suppressed sound data, and "near _ speed" and "far _ speed + echo" represent input sound data.

It should be further noted that, the sets of training data for learning and training the wavenet model may be obtained through the following steps: recording sound according to the sound data acquisition mode shown in fig. 1 to obtain sound data including near _ sound and far _ sound + echo, wherein other sounds are avoided as much as possible in the recording process; sampling the recorded original sound data to obtain a plurality of sampling points, wherein the sampling frequency can be 8KHz, 16KHz and 48KHz, and the sampling rate is 16 KHz. Finally, 7000 pieces of sound data with the average time length of 10s, the standard deviation of the time length within 1s and the total time length of the sound data set of about 20 hours are obtained, wherein the 7000 pieces of sound data are the multiple sets of training data.

That is, the plurality of sets of training data for learning and training by the wavenet model at least include any one of the following features: the sampling rate of sound data in each group of training data is 16 KHz; the average time length of each section of sound data in the multiple groups of training data is 10s, and the standard deviation of the time length of each section of sound data is 1 s; the total duration of the sound data in the plurality of sets of training data is 20 h.

In addition, in the processing method for echo suppression provided in the embodiment of the present application, a random gradient descent method is further adopted to train the wavenet model. That is, in the echo suppression processing method provided in the embodiment of the present application, performing learning training on the wavenet model by using the multiple sets of training data includes: and carrying out learning training on the wavenet model by using a random gradient descent method according to the plurality of groups of training data.

After the wavenet model is trained, the successfully trained wavenet model is saved, so that the sound data is processed by using the successfully trained wavenet model when the step S104 is executed subsequently.

Finally, after obtaining the output sound data from the wavenett model, the method may further include: and smoothing the sound data acquired and output in the wavenett model to acquire the sound data played at the fourth end.

That is, the output end of the wavenett network model is connected with the convolution of two 3 × 1 so as to smooth the sound data output by the wavenett network model, so that the sound data played at the fourth end is closer to the sound data acceptable by human ears. There are various methods for smoothing, and in the echo suppression processing method provided in the embodiment of the present application, an average smoothing method is used to ensure continuity of smoothed sound data. Meanwhile, in order to ensure that the length of the sound data does not change before and after smoothing, zero padding processing is performed on the sound data, as shown in fig. 4.

For example: and smoothing the values of the 8 sampling points, wherein the values of the 8 sampling points are represented by x1-x8 before smoothing, and the values after smoothing are represented by y1-y8, so that y1 is (0+ x1+ x2)/3, y2 is (x1+ x2+ x3)/3, and the like.

In summary, the echo suppression processing method provided in the embodiment of the present application achieves the following technical effects:

1. the echo suppression processing method provided by the embodiment of the application is end-to-end one-to-one, that is, only the position where the echo starts needs to be calculated, and the sound data does not need to be preprocessed, for example: windowing, framing, endpoint detection, feature extraction, etc.

2. Compared with the conventional method, the echo suppression processing method provided by the embodiment of the application weakens/removes the echo information in the sound data by using the neural network, improves the intelligibility of the output signal, and accurately and effectively and accurately removes/weakens the echo information in the sound data transmitted from the third end to the fourth end after the first data is transmitted to the second end.

It should be noted that the steps illustrated in the flowcharts of the figures may be performed in a computer system such as a set of computer-executable instructions and that, although a logical order is illustrated in the flowcharts, in some cases, the steps illustrated or described may be performed in an order different than presented herein.

The embodiment of the present application further provides a processing apparatus for echo suppression, and it should be noted that the processing apparatus for echo suppression according to the embodiment of the present application may be used to execute the processing method for echo suppression according to the embodiment of the present application. The following describes a processing apparatus for echo suppression according to an embodiment of the present application.

Fig. 5 is a schematic diagram of a processing apparatus for echo suppression according to an embodiment of the present application. As shown in fig. 5, the apparatus includes: a first acquisition unit 51, a processing unit 53 and an output unit 55.

A first obtaining unit 51 configured to obtain sound data, wherein the sound data includes: the voice data processing device comprises first data and second data, wherein the first data are voice data sent by a first end to a second end, the second data are voice data sent from a third end to a fourth end after the first data are sent to the second end, the first end and the third end are used for recording voice data, and the second end and the fourth end are used for playing the voice data;

a processing unit 53, configured to process the sound data according to a wavenet model, where the wavenet model is pre-trained, and the wavenet model is used to remove data related to the first data from the second data;

and an output unit 55, configured to obtain the output sound data from the wavenett model.

Optionally, in the processing apparatus for echo suppression provided in the embodiment of the present application, the processing unit 53 includes: the acquisition module is used for acquiring the position where the echo starts in the second data according to the first data, taking the first data as first data to be input, and taking the data behind the position where the echo starts in the second data as second data to be input; the input module is used for inputting the first data to be input and the second data to be input into the wavenet model, wherein the wavenet model is used for removing data related to the first data to be input from the second data to be input.

Optionally, in the processing apparatus for echo suppression provided in the embodiment of the present application, the obtaining module includes: a first determining sub-module for determining a point of the plurality of data points of the second data which is most correlated with the first data; and the second determining submodule is used for determining the position of the echo start in the second data according to a target point with the maximum correlation with the first data in the plurality of data points of the second data, wherein the position corresponding to the target point is the position of the echo start in the second data.

Optionally, in the processing apparatus for echo suppression provided in the embodiment of the present application, the apparatus further includes: a second obtaining unit, configured to obtain multiple sets of training data before inputting the first data to be input and the second data to be input into the wavenet model, where each set of training data includes: the first end sends the sound data to the second end, and the sound data is sent from the third end to the fourth end after the first data is sent to the second end; and the training unit is used for performing learning training on the wavenet model by using the plurality of groups of training data so as to determine parameter data in the wavenet model.

Optionally, in the processing apparatus for echo suppression provided in the embodiment of the present application, the multiple sets of training data at least include a feature of any one of: the sampling rate of sound data in each group of training data is 16 KHz; the average time length of each section of sound data in the multiple groups of training data is 10s, and the standard deviation of the time length of each section of sound data is 1 s; the total duration of the sound data in the plurality of sets of training data is 20 h.

Optionally, in the processing apparatus for echo suppression provided in the embodiment of the present application, the training unit includes: and the training module is used for performing learning training on the wavenet model by using a random gradient descent method according to the plurality of groups of training data.

Optionally, in the processing apparatus for echo suppression provided in the embodiment of the present application, the apparatus further includes: and a smoothing unit 53, configured to, after acquiring the output sound data from the wavenett model, perform smoothing processing on the output sound data acquired from the wavenett model to acquire sound data played at the fourth end.

The processing apparatus for echo suppression provided in the embodiment of the present application acquires, by using the first acquiring unit 51, sound data, where the sound data includes: the voice data processing device comprises first data and second data, wherein the first data are voice data sent by a first end to a second end, the second data are voice data sent from a third end to a fourth end after the first data are sent to the second end, the first end and the third end are used for recording voice data, and the second end and the fourth end are used for playing the voice data; the processing unit 53 processes the sound data according to a wavenet model, wherein the wavenet model is trained in advance, and the wavenet model is used for removing data related to the first data from the second data; the output unit 55 obtains the output sound data from the wavenett model, and solves the problems of low accuracy of an echo cancellation algorithm and poor echo suppression effect in the related art.

The processing device for echo suppression comprises a processor and a memory, wherein the first acquiring unit 51, the processing unit 53, the output unit 55 and the like are stored in the memory as program units, and the processor executes the program units stored in the memory to realize corresponding functions.

The processor comprises a kernel, and the kernel calls the corresponding program unit from the memory. The kernel may be configured to adjust one or more parameters of the kernel to accurately remove/attenuate echo information from sound data transmitted from the third end to the fourth end after the first data is transmitted to the second end, and to improve intelligibility of the output signal.

The memory may include volatile memory in a computer readable medium, Random Access Memory (RAM) and/or nonvolatile memory such as Read Only Memory (ROM) or flash memory (flash RAM), and the memory includes at least one memory chip.

An embodiment of the present invention provides a storage medium on which a program is stored, the program implementing a processing method of echo suppression when executed by a processor.

The embodiment of the invention provides a processor, which is used for running a program, wherein the echo suppression processing method is executed when the program runs.

The embodiment of the invention provides equipment, which comprises a processor, a memory and a program which is stored on the memory and can run on the processor, wherein the processor executes the program and realizes the following steps: acquiring sound data, wherein the sound data comprises: the voice data processing device comprises first data and second data, wherein the first data are voice data sent by a first end to a second end, the second data are voice data sent from a third end to a fourth end after the first data are sent to the second end, the first end and the third end are used for recording voice data, and the second end and the fourth end are used for playing the voice data; processing the sound data according to a wavenet model, wherein the wavenet model is trained in advance, and the wavenet model is used for removing or weakening data related to the first data from the second data; and acquiring output sound data from the wavenett model.

Optionally, processing the sound data according to the wavenet model includes: acquiring the position where the echo starts in the second data according to the first data, taking the first data as first data to be input, and taking the data behind the position where the echo starts in the second data as second data to be input; inputting the first data to be input and the second data to be input into a wavenet model, wherein the wavenet model is used for removing data related to the first data to be input from the second data to be input.

Optionally, acquiring a position where an echo in the second data starts according to the first data includes: determining a point of the plurality of data points of the second data which has the greatest correlation with the first data; and determining the position of the echo start in the second data according to a target point with the maximum correlation with the first data in the plurality of data points of the second data, wherein the position corresponding to the target point is the position of the echo start in the second data.

Optionally, before inputting the first data to be input and the second data to be input into the wavenet model, the method further includes: acquiring a plurality of groups of training data, wherein each group of training data comprises: the first end sends the sound data to the second end, and the sound data is sent from the third end to the fourth end after the first data is sent to the second end; and performing learning training on the wavenet model by using the plurality of groups of training data to determine parameter data in the wavenet model.

Optionally, the plurality of sets of training data at least include any one of the following features: the sampling rate of sound data in each group of training data is 16 KHz; the average time length of each section of sound data in the multiple groups of training data is 10s, and the standard deviation of the time length of each section of sound data is 1 s; the total duration of the sound data in the plurality of sets of training data is 20 h.

Optionally, after obtaining the output sound data from the wavenett model, the method further includes: and smoothing the sound data acquired and output in the wavenett model to acquire the sound data played at the fourth end. The device herein may be a server, a PC, a PAD, a mobile phone, etc.

The present application further provides a computer program product adapted to perform a program for initializing the following method steps when executed on a data processing device: acquiring sound data, wherein the sound data comprises: the voice data processing device comprises first data and second data, wherein the first data are voice data sent by a first end to a second end, the second data are voice data sent from a third end to a fourth end after the first data are sent to the second end, the first end and the third end are used for recording voice data, and the second end and the fourth end are used for playing the voice data; processing the sound data according to a wavenet model, wherein the wavenet model is trained in advance, and the wavenet model is used for removing or weakening data related to the first data from the second data; and acquiring output sound data from the wavenett model.

Optionally, after obtaining the output sound data from the wavenett model, the method further includes: and smoothing the sound data acquired and output in the wavenett model to acquire the sound data played at the fourth end.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). The memory is an example of a computer-readable medium.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in the process, method, article, or apparatus that comprises the element.

The above are merely examples of the present application and are not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims

1. A method for echo suppression, comprising:

acquiring sound data, wherein the sound data comprises: the voice data processing method comprises the following steps that first data and second data are transmitted to a second end from a first end, the second data are transmitted to a fourth end from a third end after the first data are transmitted to the second end, the first end and the third end are used for recording voice data, and the second end and the fourth end are used for playing the voice data;

processing the sound data according to a wavenet model, wherein the wavenet model is trained in advance, and is used for removing or weakening data related to the first data from the second data;

acquiring output sound data from the wavenet model;

wherein processing the sound data according to the wavenet model comprises: acquiring the position where the echo starts in the second data according to the first data, taking the first data as first data to be input, and taking data after the position where the echo starts in the second data as second data to be input; inputting the first data to be input and the second data to be input into a wavenet model, wherein the wavenet model is used for removing data related to the first data to be input from the second data to be input;

wherein obtaining a position where an echo in the second data starts according to the first data comprises: determining a point of the plurality of data points of the second data that is most relevant to the first data; determining the position of the beginning of the echo in the second data according to a target point with the maximum correlation with the first data in the plurality of data points of the second data, wherein the position corresponding to the target point is the position of the beginning of the echo in the second data;

wherein, before the first data to be input and the second data to be input are input into the wavenet model, the method further comprises: acquiring a plurality of groups of training data, wherein each group of training data comprises: sound data sent by a first terminal to a second terminal and sound data sent from a third terminal to a fourth terminal after the first data is sent to the second terminal; carrying out learning training on the wavenet model by using the plurality of groups of training data so as to determine parameter data in the wavenet model;

wherein the plurality of sets of training data comprise at least one of the following characteristics: the sampling rate of sound data in each group of training data is 16 KHz; the average time length of each section of sound data in the multiple groups of training data is 10s, and the standard deviation of the time length of each section of sound data is 1 s; the total duration of the sound data in the multiple groups of training data is 20 h;

wherein, using the plurality of sets of training data to perform learning training on the wavenet model comprises: and carrying out learning training on the wavenet model by using a random gradient descent method according to the plurality of groups of training data.

2. The method of claim 1, wherein after obtaining the output sound data from the wavenet model, the method further comprises: and smoothing the sound data acquired and output from the wavenet model to acquire the sound data played at the fourth end.

3. A processing apparatus for echo suppression, comprising:

a first acquisition unit configured to acquire sound data, wherein the sound data includes: the voice data processing method comprises the following steps that first data and second data are transmitted to a second end from a first end, the second data are transmitted to a fourth end from a third end after the first data are transmitted to the second end, the first end and the third end are used for recording voice data, and the second end and the fourth end are used for playing the voice data;

a processing unit, configured to process the sound data according to a wavenet model, where the wavenet model is pre-trained, and the wavenet model is used to remove data related to the first data from the second data;

the output unit is used for acquiring output sound data from the wavenet model;

wherein, the processing unit includes: the acquisition module is used for acquiring the position where the echo starts in the second data according to the first data, taking the first data as first data to be input, and taking the data behind the position where the echo starts in the second data as second data to be input; the input module is used for inputting the first data to be input and the second data to be input into a wavenet model, wherein the wavenet model is used for removing data related to the first data to be input from the second data to be input;

wherein, the acquisition module includes: a first determining sub-module for determining a point of the plurality of data points of the second data which is most correlated with the first data; the second determining submodule is used for determining the position of the echo start in the second data according to a target point with the maximum correlation with the first data in a plurality of data points of the second data, wherein the position corresponding to the target point is the position of the echo start in the second data;

wherein the apparatus further comprises: a second obtaining unit, configured to obtain multiple sets of training data before inputting the first data to be input and the second data to be input into the wavenet model, where each set of training data includes: the first end sends the sound data to the second end, and the sound data is sent from the third end to the fourth end after the first data is sent to the second end; the training unit is used for performing learning training on the wavenet model by using the plurality of groups of training data to determine parameter data in the wavenet model;

wherein, the training unit includes: and the training module is used for performing learning training on the wavenet model by using a random gradient descent method according to the plurality of groups of training data.

4. A storage medium characterized in that the storage medium includes a stored program, wherein the program executes the echo suppression processing method according to any one of claims 1 to 2.

5. A processor, characterized in that the processor is configured to run a program, wherein the program is configured to perform the echo suppression processing method according to any one of claims 1-2 when running.