CN117012217A

CN117012217A - Data processing method, device, equipment, storage medium and program product

Info

Publication number: CN117012217A
Application number: CN202210831478.3A
Authority: CN
Inventors: 江飞; 杨栋; 曹木勇
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-07-15
Filing date: 2022-07-15
Publication date: 2023-11-07

Abstract

The application discloses a data processing method, a device, equipment, a storage medium and a program product, which can be applied to an artificial intelligence scene, and the method comprises the following steps: determining initial elimination data at a first moment according to the play data at the first moment, the acquisition data at the first moment and the echo prediction data at a second moment; inputting the playing data at the first moment, the initial elimination data at the first moment and the echo prediction data at the second moment into a target filtering model for prediction to obtain gain data at the first moment; based on the gain data at the first moment and the initial elimination data at the first moment, adjusting the echo prediction data at the second moment to obtain an echo path at the first moment; and carrying out echo cancellation on the acquired data at the first moment based on the echo path at the first moment and the play data at the first moment to obtain echo cancellation data at the first moment. By adopting the method and the device, the echo cancellation effect and efficiency can be improved.

Description

Data processing method, device, equipment, storage medium and program product

Technical Field

The present application relates to the field of artificial intelligence, and in particular, to a data processing method, apparatus, device, storage medium, and program product.

Background

With the large-scale popularization of mobile internet and the popularization of artificial intelligence, more and more voice communication services can perform control operations such as echo cancellation for a certain section of audio data to establish a stable voice communication environment, so that echo in the sound needs to be cancelled, and the traditional echo cancellation technology needs to cancel after estimating an echo path based on a specific preset condition.

Disclosure of Invention

The embodiment of the application provides a data processing method, a device, equipment, a storage medium and a program product, which can improve the accuracy of echo cancellation and the effect and efficiency of echo cancellation.

In one aspect, an embodiment of the present application provides a data processing method, including:

determining initial elimination data at a first moment according to the play data at the first moment, the acquisition data at the first moment and the echo prediction data at a second moment; the first time is adjacent to the second time, and the second time is smaller than the first time;

Inputting the playing data at the first moment, the initial elimination data at the first moment and the echo prediction data at the second moment into a target filtering model for prediction to obtain gain data at the first moment;

based on the gain data at the first moment and the initial elimination data at the first moment, adjusting the echo prediction data at the second moment to obtain an echo path at the first moment;

and carrying out echo cancellation on the acquired data at the first moment based on the echo path at the first moment and the play data at the first moment to obtain echo cancellation data at the first moment.

In one aspect, the embodiment of the application further provides a data processing method, which includes:

determining initial sample elimination data at a first moment according to sample play data at the first moment, sample collection data at the first moment and sample prediction data at a second moment; the first time is adjacent to the second time, and the second time is smaller than the first time;

inputting the sample play data at the first moment, the sample elimination data at the first moment and the sample prediction data at the second moment into an initial filtering model for prediction to obtain sample gain data at the first moment;

based on the sample gain data at the first moment and the initial sample elimination data at the first moment, sample prediction data at the second moment are adjusted to obtain a sample path at the first moment;

Performing echo cancellation on the sample acquisition data at the first moment based on the sample path at the first moment and the sample play data at the first moment to obtain sample echo cancellation data at the first moment;

and according to the sample echo cancellation data at the first moment and the sample label corresponding to the sample echo cancellation data at the first moment, carrying out parameter adjustment on the initial filtering model to obtain the target filtering model.

In one aspect, an embodiment of the present application provides a data processing apparatus, including:

the cancellation data determining module is used for determining initial cancellation data at the first moment according to the play data at the first moment, the acquisition data at the first moment and the echo prediction data at the second moment; the first time is adjacent to the second time, and the second time is smaller than the first time;

the target prediction module is used for inputting the play data at the first moment, the initial elimination data at the first moment and the echo prediction data at the second moment into the target filtering model for prediction to obtain gain data at the first moment;

the target adjustment module is used for adjusting the echo prediction data at the second moment based on the gain data at the first moment and the initial elimination data at the first moment to obtain an echo path at the first moment;

And the echo cancellation module is used for performing echo cancellation on the acquired data at the first moment based on the echo path at the first moment and the play data at the first moment to obtain echo cancellation data at the first moment.

Wherein the echo prediction data at the second time comprises an echo path at the second time;

the cancellation data determination module includes:

the first data fusion unit is used for carrying out data fusion processing on the play data at the first moment and the echo path at the second moment to obtain initial predicted echo data at the first moment;

the first elimination unit is used for eliminating the acquired data at the first moment based on the initial predicted echo data at the first moment to obtain the initial elimination data at the first moment.

Wherein the echo prediction data at the second time comprises echo prediction increments at the second time;

the target prediction module comprises:

the splicing unit is used for splicing the play data at the first moment, the initial elimination data at the first moment and the echo prediction increment at the second moment to obtain target combination data for prediction processing;

the prediction unit is used for inputting the hidden state of the target filtering model at the second moment and the target combined data into the target filtering model for prediction processing to obtain gain data at the first moment and the hidden state of the target filtering model at the first moment; the hidden state at the first moment is used for predicting gain data at the third moment; the first time is adjacent to the third time, and the first time is less than the third time.

a target adjustment model, comprising:

the increment obtaining unit is used for obtaining the echo prediction increment of the first moment based on the gain data of the first moment and the initial elimination data of the first moment;

and the increment adjustment unit is used for carrying out increment adjustment on the echo path at the second moment based on the echo prediction increment at the first moment to obtain the echo path at the first moment.

Wherein, echo cancellation module includes:

the second data fusion unit is used for carrying out data fusion processing on the echo path at the first moment and the playing data at the first moment to obtain target prediction echo data at the first moment;

and the second elimination unit is used for eliminating the acquired data at the first moment based on the target predicted echo data at the first moment to obtain echo elimination data at the first moment.

Wherein, the data processing device still includes:

the conversion module is used for acquiring the full play data and the full acquisition data, carrying out Fourier transform on the full play data to obtain frequency domain play data corresponding to the full play data, and carrying out Fourier transform on the full acquisition data to obtain frequency domain acquisition data corresponding to the full acquisition data;

The elimination data determining module is specifically configured to obtain playing data at a first moment from the frequency domain playing data, and obtain collected data at the first moment from the frequency domain collected data; determining initial elimination data of the first moment according to the play data of the first moment, the acquisition data of the first moment and the echo path of the second moment;

the echo cancellation module is specifically configured to perform echo cancellation on the acquired data at the first moment based on the echo path at the first moment and the play data at the first moment, so as to obtain frequency domain cancellation data at the first moment; and performing inverse Fourier transform processing on the frequency domain elimination data at the first moment to obtain echo elimination data at the first moment.

the sample data determining module is used for determining initial sample elimination data at the first moment according to the sample play data at the first moment, the sample acquisition data at the first moment and the sample prediction data at the second moment; the first time is adjacent to the second time, and the second time is smaller than the first time;

the initial prediction module is used for inputting the sample play data at the first moment, the sample elimination data at the first moment and the sample prediction data at the second moment into the initial filtering model for prediction to obtain sample gain data at the first moment;

The sample adjusting module is used for adjusting the sample prediction data at the second moment based on the sample gain data at the first moment and the initial sample elimination data at the first moment to obtain a sample path at the first moment;

the sample elimination module is used for carrying out echo elimination on the sample acquisition data at the first moment based on the sample path at the first moment and the sample play data at the first moment to obtain sample echo elimination data at the first moment;

the model adjustment module is used for carrying out parameter adjustment on the initial filtering model according to the sample echo cancellation data at the first moment and the sample label corresponding to the sample echo cancellation data at the first moment to obtain the target filtering model.

Wherein, the data processing device still includes:

the corpus acquisition module is used for acquiring corpus data, and acquiring full sample playing data and full sample labels from the corpus data; the full sample play data comprises sample play data at a first moment; the full sample tag comprises a sample tag corresponding to sample echo cancellation data at a first moment;

the echo production module is used for simulating acoustic propagation response, generating simulated echo data and acquiring a simulated echo path from the simulated echo data;

The convolution module is used for carrying out convolution processing on the full sample playing data and the simulated echo path to obtain full sample echo data;

the mixing module is used for mixing the full sample echo data with the full sample playing data to obtain full sample acquisition data; the full sample acquisition data includes sample acquisition data at a first time.

In one aspect, the application provides a computer device comprising: a processor, a memory, a network interface;

the processor is connected to the memory and the network interface, where the network interface is used to provide a data communication function, the memory is used to store a computer program, and the processor is used to call the computer program to make the computer device execute the method in the embodiment of the present application.

In one aspect, embodiments of the present application provide a computer readable storage medium having a computer program stored therein, the computer program being adapted to be loaded by a processor and to perform a method according to embodiments of the present application.

In one aspect, embodiments of the present application provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium; the processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device performs the method in the embodiment of the present application.

In the embodiment of the application, when the computer equipment with the voice communication function acquires the play data at the first moment, the acquisition data at the first moment and the echo prediction data at the second moment, the echo prediction can be performed through the target green model, and the echo prediction data at the second moment is adjusted based on the gain data at the first moment obtained by the echo prediction, so that the echo path at the first moment is obtained. Echo cancellation data at a first time may be obtained from the echo path. Therefore, according to the embodiment of the application, the gain data at the first moment can be calculated through the target filtering model, the echo prediction data at the second moment can be adjusted, the echo path at the first moment can be obtained, and the echo cancellation data at the first moment can be obtained according to the prediction of the echo path. The target filtering model has self-adaptability, so that the embodiment of the application has good generalization performance. The combination of the target filtering model and the gain data at the first moment can more accurately predict the echo path at the first moment, so that when the echo path is suddenly changed, the echo cancellation data at the first moment can be reconverged more rapidly than the self-adaptive filtering or the neural network model alone, and the echo is further rapidly cancelled. Deviations in the echo path predictions are also reduced. Meanwhile, the combination of the target filtering model and the gain data at the first moment improves the prediction precision of the model, can reduce the model size of the filtering model, and further can be applied to computer equipment sensitive to the model size and the operation speed.

Drawings

In order to more clearly illustrate the embodiments of the application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of a system architecture according to an embodiment of the present application;

FIG. 2 is a schematic diagram of an acoustic echo cancellation structure according to an embodiment of the present application;

FIG. 3 is a schematic flow chart of a data processing method according to an embodiment of the present application;

FIG. 4a is a schematic diagram of a scenario involving acoustic echo cancellation provided by an embodiment of the present application;

FIG. 4b is a flowchart illustrating a data processing method according to an embodiment of the present application;

FIG. 5 is a schematic flow chart of a data processing method according to an embodiment of the present application;

FIG. 6 is a schematic diagram of a scenario involving acoustic echo cancellation provided by an embodiment of the present application;

FIG. 7 is a schematic diagram of a data processing apparatus according to an embodiment of the present application;

FIG. 8 is a schematic diagram of a data processing apparatus according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

The embodiment of the application provides a method for performing acoustic echo cancellation, which relates to the field of artificial intelligence. Among these, artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology (voice technology), a natural language processing technology, machine learning/deep learning, automatic driving, intelligent traffic and other directions.

Among the key technologies of the speech technology (Speech Technology) are automatic speech recognition technology and speech synthesis technology, and voiceprint recognition technology. The method can enable the computer to listen, watch, say and feel, is the development direction of human-computer interaction in the future, and voice becomes one of the best human-computer interaction modes in the future.

Referring to fig. 1, fig. 1 is a schematic diagram of a system architecture according to an embodiment of the application. As shown in fig. 1, the system may include a computer device 100 and a terminal cluster, which may include: the terminal apparatuses 200a, 200b, 200c, …, and 200n, it will be appreciated that the above system may include one or more terminal apparatuses, and the present application is not limited to the number of terminal apparatuses. The computer device 100 and the terminal cluster are devices capable of voice communication (e.g., devices including speakers and microphones). In a system usage scenario, the computer device 100 may be in voice communication with any one of the terminal devices (e.g., the terminal device 200 a) in the terminal cluster. The voice communication mode may be voice communication (such as mobile phone, landline phone, etc.), communication using voice software (such as voice and video function, teleconference, audio-video conference, etc.), or communication using bluetooth (such as intercom). In a system usage situation, the computer device 100 may be used as a database and a transit server, and any one of the terminal devices (such as the terminal device 200 a) in the terminal cluster may perform voice communication with other terminal devices. Optionally, any one of the computer device 100 and the terminal device in the terminal cluster may run an application client capable of performing voice communication (such as a social client, an office client, a conference client, and a game client). The above-mentioned terminal device may be an electronic device, including but not limited to a mobile phone, a tablet computer, a desktop computer, a notebook computer, a palm computer, a vehicle-mounted device, an augmented Reality/Virtual Reality (AR/VR) device, a head-mounted display, a smart television, a wearable device, a smart speaker, a digital camera, a camera, and other mobile internet devices (mobile internet device, MID) with network access capability, or a terminal device in a scene such as a train, a ship, or a flight.

The computer device mentioned in the present application may be a server or a terminal device, or may be a system composed of a server and a terminal device.

Wherein a communication connection may exist between the terminal clusters, for example, a communication connection exists between the terminal device 200a and the terminal device 200b, and a communication connection exists between the terminal device 200a and the terminal device 200 c. Meanwhile, any terminal device in the terminal cluster may have a communication connection with the computer device 100, for example, a communication connection exists between the terminal device 200a and the computer device 100, where the communication connection is not limited to a connection manner, may be directly or indirectly connected through a wired communication manner, may be directly or indirectly connected through a wireless communication manner, and may also be other manners, and the application is not limited herein.

For the convenience of subsequent understanding and description, please refer to fig. 2, fig. 2 is a schematic structural diagram of acoustic echo cancellation according to an embodiment of the present application. As shown in fig. 2, the computer device in the embodiment of the present application may be a device that provides playing data or collecting data, or may be a device that can use echo cancellation data. The embodiment of the application can also be integrally applied to one computer device, and the use scene is not limited. The computer device may be the computer device 100 in the embodiment corresponding to fig. 1, or may be any one of the terminal devices in the terminal cluster, for example, the terminal device 200a, which will not be limited herein.

Wherein, the playing sound signal in the loudspeaker of the computer equipment is collected and processed to obtain the playing data x at the time t _t T is a positive integer. The time may refer to a specific moment in time, such as a second, and may refer to a minimum sampling unit when sampling the played sound signal. The sound signal may be original analog sound information that has not been processed by means such as sampling. Collecting and processing sound signals in a microphone of the computer equipment to obtain collected data d at the moment t _t . According to the playing data x at time t _t Acquisition data d at time t _t Echo path with time (t-1)The operation processing is carried out to obtain the initial elimination data e at the time t _t . Time (t-1) is a time (immediately preceding time) preceding time (t). According to the initial elimination data e at time t _t Echo prediction data Δh at time (t-1) _t-1 Target combination data may be obtained, which may be used to input a target filtering model. Hidden state H at time (t-1) _t-1 The target filtering model is input to carry out echo prediction processing to obtain the output of the target filtering model, namely gain data K at the moment t _t . The hidden state at the current time may be associated with an echo path at the current time, play data at the current time, acquisition data at the current time, and the like. Gain data K output according to target filtering model _t Hidden state H at time t _t Hidden state H at time t _t Can be used for the target filter model to make the target filter model output gain data K at the time t+1 _t+1 The time (t+1) is a time (next time) subsequent to the time t. According to the initial elimination data e at time t _t Gain data K at time t _t The echo prediction increment delta h at the moment t can be obtained _t Echo prediction delta h at t moment _t Echo Path +.f from time (t-1)>Mixing to obtain echo path +.>Echo path according to time t>Play data x at time t _t Acquired data d from time t _t Echo cancellation data +.>

It will be appreciated that 201 in FIG. 2 may include echo prediction data at time t-1, and that echo prediction data at time (t-1) may include time t-1Is to be used in the echo path of (a)Echo prediction increment delta h from time (t-1) _t-1 . 202 in FIG. 2 may represent initial cancellation data e including time t _t Target combination data, target filtering model and hidden state H at time t _t Gain data K at time t _t Echo prediction increment delta h at time t _t Echo Path +.>Etc., 202 in fig. 2 may be used to represent the echo cancellation process at time t.

Further, referring to fig. 3, fig. 3 is a flow chart of a data processing method according to an embodiment of the application. As shown in fig. 3, the method may be performed by a computer device, which may be any one of the terminal devices in the terminal cluster shown in fig. 1, for example, the terminal device 200a, or may be the computer device 100 shown in fig. 1, which is not limited herein. For ease of understanding, embodiments of the present application will be described with the method being performed by a computer device, and the data processing method may include at least the following steps S101 to S104:

step S101, determining initial cancellation data at the first moment according to the play data at the first moment, the acquisition data at the first moment and the echo prediction data at the second moment.

Specifically, the first time is adjacent to the second time, and the second time is smaller than the first time. The time of day may refer to a particular instant in time, such as a second, millisecond, or the like. The time instants may also refer to the smallest sample unit, where the sampling distance between the first time instant and the second time instant is separated by a smallest sample unit. For example, the play data at the first time may represent one frame in the full play data, and the collection data at the first time may represent one frame in the full collection data. The first time and the second time may be used to represent two adjacent echo cancellation time points when echo cancellation is performed based on the full play data and the full collection data, for example, the output of the second time may be input as part of the first time for the echo cancellation process of the second time. For example, if the first time is referred to as time T, the second time may be referred to as time (T-1), T is a positive integer less than or equal to T, T is a positive integer, and T may be used to represent the data length of the full-size play data and the full-size collection data. Alternatively, if the second time is zero, that is, (t-1) is 0, the echo prediction data at the second time may be initialized, or alternatively, the hidden state at the second time may be initialized. Optionally, the data acquisition may be directly performed on the speaker at a certain moment, so as to obtain the play data at the first moment. The voice signal sent by the computer equipment connected with the loudspeaker is obtained, that is, the playing data at the first moment is obtained.

Specifically, the computer device may obtain the play data at the first time, the acquisition data at the first time, and the echo prediction data at the second time. Specifically, the computer device may obtain the full play data and the full collection data required to perform echo cancellation, obtain the play data at the first moment from the full play data, and obtain the collection data at the first moment from the full collection data. Optionally, echo cancellation may be performed based on the frequency domain, and specifically, the computer device may obtain the full play data and the full acquisition data, and perform frequency domain conversion on the full play data to obtain frequency domain play data corresponding to the full play data; and performing frequency domain conversion on the full-volume acquisition data to obtain frequency domain acquisition data corresponding to the full-volume acquisition data. Specifically, fourier transform (such as short-time fourier transform) may be performed on the full-volume play data to obtain frequency domain play data corresponding to the full-volume play data, and fourier transform may be performed on the full-volume acquisition data to obtain frequency domain acquisition data corresponding to the full-volume acquisition data. Or, M filters may be obtained, where M filters are used to filter signals in different frequency bands, M is a positive integer, and specifically, the computer device may use the M filters to filter the full play data respectively, so as to obtain play filter signals of the full play data in the M frequency bands, and combine the M play filter signals to obtain frequency domain play data corresponding to the full play data; and filtering the full-quantity acquisition data by adopting M filters to obtain acquisition filtering signals of the full-quantity acquisition data in M frequency bands, and combining the M acquisition filtering signals to obtain frequency domain acquisition data corresponding to the full-quantity acquisition data. Alternatively, the computer device may obtain a play frequency domain range of the full play data, obtain an acquisition frequency domain range of the full acquisition data, determine M, and M filters based on the play frequency domain range and the acquisition frequency domain range. Further, the play data at the first moment can be obtained from the play data of the frequency domain, and the collection data at the first moment can be obtained from the collection data of the frequency domain. At this time, both the play data at the first time and the acquisition data at the first time can be regarded as frequency domain signals.

Further, the computer device may acquire echo prediction data at a second time, specifically, if the second time is a default time (such as zero time), that is, a time when echo cancellation starts, at this time, it may be considered that no parameter has yet been generated, and the computer device may perform initialization to obtain echo prediction data at the second time; if the second time is not the default time, the echo prediction data generated in the echo cancellation process of the second time can be obtained, namely the echo prediction data of the second time. For example, the computer device may perform an initialization to obtain echo prediction data at time zero; based on the echo prediction data at the zero time, the echo prediction data at the next time of the zero time is predicted until the echo prediction data at the T time can be obtained, and the like. When echo cancellation is performed based on the frequency domain, since the format and the like of the data (including the output of the model) obtained in each step in the present application are determined based on the input data, that is, when the time domain signal is used for prediction, the data obtained in each step in the present application is the time domain data, and when the frequency domain signal is used for prediction, the data obtained in each step in the present application is the frequency domain data, therefore, when the frequency domain is used for prediction, the input data can be directly subjected to frequency domain conversion, that is, the full play data, the full collection data and the like are subjected to frequency domain conversion, and the echo prediction data at each time is obtained by prediction, that is, the echo prediction data at the second time can be directly obtained.

Specifically, the echo prediction data at the second time includes an echo path at the second time. And carrying out data fusion processing on the play data at the first moment and the echo path at the second moment to obtain initial predicted echo data at the first moment. And carrying out elimination processing on the acquired data at the first moment based on the initial predicted echo data at the first moment to obtain initial elimination data at the first moment. Optionally, the data fusion manner between the playing data at the first time and the echo path at the second time may be a convolution processing manner, and convolving the playing data at the first time and the echo path at the second time may obtain initial predicted echo data predicted by the echo path at the second time, which is equivalent to predicting echo data possibly included in the acquired data at the first time. For example, if the first time is time t, the play data at the first time may be expressed as x _t That is, the play data at the first time may be the play data x at the time t in fig. 2 _t The echo path at the second time may be expressed asThe play data x at the first moment _t Echo path +.>The convolution processing is performed to obtain the initial predicted echo data +. >The acquired data at the first instant may be denoted as d _t Then based on the initial predicted echo data +.>For the acquired data d at the first moment _t Performing cancellation processing to obtain initial cancellation data e at a first moment _t ，/>Optionally, the data fusion manner between the play data at the first time and the echo path at the second time may be a dot product processing manner, the play data at the first time and the echo path at the second time may be subjected to feature extraction processing, so as to obtain a play feature at the first time corresponding to the play data at the first time, an echo path feature at the second time corresponding to the echo path at the second time, and dot product processing is performed between the play feature at the first time and the echo path feature at the second time, so as to obtain predicted initial predicted echo data.

Further, for feature extraction of the echo path between the play data at the first time and the echo path at the second time, a direct vector mapping manner may be adopted. If the playing data at the first moment is directly mapped to the playing characteristic at the first moment, the echo path at the second moment is directly mapped to the echo path characteristic at the second moment. If a direct mapping vector extraction mode is adopted, the high relevance between the original playing data and the echo path during feature extraction can be ensured.

For ease of understanding, please refer to fig. 4a, fig. 4a is a schematic diagram of a scenario regarding acoustic echo cancellation provided by an embodiment of the present application. As shown in fig. 4a, the computer device may obtain the full play data 402 sent by the remote object 401, play the full play data by the voice playing device 403 (such as a speaker, etc.), obtain the full play data 407 by the voice collecting device 406 (such as a microphone, etc.), where the full play data 407 may be generated according to a voice signal (which may be referred to as a near-end voice herein) generated by the near-end object 405, and the interference voice data 404 generated by the voice playing device 403 when playing the full play data 402, etc. Wherein the far-end object 401 and the near-end object 405 are relatively speaking, one of the two voice communication parties is a near-end object, and the other party can be considered as a corresponding far-end object, for example, the two voice communication parties comprise an object A and an object B, when the object A is regarded as a near-end object, the object B is a far-end object relative to the object A, and when the object B is regarded as a near-end object, the object A is relative to the object ARemote object of object B. The application can be used in any scene capable of carrying out voice communication, such as a video communication scene, a scene of carrying out voice communication through a game voice function in a game, a teleconference scene and the like. The full play data 402 shown in fig. 4a may include a voice signal generated by the remote object 401 (which may be referred to herein as remote voice), or the full play data 402 may include a voice signal of the remote object 401 and a sound (such as a game background sound or background music) in an environment where the remote object 401 is located. For example, it may be considered that the data played from the voice playing device 403 at time t is the playing data at time t, and t is a positive integer, with reference to the beginning of playing (i.e., recorded as zero time) of the voice data that needs to be echo cancelled. Wherein the full volume of acquired data 407 may be considered to include the acquired data d shown in fig. 2 _t 。

Step S102, the playing data at the first moment, the initial elimination data at the first moment and the echo prediction data at the second moment are input into a target filtering model for prediction, and gain data at the first moment is obtained.

Specifically, the echo prediction data at the second time includes an echo prediction increment at the second time and an echo path at the second time. The target filter model may be a recurrent neural network model (Recurrent Neural Network, RNN), a convolutional neural network model (Convolutional Neural Network, CNN), or other neural network model (e.g., transducer). The recurrent neural network model may use gated recurrent units (Gated Recurrent Uint, GRUs), long Short-Term Memory (LSTM), or other neural networks. The gain data at the first time may be understood as an output of the target filtering model, and the gain data at the first time may be understood as an echo prediction gain result for the play data at the first time, the initial cancel data at the first time, and the echo prediction data at the second time.

Specifically, the echo prediction data at the second time includes an echo prediction delta at the second time. And splicing the play data at the first moment, the initial elimination data at the first moment and the echo prediction increment at the second moment to obtain target combination data for prediction processing. Further, the target combination data can be input into a target filtering model for prediction, and gain data at the first moment can be obtained.

Optionally, when the target combined data is obtained, feature extraction processing may be performed on the play data at the first moment to obtain play features at the first moment, and feature extraction processing may be performed on the initial elimination data at the first moment to obtain initial elimination features at the first moment. And performing characteristic splicing processing on the play characteristic at the first moment, the initial elimination characteristic at the first moment and the echo prediction increment at the second moment to obtain target combination data for prediction processing.

Specifically, when predicting the gain data at the first moment, the hidden state of the target filtering model at the second moment and the target combined data can be input into the target filtering model for prediction processing, so as to obtain the gain data at the first moment and the hidden state of the target filtering model at the first moment; the hidden state at the first moment is used for predicting gain data at the third moment; the first time is adjacent to the third time, and the first time is less than the third time. For example, the echo prediction delta at the second time instant may be expressed as Δh _t-1 The playing data x at the first moment _t Initial cancellation data e at first moment _t Echo prediction delta h with second moment _t-1 Performing a stitching process to obtain target combination data (x _t ，e _t ，Δh _t-1 ). Combining the target combination data (x _t ，e _t ，Δh _t-1 ) Inputting the target filtering model for prediction to obtain gain data K at the first moment _t . Optionally, in a possible case, in the target filtering model, a target filtering parameter in the target filtering model may be obtained, and the target combined data is predicted by adopting the hidden state and the target filtering parameter at the second moment to obtain gain data at the first moment; and carrying out data fusion processing on the target combination data and the hidden state at the second moment, and predicting the hidden state at the first moment. Wherein the gain data at the first time can be denoted as K _t To hide at the first momentState is recorded as H _t This process can be denoted as K _t ，H _t- ＝FiltM(x _t ，e _t ，Δh _t-1 ，H _t-1 I θ), where FiltM is used to represent the target filter model and θ is used to represent the target filter parameters in the target filter model. The target filtering parameters are obtained through training.

Alternatively, the target filtering model may include one or more network layers, such as an active layer, a linear filtering layer, a fully connected layer, and the like, under one possible target filtering model. For example, inputting the target combination data and the hidden state at the second moment into a target filtering model, and activating the target combination data by adopting an activation layer to obtain a feature to be filtered; a linear filter layer is adopted to perform linear filter treatment on the characteristics to be filtered, so as to obtain the characteristics to be fully connected; and adopting a full connection layer to perform full connection processing on the to-be-fully connected features to obtain gain data at the first moment. The number of network layers and types of network layers included in the target filtering model are not limited herein, and the above is only one possible hierarchical structure in the target filtering model. Further optionally, the target filtering parameters may include hierarchical parameters corresponding to each network layer, the target combination model and the hidden state at the second moment are input into the target filtering model, and the target combination data may be sequentially processed by adopting the hierarchical parameters corresponding to each network layer and the hidden state at the second moment until gain data at the first moment is obtained. For example, taking the ith network layer as an example, i is a positive integer, and the hierarchical hidden state of the (i-1) th network layer and the hierarchical parameters of the ith network layer can be adopted to predict the hierarchical output data of the (i-1) th network layer to obtain the hierarchical output data of the ith network layer; and carrying out state prediction by adopting the hierarchical hidden state of the (i-1) th network layer and the hierarchical output data of the (i-1) th network layer to obtain the hierarchical hidden state of the i-th network layer. And when the i is the same as the number of the network layers included in the target filtering model, determining the hierarchical output data of the ith network layer as gain data at the first moment, and determining the hierarchical hidden state of the ith network layer as the hidden state at the first moment. When i is 1, the hierarchy hidden state of the (i-1) th network layer refers to the hidden state at the second moment, and the hierarchy output data of the (i-1) th network layer refers to the target combination data.

Alternatively, the linear filtering layer may include a gating loop unit (Gated Recurrent Uint, GRU), which may be simply referred to as a threshold selection process, where features of the features to be filtered that do not exceed a threshold are filtered by the threshold, and features that are greater than the threshold are determined to be fully connected features. Optionally, the linear filtering layer may include a Long Short-Term Memory (LSTM), and specifically, the LSTM may include p selection decision processors, where p is a positive integer, where the p selection decision processors perform selection decision on the feature to be filtered, and the feature of the feature to be filtered is determined as a feature to be fully connected, that is, features of the feature to be filtered that satisfy the decision conditions corresponding to the p selection decision processors respectively are determined as features to be fully connected.

Step S103, adjusting the echo prediction data at the second time based on the gain data at the first time and the initial cancellation data at the first time, to obtain the echo path at the first time.

Specifically, the computer device may calibrate the echo prediction data at the second time according to the calibration data by using the gain data at the first time and the initial cancellation data at the first time as the calibration data, so that the prediction for the echo prediction data is more accurate, and the prediction deviation for the echo prediction data is reduced.

Specifically, the echo prediction data at the second time includes an echo path at the second time. And acquiring the echo prediction increment of the first moment based on the gain data of the first moment and the initial elimination data of the first moment. And performing increment adjustment on the echo path at the second moment based on the echo prediction increment at the first moment to obtain the echo path at the first moment. Specifically, the gain data at the first time and the initial cancellation data at the first time may be subjected to data fusion to obtain an echo prediction increment at the first time, as shown in fig. 2, assuming that the first timeFor time t, the echo prediction delta can be expressed as deltah _t ＝K _t e _t . The echo prediction increment at the first time may be considered as a predicted echo path at the first time, and a change that may be generated on the basis of the echo path at the second time, that is, may be considered as a difference between the predicted echo path at the first time and the echo path at the second time, and the echo path at the second time may be subjected to incremental adjustment through the echo prediction increment at the first time, so as to obtain the echo path at the first time.

Step S104, echo cancellation is carried out on the acquired data at the first moment based on the echo path at the first moment and the play data at the first moment, so as to obtain echo cancellation data at the first moment.

Specifically, the echo path at the first moment and the play data at the first moment are subjected to data fusion processing to obtain target prediction echo data at the first moment, and the acquired data at the first moment is subjected to elimination processing based on the target prediction echo data at the first moment to obtain echo elimination data at the first moment. For the data fusion processing manner of the echo path at the first moment and the play data at the first moment, the data fusion processing manner of the echo path at the second moment and the play data at the first moment can be referred. For example, as shown in FIG. 2, the echo path is based on the time tPlay data x at time t _t Acquired data d from time t _t Echo cancellation data +.> Wherein (1)>Target predicted echo data representing a first time instant.

Optionally, when performing echo cancellation based on the frequency domain, echo cancellation may be performed on the acquired data at the first time based on the echo path at the first time and the play data at the first time, to obtain frequency domain cancellation data at the first time. And performing time domain conversion on the frequency domain elimination data at the first moment to obtain echo elimination data at the first moment. For example, the frequency domain cancellation data at the first time may be subjected to inverse fourier transform processing to obtain echo cancellation data at the first time. Or, the inverse filters corresponding to the M filters may be obtained, the inverse filters are used to perform inverse filtering processing on the frequency domain cancellation data at the first moment, so as to obtain M restored signals, and the M restored signals are combined to obtain the echo cancellation data at the first moment.

That is, the echo cancellation method of the embodiment of the present application may be applied to both the time domain and the frequency domain. The data magnitude for acoustic echo cancellation that the computer device needs to process in the frequency domain is smaller than the data magnitude for acoustic echo cancellation that the computer device needs to process in the time domain due to the different characteristics of the frequency domain and the time domain. Aiming at the acoustic processing process of the embodiment of the application in the frequency domain, the embodiment of the application can reduce the model size of the target filtering model, further can be applied to computer equipment sensitive to the model size and the operation speed, and the smaller-order data can greatly shorten the operation time of echo cancellation and improve the efficiency and performance of echo cancellation.

In the embodiment of the application, the gain data at the first moment is calculated through the target filtering model, the echo prediction data at the second moment is adjusted, the echo path at the first moment can be obtained, and the echo cancellation data at the first moment can be obtained according to the prediction of the echo path. The target filtering model has self-adaptability, so that the embodiment of the application has good generalization performance. The combination of the target filtering model and the gain data at the first moment can more accurately predict the echo path at the first moment, so that when the echo path is suddenly changed, the echo cancellation data at the first moment can be reconverged more quickly than the self-adaptive filtering or the neural network model alone, the echo is further rapidly cancelled, and the deviation when the echo path is estimated is weakened. Meanwhile, the combination of the target filtering model and the gain data at the first moment improves the prediction precision of the model, can reduce the model size of the filtering model, and further can be applied to computer equipment sensitive to the model size and the operation speed. Therefore, the application can improve the accuracy of echo cancellation, thereby improving the effect and efficiency of echo cancellation.

Further, referring to fig. 4b, fig. 4b is a flow chart of a data processing method according to an embodiment of the present application. As shown in fig. 4b, the method may be performed by a computer device, which may be any one of the terminal devices in the terminal cluster shown in fig. 1, for example, the terminal device 200a, or may be the computer device 100 shown in fig. 1, which is not limited herein. For ease of understanding, embodiments of the present application will be described with the method being performed by a computer device as an example, the method may include at least the following steps S201-S208:

step S201, acquiring full-volume acquisition data and full-volume play data.

Specifically, the specific embodiment of this step may refer to the description of acquiring the full-volume acquisition data and the full-volume play data in step S101 in fig. 3, which is not described herein. For example, as shown in fig. 4a, it is assumed that in the game communication scenario, the near-end object 405 generates a voice signal, that is, the near-end object 405 speaks a section of speech through the voice capturing device 406, and the section of speech is the voice signal generated by the near-end object 405, and it can be considered that during the period of time that the near-end object 405 generates the voice signal, the voice data captured by the voice capturing device 406 is the full-volume captured data, the voice data played by the voice playing device 403 is the full-volume played data. In this game communication scenario, the remote object 401 and the near object 405 may be considered as game players who perform voice communication, for example, the near object 405 may send out a voice signal such as "to XX collection," notice and avoid the large notice, "and may acquire the full-volume play data and the full-volume collection data during the time period in which the voice signal is sent out.

Optionally, echo cancellation may be performed based on the frequency domain, and specifically, the computer device may obtain the full play data and the full acquisition data, and perform frequency domain conversion on the full play data to obtain frequency domain play data corresponding to the full play data; and performing frequency domain conversion on the full-volume acquisition data to obtain frequency domain acquisition data corresponding to the full-volume acquisition data. See in particular the relevant description shown in step S101 of fig. 3.

Alternatively, the current time of echo cancellation processing on the full play data and the full collection data may be denoted as t, where t is a positive integer. Further, t is initialized, specifically, t is set to a default value, e.g., 1, and step S202 is performed.

Step S202, judging whether T > T.

Specifically, T may be obtained, and if T is less than T, step S203 is executed; if T > T, step S208 is performed. Specifically, the T is based on the data length corresponding to the full-size collected data and the full-size played data, that is, it can be considered that whether all the processing of the full-size played data and the full-size collected data is completed is determined. If echo cancellation is directly performed on the full play data and the full collection data, T may be used to represent the data length corresponding to the full collection data and the full play data; if echo cancellation is performed based on the frequency domain, T may be used to represent the data length corresponding to the frequency domain acquisition data and the frequency domain play data.

Step S203, obtain the playing data at time t, the collecting data at time t and the echo predicting data at time (t-1).

Specifically, for the specific implementation of this step, reference may be made to the description of step S101 in fig. 3 regarding the acquisition of the play data at the first time, the acquisition data at the first time, and the echo prediction data at the second time, which are not described herein. That is, the time t may be regarded as a first time and the time (t-1) may be regarded as a second time. The playing data at the time t can be obtained from the full play data, and the collecting data at the time t can be obtained from the full collecting data. Optionally, if echo cancellation is performed based on the frequency domain, playing data at the time t may be obtained from the playing data of the frequency domain, and collecting data at the time t may be obtained from the collecting data of the frequency domain. Alternatively, if t is 1, echo prediction data at time (t-1) may be initialized.

Step S204, determining initial elimination data at the time t according to the play data at the time t, the acquisition data at the time t and the echo prediction data at the time (t-1).

In step S205, the playing data at time t, the initial cancellation data at time t and the echo prediction data at time (t-1) are input into a target filtering model for prediction to obtain gain data at time t.

Step S206, based on the gain data at time t and the initial cancellation data at time t, the echo prediction data at time (t-1) is adjusted to obtain the echo path at time t.

Specifically, the specific embodiment of step S204 to step S206 may refer to the description of the echo path obtained at the first time in step S101 to step S103 in fig. 3, and will not be described herein.

Step S207, t++.

Specifically, the computer device may perform self-adding processing on t, i.e. t=t+1, and return to step S202 to perform echo cancellation on the play data and the collected data at the next moment.

Step S208, output the echo cancellation data at all times.

Specifically, the computer device may obtain echo cancellation data corresponding to each of T times (i.e., a time corresponding to a default value to a time corresponding to T), and fuse the echo cancellation data corresponding to each of T times to obtain full echo cancellation data. That is, the full-scale echo cancellation data refers to data obtained by echo cancellation of the full-scale playback data and the full-scale acquisition data.

In the embodiment of the application, the echo cancellation data at a plurality of moments are fused to obtain the set of the echo cancellation data at a plurality of moments, so that the echo cancellation data at a plurality of connected moments can be spliced, the characteristics of more echo cancellation data can be obtained, and the feedback effect of echo cancellation is convenient to improve. The method and the device improve the prediction precision of the model, can reduce the model size of the filtering model, and further can be applied to computer equipment sensitive to the model size and the operation speed. Therefore, the application can improve the accuracy of echo cancellation, thereby improving the effect and efficiency of echo cancellation.

Further, in a scenario where echo cancellation is required, after the computer device obtains the full amount of echo cancellation data through the steps shown in fig. 4b, the full amount of echo cancellation data may be sent to the corresponding target device. For example, taking fig. 4a as an example, in a game communication scenario, after echo cancellation is performed by the computer device (i.e., the device associated with the near-end object 405) through the steps shown in fig. 4b, the full amount of echo cancellation data 408 is obtained, and the full amount of echo cancellation data may be sent to the far-end device (i.e., the target device) associated with the far-end object 401. The remote device may play the speech signal via the speech playing means 409, the speech signal played by the speech playing means 409 comprising the full amount of echo cancellation data 408. For example, the computer device receives an echo cancellation request sent by the target device, where the echo cancellation request includes the full play data and the full collect data, and the computer device may perform echo cancellation through the steps shown in fig. 4b to obtain the full echo cancellation data 408, and send the full echo cancellation data 408 to the target device. That is, the present application can be applied to any echo cancellation scene where play data and acquisition data exist.

Further, referring to fig. 5, fig. 5 is a flow chart of a data processing method according to an embodiment of the application. As shown in fig. 5, the method may be performed by a computer device, which may be any one of the terminal devices in the terminal cluster shown in fig. 1, for example, the terminal device 200a, or may be the computer device 100 shown in fig. 1, which is not limited herein. For ease of understanding, embodiments of the present application will be described with the method being performed by a computer device as an example, the method may include at least the following steps S301-S305:

step S301, determining initial sample elimination data at the first moment according to the sample play data at the first moment, the sample collection data at the first moment and the sample prediction data at the second moment.

Specifically, the computer device may obtain sample play data at a first time, obtain sample collection data at the first time and sample prediction data at a second time, and determine initial sample elimination data at the first time according to the sample play data at the first time, the sample collection data at the first time and the sample prediction data at the second time. The first time is adjacent to the second time, and the second time is smaller than the first time. The sample prediction data at the second time may include a sample path at the second time and a sample prediction increment at the second time. Wherein when the second time is zero, the sample prediction data of the second time, such as the sample prediction increment Δh, can be initialized ₀ Sample pathAlternatively, the hidden state H at the second time may also be initialized ₀ 。

The computer device may obtain the sample play data at the first moment from the full sample play data, and obtain the sample collection data at the first moment from the full sample collection data. Or performing frequency domain transformation on the total sample playing data to obtain sample frequency domain playing data, and obtaining sample playing data at a first moment from the sample frequency domain playing data; and carrying out frequency domain transformation on the total sample acquisition data to obtain sample frequency domain acquisition data, and acquiring the sample acquisition data at the first moment from the sample frequency domain acquisition data. For example, the first time is time t, and the sample play data at the first time can be denoted as x _t The sample acquisition data at the first time may be denoted as d _t 。

The computer equipment can acquire the full sample playing data and the full sample label for model training. Specifically, the computer device may obtain corpus data, where the corpus data may be obtained data such as voice data or music data, or may be anechoic data obtained from the internet or a related database. The computer equipment can randomly intercept the corpus data or select one of the segments according to the requirements to play the data x as a full sample; one of the segments can be selected as the full sample label s by randomly intercepting the corpus data or according to the requirements. Specifically, the computer device may obtain corpus data, and obtain full sample play data and full sample tags from the corpus data. The full sample playing data comprises sample playing data at a first moment, and the full sample tag comprises a sample tag corresponding to sample echo cancellation data at the first moment. For example, the full sample play data may include x= [ x [ t ], x [ t-1], …, x [ t-n+1] ].

Further, the computer device may simulate the acoustic propagation response, generate simulated echo data, and obtain a simulated echo path h from the simulated echo data. For example, the computer device may use data simulated by an algorithmic party such as a mirroring method as a room impulse response (Room Impulse Response, RIR) that is determined to be simulated echo data. Alternatively, the computer device may randomly generate noise, and the randomly generated noise is determined to be the simulated echo data. Alternatively, the computer device may collect analog echo data, etc. through the voice collection device. Of course, the computer device may acquire the analog echo data in any one or more ways.

Specifically, N may be determined to be a positive integer according to the data length of the full sample play data, and the computer device may randomly select a piece of data with a data length of N from the analog echo data, and determine the randomly selected data as the analog echo path. Or, acquiring a piece of data with a data length larger than N from the simulated echo data, intercepting a data segment with the data length of N from the acquired data, and determining the data segment as a simulated echo path.

Further, the computer device may perform convolution processing on the full-volume sample playing data x and the simulated echo path h to obtain full-volume sample echo data y, which may be denoted as y=x×h. The computer device may perform a mixing process on the full-scale sample echo data and the full-scale sample play data to obtain full-scale sample acquisition data, which may be denoted as d=s+y. For example, in FIG. 4a, a full volume of sample acquisition data is acquired by the speech acquisition device 406. Wherein the full sample acquisition data comprises sample acquisition data at a first time.

Further, for a specific embodiment of this step, reference may be made to the description of determining the initial cancellation data in step S101 in fig. 3, and no further description is given here.

In step S302, the sample play data at the first moment, the sample cancel data at the first moment and the sample predict data at the second moment are input into the initial filtering model for prediction, so as to obtain the sample gain data at the first moment.

Specifically, the sample prediction data at the second time includes a sample prediction increment at the second time and a sample path at the second time. Further, for a specific embodiment of this step, reference may be made to the description of the acquisition of the gain data in step S102 in fig. 3, and no further description is given here.

Step S303, based on the sample gain data at the first time and the initial sample cancellation data at the first time, adjusts the sample prediction data at the second time to obtain the sample path at the first time.

Specifically, the specific embodiment of this step may refer to the description of acquiring the echo path at the first moment in step S103 in fig. 3, which is not described herein.

Step S304, echo cancellation is performed on the sample collection data at the first moment based on the sample path at the first moment and the sample play data at the first moment, so as to obtain sample echo cancellation data at the first moment.

For a specific implementation of this step, reference may be made to the description of the step S104 for acquiring the echo cancellation data at the first moment in the embodiment corresponding to fig. 3, which will not be repeated here.

Step S305, according to the sample echo cancellation data at the first moment and the sample label corresponding to the sample echo cancellation data at the first moment, parameter adjustment is performed on the initial filtering model to obtain the target filtering model.

Concrete embodimentsAccording to the sample play data x at the time t _t Sample acquisition data d at time t _t Sample path from time t-1The initial sample elimination data e at the time t can be obtained by performing operation processing _t ，/>Hidden state H at t-1 time _t-1 The initial filtering model is input to carry out echo prediction processing to obtain sample gain data K at output t time of the initial filtering model _t Hidden state H at time t _t This process can be denoted as K _t ，H _t- ＝RNN(x _t ，e _t ，Δh _t-1 ，H _t-1 θ), where θ is a parameter of the initial filter model. Removing data e from initial samples at time t _t Sample gain data K at time t _t The sample prediction increment delta h at the moment t can be obtained _t The sample prediction delta may include Δh _t ＝K _t e _t . Predicting the sample at the time t by delta h _t Sample Path +.1 with time t>Mixing to obtain sample path +.>the sample path at time t can be denoted +.>Sample Path according to time t->Sample play data x at time t _t Acquired data d from time t _t Sample echo cancellation data +.>the sample echo cancellation data at time t may comprise +.>For a specific implementation manner of acquiring the sample echo cancellation data at the time t, refer to the description of the echo cancellation data at the first time in step S101 to step S104 in the embodiment corresponding to fig. 3, which will not be described herein.

Further, according to the sample echo cancellation data at the first moment and the sample label corresponding to the sample echo cancellation data at the first moment, parameter adjustment is carried out on the initial filtering model, and a target filtering model is obtained.

Specifically, sample echo cancellation data corresponding to T times respectively are obtained, a loss function is generated according to the sample echo cancellation data corresponding to T times respectively and the full sample label, and parameter adjustment is performed on an initial filtering model based on the loss function, so as to obtain a target filtering model, wherein the target filtering model comprises trained target filtering parameters. Alternatively, the loss function may be generated according to the sample echo cancellation data corresponding to the T times respectively, and the difference data between the sample tags corresponding to the T times respectively for the full sample tag. Generating a loss function according to the sum of the differences between the sample echo cancellation data and the sample labels at each moment; the loss function may also be generated according to the mean value, the mean square error, the extremum, etc. of the difference between the sample echo cancellation data and the sample tag at each time. For example, one possible loss function may be found in equation (1):

in the formula (1), s represents a sample tag corresponding to sample echo cancellation data at a first moment,sample echo cancellation data representing a first time instant, θ represents parameters of the filtering modelThe number, for example, represents parameters in the initial filter model before training and the target filter parameters in the target filter model after training. Wherein s is _t Sample echo cancellation data for representing time t, < >>Sample tags for indicating time t.

Alternatively, the parameter θ of the initial filtering model may be updated by an error back-propagation and gradient descent algorithm, and specifically may be shown in formula (2):

in the formula (2), μ represents a learning rate of the initial filter model, L represents a loss function, and θ represents a parameter of the filter model.

Further, the initial filter model is adjusted until the initial filter model converges, and the converged initial filter model is determined as the target filter model. And saving the parameter theta, wherein the parameter theta can be regarded as a target filtering parameter in the target filtering model, and the parameter theta can be used for echo cancellation.

Referring to fig. 6, fig. 6 is a schematic diagram of a scenario regarding acoustic echo cancellation according to an embodiment of the present application. As shown in fig. 6, the scene may include a terminal device picture of a certain scene in the use process of the game client, or may be a video picture of a certain scene when the office client uses the teleconference function. The scenario shown in fig. 6 includes a speaker and a microphone that may be used for voice communication, where 601 indicates a current state of the speaker, if the icon 601 is in the state shown in fig. 6, it indicates that the current speaker is in an on state, and if a decoration (such as with a diagonal line or a cross line) is added to the icon 601, it indicates that the current speaker is in an off state. Where 602 indicates the current state of the microphone, if the icon 602 is in the state shown in fig. 6, it indicates that the current microphone is in the on state, and if a modification (such as a diagonal line or a cross line) is added to the icon 602, it indicates that the current microphone is in the off state.

In the embodiment of the application, the target filtering model with self-adaptability is obtained by training the initial filtering model, so that the embodiment of the application has good generalization performance. The combination of the target filtering model and the gain data at the first moment can more accurately predict the echo path at the first moment, so that when the echo path is suddenly changed, the echo cancellation data at the first moment can be reconverged more quickly than the self-adaptive filtering or the neural network model alone, the echo is further rapidly cancelled, and the deviation when the echo path is estimated is weakened. Meanwhile, the prediction precision of the model is improved, the model size of the filtering model can be reduced, and the embodiment of the application can be further applied to computer equipment sensitive to the model size and the operation speed. Therefore, the application can improve the accuracy of echo cancellation, thereby improving the effect and efficiency of echo cancellation.

Further, referring to fig. 7, fig. 7 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present application. The data processing means may be a computer program (comprising program code) running in a computer device, for example the data processing means is an application software; the device can be used for executing corresponding steps in the method provided by the embodiment of the application. As shown in fig. 7, the data processing apparatus 1 may include: the echo cancellation module 14 includes a cancellation data determination module 11, a target prediction module 12, a target adjustment module 13, and an echo cancellation module.

The cancellation data determining module 11 is configured to determine initial cancellation data at a first time according to the play data at the first time, the acquired data at the first time, and the echo prediction data at a second time; the first time is adjacent to the second time, and the second time is smaller than the first time;

the target prediction module 12 is configured to input the play data at the first time, the initial cancellation data at the first time, and the echo prediction data at the second time into the target filtering model to perform prediction, so as to obtain gain data at the first time;

the target adjustment module 13 is configured to adjust echo prediction data at a second time based on gain data at a first time and initial cancellation data at the first time, so as to obtain an echo path at the first time;

the echo cancellation module 14 is configured to perform echo cancellation on the acquired data at the first time based on the echo path at the first time and the play data at the first time, so as to obtain echo cancellation data at the first time.

The specific functional implementation manners of the cancellation data determining module 11, the target predicting module 12, the target adjusting module 13, and the echo cancellation module 14 may refer to step S101-step S104 in the corresponding embodiment of fig. 3, and are not described herein.

Referring to fig. 7, the echo prediction data at the second time includes an echo path at the second time;

the cancellation data determination module 11 includes:

a first data fusion unit 111, configured to perform data fusion processing on the play data at the first time and the echo path at the second time, so as to obtain initial predicted echo data at the first time;

the first cancellation unit 112 is configured to perform cancellation processing on the acquired data at the first time based on the initial predicted echo data at the first time, so as to obtain initial cancellation data at the first time.

The specific functional implementation manner of the first data fusion unit 111 and the first cancellation unit 112 may refer to step S101 in the corresponding embodiment of fig. 3, which is not described herein.

Referring to fig. 7, the echo prediction data at the second time includes an echo prediction increment at the second time;

the target prediction module 12 includes:

a splicing unit 121, configured to splice the play data at the first time, the initial cancellation data at the first time, and the echo prediction increment at the second time to obtain target combination data for performing prediction processing;

the prediction unit 122 is configured to input the hidden state of the target filtering model at the second time and the target combined data into the target filtering model for prediction processing, so as to obtain gain data at the first time and the hidden state of the target filtering model at the first time.

The specific functional implementation manner of the stitching unit 121 and the prediction unit 122 may refer to step S102 in the corresponding embodiment of fig. 3, which is not described herein.

the target adjustment model 13 includes:

an increment obtaining unit 131, configured to obtain an echo prediction increment at the first time based on the gain data at the first time and the initial cancellation data at the first time;

the increment adjustment unit 132 is configured to perform increment adjustment on the echo path at the second time based on the echo prediction increment at the first time, so as to obtain the echo path at the first time.

The specific functional implementation manner of the increment acquiring unit 131 and the increment adjusting unit 132 may refer to step S103 in the corresponding embodiment of fig. 3, and will not be described herein.

Referring again to fig. 7, the echo cancellation module 14 includes:

a second data fusion unit 141, configured to perform data fusion processing on the echo path at the first time and the play data at the first time, so as to obtain target predicted echo data at the first time;

the second cancellation unit 142 is configured to perform cancellation processing on the acquired data at the first time based on the target predicted echo data at the first time, so as to obtain echo cancellation data at the first time.

The specific functional implementation manner of the second data fusion unit 141 and the second cancellation unit 142 may refer to step S104 in the corresponding embodiment of fig. 3, which is not described herein.

Referring to fig. 7, the data processing apparatus 1 further includes:

the transformation module 15 is configured to obtain full play data and full collection data, perform fourier transform on the full play data to obtain frequency domain play data corresponding to the full play data, and perform fourier transform on the full collection data to obtain frequency domain collection data corresponding to the full collection data;

the cancellation data determining module 11 is specifically configured to obtain play data at a first moment from the play data in the frequency domain, and obtain collected data at the first moment from the collected data in the frequency domain; determining initial elimination data of the first moment according to the play data of the first moment, the acquisition data of the first moment and the echo path of the second moment;

the echo cancellation module 14 is specifically configured to perform echo cancellation on the acquired data at the first moment based on the echo path at the first moment and the play data at the first moment, so as to obtain frequency domain cancellation data at the first moment; and performing inverse Fourier transform processing on the frequency domain elimination data at the first moment to obtain echo elimination data at the first moment.

The specific functional implementation manner of the transformation module 15 may refer to step S104 in the corresponding embodiment of fig. 3, and will not be described herein.

Further, referring to fig. 8, fig. 8 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present application. The data processing means may be a computer program (comprising program code) running in a computer device, for example the data processing means is an application software; the device can be used for executing corresponding steps in the method provided by the embodiment of the application. As shown in fig. 8, the data processing apparatus 2 may include: sample data determination module 21, initial prediction module 22, sample adjustment module 23, sample elimination module 24, and model adjustment module 25.

The sample data determining module 21 is configured to determine initial sample elimination data at a first time according to sample play data at the first time, sample collection data at the first time, and sample prediction data at a second time; the first time is adjacent to the second time, and the second time is smaller than the first time;

the initial prediction module 22 is configured to input the sample play data at the first time, the sample cancel data at the first time, and the sample prediction data at the second time into an initial filtering model to perform prediction, so as to obtain sample gain data at the first time;

The sample adjustment module 23 is configured to adjust sample prediction data at a second time based on sample gain data at a first time and initial sample cancellation data at the first time, so as to obtain a sample path at the first time;

the sample cancellation module 24 is configured to perform echo cancellation on the sample collection data at the first moment based on the sample path at the first moment and the sample play data at the first moment, so as to obtain sample echo cancellation data at the first moment;

the model adjustment module 25 is configured to perform parameter adjustment on the initial filtering model according to the sample echo cancellation data at the first time and the sample tag corresponding to the sample echo cancellation data at the first time, so as to obtain a target filtering model.

The specific functional implementation manners of the sample data determining module 21, the initial predicting module 22, the sample adjusting module 23, the sample eliminating module 24 and the model adjusting module 25 may be referred to as step S301-step S305 in the corresponding embodiment of fig. 5, and will not be described herein.

Wherein the data processing device 2 further comprises:

the corpus acquisition module 26 is configured to acquire corpus data, and acquire full sample playing data and full sample labels from the corpus data; the full sample play data comprises sample play data at a first moment; the full sample tag comprises a sample tag corresponding to sample echo cancellation data at a first moment;

An echo production module 27, configured to simulate an acoustic propagation response, generate simulated echo data, and acquire a simulated echo path from the simulated echo data;

a convolution module 28, configured to convolve the full-sample playing data with the simulated echo path to obtain full-sample echo data;

the mixing module 29 is configured to perform a mixing process on the full-sample echo data and the full-sample play data to obtain full-sample acquisition data; the full sample acquisition data includes sample acquisition data at a first time.

The specific functional implementation manners of the corpus obtaining module 26, the echo producing module 27, the convolution module 28 and the mixing module 29 may refer to step S301 in the corresponding embodiment of fig. 5, and are not described herein. In addition, the description of the beneficial effects of the same method is omitted.

Further, referring to fig. 9, fig. 9 is a schematic structural diagram of a computer device according to an embodiment of the present application. As shown in fig. 9, the computer device 1000 may include: at least one processor 1001, such as a CPU, at least one network interface 1004, a user interface 1003, a memory 1005, at least one communication bus 1002. Wherein the communication bus 1002 is used to enable connected communication between these components. The user interface 1003 may include a Display (Display), a Keyboard (Keyboard), and the network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface), among others. The memory 1005 may be a high-speed RAM memory or a non-volatile memory (non-volatile memory), such as at least one disk memory. The memory 1005 may also optionally be at least one storage device located remotely from the aforementioned processor 1001. As shown in fig. 9, the memory 1005, which is one type of computer storage medium, may include an operating system, a network communication module, a user interface module, and a device control application.

In the computer device 1000 shown in fig. 9, the network interface 1004 may provide network communication functions; while user interface 1003 is primarily used as an interface for providing input to a user; and the processor 1001 may be used to invoke a device control application stored in the memory 1005 to implement:

determining initial elimination data at a first moment according to the play data at the first moment, the acquisition data at the first moment and the echo prediction data at a second moment; the first time is adjacent to the second time, and the second time is smaller than the first time; inputting the playing data at the first moment, the initial elimination data at the first moment and the echo prediction data at the second moment into a target filtering model for prediction to obtain gain data at the first moment; based on the gain data at the first moment and the initial elimination data at the first moment, adjusting the echo prediction data at the second moment to obtain an echo path at the first moment; and carrying out echo cancellation on the acquired data at the first moment based on the echo path at the first moment and the play data at the first moment to obtain echo cancellation data at the first moment.

The processor 1001 may also be used to invoke a device control application stored in the memory 1005 to implement:

Determining initial sample elimination data at a first moment according to sample play data at the first moment, sample collection data at the first moment and sample prediction data at a second moment; the first time is adjacent to the second time, and the second time is smaller than the first time; inputting the sample play data at the first moment, the sample elimination data at the first moment and the sample prediction data at the second moment into an initial filtering model for prediction to obtain sample gain data at the first moment; based on the sample gain data at the first moment and the initial sample elimination data at the first moment, sample prediction data at the second moment are adjusted to obtain a sample path at the first moment; performing echo cancellation on the sample acquisition data at the first moment based on the sample path at the first moment and the sample play data at the first moment to obtain sample echo cancellation data at the first moment; and according to the sample echo cancellation data at the first moment and the sample label corresponding to the sample echo cancellation data at the first moment, carrying out parameter adjustment on the initial filtering model to obtain the target filtering model.

It should be understood that the computer device 1000 described in the embodiments of the present application may perform the description of the data processing method in the embodiments corresponding to fig. 2, 3, 4a, 4b, 5 and 6, the description of the data processing apparatus 1 in the embodiment corresponding to fig. 7, and the description of the data processing apparatus 2 in the embodiment corresponding to fig. 8, which will not be repeated herein. In addition, the description of the beneficial effects of the same method is omitted.

The embodiment of the present application further provides a computer readable storage medium, where a computer program is stored, where the computer program includes program instructions, where the program instructions, when executed by a processor, implement a data processing method provided by each step in fig. 2, fig. 3, fig. 4a, fig. 4b, fig. 5, and fig. 6, and specifically refer to an implementation manner provided by each step in fig. 2, fig. 3, fig. 4a, fig. 4b, fig. 5, and fig. 6, which is not described herein again. In addition, the description of the beneficial effects of the same method is omitted.

The computer readable storage medium may be the data processing apparatus provided in any one of the foregoing embodiments or an internal storage unit of the computer device, for example, a hard disk or a memory of the computer device. The computer readable storage medium may also be an external storage device of the computer device, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) card, a flash card (flash card) or the like, which are provided on the computer device. Further, the computer-readable storage medium may also include both internal storage units and external storage devices of the computer device. The computer-readable storage medium is used to store the computer program and other programs and data required by the computer device. The computer-readable storage medium may also be used to temporarily store data that has been output or is to be output.

Embodiments of the present application also provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer readable storage medium, and the processor executes the computer instructions, so that the computer device can execute the data processing method in the embodiments corresponding to fig. 2, 3, 4a, 4b, 5 and 6, which are not described herein. In addition, the description of the beneficial effects of the same method is omitted.

The term "comprising" and any variations thereof in the description of embodiments of the application and in the claims and drawings is intended to cover a non-exclusive inclusion. For example, a process, method, apparatus, article, or device that comprises a list of steps or elements is not limited to the list of steps or modules but may, in the alternative, include other steps or modules not listed or inherent to such process, method, apparatus, article, or device.

Those of ordinary skill in the art will appreciate that the elements and algorithm steps described in connection with the embodiments disclosed herein may be embodied in electronic hardware, in computer software, or in a combination of the two, and that the elements and steps of the examples have been generally described in terms of function in the foregoing description to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The method and related apparatus provided in the embodiments of the present application are described with reference to the flowchart and/or schematic structural diagrams of the method provided in the embodiments of the present application, and each flow and/or block of the flowchart and/or schematic structural diagrams of the method may be implemented by computer program instructions, and combinations of flows and/or blocks in the flowchart and/or block diagrams. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks. These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or structural diagram block or blocks. These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or structures.

The foregoing disclosure is illustrative of the present application and is not to be construed as limiting the scope of the application, which is defined by the appended claims.

Claims

1. A method of data processing, comprising:

determining initial elimination data of a first moment according to play data of the first moment, acquisition data of the first moment and echo prediction data of a second moment; the first time is adjacent to the second time, and the second time is less than the first time;

inputting the play data at the first moment, the initial elimination data at the first moment and the echo prediction data at the second moment into a target filtering model for prediction to obtain gain data at the first moment;

based on the gain data at the first moment and the initial elimination data at the first moment, the echo prediction data at the second moment is adjusted to obtain an echo path at the first moment;

2. The method of claim 1, wherein the echo prediction data at the second time comprises an echo path at the second time;

the determining the initial elimination data of the first moment according to the play data of the first moment, the collection data of the first moment and the echo prediction data of the second moment includes:

performing data fusion processing on the play data at the first moment and the echo path at the second moment to obtain initial predicted echo data at the first moment;

and carrying out elimination processing on the acquired data at the first moment based on the initial predicted echo data at the first moment to obtain initial elimination data at the first moment.

3. The method of claim 1, wherein the echo prediction data at the second time instance comprises echo prediction delta at the second time instance; the step of inputting the playing data at the first moment, the initial elimination data at the first moment and the echo prediction data at the second moment into a target filtering model for prediction to obtain gain data at the first moment, comprising the following steps:

splicing the play data at the first moment, the initial elimination data at the first moment and the echo prediction increment at the second moment to obtain target combination data for prediction processing;

Inputting the hidden state of the target filtering model at the second moment and the target combined data into the target filtering model for prediction processing to obtain gain data at a first moment and the hidden state of the target filtering model at the first moment; the hidden state of the first moment is used for predicting gain data of a third moment; the first time is adjacent to the third time, and the first time is less than the third time.

4. The method of claim 1, wherein the echo prediction data for the second time instance comprises an echo path for the second time instance;

the adjusting the echo prediction data at the second time based on the gain data at the first time and the initial cancellation data at the first time to obtain an echo path at the first time includes:

acquiring an echo prediction increment of the first moment based on the gain data of the first moment and the initial elimination data of the first moment;

and performing increment adjustment on the echo path at the second moment based on the echo prediction increment at the first moment to obtain the echo path at the first moment.

5. The method of claim 1, wherein the performing echo cancellation on the acquired data at the first time based on the echo path at the first time and the play data at the first time to obtain echo cancellation data at the first time includes:

Performing data fusion processing on the echo path at the first moment and the play data at the first moment to obtain target prediction echo data at the first moment;

and carrying out elimination processing on the acquired data at the first moment based on the target predicted echo data at the first moment to obtain echo elimination data at the first moment.

6. The method according to claim 1, wherein the method further comprises:

acquiring full play data and full acquisition data, performing Fourier transform on the full play data to obtain frequency domain play data corresponding to the full play data, and performing Fourier transform on the full acquisition data to obtain frequency domain acquisition data corresponding to the full acquisition data;

acquiring playing data at a first moment from the frequency domain playing data, and acquiring acquisition data at the first moment from the frequency domain acquisition data;

determining initial elimination data of a first moment according to play data of the first moment, acquisition data of the first moment and an echo path of a second moment;

The echo cancellation is performed on the collected data at the first moment based on the echo path at the first moment and the play data at the first moment, so as to obtain echo cancellation data at the first moment, including:

performing echo cancellation on the acquired data at the first moment based on the echo path at the first moment and the play data at the first moment to obtain frequency domain cancellation data at the first moment;

and performing inverse Fourier transform processing on the frequency domain elimination data at the first moment to obtain echo elimination data at the first moment.

7. A method of data processing, comprising:

determining initial sample elimination data at a first moment according to sample play data at the first moment, sample acquisition data at the first moment and sample prediction data at a second moment; the first time is adjacent to the second time, and the second time is less than the first time;

Based on the sample gain data at the first moment and the initial sample elimination data at the first moment, adjusting the sample prediction data at the second moment to obtain a sample path at the first moment;

and according to the sample echo cancellation data at the first moment and the sample label corresponding to the sample echo cancellation data at the first moment, carrying out parameter adjustment on the initial filtering model to obtain a target filtering model.

8. The method as recited in claim 7, further comprising:

acquiring corpus data, and acquiring full sample playing data and full sample labels from the corpus data; the full sample play data comprises sample play data at the first moment; the full sample tag comprises the sample tag corresponding to the sample echo cancellation data at the first moment;

simulating acoustic propagation response, generating simulated echo data, and acquiring a simulated echo path from the simulated echo data;

Convolving the full sample play data with the simulated echo path to obtain full sample echo data;

mixing the full sample echo data and the full sample play data to obtain full sample acquisition data; the full sample acquisition data includes sample acquisition data for the first time instant.

9. A data processing apparatus, comprising:

the cancellation data determining module is used for determining initial cancellation data of the first moment according to the play data of the first moment, the acquisition data of the first moment and the echo prediction data of the second moment; the first time is adjacent to the second time, and the second time is less than the first time;

the target prediction module is used for inputting the play data at the first moment, the initial elimination data at the first moment and the echo prediction data at the second moment into a target filtering model for prediction to obtain gain data at the first moment;

10. A data processing apparatus, comprising:

the sample data determining module is used for determining initial sample elimination data at the first moment according to the sample play data at the first moment, the sample acquisition data at the first moment and the sample prediction data at the second moment; the first time is adjacent to the second time, and the second time is less than the first time;

the initial prediction module is used for inputting the sample play data at the first moment, the sample elimination data at the first moment and the sample prediction data at the second moment into an initial filtering model for prediction to obtain sample gain data at the first moment;

the sample adjustment module is used for adjusting the sample prediction data at the second moment based on the sample gain data at the first moment and the initial sample elimination data at the first moment to obtain a sample path at the first moment;

and the model adjustment module is used for carrying out parameter adjustment on the initial filtering model according to the sample echo cancellation data at the first moment and the sample label corresponding to the sample echo cancellation data at the first moment to obtain a target filtering model.

11. A computer device, comprising: a processor, a memory, and a network interface;

the processor is connected to a memory, a network interface for providing data communication functions, the memory for storing a computer program, the processor for invoking the computer program to cause the computer device to perform the method of any of claims 1-6 or the method of any of claims 7-8.

12. A computer readable storage medium, characterized in that a computer program is stored in the computer readable storage medium, which computer program is adapted to be loaded and executed by a processor to cause a computer device with a processor to perform the method of any of claims 1-6 or the method of any of claims 7-8.

13. A computer program product, characterized in that the computer program product comprises a computer program stored in a computer readable storage medium, the computer program being adapted to be read and executed by a processor to cause a computer device having a processor to carry out the steps of the method according to any one of claims 1-6 or the steps of the method according to any one of claims 7-8.