CN113299306B

CN113299306B - Echo cancellation method, echo cancellation device, electronic equipment and computer-readable storage medium

Info

Publication number: CN113299306B
Application number: CN202110847066.4A
Authority: CN
Inventors: 马路; 杨嵩; 王心恬
Original assignee: Beijing Century TAL Education Technology Co Ltd
Current assignee: Beijing Century TAL Education Technology Co Ltd
Priority date: 2021-07-27
Filing date: 2021-07-27
Publication date: 2021-10-15
Anticipated expiration: 2041-07-27
Also published as: CN113299306A

Abstract

The present disclosure provides an echo cancellation method, apparatus, electronic device and computer readable storage medium, receiving a near-end mixed signal and a far-end signal of a corresponding reference channel; respectively encoding the near-end mixed signal and the far-end signal to obtain an encoded near-end mixed signal spectrogram and an encoded far-end signal spectrogram, and splicing the encoded near-end mixed signal spectrogram and the encoded far-end signal spectrogram to obtain a spliced postphrase spectrogram; extracting multi-scale features according to the spliced postphrase spectrogram; extracting depth features according to the encoded near-end mixed signal spectrogram; calculating the weight of each layer of features of the multi-scale features according to the depth features; weighting the corresponding features by using the weight of each layer of features to obtain combined multi-scale features; and acquiring near-end signal estimation according to the combined multi-scale features and depth features. The echo is effectively eliminated in the scenes of voice interaction, voice call and the like.

Description

Echo cancellation method, echo cancellation device, electronic equipment and computer-readable storage medium

Technical Field

The present invention relates to the field of speech processing technologies, and in particular, to an echo cancellation method and apparatus, an electronic device, and a computer-readable storage medium.

Background

Echo cancellation was first applied in audio telephony systems. At both ends of the call, the sound at one end is transmitted to the other end through the wire and is played out through the loudspeaker at the other end, the microphone at the other end receives the sound played by the loudspeaker, and simultaneously, due to the orientation and reflection of the floor, the wall and other objects in the room, the microphone receives various reflected sounds in addition to the direct sound played by the loudspeaker, and the mixed sound is transmitted back to the speaking end, which is the so-called acoustic echo problem, which disturbs the conversation of people and reduces the quality of the system, which is a common problem in communication networks. In the intelligent voice device, the audio played by the device can be received by the microphone of the device, the echo problem also exists, if the audio cannot be eliminated, the audio quality can be influenced, the voice recognition rate is further influenced, and the user experience is reduced.

In the scenes of voice interaction, voice call and the like, the quality of echo cancellation performance directly influences the back-end voice recognition rate and the listening experience of a user, and is a key core technology of a voice technology.

Disclosure of Invention

According to an aspect of the present disclosure, there is provided an echo cancellation method, including:

receiving a near-end mixed signal and a far-end signal of a corresponding reference channel;

respectively encoding the near-end mixed signal and the far-end signal to obtain an encoded near-end mixed signal spectrogram and an encoded far-end signal spectrogram, and splicing the encoded near-end mixed signal spectrogram and the encoded far-end signal spectrogram to obtain a spliced post-speech spectrogram;

extracting multi-scale features according to the spliced postlanguage spectrogram;

extracting depth features according to the encoded near-end mixed signal spectrogram;

calculating a weight of each layer of features of the multi-scale features according to the depth features;

weighting the corresponding features by using the weight of each layer of features to obtain combined multi-scale features;

and acquiring near-end signal estimation according to the combined multi-scale features and the depth features.

According to another aspect of the present disclosure, there is provided an echo cancellation device including:

the receiving module is used for receiving the near-end mixed signal and a far-end signal of a corresponding reference channel;

the encoding module is used for respectively encoding the near-end mixed signal and the far-end signal to obtain an encoded near-end mixed signal spectrogram and an encoded far-end signal spectrogram, and splicing the encoded near-end mixed signal spectrogram and the encoded far-end signal spectrogram to obtain a spliced post-speech spectrogram;

the first extraction module is used for extracting multi-scale features according to the spliced postlanguage spectrogram;

the second extraction module is used for extracting depth features according to the coded near-end mixed signal spectrogram;

a calculation module for calculating a weight of each layer of features of the multi-scale features according to the depth features;

the weighting module is used for weighting the corresponding features by utilizing the weight of each layer of features to obtain the combined multi-scale features;

and the obtaining module is used for obtaining near-end signal estimation according to the combined multi-scale features and the depth features.

According to another aspect of the present disclosure, there is provided an electronic device including:

a processor; and

a memory for storing a program, wherein the program is stored in the memory,

wherein the program comprises instructions which, when executed by the processor, cause the processor to carry out the method according to any one of the above aspects.

According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method according to any one of the above aspects.

One or more technical solutions provided in the embodiments of the present disclosure can implement effective echo cancellation in scenarios such as voice interaction and voice call.

Drawings

Further details, features and advantages of the disclosure are disclosed in the following description of exemplary embodiments, taken in conjunction with the accompanying drawings, in which:

fig. 1 shows a flow chart of an echo cancellation method according to an exemplary embodiment of the present disclosure;

FIG. 2 illustrates a schematic diagram of an echo cancellation network structure according to an exemplary embodiment of the present disclosure;

FIG. 3 illustrates an echo cancellation data preparation diagram according to an exemplary embodiment of the present disclosure;

FIG. 4 shows a schematic diagram of a 1-D Conv Block model structure according to an exemplary embodiment of the present disclosure;

fig. 5 shows a schematic block diagram of an echo cancellation device according to an exemplary embodiment of the present disclosure;

FIG. 6 illustrates a block diagram of an exemplary electronic device that can be used to implement embodiments of the present disclosure.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather are provided for a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the disclosure are for illustration purposes only and are not intended to limit the scope of the disclosure.

It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order, and/or performed in parallel. Moreover, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.

The term "include" and variations thereof as used herein are open-ended, i.e., "including but not limited to". The term "based on" is "based, at least in part, on". The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments". Relevant definitions for other terms will be given in the following description. It should be noted that the terms "first", "second", and the like in the present disclosure are only used for distinguishing different devices, modules or units, and are not used for limiting the order or interdependence relationship of the functions performed by the devices, modules or units.

It is noted that references to "a", "an", and "the" modifications in this disclosure are intended to be illustrative rather than limiting, and that those skilled in the art will recognize that "one or more" may be used unless the context clearly dictates otherwise.

The names of messages or information exchanged between devices in the embodiments of the present disclosure are for illustrative purposes only, and are not intended to limit the scope of the messages or information.

In the scenes of voice interaction, voice call and the like, the quality of echo cancellation performance directly influences the back-end voice recognition rate and the listening experience of a user, and is a key core technology of a voice technology. The currently common method is a WebRTC method, i.e.: firstly, aligning data of a near end and data of a far end by using a time delay estimation algorithm; then, the adaptive filter is adopted to complete the estimation of the echo, thereby eliminating the linear echo; and finally, finishing the suppression of the residual echo by utilizing nonlinear processing. Although the non-linear processing can suppress the residual echo to a certain extent, the suppression degree is limited, a certain residual echo still exists, especially the echo in a complex environment, and the filter cannot quickly track the change of the room impulse response, so that the final echo cancellation effect is influenced, and the performance of the whole sound signal processing is further influenced.

In view of the above problems, the present embodiment provides an echo cancellation method, which can be used in a smart phone and a smart device (electronic device) with a voice processing function, such as a portable tablet computer. Fig. 1 shows a flowchart of an echo cancellation method according to an exemplary embodiment of the present disclosure, and as shown in fig. 1, the flowchart includes the following steps:

step S101, receiving the near-end mixed signal and the far-end signal of the corresponding reference channel. In particular, the near-end mixed signal received by the near-end microphone and the far-end signal of the reference channel may be directly input.

Step S102, the near-end mixed signal and the far-end signal are respectively encoded to obtain an encoded near-end mixed signal spectrogram and an encoded far-end signal spectrogram, and the encoded near-end mixed signal spectrogram and the encoded far-end signal spectrogram are spliced to obtain a spliced spectrogram.

And S103, extracting multi-scale features according to the spliced spectrogram. For example, a dilated time domain convolution network may be employed to extract multi-scale features. Those skilled in the art should understand that the manner of extracting the multi-scale features is not limited to the embodiment, and other manners are also within the scope of the embodiment according to the actual needs.

And step S104, extracting depth features according to the coded near-end mixed signal spectrogram.

And step S105, calculating the weight of each layer of features of the multi-scale features according to the depth features. For boosting important features in multi-scale features while suppressing non-important features.

And step S106, weighting the corresponding features by using the weight of each layer of features to obtain the combined multi-scale features, so that the important features in the multi-scale features are improved, and the non-important features are inhibited.

And S107, acquiring near-end signal estimation according to the combined multi-scale features and depth features.

Through the steps, the method is realized in an end-to-end mode, multi-scale features of the input audio are extracted, echo cancellation is performed by using the multi-scale features, echo suppression capability is high, and loss of frequency spectrum is small.

In some optional embodiments, as shown in fig. 2, the concatenated post-speech spectrogram may be input to a multi-scale feature extraction module in the canceller module, where the multi-scale feature extraction module is formed by multiple sets of expansion convolutions, each set of expansion convolutions includes multiple convolution blocks, and the multi-scale feature extraction module extracts the multi-scale features of each layer according to the concatenated post-speech spectrogram.

In some optional embodiments, as shown in fig. 2, the encoded near-end mixed signal spectrogram may be input to a first long-short term memory network in the canceller module, and the depth features may be extracted by the first long-short term memory network according to the encoded near-end mixed signal spectrogram.

The step S105 is related to calculating the weight of each layer of features of the multi-scale features according to the depth features, and in some alternative embodiments, as shown in fig. 2, the depth features may be used as query, each layer of features of the multi-scale features may be used as key and value, and the weight of each layer of features of the multi-scale features may be calculated by using a multi-head attention mechanism.

The step S106 is to perform weighting processing on the corresponding features by using the weight of each layer of features to obtain the merged multi-scale features, and in some alternative embodiments, as shown in fig. 2, the combined multi-scale features may be obtained by multiplying and overlapping the weight of each layer of features with the corresponding features by using a multi-head attention mechanism.

Step S107 above involves obtaining a near-end signal estimate according to the combined multi-scale feature and the depth feature, and in some alternative embodiments, as shown in fig. 2, the combined multi-scale feature and the depth feature may be spliced and then input to a second long-short term memory network in the canceller module to obtain the near-end signal estimate.

In some alternative embodiments, as shown in fig. 2, the combined multi-scale features and the near-end signal estimate are spliced and input to a classifier, and the classifier determines whether there is a far-end signal or a near-end signal.

The data required for the canceller module training is prepared as shown in fig. 3, and the canceller module is trained by the following steps: selecting voices of different persons from a database as a near-end signal sample (near-end) and a far-end signal sample (far-end), sequentially processing the far-end signal sample by a Non-linear processing module (NLP for short) and a Room Impulse Response (RIR for short), respectively simulating nonlinearity introduced by a loudspeaker and reverberation introduced by the environment, further obtaining an echo signal sample echo, superposing the near-end signal sample and the echo signal sample, simultaneously superposing certain noise so as to obtain a near-end mixed signal sample mix received by a near-end microphone, taking the near-end mixed signal sample mix and the far-end signal sample (far-end) as the input of a canceller module, and taking the near-end signal sample (near-end) as a learning target of a minimum mean square error loss function of the canceller module, the canceller module is trained.

As shown in fig. 3, in some alternative embodiments, the training of the canceller module is continued, the energy of the echo signal sample and the energy of the near-end signal sample are calculated, and the energy of the echo signal sample and the energy of the near-end signal sample are respectively compared with a predetermined threshold value to obtain a first value and a second value, which are used as the double-end detection result tags. For example, "1" greater than the predetermined threshold value and "0" less than the predetermined threshold value, thereby obtaining a double end detection result class, that is: only mute ("00"), only far-end signal ("01"), only near-end signal ("10"), and both ends present signal ("11"). And taking the near-end mixed signal sample and the far-end signal sample as the input of the eliminator module, and taking a double-end detection result label class as a learning target of a cross entropy loss function of the eliminator module.

The network training targets are two, one is for the near-end signal estimation accuracy, and the target is to minimize the minimum Mean Square Error (MSE) between the near-end signal estimation and the real near-end signal, which is defined as follows:

wherein the content of the first and second substances,

，

respectively an estimated signal of the near-end speech and a true near-end signal.

Another learning objective is classification, the objective being to minimize the cross-entropy loss function between the estimated classification and the true label classification, namely:

wherein the content of the first and second substances,

representing the class distribution probability after Softmax estimated by the network,

the true distribution probability of the representative category, namely: label distribution, C denotes number of categories.

The network total loss function is the weighted average result of the classification cross entropy loss function and the MSE loss function, namely:

wherein the content of the first and second substances,

the two tasks of classification and separation are balanced for weight coefficients, and the log of the cross entropy of classification is taken to keep the two loss functions in the same order of magnitude.

In some optional embodiments, as shown in fig. 2, the near-end signal estimation is input to the mask estimation module to obtain a mask value of each time-frequency point of a pure near-end signal in the near-end mixed signal, the mask value of each time-frequency point is multiplied by the encoded near-end mixed signal spectrogram to obtain a near-end signal spectrogram, and the near-end signal spectrogram is input to the one-dimensional convolutional decoder to obtain a time-domain waveform of the near-end signal.

Reference is now made in detail to fig. 2 for a number of complete alternative embodiments.

The echo cancellation network of the main functional module is shown in fig. 2, and mainly includes 4 modules: an audio encoding module (Encoder), an audio encoding module (Decoder), a Canceller module (canceler), and a Classifier module (Classifier).

The audio coding module (Encoder) is a one-dimensional convolution module.

The canceller: the system comprises a layer normalization, a one-dimensional convolution, two LSTM layers, a plurality of groups of expansion convolution layers and an attention mechanism module. Each set of dilation convolutions contains X one-dimensional volume blocks 1-DConv Block, the dilation rate (disparity) of each volume Block increasing exponentially by 2, i.e.: 2i-1(i denotes the ith convolution block, and takes = 1, …, X), and depending on whether or not there is causal convolution, the number of padding 0 is: cause and effect conditions: (kernel _ size-1))/2; the non-causal case is partition (kernel _ size-1), and each 1-D Conv Block structure is shown in FIG. 4. Assume that the features of the multi-scale feature extraction are represented as follows:

where S represents the characteristic dimension of each layer output, T represents the number of time steps, J = M × R represents the total number of layers, M is the number of layers of each set of stacked dilatational convolutions, and R represents the co-repeating stacking of R sets (each set containing M layers).

An attention mechanism is as follows: and calculating the similarity between the depth feature of the near-end mixed signal extracted by the LSTM and the features extracted by each layer of the multiple groups of expansion convolutions to obtain the weight of the corresponding layer, and multiplying the weight by the features of the corresponding layer and then directly superposing the weights to obtain a weighted depth feature. The attention mechanism adopts a standard multi-head attention mechanism, namely:

wherein Q, K and V respectively represent Q of attention mechanismue, key and value;

the features extracted for the LSTM are,

extracting multi-layer features for the multi-scale feature extraction module;

respectively representing mapping matrixes in the attention mechanism, and F representing the dimension of the calculation process of the attention mechanism; h denotes the number of heads of a multi-head attention mechanism.

LSTM layer: the first LSTM extracts the depth features of the near-end mixed signal; the second LSTM calculates the depth feature of the near-end speech signal according to the depth feature of the near-end mixed signal extracted by the first LSTM and the depth feature obtained by the attention mechanism.

A mask module: the system consists of a PReLU activation function, a one-dimensional convolutional layer (1 x1 Conv) and a Sigmoid activation function; and obtaining the mask of the near-end voice in the near-end mixed signal according to the depth characteristic of the near-end voice signal estimated by the LSTM.

A classifier: the device consists of a linear layer and a Softmax layer; and estimating the probability of the signals occurring at the near end and the far end at each time step according to the depth features obtained by the attention mechanism and the near-end voice features obtained by the LSTM.

A Decoder: the method is composed of a transposition convolution network, and input is deconvoluted to obtain a time domain signal.

The network structure configuration is shown in table 1, where F represents the number of output channels of the Encoder; l represents the convolution kernel size of the Encoder; the number of output channels of the bottleneck layer is E, the number of each group of 1-DConvBlock of the multi-scale features is M, and R groups are stacked together; the input channels of the classifier are 2 × E, the number of output channels is C, that is: dividing the audio into C categories; the output channel of Masking is F, wherein F represents the number of output channels of Encoder.

TABLE 1

The 1-D Conv Block model structure is as shown in FIG. 4, a conventional convolution is divided into a point-by-point convolution (position convolution) and a depth convolution (depthwise convolution), a parameter corrected linear unit (PReLU) is used as an activation function, the expression is as shown in the following, normalization operation is carried out on data after each convolution, finally, output is divided into two paths, each path is subjected to dimension transformation through 1x1 Conv, an output branch and input are overlapped to improve network depth, the output of a Skip-out branch is used as the output characteristic of the module, and the characteristic is spliced with the characteristics stacked behind to a classifier.

To ensure that the separation network is insensitive to the amplitude of the input speech, normalization of the input features is required before performing multi-scale mapping.

In a non-real-time scenario, the layer normalization may adopt global layer normalization, that is: the characteristics are normalized in both the channel and the time domain, and the expression is as follows:

wherein the content of the first and second substances,

the characteristics are represented by a plurality of symbols,

in order to train the parameters, the user may,

the stability factor is indicated.

In a real-time scenario, the layer normalization may be cumulative layer normalization, that is: and carrying out layer normalization on the continuous input features, wherein the expression is as follows:

wherein the content of the first and second substances,

which represents the characteristics of the k-th frame,

represents the continuous k frame characteristics, namely:

，

in order to train the parameters, the user may,

the stability factor is indicated.

The Encoder Encoder converts the one-dimensional time domain input audio frequency to a two-dimensional spectrogram; the near-end mixed signal (mix) and the far-end signal (far-end) are sent to the canceller module after passing through the coder to obtain the spectrogram. The canceller firstly normalizes the input amplitude by layer normalization, then compresses the input dimension (namely, a bottleneck layer) by one-dimensional convolution, finally splices the near-end mixed signal and the far-end signal and then sends the spliced signals to a multi-scale feature extraction module, and the multi-scale feature extraction module carries out multi-scale feature extraction on the input features and splices the extracted features under each scale to form a multi-scale feature group; meanwhile, after dimension compression, the near-end mixed signal adopts a Long Short-Term Memory network (LSTM) to extract depth features, the depth features are used as query of an Attention mechanism (Attention), similarity calculation is carried out on the depth features and the extracted multi-scale features of each layer, weight of the features of each layer is obtained, and then the features of each layer are weighted; when computing the Attention, the features extracted by the LSTM are used as query, and the features extracted by each layer of the multi-scale feature extraction module are used as key and value. Calculating the weight of each layer of characteristics of the multi-scale characteristics by adopting a standard multi-head attention mechanism, and multiplying the weight by the corresponding characteristics and superposing to obtain the combined multi-scale characteristics; and splicing the multi-scale features and the features of the near-end mixed signal extracted by the LSTM, and then sending the spliced multi-scale features into another LSTM to obtain the estimation of the features of the near-end mixed signal. And splicing the estimation of the near-end mixed signal and the output of the attention mechanism, and sending the spliced near-end mixed signal and the output of the attention mechanism to a classifier to judge whether signals exist at the near end and the far end. Sending the near-end estimation features output by the LSTM into a mask estimation module (namely, the mask estimation module comprises a PReLU activation function, a one-dimensional convolution (1-D Conv) and a sigmoid activation function) to obtain a mask value of each time frequency point of a pure near-end signal in a near-end mixed signal, multiplying the mask value and a spectrogram encoded by the near-end mixed signal to obtain a spectrogram of the near-end signal, and sending the spectrogram of the near-end signal into a Decoder formed by the one-dimensional convolution to obtain a corresponding time domain waveform of the near-end signal.

In this embodiment, an echo cancellation device is further provided, and the echo cancellation device is used to implement the foregoing embodiments and preferred embodiments, which have already been described and are not described again. As used hereinafter, the term "module" is a combination of software and/or hardware that can implement a predetermined function. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware, or a combination of software and hardware is also possible and contemplated.

The present embodiment provides an echo cancellation device, as shown in fig. 5, including:

a receiving module 51, configured to receive the near-end mixed signal and a far-end signal of a corresponding reference channel;

the encoding module 52 is configured to encode the near-end mixed signal and the far-end signal respectively to obtain an encoded near-end mixed signal spectrogram and an encoded far-end signal spectrogram, and splice the encoded near-end mixed signal spectrogram and the encoded far-end signal spectrogram to obtain a spliced post-speech spectrogram;

a first extraction module 53, configured to extract a multi-scale feature according to the spliced postlanguage spectrogram;

a second extraction module 54, configured to extract a depth feature according to the encoded near-end mixed signal spectrogram;

a calculating module 55, configured to calculate a weight of each layer feature of the multi-scale features according to the depth features;

the weighting module 56 is configured to perform weighting processing on the corresponding features by using the weight of each layer of features to obtain combined multi-scale features;

an obtaining module 57, configured to obtain a near-end signal estimate according to the combined multi-scale features and the depth features.

The echo cancellation device in this embodiment is presented as a functional unit, where the unit refers to an ASIC circuit, a processor and a memory executing one or more software or fixed programs, and/or other devices that may provide the above-described functionality.

Further functional descriptions of the modules are the same as those of the corresponding embodiments, and are not repeated herein.

An exemplary embodiment of the present disclosure also provides an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor. The memory stores a computer program executable by the at least one processor, the computer program, when executed by the at least one processor, is for causing the electronic device to perform a method according to an embodiment of the disclosure.

The disclosed exemplary embodiments also provide a non-transitory computer readable storage medium storing a computer program, wherein the computer program, when executed by a processor of a computer, is adapted to cause the computer to perform a method according to an embodiment of the present disclosure.

The exemplary embodiments of the present disclosure also provide a computer program product comprising a computer program, wherein the computer program, when executed by a processor of a computer, is adapted to cause the computer to perform a method according to an embodiment of the present disclosure.

Referring to fig. 6, a block diagram of a structure of an electronic device 600, which may be a server or a client of the present disclosure, which is an example of a hardware device that may be applied to aspects of the present disclosure, will now be described. Electronic device is intended to represent various forms of digital electronic computer devices, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other suitable computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 6, the electronic device 600 includes a computing unit 601, which can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 602 or a computer program loaded from a storage unit 608 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data required for the operation of the device 600 can also be stored. The calculation unit 601, the ROM 602, and the RAM 603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

Various components in the electronic device 600 are connected to the I/O interface 605, including: an input unit 606, an output unit 607, a storage unit 608, and a communication unit 609. The input unit 606 may be any type of device capable of inputting information to the electronic device 600, and the input unit 606 may receive input numeric or character information and generate key signal inputs related to user settings and/or function controls of the electronic device. Output unit 607 may be any type of device capable of presenting information and may include, but is not limited to, a display, speakers, a video/audio output terminal, a vibrator, and/or a printer. Storage unit 604 may include, but is not limited to, magnetic or optical disks. The communication unit 609 allows the electronic device 600 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunications networks, and may include, but is not limited to, a modem, a network card, an infrared communication device, a wireless communication transceiver, and/or a chipset, such as a bluetooth (TM) device, a WiFi device, a WiMax device, a cellular communication device, and/or the like.

The computing unit 601 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of the computing unit 601 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 601 performs the respective methods and processes described above. For example, in some embodiments, the echo cancellation method may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 608. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 600 via the ROM 602 and/or the communication unit 609. In some embodiments, the computing unit 601 may be configured to perform the echo cancellation method in any other suitable way (e.g. by means of firmware).

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

As used in this disclosure, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

Claims

1. An echo cancellation method, comprising:

extracting depth features according to the encoded near-end mixed signal spectrogram; extracting depth features according to the encoded near-end mixed signal spectrogram comprises the following steps: inputting the encoded near-end mixed signal spectrogram into a first long-short term memory network in a canceller module, and extracting the depth features by the first long-short term memory network according to the encoded near-end mixed signal spectrogram;

calculating a weight of each layer of features of the multi-scale features according to the depth features; calculating a weight for each layer feature of the multi-scale features from the depth features comprises: taking the depth feature as a query, taking each layer of feature of the multi-scale feature as a key and a value, and calculating the weight of each layer of feature of the multi-scale feature by using a multi-head attention mechanism;

weighting the corresponding features by using the weight of each layer of features to obtain combined multi-scale features; weighting the corresponding features by using the weight of each layer of features to obtain the combined multi-scale features, wherein the step of weighting the corresponding features comprises the following steps: multiplying and superposing the weight of each layer of features and the corresponding features through the multi-head attention mechanism to obtain the combined multi-scale features;

obtaining near-end signal estimation according to the combined multi-scale features and the depth features; obtaining a near-end signal estimate from the merged multi-scale features and the depth features comprises: after the merged multi-scale features and the depth features are spliced, inputting the spliced multi-scale features and the spliced depth features into a second long-short term memory network in a canceller module to obtain the near-end signal estimation;

inputting the near-end signal estimation to a mask estimation module to obtain a mask value of each time frequency point of pure near-end signals in the near-end mixed signals;

multiplying the mask value of each time-frequency point with the encoded near-end mixed signal spectrogram to obtain a near-end signal spectrogram;

and inputting the near-end signal spectrogram into a one-dimensional convolution decoder to obtain a time domain waveform of the near-end signal.

2. The echo cancellation method of claim 1, wherein extracting multi-scale features from the stitched post-speech spectrogram comprises:

inputting the spliced postlanguage spectrogram into a multi-scale feature extraction module in a canceller module; the multi-scale feature extraction module is composed of a plurality of groups of expansion convolutions, and each group of expansion convolutions comprises a plurality of convolution blocks; and extracting the multi-scale features of each layer by the multi-scale feature extraction module according to the spliced postlanguage spectrogram.

3. The echo cancellation method of claim 1, wherein obtaining a near-end signal estimate based on the combined multi-scale features and the depth features comprises:

and splicing the combined multi-scale features and the depth features, and inputting the combined multi-scale features and the depth features into a second long-short term memory network in a canceller module to obtain the near-end signal estimation.

4. The echo cancellation method of claim 1, wherein the method further comprises:

splicing the combined multi-scale features and the near-end signal estimation and then inputting the spliced multi-scale features and the near-end signal estimation into a classifier; and judging whether a far-end signal or a near-end signal exists by the classifier.

5. The echo cancellation method of claim 3 or 4, wherein the canceller module is trained by:

selecting voices of different people from a database as a near-end signal sample and a far-end signal sample respectively;

sequentially processing the far-end signal sample by a nonlinear processing module and a room impulse response to obtain an echo signal sample;

superposing the near-end signal sample and the echo signal sample to obtain a near-end mixed signal sample;

and taking the near-end mixed signal sample and the far-end signal sample as the input of a canceller module, taking the near-end signal sample as a learning target of a minimum mean square error loss function of the canceller module, and training the canceller module.

6. The echo cancellation method of claim 5, wherein the canceller module is trained by:

calculating the energy of the echo signal sample and the energy of the near-end signal sample;

respectively comparing the energy of the echo signal sample and the energy of the near-end signal sample with a preset threshold value to obtain a first numerical value and a second numerical value which are used as double-end detection result labels;

and taking the near-end mixed signal sample and the far-end signal sample as the input of a canceller module, and taking the double-end detection result label as a learning target of a cross entropy loss function of the canceller module.

7. An echo cancellation device, comprising:

the second extraction module is used for extracting depth features according to the coded near-end mixed signal spectrogram; extracting depth features according to the encoded near-end mixed signal spectrogram comprises the following steps: inputting the encoded near-end mixed signal spectrogram into a first long-short term memory network in a canceller module, and extracting the depth features by the first long-short term memory network according to the encoded near-end mixed signal spectrogram;

a calculation module for calculating a weight of each layer of features of the multi-scale features according to the depth features; calculating a weight for each layer feature of the multi-scale features from the depth features comprises: taking the depth feature as a query, taking each layer of feature of the multi-scale feature as a key and a value, and calculating the weight of each layer of feature of the multi-scale feature by using a multi-head attention mechanism;

the weighting module is used for weighting the corresponding features by utilizing the weight of each layer of features to obtain the combined multi-scale features; weighting the corresponding features by using the weight of each layer of features to obtain the combined multi-scale features, wherein the step of weighting the corresponding features comprises the following steps: multiplying and superposing the weight of each layer of features and the corresponding features through the multi-head attention mechanism to obtain the combined multi-scale features;

an obtaining module, configured to obtain a near-end signal estimate according to the merged multi-scale feature and the depth feature; obtaining a near-end signal estimate from the merged multi-scale features and the depth features comprises: after the merged multi-scale features and the depth features are spliced, inputting the spliced multi-scale features and the spliced depth features into a second long-short term memory network in a canceller module to obtain the near-end signal estimation; inputting the near-end signal estimation to a mask estimation module to obtain a mask value of each time frequency point of pure near-end signals in the near-end mixed signals; multiplying the mask value of each time-frequency point with the encoded near-end mixed signal spectrogram to obtain a near-end signal spectrogram; and inputting the near-end signal spectrogram into a one-dimensional convolution decoder to obtain a time domain waveform of the near-end signal.

8. An electronic device, comprising:

a processor; and

a memory for storing a program, wherein the program is stored in the memory,

wherein the program comprises instructions which, when executed by the processor, cause the processor to carry out the method according to any one of claims 1-6.

9. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-6.