CN112289334B

CN112289334B - Reverberation elimination method and device

Info

Publication number: CN112289334B
Application number: CN202011588741.8A
Authority: CN
Inventors: 邓峰; 姜涛; 王晓瑞; 李岩
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2020-12-29
Filing date: 2020-12-29
Publication date: 2021-04-02
Anticipated expiration: 2040-12-29
Also published as: CN112289334A

Abstract

The application relates to the technical field of voice processing, and discloses a reverberation elimination method and device, which are used for solving the problem that voice signals with long reverberation time are difficult to eliminate. The method comprises the following steps: generating a spectrogram and a phase spectrogram corresponding to the audio, wherein each frame on the spectrogram corresponds to one voice signal group; extracting the features of each frame to obtain corresponding voice features, and determining context association vectors of the voice features; determining a voice masking estimation value of each voice signal on the spectrogram based on each voice feature and the corresponding context association vector, and performing reverberation elimination operation on the spectrogram according to each voice masking estimation value to obtain a dereverberation spectrogram; and finally, obtaining the audio frequency without reverberation by using the spectrogram and the phase spectrogram without reverberation. By adding attention to the input speech feature map, the speech features on the speech feature map can be made to be dominated by clean speech information or reverberated speech information, so that clean speech and reverberated speech can be distinguished.

Description

Reverberation elimination method and device

Technical Field

The present invention relates to the field of speech processing technologies, and in particular, to a method and an apparatus for eliminating reverberation.

Background

Reverberation is a common acoustic phenomenon in daily life, when sound waves are transmitted indoors, the sound waves are reflected by barriers such as walls, ceilings and floors, and some sound waves are absorbed by the barriers once every reflection, even if sound sources stop sounding, the sound waves disappear after being reflected and absorbed for many times indoors, so that people cannot hear the sound easily in scenes such as voice recognition, video recording and hearing aids, product performance is affected, and user experience is poor.

Currently, reverberation-containing audio can be eliminated using methods of signal processing and neural networks. However, the two methods for eliminating reverberation can only filter out the speech signal with short reverberation time, and it is difficult to effectively eliminate the speech signal with long reverberation time, and the effect of eliminating reverberation is poor.

In view of the above, a new reverberation cancellation method and apparatus are needed to overcome the above-mentioned drawbacks.

Disclosure of Invention

The embodiment of the application provides a method and a device for eliminating reverberation, which are used for solving the problem that a voice signal with longer reverberation time is difficult to eliminate.

The embodiment of the application provides the following specific technical scheme:

in a first aspect, an embodiment of the present application provides a reverberation cancellation method, including:

performing time-frequency conversion processing on audio to obtain a spectrogram and a phase spectrogram, wherein each frame on the spectrogram corresponds to a voice signal group;

extracting features of each frame to obtain corresponding voice features, and determining context association vectors of the voice features, wherein one context association vector represents the correlation between one voice feature and each voice feature;

determining a voice masking estimation value of each voice signal on the spectrogram according to each voice feature and the corresponding context association vector, and performing reverberation elimination operation on the spectrogram according to each voice masking estimation value to obtain a dereverberated spectrogram, wherein one voice masking estimation value represents the probability of predicting one voice signal to contain reverberation;

and performing time-frequency conversion inverse processing by using the dereverberation spectrogram and the phase spectrogram to obtain dereverberation audio.

Optionally, performing time-frequency conversion on the audio to obtain a spectrogram and a phase diagram, including:

obtaining initial voice signals in different frames by performing windowing and framing operation on the audio;

fourier transform is carried out on each initial voice signal to obtain a corresponding spectrogram and a corresponding phase spectrogram;

and splicing the spectrograms according to a time sequence to obtain the spectrogram, and splicing the phase spectrograms of the frames according to a time sequence to obtain the phase spectrogram.

Optionally, when determining the context association vector of each speech feature, the method specifically includes, for any speech feature:

calculating attention weight of any one voice feature to each voice feature;

and generating a context association vector of any one voice feature based on each voice feature and the corresponding attention weight.

Optionally, calculating the attention weight of the arbitrary speech feature to each speech feature includes:

and weighting the query vector of any one voice feature and the key vector of each voice feature to obtain the attention weight of each voice feature.

Optionally, the performing a time-frequency transform inverse process by using the dereverberated spectrogram and the phase spectrogram to obtain a dereverberated audio frequency includes:

obtaining a Fourier coefficient without reverberation according to the spectrogram without reverberation and the phase spectrogram;

and performing inverse Fourier transform operation on the de-reverberated spectrogram by using the de-reverberated Fourier coefficient to obtain the de-reverberated audio.

In a second aspect, an embodiment of the present application further provides a reverberation cancellation device, including:

the first processing unit is configured to perform time-frequency conversion processing on audio to obtain a spectrogram and a phase spectrogram, wherein each frame on the spectrogram corresponds to one voice signal group;

the second processing unit is configured to perform feature extraction on each frame to obtain corresponding voice features, and determine context association vectors of each voice feature, wherein one context association vector represents the correlation between one voice feature and each voice feature;

a reverberation elimination unit configured to perform a time-frequency transform inverse process using the dereverberated spectrogram and the phase spectrogram, resulting in a dereverberated audio.

Optionally, the first processing unit is configured to:

Optionally, the second processing unit is configured to:

calculating attention weight of any one voice feature to each voice feature;

Optionally, the second processing unit is configured to:

Optionally, the reverberation cancellation unit is configured to:

In a third aspect, an embodiment of the present application further provides a computing device, including:

a memory for storing program instructions;

and the processor is used for calling the program instructions stored in the memory and executing any reverberation elimination method according to the obtained program.

In a fourth aspect, embodiments of the present application further provide a storage medium including computer-readable instructions, which when read and executed by a computing device, cause the computing device to perform any one of the reverberation cancellation methods described above.

The beneficial effect of this application is as follows:

in the embodiment of the application, a spectrogram and a phase spectrogram corresponding to an audio are generated, wherein each frame on the spectrogram corresponds to a voice signal group; extracting the features of each frame to obtain corresponding voice features, and determining context association vectors of the voice features; determining a voice masking estimation value of each voice signal on the spectrogram based on each voice feature and the corresponding context association vector, and performing reverberation elimination operation on the spectrogram according to each voice masking estimation value to obtain a dereverberation spectrogram; and finally, obtaining the audio frequency without reverberation by using the spectrogram and the phase spectrogram without reverberation. The context correlation degree reflects the correlation between one voice characteristic and each voice characteristic, and the voice characteristic on the voice characteristic diagram can be led to be dominated by clean voice information or reverberation voice information by adding attention to the input voice characteristic diagram, so that the clean voice and the reverberation voice are distinguished, and the voice signal with longer reverberation time is effectively screened and eliminated.

Drawings

Fig. 1 is a schematic diagram of an architecture of a reverberation cancellation model provided in an embodiment of the present application;

fig. 2 is a schematic flow chart of reverberation elimination provided by an embodiment of the present application;

FIG. 3a is a time domain diagram provided by an embodiment of the present application;

fig. 3b is a spectrum diagram provided in an embodiment of the present application;

FIG. 3c is a spectrogram provided in an embodiment of the present application;

fig. 4a is a schematic structural diagram of a residual error learning module according to an embodiment of the present application;

fig. 4b is a schematic structural diagram of a convolution module according to an embodiment of the present application;

fig. 4c is a schematic structural diagram of an attention module according to an embodiment of the present application;

fig. 4d is a schematic structural diagram of a deconvolution module according to an embodiment of the present application;

fig. 5 is a spectrogram of dereverberated audio provided by an embodiment of the present application;

fig. 6 is a schematic structural diagram of an apparatus for eliminating reverberation according to an embodiment of the present disclosure;

fig. 7 is a schematic structural diagram of an apparatus of a computing device according to an embodiment of the present application.

Detailed Description

In order to solve the problem that it is difficult to eliminate a speech signal with a long reverberation time, a new technical scheme is provided in the embodiment of the present application. The scheme comprises the following steps: generating a spectrogram and a phase spectrogram corresponding to the audio, wherein each frame on the spectrogram corresponds to one voice signal group; extracting the features of each frame to obtain corresponding voice features, and determining context association vectors of the voice features; determining a voice masking estimation value of each voice signal on the spectrogram based on each voice feature and the corresponding context association vector, and performing reverberation elimination operation on the spectrogram according to each voice masking estimation value to obtain a dereverberation spectrogram; and finally, obtaining the audio frequency without reverberation by using the spectrogram and the phase spectrogram without reverberation.

Preferred embodiments of the present invention will be described in detail below with reference to the accompanying drawings.

In the embodiment of the application, the audio is firstly processed by time frequencyAnd converting to obtain a spectrogram and a phase spectrogram, determining a voice masking estimation value of each voice signal on the spectrogram by using the trained reverberation elimination model, performing reverberation elimination operation on the input spectrogram by using the voice masking estimation value of each voice signal to obtain a reverberation-removed spectrogram, and finally obtaining a reverberation-removed audio by using the reverberation-removed spectrogram and the phase spectrogram. Generally speaking, the estimated voice masking value is greater than 1, and in the course of training the reverberation cancellation model, in order to facilitate convergence of the reverberation cancellation model, the embodiment of the present application proposes a compression cost function, which maps the estimated voice masking value into an interval of (0, 1). Formula (1) is a compression cost function, Q represents a compression range, and Q can take the value of 1; c represents the slope of the compression function, wherein the value of C can be 0.5;

the target voice mask estimate is characterized,

a compressed value characterizing the target voice mask estimate, t characterizing the time frame, and f characterizing the frequency.

Formula (1);

referring to fig. 1, the reverberation elimination model specifically includes a coding module, an attention module and a decoding module, where the coding module is configured to perform feature extraction on a spectrogram to obtain a speech feature corresponding to each frame on the spectrogram; the attention module is used for carrying out weighting processing on each voice feature so as to generate context association vectors of each voice feature, wherein one context association vector represents the correlation between one voice feature and each voice feature; the decoding module is used for determining a voice masking estimation value of each voice signal on the spectrogram according to each voice feature and the corresponding context association vector, and executing reverberation elimination operation on the spectrogram according to each voice masking estimation value to obtain the spectrogram without reverberation. Specifically, referring to fig. 2, a process of the reverberation elimination model eliminating the reverberation included in the audio is introduced.

S201: and performing time-frequency conversion processing on the audio to obtain a spectrogram and a phase spectrogram, wherein each frame on the spectrogram corresponds to one voice signal group.

The time domain diagram can visually observe the shape of the voice signal, but the voice signal cannot be accurately described by using limited parameters on the time domain diagram, and the frequency domain diagram can decompose a complex voice signal into superposition of simple voice signals (sinusoidal signals), so that the 'structure' of the voice signal can be more accurately known, and more information can be extracted for signal analysis. Therefore, in performing step 201, the audio needs to be subjected to time-frequency domain conversion processing in the following manner.

(1) And performing framing processing on the audio according to a preset frame interval.

The audio speech signal forms a continuous wave in the time domain (hereinafter referred to as a time domain diagram), and the time domain diagram is shown in fig. 3a, in which the horizontal axis represents time and the vertical axis represents amplitude. Since a speech signal has short-time stationarity, which is macroscopically unstable but microscopically stationary, the time domain image may be truncated into a plurality of short time segments, each of which is referred to as a frame, for processing according to a preset inter-frame distance. It should be noted that, in order to ensure the continuity of the synthesized signal, there may be an overlapping portion between adjacent frames.

(2) And respectively carrying out windowing processing on each frame.

Taking a frame as an example, multiplying the frame by a window function, so that the original speech signal without periodicity presents partial characteristics of the periodic function, and the subsequent Fourier transform operation is facilitated.

(3) And respectively carrying out short-time Fourier transform on each frame subjected to windowing processing, and calling a result obtained after the short-time Fourier transform as a frequency spectrum. Although the time domain diagram intuitively shows the amplitude values of the speech signal at different time instants, it is difficult to extract more useful information for signal analysis, and it can be known from the short-time fourier transform formula that the waves of one frame of audio are synthesized by a plurality of waves with different frequencies, therefore, the audio can be mapped from the time domain to the frequency domain by using the short-time fourier transform to obtain a frequency domain diagram and a phase spectrogram as shown in fig. 3b, wherein the horizontal axis of the frequency domain diagram is frequency, the vertical axis thereof is amplitude, and the horizontal axis of the phase spectrogram is phase, and the vertical axis thereof is amplitude, so as to better analyze the speech signal.

(4) And splicing the spectrograms according to the time sequence to obtain a spectrogram, and splicing the phase spectrograms of the frames according to the time sequence to obtain a phase spectrogram. The spectrogram is shown in fig. 3c, in which the horizontal axis represents time and the vertical axis represents frequency.

S202: and extracting the features of each frame to obtain the corresponding voice features.

And inputting the spectrogram into a coding module for feature extraction to obtain a voice feature map, wherein the voice feature map comprises voice features of each frame. Specifically, the encoding module is composed of a Residual Learning module (Residual Learning Unit) and a Convolution module (Convolution Unit). Generally speaking, as the depth of a network increases (i.e., the number of parameters included in the network is more), the nonlinear expression capability of the network is stronger, and the more things the network can learn, which is a practical situation, when the number of network layers increases to a certain extent, the deeper network has a worse classification effect, and therefore, the embodiment of the present application adopts a method that a residual error learning module and a convolution module are combined to solve the degradation problem of the deep network.

Referring to fig. 4a, the residual learning module includes two channels, one is a channel of the main processing layer for feature extraction, and the other is a channel of the input layer to the output layer. The main process Layer includes, from top to bottom, a hole Convolution Layer (partitioned Convolution Layer), a Batch Normalization Layer (Batch Normalization Layer), and an activation function (leak Relu). In order to increase the receptive field and enable the network to obtain wide context information, the embodiment of the present application adopts the void convolution layers with the void ratios of (2, 2) and (4, 4) to perform the feature extraction operation on the input spectrogram. Referring to fig. 4b, the Convolution module of the embodiment of the present application is divided into three layers from top to bottom, which are a Convolution Layer (Convolution Layer), a batch normalization Layer, and an activation function in sequence.

As can be seen from fig. 1, in the embodiment of the present application, a plurality of residual learning modules and convolution modules are provided, and each convolution layer adopts different convolution kernels and step lengths, so that the receptive field can be further increased, the complexity and depth of the network can be improved, the information loss caused by using the largest pooling layer can be solved, and the prediction accuracy of the reverberation elimination model can be improved.

S203: determining a context association vector for each of the speech features, wherein a context association vector characterizes a correlation between a speech feature and the respective speech feature.

In cognitive neurology, attention is an indispensable complex cognitive function of humans, meaning the ability of a person to select to ignore some information while focusing on others. In daily life, we receive a great deal of sensory input through visual, auditory, tactile, etc., but our brains can work orderly in these outside information bombings because they can intentionally or unintentionally select a small portion of useful information from these large amounts of input information to process with emphasis, and ignore other information. For example, when a person is reading, only a few words to be read are usually attended to and processed. Similarly, the attention mechanism may also enable the neural network to focus on its input features-i.e., select a particular input feature. In situations where computing power is limited, the attention mechanism is a resource allocation scheme that is the primary means to solve the information overload problem, allocating computing resources to more important tasks.

And inputting the voice feature map into an attention module, and determining the context association degree of each voice feature. Referring to fig. 4c, the attention module of the embodiment of the present application may be divided into three layers from top to bottom, three matrices with the same size are disposed in the first layer, and the input speech feature map is multiplied by three weight matrices respectively to obtain a query matrix

Key matrix

Sum matrix

(ii) a A Softmax function is set at the second layer for mapping an attention weight matrix to an interval of (0, 1), the attention weight matrix being based on the query matrix

And key matrix

Obtaining; the third layer is the attention weight matrix and the value matrix

And executing dot multiplication operation to generate a context associated vector matrix, taking the context associated vector matrix as the input of the fourth layer, and multiplying the context associated vector matrix by a preset attention coefficient w to obtain a final weighted graph. In the process of generating the context association vector, the value vector of any one voice feature X and the value vector of each voice feature on the spectrogram are weighted and fused, and the correlation between the voice feature X and each voice feature on the spectrogram is embodied.

For ease of understanding, the process of generating the context association vector will be described with reference to the speech feature X as an example.

First, attention weights of the speech feature X to the respective speech features are calculated.

Specifically, the query vector of the speech feature X and the key vector of each speech feature are weighted to obtain the attention weight of each speech feature.

Next, a context association vector of the speech feature X is generated based on each speech feature and the corresponding attention weight.

In particular, the value matrix

The context association vector of the speech feature X is generated by performing weighting processing on the speech feature X and the corresponding attention weight.

In the embodiment of the present application, the purpose of calculating the attention weight and the context association vector is to determine the dependency relationship between the speech feature X and any one of the speech features, when encoding the speech feature X, even if the speech feature X is far away from the speech feature X, the neural network will pay attention to the speech feature as long as the context association degree between the speech feature X and the speech feature X is high, so that the neural network can pay attention to the speech feature no matter whether the speech feature with short reverberation time or the speech feature with long reverberation time is concerned, and the phenomenon of erroneous division or missed division will not occur.

S204: and determining a voice masking estimation value of each voice signal on the spectrogram according to each voice feature and the corresponding context association vector, and performing reverberation elimination operation on the spectrogram according to each voice masking estimation value to obtain a dereverberated spectrogram, wherein one voice masking estimation value represents the probability of predicting that one voice signal contains reverberation.

And inputting the voice feature map and the corresponding context association vector into a decoding module for decoding, and determining a voice masking estimation value corresponding to each voice signal on the spectrogram. By adding attention to the input speech feature map, the speech features on the speech feature map can be made to be dominated by clean speech information or reverberated speech information, so that clean speech and reverberated speech can be distinguished.

Similar to the architecture of the encoding module, the decoding module is composed of a residual learning module and a Deconvolution module (Deconvolution Unit). The schematic structural diagram of the residual learning module in the decoding module is shown in fig. 4a, and includes two channels, one is a channel of the main processing layer for feature extraction, and the other is a channel that lets the input layer go to the output layer. The layers from top to bottom in the main processing layer are a cavity convolution layer, a batch normalization layer and an activation function in sequence; referring to fig. 4d, the Deconvolution module of the embodiment of the present application is divided into three layers from top to bottom, which are a Deconvolution Layer (Deconvolution Layer), a batch normalization Layer, and an activation function in sequence.

As can be seen from fig. 1, a speech feature map generated in a current layer and a decoded feature map output in a next layer are used as input of a decoding module in the current layer, and are processed by the decoding module to output a new decoded feature map, where the size of the new decoded feature map is consistent with the size of the feature map before the speech feature map is generated. After passing through a plurality of consecutive decoding modules, the size of the feature map output by the last decoding module is consistent with the size of the spectrogram input at the beginning, so that the voice masking estimation value corresponding to each voice signal on the spectrogram is marked on the decoding feature map output by the last decoding module. And multiplying the voice masking estimated value corresponding to each voice signal by the spectrogram to obtain the spectrogram without reverberation.

In the training of the model, in order to facilitate the convergence of the model, the voice masking estimation values are mapped into the (0, 1) interval by using a compression cost function, so that the compression values of the voice masking estimation values output by the last decoding module need to be decompressed by using a formula (2), wherein Q represents a compression range, and the value of Q can be 1; c represents the slope of the compression function, wherein the value of C can be 0.5;

the speech masking estimate is characterized by a value,

a compressed value representing the voice mask estimate.

Formula (2);

s205: and performing time-frequency conversion inverse processing by using the dereverberated spectrogram and the phase spectrogram to obtain dereverberated audio.

The spectrogram output in step 204 belongs to one of frequency domain maps, and the dereverberated spectrogram must be converted into a time domain map through inverse fourier transform, so as to obtain dereverberated audio. Optionally, obtaining a dereverberation fourier coefficient according to the dereverberation spectrogram and the phase spectrogram; and performing inverse Fourier transform operation on the dereverberated spectrogram by using the dereverberated Fourier coefficient to obtain dereverberated audio. And mapping the voice signal in the frequency domain to the time domain through inverse Fourier transform operation to obtain the dereverberated audio, wherein a spectrogram of the dereverberated audio is shown in FIG. 5.

Based on the same inventive concept, the present embodiment provides a reverberation cancellation device, as shown in fig. 6, including at least a first processing unit 601, a second processing unit 602, and a reverberation cancellation unit 603, wherein,

a first processing unit 601, configured to perform time-frequency conversion processing on an audio to obtain a spectrogram and a phase spectrogram, where each frame on the spectrogram corresponds to a speech signal group;

a second processing unit 602, configured to perform feature extraction on each frame to obtain corresponding speech features, and determine context association vectors of each speech feature, where one context association vector represents a correlation between one speech feature and each speech feature;

a reverberation removal unit 603 configured to perform a time-frequency transform inverse process using the dereverberated spectrogram and the phase spectrogram, resulting in a dereverberated audio.

Optionally, the first processing unit 601 is configured to:

Optionally, the second processing unit 602 is configured to:

calculating attention weight of any one voice feature to each voice feature;

Optionally, the second processing unit 602 is configured to:

Optionally, the reverberation canceling unit 603 is configured to:

Based on the same inventive concept, the embodiment of the present application provides a computing device, which is shown in fig. 7 and at least includes a memory 701 and at least one processor 702, where the memory 701 and the processor 702 complete communication with each other through a communication bus;

the memory 701 is used for storing program instructions;

the processor 702 is configured to call program instructions stored in the memory 701 and execute the reverberation cancellation method according to the obtained program.

Based on the same inventive concept, in an embodiment of the present invention, there is provided a storage medium at least including computer readable instructions, which when read and executed by a computer, cause the computer to execute the reverberation cancellation method.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the invention.

It will be apparent to those skilled in the art that various modifications and variations can be made in the embodiments of the present invention without departing from the spirit or scope of the embodiments of the invention. Thus, if such modifications and variations of the embodiments of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to encompass such modifications and variations.

Claims

1. A reverberation cancellation method, comprising:

extracting features of each frame to obtain corresponding voice features, and determining a context association vector of each voice feature, wherein the determining of the context association vector of any one voice feature specifically comprises: calculating attention weight of any one voice feature to each voice feature; generating a context association vector of the arbitrary voice feature based on the voice features and the corresponding attention weight; a context association vector characterizing a correlation between a speech feature and each of said speech features;

2. The method of claim 1, wherein the time-frequency transforming the audio to obtain a spectrogram and a phase map comprises:

3. The method of claim 1, wherein computing the attention weight of the arbitrary one of the speech features to the respective speech feature comprises:

4. The method of claim 1, wherein performing a time-frequency transform inversion process using the dereverberated spectrogram and the phase spectrogram to obtain dereverberated audio comprises:

5. A reverberation cancellation device, comprising:

the second processing unit is configured to perform feature extraction on each frame to obtain a corresponding voice feature, and determine a context association vector of each voice feature, where determining the context association vector of any one voice feature specifically includes: calculating attention weight of any one voice feature to each voice feature; generating a context association vector of the arbitrary voice feature based on the voice features and the corresponding attention weight; a context association vector characterizing a correlation between a speech feature and each of said speech features;

6. The apparatus of claim 5, wherein the first processing unit is configured to:

7. The apparatus of claim 5, wherein the second processing unit is configured to:

8. The apparatus of claim 5, wherein the reverberation cancellation unit is configured to:

9. A computing device, comprising:

a memory for storing program instructions;

a processor for calling program instructions stored in said memory to execute the method of any one of claims 1 to 4 in accordance with the obtained program.

10. A storage medium comprising computer readable instructions which, when read and executed by a computing device, cause the computing device to perform the method of any of claims 1-4.