CN112289334B - Reverberation elimination method and device - Google Patents

Reverberation elimination method and device Download PDF

Info

Publication number
CN112289334B
CN112289334B CN202011588741.8A CN202011588741A CN112289334B CN 112289334 B CN112289334 B CN 112289334B CN 202011588741 A CN202011588741 A CN 202011588741A CN 112289334 B CN112289334 B CN 112289334B
Authority
CN
China
Prior art keywords
spectrogram
voice
reverberation
feature
phase
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011588741.8A
Other languages
Chinese (zh)
Other versions
CN112289334A (en
Inventor
邓峰
姜涛
王晓瑞
李岩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Dajia Internet Information Technology Co Ltd
Original Assignee
Beijing Dajia Internet Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Dajia Internet Information Technology Co Ltd filed Critical Beijing Dajia Internet Information Technology Co Ltd
Priority to CN202011588741.8A priority Critical patent/CN112289334B/en
Publication of CN112289334A publication Critical patent/CN112289334A/en
Application granted granted Critical
Publication of CN112289334B publication Critical patent/CN112289334B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0224Processing in the time domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0232Processing in the frequency domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0264Noise filtering characterised by the type of parameter measurement, e.g. correlation techniques, zero crossing techniques or predictive techniques
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L2021/02082Noise filtering the noise being echo, reverberation of the speech

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

The application relates to the technical field of voice processing, and discloses a reverberation elimination method and device, which are used for solving the problem that voice signals with long reverberation time are difficult to eliminate. The method comprises the following steps: generating a spectrogram and a phase spectrogram corresponding to the audio, wherein each frame on the spectrogram corresponds to one voice signal group; extracting the features of each frame to obtain corresponding voice features, and determining context association vectors of the voice features; determining a voice masking estimation value of each voice signal on the spectrogram based on each voice feature and the corresponding context association vector, and performing reverberation elimination operation on the spectrogram according to each voice masking estimation value to obtain a dereverberation spectrogram; and finally, obtaining the audio frequency without reverberation by using the spectrogram and the phase spectrogram without reverberation. By adding attention to the input speech feature map, the speech features on the speech feature map can be made to be dominated by clean speech information or reverberated speech information, so that clean speech and reverberated speech can be distinguished.

Description

Reverberation elimination method and device
Technical Field
The present invention relates to the field of speech processing technologies, and in particular, to a method and an apparatus for eliminating reverberation.
Background
Reverberation is a common acoustic phenomenon in daily life, when sound waves are transmitted indoors, the sound waves are reflected by barriers such as walls, ceilings and floors, and some sound waves are absorbed by the barriers once every reflection, even if sound sources stop sounding, the sound waves disappear after being reflected and absorbed for many times indoors, so that people cannot hear the sound easily in scenes such as voice recognition, video recording and hearing aids, product performance is affected, and user experience is poor.
Currently, reverberation-containing audio can be eliminated using methods of signal processing and neural networks. However, the two methods for eliminating reverberation can only filter out the speech signal with short reverberation time, and it is difficult to effectively eliminate the speech signal with long reverberation time, and the effect of eliminating reverberation is poor.
In view of the above, a new reverberation cancellation method and apparatus are needed to overcome the above-mentioned drawbacks.
Disclosure of Invention
The embodiment of the application provides a method and a device for eliminating reverberation, which are used for solving the problem that a voice signal with longer reverberation time is difficult to eliminate.
The embodiment of the application provides the following specific technical scheme:
in a first aspect, an embodiment of the present application provides a reverberation cancellation method, including:
performing time-frequency conversion processing on audio to obtain a spectrogram and a phase spectrogram, wherein each frame on the spectrogram corresponds to a voice signal group;
extracting features of each frame to obtain corresponding voice features, and determining context association vectors of the voice features, wherein one context association vector represents the correlation between one voice feature and each voice feature;
determining a voice masking estimation value of each voice signal on the spectrogram according to each voice feature and the corresponding context association vector, and performing reverberation elimination operation on the spectrogram according to each voice masking estimation value to obtain a dereverberated spectrogram, wherein one voice masking estimation value represents the probability of predicting one voice signal to contain reverberation;
and performing time-frequency conversion inverse processing by using the dereverberation spectrogram and the phase spectrogram to obtain dereverberation audio.
Optionally, performing time-frequency conversion on the audio to obtain a spectrogram and a phase diagram, including:
obtaining initial voice signals in different frames by performing windowing and framing operation on the audio;
fourier transform is carried out on each initial voice signal to obtain a corresponding spectrogram and a corresponding phase spectrogram;
and splicing the spectrograms according to a time sequence to obtain the spectrogram, and splicing the phase spectrograms of the frames according to a time sequence to obtain the phase spectrogram.
Optionally, when determining the context association vector of each speech feature, the method specifically includes, for any speech feature:
calculating attention weight of any one voice feature to each voice feature;
and generating a context association vector of any one voice feature based on each voice feature and the corresponding attention weight.
Optionally, calculating the attention weight of the arbitrary speech feature to each speech feature includes:
and weighting the query vector of any one voice feature and the key vector of each voice feature to obtain the attention weight of each voice feature.
Optionally, the performing a time-frequency transform inverse process by using the dereverberated spectrogram and the phase spectrogram to obtain a dereverberated audio frequency includes:
obtaining a Fourier coefficient without reverberation according to the spectrogram without reverberation and the phase spectrogram;
and performing inverse Fourier transform operation on the de-reverberated spectrogram by using the de-reverberated Fourier coefficient to obtain the de-reverberated audio.
In a second aspect, an embodiment of the present application further provides a reverberation cancellation device, including:
the first processing unit is configured to perform time-frequency conversion processing on audio to obtain a spectrogram and a phase spectrogram, wherein each frame on the spectrogram corresponds to one voice signal group;
the second processing unit is configured to perform feature extraction on each frame to obtain corresponding voice features, and determine context association vectors of each voice feature, wherein one context association vector represents the correlation between one voice feature and each voice feature;
determining a voice masking estimation value of each voice signal on the spectrogram according to each voice feature and the corresponding context association vector, and performing reverberation elimination operation on the spectrogram according to each voice masking estimation value to obtain a dereverberated spectrogram, wherein one voice masking estimation value represents the probability of predicting one voice signal to contain reverberation;
a reverberation elimination unit configured to perform a time-frequency transform inverse process using the dereverberated spectrogram and the phase spectrogram, resulting in a dereverberated audio.
Optionally, the first processing unit is configured to:
obtaining initial voice signals in different frames by performing windowing and framing operation on the audio;
fourier transform is carried out on each initial voice signal to obtain a corresponding spectrogram and a corresponding phase spectrogram;
and splicing the spectrograms according to a time sequence to obtain the spectrogram, and splicing the phase spectrograms of the frames according to a time sequence to obtain the phase spectrogram.
Optionally, the second processing unit is configured to:
calculating attention weight of any one voice feature to each voice feature;
and generating a context association vector of any one voice feature based on each voice feature and the corresponding attention weight.
Optionally, the second processing unit is configured to:
and weighting the query vector of any one voice feature and the key vector of each voice feature to obtain the attention weight of each voice feature.
Optionally, the reverberation cancellation unit is configured to:
obtaining a Fourier coefficient without reverberation according to the spectrogram without reverberation and the phase spectrogram;
and performing inverse Fourier transform operation on the de-reverberated spectrogram by using the de-reverberated Fourier coefficient to obtain the de-reverberated audio.
In a third aspect, an embodiment of the present application further provides a computing device, including:
a memory for storing program instructions;
and the processor is used for calling the program instructions stored in the memory and executing any reverberation elimination method according to the obtained program.
In a fourth aspect, embodiments of the present application further provide a storage medium including computer-readable instructions, which when read and executed by a computing device, cause the computing device to perform any one of the reverberation cancellation methods described above.
The beneficial effect of this application is as follows:
in the embodiment of the application, a spectrogram and a phase spectrogram corresponding to an audio are generated, wherein each frame on the spectrogram corresponds to a voice signal group; extracting the features of each frame to obtain corresponding voice features, and determining context association vectors of the voice features; determining a voice masking estimation value of each voice signal on the spectrogram based on each voice feature and the corresponding context association vector, and performing reverberation elimination operation on the spectrogram according to each voice masking estimation value to obtain a dereverberation spectrogram; and finally, obtaining the audio frequency without reverberation by using the spectrogram and the phase spectrogram without reverberation. The context correlation degree reflects the correlation between one voice characteristic and each voice characteristic, and the voice characteristic on the voice characteristic diagram can be led to be dominated by clean voice information or reverberation voice information by adding attention to the input voice characteristic diagram, so that the clean voice and the reverberation voice are distinguished, and the voice signal with longer reverberation time is effectively screened and eliminated.
Drawings
Fig. 1 is a schematic diagram of an architecture of a reverberation cancellation model provided in an embodiment of the present application;
fig. 2 is a schematic flow chart of reverberation elimination provided by an embodiment of the present application;
FIG. 3a is a time domain diagram provided by an embodiment of the present application;
fig. 3b is a spectrum diagram provided in an embodiment of the present application;
FIG. 3c is a spectrogram provided in an embodiment of the present application;
fig. 4a is a schematic structural diagram of a residual error learning module according to an embodiment of the present application;
fig. 4b is a schematic structural diagram of a convolution module according to an embodiment of the present application;
fig. 4c is a schematic structural diagram of an attention module according to an embodiment of the present application;
fig. 4d is a schematic structural diagram of a deconvolution module according to an embodiment of the present application;
fig. 5 is a spectrogram of dereverberated audio provided by an embodiment of the present application;
fig. 6 is a schematic structural diagram of an apparatus for eliminating reverberation according to an embodiment of the present disclosure;
fig. 7 is a schematic structural diagram of an apparatus of a computing device according to an embodiment of the present application.
Detailed Description
In order to solve the problem that it is difficult to eliminate a speech signal with a long reverberation time, a new technical scheme is provided in the embodiment of the present application. The scheme comprises the following steps: generating a spectrogram and a phase spectrogram corresponding to the audio, wherein each frame on the spectrogram corresponds to one voice signal group; extracting the features of each frame to obtain corresponding voice features, and determining context association vectors of the voice features; determining a voice masking estimation value of each voice signal on the spectrogram based on each voice feature and the corresponding context association vector, and performing reverberation elimination operation on the spectrogram according to each voice masking estimation value to obtain a dereverberation spectrogram; and finally, obtaining the audio frequency without reverberation by using the spectrogram and the phase spectrogram without reverberation.
Preferred embodiments of the present invention will be described in detail below with reference to the accompanying drawings.
In the embodiment of the application, the audio is firstly processed by time frequencyAnd converting to obtain a spectrogram and a phase spectrogram, determining a voice masking estimation value of each voice signal on the spectrogram by using the trained reverberation elimination model, performing reverberation elimination operation on the input spectrogram by using the voice masking estimation value of each voice signal to obtain a reverberation-removed spectrogram, and finally obtaining a reverberation-removed audio by using the reverberation-removed spectrogram and the phase spectrogram. Generally speaking, the estimated voice masking value is greater than 1, and in the course of training the reverberation cancellation model, in order to facilitate convergence of the reverberation cancellation model, the embodiment of the present application proposes a compression cost function, which maps the estimated voice masking value into an interval of (0, 1). Formula (1) is a compression cost function, Q represents a compression range, and Q can take the value of 1; c represents the slope of the compression function, wherein the value of C can be 0.5;
Figure 696926DEST_PATH_IMAGE001
the target voice mask estimate is characterized,
Figure 879646DEST_PATH_IMAGE002
a compressed value characterizing the target voice mask estimate, t characterizing the time frame, and f characterizing the frequency.
Figure 721700DEST_PATH_IMAGE003
Formula (1);
referring to fig. 1, the reverberation elimination model specifically includes a coding module, an attention module and a decoding module, where the coding module is configured to perform feature extraction on a spectrogram to obtain a speech feature corresponding to each frame on the spectrogram; the attention module is used for carrying out weighting processing on each voice feature so as to generate context association vectors of each voice feature, wherein one context association vector represents the correlation between one voice feature and each voice feature; the decoding module is used for determining a voice masking estimation value of each voice signal on the spectrogram according to each voice feature and the corresponding context association vector, and executing reverberation elimination operation on the spectrogram according to each voice masking estimation value to obtain the spectrogram without reverberation. Specifically, referring to fig. 2, a process of the reverberation elimination model eliminating the reverberation included in the audio is introduced.
S201: and performing time-frequency conversion processing on the audio to obtain a spectrogram and a phase spectrogram, wherein each frame on the spectrogram corresponds to one voice signal group.
The time domain diagram can visually observe the shape of the voice signal, but the voice signal cannot be accurately described by using limited parameters on the time domain diagram, and the frequency domain diagram can decompose a complex voice signal into superposition of simple voice signals (sinusoidal signals), so that the 'structure' of the voice signal can be more accurately known, and more information can be extracted for signal analysis. Therefore, in performing step 201, the audio needs to be subjected to time-frequency domain conversion processing in the following manner.
(1) And performing framing processing on the audio according to a preset frame interval.
The audio speech signal forms a continuous wave in the time domain (hereinafter referred to as a time domain diagram), and the time domain diagram is shown in fig. 3a, in which the horizontal axis represents time and the vertical axis represents amplitude. Since a speech signal has short-time stationarity, which is macroscopically unstable but microscopically stationary, the time domain image may be truncated into a plurality of short time segments, each of which is referred to as a frame, for processing according to a preset inter-frame distance. It should be noted that, in order to ensure the continuity of the synthesized signal, there may be an overlapping portion between adjacent frames.
(2) And respectively carrying out windowing processing on each frame.
Taking a frame as an example, multiplying the frame by a window function, so that the original speech signal without periodicity presents partial characteristics of the periodic function, and the subsequent Fourier transform operation is facilitated.
(3) And respectively carrying out short-time Fourier transform on each frame subjected to windowing processing, and calling a result obtained after the short-time Fourier transform as a frequency spectrum. Although the time domain diagram intuitively shows the amplitude values of the speech signal at different time instants, it is difficult to extract more useful information for signal analysis, and it can be known from the short-time fourier transform formula that the waves of one frame of audio are synthesized by a plurality of waves with different frequencies, therefore, the audio can be mapped from the time domain to the frequency domain by using the short-time fourier transform to obtain a frequency domain diagram and a phase spectrogram as shown in fig. 3b, wherein the horizontal axis of the frequency domain diagram is frequency, the vertical axis thereof is amplitude, and the horizontal axis of the phase spectrogram is phase, and the vertical axis thereof is amplitude, so as to better analyze the speech signal.
(4) And splicing the spectrograms according to the time sequence to obtain a spectrogram, and splicing the phase spectrograms of the frames according to the time sequence to obtain a phase spectrogram. The spectrogram is shown in fig. 3c, in which the horizontal axis represents time and the vertical axis represents frequency.
S202: and extracting the features of each frame to obtain the corresponding voice features.
And inputting the spectrogram into a coding module for feature extraction to obtain a voice feature map, wherein the voice feature map comprises voice features of each frame. Specifically, the encoding module is composed of a Residual Learning module (Residual Learning Unit) and a Convolution module (Convolution Unit). Generally speaking, as the depth of a network increases (i.e., the number of parameters included in the network is more), the nonlinear expression capability of the network is stronger, and the more things the network can learn, which is a practical situation, when the number of network layers increases to a certain extent, the deeper network has a worse classification effect, and therefore, the embodiment of the present application adopts a method that a residual error learning module and a convolution module are combined to solve the degradation problem of the deep network.
Referring to fig. 4a, the residual learning module includes two channels, one is a channel of the main processing layer for feature extraction, and the other is a channel of the input layer to the output layer. The main process Layer includes, from top to bottom, a hole Convolution Layer (partitioned Convolution Layer), a Batch Normalization Layer (Batch Normalization Layer), and an activation function (leak Relu). In order to increase the receptive field and enable the network to obtain wide context information, the embodiment of the present application adopts the void convolution layers with the void ratios of (2, 2) and (4, 4) to perform the feature extraction operation on the input spectrogram. Referring to fig. 4b, the Convolution module of the embodiment of the present application is divided into three layers from top to bottom, which are a Convolution Layer (Convolution Layer), a batch normalization Layer, and an activation function in sequence.
As can be seen from fig. 1, in the embodiment of the present application, a plurality of residual learning modules and convolution modules are provided, and each convolution layer adopts different convolution kernels and step lengths, so that the receptive field can be further increased, the complexity and depth of the network can be improved, the information loss caused by using the largest pooling layer can be solved, and the prediction accuracy of the reverberation elimination model can be improved.
S203: determining a context association vector for each of the speech features, wherein a context association vector characterizes a correlation between a speech feature and the respective speech feature.
In cognitive neurology, attention is an indispensable complex cognitive function of humans, meaning the ability of a person to select to ignore some information while focusing on others. In daily life, we receive a great deal of sensory input through visual, auditory, tactile, etc., but our brains can work orderly in these outside information bombings because they can intentionally or unintentionally select a small portion of useful information from these large amounts of input information to process with emphasis, and ignore other information. For example, when a person is reading, only a few words to be read are usually attended to and processed. Similarly, the attention mechanism may also enable the neural network to focus on its input features-i.e., select a particular input feature. In situations where computing power is limited, the attention mechanism is a resource allocation scheme that is the primary means to solve the information overload problem, allocating computing resources to more important tasks.
And inputting the voice feature map into an attention module, and determining the context association degree of each voice feature. Referring to fig. 4c, the attention module of the embodiment of the present application may be divided into three layers from top to bottom, three matrices with the same size are disposed in the first layer, and the input speech feature map is multiplied by three weight matrices respectively to obtain a query matrix
Figure 943341DEST_PATH_IMAGE004
Key matrix
Figure 210243DEST_PATH_IMAGE005
Sum matrix
Figure 767126DEST_PATH_IMAGE006
(ii) a A Softmax function is set at the second layer for mapping an attention weight matrix to an interval of (0, 1), the attention weight matrix being based on the query matrix
Figure 597941DEST_PATH_IMAGE004
And key matrix
Figure 859158DEST_PATH_IMAGE005
Obtaining; the third layer is the attention weight matrix and the value matrix
Figure 590354DEST_PATH_IMAGE006
And executing dot multiplication operation to generate a context associated vector matrix, taking the context associated vector matrix as the input of the fourth layer, and multiplying the context associated vector matrix by a preset attention coefficient w to obtain a final weighted graph. In the process of generating the context association vector, the value vector of any one voice feature X and the value vector of each voice feature on the spectrogram are weighted and fused, and the correlation between the voice feature X and each voice feature on the spectrogram is embodied.
For ease of understanding, the process of generating the context association vector will be described with reference to the speech feature X as an example.
First, attention weights of the speech feature X to the respective speech features are calculated.
Specifically, the query vector of the speech feature X and the key vector of each speech feature are weighted to obtain the attention weight of each speech feature.
Next, a context association vector of the speech feature X is generated based on each speech feature and the corresponding attention weight.
In particular, the value matrix
Figure 472466DEST_PATH_IMAGE006
The context association vector of the speech feature X is generated by performing weighting processing on the speech feature X and the corresponding attention weight.
In the embodiment of the present application, the purpose of calculating the attention weight and the context association vector is to determine the dependency relationship between the speech feature X and any one of the speech features, when encoding the speech feature X, even if the speech feature X is far away from the speech feature X, the neural network will pay attention to the speech feature as long as the context association degree between the speech feature X and the speech feature X is high, so that the neural network can pay attention to the speech feature no matter whether the speech feature with short reverberation time or the speech feature with long reverberation time is concerned, and the phenomenon of erroneous division or missed division will not occur.
S204: and determining a voice masking estimation value of each voice signal on the spectrogram according to each voice feature and the corresponding context association vector, and performing reverberation elimination operation on the spectrogram according to each voice masking estimation value to obtain a dereverberated spectrogram, wherein one voice masking estimation value represents the probability of predicting that one voice signal contains reverberation.
And inputting the voice feature map and the corresponding context association vector into a decoding module for decoding, and determining a voice masking estimation value corresponding to each voice signal on the spectrogram. By adding attention to the input speech feature map, the speech features on the speech feature map can be made to be dominated by clean speech information or reverberated speech information, so that clean speech and reverberated speech can be distinguished.
Similar to the architecture of the encoding module, the decoding module is composed of a residual learning module and a Deconvolution module (Deconvolution Unit). The schematic structural diagram of the residual learning module in the decoding module is shown in fig. 4a, and includes two channels, one is a channel of the main processing layer for feature extraction, and the other is a channel that lets the input layer go to the output layer. The layers from top to bottom in the main processing layer are a cavity convolution layer, a batch normalization layer and an activation function in sequence; referring to fig. 4d, the Deconvolution module of the embodiment of the present application is divided into three layers from top to bottom, which are a Deconvolution Layer (Deconvolution Layer), a batch normalization Layer, and an activation function in sequence.
As can be seen from fig. 1, a speech feature map generated in a current layer and a decoded feature map output in a next layer are used as input of a decoding module in the current layer, and are processed by the decoding module to output a new decoded feature map, where the size of the new decoded feature map is consistent with the size of the feature map before the speech feature map is generated. After passing through a plurality of consecutive decoding modules, the size of the feature map output by the last decoding module is consistent with the size of the spectrogram input at the beginning, so that the voice masking estimation value corresponding to each voice signal on the spectrogram is marked on the decoding feature map output by the last decoding module. And multiplying the voice masking estimated value corresponding to each voice signal by the spectrogram to obtain the spectrogram without reverberation.
In the training of the model, in order to facilitate the convergence of the model, the voice masking estimation values are mapped into the (0, 1) interval by using a compression cost function, so that the compression values of the voice masking estimation values output by the last decoding module need to be decompressed by using a formula (2), wherein Q represents a compression range, and the value of Q can be 1; c represents the slope of the compression function, wherein the value of C can be 0.5;
Figure 289112DEST_PATH_IMAGE007
the speech masking estimate is characterized by a value,
Figure 354020DEST_PATH_IMAGE008
a compressed value representing the voice mask estimate.
Figure 939722DEST_PATH_IMAGE009
Formula (2);
s205: and performing time-frequency conversion inverse processing by using the dereverberated spectrogram and the phase spectrogram to obtain dereverberated audio.
The spectrogram output in step 204 belongs to one of frequency domain maps, and the dereverberated spectrogram must be converted into a time domain map through inverse fourier transform, so as to obtain dereverberated audio. Optionally, obtaining a dereverberation fourier coefficient according to the dereverberation spectrogram and the phase spectrogram; and performing inverse Fourier transform operation on the dereverberated spectrogram by using the dereverberated Fourier coefficient to obtain dereverberated audio. And mapping the voice signal in the frequency domain to the time domain through inverse Fourier transform operation to obtain the dereverberated audio, wherein a spectrogram of the dereverberated audio is shown in FIG. 5.
Based on the same inventive concept, the present embodiment provides a reverberation cancellation device, as shown in fig. 6, including at least a first processing unit 601, a second processing unit 602, and a reverberation cancellation unit 603, wherein,
a first processing unit 601, configured to perform time-frequency conversion processing on an audio to obtain a spectrogram and a phase spectrogram, where each frame on the spectrogram corresponds to a speech signal group;
a second processing unit 602, configured to perform feature extraction on each frame to obtain corresponding speech features, and determine context association vectors of each speech feature, where one context association vector represents a correlation between one speech feature and each speech feature;
determining a voice masking estimation value of each voice signal on the spectrogram according to each voice feature and the corresponding context association vector, and performing reverberation elimination operation on the spectrogram according to each voice masking estimation value to obtain a dereverberated spectrogram, wherein one voice masking estimation value represents the probability of predicting one voice signal to contain reverberation;
a reverberation removal unit 603 configured to perform a time-frequency transform inverse process using the dereverberated spectrogram and the phase spectrogram, resulting in a dereverberated audio.
Optionally, the first processing unit 601 is configured to:
obtaining initial voice signals in different frames by performing windowing and framing operation on the audio;
fourier transform is carried out on each initial voice signal to obtain a corresponding spectrogram and a corresponding phase spectrogram;
and splicing the spectrograms according to a time sequence to obtain the spectrogram, and splicing the phase spectrograms of the frames according to a time sequence to obtain the phase spectrogram.
Optionally, the second processing unit 602 is configured to:
calculating attention weight of any one voice feature to each voice feature;
and generating a context association vector of any one voice feature based on each voice feature and the corresponding attention weight.
Optionally, the second processing unit 602 is configured to:
and weighting the query vector of any one voice feature and the key vector of each voice feature to obtain the attention weight of each voice feature.
Optionally, the reverberation canceling unit 603 is configured to:
obtaining a Fourier coefficient without reverberation according to the spectrogram without reverberation and the phase spectrogram;
and performing inverse Fourier transform operation on the de-reverberated spectrogram by using the de-reverberated Fourier coefficient to obtain the de-reverberated audio.
Based on the same inventive concept, the embodiment of the present application provides a computing device, which is shown in fig. 7 and at least includes a memory 701 and at least one processor 702, where the memory 701 and the processor 702 complete communication with each other through a communication bus;
the memory 701 is used for storing program instructions;
the processor 702 is configured to call program instructions stored in the memory 701 and execute the reverberation cancellation method according to the obtained program.
Based on the same inventive concept, in an embodiment of the present invention, there is provided a storage medium at least including computer readable instructions, which when read and executed by a computer, cause the computer to execute the reverberation cancellation method.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the invention.
It will be apparent to those skilled in the art that various modifications and variations can be made in the embodiments of the present invention without departing from the spirit or scope of the embodiments of the invention. Thus, if such modifications and variations of the embodiments of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to encompass such modifications and variations.

Claims (10)

1. A reverberation cancellation method, comprising:
performing time-frequency conversion processing on audio to obtain a spectrogram and a phase spectrogram, wherein each frame on the spectrogram corresponds to a voice signal group;
extracting features of each frame to obtain corresponding voice features, and determining a context association vector of each voice feature, wherein the determining of the context association vector of any one voice feature specifically comprises: calculating attention weight of any one voice feature to each voice feature; generating a context association vector of the arbitrary voice feature based on the voice features and the corresponding attention weight; a context association vector characterizing a correlation between a speech feature and each of said speech features;
determining a voice masking estimation value of each voice signal on the spectrogram according to each voice feature and the corresponding context association vector, and performing reverberation elimination operation on the spectrogram according to each voice masking estimation value to obtain a dereverberated spectrogram, wherein one voice masking estimation value represents the probability of predicting one voice signal to contain reverberation;
and performing time-frequency conversion inverse processing by using the dereverberation spectrogram and the phase spectrogram to obtain dereverberation audio.
2. The method of claim 1, wherein the time-frequency transforming the audio to obtain a spectrogram and a phase map comprises:
obtaining initial voice signals in different frames by performing windowing and framing operation on the audio;
fourier transform is carried out on each initial voice signal to obtain a corresponding spectrogram and a corresponding phase spectrogram;
and splicing the spectrograms according to a time sequence to obtain the spectrogram, and splicing the phase spectrograms of the frames according to a time sequence to obtain the phase spectrogram.
3. The method of claim 1, wherein computing the attention weight of the arbitrary one of the speech features to the respective speech feature comprises:
and weighting the query vector of any one voice feature and the key vector of each voice feature to obtain the attention weight of each voice feature.
4. The method of claim 1, wherein performing a time-frequency transform inversion process using the dereverberated spectrogram and the phase spectrogram to obtain dereverberated audio comprises:
obtaining a Fourier coefficient without reverberation according to the spectrogram without reverberation and the phase spectrogram;
and performing inverse Fourier transform operation on the de-reverberated spectrogram by using the de-reverberated Fourier coefficient to obtain the de-reverberated audio.
5. A reverberation cancellation device, comprising:
the first processing unit is configured to perform time-frequency conversion processing on audio to obtain a spectrogram and a phase spectrogram, wherein each frame on the spectrogram corresponds to one voice signal group;
the second processing unit is configured to perform feature extraction on each frame to obtain a corresponding voice feature, and determine a context association vector of each voice feature, where determining the context association vector of any one voice feature specifically includes: calculating attention weight of any one voice feature to each voice feature; generating a context association vector of the arbitrary voice feature based on the voice features and the corresponding attention weight; a context association vector characterizing a correlation between a speech feature and each of said speech features;
determining a voice masking estimation value of each voice signal on the spectrogram according to each voice feature and the corresponding context association vector, and performing reverberation elimination operation on the spectrogram according to each voice masking estimation value to obtain a dereverberated spectrogram, wherein one voice masking estimation value represents the probability of predicting one voice signal to contain reverberation;
a reverberation elimination unit configured to perform a time-frequency transform inverse process using the dereverberated spectrogram and the phase spectrogram, resulting in a dereverberated audio.
6. The apparatus of claim 5, wherein the first processing unit is configured to:
obtaining initial voice signals in different frames by performing windowing and framing operation on the audio;
fourier transform is carried out on each initial voice signal to obtain a corresponding spectrogram and a corresponding phase spectrogram;
and splicing the spectrograms according to a time sequence to obtain the spectrogram, and splicing the phase spectrograms of the frames according to a time sequence to obtain the phase spectrogram.
7. The apparatus of claim 5, wherein the second processing unit is configured to:
and weighting the query vector of any one voice feature and the key vector of each voice feature to obtain the attention weight of each voice feature.
8. The apparatus of claim 5, wherein the reverberation cancellation unit is configured to:
obtaining a Fourier coefficient without reverberation according to the spectrogram without reverberation and the phase spectrogram;
and performing inverse Fourier transform operation on the de-reverberated spectrogram by using the de-reverberated Fourier coefficient to obtain the de-reverberated audio.
9. A computing device, comprising:
a memory for storing program instructions;
a processor for calling program instructions stored in said memory to execute the method of any one of claims 1 to 4 in accordance with the obtained program.
10. A storage medium comprising computer readable instructions which, when read and executed by a computing device, cause the computing device to perform the method of any of claims 1-4.
CN202011588741.8A 2020-12-29 2020-12-29 Reverberation elimination method and device Active CN112289334B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011588741.8A CN112289334B (en) 2020-12-29 2020-12-29 Reverberation elimination method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011588741.8A CN112289334B (en) 2020-12-29 2020-12-29 Reverberation elimination method and device

Publications (2)

Publication Number Publication Date
CN112289334A CN112289334A (en) 2021-01-29
CN112289334B true CN112289334B (en) 2021-04-02

Family

ID=74426588

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011588741.8A Active CN112289334B (en) 2020-12-29 2020-12-29 Reverberation elimination method and device

Country Status (1)

Country Link
CN (1) CN112289334B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113160839B (en) * 2021-04-16 2022-10-14 电子科技大学 Single-channel speech enhancement method based on adaptive attention mechanism and progressive learning
CN114283827B (en) * 2021-08-19 2024-03-29 腾讯科技(深圳)有限公司 Audio dereverberation method, device, equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109308904A (en) * 2018-10-22 2019-02-05 上海声瀚信息科技有限公司 A kind of array voice enhancement algorithm
CN110503972A (en) * 2019-08-26 2019-11-26 北京大学深圳研究生院 Sound enhancement method, system, computer equipment and storage medium
CN110709924A (en) * 2017-11-22 2020-01-17 谷歌有限责任公司 Audio-visual speech separation
CN110970053A (en) * 2019-12-04 2020-04-07 西北工业大学深圳研究院 Multichannel speaker-independent voice separation method based on deep clustering

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP5195652B2 (en) * 2008-06-11 2013-05-08 ソニー株式会社 Signal processing apparatus, signal processing method, and program

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110709924A (en) * 2017-11-22 2020-01-17 谷歌有限责任公司 Audio-visual speech separation
CN109308904A (en) * 2018-10-22 2019-02-05 上海声瀚信息科技有限公司 A kind of array voice enhancement algorithm
CN110503972A (en) * 2019-08-26 2019-11-26 北京大学深圳研究生院 Sound enhancement method, system, computer equipment and storage medium
CN110970053A (en) * 2019-12-04 2020-04-07 西北工业大学深圳研究院 Multichannel speaker-independent voice separation method based on deep clustering

Also Published As

Publication number Publication date
CN112289334A (en) 2021-01-29

Similar Documents

Publication Publication Date Title
CN110600017B (en) Training method of voice processing model, voice recognition method, system and device
Zhao et al. Monaural speech dereverberation using temporal convolutional networks with self attention
CN111785288B (en) Voice enhancement method, device, equipment and storage medium
CN112289334B (en) Reverberation elimination method and device
CN111508519B (en) Method and device for enhancing voice of audio signal
US9390723B1 (en) Efficient dereverberation in networked audio systems
CN114203163A (en) Audio signal processing method and device
CN113454717A (en) Speech recognition apparatus and method
CN105960676B (en) Linear prediction analysis device, method and recording medium
CN112735466B (en) Audio detection method and device
CN112750461A (en) Voice communication optimization method and device, electronic equipment and readable storage medium
CN114338623A (en) Audio processing method, device, equipment, medium and computer program product
CN114333893A (en) Voice processing method and device, electronic equipment and readable medium
CN112151055B (en) Audio processing method and device
CN116612778B (en) Echo and noise suppression method, related device and medium
CN113782044A (en) Voice enhancement method and device
CN116312570A (en) Voice noise reduction method, device, equipment and medium based on voiceprint recognition
US20230116052A1 (en) Array geometry agnostic multi-channel personalized speech enhancement
CN110459235A (en) A kind of reverberation removing method, device, equipment and storage medium
CN114333912B (en) Voice activation detection method, device, electronic equipment and storage medium
CN112802453B (en) Fast adaptive prediction voice fitting method, system, terminal and storage medium
CN117373468A (en) Far-field voice enhancement processing method, far-field voice enhancement processing device, computer equipment and storage medium
CN114333892A (en) Voice processing method and device, electronic equipment and readable medium
CN114333891A (en) Voice processing method and device, electronic equipment and readable medium
Xiang et al. Distributed microphones speech separation by learning spatial information with recurrent neural network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant