CN116189703B

CN116189703B - Global multi-head attention voice enhancement method

Info

Publication number: CN116189703B
Application number: CN202310447342.7A
Authority: CN
Inventors: 楚明航; 王靖; 马瑶瑶; 黄玉玲; 杨梦涛; 范智玮; 徐超; 吴迪
Original assignee: Suzhou University
Current assignee: Suzhou University
Priority date: 2023-04-24
Filing date: 2023-04-24
Publication date: 2023-07-14
Anticipated expiration: 2043-04-24
Also published as: CN116189703A

Abstract

The invention discloses a global multi-head attention voice enhancement method, which relates to the field of generation of countermeasure networks and comprises the steps of inputting a noise-containing frequency signal into a generator encoder for convolution to obtain a convolution characteristic map; inputting the convolution characteristic map to the global multi-head attention layer to obtain a global multi-head attention characteristic map; inputting the global multi-head attention characteristic map to a generator encoder to obtain a convolution-global multi-head attention-convolution characteristic map; overlapping the convolution-global multi-head attention-convolution characteristic map with random noise z sampled from Gaussian distribution, and then inputting the overlapped convolution-global multi-head attention-convolution characteristic map to a generator decoder to obtain a deconvolution characteristic map; inputting the deconvolution characteristic map to a global multi-head attention layer to obtain a decoding-global multi-head attention characteristic map; the decoding-global multi-head attention profile is input to a generator decoder to obtain an enhanced audio signal. The invention can be used in a speech enhancement network and can achieve time dependence.

Description

Global multi-head attention voice enhancement method

Technical Field

The present invention relates to the field of generating countermeasure networks, and more particularly to a global multi-headed attention speech enhancement method.

Background

In recent years, a voice enhancement method based on generation of countermeasure networks (GANs) has been proposed, in which end-to-end voice enhancement is achieved by directly inputting waveforms into the network. However, existing speech enhancement GANs rely entirely on convolution operations, which may mask the time dependence of the sequence input.

Disclosure of Invention

The invention aims to provide a global multi-head attention voice enhancement method for solving the problems in the background art.

In order to achieve the above purpose, the invention adopts the following technical scheme:

a global multi-headed speech enhancement method comprising the steps of:

step one: acquiring a noise-containing frequency signal;

step two: inputting the noise-containing frequency signals into a plurality of layers of convolutions of a generator encoder to obtain a convolution characteristic map;

step three: inputting the convolution characteristic spectrum obtained in the second step into a global multi-head attention layer of a generator encoder to obtain a global multi-head attention characteristic spectrum;

step four: inputting the global multi-head attention characteristic map obtained in the step three into a plurality of layers of convolutions of a generator encoder to obtain a convolution-global multi-head attention-convolution characteristic map;

step five: superposing the convolution-global multi-head attention-convolution characteristic map obtained in the step four with random noise z sampled from Gaussian distribution, and then inputting the superposition to a plurality of layers of deconvolution of a generator decoder to obtain deconvolution characteristic map;

step six: inputting the deconvolution characteristic spectrum obtained in the step five to a global multi-head attention layer of a generator decoder to obtain a decoding-global multi-head attention characteristic spectrum;

step seven: and D, inputting the decoding-global multi-head attention characteristic map obtained in the step six to a plurality of layers of deconvolution of a generator decoder to obtain an enhanced audio signal.

Further, step three includes a first pre-step of: query matrix for acquiring convolution characteristic spectrum

Key matrix

Sum matrix->WhereinN=1, 2, n denotes the number of heads.

Further, the third step further includes a second pre-step of: calculating a weight matrix of the global multi-head attention:

wherein the method comprises the steps ofN=1, 2, representing the number of heads, softmax representing the normalized exponential function; t represents the transpose of the vector or matrix.

Further, the formula for calculating the global multi-head attention feature map in the third step is as follows:

wherein Cat is a concate function in the convolutional neural network and represents feature fusion;

representing a one-dimensional convolution.

Further:

wherein:

representation->

Is a weight matrix of (2);

representation->

Is a weight matrix of (2);

representation->

Is a weight matrix of (2);

the convolution characteristic map output in the second step;

the ownsample represents downsampling.

Further, step six includes a first pre-step of: query matrix for obtaining deconvolution feature map

Key matrix->

Sum matrix->

WhereinN=1, 2, n denotes the number of heads.

Further, step six further includes a second pre-step of: calculating a weight matrix of the global multi-head attention:

Further, the formula for calculating the global multi-head attention feature map in the step six is as follows:

wherein Cat is a concate function in the convolutional neural network and represents feature fusion; />

Representing a one-dimensional convolution.

Wherein:

representation->

Is a weight matrix of (2);

representation->

Is a weight matrix of (2);

representation->

Is a weight matrix of (2);

the deconvolution characteristic map output in the fifth step;

downsampling is denoted by downsampling.

Compared with the prior art, the invention has the advantages that compared with the traditional SASEGAN (self-attention voice enhancement generation countermeasure network), the global multi-head attention voice enhancement generation countermeasure network (GMASEGAN) based on the original audio signal input has more network parameters and higher convergence speed in the training process. In subjective evaluation, the GMASEGAN provided by the invention is improved by 8.81 percent and 5.78 percent compared with the traditional SEGAN (voice enhancement generation countermeasure network) and SASEGAN (self-attention voice enhancement generation countermeasure network), respectively. In terms of objective evaluation, the absolute value of GMASEGAN of the present invention yields 0.07, 0.06, 0.09, 0.07, 0.63, and 0.26 over sassegan at PESQ, CSIG, CBAK, COVL, SSNR and STOI, respectively. In addition, the GMASEGAN (attention header number n=4) provided by the present invention can attenuate background noise in clean voice, and as the header number increases, the voice enhancement effect of the GMASEGAN may be further improved. More importantly, the global multi-head attention layer provided by the invention can be used for other voice enhancement networks taking a convolution layer as a backbone so as to realize time dependence.

Drawings

FIG. 1 is a schematic diagram of global multi-headed attention layer processing steps;

FIG. 2 is a schematic diagram of the network architecture of the generator and the computation of the global multi-headed attention layer in the network;

fig. 3 is a schematic diagram of a network structure of the discriminator.

Detailed Description

The following describes specific embodiments of the present invention with reference to the drawings.

First, a global multi-head attention layer is described:

the global multi-headed note layer is adapted from a non-local self-note layer. Here, two attention heads are taken as an example. Processing steps of the global multi-head attention layer as shown in fig. 1, fig. 1 is a diagram of processing steps of the global multi-head attention layer (attention head number n=2). Taking two vectors as an example, T represents the transpose of the vector. In order to keep the attention coefficient in the range of 0-1, in

A softmax layer was then added.

Determining a feature map of one-dimensional convolutional layer output

Where T represents the length of the time dimension and C represents the number of channels. A is divided into T vectors, i.e.)>

Wherein

A channel feature vector representing each point in time. The query vector q, the key vector k, and the value vector v are obtained by:

wherein->

,

, and/>

The expression is represented by->

The 1 x 1 convolution layer of each filter performs a weight matrix. The attention coefficient α is obtained by the following conversion:

wherein->

Representing a transpose of the vector or matrix. To ensure that the attention factor is in the range of 0-1, a softmax layer is used here. The attention profile b is obtained by the following transformation:

where Cat (∙) represents connecting multiple matrices together according to a particular dimension. />

Representing the output of a 1 x 1 convolutional layer whose channel is equal toA is the same.

Taking two vectors as an example, T represents the transpose of the vector. In order to keep the attention coefficient in the range of 0-1, in

A softmax layer was then added.

The network architecture and method of the present invention are described below:

the network structure of the generator and the calculation of the global multi-headed attention layer in the network are shown in fig. 2. The upper half shows the network structure of the generator and the output shape of each layer. The lower part shows how the data is calculated in the global multi-head attention layer after adding the global multi-head attention layer after the third last convolutional layer of the decoder, which is also provided at the encoder position corresponding to the vertical direction in the figure.

The generator receives noise corrupted audio signal input

(/>

) Is a full convolutional encoder-decoder architecture.

The upper half shows the network structure of the generator and the output shape of each layer. The bottom half shows how data is calculated in the global multi-headed attention layer after the addition of the global multi-headed attention layer after the third last convolutional layer of the decoder.

A parametric rectifying linear unit is employed as an activation function of the generator. The convolutional layer and the active layer are combined into one convolutional block (ConvBlk), which is the basic unit of the generator encoder. The generator encoder consists of 11 ConvBlks, where the number of filters of the convolutional layer is incremented by a step size of 2. The output of the encoder is an 8 x 1024 feature map

. Potential code->

Are randomly sampled from a gaussian distribution. z is stacked in->

And presented to the generator decoder.

The generator decoder mirrors the generator encoder architecture, reversing the encoding process by deconvolution. The deconvolution layer and the activation layer are combined with a deconvolution block (deConvBlk), which is the basic unit of the generator decoder. The skip connection is used to link each ConvBlk in the encoder to its mirrored deConvBlk in the decoder, allowing data from the encoding process to flow into the decoding process. The output of the decoder being an enhanced audio signal

。

GMASEGAN couples the global multi-headed attention layer described in section a with the generator (de) ConvBlks. Taking the two-headed global multi-headed attention layer as an example, the present invention couples the global multi-headed attention layer with the third-to-last deConvBlk.

Is the output of the third last deConvBlk, l=3 represents the position of the global multi-headed attention layer. Query matrix->

Is obtained by the following conversion:

wherein->

Representation->

Is a weight matrix of (a). />

Is downsampled to 512 x 32 before the key matrix is calculated. Key matrix->

Is obtained by the following conversion:

wherein the method comprises the steps of

Representation->

Is a weight matrix of (a). Similar to key matrix>

Is downsampled to 512 x 32 before calculating the value matrix. Value matrix->

Is obtained by the following conversion:

wherein the method comprises the steps of

Representation->

Is a weight matrix of (a). Attention coefficient matrix->

Obtained by the following transformation:

wherein->

Representing a transpose of the vector or matrix. The output F' of the global multi-headed attention layer is obtained by the following conversion:

wherein Cat(∙) represents connecting a plurality of matrices together according to a particular dimension. But->

Representing the output of a one-dimensional convolution layer with 32 filters.

The above calculation procedure is also applicable to the discriminator. As shown in fig. 3, the discriminator structure is similar to a generator encoder. But it requires as input a pair of original audio clips. The output of the last discriminator ConvBlk is flattened and presented to the fully connected layer. The full connection layer was used as the last layer of the arbiter and the softmax was used for classification. The discriminator outputs true or false.

In general, the global multi-headed attention layer may be combined with any number (de) or even all ConvBlks. The present invention will study the placement of global multi-headed attention layer in section IV. In GMASEGAN, all (de) ConvBlks of the generator and discriminator are spectrally normalized.

The method provided by the invention utilizes the data set extracted from the Voice Bank corpus to evaluate, and specifically extracts 30 speakers in the data set: of which 28 were in the training set and 2 were in the test set. For the training set, the SNR is 15, 10, 5 and 0dB,10 noise combinations, 40 noise conditions are introduced. Each training speaker had about 10 different sentences under each condition. The test set considers 20 different conditions in total: 5 noise types, each with 4 SNR (17.5, 12.5, 7.5 and 2.5 dB), were all taken from the Demand database. Each test speaker had about 20 different sentences under each condition. As mentioned above, the generalization ability of GMASEGAN can be better evaluated by varying the speaker and condition of the test set and the training set. All audio samples were downsampled from 48KHz to 16KHz.

The proposed GMASEGAN is built based on a Tensorflow framework, including training and testing. 100 epochs were trained on the network using a small batch size of 50, RMSprop, and a learning rate of 0.0002. Throughout the training process, the present invention sets the weight of the L1 regularized λ to 100. During training, the present invention uses a sliding window of speech (50% overlap) of approximately 1 second to extract the waveform blocks. In contrast, the present invention slides windows throughout the test audio signal without overlapping them and concatenates the results during the test phase. During the training and testing phase, a high frequency pre-emphasis filter with a factor of 0.95 is applied to all input samples.

The experiments of the present invention set three objectives. First, the present invention investigates and quantifies the role of the global multi-headed attention layer in a speech enhancement network. SEGAN was used as a baseline for comparison. Second, the present invention investigates how the performance of the generator and discriminator changes when the global multi-headed attention layer is located at different positions. The present invention evaluates the proposed GMASEGAN using different (de) ConvBlk index values. The present invention fails to test the first and second (de) ConvBlks due to GPU memory limitations and the high time dimension of the first and second (de) ConvBlks (i.e., 8192 and 4096, respectively). Third, the proposed GMASEGAN and sassegan are compared in the frequency and time domains. The invention sets the number of attention heads to N, and quantitatively analyzes the influence of N on voice enhancement.

The experiment of the invention is mainly divided into the following stages to achieve the objective:

data preparation stage: speech signal samples, including noisy and noiseless speech, are first collected from a plurality of data sets. These samples are pre-processed, such as data enhancement, normalization, framing, etc., for use in subsequent experiments.

Model building: at this stage, the present invention builds the generator and discriminator according to the proposed GMASEGAN architecture. GMASEGAN contains a global multi-headed attention layer for capturing long-range dependencies in speech signals. The invention inserts the global multi-head attention layer into different (de) ConvBlk index values respectively for performance evaluation.

Training and optimizing: the invention inputs the collected data into the generator and discriminator for training. Different penalty functions (e.g., countering penalty and reconstructing penalty) are used to optimize the model parameters. At the same time, the invention will adjust the number of attention heads N to analyze its impact on speech enhancement performance.

Model evaluation stage: after model training is completed, the present invention uses various evaluation metrics (e.g., signal-to-noise ratio, PESQ, etc.) to evaluate the performance of GMASEGAN on speech enhancement tasks. To achieve the first part of the objective, the present invention compares GMASEGAN with SEGAN to quantitatively evaluate the role of the global multi-headed attention layer in a speech enhancement network.

Experimental analysis stage: the invention will analyze the experimental results and compare the performance changes of the generator and discriminator when the global multi-headed attention layer is in different positions. This will help determine the best position of the global multi-headed attention layer in GMASEGAN. Furthermore, the present invention will compare the performance of GMASEGAN and sassegan in the frequency and time domains to highlight the advantages of GMASEGAN proposed.

The result presentation stage: the invention arranges the experimental results and displays the specific processes, methods and results in forms of tables, charts, text descriptions and the like. This will help elucidate the effectiveness of the proposed GMASEGAN in achieving the speech enhancement objective.

1) Subjective index

The quality of the speech to be measured is obtained by averaging the scores of all listeners, called Mean Opinion Score (MOS). The present invention randomly selects 20 sentences to be presented to a total of 10 listeners in random order. The selection process involves listening to some of the provided noisy audio signals and trying to balance the various noise types, as the data set does not specify the amount and type of noise for each audio. Most audio samples have a low SNR and a few have a high SNR. The following five versions of each sentence are presented in random order: clean audio signal, noisy audio signal, SASEGAN enhanced audio signal, GMASEGAN enhanced audio signal when n=2 and n=4. The listener scores the overall quality of each signal by 1 to 5. They are required to be concerned with both audio signal distortion and noise interference (e.g., 1 = bad: objectionable audio signal distortion, 5 = excellent: very natural speech, no audible noise, nor degradation). To ensure that listeners do not accidentally complete the survey, the present invention inserts 10 severely distorted and non-coherent audio portions into the assessment focus as a test of attention. If three or more of these samples are rated as 2, the listener is removed from the analysis. Each signal may be listened to as many times as desired. The listener is required to pay attention to the comparison rate of the five signals.

2) Objective evaluation

PESQ: perceptual Evaluation of Speech Quality is the objective speech quality assessment algorithm that is currently most relevant to MOS scores. The wideband version is used because ITU-T p.862.2 uses and recommends wideband versions. The value range is-0.5 to 4.5. The higher the value, the better the enhanced speech quality.

CSIG: this is a MOS prediction for speech signal distortion only. The value range is 1 to 5. The higher the value, the less the enhanced speech distortion.

CBAK: this is an aggressive MOS prediction for background noise. The higher the value is, the lower the invasiveness of the background noise is.

COVL: this is a MOS prediction of the overall effect. The value range is 1-5, and the larger the value is, the better the overall voice quality enhancement effect is.

SSNR: note that the segment signal-to-noise ratio is the average of all speech frames. Using this index must first ensure that the pure speech and enhancement signals are aligned in the time domain, which is highly correlated with the subjective auditory perception of the listener. The value range is 1 to infinity. The higher the value, the better the enhanced speech quality.

STOI (%): short-Time Objective Intelligibility is one of the important indicators for speech intelligibility. The value range is 0 to 1. The higher the value, the higher the enhanced speech intelligibility.

The invention researches the method in SEGAN

(/>

) The impact of placing a global multi-headed attention layer on ConvBlks on speech enhancement. The proposed GMASEGAN is compared to the traditional method Winner filter and deep learning method. Subjective and objective indicators were used to evaluate the effect of speech enhancement, and were found to be mainly as follows:

1) During training, the proposed GMASEGAN converges significantly faster than sassegan. It shows that the convergence speed can be faster although the number of model parameters of the proposed GMASEGAN increases slightly. And the present invention finds that the convergence speed is further increased when the number of heads is increased.

2) When n=2 and n=4, the absolute gains of the proposed GMASEGAN in subjective evaluation are 0.22, 0.30, respectively, compared to sassegan. The results show that the proposed GMASEGAN is superior to baseline SEGAN and sassegan in subjective evaluation.

3) The location of the global multi-headed attention layer in the network has no obvious relation to the speech enhancement effect. The absolute values of GMASEGAN-Average (n=2) were 0.14, 0.35, 0.04, 0.07, 0.01 and 0.08 higher than sassegan-Average over STOI, SSNR, COVL, CBAK, CSIG and PESQ, respectively. The absolute values of GMASEGAN-Average (n=4) were 0.12, 0.28, 0.03, 0.02, and 0.05 higher on STOI, SSNR, COVL, CBAK and vsig than GMASEGAN-Average (n=2), respectively. In terms of objective assessment, the results show that the proposed GMASEGAN is generally superior to sassegan and that the enhancement effect may increase as the number of attention tips increases.

4) The absolute gains of the proposed GMASEGAN at PESQ, CSIG, CBAK, COVL, SSNR and STOI are 0.23, 0.14, 0.21, 0.19, 1.04 and 0.34, respectively, compared to baseline SEGAN. In addition, a comparison was made with the wiener filtering method and the deep learning method. The proposed GMASEGAN enhancement is superior to Wiener filter methods and other deep learning methods.

5) In the case of low SNR (snr=2.5 or 7.5), it can be intuitively seen from the spectrogram that the proposed GMASEGAN (n=4) can reduce noise that sassegan cannot handle. At high signal-to-noise ratios (snr=12.5 or 17.5), the proposed GMASEGAN can not only enhance additive noise, but also attenuate background noise speech signals of clean speech signals, since the clean speech signals also have some background noise during recording.

Importantly, it is readily available for use in other speech enhancement networks for potential improvement. From a future perspective, the present invention seeks to explore the placement of global multi-headed attention layers in other networks to further enhance performance. In addition, a larger N will be employed to evaluate the enhanced performance of GMASEGAN. The present invention will also train the proposed GMASEGAN with data sets that contain more types of noise, making it suitable for use in a variety of real world noise environments.

The foregoing is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art, who is within the scope of the present invention, should be covered by the protection scope of the present invention by making equivalents and modifications to the technical solution and the inventive concept thereof.

Claims

1. A global multi-headed speech enhancement method, comprising the steps of:

step one: acquiring a noise-containing frequency signal;

step four: inputting the global multi-head attention characteristic map obtained in the third step into a plurality of other layers of convolutions of a generator encoder to obtain a convolution-global multi-head attention-convolution characteristic map;

step seven: and D, inputting the decoding-global multi-head attention characteristic map obtained in the step six to the other layers of deconvolution of the generator decoder to obtain the enhanced audio signal.

2. The global multi-headed speech enhancement method according to claim 1, wherein step three comprises a first pre-step of: query matrix for acquiring convolution characteristic spectrum

Key matrix->

Sum matrix->

WhereinN=1, 2, n denotes the number of heads.

3. The global multi-headed speech enhancement method according to claim 2, wherein step three further comprises the second pre-step of: calculating a weight matrix of the global multi-head attention:

4. A method of global multi-headed speech enhancement according to claim 3, wherein the formula for calculating the global multi-headed attention profile in step three is as follows:

Representing a one-dimensional convolution.

5. A global multi-headed speech enhancement method according to any one of claims 2 to 4, characterized in that:

wherein:

representation->

Is a weight matrix of (2);

representation->

Is a weight matrix of (2);

representation->

Is a weight matrix of (2);

the convolution characteristic map output in the second step;

downsampling is denoted by downsampling.

6. The global multi-headed speech enhancement method according to claim 1, wherein step six comprises a first pre-step of: obtaining deconvolution characteristic mapPolling matrix

Key matrix->

Sum matrix->

WhereinN=1, 2, n denotes the number of heads.

7. The global multi-headed speech enhancement method according to claim 6, wherein step six further comprises the second pre-step of: calculating a weight matrix of the global multi-head attention:

8. The global multi-headed speech enhancement method according to claim 7, wherein the formula for calculating the global multi-headed attention profile in step six is as follows: