CN116189703B - Global multi-head attention voice enhancement method - Google Patents

Global multi-head attention voice enhancement method Download PDF

Info

Publication number
CN116189703B
CN116189703B CN202310447342.7A CN202310447342A CN116189703B CN 116189703 B CN116189703 B CN 116189703B CN 202310447342 A CN202310447342 A CN 202310447342A CN 116189703 B CN116189703 B CN 116189703B
Authority
CN
China
Prior art keywords
global multi
head attention
convolution
headed
characteristic map
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310447342.7A
Other languages
Chinese (zh)
Other versions
CN116189703A (en
Inventor
楚明航
王靖
马瑶瑶
黄玉玲
杨梦涛
范智玮
徐超
吴迪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou University
Original Assignee
Suzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou University filed Critical Suzhou University
Priority to CN202310447342.7A priority Critical patent/CN116189703B/en
Publication of CN116189703A publication Critical patent/CN116189703A/en
Application granted granted Critical
Publication of CN116189703B publication Critical patent/CN116189703B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/008Multichannel audio signal coding or decoding using interchannel correlation to reduce redundancy, e.g. joint-stereo, intensity-coding or matrixing
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/02Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Signal Processing (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Mathematical Physics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)
  • Complex Calculations (AREA)

Abstract

The invention discloses a global multi-head attention voice enhancement method, which relates to the field of generation of countermeasure networks and comprises the steps of inputting a noise-containing frequency signal into a generator encoder for convolution to obtain a convolution characteristic map; inputting the convolution characteristic map to the global multi-head attention layer to obtain a global multi-head attention characteristic map; inputting the global multi-head attention characteristic map to a generator encoder to obtain a convolution-global multi-head attention-convolution characteristic map; overlapping the convolution-global multi-head attention-convolution characteristic map with random noise z sampled from Gaussian distribution, and then inputting the overlapped convolution-global multi-head attention-convolution characteristic map to a generator decoder to obtain a deconvolution characteristic map; inputting the deconvolution characteristic map to a global multi-head attention layer to obtain a decoding-global multi-head attention characteristic map; the decoding-global multi-head attention profile is input to a generator decoder to obtain an enhanced audio signal. The invention can be used in a speech enhancement network and can achieve time dependence.

Description

Global multi-head attention voice enhancement method
Technical Field
The present invention relates to the field of generating countermeasure networks, and more particularly to a global multi-headed attention speech enhancement method.
Background
In recent years, a voice enhancement method based on generation of countermeasure networks (GANs) has been proposed, in which end-to-end voice enhancement is achieved by directly inputting waveforms into the network. However, existing speech enhancement GANs rely entirely on convolution operations, which may mask the time dependence of the sequence input.
Disclosure of Invention
The invention aims to provide a global multi-head attention voice enhancement method for solving the problems in the background art.
In order to achieve the above purpose, the invention adopts the following technical scheme:
a global multi-headed speech enhancement method comprising the steps of:
step one: acquiring a noise-containing frequency signal;
step two: inputting the noise-containing frequency signals into a plurality of layers of convolutions of a generator encoder to obtain a convolution characteristic map;
step three: inputting the convolution characteristic spectrum obtained in the second step into a global multi-head attention layer of a generator encoder to obtain a global multi-head attention characteristic spectrum;
step four: inputting the global multi-head attention characteristic map obtained in the step three into a plurality of layers of convolutions of a generator encoder to obtain a convolution-global multi-head attention-convolution characteristic map;
step five: superposing the convolution-global multi-head attention-convolution characteristic map obtained in the step four with random noise z sampled from Gaussian distribution, and then inputting the superposition to a plurality of layers of deconvolution of a generator decoder to obtain deconvolution characteristic map;
step six: inputting the deconvolution characteristic spectrum obtained in the step five to a global multi-head attention layer of a generator decoder to obtain a decoding-global multi-head attention characteristic spectrum;
step seven: and D, inputting the decoding-global multi-head attention characteristic map obtained in the step six to a plurality of layers of deconvolution of a generator decoder to obtain an enhanced audio signal.
Further, step three includes a first pre-step of: query matrix for acquiring convolution characteristic spectrum
Figure SMS_1
Key matrix
Figure SMS_2
Sum matrix->WhereinN=1, 2, n denotes the number of heads.
Further, the third step further includes a second pre-step of: calculating a weight matrix of the global multi-head attention:
Figure SMS_4
wherein the method comprises the steps ofN=1, 2, representing the number of heads, softmax representing the normalized exponential function; t represents the transpose of the vector or matrix.
Further, the formula for calculating the global multi-head attention feature map in the third step is as follows:
Figure SMS_5
wherein Cat is a concate function in the convolutional neural network and represents feature fusion;
Figure SMS_6
representing a one-dimensional convolution.
Further:
Figure SMS_7
Figure SMS_8
Figure SMS_9
wherein:
Figure SMS_10
representation->
Figure SMS_11
Is a weight matrix of (2);
Figure SMS_12
representation->
Figure SMS_13
Is a weight matrix of (2);
Figure SMS_14
representation->
Figure SMS_15
Is a weight matrix of (2);
Figure SMS_16
the convolution characteristic map output in the second step;
the ownsample represents downsampling.
Further, step six includes a first pre-step of: query matrix for obtaining deconvolution feature map
Figure SMS_17
Key matrix->
Figure SMS_18
Sum matrix->
Figure SMS_19
WhereinN=1, 2, n denotes the number of heads.
Further, step six further includes a second pre-step of: calculating a weight matrix of the global multi-head attention:
Figure SMS_20
wherein the method comprises the steps ofN=1, 2, representing the number of heads, softmax representing the normalized exponential function; t represents the transpose of the vector or matrix.
Further, the formula for calculating the global multi-head attention feature map in the step six is as follows:
Figure SMS_21
wherein Cat is a concate function in the convolutional neural network and represents feature fusion; />
Figure SMS_22
Representing a one-dimensional convolution.
Figure SMS_23
Figure SMS_24
Figure SMS_25
Wherein:
Figure SMS_26
representation->
Figure SMS_27
Is a weight matrix of (2);
Figure SMS_28
representation->
Figure SMS_29
Is a weight matrix of (2);
Figure SMS_30
representation->
Figure SMS_31
Is a weight matrix of (2);
Figure SMS_32
the deconvolution characteristic map output in the fifth step;
downsampling is denoted by downsampling.
Compared with the prior art, the invention has the advantages that compared with the traditional SASEGAN (self-attention voice enhancement generation countermeasure network), the global multi-head attention voice enhancement generation countermeasure network (GMASEGAN) based on the original audio signal input has more network parameters and higher convergence speed in the training process. In subjective evaluation, the GMASEGAN provided by the invention is improved by 8.81 percent and 5.78 percent compared with the traditional SEGAN (voice enhancement generation countermeasure network) and SASEGAN (self-attention voice enhancement generation countermeasure network), respectively. In terms of objective evaluation, the absolute value of GMASEGAN of the present invention yields 0.07, 0.06, 0.09, 0.07, 0.63, and 0.26 over sassegan at PESQ, CSIG, CBAK, COVL, SSNR and STOI, respectively. In addition, the GMASEGAN (attention header number n=4) provided by the present invention can attenuate background noise in clean voice, and as the header number increases, the voice enhancement effect of the GMASEGAN may be further improved. More importantly, the global multi-head attention layer provided by the invention can be used for other voice enhancement networks taking a convolution layer as a backbone so as to realize time dependence.
Drawings
FIG. 1 is a schematic diagram of global multi-headed attention layer processing steps;
FIG. 2 is a schematic diagram of the network architecture of the generator and the computation of the global multi-headed attention layer in the network;
fig. 3 is a schematic diagram of a network structure of the discriminator.
Detailed Description
The following describes specific embodiments of the present invention with reference to the drawings.
First, a global multi-head attention layer is described:
the global multi-headed note layer is adapted from a non-local self-note layer. Here, two attention heads are taken as an example. Processing steps of the global multi-head attention layer as shown in fig. 1, fig. 1 is a diagram of processing steps of the global multi-head attention layer (attention head number n=2). Taking two vectors as an example, T represents the transpose of the vector. In order to keep the attention coefficient in the range of 0-1, in
Figure SMS_33
A softmax layer was then added.
Determining a feature map of one-dimensional convolutional layer output
Figure SMS_34
Where T represents the length of the time dimension and C represents the number of channels. A is divided into T vectors, i.e.)>
Figure SMS_35
Wherein
Figure SMS_36
A channel feature vector representing each point in time. The query vector q, the key vector k, and the value vector v are obtained by:
Figure SMS_37
wherein->
Figure SMS_38
,
Figure SMS_39
, and/>
Figure SMS_40
The expression is represented by->
Figure SMS_41
The 1 x 1 convolution layer of each filter performs a weight matrix. The attention coefficient α is obtained by the following conversion:
Figure SMS_42
wherein->
Figure SMS_43
Representing a transpose of the vector or matrix. To ensure that the attention factor is in the range of 0-1, a softmax layer is used here. The attention profile b is obtained by the following transformation:
Figure SMS_44
Figure SMS_45
where Cat (∙) represents connecting multiple matrices together according to a particular dimension. />
Figure SMS_46
Representing the output of a 1 x 1 convolutional layer whose channel is equal toA is the same.
Taking two vectors as an example, T represents the transpose of the vector. In order to keep the attention coefficient in the range of 0-1, in
Figure SMS_47
A softmax layer was then added.
The network architecture and method of the present invention are described below:
the network structure of the generator and the calculation of the global multi-headed attention layer in the network are shown in fig. 2. The upper half shows the network structure of the generator and the output shape of each layer. The lower part shows how the data is calculated in the global multi-head attention layer after adding the global multi-head attention layer after the third last convolutional layer of the decoder, which is also provided at the encoder position corresponding to the vertical direction in the figure.
The generator receives noise corrupted audio signal input
Figure SMS_48
(/>
Figure SMS_49
) Is a full convolutional encoder-decoder architecture.
The upper half shows the network structure of the generator and the output shape of each layer. The bottom half shows how data is calculated in the global multi-headed attention layer after the addition of the global multi-headed attention layer after the third last convolutional layer of the decoder.
A parametric rectifying linear unit is employed as an activation function of the generator. The convolutional layer and the active layer are combined into one convolutional block (ConvBlk), which is the basic unit of the generator encoder. The generator encoder consists of 11 ConvBlks, where the number of filters of the convolutional layer is incremented by a step size of 2. The output of the encoder is an 8 x 1024 feature map
Figure SMS_50
. Potential code->
Figure SMS_51
Are randomly sampled from a gaussian distribution. z is stacked in->
Figure SMS_52
And presented to the generator decoder.
The generator decoder mirrors the generator encoder architecture, reversing the encoding process by deconvolution. The deconvolution layer and the activation layer are combined with a deconvolution block (deConvBlk), which is the basic unit of the generator decoder. The skip connection is used to link each ConvBlk in the encoder to its mirrored deConvBlk in the decoder, allowing data from the encoding process to flow into the decoding process. The output of the decoder being an enhanced audio signal
Figure SMS_53
GMASEGAN couples the global multi-headed attention layer described in section a with the generator (de) ConvBlks. Taking the two-headed global multi-headed attention layer as an example, the present invention couples the global multi-headed attention layer with the third-to-last deConvBlk.
Figure SMS_54
Is the output of the third last deConvBlk, l=3 represents the position of the global multi-headed attention layer. Query matrix->
Figure SMS_55
Is obtained by the following conversion:
Figure SMS_56
wherein->
Figure SMS_57
Representation->
Figure SMS_58
Is a weight matrix of (a). />
Figure SMS_59
Is downsampled to 512 x 32 before the key matrix is calculated. Key matrix->
Figure SMS_60
Is obtained by the following conversion:
Figure SMS_61
wherein the method comprises the steps of
Figure SMS_62
Representation->
Figure SMS_63
Is a weight matrix of (a). Similar to key matrix>
Figure SMS_64
Is downsampled to 512 x 32 before calculating the value matrix. Value matrix->
Figure SMS_65
Is obtained by the following conversion:
Figure SMS_66
wherein the method comprises the steps of
Figure SMS_67
Representation->
Figure SMS_68
Is a weight matrix of (a). Attention coefficient matrix->
Figure SMS_69
Obtained by the following transformation:
Figure SMS_70
wherein->
Figure SMS_71
Representing a transpose of the vector or matrix. The output F' of the global multi-headed attention layer is obtained by the following conversion:
Figure SMS_72
wherein Cat(∙) represents connecting a plurality of matrices together according to a particular dimension. But->
Figure SMS_73
Representing the output of a one-dimensional convolution layer with 32 filters.
The above calculation procedure is also applicable to the discriminator. As shown in fig. 3, the discriminator structure is similar to a generator encoder. But it requires as input a pair of original audio clips. The output of the last discriminator ConvBlk is flattened and presented to the fully connected layer. The full connection layer was used as the last layer of the arbiter and the softmax was used for classification. The discriminator outputs true or false.
In general, the global multi-headed attention layer may be combined with any number (de) or even all ConvBlks. The present invention will study the placement of global multi-headed attention layer in section IV. In GMASEGAN, all (de) ConvBlks of the generator and discriminator are spectrally normalized.
The method provided by the invention utilizes the data set extracted from the Voice Bank corpus to evaluate, and specifically extracts 30 speakers in the data set: of which 28 were in the training set and 2 were in the test set. For the training set, the SNR is 15, 10, 5 and 0dB,10 noise combinations, 40 noise conditions are introduced. Each training speaker had about 10 different sentences under each condition. The test set considers 20 different conditions in total: 5 noise types, each with 4 SNR (17.5, 12.5, 7.5 and 2.5 dB), were all taken from the Demand database. Each test speaker had about 20 different sentences under each condition. As mentioned above, the generalization ability of GMASEGAN can be better evaluated by varying the speaker and condition of the test set and the training set. All audio samples were downsampled from 48KHz to 16KHz.
The proposed GMASEGAN is built based on a Tensorflow framework, including training and testing. 100 epochs were trained on the network using a small batch size of 50, RMSprop, and a learning rate of 0.0002. Throughout the training process, the present invention sets the weight of the L1 regularized λ to 100. During training, the present invention uses a sliding window of speech (50% overlap) of approximately 1 second to extract the waveform blocks. In contrast, the present invention slides windows throughout the test audio signal without overlapping them and concatenates the results during the test phase. During the training and testing phase, a high frequency pre-emphasis filter with a factor of 0.95 is applied to all input samples.
The experiments of the present invention set three objectives. First, the present invention investigates and quantifies the role of the global multi-headed attention layer in a speech enhancement network. SEGAN was used as a baseline for comparison. Second, the present invention investigates how the performance of the generator and discriminator changes when the global multi-headed attention layer is located at different positions. The present invention evaluates the proposed GMASEGAN using different (de) ConvBlk index values. The present invention fails to test the first and second (de) ConvBlks due to GPU memory limitations and the high time dimension of the first and second (de) ConvBlks (i.e., 8192 and 4096, respectively). Third, the proposed GMASEGAN and sassegan are compared in the frequency and time domains. The invention sets the number of attention heads to N, and quantitatively analyzes the influence of N on voice enhancement.
The experiment of the invention is mainly divided into the following stages to achieve the objective:
data preparation stage: speech signal samples, including noisy and noiseless speech, are first collected from a plurality of data sets. These samples are pre-processed, such as data enhancement, normalization, framing, etc., for use in subsequent experiments.
Model building: at this stage, the present invention builds the generator and discriminator according to the proposed GMASEGAN architecture. GMASEGAN contains a global multi-headed attention layer for capturing long-range dependencies in speech signals. The invention inserts the global multi-head attention layer into different (de) ConvBlk index values respectively for performance evaluation.
Training and optimizing: the invention inputs the collected data into the generator and discriminator for training. Different penalty functions (e.g., countering penalty and reconstructing penalty) are used to optimize the model parameters. At the same time, the invention will adjust the number of attention heads N to analyze its impact on speech enhancement performance.
Model evaluation stage: after model training is completed, the present invention uses various evaluation metrics (e.g., signal-to-noise ratio, PESQ, etc.) to evaluate the performance of GMASEGAN on speech enhancement tasks. To achieve the first part of the objective, the present invention compares GMASEGAN with SEGAN to quantitatively evaluate the role of the global multi-headed attention layer in a speech enhancement network.
Experimental analysis stage: the invention will analyze the experimental results and compare the performance changes of the generator and discriminator when the global multi-headed attention layer is in different positions. This will help determine the best position of the global multi-headed attention layer in GMASEGAN. Furthermore, the present invention will compare the performance of GMASEGAN and sassegan in the frequency and time domains to highlight the advantages of GMASEGAN proposed.
The result presentation stage: the invention arranges the experimental results and displays the specific processes, methods and results in forms of tables, charts, text descriptions and the like. This will help elucidate the effectiveness of the proposed GMASEGAN in achieving the speech enhancement objective.
1) Subjective index
The quality of the speech to be measured is obtained by averaging the scores of all listeners, called Mean Opinion Score (MOS). The present invention randomly selects 20 sentences to be presented to a total of 10 listeners in random order. The selection process involves listening to some of the provided noisy audio signals and trying to balance the various noise types, as the data set does not specify the amount and type of noise for each audio. Most audio samples have a low SNR and a few have a high SNR. The following five versions of each sentence are presented in random order: clean audio signal, noisy audio signal, SASEGAN enhanced audio signal, GMASEGAN enhanced audio signal when n=2 and n=4. The listener scores the overall quality of each signal by 1 to 5. They are required to be concerned with both audio signal distortion and noise interference (e.g., 1 = bad: objectionable audio signal distortion, 5 = excellent: very natural speech, no audible noise, nor degradation). To ensure that listeners do not accidentally complete the survey, the present invention inserts 10 severely distorted and non-coherent audio portions into the assessment focus as a test of attention. If three or more of these samples are rated as 2, the listener is removed from the analysis. Each signal may be listened to as many times as desired. The listener is required to pay attention to the comparison rate of the five signals.
2) Objective evaluation
PESQ: perceptual Evaluation of Speech Quality is the objective speech quality assessment algorithm that is currently most relevant to MOS scores. The wideband version is used because ITU-T p.862.2 uses and recommends wideband versions. The value range is-0.5 to 4.5. The higher the value, the better the enhanced speech quality.
CSIG: this is a MOS prediction for speech signal distortion only. The value range is 1 to 5. The higher the value, the less the enhanced speech distortion.
CBAK: this is an aggressive MOS prediction for background noise. The higher the value is, the lower the invasiveness of the background noise is.
COVL: this is a MOS prediction of the overall effect. The value range is 1-5, and the larger the value is, the better the overall voice quality enhancement effect is.
SSNR: note that the segment signal-to-noise ratio is the average of all speech frames. Using this index must first ensure that the pure speech and enhancement signals are aligned in the time domain, which is highly correlated with the subjective auditory perception of the listener. The value range is 1 to infinity. The higher the value, the better the enhanced speech quality.
STOI (%): short-Time Objective Intelligibility is one of the important indicators for speech intelligibility. The value range is 0 to 1. The higher the value, the higher the enhanced speech intelligibility.
The invention researches the method in SEGAN
Figure SMS_74
(/>
Figure SMS_75
) The impact of placing a global multi-headed attention layer on ConvBlks on speech enhancement. The proposed GMASEGAN is compared to the traditional method Winner filter and deep learning method. Subjective and objective indicators were used to evaluate the effect of speech enhancement, and were found to be mainly as follows:
1) During training, the proposed GMASEGAN converges significantly faster than sassegan. It shows that the convergence speed can be faster although the number of model parameters of the proposed GMASEGAN increases slightly. And the present invention finds that the convergence speed is further increased when the number of heads is increased.
2) When n=2 and n=4, the absolute gains of the proposed GMASEGAN in subjective evaluation are 0.22, 0.30, respectively, compared to sassegan. The results show that the proposed GMASEGAN is superior to baseline SEGAN and sassegan in subjective evaluation.
3) The location of the global multi-headed attention layer in the network has no obvious relation to the speech enhancement effect. The absolute values of GMASEGAN-Average (n=2) were 0.14, 0.35, 0.04, 0.07, 0.01 and 0.08 higher than sassegan-Average over STOI, SSNR, COVL, CBAK, CSIG and PESQ, respectively. The absolute values of GMASEGAN-Average (n=4) were 0.12, 0.28, 0.03, 0.02, and 0.05 higher on STOI, SSNR, COVL, CBAK and vsig than GMASEGAN-Average (n=2), respectively. In terms of objective assessment, the results show that the proposed GMASEGAN is generally superior to sassegan and that the enhancement effect may increase as the number of attention tips increases.
4) The absolute gains of the proposed GMASEGAN at PESQ, CSIG, CBAK, COVL, SSNR and STOI are 0.23, 0.14, 0.21, 0.19, 1.04 and 0.34, respectively, compared to baseline SEGAN. In addition, a comparison was made with the wiener filtering method and the deep learning method. The proposed GMASEGAN enhancement is superior to Wiener filter methods and other deep learning methods.
5) In the case of low SNR (snr=2.5 or 7.5), it can be intuitively seen from the spectrogram that the proposed GMASEGAN (n=4) can reduce noise that sassegan cannot handle. At high signal-to-noise ratios (snr=12.5 or 17.5), the proposed GMASEGAN can not only enhance additive noise, but also attenuate background noise speech signals of clean speech signals, since the clean speech signals also have some background noise during recording.
Importantly, it is readily available for use in other speech enhancement networks for potential improvement. From a future perspective, the present invention seeks to explore the placement of global multi-headed attention layers in other networks to further enhance performance. In addition, a larger N will be employed to evaluate the enhanced performance of GMASEGAN. The present invention will also train the proposed GMASEGAN with data sets that contain more types of noise, making it suitable for use in a variety of real world noise environments.
The foregoing is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art, who is within the scope of the present invention, should be covered by the protection scope of the present invention by making equivalents and modifications to the technical solution and the inventive concept thereof.

Claims (9)

1. A global multi-headed speech enhancement method, comprising the steps of:
step one: acquiring a noise-containing frequency signal;
step two: inputting the noise-containing frequency signals into a plurality of layers of convolutions of a generator encoder to obtain a convolution characteristic map;
step three: inputting the convolution characteristic spectrum obtained in the second step into a global multi-head attention layer of a generator encoder to obtain a global multi-head attention characteristic spectrum;
step four: inputting the global multi-head attention characteristic map obtained in the third step into a plurality of other layers of convolutions of a generator encoder to obtain a convolution-global multi-head attention-convolution characteristic map;
step five: superposing the convolution-global multi-head attention-convolution characteristic map obtained in the step four with random noise z sampled from Gaussian distribution, and then inputting the superposition to a plurality of layers of deconvolution of a generator decoder to obtain deconvolution characteristic map;
step six: inputting the deconvolution characteristic spectrum obtained in the step five to a global multi-head attention layer of a generator decoder to obtain a decoding-global multi-head attention characteristic spectrum;
step seven: and D, inputting the decoding-global multi-head attention characteristic map obtained in the step six to the other layers of deconvolution of the generator decoder to obtain the enhanced audio signal.
2. The global multi-headed speech enhancement method according to claim 1, wherein step three comprises a first pre-step of: query matrix for acquiring convolution characteristic spectrum
Figure QLYQS_1
Key matrix->
Figure QLYQS_2
Sum matrix->
Figure QLYQS_3
WhereinN=1, 2, n denotes the number of heads.
3. The global multi-headed speech enhancement method according to claim 2, wherein step three further comprises the second pre-step of: calculating a weight matrix of the global multi-head attention:
Figure QLYQS_4
wherein the method comprises the steps ofN=1, 2, representing the number of heads, softmax representing the normalized exponential function; t represents the transpose of the vector or matrix.
4. A method of global multi-headed speech enhancement according to claim 3, wherein the formula for calculating the global multi-headed attention profile in step three is as follows:
Figure QLYQS_5
wherein Cat is a concate function in the convolutional neural network and represents feature fusion; />
Figure QLYQS_6
Representing a one-dimensional convolution.
5. A global multi-headed speech enhancement method according to any one of claims 2 to 4, characterized in that:
Figure QLYQS_7
Figure QLYQS_8
Figure QLYQS_9
wherein:
Figure QLYQS_10
representation->
Figure QLYQS_11
Is a weight matrix of (2);
Figure QLYQS_12
representation->
Figure QLYQS_13
Is a weight matrix of (2);
Figure QLYQS_14
representation->
Figure QLYQS_15
Is a weight matrix of (2);
Figure QLYQS_16
the convolution characteristic map output in the second step;
downsampling is denoted by downsampling.
6. The global multi-headed speech enhancement method according to claim 1, wherein step six comprises a first pre-step of: obtaining deconvolution characteristic mapPolling matrix
Figure QLYQS_17
Key matrix->
Figure QLYQS_18
Sum matrix->
Figure QLYQS_19
WhereinN=1, 2, n denotes the number of heads.
7. The global multi-headed speech enhancement method according to claim 6, wherein step six further comprises the second pre-step of: calculating a weight matrix of the global multi-head attention:
Figure QLYQS_20
wherein the method comprises the steps ofN=1, 2, representing the number of heads, softmax representing the normalized exponential function; t represents the transpose of the vector or matrix.
8. The global multi-headed speech enhancement method according to claim 7, wherein the formula for calculating the global multi-headed attention profile in step six is as follows:
Figure QLYQS_21
wherein Cat is a concate function in the convolutional neural network and represents feature fusion; />
Figure QLYQS_22
Representing a one-dimensional convolution.
9. A global multi-headed speech enhancement method according to any of claims 6 to 8, characterized in that:
Figure QLYQS_23
Figure QLYQS_24
Figure QLYQS_25
wherein:
Figure QLYQS_26
representation->
Figure QLYQS_27
Is a weight matrix of (2);
Figure QLYQS_28
representation->
Figure QLYQS_29
Is a weight matrix of (2);
Figure QLYQS_30
representation->
Figure QLYQS_31
Is a weight matrix of (2);
Figure QLYQS_32
the deconvolution characteristic map output in the fifth step;
downsampling is denoted by downsampling.
CN202310447342.7A 2023-04-24 2023-04-24 Global multi-head attention voice enhancement method Active CN116189703B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310447342.7A CN116189703B (en) 2023-04-24 2023-04-24 Global multi-head attention voice enhancement method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310447342.7A CN116189703B (en) 2023-04-24 2023-04-24 Global multi-head attention voice enhancement method

Publications (2)

Publication Number Publication Date
CN116189703A CN116189703A (en) 2023-05-30
CN116189703B true CN116189703B (en) 2023-07-14

Family

ID=86452466

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310447342.7A Active CN116189703B (en) 2023-04-24 2023-04-24 Global multi-head attention voice enhancement method

Country Status (1)

Country Link
CN (1) CN116189703B (en)

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110739003B (en) * 2019-10-23 2022-10-28 北京计算机技术及应用研究所 Voice enhancement method based on multi-head self-attention mechanism
US11715480B2 (en) * 2021-03-23 2023-08-01 Qualcomm Incorporated Context-based speech enhancement
CN114141238A (en) * 2021-11-26 2022-03-04 中国人民解放军陆军工程大学 Voice enhancement method fusing Transformer and U-net network
CN114664318A (en) * 2022-03-25 2022-06-24 山东省计算中心(国家超级计算济南中心) Voice enhancement method and system based on generation countermeasure network
CN115700882A (en) * 2022-10-21 2023-02-07 东南大学 Voice enhancement method based on convolution self-attention coding structure
CN115762544A (en) * 2022-11-15 2023-03-07 南京邮电大学 Voice enhancement method based on dynamic convolution and narrow-band former
CN115602152B (en) * 2022-12-14 2023-02-28 成都启英泰伦科技有限公司 Voice enhancement method based on multi-stage attention network

Also Published As

Publication number Publication date
CN116189703A (en) 2023-05-30

Similar Documents

Publication Publication Date Title
Su et al. HiFi-GAN: High-fidelity denoising and dereverberation based on speech deep features in adversarial networks
CN108447495B (en) Deep learning voice enhancement method based on comprehensive feature set
CN110619885B (en) Method for generating confrontation network voice enhancement based on deep complete convolution neural network
Kong et al. Speech denoising in the waveform domain with self-attention
CN110246510B (en) End-to-end voice enhancement method based on RefineNet
CN110428849B (en) Voice enhancement method based on generation countermeasure network
Su et al. Bandwidth extension is all you need
CN112802491B (en) Voice enhancement method for generating confrontation network based on time-frequency domain
CN113744749B (en) Speech enhancement method and system based on psychoacoustic domain weighting loss function
Strauss et al. A flow-based neural network for time domain speech enhancement
Geng et al. End-to-end speech enhancement based on discrete cosine transform
Braun et al. Effect of noise suppression losses on speech distortion and ASR performance
Zhu et al. FLGCNN: A novel fully convolutional neural network for end-to-end monaural speech enhancement with utterance-based objective functions
Qian et al. Combining equalization and estimation for bandwidth extension of narrowband speech
Zhang et al. Multiple vowels repair based on pitch extraction and line spectrum pair feature for voice disorder
Zhang et al. FB-MSTCN: A full-band single-channel speech enhancement method based on multi-scale temporal convolutional network
Lee et al. DeFT-AN: Dense frequency-time attentive network for multichannel speech enhancement
CN114360571A (en) Reference-based speech enhancement method
Yoneyama et al. Nonparallel high-quality audio super resolution with domain adaptation and resampling CycleGANs
CN116189703B (en) Global multi-head attention voice enhancement method
Hepsiba et al. Enhancement of single channel speech quality and intelligibility in multiple noise conditions using wiener filter and deep CNN
Xu et al. Selector-enhancer: learning dynamic selection of local and non-local attention operation for speech enhancement
Strauss et al. Improved normalizing flow-based speech enhancement using an all-pole gammatone filterbank for conditional input representation
Rani et al. Significance of phase in DNN based speech enhancement algorithms
CN116013343A (en) Speech enhancement method, electronic device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant