CN116189703B - Global multi-head attention voice enhancement method - Google Patents
Global multi-head attention voice enhancement method Download PDFInfo
- Publication number
- CN116189703B CN116189703B CN202310447342.7A CN202310447342A CN116189703B CN 116189703 B CN116189703 B CN 116189703B CN 202310447342 A CN202310447342 A CN 202310447342A CN 116189703 B CN116189703 B CN 116189703B
- Authority
- CN
- China
- Prior art keywords
- global multi
- head attention
- convolution
- headed
- characteristic map
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 40
- 230000005236 sound signal Effects 0.000 claims abstract description 14
- 239000011159 matrix material Substances 0.000 claims description 34
- 239000013598 vector Substances 0.000 claims description 15
- 238000001228 spectrum Methods 0.000 claims description 10
- 238000013527 convolutional neural network Methods 0.000 claims description 4
- 230000004927 fusion Effects 0.000 claims description 4
- 238000012549 training Methods 0.000 description 13
- 238000012360 testing method Methods 0.000 description 10
- 238000011156 evaluation Methods 0.000 description 9
- 230000008569 process Effects 0.000 description 7
- 230000000694 effects Effects 0.000 description 6
- 230000006870 function Effects 0.000 description 6
- 238000006243 chemical reaction Methods 0.000 description 5
- 101000659995 Homo sapiens Ribosomal L1 domain-containing protein 1 Proteins 0.000 description 4
- 102100035066 Ribosomal L1 domain-containing protein 1 Human genes 0.000 description 4
- 238000010586 diagram Methods 0.000 description 4
- 238000013135 deep learning Methods 0.000 description 3
- 238000002474 experimental method Methods 0.000 description 3
- 238000012545 processing Methods 0.000 description 3
- 230000004913 activation Effects 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 2
- 230000009466 transformation Effects 0.000 description 2
- 238000012935 Averaging Methods 0.000 description 1
- 239000000654 additive Substances 0.000 description 1
- 230000000996 additive effect Effects 0.000 description 1
- 230000015556 catabolic process Effects 0.000 description 1
- 230000001427 coherent effect Effects 0.000 description 1
- 230000002596 correlated effect Effects 0.000 description 1
- 230000000875 corresponding effect Effects 0.000 description 1
- 238000006731 degradation reaction Methods 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 238000009432 framing Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 230000008447 perception Effects 0.000 description 1
- 238000002360 preparation method Methods 0.000 description 1
- 238000001303 quality assessment method Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
- G10L19/008—Multichannel audio signal coding or decoding using interchannel correlation to reduce redundancy, e.g. joint-stereo, intensity-coding or matrixing
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
- G10L19/02—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Signal Processing (AREA)
- Computational Linguistics (AREA)
- Quality & Reliability (AREA)
- Mathematical Physics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Compression, Expansion, Code Conversion, And Decoders (AREA)
- Complex Calculations (AREA)
Abstract
The invention discloses a global multi-head attention voice enhancement method, which relates to the field of generation of countermeasure networks and comprises the steps of inputting a noise-containing frequency signal into a generator encoder for convolution to obtain a convolution characteristic map; inputting the convolution characteristic map to the global multi-head attention layer to obtain a global multi-head attention characteristic map; inputting the global multi-head attention characteristic map to a generator encoder to obtain a convolution-global multi-head attention-convolution characteristic map; overlapping the convolution-global multi-head attention-convolution characteristic map with random noise z sampled from Gaussian distribution, and then inputting the overlapped convolution-global multi-head attention-convolution characteristic map to a generator decoder to obtain a deconvolution characteristic map; inputting the deconvolution characteristic map to a global multi-head attention layer to obtain a decoding-global multi-head attention characteristic map; the decoding-global multi-head attention profile is input to a generator decoder to obtain an enhanced audio signal. The invention can be used in a speech enhancement network and can achieve time dependence.
Description
Technical Field
The present invention relates to the field of generating countermeasure networks, and more particularly to a global multi-headed attention speech enhancement method.
Background
In recent years, a voice enhancement method based on generation of countermeasure networks (GANs) has been proposed, in which end-to-end voice enhancement is achieved by directly inputting waveforms into the network. However, existing speech enhancement GANs rely entirely on convolution operations, which may mask the time dependence of the sequence input.
Disclosure of Invention
The invention aims to provide a global multi-head attention voice enhancement method for solving the problems in the background art.
In order to achieve the above purpose, the invention adopts the following technical scheme:
a global multi-headed speech enhancement method comprising the steps of:
step one: acquiring a noise-containing frequency signal;
step two: inputting the noise-containing frequency signals into a plurality of layers of convolutions of a generator encoder to obtain a convolution characteristic map;
step three: inputting the convolution characteristic spectrum obtained in the second step into a global multi-head attention layer of a generator encoder to obtain a global multi-head attention characteristic spectrum;
step four: inputting the global multi-head attention characteristic map obtained in the step three into a plurality of layers of convolutions of a generator encoder to obtain a convolution-global multi-head attention-convolution characteristic map;
step five: superposing the convolution-global multi-head attention-convolution characteristic map obtained in the step four with random noise z sampled from Gaussian distribution, and then inputting the superposition to a plurality of layers of deconvolution of a generator decoder to obtain deconvolution characteristic map;
step six: inputting the deconvolution characteristic spectrum obtained in the step five to a global multi-head attention layer of a generator decoder to obtain a decoding-global multi-head attention characteristic spectrum;
step seven: and D, inputting the decoding-global multi-head attention characteristic map obtained in the step six to a plurality of layers of deconvolution of a generator decoder to obtain an enhanced audio signal.
Further, step three includes a first pre-step of: query matrix for acquiring convolution characteristic spectrumKey matrixSum matrix->WhereinN=1, 2, n denotes the number of heads.
Further, the third step further includes a second pre-step of: calculating a weight matrix of the global multi-head attention:
wherein the method comprises the steps ofN=1, 2, representing the number of heads, softmax representing the normalized exponential function; t represents the transpose of the vector or matrix.
Further, the formula for calculating the global multi-head attention feature map in the third step is as follows:
wherein Cat is a concate function in the convolutional neural network and represents feature fusion;representing a one-dimensional convolution.
Further:
the ownsample represents downsampling.
Further, step six includes a first pre-step of: query matrix for obtaining deconvolution feature mapKey matrix->Sum matrix->WhereinN=1, 2, n denotes the number of heads.
Further, step six further includes a second pre-step of: calculating a weight matrix of the global multi-head attention:
wherein the method comprises the steps ofN=1, 2, representing the number of heads, softmax representing the normalized exponential function; t represents the transpose of the vector or matrix.
Further, the formula for calculating the global multi-head attention feature map in the step six is as follows:
wherein Cat is a concate function in the convolutional neural network and represents feature fusion; />Representing a one-dimensional convolution.
downsampling is denoted by downsampling.
Compared with the prior art, the invention has the advantages that compared with the traditional SASEGAN (self-attention voice enhancement generation countermeasure network), the global multi-head attention voice enhancement generation countermeasure network (GMASEGAN) based on the original audio signal input has more network parameters and higher convergence speed in the training process. In subjective evaluation, the GMASEGAN provided by the invention is improved by 8.81 percent and 5.78 percent compared with the traditional SEGAN (voice enhancement generation countermeasure network) and SASEGAN (self-attention voice enhancement generation countermeasure network), respectively. In terms of objective evaluation, the absolute value of GMASEGAN of the present invention yields 0.07, 0.06, 0.09, 0.07, 0.63, and 0.26 over sassegan at PESQ, CSIG, CBAK, COVL, SSNR and STOI, respectively. In addition, the GMASEGAN (attention header number n=4) provided by the present invention can attenuate background noise in clean voice, and as the header number increases, the voice enhancement effect of the GMASEGAN may be further improved. More importantly, the global multi-head attention layer provided by the invention can be used for other voice enhancement networks taking a convolution layer as a backbone so as to realize time dependence.
Drawings
FIG. 1 is a schematic diagram of global multi-headed attention layer processing steps;
FIG. 2 is a schematic diagram of the network architecture of the generator and the computation of the global multi-headed attention layer in the network;
fig. 3 is a schematic diagram of a network structure of the discriminator.
Detailed Description
The following describes specific embodiments of the present invention with reference to the drawings.
First, a global multi-head attention layer is described:
the global multi-headed note layer is adapted from a non-local self-note layer. Here, two attention heads are taken as an example. Processing steps of the global multi-head attention layer as shown in fig. 1, fig. 1 is a diagram of processing steps of the global multi-head attention layer (attention head number n=2). Taking two vectors as an example, T represents the transpose of the vector. In order to keep the attention coefficient in the range of 0-1, inA softmax layer was then added.
Determining a feature map of one-dimensional convolutional layer outputWhere T represents the length of the time dimension and C represents the number of channels. A is divided into T vectors, i.e.)>WhereinA channel feature vector representing each point in time. The query vector q, the key vector k, and the value vector v are obtained by:
wherein->,, and/>The expression is represented by->The 1 x 1 convolution layer of each filter performs a weight matrix. The attention coefficient α is obtained by the following conversion:
wherein->Representing a transpose of the vector or matrix. To ensure that the attention factor is in the range of 0-1, a softmax layer is used here. The attention profile b is obtained by the following transformation:
where Cat (∙) represents connecting multiple matrices together according to a particular dimension. />Representing the output of a 1 x 1 convolutional layer whose channel is equal toA is the same.
Taking two vectors as an example, T represents the transpose of the vector. In order to keep the attention coefficient in the range of 0-1, inA softmax layer was then added.
The network architecture and method of the present invention are described below:
the network structure of the generator and the calculation of the global multi-headed attention layer in the network are shown in fig. 2. The upper half shows the network structure of the generator and the output shape of each layer. The lower part shows how the data is calculated in the global multi-head attention layer after adding the global multi-head attention layer after the third last convolutional layer of the decoder, which is also provided at the encoder position corresponding to the vertical direction in the figure.
The generator receives noise corrupted audio signal input (/>) Is a full convolutional encoder-decoder architecture.
The upper half shows the network structure of the generator and the output shape of each layer. The bottom half shows how data is calculated in the global multi-headed attention layer after the addition of the global multi-headed attention layer after the third last convolutional layer of the decoder.
A parametric rectifying linear unit is employed as an activation function of the generator. The convolutional layer and the active layer are combined into one convolutional block (ConvBlk), which is the basic unit of the generator encoder. The generator encoder consists of 11 ConvBlks, where the number of filters of the convolutional layer is incremented by a step size of 2. The output of the encoder is an 8 x 1024 feature map. Potential code->Are randomly sampled from a gaussian distribution. z is stacked in->And presented to the generator decoder.
The generator decoder mirrors the generator encoder architecture, reversing the encoding process by deconvolution. The deconvolution layer and the activation layer are combined with a deconvolution block (deConvBlk), which is the basic unit of the generator decoder. The skip connection is used to link each ConvBlk in the encoder to its mirrored deConvBlk in the decoder, allowing data from the encoding process to flow into the decoding process. The output of the decoder being an enhanced audio signal。
GMASEGAN couples the global multi-headed attention layer described in section a with the generator (de) ConvBlks. Taking the two-headed global multi-headed attention layer as an example, the present invention couples the global multi-headed attention layer with the third-to-last deConvBlk.Is the output of the third last deConvBlk, l=3 represents the position of the global multi-headed attention layer. Query matrix->Is obtained by the following conversion:
wherein->Representation->Is a weight matrix of (a). />Is downsampled to 512 x 32 before the key matrix is calculated. Key matrix->Is obtained by the following conversion:
wherein the method comprises the steps ofRepresentation->Is a weight matrix of (a). Similar to key matrix>Is downsampled to 512 x 32 before calculating the value matrix. Value matrix->Is obtained by the following conversion:
wherein the method comprises the steps ofRepresentation->Is a weight matrix of (a). Attention coefficient matrix->Obtained by the following transformation:
wherein->Representing a transpose of the vector or matrix. The output F' of the global multi-headed attention layer is obtained by the following conversion:
wherein Cat(∙) represents connecting a plurality of matrices together according to a particular dimension. But->Representing the output of a one-dimensional convolution layer with 32 filters.
The above calculation procedure is also applicable to the discriminator. As shown in fig. 3, the discriminator structure is similar to a generator encoder. But it requires as input a pair of original audio clips. The output of the last discriminator ConvBlk is flattened and presented to the fully connected layer. The full connection layer was used as the last layer of the arbiter and the softmax was used for classification. The discriminator outputs true or false.
In general, the global multi-headed attention layer may be combined with any number (de) or even all ConvBlks. The present invention will study the placement of global multi-headed attention layer in section IV. In GMASEGAN, all (de) ConvBlks of the generator and discriminator are spectrally normalized.
The method provided by the invention utilizes the data set extracted from the Voice Bank corpus to evaluate, and specifically extracts 30 speakers in the data set: of which 28 were in the training set and 2 were in the test set. For the training set, the SNR is 15, 10, 5 and 0dB,10 noise combinations, 40 noise conditions are introduced. Each training speaker had about 10 different sentences under each condition. The test set considers 20 different conditions in total: 5 noise types, each with 4 SNR (17.5, 12.5, 7.5 and 2.5 dB), were all taken from the Demand database. Each test speaker had about 20 different sentences under each condition. As mentioned above, the generalization ability of GMASEGAN can be better evaluated by varying the speaker and condition of the test set and the training set. All audio samples were downsampled from 48KHz to 16KHz.
The proposed GMASEGAN is built based on a Tensorflow framework, including training and testing. 100 epochs were trained on the network using a small batch size of 50, RMSprop, and a learning rate of 0.0002. Throughout the training process, the present invention sets the weight of the L1 regularized λ to 100. During training, the present invention uses a sliding window of speech (50% overlap) of approximately 1 second to extract the waveform blocks. In contrast, the present invention slides windows throughout the test audio signal without overlapping them and concatenates the results during the test phase. During the training and testing phase, a high frequency pre-emphasis filter with a factor of 0.95 is applied to all input samples.
The experiments of the present invention set three objectives. First, the present invention investigates and quantifies the role of the global multi-headed attention layer in a speech enhancement network. SEGAN was used as a baseline for comparison. Second, the present invention investigates how the performance of the generator and discriminator changes when the global multi-headed attention layer is located at different positions. The present invention evaluates the proposed GMASEGAN using different (de) ConvBlk index values. The present invention fails to test the first and second (de) ConvBlks due to GPU memory limitations and the high time dimension of the first and second (de) ConvBlks (i.e., 8192 and 4096, respectively). Third, the proposed GMASEGAN and sassegan are compared in the frequency and time domains. The invention sets the number of attention heads to N, and quantitatively analyzes the influence of N on voice enhancement.
The experiment of the invention is mainly divided into the following stages to achieve the objective:
data preparation stage: speech signal samples, including noisy and noiseless speech, are first collected from a plurality of data sets. These samples are pre-processed, such as data enhancement, normalization, framing, etc., for use in subsequent experiments.
Model building: at this stage, the present invention builds the generator and discriminator according to the proposed GMASEGAN architecture. GMASEGAN contains a global multi-headed attention layer for capturing long-range dependencies in speech signals. The invention inserts the global multi-head attention layer into different (de) ConvBlk index values respectively for performance evaluation.
Training and optimizing: the invention inputs the collected data into the generator and discriminator for training. Different penalty functions (e.g., countering penalty and reconstructing penalty) are used to optimize the model parameters. At the same time, the invention will adjust the number of attention heads N to analyze its impact on speech enhancement performance.
Model evaluation stage: after model training is completed, the present invention uses various evaluation metrics (e.g., signal-to-noise ratio, PESQ, etc.) to evaluate the performance of GMASEGAN on speech enhancement tasks. To achieve the first part of the objective, the present invention compares GMASEGAN with SEGAN to quantitatively evaluate the role of the global multi-headed attention layer in a speech enhancement network.
Experimental analysis stage: the invention will analyze the experimental results and compare the performance changes of the generator and discriminator when the global multi-headed attention layer is in different positions. This will help determine the best position of the global multi-headed attention layer in GMASEGAN. Furthermore, the present invention will compare the performance of GMASEGAN and sassegan in the frequency and time domains to highlight the advantages of GMASEGAN proposed.
The result presentation stage: the invention arranges the experimental results and displays the specific processes, methods and results in forms of tables, charts, text descriptions and the like. This will help elucidate the effectiveness of the proposed GMASEGAN in achieving the speech enhancement objective.
1) Subjective index
The quality of the speech to be measured is obtained by averaging the scores of all listeners, called Mean Opinion Score (MOS). The present invention randomly selects 20 sentences to be presented to a total of 10 listeners in random order. The selection process involves listening to some of the provided noisy audio signals and trying to balance the various noise types, as the data set does not specify the amount and type of noise for each audio. Most audio samples have a low SNR and a few have a high SNR. The following five versions of each sentence are presented in random order: clean audio signal, noisy audio signal, SASEGAN enhanced audio signal, GMASEGAN enhanced audio signal when n=2 and n=4. The listener scores the overall quality of each signal by 1 to 5. They are required to be concerned with both audio signal distortion and noise interference (e.g., 1 = bad: objectionable audio signal distortion, 5 = excellent: very natural speech, no audible noise, nor degradation). To ensure that listeners do not accidentally complete the survey, the present invention inserts 10 severely distorted and non-coherent audio portions into the assessment focus as a test of attention. If three or more of these samples are rated as 2, the listener is removed from the analysis. Each signal may be listened to as many times as desired. The listener is required to pay attention to the comparison rate of the five signals.
2) Objective evaluation
PESQ: perceptual Evaluation of Speech Quality is the objective speech quality assessment algorithm that is currently most relevant to MOS scores. The wideband version is used because ITU-T p.862.2 uses and recommends wideband versions. The value range is-0.5 to 4.5. The higher the value, the better the enhanced speech quality.
CSIG: this is a MOS prediction for speech signal distortion only. The value range is 1 to 5. The higher the value, the less the enhanced speech distortion.
CBAK: this is an aggressive MOS prediction for background noise. The higher the value is, the lower the invasiveness of the background noise is.
COVL: this is a MOS prediction of the overall effect. The value range is 1-5, and the larger the value is, the better the overall voice quality enhancement effect is.
SSNR: note that the segment signal-to-noise ratio is the average of all speech frames. Using this index must first ensure that the pure speech and enhancement signals are aligned in the time domain, which is highly correlated with the subjective auditory perception of the listener. The value range is 1 to infinity. The higher the value, the better the enhanced speech quality.
STOI (%): short-Time Objective Intelligibility is one of the important indicators for speech intelligibility. The value range is 0 to 1. The higher the value, the higher the enhanced speech intelligibility.
The invention researches the method in SEGAN (/>) The impact of placing a global multi-headed attention layer on ConvBlks on speech enhancement. The proposed GMASEGAN is compared to the traditional method Winner filter and deep learning method. Subjective and objective indicators were used to evaluate the effect of speech enhancement, and were found to be mainly as follows:
1) During training, the proposed GMASEGAN converges significantly faster than sassegan. It shows that the convergence speed can be faster although the number of model parameters of the proposed GMASEGAN increases slightly. And the present invention finds that the convergence speed is further increased when the number of heads is increased.
2) When n=2 and n=4, the absolute gains of the proposed GMASEGAN in subjective evaluation are 0.22, 0.30, respectively, compared to sassegan. The results show that the proposed GMASEGAN is superior to baseline SEGAN and sassegan in subjective evaluation.
3) The location of the global multi-headed attention layer in the network has no obvious relation to the speech enhancement effect. The absolute values of GMASEGAN-Average (n=2) were 0.14, 0.35, 0.04, 0.07, 0.01 and 0.08 higher than sassegan-Average over STOI, SSNR, COVL, CBAK, CSIG and PESQ, respectively. The absolute values of GMASEGAN-Average (n=4) were 0.12, 0.28, 0.03, 0.02, and 0.05 higher on STOI, SSNR, COVL, CBAK and vsig than GMASEGAN-Average (n=2), respectively. In terms of objective assessment, the results show that the proposed GMASEGAN is generally superior to sassegan and that the enhancement effect may increase as the number of attention tips increases.
4) The absolute gains of the proposed GMASEGAN at PESQ, CSIG, CBAK, COVL, SSNR and STOI are 0.23, 0.14, 0.21, 0.19, 1.04 and 0.34, respectively, compared to baseline SEGAN. In addition, a comparison was made with the wiener filtering method and the deep learning method. The proposed GMASEGAN enhancement is superior to Wiener filter methods and other deep learning methods.
5) In the case of low SNR (snr=2.5 or 7.5), it can be intuitively seen from the spectrogram that the proposed GMASEGAN (n=4) can reduce noise that sassegan cannot handle. At high signal-to-noise ratios (snr=12.5 or 17.5), the proposed GMASEGAN can not only enhance additive noise, but also attenuate background noise speech signals of clean speech signals, since the clean speech signals also have some background noise during recording.
Importantly, it is readily available for use in other speech enhancement networks for potential improvement. From a future perspective, the present invention seeks to explore the placement of global multi-headed attention layers in other networks to further enhance performance. In addition, a larger N will be employed to evaluate the enhanced performance of GMASEGAN. The present invention will also train the proposed GMASEGAN with data sets that contain more types of noise, making it suitable for use in a variety of real world noise environments.
The foregoing is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art, who is within the scope of the present invention, should be covered by the protection scope of the present invention by making equivalents and modifications to the technical solution and the inventive concept thereof.
Claims (9)
1. A global multi-headed speech enhancement method, comprising the steps of:
step one: acquiring a noise-containing frequency signal;
step two: inputting the noise-containing frequency signals into a plurality of layers of convolutions of a generator encoder to obtain a convolution characteristic map;
step three: inputting the convolution characteristic spectrum obtained in the second step into a global multi-head attention layer of a generator encoder to obtain a global multi-head attention characteristic spectrum;
step four: inputting the global multi-head attention characteristic map obtained in the third step into a plurality of other layers of convolutions of a generator encoder to obtain a convolution-global multi-head attention-convolution characteristic map;
step five: superposing the convolution-global multi-head attention-convolution characteristic map obtained in the step four with random noise z sampled from Gaussian distribution, and then inputting the superposition to a plurality of layers of deconvolution of a generator decoder to obtain deconvolution characteristic map;
step six: inputting the deconvolution characteristic spectrum obtained in the step five to a global multi-head attention layer of a generator decoder to obtain a decoding-global multi-head attention characteristic spectrum;
step seven: and D, inputting the decoding-global multi-head attention characteristic map obtained in the step six to the other layers of deconvolution of the generator decoder to obtain the enhanced audio signal.
3. The global multi-headed speech enhancement method according to claim 2, wherein step three further comprises the second pre-step of: calculating a weight matrix of the global multi-head attention:
4. A method of global multi-headed speech enhancement according to claim 3, wherein the formula for calculating the global multi-headed attention profile in step three is as follows:
5. A global multi-headed speech enhancement method according to any one of claims 2 to 4, characterized in that:
downsampling is denoted by downsampling.
7. The global multi-headed speech enhancement method according to claim 6, wherein step six further comprises the second pre-step of: calculating a weight matrix of the global multi-head attention:
8. The global multi-headed speech enhancement method according to claim 7, wherein the formula for calculating the global multi-headed attention profile in step six is as follows:
9. A global multi-headed speech enhancement method according to any of claims 6 to 8, characterized in that:
downsampling is denoted by downsampling.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310447342.7A CN116189703B (en) | 2023-04-24 | 2023-04-24 | Global multi-head attention voice enhancement method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310447342.7A CN116189703B (en) | 2023-04-24 | 2023-04-24 | Global multi-head attention voice enhancement method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN116189703A CN116189703A (en) | 2023-05-30 |
CN116189703B true CN116189703B (en) | 2023-07-14 |
Family
ID=86452466
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310447342.7A Active CN116189703B (en) | 2023-04-24 | 2023-04-24 | Global multi-head attention voice enhancement method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116189703B (en) |
Family Cites Families (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110739003B (en) * | 2019-10-23 | 2022-10-28 | 北京计算机技术及应用研究所 | Voice enhancement method based on multi-head self-attention mechanism |
US11715480B2 (en) * | 2021-03-23 | 2023-08-01 | Qualcomm Incorporated | Context-based speech enhancement |
CN114141238A (en) * | 2021-11-26 | 2022-03-04 | 中国人民解放军陆军工程大学 | Voice enhancement method fusing Transformer and U-net network |
CN114664318A (en) * | 2022-03-25 | 2022-06-24 | 山东省计算中心(国家超级计算济南中心) | Voice enhancement method and system based on generation countermeasure network |
CN115700882A (en) * | 2022-10-21 | 2023-02-07 | 东南大学 | Voice enhancement method based on convolution self-attention coding structure |
CN115762544A (en) * | 2022-11-15 | 2023-03-07 | 南京邮电大学 | Voice enhancement method based on dynamic convolution and narrow-band former |
CN115602152B (en) * | 2022-12-14 | 2023-02-28 | 成都启英泰伦科技有限公司 | Voice enhancement method based on multi-stage attention network |
-
2023
- 2023-04-24 CN CN202310447342.7A patent/CN116189703B/en active Active
Also Published As
Publication number | Publication date |
---|---|
CN116189703A (en) | 2023-05-30 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Su et al. | HiFi-GAN: High-fidelity denoising and dereverberation based on speech deep features in adversarial networks | |
CN108447495B (en) | Deep learning voice enhancement method based on comprehensive feature set | |
CN110619885B (en) | Method for generating confrontation network voice enhancement based on deep complete convolution neural network | |
Kong et al. | Speech denoising in the waveform domain with self-attention | |
CN110246510B (en) | End-to-end voice enhancement method based on RefineNet | |
CN110428849B (en) | Voice enhancement method based on generation countermeasure network | |
Su et al. | Bandwidth extension is all you need | |
CN112802491B (en) | Voice enhancement method for generating confrontation network based on time-frequency domain | |
CN113744749B (en) | Speech enhancement method and system based on psychoacoustic domain weighting loss function | |
Strauss et al. | A flow-based neural network for time domain speech enhancement | |
Geng et al. | End-to-end speech enhancement based on discrete cosine transform | |
Braun et al. | Effect of noise suppression losses on speech distortion and ASR performance | |
Zhu et al. | FLGCNN: A novel fully convolutional neural network for end-to-end monaural speech enhancement with utterance-based objective functions | |
Qian et al. | Combining equalization and estimation for bandwidth extension of narrowband speech | |
Zhang et al. | Multiple vowels repair based on pitch extraction and line spectrum pair feature for voice disorder | |
Zhang et al. | FB-MSTCN: A full-band single-channel speech enhancement method based on multi-scale temporal convolutional network | |
Lee et al. | DeFT-AN: Dense frequency-time attentive network for multichannel speech enhancement | |
CN114360571A (en) | Reference-based speech enhancement method | |
Yoneyama et al. | Nonparallel high-quality audio super resolution with domain adaptation and resampling CycleGANs | |
CN116189703B (en) | Global multi-head attention voice enhancement method | |
Hepsiba et al. | Enhancement of single channel speech quality and intelligibility in multiple noise conditions using wiener filter and deep CNN | |
Xu et al. | Selector-enhancer: learning dynamic selection of local and non-local attention operation for speech enhancement | |
Strauss et al. | Improved normalizing flow-based speech enhancement using an all-pole gammatone filterbank for conditional input representation | |
Rani et al. | Significance of phase in DNN based speech enhancement algorithms | |
CN116013343A (en) | Speech enhancement method, electronic device and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |