CN117711417A

CN117711417A - Voice quality enhancement method and system based on frequency domain self-attention network

Info

Publication number: CN117711417A
Application number: CN202410163875.7A
Authority: CN
Inventors: 袁程浩; 归子涵; 刘瑨玮; 杨光义; 贺威
Original assignee: Wuhan University WHU
Current assignee: Wuhan University WHU
Priority date: 2024-02-05
Filing date: 2024-02-05
Publication date: 2024-03-15
Anticipated expiration: 2044-02-05

Abstract

The invention discloses a voice quality enhancement method and a voice quality enhancement system based on a frequency domain self-attention network, which are characterized in that original voice is input and preprocessed; then inputting the processed frequency response into a frequency domain self-attention network; finally outputting a signal and performing post-processing on the output signal to obtain a voice enhancement signal; the frequency domain self-attention network comprises a position coding module and N identical basic unit modules; the position coding module comprises a position coding layer; the basic unit module comprises a multi-attention head layer, a residual error connection and layer normalization layer and a feedforward layer; n identical base unit modules, where N is determined by the desired network depth. The invention can remove noise in voice signals and has important significance on voice communication parties.

Description

Voice quality enhancement method and system based on frequency domain self-attention network

Technical Field

The invention belongs to the technical field of voice quality processing, relates to a voice quality enhancement method and a voice quality enhancement system, and particularly relates to a voice quality enhancement method and a voice quality enhancement system based on a frequency domain self-attention network.

Background

Numerous practical theories and algorithms formed in the field of digital signal processing in the middle 60 th century, such as Fast Fourier Transform (FFT) and various digital filters, are the theoretical and technical bases of digital processing of speech signals. After the late middle of 70 years, linear prediction technology (LPC) has been used for information compression and feature extraction of speech signals and has become a very important tool in speech signal processing. A significant development in the 80 s speech signal processing technology is the generation of Hidden Markov Models (HMMs) describing the speech signal process. Since the 90 s of the last century, speech signal acquisition and analysis techniques have made a number of breakthrough research advances in practical applications.

In the fields of business, education, medical care, etc. where remote work is required, there is a great demand for teleconferencing systems. The voice quality of the teleconferencing system is critical. Therefore, the noise can be removed to a great extent, which has a decisive effect on the voice quality improvement. In full duplex communications, these problems become more challenging when echoes interfere with Double Talk (DT) scenarios. Thus, a solution that can address acoustic echo, noise, and dereverberation is critical to achieving seamless communication.

In recent years, with the progress of science and technology, research on Artificial Neural Networks (ANNs) has been rapidly progressed, and each of the scientific research subjects of speech signal processing has been to promote the growth promotion of the development thereof, and at the same time, many of its results have been embodied in each technology related to speech signal processing. In recent years, joint AEC and NS approaches have been developed to simplify the communication pipeline while providing good AEC and NS performance. For example, MTFAA-Net is a neural network for combining AEC and NS based on multi-scale time-frequency processing and stream axial attention. However, MTFAA-Net still relies on classical AEC components.

However, the current approach based on deep learning is still not perfect for mathematical modeling of speech noise. Meanwhile, the real-time noise reduction capability is also very important for voice communication, so that the enjoyment of the user voice communication experience is required to be improved, and the complexity of algorithm time is required to be reduced to improve the real-time noise reduction effect.

Disclosure of Invention

In order to solve the problem of low real-time performance of the voice quality enhancement method in the prior art, the invention provides a voice quality enhancement method and a voice quality enhancement system based on a frequency domain self-attention network, which can be applied to enhancing the voice quality in the fields of business, military and the like.

The technical scheme adopted by the method is as follows: a method for enhancing speech quality based on a frequency domain self-attention network, comprising the steps of:

step S1, inputting original voice and preprocessing to obtain the frequency response of voice data;

step S2, inputting the processed frequency response into a frequency domain self-attention network to obtain a frequency response with enhanced voice quality;

the frequency domain self-attention network comprises a position coding module and N identical basic unit modules;

the position coding module comprises a position coding layer for adding position information to the processed frequency response; the basic unit module comprises a multi-attention head layer, a residual error connection and layer normalization layer and a feedforward layer;

and S3, carrying out post-processing on the frequency response after voice quality enhancement to obtain a final voice enhancement signal.

Further, in step S1, preprocessing is performed on the input original speech, including fourier transform, normalization and dimension-lifting operation, where the fourier transform is to obtain a frequency response of the input speech data by using a fast fourier transform function, including an amplitude response characteristic and a phase response characteristic, the normalization is to normalize the amplitude response characteristic and the phase response characteristic by using a maximum and minimum value, and scale-transform the phase response characteristic into a length interval of 0 to 2 pi, and the dimension-lifting operation is to clip a frequency domain signal which is a one-dimensional sequence into a plurality of sequences with a certain length, and stack the sequences into a two-dimensional matrix according to columns.

Further, in step S2, the position coding function in the position coding module is:

wherein,representing position coding->Representing the position of a word in a sentence, +.>Representation->Dimension of->Representing even dimensions>Representing odd dimensions.

Further, in step S2, the frequency response after the position encoding is performedInputting a multi-attention head layer consisting of a plurality of parallel attention heads, wherein each attention head consists of three weight matrixes capable of parameter optimization +.>、/>、/>The composition is used for obtaining the query Q, the key value K and the value V, and the specific calculation formula is as follows:

after obtaining the matrix Q, K, V, the output of the multi-attention head layer is calculated, and the specific formula is as follows:

wherein,is->Column number of matrix, i.e. vector dimension, +.>For transpose operation, < >>Is a normalization function;

and inputting the output of the multi-attention head layer and the frequency response after position coding into a residual error connection layer for solving the problem of multi-layer network training, then carrying out layer normalization on the output of the residual error connection layer, inputting the layer normalization result into a feedforward layer to enable the final output matrix dimension to be consistent with the input dimension, and finally carrying out residual error connection and layer normalization on the result of the feedforward layer to obtain the final frequency response.

Further, the calculation formula of the normalization function is:

wherein,for vector +.>Is the maximum value of (2);

the residual connection layer consists of 2 convolution layers, and the specific formula is as follows:

wherein,for the output of the residual connection layer, < >>Output of the 2 nd convolution layer in the residual connection layer,>' is the input of the residual connection layer;

the feedforward layer comprises two fully connected layers, wherein the first layer uses a Relu activation function, and the second layer does not use an activation function, and the specific formula is as follows:

wherein the method comprises the steps ofIs input, & lt + & gt>And->Respectively two full connection layer parameters +.>And->Respectively two fully connected layers.

Further, in step S2, the frequency domain self-attention network is a trained frequency domain self-attention network; the training process comprises the following substeps:

step SS1, using a VOICEBANK data set containing original voice and clean voice;

and step SS2, preprocessing the data set, inputting the preprocessed data set into a frequency domain self-attention network for training, and continuously optimizing model parameters through a back propagation algorithm to achieve a better voice enhancement effect.

Further, in step SS2, the preprocessing includes fourier transform, normalization, and dimension up operation; firstly, carrying out Fourier transform on input original voice to obtain frequency response; then, carrying out normalization processing on the frequency response; finally, carrying out dimension-lifting operation on the normalized frequency response to obtain a calculation matrix; and in the training process, a mean square error loss function is adopted, and the training is carried out until the network converges, namely, the training loss function curve keeps stable and does not drop any more.

Further, in step S3, the post-processing includes taking a positive value, performing a dimension reduction operation, and performing an inverse fourier transform, where the taking a positive value is taking a positive result of the network output, and the dimension reduction operation is to splice the positive result in sequence into a one-dimensional sequence, so as to obtain a one-dimensional frequency response after the voice quality is enhanced, and the inverse fourier transform is obtaining the voice signal after the voice quality is enhanced by using an inverse fast fourier transform function.

The invention also provides a voice quality enhancement system based on the frequency domain self-attention network, which comprises the following units:

the preprocessing unit is used for inputting original voice and preprocessing the original voice to obtain the frequency response of voice data;

the voice quality enhancement unit is used for inputting the processed frequency response into the frequency domain self-attention network to obtain the frequency response after voice quality enhancement;

and the post-processing unit is used for carrying out post-processing on the frequency response after voice quality enhancement to obtain a final voice enhancement signal.

The invention adopts the frequency domain self-attention network to realize the enhancement of the original voice quality. The technique combines frequency domain analysis and deep learning algorithms, first using a fast fourier transform to obtain the frequency response of the original speech signal, which contains the characteristics of valid speech signals and invalid noise. An up-scaling operation is then used on the original speech signal frequency response to make it a two-dimensional matrix for input network processing. The frequency domain self-attention network model is then used to perform feature extraction on the frequency response after the dimension increase. Finally, the output signal is subjected to dimension reduction operation, so that noise in the voice signal is removed. Compared with the traditional voice quality enhancement method, the method has the advantages of stability, independence, rapidness, high efficiency and the like, can greatly improve the accuracy and the efficiency of voice quality enhancement, and provides powerful guarantee for the aspects of voice communication in the fields of business, military and the like.

Drawings

The following examples, as well as specific embodiments, are used to further illustrate the technical solutions herein. In addition, in the course of describing the technical solutions, some drawings are also used. Other figures and the intent of the present invention can be derived from these figures without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a method according to an embodiment of the present invention;

FIG. 2 is a diagram of a frequency domain self-attention network architecture in accordance with an embodiment of the present invention;

FIG. 3 is a flow chart of a frequency domain self-attention network training process according to an embodiment of the present invention;

FIG. 4 is a diagram of a dimension up operation structure in accordance with an embodiment of the present invention;

fig. 5 is a diagram illustrating a dimension reduction operation structure according to an embodiment of the present invention.

Detailed Description

In order to facilitate the understanding and practice of the invention, those of ordinary skill in the art will now make further details with reference to the drawings and examples, it being understood that the examples described herein are for the purpose of illustration and explanation only and are not intended to limit the invention thereto.

The present embodiment takes a given voice data set to be tested as an example, and further describes the present invention. Referring to fig. 1, the method for enhancing voice quality based on the frequency domain self-attention network provided in this embodiment includes the following steps:

step S1: inputting and preprocessing original voice signals in a given data set to be tested;

in one embodiment, the input original speech is preprocessed, including fourier transform, normalization and dimension-lifting operation, the fourier transform is to obtain frequency response of the input speech data by using a fast fourier transform function, including amplitude response characteristic and phase response characteristic, the normalization is to normalize the amplitude response characteristic and the phase response characteristic by using maximum and minimum values, and dimension-convert the phase response characteristic into a length interval of 0 to 2 pi, the dimension-lifting operation is to clip a frequency domain signal which is a one-dimensional sequence into 200 sequences with length of 512, and stack the sequences into a 512×200 two-dimensional matrix by columns.

Step S2: inputting the processed frequency response into a frequency domain self-attention network to obtain a frequency response with enhanced voice quality;

please refer to fig. 2, wherein the frequency-entering domain self-attention network includes a position coding module and N identical basic unit modules; the position coding module comprises a position coding layer; the basic unit module comprises a multi-attention head layer, a residual error connection and layer normalization layer and a feedforward layer; the N identical base unit modules, where N is determined by the desired network depth.

In one embodiment, the position coding module adds the position information of the sequence to the processed frequency response, and the specific position coding function is:

wherein,representing the position of a word in a sentence, +.>Representing the position code of the place,/->Representation->Is used in the manufacture of a printed circuit board,representing even dimensions>Representing odd dimensions (i.e->≤/>,/>≤/>)。

In one embodiment, the base unit module encodes a positionPost frequency responseA multi-attention head layer consisting of 8 parallel attention heads is input, wherein each attention head consists of three weight matrixes capable of parameter optimization、/>、/>The composition is used for obtaining Q (query), K (key value) and V (value), and the specific calculation formula is as follows:

after obtaining the matrix Q, K, V, the output of the multi-attention head layer can be calculated, and the specific formula is as follows:

wherein,is->Column number of matrix, i.e. vector dimension, +.>For transpose operation, < >>For the normalization function, the normalization calculation formula is:

wherein,for the purpose of collection->Maximum value of element (B),>and->Is an element in set z;

the method comprises the steps of inputting the output of a multi-attention head layer and the frequency response after position coding into a residual connection layer, wherein the residual connection layer consists of 2 convolution layers and is used for solving the problem of multi-layer network training, and the network pays attention to a current difference part, and the specific formula is as follows:

wherein,for the output of the residual connection layer, < >>Output of the 2 nd convolution layer in the residual connection layer,>as the input of the residual connection layer, adopting a residual structure, so that a network is enabled to 'short circuit' certain layers when the depth is deeper to prevent the network from degradation;

carrying out layer normalization on the output of the residual connection layer, and converting the input of each layer of neurons into consistency so as to accelerate convergence;

inputting the layer normalization result into a feedforward layer, so that the final output matrix dimension is consistent with the input dimension, wherein the feedforward layer is a two-layer full-connection layer, the first layer uses a Relu activation function, and the second layer does not use an activation function, and the specific formula is as follows:

wherein the method comprises the steps ofIs input, & lt + & gt>And->Respectively two full connection layer parameters +.>And->Respectively biasing two full connection layers;

and finally, carrying out residual connection and layer normalization on the feedforward layer result to obtain a final frequency response.

Referring to fig. 3, in one embodiment, the frequency domain self-attention network is a trained frequency domain self-attention network; the training process comprises the following substeps:

step SS1: using a VOICEBANK dataset comprising original speech and clean speech;

in one embodiment, the VOICEBANK dataset is a speech denoising dataset commonly used for deep learning, and the dataset is typically referenced using the dataset.

Step SS2: preprocessing a data set, inputting the preprocessed data set into a frequency domain self-attention network model for training, and continuously optimizing model parameters through a back propagation algorithm to achieve a better voice enhancement effect;

in one embodiment, the preprocessing includes fourier transform, normalization, and dimension-up operations; firstly, carrying out Fourier transform on input original voice to obtain frequency response; then, carrying out normalization processing on the frequency response; finally, carrying out dimension-lifting operation on the normalized frequency response to obtain a calculation matrix; and in the training process, a mean square error loss function is adopted, and the training is carried out until the network converges, namely, the training loss function curve keeps stable and does not drop any more. The voice denoising enhancement effect is the best as the final result.

Please refer to fig. 4, the dimension up operation is to clip the frequency domain signal, which is a one-dimensional sequence, into 200 sequences with a length of 512, and stack the sequences into a 512×200 two-dimensional matrix by columns.

Step S3: outputting a signal and performing post-processing on the output signal to obtain a voice enhancement signal;

in one embodiment, post-processing is performed on the output signal, including taking a positive value, performing a dimension reduction operation, and performing an inverse fourier transform, where the taking a positive value is taking a positive result of the network output, and the dimension reduction operation is to splice the network output structure after taking the positive result into a sequence with a one-dimensional length of 512×200 in sequence, so as to obtain a one-dimensional frequency response after voice quality enhancement, and the inverse fourier transform is to obtain voice with enhanced quality by using an inverse fast fourier transform function.

Please refer to fig. 5, the dimension-reducing operation is to splice the two-dimensional matrix of 512×200 into a one-dimensional sequence of 512×200.

In one implementation, the performance of the embodiment of the present invention is reflected by objective evaluation of the performance of speech denoising or enhancement of the model, and the specific used indexes are billions of floating point Operations Per Second (GFLOPs), memory requirements (Memory) and execution Time (Time), and the specific formulas are:

wherein the method comprises the steps of，Means the amount of model calculation in floating point operation, +.>Refers to model execution time in seconds, < >>Is an indicator of the performance of the calculation, and represents the number of billions of floating point operations performed per second.

In one embodiment, model performance is evaluated on a VOICEBANK dataset, and the experimental results obtained are shown in table 1 with the highest performance index bolded. Compared with the current mainstream voice denoising and enhancing method, the method has excellent performance.

TABLE 1 comparison of performance of noise reduction and enhancement methods execution efficiency for each Speech on VOICEBANK dataset

Where I represents the computation on the CPU and B represents the computation on the GPU. Algorithms ConvTsaNet, demucs, DPRNN, two-Step TDCN are referred to by Luo Y, mesgarani N.Conv-tasnet: surpassing ideal time-frequency magnitude masking for speech separation [ J ]. IEEE/ACM transactions on audio, specch, and language processing, 2019, 27 (8): 1256-1266, dfo ssez A, usunier N, bottou L, et al, demucs: deep extractor for music sources with extra unlabeled data remixed [ J ]. ArXiv preprint arXiv:1909.01174, 2019, tzinis E, venkataram S, wang Z, et al, two-Step sound source separation: training on learned latent targets [ C ]// ICASSP 2020-IEEE International Conference on Acoustics, speech and Signal Processing (ICASSP). IEEE, 2020:31-35, respectively.

The embodiment of the invention also provides a voice quality enhancement system based on the frequency domain self-attention network, which comprises the following units:

The specific implementation manner of each unit is the same as that of each step, and the invention is not written.

The embodiment of the invention also provides a voice enhancement device based on the frequency domain self-attention network, which comprises:

one or more processors;

and a storage means for storing one or more programs that, when executed by the one or more processors, cause the one or more processors to implement the frequency domain self-attention network based speech quality enhancement method.

The invention can realize the removal of noise in the voice signal. Compared with the traditional voice quality enhancement method, the method has the advantages of stability, independence, rapidness, high efficiency and the like, can greatly improve the accuracy and efficiency of voice quality enhancement, provides powerful guarantee for the aspects of voice communication in the fields of business, military and the like, and has better popularization and application prospects.

It should be understood that the foregoing description of the preferred embodiments is not intended to limit the scope of the invention, but rather to limit the scope of the claims, and that those skilled in the art can make substitutions or modifications without departing from the scope of the invention as set forth in the appended claims.

Claims

1. A method for enhancing speech quality based on a frequency domain self-attention network, comprising the steps of:

2. The method for enhancing speech quality based on a frequency domain self-attention network of claim 1, wherein: in step S1, preprocessing is performed on the input original speech, including fourier transform, normalization and dimension-lifting operation, where the fourier transform is to obtain a frequency response of the input speech data by using a fast fourier transform function, and includes an amplitude response characteristic and a phase response characteristic, the normalization is to normalize the amplitude response characteristic and the phase response characteristic by using a maximum and minimum value, and scale-transform the phase response characteristic into a length interval of 0 to 2 pi, and the dimension-lifting operation is to clip a frequency domain signal which is a one-dimensional sequence into a plurality of sequences with a certain length, and stack the sequences into a two-dimensional matrix according to columns.

3. The method for enhancing speech quality based on a frequency domain self-attention network of claim 1, wherein: in step S2, the position coding function in the position coding module is:

4. The method for enhancing speech quality based on a frequency domain self-attention network of claim 1, wherein: in step S2, the position-coded frequency response is determinedInputting a multi-attention head layer consisting of a plurality of parallel attention heads, wherein each attention head consists of three weight matrixes capable of parameter optimization +.>、/>、/>The composition is used for obtaining the query Q, the key value K and the value V, and the specific calculation formula is as follows:

5. The method for enhancing speech quality based on a frequency domain self-attention network of claim 4, wherein: the calculation formula of the normalization function is as follows:

wherein,for vector +.>Is the maximum value of (2);

6. The method for enhancing speech quality based on a frequency domain self-attention network of claim 1, wherein: in step S2, the frequency domain self-attention network is a trained frequency domain self-attention network; the training process comprises the following substeps:

step SS1, using a VOICEBANK data set containing original voice and clean voice;

7. The method for enhancing speech quality based on a frequency domain self-attention network of claim 6, wherein: in step SS2, the preprocessing includes fourier transform, normalization, and dimension up operations; firstly, carrying out Fourier transform on input original voice to obtain frequency response; then, carrying out normalization processing on the frequency response; finally, carrying out dimension-lifting operation on the normalized frequency response to obtain a calculation matrix; and in the training process, a mean square error loss function is adopted, and the training is carried out until the network converges, namely, the training loss function curve keeps stable and does not drop any more.

8. The method for enhancing speech quality based on frequency domain self-attention network as claimed in claim 1, wherein in step S3, said post-processing comprises taking positive values, taking positive values by taking positive values of the network output result, the dimension reduction operation is to splice the positive result into a one-dimensional sequence in sequence, so as to obtain a one-dimensional frequency response with enhanced voice quality, and the inverse Fourier transform is to obtain a voice signal with enhanced quality by using an inverse fast Fourier transform function.

9. A speech quality enhancement system based on a frequency domain self-attention network, comprising the following elements: