CN117711417A - Voice quality enhancement method and system based on frequency domain self-attention network - Google Patents

Voice quality enhancement method and system based on frequency domain self-attention network Download PDF

Info

Publication number
CN117711417A
CN117711417A CN202410163875.7A CN202410163875A CN117711417A CN 117711417 A CN117711417 A CN 117711417A CN 202410163875 A CN202410163875 A CN 202410163875A CN 117711417 A CN117711417 A CN 117711417A
Authority
CN
China
Prior art keywords
layer
voice
frequency domain
frequency response
attention
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202410163875.7A
Other languages
Chinese (zh)
Other versions
CN117711417B (en
Inventor
袁程浩
归子涵
刘瑨玮
杨光义
贺威
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan University WHU
Original Assignee
Wuhan University WHU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan University WHU filed Critical Wuhan University WHU
Priority to CN202410163875.7A priority Critical patent/CN117711417B/en
Priority claimed from CN202410163875.7A external-priority patent/CN117711417B/en
Publication of CN117711417A publication Critical patent/CN117711417A/en
Application granted granted Critical
Publication of CN117711417B publication Critical patent/CN117711417B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Abstract

The invention discloses a voice quality enhancement method and a voice quality enhancement system based on a frequency domain self-attention network, which are characterized in that original voice is input and preprocessed; then inputting the processed frequency response into a frequency domain self-attention network; finally outputting a signal and performing post-processing on the output signal to obtain a voice enhancement signal; the frequency domain self-attention network comprises a position coding module and N identical basic unit modules; the position coding module comprises a position coding layer; the basic unit module comprises a multi-attention head layer, a residual error connection and layer normalization layer and a feedforward layer; n identical base unit modules, where N is determined by the desired network depth. The invention can remove noise in voice signals and has important significance on voice communication parties.

Description

Voice quality enhancement method and system based on frequency domain self-attention network
Technical Field
The invention belongs to the technical field of voice quality processing, relates to a voice quality enhancement method and a voice quality enhancement system, and particularly relates to a voice quality enhancement method and a voice quality enhancement system based on a frequency domain self-attention network.
Background
Numerous practical theories and algorithms formed in the field of digital signal processing in the middle 60 th century, such as Fast Fourier Transform (FFT) and various digital filters, are the theoretical and technical bases of digital processing of speech signals. After the late middle of 70 years, linear prediction technology (LPC) has been used for information compression and feature extraction of speech signals and has become a very important tool in speech signal processing. A significant development in the 80 s speech signal processing technology is the generation of Hidden Markov Models (HMMs) describing the speech signal process. Since the 90 s of the last century, speech signal acquisition and analysis techniques have made a number of breakthrough research advances in practical applications.
In the fields of business, education, medical care, etc. where remote work is required, there is a great demand for teleconferencing systems. The voice quality of the teleconferencing system is critical. Therefore, the noise can be removed to a great extent, which has a decisive effect on the voice quality improvement. In full duplex communications, these problems become more challenging when echoes interfere with Double Talk (DT) scenarios. Thus, a solution that can address acoustic echo, noise, and dereverberation is critical to achieving seamless communication.
In recent years, with the progress of science and technology, research on Artificial Neural Networks (ANNs) has been rapidly progressed, and each of the scientific research subjects of speech signal processing has been to promote the growth promotion of the development thereof, and at the same time, many of its results have been embodied in each technology related to speech signal processing. In recent years, joint AEC and NS approaches have been developed to simplify the communication pipeline while providing good AEC and NS performance. For example, MTFAA-Net is a neural network for combining AEC and NS based on multi-scale time-frequency processing and stream axial attention. However, MTFAA-Net still relies on classical AEC components.
However, the current approach based on deep learning is still not perfect for mathematical modeling of speech noise. Meanwhile, the real-time noise reduction capability is also very important for voice communication, so that the enjoyment of the user voice communication experience is required to be improved, and the complexity of algorithm time is required to be reduced to improve the real-time noise reduction effect.
Disclosure of Invention
In order to solve the problem of low real-time performance of the voice quality enhancement method in the prior art, the invention provides a voice quality enhancement method and a voice quality enhancement system based on a frequency domain self-attention network, which can be applied to enhancing the voice quality in the fields of business, military and the like.
The technical scheme adopted by the method is as follows: a method for enhancing speech quality based on a frequency domain self-attention network, comprising the steps of:
step S1, inputting original voice and preprocessing to obtain the frequency response of voice data;
step S2, inputting the processed frequency response into a frequency domain self-attention network to obtain a frequency response with enhanced voice quality;
the frequency domain self-attention network comprises a position coding module and N identical basic unit modules;
the position coding module comprises a position coding layer for adding position information to the processed frequency response; the basic unit module comprises a multi-attention head layer, a residual error connection and layer normalization layer and a feedforward layer;
and S3, carrying out post-processing on the frequency response after voice quality enhancement to obtain a final voice enhancement signal.
Further, in step S1, preprocessing is performed on the input original speech, including fourier transform, normalization and dimension-lifting operation, where the fourier transform is to obtain a frequency response of the input speech data by using a fast fourier transform function, including an amplitude response characteristic and a phase response characteristic, the normalization is to normalize the amplitude response characteristic and the phase response characteristic by using a maximum and minimum value, and scale-transform the phase response characteristic into a length interval of 0 to 2 pi, and the dimension-lifting operation is to clip a frequency domain signal which is a one-dimensional sequence into a plurality of sequences with a certain length, and stack the sequences into a two-dimensional matrix according to columns.
Further, in step S2, the position coding function in the position coding module is:
wherein,representing position coding->Representing the position of a word in a sentence, +.>Representation->Dimension of->Representing even dimensions>Representing odd dimensions.
Further, in step S2, the frequency response after the position encoding is performedInputting a multi-attention head layer consisting of a plurality of parallel attention heads, wherein each attention head consists of three weight matrixes capable of parameter optimization +.>、/>、/>The composition is used for obtaining the query Q, the key value K and the value V, and the specific calculation formula is as follows:
after obtaining the matrix Q, K, V, the output of the multi-attention head layer is calculated, and the specific formula is as follows:
wherein,is->Column number of matrix, i.e. vector dimension, +.>For transpose operation, < >>Is a normalization function;
and inputting the output of the multi-attention head layer and the frequency response after position coding into a residual error connection layer for solving the problem of multi-layer network training, then carrying out layer normalization on the output of the residual error connection layer, inputting the layer normalization result into a feedforward layer to enable the final output matrix dimension to be consistent with the input dimension, and finally carrying out residual error connection and layer normalization on the result of the feedforward layer to obtain the final frequency response.
Further, the calculation formula of the normalization function is:
wherein,for vector +.>Is the maximum value of (2);
the residual connection layer consists of 2 convolution layers, and the specific formula is as follows:
wherein,for the output of the residual connection layer, < >>Output of the 2 nd convolution layer in the residual connection layer,>' is the input of the residual connection layer;
the feedforward layer comprises two fully connected layers, wherein the first layer uses a Relu activation function, and the second layer does not use an activation function, and the specific formula is as follows:
wherein the method comprises the steps ofIs input, & lt + & gt>And->Respectively two full connection layer parameters +.>And->Respectively two fully connected layers.
Further, in step S2, the frequency domain self-attention network is a trained frequency domain self-attention network; the training process comprises the following substeps:
step SS1, using a VOICEBANK data set containing original voice and clean voice;
and step SS2, preprocessing the data set, inputting the preprocessed data set into a frequency domain self-attention network for training, and continuously optimizing model parameters through a back propagation algorithm to achieve a better voice enhancement effect.
Further, in step SS2, the preprocessing includes fourier transform, normalization, and dimension up operation; firstly, carrying out Fourier transform on input original voice to obtain frequency response; then, carrying out normalization processing on the frequency response; finally, carrying out dimension-lifting operation on the normalized frequency response to obtain a calculation matrix; and in the training process, a mean square error loss function is adopted, and the training is carried out until the network converges, namely, the training loss function curve keeps stable and does not drop any more.
Further, in step S3, the post-processing includes taking a positive value, performing a dimension reduction operation, and performing an inverse fourier transform, where the taking a positive value is taking a positive result of the network output, and the dimension reduction operation is to splice the positive result in sequence into a one-dimensional sequence, so as to obtain a one-dimensional frequency response after the voice quality is enhanced, and the inverse fourier transform is obtaining the voice signal after the voice quality is enhanced by using an inverse fast fourier transform function.
The invention also provides a voice quality enhancement system based on the frequency domain self-attention network, which comprises the following units:
the preprocessing unit is used for inputting original voice and preprocessing the original voice to obtain the frequency response of voice data;
the voice quality enhancement unit is used for inputting the processed frequency response into the frequency domain self-attention network to obtain the frequency response after voice quality enhancement;
the frequency domain self-attention network comprises a position coding module and N identical basic unit modules;
the position coding module comprises a position coding layer for adding position information to the processed frequency response; the basic unit module comprises a multi-attention head layer, a residual error connection and layer normalization layer and a feedforward layer;
and the post-processing unit is used for carrying out post-processing on the frequency response after voice quality enhancement to obtain a final voice enhancement signal.
The invention adopts the frequency domain self-attention network to realize the enhancement of the original voice quality. The technique combines frequency domain analysis and deep learning algorithms, first using a fast fourier transform to obtain the frequency response of the original speech signal, which contains the characteristics of valid speech signals and invalid noise. An up-scaling operation is then used on the original speech signal frequency response to make it a two-dimensional matrix for input network processing. The frequency domain self-attention network model is then used to perform feature extraction on the frequency response after the dimension increase. Finally, the output signal is subjected to dimension reduction operation, so that noise in the voice signal is removed. Compared with the traditional voice quality enhancement method, the method has the advantages of stability, independence, rapidness, high efficiency and the like, can greatly improve the accuracy and the efficiency of voice quality enhancement, and provides powerful guarantee for the aspects of voice communication in the fields of business, military and the like.
Drawings
The following examples, as well as specific embodiments, are used to further illustrate the technical solutions herein. In addition, in the course of describing the technical solutions, some drawings are also used. Other figures and the intent of the present invention can be derived from these figures without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a method according to an embodiment of the present invention;
FIG. 2 is a diagram of a frequency domain self-attention network architecture in accordance with an embodiment of the present invention;
FIG. 3 is a flow chart of a frequency domain self-attention network training process according to an embodiment of the present invention;
FIG. 4 is a diagram of a dimension up operation structure in accordance with an embodiment of the present invention;
fig. 5 is a diagram illustrating a dimension reduction operation structure according to an embodiment of the present invention.
Detailed Description
In order to facilitate the understanding and practice of the invention, those of ordinary skill in the art will now make further details with reference to the drawings and examples, it being understood that the examples described herein are for the purpose of illustration and explanation only and are not intended to limit the invention thereto.
The present embodiment takes a given voice data set to be tested as an example, and further describes the present invention. Referring to fig. 1, the method for enhancing voice quality based on the frequency domain self-attention network provided in this embodiment includes the following steps:
step S1: inputting and preprocessing original voice signals in a given data set to be tested;
in one embodiment, the input original speech is preprocessed, including fourier transform, normalization and dimension-lifting operation, the fourier transform is to obtain frequency response of the input speech data by using a fast fourier transform function, including amplitude response characteristic and phase response characteristic, the normalization is to normalize the amplitude response characteristic and the phase response characteristic by using maximum and minimum values, and dimension-convert the phase response characteristic into a length interval of 0 to 2 pi, the dimension-lifting operation is to clip a frequency domain signal which is a one-dimensional sequence into 200 sequences with length of 512, and stack the sequences into a 512×200 two-dimensional matrix by columns.
Step S2: inputting the processed frequency response into a frequency domain self-attention network to obtain a frequency response with enhanced voice quality;
please refer to fig. 2, wherein the frequency-entering domain self-attention network includes a position coding module and N identical basic unit modules; the position coding module comprises a position coding layer; the basic unit module comprises a multi-attention head layer, a residual error connection and layer normalization layer and a feedforward layer; the N identical base unit modules, where N is determined by the desired network depth.
In one embodiment, the position coding module adds the position information of the sequence to the processed frequency response, and the specific position coding function is:
wherein,representing the position of a word in a sentence, +.>Representing the position code of the place,/->Representation->Is used in the manufacture of a printed circuit board,representing even dimensions>Representing odd dimensions (i.e->≤/>,/>≤/>)。
In one embodiment, the base unit module encodes a positionPost frequency responseA multi-attention head layer consisting of 8 parallel attention heads is input, wherein each attention head consists of three weight matrixes capable of parameter optimization、/>、/>The composition is used for obtaining Q (query), K (key value) and V (value), and the specific calculation formula is as follows:
after obtaining the matrix Q, K, V, the output of the multi-attention head layer can be calculated, and the specific formula is as follows:
wherein,is->Column number of matrix, i.e. vector dimension, +.>For transpose operation, < >>For the normalization function, the normalization calculation formula is:
wherein,for the purpose of collection->Maximum value of element (B),>and->Is an element in set z;
the method comprises the steps of inputting the output of a multi-attention head layer and the frequency response after position coding into a residual connection layer, wherein the residual connection layer consists of 2 convolution layers and is used for solving the problem of multi-layer network training, and the network pays attention to a current difference part, and the specific formula is as follows:
wherein,for the output of the residual connection layer, < >>Output of the 2 nd convolution layer in the residual connection layer,>as the input of the residual connection layer, adopting a residual structure, so that a network is enabled to 'short circuit' certain layers when the depth is deeper to prevent the network from degradation;
carrying out layer normalization on the output of the residual connection layer, and converting the input of each layer of neurons into consistency so as to accelerate convergence;
inputting the layer normalization result into a feedforward layer, so that the final output matrix dimension is consistent with the input dimension, wherein the feedforward layer is a two-layer full-connection layer, the first layer uses a Relu activation function, and the second layer does not use an activation function, and the specific formula is as follows:
wherein the method comprises the steps ofIs input, & lt + & gt>And->Respectively two full connection layer parameters +.>And->Respectively biasing two full connection layers;
and finally, carrying out residual connection and layer normalization on the feedforward layer result to obtain a final frequency response.
Referring to fig. 3, in one embodiment, the frequency domain self-attention network is a trained frequency domain self-attention network; the training process comprises the following substeps:
step SS1: using a VOICEBANK dataset comprising original speech and clean speech;
in one embodiment, the VOICEBANK dataset is a speech denoising dataset commonly used for deep learning, and the dataset is typically referenced using the dataset.
Step SS2: preprocessing a data set, inputting the preprocessed data set into a frequency domain self-attention network model for training, and continuously optimizing model parameters through a back propagation algorithm to achieve a better voice enhancement effect;
in one embodiment, the preprocessing includes fourier transform, normalization, and dimension-up operations; firstly, carrying out Fourier transform on input original voice to obtain frequency response; then, carrying out normalization processing on the frequency response; finally, carrying out dimension-lifting operation on the normalized frequency response to obtain a calculation matrix; and in the training process, a mean square error loss function is adopted, and the training is carried out until the network converges, namely, the training loss function curve keeps stable and does not drop any more. The voice denoising enhancement effect is the best as the final result.
Please refer to fig. 4, the dimension up operation is to clip the frequency domain signal, which is a one-dimensional sequence, into 200 sequences with a length of 512, and stack the sequences into a 512×200 two-dimensional matrix by columns.
Step S3: outputting a signal and performing post-processing on the output signal to obtain a voice enhancement signal;
in one embodiment, post-processing is performed on the output signal, including taking a positive value, performing a dimension reduction operation, and performing an inverse fourier transform, where the taking a positive value is taking a positive result of the network output, and the dimension reduction operation is to splice the network output structure after taking the positive result into a sequence with a one-dimensional length of 512×200 in sequence, so as to obtain a one-dimensional frequency response after voice quality enhancement, and the inverse fourier transform is to obtain voice with enhanced quality by using an inverse fast fourier transform function.
Please refer to fig. 5, the dimension-reducing operation is to splice the two-dimensional matrix of 512×200 into a one-dimensional sequence of 512×200.
In one implementation, the performance of the embodiment of the present invention is reflected by objective evaluation of the performance of speech denoising or enhancement of the model, and the specific used indexes are billions of floating point Operations Per Second (GFLOPs), memory requirements (Memory) and execution Time (Time), and the specific formulas are:
wherein the method comprises the steps of,Means the amount of model calculation in floating point operation, +.>Refers to model execution time in seconds, < >>Is an indicator of the performance of the calculation, and represents the number of billions of floating point operations performed per second.
In one embodiment, model performance is evaluated on a VOICEBANK dataset, and the experimental results obtained are shown in table 1 with the highest performance index bolded. Compared with the current mainstream voice denoising and enhancing method, the method has excellent performance.
TABLE 1 comparison of performance of noise reduction and enhancement methods execution efficiency for each Speech on VOICEBANK dataset
Where I represents the computation on the CPU and B represents the computation on the GPU. Algorithms ConvTsaNet, demucs, DPRNN, two-Step TDCN are referred to by Luo Y, mesgarani N.Conv-tasnet: surpassing ideal time-frequency magnitude masking for speech separation [ J ]. IEEE/ACM transactions on audio, specch, and language processing, 2019, 27 (8): 1256-1266, dfo ssez A, usunier N, bottou L, et al, demucs: deep extractor for music sources with extra unlabeled data remixed [ J ]. ArXiv preprint arXiv:1909.01174, 2019, tzinis E, venkataram S, wang Z, et al, two-Step sound source separation: training on learned latent targets [ C ]// ICASSP 2020-IEEE International Conference on Acoustics, speech and Signal Processing (ICASSP). IEEE, 2020:31-35, respectively.
The embodiment of the invention also provides a voice quality enhancement system based on the frequency domain self-attention network, which comprises the following units:
the preprocessing unit is used for inputting original voice and preprocessing the original voice to obtain the frequency response of voice data;
the voice quality enhancement unit is used for inputting the processed frequency response into the frequency domain self-attention network to obtain the frequency response after voice quality enhancement;
the frequency domain self-attention network comprises a position coding module and N identical basic unit modules;
the position coding module comprises a position coding layer for adding position information to the processed frequency response; the basic unit module comprises a multi-attention head layer, a residual error connection and layer normalization layer and a feedforward layer;
and the post-processing unit is used for carrying out post-processing on the frequency response after voice quality enhancement to obtain a final voice enhancement signal.
The specific implementation manner of each unit is the same as that of each step, and the invention is not written.
The embodiment of the invention also provides a voice enhancement device based on the frequency domain self-attention network, which comprises:
one or more processors;
and a storage means for storing one or more programs that, when executed by the one or more processors, cause the one or more processors to implement the frequency domain self-attention network based speech quality enhancement method.
The invention can realize the removal of noise in the voice signal. Compared with the traditional voice quality enhancement method, the method has the advantages of stability, independence, rapidness, high efficiency and the like, can greatly improve the accuracy and efficiency of voice quality enhancement, provides powerful guarantee for the aspects of voice communication in the fields of business, military and the like, and has better popularization and application prospects.
It should be understood that the foregoing description of the preferred embodiments is not intended to limit the scope of the invention, but rather to limit the scope of the claims, and that those skilled in the art can make substitutions or modifications without departing from the scope of the invention as set forth in the appended claims.

Claims (9)

1. A method for enhancing speech quality based on a frequency domain self-attention network, comprising the steps of:
step S1, inputting original voice and preprocessing to obtain the frequency response of voice data;
step S2, inputting the processed frequency response into a frequency domain self-attention network to obtain a frequency response with enhanced voice quality;
the frequency domain self-attention network comprises a position coding module and N identical basic unit modules;
the position coding module comprises a position coding layer for adding position information to the processed frequency response; the basic unit module comprises a multi-attention head layer, a residual error connection and layer normalization layer and a feedforward layer;
and S3, carrying out post-processing on the frequency response after voice quality enhancement to obtain a final voice enhancement signal.
2. The method for enhancing speech quality based on a frequency domain self-attention network of claim 1, wherein: in step S1, preprocessing is performed on the input original speech, including fourier transform, normalization and dimension-lifting operation, where the fourier transform is to obtain a frequency response of the input speech data by using a fast fourier transform function, and includes an amplitude response characteristic and a phase response characteristic, the normalization is to normalize the amplitude response characteristic and the phase response characteristic by using a maximum and minimum value, and scale-transform the phase response characteristic into a length interval of 0 to 2 pi, and the dimension-lifting operation is to clip a frequency domain signal which is a one-dimensional sequence into a plurality of sequences with a certain length, and stack the sequences into a two-dimensional matrix according to columns.
3. The method for enhancing speech quality based on a frequency domain self-attention network of claim 1, wherein: in step S2, the position coding function in the position coding module is:
wherein,representing position coding->Representing the position of a word in a sentence, +.>Representation->Dimension of->Representing even dimensions>Representing odd dimensions.
4. The method for enhancing speech quality based on a frequency domain self-attention network of claim 1, wherein: in step S2, the position-coded frequency response is determinedInputting a multi-attention head layer consisting of a plurality of parallel attention heads, wherein each attention head consists of three weight matrixes capable of parameter optimization +.>、/>、/>The composition is used for obtaining the query Q, the key value K and the value V, and the specific calculation formula is as follows:
after obtaining the matrix Q, K, V, the output of the multi-attention head layer is calculated, and the specific formula is as follows:
wherein,is->Column number of matrix, i.e. vector dimension, +.>For transpose operation, < >>Is a normalization function;
and inputting the output of the multi-attention head layer and the frequency response after position coding into a residual error connection layer for solving the problem of multi-layer network training, then carrying out layer normalization on the output of the residual error connection layer, inputting the layer normalization result into a feedforward layer to enable the final output matrix dimension to be consistent with the input dimension, and finally carrying out residual error connection and layer normalization on the result of the feedforward layer to obtain the final frequency response.
5. The method for enhancing speech quality based on a frequency domain self-attention network of claim 4, wherein: the calculation formula of the normalization function is as follows:
wherein,for vector +.>Is the maximum value of (2);
the residual connection layer consists of 2 convolution layers, and the specific formula is as follows:
wherein,for the output of the residual connection layer, < >>Output of the 2 nd convolution layer in the residual connection layer,>' is the input of the residual connection layer;
the feedforward layer comprises two fully connected layers, wherein the first layer uses a Relu activation function, and the second layer does not use an activation function, and the specific formula is as follows:
wherein the method comprises the steps ofIs input, & lt + & gt>And->Respectively two full connection layer parameters +.>And->Respectively two fully connected layers.
6. The method for enhancing speech quality based on a frequency domain self-attention network of claim 1, wherein: in step S2, the frequency domain self-attention network is a trained frequency domain self-attention network; the training process comprises the following substeps:
step SS1, using a VOICEBANK data set containing original voice and clean voice;
and step SS2, preprocessing the data set, inputting the preprocessed data set into a frequency domain self-attention network for training, and continuously optimizing model parameters through a back propagation algorithm to achieve a better voice enhancement effect.
7. The method for enhancing speech quality based on a frequency domain self-attention network of claim 6, wherein: in step SS2, the preprocessing includes fourier transform, normalization, and dimension up operations; firstly, carrying out Fourier transform on input original voice to obtain frequency response; then, carrying out normalization processing on the frequency response; finally, carrying out dimension-lifting operation on the normalized frequency response to obtain a calculation matrix; and in the training process, a mean square error loss function is adopted, and the training is carried out until the network converges, namely, the training loss function curve keeps stable and does not drop any more.
8. The method for enhancing speech quality based on frequency domain self-attention network as claimed in claim 1, wherein in step S3, said post-processing comprises taking positive values, taking positive values by taking positive values of the network output result, the dimension reduction operation is to splice the positive result into a one-dimensional sequence in sequence, so as to obtain a one-dimensional frequency response with enhanced voice quality, and the inverse Fourier transform is to obtain a voice signal with enhanced quality by using an inverse fast Fourier transform function.
9. A speech quality enhancement system based on a frequency domain self-attention network, comprising the following elements:
the preprocessing unit is used for inputting original voice and preprocessing the original voice to obtain the frequency response of voice data;
the voice quality enhancement unit is used for inputting the processed frequency response into the frequency domain self-attention network to obtain the frequency response after voice quality enhancement;
the frequency domain self-attention network comprises a position coding module and N identical basic unit modules;
the position coding module comprises a position coding layer for adding position information to the processed frequency response; the basic unit module comprises a multi-attention head layer, a residual error connection and layer normalization layer and a feedforward layer;
and the post-processing unit is used for carrying out post-processing on the frequency response after voice quality enhancement to obtain a final voice enhancement signal.
CN202410163875.7A 2024-02-05 Voice quality enhancement method and system based on frequency domain self-attention network Active CN117711417B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410163875.7A CN117711417B (en) 2024-02-05 Voice quality enhancement method and system based on frequency domain self-attention network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410163875.7A CN117711417B (en) 2024-02-05 Voice quality enhancement method and system based on frequency domain self-attention network

Publications (2)

Publication Number Publication Date
CN117711417A true CN117711417A (en) 2024-03-15
CN117711417B CN117711417B (en) 2024-04-30

Family

ID=

Citations (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180350346A1 (en) * 2017-06-05 2018-12-06 Baidu Online Network Technology (Beijing) Co., Ltd. Speech recognition method based on artifical intelligence and terminal
CN109492232A (en) * 2018-10-22 2019-03-19 内蒙古工业大学 A kind of illiteracy Chinese machine translation method of the enhancing semantic feature information based on Transformer
CN111696567A (en) * 2020-06-12 2020-09-22 苏州思必驰信息科技有限公司 Noise estimation method and system for far-field call
CN112767959A (en) * 2020-12-31 2021-05-07 恒安嘉新(北京)科技股份公司 Voice enhancement method, device, equipment and medium
KR102287499B1 (en) * 2020-09-15 2021-08-09 주식회사 에이아이더뉴트리진 Method and apparatus for synthesizing speech reflecting phonemic rhythm
CN114283795A (en) * 2021-12-24 2022-04-05 思必驰科技股份有限公司 Training and recognition method of voice enhancement model, electronic equipment and storage medium
CN114678033A (en) * 2022-03-15 2022-06-28 南京邮电大学 Speech enhancement algorithm based on multi-head attention mechanism only comprising encoder
US20220286775A1 (en) * 2021-03-05 2022-09-08 Honda Motor Co., Ltd. Acoustic processing device, acoustic processing method, and storage medium
US20220291328A1 (en) * 2015-07-17 2022-09-15 Muhammed Zahid Ozturk Method, apparatus, and system for speech enhancement and separation based on audio and radio signals
US20220310108A1 (en) * 2021-03-23 2022-09-29 Qualcomm Incorporated Context-based speech enhancement
US20230075891A1 (en) * 2021-03-11 2023-03-09 Tencent Technology (Shenzhen) Company Limited Speech synthesis method and apparatus, and readable storage medium
WO2023044961A1 (en) * 2021-09-23 2023-03-30 武汉大学 Multi-feature fusion echo cancellation method and system based on self-attention transform network
CN115881157A (en) * 2021-09-29 2023-03-31 北京三星通信技术研究有限公司 Audio signal processing method and related equipment
CN116013344A (en) * 2022-12-17 2023-04-25 西安交通大学 Speech enhancement method under multiple noise environments
US20230162758A1 (en) * 2021-11-19 2023-05-25 Massachusetts Institute Of Technology Systems and methods for speech enhancement using attention masking and end to end neural networks
US20230223011A1 (en) * 2022-01-10 2023-07-13 Intone Inc. Real time correction of accent in speech audio signals
US20230253003A1 (en) * 2020-11-27 2023-08-10 Beijing Sogou Technology Development Co., Ltd. Speech processing method and speech processing apparatus
CN116798410A (en) * 2023-07-31 2023-09-22 易方信息科技股份有限公司 Speech recognition method, system, equipment and medium with enhanced local features
CN116884426A (en) * 2023-07-11 2023-10-13 武汉大学 Voice enhancement method, device and equipment based on DFSMN model
US20230395087A1 (en) * 2020-10-16 2023-12-07 Google Llc Machine Learning for Microphone Style Transfer

Patent Citations (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220291328A1 (en) * 2015-07-17 2022-09-15 Muhammed Zahid Ozturk Method, apparatus, and system for speech enhancement and separation based on audio and radio signals
US20180350346A1 (en) * 2017-06-05 2018-12-06 Baidu Online Network Technology (Beijing) Co., Ltd. Speech recognition method based on artifical intelligence and terminal
CN109492232A (en) * 2018-10-22 2019-03-19 内蒙古工业大学 A kind of illiteracy Chinese machine translation method of the enhancing semantic feature information based on Transformer
CN111696567A (en) * 2020-06-12 2020-09-22 苏州思必驰信息科技有限公司 Noise estimation method and system for far-field call
KR102287499B1 (en) * 2020-09-15 2021-08-09 주식회사 에이아이더뉴트리진 Method and apparatus for synthesizing speech reflecting phonemic rhythm
US20230395087A1 (en) * 2020-10-16 2023-12-07 Google Llc Machine Learning for Microphone Style Transfer
US20230253003A1 (en) * 2020-11-27 2023-08-10 Beijing Sogou Technology Development Co., Ltd. Speech processing method and speech processing apparatus
CN112767959A (en) * 2020-12-31 2021-05-07 恒安嘉新(北京)科技股份公司 Voice enhancement method, device, equipment and medium
US20220286775A1 (en) * 2021-03-05 2022-09-08 Honda Motor Co., Ltd. Acoustic processing device, acoustic processing method, and storage medium
US20230075891A1 (en) * 2021-03-11 2023-03-09 Tencent Technology (Shenzhen) Company Limited Speech synthesis method and apparatus, and readable storage medium
US20220310108A1 (en) * 2021-03-23 2022-09-29 Qualcomm Incorporated Context-based speech enhancement
CN117043861A (en) * 2021-03-23 2023-11-10 高通股份有限公司 Context-based speech enhancement
WO2023044961A1 (en) * 2021-09-23 2023-03-30 武汉大学 Multi-feature fusion echo cancellation method and system based on self-attention transform network
CN115881157A (en) * 2021-09-29 2023-03-31 北京三星通信技术研究有限公司 Audio signal processing method and related equipment
US20230162758A1 (en) * 2021-11-19 2023-05-25 Massachusetts Institute Of Technology Systems and methods for speech enhancement using attention masking and end to end neural networks
CN114283795A (en) * 2021-12-24 2022-04-05 思必驰科技股份有限公司 Training and recognition method of voice enhancement model, electronic equipment and storage medium
US20230223011A1 (en) * 2022-01-10 2023-07-13 Intone Inc. Real time correction of accent in speech audio signals
CN114678033A (en) * 2022-03-15 2022-06-28 南京邮电大学 Speech enhancement algorithm based on multi-head attention mechanism only comprising encoder
CN116013344A (en) * 2022-12-17 2023-04-25 西安交通大学 Speech enhancement method under multiple noise environments
CN116884426A (en) * 2023-07-11 2023-10-13 武汉大学 Voice enhancement method, device and equipment based on DFSMN model
CN116798410A (en) * 2023-07-31 2023-09-22 易方信息科技股份有限公司 Speech recognition method, system, equipment and medium with enhanced local features

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
王雁;贾海蓉;吉慧芳;王卫梅;: "特征联合优化深度信念网络的语音增强算法", 计算机工程与应用, no. 09, 8 December 2018 (2018-12-08) *

Similar Documents

Publication Publication Date Title
Qian et al. Very deep convolutional neural networks for noise robust speech recognition
Lu et al. Conditional diffusion probabilistic model for speech enhancement
WO2021042870A1 (en) Speech processing method and apparatus, electronic device, and computer-readable storage medium
CN110739003B (en) Voice enhancement method based on multi-head self-attention mechanism
Yen et al. Cold diffusion for speech enhancement
CN110060657B (en) SN-based many-to-many speaker conversion method
Zezario et al. Self-supervised denoising autoencoder with linear regression decoder for speech enhancement
CN112435652A (en) Voice keyword recognition system and method based on graph convolution neural network
Hasannezhad et al. PACDNN: A phase-aware composite deep neural network for speech enhancement
CN117059103A (en) Acceleration method of voice recognition fine tuning task based on low-rank matrix approximation
CN114373451A (en) End-to-end Chinese speech recognition method
CN111429893A (en) Many-to-many speaker conversion method based on Transitive STARGAN
CN115101085A (en) Multi-speaker time-domain voice separation method for enhancing external attention through convolution
Ju et al. Tea-pse 3.0: Tencent-ethereal-audio-lab personalized speech enhancement system for icassp 2023 dns-challenge
Jiang et al. Speaker attractor network: Generalizing speech separation to unseen numbers of sources
CN117711417B (en) Voice quality enhancement method and system based on frequency domain self-attention network
Li et al. Deeplabv3+ vision transformer for visual bird sound denoising
Tang et al. Acoustic modeling with densely connected residual network for multichannel speech recognition
Raj et al. Multilayered convolutional neural network-based auto-CODEC for audio signal denoising using mel-frequency cepstral coefficients
Shahnawazuddin et al. Sparse coding over redundant dictionaries for fast adaptation of speech recognition system
Li et al. A Convolutional Neural Network with Non-Local Module for Speech Enhancement.
CN117711417A (en) Voice quality enhancement method and system based on frequency domain self-attention network
CN113707172B (en) Single-channel voice separation method, system and computer equipment of sparse orthogonal network
Li et al. A fast convolutional self-attention based speech dereverberation method for robust speech recognition
Zhang et al. Enhanced-Deep-Residual-Shrinkage-Network-Based Voiceprint Recognition in the Electric Industry

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant