CN114023346A - Voice enhancement method and device capable of separating circulatory attention - Google Patents
Voice enhancement method and device capable of separating circulatory attention Download PDFInfo
- Publication number
- CN114023346A CN114023346A CN202111285653.5A CN202111285653A CN114023346A CN 114023346 A CN114023346 A CN 114023346A CN 202111285653 A CN202111285653 A CN 202111285653A CN 114023346 A CN114023346 A CN 114023346A
- Authority
- CN
- China
- Prior art keywords
- signal
- amplitude
- phase
- module
- attention
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 29
- 230000009466 transformation Effects 0.000 claims description 46
- 125000004122 cyclic group Chemical group 0.000 claims description 42
- 238000012545 processing Methods 0.000 claims description 16
- 238000006243 chemical reaction Methods 0.000 claims description 13
- 238000010606 normalization Methods 0.000 claims description 6
- 238000000844 transformation Methods 0.000 claims description 4
- 230000002708 enhancing effect Effects 0.000 claims description 3
- 230000001131 transforming effect Effects 0.000 claims description 3
- 230000001105 regulatory effect Effects 0.000 claims description 2
- 230000000694 effects Effects 0.000 abstract description 6
- 238000004364 calculation method Methods 0.000 abstract description 3
- 230000004913 activation Effects 0.000 description 7
- 230000001351 cycling effect Effects 0.000 description 6
- 238000010586 diagram Methods 0.000 description 6
- 230000002457 bidirectional effect Effects 0.000 description 5
- 230000007547 defect Effects 0.000 description 3
- 238000013528 artificial neural network Methods 0.000 description 2
- 210000004027 cell Anatomy 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 230000000306 recurrent effect Effects 0.000 description 2
- 238000013459 approach Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000015572 biosynthetic process Effects 0.000 description 1
- 230000015556 catabolic process Effects 0.000 description 1
- 210000003040 circulating cell Anatomy 0.000 description 1
- 238000006731 degradation reaction Methods 0.000 description 1
- 238000006073 displacement reaction Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000000746 purification Methods 0.000 description 1
- 239000000126 substance Substances 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000003786 synthesis reaction Methods 0.000 description 1
- 238000012549 training Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
Landscapes
- Engineering & Computer Science (AREA)
- Human Computer Interaction (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Quality & Reliability (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Cable Transmission Systems, Equalization Of Radio And Reduction Of Echo (AREA)
Abstract
The invention relates to a voice enhancement method capable of separating circulatory attention, which comprises the following steps of 1: inputting the voice signal into a preposed network unit for Fourier transform, and outputting a first amplitude flow signal and a first phase flow signal; step 2: inputting the first amplitude stream signal and the first phase stream signal into an attention network unit for noise reduction, and outputting a second amplitude stream signal and a second phase stream signal; the attention network unit comprises a plurality of stages of polar coordinate attention modules connected in series, wherein each stage of polar coordinate attention module comprises an amplitude attention module, a phase self-adjusting module and a phase other adjusting module; and step 3: and inputting the second amplitude stream signal and the second phase stream signal to a post-network unit for inverse Fourier transform, and outputting an enhanced voice signal. The invention has small calculation amount and can effectively ensure the voice noise reduction effect.
Description
Technical Field
The invention relates to a voice enhancement method and device capable of separating circulatory attention.
Background
The noise reduction of the speech recognition front end, the extraction of human voice in the field of audio and video production, the purification of voice in the field of speech synthesis and the like all relate to the noise reduction enhancement of speech signals, and the existing speech noise reduction mainly comprises the following modes:
SEGAN: the UNet is used as a basic structure for noise reduction, and a countermeasure generation technology is adopted to enable the generated sound to be close to human voice. The method has the defects of simple model structure, unclean treatment on complex noise and easy mode collapse.
WAVENET: the method has the defects of huge model, complex training, extremely low speed (10 minutes of processing time is needed for every 1 minute of voice), misaligned phase and difficult discrimination between human voice and music noise with harmonic waves.
TasNet: and (3) denoising by taking the TCN as a basic structure, and obtaining the promotion of the receptive field by adopting cavity convolution. The method has the disadvantages that the completeness of the space is not ensured, the frequency resolution of the model is poor, and the noise of the simultaneous voice segment of the voice and the noise is not completely removed.
T-GSA: and denoising by taking a transform as a basic structure, and locally constraining the receptive field by adopting a Gaussian function. The disadvantage is that the computation complexity is huge, and the processing time is O (N) along with the lengthening of the voice length2) And (4) increasing.
PHASEN: this approach is the most relevant noise reduction method to the present invention. And noise reduction is carried out by taking the TSB as a basic structure, and harmonic enhancement is carried out by adopting a frequency conversion block. Although the method has small calculation amount and can ensure better noise reduction effect, the method has the defects that only a fixed receptive field is used, only fixed harmonic correlation can be modeled, and actually sometimes people need to see far to determine whether the current sound is voice or noise, and the current harmonic is true harmonic or pseudo harmonic by comprehensively considering the upper-lower front-back relation, so the voice noise reduction effect is not ideal.
Disclosure of Invention
The invention aims to provide a method and a device for enhancing voice capable of separating circulating attention, which have small calculation amount and can effectively ensure the voice noise reduction effect.
Based on the same inventive concept, the invention has three independent technical schemes:
1. a method of speech enhancement that decouples circulatory attention, comprising:
step 1: inputting the voice signal into a preposed network unit for Fourier transform, and outputting a first amplitude flow signal and a first phase flow signal;
step 2: inputting the first amplitude stream signal and the first phase stream signal into an attention network unit for noise reduction, and outputting a second amplitude stream signal and a second phase stream signal; the attention network unit comprises a plurality of stages of polar coordinate attention modules which are connected in series, wherein each stage of polar coordinate attention module comprises an amplitude attention module, a phase self-adjusting module and a phase adjusting module, and the amplitude attention module further comprises two channel replacement transformation modules, two time-frequency separable cyclic network modules and an independent same-distribution convolution module;
and step 3: and inputting the second amplitude stream signal and the second phase stream signal to a post-network unit for inverse Fourier transform, and outputting an enhanced voice signal.
Further, in step 2, each stage of polar attention module is configured to perform the following steps:
step 2.1: the amplitude attention module is used for processing the input amplitude flow signal, and the adjusted amplitude flow signal is input to the phase other adjusting module;
step 2.2: the phase self-adjusting module is used for processing the input phase flow signal and inputting the processed self-adjusting phase flow signal into the phase other adjusting module;
step 2.3: performing its adjustment on the self-adjusting phase stream signal with a phase-it adjustment module based on the adjusted amplitude stream signal, outputting an adjusted phase stream signal;
step 2.4: and outputting the adjusted amplitude flow signal and the adjusted phase flow signal.
Further, the phase self-adjusting module is formed by one or more layers of two-dimensional convolution;
the phase-its adjusting module comprises one or more amplitude-aware phase transformations, each amplitude-aware phase transformation adjusting the phase stream signal with an amplitude stream signal, the transformation formula being as follows:
Po=Conv(Ao)o Pi
where Conv denotes convolution, o denotes dot product, PiRepresenting amplitude flow outputAs phase-adjusting input, PoRepresenting the phase flow output, AoRepresenting the amplitude stream output as the phase-it-adjusts input.
Further, the step 2.1 comprises the following steps:
step 2.1.1: respectively inputting the input amplitude flow signals into a first channel permutation conversion module and a first time-frequency separable cyclic network module to obtain a first permutation conversion signal and a first cyclic signal;
step 2.1.2: inputting the first replacement transformation signal to a second channel replacement transformation module, and outputting a second replacement transformation signal; multiplying the first permutation conversion signal and the first cyclic signal, inputting the multiplied first permutation conversion signal and the multiplied first cyclic signal to the second time-frequency separable cyclic network module, and outputting a second cyclic signal;
step 2.1.3: and splicing the second replacement transformation signal and the second cyclic signal, inputting the spliced signals into an independent same-distribution convolution module, and outputting the adjusted amplitude flow signal.
Further, the time-frequency separable circulation network module adopts one of the following circulation modes: single time cycle, single frequency cycle, time first cycle then frequency cycle, frequency first cycle then time cycle, time and frequency parallel cycle;
the loop includes one of a forward loop, a backward loop, and a bi-directional loop.
Further, the independent and identically distributed convolution module is composed of a distribution normalization layer, a two-dimensional convolution layer and a GELU layer.
Further, the front network unit comprises a short-time Fourier transform module, an amplitude convolution module and a phase convolution module,
the short-time Fourier transform module is used for transforming the voice signal into short-time Fourier coefficients;
the amplitude convolution module is used for performing amplitude convolution on the signal output by the short-time Fourier transform module and outputting a first amplitude flow signal;
and the phase convolution module is used for performing phase convolution on the signal output by the short-time Fourier transform module and outputting a first phase flow signal.
Further, the post-network element comprises an amplitude mask generator, a phase mask generator, a Fourier coefficient generator and an inverse short-time Fourier transform module,
the amplitude mask generator is used for generating a single-channel amplitude signal from the second amplitude flow signal;
the phase mask generator is used for generating a two-channel phase signal from the second phase-bit stream signal;
the Fourier coefficient generator is used for generating Fourier coefficients according to the single-channel amplitude signal and the double-channel phase signal;
and the inverse short-time Fourier transform module is used for outputting the enhanced voice signal according to the generated Fourier coefficient.
2. A method of speech enhancement that decouples circulatory attention, comprising:
step 1: inputting the voice signal into a preposed network unit for Fourier transform, and outputting a first amplitude flow signal;
step 2: inputting the first amplitude flow signal into an attention network unit for noise reduction, and outputting a second amplitude flow signal; the attention network unit comprises a plurality of stages of polar coordinate attention modules which are connected in series, wherein each stage of polar coordinate attention module comprises an amplitude attention module, a phase self-adjusting module and a phase adjusting module, and the amplitude attention module further comprises two time-frequency separable cyclic network modules and an independent same-distribution convolution module; or comprises two channel permutation and transformation modules, two time-frequency separable circulation network modules and an independent same-distribution convolution module;
and step 3: and carrying out inverse Fourier transform on the second amplitude flow signal through a post-network unit, and outputting an enhanced voice signal.
3. A cyclic attention separable speech enhancement apparatus comprising:
the device comprises a preposed network unit, a voice processing unit and a voice processing unit, wherein the preposed network unit is used for carrying out Fourier transform on an input voice signal and outputting a first amplitude flow signal and a first phase flow signal;
an attention network unit, configured to perform noise reduction on the first amplitude stream signal and the first phase stream signal, and output a second amplitude stream signal and a second phase stream signal; and
the post-network unit is used for performing inverse Fourier transform on the second amplitude stream signal and the second phase stream signal and outputting an enhanced voice signal;
wherein the attention network unit comprises a plurality of stages of polar coordinate attention modules connected in series, each stage of polar coordinate attention module comprises an amplitude attention module, a phase self-adjusting module and a phase adjusting module, wherein the amplitude attention module further comprises two channel permutation transformation modules, two time-frequency separable circulation network modules and an independent same distribution convolution module.
The invention has the following beneficial effects:
the method comprises the steps that a voice signal is input to a preposed network unit to be subjected to Fourier transform, and a first amplitude flow signal and a first phase flow signal are output; inputting the first amplitude stream signal and the first phase stream signal into an attention network unit for noise reduction, and outputting a second amplitude stream signal and a second phase stream signal; the attention network unit is formed by connecting a plurality of stages of polar coordinate attention modules in series, and each stage of polar coordinate attention module is formed by three modules of amplitude attention, phase self-adjustment and phase other adjustment; and inputting the second amplitude stream signal and the second phase stream signal to a post-network unit for inverse Fourier transform, and outputting an enhanced voice signal. The invention adopts the attention network unit, and the attention network unit is based on the separable design idea and adopts the expanded recurrent neural network structure, so the receptive field is not fixed any more, and more complex harmonic correlation is modeled. Compared with the conventional PHASEN, the structural parameters of the invention are reduced by two orders of magnitude, the calculated amount is smaller, and the speech noise reduction effect is better compared with the conventional model including PHASEN on the aspect of 6 international evaluation indexes.
Step 2.1 of each stage of polar coordinate attention module of the invention: the amplitude attention module is used for processing the input amplitude flow signal, and the adjusted amplitude flow signal is input to the phase other adjusting module; step 2.2: the phase self-adjusting module is used for processing the input phase flow signal and inputting the processed self-adjusting phase flow signal into the phase other adjusting module; step 2.3: performing its adjustment on the self-adjusting phase stream signal with a phase-it adjustment module based on the adjusted amplitude stream signal, outputting an adjusted phase stream signal; step 2.4: and outputting the adjusted amplitude flow signal and the adjusted phase flow signal. The phase self-adjusting module is formed by one-to-multiple layers of two-dimensional convolution; a phase-it-adjustment module comprising one to more amplitude-aware phase transforms, each amplitude-aware phase transform adjusting phase by utilizing the amplitude stream output. The amplitude attention module comprises a channel replacement transformation module, a time-frequency separable circulation network module and an independent same-distribution convolution module. Step 2.1 comprises the following steps: step 2.1.1: respectively inputting the input amplitude flow signals into a first channel permutation conversion module and a first time-frequency separable cyclic network module to obtain a first permutation conversion signal and a first cyclic signal; step 2.1.2: inputting the first replacement transformation signal to a second channel replacement transformation module, and outputting a second replacement transformation signal; superposing the first displacement transformation signal and the first cyclic signal, inputting the superposed signals into the second time-frequency separable cyclic network module, and outputting a second cyclic signal; step 2.1.3: and superposing the second replacement transformation signal and the second cyclic signal, inputting the superposed signals into an independent equal distribution convolution module, and outputting the adjusted amplitude flow signal. According to the invention, through the structural design of the recurrent neural network of the attention network unit, the invention is further ensured to achieve a better voice noise reduction effect.
Drawings
FIG. 1 is a block flow diagram of a method for speech enhancement with separable cycle attention according to the present invention;
FIG. 2 is a flow diagram of a head-end network element of the present invention;
FIG. 3 is a flow diagram of a post-network element of the present invention;
FIG. 4 is a flow diagram of an attention network element of the present invention;
FIG. 5 is a block flow diagram of a polar attention module of the attention network element of the present invention;
FIG. 6 is a block flow diagram of the magnitude attention module of the polar attention module of the attention network element of the present invention.
Detailed Description
The present invention is described in detail with reference to the embodiments shown in the drawings, but it should be understood that these embodiments are not intended to limit the present invention, and those skilled in the art should understand that functional, methodological, or structural equivalents or substitutions made by these embodiments are within the scope of the present invention.
The first embodiment is as follows:
voice enhancement method capable of separating circulatory attention
As shown in fig. 1, the speech enhancement method capable of separating the circulatory attention of the present invention includes a front network element, an attention network element, and a rear network element, and includes the following steps:
step 1: inputting the voice signal into a preposed network unit for Fourier transform, and outputting a first amplitude flow signal and a first phase flow signal;
step 2: inputting the first amplitude stream signal and the first phase stream signal into an attention network unit for noise reduction, and outputting a second amplitude stream signal and a second phase stream signal; the attention network unit comprises a plurality of stages of polar coordinate attention modules which are connected in series, wherein each stage of polar coordinate attention module comprises an amplitude attention module, a phase self-adjusting module and a phase adjusting module, and the amplitude attention module further comprises two time-frequency separable cyclic network modules and an independent same-distribution convolution module; or comprises two channel permutation and transformation modules, two time-frequency separable circulation network modules and an independent same-distribution convolution module; in this embodiment, two channel permutation and transformation modules are included.
And step 3: and inputting the second amplitude stream signal and the second phase stream signal to a post-network unit for inverse Fourier transform, and outputting an enhanced voice signal.
As shown in fig. 2, the front network element includes a short-time fourier transform module for transforming the voice signal into short-time fourier coefficients, an amplitude convolution module, and a phase convolution module; the amplitude convolution module is used for performing amplitude convolution on the signal output by the short-time Fourier transform module and outputting a first amplitude flow signal; and the phase convolution module is used for performing phase convolution on the signal output by the short-time Fourier transform module and outputting a first phase flow signal. The amplitude convolution comprises a 1 x 1 convolution and a GELU activation. The phase convolution comprises an n × n convolution without activation. Note that no activation can be used here, otherwise the performance degradation is significant.
As shown in fig. 3, the apparatus comprises an amplitude mask generator, a phase mask generator, a fourier coefficient generator and an inverse short-time fourier transform module, wherein the amplitude mask generator is used for generating a single-channel amplitude signal from a second amplitude flow signal; the phase mask generator is used for generating a two-channel phase signal from the second phase-bit stream signal; the Fourier coefficient generator is used for generating Fourier coefficients according to the single-channel amplitude signal and the double-channel phase signal; and the inverse short-time Fourier transform module is used for outputting the enhanced voice signal according to the generated Fourier coefficient. The amplitude mask generator is composed of a plurality of layers of two-dimensional convolutions, the convolution output of the last layer is 1 channel, a layer normalization function and a GELU activation function can be selectively inserted between every two convolution layers, and a Sigmoid activation function is connected behind the convolution layer of the last layer. The phase mask generator is composed of a plurality of layers of two-dimensional convolutions, the last layer of convolution output is 2 channels, no layer normalization and no activation function exist between every two convolution layers, and the last layer of convolution layer is followed by amplitude normalization, so that the sum of squares of the amplitudes of 2 channels of each time frequency point is 1 (namely, only phase information and no amplitude information exist).
As shown in fig. 4, the attention network element is composed of a plurality of stages of polar attention modules connected in series. As shown in fig. 5, each stage of polar attention module consists of three modules, amplitude attention, phase self-adjustment, phase-it-adjustment. Each stage of polar attention module is used for executing the following steps:
step 2.1: the amplitude attention module is used for processing the input amplitude flow signal, and the adjusted amplitude flow signal is input to the phase other adjusting module;
step 2.2: the phase self-adjusting module is used for processing the input phase flow signal and inputting the processed self-adjusting phase flow signal into the phase other adjusting module;
step 2.3: performing its adjustment on the self-adjusting phase stream signal with a phase-it adjustment module based on the adjusted amplitude stream signal, outputting an adjusted phase stream signal;
step 2.4: and outputting the adjusted amplitude flow signal and the adjusted phase flow signal.
The phase self-adjusting module is formed by one or more layers of two-dimensional convolution;
the phase-its adjusting module comprises one or more amplitude-aware phase transformations, each amplitude-aware phase transformation adjusting the phase stream signal with an amplitude stream signal, the transformation formula being as follows:
Po=Conv(Ao)o Pi
where Conv denotes convolution, o denotes dot product, PiRepresenting the amplitude stream output as the phase-its regulating input, PoRepresenting the phase flow output, AoRepresenting the amplitude stream output as the phase-it-adjusts input.
As shown in fig. 6, the amplitude attention module includes a channel permutation transformation module, a time-frequency separable cyclic network module, and an independent same-distribution convolution module, and step 2.1 includes the following steps:
step 2.1.1: respectively inputting the input amplitude flow signals into a first channel permutation conversion module and a first time-frequency separable cyclic network module to obtain a first permutation conversion signal and a first cyclic signal;
step 2.1.2: inputting the first replacement transformation signal to a second channel replacement transformation module, and outputting a second replacement transformation signal; multiplying the first permutation transformation signal and the first cyclic signal (point-to-point multiplication of a signal matrix), inputting the multiplied first permutation transformation signal and the first cyclic signal into the second time-frequency separable cyclic network module, and outputting a second cyclic signal;
step 2.1.3: and splicing (channel splicing) the second replacement transformation signal and the second cyclic signal, inputting the spliced signals into an independent same-distribution convolution module, and outputting the adjusted amplitude flow signal.
The channel permutation transformation may adopt an identity transformation, a reordering transformation or a convolution transformation, or a combination of the three transformations. The independent same-distribution convolution is composed of a distribution normalization layer, a two-dimensional convolution layer and a GELU layer. The GELU layer can also be replaced by an activation function such as ReLU, PReLU, ELU, sigmoid, softplus, etc.
The first time-frequency separable torus module, the second time-frequency separable torus module can employ a single time loop, a single frequency loop, a time loop followed by a frequency loop, a frequency loop followed by a time loop, or a time and frequency parallel loop, and the loops can employ a forward loop, a backward loop, or a bi-directional loop.
The single-time cycle is realized by the following formula,
wherein the content of the first and second substances,representing channel dimension data splicing, Cell representing arbitrary circulating Cell structure, hb,f,tRepresenting the hidden state of the mth frequency of the mth speech segment, cb,f,tRepresents the cell state of the f frequency and t time of the b voice segment, xb,f,tAn input value representing the fth frequency and the tth time of the kth voice segment;
the frequency cycling alone is achieved by the following formula,
time-cycle-before-frequency cycle is implemented by the following formula,
the frequency cycle before the time cycle is realized by the following formula,
the time-frequency parallel loop is realized by the following formula,
example two:
voice enhancement method capable of separating circulatory attention
The method comprises the following steps:
step 1: inputting the voice signal into a preposed network unit for Fourier transform, and outputting a first amplitude flow signal;
step 2: inputting the first amplitude flow signal into an attention network unit for noise reduction, and outputting a second amplitude flow signal; the attention network unit comprises a plurality of stages of polar coordinate attention modules which are connected in series, wherein each stage of polar coordinate attention module comprises an amplitude attention module, a phase self-adjusting module and a phase adjusting module, and the amplitude attention module further comprises two time-frequency separable cyclic network modules and an independent same-distribution convolution module; or comprises two channel permutation and transformation modules, two time-frequency separable circulation network modules and an independent same-distribution convolution module;
and step 3: and based on the second amplitude flow signal, performing inverse Fourier transform through a post-network unit, and outputting an enhanced voice signal.
The difference between the second embodiment and the first embodiment is that the noise reduction is performed only on the first amplitude stream signal, and the noise reduction is not performed on the phase stream signal, which is slightly weaker than the first embodiment. The rest of the working principle is the same as that of the first embodiment.
Example three:
voice enhancement device capable of separating circulatory attention
The method comprises the following steps:
the device comprises a preposed network unit, a voice processing unit and a voice processing unit, wherein the preposed network unit is used for carrying out Fourier transform on an input voice signal and outputting a first amplitude flow signal and a first phase flow signal;
an attention network unit, configured to perform noise reduction on the first amplitude stream signal and the first phase stream signal, and output a second amplitude stream signal and a second phase stream signal; and
the post-network unit is used for performing inverse Fourier transform on the second amplitude stream signal and the second phase stream signal and outputting an enhanced voice signal;
wherein the attention network unit comprises a plurality of stages of polar coordinate attention modules connected in series, each stage of polar coordinate attention module comprises an amplitude attention module, a phase self-adjusting module and a phase adjusting module, wherein the amplitude attention module further comprises two channel permutation transformation modules, two time-frequency separable circulation network modules and an independent same distribution convolution module.
The above-listed detailed description is only a specific description of a possible embodiment of the present invention, and they are not intended to limit the scope of the present invention, and equivalent embodiments or modifications made without departing from the technical spirit of the present invention should be included in the scope of the present invention.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein.
Claims (10)
1. A method for enhancing speech that separates attention in circulation, comprising:
step 1: inputting the voice signal into a preposed network unit for Fourier transform, and outputting a first amplitude flow signal and a first phase flow signal;
step 2: inputting the first amplitude stream signal and the first phase stream signal into an attention network unit for noise reduction, and outputting a second amplitude stream signal and a second phase stream signal; the attention network unit comprises a plurality of stages of polar coordinate attention modules which are connected in series, wherein each stage of polar coordinate attention module comprises an amplitude attention module, a phase self-adjusting module and a phase adjusting module, and the amplitude attention module further comprises two time-frequency separable cyclic network modules and an independent same-distribution convolution module; or comprises two channel permutation and transformation modules, two time-frequency separable circulation network modules and an independent same-distribution convolution module;
and step 3: and inputting the second amplitude stream signal and the second phase stream signal to a post-network unit for inverse Fourier transform, and outputting an enhanced voice signal.
2. The cyclic attention separable speech enhancement method according to claim 1, characterized in that: in step 2, each stage of polar coordinate attention module is used for executing the following steps:
step 2.1: the amplitude attention module is used for processing the input amplitude flow signal, and the adjusted amplitude flow signal is input to the phase other adjusting module;
step 2.2: the phase self-adjusting module is used for processing the input phase flow signal and inputting the processed self-adjusting phase flow signal into the phase other adjusting module;
step 2.3: performing its adjustment on the self-adjusting phase stream signal with a phase-it adjustment module based on the adjusted amplitude stream signal, outputting an adjusted phase stream signal;
step 2.4: and outputting the adjusted amplitude flow signal and the adjusted phase flow signal.
3. The separable cyclical attention speech enhancement method of claim 2, wherein:
the phase self-adjusting module is formed by one or more layers of two-dimensional convolution;
the phase-its adjusting module comprises one or more amplitude-aware phase transformations, each amplitude-aware phase transformation adjusting the phase stream signal with an amplitude stream signal, the transformation formula being as follows:
Po=Conv(Ao)o Pi
where Conv denotes convolution, o denotes dot product, PiRepresenting the amplitude stream output as the phase-its regulating input, PoRepresenting the phase flow output, AoRepresenting the amplitude stream output as the phase-it-adjusts input.
4. The separable cyclical attention speech enhancement method of claim 2 wherein said step 2.1 comprises the steps of, in the case where the amplitude attention module comprises two channel permutation transformation modules, two time-frequency separable cyclic net modules and an independent co-distributed convolution module:
step 2.1.1: respectively inputting the input amplitude flow signals into a first channel permutation conversion module and a first time-frequency separable cyclic network module to obtain a first permutation conversion signal and a first cyclic signal;
step 2.1.2: inputting the first replacement transformation signal to a second channel replacement transformation module, and outputting a second replacement transformation signal; multiplying the first permutation conversion signal and the first cyclic signal, inputting the multiplied first permutation conversion signal and the multiplied first cyclic signal to the second time-frequency separable cyclic network module, and outputting a second cyclic signal;
step 2.1.3: and splicing the second replacement transformation signal and the second cyclic signal, inputting the spliced signals into an independent same-distribution convolution module, and outputting the adjusted amplitude flow signal.
5. The cyclic attention separable speech enhancement method according to claim 1, characterized in that:
the time-frequency separable circulation network module adopts one of the following circulation modes: single time cycle, single frequency cycle, time first cycle then frequency cycle, frequency first cycle then time cycle, time and frequency parallel cycle;
the loop includes one of a forward loop, a backward loop, and a bi-directional loop.
6. The cyclic attention separable speech enhancement method according to claim 1, characterized in that: the independent same-distribution convolution module is composed of a distribution normalization layer, a two-dimensional convolution layer and a GELU layer.
7. The cyclic attention separable speech enhancement method according to claim 1, characterized in that: the preposed network unit comprises a short-time Fourier transform module, an amplitude convolution module and a phase convolution module,
the short-time Fourier transform module is used for transforming the voice signal into short-time Fourier coefficients;
the amplitude convolution module is used for performing amplitude convolution on the signal output by the short-time Fourier transform module and outputting a first amplitude flow signal;
and the phase convolution module is used for performing phase convolution on the signal output by the short-time Fourier transform module and outputting a first phase flow signal.
8. The cyclic attention separable speech enhancement method according to claim 1, characterized in that: the post-network element comprises an amplitude mask generator, a phase mask generator, a Fourier coefficient generator and an inverse short-time Fourier transform module,
the amplitude mask generator is used for generating a single-channel amplitude signal from the second amplitude flow signal;
the phase mask generator is used for generating a two-channel phase signal from the second phase-bit stream signal;
the Fourier coefficient generator is used for generating Fourier coefficients according to the single-channel amplitude signal and the double-channel phase signal;
and the inverse short-time Fourier transform module is used for outputting the enhanced voice signal according to the generated Fourier coefficient.
9. A method for enhancing speech that separates attention in circulation, comprising:
step 1: inputting the voice signal into a preposed network unit for Fourier transform, and outputting a first amplitude flow signal;
step 2: inputting the first amplitude flow signal into an attention network unit for noise reduction, and outputting a second amplitude flow signal; the attention network unit comprises a plurality of stages of polar coordinate attention modules which are connected in series, wherein each stage of polar coordinate attention module comprises an amplitude attention module, a phase self-adjusting module and a phase adjusting module, and the amplitude attention module further comprises two time-frequency separable cyclic network modules and an independent same-distribution convolution module; or comprises two channel permutation and transformation modules, two time-frequency separable circulation network modules and an independent same-distribution convolution module;
and step 3: and based on the second amplitude flow signal, performing inverse Fourier transform through a post-network unit, and outputting an enhanced voice signal.
10. A cyclic attention separable speech enhancement apparatus, comprising:
the device comprises a preposed network unit, a voice processing unit and a voice processing unit, wherein the preposed network unit is used for carrying out Fourier transform on an input voice signal and outputting a first amplitude flow signal and a first phase flow signal;
an attention network unit, configured to perform noise reduction on the first amplitude stream signal and the first phase stream signal, and output a second amplitude stream signal and a second phase stream signal; and
the post-network unit is used for performing inverse Fourier transform on the second amplitude stream signal and the second phase stream signal and outputting an enhanced voice signal;
wherein the attention network unit comprises a plurality of stages of polar coordinate attention modules connected in series, each stage of polar coordinate attention module comprises an amplitude attention module, a phase self-adjusting module and a phase adjusting module, wherein the amplitude attention module further comprises two channel permutation transformation modules, two time-frequency separable circulation network modules and an independent same distribution convolution module.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111285653.5A CN114023346B (en) | 2021-11-01 | 2021-11-01 | Voice enhancement method and device capable of separating circulating attention |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111285653.5A CN114023346B (en) | 2021-11-01 | 2021-11-01 | Voice enhancement method and device capable of separating circulating attention |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114023346A true CN114023346A (en) | 2022-02-08 |
CN114023346B CN114023346B (en) | 2024-05-31 |
Family
ID=80059604
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111285653.5A Active CN114023346B (en) | 2021-11-01 | 2021-11-01 | Voice enhancement method and device capable of separating circulating attention |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114023346B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116092501A (en) * | 2023-03-14 | 2023-05-09 | 澳克多普有限公司 | Speech enhancement method, speech recognition method, speaker recognition method and speaker recognition system |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US4754449A (en) * | 1986-07-02 | 1988-06-28 | Hughes Aircraft Company | Wide bandwidth device for demodulating frequency division multiplexed signals |
WO2011026247A1 (en) * | 2009-09-04 | 2011-03-10 | Svox Ag | Speech enhancement techniques on the power spectrum |
EP2905774A1 (en) * | 2014-02-11 | 2015-08-12 | JoboMusic GmbH | Method for synthesszing a digital audio signal |
US20210012767A1 (en) * | 2020-09-25 | 2021-01-14 | Intel Corporation | Real-time dynamic noise reduction using convolutional networks |
US20210035590A1 (en) * | 2019-08-02 | 2021-02-04 | Audioshake, Inc. | Deep learning segmentation of audio using magnitude spectrogram |
CN113241092A (en) * | 2021-06-15 | 2021-08-10 | 新疆大学 | Sound source separation method based on double-attention mechanism and multi-stage hybrid convolution network |
-
2021
- 2021-11-01 CN CN202111285653.5A patent/CN114023346B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US4754449A (en) * | 1986-07-02 | 1988-06-28 | Hughes Aircraft Company | Wide bandwidth device for demodulating frequency division multiplexed signals |
WO2011026247A1 (en) * | 2009-09-04 | 2011-03-10 | Svox Ag | Speech enhancement techniques on the power spectrum |
EP2905774A1 (en) * | 2014-02-11 | 2015-08-12 | JoboMusic GmbH | Method for synthesszing a digital audio signal |
US20210035590A1 (en) * | 2019-08-02 | 2021-02-04 | Audioshake, Inc. | Deep learning segmentation of audio using magnitude spectrogram |
US20210012767A1 (en) * | 2020-09-25 | 2021-01-14 | Intel Corporation | Real-time dynamic noise reduction using convolutional networks |
CN113241092A (en) * | 2021-06-15 | 2021-08-10 | 新疆大学 | Sound source separation method based on double-attention mechanism and multi-stage hybrid convolution network |
Non-Patent Citations (1)
Title |
---|
闫昭宇;王晶;: "结合深度卷积循环网络和时频注意力机制的单通道语音增强算法", 信号处理, no. 06, 25 June 2020 (2020-06-25) * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116092501A (en) * | 2023-03-14 | 2023-05-09 | 澳克多普有限公司 | Speech enhancement method, speech recognition method, speaker recognition method and speaker recognition system |
CN116092501B (en) * | 2023-03-14 | 2023-07-25 | 深圳市玮欧科技有限公司 | Speech enhancement method, speech recognition method, speaker recognition method and speaker recognition system |
Also Published As
Publication number | Publication date |
---|---|
CN114023346B (en) | 2024-05-31 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Zhang et al. | Deep audio priors emerge from harmonic convolutional networks | |
Venkataramani et al. | Adaptive front-ends for end-to-end source separation | |
CN114141238A (en) | Voice enhancement method fusing Transformer and U-net network | |
US11393443B2 (en) | Apparatuses and methods for creating noise environment noisy data and eliminating noise | |
CN114023346A (en) | Voice enhancement method and device capable of separating circulatory attention | |
Wang et al. | A path signature approach for speech emotion recognition | |
Du et al. | A joint framework of denoising autoencoder and generative vocoder for monaural speech enhancement | |
CN112151071A (en) | Speech emotion recognition method based on mixed wavelet packet feature deep learning | |
Jindal et al. | SpeechMix-Augmenting Deep Sound Recognition Using Hidden Space Interpolations. | |
Lim et al. | Harmonic and percussive source separation using a convolutional auto encoder | |
Vuong et al. | Learnable spectro-temporal receptive fields for robust voice type discrimination | |
Takeuchi et al. | Invertible DNN-based nonlinear time-frequency transform for speech enhancement | |
Zhang et al. | Temporal Transformer Networks for Acoustic Scene Classification. | |
Narayanan et al. | Cross-attention conformer for context modeling in speech enhancement for ASR | |
Li et al. | Data augmentation method for underwater acoustic target recognition based on underwater acoustic channel modeling and transfer learning | |
Xu et al. | U-former: Improving monaural speech enhancement with multi-head self and cross attention | |
Dey et al. | Single channel blind source separation based on variational mode decomposition and PCA | |
CN113593588A (en) | Multi-singer singing voice synthesis method and system based on generation countermeasure network | |
Wang et al. | Low pass filtering and bandwidth extension for robust anti-spoofing countermeasure against codec variabilities | |
CN116682444A (en) | Single-channel voice enhancement method based on waveform spectrum fusion network | |
CN116469404A (en) | Audio-visual cross-mode fusion voice separation method | |
Le et al. | Personalized speech enhancement combining band-split rnn and speaker attentive module | |
Wang et al. | Unsupervised improvement of audio-text cross-modal representations | |
CN111028857B (en) | Method and system for reducing noise of multichannel audio-video conference based on deep learning | |
US9478223B2 (en) | Method and apparatus for down-mixing multi-channel audio |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |