CN114023346A - Voice enhancement method and device capable of separating circulatory attention - Google Patents

Voice enhancement method and device capable of separating circulatory attention Download PDF

Info

Publication number
CN114023346A
CN114023346A CN202111285653.5A CN202111285653A CN114023346A CN 114023346 A CN114023346 A CN 114023346A CN 202111285653 A CN202111285653 A CN 202111285653A CN 114023346 A CN114023346 A CN 114023346A
Authority
CN
China
Prior art keywords
signal
amplitude
phase
module
attention
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111285653.5A
Other languages
Chinese (zh)
Other versions
CN114023346B (en
Inventor
柯登峰
张劲松
解焱陆
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
BEIJING LANGUAGE AND CULTURE UNIVERSITY
Original Assignee
BEIJING LANGUAGE AND CULTURE UNIVERSITY
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by BEIJING LANGUAGE AND CULTURE UNIVERSITY filed Critical BEIJING LANGUAGE AND CULTURE UNIVERSITY
Priority to CN202111285653.5A priority Critical patent/CN114023346B/en
Publication of CN114023346A publication Critical patent/CN114023346A/en
Application granted granted Critical
Publication of CN114023346B publication Critical patent/CN114023346B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Cable Transmission Systems, Equalization Of Radio And Reduction Of Echo (AREA)

Abstract

The invention relates to a voice enhancement method capable of separating circulatory attention, which comprises the following steps of 1: inputting the voice signal into a preposed network unit for Fourier transform, and outputting a first amplitude flow signal and a first phase flow signal; step 2: inputting the first amplitude stream signal and the first phase stream signal into an attention network unit for noise reduction, and outputting a second amplitude stream signal and a second phase stream signal; the attention network unit comprises a plurality of stages of polar coordinate attention modules connected in series, wherein each stage of polar coordinate attention module comprises an amplitude attention module, a phase self-adjusting module and a phase other adjusting module; and step 3: and inputting the second amplitude stream signal and the second phase stream signal to a post-network unit for inverse Fourier transform, and outputting an enhanced voice signal. The invention has small calculation amount and can effectively ensure the voice noise reduction effect.

Description

Voice enhancement method and device capable of separating circulatory attention
Technical Field
The invention relates to a voice enhancement method and device capable of separating circulatory attention.
Background
The noise reduction of the speech recognition front end, the extraction of human voice in the field of audio and video production, the purification of voice in the field of speech synthesis and the like all relate to the noise reduction enhancement of speech signals, and the existing speech noise reduction mainly comprises the following modes:
SEGAN: the UNet is used as a basic structure for noise reduction, and a countermeasure generation technology is adopted to enable the generated sound to be close to human voice. The method has the defects of simple model structure, unclean treatment on complex noise and easy mode collapse.
WAVENET: the method has the defects of huge model, complex training, extremely low speed (10 minutes of processing time is needed for every 1 minute of voice), misaligned phase and difficult discrimination between human voice and music noise with harmonic waves.
TasNet: and (3) denoising by taking the TCN as a basic structure, and obtaining the promotion of the receptive field by adopting cavity convolution. The method has the disadvantages that the completeness of the space is not ensured, the frequency resolution of the model is poor, and the noise of the simultaneous voice segment of the voice and the noise is not completely removed.
T-GSA: and denoising by taking a transform as a basic structure, and locally constraining the receptive field by adopting a Gaussian function. The disadvantage is that the computation complexity is huge, and the processing time is O (N) along with the lengthening of the voice length2) And (4) increasing.
PHASEN: this approach is the most relevant noise reduction method to the present invention. And noise reduction is carried out by taking the TSB as a basic structure, and harmonic enhancement is carried out by adopting a frequency conversion block. Although the method has small calculation amount and can ensure better noise reduction effect, the method has the defects that only a fixed receptive field is used, only fixed harmonic correlation can be modeled, and actually sometimes people need to see far to determine whether the current sound is voice or noise, and the current harmonic is true harmonic or pseudo harmonic by comprehensively considering the upper-lower front-back relation, so the voice noise reduction effect is not ideal.
Disclosure of Invention
The invention aims to provide a method and a device for enhancing voice capable of separating circulating attention, which have small calculation amount and can effectively ensure the voice noise reduction effect.
Based on the same inventive concept, the invention has three independent technical schemes:
1. a method of speech enhancement that decouples circulatory attention, comprising:
step 1: inputting the voice signal into a preposed network unit for Fourier transform, and outputting a first amplitude flow signal and a first phase flow signal;
step 2: inputting the first amplitude stream signal and the first phase stream signal into an attention network unit for noise reduction, and outputting a second amplitude stream signal and a second phase stream signal; the attention network unit comprises a plurality of stages of polar coordinate attention modules which are connected in series, wherein each stage of polar coordinate attention module comprises an amplitude attention module, a phase self-adjusting module and a phase adjusting module, and the amplitude attention module further comprises two channel replacement transformation modules, two time-frequency separable cyclic network modules and an independent same-distribution convolution module;
and step 3: and inputting the second amplitude stream signal and the second phase stream signal to a post-network unit for inverse Fourier transform, and outputting an enhanced voice signal.
Further, in step 2, each stage of polar attention module is configured to perform the following steps:
step 2.1: the amplitude attention module is used for processing the input amplitude flow signal, and the adjusted amplitude flow signal is input to the phase other adjusting module;
step 2.2: the phase self-adjusting module is used for processing the input phase flow signal and inputting the processed self-adjusting phase flow signal into the phase other adjusting module;
step 2.3: performing its adjustment on the self-adjusting phase stream signal with a phase-it adjustment module based on the adjusted amplitude stream signal, outputting an adjusted phase stream signal;
step 2.4: and outputting the adjusted amplitude flow signal and the adjusted phase flow signal.
Further, the phase self-adjusting module is formed by one or more layers of two-dimensional convolution;
the phase-its adjusting module comprises one or more amplitude-aware phase transformations, each amplitude-aware phase transformation adjusting the phase stream signal with an amplitude stream signal, the transformation formula being as follows:
Po=Conv(Ao)o Pi
where Conv denotes convolution, o denotes dot product, PiRepresenting amplitude flow outputAs phase-adjusting input, PoRepresenting the phase flow output, AoRepresenting the amplitude stream output as the phase-it-adjusts input.
Further, the step 2.1 comprises the following steps:
step 2.1.1: respectively inputting the input amplitude flow signals into a first channel permutation conversion module and a first time-frequency separable cyclic network module to obtain a first permutation conversion signal and a first cyclic signal;
step 2.1.2: inputting the first replacement transformation signal to a second channel replacement transformation module, and outputting a second replacement transformation signal; multiplying the first permutation conversion signal and the first cyclic signal, inputting the multiplied first permutation conversion signal and the multiplied first cyclic signal to the second time-frequency separable cyclic network module, and outputting a second cyclic signal;
step 2.1.3: and splicing the second replacement transformation signal and the second cyclic signal, inputting the spliced signals into an independent same-distribution convolution module, and outputting the adjusted amplitude flow signal.
Further, the time-frequency separable circulation network module adopts one of the following circulation modes: single time cycle, single frequency cycle, time first cycle then frequency cycle, frequency first cycle then time cycle, time and frequency parallel cycle;
the loop includes one of a forward loop, a backward loop, and a bi-directional loop.
Further, the independent and identically distributed convolution module is composed of a distribution normalization layer, a two-dimensional convolution layer and a GELU layer.
Further, the front network unit comprises a short-time Fourier transform module, an amplitude convolution module and a phase convolution module,
the short-time Fourier transform module is used for transforming the voice signal into short-time Fourier coefficients;
the amplitude convolution module is used for performing amplitude convolution on the signal output by the short-time Fourier transform module and outputting a first amplitude flow signal;
and the phase convolution module is used for performing phase convolution on the signal output by the short-time Fourier transform module and outputting a first phase flow signal.
Further, the post-network element comprises an amplitude mask generator, a phase mask generator, a Fourier coefficient generator and an inverse short-time Fourier transform module,
the amplitude mask generator is used for generating a single-channel amplitude signal from the second amplitude flow signal;
the phase mask generator is used for generating a two-channel phase signal from the second phase-bit stream signal;
the Fourier coefficient generator is used for generating Fourier coefficients according to the single-channel amplitude signal and the double-channel phase signal;
and the inverse short-time Fourier transform module is used for outputting the enhanced voice signal according to the generated Fourier coefficient.
2. A method of speech enhancement that decouples circulatory attention, comprising:
step 1: inputting the voice signal into a preposed network unit for Fourier transform, and outputting a first amplitude flow signal;
step 2: inputting the first amplitude flow signal into an attention network unit for noise reduction, and outputting a second amplitude flow signal; the attention network unit comprises a plurality of stages of polar coordinate attention modules which are connected in series, wherein each stage of polar coordinate attention module comprises an amplitude attention module, a phase self-adjusting module and a phase adjusting module, and the amplitude attention module further comprises two time-frequency separable cyclic network modules and an independent same-distribution convolution module; or comprises two channel permutation and transformation modules, two time-frequency separable circulation network modules and an independent same-distribution convolution module;
and step 3: and carrying out inverse Fourier transform on the second amplitude flow signal through a post-network unit, and outputting an enhanced voice signal.
3. A cyclic attention separable speech enhancement apparatus comprising:
the device comprises a preposed network unit, a voice processing unit and a voice processing unit, wherein the preposed network unit is used for carrying out Fourier transform on an input voice signal and outputting a first amplitude flow signal and a first phase flow signal;
an attention network unit, configured to perform noise reduction on the first amplitude stream signal and the first phase stream signal, and output a second amplitude stream signal and a second phase stream signal; and
the post-network unit is used for performing inverse Fourier transform on the second amplitude stream signal and the second phase stream signal and outputting an enhanced voice signal;
wherein the attention network unit comprises a plurality of stages of polar coordinate attention modules connected in series, each stage of polar coordinate attention module comprises an amplitude attention module, a phase self-adjusting module and a phase adjusting module, wherein the amplitude attention module further comprises two channel permutation transformation modules, two time-frequency separable circulation network modules and an independent same distribution convolution module.
The invention has the following beneficial effects:
the method comprises the steps that a voice signal is input to a preposed network unit to be subjected to Fourier transform, and a first amplitude flow signal and a first phase flow signal are output; inputting the first amplitude stream signal and the first phase stream signal into an attention network unit for noise reduction, and outputting a second amplitude stream signal and a second phase stream signal; the attention network unit is formed by connecting a plurality of stages of polar coordinate attention modules in series, and each stage of polar coordinate attention module is formed by three modules of amplitude attention, phase self-adjustment and phase other adjustment; and inputting the second amplitude stream signal and the second phase stream signal to a post-network unit for inverse Fourier transform, and outputting an enhanced voice signal. The invention adopts the attention network unit, and the attention network unit is based on the separable design idea and adopts the expanded recurrent neural network structure, so the receptive field is not fixed any more, and more complex harmonic correlation is modeled. Compared with the conventional PHASEN, the structural parameters of the invention are reduced by two orders of magnitude, the calculated amount is smaller, and the speech noise reduction effect is better compared with the conventional model including PHASEN on the aspect of 6 international evaluation indexes.
Step 2.1 of each stage of polar coordinate attention module of the invention: the amplitude attention module is used for processing the input amplitude flow signal, and the adjusted amplitude flow signal is input to the phase other adjusting module; step 2.2: the phase self-adjusting module is used for processing the input phase flow signal and inputting the processed self-adjusting phase flow signal into the phase other adjusting module; step 2.3: performing its adjustment on the self-adjusting phase stream signal with a phase-it adjustment module based on the adjusted amplitude stream signal, outputting an adjusted phase stream signal; step 2.4: and outputting the adjusted amplitude flow signal and the adjusted phase flow signal. The phase self-adjusting module is formed by one-to-multiple layers of two-dimensional convolution; a phase-it-adjustment module comprising one to more amplitude-aware phase transforms, each amplitude-aware phase transform adjusting phase by utilizing the amplitude stream output. The amplitude attention module comprises a channel replacement transformation module, a time-frequency separable circulation network module and an independent same-distribution convolution module. Step 2.1 comprises the following steps: step 2.1.1: respectively inputting the input amplitude flow signals into a first channel permutation conversion module and a first time-frequency separable cyclic network module to obtain a first permutation conversion signal and a first cyclic signal; step 2.1.2: inputting the first replacement transformation signal to a second channel replacement transformation module, and outputting a second replacement transformation signal; superposing the first displacement transformation signal and the first cyclic signal, inputting the superposed signals into the second time-frequency separable cyclic network module, and outputting a second cyclic signal; step 2.1.3: and superposing the second replacement transformation signal and the second cyclic signal, inputting the superposed signals into an independent equal distribution convolution module, and outputting the adjusted amplitude flow signal. According to the invention, through the structural design of the recurrent neural network of the attention network unit, the invention is further ensured to achieve a better voice noise reduction effect.
Drawings
FIG. 1 is a block flow diagram of a method for speech enhancement with separable cycle attention according to the present invention;
FIG. 2 is a flow diagram of a head-end network element of the present invention;
FIG. 3 is a flow diagram of a post-network element of the present invention;
FIG. 4 is a flow diagram of an attention network element of the present invention;
FIG. 5 is a block flow diagram of a polar attention module of the attention network element of the present invention;
FIG. 6 is a block flow diagram of the magnitude attention module of the polar attention module of the attention network element of the present invention.
Detailed Description
The present invention is described in detail with reference to the embodiments shown in the drawings, but it should be understood that these embodiments are not intended to limit the present invention, and those skilled in the art should understand that functional, methodological, or structural equivalents or substitutions made by these embodiments are within the scope of the present invention.
The first embodiment is as follows:
voice enhancement method capable of separating circulatory attention
As shown in fig. 1, the speech enhancement method capable of separating the circulatory attention of the present invention includes a front network element, an attention network element, and a rear network element, and includes the following steps:
step 1: inputting the voice signal into a preposed network unit for Fourier transform, and outputting a first amplitude flow signal and a first phase flow signal;
step 2: inputting the first amplitude stream signal and the first phase stream signal into an attention network unit for noise reduction, and outputting a second amplitude stream signal and a second phase stream signal; the attention network unit comprises a plurality of stages of polar coordinate attention modules which are connected in series, wherein each stage of polar coordinate attention module comprises an amplitude attention module, a phase self-adjusting module and a phase adjusting module, and the amplitude attention module further comprises two time-frequency separable cyclic network modules and an independent same-distribution convolution module; or comprises two channel permutation and transformation modules, two time-frequency separable circulation network modules and an independent same-distribution convolution module; in this embodiment, two channel permutation and transformation modules are included.
And step 3: and inputting the second amplitude stream signal and the second phase stream signal to a post-network unit for inverse Fourier transform, and outputting an enhanced voice signal.
As shown in fig. 2, the front network element includes a short-time fourier transform module for transforming the voice signal into short-time fourier coefficients, an amplitude convolution module, and a phase convolution module; the amplitude convolution module is used for performing amplitude convolution on the signal output by the short-time Fourier transform module and outputting a first amplitude flow signal; and the phase convolution module is used for performing phase convolution on the signal output by the short-time Fourier transform module and outputting a first phase flow signal. The amplitude convolution comprises a 1 x 1 convolution and a GELU activation. The phase convolution comprises an n × n convolution without activation. Note that no activation can be used here, otherwise the performance degradation is significant.
As shown in fig. 3, the apparatus comprises an amplitude mask generator, a phase mask generator, a fourier coefficient generator and an inverse short-time fourier transform module, wherein the amplitude mask generator is used for generating a single-channel amplitude signal from a second amplitude flow signal; the phase mask generator is used for generating a two-channel phase signal from the second phase-bit stream signal; the Fourier coefficient generator is used for generating Fourier coefficients according to the single-channel amplitude signal and the double-channel phase signal; and the inverse short-time Fourier transform module is used for outputting the enhanced voice signal according to the generated Fourier coefficient. The amplitude mask generator is composed of a plurality of layers of two-dimensional convolutions, the convolution output of the last layer is 1 channel, a layer normalization function and a GELU activation function can be selectively inserted between every two convolution layers, and a Sigmoid activation function is connected behind the convolution layer of the last layer. The phase mask generator is composed of a plurality of layers of two-dimensional convolutions, the last layer of convolution output is 2 channels, no layer normalization and no activation function exist between every two convolution layers, and the last layer of convolution layer is followed by amplitude normalization, so that the sum of squares of the amplitudes of 2 channels of each time frequency point is 1 (namely, only phase information and no amplitude information exist).
As shown in fig. 4, the attention network element is composed of a plurality of stages of polar attention modules connected in series. As shown in fig. 5, each stage of polar attention module consists of three modules, amplitude attention, phase self-adjustment, phase-it-adjustment. Each stage of polar attention module is used for executing the following steps:
step 2.1: the amplitude attention module is used for processing the input amplitude flow signal, and the adjusted amplitude flow signal is input to the phase other adjusting module;
step 2.2: the phase self-adjusting module is used for processing the input phase flow signal and inputting the processed self-adjusting phase flow signal into the phase other adjusting module;
step 2.3: performing its adjustment on the self-adjusting phase stream signal with a phase-it adjustment module based on the adjusted amplitude stream signal, outputting an adjusted phase stream signal;
step 2.4: and outputting the adjusted amplitude flow signal and the adjusted phase flow signal.
The phase self-adjusting module is formed by one or more layers of two-dimensional convolution;
the phase-its adjusting module comprises one or more amplitude-aware phase transformations, each amplitude-aware phase transformation adjusting the phase stream signal with an amplitude stream signal, the transformation formula being as follows:
Po=Conv(Ao)o Pi
where Conv denotes convolution, o denotes dot product, PiRepresenting the amplitude stream output as the phase-its regulating input, PoRepresenting the phase flow output, AoRepresenting the amplitude stream output as the phase-it-adjusts input.
As shown in fig. 6, the amplitude attention module includes a channel permutation transformation module, a time-frequency separable cyclic network module, and an independent same-distribution convolution module, and step 2.1 includes the following steps:
step 2.1.1: respectively inputting the input amplitude flow signals into a first channel permutation conversion module and a first time-frequency separable cyclic network module to obtain a first permutation conversion signal and a first cyclic signal;
step 2.1.2: inputting the first replacement transformation signal to a second channel replacement transformation module, and outputting a second replacement transformation signal; multiplying the first permutation transformation signal and the first cyclic signal (point-to-point multiplication of a signal matrix), inputting the multiplied first permutation transformation signal and the first cyclic signal into the second time-frequency separable cyclic network module, and outputting a second cyclic signal;
step 2.1.3: and splicing (channel splicing) the second replacement transformation signal and the second cyclic signal, inputting the spliced signals into an independent same-distribution convolution module, and outputting the adjusted amplitude flow signal.
The channel permutation transformation may adopt an identity transformation, a reordering transformation or a convolution transformation, or a combination of the three transformations. The independent same-distribution convolution is composed of a distribution normalization layer, a two-dimensional convolution layer and a GELU layer. The GELU layer can also be replaced by an activation function such as ReLU, PReLU, ELU, sigmoid, softplus, etc.
The first time-frequency separable torus module, the second time-frequency separable torus module can employ a single time loop, a single frequency loop, a time loop followed by a frequency loop, a frequency loop followed by a time loop, or a time and frequency parallel loop, and the loops can employ a forward loop, a backward loop, or a bi-directional loop.
The single-time cycle is realized by the following formula,
forward single-use time cycle:
Figure BDA0003331597040000101
backward single-use time cycle:
Figure BDA0003331597040000102
bidirectional single-time cycle:
Figure BDA0003331597040000111
wherein the content of the first and second substances,
Figure BDA0003331597040000112
representing channel dimension data splicing, Cell representing arbitrary circulating Cell structure, hb,f,tRepresenting the hidden state of the mth frequency of the mth speech segment, cb,f,tRepresents the cell state of the f frequency and t time of the b voice segment, xb,f,tAn input value representing the fth frequency and the tth time of the kth voice segment;
the frequency cycling alone is achieved by the following formula,
forward single-use frequency cycling:
Figure BDA0003331597040000113
backward single-use frequency cycling:
Figure BDA0003331597040000114
bidirectional single-use frequency cycling:
Figure BDA0003331597040000115
time-cycle-before-frequency cycle is implemented by the following formula,
forward time cycle followed by frequency cycle:
Figure BDA0003331597040000116
backward time-first and frequency-second cycling:
Figure BDA0003331597040000117
bidirectional time-first and frequency-second cycling:
Figure BDA0003331597040000118
the frequency cycle before the time cycle is realized by the following formula,
forward frequency cycle followed by time cycle:
Figure BDA0003331597040000119
backward frequency cycle before time cycle:
Figure BDA00033315970400001110
bidirectional frequency cycle before time cycle:
Figure BDA00033315970400001111
the time-frequency parallel loop is realized by the following formula,
forward parallel loop:
Figure BDA00033315970400001112
backward parallel circulation:
Figure BDA00033315970400001113
bidirectional parallel circulation:
Figure BDA00033315970400001114
example two:
voice enhancement method capable of separating circulatory attention
The method comprises the following steps:
step 1: inputting the voice signal into a preposed network unit for Fourier transform, and outputting a first amplitude flow signal;
step 2: inputting the first amplitude flow signal into an attention network unit for noise reduction, and outputting a second amplitude flow signal; the attention network unit comprises a plurality of stages of polar coordinate attention modules which are connected in series, wherein each stage of polar coordinate attention module comprises an amplitude attention module, a phase self-adjusting module and a phase adjusting module, and the amplitude attention module further comprises two time-frequency separable cyclic network modules and an independent same-distribution convolution module; or comprises two channel permutation and transformation modules, two time-frequency separable circulation network modules and an independent same-distribution convolution module;
and step 3: and based on the second amplitude flow signal, performing inverse Fourier transform through a post-network unit, and outputting an enhanced voice signal.
The difference between the second embodiment and the first embodiment is that the noise reduction is performed only on the first amplitude stream signal, and the noise reduction is not performed on the phase stream signal, which is slightly weaker than the first embodiment. The rest of the working principle is the same as that of the first embodiment.
Example three:
voice enhancement device capable of separating circulatory attention
The method comprises the following steps:
the device comprises a preposed network unit, a voice processing unit and a voice processing unit, wherein the preposed network unit is used for carrying out Fourier transform on an input voice signal and outputting a first amplitude flow signal and a first phase flow signal;
an attention network unit, configured to perform noise reduction on the first amplitude stream signal and the first phase stream signal, and output a second amplitude stream signal and a second phase stream signal; and
the post-network unit is used for performing inverse Fourier transform on the second amplitude stream signal and the second phase stream signal and outputting an enhanced voice signal;
wherein the attention network unit comprises a plurality of stages of polar coordinate attention modules connected in series, each stage of polar coordinate attention module comprises an amplitude attention module, a phase self-adjusting module and a phase adjusting module, wherein the amplitude attention module further comprises two channel permutation transformation modules, two time-frequency separable circulation network modules and an independent same distribution convolution module.
The above-listed detailed description is only a specific description of a possible embodiment of the present invention, and they are not intended to limit the scope of the present invention, and equivalent embodiments or modifications made without departing from the technical spirit of the present invention should be included in the scope of the present invention.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein.

Claims (10)

1. A method for enhancing speech that separates attention in circulation, comprising:
step 1: inputting the voice signal into a preposed network unit for Fourier transform, and outputting a first amplitude flow signal and a first phase flow signal;
step 2: inputting the first amplitude stream signal and the first phase stream signal into an attention network unit for noise reduction, and outputting a second amplitude stream signal and a second phase stream signal; the attention network unit comprises a plurality of stages of polar coordinate attention modules which are connected in series, wherein each stage of polar coordinate attention module comprises an amplitude attention module, a phase self-adjusting module and a phase adjusting module, and the amplitude attention module further comprises two time-frequency separable cyclic network modules and an independent same-distribution convolution module; or comprises two channel permutation and transformation modules, two time-frequency separable circulation network modules and an independent same-distribution convolution module;
and step 3: and inputting the second amplitude stream signal and the second phase stream signal to a post-network unit for inverse Fourier transform, and outputting an enhanced voice signal.
2. The cyclic attention separable speech enhancement method according to claim 1, characterized in that: in step 2, each stage of polar coordinate attention module is used for executing the following steps:
step 2.1: the amplitude attention module is used for processing the input amplitude flow signal, and the adjusted amplitude flow signal is input to the phase other adjusting module;
step 2.2: the phase self-adjusting module is used for processing the input phase flow signal and inputting the processed self-adjusting phase flow signal into the phase other adjusting module;
step 2.3: performing its adjustment on the self-adjusting phase stream signal with a phase-it adjustment module based on the adjusted amplitude stream signal, outputting an adjusted phase stream signal;
step 2.4: and outputting the adjusted amplitude flow signal and the adjusted phase flow signal.
3. The separable cyclical attention speech enhancement method of claim 2, wherein:
the phase self-adjusting module is formed by one or more layers of two-dimensional convolution;
the phase-its adjusting module comprises one or more amplitude-aware phase transformations, each amplitude-aware phase transformation adjusting the phase stream signal with an amplitude stream signal, the transformation formula being as follows:
Po=Conv(Ao)o Pi
where Conv denotes convolution, o denotes dot product, PiRepresenting the amplitude stream output as the phase-its regulating input, PoRepresenting the phase flow output, AoRepresenting the amplitude stream output as the phase-it-adjusts input.
4. The separable cyclical attention speech enhancement method of claim 2 wherein said step 2.1 comprises the steps of, in the case where the amplitude attention module comprises two channel permutation transformation modules, two time-frequency separable cyclic net modules and an independent co-distributed convolution module:
step 2.1.1: respectively inputting the input amplitude flow signals into a first channel permutation conversion module and a first time-frequency separable cyclic network module to obtain a first permutation conversion signal and a first cyclic signal;
step 2.1.2: inputting the first replacement transformation signal to a second channel replacement transformation module, and outputting a second replacement transformation signal; multiplying the first permutation conversion signal and the first cyclic signal, inputting the multiplied first permutation conversion signal and the multiplied first cyclic signal to the second time-frequency separable cyclic network module, and outputting a second cyclic signal;
step 2.1.3: and splicing the second replacement transformation signal and the second cyclic signal, inputting the spliced signals into an independent same-distribution convolution module, and outputting the adjusted amplitude flow signal.
5. The cyclic attention separable speech enhancement method according to claim 1, characterized in that:
the time-frequency separable circulation network module adopts one of the following circulation modes: single time cycle, single frequency cycle, time first cycle then frequency cycle, frequency first cycle then time cycle, time and frequency parallel cycle;
the loop includes one of a forward loop, a backward loop, and a bi-directional loop.
6. The cyclic attention separable speech enhancement method according to claim 1, characterized in that: the independent same-distribution convolution module is composed of a distribution normalization layer, a two-dimensional convolution layer and a GELU layer.
7. The cyclic attention separable speech enhancement method according to claim 1, characterized in that: the preposed network unit comprises a short-time Fourier transform module, an amplitude convolution module and a phase convolution module,
the short-time Fourier transform module is used for transforming the voice signal into short-time Fourier coefficients;
the amplitude convolution module is used for performing amplitude convolution on the signal output by the short-time Fourier transform module and outputting a first amplitude flow signal;
and the phase convolution module is used for performing phase convolution on the signal output by the short-time Fourier transform module and outputting a first phase flow signal.
8. The cyclic attention separable speech enhancement method according to claim 1, characterized in that: the post-network element comprises an amplitude mask generator, a phase mask generator, a Fourier coefficient generator and an inverse short-time Fourier transform module,
the amplitude mask generator is used for generating a single-channel amplitude signal from the second amplitude flow signal;
the phase mask generator is used for generating a two-channel phase signal from the second phase-bit stream signal;
the Fourier coefficient generator is used for generating Fourier coefficients according to the single-channel amplitude signal and the double-channel phase signal;
and the inverse short-time Fourier transform module is used for outputting the enhanced voice signal according to the generated Fourier coefficient.
9. A method for enhancing speech that separates attention in circulation, comprising:
step 1: inputting the voice signal into a preposed network unit for Fourier transform, and outputting a first amplitude flow signal;
step 2: inputting the first amplitude flow signal into an attention network unit for noise reduction, and outputting a second amplitude flow signal; the attention network unit comprises a plurality of stages of polar coordinate attention modules which are connected in series, wherein each stage of polar coordinate attention module comprises an amplitude attention module, a phase self-adjusting module and a phase adjusting module, and the amplitude attention module further comprises two time-frequency separable cyclic network modules and an independent same-distribution convolution module; or comprises two channel permutation and transformation modules, two time-frequency separable circulation network modules and an independent same-distribution convolution module;
and step 3: and based on the second amplitude flow signal, performing inverse Fourier transform through a post-network unit, and outputting an enhanced voice signal.
10. A cyclic attention separable speech enhancement apparatus, comprising:
the device comprises a preposed network unit, a voice processing unit and a voice processing unit, wherein the preposed network unit is used for carrying out Fourier transform on an input voice signal and outputting a first amplitude flow signal and a first phase flow signal;
an attention network unit, configured to perform noise reduction on the first amplitude stream signal and the first phase stream signal, and output a second amplitude stream signal and a second phase stream signal; and
the post-network unit is used for performing inverse Fourier transform on the second amplitude stream signal and the second phase stream signal and outputting an enhanced voice signal;
wherein the attention network unit comprises a plurality of stages of polar coordinate attention modules connected in series, each stage of polar coordinate attention module comprises an amplitude attention module, a phase self-adjusting module and a phase adjusting module, wherein the amplitude attention module further comprises two channel permutation transformation modules, two time-frequency separable circulation network modules and an independent same distribution convolution module.
CN202111285653.5A 2021-11-01 2021-11-01 Voice enhancement method and device capable of separating circulating attention Active CN114023346B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111285653.5A CN114023346B (en) 2021-11-01 2021-11-01 Voice enhancement method and device capable of separating circulating attention

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111285653.5A CN114023346B (en) 2021-11-01 2021-11-01 Voice enhancement method and device capable of separating circulating attention

Publications (2)

Publication Number Publication Date
CN114023346A true CN114023346A (en) 2022-02-08
CN114023346B CN114023346B (en) 2024-05-31

Family

ID=80059604

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111285653.5A Active CN114023346B (en) 2021-11-01 2021-11-01 Voice enhancement method and device capable of separating circulating attention

Country Status (1)

Country Link
CN (1) CN114023346B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116092501A (en) * 2023-03-14 2023-05-09 澳克多普有限公司 Speech enhancement method, speech recognition method, speaker recognition method and speaker recognition system

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4754449A (en) * 1986-07-02 1988-06-28 Hughes Aircraft Company Wide bandwidth device for demodulating frequency division multiplexed signals
WO2011026247A1 (en) * 2009-09-04 2011-03-10 Svox Ag Speech enhancement techniques on the power spectrum
EP2905774A1 (en) * 2014-02-11 2015-08-12 JoboMusic GmbH Method for synthesszing a digital audio signal
US20210012767A1 (en) * 2020-09-25 2021-01-14 Intel Corporation Real-time dynamic noise reduction using convolutional networks
US20210035590A1 (en) * 2019-08-02 2021-02-04 Audioshake, Inc. Deep learning segmentation of audio using magnitude spectrogram
CN113241092A (en) * 2021-06-15 2021-08-10 新疆大学 Sound source separation method based on double-attention mechanism and multi-stage hybrid convolution network

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4754449A (en) * 1986-07-02 1988-06-28 Hughes Aircraft Company Wide bandwidth device for demodulating frequency division multiplexed signals
WO2011026247A1 (en) * 2009-09-04 2011-03-10 Svox Ag Speech enhancement techniques on the power spectrum
EP2905774A1 (en) * 2014-02-11 2015-08-12 JoboMusic GmbH Method for synthesszing a digital audio signal
US20210035590A1 (en) * 2019-08-02 2021-02-04 Audioshake, Inc. Deep learning segmentation of audio using magnitude spectrogram
US20210012767A1 (en) * 2020-09-25 2021-01-14 Intel Corporation Real-time dynamic noise reduction using convolutional networks
CN113241092A (en) * 2021-06-15 2021-08-10 新疆大学 Sound source separation method based on double-attention mechanism and multi-stage hybrid convolution network

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
闫昭宇;王晶;: "结合深度卷积循环网络和时频注意力机制的单通道语音增强算法", 信号处理, no. 06, 25 June 2020 (2020-06-25) *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116092501A (en) * 2023-03-14 2023-05-09 澳克多普有限公司 Speech enhancement method, speech recognition method, speaker recognition method and speaker recognition system
CN116092501B (en) * 2023-03-14 2023-07-25 深圳市玮欧科技有限公司 Speech enhancement method, speech recognition method, speaker recognition method and speaker recognition system

Also Published As

Publication number Publication date
CN114023346B (en) 2024-05-31

Similar Documents

Publication Publication Date Title
Zhang et al. Deep audio priors emerge from harmonic convolutional networks
Venkataramani et al. Adaptive front-ends for end-to-end source separation
CN114141238A (en) Voice enhancement method fusing Transformer and U-net network
US11393443B2 (en) Apparatuses and methods for creating noise environment noisy data and eliminating noise
CN114023346A (en) Voice enhancement method and device capable of separating circulatory attention
Wang et al. A path signature approach for speech emotion recognition
Du et al. A joint framework of denoising autoencoder and generative vocoder for monaural speech enhancement
CN112151071A (en) Speech emotion recognition method based on mixed wavelet packet feature deep learning
Jindal et al. SpeechMix-Augmenting Deep Sound Recognition Using Hidden Space Interpolations.
Lim et al. Harmonic and percussive source separation using a convolutional auto encoder
Vuong et al. Learnable spectro-temporal receptive fields for robust voice type discrimination
Takeuchi et al. Invertible DNN-based nonlinear time-frequency transform for speech enhancement
Zhang et al. Temporal Transformer Networks for Acoustic Scene Classification.
Narayanan et al. Cross-attention conformer for context modeling in speech enhancement for ASR
Li et al. Data augmentation method for underwater acoustic target recognition based on underwater acoustic channel modeling and transfer learning
Xu et al. U-former: Improving monaural speech enhancement with multi-head self and cross attention
Dey et al. Single channel blind source separation based on variational mode decomposition and PCA
CN113593588A (en) Multi-singer singing voice synthesis method and system based on generation countermeasure network
Wang et al. Low pass filtering and bandwidth extension for robust anti-spoofing countermeasure against codec variabilities
CN116682444A (en) Single-channel voice enhancement method based on waveform spectrum fusion network
CN116469404A (en) Audio-visual cross-mode fusion voice separation method
Le et al. Personalized speech enhancement combining band-split rnn and speaker attentive module
Wang et al. Unsupervised improvement of audio-text cross-modal representations
CN111028857B (en) Method and system for reducing noise of multichannel audio-video conference based on deep learning
US9478223B2 (en) Method and apparatus for down-mixing multi-channel audio

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant