CN114023346A

CN114023346A - Voice enhancement method and device capable of separating circulatory attention

Info

Publication number: CN114023346A
Application number: CN202111285653.5A
Authority: CN
Inventors: 柯登峰; 张劲松; 解焱陆
Original assignee: BEIJING LANGUAGE AND CULTURE UNIVERSITY
Current assignee: BEIJING LANGUAGE AND CULTURE UNIVERSITY
Priority date: 2021-11-01
Filing date: 2021-11-01
Publication date: 2022-02-08
Anticipated expiration: 2041-11-01
Also published as: CN114023346B

Abstract

The invention relates to a voice enhancement method capable of separating circulatory attention, which comprises the following steps of 1: inputting the voice signal into a preposed network unit for Fourier transform, and outputting a first amplitude flow signal and a first phase flow signal; step 2: inputting the first amplitude stream signal and the first phase stream signal into an attention network unit for noise reduction, and outputting a second amplitude stream signal and a second phase stream signal; the attention network unit comprises a plurality of stages of polar coordinate attention modules connected in series, wherein each stage of polar coordinate attention module comprises an amplitude attention module, a phase self-adjusting module and a phase other adjusting module; and step 3: and inputting the second amplitude stream signal and the second phase stream signal to a post-network unit for inverse Fourier transform, and outputting an enhanced voice signal. The invention has small calculation amount and can effectively ensure the voice noise reduction effect.

Description

Voice enhancement method and device capable of separating circulatory attention

Technical Field

The invention relates to a voice enhancement method and device capable of separating circulatory attention.

Background

The noise reduction of the speech recognition front end, the extraction of human voice in the field of audio and video production, the purification of voice in the field of speech synthesis and the like all relate to the noise reduction enhancement of speech signals, and the existing speech noise reduction mainly comprises the following modes:

SEGAN: the UNet is used as a basic structure for noise reduction, and a countermeasure generation technology is adopted to enable the generated sound to be close to human voice. The method has the defects of simple model structure, unclean treatment on complex noise and easy mode collapse.

WAVENET: the method has the defects of huge model, complex training, extremely low speed (10 minutes of processing time is needed for every 1 minute of voice), misaligned phase and difficult discrimination between human voice and music noise with harmonic waves.

TasNet: and (3) denoising by taking the TCN as a basic structure, and obtaining the promotion of the receptive field by adopting cavity convolution. The method has the disadvantages that the completeness of the space is not ensured, the frequency resolution of the model is poor, and the noise of the simultaneous voice segment of the voice and the noise is not completely removed.

T-GSA: and denoising by taking a transform as a basic structure, and locally constraining the receptive field by adopting a Gaussian function. The disadvantage is that the computation complexity is huge, and the processing time is O (N) along with the lengthening of the voice length²) And (4) increasing.

PHASEN: this approach is the most relevant noise reduction method to the present invention. And noise reduction is carried out by taking the TSB as a basic structure, and harmonic enhancement is carried out by adopting a frequency conversion block. Although the method has small calculation amount and can ensure better noise reduction effect, the method has the defects that only a fixed receptive field is used, only fixed harmonic correlation can be modeled, and actually sometimes people need to see far to determine whether the current sound is voice or noise, and the current harmonic is true harmonic or pseudo harmonic by comprehensively considering the upper-lower front-back relation, so the voice noise reduction effect is not ideal.

Disclosure of Invention

The invention aims to provide a method and a device for enhancing voice capable of separating circulating attention, which have small calculation amount and can effectively ensure the voice noise reduction effect.

Based on the same inventive concept, the invention has three independent technical schemes:

1. a method of speech enhancement that decouples circulatory attention, comprising:

step 1: inputting the voice signal into a preposed network unit for Fourier transform, and outputting a first amplitude flow signal and a first phase flow signal;

step 2: inputting the first amplitude stream signal and the first phase stream signal into an attention network unit for noise reduction, and outputting a second amplitude stream signal and a second phase stream signal; the attention network unit comprises a plurality of stages of polar coordinate attention modules which are connected in series, wherein each stage of polar coordinate attention module comprises an amplitude attention module, a phase self-adjusting module and a phase adjusting module, and the amplitude attention module further comprises two channel replacement transformation modules, two time-frequency separable cyclic network modules and an independent same-distribution convolution module;

and step 3: and inputting the second amplitude stream signal and the second phase stream signal to a post-network unit for inverse Fourier transform, and outputting an enhanced voice signal.

Further, in step 2, each stage of polar attention module is configured to perform the following steps:

step 2.1: the amplitude attention module is used for processing the input amplitude flow signal, and the adjusted amplitude flow signal is input to the phase other adjusting module;

step 2.2: the phase self-adjusting module is used for processing the input phase flow signal and inputting the processed self-adjusting phase flow signal into the phase other adjusting module;

step 2.3: performing its adjustment on the self-adjusting phase stream signal with a phase-it adjustment module based on the adjusted amplitude stream signal, outputting an adjusted phase stream signal;

step 2.4: and outputting the adjusted amplitude flow signal and the adjusted phase flow signal.

Further, the phase self-adjusting module is formed by one or more layers of two-dimensional convolution;

the phase-its adjusting module comprises one or more amplitude-aware phase transformations, each amplitude-aware phase transformation adjusting the phase stream signal with an amplitude stream signal, the transformation formula being as follows:

P_o＝Conv(A_o)o P_i

where Conv denotes convolution, o denotes dot product, P_iRepresenting amplitude flow outputAs phase-adjusting input, P_oRepresenting the phase flow output, A_oRepresenting the amplitude stream output as the phase-it-adjusts input.

Further, the step 2.1 comprises the following steps:

step 2.1.1: respectively inputting the input amplitude flow signals into a first channel permutation conversion module and a first time-frequency separable cyclic network module to obtain a first permutation conversion signal and a first cyclic signal;

step 2.1.2: inputting the first replacement transformation signal to a second channel replacement transformation module, and outputting a second replacement transformation signal; multiplying the first permutation conversion signal and the first cyclic signal, inputting the multiplied first permutation conversion signal and the multiplied first cyclic signal to the second time-frequency separable cyclic network module, and outputting a second cyclic signal;

step 2.1.3: and splicing the second replacement transformation signal and the second cyclic signal, inputting the spliced signals into an independent same-distribution convolution module, and outputting the adjusted amplitude flow signal.

Further, the time-frequency separable circulation network module adopts one of the following circulation modes: single time cycle, single frequency cycle, time first cycle then frequency cycle, frequency first cycle then time cycle, time and frequency parallel cycle;

the loop includes one of a forward loop, a backward loop, and a bi-directional loop.

Further, the independent and identically distributed convolution module is composed of a distribution normalization layer, a two-dimensional convolution layer and a GELU layer.

Further, the front network unit comprises a short-time Fourier transform module, an amplitude convolution module and a phase convolution module,

the short-time Fourier transform module is used for transforming the voice signal into short-time Fourier coefficients;

the amplitude convolution module is used for performing amplitude convolution on the signal output by the short-time Fourier transform module and outputting a first amplitude flow signal;

and the phase convolution module is used for performing phase convolution on the signal output by the short-time Fourier transform module and outputting a first phase flow signal.

Further, the post-network element comprises an amplitude mask generator, a phase mask generator, a Fourier coefficient generator and an inverse short-time Fourier transform module,

the amplitude mask generator is used for generating a single-channel amplitude signal from the second amplitude flow signal;

the phase mask generator is used for generating a two-channel phase signal from the second phase-bit stream signal;

the Fourier coefficient generator is used for generating Fourier coefficients according to the single-channel amplitude signal and the double-channel phase signal;

and the inverse short-time Fourier transform module is used for outputting the enhanced voice signal according to the generated Fourier coefficient.

2. A method of speech enhancement that decouples circulatory attention, comprising:

step 1: inputting the voice signal into a preposed network unit for Fourier transform, and outputting a first amplitude flow signal;

step 2: inputting the first amplitude flow signal into an attention network unit for noise reduction, and outputting a second amplitude flow signal; the attention network unit comprises a plurality of stages of polar coordinate attention modules which are connected in series, wherein each stage of polar coordinate attention module comprises an amplitude attention module, a phase self-adjusting module and a phase adjusting module, and the amplitude attention module further comprises two time-frequency separable cyclic network modules and an independent same-distribution convolution module; or comprises two channel permutation and transformation modules, two time-frequency separable circulation network modules and an independent same-distribution convolution module;

and step 3: and carrying out inverse Fourier transform on the second amplitude flow signal through a post-network unit, and outputting an enhanced voice signal.

3. A cyclic attention separable speech enhancement apparatus comprising:

the device comprises a preposed network unit, a voice processing unit and a voice processing unit, wherein the preposed network unit is used for carrying out Fourier transform on an input voice signal and outputting a first amplitude flow signal and a first phase flow signal;

an attention network unit, configured to perform noise reduction on the first amplitude stream signal and the first phase stream signal, and output a second amplitude stream signal and a second phase stream signal; and

the post-network unit is used for performing inverse Fourier transform on the second amplitude stream signal and the second phase stream signal and outputting an enhanced voice signal;

wherein the attention network unit comprises a plurality of stages of polar coordinate attention modules connected in series, each stage of polar coordinate attention module comprises an amplitude attention module, a phase self-adjusting module and a phase adjusting module, wherein the amplitude attention module further comprises two channel permutation transformation modules, two time-frequency separable circulation network modules and an independent same distribution convolution module.

The invention has the following beneficial effects:

the method comprises the steps that a voice signal is input to a preposed network unit to be subjected to Fourier transform, and a first amplitude flow signal and a first phase flow signal are output; inputting the first amplitude stream signal and the first phase stream signal into an attention network unit for noise reduction, and outputting a second amplitude stream signal and a second phase stream signal; the attention network unit is formed by connecting a plurality of stages of polar coordinate attention modules in series, and each stage of polar coordinate attention module is formed by three modules of amplitude attention, phase self-adjustment and phase other adjustment; and inputting the second amplitude stream signal and the second phase stream signal to a post-network unit for inverse Fourier transform, and outputting an enhanced voice signal. The invention adopts the attention network unit, and the attention network unit is based on the separable design idea and adopts the expanded recurrent neural network structure, so the receptive field is not fixed any more, and more complex harmonic correlation is modeled. Compared with the conventional PHASEN, the structural parameters of the invention are reduced by two orders of magnitude, the calculated amount is smaller, and the speech noise reduction effect is better compared with the conventional model including PHASEN on the aspect of 6 international evaluation indexes.

Step 2.1 of each stage of polar coordinate attention module of the invention: the amplitude attention module is used for processing the input amplitude flow signal, and the adjusted amplitude flow signal is input to the phase other adjusting module; step 2.2: the phase self-adjusting module is used for processing the input phase flow signal and inputting the processed self-adjusting phase flow signal into the phase other adjusting module; step 2.3: performing its adjustment on the self-adjusting phase stream signal with a phase-it adjustment module based on the adjusted amplitude stream signal, outputting an adjusted phase stream signal; step 2.4: and outputting the adjusted amplitude flow signal and the adjusted phase flow signal. The phase self-adjusting module is formed by one-to-multiple layers of two-dimensional convolution; a phase-it-adjustment module comprising one to more amplitude-aware phase transforms, each amplitude-aware phase transform adjusting phase by utilizing the amplitude stream output. The amplitude attention module comprises a channel replacement transformation module, a time-frequency separable circulation network module and an independent same-distribution convolution module. Step 2.1 comprises the following steps: step 2.1.1: respectively inputting the input amplitude flow signals into a first channel permutation conversion module and a first time-frequency separable cyclic network module to obtain a first permutation conversion signal and a first cyclic signal; step 2.1.2: inputting the first replacement transformation signal to a second channel replacement transformation module, and outputting a second replacement transformation signal; superposing the first displacement transformation signal and the first cyclic signal, inputting the superposed signals into the second time-frequency separable cyclic network module, and outputting a second cyclic signal; step 2.1.3: and superposing the second replacement transformation signal and the second cyclic signal, inputting the superposed signals into an independent equal distribution convolution module, and outputting the adjusted amplitude flow signal. According to the invention, through the structural design of the recurrent neural network of the attention network unit, the invention is further ensured to achieve a better voice noise reduction effect.

Drawings

FIG. 1 is a block flow diagram of a method for speech enhancement with separable cycle attention according to the present invention;

FIG. 2 is a flow diagram of a head-end network element of the present invention;

FIG. 3 is a flow diagram of a post-network element of the present invention;

FIG. 4 is a flow diagram of an attention network element of the present invention;

FIG. 5 is a block flow diagram of a polar attention module of the attention network element of the present invention;

FIG. 6 is a block flow diagram of the magnitude attention module of the polar attention module of the attention network element of the present invention.

Detailed Description

The present invention is described in detail with reference to the embodiments shown in the drawings, but it should be understood that these embodiments are not intended to limit the present invention, and those skilled in the art should understand that functional, methodological, or structural equivalents or substitutions made by these embodiments are within the scope of the present invention.

The first embodiment is as follows:

voice enhancement method capable of separating circulatory attention

As shown in fig. 1, the speech enhancement method capable of separating the circulatory attention of the present invention includes a front network element, an attention network element, and a rear network element, and includes the following steps:

step 2: inputting the first amplitude stream signal and the first phase stream signal into an attention network unit for noise reduction, and outputting a second amplitude stream signal and a second phase stream signal; the attention network unit comprises a plurality of stages of polar coordinate attention modules which are connected in series, wherein each stage of polar coordinate attention module comprises an amplitude attention module, a phase self-adjusting module and a phase adjusting module, and the amplitude attention module further comprises two time-frequency separable cyclic network modules and an independent same-distribution convolution module; or comprises two channel permutation and transformation modules, two time-frequency separable circulation network modules and an independent same-distribution convolution module; in this embodiment, two channel permutation and transformation modules are included.

As shown in fig. 2, the front network element includes a short-time fourier transform module for transforming the voice signal into short-time fourier coefficients, an amplitude convolution module, and a phase convolution module; the amplitude convolution module is used for performing amplitude convolution on the signal output by the short-time Fourier transform module and outputting a first amplitude flow signal; and the phase convolution module is used for performing phase convolution on the signal output by the short-time Fourier transform module and outputting a first phase flow signal. The amplitude convolution comprises a 1 x 1 convolution and a GELU activation. The phase convolution comprises an n × n convolution without activation. Note that no activation can be used here, otherwise the performance degradation is significant.

As shown in fig. 3, the apparatus comprises an amplitude mask generator, a phase mask generator, a fourier coefficient generator and an inverse short-time fourier transform module, wherein the amplitude mask generator is used for generating a single-channel amplitude signal from a second amplitude flow signal; the phase mask generator is used for generating a two-channel phase signal from the second phase-bit stream signal; the Fourier coefficient generator is used for generating Fourier coefficients according to the single-channel amplitude signal and the double-channel phase signal; and the inverse short-time Fourier transform module is used for outputting the enhanced voice signal according to the generated Fourier coefficient. The amplitude mask generator is composed of a plurality of layers of two-dimensional convolutions, the convolution output of the last layer is 1 channel, a layer normalization function and a GELU activation function can be selectively inserted between every two convolution layers, and a Sigmoid activation function is connected behind the convolution layer of the last layer. The phase mask generator is composed of a plurality of layers of two-dimensional convolutions, the last layer of convolution output is 2 channels, no layer normalization and no activation function exist between every two convolution layers, and the last layer of convolution layer is followed by amplitude normalization, so that the sum of squares of the amplitudes of 2 channels of each time frequency point is 1 (namely, only phase information and no amplitude information exist).

As shown in fig. 4, the attention network element is composed of a plurality of stages of polar attention modules connected in series. As shown in fig. 5, each stage of polar attention module consists of three modules, amplitude attention, phase self-adjustment, phase-it-adjustment. Each stage of polar attention module is used for executing the following steps:

The phase self-adjusting module is formed by one or more layers of two-dimensional convolution;

P_o＝Conv(A_o)o P_i

where Conv denotes convolution, o denotes dot product, P_iRepresenting the amplitude stream output as the phase-its regulating input, P_oRepresenting the phase flow output, A_oRepresenting the amplitude stream output as the phase-it-adjusts input.

As shown in fig. 6, the amplitude attention module includes a channel permutation transformation module, a time-frequency separable cyclic network module, and an independent same-distribution convolution module, and step 2.1 includes the following steps:

step 2.1.2: inputting the first replacement transformation signal to a second channel replacement transformation module, and outputting a second replacement transformation signal; multiplying the first permutation transformation signal and the first cyclic signal (point-to-point multiplication of a signal matrix), inputting the multiplied first permutation transformation signal and the first cyclic signal into the second time-frequency separable cyclic network module, and outputting a second cyclic signal;

step 2.1.3: and splicing (channel splicing) the second replacement transformation signal and the second cyclic signal, inputting the spliced signals into an independent same-distribution convolution module, and outputting the adjusted amplitude flow signal.

The channel permutation transformation may adopt an identity transformation, a reordering transformation or a convolution transformation, or a combination of the three transformations. The independent same-distribution convolution is composed of a distribution normalization layer, a two-dimensional convolution layer and a GELU layer. The GELU layer can also be replaced by an activation function such as ReLU, PReLU, ELU, sigmoid, softplus, etc.

The first time-frequency separable torus module, the second time-frequency separable torus module can employ a single time loop, a single frequency loop, a time loop followed by a frequency loop, a frequency loop followed by a time loop, or a time and frequency parallel loop, and the loops can employ a forward loop, a backward loop, or a bi-directional loop.

The single-time cycle is realized by the following formula,

forward single-use time cycle:

backward single-use time cycle:

bidirectional single-time cycle:

wherein the content of the first and second substances,

representing channel dimension data splicing, Cell representing arbitrary circulating Cell structure, h_b,f,tRepresenting the hidden state of the mth frequency of the mth speech segment, c_b,f,tRepresents the cell state of the f frequency and t time of the b voice segment, x_b,f,tAn input value representing the fth frequency and the tth time of the kth voice segment;

the frequency cycling alone is achieved by the following formula,

forward single-use frequency cycling:

backward single-use frequency cycling:

bidirectional single-use frequency cycling:

time-cycle-before-frequency cycle is implemented by the following formula,

forward time cycle followed by frequency cycle:

backward time-first and frequency-second cycling:

bidirectional time-first and frequency-second cycling:

the frequency cycle before the time cycle is realized by the following formula,

forward frequency cycle followed by time cycle:

backward frequency cycle before time cycle:

bidirectional frequency cycle before time cycle:

the time-frequency parallel loop is realized by the following formula,

forward parallel loop:

backward parallel circulation:

bidirectional parallel circulation:

example two:

voice enhancement method capable of separating circulatory attention

The method comprises the following steps:

and step 3: and based on the second amplitude flow signal, performing inverse Fourier transform through a post-network unit, and outputting an enhanced voice signal.

The difference between the second embodiment and the first embodiment is that the noise reduction is performed only on the first amplitude stream signal, and the noise reduction is not performed on the phase stream signal, which is slightly weaker than the first embodiment. The rest of the working principle is the same as that of the first embodiment.

Example three:

voice enhancement device capable of separating circulatory attention

The method comprises the following steps:

The above-listed detailed description is only a specific description of a possible embodiment of the present invention, and they are not intended to limit the scope of the present invention, and equivalent embodiments or modifications made without departing from the technical spirit of the present invention should be included in the scope of the present invention.

It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein.

Claims

1. A method for enhancing speech that separates attention in circulation, comprising:

step 2: inputting the first amplitude stream signal and the first phase stream signal into an attention network unit for noise reduction, and outputting a second amplitude stream signal and a second phase stream signal; the attention network unit comprises a plurality of stages of polar coordinate attention modules which are connected in series, wherein each stage of polar coordinate attention module comprises an amplitude attention module, a phase self-adjusting module and a phase adjusting module, and the amplitude attention module further comprises two time-frequency separable cyclic network modules and an independent same-distribution convolution module; or comprises two channel permutation and transformation modules, two time-frequency separable circulation network modules and an independent same-distribution convolution module;

2. The cyclic attention separable speech enhancement method according to claim 1, characterized in that: in step 2, each stage of polar coordinate attention module is used for executing the following steps:

3. The separable cyclical attention speech enhancement method of claim 2, wherein:

P_o＝Conv(A_o)o P_i

4. The separable cyclical attention speech enhancement method of claim 2 wherein said step 2.1 comprises the steps of, in the case where the amplitude attention module comprises two channel permutation transformation modules, two time-frequency separable cyclic net modules and an independent co-distributed convolution module:

5. The cyclic attention separable speech enhancement method according to claim 1, characterized in that:

the time-frequency separable circulation network module adopts one of the following circulation modes: single time cycle, single frequency cycle, time first cycle then frequency cycle, frequency first cycle then time cycle, time and frequency parallel cycle;

6. The cyclic attention separable speech enhancement method according to claim 1, characterized in that: the independent same-distribution convolution module is composed of a distribution normalization layer, a two-dimensional convolution layer and a GELU layer.

7. The cyclic attention separable speech enhancement method according to claim 1, characterized in that: the preposed network unit comprises a short-time Fourier transform module, an amplitude convolution module and a phase convolution module,

8. The cyclic attention separable speech enhancement method according to claim 1, characterized in that: the post-network element comprises an amplitude mask generator, a phase mask generator, a Fourier coefficient generator and an inverse short-time Fourier transform module,

9. A method for enhancing speech that separates attention in circulation, comprising:

10. A cyclic attention separable speech enhancement apparatus, comprising: