CN117558288A

CN117558288A - Training method, device, equipment and storage medium of single-channel voice enhancement model

Info

Publication number: CN117558288A
Application number: CN202311511028.7A
Authority: CN
Inventors: 杨柳; 毛忌; 翁士龙; 周昱彬
Original assignee: Guangzhou University
Current assignee: Guangzhou University
Priority date: 2023-11-13
Filing date: 2023-11-13
Publication date: 2024-02-13
Anticipated expiration: 2043-11-13
Also published as: CN117558288B

Abstract

The invention provides a training method, a device, equipment and a storage medium of a single-channel voice enhancement model, wherein the method comprises the following steps: step 1, generating a noisy spectrum; step 2, calculating an initial amplitude spectrum and an initial phase spectrum, and converting the initial phase spectrum into a differential square phase spectrum; step 3, inputting the initial amplitude spectrum into an amplitude encoder and an amplitude module in sequence to obtain decomposed time-frequency attention characteristics, and inputting the differential square phase spectrum into a phase encoder and a phase module to obtain differential square phase characteristics; step 4, the time-frequency attention characteristic and the differential square phase characteristic interaction are decomposed to obtain an interacted time-frequency characteristic and an interacted phase characteristic; step 5, inputting the interacted time-frequency characteristic into a mask decoder to obtain an enhanced amplitude spectrum, inputting the interacted phase characteristic into a phase decoder to obtain an enhanced phase spectrum, and calculating total loss update model parameters; and 6, iterating for a plurality of times, and repeating the steps 2 to 5 in batches in each iteration.

Description

Training method, device, equipment and storage medium of single-channel voice enhancement model

Technical Field

The present invention relates to the field of speech enhancement technologies, and in particular, to a training method, apparatus, device, and storage medium for a single-channel speech enhancement model.

Background

In a real-life scene, voice signals are often interfered by surrounding noise, the interference can affect equipment such as a hearing aid, mobile phone communication and network video conference, and the problem that the voice signals are polluted in the communication process can be solved by the voice enhancement technology.

In the related art, a single-channel speech enhancement method based on a deep learning model is a mainstream method at present, and is mainly divided into two types of methods of enhancement in a time-frequency domain and enhancement in a time domain, wherein the method of enhancement in the time-frequency domain is widely applied due to the interpretability of the method, but the following defects still exist: the phase spectrum presents a random complex pattern, and the neural network model is difficult to extract effective advanced features from the pattern; more attention is paid to the long-range context dependency of effectively capturing the speech signal, while the time-frequency distribution information of the speech spectrum is ignored.

By combining the analysis of the development status in the technical field, the prior art scheme lacks a single-channel voice enhancement model for generating a regular structure and texture phase diagram and capturing the time-frequency distribution information of a voice frequency spectrum.

Disclosure of Invention

The invention aims to provide a training method, device and equipment for a single-channel voice enhancement model and a storage medium, and aims to solve the problems in the prior art.

According to a first aspect of an embodiment of the present invention, there is provided a training method of a single-channel speech enhancement model, including:

step 1, generating a set of noisy spectrums for training;

step 2, acquiring a noisy spectrum, calculating an initial amplitude spectrum and an initial phase spectrum corresponding to the noisy spectrum, and converting the initial phase spectrum into a differential square phase spectrum;

step 3, inputting the initial amplitude spectrum into an amplitude encoder and an amplitude module corresponding to an amplitude branch in sequence to obtain a decomposed time-frequency attention characteristic, and inputting the differential square phase spectrum into a phase encoder and a phase module corresponding to a phase branch to obtain a differential square phase characteristic;

step 4, obtaining the time-frequency characteristics after interaction and the phase characteristics after interaction through the interactive calculation between the decomposed time-frequency attention characteristics and the differential square phase characteristics;

step 5, inputting the interacted time-frequency characteristics into a mask interpreter corresponding to the amplitude branch to obtain an enhanced amplitude spectrum, inputting the interacted phase characteristics into a phase decoder corresponding to the phase branch to obtain an enhanced phase spectrum, calculating total loss according to the enhanced amplitude spectrum and the enhanced phase spectrum, and updating model parameters;

and 6, performing a plurality of round iterations, repeating the steps 2 to 5 in batches in each round iteration, performing iterative training, and obtaining a trained single-channel voice enhancement model after the round iteration execution is finished.

According to a second aspect of an embodiment of the present invention, there is provided a training apparatus for a single-channel speech enhancement model, including:

the training set generation module is used for generating a set of noisy spectrums for training;

the phase spectrum generation module is used for acquiring the noisy spectrum, calculating an initial amplitude spectrum and an initial phase spectrum corresponding to the noisy spectrum, and converting the initial phase spectrum into a differential square phase spectrum;

the double-branch coding module is used for inputting the initial amplitude spectrum into an amplitude encoder and an amplitude module corresponding to the amplitude branch in sequence to obtain a decomposed time-frequency attention characteristic, and inputting the differential square phase spectrum into a phase encoder and a phase module corresponding to the phase branch to obtain a differential square phase characteristic;

the interaction calculation module is used for obtaining the time-frequency characteristics after interaction and the phase characteristics after interaction through the interaction calculation between the decomposed time-frequency attention characteristics and the differential square phase characteristics;

the dual-branch decoding module is used for inputting the time-frequency characteristics after interaction into a mask interpreter corresponding to the amplitude branch to obtain an enhanced amplitude spectrum, inputting the phase characteristics after interaction into a phase decoder corresponding to the phase branch to obtain an enhanced phase spectrum, calculating total loss according to the enhanced amplitude spectrum and the enhanced phase spectrum, and updating model parameters;

the training iteration module is used for carrying out a plurality of round iterations, repeatedly calling the steps in the phase spectrum generation module, the double-branch coding module, the interaction calculation module and the double-branch decoding module in batches in each round iteration to carry out iterative training, and obtaining a trained single-channel voice enhancement model after the round iteration execution is finished.

According to a third aspect of an embodiment of the present invention, there is provided an electronic apparatus including: a memory, a processor, and a computer program stored on the memory and executable on the processor, which when executed by the processor, performs the steps of the training method of the single channel speech enhancement model as provided in the first aspect of the present disclosure.

According to a fourth aspect of embodiments of the present invention, there is provided a computer-readable storage medium having stored thereon a program for implementing information transfer, which when executed by a processor, implements the steps of the training method for a single-channel speech enhancement model provided in the first aspect of the present disclosure.

The technical scheme provided by the embodiment of the invention has the following beneficial effects: the method adopts an amplitude-phase parallel double-branch structure to process the single-channel voice enhancement problem, captures effective information of time-frequency distribution through an amplitude module in amplitude branch processing, converts an initial phase spectrum which presents random complexity into a differential square phase spectrum before phase branch processing, and obtains phase information with more obvious structure and texture.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

For a clearer description of one or more embodiments of the present description or of the solutions of the prior art, the drawings that are necessary for the description of the embodiments or of the prior art will be briefly described, it being apparent that the drawings in the description that follow are only some of the embodiments described in the description, from which, for a person skilled in the art, other drawings can be obtained without inventive faculty.

FIG. 1 is a flow chart of a training method of a single-channel speech enhancement model according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a differential square phase spectrum of an embodiment of the present invention;

FIG. 3 is a schematic diagram of an exploded time-frequency attention block of an embodiment of the present invention;

FIG. 4 is a schematic diagram of an iterative variation of total loss for an embodiment of the present invention;

FIG. 5 is a schematic diagram of a dual-branch model structure according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of a training device for a single channel speech enhancement model according to an embodiment of the present invention;

fig. 7 is a schematic diagram of an electronic device according to an embodiment of the invention.

Detailed Description

In order to enable a person skilled in the art to better understand the technical solutions in one or more embodiments of the present specification, the technical solutions in one or more embodiments of the present specification will be clearly and completely described below with reference to the drawings in one or more embodiments of the present specification, and it is obvious that the described embodiments are only some embodiments of the present specification, not all embodiments. All other embodiments, which can be made by one or more embodiments of the present disclosure without inventive faculty, are intended to be within the scope of the present disclosure.

Method embodiment

According to an embodiment of the present invention, there is provided a training method of a single-channel speech enhancement model, and fig. 1 is a flowchart of the training method of the single-channel speech enhancement model according to the embodiment of the present invention, as shown in fig. 1, where the training method of the single-channel speech enhancement model according to the embodiment of the present invention specifically includes:

in step S110, a set of noisy spectra for training is generated. The method specifically comprises the following steps:

in this embodiment, a Voice band-DEMAND dataset is used, all the audios of the dataset are downsampled into 16kHz band noise frequencies, the 16kHz band noise frequencies are divided into a training set and a test set, and a set of band noise spectrums for training is generated by using a short-time fourier transform on the 16kHz band noise frequencies in the training set.

In step S120, a noisy spectrum is acquired, an initial amplitude spectrum and an initial phase spectrum corresponding to the noisy spectrum are calculated, and the initial phase spectrum is converted into a differential square phase spectrum. The method specifically comprises the following steps:

obtaining an initial amplitude spectrum by calculating the absolute value of the noisy spectrum signal, and obtaining an initial phase spectrum by calculating the arctan function value arctan of the noisy spectrum signal;

the initial phase spectrum is unfolded along the direction of the frequency axis, the low-frequency point and the high-frequency point of each adjacent time frame in the phase spectrum are subjected to differential calculation to obtain differential phases, the differential phases are subjected to square calculation to obtain differential square phase spectrums, fig. 2 is a schematic diagram of the differential square phase spectrums in the embodiment of the invention, and the comparison between the differential square phase spectrums obtained by the invention and the initial phase and the unfolded phase is shown in fig. 2.

In step S130, the initial amplitude spectrum is sequentially input to an amplitude encoder and an amplitude module corresponding to the amplitude branch, so as to obtain a decomposed time-frequency attention characteristic, and the differential square phase spectrum is input to a phase encoder and a phase module corresponding to the phase branch, so as to obtain a differential square phase characteristic. The method specifically comprises the following steps:

the initial amplitude spectrum is input into an amplitude encoder consisting of two convolution layers, wherein the first convolution layer comprises an expansion convolution with an expansion ratio of 1 and a convolution kernel of (7, 1), a batch normalization function and a ReLU function, the second convolution layer comprises an expansion convolution with an expansion ratio of 2 and a convolution kernel of (1, 7), a batch normalization function and a ReLU function, and the amplitude encoder is configured as shown in table 1:

TABLE 1 Structure of amplitude encoder

Network layer	Output feature map size
		Input layer	(1,F,T)
Conv2D(1×7)	(96,F,T)
		Conv2D(7×1)	(96,F,T)
Output layer	(96,F,T)

The method comprises the steps of inputting an advanced amplitude characteristic into an amplitude module composed of two decomposition time-frequency attention blocks and three convolution layers, decomposing the advanced amplitude characteristic into a time channel attention vector in a time axis direction and a frequency channel attention vector in a frequency axis direction through the amplitude module, wherein the positions of the two decomposition time-frequency attention blocks are respectively at the front end and the rear end of the three convolution layers, an average pooling layer is arranged in the decomposition time-frequency attention blocks, the expansion rate is sequentially 1 and 2, convolution kernels are 1, an activation function is sequentially two one-dimensional expansion convolution layers of a Sigmoid function and a ReLU function, the time channel attention vector t and the frequency channel attention vector f are obtained through the output of the last decomposition time-frequency attention block, and fig. 3 is a schematic diagram of the decomposition time-frequency attention block in the embodiment of the invention, and shows the structure schematic diagram of the decomposition time-frequency attention block as shown in fig. 3.

Taking the outer products of the Time channel attention vector t and the frequency channel attention vector f to obtain a Time-frequency attention matrix TFA, calculating masks of each interval corresponding to the Time-frequency attention moment matrix by a mask dividing method, namely generating a plurality of masks according to TFA under different thresholds, multiplying the masks by the values of the corresponding positions of the Time-frequency attention matrix TFA to obtain higher-order Decomposed Time-frequency attention features, and then obtaining a Decomposed Time-frequency attention feature (DTFA, decompounded Time-Frequency Attention) matrix by splicing and convolution;

the mask of each interval corresponding to the time-frequency attention moment array is calculated by a mask dividing method specifically comprises the following steps: setting a decomposition parameter c=1/n of a division section, wherein n represents the total length of the division, n is 500 in the embodiment, and n is 20 in fig. 3, and the first division section is (0, c), the (i+1) th section is larger than the (i) th section by c, obtaining a plurality of sections for dividing the time-frequency attention matrix according to the decomposition parameter, and calculating information values of the sections according to the decomposition parameter; if the information value is larger than the value of the corresponding interval of the time-frequency attention moment array, setting the value of the mask to 0, otherwise setting the value of the mask to 1, and carrying out mask calculation through a formula 1:

wherein m is _i Represents the ith mask, m _i (j, k) represents the (j, k) th element of the ith mask, TFA represents the original time-frequency attention matrix, TFA (j, k) represents the (j, k) th element of the original time-frequency attention matrix.

The differential square phase spectrum is input to a phase encoder consisting of two convolution layers, the first convolution layer comprising an expanded convolution with an expansion ratio of 1 and a convolution kernel of (5, 3), a batch normalization function and a ReLU function, and the second convolution layer comprising an expanded convolution with an expansion ratio of 2 and a convolution kernel of (25, 1), a batch normalization function and a ReLU function, and advanced phase features are extracted by the phase encoder, the phase encoder being configured as shown in table 2:

TABLE 2 Structure of phase encoder

Network layer	Output feature map size
		Input layer	(1,F,T)
Conv2D(5×3)	(48,F,T)
		Conv2D(25×1)	(48,F,T)
Conv2D(3×5)	(48,F,T)
		Output layer	(48,F,T)

The advanced phase characteristics are input into a phase module formed by three convolution layers to obtain differential square phase characteristics, wherein the phase module is formed by three convolutions, and convolution kernels sequentially take (5, 3), (25, 1), (3, 5) without an activation function.

In step S140, the post-interaction time-frequency characteristic and the post-interaction phase characteristic are obtained by decomposing the interaction calculation between the time-frequency attention characteristic and the differential square phase characteristic. The method specifically comprises the following steps:

performing three-time interaction calculation by using three groups of corresponding amplitude modules and phase modules, and performing interaction calculation of the differential square phase characteristics on the decomposed time-frequency attention characteristics through a formula 2 to obtain three interacted time-frequency characteristics; and performing interactive calculation of the decomposed time-frequency attention characteristic on the differential square phase characteristic through a formula 3 to obtain three phase characteristics after interaction:

wherein,representing the time-frequency characteristics after interaction, < >>Representing phase characteristics after interaction, B _m Representing the resolved time-frequency attention characteristic, i.e. the output of the amplitude module, B _p Representing the differential squared phase characteristic, i.e. the output of the phase block, delta () representing the Tanh activation function, conv () representing the two-dimensional convolution change channel using a convolution kernel of (1, 1), and # representing the dot product operation.

In step S150, the time-frequency characteristics after interaction obtained by each group of amplitude module-phase module interaction are input into the mask decoder corresponding to the amplitude branch to obtain an enhanced amplitude spectrum, the phase characteristics after interaction are input into the phase decoder corresponding to the phase branch to obtain an enhanced phase spectrum, the total loss is calculated according to the enhanced amplitude spectrum and the enhanced phase spectrum, and the model parameters are updated. The method specifically comprises the following steps:

inputting the time-frequency characteristics after interaction into a mask decoder consisting of a convolution layer, a bGRU layer and a full connection layer to obtain an estimated mask, and carrying out Hadamard product on the estimated mask and a noisy spectrum to obtain an enhanced amplitude spectrum, wherein the structure of the mask decoder is shown in the table 3:

TABLE 3 mask decoder Structure Table

The phase characteristics after interaction are input into a phase decoder consisting of convolution layers of conversion channels, the phase decoder obtains an enhancement spectrum through the convolution layers of one conversion channel, and an enhancement phase spectrum is obtained, and the structure of the phase decoder is shown in table 4:

TABLE 4 Structure of phase decoder

Network layer	Output feature map size
		Input layer	(48,F,T)
Conv2D(1×1)	(1,F,T)
		Output layer	(1,F,T)

And combining the enhanced amplitude spectrum and the enhanced phase spectrum to form a complete spectrum, and performing inverse short-time Fourier transform on the complete spectrum to obtain an enhanced voice signal.

Calculating root mean square error of the enhanced magnitude spectrum and the initial magnitude spectrum to obtain a first loss L _m Calculating root mean square error of the enhanced phase spectrum and the initial phase spectrum to obtain second loss L _p Calculating total loss from the first loss and the second loss, defining the total loss of training as 0.5 XL _m +0.5×L _p An Adam optimizer was used during training, with an initial learning rate set to 0.0005.

In step S160, a plurality of round iterations are performed, and steps S120 to S150 are repeated in batches in each round iteration to perform iterative training, and after the round iteration execution is finished, a trained single-channel speech enhancement model is obtained. The method specifically comprises the following steps:

in each iteration round, sequentially inputting a plurality of batches divided in advance into a model, updating the model in the previous batch, continuously updating the model in the next batch on the basis of the updated model, and internally cycling the iterative training process of steps S120 to S150 for a plurality of times in each batch, so as to obtain a trained single-channel voice enhancement model after the iteration of the round is finished. In this embodiment, the iteration round is 30, and fig. 4 is a schematic diagram of the total loss iteration change in the embodiment of the present invention, and as shown in fig. 4, the process of the total loss change in the iteration round is 30 is shown.

The network structure and parameters of the trained single-channel voice enhancement model are utilized to realize data enhancement of noise voice, and the conclusion of the multiple index effects obtained by the embodiment of the invention compared with other methods is shown in table 5:

TABLE 5 comparison of currently mainstream models with objective evaluation indicators of the present invention

Method	PESQ	CSIG	CBAK	COVL
					Noisy	1.97	3.34	2.44	2.63
SEGAN	2.16	3.48	2.94	2.80
					MetricGAN	2.86	3.99	3.18	3.42
TSTNN	2.96	4.33	3.53	3.67
					PHASEN	2.99	4.21	3.55	3.62
DEMUCS	3.07	4.31	3.40	3.63
					Ours(w/o DSP)	3.05	4.31	3.30	3.70
Ours(w/o DTFAB)	3.03	4.28	3.29	3.68
					The invention is that	3.12	4.33	3.31	3.74

As shown in table 5, the method of the present invention is significantly improved in subjective index PESQ for evaluating speech enhancement performance, and three objective indexes CSIG, CBAK, CVOL, compared with the currently mainstream advanced speech enhancement method, wherein PESQ represents a perceptual evaluation of speech quality, CSIG is an estimation result related to signal-to-noise ratio, CBAK is an estimation result related to background noise, and CVOL is an estimation result of audio volume level.

The above technical solutions of the embodiments of the present invention are illustrated with reference to the following drawings.

Fig. 5 is a schematic diagram of a dual-branch model structure according to an embodiment of the present invention, as shown in fig. 5, the dual-branch structure of amplitude-phase parallelism is used to solve the problem of single-channel speech enhancement, that is, the amplitude branches including an amplitude encoder, an amplitude module and a mask decoder, and the phase branches including an amplitude encoder, a phase module and a phase decoder, where the set number of amplitude modules MB (Magnitude Block) is consistent with the set number of phase modules PB (Phase Block), and two decomposed time-frequency attention blocks in the amplitude module MB are DTFAB.

In summary, aiming at the problems existing in the current situation, the technical scheme of the embodiment of the invention provides a training method of a single-channel voice enhancement model, which adopts an amplitude-phase parallel double-branch structure to process the single-channel voice enhancement problem; the self-defined decomposition time-frequency attention block is added in the amplitude module to extract the characteristics of the frequency axis and the time axis, thereby being beneficial to capturing important time-frequency distribution information; before phase branching processing, performing differential calculation on low-frequency points and high-frequency points of adjacent time frames of a phase spectrum, and performing square calculation on differential phases to convert an initial phase spectrum which presents random complexity into a differential square phase spectrum, so that phase information with more obvious structure and texture is obtained; when the decomposed time-frequency attention characteristic is obtained, the effect of enhancing the time-frequency attention is calculated through a mask, and the strong distribution in the time-frequency distribution is highlighted; after training, a single-channel voice enhancement model for improving the signal-to-noise ratio is obtained.

Device embodiment

According to an embodiment of the present invention, a training device for a single-channel speech enhancement model is provided, and fig. 6 is a schematic diagram of the training device for the single-channel speech enhancement model according to the embodiment of the present invention, as shown in fig. 6, where the training device for the single-channel speech enhancement model according to the embodiment of the present invention specifically includes:

a training set generation module 60 for generating a set of noisy spectra for training;

the phase spectrum generating module 62 is configured to obtain a noisy spectrum, calculate an initial amplitude spectrum and an initial phase spectrum corresponding to the noisy spectrum, and convert the initial phase spectrum into a differential square phase spectrum. The method is particularly used for:

and unfolding the initial phase spectrum along the direction of the frequency axis, performing differential calculation on the low frequency points and the high frequency points of each adjacent time frame in the phase spectrum to obtain a differential phase, and performing square calculation on the differential phase to obtain a differential square phase spectrum.

The dual-branch encoding module 64 is configured to sequentially input the initial amplitude spectrum to an amplitude encoder and an amplitude module corresponding to the amplitude branch, obtain a decomposed time-frequency attention characteristic, and input the differential square phase spectrum to a phase encoder and a phase module corresponding to the phase branch, and obtain a differential square phase characteristic. The method is particularly used for:

inputting the initial amplitude spectrum into an amplitude encoder composed of two convolution layers, and extracting advanced amplitude characteristics through the amplitude encoder; inputting the high-level amplitude characteristic into an amplitude module consisting of two decomposed time-frequency attention blocks and three convolution layers, decomposing the high-level amplitude characteristic into a time channel attention vector and a frequency channel attention vector through the amplitude module, carrying out outer product on the time channel attention vector and the frequency channel attention vector to obtain a time-frequency attention matrix, calculating masks of each interval corresponding to the time-frequency attention moment matrix through a mask dividing method, and multiplying the masks by the values of the corresponding positions of the time-frequency attention matrix to obtain the decomposed time-frequency attention characteristic;

inputting the differential square phase spectrum into a phase encoder consisting of two convolution layers, and extracting advanced phase characteristics through the phase encoder; and inputting the advanced phase characteristics into a phase module formed by three convolution layers to obtain differential square phase characteristics.

The mask of each interval corresponding to the time-frequency attention moment array is calculated by a mask dividing method specifically comprises the following steps: setting decomposition parameters of the division sections, obtaining a plurality of sections for dividing the time-frequency attention matrix according to the decomposition parameters, and calculating information values of the sections according to the decomposition parameters; and if the information value is larger than the value of the corresponding interval of the time-frequency attention moment array, setting the value of the mask to 0, otherwise, setting the value of the mask to 1.

The interaction calculating module 66 is configured to obtain the post-interaction time-frequency feature and the post-interaction phase feature by decomposing the interaction calculation between the time-frequency attention feature and the differential square phase feature. The method is particularly used for:

performing three-time interactive calculation by using three groups of corresponding amplitude modules and phase modules, and performing interactive calculation of the differential square phase characteristics on the decomposed time-frequency attention characteristics through a formula 1 to obtain the time-frequency characteristics after interaction; and performing interactive calculation of the decomposed time-frequency attention characteristic on the differential square phase characteristic through a formula 2 to obtain an interactive phase characteristic:

wherein,representing the time-frequency characteristics after interaction, < >>Representing phase characteristics after interaction, B _m Representing decomposed time-frequency attention characteristics, B _p Representing the differential squared phase characteristic, delta () representing the Tanh activation function, conv () representing the two-dimensional convolution change channel using a convolution kernel of (1, 1), and # representing the dot product operation.

The dual-branch decoding module 68 is configured to input the time-frequency characteristics after interaction to the mask interpreter corresponding to the amplitude branch to obtain an enhanced amplitude spectrum, input the phase characteristics after interaction to the phase decoder corresponding to the phase branch to obtain an enhanced phase spectrum, calculate the total loss according to the enhanced amplitude spectrum and the enhanced phase spectrum, and update the model parameters. The method is particularly used for:

inputting the time-frequency characteristics after interaction into a mask decoder consisting of a convolution layer, a bGRU layer and a full connection layer to obtain an estimated mask, and carrying out Hadamard product on the estimated mask and a noisy frequency spectrum to obtain an enhanced amplitude spectrum;

the phase characteristics after interaction are input into a phase decoder consisting of convolution layers of the conversion channels, and an enhanced phase spectrum is obtained.

And calculating the root mean square error of the enhanced amplitude spectrum and the initial amplitude spectrum to obtain a first loss, calculating the root mean square error of the enhanced phase spectrum and the initial phase spectrum to obtain a second loss, and calculating the total loss according to the first loss and the second loss.

The training iteration module 610 is configured to perform a plurality of round iterations, and repeatedly invoke the steps of the phase spectrum generating module 62, the dual-branch encoding module 64, the interaction calculating module 66, and the dual-branch decoding module 68 in each round iteration in batches to perform iterative training, and obtain a trained single-channel speech enhancement model after the round iteration execution is finished.

In summary, in view of the existing problems, the above-mentioned technical solution of the present invention provides a training device for a single-channel speech enhancement model, which adopts an amplitude-phase parallel dual-branch structure to process the single-channel speech enhancement problem, that is, an amplitude branch including an amplitude encoder, an amplitude module and a mask decoder, and a phase branch including an amplitude encoder, a phase module and a phase decoder; the self-defined decomposition time-frequency attention block is added in the amplitude module to extract the characteristics of the frequency axis and the time axis, thereby being beneficial to capturing important time-frequency distribution information; before phase branching processing, performing differential calculation on low-frequency points and high-frequency points of adjacent time frames of a phase spectrum, and performing square calculation on differential phases to convert an initial phase spectrum which presents random complexity into a differential square phase spectrum, so that phase information with more obvious structure and texture is obtained; when the decomposed time-frequency attention characteristic is obtained, the effect of enhancing the time-frequency attention is calculated through a mask, and the strong distribution in the time-frequency distribution is highlighted; after training, a single-channel voice enhancement model for improving the signal-to-noise ratio is obtained.

Electronic device embodiment

Fig. 7 is a schematic diagram of an electronic device according to an embodiment of the invention. The electronic device 700 may include at least one processor 710 and memory 720. Processor 710 may execute instructions stored in memory 720. The processor 710 is communicatively coupled to the memory 720 via a data bus. In addition to memory 720, processor 710 may also be communicatively coupled to input device 730, output device 740, and communication device 750 via a data bus.

The processor 710 may be any conventional processor, such as a commercially available CPU. The processor may also include, for example, an image processor (Graphic Process Unit, GPU), a field programmable gate array (Field Programmable Gate Array, FPGA), a System On Chip (SOC), an application specific integrated Chip (Application Specific Integrated Circuit, ASIC), or a combination thereof.

The memory 720 may be implemented by any type of volatile or non-volatile memory device or combination thereof, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disk.

In the embodiment of the present disclosure, the memory 720 stores executable instructions, and the processor 710 may read the executable instructions from the memory 720 and execute the instructions to implement all or part of the steps of the training method of the single channel speech enhancement model in any of the above exemplary embodiments.

Computer-readable storage medium embodiments

In addition to the methods and apparatus described above, exemplary embodiments of the present disclosure may also be a computer program product or a computer-readable storage medium storing the computer program product, the computer program product including computer program instructions executable by a processor to implement all or part of the steps described in the training method of the single-channel speech enhancement model of any of the exemplary embodiments described above.

The computer program product may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages, and scripting languages (e.g., python). The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device, partly on a remote computing device, or entirely on the remote computing device or server.

A computer readable storage medium may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium may include, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the readable storage medium include: a Static Random Access Memory (SRAM), an electrically erasable programmable read-only memory (EEPROM), an erasable programmable read-only memory (EPROM), a programmable read-only memory (PROM), a read-only memory (ROM), a magnetic memory, a flash memory, a magnetic or optical disk, or any suitable combination of the foregoing having one or more electrical conductors.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the invention.

Claims

1. A method for training a single-channel speech enhancement model, comprising:

step 1, generating a set of noisy spectrums for training;

step 2, acquiring the noisy spectrum, calculating an initial amplitude spectrum and an initial phase spectrum corresponding to the noisy spectrum, and converting the initial phase spectrum into a differential square phase spectrum;

step 4, obtaining the time-frequency characteristics after interaction and the phase characteristics after interaction through the interaction calculation between the decomposed time-frequency attention characteristics and the differential square phase characteristics;

step 5, inputting the interactive time-frequency characteristic into a mask interpreter corresponding to an amplitude branch to obtain an enhanced amplitude spectrum, inputting the interactive phase characteristic into a phase decoder corresponding to a phase branch to obtain an enhanced phase spectrum, calculating total loss according to the enhanced amplitude spectrum and the enhanced phase spectrum, and updating model parameters;

2. The method according to claim 1, wherein the calculating an initial magnitude spectrum and an initial phase spectrum corresponding to the noisy spectrum, converting the initial phase spectrum into a differential square phase spectrum specifically comprises:

obtaining an initial amplitude spectrum by calculating the absolute value of the spectrum signal with noise, and obtaining an initial phase spectrum by calculating the arctangent function value of the spectrum signal with noise;

and expanding the initial phase spectrum along the direction of a frequency axis to obtain an expanded phase spectrum, performing differential calculation on the phase spectrums of the low frequency point and the high frequency point of each adjacent time frame in the expanded phase spectrum to obtain a differential phase, and performing square calculation on the differential phase to obtain a differential square phase spectrum.

3. The method according to claim 1, wherein the sequentially inputting the initial amplitude spectrum into the amplitude encoder and the amplitude module corresponding to the amplitude branch to obtain the decomposed time-frequency attention characteristic, and the inputting the differential square phase spectrum into the phase encoder and the phase module corresponding to the phase branch to obtain the differential square phase characteristic specifically includes:

inputting the initial amplitude spectrum into an amplitude encoder composed of two convolution layers, and extracting advanced amplitude features through the amplitude encoder; inputting the high-level amplitude characteristic into the amplitude module consisting of two decomposed time-frequency attention blocks and three convolution layers, decomposing the high-level amplitude characteristic into a time channel attention vector and a frequency channel attention vector through the amplitude module, carrying out outer product on the time channel attention vector and the frequency channel attention vector to obtain a time-frequency attention matrix, calculating a mask of each interval corresponding to the time-frequency attention moment matrix through a mask division method, and multiplying the mask by the value of the corresponding position of the time-frequency attention matrix to obtain the decomposed time-frequency attention characteristic;

4. The method of claim 3, wherein the calculating the mask of each interval corresponding to the time-frequency attention moment array by the mask dividing method specifically comprises:

setting decomposition parameters of dividing intervals, obtaining a plurality of intervals for dividing the time-frequency attention matrix according to the decomposition parameters, and calculating information values of the intervals according to the decomposition parameters; and if the information value is larger than the value of the interval corresponding to the time-frequency attention moment array, setting the value of the mask to 0, otherwise, setting the value of the mask to 1.

5. The method according to claim 1, wherein the obtaining the post-interaction time-frequency feature and the post-interaction phase feature by the interaction calculation between the decomposed time-frequency attention feature and the differential square phase feature specifically includes:

performing three-time interaction calculation by using three groups of corresponding amplitude modules and phase modules, and performing interaction calculation of the differential square phase characteristics on the decomposed time-frequency attention characteristics through a formula 1 to obtain time-frequency characteristics after interaction; and performing interactive calculation of the decomposed time-frequency attention characteristic on the differential square phase characteristic through a formula 2 to obtain an interactive phase characteristic:

6. The method according to claim 1, wherein the inputting the post-interaction time-frequency characteristic into the mask decoder corresponding to the amplitude branch to obtain the enhanced amplitude spectrum, and inputting the post-interaction phase characteristic into the phase decoder corresponding to the phase branch to obtain the enhanced phase spectrum specifically includes:

inputting the time-frequency characteristics after interaction into a mask decoder consisting of a convolution layer, a bGRU layer and a full connection layer to obtain an estimated mask, and carrying out Hadamard product on the estimated mask and the noisy spectrum to obtain an enhanced amplitude spectrum;

and inputting the phase characteristics after interaction into a phase decoder formed by convolution layers of the conversion channels to obtain an enhanced phase spectrum.

7. The method according to claim 1, wherein said calculating a total loss from said enhanced amplitude spectrum and said enhanced phase spectrum comprises in particular:

the root mean square error of the enhanced amplitude spectrum and the initial amplitude spectrum is calculated to obtain a first loss, the root mean square error of the enhanced phase spectrum and the initial phase spectrum is calculated to obtain a second loss, and the total loss is calculated according to the first loss and the second loss.

8. A training device for a single channel speech enhancement model, comprising:

the double-branch coding module is used for inputting the initial amplitude spectrum into an amplitude encoder and an amplitude module corresponding to an amplitude branch in sequence to obtain a decomposed time-frequency attention characteristic, and inputting the differential square phase spectrum into a phase encoder and a phase module corresponding to a phase branch to obtain a differential square phase characteristic;

the dual-branch decoding module is used for inputting the interactive time-frequency characteristic into a mask interpreter corresponding to the amplitude branch to obtain an enhanced amplitude spectrum, inputting the interactive phase characteristic into a phase decoder corresponding to the phase branch to obtain an enhanced phase spectrum, calculating total loss according to the enhanced amplitude spectrum and the enhanced phase spectrum, and updating model parameters;

and the training iteration module is used for carrying out a plurality of round iterations, repeatedly calling the steps in the phase spectrum generation module, the double-branch coding module, the interaction calculation module and the double-branch decoding module in batches in each round iteration, carrying out iterative training, and obtaining a trained single-channel voice enhancement model after the round iteration execution is finished.

9. An electronic device, comprising: memory, a processor and a computer program stored on the memory and executable on the processor, which when executed by the processor, performs the steps of the training method of the single channel speech enhancement model according to any one of claims 1 to 7.

10. A computer-readable storage medium, wherein a program for implementing information transfer is stored on the computer-readable storage medium, the program implementing the steps of the training method of the single-channel speech enhancement model according to any one of claims 1 to 7 when executed by a processor.