CN113096675B

CN113096675B - Audio style unification method based on generation type countermeasure network

Info

Publication number: CN113096675B
Application number: CN202110351514.1A
Authority: CN
Inventors: 欧阳童洁; 杨志军; 谢晖泷; 胡天林
Original assignee: Xiamen University
Current assignee: Xiamen University
Priority date: 2021-03-31
Filing date: 2021-03-31
Publication date: 2024-04-23
Anticipated expiration: 2041-03-31
Also published as: CN113096675A

Abstract

The invention discloses a method for unifying audio styles based on a generated type countermeasure network, which comprises the following steps of: acquiring an initial data set and a noise data set; step 2: preprocessing an initial data set and a noise data set to generate noise mixed audio and style template audio and determining a training data set and a test data set related to the noise mixed audio and style template audio; step 3: building a generating network model, wherein a training generator network G is used for unifying audio styles, inputting noise mixed audio and style template audio, and outputting audio of a target style and frequency spectrums of the target style; step 4: building a judging network model, and training a judging network D to measure the similarity degree of the frequency spectrum of the target style and the frequency spectrum of the style template output by the generator; step 5: constructing a loss function model and training a generated countermeasure network; according to the scheme, the method for generating the unified audio styles of the countermeasure network can adjust the styles of other input audio according to the audio styles selected by the user.

Description

Audio style unification method based on generation type countermeasure network

Technical Field

The invention relates to the technical field of deep learning, in particular to a method for unifying audio styles based on a generated type countermeasure network.

Background

The style unification of audio refers to the addition of the characteristics of a certain speaker, such as timbre, sub-language (emotion and intonation), to synthesized audio, and is also called speech style transfer. The research of the voice style transfer not only can promote the theoretical research of voice signal processing, but also can promote the fusion of the theory and the application of the cross field, and has important position.

Currently, speech style transfer technology has been developed for decades, and along with the development of speech conversion technology, speech style transfer technology also has achieved a lot of results. A method for realizing male and female voice conversion based on a time domain fundamental frequency synchronous superposition technology is proposed by initial sensitization et al (initial sensitization, lv Shinan. A synthesis method combining a PSOLA algorithm and a voice sine model [ C ]// a fifth national conference of human-computer communication academy of sciences, 1998.). Desai et al propose to use BP neural network method to realize voice conversion (Desai S,Raghavendra E V,Yegnanarayana B,et al.Voice conversion using Artificial Neural Networks[J].2009.). benefits from development of deep learning, people make new modification to the former model, such as Sun, lifa, et al, use long and short time sequence memory network to realize voice conversion (Greff K,Srivastava R K,Koutnik J,et al.LSTM:A Search Space Odyssey[J].IEEE Transactions on Neural Networks&Learning Systems,2016,28(10):2222-2232.). in order to further improve voice conversion quality, chris Donahue et al propose to use WaveGAN based on deep convolution generation countermeasure network to realize voice conversion (Donahue C, mcauley J, puckette M.Universal Audio Synthesis [ J ].2018 ]), but because voice signals are processed directly and simply into spectrograms, experimental effect is not ideal.

Reference to the literature

[1] Primary sensitization Lv Shinan A synthesis method combining PSOLA algorithm with a speech sinusoidal model [ C ]// fifth national academy of human-computer communication academy of sciences, 1998.

[2]Desai S,Raghavendra E V,Yegnanarayana B,et al.Voice conversion using Artificial NeuralNetworks[J].2009.

[3]Greff K,Srivastava R K,Koutnik J,et al.LSTM:A Search Space Odyssey[J].IEEE Transactions on Neural Networks&Learning Systems,2016,28(10):2222-2232.

[4]Donahue C,Mcauley J,Puckette M.Adversarial Audio Synthesis[J].2018.

Disclosure of Invention

In view of the above, the present invention aims to provide a method for unifying audio styles based on a generated countermeasure network, which has the advantages of less manual intervention, easy automation, reliable and convenient implementation, and rapid processing.

In order to achieve the technical purpose, the invention adopts the following technical scheme:

a method of audio style unification based on a generative countermeasure network, comprising:

S01, acquiring an initial data set and a noise data set;

s02, preprocessing an initial data set and a noise data set according to preset conditions to generate noise mixed audio;

S03, acquiring style template audio;

S04, constructing a generating network model, training to obtain a generator network G, wherein the generator network G is used for unifying audio styles, and outputting target style audio and target style frequency spectrums after the noise mixed audio and style template audio are input into the generator network G;

s05, acquiring a style template frequency spectrum corresponding to the style template audio;

S06, constructing a discrimination network model, training to obtain a discriminator network D, wherein the discriminator network D is used for measuring the similarity degree of a target style spectrum and a style template spectrum output by a generator network G, inputting the target style spectrum and the style template spectrum into the discriminator network D, discriminating the target style spectrum and the style template spectrum by the discriminator network D, and outputting probability scores mapped between [0,1 ];

S07, constructing a loss function model, accessing a generating network model and a judging network model, calculating the loss degree of information through a generator network G in the generating network model, judging the style loss degree through a judging network D of the judging network model, and training to obtain a generating type countermeasure network;

And S08, carrying out unified conversion of the audio style on the audio to be converted by the generating type countermeasure network, and outputting the style converted audio.

As a possible implementation, further, the initial data set includes a set of clean audio in the university of bloom chinese phonetic data set THCHS;

The noise dataset includes a collection of 3 types of noise audio in the university of bloom chinese phonetic dataset THCHS.

As a possible implementation manner, the style template spectrum is a spectrum obtained by performing fourier positive transformation on style template audio.

As a preferred alternative embodiment, in step S02, the method for preprocessing the initial data set and the noise data set according to the preset condition to generate the noise mixed audio is as follows:

S021, resampling the initial data set and the noise data set to 16.384kHz respectively, and dividing the initial data set and the noise data set by using 4 seconds as interval lengths respectively;

S022, generating noise mixed audio according to a preset formula, wherein the formula for generating the noise mixed audio is as follows:

Z＝C+N*r

Wherein C represents a section of audio in the initial dataset after resampling and segmentation; n represents a section of audio in the noise data set after resampling and segmentation; r represents a random number between [0.1,0.3 ]; z represents a segment of audio in the generated noise-mixed audio.

Preferably, the style template audio is randomly extracted from the initial data set after resampling and segmentation or extracted from a pre-constructed style template audio library.

As a preferred implementation manner, preferably, 85% of audio units in the noise mixed audio and the style template audio are also randomly extracted as training data sets, and the remaining 15% are used as test data sets; the training data set and the test data set are used for training or testing of the generator network G and/or the arbiter network D.

As a preferred alternative embodiment, the generator network G preferably comprises a noise-mixing audio encoder, a style-template audio encoder and a decoder;

The generator network G has two input ends and two output ends, wherein one input end is used for inputting the spectrum of the noise mixed audio after fourier transformation, the size of the spectrum is 257 x 513 x1, and the other input end is used for inputting the style template spectrum, the size of the spectrum is 257 x 513 x 1; one output end is used for outputting a target style frequency spectrum with the size of 257 x 513 x1, the target style frequency spectrum is used for being input into the discriminator network D for comparison, and the other output end is used for outputting the audio after the target style frequency spectrum is subjected to Fourier inverse transformation, namely the target style audio;

In addition, the noise mixed audio encoder includes 8 encoder units, each encoder unit has a convolution kernel size specification of 3*3, a stride of 2, an activation function of ReLu, and the number of convolution kernels of each encoder unit is 16, 32, 64, 128, 256, 512, 1024, 2048 in sequence, the first encoder unit is used for inputting a frequency spectrum of the noise mixed audio after fourier positive transformation, the size of the first encoder unit is 257×513×1, the input feature of each encoder unit after the first encoder unit is the output feature of the last encoder unit, and the output scale of the last encoder unit is 2×3×2048;

the style template audio encoder comprises 8 encoder units, wherein the size specification of the convolution kernel of each encoder unit is 3*3, the stride is 2, the activation function is ReLu, the number of the convolution kernels of each encoder unit is 16, 32, 64, 128, 256, 512, 1024 and 2048 in sequence, the first encoder unit is used for inputting a style template frequency spectrum, the size of the first encoder unit is 257 x 513 x 1, the input characteristic of each encoder unit after the first encoder unit is the output characteristic of the last encoder unit, and the output scale of the last encoder unit is 2x 3 x 2048;

The decoder comprises 8 decoder units, the deconvolution kernel of each decoder unit has the size 3*3, the stride is2, the activation function is ReLu, the number of deconvolution kernels of each decoder unit is 1024, 512, 256, 128, 64, 32, 16 and 8 in sequence, the first decoder unit is used for inputting the result of tensor splicing of the output characteristics of the noise mixed audio encoder and the output characteristics of the style template audio encoder, the input characteristics of each decoder unit after the first decoder unit are the output characteristics of the last decoder unit, and the output scale of the last decoder unit is 257 x 513 x 1.

As a preferred alternative embodiment, the arbiter network D preferably comprises 6 convolutional layers and 5 fully connected layers;

The discriminator network D has two input terminals and one output terminal, wherein one input terminal is used for inputting the target style spectrum output by the generator network G, the size of the target style spectrum is 257 x 513 x 1, the other input terminal is used for inputting style template spectrum, the size of the target style spectrum is 257 x 513 x 1, the output terminal is used for outputting the similarity degree of the target style spectrum and the style template spectrum, and the similarity degree result is output in the form of probability scores between [0,1 ];

in addition, before entering a convolution layer, data input by an input end of a discriminator network D also carries out tensor splicing processing on a target style spectrum and a style template spectrum, the characteristics of 257 x 513 x 2 are processed to form a feature, the feature is sent to the convolution layer, the convolution kernel size of each convolution layer is 3*3, the stride is 2, the convolution is standardized in BatchNorm batches before convolution, an activation function is ReLu, channels of each convolution layer are 32, 64, 128, 256, 512 and 1024 in sequence, the first convolution layer is the result of tensor splicing processing on the input target style spectrum and the style template spectrum, the input feature of each convolution layer after the first convolution layer is the output feature of the last convolution layer, and the output scale of the last convolution layer is 5 x 9 x 1024;

The number of each layer of neurons of the full-connection layer is 46080, 1024, 256, 64 and 1 in sequence, wherein the last layer adopts sigmoid as an activation function, the other layers adopt ReLu as an activation function, the input end of the full-connection layer is used for inputting the characteristic result after the output of the last convolution layer is straightened, the output end of the full-connection layer is used for outputting the similarity degree of the target style spectrum and the style template spectrum, and the similarity degree result is output in a probability score form between [0 and 1 ].

As a preferred alternative implementation manner, preferably, before the audio style unified conversion is performed on the audio converted by the style conversion treatment by the generated type countermeasure network, the network parameters of the generated type countermeasure network are further optimized to obtain the parameters with optimal network performance.

As a preferred alternative implementation mode, preferably, a loss function model is constructed, a generating network model and a judging network model are accessed, the loss degree of information is calculated through a generator network G in the generating network model, the loss degree of style is judged through a judging network D of the judging network model, and then the specific method for obtaining the generating type countermeasure network by training is as follows:

(1) The loss function L _D of the arbiter network D is defined as:

L_D＝(D(c,x)-1)²+(D(G(z,x),x))² (1)

(2) The loss function L _G of the generator network G consists of two parts, one part is L _GD output by the discriminator network D, and the other part is the difference between the target style audio output by the generator network G and the audio of the initial dataset, which is recorded as Wherein,

L_GD＝D(G(z,x),x) (2)

In the formulas (1), (2), (3) and (4), n is the number of matrix elements in the spectrum of the target style output by the generator network G; c is a frequency spectrum after Fourier positive transformation is carried out on a section of audio in the initial data set; z is the frequency spectrum of the noise mixed audio after Fourier transformation; x is the frequency spectrum of style template audio after Fourier positive transformation; k is a super parameter used for controlling the weight of the two loss parts;

(3) Optimizing the generator network G by adopting an Adam algorithm with the learning rate of 0.001; and optimizing the arbiter network D by adopting an Adam algorithm with the learning rate of 0.0001, so that the parameters with the optimal performance of the generated type countermeasure network are obtained by optimizing the parameters of the generated type countermeasure network.

By adopting the technical scheme, compared with the prior art, the invention has the beneficial effects that: the network based on the generated countermeasure idea is provided, the network utilizes the discriminator network to monitor and train the generator network, finally, the style of noise mixed audio and the style of style template audio can be unified, the generator network model adopts the full convolution structure of the encoder-decoder, unified processing can be rapidly carried out, manual intervention is reduced through training of the network, automation is easy to realize, and the style of other input audio can be conveniently adjusted according to the audio style selected by a user.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a partial schematic system flow diagram of an aspect of the present invention;

FIG. 2 is a network block diagram of the scheme generator of the present invention;

fig. 3 is a network configuration diagram of the inventive scheme discriminator.

Detailed Description

The invention is described in further detail below with reference to the drawings and examples. It is specifically noted that the following examples are only for illustrating the present invention, but do not limit the scope of the present invention. Likewise, the following examples are only some, but not all, of the examples of the present invention, and all other examples, which a person of ordinary skill in the art would obtain without making any inventive effort, are within the scope of the present invention.

In the embodiment, a set of clean audios in a Chinese phonetic dataset THCHS of Qinghua university is taken as an initial dataset, and is set as an experimental dataset;

taking a collection of 3 kinds of noise audios in a Chinese voice dataset THCHS of Qinghua university as a noise dataset for example;

The style template audio is randomly extracted from the initial data set after resampling and segmentation or extracted from a pre-constructed style template audio library.

In this embodiment, a system block diagram of implementation steps of a method for unifying audio styles based on a generated countermeasure network is shown in fig. 1, and the implementation steps are as follows:

1. An experimental dataset and a noise dataset are obtained. The experimental dataset is a collection of clean audio in the university of Qinghua chinese phonetic dataset (THCHS); the noise dataset is a collection of 3 types of noise audio in the university of bloom chinese phonetic dataset (THCHS).

2. The experimental data set and the noise data set are preprocessed to generate noise mixed audio and style template audio and to determine training data sets and test data sets associated therewith.

The method comprises the following steps:

(2.1) resampling the experimental and noise data sets to 16.384kHz, respectively, and dividing this at 4 second intervals.

The formula of the noise mixed audio generated in (2.2) is: z=c+n x r; wherein C represents a section of audio in the experimental data set after resampling and segmentation; n represents a section of audio in the noise data set after resampling and segmentation; r represents a random number between [0.1,0.3 ]; z represents a segment of audio in the generated noise-mixed audio; the generated style template audio is randomly extracted from the experiment data set after resampling and segmentation.

(2.3) Randomly extracting 85% from the noise mixed audio and the style template audio as a training data set, and the remaining 15% as a test data set.

3. And constructing a generating network model, wherein the training generator network G is used for unifying audio styles, inputting the noise mixed audio and style template audio, and outputting the audio of a target style and the frequency spectrum of the target style.

As shown in fig. 2, this step specifically includes:

(3.1) the generator network G is comprised of a noise-mixing audio encoder, a style template audio encoder and a decoder. The generator network G has two inputs and two outputs. The input end is the frequency spectrum of the noise mixed audio after Fourier positive transformation, and the size is 257 x 513 x 1; the other is the spectrum of style template audio after fourier transformation, with the size of 257 x 513 x 1. The output end is the frequency spectrum of the target style audio, the size is 257 x 513 x1, and the frequency spectrum is input into a discrimination network for comparison; the other is the audio after the inverse fourier transform of the spectrum of the target style audio.

(3.2) The noise-mixed audio encoder consists of 8 encoder units. The convolution kernel size for each encoder unit is 3*3, the stride is 2, the activation function is ReLu, and the number of convolution kernels for each encoder unit is 16,32,64,128,256,512,1024,2048 in turn. The input of the first encoder unit is the frequency spectrum of the noise mixed audio after fourier transformation, the size is 257 x 513 x 1, then the input of each encoder unit is the output characteristic of the last encoder unit, and the output scale of the last encoder unit is 2 x 3 x 2048.

(3.3) A style template audio encoder consists of 8 encoder units. The convolution kernel size for each encoder unit is 3*3, the stride is 2, the activation function is ReLu, and the number of convolution kernels for each encoder unit is 16,32,64,128,256,512,1024,2048 in turn. The input of the first encoder unit is the frequency spectrum of the style template audio after being subjected to Fourier positive transformation, the size is 257 x 513 x 1, then the input of each encoder unit is the output characteristic of the last encoder unit, and the output scale of the last encoder unit is 2 x 3 x 2048.

(3.4) The decoder consists of 8 decoder units. The deconvolution kernel size for each decoder unit is 3*3, the stride is 2, the activation function is ReLu, and the number of deconvolution kernels for each decoder unit is 1024,512,256,128,64,32,16,8 in turn. The input of the first decoder unit is the result of the concatenation of the input of the noise mixed audio encoder and the output tensor of the style template audio encoder, then the input of each decoder unit is the result of the concatenation of the output characteristics of the last decoder unit and the output tensor of the encoder unit with the same size as the decoder unit in the noise mixed audio encoder, and the output scale of the last decoder unit is 257 x 513 x 1.

4. Building a discrimination network model, training a discriminator network D to measure the similarity degree of the frequency spectrum of the target style and the frequency spectrum of the style template output by the generator, inputting the frequency spectrum of the target style and the frequency spectrum of the style template audio frequency spectrum output by the generator, discriminating the frequency spectrum of the target style and the frequency spectrum of the style template audio frequency spectrum, and outputting a probability score mapped between [0,1 ].

As shown in fig. 3, this step specifically includes:

(4.1) the arbiter network D consists of 6 convolutional layers and 5 fully connected layers. The arbiter network D has two inputs and one output. The input end is the frequency spectrum of the target style output by the generator, and the size is 257 x 513 x 1; the other is the spectrum of style template audio after fourier transformation, with the size of 257 x 513 x 1. The output is a probability score between 0, 1.

And (4.2) performing tensor splicing on the spectrum of the target style output by the generator and the spectrum of the style template audio after Fourier positive transformation before the convolution layer to form an input with the size of 257 x 513 x2, and sending the input to the convolution layer. The convolution kernel size for each convolution layer was 3*3, the stride was 2, the batch normalization by BatchNorm, the activation function was ReLu, and the number of channels per convolution layer was 32,64,128,256,512,1024 in turn. The input of the first convolution layer is the result of tensor stitching, the input of each subsequent convolution layer is the output characteristic of the last convolution layer, and the output scale of the last convolution layer is 5 x 9 x 1024.

(4.3) The number of neurons per layer of the fully connected layer is 46080,1024,256,64,1 in turn, with the last layer using sigmoid as the activation function and the other layers using ReLu as the activation function. The input of the full-connection layer is the result of the output of the last convolution layer after being straightened, and the output of the full-connection layer is the probability fraction between [0,1] for measuring the similarity degree of the frequency spectrum of the target style and the frequency spectrum of the wind pattern template output by the generator.

5. Constructing a loss function model, wherein the loss function consists of two parts, one part is generated by a generator network G, and the loss degree of information is calculated; the other part is generated by the arbiter network D for evaluating the degree of style loss. And then training the generated type countermeasure network, and finding out the parameter with optimal network performance by optimizing the network parameter.

The method specifically comprises the following steps:

(5.1) the loss function L _D of the arbiter network D is defined as:

L_D＝(D(c,x)-1)²+(D(G(z,x),x))² (1)

(5.2) the loss function L _G of the generator network G consists of two parts, one part being the output L _GD of the arbiter and the other part being the difference between the output of the generator and the experimental dataset audio, noted as

L_GD＝D(G(z,x),x) (2)

In the formulas (1), (2), (3) and (4), n is the number of matrix elements in the spectrum of the target style output by the generator network G; c is a frequency spectrum after Fourier positive transformation is carried out on a section of audio in the experimental data set; z is the frequency spectrum of the noise mixed audio after Fourier transformation; x is the frequency spectrum of style template audio after Fourier positive transformation; k is a superparameter used to control the weight of the two partial losses.

(5.3) Optimizing the generator network G by adopting an Adam algorithm with a learning rate of 0.001; optimizing the arbiter network D by adopting an Adam algorithm with the learning rate of 0.0001, and finding out the parameter with the optimal network performance by optimizing the network parameter.

6. The method comprises the steps that through the adoption of the generated type countermeasure network with optimal parameters, audio frequency style unified conversion is carried out on audio frequency to be converted, and style conversion audio frequency is output.

The foregoing description is only a partial embodiment of the present invention, and is not intended to limit the scope of the present invention, and all equivalent devices or equivalent processes using the descriptions and the drawings of the present invention or directly or indirectly applied to other related technical fields are included in the scope of the present invention.

Claims

1. A method for unifying audio styles based on a generated countermeasure network, comprising:

Acquiring an initial data set and a noise data set;

Preprocessing an initial data set and a noise data set according to preset conditions to generate noise mixed audio;

Acquiring style template audio;

Constructing a generating network model, training to obtain a generator network G, wherein the generator network G is used for unifying audio styles, and outputting target style audio and target style frequency spectrums after the noise mixed audio and style template audio are input into the generator network G;

acquiring a style template frequency spectrum corresponding to the style template audio;

constructing a discrimination network model, training to obtain a discriminator network D, wherein the discriminator network D is used for measuring the similarity degree of a target style spectrum and a style template spectrum output by a generator network G, discriminating the target style spectrum and the style template spectrum by the discriminator network D after the target style spectrum and the style template spectrum are input into the discriminator network D, and outputting probability scores mapped between [0,1 ];

constructing a loss function model, accessing a generating network model and a judging network model, calculating the loss degree of information through a generator network G in the generating network model, judging the style loss degree through a judging network D of the judging network model, and training to obtain a generating type countermeasure network;

and carrying out audio style unified conversion on the audio to be style converted through the generated countermeasure network, and outputting style converted audio.

2. The method of claim 1, wherein the initial data set comprises a collection of clean audio from a university chinese phonetic dataset THCHS;

3. The method for unifying audio styles based on a generative countermeasure network according to claim 1, wherein the style template frequency spectrum is a frequency spectrum of style template audio after fourier positive transformation.

4. The method for unifying audio styles based on a generation type countermeasure network according to claim 3, wherein the method for preprocessing the initial data set and the noise data set according to a preset condition to generate noise mixed audio is as follows:

resampling the initial data set and the noise data set to 16.384kHz respectively, and dividing them by 4 second interval length respectively;

generating noise mixed audio according to a preset formula, wherein the formula for generating the noise mixed audio is as follows:

Z＝C+N*r

5. The method of claim 4, wherein the style template audio is randomly extracted from the initial data set after resampling and segmentation or extracted from a pre-built style template audio library.

6. The method for unifying audio styles based on a generative countermeasure network according to claim 5, wherein 85% of audio units in the noise mixed audio and the style template audio are also randomly extracted as training data sets, and the remaining 15% are used as test data sets; the training data set and the test data set are used for training or testing of the generator network G and/or the arbiter network D.

7. The method for generating a unified audio style for an countermeasure network according to any of claims 4 to 6, wherein the generator network G includes a noise-mixing audio encoder, a style template audio encoder, and a decoder;

8. The method for unifying audio styles based on a generative countermeasure network of claim 7, wherein the discriminator network D comprises 6 convolutional layers and 5 fully-connected layers;

9. The method for unifying audio styles based on a generative countermeasure network according to claim 8, wherein the parameters with optimal network performance are obtained by optimizing the network parameters of the generative countermeasure network before the audio styles are uniformly converted by the generative countermeasure network for the audio converted by the style.

10. The method for unifying audio styles based on the generated type countermeasure network according to claim 9, wherein the specific method for constructing a loss function model, accessing a generated network model and a discrimination network model, calculating the loss degree of information through a generator network G in the generated network model, evaluating the style loss degree through a discriminator network D of the discrimination network model, and then training to obtain the generated type countermeasure network is as follows:

(1) The loss function L _D of the arbiter network D is defined as:

L_D＝(D(c,x)-1)²+(D(G(z,x),x))² (1)

L_GD＝D(G(z,x),x) (2)