CN113096675B - Audio style unification method based on generation type countermeasure network - Google Patents

Audio style unification method based on generation type countermeasure network Download PDF

Info

Publication number
CN113096675B
CN113096675B CN202110351514.1A CN202110351514A CN113096675B CN 113096675 B CN113096675 B CN 113096675B CN 202110351514 A CN202110351514 A CN 202110351514A CN 113096675 B CN113096675 B CN 113096675B
Authority
CN
China
Prior art keywords
audio
network
style
spectrum
output
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110351514.1A
Other languages
Chinese (zh)
Other versions
CN113096675A (en
Inventor
欧阳童洁
杨志军
谢晖泷
胡天林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xiamen University
Original Assignee
Xiamen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xiamen University filed Critical Xiamen University
Priority to CN202110351514.1A priority Critical patent/CN113096675B/en
Publication of CN113096675A publication Critical patent/CN113096675A/en
Application granted granted Critical
Publication of CN113096675B publication Critical patent/CN113096675B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/003Changing voice quality, e.g. pitch or formants
    • G10L21/007Changing voice quality, e.g. pitch or formants characterised by the process used
    • G10L21/013Adapting to target pitch
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/16Vocoder architecture
    • G10L19/18Vocoders using multiple modes
    • G10L19/20Vocoders using multiple modes using sound class specific coding, hybrid encoders or object based coding
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/003Changing voice quality, e.g. pitch or formants
    • G10L21/007Changing voice quality, e.g. pitch or formants characterised by the process used
    • G10L21/013Adapting to target pitch
    • G10L2021/0135Voice conversion or morphing

Abstract

The invention discloses a method for unifying audio styles based on a generated type countermeasure network, which comprises the following steps of: acquiring an initial data set and a noise data set; step 2: preprocessing an initial data set and a noise data set to generate noise mixed audio and style template audio and determining a training data set and a test data set related to the noise mixed audio and style template audio; step 3: building a generating network model, wherein a training generator network G is used for unifying audio styles, inputting noise mixed audio and style template audio, and outputting audio of a target style and frequency spectrums of the target style; step 4: building a judging network model, and training a judging network D to measure the similarity degree of the frequency spectrum of the target style and the frequency spectrum of the style template output by the generator; step 5: constructing a loss function model and training a generated countermeasure network; according to the scheme, the method for generating the unified audio styles of the countermeasure network can adjust the styles of other input audio according to the audio styles selected by the user.

Description

Audio style unification method based on generation type countermeasure network
Technical Field
The invention relates to the technical field of deep learning, in particular to a method for unifying audio styles based on a generated type countermeasure network.
Background
The style unification of audio refers to the addition of the characteristics of a certain speaker, such as timbre, sub-language (emotion and intonation), to synthesized audio, and is also called speech style transfer. The research of the voice style transfer not only can promote the theoretical research of voice signal processing, but also can promote the fusion of the theory and the application of the cross field, and has important position.
Currently, speech style transfer technology has been developed for decades, and along with the development of speech conversion technology, speech style transfer technology also has achieved a lot of results. A method for realizing male and female voice conversion based on a time domain fundamental frequency synchronous superposition technology is proposed by initial sensitization et al (initial sensitization, lv Shinan. A synthesis method combining a PSOLA algorithm and a voice sine model [ C ]// a fifth national conference of human-computer communication academy of sciences, 1998.). Desai et al propose to use BP neural network method to realize voice conversion (Desai S,Raghavendra E V,Yegnanarayana B,et al.Voice conversion using Artificial Neural Networks[J].2009.). benefits from development of deep learning, people make new modification to the former model, such as Sun, lifa, et al, use long and short time sequence memory network to realize voice conversion (Greff K,Srivastava R K,Koutnik J,et al.LSTM:A Search Space Odyssey[J].IEEE Transactions on Neural Networks&Learning Systems,2016,28(10):2222-2232.). in order to further improve voice conversion quality, chris Donahue et al propose to use WaveGAN based on deep convolution generation countermeasure network to realize voice conversion (Donahue C, mcauley J, puckette M.Universal Audio Synthesis [ J ].2018 ]), but because voice signals are processed directly and simply into spectrograms, experimental effect is not ideal.
Reference to the literature
[1] Primary sensitization Lv Shinan A synthesis method combining PSOLA algorithm with a speech sinusoidal model [ C ]// fifth national academy of human-computer communication academy of sciences, 1998.
[2]Desai S,Raghavendra E V,Yegnanarayana B,et al.Voice conversion using Artificial NeuralNetworks[J].2009.
[3]Greff K,Srivastava R K,Koutnik J,et al.LSTM:A Search Space Odyssey[J].IEEE Transactions on Neural Networks&Learning Systems,2016,28(10):2222-2232.
[4]Donahue C,Mcauley J,Puckette M.Adversarial Audio Synthesis[J].2018.
Disclosure of Invention
In view of the above, the present invention aims to provide a method for unifying audio styles based on a generated countermeasure network, which has the advantages of less manual intervention, easy automation, reliable and convenient implementation, and rapid processing.
In order to achieve the technical purpose, the invention adopts the following technical scheme:
a method of audio style unification based on a generative countermeasure network, comprising:
S01, acquiring an initial data set and a noise data set;
s02, preprocessing an initial data set and a noise data set according to preset conditions to generate noise mixed audio;
S03, acquiring style template audio;
S04, constructing a generating network model, training to obtain a generator network G, wherein the generator network G is used for unifying audio styles, and outputting target style audio and target style frequency spectrums after the noise mixed audio and style template audio are input into the generator network G;
s05, acquiring a style template frequency spectrum corresponding to the style template audio;
S06, constructing a discrimination network model, training to obtain a discriminator network D, wherein the discriminator network D is used for measuring the similarity degree of a target style spectrum and a style template spectrum output by a generator network G, inputting the target style spectrum and the style template spectrum into the discriminator network D, discriminating the target style spectrum and the style template spectrum by the discriminator network D, and outputting probability scores mapped between [0,1 ];
S07, constructing a loss function model, accessing a generating network model and a judging network model, calculating the loss degree of information through a generator network G in the generating network model, judging the style loss degree through a judging network D of the judging network model, and training to obtain a generating type countermeasure network;
And S08, carrying out unified conversion of the audio style on the audio to be converted by the generating type countermeasure network, and outputting the style converted audio.
As a possible implementation, further, the initial data set includes a set of clean audio in the university of bloom chinese phonetic data set THCHS;
The noise dataset includes a collection of 3 types of noise audio in the university of bloom chinese phonetic dataset THCHS.
As a possible implementation manner, the style template spectrum is a spectrum obtained by performing fourier positive transformation on style template audio.
As a preferred alternative embodiment, in step S02, the method for preprocessing the initial data set and the noise data set according to the preset condition to generate the noise mixed audio is as follows:
S021, resampling the initial data set and the noise data set to 16.384kHz respectively, and dividing the initial data set and the noise data set by using 4 seconds as interval lengths respectively;
S022, generating noise mixed audio according to a preset formula, wherein the formula for generating the noise mixed audio is as follows:
Z=C+N*r
Wherein C represents a section of audio in the initial dataset after resampling and segmentation; n represents a section of audio in the noise data set after resampling and segmentation; r represents a random number between [0.1,0.3 ]; z represents a segment of audio in the generated noise-mixed audio.
Preferably, the style template audio is randomly extracted from the initial data set after resampling and segmentation or extracted from a pre-constructed style template audio library.
As a preferred implementation manner, preferably, 85% of audio units in the noise mixed audio and the style template audio are also randomly extracted as training data sets, and the remaining 15% are used as test data sets; the training data set and the test data set are used for training or testing of the generator network G and/or the arbiter network D.
As a preferred alternative embodiment, the generator network G preferably comprises a noise-mixing audio encoder, a style-template audio encoder and a decoder;
The generator network G has two input ends and two output ends, wherein one input end is used for inputting the spectrum of the noise mixed audio after fourier transformation, the size of the spectrum is 257 x 513 x1, and the other input end is used for inputting the style template spectrum, the size of the spectrum is 257 x 513 x 1; one output end is used for outputting a target style frequency spectrum with the size of 257 x 513 x1, the target style frequency spectrum is used for being input into the discriminator network D for comparison, and the other output end is used for outputting the audio after the target style frequency spectrum is subjected to Fourier inverse transformation, namely the target style audio;
In addition, the noise mixed audio encoder includes 8 encoder units, each encoder unit has a convolution kernel size specification of 3*3, a stride of 2, an activation function of ReLu, and the number of convolution kernels of each encoder unit is 16, 32, 64, 128, 256, 512, 1024, 2048 in sequence, the first encoder unit is used for inputting a frequency spectrum of the noise mixed audio after fourier positive transformation, the size of the first encoder unit is 257×513×1, the input feature of each encoder unit after the first encoder unit is the output feature of the last encoder unit, and the output scale of the last encoder unit is 2×3×2048;
the style template audio encoder comprises 8 encoder units, wherein the size specification of the convolution kernel of each encoder unit is 3*3, the stride is 2, the activation function is ReLu, the number of the convolution kernels of each encoder unit is 16, 32, 64, 128, 256, 512, 1024 and 2048 in sequence, the first encoder unit is used for inputting a style template frequency spectrum, the size of the first encoder unit is 257 x 513 x 1, the input characteristic of each encoder unit after the first encoder unit is the output characteristic of the last encoder unit, and the output scale of the last encoder unit is 2x 3 x 2048;
The decoder comprises 8 decoder units, the deconvolution kernel of each decoder unit has the size 3*3, the stride is2, the activation function is ReLu, the number of deconvolution kernels of each decoder unit is 1024, 512, 256, 128, 64, 32, 16 and 8 in sequence, the first decoder unit is used for inputting the result of tensor splicing of the output characteristics of the noise mixed audio encoder and the output characteristics of the style template audio encoder, the input characteristics of each decoder unit after the first decoder unit are the output characteristics of the last decoder unit, and the output scale of the last decoder unit is 257 x 513 x 1.
As a preferred alternative embodiment, the arbiter network D preferably comprises 6 convolutional layers and 5 fully connected layers;
The discriminator network D has two input terminals and one output terminal, wherein one input terminal is used for inputting the target style spectrum output by the generator network G, the size of the target style spectrum is 257 x 513 x 1, the other input terminal is used for inputting style template spectrum, the size of the target style spectrum is 257 x 513 x 1, the output terminal is used for outputting the similarity degree of the target style spectrum and the style template spectrum, and the similarity degree result is output in the form of probability scores between [0,1 ];
in addition, before entering a convolution layer, data input by an input end of a discriminator network D also carries out tensor splicing processing on a target style spectrum and a style template spectrum, the characteristics of 257 x 513 x 2 are processed to form a feature, the feature is sent to the convolution layer, the convolution kernel size of each convolution layer is 3*3, the stride is 2, the convolution is standardized in BatchNorm batches before convolution, an activation function is ReLu, channels of each convolution layer are 32, 64, 128, 256, 512 and 1024 in sequence, the first convolution layer is the result of tensor splicing processing on the input target style spectrum and the style template spectrum, the input feature of each convolution layer after the first convolution layer is the output feature of the last convolution layer, and the output scale of the last convolution layer is 5 x 9 x 1024;
The number of each layer of neurons of the full-connection layer is 46080, 1024, 256, 64 and 1 in sequence, wherein the last layer adopts sigmoid as an activation function, the other layers adopt ReLu as an activation function, the input end of the full-connection layer is used for inputting the characteristic result after the output of the last convolution layer is straightened, the output end of the full-connection layer is used for outputting the similarity degree of the target style spectrum and the style template spectrum, and the similarity degree result is output in a probability score form between [0 and 1 ].
As a preferred alternative implementation manner, preferably, before the audio style unified conversion is performed on the audio converted by the style conversion treatment by the generated type countermeasure network, the network parameters of the generated type countermeasure network are further optimized to obtain the parameters with optimal network performance.
As a preferred alternative implementation mode, preferably, a loss function model is constructed, a generating network model and a judging network model are accessed, the loss degree of information is calculated through a generator network G in the generating network model, the loss degree of style is judged through a judging network D of the judging network model, and then the specific method for obtaining the generating type countermeasure network by training is as follows:
(1) The loss function L D of the arbiter network D is defined as:
LD=(D(c,x)-1)2+(D(G(z,x),x))2 (1)
(2) The loss function L G of the generator network G consists of two parts, one part is L GD output by the discriminator network D, and the other part is the difference between the target style audio output by the generator network G and the audio of the initial dataset, which is recorded as Wherein,
LGD=D(G(z,x),x) (2)
In the formulas (1), (2), (3) and (4), n is the number of matrix elements in the spectrum of the target style output by the generator network G; c is a frequency spectrum after Fourier positive transformation is carried out on a section of audio in the initial data set; z is the frequency spectrum of the noise mixed audio after Fourier transformation; x is the frequency spectrum of style template audio after Fourier positive transformation; k is a super parameter used for controlling the weight of the two loss parts;
(3) Optimizing the generator network G by adopting an Adam algorithm with the learning rate of 0.001; and optimizing the arbiter network D by adopting an Adam algorithm with the learning rate of 0.0001, so that the parameters with the optimal performance of the generated type countermeasure network are obtained by optimizing the parameters of the generated type countermeasure network.
By adopting the technical scheme, compared with the prior art, the invention has the beneficial effects that: the network based on the generated countermeasure idea is provided, the network utilizes the discriminator network to monitor and train the generator network, finally, the style of noise mixed audio and the style of style template audio can be unified, the generator network model adopts the full convolution structure of the encoder-decoder, unified processing can be rapidly carried out, manual intervention is reduced through training of the network, automation is easy to realize, and the style of other input audio can be conveniently adjusted according to the audio style selected by a user.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a partial schematic system flow diagram of an aspect of the present invention;
FIG. 2 is a network block diagram of the scheme generator of the present invention;
fig. 3 is a network configuration diagram of the inventive scheme discriminator.
Detailed Description
The invention is described in further detail below with reference to the drawings and examples. It is specifically noted that the following examples are only for illustrating the present invention, but do not limit the scope of the present invention. Likewise, the following examples are only some, but not all, of the examples of the present invention, and all other examples, which a person of ordinary skill in the art would obtain without making any inventive effort, are within the scope of the present invention.
In the embodiment, a set of clean audios in a Chinese phonetic dataset THCHS of Qinghua university is taken as an initial dataset, and is set as an experimental dataset;
taking a collection of 3 kinds of noise audios in a Chinese voice dataset THCHS of Qinghua university as a noise dataset for example;
The style template audio is randomly extracted from the initial data set after resampling and segmentation or extracted from a pre-constructed style template audio library.
In this embodiment, a system block diagram of implementation steps of a method for unifying audio styles based on a generated countermeasure network is shown in fig. 1, and the implementation steps are as follows:
1. An experimental dataset and a noise dataset are obtained. The experimental dataset is a collection of clean audio in the university of Qinghua chinese phonetic dataset (THCHS); the noise dataset is a collection of 3 types of noise audio in the university of bloom chinese phonetic dataset (THCHS).
2. The experimental data set and the noise data set are preprocessed to generate noise mixed audio and style template audio and to determine training data sets and test data sets associated therewith.
The method comprises the following steps:
(2.1) resampling the experimental and noise data sets to 16.384kHz, respectively, and dividing this at 4 second intervals.
The formula of the noise mixed audio generated in (2.2) is: z=c+n x r; wherein C represents a section of audio in the experimental data set after resampling and segmentation; n represents a section of audio in the noise data set after resampling and segmentation; r represents a random number between [0.1,0.3 ]; z represents a segment of audio in the generated noise-mixed audio; the generated style template audio is randomly extracted from the experiment data set after resampling and segmentation.
(2.3) Randomly extracting 85% from the noise mixed audio and the style template audio as a training data set, and the remaining 15% as a test data set.
3. And constructing a generating network model, wherein the training generator network G is used for unifying audio styles, inputting the noise mixed audio and style template audio, and outputting the audio of a target style and the frequency spectrum of the target style.
As shown in fig. 2, this step specifically includes:
(3.1) the generator network G is comprised of a noise-mixing audio encoder, a style template audio encoder and a decoder. The generator network G has two inputs and two outputs. The input end is the frequency spectrum of the noise mixed audio after Fourier positive transformation, and the size is 257 x 513 x 1; the other is the spectrum of style template audio after fourier transformation, with the size of 257 x 513 x 1. The output end is the frequency spectrum of the target style audio, the size is 257 x 513 x1, and the frequency spectrum is input into a discrimination network for comparison; the other is the audio after the inverse fourier transform of the spectrum of the target style audio.
(3.2) The noise-mixed audio encoder consists of 8 encoder units. The convolution kernel size for each encoder unit is 3*3, the stride is 2, the activation function is ReLu, and the number of convolution kernels for each encoder unit is 16,32,64,128,256,512,1024,2048 in turn. The input of the first encoder unit is the frequency spectrum of the noise mixed audio after fourier transformation, the size is 257 x 513 x 1, then the input of each encoder unit is the output characteristic of the last encoder unit, and the output scale of the last encoder unit is 2 x 3 x 2048.
(3.3) A style template audio encoder consists of 8 encoder units. The convolution kernel size for each encoder unit is 3*3, the stride is 2, the activation function is ReLu, and the number of convolution kernels for each encoder unit is 16,32,64,128,256,512,1024,2048 in turn. The input of the first encoder unit is the frequency spectrum of the style template audio after being subjected to Fourier positive transformation, the size is 257 x 513 x 1, then the input of each encoder unit is the output characteristic of the last encoder unit, and the output scale of the last encoder unit is 2 x 3 x 2048.
(3.4) The decoder consists of 8 decoder units. The deconvolution kernel size for each decoder unit is 3*3, the stride is 2, the activation function is ReLu, and the number of deconvolution kernels for each decoder unit is 1024,512,256,128,64,32,16,8 in turn. The input of the first decoder unit is the result of the concatenation of the input of the noise mixed audio encoder and the output tensor of the style template audio encoder, then the input of each decoder unit is the result of the concatenation of the output characteristics of the last decoder unit and the output tensor of the encoder unit with the same size as the decoder unit in the noise mixed audio encoder, and the output scale of the last decoder unit is 257 x 513 x 1.
4. Building a discrimination network model, training a discriminator network D to measure the similarity degree of the frequency spectrum of the target style and the frequency spectrum of the style template output by the generator, inputting the frequency spectrum of the target style and the frequency spectrum of the style template audio frequency spectrum output by the generator, discriminating the frequency spectrum of the target style and the frequency spectrum of the style template audio frequency spectrum, and outputting a probability score mapped between [0,1 ].
As shown in fig. 3, this step specifically includes:
(4.1) the arbiter network D consists of 6 convolutional layers and 5 fully connected layers. The arbiter network D has two inputs and one output. The input end is the frequency spectrum of the target style output by the generator, and the size is 257 x 513 x 1; the other is the spectrum of style template audio after fourier transformation, with the size of 257 x 513 x 1. The output is a probability score between 0, 1.
And (4.2) performing tensor splicing on the spectrum of the target style output by the generator and the spectrum of the style template audio after Fourier positive transformation before the convolution layer to form an input with the size of 257 x 513 x2, and sending the input to the convolution layer. The convolution kernel size for each convolution layer was 3*3, the stride was 2, the batch normalization by BatchNorm, the activation function was ReLu, and the number of channels per convolution layer was 32,64,128,256,512,1024 in turn. The input of the first convolution layer is the result of tensor stitching, the input of each subsequent convolution layer is the output characteristic of the last convolution layer, and the output scale of the last convolution layer is 5 x 9 x 1024.
(4.3) The number of neurons per layer of the fully connected layer is 46080,1024,256,64,1 in turn, with the last layer using sigmoid as the activation function and the other layers using ReLu as the activation function. The input of the full-connection layer is the result of the output of the last convolution layer after being straightened, and the output of the full-connection layer is the probability fraction between [0,1] for measuring the similarity degree of the frequency spectrum of the target style and the frequency spectrum of the wind pattern template output by the generator.
5. Constructing a loss function model, wherein the loss function consists of two parts, one part is generated by a generator network G, and the loss degree of information is calculated; the other part is generated by the arbiter network D for evaluating the degree of style loss. And then training the generated type countermeasure network, and finding out the parameter with optimal network performance by optimizing the network parameter.
The method specifically comprises the following steps:
(5.1) the loss function L D of the arbiter network D is defined as:
LD=(D(c,x)-1)2+(D(G(z,x),x))2 (1)
(5.2) the loss function L G of the generator network G consists of two parts, one part being the output L GD of the arbiter and the other part being the difference between the output of the generator and the experimental dataset audio, noted as
LGD=D(G(z,x),x) (2)
In the formulas (1), (2), (3) and (4), n is the number of matrix elements in the spectrum of the target style output by the generator network G; c is a frequency spectrum after Fourier positive transformation is carried out on a section of audio in the experimental data set; z is the frequency spectrum of the noise mixed audio after Fourier transformation; x is the frequency spectrum of style template audio after Fourier positive transformation; k is a superparameter used to control the weight of the two partial losses.
(5.3) Optimizing the generator network G by adopting an Adam algorithm with a learning rate of 0.001; optimizing the arbiter network D by adopting an Adam algorithm with the learning rate of 0.0001, and finding out the parameter with the optimal network performance by optimizing the network parameter.
6. The method comprises the steps that through the adoption of the generated type countermeasure network with optimal parameters, audio frequency style unified conversion is carried out on audio frequency to be converted, and style conversion audio frequency is output.
The foregoing description is only a partial embodiment of the present invention, and is not intended to limit the scope of the present invention, and all equivalent devices or equivalent processes using the descriptions and the drawings of the present invention or directly or indirectly applied to other related technical fields are included in the scope of the present invention.

Claims (10)

1. A method for unifying audio styles based on a generated countermeasure network, comprising:
Acquiring an initial data set and a noise data set;
Preprocessing an initial data set and a noise data set according to preset conditions to generate noise mixed audio;
Acquiring style template audio;
Constructing a generating network model, training to obtain a generator network G, wherein the generator network G is used for unifying audio styles, and outputting target style audio and target style frequency spectrums after the noise mixed audio and style template audio are input into the generator network G;
acquiring a style template frequency spectrum corresponding to the style template audio;
constructing a discrimination network model, training to obtain a discriminator network D, wherein the discriminator network D is used for measuring the similarity degree of a target style spectrum and a style template spectrum output by a generator network G, discriminating the target style spectrum and the style template spectrum by the discriminator network D after the target style spectrum and the style template spectrum are input into the discriminator network D, and outputting probability scores mapped between [0,1 ];
constructing a loss function model, accessing a generating network model and a judging network model, calculating the loss degree of information through a generator network G in the generating network model, judging the style loss degree through a judging network D of the judging network model, and training to obtain a generating type countermeasure network;
and carrying out audio style unified conversion on the audio to be style converted through the generated countermeasure network, and outputting style converted audio.
2. The method of claim 1, wherein the initial data set comprises a collection of clean audio from a university chinese phonetic dataset THCHS;
The noise dataset includes a collection of 3 types of noise audio in the university of bloom chinese phonetic dataset THCHS.
3. The method for unifying audio styles based on a generative countermeasure network according to claim 1, wherein the style template frequency spectrum is a frequency spectrum of style template audio after fourier positive transformation.
4. The method for unifying audio styles based on a generation type countermeasure network according to claim 3, wherein the method for preprocessing the initial data set and the noise data set according to a preset condition to generate noise mixed audio is as follows:
resampling the initial data set and the noise data set to 16.384kHz respectively, and dividing them by 4 second interval length respectively;
generating noise mixed audio according to a preset formula, wherein the formula for generating the noise mixed audio is as follows:
Z=C+N*r
Wherein C represents a section of audio in the initial dataset after resampling and segmentation; n represents a section of audio in the noise data set after resampling and segmentation; r represents a random number between [0.1,0.3 ]; z represents a segment of audio in the generated noise-mixed audio.
5. The method of claim 4, wherein the style template audio is randomly extracted from the initial data set after resampling and segmentation or extracted from a pre-built style template audio library.
6. The method for unifying audio styles based on a generative countermeasure network according to claim 5, wherein 85% of audio units in the noise mixed audio and the style template audio are also randomly extracted as training data sets, and the remaining 15% are used as test data sets; the training data set and the test data set are used for training or testing of the generator network G and/or the arbiter network D.
7. The method for generating a unified audio style for an countermeasure network according to any of claims 4 to 6, wherein the generator network G includes a noise-mixing audio encoder, a style template audio encoder, and a decoder;
The generator network G has two input ends and two output ends, wherein one input end is used for inputting the spectrum of the noise mixed audio after fourier transformation, the size of the spectrum is 257 x 513 x1, and the other input end is used for inputting the style template spectrum, the size of the spectrum is 257 x 513 x 1; one output end is used for outputting a target style frequency spectrum with the size of 257 x 513 x1, the target style frequency spectrum is used for being input into the discriminator network D for comparison, and the other output end is used for outputting the audio after the target style frequency spectrum is subjected to Fourier inverse transformation, namely the target style audio;
In addition, the noise mixed audio encoder includes 8 encoder units, each encoder unit has a convolution kernel size specification of 3*3, a stride of 2, an activation function of ReLu, and the number of convolution kernels of each encoder unit is 16, 32, 64, 128, 256, 512, 1024, 2048 in sequence, the first encoder unit is used for inputting a frequency spectrum of the noise mixed audio after fourier positive transformation, the size of the first encoder unit is 257×513×1, the input feature of each encoder unit after the first encoder unit is the output feature of the last encoder unit, and the output scale of the last encoder unit is 2×3×2048;
the style template audio encoder comprises 8 encoder units, wherein the size specification of the convolution kernel of each encoder unit is 3*3, the stride is 2, the activation function is ReLu, the number of the convolution kernels of each encoder unit is 16, 32, 64, 128, 256, 512, 1024 and 2048 in sequence, the first encoder unit is used for inputting a style template frequency spectrum, the size of the first encoder unit is 257 x 513 x 1, the input characteristic of each encoder unit after the first encoder unit is the output characteristic of the last encoder unit, and the output scale of the last encoder unit is 2x 3 x 2048;
The decoder comprises 8 decoder units, the deconvolution kernel of each decoder unit has the size 3*3, the stride is2, the activation function is ReLu, the number of deconvolution kernels of each decoder unit is 1024, 512, 256, 128, 64, 32, 16 and 8 in sequence, the first decoder unit is used for inputting the result of tensor splicing of the output characteristics of the noise mixed audio encoder and the output characteristics of the style template audio encoder, the input characteristics of each decoder unit after the first decoder unit are the output characteristics of the last decoder unit, and the output scale of the last decoder unit is 257 x 513 x 1.
8. The method for unifying audio styles based on a generative countermeasure network of claim 7, wherein the discriminator network D comprises 6 convolutional layers and 5 fully-connected layers;
The discriminator network D has two input terminals and one output terminal, wherein one input terminal is used for inputting the target style spectrum output by the generator network G, the size of the target style spectrum is 257 x 513 x 1, the other input terminal is used for inputting style template spectrum, the size of the target style spectrum is 257 x 513 x 1, the output terminal is used for outputting the similarity degree of the target style spectrum and the style template spectrum, and the similarity degree result is output in the form of probability scores between [0,1 ];
in addition, before entering a convolution layer, data input by an input end of a discriminator network D also carries out tensor splicing processing on a target style spectrum and a style template spectrum, the characteristics of 257 x 513 x 2 are processed to form a feature, the feature is sent to the convolution layer, the convolution kernel size of each convolution layer is 3*3, the stride is 2, the convolution is standardized in BatchNorm batches before convolution, an activation function is ReLu, channels of each convolution layer are 32, 64, 128, 256, 512 and 1024 in sequence, the first convolution layer is the result of tensor splicing processing on the input target style spectrum and the style template spectrum, the input feature of each convolution layer after the first convolution layer is the output feature of the last convolution layer, and the output scale of the last convolution layer is 5 x 9 x 1024;
The number of each layer of neurons of the full-connection layer is 46080, 1024, 256, 64 and 1 in sequence, wherein the last layer adopts sigmoid as an activation function, the other layers adopt ReLu as an activation function, the input end of the full-connection layer is used for inputting the characteristic result after the output of the last convolution layer is straightened, the output end of the full-connection layer is used for outputting the similarity degree of the target style spectrum and the style template spectrum, and the similarity degree result is output in a probability score form between [0 and 1 ].
9. The method for unifying audio styles based on a generative countermeasure network according to claim 8, wherein the parameters with optimal network performance are obtained by optimizing the network parameters of the generative countermeasure network before the audio styles are uniformly converted by the generative countermeasure network for the audio converted by the style.
10. The method for unifying audio styles based on the generated type countermeasure network according to claim 9, wherein the specific method for constructing a loss function model, accessing a generated network model and a discrimination network model, calculating the loss degree of information through a generator network G in the generated network model, evaluating the style loss degree through a discriminator network D of the discrimination network model, and then training to obtain the generated type countermeasure network is as follows:
(1) The loss function L D of the arbiter network D is defined as:
LD=(D(c,x)-1)2+(D(G(z,x),x))2 (1)
(2) The loss function L G of the generator network G consists of two parts, one part is L GD output by the discriminator network D, and the other part is the difference between the target style audio output by the generator network G and the audio of the initial dataset, which is recorded as Wherein,
LGD=D(G(z,x),x) (2)
In the formulas (1), (2), (3) and (4), n is the number of matrix elements in the spectrum of the target style output by the generator network G; c is a frequency spectrum after Fourier positive transformation is carried out on a section of audio in the initial data set; z is the frequency spectrum of the noise mixed audio after Fourier transformation; x is the frequency spectrum of style template audio after Fourier positive transformation; k is a super parameter used for controlling the weight of the two loss parts;
(3) Optimizing the generator network G by adopting an Adam algorithm with the learning rate of 0.001; and optimizing the arbiter network D by adopting an Adam algorithm with the learning rate of 0.0001, so that the parameters with the optimal performance of the generated type countermeasure network are obtained by optimizing the parameters of the generated type countermeasure network.
CN202110351514.1A 2021-03-31 2021-03-31 Audio style unification method based on generation type countermeasure network Active CN113096675B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110351514.1A CN113096675B (en) 2021-03-31 2021-03-31 Audio style unification method based on generation type countermeasure network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110351514.1A CN113096675B (en) 2021-03-31 2021-03-31 Audio style unification method based on generation type countermeasure network

Publications (2)

Publication Number Publication Date
CN113096675A CN113096675A (en) 2021-07-09
CN113096675B true CN113096675B (en) 2024-04-23

Family

ID=76672582

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110351514.1A Active CN113096675B (en) 2021-03-31 2021-03-31 Audio style unification method based on generation type countermeasure network

Country Status (1)

Country Link
CN (1) CN113096675B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114299969A (en) * 2021-08-19 2022-04-08 腾讯科技(深圳)有限公司 Audio synthesis method, apparatus, device and medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110473154A (en) * 2019-07-31 2019-11-19 西安理工大学 A kind of image de-noising method based on generation confrontation network
CN110992252A (en) * 2019-11-29 2020-04-10 北京航空航天大学合肥创新研究院 Image multi-format conversion method based on latent variable feature generation
CN111816156A (en) * 2020-06-02 2020-10-23 南京邮电大学 Many-to-many voice conversion method and system based on speaker style feature modeling
CN112216257A (en) * 2020-09-29 2021-01-12 南方科技大学 Music style migration method, model training method, device and storage medium
CN112466316A (en) * 2020-12-10 2021-03-09 青海民族大学 Zero-sample voice conversion system based on generation countermeasure network
CN112562728A (en) * 2020-11-13 2021-03-26 百果园技术(新加坡)有限公司 Training method for generating confrontation network, and audio style migration method and device

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106847294B (en) * 2017-01-17 2018-11-30 百度在线网络技术(北京)有限公司 Audio-frequency processing method and device based on artificial intelligence
US11854562B2 (en) * 2019-05-14 2023-12-26 International Business Machines Corporation High-quality non-parallel many-to-many voice conversion

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110473154A (en) * 2019-07-31 2019-11-19 西安理工大学 A kind of image de-noising method based on generation confrontation network
CN110992252A (en) * 2019-11-29 2020-04-10 北京航空航天大学合肥创新研究院 Image multi-format conversion method based on latent variable feature generation
CN111816156A (en) * 2020-06-02 2020-10-23 南京邮电大学 Many-to-many voice conversion method and system based on speaker style feature modeling
CN112216257A (en) * 2020-09-29 2021-01-12 南方科技大学 Music style migration method, model training method, device and storage medium
CN112562728A (en) * 2020-11-13 2021-03-26 百果园技术(新加坡)有限公司 Training method for generating confrontation network, and audio style migration method and device
CN112466316A (en) * 2020-12-10 2021-03-09 青海民族大学 Zero-sample voice conversion system based on generation countermeasure network

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于CQT和梅尔频谱的带有人声的音乐风格转换方法;叶洪良;朱皖宁;洪蕾;计算机科学;20211231(0S1);全文 *

Also Published As

Publication number Publication date
CN113096675A (en) 2021-07-09

Similar Documents

Publication Publication Date Title
JP7337953B2 (en) Speech recognition method and device, neural network training method and device, and computer program
US11908455B2 (en) Speech separation model training method and apparatus, storage medium and computer device
CN110600047B (en) Perceptual STARGAN-based multi-to-multi speaker conversion method
CN106952649A (en) Method for distinguishing speek person based on convolutional neural networks and spectrogram
WO2021128256A1 (en) Voice conversion method, apparatus and device, and storage medium
US20220253700A1 (en) Audio signal time sequence processing method, apparatus and system based on neural network, and computer-readable storage medium
CN109378010A (en) Training method, the speech de-noising method and device of neural network model
CN111179905A (en) Rapid dubbing generation method and device
CN110853656B (en) Audio tampering identification method based on improved neural network
CN110189766B (en) Voice style transfer method based on neural network
JP2021026130A (en) Information processing device, information processing method, recognition model and program
CN115294970B (en) Voice conversion method, device and storage medium for pathological voice
CN110047501A (en) Multi-to-multi phonetics transfer method based on beta-VAE
CN113096675B (en) Audio style unification method based on generation type countermeasure network
JP7124373B2 (en) LEARNING DEVICE, SOUND GENERATOR, METHOD AND PROGRAM
CN111724806A (en) Double-visual-angle single-channel voice separation method based on deep neural network
Zheng et al. MSRANet: Learning discriminative embeddings for speaker verification via channel and spatial attention mechanism in alterable scenarios
Ariff et al. Study of adam and adamax optimizers on alexnet architecture for voice biometric authentication system
Ong et al. Speech emotion recognition with light gradient boosting decision trees machine
CN111488486B (en) Electronic music classification method and system based on multi-sound-source separation
CN111860246A (en) Deep convolutional neural network-oriented data expansion method for heart sound signal classification
CN113707172B (en) Single-channel voice separation method, system and computer equipment of sparse orthogonal network
Choi et al. Adversarial speaker-consistency learning using untranscribed speech data for zero-shot multi-speaker text-to-speech
CN113112969B (en) Buddhism music notation method, device, equipment and medium based on neural network
Wan et al. Deep neural network based Chinese dialect classification

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant