CN117649839B - Personalized speech synthesis method based on low-rank adaptation - Google Patents

Personalized speech synthesis method based on low-rank adaptation Download PDF

Info

Publication number
CN117649839B
CN117649839B CN202410120426.4A CN202410120426A CN117649839B CN 117649839 B CN117649839 B CN 117649839B CN 202410120426 A CN202410120426 A CN 202410120426A CN 117649839 B CN117649839 B CN 117649839B
Authority
CN
China
Prior art keywords
audio
probability distribution
decoder
training
low
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202410120426.4A
Other languages
Chinese (zh)
Other versions
CN117649839A (en
Inventor
汤杰辉
刘学亮
蔡驿晨
张金炎
叶雨露
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hefei University of Technology
Original Assignee
Hefei University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hefei University of Technology filed Critical Hefei University of Technology
Priority to CN202410120426.4A priority Critical patent/CN117649839B/en
Publication of CN117649839A publication Critical patent/CN117649839A/en
Application granted granted Critical
Publication of CN117649839B publication Critical patent/CN117649839B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

The invention relates to the technical field of voice synthesis, and discloses a personalized voice synthesis method based on low-rank adaptation, which comprises the following steps: acquiring an audio data set having a plurality of audio files; constructing a basic synthesis model and training; constructing a low-rank adaptive network and training; reasoning is carried out; according to the invention, the personalized decoder is quickly trained through low-rank adaptation, the customization requirement is realized, meanwhile, the F0 predictor is added to extract rich audio characteristics, and the method is applied to the training of the decoder and the generation of sampling points of posterior probability distribution, so that the audio which is more fit with the original sound is generated.

Description

Personalized speech synthesis method based on low-rank adaptation
Technical Field
The invention relates to the technical field of voice synthesis, in particular to a personalized voice synthesis method based on low-rank adaptation.
Background
With the continuous development of artificial intelligence and natural language processing technology, speech synthesis has become a key component in man-machine interaction. However, conventional speech synthesis methods are generally limited by the quality and degree of personalization of the speech generated. The speech synthesis model adopted in the prior art lacks individuality for speech synthesis of different individuals, resulting in a lack of uniqueness of the synthesized speech.
Disclosure of Invention
In order to solve the technical problems, the invention provides a personalized speech synthesis method based on low-rank adaptation, and the parameters of the personalized speech synthesis method are embedded into a decoder of a basic synthesis model by adding a low-rank adaptation structure, so that partial layers are replaced by the low-rank adaptation structure on the basis of the original basic synthesis model, and synthesis fine adjustment can be performed faster and more personalized. The new model can change the generation style of the original model through fine adjustment of parameters in the low-rank adaptive structure, so that more accurate speech synthesis is realized, and the individuation degree and quality of the synthesized speech are improved.
In order to solve the technical problems, the invention adopts the following technical scheme:
A personalized speech synthesis method based on low-rank adaptation comprises the following steps:
Step one, acquiring an audio data set with a plurality of audio files;
step two, constructing a basic synthesis model and training, which specifically comprises the following steps:
The basic synthesis model is used for inputting text Converting to synthetic audio, including a posterior coder, a normalized stream, a decoder, a discriminator, a text coder, and a multicycle random duration predictor;
text is obtained by a text encoder consisting of multiple groups of Transformer blocks Prior hidden variable/>And willMapping as a priori probability distribution/>
Converting audio files to mel frequenciesThen the Mel spectrum is converted into linear spectrum/>; Extracting pitch information in an audio file;
processing linear spectra by posterior encoder Generating posterior probability distribution/>And up-sampling is carried out to obtain sampling points/>, of posterior probability distribution; Wherein/>Representing posterior hidden variables;
By standardizing the flow Sampling points/>, which distribute posterior probabilityMapping to complex probability distribution, obtaining posterior probability distribution/>, by forced alignmentWith a priori probability distribution/>Alignment relation/>; Alignment relation/>Representing a pronunciation duration of each phoneme;
Based on prior hidden variables And alignment relation/>Outputting a logarithmic representation of the phoneme duration by adopting the multicycle random duration predictor;
Sampling points for distributing posterior probability The pitch information is input into a decoder to obtain synthesized audio;
the discriminator adopts a network structure of the countermeasure generation network to classify the authenticity of the synthesized audio;
training a basic synthesis model through reconstruction loss obtained based on a mel spectrum, KL divergence used for measuring the distance between posterior probability distribution and prior probability distribution, prediction loss of a multicycle random time length predictor, least square loss of a countermeasure generation network of a discriminator when performing countermeasure training and feature matching loss applied to a decoder;
Step three, constructing a low-rank adaptive network and training:
constructing a low-rank adaptation network based on the decoder; the decoder comprises a convolution layer and a multi-receptive field fusion module, and performs low-rank decomposition on a weight matrix in the decoder of a basic synthesis model which is trained, and updates the weight matrix in the decoder, and the decoder specifically comprises:
Rearranging the weight matrix of the convolution layer and the multi-receptive field fusion module into a two-dimensional weight matrix; decomposing the two-dimensional weight matrix into a matrix by a singular value decomposition method Of (3), wherein/>And/>Is an orthogonal matrix,/>Is a diagonal matrix,/>The method comprises singular values; cut-off diagonal matrix/>The first M maximum singular values are reserved, M is a set value; using truncated diagonal matrix/>And orthogonal matrix/>Constructing a new weight matrix and replacing the original weight matrix;
training the low-rank adaptive network by using the audio data set, and adjusting parameters of the low-rank adaptive network;
step four, reasoning:
inputting text into a trained basic synthesis model, wherein the basic synthesis model obtains a logarithmic representation of phoneme duration and prior probability distribution according to a text encoder and a multi-period random duration predictor Obtaining sampling points/>, of posterior probability distribution through inverse transformation of standardized flow; Sampling points/>, which distribute posterior probabilityAnd inputting the audio signals into a low-rank adaptive network for training, and generating synthesized audio.
Further, the first step specifically includes the following steps:
S11, collecting a voice audio file;
s12, separating accompaniment, reverberation and harmony in the voice audio file through audio separation to obtain a dry voice audio file;
s13, slicing the dry audio file to obtain a sliced audio file;
s14, performing loudness matching operation on the sliced audio file, matching all the audio in the sliced audio file to the same target loudness, and sampling to the same frequency to obtain the audio data set.
Further, in the second step, the converting the audio file into mel frequency specifically includes:
Normalizing audio files in the audio data set, and obtaining a Mel frequency spectrum by applying short-time Fourier transform to the normalized audio files
The extracting of pitch information in the audio file specifically includes:
Extracting pitch information of the audio file through an F0 predictor; the pitch information includes pitch and sound channel information.
Further, in the second step, the step of inputting the sampling points z of the posterior probability distribution and the pitch information into a decoder to obtain the synthesized audio specifically includes:
Sampling points for posterior probability distribution The decoder firstly carries out preprocessing convolution, gradually increases the number of channels and acquires an initial characteristic diagram; converting pitch information into an embedded vector/>Adding the initial characteristic diagram by one-dimensional convolution; then, using transpose convolution in a plurality of upsampling layers, gradually increasing the width of an initial feature map, reducing the number of channels, gradually reducing the feature map to an original size, and performing feature fusion through a multi-receptive field fusion module after each upsampling layer; the multi-receptive field fusion module is formed by connecting a plurality of one-dimensional convolutions with the same size in a residual error module mode so as to capture the characteristics under different receptive fields; the feature map after feature fusion is subjected to one-dimensional convolution again, the number of channels is reduced, synthesized audio is generated through an activation function and weight transformation, and the synthesized audio is scaled to/>, through a Tanh functionWithin the range.
Further, in the second step, the training of the basic synthesis model by using the reconstruction loss obtained based on the mel spectrum, the KL divergence for measuring the distance between the posterior probability distribution and the prior probability distribution, the prediction loss of the multicycle random duration predictor, the least square loss when the countermeasure generation network of the discriminator performs the countermeasure training, and the feature matching loss applied to the decoder specifically includes:
Loss function of basic synthesis model The following are provided:
wherein, Loss for the reconstruction:
the representation is a mel spectrum generated during training; /(I) Represents an L1 norm;
The KL divergence used to measure the distance between the posterior probability distribution and the prior probability distribution is represented:
wherein, Representing a given linear spectrum/>And output hidden variable/>Is used for the posterior probability distribution of (c),Representing a given input text/>After that, a priori hidden variables/>Is a priori probability distribution of (2);
prediction loss for a multicycle random duration predictor:
Representing decoder/> Decoding procedure of/>Representing the output of the multicycle random duration predictor;
and/> Least squares loss when performing countermeasure training for the countermeasure generation network of the discriminator:
Wherein the method comprises the steps of Representing a discriminant,/>Is the synthesized audio that the decoder ultimately generates,/>And/>Respectively represent sampling points/>A conditional expectation of the synthesized audio y at a given sampling point z; /(I)Is the feature matching penalty applied to the decoder:
wherein, Representing the number of layers of the arbiter,/>Represents the/>Output feature map of layer discriminant,/>Representing the number of feature maps.
Compared with the prior art, the invention has the beneficial technical effects that:
The current common speech synthesis method generally adopts a mode of using a pre-trained decoder and a discriminant to train the audio of a user due to huge model volume and complex parameters, and the mode can quickly transfer learning among different speakers although training is quick, but has the problem of insufficient individualized audio feature learning, and cannot meet the requirement of individualized expression among different users.
Compared with the existing voice synthesis method, the invention has the advantages that the low-rank adaptation structure is added, the personalized decoder can be quickly trained, the model capable of generating specific audio is trained by using a small amount of data, the customization requirement is realized, meanwhile, the F0 predictor is added to extract rich pitch characteristics, and the rich pitch characteristics are applied to the training of the decoder and the generation of sampling points of posterior probability distribution, so that the audio which is more fit with the original sound is generated.
Drawings
FIG. 1 is a schematic diagram of a training process of a basic synthetic model in an embodiment of the invention;
FIG. 2 is a schematic diagram of an inference process in an embodiment of the invention;
FIG. 3 is a schematic diagram of low rank adaptation structure optimization in an embodiment of the present invention;
FIG. 4 is a schematic diagram of an inference process in an embodiment of the invention;
fig. 5 is a schematic diagram of processing an audio file according to an embodiment of the present invention.
Detailed Description
A preferred embodiment of the present invention will be described in detail with reference to the accompanying drawings.
The invention aims to provide a personalized voice synthesis method based on low-rank adaptation, so as to improve the personalized degree and quality of voice synthesis, and enable synthesized voice to be more lifelike and personalized.
1. Audio data set creation
The method specifically comprises the following steps:
S11, collecting the audio file.
S12, for the collected audio files, accompaniment, reverberation and harmony are sequentially separated by utilizing an audio separation technology, and relatively clean dry sound is obtained.
S13, slicing the dry audio file to obtain a plurality of sliced audio files with the duration of 3 seconds to 8 seconds.
S14, performing loudness matching operation on the acquired multiple short-duration sliced audio files, matching all the audio files to the same target loudness, and resampling to 44100Hz mono; finally, an audio data set is obtained.
2. Basic synthetic model training
The basic synthesis model realizes the function of converting a universal text into an audio, and is a high-expressive voice synthesis model combining variational reasoning, standardized flow and countermeasure training, and the overall framework adopts a conditional VAE model and mainly comprises a text encoder, a standardized flow, a posterior encoder, a multicycle random duration predictor, a decoder and a discriminator.
As shown in fig. 5, the audio files in the audio data set are normalized and then converted into mel-frequency spectrum by applying short-time fourier transform to the normalized audio files. Mel frequency spectrum/>Will be used to calculate reconstruction lossesThe method is characterized by comprising the following steps:
Wherein the method comprises the steps of Is a mel frequency spectrum generated in the training process and is used for guiding model training. In order to provide higher resolution information to the posterior encoder, the mel spectrum is again converted to a linear spectrum/>
And then selecting an F0 predictor according to the training data condition, and extracting the pitch information of the audio file. These pitch information typically include pitch and sound channel information. The pitch information will be used in the subsequent posterior encoder, affecting the generation of the posterior probability distribution, and can also be used for training of the decoder.
The text encoder is composed of multiple sets of transform blocks. For input textGenerating a priori hidden variable/>. Text/>, firstEach word is mapped by the embedding layer into an embedded vector with consecutive values so that the underlying synthetic model can understand the semantics of the text. The embedded vectors are adjusted and transposed to accommodate the following self-attention and feed-forward network operations. Generating a mask tensor/>, for indefinite length textFor identifying valid locations in text. Each transducer block contains a self-attention layer and a feed forward network. At the self-attention layer, the text vector performs attention relation operation with itself, the attention mask is used for ensuring that only effective positions are focused, and then the output weight parameter/>And/>After weighted summation of attention weights, the/>, is found in the form of residual error. The feedforward network is composed of two equal-length convolution layers, and the feedforward network is connected with the original text vector in a residual way and normalized through an activation function and random inactivation operation to generate the other/>. These operations are iterated in multiple transducer blocks, each time taking into account the effective position to accommodate the variable length input. Will eventually/>Mapping to obtain prior probability distribution/>The specific expression is as follows:
representing a Gaussian distribution,/> Representing the mean of a priori hidden variables given text conditions,/>Representing the standard deviation of a priori hidden variables for a given text condition.
Posterior encoder for processing linear spectraThe goal is to generate a high-dimensional posterior probability distributionWherein the posterior hidden variable/>The specific expression of (2) is as follows:
First, a linear spectrum Features are initially extracted through a one-dimensional convolution layer. Then, through a series of residual modules, multi-layer nonlinear transformation and feature learning are performed on the input data. Furthermore pitch information is introduced to personalize the encoding. Finally, the projection is carried out through a one-dimensional convolution layer, and a vector containing posterior hidden variables/>, is outputTensors of (a) to obtain posterior probability distribution/>And up-sampling to obtain sampling points/>, of the posterior probability distribution
The normalized flow is responsible for the flowSampling points/>, by posterior probability distributionMapping to a more complex probability distribution/>,/>What is shown is a process of normalizing the stream. The normalized stream is a block of four affine coupling layers, each coupling layer containing four residual blocks. The residual error module continuously increases the receptive field by continuously increasing the expansion coefficient of the one-dimensional expansion convolution. And meanwhile, a gating feature fusion method is adopted to obtain more complex data representation. Gating feature fusion involves splitting the input data, one part activating the function by sigmoid, the other part activating the function by tanh, and then multiplying the two parts. This helps the model to better understand the different features of the input data. Meanwhile, in each coupling layer, random inactivation operation is also applied, so that the risk of overfitting is reduced, and the generalization performance of the model is enhanced. The combination of these coupling layers generates the information needed to synthesize the audio by reversible transformation while also preserving the reversibility of the data.
Obtaining posterior probability distribution by forced alignmentWith a priori probability distribution/>Alignment relation/>. Alignment relation/>Is/>A strict monotonic attention matrix of size, representing the pronunciation duration of each phoneme.
A priori hidden variables for text encoder outputAlignment relationship/>A multicycle random duration predictor is used to output it as a logarithmic representation of the phoneme duration. Priori hidden variables/>The purpose of this convolution operation is to enrich the feature representation by increasing the number of channels and expanding the receptive field, by first preprocessing the one-dimensional convolution. Specifically, the preprocessing one-dimensional convolution increases the channel number of the prior hidden variable, but does not change the width and the height of the prior hidden variable, and only performs feature conversion. Next, the prior hidden variable enters a hole depth separable convolution (DDSConv) module. DDSConv is a complex convolution module comprising a plurality of convolution layers. In each convolution layer, a different coefficient of expansion is used to increase the receptive field of the convolution layer. In addition, after each convolutional layer, layer normalization and GELU activation functions are applied to improve the efficient use of parameters and to improve the ability of feature extraction. The a priori hidden variables are then subjected to a one-dimensional convolution again, a step known as post-processing the one-dimensional convolution. It further increases the number of channels and enlarges receptive fields to improve the data representation. Finally, the processed tensor enters a neural spline stream, and the module is used for generating the logarithm of the phoneme duration. The neural spline stream consists of four coupling layers, each of which maps the input data to a more complex probability distribution. This step involves an advanced transformation of the data to ultimately obtain a logarithmic representation of the phoneme duration. Prediction loss/>, of multicycle random duration predictorThe method comprises the following steps:
wherein, For sampling points of posterior probability distribution, in addition, two random variables/>, are introducedAnd/>They have an input duration sequence/>, with a multicycle random duration predictorThe same temporal resolution and dimension are used for the variation dequantization and the variation data enhancement, respectively. Will/>Support limit of [0,1 ]) in order to/>Become positive real number sequence and use/>Make a representation, and will/>And/>The channels are connected one by one to form a higher dimensional potential representation. /(I)Representing decoder/>Decoding procedure of/>Pass/>Operation, random variable/>。/>Representing the output of a multicycle random duration predictor,/>Satisfies N (0, 1) standard normal distribution.
Sampling points for a code-generated posterior probability distributionThe decoder first performs preprocessing convolution, gradually increases the number of channels, and acquires an initial feature map. Subsequently, the pitch information is converted into an embedding vector/>And adding the processed characteristic diagram by one-dimensional convolution. Subsequently, a transpose convolution is used in the multiple upsampling layers to gradually increase the width of the feature map while reducing the number of channels to gradually restore the feature map to the original size. After each upsampling, feature fusion is performed by a multiple receptive field fusion module. The receptive field fusion module is connected by a plurality of one-dimensional convolutions with the same size in a residual error module mode so as to capture the characteristics under different receptive fields. The final feature map is subjected to one-dimensional convolution again, the number of channels is reduced, synthesized audio is generated through an activation function and weight transformation, and the synthesized audio is scaled to be within the range of < -1, 1> through a Tanh function.
The arbiter employs a network structure of the countermeasure generation network for processing the synthesized audio generated by the decoder, and extracting features and patterns therefrom. The arbiter network comprises a plurality of sub-arbiters, each sub-arbiter is composed of a plurality of multi-period convolution layers, wherein each period corresponds to a different time scale and is used for carrying out multi-scale analysis on the audio data in a targeted manner. These convolution operations employ conditional weight normalization to ensure stability of the model. Between convolution operations, a leak ReLU activation function is applied, introducing a nonlinear characteristic, making the arbiter more sensitive to variations in different audio characteristics. The main task of the arbiter is to classify the authenticity of the audio data by these convolution and activation operations, as well as the multi-layer perceptron.
Loss function of whole basic synthetic modelThe method comprises the following steps:
KL-divergence for measuring the distance between a posterior probability distribution and an a priori probability distribution:
wherein, Representing a given linear spectrum/>And output hidden variable/>Is used for the posterior probability distribution of (c),Representing a given condition/>And alignment relation/>Lower and a priori hidden variables/>Is a priori probability distribution of (c).
In addition, in the case of the optical fiber,And/>Is a least squares loss function for countermeasure training, the expressions of both are as follows:
Wherein the method comprises the steps of Representing a discriminant,/>Representing decoder,/>Is the synthesized audio that the decoder ultimately generates.
Then the feature matching penalty that is applied specifically to the decoder:
wherein, Representing the number of layers of the arbiter,/>Represents the/>Output feature map of layer discriminant,/>Representing the number of feature maps. The feature matching loss can be regarded as a reconstruction loss for constraining the output of the intermediate layer of the arbiter.
And adopting an Adam optimizer to perform optimization solution on the basic synthesis model. 16 audio files per batch. The learning rate was 0.0001, while the super parameter "betas" was set to [0.8,0.99], and "eps" was set to 1e-09. These super parameters control the Adam optimizer's behavior in training for adjusting the weights of the underlying synthetic model to minimize the loss function. Adam is a commonly used optimization algorithm, which is an adaptive learning rate optimization algorithm and combines the characteristics of momentum and adaptive learning rate. The main idea of Adam is to adaptively adjust the learning rate of each parameter according to the gradient of the parameter, thereby improving convergence speed and stability.
The training process of the invention for the basic synthetic model is shown in figure 1.
3. Low rank adaptive network training
The low rank adaptation network trains on the basis of the basic synthesis model, the part freezes part of the weight of the pre-trained basic synthesis model by using a low rank adaptation method, a trainable rank decomposition matrix is injected into each layer of the decoder architecture, the original pre-trained decoder is replaced by the low rank adaptation method, and the linear spectrum is personalized and learned by quickly fine tuning parametersThereby enabling personalized speech synthesis for individuals.
First, a pre-trained decoder is obtained as a basic model, and a weight matrix of a partial layer of the pre-trained decoder is subjected to low-rank decomposition so as to adapt. The weight matrices of the convolution layer and the multi-receptive field fusion module may optionally be rearranged into a two-dimensional weight matrix, where one dimension represents the output channel and convolution kernel height and the other dimension represents the input channel and convolution kernel width. A singular value decomposition (SVD, singular value decomposition) two-dimensional weight matrix is applied. Singular value decomposition decomposes the weight matrix into three matricesOf (3), wherein/>And/>Is an orthogonal matrix,/>Is a diagonal matrix containing singular values. Subsequently truncating the diagonal matrix/>The first M largest singular values are reserved to reduce rank. M is a value that can be selected based on task requirements and computing power. Using truncated matrix/>、/>And/>A new weight matrix is reconstructed. Typically, the new weight matrix loses some information, but its rank is lower, with fewer parameters, facilitating subsequent training. The new weight matrix is used for replacing the original weight matrix, so that the number of parameters is reduced, and subsequent quick fine tuning is supported to generate personalized voice. Let the original weight matrix be/>Reconstructed weight matrix/>The method comprises the following steps:
Wherein rank is In the training process of low-rank adaptive network,/>Is fixed and unchanged, only/>And/>Is a training parameter. In the forward process,/>And/>Will multiply by the same input/>And (3) finally adding:
in the reasoning process, the low-rank adaptive network also hardly introduces extra reasoning delay and only needs to calculate And (3) obtaining the product.
The method selects proper singular value cut-off threshold, uses low-rank adaptive network training speaker audio to replace a pre-training decoder, and can quickly train according to different application occasions to generate personalized audio to meet the customization requirement.
4. Personalized audio output
And placing the prepared slice audio file into a speaker folder, and generating a corresponding configuration file according to the selected pre-training text encoder, decoder and discriminator.
Performing short-time Fourier transform on the sliced audio file to obtain a Mel frequency spectrum required by basic synthesis model trainingAnd linear spectrum/>. By linear spectrum/>The posterior encoder, the decoder, the standardized stream and the discriminator of the basic synthesis model are trained, the audio information of the sliced audio file is extracted through the F0 predictor, and the audio information is added into the feature map obtained by the decoder, so that the finally generated audio is more fit with the original sound, and the personalized expression is enriched.
Inputting text into the basic synthesis model, obtaining a logarithmic representation of phoneme duration and a priori probability distribution according to a text encoder and a multi-period duration predictorObtaining sampling points/>, of posterior probability distribution after inverse transformation of standardized flow
Under the condition of keeping the original decoder unchanged, the parameters of the low-rank adaptive network are only required to be adjusted, so that a more personalized speech synthesis decoder can be obtained by rapid training, and sampling points with posterior probability distribution are introducedThen, an audio file which is more fit with the characteristics of the speaker is generated through a low-rank adaptive network, see fig. 3.
The reasoning process of the present invention is shown in fig. 2 and 4.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential characteristics thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein.
Furthermore, it should be understood that although the present disclosure describes embodiments, not every embodiment is provided with a single embodiment, and that this description is provided for clarity only, and that the disclosure is not limited to specific embodiments, and that the embodiments may be combined appropriately to form other embodiments that will be understood by those skilled in the art.

Claims (4)

1. A personalized speech synthesis method based on low-rank adaptation comprises the following steps:
Step one, acquiring an audio data set with a plurality of audio files;
step two, constructing a basic synthesis model and training, which specifically comprises the following steps: the basic synthesis model is used for inputting text Converting to synthetic audio, including a posterior coder, a normalized stream, a decoder, a discriminator, a text coder, and a multicycle random duration predictor; text/>, obtained by a text encoder consisting of multiple sets of Transformer blocksPrior hidden variable/>And willMapping as a priori probability distribution/>; Converting an audio file to mel frequency/>Then the Mel spectrum is converted into linear spectrum/>; Extracting pitch information in an audio file; processing linear spectra/>, by posterior encoderGenerating posterior probability distribution/>And up-sampling is carried out to obtain sampling points/>, of posterior probability distribution; Wherein/>Representing posterior hidden variables; by standardizing the flow/>Sampling points/>, which distribute posterior probabilityMapping to complex probability distribution, obtaining posterior probability distribution/>, by forced alignmentWith a priori probability distribution/>Alignment relation/>; Alignment relation/>Representing a pronunciation duration of each phoneme; based on prior hidden variable/>And alignment relation/>Outputting a logarithmic representation of the phoneme duration by adopting the multicycle random duration predictor; sampling points/>, which distribute posterior probabilityThe pitch information is input into a decoder to obtain synthesized audio; the discriminator adopts a network structure of the countermeasure generation network to classify the authenticity of the synthesized audio; training a basic synthesis model through reconstruction loss obtained based on a mel spectrum, KL divergence used for measuring the distance between posterior probability distribution and prior probability distribution, prediction loss of a multicycle random time length predictor, least square loss of a countermeasure generation network of a discriminator when performing countermeasure training and feature matching loss applied to a decoder;
the step of inputting the sampling points z of the posterior probability distribution and the pitch information into a decoder to obtain synthesized audio, specifically comprises the following steps:
Sampling points for posterior probability distribution The decoder firstly carries out preprocessing convolution, gradually increases the number of channels and acquires an initial characteristic diagram; converting pitch information into an embedded vector/>Adding the initial characteristic diagram by one-dimensional convolution; then, using transpose convolution in a plurality of upsampling layers, gradually increasing the width of an initial feature map, reducing the number of channels, gradually reducing the feature map to an original size, and performing feature fusion through a multi-receptive field fusion module after each upsampling layer; the multi-receptive field fusion module is formed by connecting a plurality of one-dimensional convolutions with the same size in a residual error module mode so as to capture the characteristics under different receptive fields; the feature map after feature fusion is subjected to one-dimensional convolution again, the number of channels is reduced, synthesized audio is generated through an activation function and weight transformation, and the synthesized audio is scaled to/>, through a Tanh functionWithin the range;
step three, constructing a low-rank adaptive network and training: constructing a low-rank adaptation network based on the decoder; the decoder comprises a convolution layer and a multi-receptive field fusion module, and performs low-rank decomposition on a weight matrix in the decoder of a basic synthesis model which is trained, and updates the weight matrix in the decoder, and the decoder specifically comprises: rearranging the weight matrix of the convolution layer and the multi-receptive field fusion module into a two-dimensional weight matrix; decomposing the two-dimensional weight matrix into a matrix by a singular value decomposition method Of (3), wherein/>And/>Is an orthogonal matrix,/>Is a diagonal matrix,/>The method comprises singular values; cut-off diagonal matrix/>The first M maximum singular values are reserved, M is a set value; using truncated diagonal matrix/>And orthogonal matrix/>Constructing a new weight matrix and replacing the original weight matrix; training the low-rank adaptive network by using the audio data set, and adjusting parameters of the low-rank adaptive network;
Step four, reasoning: inputting text into a trained basic synthesis model, wherein the basic synthesis model obtains a logarithmic representation of phoneme duration and prior probability distribution according to a text encoder and a multi-period random duration predictor Obtaining sampling points/>, of posterior probability distribution through inverse transformation of standardized flow; Sampling points/>, which distribute posterior probabilityAnd inputting the audio signals into a low-rank adaptive network for training, and generating synthesized audio.
2. The method for synthesizing personalized speech based on low-rank adaptation according to claim 1, wherein the first step specifically comprises the steps of:
S11, collecting a voice audio file;
s12, separating accompaniment, reverberation and harmony in the voice audio file through audio separation to obtain a dry voice audio file;
s13, slicing the dry audio file to obtain a sliced audio file;
s14, performing loudness matching operation on the sliced audio file, matching all the audio in the sliced audio file to the same target loudness, and sampling to the same frequency to obtain the audio data set.
3. The low-rank adaptation-based personalized speech synthesis method according to claim 1, wherein: in the second step, the converting the audio file into mel frequency specifically includes:
Normalizing audio files in the audio data set, and obtaining a Mel frequency spectrum by applying short-time Fourier transform to the normalized audio files
The extracting of pitch information in the audio file specifically includes:
Extracting pitch information of the audio file through an F0 predictor; the pitch information includes pitch and sound channel information.
4. The personalized speech synthesis method according to claim 1, wherein in the second step, the basic synthesis model is trained by reconstruction loss based on mel spectrum, KL divergence for measuring distance between posterior probability distribution and prior probability distribution, prediction loss of multi-period random duration predictor, least square loss of the countermeasure training of the countermeasure generation network of the discriminator, and feature matching loss applied to the decoder, specifically comprising:
Loss function of basic synthesis model The following are provided:
wherein, Loss for the reconstruction:
the representation is a mel spectrum generated during training; /(I) Represents an L1 norm;
The KL divergence used to measure the distance between the posterior probability distribution and the prior probability distribution is represented:
wherein, Representing a given linear spectrum/>And output hidden variable/>Is used for the posterior probability distribution of (c),Representing a given input text/>After that, a priori hidden variables/>Is a priori probability distribution of (2);
prediction loss for a multicycle random duration predictor:
Representing decoder/> Decoding procedure of/>Representing the output of the multicycle random duration predictor;
and/> Least squares loss when performing countermeasure training for the countermeasure generation network of the discriminator:
Wherein the method comprises the steps of Representing a discriminant,/>Is the synthesized audio that the decoder ultimately generates,/>And/>Respectively represent sampling points/>A conditional expectation of the synthesized audio y at a given sampling point z; /(I)Is the feature matching penalty applied to the decoder:
wherein, Representing the number of layers of the arbiter,/>Represents the/>Output feature map of layer discriminant,/>Representing the number of feature maps.
CN202410120426.4A 2024-01-29 2024-01-29 Personalized speech synthesis method based on low-rank adaptation Active CN117649839B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410120426.4A CN117649839B (en) 2024-01-29 2024-01-29 Personalized speech synthesis method based on low-rank adaptation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410120426.4A CN117649839B (en) 2024-01-29 2024-01-29 Personalized speech synthesis method based on low-rank adaptation

Publications (2)

Publication Number Publication Date
CN117649839A CN117649839A (en) 2024-03-05
CN117649839B true CN117649839B (en) 2024-04-19

Family

ID=90046314

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410120426.4A Active CN117649839B (en) 2024-01-29 2024-01-29 Personalized speech synthesis method based on low-rank adaptation

Country Status (1)

Country Link
CN (1) CN117649839B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021123792A1 (en) * 2019-12-20 2021-06-24 Sonantic Limited A Text-to-Speech Synthesis Method and System, a Method of Training a Text-to-Speech Synthesis System, and a Method of Calculating an Expressivity Score
WO2021234967A1 (en) * 2020-05-22 2021-11-25 日本電信電話株式会社 Speech waveform generation model training device, speech synthesis device, method for the same, and program
CN114267329A (en) * 2021-12-24 2022-04-01 厦门大学 Multi-speaker speech synthesis method based on probability generation and non-autoregressive model
CN116229932A (en) * 2022-12-08 2023-06-06 维音数码(上海)有限公司 Voice cloning method and system based on cross-domain consistency loss
CN116452983A (en) * 2023-06-12 2023-07-18 合肥工业大学 Quick discovering method for land landform change based on unmanned aerial vehicle aerial image

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2024510679A (en) * 2021-03-22 2024-03-08 グーグル エルエルシー Unsupervised parallel tacotron non-autoregressive and controllable text reading

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021123792A1 (en) * 2019-12-20 2021-06-24 Sonantic Limited A Text-to-Speech Synthesis Method and System, a Method of Training a Text-to-Speech Synthesis System, and a Method of Calculating an Expressivity Score
WO2021234967A1 (en) * 2020-05-22 2021-11-25 日本電信電話株式会社 Speech waveform generation model training device, speech synthesis device, method for the same, and program
CN114267329A (en) * 2021-12-24 2022-04-01 厦门大学 Multi-speaker speech synthesis method based on probability generation and non-autoregressive model
CN116229932A (en) * 2022-12-08 2023-06-06 维音数码(上海)有限公司 Voice cloning method and system based on cross-domain consistency loss
CN116452983A (en) * 2023-06-12 2023-07-18 合肥工业大学 Quick discovering method for land landform change based on unmanned aerial vehicle aerial image

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech;Jaehyeon Kim;International Conference on Machine Learning;20211231;第1-8页,图1 *
Linear networks based speaker adaptation for speech synthesis;Zhiying Huang 等;arXiv;20180305;第1-3页 *
NaturalSpeech: End-to-End Text-to-Speech Synthesis with Human-Level Quality;Xu Tan 等;IEEE Transactions on Pattern Analysis and Machine Intelligence;20240119;第1-5页 *
Text-to-Speech with Model Compression on Edge Devices;Wai-Wan Koc 等;2021 22nd Asia-Pacific Network Operations and Management Symposium;20210910;第114-118页 *
基于多设备协同的深度学习模型推理关键技术研究;陈楠;中国优秀硕士学位论文全文数据库;20221215;第5页 *
结合两种距离测度的说话人聚类算法;陈明同 等;小型微型计算机系统;20151031;第2369-2371页 *

Also Published As

Publication number Publication date
CN117649839A (en) 2024-03-05

Similar Documents

Publication Publication Date Title
CN109671442B (en) Many-to-many speaker conversion method based on STARGAN and x vectors
CN110136731B (en) Cavity causal convolution generation confrontation network end-to-end bone conduction voice blind enhancement method
CN110060690B (en) Many-to-many speaker conversion method based on STARGAN and ResNet
CN110060701B (en) Many-to-many voice conversion method based on VAWGAN-AC
CN111816156B (en) Multi-to-multi voice conversion method and system based on speaker style feature modeling
Chen et al. A deep generative architecture for postfiltering in statistical parametric speech synthesis
CN109599091B (en) Star-WAN-GP and x-vector based many-to-many speaker conversion method
CN110600047A (en) Perceptual STARGAN-based many-to-many speaker conversion method
CN113314140A (en) Sound source separation algorithm of end-to-end time domain multi-scale convolutional neural network
CN110060657B (en) SN-based many-to-many speaker conversion method
CN114141238A (en) Voice enhancement method fusing Transformer and U-net network
CN110060691B (en) Many-to-many voice conversion method based on i-vector and VARSGAN
Pascual et al. Time-domain speech enhancement using generative adversarial networks
CN111429893A (en) Many-to-many speaker conversion method based on Transitive STARGAN
Sun et al. A model compression method with matrix product operators for speech enhancement
CN114141237A (en) Speech recognition method, speech recognition device, computer equipment and storage medium
CN115101085A (en) Multi-speaker time-domain voice separation method for enhancing external attention through convolution
Le Moine et al. Towards end-to-end F0 voice conversion based on Dual-GAN with convolutional wavelet kernels
KR20200088263A (en) Method and system of text to multiple speech
Cheng et al. DNN-based speech enhancement with self-attention on feature dimension
CN113593588B (en) Multi-singer singing voice synthesis method and system based on generation of countermeasure network
CN110600046A (en) Many-to-many speaker conversion method based on improved STARGAN and x vectors
CN117095669A (en) Emotion voice synthesis method, system, equipment and medium based on variation automatic coding
Zhao et al. Research on voice cloning with a few samples
Lee et al. HierVST: Hierarchical Adaptive Zero-shot Voice Style Transfer

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant