CN114220456A

CN114220456A - Method and device for generating speech synthesis model and electronic equipment

Info

Publication number: CN114220456A
Application number: CN202111437392.4A
Authority: CN
Inventors: 李婉; 李健; 武卫东; 陈明
Original assignee: Beijing Sinovoice Technology Co Ltd
Current assignee: Beijing Sinovoice Technology Co Ltd
Priority date: 2021-11-29
Filing date: 2021-11-29
Publication date: 2022-03-22

Abstract

The embodiment of the application provides a method and a device for generating a speech synthesis model and electronic equipment, wherein the method comprises the following steps: a plurality of text samples can be obtained; training a text sample to obtain a first speech synthesis model, wherein the first speech synthesis model comprises a plurality of convolution processing modules, and each convolution processing module comprises a plurality of parallel convolution layers; when i takes each integer from 1 to n, determining one convolution layer equivalent to a plurality of parallel convolution layers included in the ith convolution processing module as an ith target convolution layer; replacing a plurality of parallel convolutional layers included in an ith convolution processing module of the first voice synthesis model with an ith target convolutional layer; and when the plurality of parallel convolution layers included by the nth convolution processing module are replaced by the nth target convolution layer, obtaining a second speech synthesis model. Therefore, the equivalent transformation target convolution layer obtains the second speech synthesis model, thereby saving the memory of the model when the model is applied at the later stage.

Description

Method and device for generating speech synthesis model and electronic equipment

Technical Field

The present application relates to the field of computer technologies, and in particular, to a method and an apparatus for generating a speech synthesis model, and an electronic device.

Background

Currently, the synthesis schemes of the model generally focus structurally on a combination of multiple parallel and Attention (Attention) structures. The combined design has the advantages that the rapid convergence speed can be obtained during training, and the combining effect is fine and smooth and people are simulated strongly due to the information modeling of the Attention structure on the whole training sample.

For example, in an acoustic model synthesis scheme of a Speech synthesis technology, a large amount of Attention and a multi-path parallel structure are adopted in a Fast Speech synthesis (Fast Speech) model, and the Fast Speech synthesis model helps to optimize gradient update of the model in a model training stage, so that the problem of disappearance of gradient of parameters of a bottom layer is avoided, and convergence of the model is accelerated. However, when the model of the multi-path parallel structure is applied at a later stage, memory is separately allocated to each path input, thereby increasing the memory consumption of the system.

Therefore, in the prior art, although the model combining the Attention and the multi-path parallel structure can be converged quickly in the training phase, it occupies a large memory in the later application.

Disclosure of Invention

The embodiment of the application provides a method and a device for generating a speech synthesis model and electronic equipment, and aims to solve the problem that the model combining the Attention and the multipath parallel structure occupies a large memory in the later application.

In a first aspect, an embodiment of the present application provides a method for generating a speech synthesis model, where the method includes:

obtaining a plurality of text samples;

training the text sample to obtain a first speech synthesis model, wherein the first speech synthesis model comprises a plurality of convolution processing modules, and each convolution processing module comprises a plurality of parallel convolution layers;

when i takes each integer from 1 to n, determining one convolution layer equivalent to a plurality of parallel convolution layers included in the ith convolution processing module as the ith target convolution layer, wherein n is the number of the convolution processing modules;

replacing a plurality of parallel convolutional layers included in the ith convolution processing module of the first speech synthesis model with an ith target convolutional layer;

and when the plurality of parallel convolution layers included by the nth convolution processing module are replaced by the nth target convolution layer, obtaining a second speech synthesis model.

In a second aspect, an embodiment of the present application provides an apparatus for generating a speech synthesis model, where the apparatus includes:

the sample acquisition module is used for acquiring a plurality of text samples;

the model acquisition module is used for training the text sample to acquire a first speech synthesis model, wherein the first speech synthesis model comprises a plurality of convolution processing modules, and each convolution processing module comprises a plurality of parallel convolution layers;

an equivalent transformation module, configured to determine, as an ith target convolutional layer, one convolutional layer that is equivalent to a plurality of parallel convolutional layers included in an ith convolutional processing module when i takes each integer from 1 to n, where n is the number of the convolutional processing modules;

a replacing module, configured to replace the multiple parallel convolutional layers included in the ith convolution processing module of the first speech synthesis model with an ith target convolutional layer;

and the model generation module is used for obtaining a second speech synthesis model when the plurality of parallel convolutional layers included by the nth convolutional processing module are replaced by the nth target convolutional layer.

In a third aspect, an embodiment of the present application provides an electronic device, which includes a processor, a memory, and a computer program stored on the memory and executable on the processor, where the computer program, when executed by the processor, implements the steps of the method for generating a speech synthesis model described above.

In a fourth aspect, the present application provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the steps of the method for generating a speech synthesis model described above.

In an embodiment of the present application, a plurality of text samples can be obtained; training a text sample to obtain a first speech synthesis model, wherein the first speech synthesis model comprises a plurality of convolution processing modules, and each convolution processing module comprises a plurality of parallel convolution layers; when i takes each integer from 1 to n, determining one convolution layer equivalent to a plurality of parallel convolution layers included by the ith convolution processing module as an ith target convolution layer, wherein n is the number of the convolution processing modules; replacing a plurality of parallel convolutional layers included in an ith convolution processing module of the first voice synthesis model with an ith target convolutional layer; and when the plurality of parallel convolution layers included by the nth convolution processing module are replaced by the nth target convolution layer, obtaining a second speech synthesis model.

The first voice synthesis model adopts a multi-path model structure in a training stage, and a generation stage equivalently transforms a plurality of parallel convolutional layers included by a plurality of convolutional processing modules in the first voice synthesis model into a target convolutional layer through identity transformation under the condition of not changing input and output so as to obtain a second voice synthesis model, namely, the first voice synthesis model with the multi-path model structure is transformed into the second voice synthesis model with the single-path model structure. Therefore, in the embodiment of the application, the excellent performance of the multi-path model structure is kept in the training stage, the model is helped to be rapidly converged, and the advantage of the lightweight single-path model structure in the deployment stage is also achieved, so that the problem that the model combined by the Attention and the multi-path parallel structure occupies a large memory in the later application is solved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the description of the embodiments of the present application will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a flowchart illustrating steps of a method for generating a speech synthesis model according to an embodiment of the present application;

FIG. 2 is a diagram of a Fast Speech model architecture in the prior art according to an embodiment of the present application;

fig. 3 is a schematic structural diagram of an encoder composed of FFT blocks in the prior art provided in an embodiment of the present application;

FIG. 4 is a schematic structural diagram of a Length Regular in the prior art provided by an embodiment of the present application;

FIG. 5 is a schematic structural diagram of a Duration Predictor in the prior art according to an embodiment of the present application;

FIG. 6 is a schematic diagram of a first speech synthesis model architecture provided by an embodiment of the present application;

FIG. 7 is a flow chart of convolution kernel equivalent transformation provided by an embodiment of the present application;

FIG. 8 is a schematic flow chart diagram illustrating a model training phase provided by an embodiment of the present application;

FIG. 9 is a schematic flow chart diagram of a model application phase provided by an embodiment of the present application;

fig. 10 is a block diagram of a speech synthesis model generation apparatus according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

To facilitate understanding of the method for generating a speech synthesis model according to the embodiment of the present application, the following description is made for the related art:

currently, in an acoustic model synthesis scheme of a Speech synthesis technology, a Fast Speech synthesis (Fast Speech) system architecture is shown in fig. 2. Namely, the Fast Speech system architecture comprises an Encoder (Encoder) composed of a rain removing network (PReNet) and a Fast Fourier transform module (FFT block) shown in FIG. 3, a Length prediction model (Length Regular) shown in FIG. 4, and a Decoder (Decoder) also composed of the FFT block. The input of the Encoder is the sum of a feature vector and a position Encoding vector (Positional Encoding) of the same Phoneme in a text Phoneme sequence to be processed, which is obtained through a preprocessing structure (phonememe Embedding); length Regular is used to predict the frame Length used by each phoneme; the Decoder outputs acoustic characteristics to the input Encoder and Length Regular processing results through a neural network module (Linear Layer).

Wherein, the training process of the Fast Speech system architecture model is as follows steps F1 to F5:

step F1: the text sample is processed by the front end to obtain a Phoneme sequence (Phoneme), and then the Phoneme sequence enters a preprocessing structure of an Encoder for sequence integration.

Step F2: adding the preprocessed phoneme characteristics with position coding information (Positional coding), and inputting an Encoder for information coding. The Encoder's infrastructure consists of a stack of N Multi-way parallel attention modules (i.e., N FFT blocks). As shown in FIG. 3, the phoneme features are shaped and normalized via each Multi-Head attribution module self-encoding (self-extension) followed by feed-forward neural network (FFN), one-dimensional convolution (Conv1D) and residual concatenation normalization module (Add & Norm).

Step F3: length legal receives the output of the Encoder, predicts the Duration of each input phoneme, and expands the encoded output into a new matrix by phoneme Duration as the input of the Decoder, i.e., as shown in fig. 4, the output (hpo) of the Encoder goes through a Duration Predictor (Duration Predictor) as shown in fig. 5, predicts the Duration of each input phoneme, where "D ═ 2, 2, 3, 1" is the phoneme Duration, and then expands into a new matrix via Length Regulator (LR) as the input (Hmel) of the Decoder, where "a ═ 1.0" is the matrix expansion coefficient (alpha) of 1.0.

Step F4: the Decoder structure is stacked by N Multi-Head integration modules. The matrix coding information output by the Length Regular is input into a Multi-Head association structure through the same position coding and is self-coded. Finally, the Decoder decoded acoustic features are post-processed (PostNet) to obtain optimized smooth acoustic features.

Step F5: the acoustic features are passed through a vocoder to obtain the target speaker's voice.

In addition, a great number of attentions and multi-way parallel structures are adopted in the Fast Speech model. The advantage of the multi-path parallel structure is that the derivation convenience of the residual error is utilized, the gradient update of the optimization model can be helped in the training stage, the problem that the gradient of the bottom layer parameter disappears is avoided, and therefore the convergence of the model is accelerated. The disadvantage is that memory is allocated separately for each way input, increasing the memory consumption of the system. Namely, the Attention structure models the whole input sequence, so that the context information is richer and more accurate, but the consumed memory is O (n ^2) obtained according to a space complexity algorithm, wherein n is the sequence length. Therefore, although the Fast Speech model can converge quickly in the training phase, it occupies a large memory in the later application.

In the embodiment of the application, a plurality of text samples can be obtained; training a text sample to obtain a first speech synthesis model, wherein the first speech synthesis model comprises a plurality of convolution processing modules, and each convolution processing module comprises a plurality of parallel convolution layers; when i takes each integer from 1 to n, determining one convolution layer equivalent to a plurality of parallel convolution layers included by the ith convolution processing module as an ith target convolution layer, wherein n is the number of the convolution processing modules; replacing a plurality of parallel convolutional layers included in an ith convolution processing module of the first voice synthesis model with an ith target convolutional layer; and when the plurality of parallel convolution layers included by the nth convolution processing module are replaced by the nth target convolution layer, obtaining a second speech synthesis model.

That is, in the embodiment of the present application, the first speech synthesis model adopts a multi-path model structure in the training stage, and the generation stage equivalently transforms a plurality of parallel convolutional layers included in a plurality of convolutional processing modules in the first speech synthesis model into one target convolutional layer through identity transformation without changing input and output, so as to obtain a second speech synthesis model, that is, the first speech synthesis model with the multi-path model structure is transformed into a second speech synthesis model with a single-path model structure. Therefore, in the embodiment of the present application, the second speech synthesis model not only retains the excellent performance of the multi-path model structure in the training phase to help the model to converge quickly, but also has the advantage of lightweight single-path model structure in the deployment phase, thereby solving the problem that the model combined by the Attention and the multi-path parallel structure occupies a large memory in the later application.

In order to make the above objects, features and advantages of the present application more comprehensible, a method of generating a speech synthesis model according to an embodiment of the present application is described in detail below with reference to the accompanying drawings and detailed description.

Referring to fig. 1, a flowchart illustrating steps of a method for generating a speech synthesis model in an embodiment of the present application is shown, and the method may include the following steps 101 to 105.

Step 101: a plurality of text samples are obtained.

For example, a text sample may be obtained through keyboard input, picture recognition, and the like, where when the keyboard is input, the text sample is a text generated by a keyboard key sequence; in the picture identification, the text sample is the text identified from the picture.

Step 102: and training the text sample to obtain a first speech synthesis model, wherein the first speech synthesis model comprises a plurality of convolution processing modules, and each convolution processing module comprises a plurality of parallel convolution layers.

In addition, the first speech synthesis model is a model for outputting speech acoustic features of the text to be processed. Therefore, the input of the first speech model is the text to be processed, and the output is the speech acoustic characteristics corresponding to the text to be processed.

Step 103: and when i takes each integer from 1 to n, determining one convolution layer equivalent to a plurality of parallel convolution layers included in the ith convolution processing module as the ith target convolution layer, wherein n is the number of the convolution processing modules.

Therefore, in the embodiment of the present application, each convolution processing module of the first speech synthesis model has an equivalent target convolution layer.

In addition, one convolution layer equivalent to the plurality of parallel convolution layers included in the ith convolution processing module is determined to be used as the ith target convolution layer, and the output result obtained after the input of the model is processed by the plurality of parallel convolution layers included in the ith convolution processing module is the same as the output result obtained after the input of the model is processed by the ith target convolution layer.

Therefore, the multiple parallel convolutional layers included in the ith convolutional processing module are equivalent to the ith target convolutional layer, that is, the same input is processed by the multiple parallel convolutional layers included in the ith convolutional processing module and the ith target convolutional layer, and then the same output can be obtained.

Step 104: and replacing the plurality of parallel convolutional layers included in the ith convolution processing module of the first speech synthesis model by the ith target convolutional layer.

After the model training is completed, the plurality of parallel convolutional layers in the ith convolution processing module included in the obtained first speech synthesis model can be equivalently transformed into the ith target convolutional layer.

Step 105: and when the plurality of parallel convolution layers included by the nth convolution processing module are replaced by the nth target convolution layer, obtaining a second speech synthesis model.

In this embodiment, a second speech synthesis model can be obtained by replacing a plurality of parallel convolutional layers included in a plurality of convolutional processing modules in a first speech synthesis model with a target convolutional layer.

And the second speech synthesis model is a model for outputting the speech acoustic characteristics of the text to be processed. Therefore, the input of the second speech synthesis model is the text to be processed, and the output is the speech acoustic feature corresponding to the text to be processed. Therefore, in the embodiment of the application, the text to be processed can be input into the second speech synthesis model in the actual application stage, so that the speech acoustic features corresponding to the text to be processed can be obtained.

In addition, after the same input is respectively input into the first voice synthesis model and the second voice synthesis model, the system distributes the memory in the first voice synthesis model after being processed by a plurality of parallel convolution layers included by a plurality of convolution processing modules, and each time the system is processed by one parallel convolution layer; in the second speech synthesis model, only target convolutional layers equivalent to the parallel convolutional layers included in the convolutional processing module need to be processed, that is, only the memory needs to be allocated to the target convolutional layers. From this comparison, the second speech synthesis model can save the memory of the system compared to the first speech synthesis model.

As can be seen from the foregoing steps 101 to 105, in the embodiment of the present application, a plurality of text samples can be obtained; training a text sample to obtain a first speech synthesis model, wherein the first speech synthesis model comprises a plurality of convolution processing modules, and each convolution processing module comprises a plurality of parallel convolution layers; when i takes each integer from 1 to n, determining one convolution layer equivalent to a plurality of parallel convolution layers included by the ith convolution processing module as an ith target convolution layer, wherein n is the number of the convolution processing modules; replacing a plurality of parallel convolutional layers included in an ith convolution processing module of the first voice synthesis model with an ith target convolutional layer; and when the plurality of parallel convolution layers included by the nth convolution processing module are replaced by the nth target convolution layer, obtaining a second speech synthesis model.

The first voice synthesis model adopts a multi-path model structure in a training stage, and a generation stage equivalently transforms a plurality of parallel convolutional layers included by a plurality of convolutional processing modules in the first voice synthesis model into a target convolutional layer through identity transformation under the condition of not changing input and output so as to obtain a second voice synthesis model, namely, the first voice synthesis model with the multi-path model structure is transformed into the second voice synthesis model with the single-path model structure. Therefore, in the embodiment of the present application, the second speech synthesis model not only retains the excellent performance of the multi-path model structure in the training phase to help the model to converge quickly, but also has the advantage of lightweight single-path model structure in the deployment phase, thereby solving the problem that the model combined by the Attention and the multi-path parallel structure occupies a large memory in the later application.

Optionally, after obtaining the second speech synthesis model, the method further includes:

acquiring a text to be processed;

acquiring a phoneme sequence of the text to be processed;

acquiring a feature vector of each phoneme in the phoneme sequence and a position coding vector of each phoneme in the phoneme sequence;

and inputting the feature vector and the position coding vector into the second speech synthesis model, and outputting the speech acoustic features of the text to be processed.

The phoneme is the minimum voice unit divided according to the natural attributes of the voice, and is analyzed according to the pronunciation action in the syllable, and one action forms one phoneme. For example, the chinese syllables o (ā) have only one phoneme, the love (aji) has two phonemes, the generation (d aji) has three phonemes, etc. The phone sequence is a sequence of permutations of each phone corresponding to a chinese syllable.

For example, if the obtained text to be processed is "obtaining speaker voice", the phoneme sequence of the text to be processed is "hu qu shuo hua ren yu yin"; then, obtaining a feature vector of each phoneme in the phoneme sequence, and if obtaining a feature vector corresponding to "q" and a position coding vector of "q" in the phoneme sequence, wherein the position of "q" in the phoneme sequence is a 4 th phoneme, obtaining a position coding vector corresponding to the 4 th phoneme; and then, adding the feature vector and the position coding vector of the same phoneme in the phoneme sequence, so that the vector subjected to the addition processing is input into a second speech synthesis model, and the speech acoustic features of the text to be processed are output.

In addition, the acoustic features of the speech output by the second speech synthesis model are passed through the vocoder, and the speech text can be obtained.

Optionally, the first speech synthesis model includes an encoder, a duration prediction module, and a decoder, where the encoder includes at least one convolution processing module, and the decoder includes at least one convolution processing module.

Wherein the encoder may be composed of a PReNet and a boundary representation module (Rep block), and the decoder may be composed of a Rep block. Thus, as shown in fig. 6, in the embodiment of the present application, the overall architecture of the first speech synthesis model may include an Encoder composed of a pcenet and a boundary representation module (Rep block), a Length regularl, and a Decoder also composed of a Rep block.

In addition, the input of the Encoder is the sum of a feature vector and a position Encoding vector (Positional Encoding) of the same Phoneme in a text Phoneme sequence to be processed, which is obtained by Phonmeme Embedding; length Regular is used to predict the frame Length used by each phoneme; the Decoder outputs acoustic characteristics to the input Encoder and Length Regular processing results through a Linear Layer.

Optionally, each convolutional layer includes a convolution kernel, the convolution kernel included in the i-th target convolutional layer is the sum of the convolution kernels of the plurality of convolutional layers included in the i-th convolution processing module and the i-th unit matrix, and the number of rows and columns of the i-th unit matrix is the same as the number of rows and columns of the convolution kernel with the least element in the i-th convolution processing module.

For example, one of the convolution processing modules of the first speech synthesis model includes two parallel convolution layers, the convolution kernels of which are the first preset convolution kernel 701 and the second preset convolution kernel 702 shown in fig. 7. The first preset convolution kernel is a matrix of 3 × 3, and the second preset convolution kernel is a matrix of 1 × 1, zero padding operation processing needs to be performed on the second preset convolution kernel and the unit matrix 703 of 1 × 1, so as to form a matrix of 3 × 3, and then the first preset convolution kernel, the second preset convolution kernel after zero padding, and the unit matrix of 1 × 1 after zero padding are added to obtain a new matrix of 3 × 3, where the new matrix of 3 × 3 is the target convolution kernel 704 equivalent to the first preset convolution kernel and the second preset convolution kernel.

Or, for example, if the first predetermined convolution kernel is a matrix of 3 × 3 and the second predetermined convolution kernel is a matrix of 3 × 3, the unit matrix is a matrix of 3 × 3, and the first predetermined convolution kernel, the second predetermined convolution kernel and the unit matrix of 3 × 3 are added to obtain a new matrix of 3 × 3, where the new matrix of 3 × 3 is the target convolution kernel equivalent to the first predetermined convolution kernel and the second predetermined convolution kernel.

That is, after the same vectors are respectively processed by the first preset convolution kernel and the second preset convolution kernel, the result of adding the obtained matrix and the unit matrix 703 is the same as the result of processing only by the target convolution kernel 704.

Optionally, training the text sample to obtain a first speech synthesis model, including:

acquiring a phoneme sequence of the text sample;

adding the feature vectors and the position coding vectors of the same phoneme to obtain a plurality of first vectors;

processing the first vector of the same text sample by adopting a predetermined parameter of a third speech synthesis model to obtain a speech acoustic characteristic corresponding to each text sample;

modifying the parameters of the third voice synthesis model according to the obtained voice acoustic characteristics to obtain the parameters of a fourth voice synthesis model;

and processing the first vector of the same text sample by adopting the parameters of the fourth speech synthesis model until the acoustic features of the obtained speech corresponding to the text sample meet a preset condition, and determining the speech synthesis model meeting the preset condition as the first speech synthesis model.

And in the case that the first speech synthesis model comprises an encoder, a duration prediction module and a decoder, the parameters of the third speech synthesis model and the parameters of the fourth speech synthesis model comprise parameters of the encoder, parameters of the duration prediction module and parameters of the decoder.

In addition, each text sample corresponds to a real voice text, so that the voice text after voice acoustic feature synthesis output by each voice synthesis model can be compared with the real voice text corresponding to the corresponding text sample to obtain the similarity of the two, and further, the voice acoustic features with the similarity greater than the preset similarity can be identified by adopting preset marks. Therefore, the preset condition may be that the number of the voice acoustic features marked with the preset mark reaches a preset number. That is, the number of the speech acoustic features marked with the preset marks reaches the preset number, which indicates that the accuracy of the speech synthesis model outputting the speech acoustic features has met the requirement, that is, the speech synthesis model has reached the convergence condition.

Or comparing the voice acoustic features output by each voice synthesis model with the actual voice acoustic features of each text sample, and calculating to obtain the loss value of the model. Therefore, the preset condition may be that the calculated model loss value is smaller than the preset loss value. That is, the calculated model loss value is smaller than the preset loss value, which indicates that the accuracy of the speech synthesis model outputting the speech acoustic features has met the requirement, that is, the speech synthesis model has reached the convergence condition.

Optionally, the third speech synthesis model includes an encoder, a duration prediction module, and a decoder;

a process for processing one of the first vectors of a text sample using predetermined parameters of the third speech synthesis model, comprising:

inputting a second vector to an encoder of the third speech synthesis model, and outputting an encoded vector, wherein the second vector is one of the first vectors of one of the text samples;

inputting the coding vector to a duration prediction module of the third speech synthesis model, and outputting a phoneme duration matrix which expands the coding vector according to the duration of the phoneme corresponding to the second vector;

and inputting the phoneme duration matrix into a decoder of the third speech synthesis model, and outputting the speech acoustic features of the phonemes corresponding to the second vector.

Therefore, the input of the Encoder is the sum of the feature vector and the position coding vector of the same phoneme in the text phoneme sequence to be processed; length Regular is used to predict the frame Length used by each phoneme; and the Decoder outputs acoustic characteristics to the input Encoder and Length Regular processing results.

For example, the phoneme sequence of one text sample is "hu qu shuo hua ren yu yin", each phoneme in the phoneme sequence is mapped to obtain a first vector, so that all the obtained first vectors are input into the third speech synthesis model, then the coder of the third speech synthesis model processes each first vector and outputs the coding vector corresponding to each first vector, and finally, processing each phoneme duration matrix by the duration prediction module of the third speech synthesis model, and outputting the speech acoustic characteristics of the phoneme corresponding to each first vector.

Optionally, the encoder of the third speech synthesis model includes a first convolution processing module and a second convolution processing module, where the first convolution processing module includes a first convolution layer and a second convolution layer, the first convolution layer includes a first preset convolution kernel, the second convolution layer includes a second preset convolution kernel, the second convolution processing module includes a third convolution layer and a fourth convolution layer, the third convolution layer includes a third preset convolution kernel, and the fourth convolution layer includes a fourth preset convolution kernel;

the inputting a second vector to an encoder of the third speech synthesis model, outputting an encoded vector, comprising:

processing the second vector through the first preset convolution kernel to obtain a first matrix, and processing the second vector through the second preset convolution kernel to obtain a second matrix;

adding the first matrix, the second matrix and the first identity matrix to obtain a third matrix;

normalizing the third matrix to obtain a third vector;

processing the third vector by the third preset convolution kernel to obtain a fifth matrix, and processing the third vector by the fourth preset convolution kernel to obtain a sixth matrix;

adding the fifth matrix, the sixth matrix and the second identity matrix to obtain a seventh matrix;

carrying out normalization processing on the seventh matrix to obtain the coding vector;

the number of rows and columns of the first unit matrix is the same as the number of rows and columns of the convolution kernel with the least elements in the first convolution processing module, and the number of rows and columns of the second unit matrix is the same as the number of rows and columns of the convolution kernel with the least elements in the second convolution processing module.

For example, in the training phase shown in fig. 8, the second vector is processed by a first predetermined convolution kernel to obtain a 3 × 3 first matrix, where the first predetermined convolution kernel is a 3 × 3 convolution kernel; processing the second vector by a second preset convolution kernel to obtain a 1 x 1 second matrix, wherein the second preset convolution kernel is a 1 x 1 convolution kernel; respectively processing the second matrix and the first unit matrix (namely the unit matrix of 1 x 1) into a 3 x 3 matrix through zero padding operation, and then adding the 3 x 3 matrix with the first matrix to obtain a third matrix of 3 x 3; normalizing the third matrix by a normalization module (Batch Norm) to obtain a third vector; processing the third vector by a third preset convolution kernel to obtain a 3 x 3 fifth matrix, wherein the third preset convolution kernel is a 3 x 3 convolution kernel; processing the third vector by a fourth preset convolution kernel to obtain a sixth matrix of 1 x 1, wherein the fourth preset convolution kernel is the 1 x 1 convolution kernel; processing the sixth matrix and the second unit matrix (namely the unit matrix of 1 x 1) into a matrix of 3 x 3 through zero padding operation, and then adding the matrix of 3 x 3 with the fifth matrix to obtain a seventh matrix of 3 x 3; and normalizing the seventh matrix through Batch Norm to output a coding vector corresponding to the second vector.

Optionally, the decoder of the third speech synthesis model includes a third convolution processing module and a fourth convolution processing module, the third convolution processing module includes a fifth convolution layer and a sixth convolution layer, the fifth convolution layer includes a fifth preset convolution kernel, the sixth convolution layer includes a sixth preset convolution kernel, the fourth convolution processing module includes a seventh convolution layer and an eighth convolution layer, the seventh convolution layer includes a seventh preset convolution kernel, and the eighth convolution layer includes an eighth preset convolution kernel;

the inputting the phoneme duration matrix into a decoder of the third speech synthesis model and outputting the speech acoustic features of the phonemes corresponding to the second vector includes:

processing the phoneme duration matrix through the fifth preset convolution kernel to obtain an eighth matrix, and processing the phoneme duration matrix through the sixth preset convolution kernel to obtain a ninth matrix;

adding the eighth matrix, the ninth matrix and the third identity matrix to obtain a tenth matrix;

normalizing the tenth matrix to obtain a fourth vector;

processing the fourth vector by the seventh preset convolution kernel to obtain an eleventh matrix, and processing the fourth vector by the eighth preset convolution kernel to obtain a twelfth matrix;

adding the eleventh matrix, the twelfth matrix and the fourth identity matrix to obtain a thirteenth matrix;

performing normalization processing on the thirteenth matrix to obtain the voice acoustic characteristics of the phoneme corresponding to the second vector;

the number of rows and columns of the third unit matrix is the same as the number of rows and columns of the convolution kernel with the least element in the third convolution processing module, and the number of rows and columns of the fourth unit matrix is the same as the number of rows and columns of the convolution kernel with the least element in the fourth convolution processing module.

For example, in the training stage shown in fig. 8, the phoneme duration matrix is processed by a fifth preset convolution kernel to obtain an eighth 3 × 3 matrix, where the fifth preset convolution kernel is a 3 × 3 convolution kernel; processing the phoneme duration matrix by a sixth preset convolution kernel to obtain a ninth matrix of 1 x 1, wherein the sixth preset convolution kernel is the 1 x 1 convolution kernel; processing the ninth matrix and the third unit matrix (namely the unit matrix of 1 x 1) into a matrix of 3 x 3 through zero padding operation, and then adding the matrix of 3 x 3 with the eighth matrix to obtain a tenth matrix of 3 x 3; normalizing the tenth matrix by a Batch Norm to obtain a fourth vector; processing the fourth vector by a seventh preset convolution kernel to obtain an eleventh matrix of 3 × 3, wherein the seventh preset convolution kernel is a 3 × 3 convolution kernel; processing the fourth vector by an eighth preset convolution kernel to obtain a twelfth matrix of 1 x 1, wherein the eighth preset convolution kernel is the 1 x 1 convolution kernel; processing the twelfth matrix and the fourth unit matrix (namely the unit matrix of 1 x 1) into a 3 x 3 matrix through zero padding operation, and then adding the 3 x 3 matrix and the eleventh matrix to obtain a thirteenth matrix of 3 x 3; and carrying out normalization processing on the thirteenth matrix through Batch Norm, and outputting the phonetic acoustic characteristics of the phoneme corresponding to the second vector.

In addition, after the training of the first speech synthesis model is completed, the convolution kernels of the convolution processing modules in the first speech synthesis model can be converted into an equivalent convolution kernel, and the equivalent convolution kernel obtained through conversion is the convolution kernel in the second speech synthesis model

For example, in the second speech synthesis model, the encoder includes a first target convolution kernel of 3 × 3 and a second target convolution kernel of 3 × 3, and the decoder includes a third target convolution kernel of 3 × 3 and a fourth target convolution kernel of 3 × 3, then as shown in fig. 9, in the actual application stage of the second speech synthesis model, in the phoneme sequence of the text to be processed, the sum of the feature vector and the position coding vector of the same phoneme sequence is input into the second speech synthesis model, and is processed by the first target convolution kernel in the encoder, so as to obtain a first target matrix of 3 × 3, and the first target matrix is normalized by the Batch Norm, so as to obtain a first target vector; processing the first target vector by a second target convolution kernel to obtain a 3 x 3 second target matrix, and performing normalization processing on the second target matrix by Batch Norm to output a coding vector; inputting the coding vector into a duration prediction module of a second speech synthesis model, and outputting a phoneme duration matrix which expands the coding vector according to the duration of the phoneme; further, the phoneme duration matrix is processed through a third target convolution kernel in the decoder to obtain a 3 x 3 third target matrix, and the third target matrix is normalized through Batch Norm to obtain a third target vector; and processing the third target vector by a fourth target convolution kernel in a decoder to obtain a fourth target matrix of 3 x 3, carrying out normalization processing on the fourth target matrix by Batch Norm, and outputting the phonetic acoustic characteristics of the phoneme corresponding to the second vector.

In summary, the specific implementation of the method for generating a speech synthesis model according to the embodiment of the present application can be as follows:

step H1: obtaining a plurality of text samples, and obtaining a corresponding phoneme sequence from one text sample through front-end processing.

Step H2: and adding the feature vector and the position coding vector of the same phoneme in the phoneme sequence to obtain a plurality of first vectors. Steps H3-H5 are performed during the training phase of the first speech synthesis model. Step H6 is performed during the generation phase of the second speech synthesis model.

Step H3: and processing the first vector of the same text sample by adopting the predetermined parameter of the third speech synthesis model to obtain the speech acoustic characteristics corresponding to each text sample.

Step H4: and modifying the parameters of the third speech synthesis model according to the obtained speech acoustic characteristics corresponding to each text sample to obtain the parameters of the fourth speech synthesis model.

Step H5: and processing the first vector of the same text sample by adopting the parameter of the fourth speech synthesis model until the acoustic characteristics of the speech corresponding to the obtained text sample meet the preset condition, and determining the speech synthesis model meeting the preset condition as the first speech synthesis model.

Step H6: and equivalently transforming a plurality of parallel convolutional layers included in each convolution processing module in the first voice synthesis model into a target convolutional layer to obtain a second voice synthesis model.

After the second speech synthesis model is generated, in practical application, the text to be processed is processed to obtain a phoneme sequence of the text to be processed, the sum of the feature vector of the same phoneme and the position coding vector in the phoneme sequence is calculated to obtain a plurality of input vectors, and then the plurality of input vectors are input into the second speech synthesis model, so that the speech acoustic features of the text sample to be processed can be output.

As can be seen from the above description, in the embodiment of the present application, the first speech synthesis model adopts the multi-path model structure in the training stage, and the generation stage equivalently transforms a plurality of parallel convolutional layers included in a plurality of convolutional processing modules in the first speech synthesis model into one target convolutional layer through the identity transformation without changing the input and output, so as to obtain the second speech synthesis model, that is, the first speech synthesis model with the multi-path model structure is transformed into the second speech synthesis model with the single-path model structure. Therefore, in the embodiment of the present application, the second speech synthesis model not only retains the excellent performance of the multi-path model structure in the training phase to help the model to converge quickly, but also has the advantage of lightweight single-path model structure in the deployment phase, thereby solving the problem that the model combined by the Attention and the multi-path parallel structure occupies a large memory in the later application.

It should be noted that, for simplicity of description, the method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the embodiments are not limited by the order of acts described, as some steps may occur in other orders or concurrently depending on the embodiments. Further, those skilled in the art will also appreciate that the embodiments described in the specification are presently preferred and that no particular act is required of the embodiments of the application.

Referring to fig. 10, a block diagram of a device for generating a speech synthesis model in an embodiment of the present application is shown, where the device 1000 for generating a speech synthesis model may include the following modules:

a sample obtaining module 1001 configured to obtain a plurality of text samples;

a model obtaining module 1002, configured to train the text sample to obtain a first speech synthesis model, where the first speech synthesis model includes multiple convolution processing modules, and each convolution processing module includes multiple parallel convolution layers;

an equivalent transformation module 1003, configured to determine, as an ith target convolutional layer, one convolutional layer that is equivalent to the plurality of parallel convolutional layers included in the ith convolutional processing module when i is an integer from 1 to n, where n is the number of the convolutional processing modules;

a replacing module 1004, configured to replace the multiple parallel convolutional layers included in the ith convolution processing module of the first speech synthesis model with an ith target convolutional layer;

a model generating module 1005, configured to obtain a second speech synthesis model when the plurality of parallel convolutional layers included in the nth convolutional processing module are replaced with the nth target convolutional layer.

Optionally, the apparatus 1000 for generating a speech synthesis model further includes:

the first acquisition module is used for acquiring a text to be processed;

the second acquisition module is used for acquiring a phoneme sequence of the text to be processed;

a third obtaining module, configured to obtain a feature vector of each phoneme in the phoneme sequence and a position coding vector of each phoneme in the phoneme sequence;

and the output module is used for inputting the feature vector and the position coding vector into the second speech synthesis model and outputting the speech acoustic features of the text to be processed.

Optionally, the model obtaining module 1002 includes:

the first obtaining submodule is used for obtaining a phoneme sequence of the text sample;

the second obtaining submodule is used for obtaining a feature vector of each phoneme in the phoneme sequence and a position coding vector of each phoneme in the phoneme sequence;

the first vector acquisition submodule is used for adding the feature vector and the position coding vector of the same phoneme to obtain a plurality of first vectors;

the third obtaining submodule is used for processing the first vector of the same text sample by adopting a predetermined parameter of a third speech synthesis model to obtain the speech acoustic characteristics corresponding to each text sample;

the fourth obtaining submodule is used for modifying the parameters of the third voice synthesis model according to the obtained voice acoustic characteristics to obtain the parameters of a fourth voice synthesis model;

and the determining submodule is used for processing the first vector of the same text sample by adopting the parameters of the fourth speech synthesis model until the obtained speech acoustic characteristics corresponding to the text sample meet the preset conditions, and determining the speech synthesis model meeting the preset conditions as the first speech synthesis model.

the third obtaining sub-module includes:

a second output unit, configured to input a second vector to the encoder of the third speech synthesis model, and output an encoded vector, where the second vector is one of the first vectors of one of the text samples;

a third output unit, configured to input the coding vector to a duration prediction module of the third speech synthesis model, and output a phoneme duration matrix that extends the coding vector according to the duration of the phoneme corresponding to the second vector;

and a fourth output unit, configured to input the phoneme duration matrix to a decoder of the third speech synthesis model, and output the speech acoustic features of the phoneme corresponding to the second vector.

the second output unit is specifically configured to:

normalizing the third matrix to obtain a third vector;

the fourth output unit is specifically configured to:

normalizing the tenth matrix to obtain a fourth vector;

As can be seen from the above, in the embodiment of the present application, a plurality of text samples can be obtained; training a text sample to obtain a first speech synthesis model, wherein the first speech synthesis model comprises a plurality of convolution processing modules, and each convolution processing module comprises a plurality of parallel convolution layers; when i takes each integer from 1 to n, determining one convolution layer equivalent to a plurality of parallel convolution layers included by the ith convolution processing module as an ith target convolution layer, wherein n is the number of the convolution processing modules; replacing a plurality of parallel convolutional layers included in an ith convolution processing module of the first voice synthesis model with an ith target convolutional layer; and when the plurality of parallel convolution layers included by the nth convolution processing module are replaced by the nth target convolution layer, obtaining a second speech synthesis model.

For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.

An embodiment of the present application further provides an electronic device, including:

one or more processors; and one or more machine readable media having instructions stored thereon that, when executed by the one or more processors, cause the electronic device to perform methods as described herein.

Embodiments of the present application also provide one or more machine-readable media having instructions stored thereon, which when executed by one or more processors, cause the processors to perform the methods of embodiments of the present application.

The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

As will be appreciated by one of skill in the art, embodiments of the present application may be provided as a method, apparatus, or computer program product. Accordingly, embodiments of the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

Embodiments of the present application are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing terminal to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing terminal to cause a series of operational steps to be performed on the computer or other programmable terminal to produce a computer implemented process such that the instructions which execute on the computer or other programmable terminal provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present application have been described, additional variations and modifications of these embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including the preferred embodiment and all such alterations and modifications as fall within the true scope of the embodiments of the application.

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or terminal that comprises the element.

The method and the device for generating a speech synthesis model provided by the present application are introduced in detail, and a specific example is applied in the text to explain the principle and the implementation of the present application, and the description of the above embodiment is only used to help understand the method and the core idea of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims

1. A method for generating a speech synthesis model, the method comprising:

obtaining a plurality of text samples;

2. The method of claim 1, wherein after obtaining the second speech synthesis model, the method further comprises:

acquiring a text to be processed;

acquiring a phoneme sequence of the text to be processed;

3. The method of claim 1, wherein the first speech synthesis model comprises an encoder, a duration prediction module, and a decoder, wherein the encoder comprises at least one convolution processing module and the decoder comprises at least one convolution processing module.

4. The method of claim 1, wherein each convolutional layer comprises a convolution kernel, and wherein the i-th target convolutional layer comprises a convolution kernel that is the sum of the convolution kernels of the plurality of convolutional layers comprised in the i-th convolutional processing module and an i-th unit matrix, wherein the number of rows and columns of the i-th unit matrix is the same as the number of rows and columns of the convolution kernel with the least number of elements in the i-th convolutional processing module.

5. The method of claim 1, wherein training the text sample to obtain a first speech synthesis model comprises:

acquiring a phoneme sequence of the text sample;

6. The method of claim 5, wherein the third speech synthesis model comprises an encoder, a duration prediction module, and a decoder;

7. The method of claim 6, wherein the encoder of the third speech synthesis model comprises a first convolution processing module and a second convolution processing module, the first convolution processing module comprises a first convolution layer and a second convolution layer, the first convolution layer comprises a first preset convolution kernel, the second convolution layer comprises a second preset convolution kernel, the second convolution processing module comprises a third convolution layer and a fourth convolution layer, the third convolution layer comprises a third preset convolution kernel, and the fourth convolution layer comprises a fourth preset convolution kernel;

normalizing the third matrix to obtain a third vector;

8. The method of claim 6, wherein the decoder of the third speech synthesis model comprises a third convolution processing module and a fourth convolution processing module, the third convolution processing module comprising a fifth convolution layer and a sixth convolution layer, the fifth convolution layer comprising a fifth preset convolution kernel, the sixth convolution layer comprising a sixth preset convolution kernel, the fourth convolution processing module comprising a seventh convolution layer and an eighth convolution layer, the seventh convolution layer comprising a seventh preset convolution kernel, the eighth convolution layer comprising an eighth preset convolution kernel;

normalizing the tenth matrix to obtain a fourth vector;

9. An apparatus for generating a speech synthesis model, the apparatus comprising:

10. An electronic device comprising a processor, a memory and a computer program stored on the memory and executable on the processor, the computer program, when executed by the processor, implementing the steps of the method of generating a speech synthesis model according to any one of claims 1 to 8.