CN113096634A

CN113096634A - Speech synthesis method, apparatus, server and storage medium

Info

Publication number: CN113096634A
Application number: CN202110342399.1A
Authority: CN
Inventors: 孙奥兰; 王健宗; 程宁
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2021-03-30
Filing date: 2021-03-30
Publication date: 2021-07-09
Anticipated expiration: 2041-03-30
Also published as: CN113096634B

Abstract

The application relates to voice processing in artificial intelligence, and provides a voice synthesis method, a device, a server and a storage medium, wherein the method comprises the following steps: calling a speech synthesis model to be trained; inputting the voice sample into a reference encoder for encoding processing so as to extract a prosody feature vector and a tone feature vector of the voice data; inputting the rhythm feature vector, the tone feature vector and the text feature vector into the embedding layer for superposition operation to obtain a target feature vector; inputting the target characteristic vector into a decoder for decoding processing to obtain a predicted Mel frequency spectrum of the voice data; adjusting model parameters of the speech synthesis model according to the predicted Mel frequency spectrum and the real Mel frequency spectrum of the speech data until the speech synthesis model converges; inputting a target voice emotion label and a target identity label of voice to be synthesized into a converged voice synthesis model to obtain a Mel frequency spectrum; and generating target voice information according to the Mel frequency spectrum. The method and the device improve the efficiency of voice synthesis.

Description

Speech synthesis method, apparatus, server and storage medium

Technical Field

The present application relates to the field of speech processing technologies, and in particular, to a speech synthesis method, apparatus, server, and storage medium.

Background

Voice is one of carriers of text contents, and information can be effectively transmitted in daily life, so voice interaction technology has been an object of attention. For example, the voice interaction process of the customer service system relates to a large number of voice synthesis scenes, and from intelligent customer service to short-video intelligent dubbing or audio books, etc., the voice interaction process is a long-time voice interaction process, so that the current voice synthesis mainly seeks to improve the perception experience of the user. The existing speech synthesis products on the market mostly adopt speech synthesis models, training samples of the speech synthesis models relate to elements such as different scenes, characters, emotions and the like, the number of the training samples is very large, and the efficiency of realizing speech synthesis is low. Therefore, how to improve the efficiency of speech synthesis becomes an urgent problem to be solved.

Disclosure of Invention

The present application is directed to a method, an apparatus, a server and a storage medium for speech synthesis, and aims to improve the efficiency of speech synthesis.

In a first aspect, the present application provides a speech synthesis method, including:

acquiring a voice sample, wherein the voice sample comprises voice data of a user, a voice emotion label corresponding to the voice data and an identity label of the user;

calling a speech synthesis model to be trained, wherein the speech synthesis model comprises a reference encoder, an embedded layer and a decoder;

inputting the voice sample into the reference encoder for encoding processing so as to extract prosody feature vectors and timbre feature vectors of the voice data, wherein the prosody feature vectors are obtained by encoding the voice data according to the voice emotion tags, and the timbre feature vectors are obtained by encoding the voice data according to the identity tags;

inputting the prosodic feature vectors, the tone feature vectors and the text feature vectors corresponding to the voice data into the embedding layer for superposition operation to obtain target feature vectors;

inputting the target feature vector into the decoder for decoding processing to obtain a predicted Mel frequency spectrum of the voice data;

acquiring a real Mel frequency spectrum of the voice data, and adjusting model parameters of the voice synthesis model according to the predicted Mel frequency spectrum and the real Mel frequency spectrum until the voice synthesis model is converged;

acquiring a target voice emotion label and a target identity label of voice to be synthesized, and inputting the target voice emotion label and the target identity label into the converged voice synthesis model to obtain a Mel frequency spectrum of the voice to be synthesized;

and generating target voice information according to the Mel frequency spectrum of the voice to be synthesized.

In a second aspect, the present application further provides a speech synthesis apparatus, comprising:

the voice recognition system comprises an acquisition module, a recognition module and a processing module, wherein the acquisition module is used for acquiring a voice sample, and the voice sample comprises voice data of a user, a voice emotion tag corresponding to the voice data and an identity tag of the user;

the device comprises a calling module, a training module and a control module, wherein the calling module is used for calling a speech synthesis model to be trained, and the speech synthesis model comprises a reference encoder, an embedded layer and a decoder;

the encoding module is used for inputting the voice sample into the reference encoder for encoding processing so as to extract a prosody feature vector and a timbre feature vector of the voice data, wherein the prosody feature vector is obtained by encoding the voice data according to the voice emotion tag, and the timbre feature vector is obtained by encoding the voice data according to the identity tag;

the superposition module is used for inputting the prosody feature vector, the tone feature vector and the text feature vector corresponding to the voice data into the embedding layer for superposition operation to obtain a target feature vector;

the decoding module is used for inputting the target feature vector into the decoder for decoding processing so as to obtain a predicted Mel frequency spectrum of the voice data;

the acquisition module is further configured to acquire a real mel frequency spectrum of the voice data;

the adjusting module is used for adjusting the model parameters of the speech synthesis model according to the predicted Mel frequency spectrum and the real Mel frequency spectrum until the speech synthesis model is converged;

the acquisition module is also used for acquiring a target voice emotion label and a target identity label of the voice to be synthesized;

the input module is used for inputting the target voice emotion label and the target identity label into the converged voice synthesis model to obtain a Mel frequency spectrum of the voice to be synthesized;

and the generating module is used for generating target voice information according to the Mel frequency spectrum of the voice to be synthesized.

In a third aspect, the present application also provides a server comprising a processor, a memory, and a computer program stored on the memory and executable by the processor, wherein the computer program, when executed by the processor, implements the steps of the speech synthesis method as described above.

In a fourth aspect, the present application further provides a computer-readable storage medium having a computer program stored thereon, where the computer program, when executed by a processor, implements the steps of the speech synthesis method as described above.

The application provides a voice synthesis method, a voice synthesis device, a server and a storage medium, and the voice synthesis method comprises the steps of obtaining a voice sample, wherein the voice sample comprises voice data of a user, a voice emotion label corresponding to the voice data and an identity label of the user; calling a speech synthesis model to be trained, wherein the speech synthesis model comprises a reference encoder, an embedded layer and a decoder; inputting the voice sample into a reference encoder for encoding processing so as to extract a prosody feature vector and a tone feature vector of the voice data; inputting the prosodic feature vectors, the tone feature vectors and the text feature vectors corresponding to the voice data into the embedding layer for superposition operation to obtain target feature vectors; inputting the target characteristic vector into a decoder for decoding processing to obtain a predicted Mel frequency spectrum of the voice data; acquiring a real Mel frequency spectrum of the voice data, and adjusting model parameters of the voice synthesis model according to the predicted Mel frequency spectrum and the real Mel frequency spectrum until the voice synthesis model is converged; acquiring a target voice emotion label and a target identity label of voice to be synthesized, and inputting the target voice emotion label and the target identity label into a converged voice synthesis model to obtain a Mel frequency spectrum of the voice to be synthesized; and generating target voice information according to the Mel frequency spectrum of the voice to be synthesized. Through the target characteristic vector of embedding voice sample, can effectively reduce the required training sample quantity in the model training process, the speech synthesis model can converge fast, and need not to input reference speech during speech synthesis, has reduced data processing process to improve speech synthesis's efficiency.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic flowchart illustrating steps of a speech synthesis method according to an embodiment of the present application;

FIG. 2 is a flow diagram illustrating sub-steps of the speech synthesis method of FIG. 1;

fig. 3 is a schematic block diagram of a speech synthesis apparatus according to an embodiment of the present application;

FIG. 4 is a schematic block diagram of sub-modules of the speech synthesis apparatus of FIG. 3;

fig. 5 is a block diagram schematically illustrating a structure of a server according to an embodiment of the present disclosure.

The implementation, functional features and advantages of the objectives of the present application will be further explained with reference to the accompanying drawings.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The flow diagrams depicted in the figures are merely illustrative and do not necessarily include all of the elements and operations/steps, nor do they necessarily have to be performed in the order depicted. For example, some operations/steps may be decomposed, combined or partially combined, so that the actual execution sequence may be changed according to the actual situation. In addition, although the division of the functional blocks is made in the device diagram, in some cases, it may be divided in blocks different from those in the device diagram.

The embodiment of the application provides a voice synthesis method, a voice synthesis device, a server and a storage medium. The speech synthesis method can be applied to a server, wherein the server stores a speech synthesis model, and the speech synthesis model comprises a reference encoder, an embedded layer and a decoder. The server may be a single server or a server cluster including a plurality of servers.

Some embodiments of the present application will be described in detail below with reference to the accompanying drawings. The embodiments described below and the features of the embodiments can be combined with each other without conflict.

Referring to fig. 1, fig. 1 is a schematic flowchart illustrating steps of a speech synthesis method according to an embodiment of the present application.

As shown in fig. 1, the speech synthesis method includes steps S101 to S108.

Step S101, a voice sample is obtained, wherein the voice sample comprises voice data of a user, a voice emotion label corresponding to the voice data and an identity label of the user.

The method comprises the steps that a server obtains voice data serving as a training sample, the voice data carries a corresponding voice emotion label and a user identity label, the voice data of the user, the voice emotion label corresponding to the voice data and the user identity label jointly form a voice sample, and the number of the voice samples can be one or more. And inputting the voice sample into a voice synthesis model stored in a server, so that the voice data in the voice sample, the voice emotion label corresponding to the voice data and the identity label of the user can train the voice synthesis model.

In one embodiment, obtaining a speech sample comprises: the method comprises the steps of obtaining a plurality of first voice samples and a plurality of second voice samples, wherein the first voice samples comprise first voice data and voice emotion labels corresponding to the first voice data, and the second voice samples comprise second voice data and identity labels corresponding to the second voice data; training a first preset classifier through a plurality of first voice samples to obtain a trained speech emotion classifier, and training a second preset classifier through a plurality of second voice samples to obtain a trained user identity classifier; acquiring target voice data of a user, determining a voice emotion label corresponding to the target voice data through a speech emotion classifier, and determining an identity label corresponding to the target voice data through a user identity classifier; and marking the voice emotion label and the identity label on the target voice data to obtain a voice sample.

It should be noted that the speech emotion classifier and the user identity classifier are trained through the plurality of first speech samples and the plurality of second speech samples, and then the speech emotion labels and the identity labels of a large amount of unlabelled speech data can be obtained through the speech emotion classifier and the user identity classifier, so that the speech samples with large data volume can be further obtained, the obtaining efficiency of the speech samples is improved, a large amount of manpower and time are not wasted for sample labeling, and the training efficiency of the speech synthesis model can be greatly improved.

Illustratively, the plurality of first voice samples include, for example, 5 pieces of voice data with emotion tags, the number of each piece of voice data with emotion tags being greater than 300, and the plurality of first voice samples include, for example, 5 pieces of voice data with identity tags, the number of each piece of voice data with identity tags being greater than 400; and evaluating the classification effect by using the cross entropy loss dialogue emotion classifier and the user identity classifier, and obtaining the trained dialogue emotion classifier and the trained user identity classifier when the cross entropy loss is less than or equal to a set value. And then, acquiring reference voice data expressed by different emotions of a plurality of users, classifying the target voice data through the trained speech emotion classifier and the trained user identity classifier to obtain a voice emotion tag corresponding to each target voice data and an identity tag of the user, and marking the voice emotion tag and the identity tag on the target voice data to conveniently obtain a plurality of voice samples.

And step S102, calling a speech synthesis model to be trained, wherein the speech synthesis model comprises a reference encoder, an embedded layer and a decoder.

The speech synthesis model can be stored in the server in advance, and the reference encoder is used for encoding the speech data in the speech sample so as to extract a prosody feature vector and a tone feature vector of the speech data; the embedded layer is used for performing superposition operation on the prosodic feature vectors, the tone feature vectors and the text feature vectors corresponding to the voice data output by the reference encoder to obtain target feature vectors; the decoder is used for inputting the target characteristic vector into the decoder for decoding processing so as to obtain a predicted Mel frequency spectrum of the voice data.

In one embodiment, the reference encoder comprises a first reference encoder and a second reference encoder, both of which may be comprised of a convolutional neural network and a cyclic neural network; the embedding layer (embedding) includes a first embedding layer and a second embedding layer, and the first embedding layer and the second embedding layer can be convolutional neural networks; the decoder includes one layer of RNN attention model containing 256 GRU networks and two layers of residual GRU networks. The target feature vector can be accurately decoded by the decoding layer, and the predicted Mel frequency spectrum of the voice data can be obtained.

Step S103, inputting the voice sample into a reference encoder for encoding processing so as to extract prosody feature vectors and tone feature vectors of the voice data.

And the prosodic feature vector is obtained by coding the voice data according to the voice emotion label, and the tone feature vector is obtained by coding the voice data according to the identity label. The reference encoder is composed of, for example, a convolutional neural network and a cyclic neural network, and after acquiring the speech emotion tag and the identity tag in the speech data, the reference encoder can extract a prosody feature vector from the speech data according to the speech emotion tag and extract a timbre feature vector from the speech data according to the identity tag. The prosodic feature vector includes prosodic information of the user's speech data, including, for example, pauses, voice strengths, and the like. The tone feature vector includes tone information of the user's voice data, and there is a difference in tone between different users.

In an embodiment, the reference encoder comprises a first reference encoder and a second reference encoder. The first reference encoder is used for extracting prosody feature vectors from the voice data according to the voice emotion labels, and the second reference encoder is used for extracting tone feature vectors from the voice data according to the identity labels.

Illustratively, the first reference encoder and the second reference encoder are composed of convolution stacks, RNNs and concentration modules, and the prosodic feature vectors and the timbre feature vectors in the speech data can be extracted quickly and accurately by batch normalization of the speech data and passing through 6 two-dimensional convolution stacks with ReLU as an activation function, and the 6 two-dimensional convolution stacks can use 64, 128, 256 output channels respectively.

And step S104, inputting the prosodic feature vectors, the tone feature vectors and the text feature vectors corresponding to the voice data into the embedding layer for superposition operation to obtain target feature vectors.

And inputting the prosodic feature vectors, the tone feature vectors and the text feature vectors corresponding to the voice data into the embedding layer, so that the embedding layer can conveniently perform superposition operation on the prosodic feature vectors, the tone feature vectors and the text feature vectors to obtain the target feature vectors. The speech synthesis model is trained through the target characteristic vector, the target characteristic vector can be embedded into the speech synthesis model, the requirement for the number of training samples of the speech synthesis model is reduced, and therefore the training efficiency of the speech synthesis model is improved.

In one embodiment, as shown in FIG. 2, the embedding layers include a first embedding layer and a second embedding layer; step S104 includes: substeps S1041 to substep S1042.

And a substep S1041 of inputting the prosody feature vector and the tone feature vector into the first embedding layer for combination to obtain a combined feature vector.

It should be noted that combining the prosodic feature vectors and the timbre feature vectors includes, for example, stitching the prosodic feature vectors and the timbre feature vectors. The splicing method comprises row splicing or column splicing, and can be flexibly applied and transposed according to actual conditions. The target feature vector obtained by superposing the combined feature vector and the text feature vector is used for training the voice synthesis model, and the target feature vector can be embedded into the voice synthesis model, so that the requirement on the number of training samples of the voice synthesis model is reduced, and the training efficiency of the voice synthesis model is improved.

Illustratively, the prosodic feature vector is a matrix vector of k × m, the timbre feature vector is a matrix vector of e × m, the matrix vectors of (k + e) × m are obtained by splicing, and the matrix vector of (k + e) × m is used as the combined feature vector. For another example, the prosodic feature vector is a matrix vector of m × k, and the timbre feature vector is a matrix vector of m × e, and a combined feature vector of m × k + e is obtained by concatenation. Illustratively, the prosodic feature vector is a matrix vector of k × n, the timbre feature vector is a matrix vector of e × n, the matrix vectors of (k + e) × n are obtained by splicing, and the matrix vector of (k + e) × n is used as the combined feature vector.

In an embodiment, before obtaining the target feature vector, the method further includes: the combined feature vector and/or the text feature vector is adjusted such that the combined feature vector coincides with a matrix size of the text feature vector. And then, inputting the combined feature vectors with the consistent matrix sizes and the text feature vectors into the second embedding layer for superposition to obtain target feature vectors. It should be noted that the combined feature vector and the text feature vector acquired by the second embedding layer are a matrix vector, and the matrix size of the combined feature vector may be different from the matrix size of the text feature vector, so that direct superposition cannot be performed.

In one embodiment, adjusting the combined feature vector and/or the text feature vector comprises: determining the size of a target matrix to be adjusted; acquiring a first matrix size of the combined eigenvector, and determining a first matrix position to be adjusted of the combined eigenvector according to the target matrix size and the first matrix size; filling a first matrix position to be adjusted through a preset identifier; and/or acquiring a second matrix size of the text characteristic vector, and determining a second matrix position to be adjusted of the text characteristic vector according to the target matrix size and the second matrix size; and filling the position of the second matrix to be adjusted through the preset identification.

Further, the target matrix size may be a matrix size of a preset output matrix of the second embedded layer, the target matrix size is greater than or equal to the first matrix size and the second matrix size, the preset identifier may be 0, and the combined eigenvector and/or the text eigenvector are/is adjusted in a dimension expansion manner, so that the matrix sizes of the combined eigenvector and the text eigenvector are the matrix sizes of the preset output matrix, and the adjusted combined eigenvector and the adjusted text eigenvector are convenient to superimpose.

Further, the target matrix size may be flexibly set by a user, for example, a matrix size with a larger dimension is selected from the matrix sizes of the combined feature vectors or the text feature vectors as the target matrix size, if the matrix size of the text feature vector is selected as the target matrix size, a matrix position to be adjusted (a vacant position of the matrix position) of the combined feature vector is determined according to the target matrix size, the combined feature vector is adjusted, and the matrix position to be adjusted of the combined feature vector is filled by element 0, so that the matrix size of the combined feature vector is the target matrix size.

And a substep S1042 of inputting the combined feature vector and the text feature vector of the text information corresponding to the voice data into the second embedding layer for superposition to obtain a target feature vector.

The text information corresponds to the voice data of the user statement, can be obtained through a voice recognition model or carried in a voice sample, and the feature information in the text information is extracted to obtain a text feature vector. Adjusting the combined feature vector and/or the text feature vector to enable the matrix size of the combined feature vector to be consistent with that of the text feature vector; and inputting the combined feature vectors with consistent matrix sizes and the text feature vectors into the second embedding layer for superposition to obtain target feature vectors. Because the matrix size of the adjusted combined feature vector is consistent with that of the text feature vector, the second embedding layer can quickly superpose the combined feature vector and the text feature vector corresponding to the voice data to obtain the target feature vector.

In one embodiment, the target matrix size is consistent with the matrix size of the preset output matrix of the second embedded layer, and the target feature vector can be directly output. In some embodiments, if the size of the target matrix is not consistent with the matrix size of the preset output matrix of the second embedded layer, the target eigenvector needs to be adjusted, so that the size of the target eigenvector is consistent with the matrix size of the preset output matrix of the second embedded layer, and the updated target eigenvector is obtained, thereby outputting the updated target eigenvector.

It should be noted that, the superposition of the adjusted combined feature vector and the adjusted text feature vector may be understood as embedding the combined feature vector and the text feature vector into the speech synthesis model. The generation of the combined feature vector is random at first, but the combined feature vector and the target feature vector are more and more accurate along with the deepening of model training, so that the number of training samples and the time period for model construction required in the process of model training can be effectively reduced.

In one embodiment, the prosodic feature vectors, the tone feature vectors and the text feature vectors corresponding to the voice data are input into the embedding layer for superposition operation to adjust the prosodic feature vectors, the tone feature vectors and/or the text feature vectors so that the matrix sizes of the prosodic feature vectors, the tone feature vectors and the text feature vectors are consistent; and superposing the adjusted prosody feature vector, tone feature vector and text feature vector to obtain a target feature vector. It should be noted that the prosody feature vectors, the tone feature vectors and/or the text feature vectors are adjusted to make the matrix sizes of the prosody feature vectors, the tone feature vectors and/or the text feature vectors consistent, so that the embedding layer superimposes the combined feature vectors and the text feature vectors corresponding to the voice data to obtain the target feature vectors quickly. For the specific implementation process, reference may be made to the foregoing embodiments, which are not described in detail herein.

Step S105, inputting the target feature vector into a decoder for decoding processing, so as to obtain a predicted mel spectrum of the voice data.

The decoding layer is, for example, a decoder comprising a layer of RNN attention model including 256 GRU networks and two layers of residual GRU networks. The target feature vector can be accurately decoded by the decoding layer, and the predicted Mel frequency spectrum of the voice data can be obtained.

And S106, acquiring a real Mel frequency spectrum of the voice data, and adjusting model parameters of the voice synthesis model according to the predicted Mel frequency spectrum and the real Mel frequency spectrum until the voice synthesis model is converged.

The predicted mel frequency spectrum is obtained by the model training prediction, the real mel frequency spectrum is obtained by performing mel filtering transformation on a spectrogram of the voice data, for example, the voice data is subjected to Fourier transformation to obtain the spectrogram of the voice data, and the spectrogram of the voice data is subjected to a mel scale filter bank to obtain the real mel frequency spectrum. The speech synthesis model is, for example, a Tacotron model, model parameters of the speech synthesis model are adjusted according to the predicted Mel frequency spectrum and the real Mel frequency spectrum until the speech synthesis model converges, the number of training samples required in the model training process is small, so that the speech synthesis model can be trained quickly, and the converged speech synthesis model can be used for quickly and accurately performing speech synthesis.

In one embodiment, a model loss value of the speech synthesis model is calculated according to the Mel frequency spectrum and the real Mel frequency spectrum; updating model parameters of the voice synthesis model based on the model loss value, and performing iterative training on the voice synthesis model with the updated model parameters according to a plurality of voice samples; and when the speech synthesis model for updating the model parameters is determined to be in a convergence state, obtaining the trained speech synthesis model.

Determining whether the iteration times of the voice synthesis model reach preset iteration times or not, and if the iteration times of the voice synthesis model reach the preset iteration times, determining that the voice synthesis model is in a convergence state; if the iteration times of the voice synthesis model do not reach the preset iteration times, determining that the voice synthesis model is not in a convergence state; or determining whether the iteration time of the voice synthesis model is greater than or equal to the preset iteration time, and if the iteration time of the voice synthesis model is greater than or equal to the preset iteration time, determining that the voice synthesis model is in a convergence state; and if the iteration time of the voice synthesis model is less than the preset iteration time, determining that the voice synthesis model is not in a convergence state. The preset iteration time and the preset iteration times can be flexibly set by a user, and the embodiment of the application is not particularly limited.

Further, if the speech synthesis model is determined not to be in the convergence state, the speech synthesis model continues to be trained according to the speech sample until the updated speech synthesis model converges.

In an embodiment, after the speech synthesis model converges, the association information between the speech emotion tag and the prosody feature vector is recorded, the association information between the identity tag and the tone feature vector is recorded, and the association information between the speech emotion tag, the identity tag and the combined feature vector is recorded, so that the speech synthesis model can be called and inferred quickly in the subsequent application process, and the speech synthesis efficiency is improved.

By embedding the target characteristic vectors of the voice samples, the number of training samples and the time period for constructing the model required in the model training process can be effectively reduced, and therefore the training efficiency of the voice synthesis model is improved.

For example, if a speech synthesis model requires a speech text, b character characters, c emotions, and d scenes, the number of data sets of the training samples required is a × b × c × d customized speeches, which causes a huge workload, a large amount of capital and a large amount of time, and results in a large training period and low efficiency of the speech synthesis model. By applying the technical scheme of the embodiment of the application, under the scenes of a voice texts, b personal character roles and c emotions in total, the number of the data sets of the required training samples is (a + b + c) d customized voices, namely, an embedded layer is supposed to replace a simple physical label, a basic speaking model is trained by a piece of language data, b voices are required by the embedded layer of the training character label, and c voices are required by the embedded layer of the training voice emotion, so that the data requirement of the training samples and the model construction time period are greatly reduced, and the training efficiency of the voice synthesis model is improved.

And S107, acquiring a target voice emotion label of the voice to be synthesized and an identity label of a target user, and inputting the target voice emotion label and the target identity label into the converged voice synthesis model to obtain a Mel frequency spectrum of the voice to be synthesized.

The target voice emotion label and the identity label of the target user can be specified by the user, the target voice emotion label comprises anger, sadness, fear, happiness, surprise, aversion and the like, and the identity label comprises an identifier capable of representing the user, including a name, an identifier number and the like.

It should be noted that the information imported into the speech synthesis model includes the target speech emotion tag and the identity tag of the target user, but may not include the speech to be synthesized (reference speech), so that fast reasoning can be performed in the application process of the speech synthesis model, and the speech synthesis efficiency is improved.

In one embodiment, a target voice emotion tag and a target identity tag are input into a reference encoder to be processed so as to extract a target prosody feature vector corresponding to the target voice emotion tag and a target tone feature vector corresponding to the target identity tag; inputting the target prosody feature vector, the target tone feature vector and the target text feature vector corresponding to the voice to be synthesized into the embedding layer for superposition operation to obtain a candidate feature vector; and inputting the candidate characteristic vectors into a decoder for decoding to obtain the Mel frequency spectrum of the voice to be synthesized.

The target prosody feature vectors are extracted according to the target voice emotion labels, and the target tone feature vectors are extracted according to the target identity labels. The prosodic feature vectors and the tone feature vectors are embedded in the voice synthesis model, and the reference encoder can determine corresponding target prosodic feature vectors according to the voice emotion labels and determine corresponding target tone feature vectors according to the identity labels. Different from the prior art, the method and the device do not need to input reference voice, do not need to extract prosody feature vectors and tone feature vectors from the reference voice, and can improve the voice synthesis efficiency.

In an embodiment, the reference encoder comprises a first reference encoder and a second reference encoder. The first reference encoder is used for determining the associated target prosodic feature vector according to the speech emotion label, and the second reference encoder is used for determining the associated target timbre feature vector according to the identity label.

Illustratively, the user selects "anger" as the target speech emotion tag and "Xiaoliu" as the target user's identity tag, and inputs the target speech emotion tag and the target user's identity tag into the speech synthesis model, such that a reference encoder in the speech synthesis model determines the prosodic feature vector associated with "anger" and determines the timbre feature vector associated with "Xiaoliu".

The method includes the steps of inputting a target prosody feature vector, a target tone feature vector and a target text feature vector corresponding to the voice to be synthesized into an embedding layer for superposition operation, splicing the prosody feature vector and the tone feature vector to obtain a spliced feature vector, and inputting the spliced feature vector and the target text feature vector corresponding to the voice to be synthesized into the embedding layer for superposition operation to obtain a candidate feature vector. The splicing method comprises row splicing or column splicing, and can be flexibly applied and transposed according to actual conditions.

Specifically, a text characteristic vector of text information corresponding to voice data is obtained, and the size of a target matrix to be adjusted is determined; adjusting the combined characteristic vector and/or the text characteristic vector according to the size of the target matrix to enable the matrix size of the combined characteristic vector to be consistent with that of the text characteristic vector; and overlapping the adjusted combined feature vector and the text feature vector to obtain a target feature vector. The specific implementation process can refer to the foregoing embodiments, and details are not repeated herein.

And step S108, generating target voice information according to the Mel frequency spectrum of the voice to be synthesized.

Inputting the Mel frequency spectrum of the voice to be synthesized into a vocoder to obtain target voice information, wherein the vocoder comprises WaveNet or WaveRNN, for example, so that the Mel frequency spectrum of the voice to be synthesized is changed into a wav file capable of being played, and finally, the voice synthesis process is completed, and the target voice information corresponds to a target voice emotion label and an identity label of a target user and can express the target voice emotion of the target user.

In the voice synthesis method provided by the above embodiment, the voice sample is obtained, where the voice sample includes voice data of the user, a voice emotion tag corresponding to the voice data, and an identity tag of the user; calling a speech synthesis model to be trained, wherein the speech synthesis model comprises a reference encoder, an embedded layer and a decoder; inputting the voice sample into a reference encoder for encoding processing so as to extract a prosody feature vector and a tone feature vector of the voice data; inputting the prosodic feature vectors, the tone feature vectors and the text feature vectors corresponding to the voice data into the embedding layer for superposition operation to obtain target feature vectors; inputting the target characteristic vector into a decoder for decoding processing to obtain a predicted Mel frequency spectrum of the voice data; acquiring a real Mel frequency spectrum of the voice data, and adjusting model parameters of the voice synthesis model according to the predicted Mel frequency spectrum and the real Mel frequency spectrum until the voice synthesis model is converged; acquiring a target voice emotion label and a target identity label of voice to be synthesized, and inputting the target voice emotion label and the target identity label into a converged voice synthesis model to obtain a Mel frequency spectrum of the voice to be synthesized; and generating target voice information according to the Mel frequency spectrum of the voice to be synthesized. Through the target feature vector of embedding voice sample, can effectively reduce the required training sample quantity in the model training process, the speech synthesis model can converge fast, and need not to input reference voice during speech synthesis, can reason fast in speech synthesis's application process, can effectual improvement speech synthesis efficiency.

Referring to fig. 3, fig. 3 is a schematic block diagram of a speech synthesis apparatus according to an embodiment of the present application.

As shown in fig. 3, the speech synthesis apparatus 200 includes: an obtaining module 201, a calling module 202, an encoding module 203, a superposition module 204, a decoding module 205, an adjusting module 206, an input module 207, and a generating module 208, wherein:

an obtaining module 201, configured to obtain a voice sample, where the voice sample includes voice data of a user, a voice emotion tag corresponding to the voice data, and an identity tag of the user;

a calling module 202, configured to call a speech synthesis model to be trained, where the speech synthesis model includes a reference encoder, an embedded layer, and a decoder;

the encoding module 203 is configured to input the voice sample into the reference encoder for encoding processing, so as to extract a prosody feature vector and a timbre feature vector of the voice data, where the prosody feature vector is obtained by encoding the voice data according to the voice emotion tag, and the timbre feature vector is obtained by encoding the voice data according to the identity tag;

a superposition module 204, configured to input the prosodic feature vector, the tone feature vector, and the text feature vector corresponding to the voice data into the embedding layer for superposition operation, so as to obtain a target feature vector;

a decoding module 205, configured to input the target feature vector into the decoder for decoding processing, so as to obtain a predicted mel spectrum of the speech data;

the obtaining module 201 is further configured to obtain a real mel spectrum of the voice data;

an adjusting module 206, configured to adjust model parameters of the speech synthesis model according to the predicted mel spectrum and the real mel spectrum until the speech synthesis model converges;

the obtaining module 201 is further configured to obtain a target voice emotion tag and a target identity tag of a voice to be synthesized;

an input module 207, configured to input the target speech emotion tag and the target identity tag into the converged speech synthesis model, so as to obtain a mel frequency spectrum of the speech to be synthesized;

and the generating module 208 is configured to generate the target speech information according to the mel spectrum of the speech to be synthesized. In one embodiment, the embedding layers include a first embedding layer and a second embedding layer, as shown in fig. 4, the overlay module 204 includes:

the combining submodule 2041 is configured to input the prosodic feature vector and the timbre feature vector into the first embedding layer for combining, so as to obtain a combined feature vector.

The superposition submodule 2042 is configured to input the combined feature vector and the text feature vector of the text information corresponding to the voice data into the second embedding layer to be superposed, so as to obtain a target feature vector.

In one embodiment, the overlay module 204 is further configured to:

adjusting the combined feature vector and/or the text feature vector so that the combined feature vector is consistent with the matrix size of the text feature vector;

the inputting the combined feature vector and the text feature vector into the second embedding layer for superposition to obtain a target feature vector includes:

and inputting the combined feature vectors with consistent matrix sizes and the text feature vectors into the second embedding layer for superposition to obtain target feature vectors.

In one embodiment, the overlay module 204 is further configured to:

determining the size of a target matrix to be adjusted;

acquiring a first matrix size of the combined feature vector, and determining a first matrix position to be adjusted of the combined feature vector according to the target matrix size and the first matrix size;

filling the first matrix position to be adjusted through a preset identifier; and/or

Acquiring a second matrix size of the text feature vector, and determining a second matrix position to be adjusted of the text feature vector according to the target matrix size and the second matrix size;

and filling the second matrix position to be adjusted through a preset identifier.

In one embodiment, the adjustment module 206 is further configured to:

calculating a model loss value of the speech synthesis model according to the Mel frequency spectrum and the real Mel frequency spectrum;

updating model parameters of the voice synthesis model based on the model loss value, and performing iterative training on the voice synthesis model with updated model parameters according to a plurality of voice samples;

and when the speech synthesis model with updated model parameters is determined to be in a convergence state, obtaining the trained speech synthesis model.

In one embodiment, the obtaining module 201 is further configured to:

the method comprises the steps of obtaining a plurality of first voice samples and a plurality of second voice samples, wherein the first voice samples comprise first voice data and voice emotion labels corresponding to the first voice data, and the second voice samples comprise second voice data and identity labels corresponding to the second voice data;

training a first preset classifier through the plurality of first voice samples to obtain a trained speech emotion classifier, and training a second preset classifier through the plurality of second voice samples to obtain a trained user identity classifier;

acquiring target voice data of a user, determining a voice emotion label corresponding to the target voice data through the utterance emotion classifier, and determining an identity label corresponding to the target voice data through the user identity classifier;

and marking the voice emotion label and the identity label on the target voice data to obtain the voice sample.

In one embodiment, the input module 207 is further configured to:

inputting the target voice emotion label and the target identity label into the reference encoder for processing to obtain a target prosody feature vector corresponding to the target voice emotion label and a target tone feature vector corresponding to the target identity label;

inputting the target prosody feature vector, the target tone feature vector and a target text feature vector corresponding to the voice to be synthesized into the embedding layer for superposition operation to obtain a candidate feature vector;

and inputting the candidate characteristic vectors into the decoder for decoding to obtain the Mel frequency spectrum of the voice to be synthesized.

It should be noted that, as will be clear to those skilled in the art, for convenience and brevity of description, the specific working processes of the apparatus and the modules and units described above may refer to the corresponding processes in the foregoing speech synthesis method embodiment, and are not described herein again.

The apparatus provided by the above embodiments may be implemented in the form of a computer program, which may run on a server as shown in fig. 5.

Referring to fig. 5, fig. 5 is a schematic block diagram of a server according to an embodiment of the present disclosure.

As shown in fig. 5, the server includes a processor, a memory, and a network interface connected by a system bus, wherein the memory may include a nonvolatile storage medium and an internal memory.

The non-volatile storage medium may store an operating system and a computer program. The computer program includes program instructions that, when executed, cause a processor to perform any of the speech synthesis methods.

The processor is used for providing calculation and control capacity and supporting the operation of the whole server.

The internal memory provides an environment for the execution of a computer program on a non-volatile storage medium, which when executed by a processor, causes the processor to perform any of the speech synthesis methods.

The network interface is used for network communication, such as sending assigned tasks and the like. Those skilled in the art will appreciate that the architecture shown in fig. 5 is a block diagram of only a portion of the architecture associated with the subject application, and does not constitute a limitation on the servers to which the subject application applies, as a particular server may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

It should be understood that the Processor may be a Central Processing Unit (CPU), and the Processor may be other general purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, etc. Wherein a general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

Wherein, in one embodiment, the processor is configured to execute a computer program stored in the memory to implement the steps of:

In one embodiment, the embedding layers include a first embedding layer and a second embedding layer; when the processor inputs the prosodic feature vectors, the tone feature vectors and the text feature vectors corresponding to the voice data into the embedding layer for superposition operation to obtain target feature vectors, the processor is used for realizing:

inputting the prosodic feature vectors and the tone feature vectors into the first embedding layer for combination to obtain combined feature vectors;

and inputting the combined feature vector and the text feature vector of the text information corresponding to the voice data into the second embedding layer for superposition to obtain a target feature vector.

In one embodiment, before implementing the inputting of the combined feature vector and the text feature vector into the second embedding layer for superposition to obtain a target feature vector, the processor is further configured to implement:

In one embodiment, the processor, when implementing the adjusting the combined feature vector and/or the text feature vector, is configured to implement:

determining the size of a target matrix to be adjusted;

In one embodiment, the processor, in implementing the adjusting model parameters of the speech synthesis model according to the predicted mel-frequency spectrum and the real mel-frequency spectrum until the speech synthesis model converges, is configured to implement:

In one embodiment, the processor, when implementing the obtaining of the speech sample, is configured to implement:

In one embodiment, the processor, when implementing the inputting of the target speech emotion tag and the target identity tag into the converged speech synthesis model to obtain the mel-frequency spectrum of the speech to be synthesized, is configured to implement:

It should be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process of the server described above may refer to the corresponding process in the foregoing embodiment of the speech synthesis method, and is not described herein again.

Embodiments of the present application further provide a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, where the computer program includes program instructions, and a method implemented when the program instructions are executed may refer to the embodiments of the speech synthesis method in the present application.

The computer-readable storage medium may be an internal storage unit of the server according to the foregoing embodiment, for example, a hard disk or a memory of the server. The computer readable storage medium may also be an external storage device of the server, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like provided on the server.

It is to be understood that the terminology used in the description of the present application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in the specification of the present application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should also be understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items. It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.

The above-mentioned serial numbers of the embodiments of the present application are merely for description and do not represent the merits of the embodiments. While the invention has been described with reference to specific embodiments, the scope of the invention is not limited thereto, and those skilled in the art can easily conceive various equivalent modifications or substitutions within the technical scope of the invention. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method of speech synthesis, comprising:

2. The speech synthesis method of claim 1, wherein the embedding layers comprise a first embedding layer and a second embedding layer; inputting the prosodic feature vectors, the tone feature vectors and the text feature vectors corresponding to the voice data into the embedding layer for superposition operation to obtain target feature vectors, including:

3. The speech synthesis method of claim 2, wherein before inputting the combined feature vector and the text feature vector into the second embedding layer for superposition to obtain a target feature vector, the method further comprises:

4. A speech synthesis method according to claim 3, wherein said adjusting said combined feature vector and/or said text feature vector comprises:

determining the size of a target matrix to be adjusted;

5. The method of speech synthesis according to claim 1, wherein said adjusting model parameters of said speech synthesis model based on said predicted mel-frequency spectrum and said true mel-frequency spectrum until said speech synthesis model converges comprises:

6. The speech synthesis method of any one of claims 1-5, wherein the obtaining a speech sample comprises:

7. The speech synthesis method of any one of claims 1-5, wherein the inputting the target speech emotion tag and the target identity tag into the converged speech synthesis model to obtain a Mel frequency spectrum of the speech to be synthesized comprises:

8. A speech synthesis apparatus, characterized in that the speech synthesis apparatus comprises:

9. A server, characterized in that the server comprises a processor, a memory, and a computer program stored on the memory and executable by the processor, wherein the computer program, when executed by the processor, implements the steps of the speech synthesis method according to any one of claims 1 to 7.

10. A computer-readable storage medium, having stored thereon a computer program, wherein the computer program, when being executed by a processor, carries out the steps of the speech synthesis method according to any one of claims 1 to 7.