CN113658577A

CN113658577A - Speech synthesis model training method, audio generation method, device and medium

Info

Publication number: CN113658577A
Application number: CN202110937782.1A
Authority: CN
Inventors: 徐东; 陈洲旋
Original assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd
Current assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd
Priority date: 2021-08-16
Filing date: 2021-08-16
Publication date: 2021-11-16

Abstract

The application discloses a speech synthesis model training method, an audio generation method, equipment and a medium, comprising the following steps: acquiring a training sample set; inputting the data into a speech synthesis model; extracting character content characteristic vectors and expression mode characteristic vectors of the text samples; extracting a voice feature vector of a voice sample and determining a corresponding style vector; determining a predicted Mel frequency spectrum of the text sample based on the style vector, the character content feature vector and the expression mode feature vector; determining the loss of the Mel frequency spectrum by using the predicted Mel frequency spectrum and the real Mel frequency spectrum of the voice sample, and determining the loss of the style vector by using the style vector and the label information; and determining comprehensive training loss based on the Mel frequency spectrum loss and the style vector loss, and obtaining a trained speech synthesis model and a trained style vector when the comprehensive training loss is converged. The distinguishing effect of the speech synthesis model obtained by training on different expression modes can be improved, so that the naturalness of the synthesized speech is improved, and the user experience is improved.

Description

Speech synthesis model training method, audio generation method, device and medium

Technical Field

The present application relates to the field of speech synthesis technologies, and in particular, to a speech synthesis model training method, an audio generation method, an apparatus, and a medium.

Background

With the development of deep neural network technology, more and more powerful acoustic models and vocoders are emerging in the field of speech synthesis, the former for generating text sequences into mel-frequency spectra, and the latter for generating high-quality speech from mel-frequency spectra. At present, in the field of speech synthesis, for differences of expression modes, such as voice-over or dialogue, good distinguishing effect is difficult to achieve by existing model training, so that naturalness of synthesized speech is low, and user experience is poor. In summary, in the process of implementing the present invention, the inventor finds that, in the prior art, at least, the speech synthesis model obtained by training is difficult to distinguish different expression modes, and the synthesized speech naturalness is poor in user experience.

Disclosure of Invention

In view of this, an object of the present application is to provide a method, a device, and a medium for training a speech synthesis model, which can improve the distinguishing effect of the trained speech synthesis model on different expression modes, thereby improving the naturalness of the synthesized speech and user experience. The specific scheme is as follows:

in a first aspect, the present application provides a method for training a speech synthesis model, including:

acquiring a training sample set; the training sample set comprises a text sample, a voice sample corresponding to the text sample and label information, and the label information comprises an expression mode label;

inputting the training sample set to a speech synthesis model;

extracting character content characteristic vectors and expression mode characteristic vectors of the text samples;

extracting a voice feature vector of a voice sample corresponding to the text sample, and determining a style vector corresponding to the voice feature vector through a multi-head attention mechanism;

determining a predicted Mel frequency spectrum corresponding to the text sample based on the style vector, the character content feature vector and the expression mode feature vector;

determining a mel-frequency spectrum loss by using the predicted mel-frequency spectrum and a real mel-frequency spectrum corresponding to the voice sample, and determining a style vector loss by using the style vector and the label information;

determining a synthetic training loss based on the mel-frequency spectrum loss and the style vector loss;

and when the comprehensive training loss is converged, determining the current speech synthesis model as a trained speech synthesis model and determining the current style vector as a trained style vector.

Optionally, the obtaining a training sample set includes:

obtaining a long sentence text sample, a single sentence text sample, a voice sample corresponding to the long sentence text sample, a voice sample corresponding to the single sentence text sample and label information to obtain a training sample set;

the long sentence text sample is a text sample containing a plurality of single sentence texts and pause marking information between two adjacent single sentence texts.

Optionally, the obtaining long sentence text samples and single sentence text samples includes:

splitting an original text into single sentence texts by using preset punctuations;

determining the symbol type of the ending punctuation of the single sentence text;

performing word segmentation and part-of-speech tagging on the single sentence text to obtain the word segmentation and part-of-speech of the single sentence text;

marking the pause grade of the participles in the single sentence text based on the participles and the parts of speech, and marking the pause grade of the tail of the single sentence text based on the symbol type to obtain a single sentence text sample;

and splicing the single-sentence text samples sentence by sentence, judging whether the number of characters of the current spliced sentence reaches a preset character number threshold value or not in the splicing process, splicing the current single-sentence text sample to be spliced to the spliced sentence if the number of characters of the current spliced sentence reaches the preset character threshold value, taking the current spliced sentence as a long-sentence text sample, and starting to splice the next spliced sentence until the splicing finishing condition is met.

Optionally, after splitting the original text into a single-sentence text with a preset punctuation mark, the method further includes:

rejecting the single sentence text without the first target character; wherein the first target character comprises Chinese characters, numbers and letters;

removing second target characters in the residual single sentence texts; wherein the second target character is a character that does not contain valid information.

Optionally, acquiring the tag information includes:

and judging the property of the quotation marks in the text samples, and if the property is the representation dialogue, determining that the representation mode labels of the texts in the quotation marks are the dialogue types.

Optionally, the determining the nature of the quotation marks in the single sentence text includes:

judging whether a colon exists before the quotation marks, and if so, judging the property of the quotation marks as representing a conversation;

or, judging whether the appointed characters exist before the quotation marks, if so, judging the property of the quotation marks as representing the dialogue; the appointed characters are characters which represent the characters behind the appointed characters and are dialog type characters;

or analyzing the part of speech of the text in the quotation marks, and judging the property of the quotation marks to be the representation dialogue if the text in the quotation marks comprises verbs.

Optionally, the determining a predicted mel frequency spectrum corresponding to the text sample based on the style vector, the text content feature vector, and the expression mode feature vector includes:

splicing the style vector, the character content characteristic vector and the expression mode characteristic vector based on the weight parameter corresponding to the style vector, the weight parameter corresponding to the character content characteristic vector and the weight parameter corresponding to the expression mode characteristic vector to obtain a spliced vector;

and determining a predicted Mel frequency spectrum corresponding to the splicing vector based on an attention mechanism.

Optionally, the determining a comprehensive training loss based on the mel-frequency spectrum loss and the style vector loss includes:

and performing weighted calculation on the Mel spectrum loss and the style vector loss based on the weight parameters corresponding to the Mel spectrum loss and the weight parameters corresponding to the style vector loss to obtain the comprehensive training loss.

In a second aspect, the present application discloses an audio generation method, comprising:

acquiring a target text of a voice to be synthesized and target label information of the target text; wherein the target tag information comprises an expression mode tag;

inputting the target text and the target label information into the trained speech synthesis model;

extracting a text content feature vector of the target text, and extracting an expression mode feature vector of the target text based on the expression mode label;

determining a target style vector based on the target label information and a trained style vector corresponding to the trained speech synthesis model;

determining a target prediction Mel frequency spectrum corresponding to the target text based on the target style vector, the character content feature vector and the expression mode feature vector;

and synthesizing corresponding predicted voice by using the target predicted Mel frequency spectrum.

In a third aspect, the present application discloses an electronic device, comprising:

a memory for storing a computer program;

a processor for executing the computer program to implement the aforementioned speech synthesis model training method and/or the aforementioned speech generation method.

In a fourth aspect, the present application discloses a computer-readable storage medium for storing a computer program which, when executed by a processor, implements the aforementioned speech synthesis model training method and/or the aforementioned speech generation method.

Therefore, the training sample set is obtained firstly; wherein the training sample set comprises a text sample, a speech sample corresponding to the text sample and label information, and the label information comprises an expression mode label, then the training sample set is input to a speech synthesis model, a text content feature vector and an expression mode feature vector of the text sample are extracted, a speech feature vector of the speech sample corresponding to the text sample is extracted, a style vector corresponding to the speech feature vector is determined through a multi-head attention mechanism, then a predicted Mel frequency spectrum corresponding to the text sample is determined based on the style vector, the text content feature vector and the expression mode feature vector, a Mel frequency spectrum loss is determined by using the predicted Mel frequency spectrum and a real Mel frequency spectrum corresponding to the speech sample, and a style vector loss is determined by using the style vector and the label information, and determining a comprehensive training loss based on the Mel frequency spectrum loss and the style vector loss, and when the comprehensive training loss is converged, determining the current speech synthesis model as a trained speech synthesis model and determining the current style vector as a trained style vector. That is, the speech synthesis model is trained by using the label information including the expression mode labels and the text sample and the speech sample, in the training process, the character content characteristic vector and the expression mode characteristic vector are extracted, the prediction Mel frequency spectrum is determined by using the style vector, the character content characteristic vector and the expression mode characteristic vector corresponding to the speech sample, the loss is further determined, and when the loss converges, the trained speech synthesis model is obtained.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

FIG. 1 is a schematic diagram of a system framework for a speech synthesis model training scheme as disclosed herein;

FIG. 2 is a flow chart of a speech synthesis model training method disclosed herein;

FIG. 3 is a schematic diagram of a specific speech synthesis model training method disclosed in the present application;

FIG. 4 is a flow diagram of a particular speech synthesis model training method disclosed herein;

FIG. 5 is a flow chart of a specific training sample set acquisition process disclosed herein;

FIG. 6 is a flow diagram of a particular speech synthesis model training method disclosed herein;

FIG. 7 is a diagram illustrating a prediction of a particular speech synthesis model disclosed herein;

FIG. 8 is a schematic diagram of a speech synthesis model training apparatus according to the present disclosure;

fig. 9 is a block diagram of an electronic device disclosed in the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

At present, in the field of speech synthesis, for differences of expression modes, such as voice-over or dialogue, good distinguishing effect is difficult to achieve by existing model training, so that naturalness of synthesized speech is low, and user experience is poor. Therefore, the speech synthesis model training scheme can improve the distinguishing effect of the speech synthesis model obtained by training on different expression modes, so that the naturalness of the synthesized speech is improved, and the user experience is improved.

In the speech synthesis model training scheme of the present application, the system frame diagram adopted may be shown in fig. 1, and specifically may include: the system comprises a background server and a plurality of user terminals which are in communication connection with the background server. The user side includes, but is not limited to, a tablet computer, a notebook computer, a smart phone, and a Personal Computer (PC), and is not limited herein. The background server can be a cloud server or a non-cloud server.

In the application, the steps executed by the background server include acquiring a training sample set; the training sample set comprises a text sample, a voice sample corresponding to the text sample and label information, and the label information comprises an expression mode label; inputting the training sample set to a speech synthesis model; extracting character content characteristic vectors and expression mode characteristic vectors of the text samples; extracting a voice feature vector of a voice sample corresponding to the text sample, and determining a style vector corresponding to the voice feature vector through a multi-head attention mechanism; determining a predicted Mel frequency spectrum corresponding to the text sample based on the style vector, the character content feature vector and the expression mode feature vector; determining a mel-frequency spectrum loss by using the predicted mel-frequency spectrum and a real mel-frequency spectrum corresponding to the voice sample, and determining a style vector loss by using the style vector and the label information; determining a synthetic training loss based on the mel-frequency spectrum loss and the style vector loss; and when the comprehensive training loss is converged, determining the current speech synthesis model as a trained speech synthesis model and determining the current style vector as a trained style vector.

The user side is used for transmitting text contents which are specified by a user and need to be subjected to voice synthesis to the background server, so that when the background server acquires the text contents, the background server determines a predicted Mel frequency spectrum of the text contents by using the trained voice synthesis model and the trained style vectors, further synthesizes voice, and transmits the voice to the user side for playing.

Referring to fig. 2, an embodiment of the present application discloses a method for training a speech synthesis model, including:

step S11: acquiring a training sample set; the training sample set comprises a text sample, a voice sample corresponding to the text sample and label information, and the label information comprises an expression mode label.

In a specific embodiment, the expression mode tag may be a conversation type, that is, the text content of which the expression mode in the text sample is a conversation is labeled to obtain an expression mode tag of the conversation type. Furthermore, the text content whose expression mode in the text sample is non-dialogue can be labeled to obtain an expression mode label of the voice-over type, or the voice-over can be not labeled. For example, 1 may be used to label the text content whose expression mode in the text sample is dialog, 0 may be used to label the text content whose expression mode in the text sample is non-dialog, and the text content is a tag of the dialogue type.

In addition, in a specific embodiment, the tag information may further include a speaker tag, an emotion tag, a speech rate tag, and the like.

In a specific embodiment, the specific process of acquiring the tag information includes:

Further, in a specific embodiment, it may be determined whether a colon exists before the quotation mark, and if so, the nature of the quotation mark is determined to represent a dialog; or, judging whether the appointed characters exist before the quotation marks, if so, judging the property of the quotation marks as representing the dialogue; the appointed characters are characters which represent the characters behind the appointed characters and are dialog type characters; or analyzing the part of speech of the text in the quotation marks, and judging the property of the quotation marks to be the representation dialogue if the text in the quotation marks comprises verbs.

It should be noted that quotation marks can be marked with emphasis, special appellations, etc. besides representing dialogs, and in general, quotation marks representing dialog types have some pause before the speech starts, while quotation marks representing emphasis are emphasized with accents and generally do not pause for a long time. Therefore, it is desirable to determine as accurately as possible whether quotation marks represent utterances. The present embodiment can be judged by the above 3 ways, but the specific quote character judging way includes but is not limited to the above 3 ways. In a specific embodiment, it is determined whether or not a colon is present before the quotation mark, if so, the character of the quotation mark is determined as indicating a dialogue, if not, it is continuously determined whether or not the character before the quotation mark is a character which is clearly indicated as a dialogue type, such as "track, say, state, pick up, tell", and the like, if so, the character of the quotation mark is determined as indicating a dialogue, if not, the part of speech in the quotation mark is analyzed, and if only the noun type is present, the judgment is emphasized, and if a verb type is present, the judgment is made as a dialogue type.

Step S12: inputting the training sample set to a speech synthesis model.

Step S13: and extracting the character content characteristic vector and the expression mode characteristic vector of the text sample.

That is, the text feature vector extracted by the speech synthesis model in the encoding stage includes a text content feature vector and an expression mode feature vector, where the text content feature vector represents content information of a text, that is, what text information is specifically expressed, and the expression mode feature vector represents an expression mode of the text, that is, whether the text is expressed in a form of voice-over or in a form of dialogue.

Step S14: and extracting the voice feature vector of the voice sample corresponding to the text sample, and determining the style vector corresponding to the voice feature vector through a multi-head attention mechanism.

In a specific implementation manner, tokens of the speech feature vector in different information dimensions are acquired through a multi-head attention mechanism, and then weighting calculation is performed on each token to obtain a style vector corresponding to the speech feature vector.

It should be noted that the speech feature vector contains various types of information of the speech sample, tokens of the speech feature vector in different dimensions obtained by the multi-head attention mechanism are equivalent to branch vectors of speech in various dimensions, representing information such as pauses, timbre, semantics, emotion and the like, and the tokens can be combined to obtain a style vector by weighting.

Step S15: and determining a predicted Mel frequency spectrum corresponding to the text sample based on the style vector, the character content feature vector and the expression mode feature vector.

In a specific embodiment, the style vector, the text content feature vector, and the expression manner feature vector may be spliced based on a weight parameter corresponding to the style vector, a weight parameter corresponding to the text content feature vector, and a weight parameter corresponding to the expression manner feature vector to obtain a spliced vector; and determining a predicted Mel frequency spectrum corresponding to the splicing vector based on an attention mechanism.

The weight parameters corresponding to the style vectors, the text content feature vectors and the expression mode feature vectors are learnable parameters and are updated in the training process.

Of course, in some embodiments, the text content feature vectors and the expression manner feature vectors may be spliced according to the style vector to obtain a spliced vector.

Step S16: and determining a Mel frequency spectrum loss by using the predicted Mel frequency spectrum and a real Mel frequency spectrum corresponding to the voice sample, and determining a style vector loss by using the style vector and the label information.

In particular embodiments, the model parameters may be updated using mel-frequency spectral loss, and style vector loss.

Step S17: determining a composite training loss based on the Mel spectral loss and the style vector loss.

In a specific embodiment, the mel-frequency spectrum loss and the style vector loss may be weighted and calculated based on a weight parameter corresponding to the mel-frequency spectrum loss and a weight parameter corresponding to the style vector loss, so as to obtain a comprehensive training loss.

The weight parameter corresponding to the mel-frequency spectrum loss and the weight parameter corresponding to the style vector loss may be parameters configured in advance according to experience, or may be learnable parameters.

In another specific embodiment, the mel-frequency spectrum loss and the style vector loss can be directly added to obtain the comprehensive training loss.

It should be noted that the loss for determining the integrated training loss includes, but is not limited to, mel-frequency spectrum loss and style vector loss, and may include other losses calculated according to actual requirements.

Step S18: and when the comprehensive training loss is converged, determining the current speech synthesis model as a trained speech synthesis model and determining the current style vector as a trained style vector.

For example, referring to fig. 3, an embodiment of the present application discloses a specific speech synthesis model training method. The speech synthesis model comprises a speech coder, a GST (Global Style Token) module, a text coder, a representation mode coder, an attention mechanism and a decoder. Firstly, extracting a character content characteristic vector and an expression mode characteristic vector of a text through a text encoder and an expression mode encoder respectively; the speech feature vectors of the speech are extracted through a speech encoder, tokens of the speech vectors in different dimensions are obtained through a multi-head attention mechanism of a GST module, and then the token results are combined through weighting to obtain a style vector. Then, evaluating the accuracy of the style vector, and determining the loss of the style vector through the label information and the current style vector; estimating the predicted Mel frequency spectrum effect, extracting the Mel frequency spectrum from the input training speech to be used as a real Mel frequency spectrum, then calculating the difference value between the Mel frequency spectrum predicted by the model and the real Mel frequency spectrum to obtain the Mel frequency spectrum loss, and then feeding back the style vector loss and the Mel frequency spectrum loss to the speech synthesis model for adjusting the weight parameters in the model training process until the predicted effect is close to the real effect. The prediction mel spectrum refers to a processing result of a decoder after performing condition control on an input text vector and a style vector, and represents the output of an acoustic model, that is, the prediction mel spectrum is the mel spectrum determined based on the style vector, the character content feature vector and the expression mode feature vector.

It should be noted that, on the basis of a Tacotron2 model for end-to-end speech synthesis, a GST module is added for stylization, so that the synthesized speech has better prosody expression, during training, text vectors such as phonemes/tones are respectively extracted from the text through a large amount of paired text/speech data, prosody vectors are extracted from the speech, style vectors are obtained by learning the prosody vectors through a multi-head attention mechanism, and the style vectors are spliced with the text vectors and then sent to an attention mechanism model. After training is finished, the GST module extracts global style characteristics of the audio in the data set, such as prosody pause and other information, and stores the information in the style vectors, and when speech synthesis is carried out, the obtained style vectors can be used for speech synthesis. The prosody pause is the pause duration of characters, words and sentences in the text after the text is vocalized. On the basis of a Tacotron2 model and a GST module, an expression mode encoder is introduced to distinguish the bystander from the dialogue during training, so that the model can model the bystander and the dialogue in a training stage, distinguish the bystander from the dialogue, and better accord with daily expression habits and synthesize more natural voice.

Therefore, the embodiment of the application firstly obtains a training sample set; wherein the training sample set comprises a text sample, a speech sample corresponding to the text sample and label information, and the label information comprises an expression mode label, then the training sample set is input to a speech synthesis model, a text content feature vector and an expression mode feature vector of the text sample are extracted, a speech feature vector of the speech sample corresponding to the text sample is extracted, a style vector corresponding to the speech feature vector is determined through a multi-head attention mechanism, then a predicted Mel frequency spectrum corresponding to the text sample is determined based on the style vector, the text content feature vector and the expression mode feature vector, a Mel frequency spectrum loss is determined by using the predicted Mel frequency spectrum and a real Mel frequency spectrum corresponding to the speech sample, and a style vector loss is determined by using the style vector and the label information, and determining a comprehensive training loss based on the Mel frequency spectrum loss and the style vector loss, and when the comprehensive training loss is converged, determining the current speech synthesis model as a trained speech synthesis model and determining the current style vector as a trained style vector. That is, the speech synthesis model is trained by using the label information including the expression mode labels and the text sample and the speech sample, in the training process, the character content characteristic vector and the expression mode characteristic vector are extracted, the prediction Mel frequency spectrum is determined by using the style vector, the character content characteristic vector and the expression mode characteristic vector corresponding to the speech sample, the loss is further determined, and when the loss converges, the trained speech synthesis model is obtained.

Referring to fig. 4, an embodiment of the present application discloses a specific speech synthesis model training method, including:

step S21: the method comprises the steps of obtaining a long sentence text sample, a single sentence text sample, a voice sample corresponding to the long sentence text sample, a voice sample corresponding to the single sentence text sample and label information to obtain a training sample set.

Referring to fig. 5, an embodiment of the present application discloses a specific training sample set obtaining flowchart, and in a specific implementation, a specific process of obtaining a long sentence text sample and a single sentence text sample includes:

step S31: splitting an original text into single sentence texts by using preset punctuations.

The original text is unprocessed text, and the text types include but are not limited to novel text, information, conversation, and the like, and can be long text or short text. The preset punctuation marks may include commas, sentences, exclamation marks, ellipses, etc.

That is, the embodiment of the application can split the original text sentence by using punctuation marks between each sentence as separators, for example, the original text is' big family good, please refer to more than one finger. The split single sentence texts are respectively ' good family ' and ' please refer to more and more fingers. "

In a specific embodiment, the single sentence text without the first target character may be eliminated; wherein the first target character comprises Chinese characters, numbers and letters.

Further, second target characters in the residual single sentence texts are removed; wherein the second target character is a character that does not contain valid information.

That is, in the embodiment of the present application, invalid characters may be removed, after the split single-sentence texts are obtained, it is determined whether each single-sentence text contains a first target character, and a text that does not contain the first target character is removed, for example, if the text only contains "@ (|", etc., the single-sentence text is removed.

In addition, in this embodiment of the application, a specific process of acquiring the tag information may include: and judging the property of the quotation marks in the single sentence text, and if the property is the representation dialogue, determining that the representation mode labels of the text in the quotation marks are the dialogue type.

That is, the embodiment of the application can judge the quotation mark property in the single-sentence text after the single-sentence text is split and the invalid text and the invalid characters are removed. For a specific determination manner, reference may be made to the disclosure of the foregoing embodiments, which is not described herein again.

Step S32: and determining the symbol type of the ending punctuation of the single sentence text.

It should be noted that the ending punctuation of a single sentence text is usually a punctuation between two adjacent single sentences, and the determination of the type of the ending punctuation can be used for inter-sentence pause processing.

Step S33: and performing word segmentation and part-of-speech tagging on the single sentence text to obtain the word segmentation and part-of-speech of the single sentence text.

It should be noted that word segmentation means that words and phrases in each sentence are separated by word segmentation, and part-of-speech tagging means that part-of-speech of each word is predicted. For example, "did you eat? "is you eating after word segmentation? ", wherein" eat "is a single word; and obtaining the part of speech of each word and word after word segmentation through part of speech tagging, for example, "you" belongs to pronouns. In a specific embodiment, word segmentation can be performed by using a word segmentation tool such as "jieba" (jieba) word segmentation, so as to obtain word segmentation of a single sentence text, including words, phrases, and parts of speech in a sentence.

Step S34: and marking the pause grade of the participles in the single sentence text based on the participles and the parts of speech, and marking the pause grade of the tail of the single sentence text based on the symbol type to obtain a single sentence text sample.

In a specific embodiment, the pause levels can be divided according to the pause length, for example, into 4 levels, which are denoted by #1, #2, #3, and # 4. For the segmentation, prosodic words, prosodic phrases and intonation phrases of a single sentence text, the grades are "# 1", "# 2", "# 3", respectively. By judging punctuation marks among the sentences, giving the text corresponding pause levels, wherein the higher the level is, the longer the pause is, the shorter the pause is, the longer the pause is, the larger the pause is, and the comma is between two sentences, the end of the first sentence is marked with "# 3", the sentence is between two sentences, and the end of the first sentence is marked with "# 4". By the marking mode, controllable pause between sentences of the text is realized, and the effect of long pause and short pause is achieved.

Step S35: and splicing the single-sentence text samples sentence by sentence, judging whether the number of characters of the current spliced sentence reaches a preset character number threshold value or not in the splicing process, splicing the current single-sentence text sample to be spliced to the spliced sentence if the number of characters of the current spliced sentence reaches the preset character threshold value, taking the current spliced sentence as a long-sentence text sample, and starting to splice the next spliced sentence until the splicing finishing condition is met.

The splicing end condition may be that all single-sentence text samples are spliced, or the number of long-sentence text samples reaches a preset number, and sufficient samples are obtained, and then the splicing is ended.

Therefore, long sentence text samples with the number of characters close to the preset character number threshold are obtained, and compared with a single sentence text, the spliced long sentence text samples have prosody information of inter-sentence pause, and the pause between sentences can be better reflected.

In addition, in a specific implementation manner, if the number of characters of a single-sentence text is greater than a preset threshold, that is, the single-sentence text is too long, the single-sentence text may be discarded or split.

Step S22: inputting the training sample set to a speech synthesis model.

Step S23: and extracting the character content characteristic vector and the expression mode characteristic vector of the text sample.

Step S24: and extracting the voice feature vector of the voice sample corresponding to the text sample, and determining the style vector corresponding to the voice feature vector through a multi-head attention mechanism.

Step S25: and determining a predicted Mel frequency spectrum corresponding to the text sample based on the style vector, the character content feature vector and the expression mode feature vector.

Step S26: and determining a Mel frequency spectrum loss by using the predicted Mel frequency spectrum and a real Mel frequency spectrum corresponding to the voice sample, and determining a style vector loss by using the style vector and the label information.

Step S27: determining a composite training loss based on the Mel spectral loss and the style vector loss.

For specific implementation of the above steps S23 to S27, reference may be made to the disclosure of the foregoing embodiments, and details are not repeated here.

Step S28: and judging whether the comprehensive training loss is converged.

Step S29: if yes, determining the current speech synthesis model as the trained speech synthesis model, and determining the current style vector as the trained style vector.

Otherwise, the speech synthesis model is updated with the integrated training loss, and further text samples, speech samples and label information are determined from the training sample set, and the above steps S23 to S28 are performed.

That is, in the embodiment of the present application, a speech synthesis model is trained using a training sample set, and a comprehensive training loss is determined in a training process, and when the comprehensive training loss converges, a current speech synthesis model is determined as a trained speech synthesis model, and a current style vector is determined as a trained style vector.

Therefore, the training samples obtained in the embodiment of the application collectively obtain long sentence text samples, single sentence text samples, voice samples corresponding to the long sentence text samples, and voice samples corresponding to the single sentence text samples, including rich prosody pause information in single sentences and prosody pause information between single sentences, so that the model can better obtain prosody pause information, and the performance of the model is improved.

Referring to fig. 6, an embodiment of the present application discloses a specific audio generation method, including:

step S41: acquiring a target text of a voice to be synthesized and target label information of the target text; wherein the target tag information includes a presentation tag.

The process of acquiring the expression mode tag may refer to the content disclosed in the foregoing embodiments, and is not described herein again. The target tag information may further include an emotion tag, a speech rate tag, and the like.

Step S42: and inputting the target text and the target label information into the trained speech synthesis model disclosed by the embodiment of the application.

Step S43: extracting a text content feature vector of the target text, and extracting an expression mode feature vector of the target text based on the expression mode label;

step S44: determining a target style vector based on the target label information and a trained style vector corresponding to the trained speech synthesis model;

step S45: and determining a target prediction Mel frequency spectrum corresponding to the target text based on the target style vector, the character content feature vector and the expression mode feature vector.

In a specific implementation manner, the text content feature vector, the expression mode feature vector and the target style vector may be spliced to obtain a spliced vector; and determining a target prediction Mel frequency spectrum corresponding to the splicing vector based on an attention mechanism.

For example, referring to fig. 7, the embodiment of the present application discloses a specific prediction diagram of a speech synthesis model.

Inputting a target text and target label information, extracting a text content characteristic vector by a text encoder, extracting an expression mode characteristic vector by an expression mode encoder based on an expression mode label, determining a target style vector by using the trained style vector and the target label information, splicing the target style vector, the text content characteristic vector and the expression mode characteristic vector to obtain a spliced vector, and obtaining a predicted Mel frequency spectrum corresponding to the spliced vector by an attention mechanism and a decoder.

It should be noted that for model training, in a specific embodiment, a training sample set corresponding to a plurality of speakers may be obtained; and respectively inputting the corresponding speech synthesis models by using the training sample set corresponding to each speaker for training to obtain the trained speech synthesis model corresponding to each speaker. Therefore, when the audio is generated, the target text and the speaker information input by the user are obtained, the corresponding trained speech synthesis model is determined according to the speaker information, the target label information of the target text is determined, including the expression mode label and the like, and the target text and the target label information are input to the trained speech synthesis model without the speaker label to generate the speech corresponding to the speaker information.

In another specific embodiment, a training sample set corresponding to a plurality of speakers may be obtained, and the training sample sets corresponding to the plurality of speakers are input to the same trained speech synthesis model for training, so as to obtain a trained speech synthesis model corresponding to each speaker. Since the same model is trained using training sample sets corresponding to multiple speakers, the training sample sets include speaker labels. When the audio is generated, target text and speaker information input by a user are obtained, target label information is determined, the target label information comprises a speaker label determined by using the speaker information, and the target label information and the target text are input into a trained voice synthesis model to generate voice corresponding to the speaker information.

Of course, in some embodiments, the trained speech synthesis model may be utilized to generate a style vector based on the target label information, determine a target predicted mel-frequency spectrum based on the generated style vector, the text content feature vector, and the expression feature vector, without using the trained style vector,

step S46: and synthesizing corresponding predicted voice by using the target predicted Mel frequency spectrum.

In particular embodiments, the corresponding predicted speech may be synthesized by a phase prediction or neural network vocoder.

In the method, a final predicted speech signal is obtained by predicting phase information from an input frequency spectrum (amplitude spectrum, without phase information) and then continuously reducing the difference between the frequency spectrum of the inverse fourier transform corresponding to the predicted phase and the input frequency spectrum in an iterative manner, through a phase prediction method, including but not limited to Griffin-Lim signal estimation algorithm. The scheme of the neural network vocoder is that the deep neural network is used for establishing the relation between the frequency spectrum and the voice to predict, and the output voice has high tone quality but high algorithm complexity.

The following describes a technical solution of the present application, taking a certain speech synthesis APP as an example.

The background server of the APP firstly obtains a large amount of novel texts and consultation texts to obtain original texts, divides the original texts into single-sentence texts by preset punctuation marks, eliminates invalid texts and invalid characters in the valid texts, then judges the properties of quotation marks in the remaining single-sentence texts, if the characters are used for representing dialogs, determines that the representation mode labels of the texts in the quotation marks are dialog types, determines the symbol types of ending punctuation marks of the remaining valid single-sentence texts, performs word segmentation and morphological tagging on the remaining single-sentence texts, obtains single-sentence text samples based on the pause levels of the participles in the participle and morphological tagging single-sentence texts and the pause levels of the ending of the symbol type tagging single-sentence texts, further splices the single-sentence text samples one by one, and judges whether the number of characters of the current spliced sentences reaches a preset character number threshold value in the process of sentence splicing, and if not, splicing the single-sentence text sample to be spliced to the spliced sentence, taking the current spliced sentence as the long-sentence text sample and starting to splice the next spliced sentence until the splicing finishing condition is met, so as to obtain the long-sentence text sample, the single-sentence text sample and the expression mode label, further obtain the corresponding voice sample, and obtain the training sample set. Inputting the training sample set to a speech synthesis model; extracting character content characteristic vectors and expression mode characteristic vectors of the text samples; extracting a voice feature vector of a voice sample corresponding to the text sample, and determining a style vector corresponding to the voice feature vector through a multi-head attention mechanism; determining a predicted Mel frequency spectrum corresponding to the text sample based on the style vector, the character content feature vector and the expression mode feature vector; determining a mel-frequency spectrum loss by using the predicted mel-frequency spectrum and a real mel-frequency spectrum corresponding to the voice sample, and determining a style vector loss by using the style vector and the label information; and determining a comprehensive training loss based on the Mel frequency spectrum loss and the style vector loss, and when the comprehensive training loss is converged, determining the current speech synthesis model as a trained speech synthesis model and determining the current style vector as a trained style vector.

The user side installs the speech synthesis APP, the user transmits text content needing speech synthesis to the background server through the APP, and when the background server acquires the text content, the background server determines a predicted Mel frequency spectrum of the text content by using a trained speech synthesis model and a trained style vector, so that speech is synthesized, and the speech is transmitted to the user side to be played.

Referring to fig. 8, an embodiment of the present application discloses a speech synthesis model training apparatus, including:

a training sample set obtaining module 11, configured to obtain a training sample set; the training sample set comprises a text sample, a voice sample corresponding to the text sample and label information, and the label information comprises an expression mode label;

a training sample set input module 12, configured to input the training sample set to a speech synthesis model;

the text feature extraction module 13 is configured to extract a text content feature vector and an expression mode feature vector of the text sample;

a voice feature extraction module 14, configured to extract a voice feature vector of a voice sample corresponding to the text sample;

the style vector determining module 15 is configured to determine a style vector corresponding to the speech feature vector through a multi-head attention mechanism;

a predicted mel-frequency spectrum determining module 16, configured to determine a predicted mel-frequency spectrum corresponding to the text sample based on the style vector, the text content feature vector, and the expression manner feature vector;

a loss determining module 17, configured to determine a mel-frequency spectrum loss by using the predicted mel-frequency spectrum and a real mel-frequency spectrum corresponding to the voice sample, and determine a style vector loss by using the style vector and the label information; determining a synthetic training loss based on the mel-frequency spectrum loss and the style vector loss;

a trained model determining module 18, configured to determine the current speech synthesis model as the trained speech synthesis model and determine the current style vector as the trained style vector when the comprehensive training loss converges.

The training sample set obtaining module 11 is specifically configured to obtain a long sentence text sample, a single sentence text sample, a voice sample corresponding to the long sentence text sample, a voice sample corresponding to the single sentence text sample, and label information, so as to obtain a training sample set; the long sentence text sample is a text sample containing a plurality of single sentence texts and pause marking information between two adjacent single sentence texts.

In a specific embodiment, the training sample set obtaining module 11 includes:

the single sentence sample acquisition submodel is used for splitting an original text into single sentence texts by using preset punctuations; determining the symbol type of the ending punctuation of the single sentence text; performing word segmentation and part-of-speech tagging on the single sentence text to obtain the word segmentation and part-of-speech of the single sentence text; marking the pause grade of the participles in the single sentence text based on the participles and the parts of speech, and marking the pause grade of the tail of the single sentence text based on the symbol type to obtain a single sentence text sample;

and the long sentence sample acquisition submodule is used for splicing the single sentence text samples one by one, judging whether the number of characters of the current spliced sentence reaches a preset character number threshold value or not in the splicing process, splicing the current single sentence text sample to be spliced to the spliced sentence if the number of characters of the current spliced sentence does not reach the preset character number threshold value, taking the current spliced sentence as the long sentence text sample, and starting to splice the next spliced sentence until the splicing ending condition is met.

The device further comprises:

the invalid text body removing module is used for removing the single sentence text without the first target character; wherein the first target character comprises Chinese characters, numbers and letters;

the invalid character removing module is used for removing second target characters in the residual single sentence texts; wherein the second target character is a character that does not contain valid information.

In a specific embodiment, the training sample set obtaining module 11 includes:

and the tag information acquisition submodule is used for judging the property of the quotation marks in the text samples, and if the property is the representation dialogue, determining that the representation mode tags of the texts in the quotation marks are the dialogue type.

Further, in a specific embodiment, the tag information obtaining sub-module is specifically configured to:

The predictive mel-frequency spectrum determining module 16 is specifically configured to splice the style vector, the text content feature vector and the expression mode feature vector based on a weight parameter corresponding to the style vector, a weight parameter corresponding to the text content feature vector and a weight parameter corresponding to the expression mode feature vector to obtain a spliced vector; and determining a predicted Mel frequency spectrum corresponding to the splicing vector based on an attention mechanism.

Further, the embodiment of the application also provides electronic equipment. FIG. 9 is a block diagram illustrating an electronic device 20 according to an exemplary embodiment, and nothing in the figure should be taken as a limitation on the scope of use of the present application.

Fig. 9 is a schematic structural diagram of an electronic device 20 according to an embodiment of the present disclosure. The electronic device 20 may specifically include: at least one processor 21, at least one memory 22, a power supply 23, a communication interface 24, an input output interface 25, and a communication bus 26. Wherein the memory 22 is used for storing a computer program, which is loaded and executed by the processor 21 to implement the relevant steps in the speech synthesis model training method and/or the audio generation method disclosed in any of the foregoing embodiments. In addition, the electronic device 20 in the present embodiment may be specifically a server.

In this embodiment, the power supply 23 is configured to provide a working voltage for each hardware device on the electronic device 20; the communication interface 24 can create a data transmission channel between the electronic device 20 and an external device, and a communication protocol followed by the communication interface is any communication protocol applicable to the technical solution of the present application, and is not specifically limited herein; the input/output interface 25 is configured to obtain external input data or output data to the outside, and a specific interface type thereof may be selected according to specific application requirements, which is not specifically limited herein.

In addition, the memory 22 is used as a carrier for resource storage, and may be a read-only memory, a random access memory, a magnetic disk or an optical disk, etc., and the resources stored thereon may include the operating system 221, the computer program 222, the training data 223, etc., and the storage manner may be a transient storage or a permanent storage.

The operating system 221 is used for managing and controlling each hardware device and the computer program 222 on the electronic device 20, so as to implement the operation and processing of the training data 223 in the memory 22 by the processor 21, and may be Windows Server, Netware, Unix, Linux, and the like. The computer program 222 may further include a computer program that can be used to perform other specific tasks in addition to the computer program that can be used to perform the speech synthesis model training method and/or the audio generation method performed by the electronic device 20 disclosed in any of the foregoing embodiments.

Further, an embodiment of the present application also discloses a storage medium, in which a computer program is stored, and when the computer program is loaded and executed by a processor, the steps of the speech synthesis model training method and/or the audio generation method disclosed in any of the foregoing embodiments are implemented.

The embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same or similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The above detailed description is made on a speech synthesis model training method, an audio generation method, a device and a medium provided by the present application, and a specific example is applied in the present application to explain the principle and the implementation of the present application, and the description of the above embodiment is only used to help understanding the method and the core idea of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims

1. A method for training a speech synthesis model, comprising:

inputting the training sample set to a speech synthesis model;

2. The method of training a speech synthesis model according to claim 1, wherein the obtaining a set of training samples comprises:

3. The method for training a speech synthesis model according to claim 2, wherein the obtaining long sentence text samples and single sentence text samples comprises:

4. The method for training a speech synthesis model according to claim 3, wherein the splitting of the original text into single-sentence texts with the preset punctuation marks further comprises:

5. The method of training a speech synthesis model according to claim 1, wherein obtaining label information comprises:

6. The method of claim 5, wherein the determining the nature of quotation marks in the single sentence text comprises:

7. The method of claim 1, wherein the determining the predicted mel-frequency spectrum corresponding to the text sample based on the style vector, the text-content feature vector and the expression-mode feature vector comprises:

8. The method of training a speech synthesis model according to claim 1, wherein the determining a synthetic training loss based on the mel-frequency spectrum loss and the style vector loss comprises:

9. A method of audio generation, comprising:

inputting the target text and the target label information to the trained speech synthesis model of any of claims 1-8;

10. An electronic device, comprising:

a memory for storing a computer program;

a processor for executing the computer program for implementing the speech synthesis model training method of any one of claims 1 to 8 and/or the audio generation method of claim 9.

11. A computer-readable storage medium for storing a computer program which, when executed by a processor, implements the speech synthesis model training method of any one of claims 1 to 8 and/or the audio generation method of claim 9.