CN113658577B

CN113658577B - Speech synthesis model training method, audio generation method, equipment and medium

Info

Publication number: CN113658577B
Application number: CN202110937782.1A
Authority: CN
Inventors: 徐东; 陈洲旋
Original assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd
Current assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd
Priority date: 2021-08-16
Filing date: 2021-08-16
Publication date: 2024-06-14
Anticipated expiration: 2041-08-16
Also published as: CN113658577A

Abstract

The application discloses a voice synthesis model training method, an audio generation method, equipment and a medium, which comprise the following steps: acquiring a training sample set; inputting the speech synthesis model; extracting a text content feature vector and a expression mode feature vector of a text sample; extracting a voice characteristic vector of a voice sample and determining a corresponding style vector; determining a predicted mel frequency spectrum of the text sample based on the style vector, the text content feature vector and the expression mode feature vector; determining a mel spectrum loss by using the predicted mel spectrum and the real mel spectrum of the voice sample, and determining a style vector loss by using the style vector and the tag information; and determining comprehensive training loss based on the Mel spectrum loss and the style vector loss, and obtaining a trained speech synthesis model and a trained style vector when the comprehensive training loss converges. The distinguishing effect of the speech synthesis model obtained through training on different expression modes can be improved, so that the naturalness of the synthesized speech and the user experience are improved.

Description

Speech synthesis model training method, audio generation method, equipment and medium

Technical Field

The present application relates to the field of speech synthesis technologies, and in particular, to a speech synthesis model training method, an audio generating device, and a medium.

Background

With the development of deep neural network technology, more and more powerful acoustic models and vocoders are presented in the field of speech synthesis, the former being used for generating text sequences into mel-frequency spectra, and the latter being used for generating high-quality speech from the mel-frequency spectra. At present, in the field of speech synthesis, for the difference of expression modes, such as side white or dialogue, the existing model training is difficult to achieve a good distinguishing effect, so that the naturalness of the synthesized speech is low and the user experience is poor. In summary, in the process of implementing the present invention, the inventor finds that at least the speech synthesis model obtained by training in the prior art has the problems of difficulty in distinguishing different expression modes, synthesized speech naturalness and poor user experience.

Disclosure of Invention

Accordingly, the present application is directed to a method, apparatus, and medium for training a speech synthesis model, which can improve the effect of distinguishing different expressions of a speech synthesis model obtained by training, thereby improving the naturalness of the synthesized speech and the user experience. The specific scheme is as follows:

In a first aspect, the present application provides a method for training a speech synthesis model, including:

Acquiring a training sample set; the training sample set comprises a text sample, a voice sample corresponding to the text sample and label information, wherein the label information comprises a expression mode label;

inputting the training sample set to a speech synthesis model;

Extracting a text content feature vector and a expression mode feature vector of the text sample;

Extracting a voice characteristic vector of a voice sample corresponding to the text sample, and determining a style vector corresponding to the voice characteristic vector through a multi-head attention mechanism;

Determining a predicted mel frequency spectrum corresponding to the text sample based on the style vector, the text content feature vector and the expression mode feature vector;

Determining a mel spectrum loss by using the predicted mel spectrum and a real mel spectrum corresponding to the voice sample, and determining a style vector loss by using the style vector and the tag information;

determining a composite training loss based on the mel spectrum loss and the style vector loss;

when the comprehensive training loss converges, determining the current speech synthesis model as a trained speech synthesis model and determining the current style vector as a trained style vector.

Optionally, the acquiring a training sample set includes:

Obtaining a long sentence text sample, a single sentence text sample, a voice sample corresponding to the long sentence text sample, a voice sample corresponding to the single sentence text sample and label information, and obtaining a training sample set;

The long sentence text sample is a text sample containing a plurality of single sentence texts and pause labeling information between two adjacent single sentence texts.

Optionally, the obtaining long sentence text samples and single sentence text samples includes:

Splitting the original text into single sentence text by using a preset punctuation mark;

determining the symbol type of the ending punctuation symbol of the single sentence text;

performing word segmentation and part-of-speech tagging on the single sentence text to obtain word segmentation and part-of-speech tagging of the single sentence text;

Labeling the pause level of the segmentation in the single sentence text based on the segmentation and the part of speech, and labeling the pause level of the end of the single sentence text based on the symbol type, so as to obtain a single sentence text sample;

And splicing the single sentence text samples sentence by sentence, judging whether the number of characters of the current spliced sentence reaches a preset character number threshold value in the splicing process, if not, splicing the single sentence text samples to be spliced to the spliced sentence until the number of characters of the current spliced sentence reaches the preset character threshold value, taking the current spliced sentence as a long sentence text sample, and starting to splice the next spliced sentence until the splicing ending condition is met.

Optionally, after splitting the original text into single sentence text with the preset punctuation mark, the method further includes:

rejecting the single sentence text without the first target character; wherein the first target character comprises Chinese characters, numbers and letters;

Removing the second target characters in the rest single sentence text; the second target character is a character which does not contain effective information.

Optionally, acquiring tag information includes:

Judging the property of the quotation marks in the text sample, and if the property is the representation dialogue, determining that the expression mode labels of the texts in the quotation marks are dialogue types.

Optionally, the judging the nature of the quotation marks in the sentence text includes:

judging whether a colon exists before the quotation, if so, judging that the property of the quotation is a representation dialogue;

Or judging whether the appointed characters exist before the quotation mark, if so, judging that the quotation mark is of a dialogue; the appointed characters are characters which represent the appointed characters and then are dialogue type characters;

or analyzing the part of speech of the text in the quotation mark, and judging the property of the quotation mark as representing the dialogue if the text in the quotation mark comprises a verb.

Optionally, the determining the predicted mel spectrum corresponding to the text sample based on the style vector, the text content feature vector and the expression feature vector includes:

Based on the weight parameters corresponding to the style vectors, the weight parameters corresponding to the character content feature vectors and the weight parameters corresponding to the expression mode feature vectors, splicing the style vectors, the character content feature vectors and the expression mode feature vectors to obtain spliced vectors;

and determining a predicted mel frequency spectrum corresponding to the spliced vector based on an attention mechanism.

Optionally, the determining the comprehensive training loss based on the mel spectrum loss and the style vector loss includes:

and carrying out weighted calculation on the Mel spectrum loss and the style vector loss based on the weight parameters corresponding to the Mel spectrum loss and the weight parameters corresponding to the style vector loss to obtain comprehensive training loss.

In a second aspect, the present application discloses an audio generation method, including:

Acquiring a target text of a voice to be synthesized and target label information of the target text; the target label information comprises a presentation mode label;

inputting the target text and the target tag information into the trained voice synthesis model;

extracting text content feature vectors of the target text, and extracting expression feature vectors of the target text based on the expression labels;

Determining a target style vector based on the target label information and the trained style vector corresponding to the trained speech synthesis model;

Determining a target predicted Mel frequency spectrum corresponding to the target text based on the target style vector, the text content feature vector and the expression mode feature vector;

and synthesizing corresponding predicted voice by utilizing the target predicted Mel frequency spectrum.

In a third aspect, the present application discloses an electronic device, comprising:

a memory for storing a computer program;

And the processor is used for executing the computer program to realize the voice synthesis model training method and/or the voice generation method.

In a fourth aspect, the present application discloses a computer readable storage medium storing a computer program which, when executed by a processor, implements the foregoing speech synthesis model training method and/or the foregoing speech generation method.

Therefore, the training sample set is acquired firstly; the training sample set comprises text samples, voice samples corresponding to the text samples and tag information, the tag information comprises expression type tags, the training sample set is input to a voice synthesis model, text content feature vectors and expression type feature vectors of the text samples are extracted, voice feature vectors of the voice samples corresponding to the text samples are extracted, style vectors corresponding to the voice feature vectors are determined through a multi-head attention mechanism, then predicted Mel spectrums corresponding to the text samples are determined based on the style vectors, the text content feature vectors and the expression type feature vectors, mel spectrum losses are determined by using the predicted Mel spectrums and real Mel spectrums corresponding to the voice samples, and style vector losses are determined by using the style vectors and the tag information, comprehensive training losses are determined based on the Mel spectrum losses and the style vector losses, when the comprehensive training losses converge, the current voice synthesis model is determined to be a post-training voice synthesis model, and the current style vector is determined to be a post-training style vector. The application trains the speech synthesis model by using the label information comprising the expression mode label, the text sample and the speech sample, extracts the character content feature vector and the expression mode feature vector in the training process, determines the predicted Mel frequency spectrum by using the style vector, the character content feature vector and the expression mode feature vector corresponding to the speech sample, further determines the loss, and when the loss converges, obtains the trained speech synthesis model, thus, in the training process, the expression mode feature is considered, the distinguishing effect of the speech synthesis model obtained by training on different expression modes can be improved, and the naturalness of the synthesized speech and the user experience are improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present application, and that other drawings can be obtained according to the provided drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of a system framework to which the disclosed speech synthesis model training scheme is applied;

FIG. 2 is a flow chart of a method for training a speech synthesis model according to the present disclosure;

FIG. 3 is a schematic diagram of a specific speech synthesis model training method disclosed in the present application;

FIG. 4 is a flowchart of a specific speech synthesis model training method disclosed in the present application;

FIG. 5 is a flowchart of a specific training sample set acquisition process according to the present disclosure;

FIG. 6 is a flowchart of a specific speech synthesis model training method disclosed in the present application;

FIG. 7 is a schematic diagram of a specific speech synthesis model prediction according to the present disclosure;

FIG. 8 is a schematic diagram of a speech synthesis model training device according to the present disclosure;

Fig. 9 is a block diagram of an electronic device according to the present disclosure.

Detailed Description

The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

At present, in the field of speech synthesis, for the difference of expression modes, such as side white or dialogue, the existing model training is difficult to achieve a good distinguishing effect, so that the naturalness of the synthesized speech is low and the user experience is poor. Therefore, the application provides a voice synthesis model training scheme, which can improve the distinguishing effect of the voice synthesis model obtained by training on different expression modes, thereby improving the naturalness of the synthesized voice and the user experience.

In the speech synthesis model training scheme of the present application, a system frame diagram used may be shown in fig. 1, and may specifically include: the system comprises a background server and a plurality of user terminals which are in communication connection with the background server. The user side includes, but is not limited to, tablet computers, notebook computers, smart phones, personal computers (personal computer, PC), but is not limited thereto. The background server may be a cloud server or a non-cloud server.

In the application, the step executed by the background server comprises the steps of obtaining a training sample set; the training sample set comprises a text sample, a voice sample corresponding to the text sample and label information, wherein the label information comprises a expression mode label; inputting the training sample set to a speech synthesis model; extracting a text content feature vector and a expression mode feature vector of the text sample; extracting a voice characteristic vector of a voice sample corresponding to the text sample, and determining a style vector corresponding to the voice characteristic vector through a multi-head attention mechanism; determining a predicted mel frequency spectrum corresponding to the text sample based on the style vector, the text content feature vector and the expression mode feature vector; determining a mel spectrum loss by using the predicted mel spectrum and a real mel spectrum corresponding to the voice sample, and determining a style vector loss by using the style vector and the tag information; determining a composite training loss based on the mel spectrum loss and the style vector loss; when the comprehensive training loss converges, determining the current speech synthesis model as a trained speech synthesis model and determining the current style vector as a trained style vector.

The user side is used for transmitting text content which is designated by a user and needs to be subjected to voice synthesis to the background server, so that when the background server acquires the text content, a trained voice synthesis model and a trained style vector are utilized to determine a predicted Mel frequency spectrum of the text content, and then voice is synthesized, and the voice is transmitted to the user side to be played.

Referring to fig. 2, the embodiment of the application discloses a training method for a speech synthesis model, which comprises the following steps:

Step S11: acquiring a training sample set; the training sample set comprises a text sample, a voice sample corresponding to the text sample and label information, and the label information comprises a expression mode label.

In a specific embodiment, the expression label may be a dialogue type, that is, the text content of the dialogue expressed in the text sample is marked, so as to obtain the expression label of the dialogue type. Further, text content whose expression in the text sample is non-dialog may be labeled, and a expression label of the bystander type may be obtained, or bystander may not be labeled. For example, the text content whose expression mode in the text sample is dialogue may be labeled with 1, the text content whose expression mode in the text sample is non-dialogue may be labeled with 0 as a label of dialogue type, and the text content whose expression mode in the text sample is non-dialogue may be labeled as a label of bypass type, that is, in the embodiment of the present application, the expression mode of the non-dialogue text content may be determined as bypass type, or of course, the bypass may not be labeled, and only the text content whose expression mode is dialogue may be labeled, so that whether the expression mode of the text content is bypass or dialogue may be distinguished.

In addition, in a specific embodiment, the tag information may further include a speaker tag, an emotion tag, a speech rate tag, and the like.

In a specific embodiment, the specific process of acquiring the tag information includes:

Further, in a specific embodiment, it may be determined whether a colon exists before the quotation, and if so, the quotation property is determined to be a dialog; or judging whether the appointed characters exist before the quotation mark, if so, judging that the quotation mark is of a dialogue; the appointed characters are characters which represent the appointed characters and then are dialogue type characters; or analyzing the part of speech of the text in the quotation mark, and judging the property of the quotation mark as representing the dialogue if the text in the quotation mark comprises a verb.

It should be noted that, in addition to representing the dialog, the nature of the quotation marks may be emphasized, specially called, etc., and generally, when speaking, there will be some pauses before the utterance starts for the quotation marks representing the dialog types, and when the quotation marks representing the emphasis are emphasized, they will not pause for a long time. Therefore, it is desirable to determine whether or not the quotation marks represent utterances as accurately as possible. The present embodiment can be judged in the above 3 ways, but specific quotation mark property judging ways include, but are not limited to, the above 3 ways. In a specific embodiment, whether or not a colon is present before the quotation mark may be determined, if so, whether or not the quotation mark is a character representing a dialogue is determined, if not, whether or not the text before the quotation mark is a character of a type of a dialogue such as "trace, say, state, tell, etc., and if so, whether or not the quotation mark is a character representing a dialogue is determined, if not, the quotation mark is determined to be a character representing a dialogue by analyzing the text part of a word in the quotation mark, and if only the noun type is emphasized, and if the verb type is present, the judgment is a dialogue type.

Step S12: the training sample set is input to a speech synthesis model.

Step S13: and extracting the character content characteristic vector and the expression characteristic vector of the text sample.

That is, the text feature vector extracted by the speech synthesis model in the encoding stage includes a text content feature vector representing content information of the text, i.e., what text information is specifically expressed, and a presentation mode feature vector representing a presentation mode of the text, i.e., whether the presentation mode is expressed in a form of a paralytic or a form of a dialogue.

Step S14: and extracting a voice characteristic vector of the voice sample corresponding to the text sample, and determining a style vector corresponding to the voice characteristic vector through a multi-head attention mechanism.

In a specific embodiment, tokens of the voice feature vector in different information dimensions are obtained through a multi-head attention mechanism, and then weighting calculation is carried out on each token to obtain a style vector corresponding to the voice feature vector.

It should be noted that the speech feature vector contains various information of speech samples, tokens of the speech feature vector in different dimensions obtained through a multi-head attention mechanism are equivalent to branch vectors of speech in various dimensions, represent information such as pauses, timbres, semantics, emotion and the like, and can be combined to obtain a style vector through weighting.

Step S15: and determining a predicted Mel frequency spectrum corresponding to the text sample based on the style vector, the text content feature vector and the expression mode feature vector.

In a specific embodiment, the style vector, the text content feature vector and the expression feature vector can be spliced based on the weight parameter corresponding to the style vector, the weight parameter corresponding to the text content feature vector and the weight parameter corresponding to the expression feature vector to obtain a spliced vector; and determining a predicted mel frequency spectrum corresponding to the spliced vector based on an attention mechanism.

The weight parameters corresponding to the style vectors, the weight parameters corresponding to the character content feature vectors and the weight parameters corresponding to the expression mode feature vectors are all learnable parameters, and are updated in the training process.

Of course, in some embodiments, the text content feature vector and the expression feature vector may be spliced according to the style vector to obtain a spliced vector.

Step S16: determining a mel spectrum loss by using the predicted mel spectrum and a real mel spectrum corresponding to the speech sample, and determining a style vector loss by using the style vector and the tag information.

In particular embodiments, model parameters may be updated with mel-spectrum loss, style vector loss.

Step S17: determining a composite training loss based on the mel spectrum loss and the style vector loss.

In a specific embodiment, the mel spectrum loss and the style vector loss may be weighted based on a weight parameter corresponding to the mel spectrum loss and a weight parameter corresponding to the style vector loss, so as to obtain a comprehensive training loss.

The weight parameters corresponding to the mel spectrum loss and the weight parameters corresponding to the style vector loss may be parameters configured in advance according to experience, or may be learnable parameters.

In another specific embodiment, the mel-spectrum loss and the style vector loss may be directly added to obtain a composite training loss.

It should be noted that the loss of determining the integrated training loss includes, but is not limited to, mel spectrum loss and style vector loss, and may also include other losses calculated according to actual requirements.

Step S18: when the comprehensive training loss converges, determining the current speech synthesis model as a trained speech synthesis model and determining the current style vector as a trained style vector.

For example, referring to fig. 3, an embodiment of the present application discloses a specific speech synthesis model training method. The speech synthesis model includes a speech encoder, a GST (i.e., global Style Token, global style symbol) module, a text encoder, a expression encoder, an attention mechanism, a decoder. Firstly, extracting text content feature vectors and expression mode feature vectors of a text through a text encoder and an expression mode encoder respectively; the speech feature vector of the speech is extracted through a speech encoder, tokens of the speech vector in different dimensions are obtained through a multi-head attention mechanism of the GST module, and then the token results are combined through weight weighting to obtain a style vector. Then, the accuracy of the style vector is evaluated, and the style vector loss is determined through the label information and the current style vector; estimating the predicted Mel spectrum effect, extracting Mel spectrum from the input training voice as real Mel spectrum, calculating the difference between the Mel spectrum predicted by the model and the real Mel spectrum to obtain Mel spectrum loss, and feeding back style vector loss and Mel spectrum loss to the voice synthesis model for adjusting weight parameters in the model training process until the predicted effect is nearly consistent with the real effect. The predicted mel spectrum is a processing result obtained by performing conditional control on an input text vector and a style vector by a decoder, and represents an output of an acoustic model, that is, the predicted mel spectrum is a mel spectrum determined based on the style vector, the text content feature vector and the expression feature vector.

It should be noted that, on the basis of Tacotron model of end-to-end speech synthesis, GST module is added to make the synthesized speech have more rhythm expressive force, during training, through a large number of paired text/speech data, text vectors such as phonemes/tones are respectively extracted from the text, rhythm vectors are extracted from the speech, style vectors are learned by adopting a multi-head attention mechanism for rhythm vectors, and the style vectors are spliced with the text vectors and then fed into an attention mechanism model. After training, the GST module extracts the global style characteristics of the audio in the data set, such as prosody pause and other information, and stores the information in the style vector, so that the obtained style vector can be used for speech synthesis when the speech synthesis is carried out. The prosodic pause is the pause time of the words, the words and the sentences in the text after the text is vocalized. The embodiment of the application introduces the expression mode encoder on the basis of the Tacotron model and the GST module, and is used for distinguishing the bystanding from the conversation during training, so that the model can model the bystanding from the conversation during the training stage, distinguish the bystanding from the conversation, accords with daily expression habit better and synthesizes voice more naturally.

Referring to fig. 4, the embodiment of the application discloses a specific speech synthesis model training method, which comprises the following steps:

Step S21: and acquiring a long sentence text sample, a single sentence text sample, a voice sample corresponding to the long sentence text sample, a voice sample corresponding to the single sentence text sample and label information, and obtaining a training sample set.

Referring to fig. 5, an embodiment of the present application discloses a specific training sample set obtaining flowchart, and in a specific implementation manner, a specific process of obtaining a long sentence text sample and a single sentence text sample includes:

step S31: splitting the original text into single sentence text by using a preset punctuation mark.

The original text is an unprocessed text, and the text type includes, but is not limited to, a novel text, information, dialogue and the like, and can be a long text or a small text. The preset punctuation marks may include commas, sentence marks, exclamation marks, ellipses, and the like.

That is, the embodiment of the present application may split the original text with punctuation marks between each sentence as the separation Fu Zhugou, for example, the original text is "good, please refer to more than one. The split single sentence texts are 'good, and' please refer to teaching more. "

In a specific embodiment, the single sentence text without the first target character may be culled; wherein the first target character comprises Chinese characters, numbers and letters.

Further, rejecting the second target characters in the rest of the single sentence text; the second target character is a character which does not contain effective information.

The method and the device can reject invalid characters, judge whether each single sentence text contains a first target character or not after the split single sentence text is obtained, reject the text which does not contain the first target character, for example, only contain 'in the text (I', etc., reject the single sentence text, further reject invalid characters in the rest single sentence text, namely characters which do not contain effective information of synthesized voice, and after rejecting the invalid text and the invalid characters in the rest single sentence text, perform subsequent processing on the current single sentence text, so that the influence of the invalid text and the invalid characters can be avoided.

In addition, in the embodiment of the present application, the specific process of acquiring the tag information may include: judging the property of the quotation marks in the single sentence text, and if the property is the representation dialogue, determining that the expression mode label of the text in the quotation marks is the dialogue type.

That is, the embodiment of the application can judge the quotation mark property in the single sentence text after splitting the single sentence text and eliminating the invalid text and the invalid characters. The specific judging manner may refer to the disclosure of the foregoing embodiment, and will not be described herein.

Step S32: and determining the symbol type of the ending punctuation symbol of the single sentence text.

It should be noted that the end punctuation of a single sentence text is typically the punctuation between two adjacent single sentences, and the symbol type of the end punctuation is determined to be useful for inter-sentence pause processing.

Step S33: and performing word segmentation and part-of-speech tagging on the single sentence text to obtain the word segmentation and part-of-speech of the single sentence text.

It should be noted that word segmentation refers to the separation of words and phrases in each sentence through word segmentation, and part-of-speech tagging refers to the prediction of the part-of-speech of each word. For example, "do you eat? "do you eat? "wherein" eat "is a single word; the part of speech of each word, word after the segmentation is obtained through part of speech tagging, for example, "you" belong to the pronoun. In a specific embodiment, word segmentation may be performed using a word segmentation tool, such as "resultant" (jieba) word segmentation, to obtain word segments of single sentence text, including words, phrases, and parts of speech in sentences.

Step S34: and labeling the pause level of the segmentation in the single sentence text based on the segmentation and the part of speech, and labeling the pause level of the end of the single sentence text based on the symbol type, so as to obtain a single sentence text sample.

In a specific embodiment, the pause grades may be divided according to the pause length, for example, the pause grades are divided into 4 grades, and are represented by #1, #2, #3 and # 4. For word segmentation of single sentence text, the prosodic words, prosodic phrases and intonation phrases are rated as "#1", "#2", "#3", respectively. Through judging punctuation marks among sentences, corresponding pause levels are given to the text, the higher the level is, the longer the pause is, the shorter the comma pause is, the longer the period pause is, the comma is between two sentences, the end of the first sentence is marked with a '# 3', the period is between two sentences, and the end of the first sentence is marked with a '# 4'. By the labeling mode, the inter-sentence pause of the text is controllable, and the effect that the pause is long or short is achieved.

Step S35: and splicing the single sentence text samples sentence by sentence, judging whether the number of characters of the current spliced sentence reaches a preset character number threshold value in the splicing process, if not, splicing the single sentence text samples to be spliced to the spliced sentence until the number of characters of the current spliced sentence reaches the preset character threshold value, taking the current spliced sentence as a long sentence text sample, and starting to splice the next spliced sentence until the splicing ending condition is met.

The splicing end condition can be that all single sentence text samples are spliced, or the number of long sentence text samples reaches a preset number, so that sufficient samples are obtained, and the splicing is ended.

In this way, long sentence text samples with the number of characters close to the preset number of characters reaching the threshold value are obtained, and compared with single sentence text, the spliced long sentence text samples have prosodic information of pause between sentences, so that pause between sentences can be better reflected.

In addition, in a specific embodiment, if the number of characters in the single sentence text is greater than a preset threshold, that is, the single sentence text is too long, the single sentence text may be discarded or split.

Step S22: the training sample set is input to a speech synthesis model.

Step S23: and extracting the character content characteristic vector and the expression characteristic vector of the text sample.

Step S24: and extracting a voice characteristic vector of the voice sample corresponding to the text sample, and determining a style vector corresponding to the voice characteristic vector through a multi-head attention mechanism.

Step S25: and determining a predicted Mel frequency spectrum corresponding to the text sample based on the style vector, the text content feature vector and the expression mode feature vector.

Step S26: determining a mel spectrum loss by using the predicted mel spectrum and a real mel spectrum corresponding to the speech sample, and determining a style vector loss by using the style vector and the tag information.

Step S27: and determining a comprehensive training loss based on the mel spectrum loss and the style vector loss.

The specific implementation manner of the above steps S23 to S27 may refer to the disclosure of the foregoing embodiment, and will not be described herein.

Step S28: and judging whether the comprehensive training loss is converged or not.

Step S29: if yes, determining the current speech synthesis model as a trained speech synthesis model, and determining the current style vector as a trained style vector.

Otherwise, the speech synthesis model is updated with the comprehensive training loss, and additional text samples, speech samples, and tag information are determined from the training sample set, and the above steps S23 to S28 are performed.

That is, the embodiment of the application trains the speech synthesis model by using the training sample set, determines the comprehensive training loss in the training process, and determines the current speech synthesis model as the trained speech synthesis model and the current style vector as the trained style vector when the comprehensive training loss converges.

Therefore, the long sentence text sample, the single sentence text sample, the voice sample corresponding to the long sentence text sample and the voice sample corresponding to the single sentence text sample are obtained from the training sample set obtained by the embodiment of the application, and the training sample set comprises rich prosodic pause information in the single sentences and prosodic pause information among the single sentences, so that the model can better obtain the prosodic pause information, and the performance of the model is improved.

Referring to fig. 6, the embodiment of the application discloses a specific audio generation method, which comprises the following steps:

Step S41: acquiring a target text of a voice to be synthesized and target label information of the target text; the target label information comprises a expression mode label.

The process of obtaining the expression label may refer to the disclosure of the foregoing embodiment, and will not be described herein. The target tag information may also include emotion tags, speed tags, and the like.

Step S42: and inputting the target text and the target label information into the trained voice synthesis model disclosed by the embodiment of the application.

Step S43: extracting text content feature vectors of the target text, and extracting expression feature vectors of the target text based on the expression labels;

Step S44: determining a target style vector based on the target label information and the trained style vector corresponding to the trained speech synthesis model;

Step S45: and determining a target predicted Mel frequency spectrum corresponding to the target text based on the target style vector, the text content feature vector and the expression mode feature vector.

In a specific embodiment, the text content feature vector, the expression mode feature vector and the target style vector may be spliced to obtain a spliced vector; and determining a target prediction Mel frequency spectrum corresponding to the spliced vector based on an attention mechanism.

For example, referring to fig. 7, an embodiment of the present application discloses a specific speech synthesis model prediction schematic.

Inputting target text and target tag information, extracting text content feature vectors by a text encoder, extracting expression feature vectors by an expression encoder based on expression tags, determining target style vectors by using trained style vectors and target tag information, splicing the target style vectors, the text content feature vectors and the expression feature vectors to obtain spliced vectors, and obtaining predicted Mel frequency spectrums corresponding to the spliced vectors by a attention mechanism and a decoder.

It should be noted that, for model training, in a specific embodiment, training sample sets corresponding to multiple speakers may be obtained; and respectively inputting the corresponding speech synthesis model to train by using the training sample set corresponding to each speaker to obtain a trained speech synthesis model corresponding to each speaker. Therefore, when the audio is generated, the target text and the speaker information input by the user are acquired, the corresponding trained voice synthesis model is determined according to the speaker information, the target label information of the target text is determined, the target label information comprises a expression mode label and the like, the speaker label is not needed, and the target text and the target label information are input into the trained voice synthesis model so as to generate the voice corresponding to the speaker information.

In another specific embodiment, a training sample set corresponding to a plurality of speakers may be obtained, and the training sample set corresponding to the plurality of speakers is input to the same post-training speech synthesis model for training, so as to obtain a post-training speech synthesis model corresponding to each speaker. Since the same model is trained using training sample sets corresponding to a plurality of speakers, the training sample sets include speaker tags. And when the audio is generated, acquiring target text and speaker information input by a user, determining target tag information, wherein the target tag information comprises a speaker tag determined by using the speaker information, and inputting the target tag information and the target text into a trained speech synthesis model so as to generate speech corresponding to the speaker information.

Of course, in some embodiments, the trained speech synthesis model may be utilized and style vectors generated based on the target tag information, the target predicted mel spectrum determined based on the generated style vectors, text content feature vectors, expression feature vectors, without using the trained style vectors,

Step S46: and synthesizing corresponding predicted voice by utilizing the target predicted Mel frequency spectrum.

In particular embodiments, the corresponding predicted speech may be synthesized by phase prediction or a neural network vocoder.

The method of phase prediction includes, but is not limited to, through Ge Shilin's algorithm (Griffin-Lim signal estimation algorithm), which predicts phase information from an input frequency spectrum (amplitude spectrum, without phase information), and then continuously reduces the difference between the frequency spectrum of inverse fourier transform corresponding to the predicted phase and the input frequency spectrum in an iterative manner to obtain a final predicted voice signal. The scheme of the neural network vocoder is that the relation between the frequency spectrum and the voice is established through the deep neural network to predict, and the voice quality of the output voice is higher, but the algorithm complexity is higher.

The following describes the technical scheme of the present application by taking a certain speech synthesis APP as an example.

The background server of the APP obtains a large number of novel texts and consultation texts to obtain original texts, splits the original texts into single sentence texts by preset punctuation marks, eliminates invalid texts and invalid characters in the valid texts, then judges the nature of quotation in the rest single sentence texts, if the characters are representing conversations, determines the expression mode labels of the texts in the quotation as conversations, determines the symbol types of the ending punctuation marks of the rest valid single sentence texts, performs word segmentation and part-of-speech labeling on the rest single sentence texts, marks the pause level of the word segmentation in the single sentence texts based on the word segmentation and part-of-speech, marks the pause level of the ending of the single sentence texts based on the symbol types, obtains single sentence text samples, further, splices the single sentence text samples one by one, judges whether the number of characters of a current sentence reaches a preset character number threshold value in the splicing process, if the number of characters of the single sentence text samples to be spliced reaches the preset sentence threshold value, and if the number of the characters of the current sentence text to be spliced reaches the preset sentence threshold value, takes the current sentence as a long text, and the next sentence splicing condition is met, and the corresponding splicing condition is met, and the speech sample is obtained until the speech sample is obtained. Inputting the training sample set to a speech synthesis model; extracting a text content feature vector and a expression mode feature vector of the text sample; extracting a voice characteristic vector of a voice sample corresponding to the text sample, and determining a style vector corresponding to the voice characteristic vector through a multi-head attention mechanism; determining a predicted mel frequency spectrum corresponding to the text sample based on the style vector, the text content feature vector and the expression mode feature vector; determining a mel spectrum loss by using the predicted mel spectrum and a real mel spectrum corresponding to the voice sample, and determining a style vector loss by using the style vector and the tag information; determining a comprehensive training loss based on the mel spectrum loss and the style vector loss, determining a current speech synthesis model as a post-training speech synthesis model when the comprehensive training loss converges, and determining a current style vector as a post-training style vector.

The user side installs the voice synthesis APP, the user transmits text content needing to be subjected to voice synthesis to a background server through the APP, and when the background server acquires the text content, the background server determines a predicted Mel frequency spectrum of the text content by utilizing a trained voice synthesis model and a trained style vector, so as to synthesize voice and transmit the voice to the user side for playing.

Referring to fig. 8, an embodiment of the present application discloses a speech synthesis model training device, including:

a training sample set acquisition module 11, configured to acquire a training sample set; the training sample set comprises a text sample, a voice sample corresponding to the text sample and label information, wherein the label information comprises a expression mode label;

a training sample set input module 12 for inputting the training sample set into a speech synthesis model;

a text feature extraction module 13, configured to extract a text content feature vector and a expression feature vector of the text sample;

a voice feature extraction module 14, configured to extract a voice feature vector of a voice sample corresponding to the text sample;

a style vector determining module 15, configured to determine, by using a multi-head attention mechanism, a style vector corresponding to the speech feature vector;

A predicted mel-frequency spectrum determining module 16, configured to determine a predicted mel-frequency spectrum corresponding to the text sample based on the style vector, the text content feature vector, and the expression feature vector;

A loss determination module 17, configured to determine a mel spectrum loss using the predicted mel spectrum and a real mel spectrum corresponding to the speech sample, and determine a style vector loss using the style vector and the tag information; determining a composite training loss based on the mel spectrum loss and the style vector loss;

The trained model determining module 18 is configured to determine the current speech synthesis model as a trained speech synthesis model and determine the current style vector as a trained style vector when the comprehensive training loss converges.

The training sample set obtaining module 11 is specifically configured to obtain a long sentence text sample, a single sentence text sample, a voice sample corresponding to the long sentence text sample, a voice sample corresponding to the single sentence text sample, and tag information, so as to obtain a training sample set; the long sentence text sample is a text sample containing a plurality of single sentence texts and pause labeling information between two adjacent single sentence texts.

In a specific embodiment, the training sample set acquisition module 11 includes:

The single sentence sample obtaining sub-module is used for splitting the original text into single sentence texts by using preset punctuation marks; determining the symbol type of the ending punctuation symbol of the single sentence text; performing word segmentation and part-of-speech tagging on the single sentence text to obtain word segmentation and part-of-speech tagging of the single sentence text; labeling the pause level of the segmentation in the single sentence text based on the segmentation and the part of speech, and labeling the pause level of the end of the single sentence text based on the symbol type, so as to obtain a single sentence text sample;

And the long sentence sample acquisition sub-module is used for splicing the single sentence text samples sentence by sentence, judging whether the number of characters of the current spliced sentence reaches a preset character number threshold value in the splicing process, if not, splicing the single sentence text sample to be spliced to the spliced sentence until the number of characters of the current spliced sentence reaches the preset character threshold value, taking the current spliced sentence as the long sentence text sample, and starting to splice the next spliced sentence until the splicing end condition is met.

The apparatus further comprises:

The invalid text body eliminating module is used for eliminating the single sentence text without the first target character; wherein the first target character comprises Chinese characters, numbers and letters;

the invalid character eliminating module is used for eliminating the second target characters in the rest single sentence texts; the second target character is a character which does not contain effective information.

And the label information acquisition sub-module is used for judging the property of the quotation marks in the text sample, and if the property is the representation dialogue, determining that the expression mode label of the text in the quotation marks is the dialogue type.

Further, in a specific embodiment, the tag information obtaining sub-module is specifically configured to:

The predictive mel spectrum determining module 16 is specifically configured to splice the style vector, the text content feature vector, and the expression feature vector based on the weight parameter corresponding to the style vector, the weight parameter corresponding to the text content feature vector, and the weight parameter corresponding to the expression feature vector, to obtain a spliced vector; and determining a predicted mel frequency spectrum corresponding to the spliced vector based on an attention mechanism.

Further, the embodiment of the application also provides electronic equipment. Fig. 9 is a block diagram of an electronic device 20, according to an exemplary embodiment, and the contents of the diagram should not be construed as limiting the scope of use of the present application in any way.

Fig. 9 is a schematic structural diagram of an electronic device 20 according to an embodiment of the present application. The electronic device 20 may specifically include: at least one processor 21, at least one memory 22, a power supply 23, a communication interface 24, an input output interface 25, and a communication bus 26. Wherein the memory 22 is configured to store a computer program that is loaded and executed by the processor 21 to implement relevant steps in the speech synthesis model training method and/or the audio generation method disclosed in any of the foregoing embodiments. In addition, the electronic device 20 in the present embodiment may be a server.

In this embodiment, the power supply 23 is configured to provide an operating voltage for each hardware device on the electronic device 20; the communication interface 24 can create a data transmission channel between the electronic device 20 and an external device, and the communication protocol to be followed is any communication protocol applicable to the technical solution of the present application, which is not specifically limited herein; the input/output interface 25 is used for acquiring external input data or outputting external output data, and the specific interface type thereof may be selected according to the specific application requirement, which is not limited herein.

The memory 22 may be a carrier for storing resources, such as a read-only memory, a random access memory, a magnetic disk, or an optical disk, and the resources stored thereon may include an operating system 221, a computer program 222, training data 223, and the like, and the storage may be temporary storage or permanent storage.

The operating system 221 is used for managing and controlling various hardware devices on the electronic device 20 and the computer program 222, so as to implement the operation and processing of the training data 223 in the memory 22 by the processor 21, which may be Windows Server, netware, unix, linux, etc. The computer program 222 may further comprise a computer program that can be used to perform other specific tasks in addition to the computer program that can be used to perform the speech synthesis model training method and/or the audio generation method performed by the electronic device 20 as disclosed in any of the previous embodiments.

Further, the embodiment of the application also discloses a storage medium, wherein the storage medium stores a computer program, and the computer program realizes the steps of the speech synthesis model training method and/or the audio generation method disclosed in any one of the previous embodiments when being loaded and executed by a processor.

In this specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, so that the same or similar parts between the embodiments are referred to each other. For the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section.

Finally, it is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The foregoing describes in detail a speech synthesis model training method, an audio generating method, a device and a medium, and specific examples are applied to illustrate the principles and embodiments of the present application, and the above examples are only used to help understand the method and core ideas of the present application; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope in accordance with the ideas of the present application, the present description should not be construed as limiting the present application in view of the above.

Claims

1. A method for training a speech synthesis model, comprising:

Acquiring a training sample set; the training sample set comprises a text sample, a voice sample corresponding to the text sample and label information, wherein the label information comprises a expression mode label; the expression mode label is used for distinguishing whether the expression mode of the text content is dialogue or bystander;

inputting the training sample set to a speech synthesis model;

2. The method of claim 1, wherein the obtaining a training sample set comprises:

3. The method for training a speech synthesis model according to claim 2, wherein the obtaining long sentence text samples and single sentence text samples comprises:

4. The method for training a speech synthesis model according to claim 3, wherein after splitting the original text into single sentence text with a preset punctuation mark, further comprising:

5. The speech synthesis model training method of claim 1, wherein obtaining tag information comprises:

6. The method of claim 5, wherein said determining the nature of the quotation marks in the text sample comprises:

7. The method of claim 1, wherein determining the predicted mel spectrum corresponding to the text sample based on the style vector, the text content feature vector, and the expression feature vector comprises:

8. The method of claim 1, wherein said determining a composite training loss based on said mel-spectrum loss and said style vector loss comprises:

9. An audio generation method, comprising:

Acquiring a target text of a voice to be synthesized and target label information of the target text; the target label information comprises a presentation mode label; the expression mode label is used for distinguishing whether the expression mode of the text content is dialogue or bystander;

inputting the target text and the target tag information into the trained speech synthesis model according to any one of claims 1 to 8;

10. An electronic device, comprising:

a memory for storing a computer program;

Processor for executing the computer program for implementing a speech synthesis model training method according to any of claims 1 to 8 and/or an audio generation method according to claim 9.

11. A computer readable storage medium for storing a computer program which when executed by a processor implements the speech synthesis model training method of any one of claims 1 to 8 and/or the audio generation method of claim 9.