CN113658577A - Speech synthesis model training method, audio generation method, device and medium - Google Patents

Speech synthesis model training method, audio generation method, device and medium Download PDF

Info

Publication number
CN113658577A
CN113658577A CN202110937782.1A CN202110937782A CN113658577A CN 113658577 A CN113658577 A CN 113658577A CN 202110937782 A CN202110937782 A CN 202110937782A CN 113658577 A CN113658577 A CN 113658577A
Authority
CN
China
Prior art keywords
text
vector
sample
training
sentence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110937782.1A
Other languages
Chinese (zh)
Inventor
徐东
陈洲旋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Music Entertainment Technology Shenzhen Co Ltd
Original Assignee
Tencent Music Entertainment Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Music Entertainment Technology Shenzhen Co Ltd filed Critical Tencent Music Entertainment Technology Shenzhen Co Ltd
Priority to CN202110937782.1A priority Critical patent/CN113658577A/en
Publication of CN113658577A publication Critical patent/CN113658577A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination

Abstract

The application discloses a speech synthesis model training method, an audio generation method, equipment and a medium, comprising the following steps: acquiring a training sample set; inputting the data into a speech synthesis model; extracting character content characteristic vectors and expression mode characteristic vectors of the text samples; extracting a voice feature vector of a voice sample and determining a corresponding style vector; determining a predicted Mel frequency spectrum of the text sample based on the style vector, the character content feature vector and the expression mode feature vector; determining the loss of the Mel frequency spectrum by using the predicted Mel frequency spectrum and the real Mel frequency spectrum of the voice sample, and determining the loss of the style vector by using the style vector and the label information; and determining comprehensive training loss based on the Mel frequency spectrum loss and the style vector loss, and obtaining a trained speech synthesis model and a trained style vector when the comprehensive training loss is converged. The distinguishing effect of the speech synthesis model obtained by training on different expression modes can be improved, so that the naturalness of the synthesized speech is improved, and the user experience is improved.

Description

Speech synthesis model training method, audio generation method, device and medium
Technical Field
The present application relates to the field of speech synthesis technologies, and in particular, to a speech synthesis model training method, an audio generation method, an apparatus, and a medium.
Background
With the development of deep neural network technology, more and more powerful acoustic models and vocoders are emerging in the field of speech synthesis, the former for generating text sequences into mel-frequency spectra, and the latter for generating high-quality speech from mel-frequency spectra. At present, in the field of speech synthesis, for differences of expression modes, such as voice-over or dialogue, good distinguishing effect is difficult to achieve by existing model training, so that naturalness of synthesized speech is low, and user experience is poor. In summary, in the process of implementing the present invention, the inventor finds that, in the prior art, at least, the speech synthesis model obtained by training is difficult to distinguish different expression modes, and the synthesized speech naturalness is poor in user experience.
Disclosure of Invention
In view of this, an object of the present application is to provide a method, a device, and a medium for training a speech synthesis model, which can improve the distinguishing effect of the trained speech synthesis model on different expression modes, thereby improving the naturalness of the synthesized speech and user experience. The specific scheme is as follows:
in a first aspect, the present application provides a method for training a speech synthesis model, including:
acquiring a training sample set; the training sample set comprises a text sample, a voice sample corresponding to the text sample and label information, and the label information comprises an expression mode label;
inputting the training sample set to a speech synthesis model;
extracting character content characteristic vectors and expression mode characteristic vectors of the text samples;
extracting a voice feature vector of a voice sample corresponding to the text sample, and determining a style vector corresponding to the voice feature vector through a multi-head attention mechanism;
determining a predicted Mel frequency spectrum corresponding to the text sample based on the style vector, the character content feature vector and the expression mode feature vector;
determining a mel-frequency spectrum loss by using the predicted mel-frequency spectrum and a real mel-frequency spectrum corresponding to the voice sample, and determining a style vector loss by using the style vector and the label information;
determining a synthetic training loss based on the mel-frequency spectrum loss and the style vector loss;
and when the comprehensive training loss is converged, determining the current speech synthesis model as a trained speech synthesis model and determining the current style vector as a trained style vector.
Optionally, the obtaining a training sample set includes:
obtaining a long sentence text sample, a single sentence text sample, a voice sample corresponding to the long sentence text sample, a voice sample corresponding to the single sentence text sample and label information to obtain a training sample set;
the long sentence text sample is a text sample containing a plurality of single sentence texts and pause marking information between two adjacent single sentence texts.
Optionally, the obtaining long sentence text samples and single sentence text samples includes:
splitting an original text into single sentence texts by using preset punctuations;
determining the symbol type of the ending punctuation of the single sentence text;
performing word segmentation and part-of-speech tagging on the single sentence text to obtain the word segmentation and part-of-speech of the single sentence text;
marking the pause grade of the participles in the single sentence text based on the participles and the parts of speech, and marking the pause grade of the tail of the single sentence text based on the symbol type to obtain a single sentence text sample;
and splicing the single-sentence text samples sentence by sentence, judging whether the number of characters of the current spliced sentence reaches a preset character number threshold value or not in the splicing process, splicing the current single-sentence text sample to be spliced to the spliced sentence if the number of characters of the current spliced sentence reaches the preset character threshold value, taking the current spliced sentence as a long-sentence text sample, and starting to splice the next spliced sentence until the splicing finishing condition is met.
Optionally, after splitting the original text into a single-sentence text with a preset punctuation mark, the method further includes:
rejecting the single sentence text without the first target character; wherein the first target character comprises Chinese characters, numbers and letters;
removing second target characters in the residual single sentence texts; wherein the second target character is a character that does not contain valid information.
Optionally, acquiring the tag information includes:
and judging the property of the quotation marks in the text samples, and if the property is the representation dialogue, determining that the representation mode labels of the texts in the quotation marks are the dialogue types.
Optionally, the determining the nature of the quotation marks in the single sentence text includes:
judging whether a colon exists before the quotation marks, and if so, judging the property of the quotation marks as representing a conversation;
or, judging whether the appointed characters exist before the quotation marks, if so, judging the property of the quotation marks as representing the dialogue; the appointed characters are characters which represent the characters behind the appointed characters and are dialog type characters;
or analyzing the part of speech of the text in the quotation marks, and judging the property of the quotation marks to be the representation dialogue if the text in the quotation marks comprises verbs.
Optionally, the determining a predicted mel frequency spectrum corresponding to the text sample based on the style vector, the text content feature vector, and the expression mode feature vector includes:
splicing the style vector, the character content characteristic vector and the expression mode characteristic vector based on the weight parameter corresponding to the style vector, the weight parameter corresponding to the character content characteristic vector and the weight parameter corresponding to the expression mode characteristic vector to obtain a spliced vector;
and determining a predicted Mel frequency spectrum corresponding to the splicing vector based on an attention mechanism.
Optionally, the determining a comprehensive training loss based on the mel-frequency spectrum loss and the style vector loss includes:
and performing weighted calculation on the Mel spectrum loss and the style vector loss based on the weight parameters corresponding to the Mel spectrum loss and the weight parameters corresponding to the style vector loss to obtain the comprehensive training loss.
In a second aspect, the present application discloses an audio generation method, comprising:
acquiring a target text of a voice to be synthesized and target label information of the target text; wherein the target tag information comprises an expression mode tag;
inputting the target text and the target label information into the trained speech synthesis model;
extracting a text content feature vector of the target text, and extracting an expression mode feature vector of the target text based on the expression mode label;
determining a target style vector based on the target label information and a trained style vector corresponding to the trained speech synthesis model;
determining a target prediction Mel frequency spectrum corresponding to the target text based on the target style vector, the character content feature vector and the expression mode feature vector;
and synthesizing corresponding predicted voice by using the target predicted Mel frequency spectrum.
In a third aspect, the present application discloses an electronic device, comprising:
a memory for storing a computer program;
a processor for executing the computer program to implement the aforementioned speech synthesis model training method and/or the aforementioned speech generation method.
In a fourth aspect, the present application discloses a computer-readable storage medium for storing a computer program which, when executed by a processor, implements the aforementioned speech synthesis model training method and/or the aforementioned speech generation method.
Therefore, the training sample set is obtained firstly; wherein the training sample set comprises a text sample, a speech sample corresponding to the text sample and label information, and the label information comprises an expression mode label, then the training sample set is input to a speech synthesis model, a text content feature vector and an expression mode feature vector of the text sample are extracted, a speech feature vector of the speech sample corresponding to the text sample is extracted, a style vector corresponding to the speech feature vector is determined through a multi-head attention mechanism, then a predicted Mel frequency spectrum corresponding to the text sample is determined based on the style vector, the text content feature vector and the expression mode feature vector, a Mel frequency spectrum loss is determined by using the predicted Mel frequency spectrum and a real Mel frequency spectrum corresponding to the speech sample, and a style vector loss is determined by using the style vector and the label information, and determining a comprehensive training loss based on the Mel frequency spectrum loss and the style vector loss, and when the comprehensive training loss is converged, determining the current speech synthesis model as a trained speech synthesis model and determining the current style vector as a trained style vector. That is, the speech synthesis model is trained by using the label information including the expression mode labels and the text sample and the speech sample, in the training process, the character content characteristic vector and the expression mode characteristic vector are extracted, the prediction Mel frequency spectrum is determined by using the style vector, the character content characteristic vector and the expression mode characteristic vector corresponding to the speech sample, the loss is further determined, and when the loss converges, the trained speech synthesis model is obtained.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
FIG. 1 is a schematic diagram of a system framework for a speech synthesis model training scheme as disclosed herein;
FIG. 2 is a flow chart of a speech synthesis model training method disclosed herein;
FIG. 3 is a schematic diagram of a specific speech synthesis model training method disclosed in the present application;
FIG. 4 is a flow diagram of a particular speech synthesis model training method disclosed herein;
FIG. 5 is a flow chart of a specific training sample set acquisition process disclosed herein;
FIG. 6 is a flow diagram of a particular speech synthesis model training method disclosed herein;
FIG. 7 is a diagram illustrating a prediction of a particular speech synthesis model disclosed herein;
FIG. 8 is a schematic diagram of a speech synthesis model training apparatus according to the present disclosure;
fig. 9 is a block diagram of an electronic device disclosed in the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
At present, in the field of speech synthesis, for differences of expression modes, such as voice-over or dialogue, good distinguishing effect is difficult to achieve by existing model training, so that naturalness of synthesized speech is low, and user experience is poor. Therefore, the speech synthesis model training scheme can improve the distinguishing effect of the speech synthesis model obtained by training on different expression modes, so that the naturalness of the synthesized speech is improved, and the user experience is improved.
In the speech synthesis model training scheme of the present application, the system frame diagram adopted may be shown in fig. 1, and specifically may include: the system comprises a background server and a plurality of user terminals which are in communication connection with the background server. The user side includes, but is not limited to, a tablet computer, a notebook computer, a smart phone, and a Personal Computer (PC), and is not limited herein. The background server can be a cloud server or a non-cloud server.
In the application, the steps executed by the background server include acquiring a training sample set; the training sample set comprises a text sample, a voice sample corresponding to the text sample and label information, and the label information comprises an expression mode label; inputting the training sample set to a speech synthesis model; extracting character content characteristic vectors and expression mode characteristic vectors of the text samples; extracting a voice feature vector of a voice sample corresponding to the text sample, and determining a style vector corresponding to the voice feature vector through a multi-head attention mechanism; determining a predicted Mel frequency spectrum corresponding to the text sample based on the style vector, the character content feature vector and the expression mode feature vector; determining a mel-frequency spectrum loss by using the predicted mel-frequency spectrum and a real mel-frequency spectrum corresponding to the voice sample, and determining a style vector loss by using the style vector and the label information; determining a synthetic training loss based on the mel-frequency spectrum loss and the style vector loss; and when the comprehensive training loss is converged, determining the current speech synthesis model as a trained speech synthesis model and determining the current style vector as a trained style vector.
The user side is used for transmitting text contents which are specified by a user and need to be subjected to voice synthesis to the background server, so that when the background server acquires the text contents, the background server determines a predicted Mel frequency spectrum of the text contents by using the trained voice synthesis model and the trained style vectors, further synthesizes voice, and transmits the voice to the user side for playing.
Referring to fig. 2, an embodiment of the present application discloses a method for training a speech synthesis model, including:
step S11: acquiring a training sample set; the training sample set comprises a text sample, a voice sample corresponding to the text sample and label information, and the label information comprises an expression mode label.
In a specific embodiment, the expression mode tag may be a conversation type, that is, the text content of which the expression mode in the text sample is a conversation is labeled to obtain an expression mode tag of the conversation type. Furthermore, the text content whose expression mode in the text sample is non-dialogue can be labeled to obtain an expression mode label of the voice-over type, or the voice-over can be not labeled. For example, 1 may be used to label the text content whose expression mode in the text sample is dialog, 0 may be used to label the text content whose expression mode in the text sample is non-dialog, and the text content is a tag of the dialogue type.
In addition, in a specific embodiment, the tag information may further include a speaker tag, an emotion tag, a speech rate tag, and the like.
In a specific embodiment, the specific process of acquiring the tag information includes:
and judging the property of the quotation marks in the text samples, and if the property is the representation dialogue, determining that the representation mode labels of the texts in the quotation marks are the dialogue types.
Further, in a specific embodiment, it may be determined whether a colon exists before the quotation mark, and if so, the nature of the quotation mark is determined to represent a dialog; or, judging whether the appointed characters exist before the quotation marks, if so, judging the property of the quotation marks as representing the dialogue; the appointed characters are characters which represent the characters behind the appointed characters and are dialog type characters; or analyzing the part of speech of the text in the quotation marks, and judging the property of the quotation marks to be the representation dialogue if the text in the quotation marks comprises verbs.
It should be noted that quotation marks can be marked with emphasis, special appellations, etc. besides representing dialogs, and in general, quotation marks representing dialog types have some pause before the speech starts, while quotation marks representing emphasis are emphasized with accents and generally do not pause for a long time. Therefore, it is desirable to determine as accurately as possible whether quotation marks represent utterances. The present embodiment can be judged by the above 3 ways, but the specific quote character judging way includes but is not limited to the above 3 ways. In a specific embodiment, it is determined whether or not a colon is present before the quotation mark, if so, the character of the quotation mark is determined as indicating a dialogue, if not, it is continuously determined whether or not the character before the quotation mark is a character which is clearly indicated as a dialogue type, such as "track, say, state, pick up, tell", and the like, if so, the character of the quotation mark is determined as indicating a dialogue, if not, the part of speech in the quotation mark is analyzed, and if only the noun type is present, the judgment is emphasized, and if a verb type is present, the judgment is made as a dialogue type.
Step S12: inputting the training sample set to a speech synthesis model.
Step S13: and extracting the character content characteristic vector and the expression mode characteristic vector of the text sample.
That is, the text feature vector extracted by the speech synthesis model in the encoding stage includes a text content feature vector and an expression mode feature vector, where the text content feature vector represents content information of a text, that is, what text information is specifically expressed, and the expression mode feature vector represents an expression mode of the text, that is, whether the text is expressed in a form of voice-over or in a form of dialogue.
Step S14: and extracting the voice feature vector of the voice sample corresponding to the text sample, and determining the style vector corresponding to the voice feature vector through a multi-head attention mechanism.
In a specific implementation manner, tokens of the speech feature vector in different information dimensions are acquired through a multi-head attention mechanism, and then weighting calculation is performed on each token to obtain a style vector corresponding to the speech feature vector.
It should be noted that the speech feature vector contains various types of information of the speech sample, tokens of the speech feature vector in different dimensions obtained by the multi-head attention mechanism are equivalent to branch vectors of speech in various dimensions, representing information such as pauses, timbre, semantics, emotion and the like, and the tokens can be combined to obtain a style vector by weighting.
Step S15: and determining a predicted Mel frequency spectrum corresponding to the text sample based on the style vector, the character content feature vector and the expression mode feature vector.
In a specific embodiment, the style vector, the text content feature vector, and the expression manner feature vector may be spliced based on a weight parameter corresponding to the style vector, a weight parameter corresponding to the text content feature vector, and a weight parameter corresponding to the expression manner feature vector to obtain a spliced vector; and determining a predicted Mel frequency spectrum corresponding to the splicing vector based on an attention mechanism.
The weight parameters corresponding to the style vectors, the text content feature vectors and the expression mode feature vectors are learnable parameters and are updated in the training process.
Of course, in some embodiments, the text content feature vectors and the expression manner feature vectors may be spliced according to the style vector to obtain a spliced vector.
Step S16: and determining a Mel frequency spectrum loss by using the predicted Mel frequency spectrum and a real Mel frequency spectrum corresponding to the voice sample, and determining a style vector loss by using the style vector and the label information.
In particular embodiments, the model parameters may be updated using mel-frequency spectral loss, and style vector loss.
Step S17: determining a composite training loss based on the Mel spectral loss and the style vector loss.
In a specific embodiment, the mel-frequency spectrum loss and the style vector loss may be weighted and calculated based on a weight parameter corresponding to the mel-frequency spectrum loss and a weight parameter corresponding to the style vector loss, so as to obtain a comprehensive training loss.
The weight parameter corresponding to the mel-frequency spectrum loss and the weight parameter corresponding to the style vector loss may be parameters configured in advance according to experience, or may be learnable parameters.
In another specific embodiment, the mel-frequency spectrum loss and the style vector loss can be directly added to obtain the comprehensive training loss.
It should be noted that the loss for determining the integrated training loss includes, but is not limited to, mel-frequency spectrum loss and style vector loss, and may include other losses calculated according to actual requirements.
Step S18: and when the comprehensive training loss is converged, determining the current speech synthesis model as a trained speech synthesis model and determining the current style vector as a trained style vector.
For example, referring to fig. 3, an embodiment of the present application discloses a specific speech synthesis model training method. The speech synthesis model comprises a speech coder, a GST (Global Style Token) module, a text coder, a representation mode coder, an attention mechanism and a decoder. Firstly, extracting a character content characteristic vector and an expression mode characteristic vector of a text through a text encoder and an expression mode encoder respectively; the speech feature vectors of the speech are extracted through a speech encoder, tokens of the speech vectors in different dimensions are obtained through a multi-head attention mechanism of a GST module, and then the token results are combined through weighting to obtain a style vector. Then, evaluating the accuracy of the style vector, and determining the loss of the style vector through the label information and the current style vector; estimating the predicted Mel frequency spectrum effect, extracting the Mel frequency spectrum from the input training speech to be used as a real Mel frequency spectrum, then calculating the difference value between the Mel frequency spectrum predicted by the model and the real Mel frequency spectrum to obtain the Mel frequency spectrum loss, and then feeding back the style vector loss and the Mel frequency spectrum loss to the speech synthesis model for adjusting the weight parameters in the model training process until the predicted effect is close to the real effect. The prediction mel spectrum refers to a processing result of a decoder after performing condition control on an input text vector and a style vector, and represents the output of an acoustic model, that is, the prediction mel spectrum is the mel spectrum determined based on the style vector, the character content feature vector and the expression mode feature vector.
It should be noted that, on the basis of a Tacotron2 model for end-to-end speech synthesis, a GST module is added for stylization, so that the synthesized speech has better prosody expression, during training, text vectors such as phonemes/tones are respectively extracted from the text through a large amount of paired text/speech data, prosody vectors are extracted from the speech, style vectors are obtained by learning the prosody vectors through a multi-head attention mechanism, and the style vectors are spliced with the text vectors and then sent to an attention mechanism model. After training is finished, the GST module extracts global style characteristics of the audio in the data set, such as prosody pause and other information, and stores the information in the style vectors, and when speech synthesis is carried out, the obtained style vectors can be used for speech synthesis. The prosody pause is the pause duration of characters, words and sentences in the text after the text is vocalized. On the basis of a Tacotron2 model and a GST module, an expression mode encoder is introduced to distinguish the bystander from the dialogue during training, so that the model can model the bystander and the dialogue in a training stage, distinguish the bystander from the dialogue, and better accord with daily expression habits and synthesize more natural voice.
Therefore, the embodiment of the application firstly obtains a training sample set; wherein the training sample set comprises a text sample, a speech sample corresponding to the text sample and label information, and the label information comprises an expression mode label, then the training sample set is input to a speech synthesis model, a text content feature vector and an expression mode feature vector of the text sample are extracted, a speech feature vector of the speech sample corresponding to the text sample is extracted, a style vector corresponding to the speech feature vector is determined through a multi-head attention mechanism, then a predicted Mel frequency spectrum corresponding to the text sample is determined based on the style vector, the text content feature vector and the expression mode feature vector, a Mel frequency spectrum loss is determined by using the predicted Mel frequency spectrum and a real Mel frequency spectrum corresponding to the speech sample, and a style vector loss is determined by using the style vector and the label information, and determining a comprehensive training loss based on the Mel frequency spectrum loss and the style vector loss, and when the comprehensive training loss is converged, determining the current speech synthesis model as a trained speech synthesis model and determining the current style vector as a trained style vector. That is, the speech synthesis model is trained by using the label information including the expression mode labels and the text sample and the speech sample, in the training process, the character content characteristic vector and the expression mode characteristic vector are extracted, the prediction Mel frequency spectrum is determined by using the style vector, the character content characteristic vector and the expression mode characteristic vector corresponding to the speech sample, the loss is further determined, and when the loss converges, the trained speech synthesis model is obtained.
Referring to fig. 4, an embodiment of the present application discloses a specific speech synthesis model training method, including:
step S21: the method comprises the steps of obtaining a long sentence text sample, a single sentence text sample, a voice sample corresponding to the long sentence text sample, a voice sample corresponding to the single sentence text sample and label information to obtain a training sample set.
The long sentence text sample is a text sample containing a plurality of single sentence texts and pause marking information between two adjacent single sentence texts.
Referring to fig. 5, an embodiment of the present application discloses a specific training sample set obtaining flowchart, and in a specific implementation, a specific process of obtaining a long sentence text sample and a single sentence text sample includes:
step S31: splitting an original text into single sentence texts by using preset punctuations.
The original text is unprocessed text, and the text types include but are not limited to novel text, information, conversation, and the like, and can be long text or short text. The preset punctuation marks may include commas, sentences, exclamation marks, ellipses, etc.
That is, the embodiment of the application can split the original text sentence by using punctuation marks between each sentence as separators, for example, the original text is' big family good, please refer to more than one finger. The split single sentence texts are respectively ' good family ' and ' please refer to more and more fingers. "
In a specific embodiment, the single sentence text without the first target character may be eliminated; wherein the first target character comprises Chinese characters, numbers and letters.
Further, second target characters in the residual single sentence texts are removed; wherein the second target character is a character that does not contain valid information.
That is, in the embodiment of the present application, invalid characters may be removed, after the split single-sentence texts are obtained, it is determined whether each single-sentence text contains a first target character, and a text that does not contain the first target character is removed, for example, if the text only contains "@ (|", etc., the single-sentence text is removed.
In addition, in this embodiment of the application, a specific process of acquiring the tag information may include: and judging the property of the quotation marks in the single sentence text, and if the property is the representation dialogue, determining that the representation mode labels of the text in the quotation marks are the dialogue type.
That is, the embodiment of the application can judge the quotation mark property in the single-sentence text after the single-sentence text is split and the invalid text and the invalid characters are removed. For a specific determination manner, reference may be made to the disclosure of the foregoing embodiments, which is not described herein again.
Step S32: and determining the symbol type of the ending punctuation of the single sentence text.
It should be noted that the ending punctuation of a single sentence text is usually a punctuation between two adjacent single sentences, and the determination of the type of the ending punctuation can be used for inter-sentence pause processing.
Step S33: and performing word segmentation and part-of-speech tagging on the single sentence text to obtain the word segmentation and part-of-speech of the single sentence text.
It should be noted that word segmentation means that words and phrases in each sentence are separated by word segmentation, and part-of-speech tagging means that part-of-speech of each word is predicted. For example, "did you eat? "is you eating after word segmentation? ", wherein" eat "is a single word; and obtaining the part of speech of each word and word after word segmentation through part of speech tagging, for example, "you" belongs to pronouns. In a specific embodiment, word segmentation can be performed by using a word segmentation tool such as "jieba" (jieba) word segmentation, so as to obtain word segmentation of a single sentence text, including words, phrases, and parts of speech in a sentence.
Step S34: and marking the pause grade of the participles in the single sentence text based on the participles and the parts of speech, and marking the pause grade of the tail of the single sentence text based on the symbol type to obtain a single sentence text sample.
In a specific embodiment, the pause levels can be divided according to the pause length, for example, into 4 levels, which are denoted by #1, #2, #3, and # 4. For the segmentation, prosodic words, prosodic phrases and intonation phrases of a single sentence text, the grades are "# 1", "# 2", "# 3", respectively. By judging punctuation marks among the sentences, giving the text corresponding pause levels, wherein the higher the level is, the longer the pause is, the shorter the pause is, the longer the pause is, the larger the pause is, and the comma is between two sentences, the end of the first sentence is marked with "# 3", the sentence is between two sentences, and the end of the first sentence is marked with "# 4". By the marking mode, controllable pause between sentences of the text is realized, and the effect of long pause and short pause is achieved.
Step S35: and splicing the single-sentence text samples sentence by sentence, judging whether the number of characters of the current spliced sentence reaches a preset character number threshold value or not in the splicing process, splicing the current single-sentence text sample to be spliced to the spliced sentence if the number of characters of the current spliced sentence reaches the preset character threshold value, taking the current spliced sentence as a long-sentence text sample, and starting to splice the next spliced sentence until the splicing finishing condition is met.
The splicing end condition may be that all single-sentence text samples are spliced, or the number of long-sentence text samples reaches a preset number, and sufficient samples are obtained, and then the splicing is ended.
Therefore, long sentence text samples with the number of characters close to the preset character number threshold are obtained, and compared with a single sentence text, the spliced long sentence text samples have prosody information of inter-sentence pause, and the pause between sentences can be better reflected.
In addition, in a specific implementation manner, if the number of characters of a single-sentence text is greater than a preset threshold, that is, the single-sentence text is too long, the single-sentence text may be discarded or split.
Step S22: inputting the training sample set to a speech synthesis model.
Step S23: and extracting the character content characteristic vector and the expression mode characteristic vector of the text sample.
Step S24: and extracting the voice feature vector of the voice sample corresponding to the text sample, and determining the style vector corresponding to the voice feature vector through a multi-head attention mechanism.
Step S25: and determining a predicted Mel frequency spectrum corresponding to the text sample based on the style vector, the character content feature vector and the expression mode feature vector.
Step S26: and determining a Mel frequency spectrum loss by using the predicted Mel frequency spectrum and a real Mel frequency spectrum corresponding to the voice sample, and determining a style vector loss by using the style vector and the label information.
Step S27: determining a composite training loss based on the Mel spectral loss and the style vector loss.
For specific implementation of the above steps S23 to S27, reference may be made to the disclosure of the foregoing embodiments, and details are not repeated here.
Step S28: and judging whether the comprehensive training loss is converged.
Step S29: if yes, determining the current speech synthesis model as the trained speech synthesis model, and determining the current style vector as the trained style vector.
Otherwise, the speech synthesis model is updated with the integrated training loss, and further text samples, speech samples and label information are determined from the training sample set, and the above steps S23 to S28 are performed.
That is, in the embodiment of the present application, a speech synthesis model is trained using a training sample set, and a comprehensive training loss is determined in a training process, and when the comprehensive training loss converges, a current speech synthesis model is determined as a trained speech synthesis model, and a current style vector is determined as a trained style vector.
Therefore, the training samples obtained in the embodiment of the application collectively obtain long sentence text samples, single sentence text samples, voice samples corresponding to the long sentence text samples, and voice samples corresponding to the single sentence text samples, including rich prosody pause information in single sentences and prosody pause information between single sentences, so that the model can better obtain prosody pause information, and the performance of the model is improved.
Referring to fig. 6, an embodiment of the present application discloses a specific audio generation method, including:
step S41: acquiring a target text of a voice to be synthesized and target label information of the target text; wherein the target tag information includes a presentation tag.
The process of acquiring the expression mode tag may refer to the content disclosed in the foregoing embodiments, and is not described herein again. The target tag information may further include an emotion tag, a speech rate tag, and the like.
Step S42: and inputting the target text and the target label information into the trained speech synthesis model disclosed by the embodiment of the application.
Step S43: extracting a text content feature vector of the target text, and extracting an expression mode feature vector of the target text based on the expression mode label;
step S44: determining a target style vector based on the target label information and a trained style vector corresponding to the trained speech synthesis model;
step S45: and determining a target prediction Mel frequency spectrum corresponding to the target text based on the target style vector, the character content feature vector and the expression mode feature vector.
In a specific implementation manner, the text content feature vector, the expression mode feature vector and the target style vector may be spliced to obtain a spliced vector; and determining a target prediction Mel frequency spectrum corresponding to the splicing vector based on an attention mechanism.
For example, referring to fig. 7, the embodiment of the present application discloses a specific prediction diagram of a speech synthesis model.
Inputting a target text and target label information, extracting a text content characteristic vector by a text encoder, extracting an expression mode characteristic vector by an expression mode encoder based on an expression mode label, determining a target style vector by using the trained style vector and the target label information, splicing the target style vector, the text content characteristic vector and the expression mode characteristic vector to obtain a spliced vector, and obtaining a predicted Mel frequency spectrum corresponding to the spliced vector by an attention mechanism and a decoder.
It should be noted that for model training, in a specific embodiment, a training sample set corresponding to a plurality of speakers may be obtained; and respectively inputting the corresponding speech synthesis models by using the training sample set corresponding to each speaker for training to obtain the trained speech synthesis model corresponding to each speaker. Therefore, when the audio is generated, the target text and the speaker information input by the user are obtained, the corresponding trained speech synthesis model is determined according to the speaker information, the target label information of the target text is determined, including the expression mode label and the like, and the target text and the target label information are input to the trained speech synthesis model without the speaker label to generate the speech corresponding to the speaker information.
In another specific embodiment, a training sample set corresponding to a plurality of speakers may be obtained, and the training sample sets corresponding to the plurality of speakers are input to the same trained speech synthesis model for training, so as to obtain a trained speech synthesis model corresponding to each speaker. Since the same model is trained using training sample sets corresponding to multiple speakers, the training sample sets include speaker labels. When the audio is generated, target text and speaker information input by a user are obtained, target label information is determined, the target label information comprises a speaker label determined by using the speaker information, and the target label information and the target text are input into a trained voice synthesis model to generate voice corresponding to the speaker information.
Of course, in some embodiments, the trained speech synthesis model may be utilized to generate a style vector based on the target label information, determine a target predicted mel-frequency spectrum based on the generated style vector, the text content feature vector, and the expression feature vector, without using the trained style vector,
step S46: and synthesizing corresponding predicted voice by using the target predicted Mel frequency spectrum.
In particular embodiments, the corresponding predicted speech may be synthesized by a phase prediction or neural network vocoder.
In the method, a final predicted speech signal is obtained by predicting phase information from an input frequency spectrum (amplitude spectrum, without phase information) and then continuously reducing the difference between the frequency spectrum of the inverse fourier transform corresponding to the predicted phase and the input frequency spectrum in an iterative manner, through a phase prediction method, including but not limited to Griffin-Lim signal estimation algorithm. The scheme of the neural network vocoder is that the deep neural network is used for establishing the relation between the frequency spectrum and the voice to predict, and the output voice has high tone quality but high algorithm complexity.
The following describes a technical solution of the present application, taking a certain speech synthesis APP as an example.
The background server of the APP firstly obtains a large amount of novel texts and consultation texts to obtain original texts, divides the original texts into single-sentence texts by preset punctuation marks, eliminates invalid texts and invalid characters in the valid texts, then judges the properties of quotation marks in the remaining single-sentence texts, if the characters are used for representing dialogs, determines that the representation mode labels of the texts in the quotation marks are dialog types, determines the symbol types of ending punctuation marks of the remaining valid single-sentence texts, performs word segmentation and morphological tagging on the remaining single-sentence texts, obtains single-sentence text samples based on the pause levels of the participles in the participle and morphological tagging single-sentence texts and the pause levels of the ending of the symbol type tagging single-sentence texts, further splices the single-sentence text samples one by one, and judges whether the number of characters of the current spliced sentences reaches a preset character number threshold value in the process of sentence splicing, and if not, splicing the single-sentence text sample to be spliced to the spliced sentence, taking the current spliced sentence as the long-sentence text sample and starting to splice the next spliced sentence until the splicing finishing condition is met, so as to obtain the long-sentence text sample, the single-sentence text sample and the expression mode label, further obtain the corresponding voice sample, and obtain the training sample set. Inputting the training sample set to a speech synthesis model; extracting character content characteristic vectors and expression mode characteristic vectors of the text samples; extracting a voice feature vector of a voice sample corresponding to the text sample, and determining a style vector corresponding to the voice feature vector through a multi-head attention mechanism; determining a predicted Mel frequency spectrum corresponding to the text sample based on the style vector, the character content feature vector and the expression mode feature vector; determining a mel-frequency spectrum loss by using the predicted mel-frequency spectrum and a real mel-frequency spectrum corresponding to the voice sample, and determining a style vector loss by using the style vector and the label information; and determining a comprehensive training loss based on the Mel frequency spectrum loss and the style vector loss, and when the comprehensive training loss is converged, determining the current speech synthesis model as a trained speech synthesis model and determining the current style vector as a trained style vector.
The user side installs the speech synthesis APP, the user transmits text content needing speech synthesis to the background server through the APP, and when the background server acquires the text content, the background server determines a predicted Mel frequency spectrum of the text content by using a trained speech synthesis model and a trained style vector, so that speech is synthesized, and the speech is transmitted to the user side to be played.
Referring to fig. 8, an embodiment of the present application discloses a speech synthesis model training apparatus, including:
a training sample set obtaining module 11, configured to obtain a training sample set; the training sample set comprises a text sample, a voice sample corresponding to the text sample and label information, and the label information comprises an expression mode label;
a training sample set input module 12, configured to input the training sample set to a speech synthesis model;
the text feature extraction module 13 is configured to extract a text content feature vector and an expression mode feature vector of the text sample;
a voice feature extraction module 14, configured to extract a voice feature vector of a voice sample corresponding to the text sample;
the style vector determining module 15 is configured to determine a style vector corresponding to the speech feature vector through a multi-head attention mechanism;
a predicted mel-frequency spectrum determining module 16, configured to determine a predicted mel-frequency spectrum corresponding to the text sample based on the style vector, the text content feature vector, and the expression manner feature vector;
a loss determining module 17, configured to determine a mel-frequency spectrum loss by using the predicted mel-frequency spectrum and a real mel-frequency spectrum corresponding to the voice sample, and determine a style vector loss by using the style vector and the label information; determining a synthetic training loss based on the mel-frequency spectrum loss and the style vector loss;
a trained model determining module 18, configured to determine the current speech synthesis model as the trained speech synthesis model and determine the current style vector as the trained style vector when the comprehensive training loss converges.
Therefore, the embodiment of the application firstly obtains a training sample set; wherein the training sample set comprises a text sample, a speech sample corresponding to the text sample and label information, and the label information comprises an expression mode label, then the training sample set is input to a speech synthesis model, a text content feature vector and an expression mode feature vector of the text sample are extracted, a speech feature vector of the speech sample corresponding to the text sample is extracted, a style vector corresponding to the speech feature vector is determined through a multi-head attention mechanism, then a predicted Mel frequency spectrum corresponding to the text sample is determined based on the style vector, the text content feature vector and the expression mode feature vector, a Mel frequency spectrum loss is determined by using the predicted Mel frequency spectrum and a real Mel frequency spectrum corresponding to the speech sample, and a style vector loss is determined by using the style vector and the label information, and determining a comprehensive training loss based on the Mel frequency spectrum loss and the style vector loss, and when the comprehensive training loss is converged, determining the current speech synthesis model as a trained speech synthesis model and determining the current style vector as a trained style vector. That is, the speech synthesis model is trained by using the label information including the expression mode labels and the text sample and the speech sample, in the training process, the character content characteristic vector and the expression mode characteristic vector are extracted, the prediction Mel frequency spectrum is determined by using the style vector, the character content characteristic vector and the expression mode characteristic vector corresponding to the speech sample, the loss is further determined, and when the loss converges, the trained speech synthesis model is obtained.
The training sample set obtaining module 11 is specifically configured to obtain a long sentence text sample, a single sentence text sample, a voice sample corresponding to the long sentence text sample, a voice sample corresponding to the single sentence text sample, and label information, so as to obtain a training sample set; the long sentence text sample is a text sample containing a plurality of single sentence texts and pause marking information between two adjacent single sentence texts.
In a specific embodiment, the training sample set obtaining module 11 includes:
the single sentence sample acquisition submodel is used for splitting an original text into single sentence texts by using preset punctuations; determining the symbol type of the ending punctuation of the single sentence text; performing word segmentation and part-of-speech tagging on the single sentence text to obtain the word segmentation and part-of-speech of the single sentence text; marking the pause grade of the participles in the single sentence text based on the participles and the parts of speech, and marking the pause grade of the tail of the single sentence text based on the symbol type to obtain a single sentence text sample;
and the long sentence sample acquisition submodule is used for splicing the single sentence text samples one by one, judging whether the number of characters of the current spliced sentence reaches a preset character number threshold value or not in the splicing process, splicing the current single sentence text sample to be spliced to the spliced sentence if the number of characters of the current spliced sentence does not reach the preset character number threshold value, taking the current spliced sentence as the long sentence text sample, and starting to splice the next spliced sentence until the splicing ending condition is met.
The device further comprises:
the invalid text body removing module is used for removing the single sentence text without the first target character; wherein the first target character comprises Chinese characters, numbers and letters;
the invalid character removing module is used for removing second target characters in the residual single sentence texts; wherein the second target character is a character that does not contain valid information.
In a specific embodiment, the training sample set obtaining module 11 includes:
and the tag information acquisition submodule is used for judging the property of the quotation marks in the text samples, and if the property is the representation dialogue, determining that the representation mode tags of the texts in the quotation marks are the dialogue type.
Further, in a specific embodiment, the tag information obtaining sub-module is specifically configured to:
judging whether a colon exists before the quotation marks, and if so, judging the property of the quotation marks as representing a conversation;
or, judging whether the appointed characters exist before the quotation marks, if so, judging the property of the quotation marks as representing the dialogue; the appointed characters are characters which represent the characters behind the appointed characters and are dialog type characters;
or analyzing the part of speech of the text in the quotation marks, and judging the property of the quotation marks to be the representation dialogue if the text in the quotation marks comprises verbs.
The predictive mel-frequency spectrum determining module 16 is specifically configured to splice the style vector, the text content feature vector and the expression mode feature vector based on a weight parameter corresponding to the style vector, a weight parameter corresponding to the text content feature vector and a weight parameter corresponding to the expression mode feature vector to obtain a spliced vector; and determining a predicted Mel frequency spectrum corresponding to the splicing vector based on an attention mechanism.
Further, the embodiment of the application also provides electronic equipment. FIG. 9 is a block diagram illustrating an electronic device 20 according to an exemplary embodiment, and nothing in the figure should be taken as a limitation on the scope of use of the present application.
Fig. 9 is a schematic structural diagram of an electronic device 20 according to an embodiment of the present disclosure. The electronic device 20 may specifically include: at least one processor 21, at least one memory 22, a power supply 23, a communication interface 24, an input output interface 25, and a communication bus 26. Wherein the memory 22 is used for storing a computer program, which is loaded and executed by the processor 21 to implement the relevant steps in the speech synthesis model training method and/or the audio generation method disclosed in any of the foregoing embodiments. In addition, the electronic device 20 in the present embodiment may be specifically a server.
In this embodiment, the power supply 23 is configured to provide a working voltage for each hardware device on the electronic device 20; the communication interface 24 can create a data transmission channel between the electronic device 20 and an external device, and a communication protocol followed by the communication interface is any communication protocol applicable to the technical solution of the present application, and is not specifically limited herein; the input/output interface 25 is configured to obtain external input data or output data to the outside, and a specific interface type thereof may be selected according to specific application requirements, which is not specifically limited herein.
In addition, the memory 22 is used as a carrier for resource storage, and may be a read-only memory, a random access memory, a magnetic disk or an optical disk, etc., and the resources stored thereon may include the operating system 221, the computer program 222, the training data 223, etc., and the storage manner may be a transient storage or a permanent storage.
The operating system 221 is used for managing and controlling each hardware device and the computer program 222 on the electronic device 20, so as to implement the operation and processing of the training data 223 in the memory 22 by the processor 21, and may be Windows Server, Netware, Unix, Linux, and the like. The computer program 222 may further include a computer program that can be used to perform other specific tasks in addition to the computer program that can be used to perform the speech synthesis model training method and/or the audio generation method performed by the electronic device 20 disclosed in any of the foregoing embodiments.
Further, an embodiment of the present application also discloses a storage medium, in which a computer program is stored, and when the computer program is loaded and executed by a processor, the steps of the speech synthesis model training method and/or the audio generation method disclosed in any of the foregoing embodiments are implemented.
The embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same or similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.
Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The above detailed description is made on a speech synthesis model training method, an audio generation method, a device and a medium provided by the present application, and a specific example is applied in the present application to explain the principle and the implementation of the present application, and the description of the above embodiment is only used to help understanding the method and the core idea of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims (11)

1. A method for training a speech synthesis model, comprising:
acquiring a training sample set; the training sample set comprises a text sample, a voice sample corresponding to the text sample and label information, and the label information comprises an expression mode label;
inputting the training sample set to a speech synthesis model;
extracting character content characteristic vectors and expression mode characteristic vectors of the text samples;
extracting a voice feature vector of a voice sample corresponding to the text sample, and determining a style vector corresponding to the voice feature vector through a multi-head attention mechanism;
determining a predicted Mel frequency spectrum corresponding to the text sample based on the style vector, the character content feature vector and the expression mode feature vector;
determining a mel-frequency spectrum loss by using the predicted mel-frequency spectrum and a real mel-frequency spectrum corresponding to the voice sample, and determining a style vector loss by using the style vector and the label information;
determining a synthetic training loss based on the mel-frequency spectrum loss and the style vector loss;
and when the comprehensive training loss is converged, determining the current speech synthesis model as a trained speech synthesis model and determining the current style vector as a trained style vector.
2. The method of training a speech synthesis model according to claim 1, wherein the obtaining a set of training samples comprises:
obtaining a long sentence text sample, a single sentence text sample, a voice sample corresponding to the long sentence text sample, a voice sample corresponding to the single sentence text sample and label information to obtain a training sample set;
the long sentence text sample is a text sample containing a plurality of single sentence texts and pause marking information between two adjacent single sentence texts.
3. The method for training a speech synthesis model according to claim 2, wherein the obtaining long sentence text samples and single sentence text samples comprises:
splitting an original text into single sentence texts by using preset punctuations;
determining the symbol type of the ending punctuation of the single sentence text;
performing word segmentation and part-of-speech tagging on the single sentence text to obtain the word segmentation and part-of-speech of the single sentence text;
marking the pause grade of the participles in the single sentence text based on the participles and the parts of speech, and marking the pause grade of the tail of the single sentence text based on the symbol type to obtain a single sentence text sample;
and splicing the single-sentence text samples sentence by sentence, judging whether the number of characters of the current spliced sentence reaches a preset character number threshold value or not in the splicing process, splicing the current single-sentence text sample to be spliced to the spliced sentence if the number of characters of the current spliced sentence reaches the preset character threshold value, taking the current spliced sentence as a long-sentence text sample, and starting to splice the next spliced sentence until the splicing finishing condition is met.
4. The method for training a speech synthesis model according to claim 3, wherein the splitting of the original text into single-sentence texts with the preset punctuation marks further comprises:
rejecting the single sentence text without the first target character; wherein the first target character comprises Chinese characters, numbers and letters;
removing second target characters in the residual single sentence texts; wherein the second target character is a character that does not contain valid information.
5. The method of training a speech synthesis model according to claim 1, wherein obtaining label information comprises:
and judging the property of the quotation marks in the text samples, and if the property is the representation dialogue, determining that the representation mode labels of the texts in the quotation marks are the dialogue types.
6. The method of claim 5, wherein the determining the nature of quotation marks in the single sentence text comprises:
judging whether a colon exists before the quotation marks, and if so, judging the property of the quotation marks as representing a conversation;
or, judging whether the appointed characters exist before the quotation marks, if so, judging the property of the quotation marks as representing the dialogue; the appointed characters are characters which represent the characters behind the appointed characters and are dialog type characters;
or analyzing the part of speech of the text in the quotation marks, and judging the property of the quotation marks to be the representation dialogue if the text in the quotation marks comprises verbs.
7. The method of claim 1, wherein the determining the predicted mel-frequency spectrum corresponding to the text sample based on the style vector, the text-content feature vector and the expression-mode feature vector comprises:
splicing the style vector, the character content characteristic vector and the expression mode characteristic vector based on the weight parameter corresponding to the style vector, the weight parameter corresponding to the character content characteristic vector and the weight parameter corresponding to the expression mode characteristic vector to obtain a spliced vector;
and determining a predicted Mel frequency spectrum corresponding to the splicing vector based on an attention mechanism.
8. The method of training a speech synthesis model according to claim 1, wherein the determining a synthetic training loss based on the mel-frequency spectrum loss and the style vector loss comprises:
and performing weighted calculation on the Mel spectrum loss and the style vector loss based on the weight parameters corresponding to the Mel spectrum loss and the weight parameters corresponding to the style vector loss to obtain the comprehensive training loss.
9. A method of audio generation, comprising:
acquiring a target text of a voice to be synthesized and target label information of the target text; wherein the target tag information comprises an expression mode tag;
inputting the target text and the target label information to the trained speech synthesis model of any of claims 1-8;
extracting a text content feature vector of the target text, and extracting an expression mode feature vector of the target text based on the expression mode label;
determining a target style vector based on the target label information and a trained style vector corresponding to the trained speech synthesis model;
determining a target prediction Mel frequency spectrum corresponding to the target text based on the target style vector, the character content feature vector and the expression mode feature vector;
and synthesizing corresponding predicted voice by using the target predicted Mel frequency spectrum.
10. An electronic device, comprising:
a memory for storing a computer program;
a processor for executing the computer program for implementing the speech synthesis model training method of any one of claims 1 to 8 and/or the audio generation method of claim 9.
11. A computer-readable storage medium for storing a computer program which, when executed by a processor, implements the speech synthesis model training method of any one of claims 1 to 8 and/or the audio generation method of claim 9.
CN202110937782.1A 2021-08-16 2021-08-16 Speech synthesis model training method, audio generation method, device and medium Pending CN113658577A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110937782.1A CN113658577A (en) 2021-08-16 2021-08-16 Speech synthesis model training method, audio generation method, device and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110937782.1A CN113658577A (en) 2021-08-16 2021-08-16 Speech synthesis model training method, audio generation method, device and medium

Publications (1)

Publication Number Publication Date
CN113658577A true CN113658577A (en) 2021-11-16

Family

ID=78491083

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110937782.1A Pending CN113658577A (en) 2021-08-16 2021-08-16 Speech synthesis model training method, audio generation method, device and medium

Country Status (1)

Country Link
CN (1) CN113658577A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114783405A (en) * 2022-05-12 2022-07-22 马上消费金融股份有限公司 Voice synthesis method and device, electronic equipment and storage medium
CN114822495A (en) * 2022-06-29 2022-07-29 杭州同花顺数据开发有限公司 Acoustic model training method and device and speech synthesis method
CN115547296A (en) * 2022-11-29 2022-12-30 零犀(北京)科技有限公司 Voice synthesis method and device, electronic equipment and storage medium
CN115662435A (en) * 2022-10-24 2023-01-31 福建网龙计算机网络信息技术有限公司 Virtual teacher simulation voice generation method and terminal

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20080011859A (en) * 2006-08-01 2008-02-11 한국전자통신연구원 Method for predicting sentence-final intonation and text-to-speech system and method based on the same
US20080243508A1 (en) * 2007-03-28 2008-10-02 Kabushiki Kaisha Toshiba Prosody-pattern generating apparatus, speech synthesizing apparatus, and computer program product and method thereof
CN109065019A (en) * 2018-08-27 2018-12-21 北京光年无限科技有限公司 A kind of narration data processing method and system towards intelligent robot
WO2019214145A1 (en) * 2018-05-10 2019-11-14 平安科技(深圳)有限公司 Text sentiment analyzing method, apparatus and storage medium
CN110534089A (en) * 2019-07-10 2019-12-03 西安交通大学 A kind of Chinese speech synthesis method based on phoneme and rhythm structure
KR20200063301A (en) * 2018-11-19 2020-06-05 (주)유니즈소프트 Game sound playing system using deep learning voice based on ai
CN111667811A (en) * 2020-06-15 2020-09-15 北京百度网讯科技有限公司 Speech synthesis method, apparatus, device and medium
US20200395008A1 (en) * 2019-06-15 2020-12-17 Very Important Puppets Inc. Personality-Based Conversational Agents and Pragmatic Model, and Related Interfaces and Commercial Models
CN112309366A (en) * 2020-11-03 2021-02-02 北京有竹居网络技术有限公司 Speech synthesis method, speech synthesis device, storage medium and electronic equipment
CN112863483A (en) * 2021-01-05 2021-05-28 杭州一知智能科技有限公司 Voice synthesizer supporting multi-speaker style and language switching and controllable rhythm
CN112908294A (en) * 2021-01-14 2021-06-04 杭州倒映有声科技有限公司 Speech synthesis method and speech synthesis system
CN113096634A (en) * 2021-03-30 2021-07-09 平安科技(深圳)有限公司 Speech synthesis method, apparatus, server and storage medium

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20080011859A (en) * 2006-08-01 2008-02-11 한국전자통신연구원 Method for predicting sentence-final intonation and text-to-speech system and method based on the same
US20080243508A1 (en) * 2007-03-28 2008-10-02 Kabushiki Kaisha Toshiba Prosody-pattern generating apparatus, speech synthesizing apparatus, and computer program product and method thereof
WO2019214145A1 (en) * 2018-05-10 2019-11-14 平安科技(深圳)有限公司 Text sentiment analyzing method, apparatus and storage medium
CN109065019A (en) * 2018-08-27 2018-12-21 北京光年无限科技有限公司 A kind of narration data processing method and system towards intelligent robot
KR20200063301A (en) * 2018-11-19 2020-06-05 (주)유니즈소프트 Game sound playing system using deep learning voice based on ai
US20200395008A1 (en) * 2019-06-15 2020-12-17 Very Important Puppets Inc. Personality-Based Conversational Agents and Pragmatic Model, and Related Interfaces and Commercial Models
CN110534089A (en) * 2019-07-10 2019-12-03 西安交通大学 A kind of Chinese speech synthesis method based on phoneme and rhythm structure
CN111667811A (en) * 2020-06-15 2020-09-15 北京百度网讯科技有限公司 Speech synthesis method, apparatus, device and medium
CN112309366A (en) * 2020-11-03 2021-02-02 北京有竹居网络技术有限公司 Speech synthesis method, speech synthesis device, storage medium and electronic equipment
CN112863483A (en) * 2021-01-05 2021-05-28 杭州一知智能科技有限公司 Voice synthesizer supporting multi-speaker style and language switching and controllable rhythm
CN112908294A (en) * 2021-01-14 2021-06-04 杭州倒映有声科技有限公司 Speech synthesis method and speech synthesis system
CN113096634A (en) * 2021-03-30 2021-07-09 平安科技(深圳)有限公司 Speech synthesis method, apparatus, server and storage medium

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114783405A (en) * 2022-05-12 2022-07-22 马上消费金融股份有限公司 Voice synthesis method and device, electronic equipment and storage medium
CN114783405B (en) * 2022-05-12 2023-09-12 马上消费金融股份有限公司 Speech synthesis method, device, electronic equipment and storage medium
CN114822495A (en) * 2022-06-29 2022-07-29 杭州同花顺数据开发有限公司 Acoustic model training method and device and speech synthesis method
CN115662435A (en) * 2022-10-24 2023-01-31 福建网龙计算机网络信息技术有限公司 Virtual teacher simulation voice generation method and terminal
US11727915B1 (en) 2022-10-24 2023-08-15 Fujian TQ Digital Inc. Method and terminal for generating simulated voice of virtual teacher
CN115547296A (en) * 2022-11-29 2022-12-30 零犀(北京)科技有限公司 Voice synthesis method and device, electronic equipment and storage medium
CN115547296B (en) * 2022-11-29 2023-03-10 零犀(北京)科技有限公司 Voice synthesis method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN113658577A (en) Speech synthesis model training method, audio generation method, device and medium
CN112309366B (en) Speech synthesis method, speech synthesis device, storage medium and electronic equipment
WO2018151125A1 (en) Word vectorization model learning device, word vectorization device, speech synthesis device, method for said devices, and program
KR20200056261A (en) Electronic apparatus and method for controlling thereof
CN111341293B (en) Text voice front-end conversion method, device, equipment and storage medium
CN112397056B (en) Voice evaluation method and computer storage medium
CN112329451B (en) Sign language action video generation method, device, equipment and storage medium
CN112102811B (en) Optimization method and device for synthesized voice and electronic equipment
CN114330371A (en) Session intention identification method and device based on prompt learning and electronic equipment
CN112802446B (en) Audio synthesis method and device, electronic equipment and computer readable storage medium
CN110930975B (en) Method and device for outputting information
CN113963679A (en) Voice style migration method and device, electronic equipment and storage medium
EP4169014A1 (en) Spontaneous text to speech (tts) synthesis
CN114708848A (en) Method and device for acquiring size of audio and video file
CN110310620B (en) Speech fusion method based on native pronunciation reinforcement learning
CN114155829A (en) Speech synthesis method, speech synthesis device, readable storage medium and electronic equipment
CN114333758A (en) Speech synthesis method, apparatus, computer device, storage medium and product
CN115700871A (en) Model training and speech synthesis method, device, equipment and medium
CN113066473A (en) Voice synthesis method and device, storage medium and electronic equipment
CN113223513A (en) Voice conversion method, device, equipment and storage medium
CN111813989A (en) Information processing method, device and storage medium
CN113299272A (en) Speech synthesis model training method, speech synthesis apparatus, and storage medium
WO2023279976A1 (en) Speech synthesis method, apparatus, device, and storage medium
CN116564273A (en) Speech synthesis method, speech synthesis system, electronic device, and storage medium
CN113948061A (en) Speech synthesis method, system, speech synthesis model and training method thereof

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination