CN113948061A

CN113948061A - Speech synthesis method, system, speech synthesis model and training method thereof

Info

Publication number: CN113948061A
Application number: CN202111205560.7A
Authority: CN
Inventors: 司马华鹏; 毛志强
Original assignee: Suqian Silicon Based Intelligent Technology Co ltd
Current assignee: Suqian Silicon Based Intelligent Technology Co ltd
Priority date: 2021-10-15
Filing date: 2021-10-15
Publication date: 2022-01-18

Abstract

The application provides a voice synthesis method, a voice synthesis system, a voice synthesis model and a training method thereof, wherein the method comprises the steps of obtaining a target text and a first bottleneck characteristic of the target text; acquiring a reference audio library, wherein the reference audio library comprises a plurality of reference audios; acquiring a reference text corresponding to each reference audio in a reference audio library, and acquiring a second bottleneck characteristic of each reference text; calculating the similarity of the first bottleneck characteristic and the second bottleneck characteristic of each reference text, and determining the reference text corresponding to the second bottleneck characteristic with the highest similarity with the first bottleneck characteristic as a text template; determining reference audio corresponding to the text template as an audio template; and inputting the audio template and the target text into a pre-trained voice synthesis model to synthesize the voice with deep emotion hierarchical features.

Description

Speech synthesis method, system, speech synthesis model and training method thereof

Technical Field

The present application relates to the field of speech synthesis technologies, and in particular, to a speech synthesis method, a speech synthesis system, a speech synthesis model, and a training method thereof.

Background

Speech synthesis, also known as text-to-speech, is mainly used to convert text to speech and to provide the synthesized speech with as high intelligibility and naturalness as possible. In recent years, with the progress of speech synthesis technology, synthesized speech is getting closer to the real sound when a person speaks in terms of sound quality and naturalness. However, when a person speaks, the person has various styles and is rich in various emotional colors. Therefore, how to synthesize speech with unique style and emotional color is the key to the development of speech synthesis technology.

In order to solve the above problem, voices with different styles or emotions can be synthesized by embedding styles or emotions in a voice synthesis stage. Such a synthesis may enable a wider variety of styles or emotions to be embedded and synthesized, for example, when synthesizing speech featuring different emotion categories such as recitation, chatty, etc. according to a user's selection, the synthesized speech may be speech with different emotions such as happy, sad, angry, etc.

However, the same emotion category may be further divided into a plurality of emotion levels, and taking the "happy" emotion category as an example, the "happy" emotion category may be further divided into a plurality of levels such as "happy spring," wonderful "and" happy, "and only by using styles or emotion embedding in the speech synthesis stage, the speech synthesis with the emotion levels described above cannot be realized, which is not favorable for the user experience.

Disclosure of Invention

The application provides a voice synthesis method, a voice synthesis system, a voice synthesis model and a training method thereof, which aim to solve the problem that voice with deep emotion levels cannot be synthesized in the prior art and improve user experience.

In a first aspect, the present application provides a speech synthesis method, including:

acquiring a target text and a first bottleneck characteristic of the target text;

acquiring a reference audio library, wherein the reference audio library comprises a plurality of reference audios;

acquiring a reference text corresponding to each reference audio in the reference audio library, and acquiring a second bottleneck characteristic of each reference text;

calculating the similarity of the first bottleneck characteristic and a second bottleneck characteristic of each reference text, and determining the reference text corresponding to the second bottleneck characteristic with the highest similarity with the first bottleneck characteristic as a text template;

determining the reference audio corresponding to the text template as an audio template;

and inputting the audio template and the target text into a pre-trained speech synthesis model to obtain a synthesized audio.

Optionally, obtaining a first bottleneck characteristic corresponding to the target text includes:

acquiring text data related to emotion;

establishing an emotion encoding network model according to the text data, wherein the emotion encoding network model is used for acquiring emotion characteristics of an input text;

analyzing the target text according to the emotion coding network model, acquiring the emotion characteristics of the target text, and determining the emotion characteristics of the target text as first bottleneck characteristics.

Optionally, the obtaining of the second bottleneck feature corresponding to each of the reference texts includes:

analyzing each reference text according to the emotion coding network model, acquiring the emotion characteristics of each reference text, and determining the emotion characteristics of the reference texts as second bottleneck characteristics.

acquiring text data related to styles;

establishing a style coding network model according to the text data, wherein the style coding network model is used for acquiring style characteristics of an input text;

analyzing the target text according to the style coding network model, acquiring style characteristics of the target text, and determining the style characteristics of the target text as first bottleneck characteristics.

analyzing each reference text according to the style coding network model, acquiring the style characteristics of each reference text, and determining the style characteristics of the reference texts as second bottleneck characteristics.

In a second aspect, the present application provides a speech synthesis system configured to:

In a third aspect, the present application provides a speech synthesis model applied to the above method and system, including an encoder module, a feature extraction module, a duration prediction module, a duration sampling module, a fundamental frequency prediction module, a decoder module, and a vocoder module, wherein:

the encoder module is used for acquiring a text sequence of an input target text, wherein the text sequence of the target text is a phoneme set of the target text, and converting the text sequence into a corresponding text code;

the feature extraction module is used for acquiring a third bottleneck feature of the audio template according to the input audio template, wherein the third bottleneck feature at least comprises one of emotional feature and style feature of the audio template;

the duration prediction module is used for acquiring the predicted duration of the text code according to the text code and the third bottleneck characteristic, wherein the predicted duration of the text code is the pronunciation duration corresponding to each frame of the text code obtained through prediction;

the time length sampling module is used for performing upsampling processing on the text code according to the output of the feature extraction module and the time length prediction module to obtain the text code subjected to the upsampling processing and a third bottleneck feature subjected to the upsampling processing;

the fundamental frequency prediction module is used for predicting the fundamental frequency characteristic of the text code according to the input text code subjected to the upsampling processing and the third bottleneck characteristic subjected to the upsampling processing;

the decoder module is used for acquiring the audio features of the audio to be synthesized according to the text codes subjected to the upsampling, the third bottleneck features subjected to the upsampling and the fundamental frequency features;

the vocoder module is used for obtaining the synthesized audio according to the audio characteristics of the audio to be synthesized.

In a fourth aspect, the present application provides a method for training a speech synthesis model, which is applied to the method and system described above, and includes:

acquiring a training material, wherein the training material comprises a training audio and a training text corresponding to the training audio, and the training text is a text with one or more of emotional characteristics or style characteristics;

analyzing the training audio to obtain the audio features of the training audio;

analyzing the training text to obtain text features of the training text, wherein the text features of the training text comprise a collection of each phoneme of the training text, an emotional feature collection of the training text and a style feature collection of the training text;

matching the audio frequency characteristics of the training audio frequency with the text characteristics of the training text according to the training audio frequency and the training text to obtain pronunciation duration information of the training text, wherein the pronunciation duration information is a duration set corresponding to each phoneme in the training text;

inputting the audio features, the text features and the pronunciation duration information to the speech synthesis model to train the speech synthesis model.

Optionally, the audio features include at least one of mel-frequency spectrum features and mel-frequency cepstrum features of the training audio.

Optionally, matching, according to the training audio and the training text, audio features of the training audio with text features of the training text to obtain pronunciation duration information of the training text, where the pronunciation duration information is a duration set corresponding to each phoneme in the training text, and the method includes:

and establishing a corresponding relation between each phoneme in the training text and the audio features of the training audio, so that each phoneme corresponds to the audio features of a plurality of training audios.

According to the technical scheme, the application provides a speech synthesis method, a speech synthesis system, a speech synthesis model and a training method thereof, wherein the method comprises the steps of obtaining a target text and a first bottleneck characteristic of the target text; acquiring a reference audio library, wherein the reference audio library comprises a plurality of reference audios; acquiring a reference text corresponding to each reference audio in a reference audio library, and acquiring a second bottleneck characteristic of each reference text; calculating the similarity of the first bottleneck characteristic and the second bottleneck characteristic of each reference text, and determining the reference text corresponding to the second bottleneck characteristic with the highest similarity with the first bottleneck characteristic as a text template; determining reference audio corresponding to the text template as an audio template; and inputting the audio template and the target text into a pre-trained voice synthesis model to synthesize the voice with deep emotion hierarchical features.

Drawings

In order to more clearly explain the technical solution of the present application, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious to those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a flow chart of a speech synthesis method provided herein;

FIG. 2 is a speech synthesis model provided herein;

fig. 3 is a method for training a speech synthesis model according to the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly described below with reference to the drawings in the embodiments of the present application. It should be apparent that the described embodiments are only some of the embodiments of the present application, and not all embodiments. Other embodiments based on the embodiments of the present application and obtained by a person of ordinary skill in the art without any creative effort belong to the protection scope of the present application.

To make the purpose and embodiments of the present application clearer, the following will clearly and completely describe the exemplary embodiments of the present application with reference to the attached drawings in the exemplary embodiments of the present application, and it is obvious that the described exemplary embodiments are only a part of the embodiments of the present application, and not all of the embodiments.

It should be noted that the brief descriptions of the terms in the present application are only for the convenience of understanding the embodiments described below, and are not intended to limit the embodiments of the present application. These terms should be understood in their ordinary and customary meaning unless otherwise indicated.

The terms "first," "second," "third," and the like in the description and claims of this application and in the above-described drawings are used for distinguishing between similar or analogous objects or entities and not necessarily for describing a particular sequential or chronological order, unless otherwise indicated. It is to be understood that the terms so used are interchangeable under appropriate circumstances.

The terms "comprises" and "comprising," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a product or apparatus that comprises a list of elements is not necessarily limited to all elements expressly listed, but may include other elements not expressly listed or inherent to such product or apparatus.

The term "module" refers to any known or later developed hardware, software, firmware, artificial intelligence, fuzzy logic, or combination of hardware and/or software code that is capable of performing the functionality associated with that element.

In recent years, with the progress of speech synthesis technology, synthesized speech is getting closer to the real sound when a person speaks in terms of sound quality and naturalness. However, when a person speaks, the person has various styles and is rich in various emotional colors. Therefore, how to synthesize speech with unique style and emotional color is the key to the development of speech synthesis technology. The application provides a voice synthesis method, a voice synthesis system, a voice synthesis model and a training method thereof, which can synthesize voice with deep emotion levels and improve user experience.

Referring to fig. 1, a flow chart of a speech synthesis method provided by the present application is shown in fig. 1, and the method includes the following steps:

s110: and acquiring a target text and a first bottleneck characteristic of the target text.

In some embodiments, the target text may be an electronic book, or may be a chapter, a segment, or a sentence in an electronic book, or may be other types of text, such as news, public articles, short message communication records, chat records of internet platform communication APP, and the like.

In some embodiments, the first bottleneck characteristic of the target text comprises at least one of an emotional characteristic to be expressed by the content of the target text and a style characteristic that can be embodied by the content of the target text.

In some embodiments, the emotional features to be expressed by the content of the target text may be obtained using an emotion encoding network model. The method comprises the steps of establishing an emotion encoding network model, crawling the existing text data in a public network by utilizing a crawler technology, analyzing each text data by reading words related to emotion content in the text data, and marking each text data manually, wherein the analyzed text data can be marked as emotion categories such as happiness, anger, sadness, happiness, disguise, fright, fear and the like, establishing a convolutional neural network of the emotion encoding network model according to the marked text data, training the convolutional neural network of the emotion encoding network model by utilizing a back propagation technology until a convergence condition is reached, so that the training of the convolutional neural network of the emotion encoding network model is completed, and extracting emotion characteristics of the input text by the trained convolutional neural network of the emotion encoding network model.

In some embodiments, the first bottleneck characteristic of the target text may be a style characteristic that can be embodied by the content of the target text, and the style characteristic that can be embodied by the content of the target text may be obtained by using a style-coded network model. The method comprises the steps of establishing a style coding network model, crawling the existing text data in the public network by utilizing a crawler technology, analyzing the text data by reading the content of the text data, obtaining the style characteristics of the text data, and manually marking the text data, wherein the analyzed text data can be marked as style categories such as chatting, reciting and the like, establishing an emotion coding network model according to the marked text data, training the emotion coding network model by utilizing a back propagation technology until a convergence condition is reached so as to finish the training of the emotion coding network model, and extracting the style characteristics of the input text by utilizing the trained emotion coding network model.

S120: a reference audio library is obtained, the reference audio library comprising a number of reference audios.

In some embodiments, the reference audio may be a corpus obtained by training a speech synthesis model, and the reference audio composing the reference audio library should be audio with strong emotional or stylistic features.

S130: and acquiring a reference text corresponding to each reference audio in the reference audio library, and acquiring a second bottleneck characteristic of each reference text.

In some embodiments, the second bottleneck characteristic of the reference text comprises at least one of an emotional characteristic to be expressed by the content of the target text and a style characteristic that can be embodied by the content of the target text.

In some embodiments, the emotional characteristics to be expressed by the content of the reference text can be obtained by using the emotion encoding network model.

In some embodiments, the style characteristics that can be embodied by the content of the reference text can be obtained by using the style-coded network model described above.

S140: and calculating the similarity of the first bottleneck characteristic and a second bottleneck characteristic of each reference text, and determining the reference text corresponding to the second bottleneck characteristic with the highest similarity with the first bottleneck characteristic as a text template.

In some embodiments, the first bottleneck feature and the second bottleneck feature may both be features relating to emotion categories, which may be further divided into a plurality of emotion hierarchies.

More specifically, the first bottleneck feature of the target text may belong to the same emotion class as the second bottleneck feature of the reference text, for example, the first bottleneck feature and the second bottleneck feature both belong to an emotion class of "like", which may be further divided into a plurality of emotion levels such as "spring festival meaning", "happy ocean", "happy and" happy self ", and the like, and the degree of" like "corresponding to different emotion levels is different, and the similarity between the emotion level of" spring festival meaning "and the emotion level of" happy ocean "is significantly greater than the similarity between the emotion level of" spring festival meaning "and the emotion level of" happy ocean ". Therefore, if the first bottleneck feature corresponding to the target text corresponds to the emotional level of "vernal wind interest", the reference text 1 corresponds to the emotional level of "happy oceans", and the reference text 2 corresponds to the emotional level of "polar-happy", and since the similarity between the emotional level corresponding to the target text and the emotional level corresponding to the reference text 1 is greater than the similarity between the emotional level corresponding to the target text and the emotional level corresponding to the reference text 2, the reference text 1 is selected as the text template.

Further, the first bottleneck characteristic of the target text may belong to different emotion categories from the second bottleneck characteristic of the reference text, for example, the first bottleneck characteristic belongs to an emotion category of "happiness", the second bottleneck characteristic belongs to an emotion category of "anger", the emotion category of "anger" is further divided into multiple emotion levels such as "fire thrill", "anger unevenness", "thunder anger", and the like, if the first bottleneck characteristic corresponding to the target text corresponds to the emotion level of "spring wind destiny", the reference text 1 corresponds to the emotion level of "anger unevenness", the reference text 2 corresponds to the emotion level of "thunder anger", since the similarity of the emotion level corresponding to the target text to the emotion level corresponding to the reference text 1 is greater than the similarity of the emotion level corresponding to the target text to the emotion level corresponding to the reference text 2, the reference text 1 is selected as the text template.

Further, the first bottleneck feature of the target text may belong to the same emotion class as the second bottleneck feature of the partial reference text, and the first bottleneck feature of the target text may belong to a different emotion class from the second bottleneck feature of the partial reference text, for example, the first bottleneck feature may belong to an emotion class of "happiness", the second bottleneck feature may belong to an emotion class of "happiness", and the other part of the second bottleneck feature belongs to an emotion class of "anger", if the first bottleneck feature corresponding to the target text corresponds to an emotion level of "vernal availability", the reference text 1 corresponds to an emotion level of "happy ocean", the reference text 2 corresponds to an emotion level of "extreme joy", the reference text 3 corresponds to an emotion level of "uneven anger", the reference text 4 corresponds to an emotion level of "thunderstorm", and since the emotion level corresponding to the target text is more similar to the emotion level corresponding to the reference text 1 than the target text And selecting the reference text 1 as a text template according to the similarity between the corresponding emotion level and the emotion levels corresponding to other reference texts.

In some embodiments, the first bottleneck feature and the second bottleneck feature may also be features related to style categories, and at this time, the method for calculating the similarity between the first bottleneck feature and the second bottleneck feature of each reference text is similar to the method for calculating the features related to emotion categories of the first bottleneck feature and the second bottleneck feature, and is not described herein again.

S150: determining the reference audio corresponding to the text template as an audio template;

s160: and inputting the audio template and the target text into a pre-trained speech synthesis model to obtain a synthesized audio.

More specifically, referring to fig. 2, a speech synthesis model provided by the present application includes: encoder module, feature extraction module, duration prediction module, duration sampling module, fundamental frequency prediction module, decoder module and vocoder module, wherein:

the encoder module is used for acquiring a text sequence of an input target text and converting the text sequence into a corresponding text code.

In some embodiments, the text sequence of the target text is a phone set of the target text, and the encoder module may convert the target text into an abstract text encoding recognizable by the speech synthesis model for use by other modules;

the feature extraction module is used for acquiring a third bottleneck feature of the audio template according to the input audio template;

in some embodiments, the third bottleneck feature comprises at least one of an emotional feature and a style feature of the audio template.

And the duration prediction module is used for acquiring the predicted duration of the text code according to the text code and the third bottleneck characteristic.

In some embodiments, the predicted duration of the text encoding is a pronunciation duration corresponding to each frame of the text encoding obtained through prediction.

And the time length sampling module is used for performing upsampling processing on the text code according to the output of the feature extraction module and the time length prediction module to obtain the text code subjected to upsampling processing and a third bottleneck feature subjected to upsampling processing.

And the fundamental frequency prediction module is used for predicting the fundamental frequency characteristic of the text code according to the input text code subjected to the upsampling processing and the third bottleneck characteristic subjected to the upsampling processing.

And the decoder module is used for acquiring the audio features of the audio to be synthesized according to the text codes subjected to the upsampling processing, the third bottleneck features subjected to the upsampling processing and the fundamental frequency features.

Further, the following describes an exemplary implementation and usage process of the speech synthesis method provided in the present application in a scenarized and materialized manner.

In some embodiments, the approach provided herein may synthesize audio with emotional features, following exemplary embodiment 1 of the present application.

The embodiment takes the case that the user A and the user B chat through social software as an example.

(1) The user A inputs a text A to the user B through social software;

(2) analyzing the text A, specifically, acquiring a first bottleneck characteristic of the text A, namely an emotional characteristic of the text A, through an emotional coding network model;

(3) acquiring a reference text corresponding to each reference audio from a reference audio library, and a second bottleneck characteristic corresponding to each reference text, namely the emotional characteristic of the reference text;

(4) respectively calculating the similarity between the emotional feature of each reference text and the emotional feature of the text A, selecting the reference text with the highest similarity with the emotional feature of the text A, and determining the reference text as a text template;

(5) acquiring a reference audio of the text template, and determining the reference audio as an audio template;

(6) inputting the audio template and the text A into a pre-trained speech synthesis model, wherein the speech synthesis model comprises an encoder module, a feature extraction module, a duration prediction module, a duration sampling module, a fundamental frequency prediction module, a decoder module and a vocoder module;

(7) the method comprises the steps that an encoder module obtains a text sequence of an input text A, converts the text sequence into a corresponding text code, and inputs the text code to a duration prediction module;

(8) inputting an audio template into a feature extraction module, wherein the feature extraction module acquires the emotional features of the audio template and inputs the emotional features of the audio template into a duration prediction module;

(9) the duration prediction module acquires the predicted duration of the text code according to the input text code and the emotional characteristics of the audio template, and inputs the predicted duration of the text code to the duration sampling module;

(10) the time length sampling module performs up-sampling processing on the text code according to the output of the feature extraction module and the time length prediction module, outputs the text code subjected to the up-sampling processing and a third bottleneck feature subjected to the up-sampling processing, and inputs an output result to the fundamental frequency prediction module;

(11) the base frequency prediction module predicts the base frequency characteristics of the text codes according to the input text codes subjected to the up-sampling processing and the emotional characteristics of the audio template subjected to the up-sampling processing to obtain the text codes subjected to the up-sampling processing, the prediction duration and the base frequency characteristics, and inputs the output result to the decoder module;

(12) the decoder module acquires the audio features of the audio to be synthesized according to the text code subjected to the up-sampling processing, the third bottleneck feature subjected to the up-sampling processing and the fundamental frequency feature, and inputs the audio features of the audio to be synthesized into the vocoder module;

(13) the vocoder module obtains synthesized audio according to the input audio characteristics of the audio to be synthesized.

In some embodiments, the approach provided herein may synthesize audio with style characteristics, following exemplary embodiment 2 of the present application.

(1) The user A inputs a text A to the user B through social software;

(2) analyzing the text A, specifically, obtaining a first bottleneck characteristic of the text A, namely the style characteristic of the text A, through a style coding network model;

(3) acquiring a reference text corresponding to each reference audio from a reference audio library, and a second bottleneck characteristic corresponding to each reference text, namely a style characteristic of the reference text;

(4) respectively calculating the similarity between the style features of each reference text and the style features of the text A, selecting the reference text with the highest similarity with the style features of the text A, and determining the reference text as a text template;

(8) inputting an audio template into a feature extraction module, wherein the feature extraction module acquires style features of the audio template and inputs the style features of the audio template into a duration prediction module;

(9) the duration prediction module acquires the predicted duration of the text code according to the input text code and the style characteristics of the audio template, and inputs the predicted duration of the text code to the duration sampling module;

(11) the base frequency prediction module predicts the base frequency characteristics of the text codes according to the input text codes subjected to the upsampling processing and the style characteristics of the audio template subjected to the upsampling processing to obtain the text codes subjected to the upsampling processing, the prediction duration and the base frequency characteristics, and inputs the output result to the decoder module;

In some embodiments, the present application may also be applied to audio conversion of text, and the following is exemplary embodiment 3 of the present application, and exemplary embodiment 3 of the present application is described taking a novel example that a user desires to listen to.

(1) Acquiring the text content of the novel, analyzing the text content of the novel, and specifically acquiring a first bottleneck characteristic of the text, namely an emotional characteristic of the text, through an emotional coding network model;

(2) acquiring a reference text corresponding to each reference audio from a reference audio library, and a second bottleneck characteristic corresponding to each reference text, namely the emotional characteristic of the reference text;

(3) respectively calculating the similarity of the emotional features of each reference text and the emotional features of the novel text, selecting the reference text with the highest similarity with the emotional features of the novel text, and determining the reference text as a text template;

(4) acquiring a reference audio of the text template, and determining the reference audio as an audio template;

(5) inputting the audio template and the novel text into a pre-trained speech synthesis model, wherein the speech synthesis model comprises an encoder module, a feature extraction module, a duration prediction module, a duration sampling module, a fundamental frequency prediction module, a decoder module and a vocoder module;

(6) the method comprises the steps that an encoder module obtains a text sequence of an input novel text, converts the text sequence into a corresponding text code and inputs the text code to a duration prediction module;

(7) inputting an audio template into a feature extraction module, wherein the feature extraction module acquires the emotional features of the audio template and inputs the emotional features of the audio template into a duration prediction module;

(8) the duration prediction module acquires the predicted duration of the text code according to the input text code and the emotional characteristics of the audio template, and inputs the predicted duration of the text code to the duration sampling module;

(9) the time length sampling module performs up-sampling processing on the text code according to the output of the feature extraction module and the time length prediction module, outputs the text code subjected to the up-sampling processing and a third bottleneck feature subjected to the up-sampling processing, and inputs an output result to the fundamental frequency prediction module;

(10) the base frequency prediction module predicts the base frequency characteristics of the text codes according to the input text codes subjected to the up-sampling processing and the emotional characteristics of the audio template subjected to the up-sampling processing to obtain the text codes subjected to the up-sampling processing, the prediction duration and the base frequency characteristics, and inputs the output result to the decoder module;

(11) the decoder module acquires the audio features of the audio to be synthesized according to the text code subjected to the up-sampling processing, the third bottleneck feature subjected to the up-sampling processing and the fundamental frequency feature, and inputs the audio features of the audio to be synthesized into the vocoder module;

(12) the vocoder module obtains synthesized audio according to the input audio characteristics of the audio to be synthesized.

In some embodiments, the present application may also convert the target text into audio with style characteristics, and the following is exemplary embodiment 4 of the present application, and exemplary embodiment 4 of the present application is also illustrated by taking a novel that the user desires to listen to as an example.

(1) Acquiring the text content of a novel, analyzing the text content of the novel, and specifically acquiring a first bottleneck characteristic of the text, namely the style characteristic of the text, through a style coding network model;

(2) acquiring a reference text corresponding to each reference audio from a reference audio library, and a second bottleneck characteristic corresponding to each reference text, namely a style characteristic of the reference text;

(3) respectively calculating the similarity of the style features of each reference text and the style features of the novel text, selecting the reference text with the highest similarity with the style features of the novel text, and determining the reference text as a text template;

(7) inputting an audio template into a feature extraction module, wherein the feature extraction module acquires style features of the audio template and inputs the style features of the audio template into a duration prediction module;

(8) the duration prediction module acquires the predicted duration of the text code according to the input text code and the style characteristics of the audio template, and inputs the predicted duration of the text code to the duration sampling module;

(10) the base frequency prediction module predicts the base frequency characteristics of the text codes according to the input text codes subjected to the upsampling processing and the style characteristics of the audio template subjected to the upsampling processing to obtain the text codes subjected to the upsampling processing, the prediction duration and the base frequency characteristics, and inputs the output result to the decoder module;

Referring to fig. 3, a method for training a speech synthesis model provided by the present application, where the speech synthesis model can be trained and optimized by the method, includes:

s310, acquiring a training material, wherein the training material comprises a training audio and a training text corresponding to the training audio;

in some embodiments, the training audio may be a set of emotional voice data with different emotional tendencies recorded by a professional sound recorder, or may be a set of a large number of emotional voice data with different emotional tendencies directly crawled on the internet.

S320, analyzing the training audio to obtain the audio characteristics of the training audio;

in some embodiments, the training audio may be subjected to short-term fourier transform, and then the spectrum of the transformed audio is mapped to a mel scale, and logarithm is taken by using a trigonometric function and discrete cosine transform is taken to obtain the audio features of the training audio, where the audio features at least include one of mel-frequency spectrum features and mel-frequency cepstral features of the training audio.

S330, analyzing the training text to obtain the text characteristics of the training text;

in some embodiments, the training text is sequentially subjected to sentence structure analysis, text regularization, word segmentation, part of speech prediction, prosody prediction, phonetic transcription, and the like to obtain text features of the training text, where the training text includes a set of each phoneme of the training text, a set of emotional features of the training text, and a set of style features of the training text.

Wherein the sentence structure analysis is used to divide the training text into a single collection of sentences. Alternatively, sentence structure analysis can be implemented by using a model based on neural network training. The text regularization is an expression for converting punctuation or numbers in the training text, which are not chinese, into sentences in the context of chinese, for example, the text regularization is performed on the training text "2.1" to obtain the training text "two-point one", which is not limited in this example. Optionally, the text regularization process may be implemented by a neural network-based training model. The word segmentation processing is to segment the sentences in the training text according to the semantics and segment the Chinese characters of a word together when segmenting. Optionally, the word segmentation process may be implemented by using a model based on neural network training. And the part-of-speech prediction is used for predicting the part-of-speech of each word in the training text after word segmentation. Parts of speech include, but are not limited to, nouns, verbs, adjectives, quantifiers, pronouns, adverbs, prepositions, conjunctions, adjectives, sighs, and vocabularies. Optionally, the part-of-speech prediction may be implemented by using a model based on neural network training. And the prosody prediction is used for predicting the prosody of each word in the training text after the part of speech prediction, namely predicting the level and zeptosis format and prosody rule of each word. Alternatively, prosody prediction can be implemented using a model based on neural network training. The phoneme conversion process is to convert the text in the training text after prosody prediction into a corresponding phoneme, for example, if the text with conversion is "good", the result after phoneme conversion is "h, a, o, 3", which includes 3 phonemes, which is not limited by this example. Alternatively, the prediction of the transphoneme can be realized by a model based on neural network training.

S340, matching the audio frequency characteristics of the training audio frequency with the text characteristics of the training text according to the training audio frequency and the training text to obtain pronunciation duration information of the training text, wherein the pronunciation duration information is a duration collection corresponding to each phoneme in the training text;

in some embodiments, according to the training audio and the training text, aligning the audio features of the training audio with the text features of the training text by using an alignment tool to obtain pronunciation duration information of the training text.

And S350, inputting the audio features, the text features and the pronunciation duration information into the speech synthesis model to train the speech synthesis model.

In some embodiments, the present application further provides a speech synthesis system configured to:

In specific implementation, the present invention further provides a computer storage medium, where the computer storage medium may store a program, and the program may include some or all of the steps in the embodiments of the speech synthesis method provided by the present invention when executed. The storage medium may be a magnetic disk, an optical disk, a read-only memory (ROM) or a Random Access Memory (RAM).

Those skilled in the art will readily appreciate that the techniques of the embodiments of the present invention may be implemented as software plus a required general purpose hardware platform. Based on such understanding, the technical solutions in the embodiments of the present invention may be essentially or partially implemented in the form of a software product, which may be stored in a storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments or some parts of the embodiments.

The embodiments provided in the present application are only a few examples of the general concept of the present application, and do not limit the scope of the present application. Any other embodiments extended according to the scheme of the present application without inventive efforts will be within the scope of protection of the present application for a person skilled in the art.

Claims

1. A method of speech synthesis, comprising:

2. The method of claim 1, wherein obtaining a first bottleneck feature corresponding to the target text comprises:

acquiring text data related to emotion;

3. The method according to claim 2, wherein obtaining a second bottleneck characteristic corresponding to each of the reference texts comprises:

4. The method of claim 1, wherein obtaining a first bottleneck feature corresponding to the target text comprises:

acquiring text data related to styles;

5. The method of claim 4, wherein obtaining a second bottleneck characteristic corresponding to each of the reference texts comprises:

6. A speech synthesis system, characterized in that the system is configured to:

7. A speech synthesis model for use in the method of any one of claims 1-5 and the system of claim 6, comprising an encoder module, a feature extraction module, a duration prediction module, a duration sampling module, a pitch prediction module, a decoder module, and a vocoder module, wherein:

the decoder module is used for acquiring audio features of the audio to be synthesized according to the text code subjected to the upsampling processing, the third bottleneck feature subjected to the upsampling processing and the fundamental frequency feature;

8. A method for training a speech synthesis model, applied to the method of any one of claims 1 to 5 and the system of claim 6, comprising:

9. The training method of claim 8, wherein the audio features comprise at least one of mel-frequency spectral features and mel-frequency cepstral features of the training audio.

10. The training method according to claim 8, wherein matching audio features of the training audio with text features of the training text according to the training audio and the training text to obtain pronunciation duration information of the training text, the pronunciation duration information being a duration set corresponding to each phoneme in the training text, comprises: