CN112309367A

CN112309367A - Speech synthesis method, speech synthesis device, storage medium and electronic equipment

Info

Publication number: CN112309367A
Application number: CN202011211115.7A
Authority: CN
Inventors: 徐晨畅; 潘俊杰
Original assignee: Beijing Youzhuju Network Technology Co Ltd
Current assignee: Beijing Youzhuju Network Technology Co Ltd
Priority date: 2020-11-03
Filing date: 2020-11-03
Publication date: 2021-02-02
Anticipated expiration: 2040-11-03
Also published as: CN112309367B

Abstract

The present disclosure relates to a speech synthesis method, apparatus, storage medium, and electronic device, which can perform speech synthesis on a text to which an accent word is not marked, so that the synthesized speech is accented. The speech synthesis method comprises the following steps: acquiring a text to be synthesized; inputting the text to be synthesized into a speech synthesis model to obtain audio information corresponding to the text to be synthesized, wherein the speech synthesis model is obtained by training a sample text marked with accent words and a sample audio corresponding to the sample text; the voice synthesis model is used for determining accent words in the text to be synthesized and generating audio information corresponding to the text to be synthesized according to the accent words.

Description

Speech synthesis method, speech synthesis device, storage medium and electronic equipment

Technical Field

The present disclosure relates to the field of speech synthesis technologies, and in particular, to a speech synthesis method, apparatus, storage medium, and electronic device.

Background

Speech synthesis, also known as Text To Speech (TTS), is a technology that can convert any input Text into corresponding Speech. Conventional speech synthesis systems typically include two modules, a front end and a back end. The front-end module mainly analyzes the input text and extracts the linguistic information needed by the rear-end module. And the back-end module generates a voice waveform by a certain method according to the front-end analysis result.

However, the speech synthesis method in the related art generally does not consider accents in the synthesized speech, and thus the synthesized speech is not accented, has a flat pronunciation, and lacks expressive power. Or, the speech synthesis method in the related art generally randomly selects words in the input text for accent addition, which causes an error in pronunciation of accents in the synthesized speech and fails to obtain a better speech synthesis result including accents.

Disclosure of Invention

This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

In a first aspect, the present disclosure provides a method of speech synthesis, the method comprising:

acquiring a text to be synthesized;

inputting the text to be synthesized into a speech synthesis model to obtain audio information corresponding to the text to be synthesized, wherein the speech synthesis model is obtained by training a sample text marked with accent words and a sample audio corresponding to the sample text;

the voice synthesis model is used for determining accent words in the text to be synthesized and generating audio information corresponding to the text to be synthesized according to the accent words.

In a second aspect, the present disclosure provides a speech synthesis apparatus, the apparatus comprising:

the acquisition module is used for acquiring a text to be synthesized;

the synthesis module is used for inputting the text to be synthesized into a speech synthesis model to obtain audio information corresponding to the text to be synthesized, wherein the speech synthesis model is obtained by training a sample text marked with accent words and a sample audio corresponding to the sample text;

the voice synthesis model is used for identifying accent words in the text to be synthesized and generating audio information corresponding to the text to be synthesized according to the identified accent words.

In a third aspect, the present disclosure provides a computer readable medium having stored thereon a computer program which, when executed by a processing apparatus, performs the steps of the method of the first aspect.

In a fourth aspect, the present disclosure provides an electronic device comprising:

a storage device having a computer program stored thereon;

processing means for executing the computer program in the storage means to carry out the steps of the method of the first aspect.

Through the technical scheme, the voice synthesis model can determine the accent words in the text to be synthesized, so that the audio information including the accent pronunciation can be generated according to the accent words, the accent words in the text to be synthesized do not need to be marked manually, the manpower consumed in the voice synthesis process can be reduced, and the voice synthesis efficiency and the automation degree are improved. Moreover, the speech synthesis model is obtained by training according to a large amount of sample texts marked with accent words, so that compared with a mode of randomly adding accent pronunciations in the related art, the accuracy of the accent pronunciations in the generated audio information can be ensured to a certain extent.

Additional features and advantages of the disclosure will be set forth in the detailed description which follows.

Drawings

The above and other features, advantages and aspects of various embodiments of the present disclosure will become more apparent by referring to the following detailed description when taken in conjunction with the accompanying drawings. Throughout the drawings, the same or similar reference numbers refer to the same or similar elements. It should be understood that the drawings are schematic and that elements and features are not necessarily drawn to scale. In the drawings:

FIG. 1 is a flow diagram illustrating a method of speech synthesis according to an exemplary embodiment of the present disclosure;

FIG. 2 is a schematic diagram illustrating a speech synthesis model in a method of speech synthesis according to an exemplary embodiment of the present disclosure;

FIG. 3 is a schematic diagram illustrating a speech synthesis model in a method of speech synthesis according to another exemplary embodiment of the present disclosure;

FIG. 4 is a block diagram illustrating a speech synthesis apparatus according to an exemplary embodiment of the present disclosure;

fig. 5 is a block diagram illustrating an electronic device according to an exemplary embodiment of the present disclosure.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather are provided for a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the disclosure are for illustration purposes only and are not intended to limit the scope of the disclosure.

It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order, and/or performed in parallel. Moreover, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.

The term "include" and variations thereof as used herein are open-ended, i.e., "including but not limited to". The term "based on" is "based, at least in part, on". The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments". Relevant definitions for other terms will be given in the following description. It should be noted that the terms "first", "second", and the like in the present disclosure are only used for distinguishing different devices, modules or units, and are not used for limiting the order or interdependence relationship of the functions performed by the devices, modules or units. It is further noted that references to "a", "an", and "the" modifications in the present disclosure are intended to be illustrative rather than limiting, and that those skilled in the art will recognize that "one or more" may be used unless the context clearly dictates otherwise.

The names of messages or information exchanged between devices in the embodiments of the present disclosure are for illustrative purposes only, and are not intended to limit the scope of the messages or information.

As mentioned in the background, the speech synthesis method in the related art usually does not consider accents in the synthesized speech, resulting in no accents, flat pronunciation, and lack of expressiveness in the synthesized speech. Or, the speech synthesis method in the related art generally randomly selects words in the input text for accent addition, which causes an error in pronunciation of accents in the synthesized speech and fails to obtain a better speech synthesis result including accents.

In view of the above, the present disclosure provides a speech synthesis method, apparatus, storage medium and electronic device, which predict an accent in a text to be synthesized through a speech synthesis model in a new speech synthesis manner, and are used to control an accent pronunciation in a synthesized speech, so that the synthesized speech includes the accent pronunciation, and accuracy of the accent pronunciation in the synthesized speech can be improved to a certain extent.

Fig. 1 is a flow chart illustrating a method of speech synthesis according to an exemplary embodiment of the present disclosure. Referring to fig. 1, the speech synthesis method includes:

step 101, acquiring a text to be synthesized.

Step 102, inputting a text to be synthesized into a speech synthesis model to obtain audio information corresponding to the text to be synthesized, wherein the speech synthesis model is obtained by training a sample text marked with accent words and a sample audio corresponding to the sample text.

The speech synthesis model is used for determining accent words in the text to be synthesized and generating audio information corresponding to the text to be synthesized according to the accent words.

Through the mode, the voice synthesis model can determine the accent words in the text to be synthesized, so that the voice synthesis model can generate the audio information including the accent pronunciation according to the accent words without manually marking the accent words in the text to be synthesized, the manpower consumed in the voice synthesis process can be reduced, and the voice synthesis efficiency and the automation degree are improved. Moreover, the speech synthesis model is obtained by training according to a large amount of sample texts marked with accent words, so that compared with a mode of randomly adding accent pronunciations in the related art, the accuracy of the accent pronunciations in the generated audio information can be ensured to a certain extent.

In order to make the speech synthesis method provided by the present disclosure more understandable to those skilled in the art, the above steps are exemplified in detail below.

First, a training process of the speech synthesis model is explained.

For example, a plurality of sample texts for training and sample audios respectively corresponding to the plurality of sample texts may be obtained in advance, where each sample text is labeled with an accent word, that is, each sample text is labeled with a word requiring accent pronunciation. In a possible manner, the accent words labeled in the sample text may be determined by: firstly, obtaining a plurality of sample texts, wherein each sample text comprises accent words marked with initial accent marks, and then adding target accent marks to the accent words if the accent words are marked as accent words in each sample text aiming at each accent word marked with the initial accent marks; if the accent word is marked as an accent word in at least two sample texts, adding a target accent mark to the accent word under the condition that the fundamental frequency of the accent word is greater than a preset fundamental frequency threshold and the energy of the accent word is greater than a preset energy threshold. And finally, for each sample text, determining the accent words added with the target accent marks in the sample as the accent words in the sample text.

The plurality of sample texts may be sample texts including the same content and subjected to initial accent marking by different users, or may be texts including a plurality of texts including different contents and subjected to initial accent marking by different users, and so on, which is not limited in the embodiment of the present disclosure. It should be understood that, in order to improve the result accuracy, it is preferable that the plurality of sample texts are a plurality of texts including different contents and that the texts including the same content are initially accented by different users.

For example, the time boundary information of each word and each prosodic phrase in the sample text may be obtained by first obtaining the time boundary information of each word in the sample text in the sample audio through the automatic alignment model. Then, according to the aligned sample audio and sample text, the multiple users can mark the accent words at the prosodic phrase level by combining the hearing sensation, the oscillogram, the frequency spectrum and the semantic information obtained from the sample text, so as to obtain multiple sample texts with the initial accent marks. Wherein the prosodic phrases are intermediate rhythm chunks between the prosodic words and the intonation phrases. Prosodic words are a set of closely related syllables that are often pronounced together in the actual stream of speech. The intonation phrases are formed by connecting several prosodic phrases according to a certain intonation pattern, and generally correspond to sentences in syntax. In the embodiment of the present disclosure, the initial accent marks in the sample text may correspond to the prosodic phrases, so as to obtain the prosodic phrase-level initial accent marks, so that the accent pronunciation more conforms to the conventional pronunciation habit.

Alternatively, in other possible cases, the initial accent marks in the sample text may correspond to single words, so as to obtain accents at a word level or accents at a single word level, and so on, which may be selected according to requirements in specific implementations.

After obtaining the plurality of sample texts having the initial accent marks, the initial accent marks in the plurality of sample texts may be integrated. Specifically, for each accent word marked with an initial accent mark, if the accent word is marked as an accent word in each sample text, it indicates that the accent marking result is more accurate, and therefore, a target accent mark may be added to the accent word. If the accent word is marked as an accent word in at least two sample texts, the situation that the accent word is not marked as an accent exists in other sample texts is shown, and therefore the accent marking result may have a certain deviation. In this case, in order to improve the accuracy of the result, further judgment may be performed. Specifically, considering that the fundamental frequency of the accent pronunciation in the audio is higher than the fundamental frequency of the non-accent pronunciation, and the energy of the accent pronunciation in the audio is higher than the energy of the non-accent pronunciation, the target accent mark may be added to the accent word under the condition that the fundamental frequency of the accent word is greater than a preset fundamental frequency threshold and the energy of the accent word is greater than a preset energy threshold. The preset fundamental frequency threshold and the preset energy threshold may be set according to actual conditions, which is not limited in the embodiments of the present disclosure.

It should be understood that, in other possible cases, if an accent word is not labeled as accent in all other sample texts, the accent word is less likely to be accented, so that the target accent mark may not be added to the accent word.

By the method, the sample text marked with the initial stress marks can be subjected to stress mark screening, and the sample text added with the target stress marks is obtained, so that the stress words added with the target stress marks can be determined as the stress words in the sample text aiming at each sample text, and the stress mark information in the sample text is more accurate.

After the sample text labeled with the accent words is obtained, the speech synthesis model may be trained according to the plurality of sample texts labeled with the accent words and sample audios corresponding to the plurality of sample texts, respectively. In a possible manner, the speech synthesis model includes an accent recognition module, and the speech synthesis model can be obtained by training through the following training steps: converting a sample text corresponding to the sample audio into a sample phoneme sequence, and vectorizing the sample phoneme sequence to obtain a sample phoneme vector; vectorizing the sample stress label to obtain a sample stress label vector; determining a target sample phoneme vector according to the sample phoneme vector and the sample stress label vector; determining a sample Mel frequency spectrum according to the target sample phoneme vector; a first loss function is calculated from the sample Mel spectrum and an actual Mel spectrum corresponding to the sample audio. And inputting the word vector sequence corresponding to the sample text into the accent recognition module, and calculating a second loss function according to the output result of the accent recognition module and the sample accent label. And finally, adjusting parameters of the speech synthesis model through the first loss function and the second loss function.

That is, the speech synthesis model may be understood as including an accent recognition module and a speech synthesis sub-module. The accent recognition module trains through word vector sequences and sample accent labels corresponding to the sample texts, and the voice synthesis submodule trains through a plurality of sample texts marked with accent words and sample audios corresponding to the sample texts respectively.

For example, referring to fig. 2, the accent recognition module may take as input a sequence of word vectors corresponding to the sample text and output as predicted word-level accent labels. The weighted second loss function may then be calculated for the sample stress labels corresponding to the sample text and the stress labels output by the stress recognition module. The word vector sequence can be obtained by performing word segmentation processing on the sample text and then performing vector conversion on each word segmentation, and the sample stress label can be obtained by representing stress label information of each word segmentation by 0 and 1 according to marked stress words after performing word segmentation processing on the sample text. For example, a sample accent label represented as 0 and 1 may be obtained by 0 indicating that no accent is labeled and 1 indicating that an accent is labeled. The second Loss function may be, for example, a CE Loss function (Cross control Loss), etc., which is not limited by the embodiment of the present disclosure. It should be appreciated that, given the sparse nature of stress versus non-stress, stress tags may be given greater weight and non-stress tags may be given lesser weight in the calculation of the second loss function.

When the accent recognition module is trained, the speech synthesis submodule can be trained through a plurality of sample texts marked with accent words and sample audios respectively corresponding to the sample texts. Specifically, the sample text corresponding to the sample audio may be converted into a sample phoneme sequence, and then the sample phoneme sequence is vectorized to obtain a sample phoneme vector. It should be understood that phonemes are the smallest phonetic units divided according to the natural attributes of speech, and are divided into two broad categories, namely vowels and consonants. For Chinese, a phone includes an initial (the initial is a complete syllable formed by a consonant preceding a final and following the final) and a final (i.e., a vowel). For english, a phoneme includes a vowel and a consonant. In the embodiment of the present disclosure, the sample phoneme sequence is vectorized to obtain the sample phoneme vector, and the speech with the phoneme-level accents can be synthesized in the subsequent process, so that the accents in the synthesized speech are controllable at the phoneme level, thereby further improving the accuracy of the accent pronunciation in the synthesized speech. The process of vectorizing the sample phoneme sequence to obtain the sample phoneme vector is similar to the vector conversion method in the related art, and is not described herein again.

For example, the generation of the sample stress labels may be: determining a phoneme sequence corresponding to a sample text, then performing accent labeling in the phoneme sequence corresponding to the sample text according to accent words labeled in the sample text, so as to obtain a sample accent label at a phoneme level corresponding to the sample text, and further performing vectorization on the sample accent label to obtain a sample accent label vector. The way of vectorizing the sample stress labels to obtain the sample stress label vectors is similar to the vector conversion way in the related art, and is not described here again.

After the sample phoneme vector and the sample stress label vector are obtained, the sample phoneme vector and the sample stress label vector are characterized by two pieces of mutually independent information, so that the target sample phoneme vector can be obtained by splicing the sample phoneme vector and the sample stress label vector, but the target sample phoneme vector is not obtained by adding the sample phoneme vector and the sample stress label vector, thereby avoiding damaging the content independence between the sample phoneme vector and the sample stress label vector, and ensuring the accuracy of the output result of the speech synthesis model.

After the target sample phoneme vector is obtained, a sample Mel frequency spectrum may be determined from the target sample phoneme vector. In one possible approach, the target sample phoneme vector may be input into the encoder, and then the vector output from the encoder is input into the decoder to obtain a sample mel-frequency spectrum; the encoder is used for determining pronunciation information of each phoneme in a phoneme sequence corresponding to the input vector, and the decoder is used for performing conversion processing according to the pronunciation information of each phoneme corresponding to the input vector to obtain a Mel frequency spectrum corresponding to each phoneme. Or, the frame-level vector corresponding to the vector output by the encoder may be determined by an automatic alignment model, and then the frame-level vector is input to the decoder to obtain a sample mel spectrum, where the automatic alignment model is used to correspond the pronunciation information of the phoneme level in the sample text corresponding to the target sample phoneme vector to the frame time of each phoneme in the sample audio corresponding to the target sample phoneme vector, so as to improve the model training effect and further improve the accuracy of the accent pronunciation in the model synthesized speech.

For example, the speech synthesis model may be an end-to-end speech synthesis tacontron model, and accordingly, the encoder may be an encoder in the tacontron model and the decoder may be a decoder in the tacontron model. For example, as shown in fig. 2, in a training stage of the speech synthesis model, after a vectorized phoneme sequence (e.g., a sample phoneme vector) and a vectorized accent label (e.g., a sample accent label vector) are spliced to obtain a target sample phoneme vector, the target sample phoneme vector may be input to an Encoder (Encoder) to obtain pronunciation information of each phoneme in the phoneme sequence corresponding to the target sample phoneme vector, for example, for a phoneme sequence corresponding to the target sample phoneme vector to include a phoneme "jin", it is necessary to know that the pronunciation of the phoneme is the same as "today". Then, the alignment of phoneme level and frame level can be realized through an automatic alignment model, and a frame-level target sample vector corresponding to the vector output by the encoder is obtained. Then, the target sample vector may be input to a Decoder (Decoder) so that the Decoder performs a conversion process according to pronunciation information of each phoneme in a phoneme sequence corresponding to the target sample vector, thereby obtaining a sample Mel spectrum (Mel spectrum) corresponding to each phoneme.

In another possible approach, considering that the vector stitching process is set before the encoder, and the stress recognition module is calculated before the encoder, the processing speed is affected. Therefore, in order to increase the processing speed, referring to fig. 3, the concatenation process of the vectors may also be set after the encoder so that the accent recognition module and the encoder may perform parallel computations. That is, in the training stage of the speech synthesis model, the sample phoneme vector may be input into the encoder, and then the target sample phoneme vector may be determined according to the vector output by the encoder and the sample stress label vector. Accordingly, determining a sample mel-frequency spectrum from the target sample phoneme vector may be: and inputting the target sample phoneme vector into a decoder to obtain a sample Mel spectrum. By the method, the stress recognition module and the encoder can perform parallel calculation, so that the calculation speed of the speech synthesis model is increased, and the speech synthesis efficiency is improved.

Referring to fig. 2 or fig. 3, after obtaining the sample mel spectrum, a first loss function may be calculated according to the actual mel spectrum corresponding to the sample mel spectrum and the sample audio, and parameters of the speech synthesis model may be adjusted according to the first loss function and a second loss function corresponding to the accent recognition module. For example, a CE loss function (i.e., a second loss function) is calculated according to the stress label and the sample stress label output by the stress recognition module, an MSE loss function (i.e., a first loss function) is calculated according to the sample mel spectrum and the actual mel spectrum, and then the parameters of the speech synthesis model are adjusted according to the CE loss function and the MSE loss function. Or model optimization can be performed in the model training process through an Adam optimizer, so that the accuracy of the output result of the trained speech synthesis model is ensured.

In a possible approach, adjusting the parameters of the speech synthesis model by the first loss function and the second loss function may be: and carrying out weighted summation on the first loss function and the second loss function through the weight values which change in a self-adaptive manner in the training process to obtain a target loss function, and then adjusting the parameters of the voice synthesis model according to the target loss function.

For example, the weight value may be adaptively changed according to an order difference between the first loss function and the second loss function in each training process. For example, in the first training process, the calculation result of the first loss function is 10, the calculation result of the second loss function is 0.1, and the difference between the two is 2 orders of magnitude, so the weight value may be a value that makes the results of the first loss function and the second loss function be the same order of magnitude (for example, both are 1). In the second training process, the calculation result of the first loss function is 1, the second loss function is 0.1, and the difference between the two is 1 order of magnitude, so the weight value may be a numerical value that makes the first loss function unchanged and the result of the second loss function 1, and so on. Alternatively, the weight value may also be adaptively changed according to the degree of change of the values of the first loss function and the second loss function in each training process, and the like, which is not limited in this disclosure.

The first loss function and the second loss function can be subjected to weighted summation through the weight values which change in a self-adaptive mode in the training process, so that a target loss function is obtained, and therefore parameters of the voice synthesis model can be adjusted according to the target loss function. For example, in the above example, the CE loss function and the MSE loss function may be subjected to weighted summation by using a weight value that adaptively changes in the training process to obtain a target loss function, and then parameters of the speech synthesis model are adjusted according to the target loss function to implement training of the speech synthesis model.

After the speech synthesis model is obtained through the training in the above manner, the speech synthesis model can be used for performing speech synthesis on the text to be synthesized which is not marked with the accent words. Specifically, the speech synthesis model may determine an accent word in the text to be synthesized, and then generate audio information corresponding to the text to be synthesized according to the accent word. In a possible approach, in case the speech synthesis model comprises an accent recognition module, accent words in the text to be synthesized may be recognized by the accent recognition module. In other possible modes, the data difference between the training text of the accent recognition module and the text to be synthesized is considered, and the accuracy of the result output by the accent recognition module can be influenced, so that in order to improve the accuracy of accent pronunciation in the synthesized speech, the accent recognition function can be combined with a mode of manually marking accents. That is, the speech synthesis model determines that the accented words in the text to be synthesized may be: whether the text to be synthesized is marked with the accent words or not is determined, if the text to be synthesized is not marked with the accent words, the accent words in the text to be synthesized are identified through an accent identification module in a speech synthesis model, and if the text to be synthesized is marked with the accent words, the accent words in the text to be synthesized are determined according to marking information corresponding to the accent words.

By the method, the accent recognition module can be used for recognizing the accent words to synthesize accented voices under the condition that the accent words are not marked manually, and the accent words are synthesized with the accent voices according to the accent words marked manually under the condition that the accent words are marked manually, so that the requirement of voice synthesis under various scenes is met better, and the method is suitable for various different voice synthesis scenes.

After determining the accent words in the text to be synthesized, the audio information corresponding to the text to be synthesized may be generated according to the accent words. In a possible manner, the speech synthesis model may generate audio information corresponding to the text to be synthesized by: determining a phoneme sequence corresponding to a text to be synthesized, determining an accent label at a phoneme level according to the accent word, and generating audio information corresponding to the text to be synthesized according to the phoneme sequence and the accent label.

It should be understood that the speech synthesis model determines a phoneme sequence corresponding to a text to be synthesized, so that a speech with accents can be synthesized at a phoneme level in a subsequent process, so that the accents in the synthesized speech are controllable at the phoneme level, and the accuracy of accent pronunciation in the synthesized speech is further improved.

Simultaneously with or after determining the phoneme sequence corresponding to the text to be synthesized, the phoneme-level accent labels may be determined according to the recognized accent words. For example, the accent labels may be sequences represented as 0 and 1, where 0 represents that the corresponding phoneme is not marked with accent, and 1 represents that the corresponding phoneme in the text to be synthesized is marked with accent. In a specific application, a phoneme sequence corresponding to a text to be synthesized may be determined, and then, according to the identified accent word, accent labeling is performed in the phoneme sequence, so as to obtain an accent label at a phoneme level corresponding to the text to be synthesized.

After obtaining the phoneme sequence and the accent label corresponding to the text to be synthesized, the audio information corresponding to the text to be synthesized may be generated according to the phoneme sequence and the accent label. In a possible mode, the speech synthesis model may first perform vectorization on a phoneme sequence corresponding to the text to be synthesized to obtain a phoneme vector, perform vectorization on the accent label to obtain an accent label vector, then determine a target phoneme vector according to the phoneme vector and the accent label vector, determine a mel spectrum according to the target phoneme vector, and finally input the mel spectrum into the vocoder to obtain audio information corresponding to the text to be synthesized.

It should be understood that the process of vectorizing the phoneme sequence corresponding to the text to be synthesized to obtain the phoneme vector and the process of vectorizing the accent label corresponding to the text to be synthesized to obtain the accent label vector are similar to the vector conversion method in the related art, and are not described herein again.

Illustratively, considering that the phoneme vector and the accent label vector represent two pieces of information which are independent of each other, the target phoneme vector can be obtained by splicing the phoneme vector and the accent label vector, rather than by adding the phoneme vector and the accent label vector, thereby avoiding damaging the content independence between the phoneme vector and the accent label vector and ensuring the accuracy of the subsequent speech synthesis result.

After the target phoneme vector is obtained, a Mel spectrum (Mel spectrum) may be determined from the target phoneme vector. For example, the target phoneme vector may be input into an encoder, and the vector output by the encoder is input into a decoder to obtain a corresponding mel spectrum, where the encoder is configured to determine pronunciation information of each phoneme in a phoneme sequence corresponding to the input vector, and the decoder is configured to perform a conversion process according to the pronunciation information of each phoneme corresponding to the input vector to obtain the mel spectrum corresponding to each phoneme.

For example, referring to the speech synthesis model shown in fig. 2, in an application stage of the speech synthesis model, a phoneme sequence corresponding to a text to be synthesized may be vectorized to obtain a vectorized phoneme sequence (i.e., a phoneme vector). The speech synthesis model can divide the input text to be synthesized into prosodic phrase levels to obtain prosodic phrase sequences, then can generate word vector sequences according to the prosodic phrase sequences, input the word vector sequences into the stress recognition module to determine stress labels in the text to be synthesized through the stress recognition module, and then carry out vectorization on the stress labels to obtain vectorized stress labels (namely stress label vectors). Then, a target phoneme vector may be determined according to the phoneme vector and the accent label vector, and the target phoneme vector is input to the encoder to obtain pronunciation information of each phoneme in the phoneme sequence corresponding to the target phoneme vector, for example, for a phoneme "jin", a pronunciation of the phoneme may be obtained as "today". The pronunciation information may then be input into a decoder, so that the decoder performs a conversion process according to the pronunciation information of each phoneme in the phoneme sequence corresponding to the target phoneme vector to obtain a mel spectrum corresponding to each phoneme.

Alternatively, in other possible ways, to increase the processing speed, the concatenation process in the speech synthesis model may be arranged after the encoder, e.g. the speech synthesis model shown in fig. 3. In this case, determining the target phoneme vector based on the phoneme vector and the accent label vector may be: and inputting the phoneme vector into an encoder, and determining a target phoneme vector according to the vector output by the encoder and the accent label vector. Accordingly, determining the mel spectrum from the target phoneme vector may be: the target phoneme vector is input to a decoder to obtain a Mel spectrum.

After determining the mel spectrum from the target phoneme vector, the mel spectrum may be input to the vocoder to obtain the audio information corresponding to the text to be synthesized. It should be understood that the embodiment of the present disclosure does not limit the type of the vocoder, that is, the audio information with accents can be obtained by inputting the mel spectrum into any vocoder, and the accents in the audio information can be obtained by the accent recognition module, so as to solve the problem in the related art that the synthesized speech has no accents or the accent pronunciation is incorrect due to randomly assigned accents, and improve the accuracy of the accent pronunciation in the synthesized speech.

Based on the same inventive concept, the disclosed embodiments also provide a speech synthesis apparatus, which may be a part or all of an electronic device through software, hardware, or a combination of both. Referring to fig. 4, the speech synthesis apparatus 400 includes:

an obtaining module 401, configured to obtain a text to be synthesized;

a synthesis module 402, configured to input the text to be synthesized into a speech synthesis model to obtain audio information corresponding to the text to be synthesized, where the speech synthesis model is obtained through sample text labeled with an accent word and sample audio training corresponding to the sample text;

the voice synthesis model is used for determining accent words in the text to be synthesized and generating audio information corresponding to the text to be synthesized according to the identified accent words.

Optionally, the speech synthesis model generates audio information corresponding to the text to be synthesized through the following modules:

the first determining submodule is used for determining a phoneme sequence corresponding to the text to be synthesized;

the second determining submodule is used for determining the stress labels at the phoneme level according to the stress words;

and the generating submodule is used for generating audio information corresponding to the text to be synthesized according to the phoneme sequence and the accent labels.

Optionally, the generation submodule is configured to:

vectorizing the phoneme sequence corresponding to the text to be synthesized to obtain a phoneme vector, and vectorizing the stress tag to obtain a stress tag vector;

determining a target phoneme vector according to the phoneme vector and the accent label vector;

determining a mel frequency spectrum according to the target phoneme vector;

and inputting the Mel frequency spectrum into a vocoder to obtain audio information corresponding to the text to be synthesized.

Optionally, the generation submodule is configured to:

and inputting the target phoneme vector into an encoder, and inputting the vector output by the encoder into a decoder to obtain a corresponding Mel frequency spectrum, wherein the encoder is used for determining pronunciation information of each phoneme in a phoneme sequence corresponding to the input vector, and the decoder is used for performing conversion processing according to the pronunciation information of each phoneme corresponding to the input vector to obtain the Mel frequency spectrum corresponding to each phoneme.

Optionally, the generation submodule is configured to:

inputting the phoneme vector into an encoder, and determining the target phoneme vector according to the vector output by the encoder and the accent label vector;

inputting the target phoneme vector into a decoder to obtain the Mel frequency spectrum;

the encoder is configured to determine pronunciation information of each phoneme in a phoneme sequence corresponding to the input vector, and the decoder is configured to perform conversion processing according to the pronunciation information of each phoneme corresponding to the input vector to obtain a mel spectrum corresponding to each phoneme.

Optionally, the speech synthesis model comprises an accent recognition module, and the apparatus 400 further comprises the following modules for training the speech synthesis model:

the first calculation module is used for converting a sample text corresponding to the sample audio into a sample phoneme sequence and vectorizing the sample phoneme sequence to obtain a sample phoneme vector; vectorizing the sample stress label to obtain a sample stress label vector; splicing the sample phoneme vector and the sample stress label vector to obtain a target sample phoneme vector; determining a sample Mel frequency spectrum according to the target sample phoneme vector; calculating a first loss function according to the sample Mel frequency spectrum and an actual Mel frequency spectrum corresponding to the sample audio;

the second calculation module is used for inputting the word vector sequence corresponding to the sample text into the accent recognition module and calculating a second loss function according to the output result of the accent recognition module and the sample accent label;

and the adjusting module is used for adjusting the parameters of the speech synthesis model through the first loss function and the second loss function.

Optionally, the adjusting module is configured to perform weighted summation on the first loss function and the second loss function through a weight value that changes adaptively in a training process to obtain a target loss function, and adjust a parameter of the speech synthesis model according to the target loss function.

Optionally, the speech synthesis model determines the accent words in the text to be synthesized by:

a third determining submodule, configured to determine whether a text to be synthesized is marked with a accent word;

the recognition sub-module is used for recognizing the accent words in the text to be synthesized through the accent recognition module in the speech synthesis model when the accent words are not marked in the text to be synthesized;

and the fourth determining submodule is used for determining the accent words in the text to be synthesized according to the marking information corresponding to the accent words when the accent words are marked in the text to be synthesized.

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

Based on the same inventive concept, the disclosed embodiments also provide a computer readable medium, on which a computer program is stored, which when executed by a processing device, implements the steps of any of the above-mentioned speech synthesis methods.

Based on the same inventive concept, an embodiment of the present disclosure further provides an electronic device, including:

a storage device having a computer program stored thereon;

processing means for executing the computer program in the storage means to implement the steps of any of the above-described speech synthesis methods.

Referring now to FIG. 5, a block diagram of an electronic device 500 suitable for use in implementing embodiments of the present disclosure is shown. The terminal device in the embodiments of the present disclosure may include, but is not limited to, a mobile terminal such as a mobile phone, a notebook computer, a digital broadcast receiver, a PDA (personal digital assistant), a PAD (tablet computer), a PMP (portable multimedia player), a vehicle terminal (e.g., a car navigation terminal), and the like, and a stationary terminal such as a digital TV, a desktop computer, and the like. The electronic device shown in fig. 5 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.

As shown in fig. 5, electronic device 500 may include a processing means (e.g., central processing unit, graphics processor, etc.) 501 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)502 or a program loaded from a storage means 508 into a Random Access Memory (RAM) 503. In the RAM 503, various programs and data necessary for the operation of the electronic apparatus 500 are also stored. The processing device 501, the ROM 502, and the RAM 503 are connected to each other through a bus 504. An input/output (I/O) interface 505 is also connected to bus 504.

Generally, the following devices may be connected to the I/O interface 505: input devices 506 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; output devices 507 including, for example, a Liquid Crystal Display (LCD), speakers, vibrators, and the like; storage devices 508 including, for example, magnetic tape, hard disk, etc.; and a communication device 509. The communication means 509 may allow the electronic device 500 to communicate with other devices wirelessly or by wire to exchange data. While fig. 5 illustrates an electronic device 500 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program carried on a non-transitory computer readable medium, the computer program containing program code for performing the method illustrated by the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication means 509, or installed from the storage means 508, or installed from the ROM 502. The computer program performs the above-described functions defined in the methods of the embodiments of the present disclosure when executed by the processing device 501.

It should be noted that the computer readable medium in the present disclosure can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.

In some embodiments, the communication may be performed using any currently known or future developed network Protocol, such as HTTP (HyperText Transfer Protocol), and may be interconnected with any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the Internet (e.g., the Internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed network.

The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device.

The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: acquiring a text to be synthesized; inputting the text to be synthesized into a speech synthesis model to obtain audio information corresponding to the text to be synthesized; the voice synthesis model is obtained by training a sample text marked with accent words and a sample audio corresponding to the sample text; the voice synthesis model is used for determining accent words in the text to be synthesized and generating audio information corresponding to the text to be synthesized according to the accent words.

Computer program code for carrying out operations for the present disclosure may be written in any combination of one or more programming languages, including but not limited to an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules described in the embodiments of the present disclosure may be implemented by software or hardware. Wherein the name of a module in some cases does not constitute a limitation on the module itself.

The functions described herein above may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), systems on a chip (SOCs), Complex Programmable Logic Devices (CPLDs), and the like.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

Example 1 provides a speech synthesis method according to one or more embodiments of the present disclosure, the method including:

acquiring a text to be synthesized;

Example 2 provides the method of example 1, the speech synthesis model generating audio information corresponding to the text to be synthesized by:

determining a phoneme sequence corresponding to the text to be synthesized;

determining an accent label of a phoneme level according to the accent word;

and generating audio information corresponding to the text to be synthesized according to the phoneme sequence and the accent label.

Example 3 provides the method of example 2, wherein generating audio information corresponding to the text to be synthesized according to the phoneme sequence and the accent labels includes:

determining a mel frequency spectrum according to the target phoneme vector;

Example 4 provides the method of example 3, the determining a mel spectrum from the target phoneme vector, comprising:

Example 5 provides the method of example 3, wherein determining a target phoneme vector from the phoneme vector and the accent label vector comprises:

inputting the phoneme vector into an encoder, and determining a target phoneme vector according to the vector output by the encoder and the accent label vector;

the determining a mel frequency spectrum according to the target phoneme vector comprises:

Example 6 provides the method of any one of examples 1 to 5, the speech synthesis model including an accent recognition module, the speech synthesis model being trained by training steps comprising:

converting a sample text corresponding to the sample audio into a sample phoneme sequence, and vectorizing the sample phoneme sequence to obtain a sample phoneme vector; vectorizing the sample stress label to obtain a sample stress label vector; determining a target sample phoneme vector according to the sample phoneme vector and the sample stress label vector; determining a sample Mel frequency spectrum according to the target sample phoneme vector; calculating a first loss function according to the sample Mel frequency spectrum and an actual Mel frequency spectrum corresponding to the sample audio;

inputting the word vector sequence corresponding to the sample text into the accent recognition module, and calculating a second loss function according to an output result of the accent recognition module and the sample accent label;

adjusting parameters of the speech synthesis model by the first loss function and the second loss function.

Example 7 provides the method of example 6, the adjusting parameters of the speech synthesis model by the first loss function and the second loss function, comprising:

weighting and summing the first loss function and the second loss function through a weight value which changes in a self-adaptive mode in the training process to obtain a target loss function;

and adjusting parameters of the speech synthesis model according to the target loss function.

Example 8 provides the method of example 6, wherein determining the accent words in the text to be synthesized comprises:

determining whether the text to be synthesized is marked with accent words;

if the text to be synthesized is not marked with the accent words, identifying the accent words in the text to be synthesized through an accent identification module in the voice synthesis model;

and if the text to be synthesized is marked with the accent words, determining the accent words in the text to be synthesized according to the marking information corresponding to the accent words.

Example 9 provides, in accordance with one or more embodiments of the present disclosure, a speech synthesis apparatus, the apparatus comprising:

the acquisition module is used for acquiring a text to be synthesized;

Example 10 provides the apparatus of example 9, the speech synthesis model generates audio information corresponding to the text to be synthesized by:

Example 11 provides the apparatus of example 10, the generation submodule to:

determining a mel frequency spectrum according to the target phoneme vector;

Example 12 provides the apparatus of example 11, the generation submodule to:

Example 13 provides the apparatus of example 11, the generation submodule to:

Example 14 provides the apparatus of any one of examples 9 to 13, the speech synthesis model including an accent recognition module, further including the following modules for training the speech synthesis model:

Example 15 provides the apparatus of example 14, wherein the adjusting module is configured to perform weighted summation on the first loss function and the second loss function through a weight value adaptively changing in a training process to obtain a target loss function, and adjust parameters of the speech synthesis model according to the target loss function.

Example 16 provides the apparatus of example 14, the speech synthesis model to determine the accented words in the text to be synthesized by:

Example 17 provides a computer-readable medium on which is stored a computer program that, when executed by a processing apparatus, implements the steps of the speech synthesis method of any one of examples 1 to 8, in accordance with one or more embodiments of the present disclosure.

Example 18 provides, in accordance with one or more embodiments of the present disclosure, an electronic device, comprising:

a storage device having a computer program stored thereon;

processing means for executing the computer program in the storage means to implement the steps of the speech synthesis method in any one of examples 1 to 8.

The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the disclosure herein is not limited to the particular combination of features described above, but also encompasses other embodiments in which any combination of the features described above or their equivalents does not depart from the spirit of the disclosure. For example, the above features and (but not limited to) the features disclosed in this disclosure having similar functions are replaced with each other to form the technical solution.

Further, while operations are depicted in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order. Under certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limitations on the scope of the disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.

Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims. With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

Claims

1. A method of speech synthesis, the method comprising:

acquiring a text to be synthesized;

2. The method of claim 1, wherein the speech synthesis model generates the audio information corresponding to the text to be synthesized by:

determining a phoneme sequence corresponding to the text to be synthesized;

determining an accent label of a phoneme level according to the accent word;

3. The method according to claim 2, wherein the generating audio information corresponding to the text to be synthesized according to the phoneme sequence and the accent labels comprises:

determining a mel frequency spectrum according to the target phoneme vector;

4. The method of claim 3, wherein determining the Mel frequency spectrum from the target phoneme vector comprises:

5. The method of claim 3, wherein determining a target phoneme vector based on the phoneme vector and the accent label vector comprises:

6. The method according to any of claims 1-5, wherein the speech synthesis model comprises an accent recognition module, and wherein the speech synthesis model is trained by the following training steps:

7. The method of claim 6, wherein said adjusting parameters of the speech synthesis model by the first loss function and the second loss function comprises:

8. The method of claim 6, wherein the determining the accent words in the text to be synthesized comprises:

determining whether the text to be synthesized is marked with accent words;

if the text to be synthesized is not marked with the accent words, identifying the accent words in the text to be synthesized through the accent identification module;

9. A speech synthesis apparatus, characterized in that the apparatus comprises:

the acquisition module is used for acquiring a text to be synthesized;

10. A computer-readable medium, on which a computer program is stored, characterized in that the program, when being executed by processing means, carries out the steps of the method of any one of claims 1 to 8.

11. An electronic device, comprising:

a storage device having a computer program stored thereon;

processing means for executing the computer program in the storage means to carry out the steps of the method according to any one of claims 1 to 8.