CN114155829A

CN114155829A - Speech synthesis method, speech synthesis device, readable storage medium and electronic equipment

Info

Publication number: CN114155829A
Application number: CN202111473868.XA
Authority: CN
Inventors: 梅晓; 何爽爽; 吉伶俐; 马泽君
Original assignee: Beijing Youzhuju Network Technology Co Ltd
Current assignee: Beijing Youzhuju Network Technology Co Ltd
Priority date: 2021-11-30
Filing date: 2021-11-30
Publication date: 2022-03-08

Abstract

The disclosure relates to a voice synthesis method, a voice synthesis device, a readable storage medium and an electronic device. The method comprises the following steps: determining tone marking information and rhythm marking information of a text to be synthesized, wherein the tone marking information comprises tone types of single word exclamations in the text to be synthesized, and the tone types are determined based on the tone information and pitch information; determining a phoneme sequence corresponding to a text to be synthesized; and generating a synthetic audio according to the tone marking information, the rhythm marking information and the phoneme sequence. Because the tone type of the single exclamation word in the text to be synthesized is abstract summary of the pitch curve of the actual exclamation word, the method conforms to the research result of the current linguistic world, is more accurate in classification, and can ensure similarity and difference among the classes, thereby ensuring the naturalness and anthropomorphic degree of the synthesized audio. Meanwhile, the tone type of the single word sigh is given in the voice synthesis process, so that the controllability of the sigh tone of the synthetic audio is ensured, the pragmatic effect to be expressed by the sigh can be directly controlled, and the synthetic audio is more real and natural.

Description

Speech synthesis method, speech synthesis device, readable storage medium and electronic equipment

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a speech synthesis method, apparatus, readable storage medium, and electronic device.

Background

Speech synthesis technology is capable of converting arbitrary text into corresponding audio, and generally includes two parts, one part is to analyze the text to obtain information related to linguistics, and the other part is to generate sound waveforms based on the analysis result. In the related technology, the accurate description of the tone characteristics of a single exclamation word is usually lacked, so that the tone of the exclamation word in the synthetic audio cannot be effectively controlled, and the synthetic audio is not real, natural and lacks expressive force.

Disclosure of Invention

This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

In a first aspect, the present disclosure provides a speech synthesis method, including:

determining tone marking information and rhythm marking information of a text to be synthesized, wherein the tone marking information comprises tone types of single exclamation words in the text to be synthesized, and the tone types are determined based on the tone information and pitch information;

determining a phoneme sequence corresponding to the text to be synthesized;

and generating synthetic audio corresponding to the text to be synthesized according to the tone marking information, the rhythm marking information and the phoneme sequence.

In a second aspect, the present disclosure provides a speech synthesis apparatus comprising:

the first determining module is used for determining tone marking information and rhythm marking information of a text to be synthesized, wherein the tone marking information comprises tone types of single exclamations in the text to be synthesized, and the tone types are determined based on the tone information and pitch information;

the second determining module is used for determining a phoneme sequence corresponding to the text to be synthesized;

and the generating module is used for generating synthetic audio corresponding to the text to be synthesized according to the tone marking information and the prosody marking information determined by the first determining module and the phoneme sequence determined by the second determining module.

In a third aspect, the present disclosure provides a computer readable medium having stored thereon a computer program which, when executed by a processing apparatus, performs the steps of the method provided by the first aspect of the present disclosure.

In a fourth aspect, the present disclosure provides an electronic device comprising:

a storage device having a computer program stored thereon;

processing means for executing the computer program in the storage means to implement the steps of the method provided by the first aspect of the present disclosure.

In the technical scheme, tone marking information and rhythm marking information of a text to be synthesized are determined, wherein the tone marking information comprises tone types of single word sighs in the text to be synthesized, and the tone types are determined based on the tone information and pitch information; determining a phoneme sequence corresponding to a text to be synthesized; and generating synthetic audio corresponding to the text to be synthesized according to the tone marking information, the rhythm marking information and the phoneme sequence. Because the tone type of the single word sigh in the text to be synthesized is the abstract summary of the pitch curve of the actual sigh, the method conforms to the research result of the current linguistic community, is more accurate in classification, can ensure similarity and difference among the classes, can accurately describe the tone characteristics of the single word sigh, and ensures the naturalness and anthropomorphic degree of the synthesized audio. Meanwhile, the tone type of the single-character sigh in the text to be synthesized is given in the speech synthesis process, the controllability of the sigh tone of the synthesized audio is ensured, and the direct control on the acoustic characteristics of the speech is realized, so that the direct control on the pragmatic effect to be expressed by the sigh can be realized, and the synthesized audio is more real and natural.

Additional features and advantages of the disclosure will be set forth in the detailed description which follows.

Drawings

The above and other features, advantages and aspects of various embodiments of the present disclosure will become more apparent by referring to the following detailed description when taken in conjunction with the accompanying drawings. Throughout the drawings, the same or similar reference numbers refer to the same or similar elements. It should be understood that the drawings are schematic and that elements and features are not necessarily drawn to scale. In the drawings:

FIG. 1 is a flow diagram illustrating a method of speech synthesis according to an example embodiment.

FIG. 2 is a flow diagram illustrating a method for tone labeling model training in accordance with an exemplary embodiment.

FIG. 3 is a schematic diagram illustrating an annotation page in accordance with an exemplary embodiment.

Fig. 4 is a flowchart illustrating a method of generating synthetic audio corresponding to a text to be synthesized according to tone labeling information, prosody labeling information, and a phoneme sequence, according to an example embodiment.

FIG. 5 is a flow diagram illustrating a method of training a speech synthesis model according to an example embodiment.

FIG. 6 is a block diagram illustrating a speech synthesis apparatus according to an example embodiment.

FIG. 7 is a block diagram illustrating an electronic device in accordance with an example embodiment.

Detailed Description

As described in the background art, speech synthesis in the related art lacks accurate description of tone characteristics of a single word sigh, so that the tone of the sigh in the synthesized audio cannot be effectively controlled, and the synthesized audio is not true and natural and lacks expressive force. In particular, in statistical-based speech synthesis techniques, it is essential to model the audio and its tags statistically. The quality of the synthesized audio is closely related to the completeness, rationality and detail degree of a label system. At this stage, there are two ways for the sigh tone system. One way is to uniformly label as soft sounds; another way is to label with similar four tones. The two modes do not accurately describe the tone characteristics of the single word sigh, so that the synthesized sigh tone is strange in listening sensation and does not accord with natural spoken language sensation, and the synthesized audio is not real and natural and lacks expressive force.

In view of the above, the present disclosure provides a speech synthesis method, apparatus, readable storage medium and electronic device.

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather are provided for a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the disclosure are for illustration purposes only and are not intended to limit the scope of the disclosure.

It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order, and/or performed in parallel. Moreover, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.

The term "include" and variations thereof as used herein are open-ended, i.e., "including but not limited to". The term "based on" is "based, at least in part, on". The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments". Relevant definitions for other terms will be given in the following description.

It should be noted that the terms "first", "second", and the like in the present disclosure are only used for distinguishing different devices, modules or units, and are not used for limiting the order or interdependence relationship of the functions performed by the devices, modules or units.

It is noted that references to "a", "an", and "the" modifications in this disclosure are intended to be illustrative rather than limiting, and that those skilled in the art will recognize that "one or more" may be used unless the context clearly dictates otherwise.

The names of messages or information exchanged between devices in the embodiments of the present disclosure are for illustrative purposes only, and are not intended to limit the scope of the messages or information.

FIG. 1 is a flow diagram illustrating a method of speech synthesis according to an example embodiment. As shown in fig. 1, the method may include the following S101 to S103.

In S101, tone labeling information and prosody labeling information of a text to be synthesized are determined.

In the present disclosure, the tone marking information is used to reflect content related to tones, and may include tone types of single word sighs in the text to be synthesized, and may also include tone types of other words in the text to be synthesized except for the single word sighs. Where a single word sigh is a single word sigh, e.g. o, hum, hew, etc.

The tone type of the single word sigh may be determined based on the tone information and the pitch information. The tone information may include four tones of level, rising, inflection and falling, the pitch information may include high and low, and correspondingly, the tone type of the single word exclamation may be one of high level, low level, high rising tone, high falling tone, low falling tone and inflection.

The tone type of the other words than the single word exclamation may be one of flat, ascending, zigzag, and descending.

The prosody labeling information is used to reflect prosody-related content, which includes prosody boundary information.

In S102, a phoneme sequence corresponding to the text to be synthesized is determined.

In the present disclosure, a Phoneme sequence corresponding to a text to be synthesized may be obtained through a Grapheme-to-Phoneme (G2P) model.

For example, the G2P model may employ Recurrent Neural Networks (RNNs) and Long-Short Term Memory networks (LSTM) to achieve the conversion from grapheme to phoneme.

In S103, a synthesized audio corresponding to the text to be synthesized is generated according to the tone labeling information, the prosody labeling information, and the phoneme sequence.

In order to make the speech synthesis method provided by the present disclosure more understandable to those skilled in the art, the above steps are exemplified in detail below.

First, the relevant contents of the tone mark information used in the present disclosure will be explained. As discussed above, the type of tone of the single sigh may be one of a high level, a low level, a high rising, a high falling, a low falling, and a zig-zag.

Wherein, the pitch target is high and the fundamental frequency curve is in a flat trend.

The pitch target is low and the fundamental frequency curve tends to be flat.

The pitch target is rising, and the fundamental frequency curve shows a trend of continuously rising.

When pitch is lowered, pitch target is lowered, and fundamental frequency curve shows a trend of continuous descending.

In the low pitch, the pitch target is lowered first and then lowered, and the fundamental frequency curve shows a trend of continuously lowering and gradually flattening at the low position.

And in zigzag modulation, the fundamental frequency curve is in a trend of continuously descending first, then ascending and then descending.

Next, the content of the prosody label information used in the present disclosure will be explained.

Prosodic boundaries (break index, which may be abbreviated as BRK) may also be referred to as discontinuity indexes, and are used to describe the form of the information organized and divided into sentences in the speech stream. Accordingly, prosodic boundary information may be used to indicate word boundary locations where there is a pause in the text, where word boundaries include prosodic phrase boundaries and intonation phrase boundaries.

Prosodic phrase boundaries, i.e., minor phrase boundaries (minor phrases), correspond to prosodic phrases and exhibit short breaks in hearing sensation, resulting in a sense of segmentation of information organization without pauses. As a rule of thumb, transient silence at normal speech rates is typically less than 70ms, and transient silence at slow speech rates (80-120 words/min) is typically between 60ms and 170 ms.

The intonation phrase boundary, i.e. the large phrase boundary (major phrase), corresponds to the intonation phrase, there is a significant pause in hearing, and the words at the large phrase boundary usually have a prolonged delay, so that the words at the boundary have rich intonation expressions, for example, two times of pitch sliding can be performed, and pitch accent and boundary intonation can be realized at the same time. As a rule of thumb, normal speech rates exhibit silence greater than 70ms, and slow speech rates (80-120 words/min) exhibit silence greater than 170 ms.

In addition, the prosodic boundary information can also comprise sentence boundaries, which can enable the boundary tone at the end of the sentence to reach the highest point or the lowest point of the voice range of the speaker, thereby better conforming to the real effect of the speaker.

Optionally, the prosodic boundary information may further include prosodic word boundaries for describing word boundaries.

Prosodic word boundaries (prosody words) belong to a stressed domain, have one stressed syllable and obvious coarticulation phenomenon in the domain, have weak discontinuity feeling on the hearing feeling, and do not have pause and obvious information organization segmentation feeling. In general, no delay lengthening phenomenon and no pause phenomenon occur at the boundary.

Based on the prosody labeling information, prosody labeling of the text can be achieved, and based on the prosody information, a more natural voice synthesis result can be obtained. Moreover, based on the description, the labeling mode can be standardized, the scope and the labeling domain of the labeling can be defined, the acoustic clues and the threshold of the labeling can be set, and the labeling consistency can be improved.

A detailed description will be given below of a specific embodiment of determining the tone mark information of the text to be synthesized in S101. In particular, the method can be implemented in various ways, and in one embodiment, the tone marking information of the text to be synthesized can be determined in a manual marking mode. Specifically, a tone marking operation for the text to be synthesized may be received, and tone marking information of the text to be synthesized may be generated according to the tone marking operation.

That is, the annotating personnel can directly perform the tone annotation operation on the text to be synthesized, and annotate the tone features expected to be heard from the synthesized audio into the text to be synthesized. For the labeling idea of various tone features, the foregoing description about the tone labeling information can be referred to.

For example, a corresponding relationship between the tone type of each Chinese exclamation word and the pragmatic intent (as shown in table 1 below) may be established in advance to assist the annotating personnel in performing tone annotation operation. Where the pragmatic intent is to characterize the pragmatic effect that a word sigh is intended to express, e.g., pray, solicitation, question, surprise, disappointment, obsessiveness, etc.

TABLE 1 correspondence table between tone type of exclamation word of each Chinese character and pragmatic intention

High and low level adjustment

Height-rising tone

High descending tone

Zigzag tone

Low pitch

Low level tone

O

Pray to make and request

Question and surprise

Exclamation and burst

The connection of the Chinese and Qi and the uttering of the word

In thinking

Hum

Threat, force

Discontent with

Chest, frequent and belching

Responding to, uttering words

Accident of

Recognition and acceptance

Disappointing and difficult to pass

……

, respectively

Chinese words

Accident of

Agreement, confirmation

Go wrong

Agreement in thinking

In thinking

In another embodiment, the determination of the tone marking information of the text to be synthesized can be completed through a pre-trained tone marking model. Specifically, the text to be synthesized may be input to a pre-trained tone labeling model, so as to obtain tone labeling information of the text to be synthesized.

And the tone labeling model can be obtained by training based on a second training text with tone labeling information. Specifically, the tone labeling model can be trained through S201 and S202 shown in fig. 2.

In S201, a second training text and tone labeling information of the second training text are obtained.

In S202, model training is performed by using the second training text as an input of the tone labeling model, and using the tone labeling information of the second training text as a target of the tone labeling model, so as to obtain the tone labeling model.

In this disclosure, the second training text may be a text extracted from a real existing voice, and for such a voice, a annotator may first annotate the tone type of each single-word exclamation word in the text by listening to the voice to generate tone annotation information. Such labeling relies primarily on the auditory perception of the labeling person. In addition, the marking process can also refer to the corresponding relation between the tone type of each Chinese exclamation word and the pragmatic intention.

Based on the above manner, tone marking information corresponding to the true pronunciation of the second training text can be obtained. Therefore, the second training text is used as the input of the tone marking model, and the tone marking information of the second training text is used as the target output of the tone marking model to perform model training, so that the tone marking model is obtained. Therefore, the tone marking information of the text to be synthesized can be automatically obtained by inputting the text to be synthesized into the tone marking model, and manual marking in real time is not needed.

A specific embodiment of determining the prosody label information of the text to be synthesized in S101 described above will be described below. In particular, the method can be implemented in various ways, and in one embodiment, the prosody labeling information of the text to be synthesized can be determined by means of manual labeling. Specifically, a prosody labeling operation for the text to be synthesized can be received, and prosody labeling information of the text to be synthesized can be generated according to the prosody labeling operation.

That is, the annotating personnel can directly perform prosody annotation operation on the text to be synthesized, and annotate the prosody features expected to be heard from the synthesized audio into the text to be synthesized. For the labeling idea of various prosody features, the foregoing description about prosody labeling information can be referred to.

In another embodiment, the prosody labeling information of the text to be synthesized can be determined by a pre-trained prosody labeling model. Specifically, the text to be synthesized may be input to a prosody labeling model trained in advance, so as to obtain prosody labeling information of the text to be synthesized.

And the prosody labeling model can be obtained by training based on a third training text with prosody labeling information. Specifically, the prosody labeling model can be trained by the following steps:

first, a third training text and prosody labeling information of the third training text are obtained.

And then, performing model training by taking the third training text as the input of the prosody labeling model and taking prosody labeling information of the third training text as the target of the prosody labeling model to output so as to obtain the prosody labeling model.

In the present disclosure, the third training text may be a text extracted from a real existing voice, and for such a voice, a annotator may first perform prosody labeling at a suitable position in the text by listening to the voice to generate prosody labeling information. Such labeling relies primarily on the auditory perception of the labeling person. In addition, the labeling process may also refer to the following aspects:

marking a prosodic boundary, playing audio, judging which word boundary a speech stream has obvious pause or discontinuity, and marking a prosodic phrase boundary and a intonation phrase boundary (sentence boundaries can be automatically marked by a machine without manual marking);

illustratively, the annotating personnel can manually annotate the text to be synthesized through the designated annotation page. The annotation page can be as shown in fig. 3, wherein, from top to bottom, the layer 1 is a sentence layer (i.e., the third training text), the layer 2 is a chinese character layer, and includes front and rear silent sections "sil" and middle pause "sp", the third layer is a phoneme layer, and contains tone information, and the layer 4 is a prosodic boundary annotation layer. The 1 st, 2 nd and 3 rd layers are interval layers (interval tier), the 4 th layer to be labeled is a point layer (point tier), namely, labeling is carried out in a point form, and only the time of the labeled point and the right boundary of the corresponding word is required to be completely the same. In fig. 3, 1 is a prosodic word boundary, 2 is a prosodic phrase boundary, 3 is a intonation phrase boundary, and 4 is a sentence boundary.

Based on the above manner, prosody labeling information corresponding to the real pronunciation of the third training text can be obtained. Therefore, the third training text is used as the input of the prosody labeling model, and the prosody labeling information of the third training text is used as the target of the prosody labeling model to be output for model training, so that the prosody labeling model is obtained. Therefore, the text to be synthesized is input into the prosody labeling model, the prosody labeling information of the text to be synthesized can be automatically obtained, and real-time manual labeling is not needed.

A detailed description will be given below of a specific embodiment of generating the synthetic audio corresponding to the text to be synthesized according to the tone labeling information, the prosody labeling information, and the phoneme sequence in S103. Specifically, it can be realized by S1031 to S1034 shown in fig. 4.

In S1031, a phoneme-level tone label is determined according to the tone labeling information of the text to be synthesized.

As described above, the tone marking information may include a tone type of the single word exclamation in the text to be synthesized, and the tone type may be one of a high level, a low level, a high rising tone, a high falling tone, a low falling tone, and a zigzag tone. Accordingly, the tone label may be one of a high level, a low level, a high rising tone, a high falling tone, a low falling tone, and a zigzag tone.

The tone marking information is generally the tone type marking the homonym in the text to be synthesized. In order to facilitate subsequent speech synthesis, the tone labeling information is ensured to be capable of corresponding to the phonemes of the text to be synthesized one by one, and the tone label at the phoneme level can be further determined based on the tone labeling information of the text to be synthesized.

The idea of determining the tone labels is that different phonemes of the same single word exclamation word share the same tone label, that is, the tone labels of the phonemes of the single word exclamation word are the tone type of the single word exclamation word.

In S1032, a prosody label at a phoneme level is determined according to prosody labeling information of the text to be synthesized.

As mentioned above, the prosody label information includes prosody boundary information, and accordingly, the prosody label is a prosody boundary label.

The prosodic annotation information is generally an annotation of a certain location of the text, for example, a certain location of the text is a prosodic phrase boundary. In order to facilitate subsequent speech synthesis, it is ensured that the prosody labeling information can correspond to the phonemes of the text to be synthesized one by one, and the prosody label at the phoneme level can be further determined based on the prosody labeling information of the text to be synthesized.

The idea of determining a prosodic tag is to generate tag content in accordance with the annotation information for phoneme positions where prosodic annotations exist, and to substitute specified substitute content for phoneme positions where prosodic annotations do not exist. For example, for the phoneme sequence { a1, a2, A3, a4, a5, a6}, assuming that the prosody labeling information includes prosody boundary information, and the labeling content is that a2 is labeled with prosody phrase boundaries, a5 is labeled with intonation phrase boundaries, and it is specified that the prosody phrase boundaries are characterized by 3, the intonation phrase boundaries are characterized by 4, and no label is characterized by N2, the prosody boundary label determined is { N2, 3, N2, N2, 4, N2 }.

In S1033, acoustic feature information corresponding to the text to be synthesized is generated by using a pre-trained speech synthesis model according to the tone label, the prosodic label, and the phoneme sequence.

In S1034, the acoustic feature information is synthesized using the vocoder to generate a synthesized audio corresponding to the text to be synthesized.

Specifically, the tone label, the prosody label, and the phoneme sequence may be input to a pre-trained speech synthesis model to obtain acoustic feature information corresponding to the text to be synthesized. Illustratively, the acoustic feature information may be a Mel-frequency spectrum (Mel-frequency spectrum), a linear spectrum, or the like.

After obtaining the acoustic feature information corresponding to the text to be synthesized through the speech synthesis model, the acoustic feature information may be input into a vocoder (e.g., a Wavenet vocoder, a Griffin-Lim vocoder) to perform speech synthesis, so as to obtain the synthesized audio corresponding to the text to be synthesized.

In the implementation mode, the tone information and the prosody information are accurate to the phoneme level, the control accuracy is higher, and more appropriate and accurate tone and prosody expression effects can be achieved.

In addition, the speech synthesis model may include an encoding network, an attention network, and a decoding network, where the encoding network is configured to obtain a representation sequence corresponding to the tone label, the prosody label, and the phoneme sequence, and the attention network is configured to generate a fixed-length semantic representation according to the representation sequence; and the decoding network is used for obtaining acoustic characteristic information corresponding to the text to be synthesized according to the semantic representation.

Specifically, the above coding network may include an Embedding layer (i.e., Embedding layer), a Pre-processing network (Pre-net) sub-model, and a CBHG (convergence Bank + high-speed network + bidirectional Gated Recurrent Unit, i.e., a convolutional layer + high-speed network + bidirectional Recurrent neural network, i.e., CBHG is composed of a convolutional layer, a high-speed network, and a bidirectional Recurrent neural network) sub-model. Firstly, respectively vectorizing a tone label, a prosody label and a phoneme sequence through an embedded layer, and then splicing the tone label obtained after vectorization, the prosody label obtained after vectorization and the phoneme sequence obtained after vectorization according to a preset splicing sequence to obtain a target sequence, for example, splicing the phoneme sequence obtained after vectorization, the tone label obtained after vectorization and the prosody label obtained after vectorization; then, inputting the target sequence into a Pre-net sub-model to perform nonlinear transformation, thereby improving the convergence and generalization capability of the voice synthesis model; and finally, obtaining a corresponding representation sequence through the CBHG submodel according to the target sequence obtained after nonlinear transformation.

In addition, the voice synthesis model is obtained by training based on a first training text with prosody labeling information and pitch labeling information and training acoustic feature information corresponding to the first training text. Specifically, this can be realized by S501 to S504 shown in fig. 5.

In S501, a training phoneme sequence corresponding to the first training text is determined.

In S502, a training tone label at a phoneme level is determined according to the tone labeling information of the first training text.

In S503, a training prosody label at a phoneme level is determined according to the prosody label information of the first training text.

In S504, model training is performed by using the training tone labels, the training prosody labels, and the training phoneme sequences as input of the speech synthesis model, and using the training acoustic feature information as a target of the speech synthesis model, so as to obtain the speech synthesis model.

In this disclosure, the training phoneme sequence corresponding to the first training text may be determined in a manner similar to the determination of the phoneme sequence corresponding to the text to be synthesized in S102, the training tone label at the phoneme level may be determined in a manner similar to the determination of the tone label at the phoneme level according to the tone labeling information of the text to be synthesized in S1031, and the training prosody label at the phoneme level may be determined in a manner similar to the determination of the prosody label at the phoneme level according to the prosody labeling information of the text to be synthesized in S1032, which is not repeated in this disclosure.

FIG. 6 is a block diagram illustrating a speech synthesis apparatus according to an example embodiment. As shown in fig. 6, the apparatus 600 includes:

a first determining module 601, configured to determine tone marking information and prosody marking information of a text to be synthesized, where the tone marking information includes a tone type of a unicode sigh in the text to be synthesized, and the tone type is determined based on the tone information and pitch information;

a second determining module 602, configured to determine a phoneme sequence corresponding to the text to be synthesized;

a generating module 603, configured to generate a synthesized audio corresponding to the text to be synthesized according to the tone labeling information and the prosody labeling information determined by the first determining module 601 and the phoneme sequence determined by the second determining module 602.

Optionally, the tone type is one of a high level, a low level, a high rising tone, a high falling tone, a low falling tone, and a zigzag tone.

Optionally, the generating module 603 includes:

the first determining submodule is used for determining a tone label at a phoneme level according to the tone marking information of the text to be synthesized;

the second determining submodule is used for determining a prosody label at a phoneme level according to the prosody labeling information of the text to be synthesized;

the generating submodule is used for generating acoustic characteristic information corresponding to the text to be synthesized by utilizing a pre-trained voice synthesis model according to the tone label, the rhythm label and the phoneme sequence;

and the synthesis submodule is used for synthesizing the acoustic characteristic information by using a vocoder so as to generate synthetic audio corresponding to the text to be synthesized.

Optionally, the speech synthesis model includes an encoding network, an attention network, and a decoding network, where the encoding network is configured to obtain a representation sequence corresponding to the tone label, the prosody label, and the phoneme sequence, and the attention network is configured to generate a fixed-length semantic representation according to the representation sequence; and the decoding network is used for obtaining acoustic characteristic information corresponding to the text to be synthesized according to the semantic representation.

Optionally, the speech synthesis model is obtained by training based on a first training text with prosody labeling information and pitch labeling information and training acoustic feature information corresponding to the first training text.

Optionally, the speech synthesis model is obtained by training through a first model training device. Specifically, the first model training device includes:

a third determining module, configured to determine a training phoneme sequence corresponding to the first training text;

the fourth determining module is used for determining a training tone label at a phoneme level according to the tone marking information of the first training text;

a fifth determining module, configured to determine a training prosody label at a phoneme level according to the prosody labeling information of the first training text;

and the first training module is used for performing model training by taking the training tone label, the training prosody label and the training phoneme sequence as the input of the speech synthesis model and taking the training acoustic feature information as the target output of the speech synthesis model to obtain the speech synthesis model.

Optionally, the first determining module 601 is configured to input the text to be synthesized into a pre-trained tone labeling model, so as to obtain tone labeling information of the text to be synthesized.

Optionally, the tone labeling model is obtained through training by a second model training device. Wherein, this second model training device includes:

the acquisition module is used for acquiring a second training text and tone marking information of the second training text;

and the second training module is used for performing model training in a mode that the second training text is used as the input of the tone marking model and the tone marking information of the second training text is used as the target output of the tone marking model so as to obtain the tone marking model.

Optionally, the prosody annotation information includes prosody boundary information.

The first model training device may be integrated into the speech synthesis device 600, or may be independent of the speech synthesis device 600, and the second model training device may be integrated into the speech synthesis device 600, or may be independent of the speech synthesis device 600, and the disclosure is not particularly limited.

The present disclosure also provides a computer readable medium having stored thereon a computer program which, when executed by a processing apparatus, implements the steps of the above-mentioned speech synthesis method provided by the present disclosure.

Referring now to fig. 7, a schematic diagram of an electronic device (e.g., a terminal device or server) 700 suitable for use in implementing embodiments of the present disclosure is shown. The terminal device in the embodiments of the present disclosure may include, but is not limited to, a mobile terminal such as a mobile phone, a notebook computer, a digital broadcast receiver, a PDA (personal digital assistant), a PAD (tablet computer), a PMP (portable multimedia player), a vehicle terminal (e.g., a car navigation terminal), and the like, and a stationary terminal such as a digital TV, a desktop computer, and the like. The electronic device shown in fig. 7 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.

As shown in fig. 7, electronic device 700 may include a processing means (e.g., central processing unit, graphics processor, etc.) 701 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)702 or a program loaded from storage 708 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data necessary for the operation of the electronic apparatus 700 are also stored. The processing device 701, the ROM 702, and the RAM 703 are connected to each other by a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.

Generally, the following devices may be connected to the I/O interface 705: input devices 706 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; an output device 707 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 708 including, for example, magnetic tape, hard disk, etc.; and a communication device 709. The communication means 709 may allow the electronic device 700 to communicate wirelessly or by wire with other devices to exchange data. While fig. 7 illustrates an electronic device 700 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program carried on a non-transitory computer readable medium, the computer program containing program code for performing the method illustrated by the flow chart. In such embodiments, the computer program may be downloaded and installed from a network via the communication means 709, or may be installed from the storage means 708, or may be installed from the ROM 702. The computer program, when executed by the processing device 701, performs the above-described functions defined in the methods of the embodiments of the present disclosure.

It should be noted that the computer readable medium in the present disclosure can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.

In some embodiments, the clients, servers may communicate using any currently known or future developed network Protocol, such as HTTP (HyperText Transfer Protocol), and may interconnect with any form or medium of digital data communication (e.g., a communications network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the Internet (e.g., the Internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed network.

The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device.

The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: determining tone marking information and rhythm marking information of a text to be synthesized, wherein the tone marking information comprises tone types of single exclamation words in the text to be synthesized, and the tone types are determined based on the tone information and pitch information; determining a phoneme sequence corresponding to the text to be synthesized; and generating synthetic audio corresponding to the text to be synthesized according to the tone marking information, the rhythm marking information and the phoneme sequence.

Computer program code for carrying out operations for the present disclosure may be written in any combination of one or more programming languages, including but not limited to an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules described in the embodiments of the present disclosure may be implemented by software or hardware. The name of the module does not in some cases constitute a limitation of the module itself, and for example, the first determination module may also be described as a "module that determines the tone labeling information and prosody labeling information of the text to be synthesized".

The functions described herein above may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), systems on a chip (SOCs), Complex Programmable Logic Devices (CPLDs), and the like.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

Example 1 provides a speech synthesis method, according to one or more embodiments of the present disclosure, including: determining tone marking information and rhythm marking information of a text to be synthesized, wherein the tone marking information comprises tone types of single exclamation words in the text to be synthesized, and the tone types are determined based on the tone information and pitch information; determining a phoneme sequence corresponding to the text to be synthesized; and generating synthetic audio corresponding to the text to be synthesized according to the tone marking information, the rhythm marking information and the phoneme sequence.

Example 2 provides the method of example 1, the type of tone being one of high level, low level, high rising, high falling, low falling, and twiddle.

Example 3 provides the method of example 1, wherein generating the synthetic audio corresponding to the text to be synthesized according to the intonation labeling information, the prosody labeling information, and the phoneme sequence includes: determining a tone label at a phoneme level according to the tone marking information of the text to be synthesized; determining a prosodic label at a phoneme level according to the prosodic marking information of the text to be synthesized; generating acoustic feature information corresponding to the text to be synthesized by utilizing a pre-trained speech synthesis model according to the tone label, the prosodic label and the phoneme sequence; and synthesizing the acoustic feature information by using a vocoder to generate synthetic audio corresponding to the text to be synthesized.

Example 4 provides the method of example 3, the speech synthesis model including an encoding network, an attention network, and a decoding network, wherein the encoding network is configured to obtain a representation sequence corresponding to the tone tags, the prosody tags, and the phoneme sequences, and the attention network is configured to generate fixed-length semantic representations from the representation sequence; and the decoding network is used for obtaining acoustic characteristic information corresponding to the text to be synthesized according to the semantic representation.

Example 5 provides the method of example 3, and the speech synthesis model is trained based on a first training text with prosody labeling information and pitch labeling information and training acoustic feature information corresponding to the first training text.

Example 6 provides the method of example 5, the speech synthesis model being trained in the following manner: determining a training phoneme sequence corresponding to the first training text; determining a training tone label at a phoneme level according to tone marking information of the first training text; determining a training prosody label at a phoneme level according to prosody labeling information of the first training text; and performing model training by taking the training tone label, the training prosody label and the training phoneme sequence as the input of the speech synthesis model and taking the training acoustic feature information as the target output of the speech synthesis model to obtain the speech synthesis model.

Example 7 provides the method of any one of examples 1-6, the determining tonal annotation information for text to be synthesized, comprising: and inputting the text to be synthesized into a pre-trained tone marking model to obtain tone marking information of the text to be synthesized.

Example 8 provides the method of example 7, the tonal annotation model being trained by: acquiring a second training text and tone marking information of the second training text; and performing model training by taking the second training text as the input of the tone marking model and taking the tone marking information of the second training text as the target output of the tone marking model to obtain the tone marking model.

Example 9 provides the method of any one of examples 1-6, the prosody annotation information comprising prosody boundary information, according to one or more embodiments of the present disclosure.

Example 10 provides, in accordance with one or more embodiments of the present disclosure, a speech synthesis apparatus comprising: the first determining module is used for determining tone marking information and rhythm marking information of a text to be synthesized, wherein the tone marking information comprises tone types of single exclamations in the text to be synthesized, and the tone types are determined based on the tone information and pitch information; the second determining module is used for determining a phoneme sequence corresponding to the text to be synthesized; and the generating module is used for generating synthetic audio corresponding to the text to be synthesized according to the tone marking information and the prosody marking information determined by the first determining module and the phoneme sequence determined by the second determining module.

Example 11 provides a computer-readable medium having stored thereon a computer program that, when executed by a processing apparatus, performs the steps of the method of any of examples 1-9, in accordance with one or more embodiments of the present disclosure.

Example 12 provides, in accordance with one or more embodiments of the present disclosure, an electronic device, comprising: a storage device having a computer program stored thereon; processing means for executing the computer program in the storage means to carry out the steps of the method of any of examples 1-9.

The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the disclosure herein is not limited to the particular combination of features described above, but also encompasses other embodiments in which any combination of the features described above or their equivalents does not depart from the spirit of the disclosure. For example, the above features and (but not limited to) the features disclosed in this disclosure having similar functions are replaced with each other to form the technical solution.

Further, while operations are depicted in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order. Under certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limitations on the scope of the disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.

Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims. With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

Claims

1. A method of speech synthesis, comprising:

determining a phoneme sequence corresponding to the text to be synthesized;

2. The method of claim 1, wherein the tone type is one of high level, low level, high rising, high falling, low falling, and twiddle.

3. The method according to claim 1, wherein the generating synthetic audio corresponding to the text to be synthesized according to the tone labeling information, the prosody labeling information, and the phoneme sequence comprises:

determining a tone label at a phoneme level according to the tone marking information of the text to be synthesized;

determining a prosodic label at a phoneme level according to the prosodic marking information of the text to be synthesized;

generating acoustic feature information corresponding to the text to be synthesized by utilizing a pre-trained speech synthesis model according to the tone label, the prosodic label and the phoneme sequence;

and synthesizing the acoustic feature information by using a vocoder to generate synthetic audio corresponding to the text to be synthesized.

4. The method of claim 3, wherein the speech synthesis model comprises an encoding network, an attention network and a decoding network, wherein the encoding network is configured to obtain a representation sequence corresponding to the tone label, the prosody label and the phoneme sequence, and the attention network is configured to generate a semantic representation with a fixed length according to the representation sequence; and the decoding network is used for obtaining acoustic characteristic information corresponding to the text to be synthesized according to the semantic representation.

5. The method of claim 3, wherein the speech synthesis model is trained based on a first training text with prosodic and pitch labeling information and training acoustic feature information corresponding to the first training text.

6. The method of claim 5, wherein the speech synthesis model is trained by:

determining a training phoneme sequence corresponding to the first training text;

determining a training tone label at a phoneme level according to tone marking information of the first training text;

determining a training prosody label at a phoneme level according to prosody labeling information of the first training text;

and performing model training by taking the training tone label, the training prosody label and the training phoneme sequence as the input of the speech synthesis model and taking the training acoustic feature information as the target output of the speech synthesis model to obtain the speech synthesis model.

7. The method according to any one of claims 1-6, wherein the determining tonal annotation information for the text to be synthesized comprises:

and inputting the text to be synthesized into a pre-trained tone marking model to obtain tone marking information of the text to be synthesized.

8. The method of claim 7, wherein the tone labeling model is trained by:

acquiring a second training text and tone marking information of the second training text;

and performing model training by taking the second training text as the input of the tone marking model and taking the tone marking information of the second training text as the target output of the tone marking model to obtain the tone marking model.

9. The method of any of claims 1-6, wherein the prosodic annotation information comprises prosodic boundary information.

10. A speech synthesis apparatus, comprising:

11. A computer-readable medium, on which a computer program is stored, characterized in that the program, when being executed by processing means, carries out the steps of the method of any one of claims 1-9.

12. An electronic device, comprising:

a storage device having a computer program stored thereon;

processing means for executing the computer program in the storage means to carry out the steps of the method according to any one of claims 1 to 9.