CN112365880A

CN112365880A - Speech synthesis method, speech synthesis device, electronic equipment and storage medium

Info

Publication number: CN112365880A
Application number: CN202011224114.6A
Authority: CN
Inventors: 张君腾; 孙涛; 王文富
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2020-11-05
Filing date: 2020-11-05
Publication date: 2021-02-12
Anticipated expiration: 2040-11-05
Also published as: CN112365880B

Abstract

The application discloses a voice synthesis method, a voice synthesis device, electronic equipment and a storage medium, and relates to the technical field of voice technology and deep learning. The specific implementation scheme is as follows: acquiring a text to be synthesized and an identifier of a speaker corresponding to the text; inputting the text and the speaker identification into a prosody prediction model to obtain prosody characteristics of the text; the text, the speaker identification and the prosody feature are input into the speech synthesis model, and the speech corresponding to the text to be synthesized is obtained, so that the coupling between the prosody feature and the text during speech synthesis can be avoided, a speaker in one language is adopted to carry out the speech synthesis of the text in another language, and the speech synthesis by combining two prosody features is avoided, so that the speech synthesis effect is improved, and the reduction degree of the synthesized speech is improved.

Description

Speech synthesis method, speech synthesis device, electronic equipment and storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to the field of speech technology and deep learning technology, and in particular, to a speech synthesis method and apparatus, an electronic device, and a storage medium.

Background

The current speech synthesis technology is to record a high-quality speech library, obtain a text and a speaker corresponding to the speech in the speech library, train through a neural network model, input the text and the speaker, and output the speech, thereby obtaining an acoustic model.

The speech also includes prosodic features associated with the speaker. In a speech synthesis system supporting multiple languages, the text of each language is language dependent, such as english and chinese text independent of each other; when recording a library of voices, each speaker will typically only speak one language. Therefore, in the training process of the neural network model, the degree of coupling between the prosodic features related to the speaker and the language is high, and therefore the degree of coupling between the prosodic features related to the speaker and the text under the language is high. In cross-language speech synthesis, when a speaker in one language is used to synthesize a speech of a text in another language, the synthesized speech has poor degree of restorability and poor synthesis effect.

Disclosure of Invention

The disclosure provides a speech synthesis method, a speech synthesis apparatus, an electronic device and a storage medium.

According to an aspect of the present disclosure, there is provided a speech synthesis method including: acquiring a text to be synthesized and an identifier of a speaker corresponding to the text; inputting the text and the speaker identification into a prosody prediction model to obtain prosody characteristics of the text; and inputting the text, the speaker identification and the prosody feature into a speech synthesis model to obtain the speech corresponding to the text to be synthesized.

According to another aspect of the present disclosure, there is provided a speech synthesis apparatus including: the device comprises an acquisition module, a synthesis module and a processing module, wherein the acquisition module is used for acquiring a text to be synthesized and an identifier of a speaker corresponding to the text; the first input module is used for inputting the text and the speaker identification into a prosody prediction model to obtain prosody characteristics of the text; and the second input module is used for inputting the text, the speaker identification and the prosody feature into a speech synthesis model and acquiring the speech corresponding to the text to be synthesized.

According to a third aspect, there is provided an electronic device comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a speech synthesis method as described above.

According to a fourth aspect, a non-transitory computer-readable storage medium is presented having stored thereon computer instructions for causing the computer to perform the speech synthesis method as described above.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not intended to limit the present application. Wherein:

FIG. 1 is a schematic diagram according to a first embodiment of the present application;

FIG. 2 is a schematic diagram of a prosody prediction model;

FIG. 3 is a schematic diagram of the structure of a speech synthesis model;

FIG. 4 is a schematic illustration according to a second embodiment of the present application;

FIG. 5 is a schematic diagram of a prosody extraction model;

FIG. 6 is a schematic illustration according to a third embodiment of the present application;

FIG. 7 is a schematic illustration according to a fourth embodiment of the present application;

fig. 8 is a block diagram of an electronic device for implementing a speech synthesis method according to an embodiment of the present application.

Detailed Description

The following description of the exemplary embodiments of the present application, taken in conjunction with the accompanying drawings, includes various details of the embodiments of the application for the understanding of the same, which are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

A speech synthesis method, apparatus, electronic device, and storage medium according to embodiments of the present application are described below with reference to the accompanying drawings.

Fig. 1 is a schematic diagram according to a first embodiment of the present application. It should be noted that the execution subject of the embodiment of the present application is a speech synthesis apparatus, and the speech synthesis apparatus may specifically be a hardware device, or software in a hardware device, or the like.

As shown in fig. 1, the specific implementation process of the speech synthesis method is as follows:

step 101, obtaining a text to be synthesized and an identification of a speaker corresponding to the text.

In the embodiment of the present application, the text to be synthesized may be any text in any language. Among them, languages such as chinese, english, etc. Text such as news text, entertainment text, chat text, and the like. The speaker corresponding to the text refers to the speaker to which the speech to be synthesized by the text belongs. For example, if the speech of speaker a is to be synthesized, the speaker corresponding to the text is speaker a; when the speech of speaker B is synthesized, the speaker corresponding to the text is the speech of speaker B.

And 102, inputting the text and the identification of the speaker into a prosody prediction model to obtain prosody characteristics of the text.

In the embodiment of the application, the speaker is corresponding to the timbre characteristic and the style characteristic. Different speakers have different timbre characteristics and different speakers have different style characteristics. Prosodic features of text are related to text as well as to style features of the speaker. Therefore, the prosodic prediction model acquires the prosodic features of the text in a mode that a first linguistic feature coding module in the prosodic prediction model is adopted to acquire the linguistic features of the text; obtaining the style characteristics of a speaker by adopting a style characteristic coding module in a rhythm prediction model; a first splicing module in a rhythm prediction model is adopted to splice the linguistic characteristics and the style characteristics; and acquiring the prosodic features of the text by adopting a first decoding module in the prosodic prediction model and combining the spliced features.

Fig. 2 is a schematic structural diagram of a prosody prediction model. In fig. 2, the Input of the prosody prediction model may be Text (Input Text) and Speaker identification (Speaker Id). The first Linguistic Feature coding module may be a text vectorization layer (Character Embedding) + convolutional layer (Conv) + bidirectional lstm (bidirectional lstm), and obtains a Linguistic Feature (Linguistic Feature) of the obtained text. The Style Feature encoding module may obtain Style features (Style features) of the Speaker for Speaker vectorization layer (Speaker Embedding). The first concatenation module (Concat) concatenates the linguistic and stylistic features.

The first decoding module (Decoder) may be an autoregressive module, including a preprocessing layer (Pre-Net) + one-way LSTM + Linear Projection layer (Linear Projection), obtaining prosodic features (Prosody Feature) of the text. Processing the prosody features of the characters at the last moment in the text by a preprocessing layer to obtain the prosody features processed at the last moment; and splicing the prosodic features processed at the last moment and the spliced features at the current moment, and inputting the spliced prosodic features and the spliced features into the unidirectional LSTM module and the linear projection layer which are sequentially arranged to obtain the prosodic features of the characters at the current moment in the output text.

In the embodiment of the application, by arranging the first linguistic feature coding module and the style feature coding module in the prosody prediction model, the prosody feature is prevented from being coupled with the text during voice synthesis, the style feature of a speaker can be determined only according to the identification of the speaker, and then the prosody feature of the text is determined by combining the style feature and the text, so that the prosody feature is taken as an independent feature instead of being coupled to the speaker and the text during voice synthesis, and therefore, in a scene that the speaker in one language is adopted to perform voice synthesis of the text in another language, only one prosody feature can be combined, the two prosody features are prevented from being combined at the same time, the voice synthesis effect is improved, and the reduction degree of the synthesized voice is improved.

Step 103, inputting the text, the speaker identification and the prosody feature into a speech synthesis model, and acquiring the speech corresponding to the text to be synthesized.

In the embodiment of the application, the speech synthesis model acquires the speech corresponding to the text to be synthesized in a mode that a second linguistic feature coding module in the speech synthesis model is adopted to acquire the linguistic features of the text; acquiring the tone characteristic of a speaker by adopting a tone characteristic coding module in a speech synthesis model; a second splicing module in the speech synthesis model is adopted to splice the linguistic characteristics, the tone characteristics and the prosodic characteristics; and acquiring the voice corresponding to the text by adopting a second decoding module in the voice synthesis model and combining the spliced characteristics.

Fig. 3 is a schematic structural diagram of a speech synthesis model. In fig. 3, the Input of the speech synthesis model may be Text (Input Text), Speaker identification (Speaker Id), and prosodic Feature (Prosody Feature). The second Linguistic Feature coding module may be a text vectorization layer (Character Embedding) + convolutional layer (Conv) + bidirectional lstm (bidirectional lstm), and obtains a Linguistic Feature (Linguistic Feature) of the obtained text. The timbre feature encoding module may acquire a timbre feature (Speaker feature) of the Speaker for Speaker vectorization (Speaker Embedding). The second concatenation module (Concat) concatenates the linguistic, tonal and prosodic features.

The second decoding module (Decoder) may be an autoregressive module comprising an Attention mechanism module (Attention) + unidirectional LSTM + Linear Projection layer (Linear Projection) + Pre-processing layer (Pre-Net) + residual prediction layer (Post-Net). The linear projection result of the character at the last moment in the text is processed by the preprocessing layer to obtain the linear projection result processed at the last moment; the spliced characteristics at the current moment are provided for an attention mechanism module; features output by the attention mechanism module are spliced with linear projection results processed at the previous moment and input into a unidirectional LSTM; the output of the unidirectional LSTM and the characteristics output by the attention mechanism module are spliced together and pass through a linear projection layer to obtain a linear projection result at the current moment; and processing the linear projection result at the current moment by a residual prediction layer, summing the processing result of the residual prediction layer and the linear projection result at the current moment to obtain the acoustic characteristics obtained by character prediction at the current moment, and synthesizing to obtain the voice corresponding to the text by combining the acoustic characteristics obtained by character prediction in the text. The acoustic features of the characters at the current moment are predicted by combining the linear projection result of the characters at the last moment in the text, so that the accuracy of speech synthesis can be further improved.

In the embodiment of the application, by adopting the second linguistic feature coding module and the tone color feature coding module in the speech synthesis model and inputting the prosodic features, the coupling of the prosodic features and the text during speech synthesis is avoided, the speech can be synthesized by combining the prosodic features, the text and the tone color features of a speaker, and in a scene that the speaker in one language is adopted to perform speech synthesis of the text in another language, the speech can be synthesized by only combining one prosodic feature, and the speech is prevented from being synthesized by simultaneously combining two prosodic features, so that the speech synthesis effect is improved, and the reduction degree of the synthesized speech is improved.

In summary, the text to be synthesized and the speaker identification corresponding to the text are obtained; inputting the text and the speaker identification into a prosody prediction model to obtain prosody characteristics of the text; the text, the speaker identification and the prosody feature are input into the speech synthesis model, and the speech corresponding to the text to be synthesized is obtained, so that the coupling between the prosody feature and the text during speech synthesis can be avoided, a speaker in one language is adopted to carry out the speech synthesis of the text in another language, the speech synthesis by combining the two prosody features is avoided, the speech synthesis effect is improved, and the reduction degree of the synthesized speech is improved.

To improve the accuracy of the prosody prediction model and the speech synthesis model, the speech synthesis apparatus may train the prosody prediction model and the speech synthesis model. As shown in fig. 4, fig. 4 is a schematic diagram according to a second embodiment of the present application. On the basis of the embodiment shown in fig. 1, the method may further include the following steps.

Step 401, constructing an initial joint model, wherein the initial joint model includes: an initial prosody extraction model, an initial speech synthesis model and an initial prosody prediction model; the output of the initial rhythm extraction model is connected with the input of the initial voice synthesis model; the output of the initial prosody extraction model is connected with the output of the initial prosody prediction model.

In an embodiment of the present application, the input of the initial speech synthesis model may be a text sample, a speaker identification, and prosodic features extracted from the corresponding speech sample. And performing prosodic feature extraction processing, namely extracting from the voice sample corresponding to the text sample by using an initial prosodic extraction model.

In the embodiment of the present application, the input of the initial prosody prediction model may be a text sample and a speaker identifier, and the output is a predicted prosody feature. The initial prosody prediction model may determine a loss function by combining the prosody features obtained by prediction and prosody features extracted from the voice sample by the initial prosody extraction model, and further adjust coefficients of the initial prosody prediction model.

In the embodiment of the application, the prosody extraction model extracts the prosody features in a mode that a voice acoustic feature processing module in the prosody extraction model is adopted to process the acoustic features in the voice sample; and determining the prosodic features of the text sample by adopting an attention mechanism module in the prosody extraction model and combining the processed acoustic features and the linguistic features in the text sample.

The schematic structural diagram of the prosody extraction model may be as shown in fig. 5, and in fig. 5, the input of the prosody extraction model may be an acoustic feature (Mel spectrum) of a speech sample corresponding to a text sample. The voice acoustic feature processing module may be a convolutional layer (Conv) + bidirectional gru (bidirectional gru), and obtains the processed acoustic features. An Attention mechanism module (Attention) determines prosodic features (Prosody features) of the text samples in combination with the processed acoustic features and Linguistic features (Linguistic features) in the text samples. Wherein the extracted prosodic features include: the prosodic features of each character in the text sample improve the accuracy of the extracted prosodic features.

Step 402, obtaining training data, wherein the training data comprises: text samples, and corresponding speech samples and speaker identification.

And 403, training the initial combined model by adopting the text sample, the corresponding voice sample and the speaker identification to obtain a trained combined model.

In the embodiment of the present application, in a first implementation scenario, the speech synthesis apparatus may use the text sample, and the corresponding speech sample and speaker identification to simultaneously train the initial prosody extraction model, the initial speech synthesis model, and the initial prosody prediction model in the initial joint model.

Specifically, for a text sample, and corresponding voice sample and speaker identification, the voice sample can be input into an initial prosody extraction model in the initial joint model, and prosody features of the voice sample are extracted; inputting the extracted prosodic features, text samples and speaker identification into an initial voice synthesis model to obtain an output voice result; and inputting the text sample and the speaker identification into an initial prosody prediction model, and outputting a prosody feature prediction result. Then, combining the prosody feature prediction result and the extracted prosody feature, and adjusting the coefficient of the initial prosody prediction model; and adjusting the coefficients of the initial prosody extraction model and the initial speech synthesis model by combining the speech result output by the initial speech synthesis model and the speech sample, so as to realize training. The training speed is improved under the condition of ensuring the training accuracy.

In the embodiment of the present application, in a second implementation scenario, the speech synthesis apparatus may train the initial prosody extraction model and the initial speech synthesis model in the initial joint model by using the text sample, and the corresponding speech sample and speaker identification; and training an initial prosody prediction model in the initial combined model by adopting the text sample, the corresponding speaker identification and the trained prosody extraction model.

Specifically, for a text sample, and corresponding voice sample and speaker identification, the voice sample can be input into an initial prosody extraction model in the initial joint model, and prosody features of the voice sample are extracted; inputting the extracted prosodic features, text samples and speaker identification into an initial voice synthesis model to obtain an output voice result; and then, combining the voice result output by the initial voice synthesis model and the voice sample, adjusting the coefficients of the initial prosody extraction model and the initial voice synthesis model, and realizing the training of the initial prosody extraction model and the initial voice synthesis model.

After the initial prosody extraction model and the initial voice synthesis model are trained by adopting each text sample, the corresponding voice sample and the corresponding speaker identification, aiming at one text sample, the corresponding voice sample and the corresponding speaker identification, the voice sample can be input into the trained prosody extraction model, and the prosody characteristics of the voice sample are extracted; and inputting the text sample and the speaker identification into an initial prosody prediction model, and outputting a prosody feature prediction result. And adjusting the coefficient of the initial prosody prediction model by combining the prosody feature prediction result and the extracted prosody feature, so as to train the initial prosody prediction model and improve the accuracy of the prosody prediction model and the speech synthesis model.

Step 404, obtaining a prosody prediction model and a speech synthesis model in the trained combined model.

In summary, an initial joint model is constructed, wherein the initial joint model includes: an initial prosody extraction model, an initial speech synthesis model and an initial prosody prediction model; the output of the initial rhythm extraction model is connected with the input of the initial voice synthesis model; the output of the initial rhythm extraction model is connected with the output of the initial rhythm prediction model; acquiring training data, wherein the training data comprises: text samples, and corresponding speech samples and speaker identification; training the initial combined model by adopting the text sample, the corresponding voice sample and the speaker identification to obtain a trained combined model; and obtaining a rhythm prediction model and a voice synthesis model in the trained combined model, so that the rhythm prediction model and the voice synthesis model with high accuracy can be obtained through training, and the accuracy of voice synthesis is improved.

In order to implement the foregoing embodiments, an apparatus for speech synthesis is also provided in the embodiments of the present application.

Fig. 6 is a schematic diagram according to a third embodiment of the present application. As shown in fig. 6, the speech synthesis apparatus 600 includes: an acquisition module 610, a first input module 620, and a second input module 630.

The obtaining module 610 is configured to obtain a text to be synthesized and an identifier of a speaker corresponding to the text;

a first input module 620, configured to input the text and the speaker identifier into a prosody prediction model, so as to obtain prosody features of the text;

the second input module 630 is configured to input the text, the speaker identifier, and the prosody feature into a speech synthesis model, and obtain speech corresponding to the text to be synthesized.

As a possible implementation manner of the embodiment of the present application, the prosody prediction model obtains the prosody features of the text in a manner that a first linguistic feature coding module in the prosody prediction model is adopted to obtain the linguistic features of the text; obtaining the style characteristics of the speaker by adopting a style characteristic coding module in the prosody prediction model; splicing the linguistic features and the style features by adopting a first splicing module in the prosody prediction model; and acquiring the prosodic features of the text by adopting a first decoding module in the prosodic prediction model and combining the spliced features.

As a possible implementation manner of the embodiment of the present application, the manner in which the speech synthesis model obtains the speech corresponding to the text to be synthesized is that a second linguistic feature coding module in the speech synthesis model is adopted to obtain the linguistic feature of the text; acquiring the tone characteristic of the speaker by adopting a tone characteristic coding module in the speech synthesis model; splicing the linguistic feature, the timbre feature and the prosody feature by adopting a second splicing module in the voice synthesis model; and acquiring the voice corresponding to the text by adopting a second decoding module in the voice synthesis model and combining the spliced characteristics.

As a possible implementation manner of the embodiment of the present application, referring to fig. 7 in combination, the speech synthesis apparatus 700 includes: an acquisition module 710, a first input module 720, a second input module 730, a construction module 740, and a training module 750;

the details of the obtaining module 710, the first input module 720 and the second input module 730 refer to the descriptions of the obtaining module 610, the first input module 620 and the second input module 630 in the embodiment shown in fig. 6, and are not described here.

The building module 740 is configured to build an initial joint model, where the initial joint model includes: an initial prosody extraction model, an initial speech synthesis model and an initial prosody prediction model; the output of the initial prosody extraction model is connected with the input of the initial speech synthesis model; the output of the initial prosody extraction model is connected with the output of the initial prosody prediction model;

the obtaining module 710 is further configured to obtain training data, where the training data includes: text samples, and corresponding speech samples and speaker identification;

the training module 750 is configured to train the initial combined model by using the text sample, and the corresponding voice sample and speaker identification to obtain a trained combined model;

the obtaining module 710 is further configured to obtain the prosody prediction model and the speech synthesis model in the trained joint model.

As a possible implementation manner of the embodiment of the present application, the training module is specifically configured to use the text sample, and the corresponding voice sample and speaker identifier to simultaneously train the initial prosody extraction model, the initial voice synthesis model, and the initial prosody prediction model in the initial joint model.

As a possible implementation of the embodiments of the present application, the training module is specifically configured to,

training the initial prosody extraction model and the initial speech synthesis model in the initial joint model by adopting the text sample, the corresponding speech sample and the speaker identification;

and training the initial prosody prediction model in the initial combined model by adopting the text sample, the corresponding speaker identification and the trained prosody extraction model.

As a possible implementation manner of the embodiment of the present application, the prosody extraction model extracts the prosody features by,

processing the acoustic features in the voice sample by adopting a voice acoustic feature processing module in the rhythm extraction model;

and determining the prosodic features of the text sample by adopting an attention mechanism module in the prosody extraction model and combining the processed acoustic features and the linguistic features in the text sample.

As a possible implementation manner of the embodiment of the present application, the first decoding module is an autoregressive network module;

the autoregressive network module is used for predicting the prosodic feature of the current character in the text by combining the prosodic feature of the character before the current character in the text, the current character and the style feature of the speaker.

The speech synthesis device of the embodiment of the application obtains the text to be synthesized and the identification of the speaker corresponding to the text; inputting the text and the speaker identification into a prosody prediction model to obtain prosody characteristics of the text; the text, the speaker identification and the prosodic feature are input into the speech synthesis model, and the speech corresponding to the text to be synthesized is obtained, so that the coupling between the speech synthesis prosodic feature and the text can be avoided, a speaker in one language is adopted to perform speech synthesis of the text in another language, and the speech is prevented from being synthesized by simultaneously combining two prosodic features, so that the speech synthesis effect is improved, and the reduction degree of the synthesized speech is improved.

According to an embodiment of the present application, an electronic device and a readable storage medium are also provided.

Fig. 8 is a block diagram of an electronic device according to the speech synthesis method of the embodiment of the present application. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the present application that are described and/or claimed herein.

As shown in fig. 8, the electronic apparatus includes: one or more processors 801, memory 802, and interfaces for connecting the various components, including a high speed interface and a low speed interface. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions for execution within the electronic device, including instructions stored in or on the memory to display graphical information of a GUI on an external input/output apparatus (such as a display device coupled to the interface). In other embodiments, multiple processors and/or multiple buses may be used, along with multiple memories and multiple memories, as desired. Also, multiple electronic devices may be connected, with each device providing portions of the necessary operations (e.g., as a server array, a group of blade servers, or a multi-processor system). Fig. 8 illustrates an example of a processor 801.

The memory 802 is a non-transitory computer readable storage medium as provided herein. Wherein the memory stores instructions executable by at least one processor to cause the at least one processor to perform the speech synthesis methods provided herein. The non-transitory computer-readable storage medium of the present application stores computer instructions for causing a computer to perform the speech synthesis method provided by the present application.

Memory 802 serves as a non-transitory computer readable storage medium that may be used to store non-transitory software programs, non-transitory computer executable programs, and modules, such as program instructions/modules corresponding to the speech synthesis methods in embodiments of the present application (e.g., acquisition module 610, first input module 620, and second input module 630 shown in fig. 6; or, for example, acquisition module 710, first input module 720, second input module 730, construction module 740, and training module 750 shown in fig. 7). The processor 801 executes various functional applications of the server and data processing by running non-transitory software programs, instructions, and modules stored in the memory 802, that is, implements the speech synthesis method in the above-described method embodiments.

The memory 802 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the stored data area may store data created from use of the electronic device for speech synthesis, and the like. Further, the memory 802 may include high speed random access memory and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory 802 optionally includes memory located remotely from the processor 801, which may be connected to the speech synthesizing electronics through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The electronic device of the speech synthesis method may further include: an input device 803 and an output device 804. The processor 801, the memory 802, the input device 803, and the output device 804 may be connected by a bus or other means, and are exemplified by a bus in fig. 8.

The input device 803 may receive input numeric or character information and generate key signal inputs related to user settings and function controls of the speech-synthesized electronic apparatus, such as a touch screen, a keypad, a mouse, a track pad, a touch pad, a pointer, one or more mouse buttons, a track ball, a joystick, or other input device. The output devices 804 may include a display device, auxiliary lighting devices (e.g., LEDs), and haptic feedback devices (e.g., vibrating motors), among others. The display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, and a plasma display. In some implementations, the display device can be a touch screen.

Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

These computer programs (also known as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented using high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The Server can be a cloud Server, also called a cloud computing Server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service expansibility in the traditional physical host and VPS service ("Virtual Private Server", or simply "VPS").

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present application may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present application can be achieved, and the present invention is not limited herein.

The above-described embodiments should not be construed as limiting the scope of the present application. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. A method of speech synthesis comprising:

acquiring a text to be synthesized and an identifier of a speaker corresponding to the text;

inputting the text and the speaker identification into a prosody prediction model to obtain prosody characteristics of the text;

and inputting the text, the speaker identification and the prosody feature into a speech synthesis model to obtain the speech corresponding to the text to be synthesized.

2. The speech synthesis method according to claim 1, wherein the prosodic prediction model obtains prosodic features of the text in a manner,

acquiring the linguistic characteristics of the text by adopting a first linguistic characteristic coding module in the prosody prediction model;

obtaining the style characteristics of the speaker by adopting a style characteristic coding module in the prosody prediction model;

splicing the linguistic features and the style features by adopting a first splicing module in the prosody prediction model;

and acquiring the prosodic features of the text by adopting a first decoding module in the prosodic prediction model and combining the spliced features.

3. The speech synthesis method according to claim 1, wherein the speech synthesis model obtains the speech corresponding to the text to be synthesized in a manner,

acquiring the linguistic characteristics of the text by adopting a second linguistic characteristic coding module in the speech synthesis model;

acquiring the tone characteristic of the speaker by adopting a tone characteristic coding module in the speech synthesis model;

splicing the linguistic feature, the timbre feature and the prosody feature by adopting a second splicing module in the voice synthesis model;

and acquiring the voice corresponding to the text by adopting a second decoding module in the voice synthesis model and combining the spliced characteristics.

4. The speech synthesis method of claim 1, wherein before inputting the text and the speaker's identification into a prosodic prediction model to obtain prosodic features of the text, further comprising:

constructing an initial joint model, wherein the initial joint model comprises: an initial prosody extraction model, an initial speech synthesis model and an initial prosody prediction model; the output of the initial prosody extraction model is connected with the input of the initial speech synthesis model; the output of the initial prosody extraction model is connected with the output of the initial prosody prediction model;

obtaining training data, wherein the training data comprises: text samples, and corresponding speech samples and speaker identification;

training the initial combined model by adopting the text sample, the corresponding voice sample and the speaker identification to obtain a trained combined model;

and acquiring the prosody prediction model and the speech synthesis model in the trained combined model.

5. The speech synthesis method of claim 4, wherein said training the initial joint model using the text samples and the corresponding speech samples and speaker identification to obtain a trained joint model comprises:

and simultaneously training the initial prosody extraction model, the initial speech synthesis model and the initial prosody prediction model in the initial joint model by adopting the text sample and the corresponding speech sample and speaker identification.

6. The speech synthesis method of claim 4, wherein said training the initial joint model using the text samples and the corresponding speech samples and speaker identification to obtain a trained joint model comprises:

7. The speech synthesis method according to claim 4, wherein the prosody extraction model extracts prosodic features in a manner such that,

8. The speech synthesis method according to claim 2, wherein the first decoding module is an autoregressive network module;

9. A speech synthesis apparatus comprising:

the device comprises an acquisition module, a synthesis module and a processing module, wherein the acquisition module is used for acquiring a text to be synthesized and an identifier of a speaker corresponding to the text;

the first input module is used for inputting the text and the speaker identification into a prosody prediction model to obtain prosody characteristics of the text;

and the second input module is used for inputting the text, the speaker identification and the prosody feature into a speech synthesis model and acquiring the speech corresponding to the text to be synthesized.

10. The speech synthesis apparatus according to claim 9, wherein the prosodic prediction model obtains prosodic features of the text in such a manner that,

11. The speech synthesis apparatus according to claim 9, wherein the speech synthesis model obtains the speech corresponding to the text to be synthesized in such a manner that,

12. The speech synthesis apparatus of claim 9, further comprising: a construction module and a training module;

the building module is configured to build an initial joint model, where the initial joint model includes: an initial prosody extraction model, an initial speech synthesis model and an initial prosody prediction model; the output of the initial prosody extraction model is connected with the input of the initial speech synthesis model; the output of the initial prosody extraction model is connected with the output of the initial prosody prediction model;

the obtaining module is further configured to obtain training data, where the training data includes: text samples, and corresponding speech samples and speaker identification;

the training module is used for training the initial combined model by adopting the text sample, the corresponding voice sample and the speaker identification to obtain a trained combined model;

the obtaining module is further configured to obtain the prosody prediction model and the speech synthesis model in the trained joint model.

13. The speech synthesis apparatus of claim 12, wherein the training module is specifically configured to simultaneously train the initial prosody extraction model, the initial speech synthesis model, and the initial prosody prediction model in the initial joint model using the text samples and corresponding speech samples and speaker identifications.

14. The speech synthesis apparatus of claim 12, wherein the training module is specifically configured to,

15. The speech synthesis apparatus according to claim 12, wherein the prosody extraction model extracts prosodic features in such a manner that,

16. The speech synthesis apparatus of claim 10, wherein the first decoding module is an autoregressive network module;

17. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-8.

18. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-8.