CN112365880A - Speech synthesis method, speech synthesis device, electronic equipment and storage medium - Google Patents

Speech synthesis method, speech synthesis device, electronic equipment and storage medium Download PDF

Info

Publication number
CN112365880A
CN112365880A CN202011224114.6A CN202011224114A CN112365880A CN 112365880 A CN112365880 A CN 112365880A CN 202011224114 A CN202011224114 A CN 202011224114A CN 112365880 A CN112365880 A CN 112365880A
Authority
CN
China
Prior art keywords
model
text
prosody
initial
speech synthesis
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011224114.6A
Other languages
Chinese (zh)
Other versions
CN112365880B (en
Inventor
张君腾
孙涛
王文富
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN202011224114.6A priority Critical patent/CN112365880B/en
Publication of CN112365880A publication Critical patent/CN112365880A/en
Application granted granted Critical
Publication of CN112365880B publication Critical patent/CN112365880B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • G10L13/086Detection of language
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • G10L13/10Prosody rules derived from text; Stress or intonation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Abstract

The application discloses a voice synthesis method, a voice synthesis device, electronic equipment and a storage medium, and relates to the technical field of voice technology and deep learning. The specific implementation scheme is as follows: acquiring a text to be synthesized and an identifier of a speaker corresponding to the text; inputting the text and the speaker identification into a prosody prediction model to obtain prosody characteristics of the text; the text, the speaker identification and the prosody feature are input into the speech synthesis model, and the speech corresponding to the text to be synthesized is obtained, so that the coupling between the prosody feature and the text during speech synthesis can be avoided, a speaker in one language is adopted to carry out the speech synthesis of the text in another language, and the speech synthesis by combining two prosody features is avoided, so that the speech synthesis effect is improved, and the reduction degree of the synthesized speech is improved.

Description

Speech synthesis method, speech synthesis device, electronic equipment and storage medium
Technical Field
The present application relates to the field of computer technologies, and in particular, to the field of speech technology and deep learning technology, and in particular, to a speech synthesis method and apparatus, an electronic device, and a storage medium.
Background
The current speech synthesis technology is to record a high-quality speech library, obtain a text and a speaker corresponding to the speech in the speech library, train through a neural network model, input the text and the speaker, and output the speech, thereby obtaining an acoustic model.
The speech also includes prosodic features associated with the speaker. In a speech synthesis system supporting multiple languages, the text of each language is language dependent, such as english and chinese text independent of each other; when recording a library of voices, each speaker will typically only speak one language. Therefore, in the training process of the neural network model, the degree of coupling between the prosodic features related to the speaker and the language is high, and therefore the degree of coupling between the prosodic features related to the speaker and the text under the language is high. In cross-language speech synthesis, when a speaker in one language is used to synthesize a speech of a text in another language, the synthesized speech has poor degree of restorability and poor synthesis effect.
Disclosure of Invention
The disclosure provides a speech synthesis method, a speech synthesis apparatus, an electronic device and a storage medium.
According to an aspect of the present disclosure, there is provided a speech synthesis method including: acquiring a text to be synthesized and an identifier of a speaker corresponding to the text; inputting the text and the speaker identification into a prosody prediction model to obtain prosody characteristics of the text; and inputting the text, the speaker identification and the prosody feature into a speech synthesis model to obtain the speech corresponding to the text to be synthesized.
According to another aspect of the present disclosure, there is provided a speech synthesis apparatus including: the device comprises an acquisition module, a synthesis module and a processing module, wherein the acquisition module is used for acquiring a text to be synthesized and an identifier of a speaker corresponding to the text; the first input module is used for inputting the text and the speaker identification into a prosody prediction model to obtain prosody characteristics of the text; and the second input module is used for inputting the text, the speaker identification and the prosody feature into a speech synthesis model and acquiring the speech corresponding to the text to be synthesized.
According to a third aspect, there is provided an electronic device comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a speech synthesis method as described above.
According to a fourth aspect, a non-transitory computer-readable storage medium is presented having stored thereon computer instructions for causing the computer to perform the speech synthesis method as described above.
It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.
Drawings
The drawings are included to provide a better understanding of the present solution and are not intended to limit the present application. Wherein:
FIG. 1 is a schematic diagram according to a first embodiment of the present application;
FIG. 2 is a schematic diagram of a prosody prediction model;
FIG. 3 is a schematic diagram of the structure of a speech synthesis model;
FIG. 4 is a schematic illustration according to a second embodiment of the present application;
FIG. 5 is a schematic diagram of a prosody extraction model;
FIG. 6 is a schematic illustration according to a third embodiment of the present application;
FIG. 7 is a schematic illustration according to a fourth embodiment of the present application;
fig. 8 is a block diagram of an electronic device for implementing a speech synthesis method according to an embodiment of the present application.
Detailed Description
The following description of the exemplary embodiments of the present application, taken in conjunction with the accompanying drawings, includes various details of the embodiments of the application for the understanding of the same, which are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
A speech synthesis method, apparatus, electronic device, and storage medium according to embodiments of the present application are described below with reference to the accompanying drawings.
Fig. 1 is a schematic diagram according to a first embodiment of the present application. It should be noted that the execution subject of the embodiment of the present application is a speech synthesis apparatus, and the speech synthesis apparatus may specifically be a hardware device, or software in a hardware device, or the like.
As shown in fig. 1, the specific implementation process of the speech synthesis method is as follows:
step 101, obtaining a text to be synthesized and an identification of a speaker corresponding to the text.
In the embodiment of the present application, the text to be synthesized may be any text in any language. Among them, languages such as chinese, english, etc. Text such as news text, entertainment text, chat text, and the like. The speaker corresponding to the text refers to the speaker to which the speech to be synthesized by the text belongs. For example, if the speech of speaker a is to be synthesized, the speaker corresponding to the text is speaker a; when the speech of speaker B is synthesized, the speaker corresponding to the text is the speech of speaker B.
And 102, inputting the text and the identification of the speaker into a prosody prediction model to obtain prosody characteristics of the text.
In the embodiment of the application, the speaker is corresponding to the timbre characteristic and the style characteristic. Different speakers have different timbre characteristics and different speakers have different style characteristics. Prosodic features of text are related to text as well as to style features of the speaker. Therefore, the prosodic prediction model acquires the prosodic features of the text in a mode that a first linguistic feature coding module in the prosodic prediction model is adopted to acquire the linguistic features of the text; obtaining the style characteristics of a speaker by adopting a style characteristic coding module in a rhythm prediction model; a first splicing module in a rhythm prediction model is adopted to splice the linguistic characteristics and the style characteristics; and acquiring the prosodic features of the text by adopting a first decoding module in the prosodic prediction model and combining the spliced features.
Fig. 2 is a schematic structural diagram of a prosody prediction model. In fig. 2, the Input of the prosody prediction model may be Text (Input Text) and Speaker identification (Speaker Id). The first Linguistic Feature coding module may be a text vectorization layer (Character Embedding) + convolutional layer (Conv) + bidirectional lstm (bidirectional lstm), and obtains a Linguistic Feature (Linguistic Feature) of the obtained text. The Style Feature encoding module may obtain Style features (Style features) of the Speaker for Speaker vectorization layer (Speaker Embedding). The first concatenation module (Concat) concatenates the linguistic and stylistic features.
The first decoding module (Decoder) may be an autoregressive module, including a preprocessing layer (Pre-Net) + one-way LSTM + Linear Projection layer (Linear Projection), obtaining prosodic features (Prosody Feature) of the text. Processing the prosody features of the characters at the last moment in the text by a preprocessing layer to obtain the prosody features processed at the last moment; and splicing the prosodic features processed at the last moment and the spliced features at the current moment, and inputting the spliced prosodic features and the spliced features into the unidirectional LSTM module and the linear projection layer which are sequentially arranged to obtain the prosodic features of the characters at the current moment in the output text.
In the embodiment of the application, by arranging the first linguistic feature coding module and the style feature coding module in the prosody prediction model, the prosody feature is prevented from being coupled with the text during voice synthesis, the style feature of a speaker can be determined only according to the identification of the speaker, and then the prosody feature of the text is determined by combining the style feature and the text, so that the prosody feature is taken as an independent feature instead of being coupled to the speaker and the text during voice synthesis, and therefore, in a scene that the speaker in one language is adopted to perform voice synthesis of the text in another language, only one prosody feature can be combined, the two prosody features are prevented from being combined at the same time, the voice synthesis effect is improved, and the reduction degree of the synthesized voice is improved.
Step 103, inputting the text, the speaker identification and the prosody feature into a speech synthesis model, and acquiring the speech corresponding to the text to be synthesized.
In the embodiment of the application, the speech synthesis model acquires the speech corresponding to the text to be synthesized in a mode that a second linguistic feature coding module in the speech synthesis model is adopted to acquire the linguistic features of the text; acquiring the tone characteristic of a speaker by adopting a tone characteristic coding module in a speech synthesis model; a second splicing module in the speech synthesis model is adopted to splice the linguistic characteristics, the tone characteristics and the prosodic characteristics; and acquiring the voice corresponding to the text by adopting a second decoding module in the voice synthesis model and combining the spliced characteristics.
Fig. 3 is a schematic structural diagram of a speech synthesis model. In fig. 3, the Input of the speech synthesis model may be Text (Input Text), Speaker identification (Speaker Id), and prosodic Feature (Prosody Feature). The second Linguistic Feature coding module may be a text vectorization layer (Character Embedding) + convolutional layer (Conv) + bidirectional lstm (bidirectional lstm), and obtains a Linguistic Feature (Linguistic Feature) of the obtained text. The timbre feature encoding module may acquire a timbre feature (Speaker feature) of the Speaker for Speaker vectorization (Speaker Embedding). The second concatenation module (Concat) concatenates the linguistic, tonal and prosodic features.
The second decoding module (Decoder) may be an autoregressive module comprising an Attention mechanism module (Attention) + unidirectional LSTM + Linear Projection layer (Linear Projection) + Pre-processing layer (Pre-Net) + residual prediction layer (Post-Net). The linear projection result of the character at the last moment in the text is processed by the preprocessing layer to obtain the linear projection result processed at the last moment; the spliced characteristics at the current moment are provided for an attention mechanism module; features output by the attention mechanism module are spliced with linear projection results processed at the previous moment and input into a unidirectional LSTM; the output of the unidirectional LSTM and the characteristics output by the attention mechanism module are spliced together and pass through a linear projection layer to obtain a linear projection result at the current moment; and processing the linear projection result at the current moment by a residual prediction layer, summing the processing result of the residual prediction layer and the linear projection result at the current moment to obtain the acoustic characteristics obtained by character prediction at the current moment, and synthesizing to obtain the voice corresponding to the text by combining the acoustic characteristics obtained by character prediction in the text. The acoustic features of the characters at the current moment are predicted by combining the linear projection result of the characters at the last moment in the text, so that the accuracy of speech synthesis can be further improved.
In the embodiment of the application, by adopting the second linguistic feature coding module and the tone color feature coding module in the speech synthesis model and inputting the prosodic features, the coupling of the prosodic features and the text during speech synthesis is avoided, the speech can be synthesized by combining the prosodic features, the text and the tone color features of a speaker, and in a scene that the speaker in one language is adopted to perform speech synthesis of the text in another language, the speech can be synthesized by only combining one prosodic feature, and the speech is prevented from being synthesized by simultaneously combining two prosodic features, so that the speech synthesis effect is improved, and the reduction degree of the synthesized speech is improved.
In summary, the text to be synthesized and the speaker identification corresponding to the text are obtained; inputting the text and the speaker identification into a prosody prediction model to obtain prosody characteristics of the text; the text, the speaker identification and the prosody feature are input into the speech synthesis model, and the speech corresponding to the text to be synthesized is obtained, so that the coupling between the prosody feature and the text during speech synthesis can be avoided, a speaker in one language is adopted to carry out the speech synthesis of the text in another language, the speech synthesis by combining the two prosody features is avoided, the speech synthesis effect is improved, and the reduction degree of the synthesized speech is improved.
To improve the accuracy of the prosody prediction model and the speech synthesis model, the speech synthesis apparatus may train the prosody prediction model and the speech synthesis model. As shown in fig. 4, fig. 4 is a schematic diagram according to a second embodiment of the present application. On the basis of the embodiment shown in fig. 1, the method may further include the following steps.
Step 401, constructing an initial joint model, wherein the initial joint model includes: an initial prosody extraction model, an initial speech synthesis model and an initial prosody prediction model; the output of the initial rhythm extraction model is connected with the input of the initial voice synthesis model; the output of the initial prosody extraction model is connected with the output of the initial prosody prediction model.
In an embodiment of the present application, the input of the initial speech synthesis model may be a text sample, a speaker identification, and prosodic features extracted from the corresponding speech sample. And performing prosodic feature extraction processing, namely extracting from the voice sample corresponding to the text sample by using an initial prosodic extraction model.
In the embodiment of the present application, the input of the initial prosody prediction model may be a text sample and a speaker identifier, and the output is a predicted prosody feature. The initial prosody prediction model may determine a loss function by combining the prosody features obtained by prediction and prosody features extracted from the voice sample by the initial prosody extraction model, and further adjust coefficients of the initial prosody prediction model.
In the embodiment of the application, the prosody extraction model extracts the prosody features in a mode that a voice acoustic feature processing module in the prosody extraction model is adopted to process the acoustic features in the voice sample; and determining the prosodic features of the text sample by adopting an attention mechanism module in the prosody extraction model and combining the processed acoustic features and the linguistic features in the text sample.
The schematic structural diagram of the prosody extraction model may be as shown in fig. 5, and in fig. 5, the input of the prosody extraction model may be an acoustic feature (Mel spectrum) of a speech sample corresponding to a text sample. The voice acoustic feature processing module may be a convolutional layer (Conv) + bidirectional gru (bidirectional gru), and obtains the processed acoustic features. An Attention mechanism module (Attention) determines prosodic features (Prosody features) of the text samples in combination with the processed acoustic features and Linguistic features (Linguistic features) in the text samples. Wherein the extracted prosodic features include: the prosodic features of each character in the text sample improve the accuracy of the extracted prosodic features.
Step 402, obtaining training data, wherein the training data comprises: text samples, and corresponding speech samples and speaker identification.
And 403, training the initial combined model by adopting the text sample, the corresponding voice sample and the speaker identification to obtain a trained combined model.
In the embodiment of the present application, in a first implementation scenario, the speech synthesis apparatus may use the text sample, and the corresponding speech sample and speaker identification to simultaneously train the initial prosody extraction model, the initial speech synthesis model, and the initial prosody prediction model in the initial joint model.
Specifically, for a text sample, and corresponding voice sample and speaker identification, the voice sample can be input into an initial prosody extraction model in the initial joint model, and prosody features of the voice sample are extracted; inputting the extracted prosodic features, text samples and speaker identification into an initial voice synthesis model to obtain an output voice result; and inputting the text sample and the speaker identification into an initial prosody prediction model, and outputting a prosody feature prediction result. Then, combining the prosody feature prediction result and the extracted prosody feature, and adjusting the coefficient of the initial prosody prediction model; and adjusting the coefficients of the initial prosody extraction model and the initial speech synthesis model by combining the speech result output by the initial speech synthesis model and the speech sample, so as to realize training. The training speed is improved under the condition of ensuring the training accuracy.
In the embodiment of the present application, in a second implementation scenario, the speech synthesis apparatus may train the initial prosody extraction model and the initial speech synthesis model in the initial joint model by using the text sample, and the corresponding speech sample and speaker identification; and training an initial prosody prediction model in the initial combined model by adopting the text sample, the corresponding speaker identification and the trained prosody extraction model.
Specifically, for a text sample, and corresponding voice sample and speaker identification, the voice sample can be input into an initial prosody extraction model in the initial joint model, and prosody features of the voice sample are extracted; inputting the extracted prosodic features, text samples and speaker identification into an initial voice synthesis model to obtain an output voice result; and then, combining the voice result output by the initial voice synthesis model and the voice sample, adjusting the coefficients of the initial prosody extraction model and the initial voice synthesis model, and realizing the training of the initial prosody extraction model and the initial voice synthesis model.
After the initial prosody extraction model and the initial voice synthesis model are trained by adopting each text sample, the corresponding voice sample and the corresponding speaker identification, aiming at one text sample, the corresponding voice sample and the corresponding speaker identification, the voice sample can be input into the trained prosody extraction model, and the prosody characteristics of the voice sample are extracted; and inputting the text sample and the speaker identification into an initial prosody prediction model, and outputting a prosody feature prediction result. And adjusting the coefficient of the initial prosody prediction model by combining the prosody feature prediction result and the extracted prosody feature, so as to train the initial prosody prediction model and improve the accuracy of the prosody prediction model and the speech synthesis model.
Step 404, obtaining a prosody prediction model and a speech synthesis model in the trained combined model.
In summary, an initial joint model is constructed, wherein the initial joint model includes: an initial prosody extraction model, an initial speech synthesis model and an initial prosody prediction model; the output of the initial rhythm extraction model is connected with the input of the initial voice synthesis model; the output of the initial rhythm extraction model is connected with the output of the initial rhythm prediction model; acquiring training data, wherein the training data comprises: text samples, and corresponding speech samples and speaker identification; training the initial combined model by adopting the text sample, the corresponding voice sample and the speaker identification to obtain a trained combined model; and obtaining a rhythm prediction model and a voice synthesis model in the trained combined model, so that the rhythm prediction model and the voice synthesis model with high accuracy can be obtained through training, and the accuracy of voice synthesis is improved.
In order to implement the foregoing embodiments, an apparatus for speech synthesis is also provided in the embodiments of the present application.
Fig. 6 is a schematic diagram according to a third embodiment of the present application. As shown in fig. 6, the speech synthesis apparatus 600 includes: an acquisition module 610, a first input module 620, and a second input module 630.
The obtaining module 610 is configured to obtain a text to be synthesized and an identifier of a speaker corresponding to the text;
a first input module 620, configured to input the text and the speaker identifier into a prosody prediction model, so as to obtain prosody features of the text;
the second input module 630 is configured to input the text, the speaker identifier, and the prosody feature into a speech synthesis model, and obtain speech corresponding to the text to be synthesized.
As a possible implementation manner of the embodiment of the present application, the prosody prediction model obtains the prosody features of the text in a manner that a first linguistic feature coding module in the prosody prediction model is adopted to obtain the linguistic features of the text; obtaining the style characteristics of the speaker by adopting a style characteristic coding module in the prosody prediction model; splicing the linguistic features and the style features by adopting a first splicing module in the prosody prediction model; and acquiring the prosodic features of the text by adopting a first decoding module in the prosodic prediction model and combining the spliced features.
As a possible implementation manner of the embodiment of the present application, the manner in which the speech synthesis model obtains the speech corresponding to the text to be synthesized is that a second linguistic feature coding module in the speech synthesis model is adopted to obtain the linguistic feature of the text; acquiring the tone characteristic of the speaker by adopting a tone characteristic coding module in the speech synthesis model; splicing the linguistic feature, the timbre feature and the prosody feature by adopting a second splicing module in the voice synthesis model; and acquiring the voice corresponding to the text by adopting a second decoding module in the voice synthesis model and combining the spliced characteristics.
As a possible implementation manner of the embodiment of the present application, referring to fig. 7 in combination, the speech synthesis apparatus 700 includes: an acquisition module 710, a first input module 720, a second input module 730, a construction module 740, and a training module 750;
the details of the obtaining module 710, the first input module 720 and the second input module 730 refer to the descriptions of the obtaining module 610, the first input module 620 and the second input module 630 in the embodiment shown in fig. 6, and are not described here.
The building module 740 is configured to build an initial joint model, where the initial joint model includes: an initial prosody extraction model, an initial speech synthesis model and an initial prosody prediction model; the output of the initial prosody extraction model is connected with the input of the initial speech synthesis model; the output of the initial prosody extraction model is connected with the output of the initial prosody prediction model;
the obtaining module 710 is further configured to obtain training data, where the training data includes: text samples, and corresponding speech samples and speaker identification;
the training module 750 is configured to train the initial combined model by using the text sample, and the corresponding voice sample and speaker identification to obtain a trained combined model;
the obtaining module 710 is further configured to obtain the prosody prediction model and the speech synthesis model in the trained joint model.
As a possible implementation manner of the embodiment of the present application, the training module is specifically configured to use the text sample, and the corresponding voice sample and speaker identifier to simultaneously train the initial prosody extraction model, the initial voice synthesis model, and the initial prosody prediction model in the initial joint model.
As a possible implementation of the embodiments of the present application, the training module is specifically configured to,
training the initial prosody extraction model and the initial speech synthesis model in the initial joint model by adopting the text sample, the corresponding speech sample and the speaker identification;
and training the initial prosody prediction model in the initial combined model by adopting the text sample, the corresponding speaker identification and the trained prosody extraction model.
As a possible implementation manner of the embodiment of the present application, the prosody extraction model extracts the prosody features by,
processing the acoustic features in the voice sample by adopting a voice acoustic feature processing module in the rhythm extraction model;
and determining the prosodic features of the text sample by adopting an attention mechanism module in the prosody extraction model and combining the processed acoustic features and the linguistic features in the text sample.
As a possible implementation manner of the embodiment of the present application, the first decoding module is an autoregressive network module;
the autoregressive network module is used for predicting the prosodic feature of the current character in the text by combining the prosodic feature of the character before the current character in the text, the current character and the style feature of the speaker.
The speech synthesis device of the embodiment of the application obtains the text to be synthesized and the identification of the speaker corresponding to the text; inputting the text and the speaker identification into a prosody prediction model to obtain prosody characteristics of the text; the text, the speaker identification and the prosodic feature are input into the speech synthesis model, and the speech corresponding to the text to be synthesized is obtained, so that the coupling between the speech synthesis prosodic feature and the text can be avoided, a speaker in one language is adopted to perform speech synthesis of the text in another language, and the speech is prevented from being synthesized by simultaneously combining two prosodic features, so that the speech synthesis effect is improved, and the reduction degree of the synthesized speech is improved.
According to an embodiment of the present application, an electronic device and a readable storage medium are also provided.
Fig. 8 is a block diagram of an electronic device according to the speech synthesis method of the embodiment of the present application. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the present application that are described and/or claimed herein.
As shown in fig. 8, the electronic apparatus includes: one or more processors 801, memory 802, and interfaces for connecting the various components, including a high speed interface and a low speed interface. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions for execution within the electronic device, including instructions stored in or on the memory to display graphical information of a GUI on an external input/output apparatus (such as a display device coupled to the interface). In other embodiments, multiple processors and/or multiple buses may be used, along with multiple memories and multiple memories, as desired. Also, multiple electronic devices may be connected, with each device providing portions of the necessary operations (e.g., as a server array, a group of blade servers, or a multi-processor system). Fig. 8 illustrates an example of a processor 801.
The memory 802 is a non-transitory computer readable storage medium as provided herein. Wherein the memory stores instructions executable by at least one processor to cause the at least one processor to perform the speech synthesis methods provided herein. The non-transitory computer-readable storage medium of the present application stores computer instructions for causing a computer to perform the speech synthesis method provided by the present application.
Memory 802 serves as a non-transitory computer readable storage medium that may be used to store non-transitory software programs, non-transitory computer executable programs, and modules, such as program instructions/modules corresponding to the speech synthesis methods in embodiments of the present application (e.g., acquisition module 610, first input module 620, and second input module 630 shown in fig. 6; or, for example, acquisition module 710, first input module 720, second input module 730, construction module 740, and training module 750 shown in fig. 7). The processor 801 executes various functional applications of the server and data processing by running non-transitory software programs, instructions, and modules stored in the memory 802, that is, implements the speech synthesis method in the above-described method embodiments.
The memory 802 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the stored data area may store data created from use of the electronic device for speech synthesis, and the like. Further, the memory 802 may include high speed random access memory and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory 802 optionally includes memory located remotely from the processor 801, which may be connected to the speech synthesizing electronics through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The electronic device of the speech synthesis method may further include: an input device 803 and an output device 804. The processor 801, the memory 802, the input device 803, and the output device 804 may be connected by a bus or other means, and are exemplified by a bus in fig. 8.
The input device 803 may receive input numeric or character information and generate key signal inputs related to user settings and function controls of the speech-synthesized electronic apparatus, such as a touch screen, a keypad, a mouse, a track pad, a touch pad, a pointer, one or more mouse buttons, a track ball, a joystick, or other input device. The output devices 804 may include a display device, auxiliary lighting devices (e.g., LEDs), and haptic feedback devices (e.g., vibrating motors), among others. The display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, and a plasma display. In some implementations, the display device can be a touch screen.
Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.
These computer programs (also known as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented using high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.
The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The Server can be a cloud Server, also called a cloud computing Server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service expansibility in the traditional physical host and VPS service ("Virtual Private Server", or simply "VPS").
It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present application may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present application can be achieved, and the present invention is not limited herein.
The above-described embodiments should not be construed as limiting the scope of the present application. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims (18)

1. A method of speech synthesis comprising:
acquiring a text to be synthesized and an identifier of a speaker corresponding to the text;
inputting the text and the speaker identification into a prosody prediction model to obtain prosody characteristics of the text;
and inputting the text, the speaker identification and the prosody feature into a speech synthesis model to obtain the speech corresponding to the text to be synthesized.
2. The speech synthesis method according to claim 1, wherein the prosodic prediction model obtains prosodic features of the text in a manner,
acquiring the linguistic characteristics of the text by adopting a first linguistic characteristic coding module in the prosody prediction model;
obtaining the style characteristics of the speaker by adopting a style characteristic coding module in the prosody prediction model;
splicing the linguistic features and the style features by adopting a first splicing module in the prosody prediction model;
and acquiring the prosodic features of the text by adopting a first decoding module in the prosodic prediction model and combining the spliced features.
3. The speech synthesis method according to claim 1, wherein the speech synthesis model obtains the speech corresponding to the text to be synthesized in a manner,
acquiring the linguistic characteristics of the text by adopting a second linguistic characteristic coding module in the speech synthesis model;
acquiring the tone characteristic of the speaker by adopting a tone characteristic coding module in the speech synthesis model;
splicing the linguistic feature, the timbre feature and the prosody feature by adopting a second splicing module in the voice synthesis model;
and acquiring the voice corresponding to the text by adopting a second decoding module in the voice synthesis model and combining the spliced characteristics.
4. The speech synthesis method of claim 1, wherein before inputting the text and the speaker's identification into a prosodic prediction model to obtain prosodic features of the text, further comprising:
constructing an initial joint model, wherein the initial joint model comprises: an initial prosody extraction model, an initial speech synthesis model and an initial prosody prediction model; the output of the initial prosody extraction model is connected with the input of the initial speech synthesis model; the output of the initial prosody extraction model is connected with the output of the initial prosody prediction model;
obtaining training data, wherein the training data comprises: text samples, and corresponding speech samples and speaker identification;
training the initial combined model by adopting the text sample, the corresponding voice sample and the speaker identification to obtain a trained combined model;
and acquiring the prosody prediction model and the speech synthesis model in the trained combined model.
5. The speech synthesis method of claim 4, wherein said training the initial joint model using the text samples and the corresponding speech samples and speaker identification to obtain a trained joint model comprises:
and simultaneously training the initial prosody extraction model, the initial speech synthesis model and the initial prosody prediction model in the initial joint model by adopting the text sample and the corresponding speech sample and speaker identification.
6. The speech synthesis method of claim 4, wherein said training the initial joint model using the text samples and the corresponding speech samples and speaker identification to obtain a trained joint model comprises:
training the initial prosody extraction model and the initial speech synthesis model in the initial joint model by adopting the text sample, the corresponding speech sample and the speaker identification;
and training the initial prosody prediction model in the initial combined model by adopting the text sample, the corresponding speaker identification and the trained prosody extraction model.
7. The speech synthesis method according to claim 4, wherein the prosody extraction model extracts prosodic features in a manner such that,
processing the acoustic features in the voice sample by adopting a voice acoustic feature processing module in the rhythm extraction model;
and determining the prosodic features of the text sample by adopting an attention mechanism module in the prosody extraction model and combining the processed acoustic features and the linguistic features in the text sample.
8. The speech synthesis method according to claim 2, wherein the first decoding module is an autoregressive network module;
the autoregressive network module is used for predicting the prosodic feature of the current character in the text by combining the prosodic feature of the character before the current character in the text, the current character and the style feature of the speaker.
9. A speech synthesis apparatus comprising:
the device comprises an acquisition module, a synthesis module and a processing module, wherein the acquisition module is used for acquiring a text to be synthesized and an identifier of a speaker corresponding to the text;
the first input module is used for inputting the text and the speaker identification into a prosody prediction model to obtain prosody characteristics of the text;
and the second input module is used for inputting the text, the speaker identification and the prosody feature into a speech synthesis model and acquiring the speech corresponding to the text to be synthesized.
10. The speech synthesis apparatus according to claim 9, wherein the prosodic prediction model obtains prosodic features of the text in such a manner that,
acquiring the linguistic characteristics of the text by adopting a first linguistic characteristic coding module in the prosody prediction model;
obtaining the style characteristics of the speaker by adopting a style characteristic coding module in the prosody prediction model;
splicing the linguistic features and the style features by adopting a first splicing module in the prosody prediction model;
and acquiring the prosodic features of the text by adopting a first decoding module in the prosodic prediction model and combining the spliced features.
11. The speech synthesis apparatus according to claim 9, wherein the speech synthesis model obtains the speech corresponding to the text to be synthesized in such a manner that,
acquiring the linguistic characteristics of the text by adopting a second linguistic characteristic coding module in the speech synthesis model;
acquiring the tone characteristic of the speaker by adopting a tone characteristic coding module in the speech synthesis model;
splicing the linguistic feature, the timbre feature and the prosody feature by adopting a second splicing module in the voice synthesis model;
and acquiring the voice corresponding to the text by adopting a second decoding module in the voice synthesis model and combining the spliced characteristics.
12. The speech synthesis apparatus of claim 9, further comprising: a construction module and a training module;
the building module is configured to build an initial joint model, where the initial joint model includes: an initial prosody extraction model, an initial speech synthesis model and an initial prosody prediction model; the output of the initial prosody extraction model is connected with the input of the initial speech synthesis model; the output of the initial prosody extraction model is connected with the output of the initial prosody prediction model;
the obtaining module is further configured to obtain training data, where the training data includes: text samples, and corresponding speech samples and speaker identification;
the training module is used for training the initial combined model by adopting the text sample, the corresponding voice sample and the speaker identification to obtain a trained combined model;
the obtaining module is further configured to obtain the prosody prediction model and the speech synthesis model in the trained joint model.
13. The speech synthesis apparatus of claim 12, wherein the training module is specifically configured to simultaneously train the initial prosody extraction model, the initial speech synthesis model, and the initial prosody prediction model in the initial joint model using the text samples and corresponding speech samples and speaker identifications.
14. The speech synthesis apparatus of claim 12, wherein the training module is specifically configured to,
training the initial prosody extraction model and the initial speech synthesis model in the initial joint model by adopting the text sample, the corresponding speech sample and the speaker identification;
and training the initial prosody prediction model in the initial combined model by adopting the text sample, the corresponding speaker identification and the trained prosody extraction model.
15. The speech synthesis apparatus according to claim 12, wherein the prosody extraction model extracts prosodic features in such a manner that,
processing the acoustic features in the voice sample by adopting a voice acoustic feature processing module in the rhythm extraction model;
and determining the prosodic features of the text sample by adopting an attention mechanism module in the prosody extraction model and combining the processed acoustic features and the linguistic features in the text sample.
16. The speech synthesis apparatus of claim 10, wherein the first decoding module is an autoregressive network module;
the autoregressive network module is used for predicting the prosodic feature of the current character in the text by combining the prosodic feature of the character before the current character in the text, the current character and the style feature of the speaker.
17. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-8.
18. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-8.
CN202011224114.6A 2020-11-05 2020-11-05 Speech synthesis method, device, electronic equipment and storage medium Active CN112365880B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011224114.6A CN112365880B (en) 2020-11-05 2020-11-05 Speech synthesis method, device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011224114.6A CN112365880B (en) 2020-11-05 2020-11-05 Speech synthesis method, device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN112365880A true CN112365880A (en) 2021-02-12
CN112365880B CN112365880B (en) 2024-03-26

Family

ID=74509479

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011224114.6A Active CN112365880B (en) 2020-11-05 2020-11-05 Speech synthesis method, device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN112365880B (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113012681A (en) * 2021-02-18 2021-06-22 深圳前海微众银行股份有限公司 Awakening voice synthesis method based on awakening voice model and application awakening method
CN113035169A (en) * 2021-03-12 2021-06-25 北京帝派智能科技有限公司 Voice synthesis method and system capable of training personalized tone library on line
CN113129862A (en) * 2021-04-22 2021-07-16 合肥工业大学 World-tacontron-based voice synthesis method and system and server
CN113205793A (en) * 2021-04-30 2021-08-03 北京有竹居网络技术有限公司 Audio generation method and device, storage medium and electronic equipment
CN113808571A (en) * 2021-08-17 2021-12-17 北京百度网讯科技有限公司 Speech synthesis method, speech synthesis device, electronic device and storage medium
CN113838452A (en) * 2021-08-17 2021-12-24 北京百度网讯科技有限公司 Speech synthesis method, apparatus, device and computer storage medium
CN114005428A (en) * 2021-12-31 2022-02-01 科大讯飞股份有限公司 Speech synthesis method, apparatus, electronic device, storage medium, and program product
CN114038484A (en) * 2021-12-16 2022-02-11 游密科技(深圳)有限公司 Voice data processing method and device, computer equipment and storage medium
CN114373445A (en) * 2021-12-23 2022-04-19 北京百度网讯科技有限公司 Voice generation method and device, electronic equipment and storage medium
KR20230026241A (en) * 2021-08-17 2023-02-24 베이징 바이두 넷컴 사이언스 테크놀로지 컴퍼니 리미티드 Voice processing method and device, equipment and computer storage medium
WO2023160553A1 (en) * 2022-02-25 2023-08-31 北京有竹居网络技术有限公司 Speech synthesis method and apparatus, and computer-readable medium and electronic device

Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5905972A (en) * 1996-09-30 1999-05-18 Microsoft Corporation Prosodic databases holding fundamental frequency templates for use in speech synthesis
US20020069061A1 (en) * 1998-10-28 2002-06-06 Ann K. Syrdal Method and system for recorded word concatenation
JP2007127994A (en) * 2005-11-07 2007-05-24 Canon Inc Voice synthesizing method, voice synthesizer, and program
CN104916284A (en) * 2015-06-10 2015-09-16 百度在线网络技术(北京)有限公司 Prosody and acoustics joint modeling method and device for voice synthesis system
CN107464554A (en) * 2017-09-28 2017-12-12 百度在线网络技术(北京)有限公司 Phonetic synthesis model generating method and device
CN107705783A (en) * 2017-11-27 2018-02-16 北京搜狗科技发展有限公司 A kind of phoneme synthesizing method and device
CN107992485A (en) * 2017-11-27 2018-05-04 北京搜狗科技发展有限公司 A kind of simultaneous interpretation method and device
CN109036375A (en) * 2018-07-25 2018-12-18 腾讯科技(深圳)有限公司 Phoneme synthesizing method, model training method, device and computer equipment
WO2019139431A1 (en) * 2018-01-11 2019-07-18 네오사피엔스 주식회사 Speech translation method and system using multilingual text-to-speech synthesis model
CN110264992A (en) * 2019-06-11 2019-09-20 百度在线网络技术(北京)有限公司 Speech synthesis processing method, device, equipment and storage medium
CN110599998A (en) * 2018-05-25 2019-12-20 阿里巴巴集团控股有限公司 Voice data generation method and device
CN110782871A (en) * 2019-10-30 2020-02-11 百度在线网络技术(北京)有限公司 Rhythm pause prediction method and device and electronic equipment
CN111508469A (en) * 2020-04-26 2020-08-07 北京声智科技有限公司 Text-to-speech conversion method and device
CN111587455A (en) * 2018-01-11 2020-08-25 新智株式会社 Text-to-speech synthesis method and apparatus using machine learning, and computer-readable storage medium

Patent Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5905972A (en) * 1996-09-30 1999-05-18 Microsoft Corporation Prosodic databases holding fundamental frequency templates for use in speech synthesis
US20020069061A1 (en) * 1998-10-28 2002-06-06 Ann K. Syrdal Method and system for recorded word concatenation
JP2007127994A (en) * 2005-11-07 2007-05-24 Canon Inc Voice synthesizing method, voice synthesizer, and program
CN104916284A (en) * 2015-06-10 2015-09-16 百度在线网络技术(北京)有限公司 Prosody and acoustics joint modeling method and device for voice synthesis system
CN107464554A (en) * 2017-09-28 2017-12-12 百度在线网络技术(北京)有限公司 Phonetic synthesis model generating method and device
CN107992485A (en) * 2017-11-27 2018-05-04 北京搜狗科技发展有限公司 A kind of simultaneous interpretation method and device
CN107705783A (en) * 2017-11-27 2018-02-16 北京搜狗科技发展有限公司 A kind of phoneme synthesizing method and device
WO2019139431A1 (en) * 2018-01-11 2019-07-18 네오사피엔스 주식회사 Speech translation method and system using multilingual text-to-speech synthesis model
CN111587455A (en) * 2018-01-11 2020-08-25 新智株式会社 Text-to-speech synthesis method and apparatus using machine learning, and computer-readable storage medium
CN110599998A (en) * 2018-05-25 2019-12-20 阿里巴巴集团控股有限公司 Voice data generation method and device
CN109036375A (en) * 2018-07-25 2018-12-18 腾讯科技(深圳)有限公司 Phoneme synthesizing method, model training method, device and computer equipment
CN110264992A (en) * 2019-06-11 2019-09-20 百度在线网络技术(北京)有限公司 Speech synthesis processing method, device, equipment and storage medium
CN110782871A (en) * 2019-10-30 2020-02-11 百度在线网络技术(北京)有限公司 Rhythm pause prediction method and device and electronic equipment
CN111508469A (en) * 2020-04-26 2020-08-07 北京声智科技有限公司 Text-to-speech conversion method and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
刘航;凌震华;郭武;戴礼荣;: "改进的跨语种语音合成模型自适应方法", 模式识别与人工智能, no. 04 *

Cited By (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113012681A (en) * 2021-02-18 2021-06-22 深圳前海微众银行股份有限公司 Awakening voice synthesis method based on awakening voice model and application awakening method
CN113035169A (en) * 2021-03-12 2021-06-25 北京帝派智能科技有限公司 Voice synthesis method and system capable of training personalized tone library on line
CN113129862A (en) * 2021-04-22 2021-07-16 合肥工业大学 World-tacontron-based voice synthesis method and system and server
CN113129862B (en) * 2021-04-22 2024-03-12 合肥工业大学 Voice synthesis method, system and server based on world-tacotron
CN113205793B (en) * 2021-04-30 2022-05-31 北京有竹居网络技术有限公司 Audio generation method and device, storage medium and electronic equipment
CN113205793A (en) * 2021-04-30 2021-08-03 北京有竹居网络技术有限公司 Audio generation method and device, storage medium and electronic equipment
CN113838452A (en) * 2021-08-17 2021-12-24 北京百度网讯科技有限公司 Speech synthesis method, apparatus, device and computer storage medium
KR102611003B1 (en) * 2021-08-17 2023-12-06 베이징 바이두 넷컴 사이언스 테크놀로지 컴퍼니 리미티드 Voice processing method and device, equipment and computer storage medium
CN113808571A (en) * 2021-08-17 2021-12-17 北京百度网讯科技有限公司 Speech synthesis method, speech synthesis device, electronic device and storage medium
CN113808571B (en) * 2021-08-17 2022-05-27 北京百度网讯科技有限公司 Speech synthesis method, speech synthesis device, electronic device and storage medium
KR102619408B1 (en) * 2021-08-17 2023-12-29 베이징 바이두 넷컴 사이언스 테크놀로지 컴퍼니 리미티드 Voice synthesizing method, device, electronic equipment and storage medium
KR20220083987A (en) * 2021-08-17 2022-06-21 베이징 바이두 넷컴 사이언스 테크놀로지 컴퍼니 리미티드 Voice synthesizing method, device, electronic equipment and storage medium
CN113838452B (en) * 2021-08-17 2022-08-23 北京百度网讯科技有限公司 Speech synthesis method, apparatus, device and computer storage medium
KR20230026242A (en) * 2021-08-17 2023-02-24 베이징 바이두 넷컴 사이언스 테크놀로지 컴퍼니 리미티드 Voice synthesis method and device, equipment and computer storage medium
KR20230026241A (en) * 2021-08-17 2023-02-24 베이징 바이두 넷컴 사이언스 테크놀로지 컴퍼니 리미티드 Voice processing method and device, equipment and computer storage medium
KR102611024B1 (en) * 2021-08-17 2023-12-06 베이징 바이두 넷컴 사이언스 테크놀로지 컴퍼니 리미티드 Voice synthesis method and device, equipment and computer storage medium
CN114038484A (en) * 2021-12-16 2022-02-11 游密科技(深圳)有限公司 Voice data processing method and device, computer equipment and storage medium
CN114038484B (en) * 2021-12-16 2024-01-30 游密科技(深圳)有限公司 Voice data processing method, device, computer equipment and storage medium
CN114373445A (en) * 2021-12-23 2022-04-19 北京百度网讯科技有限公司 Voice generation method and device, electronic equipment and storage medium
CN114005428A (en) * 2021-12-31 2022-02-01 科大讯飞股份有限公司 Speech synthesis method, apparatus, electronic device, storage medium, and program product
WO2023160553A1 (en) * 2022-02-25 2023-08-31 北京有竹居网络技术有限公司 Speech synthesis method and apparatus, and computer-readable medium and electronic device

Also Published As

Publication number Publication date
CN112365880B (en) 2024-03-26

Similar Documents

Publication Publication Date Title
CN112365880B (en) Speech synthesis method, device, electronic equipment and storage medium
US11769480B2 (en) Method and apparatus for training model, method and apparatus for synthesizing speech, device and storage medium
CN110473516B (en) Voice synthesis method and device and electronic equipment
US11769482B2 (en) Method and apparatus of synthesizing speech, method and apparatus of training speech synthesis model, electronic device, and storage medium
KR20210106397A (en) Voice conversion method, electronic device, and storage medium
CN111754978B (en) Prosodic hierarchy labeling method, device, equipment and storage medium
CN110619867B (en) Training method and device of speech synthesis model, electronic equipment and storage medium
CN112270920A (en) Voice synthesis method and device, electronic equipment and readable storage medium
CN110797005B (en) Prosody prediction method, apparatus, device, and medium
CN112509552B (en) Speech synthesis method, device, electronic equipment and storage medium
CN112466275A (en) Voice conversion and corresponding model training method, device, equipment and storage medium
CN112365877A (en) Speech synthesis method, speech synthesis device, electronic equipment and storage medium
CN112365882A (en) Speech synthesis method, model training method, device, equipment and storage medium
CN110782871B (en) Rhythm pause prediction method and device and electronic equipment
US20220076657A1 (en) Method of registering attribute in speech synthesis model, apparatus of registering attribute in speech synthesis model, electronic device, and medium
CN112365879A (en) Speech synthesis method, speech synthesis device, electronic equipment and storage medium
CN110580904A (en) Method and device for controlling small program through voice, electronic equipment and storage medium
US20230178067A1 (en) Method of training speech synthesis model and method of synthesizing speech
US20220068265A1 (en) Method for displaying streaming speech recognition result, electronic device, and storage medium
CN111127191A (en) Risk assessment method and device
CN110767212B (en) Voice processing method and device and electronic equipment
CN112365875A (en) Voice synthesis method, device, vocoder and electronic equipment
CN112650844A (en) Tracking method and device of conversation state, electronic equipment and storage medium
CN112289305A (en) Prosody prediction method, device, equipment and storage medium
JP7204861B2 (en) Recognition method, device, electronic device and storage medium for mixed Chinese and English speech

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant