CN114187890A

CN114187890A - Voice synthesis method and device, computer readable storage medium and terminal equipment

Info

Publication number: CN114187890A
Application number: CN202111676421.2A
Authority: CN
Inventors: 黄东延; 丁万; 梁景俊; 杨志勇
Original assignee: Ubtech Robotics Corp
Current assignee: Ubtech Robotics Corp
Priority date: 2021-12-31
Filing date: 2021-12-31
Publication date: 2022-03-15

Abstract

The present application belongs to the field of speech processing technologies, and in particular, to a speech synthesis method, apparatus, computer-readable storage medium, and terminal device. The method comprises the following steps: acquiring a target text to be subjected to voice synthesis, and performing text analysis on the target text to obtain the joint characteristics of the target text; performing feature mapping on the combined features by using a preset acoustic model to obtain acoustic features corresponding to the target text; performing voice synthesis on the acoustic features by using a preset vocoder to obtain target voice corresponding to the target text; the acoustic model and the vocoder are obtained through joint training of a preset training data set in advance. In the present application, the acoustic model and the vocoder are not separately trained, but are obtained by joint training, so that the final speech synthesis effect can be improved as a whole.

Description

Voice synthesis method and device, computer readable storage medium and terminal equipment

Technical Field

The present application belongs to the field of speech processing technologies, and in particular, to a speech synthesis method, apparatus, computer-readable storage medium, and terminal device.

Background

A Text To Speech (TTS) technology refers To a technology for converting an arbitrary Text into Speech. With the rapid development of mobile internet and artificial intelligence technology, the speech synthesis technology is widely applied in scenes such as speech assistants, intelligent robots, text reading, news broadcasting and the like. A general speech synthesis system may include several subsystems, such as an acoustic model and a vocoder, which are generally trained and optimized separately in the prior art, resulting in poor overall speech synthesis.

Disclosure of Invention

In view of this, embodiments of the present application provide a speech synthesis method, an apparatus, a computer-readable storage medium, and a terminal device, so as to solve the problem that the overall speech synthesis effect of the existing speech synthesis method is poor.

A first aspect of an embodiment of the present application provides a speech synthesis method, which may include:

acquiring a target text to be subjected to voice synthesis, and performing text analysis on the target text to obtain the joint characteristics of the target text;

performing feature mapping on the combined features by using a preset acoustic model to obtain acoustic features corresponding to the target text;

performing voice synthesis on the acoustic features by using a preset vocoder to obtain target voice corresponding to the target text; the acoustic model and the vocoder are obtained through joint training of a preset training data set in advance.

In a specific implementation manner of the first aspect, the joint training process of the acoustic model and the vocoder may include:

performing initial training on an original acoustic model by using the training data set to obtain an initially trained acoustic model;

performing initial training on an original vocoder by using the training data set to obtain a vocoder after the initial training;

and performing joint training on the acoustic model after the initial training and the vocoder after the initial training by using the training data set to obtain the acoustic model after the joint training and the vocoder after the joint training.

In a specific implementation manner of the first aspect, the jointly training the initially trained acoustic model and the initially trained vocoder by using the training data set to obtain a jointly trained acoustic model and a jointly trained vocoder may include:

using the training data set to perform coarse tuning on the acoustic model after initial training and the vocoder after initial training to obtain the acoustic model after coarse tuning and the vocoder after coarse tuning;

and fixing parameters of the acoustic model, and finely tuning the acoustic model after the rough tuning and the vocoder after the rough tuning by using the training data set to obtain the acoustic model after the joint training and the vocoder after the joint training.

In a specific implementation manner of the first aspect, the performing speech synthesis on the acoustic feature by using a preset vocoder to obtain a target speech corresponding to the target text may include:

dividing the acoustic features according to syllables to obtain the acoustic features of all the syllables;

using more than two vocoders to perform parallel processing on the acoustic characteristics of each syllable to obtain each voice segment;

and generating the target voice according to each voice segment.

In a specific implementation manner of the first aspect, the generating the target speech according to each speech segment may include:

splicing the voice segments in sequence to obtain spliced voice;

and smoothing the spliced voice to obtain the target voice.

In a specific implementation manner of the first aspect, the speech synthesis method may further include:

in the training process, respectively acquiring data from more than two processing processes through a preset text iterator;

generating training items corresponding to the acquired data, and adding the generated training items to a preset queue;

and when the training items in the queue meet a preset condition, generating a training batch corresponding to the training items in the queue.

A second aspect of an embodiment of the present application provides a speech synthesis apparatus, which may include:

the text analysis module is used for acquiring a target text to be subjected to voice synthesis and performing text analysis on the target text to obtain the joint characteristics of the target text;

the acoustic model processing module is used for performing feature mapping on the combined features by using a preset acoustic model to obtain acoustic features corresponding to the target text;

a vocoder processing module, configured to perform speech synthesis on the acoustic feature by using a preset vocoder, so as to obtain a target speech corresponding to the target text; the acoustic model and the vocoder are obtained through joint training of a preset training data set in advance.

In a specific implementation manner of the second aspect, the speech synthesis apparatus may further include:

the acoustic model initial training module is used for carrying out initial training on an original acoustic model by using the training data set to obtain an initially trained acoustic model;

the initial training module of the vocoder is used for carrying out initial training on an original vocoder by using the training data set to obtain the vocoder after the initial training;

and the joint training module is used for performing joint training on the acoustic model after the initial training and the vocoder after the initial training by using the training data set to obtain the acoustic model after the joint training and the vocoder after the joint training.

In a specific implementation manner of the second aspect, the joint training module may include:

a first joint training unit, configured to perform coarse tuning on the initially trained acoustic model and the initially trained vocoder by using the training data set, to obtain a coarsely tuned acoustic model and a coarsely tuned vocoder;

and the second joint training unit is used for fixing the parameters of the acoustic model, and finely tuning the acoustic model after the coarse tuning and the vocoder after the coarse tuning by using the training data set to obtain the acoustic model after the joint training and the vocoder after the joint training.

In a specific implementation manner of the second aspect, the vocoder processing module may include:

the syllable dividing unit is used for dividing the acoustic characteristics according to syllables to obtain the acoustic characteristics of each syllable;

the parallel processing unit is used for carrying out parallel processing on the acoustic characteristics of each syllable by using more than two vocoders to obtain each voice segment;

and the target voice generating unit is used for generating the target voice according to each voice segment.

In a specific implementation manner of the second aspect, the target speech generating unit may include:

the voice splicing unit is used for sequentially splicing the voice segments to obtain spliced voice;

and the voice smoothing unit is used for smoothing the spliced voice to obtain the target voice.

the text iteration module is used for respectively acquiring data from more than two processing processes through a preset text iterator in the training process;

the training entry generating module is used for generating training entries corresponding to the acquired data and adding the generated training entries into a preset queue;

and the training batch generation module is used for generating a training batch corresponding to the training entries in the queue when the training entries in the queue meet a preset condition.

A third aspect of embodiments of the present application provides a computer-readable storage medium storing a computer program, which when executed by a processor implements the steps of any one of the above-mentioned speech synthesis methods.

A fourth aspect of the embodiments of the present application provides a terminal device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the steps of any one of the above-mentioned speech synthesis methods when executing the computer program.

A fifth aspect of embodiments of the present application provides a computer program product, which, when run on a terminal device, causes the terminal device to perform the steps of any of the speech synthesis methods described above.

Compared with the prior art, the embodiment of the application has the advantages that: the method comprises the steps of obtaining a target text to be subjected to voice synthesis, and performing text analysis on the target text to obtain the joint characteristics of the target text; performing feature mapping on the combined features by using a preset acoustic model to obtain acoustic features corresponding to the target text; performing voice synthesis on the acoustic features by using a preset vocoder to obtain target voice corresponding to the target text; the acoustic model and the vocoder are obtained through joint training of a preset training data set in advance. In the embodiment of the present application, the acoustic model and the vocoder are not separately trained, but are obtained by joint training, so that the final speech synthesis effect can be improved as a whole.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed for the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

FIG. 1 is a flow chart of an embodiment of a speech synthesis method in an embodiment of the present application;

FIG. 2 is a schematic flow diagram of a joint training process for acoustic models and vocoders;

FIG. 3 is a schematic diagram of parallel processing of acoustic features of individual syllables using multiple vocoders;

FIG. 4 is a schematic diagram of the parallel processing of two lines of a single data and training batch;

FIG. 5 is a block diagram of an embodiment of a speech synthesis apparatus according to an embodiment of the present application;

fig. 6 is a schematic block diagram of a terminal device in an embodiment of the present application.

Detailed Description

In order to make the objects, features and advantages of the present invention more apparent and understandable, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the embodiments described below are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the present application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in the specification of the present application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should be further understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.

As used in this specification and the appended claims, the term "if" may be interpreted contextually as "when", "upon" or "in response to a determination" or "in response to a detection". Similarly, the phrase "if it is determined" or "if a [ described condition or event ] is detected" may be interpreted contextually to mean "upon determining" or "in response to determining" or "upon detecting [ described condition or event ]" or "in response to detecting [ described condition or event ]".

In addition, in the description of the present application, the terms "first," "second," "third," and the like are used solely to distinguish one from another and are not to be construed as indicating or implying relative importance.

Referring to fig. 1, an embodiment of a speech synthesis method in an embodiment of the present application may include:

step S101, a target text to be subjected to voice synthesis is obtained, and text analysis is carried out on the target text to obtain the joint characteristics of the target text.

In a specific implementation of the embodiments of the present application, the joint features may include word segmentation features, polyphonic features, prosodic features, and duration features. The word segmentation features are phrase features obtained by classifying words forming the target text, and can be nouns, verbs, prepositions, adjectives and the like. The polyphonic character is a character or a word with multiple pronunciations included in the target text, and the pronunciations differ in use conditions or environments and in pronunciation due to the function of distinguishing the part of speech and the meaning of a word. The prosodic feature is a prosodic structure of a language and is closely related to other linguistic structures such as syntax, a language part structure, an information structure and the like. Prosodic features are typical features of natural languages and are common features of different languages, such as: pitch downtilt, rereading, pause, etc. The duration feature is a time length corresponding to a phoneme text feature included in the target text and a text feature corresponding to the target text.

In the embodiment of the application, the joint features reflecting the text and the sound are used as the input of the acoustic model, so that a more accurate speech generation result can be obtained compared with a mode of using a single phoneme sequence as the input of the acoustic model in the prior art.

Before text analysis is carried out on the target text, the target text can be preprocessed, so that deviation of output text characteristics caused by a little influence (such as a format problem) is avoided.

In one particular implementation of the embodiments of the present application, the preprocessing may include a regularization process. The regularization processing is to normalize the target text and convert the language words into language words in a preset form, for example, capital and lowercase letters of English processing letters, and punctuation marks can be removed as required, so that the deviation of output text characteristics caused by problems of text formats and the like is avoided. In another specific implementation of the embodiment of the present application, the preprocessing may further include converting texts such as numbers and symbols in the target text into chinese, so as to facilitate subsequent text analysis and reduce errors in feature extraction.

And S102, performing feature mapping on the combined features by using a preset acoustic model to obtain acoustic features corresponding to the target text.

In the embodiment of the present application, any one of the acoustic models in the prior art may be selected according to actual situations. It is preferable to use FastSpeech as the acoustic model, but remove the word Embedding weight matrix (Embedding Lookup Table) before the encoder, embed the front-end joint features as words directly, input the words into the multi-head attention encoder to perform self-attention calculation, capture the dependency information of the sequence context, and model the text sequence. Because the input characteristics comprise language information and prosody information, high-quality high-dimensional space expression can be carried out on each text unit in the hidden vector output by the encoder, the subsequent duration prediction is facilitated to output more accurate pronunciation time, and the integral synthetic voice rhythm sense is more consistent with the characteristics of real pronunciation. The multi-head attention module further excavates context dependence of the prosody combined features of the front-end text, assists the duration model to output reasonable prosody rhythm, and avoids abnormal phenomena of multiple characters and character loss. After the combined features are processed by the acoustic model, acoustic features such as Mel-Spectrum (Mel-Spectrum) corresponding to the target text can be obtained.

And step S103, performing voice synthesis on the acoustic features by using a preset vocoder to obtain target voice corresponding to the target text.

In the embodiment of the present application, any one of the vocoders in the prior art may be selected according to actual situations. Here, WaveRNN is preferably used as a vocoder to restore acoustic features to a speech waveform, thereby obtaining a target speech.

Unlike the prior art, the acoustic model and the vocoder in the present application are not separately trained, but are obtained by joint training, which may include the process shown in fig. 2:

step S201, performing initial training on the original acoustic model by using a training data set to obtain the acoustic model after the initial training.

The initial training process of the acoustic model is similar to the individual training of the acoustic model in the prior art, and reference may be made to any acoustic model training method in the prior art, which is not described herein again.

Step S202, the original vocoder is initially trained by using a training data set to obtain the initially trained vocoder.

The initial training process of the vocoder is similar to the prior art of training the vocoder alone, and reference may be made to any one of the prior art methods for training the vocoder, which are not described herein again.

Step S203, performing joint training on the acoustic model after the initial training and the vocoder after the initial training by using the training data set to obtain the acoustic model after the joint training and the vocoder after the joint training.

In the embodiment of the present application, the joint training can be divided into a coarse tuning stage and a fine tuning stage.

In the coarse tuning stage, the acoustic model after initial training and the vocoder after initial training can be coarsely tuned by using the training data set, that is, the acoustic model and the vocoder are considered as a whole to be trained, so as to obtain the acoustic model after coarse tuning and the vocoder after coarse tuning.

In the fine tuning (finetune) stage, the parameters of the acoustic model can be fixed, and the fine tuning can be performed on the acoustic model after the coarse tuning and the vocoder after the coarse tuning by using the training data set, that is, the parameters of the acoustic model are kept unchanged in the training process, and only the parameters of the vocoder are adjusted, so that the acoustic model after the joint training and the vocoder after the joint training are obtained.

In summary, the embodiment of the application obtains a target text to be subjected to voice synthesis, and performs text analysis on the target text to obtain a joint feature of the target text; performing feature mapping on the combined features by using a preset acoustic model to obtain acoustic features corresponding to the target text; performing voice synthesis on the acoustic features by using a preset vocoder to obtain target voice corresponding to the target text; the acoustic model and the vocoder are obtained through joint training of a preset training data set in advance. In the embodiment of the present application, the acoustic model and the vocoder are not separately trained, but are obtained by joint training, so that the final speech synthesis effect can be improved as a whole.

In a specific implementation of the embodiment of the present application, in step S103, more than two vocoders may be used simultaneously for speech synthesis, so as to improve the efficiency of speech synthesis.

As shown in fig. 3, the acoustic features may be divided according to syllables to obtain the acoustic features of each syllable, then the acoustic features of each syllable are processed in parallel by using more than two vocoders to obtain each speech segment, and finally the target speech is generated according to each speech segment.

In the process of generating the target voice according to each voice segment, the voice segments can be sequentially spliced to obtain spliced voice, and then the spliced voice is smoothed, namely, discontinuous places among all the syllables are smoothed in a window sliding mode, so that smoother target voice is obtained.

In view of the fact that in the model overall structure, multiple deep neural network models are required. In order to speed up the development efficiency, make each model use the same code as much as possible, and select the optimal model, in a specific implementation of the embodiment of the present application, the processing of two lines, i.e. a single data and a training Batch (Batch), may be performed in parallel.

Specifically, as shown in fig. 4, in the training process, data may be acquired from two or more processing processes by a preset text iterator. On the single data processing line, after the single data is subjected to word segmentation, feature extraction and other operations, a training Item (Item) corresponding to the acquired data is generated, and the generated training Item is added to a preset queue. On the line of processing the training batch, when the training entries in the queue meet the preset condition (i.e. the queue is full), filling (Padding) the training entries in the queue, thereby generating the training batch corresponding to the training entries in the queue. Due to the guarantee of multiple processes, the two lines are synchronously carried out, so that the preprocessing and training efficiency is greatly accelerated, and the iterative development of the model is greatly facilitated.

It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present application.

Fig. 5 is a block diagram of an embodiment of a speech synthesis apparatus according to an embodiment of the present application, which corresponds to a speech synthesis method according to the foregoing embodiment.

In this embodiment, a speech synthesis apparatus may include:

the text analysis module 501 is configured to obtain a target text to be subjected to speech synthesis, and perform text analysis on the target text to obtain a joint feature of the target text;

an acoustic model processing module 502, configured to perform feature mapping on the combined feature by using a preset acoustic model to obtain an acoustic feature corresponding to the target text;

a vocoder processing module 503, configured to perform speech synthesis on the acoustic feature by using a preset vocoder, so as to obtain a target speech corresponding to the target text; the acoustic model and the vocoder are obtained through joint training of a preset training data set in advance.

In a specific implementation manner of the embodiment of the present application, the speech synthesis apparatus may further include:

In a specific implementation manner of the embodiment of the present application, the joint training module may include:

In a specific implementation manner of the embodiment of the present application, the vocoder processing module may include:

In a specific implementation manner of the embodiment of the present application, the target speech generating unit may include:

It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described apparatuses, modules and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or illustrated in a certain embodiment.

Fig. 6 shows a schematic block diagram of a terminal device provided in an embodiment of the present application, and only shows a part related to the embodiment of the present application for convenience of description.

As shown in fig. 6, the terminal device 6 of this embodiment includes: a processor 60, a memory 61 and a computer program 62 stored in said memory 61 and executable on said processor 60. The processor 60, when executing the computer program 62, implements the steps in the above-described respective speech synthesis method embodiments, such as the steps S101 to S103 shown in fig. 1. Alternatively, the processor 60, when executing the computer program 62, implements the functions of each module/unit in the above-mentioned device embodiments, for example, the functions of the modules 501 to 503 shown in fig. 5.

Illustratively, the computer program 62 may be partitioned into one or more modules/units that are stored in the memory 61 and executed by the processor 60 to accomplish the present application. The one or more modules/units may be a series of computer program instruction segments capable of performing specific functions, which are used to describe the execution of the computer program 62 in the terminal device 6.

The terminal device 6 may be a mobile phone, a tablet computer, a desktop computer, a notebook computer, a palm computer, a robot, or other computing devices. It will be understood by those skilled in the art that fig. 6 is only an example of the terminal device 6, and does not constitute a limitation to the terminal device 6, and may include more or less components than those shown, or combine some components, or different components, for example, the terminal device 6 may further include an input-output device, a network access device, a bus, etc.

The Processor 60 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory 61 may be an internal storage unit of the terminal device 6, such as a hard disk or a memory of the terminal device 6. The memory 61 may also be an external storage device of the terminal device 6, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the terminal device 6. Further, the memory 61 may also include both an internal storage unit and an external storage device of the terminal device 6. The memory 61 is used for storing the computer programs and other programs and data required by the terminal device 6. The memory 61 may also be used to temporarily store data that has been output or is to be output.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions. Each functional unit and module in the embodiments may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one unit, and the integrated unit may be implemented in a form of hardware, or in a form of software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working processes of the units and modules in the system may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus/terminal device and method may be implemented in other ways. For example, the above-described embodiments of the apparatus/terminal device are merely illustrative, and for example, the division of the modules or units is only one logical division, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated modules/units, if implemented in the form of software functional units and sold or used as separate products, may be stored in a computer readable storage medium. Based on such understanding, all or part of the flow in the method of the embodiments described above can be realized by a computer program, which can be stored in a computer-readable storage medium and can realize the steps of the embodiments of the methods described above when the computer program is executed by a processor. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable storage medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the computer readable storage medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable storage media that does not include electrical carrier signals and telecommunications signals in accordance with legislation and patent practice.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present application and are intended to be included within the scope of the present application.

Claims

1. A method of speech synthesis, comprising:

2. The method of claim 1, wherein the joint training process of the acoustic model and the vocoder comprises:

3. The method of claim 2, wherein the jointly training the initially trained acoustic model and the initially trained vocoder using the training data set to obtain a jointly trained acoustic model and a jointly trained vocoder comprises:

4. The method of claim 1, wherein the performing speech synthesis on the acoustic features by using a preset vocoder to obtain a target speech corresponding to the target text comprises:

and generating the target voice according to each voice segment.

5. The speech synthesis method of claim 4, wherein the generating the target speech from the respective speech segments comprises:

splicing the voice segments in sequence to obtain spliced voice;

and smoothing the spliced voice to obtain the target voice.

6. The speech synthesis method according to any one of claims 1 to 5, further comprising:

7. A speech synthesis apparatus, comprising:

8. The speech synthesis apparatus according to claim 7, characterized in that the speech synthesis apparatus further comprises:

9. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the speech synthesis method according to one of claims 1 to 6.

10. A terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the speech synthesis method according to any one of claims 1 to 6 when executing the computer program.