CN115240635A

CN115240635A - Speech synthesis method, apparatus, medium, and electronic device

Info

Publication number: CN115240635A
Application number: CN202210872082.3A
Authority: CN
Inventors: 汤本来; 李忠豪; 马泽君
Original assignee: Beijing Youzhuju Network Technology Co Ltd
Current assignee: Beijing Youzhuju Network Technology Co Ltd
Priority date: 2022-07-22
Filing date: 2022-07-22
Publication date: 2022-10-25

Abstract

The present disclosure relates to a speech synthesis method, apparatus, medium, and electronic device, including: acquiring text characteristics corresponding to a text to be synthesized; inputting the text characteristics into a pre-trained speech synthesis model to obtain target speech synthesized according to the text to be synthesized; wherein the speech synthesis model synthesizes the target speech by directly converting the text features into speech waveform points corresponding to the target speech. Therefore, when synthesizing the voice according to the text, the prediction of any intermediate characteristic is not needed, so that the size of the voice synthesis model is reduced, the voice synthesis model is more convenient to deploy in various terminals for offline use, and because no intermediate characteristic exists, the information loss in the model intermediate layer can be reduced, the information difference between the synthesized target voice and the text characteristic is reduced, the voice synthesis effect of the model can be improved to a certain extent, and the use experience of a user on offline voice synthesis is further improved.

Description

Speech synthesis method, apparatus, medium, and electronic device

Technical Field

The present disclosure relates to the field of speech synthesis technologies, and in particular, to a speech synthesis method, apparatus, medium, and electronic device.

Background

The effect of the current off-line speech synthesis technology is limited by the performance of the equipment end, the laggard speech synthesis method and the like, and MOS (Mean Opinion Score, subjective scoring) is relatively low and has poor effect. With the increasing popularization of intelligent terminals, such as smart phones and smart watches, people have increasingly strong requirements for high-quality offline speech synthesis technology, and a problem of how to realize better offline speech synthesis on equipment is urgently needed to be solved.

Disclosure of Invention

This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

In a first aspect, the present disclosure provides a method of speech synthesis, the method comprising: acquiring text characteristics corresponding to a text to be synthesized; inputting the text features into a pre-trained voice synthesis model to obtain target voice synthesized according to the text to be synthesized; wherein the speech synthesis model synthesizes the target speech by directly converting the text features into speech waveform points corresponding to the target speech.

In a second aspect, the present disclosure also provides a speech synthesis apparatus, the apparatus comprising: the acquisition module is used for acquiring text characteristics corresponding to the text to be synthesized; the voice synthesis module is used for inputting the text characteristics into a voice synthesis model obtained by pre-training so as to obtain target voice synthesized according to the text to be synthesized; wherein the speech synthesis model synthesizes the target speech by directly converting the text features into speech waveform points corresponding to the target speech.

In a third aspect, the present disclosure also provides a computer readable medium having stored thereon a computer program which, when executed by a processing apparatus, performs the steps of the method of the first aspect.

In a fourth aspect, the present disclosure also provides an electronic device, including: a storage device having at least one computer program stored thereon; at least one processing device for executing the at least one computer program in the storage device to implement the steps of the method in the first aspect.

Through the technical scheme, the pre-trained voice synthesis model can directly predict the corresponding voice waveform point according to the text feature of the text to be synthesized to be used as the target voice corresponding to the text to be synthesized without predicting any intermediate feature, so that the size of the voice synthesis model is reduced, the voice synthesis model is more convenient to deploy in various terminals for offline use, information loss in a model intermediate layer can be reduced due to the absence of the intermediate feature, the information difference between the synthesized target voice and the text feature is reduced, the voice synthesis effect of the model can be improved to a certain extent, and the use experience of a user on offline voice synthesis is further improved.

Additional features and advantages of the present disclosure will be set forth in the detailed description which follows.

Drawings

The above and other features, advantages, and aspects of embodiments of the present disclosure will become more apparent by referring to the following detailed description when taken in conjunction with the accompanying drawings. Throughout the drawings, the same or similar reference numbers refer to the same or similar elements. It should be understood that the drawings are schematic and that elements and components are not necessarily drawn to scale. In the drawings:

fig. 1 is a flow chart illustrating a method of speech synthesis according to an exemplary embodiment of the present disclosure.

Fig. 2 is a flowchart illustrating a speech synthesis method according to yet another exemplary embodiment of the present disclosure.

Fig. 3 is a flowchart illustrating a speech synthesis method according to yet another exemplary embodiment of the present disclosure.

Fig. 4 is a block diagram illustrating a structure of a speech synthesis apparatus according to an exemplary embodiment of the present disclosure.

Fig. 5 is a block diagram illustrating a structure of a speech synthesis apparatus according to still another exemplary embodiment of the present disclosure.

Fig. 6 is a schematic structural diagram illustrating an electronic device suitable for use in implementing embodiments of the present disclosure, according to an exemplary embodiment of the present disclosure.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather are provided for a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the disclosure are for illustration purposes only and are not intended to limit the scope of the disclosure.

It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order, and/or performed in parallel. Moreover, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.

The term "include" and variations thereof as used herein are open-ended, i.e., "including but not limited to". The term "based on" is "based, at least in part, on". The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments". Relevant definitions for other terms will be given in the following description.

It should be noted that the terms "first", "second", and the like in the present disclosure are only used for distinguishing different devices, modules or units, and are not used for limiting the order or interdependence relationship of the functions performed by the devices, modules or units.

It is noted that references to "a", "an", and "the" modifications in this disclosure are intended to be illustrative rather than limiting, and that those skilled in the art will recognize that "one or more" may be used unless the context clearly dictates otherwise.

The names of messages or information exchanged between devices in the embodiments of the present disclosure are for illustrative purposes only, and are not intended to limit the scope of the messages or information.

All actions of acquiring signals, information or data in the present disclosure are performed under the premise of complying with the corresponding data protection regulation policy of the country of the location and obtaining the authorization given by the owner of the corresponding device.

It is understood that, before the technical solutions disclosed in the embodiments of the present disclosure are used, the user should be informed of the type, the use range, the use scene, etc. of the personal information related to the present disclosure in a proper manner according to the relevant laws and regulations and obtain the authorization of the user.

For example, in response to receiving an active request from a user, a prompt message is sent to the user to explicitly prompt the user that the requested operation to be performed would require the acquisition and use of personal information to the user. Thus, the user can autonomously select whether to provide personal information to software or hardware such as an electronic device, an application program, a server, or a storage medium that performs the operations of the disclosed technical solution, according to the prompt information.

As an alternative but non-limiting implementation manner, in response to receiving an active request from the user, the manner of sending the prompt information to the user may be, for example, a pop-up window manner, and the prompt information may be presented in a text manner in the pop-up window. In addition, a selection control for providing personal information to the electronic device by the user's selection of "agreeing" or "disagreeing" can be carried in the pop-up window.

It is understood that the above notification and user authorization process is only illustrative and not limiting, and other ways of satisfying relevant laws and regulations may be applied to the implementation of the present disclosure.

Meanwhile, it is understood that the data involved in the present technical solution (including but not limited to the data itself, the acquisition or use of the data) should comply with the requirements of the corresponding laws and regulations and the related regulations.

Fig. 1 is a flow chart illustrating a method of speech synthesis according to an exemplary embodiment of the present disclosure. As shown in fig. 1, the method includes step 101 and step 102.

In step 101, a text feature corresponding to a text to be synthesized is obtained.

In step 102, inputting the text features into a pre-trained speech synthesis model to obtain target speech synthesized according to the text to be synthesized; wherein the speech synthesis model synthesizes the target speech by directly converting the text features into speech waveform points corresponding to the target speech.

The text to be synthesized may be determined and obtained in any manner, for example, the text to be synthesized may be actively input by a user, or may also be a text corresponding to an option selected from a plurality of preset options, or may also be a text recognized in a picture, and the like.

The method for acquiring the text features in the text to be synthesized may be: and extracting text features corresponding to the text to be synthesized according to a text feature extraction model obtained by pre-training. The text feature extraction model may be trained by conventional training methods, which are not overly limited in this disclosure. The text features obtained from the text to be synthesized may be phonetic features related to pronunciation, prosody, and the like of the text to be synthesized.

In this embodiment, the specific model configuration in the speech synthesis model may not be limited as long as it is satisfied that the speech waveform points can be directly synthesized according to the text features corresponding to the text to be synthesized. The voice wave point is also a voice signal which is corresponding to the target voice and can be directly played.

That is, the speech synthesis model in this embodiment does not need to predict any intermediate features, for example, first predict the acoustic features (e.g., mel spectrum) corresponding to the text features, and then synthesize the acoustic features to obtain the target speech, but may directly predict the speech waveform points according to the text features, and what the speech synthesis model represents is the corresponding relationship between the text features and the speech waveform points. Compared with the scheme of predicting acoustic characteristics and then performing voice synthesis, the method and the device for predicting the acoustic characteristics can improve voice synthesis efficiency and greatly reduce model volume, so that the requirement on terminal deployment is lower, and offline use in various terminals is facilitated.

In addition, compared with the traditional parameter speech synthesis method in the offline speech synthesis technology, the speech synthesis model in the embodiment does not need to respectively predict speech parameters such as fundamental frequency (F0), phase spectrum and the like in the acoustic feature according to the text feature, but directly predicts the speech waveform point corresponding to the text feature, so that the speech synthesis effect of the speech synthesis model is greatly better than that of the traditional parameter speech synthesis method.

Fig. 2 is a flowchart illustrating a speech synthesis method according to yet another exemplary embodiment of the present disclosure. As shown in fig. 2, the method further comprises step 201.

In step 201, the speech synthesis model is obtained by jointly modeling an acoustic model, which is a model for generating acoustic features from text features, and a vocoder, which is a module for generating speech wave points from acoustic features.

That is, before obtaining the text features corresponding to the text to be synthesized, the speech synthesis model may be constructed by the method shown in step 201. The acoustic signature generated by the acoustic model may be, for example, the previously described Mel-spectrum, and the vocoder may be a neural network vocoder, such as a WaveNet vocoder or any other vocoder capable of generating points of speech waves based on acoustic signatures. By jointly modeling the acoustic model and the vocoder, a partial network structure which converts text features into acoustic features in the acoustic features and a partial network structure which generates voice waveform points through the acoustic features in the neural network vocoder can be simultaneously utilized to realize the construction of the voice synthesis model, and further realize the purpose of directly generating the voice waveform points according to the text features.

It should be noted that the method shown in step 201 is not limited in the present application to construct the speech synthesis model, and step 201 is only an example of a feasible model construction method, and those skilled in the art can implement the model construction by any method as long as it can implement the direct prediction of the speech synthesis model from the text features to obtain the speech waveform points to be synthesized.

Also, since the speech synthesis model is trained in advance, the model construction method as shown in step 201 in fig. 2 may be performed by another execution subject different from that of the method shown in fig. 1. For example, step 101 and step 102 shown in fig. 1 may be executed by the user terminal to meet the requirement of the user for offline speech synthesis in the terminal, while step 201 shown in fig. 2 may be executed in advance by the server.

In one possible embodiment, the present application further comprises the following model training steps not shown in fig. 2: jointly modeling according to an acoustic model and an acoustic code device to obtain an initial neural network model, and determining a training model, wherein the training model and a partial network structure in the initial neural network model form a complete acoustic model; acquiring a training sample, wherein the training sample comprises a text feature corresponding to a training text, an acoustic feature corresponding to the training text and a voice wave point corresponding to the training text; training the initial neural network model by means of multi-task training, wherein the multi-task training comprises a first task and a second task, and the first task comprises the following steps: taking text features corresponding to the training texts as input of the initial neural network model, and taking voice waveform points corresponding to the training texts as expected output of the initial neural network model; the second task includes: acquiring intermediate layer characteristics of the initial neural network model as input of the training model, and taking acoustic characteristics corresponding to the training text as output of the training model; and determining the voice synthesis model according to the trained initial neural network model.

The initial Neural Network model may include various Artificial Neural Networks (ANN) including hidden layers. When the initial neural network model is obtained through the acoustic model and the vocoder in a combined modeling mode, the initial neural network model inevitably comprises a part of network structures originally belonging to the acoustic model, therefore, the part of the network structures originally belonging to the acoustic model in the initial neural network model and the training model form a complete acoustic model, and the acoustic model obtained through combination and the initial neural network model are combined to carry out multi-task training, so that sharing of model parameters is realized, and the training effect of the initial neural network model can be further improved through the acoustic model.

The training samples required in the training process may be obtained in various ways. Specifically, the training samples may include text features corresponding to training text, acoustic features corresponding to the training text, and speech wave points corresponding to the training text. The text features corresponding to the training texts, the acoustic features corresponding to the training texts, and the speech waveform points corresponding to the training texts, which are labeled in the training samples, may be obtained from local or communicatively connected electronic devices in a wired or wireless connection manner, may also be manually labeled in real time or automatically labeled, or may be obtained by first performing automatic labeling and then manually performing supplementary modification to correct labeling errors, which is not specifically limited in this application.

The initial neural network model obtained according to the acoustic characteristics and the vocoder combined modeling is trained in a multitask training mode, a voice synthesis model capable of directly predicting and obtaining corresponding voice waveform points according to text characteristics of a text can be obtained, and therefore the purpose that characters on any obtained text to be synthesized are converted into audio with a target accent and a target tone color can be achieved. In addition, the acoustic features of the training samples selected during training and the voice accent and timbre corresponding to the voice waveform point also correspond to the accent and timbre of the synthesized voice of the initial neural network after training. Therefore, when the initial neural network model is trained, the acoustic features corresponding to the training text and the accent and the timbre corresponding to the voice waveform points corresponding to the training text adopted during training can be selected according to actual requirements, or a plurality of initial neural network models respectively corresponding to different timbres and/or accents can be trained, so that more voice synthesis effects are provided for a user, and the voice synthesis experience of the user is improved.

In one possible embodiment, after the model training process described in the foregoing, in order to further reduce the size of the speech synthesis model, model distillation and model pruning operations can be performed as follows. The model distillation operation may include: acquiring a pre-trained guide model, wherein the guide model and the trained initial neural network model have the same modeling method and training method, and the model depth and model nodes of the guide model are larger than those of the trained initial neural network model; taking the trained initial neural network model as a target model, and guiding and training the target model in a model distillation mode according to the guide model; and determining the voice synthesis model according to the trained target model. The guiding model may also be a model obtained after training a neural network model constructed by jointly modeling an acoustic model and a vocoder, or may also be not completely the same as the construction and training method of the initial neural network model, but training targets of the guiding model and the trained initial neural network model should both be speech waveform points of target speech directly predicted according to text features of a text to be synthesized, and the guiding model should be greater than the trained initial neural network model in terms of both network depth and node number, so as to achieve the purpose of guiding and training the trained initial neural network model. The model pruning operation may include: pruning the trained target model to reduce model parameters in the trained target model; and determining the target model subjected to pruning as the voice synthesis model. The specific pruning method may be any conventional method for pruning the model, and is not particularly limited in this disclosure.

Fig. 3 is a flowchart illustrating a speech synthesis method according to yet another exemplary embodiment of the present disclosure. As shown in fig. 3, the method further comprises step 301 and step 302.

In step 301, a target timbre is acquired.

In step 302, the text features are input into a pre-trained speech synthesis model corresponding to the target tone so as to obtain a target speech synthesized according to the text to be synthesized.

As described above, the speech synthesis model may be trained by training a plurality of initial neural network models respectively corresponding to different timbres and/or accents, and determining the speech synthesis models respectively corresponding to different timbres/accents, so as to provide more speech synthesis effects to the user. Therefore, when the user carries out voice synthesis, the user can select the target tone according to the requirement to carry out voice synthesis, and therefore the effect of improving the voice synthesis experience of the user is achieved.

Fig. 4 is a block diagram illustrating a structure of a speech synthesis apparatus according to an exemplary embodiment of the present disclosure. As shown in fig. 4, the apparatus includes: the acquiring module 10 is configured to acquire text features corresponding to a text to be synthesized; the speech synthesis module 20 is configured to input the text features into a speech synthesis model obtained through pre-training, so as to obtain a target speech synthesized according to the text to be synthesized; wherein the speech synthesis model synthesizes the target speech by directly converting the text features into speech waveform points corresponding to the target speech.

In a possible implementation, the obtaining module 10 is further configured to: and extracting text features corresponding to the text to be synthesized according to a text feature extraction model obtained by pre-training.

Fig. 5 is a block diagram illustrating a structure of a speech synthesis apparatus according to still another exemplary embodiment of the present disclosure. As shown in fig. 5, the apparatus further includes: a model training module 30, configured to jointly model according to an acoustic model and a vocoder to obtain the speech synthesis model, where the acoustic model is a model for generating acoustic features according to text features, and the vocoder is a module for generating speech wave points according to acoustic features.

In one possible implementation, the model training module 30 is further configured to: jointly modeling according to an acoustic model and an acoustic code device to obtain an initial neural network model, and determining a training model, wherein the training model and a partial network structure in the initial neural network model form a complete acoustic model; acquiring a training sample, wherein the training sample comprises a text feature corresponding to a training text, an acoustic feature corresponding to the training text and a voice wave point corresponding to the training text; training the initial neural network model by means of multi-task training, wherein the multi-task training comprises a first task and a second task, and the first task comprises: taking the text features corresponding to the training texts as the input of the initial neural network model, and taking the speech wave points corresponding to the training texts as the expected output of the initial neural network model; the second task includes: acquiring intermediate layer characteristics of the initial neural network model as input of the training model, and taking acoustic characteristics corresponding to the training text as output of the training model; and determining the voice synthesis model according to the trained initial neural network model.

In one possible implementation, the model training module 30 is further configured to: obtaining a pre-trained guide model, wherein the guide model and the trained initial neural network model have the same modeling method and training method, and the model depth and model nodes of the guide model are larger than those of the trained initial neural network model; taking the trained initial neural network model as a target model, and guiding and training the target model in a model distillation mode according to the guide model; and determining the voice synthesis model according to the trained target model.

In one possible implementation, the model training module 30 is further configured to: pruning the trained target model to reduce model parameters in the trained target model; and determining the target model subjected to pruning as the voice synthesis model.

In a possible implementation, the obtaining module 10 is further configured to obtain a target timbre; the speech synthesis module 20 is further configured to: and inputting the text features into a pre-trained voice synthesis model corresponding to the target tone so as to obtain target voice synthesized according to the text to be synthesized.

Referring now to FIG. 6, a block diagram of an electronic device 600 suitable for use in implementing embodiments of the present disclosure is shown. The terminal device in the embodiments of the present disclosure may include, but is not limited to, a mobile terminal such as a mobile phone, a notebook computer, a digital broadcast receiver, a PDA (personal digital assistant), a PAD (tablet computer), a PMP (portable multimedia player), a vehicle terminal (e.g., a car navigation terminal), and the like, and a fixed terminal such as a digital TV, a desktop computer, and the like. The electronic device shown in fig. 6 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.

As shown in fig. 6, electronic device 600 may include a processing device (e.g., central processing unit, graphics processor, etc.) 601 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM) 602 or a program loaded from a storage device 608 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data necessary for the operation of the electronic apparatus 600 are also stored. The processing device 601, the ROM 602, and the RAM 603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

Generally, the following devices may be connected to the I/O interface 605: input devices 606 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; output devices 607 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 608 including, for example, tape, hard disk, etc.; and a communication device 609. The communication means 609 may allow the electronic device 600 to communicate with other devices wirelessly or by wire to exchange data. While fig. 6 illustrates an electronic device 600 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program carried on a non-transitory computer readable medium, the computer program containing program code for performing the method illustrated by the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication means 609, or installed from the storage means 608, or installed from the ROM 602. The computer program, when executed by the processing device 601, performs the above-described functions defined in the methods of the embodiments of the present disclosure.

It should be noted that the computer readable medium of the present disclosure may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.

In some embodiments, the clients, servers may communicate using any currently known or future developed network Protocol, such as HTTP (HyperText Transfer Protocol), and may interconnect with any form or medium of digital data communication (e.g., a communications network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the Internet (e.g., the Internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed network.

The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device.

The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: acquiring text characteristics corresponding to a text to be synthesized; inputting the text features into a pre-trained voice synthesis model to obtain target voice synthesized according to the text to be synthesized; wherein the speech synthesis model synthesizes the target speech by directly converting the text features into speech waveform points corresponding to the target speech. .

Computer program code for carrying out operations for the present disclosure may be written in any combination of one or more programming languages, including but not limited to an object oriented programming language such as Java, smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules described in the embodiments of the present disclosure may be implemented by software or hardware. The name of the module does not in some cases form a limitation on the module itself, and for example, the obtaining module may also be described as a "module that obtains text features corresponding to a text to be synthesized".

The functions described herein above may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems on a chip (SOCs), complex Programmable Logic Devices (CPLDs), and the like.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

Example 1 provides a speech synthesis method according to one or more embodiments of the present disclosure, the method including: acquiring text characteristics corresponding to a text to be synthesized; inputting the text features into a pre-trained voice synthesis model to obtain target voice synthesized according to the text to be synthesized; wherein the speech synthesis model synthesizes the target speech by directly converting the text features to speech waveform points corresponding to the target speech.

Example 2 provides the method of example 1, and the obtaining text features corresponding to a text to be synthesized includes: and extracting text features corresponding to the text to be synthesized according to a text feature extraction model obtained by pre-training.

Example 3 provides the method of example 1, and further includes, before obtaining text features corresponding to the text to be synthesized; and jointly modeling according to an acoustic model and a vocoder to obtain the speech synthesis model, wherein the acoustic model is a model for generating acoustic features according to the text features, and the vocoder is a module for generating speech wave points according to the acoustic features.

Example 4 provides the method of example 3, wherein jointly modeling from an acoustic model and a vocoder to derive the speech synthesis model further comprises: jointly modeling according to an acoustic model and a vocoder to obtain an initial neural network model, and determining a training model, wherein the training model and a part of network structures in the initial neural network model form a complete acoustic model; acquiring a training sample, wherein the training sample comprises a text feature corresponding to a training text, an acoustic feature corresponding to the training text and a voice wave point corresponding to the training text; training the initial neural network model by means of multi-task training, wherein the multi-task training comprises a first task and a second task, and the first task comprises: taking the text features corresponding to the training texts as the input of the initial neural network model, and taking the speech wave points corresponding to the training texts as the expected output of the initial neural network model; the second task includes: acquiring intermediate layer characteristics of the initial neural network model as input of the training model, and taking acoustic characteristics corresponding to the training text as output of the training model; and determining the voice synthesis model according to the trained initial neural network model.

Example 5 provides the method of example 4, the determining the speech synthesis model from the trained initial neural network model including: obtaining a pre-trained guide model, wherein the guide model and the trained initial neural network model have the same modeling method and training method, and the model depth and model nodes of the guide model are larger than those of the trained initial neural network model; taking the trained initial neural network model as a target model, and guiding and training the target model in a model distillation mode according to the guide model; and determining the speech synthesis model according to the trained target model.

Example 6 provides the method of example 5, wherein the determining the speech synthesis model from the trained target model comprises: pruning the trained target model to reduce model parameters in the trained target model; and determining the target model subjected to pruning as the voice synthesis model.

Example 7 provides the method of example 1, further comprising, in accordance with one or more embodiments of the present disclosure: acquiring a target tone; the inputting the text features into a pre-trained speech synthesis model to obtain target speech synthesized according to the text to be synthesized comprises: and inputting the text features into a pre-trained voice synthesis model corresponding to the target tone so as to obtain target voice synthesized according to the text to be synthesized.

Example 8 provides, in accordance with one or more embodiments of the present disclosure, a speech synthesis apparatus, the apparatus comprising: the acquisition module is used for acquiring text characteristics corresponding to the text to be synthesized; the voice synthesis module is used for inputting the text characteristics into a voice synthesis model obtained by pre-training so as to obtain target voice synthesized according to the text to be synthesized; wherein the speech synthesis model synthesizes the target speech by directly converting the text features into speech waveform points corresponding to the target speech.

Example 9 provides the apparatus of example 8, the obtaining module 10 is further to: and extracting text features corresponding to the text to be synthesized according to a text feature extraction model obtained by pre-training.

Example 10 provides the apparatus of example 8, the apparatus further comprising, in accordance with one or more embodiments of the present disclosure: and the model training module 30 is used for jointly modeling according to an acoustic model and a vocoder to obtain the speech synthesis model, wherein the acoustic model is a model for generating acoustic characteristics according to text characteristics, and the vocoder is a module for generating speech wave nodes according to the acoustic characteristics.

Example 11 provides the apparatus of example 10, the model training module 30 further to: jointly modeling according to an acoustic model and a vocoder to obtain an initial neural network model, and determining a training model, wherein the training model and a part of network structures in the initial neural network model form a complete acoustic model; acquiring a training sample, wherein the training sample comprises text features corresponding to a training text, acoustic features corresponding to the training text and voice waveform points corresponding to the training text; training the initial neural network model by means of multi-task training, wherein the multi-task training comprises a first task and a second task, and the first task comprises the following steps: taking the text features corresponding to the training texts as the input of the initial neural network model, and taking the speech wave points corresponding to the training texts as the expected output of the initial neural network model; the second task includes: acquiring intermediate layer characteristics of the initial neural network model as input of the training model, and taking acoustic characteristics corresponding to the training text as output of the training model; and determining the speech synthesis model according to the trained initial neural network model.

Example 12 provides the apparatus of example 11, the model training module 30 further to: obtaining a pre-trained guide model, wherein the guide model and the trained initial neural network model have the same modeling method and training method, and the model depth and model nodes of the guide model are larger than those of the trained initial neural network model; taking the trained initial neural network model as a target model, and guiding and training the target model in a model distillation mode according to the guide model; and determining the voice synthesis model according to the trained target model.

Example 13 provides the apparatus of example 12, the model training module 30 further to: pruning the trained target model to reduce model parameters in the trained target model; and determining the target model subjected to pruning as the speech synthesis model.

Example 14 provides the apparatus of example 8, the obtaining module 10 further to obtain a target timbre, in accordance with one or more embodiments of the present disclosure; the speech synthesis module 20 is further configured to: and inputting the text characteristics into a pre-trained speech synthesis model corresponding to the target tone so as to obtain target speech synthesized according to the text to be synthesized.

Example 15 provides, in accordance with one or more embodiments of the present disclosure, a computer-readable medium having stored thereon a computer program that, when executed by a processing device, performs the steps of the method of any of examples 1-7.

Example 16 provides, in accordance with one or more embodiments of the present disclosure, an electronic device, comprising: a storage device having at least one computer program stored thereon; at least one processing device for executing the at least one computer program in the storage device to implement the steps of the method of any of examples 1-7.

The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the disclosure herein is not limited to the particular combination of features described above, but also encompasses other embodiments in which any combination of the features described above or their equivalents does not depart from the spirit of the disclosure. For example, the above features and (but not limited to) the features disclosed in this disclosure having similar functions are replaced with each other to form the technical solution.

Further, while operations are depicted in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order. Under certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limitations on the scope of the disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.

Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims. With regard to the apparatus in the above embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be described in detail here.

Claims

1. A method of speech synthesis, the method comprising:

acquiring text characteristics corresponding to a text to be synthesized;

inputting the text features into a pre-trained voice synthesis model to obtain target voice synthesized according to the text to be synthesized;

wherein the speech synthesis model synthesizes the target speech by directly converting the text features into speech waveform points corresponding to the target speech.

2. The method according to claim 1, wherein the obtaining text features corresponding to the text to be synthesized comprises:

and extracting text features corresponding to the text to be synthesized according to a text feature extraction model obtained by pre-training.

3. The method according to claim 1, wherein before obtaining the text features corresponding to the text to be synthesized, the method further comprises;

and jointly modeling according to an acoustic model and a vocoder to obtain the voice synthesis model, wherein the acoustic model is a model for generating acoustic characteristics according to the text characteristics, and the vocoder is a module for generating voice wave points according to the acoustic characteristics.

4. The method of claim 3, wherein jointly modeling from an acoustic model and a vocoder to derive the speech synthesis model further comprises:

jointly modeling according to an acoustic model and an acoustic code device to obtain an initial neural network model, and determining a training model, wherein the training model and a partial network structure in the initial neural network model form a complete acoustic model;

acquiring a training sample, wherein the training sample comprises a text feature corresponding to a training text, an acoustic feature corresponding to the training text and a voice wave point corresponding to the training text;

training the initial neural network model by means of multi-task training, wherein the multi-task training comprises a first task and a second task, and the first task comprises the following steps: taking text features corresponding to the training texts as input of the initial neural network model, and taking voice waveform points corresponding to the training texts as expected output of the initial neural network model; the second task includes: acquiring intermediate layer characteristics of the initial neural network model as input of the training model, and taking acoustic characteristics corresponding to the training text as output of the training model;

and determining the voice synthesis model according to the trained initial neural network model.

5. The method of claim 4, wherein the determining the speech synthesis model from the trained initial neural network model comprises:

obtaining a pre-trained guide model, wherein the guide model and the trained initial neural network model have the same modeling method and training method, and the model depth and model nodes of the guide model are larger than those of the trained initial neural network model;

taking the trained initial neural network model as a target model, and guiding and training the target model in a model distillation mode according to the guide model;

and determining the speech synthesis model according to the trained target model.

6. The method of claim 5, wherein the determining the speech synthesis model from the trained target model comprises:

pruning the trained target model to reduce model parameters in the trained target model;

and determining the target model subjected to pruning as the voice synthesis model.

7. The method of claim 1, further comprising:

acquiring a target tone;

the inputting the text features into a pre-trained speech synthesis model to obtain target speech synthesized according to the text to be synthesized comprises:

and inputting the text features into a pre-trained voice synthesis model corresponding to the target tone so as to obtain target voice synthesized according to the text to be synthesized.

8. A speech synthesis apparatus, characterized in that the apparatus comprises:

the acquisition module is used for acquiring text characteristics corresponding to the text to be synthesized;

the voice synthesis module is used for inputting the text characteristics into a voice synthesis model obtained by pre-training so as to obtain target voice synthesized according to the text to be synthesized;

9. A computer-readable medium, on which a computer program is stored, characterized in that the program, when being executed by processing means, carries out the steps of the method of any one of claims 1 to 7.

10. An electronic device, comprising:

a storage device having at least one computer program stored thereon;

at least one processing device for executing the at least one computer program in the storage device to implement the steps of the method of any one of claims 1-7.