CN112289297A

CN112289297A - Speech synthesis method, device and system

Info

Publication number: CN112289297A
Application number: CN201910675961.5A
Authority: CN
Inventors: 杨辰雨; 雷鸣
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2019-07-25
Filing date: 2019-07-25
Publication date: 2021-01-29
Anticipated expiration: 2039-07-25
Also published as: CN112289297B

Abstract

The application discloses a speech synthesis method, a speech synthesis device and a speech synthesis system. Wherein, the method comprises the following steps: obtaining a second acoustic feature set based on the text to be processed and the first acoustic feature set, wherein the first acoustic feature set comprises: based on the acoustic features decoded during the synthesis of the at least one historical speech frame, the second set of acoustic features comprises: decoding the acoustic characteristics obtained in the synthesis process of the current speech frame; and performing voice synthesis processing at least based on the second acoustic feature set to obtain the current voice frame. The method and the device solve the technical problem that the speech synthesis effect is poor due to the fact that a historical speech frame is adopted to generate the current speech frame in the prior art.

Description

Speech synthesis method, device and system

Technical Field

The present application relates to the field of speech processing, and in particular, to a speech synthesis method, apparatus, and system.

Background

Speech synthesis is a technology for generating an artificial Speech by a mechanical and electronic method, and a Text To Speech (TTS) technology is one of Speech synthesis technologies that converts Text information generated by a computer or inputted from the outside into a sound signal and outputs the sound signal.

In a conventional NeuralTTS system, only one historical speech frame is usually used to guide the generation of the current speech frame, but for low frequency speech (e.g., low frequency male voice), one historical speech frame is insufficient to provide enough information to guide the generation of the current speech frame, resulting in poor synthesized speech from text to speech.

In view of the above problems, no effective solution has been proposed.

Disclosure of Invention

The embodiment of the application provides a voice synthesis method, a voice synthesis device and a voice synthesis system, which are used for at least solving the technical problem of poor voice synthesis effect caused by the fact that a historical voice frame is adopted to generate a current voice frame in the prior art.

According to an aspect of an embodiment of the present application, there is provided a speech synthesis method including: obtaining a second acoustic feature set based on the text to be processed and the first acoustic feature set, wherein the first acoustic feature set comprises: based on the acoustic features decoded during the synthesis of the at least one historical speech frame, the second set of acoustic features comprises: decoding the acoustic characteristics obtained in the synthesis process of the current speech frame; and performing voice synthesis processing at least based on the second acoustic feature set to obtain the current voice frame.

According to another aspect of the embodiments of the present application, there is also provided a speech synthesis apparatus, including: the acquisition module is used for obtaining a second acoustic feature set based on the text to be processed and the first acoustic feature set, wherein the first acoustic feature set comprises: based on the acoustic features decoded during the synthesis of the at least one historical speech frame, the second set of acoustic features comprises: decoding the acoustic characteristics obtained in the synthesis process of the current speech frame; and the synthesis module is used for performing voice synthesis processing at least based on the second acoustic feature set to obtain a current voice frame.

According to another aspect of the embodiments of the present application, there is also provided a storage medium including a stored program, wherein when the program runs, a device on which the storage medium is located is controlled to execute the above-mentioned speech synthesis method.

According to another aspect of the embodiments of the present application, there is also provided a processor for executing a program, where the program executes to perform the above-mentioned speech synthesis method.

According to another aspect of the embodiments of the present application, there is also provided an acoustic enclosure, configured to perform the following processing steps: obtaining a second acoustic feature set based on the text to be processed and the first acoustic feature set, wherein the first acoustic feature set comprises: based on the acoustic features decoded during the synthesis of the at least one historical speech frame, the second set of acoustic features comprises: decoding the acoustic characteristics obtained in the synthesis process of the current speech frame; and performing voice synthesis processing at least based on the second acoustic feature set to obtain the current voice frame.

According to another aspect of the embodiments of the present application, there is also provided a speech synthesis system, including: a processor; and a memory coupled to the processor for providing instructions to the processor for processing the following processing steps: obtaining a second acoustic feature set based on the text to be processed and the first acoustic feature set, wherein the first acoustic feature set comprises: based on the acoustic features decoded during the synthesis of at least one historical speech frame, the second set of acoustic features comprises: decoding the acoustic characteristics obtained in the synthesis process of the current speech frame; and performing voice synthesis processing at least based on the second acoustic feature set to obtain the current voice frame.

According to another aspect of the embodiments of the present application, there is also provided a speech synthesis system, including: the preprocessing module is used for preprocessing the text to be processed to obtain a preprocessed text; the encoding module is used for encoding the preprocessed text and inputting an encoding result into the decoding module through an attention mechanism; a decoding module, configured to perform decoding processing on the encoding result to obtain a plurality of acoustic features corresponding to the text to be processed, where the plurality of acoustic features include: based on a first acoustic feature set obtained by decoding in the synthesis process of at least one historical speech frame and a second acoustic feature set obtained by decoding in the synthesis process of the current speech frame; and the post-processing module is used for performing voice synthesis processing at least based on the second acoustic feature set to obtain a current voice frame.

In the embodiment of the application, a mode of performing speech synthesis through a plurality of acoustic features is adopted, a second acoustic feature set is obtained based on a text to be processed and a first acoustic feature set, and then speech synthesis processing is performed at least based on the second acoustic feature set to obtain a current speech frame, wherein the first acoustic feature set comprises acoustic features decoded in a synthesis process based on at least one historical speech frame, and the second acoustic feature set comprises acoustic features decoded in a synthesis process of the current speech frame.

In the process, at least one historical speech frame is used for guiding the generation of the current speech frame, so that the tone quality and the naturalness of the synthesized speech of the low-frequency speaker can be improved. Therefore, the scheme provided by the application achieves the purpose of synthesizing the voice, thereby realizing the technical effect of improving the voice synthesis effect and further solving the technical problem of poor voice synthesis effect caused by the fact that a historical voice frame is adopted to generate the current voice frame in the prior art.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

fig. 1 is a block diagram of a hardware configuration of a computer terminal according to an embodiment of the present application;

FIG. 2 is a flow chart of a method of speech synthesis according to an embodiment of the present application;

FIG. 3 is a flow diagram of an alternative method of speech synthesis according to an embodiment of the present application;

FIG. 4 is a schematic diagram of a speech synthesis apparatus according to an embodiment of the present application;

FIG. 5 is a block diagram of a computer terminal according to an embodiment of the present application; and

FIG. 6 is a schematic diagram of a speech synthesis system according to an embodiment of the present application.

Detailed Description

In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only partial embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

It should be noted that the terms "first," "second," and the like in the description and claims of this application and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

First, some terms or terms appearing in the description of the embodiments of the present application are applicable to the following explanations:

neural TTS is a speech synthesis method for performing text-to-speech spectrum conversion by adopting an end-to-end model in an acoustic modeling module.

Example 1

There is also provided, in accordance with an embodiment of the present application, a speech synthesis method embodiment, it being noted that the steps illustrated in the flowchart of the figure may be performed in a computer system, such as a set of computer-executable instructions, and that, although a logical order is illustrated in the flowchart, in some cases, the steps illustrated or described may be performed in an order different than here.

The method provided by the first embodiment of the present application may be executed in a mobile terminal, a computer terminal, or a similar computing device. Fig. 1 shows a hardware configuration block diagram of a computer terminal (or mobile device) for implementing a speech synthesis method. As shown in fig. 1, the computer terminal 10 (or mobile device 10) may include one or more (shown as 102a, 102b, … …, 102 n) processors 102 (the processors 102 may include, but are not limited to, a processing device such as a microprocessor MCU or a programmable logic device FPGA, etc.), a memory 104 for storing data, and a transmission device 106 for communication functions. Besides, the method can also comprise the following steps: a display, an input/output interface (I/O interface), a Universal Serial Bus (USB) port (which may be included as one of the ports of the I/O interface), a network interface, a power source, and/or a camera. It will be understood by those skilled in the art that the structure shown in fig. 1 is only an illustration and is not intended to limit the structure of the electronic device. For example, the computer terminal 10 may also include more or fewer components than shown in FIG. 1, or have a different configuration than shown in FIG. 1.

It should be noted that the one or more processors 102 and/or other data processing circuitry described above may be referred to generally herein as "data processing circuitry". The data processing circuitry may be embodied in whole or in part in software, hardware, firmware, or any combination thereof. Further, the data processing circuit may be a single stand-alone processing module, or incorporated in whole or in part into any of the other elements in the computer terminal 10 (or mobile device). As referred to in the embodiments of the application, the data processing circuit acts as a processor control (e.g. selection of a variable resistance termination path connected to the interface).

The memory 104 may be used to store software programs and modules of application software, such as program instructions/data storage devices corresponding to the speech synthesis method in the embodiment of the present application, and the processor 102 executes various functional applications and data processing by running the software programs and modules stored in the memory 104, so as to implement the speech synthesis method. The memory 104 may include high speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include memory located remotely from the processor 102, which may be connected to the computer terminal 10 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The transmission device 106 is used for receiving or transmitting data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider of the computer terminal 10. In one example, the transmission device 106 includes a Network adapter (NIC) that can be connected to other Network devices through a base station to communicate with the internet. In one example, the transmission device 106 can be a Radio Frequency (RF) module, which is used to communicate with the internet in a wireless manner.

The display may be, for example, a touch screen type Liquid Crystal Display (LCD) that may enable a user to interact with a user interface of the computer terminal 10 (or mobile device).

It should be noted here that in some alternative embodiments, the computer device (or mobile device) shown in fig. 1 described above may include hardware elements (including circuitry), software elements (including computer code stored on a computer-readable medium), or a combination of both hardware and software elements. It should be noted that fig. 1 is only one example of a particular specific example and is intended to illustrate the types of components that may be present in the computer device (or mobile device) described above.

Under the operating environment, the application provides a speech synthesis method as shown in fig. 2, which can be applied to a speech synthesis scene in which the audio of a speaker is in a low frequency band. Fig. 2 is a flowchart of a speech synthesis method according to a first embodiment of the present application, and as can be seen from fig. 2, the method includes the following steps:

step S202, obtaining a second acoustic feature set based on the text to be processed and the first acoustic feature set, wherein the first acoustic feature set includes: based on the acoustic features decoded during the synthesis of the at least one historical speech frame, the second set of acoustic features comprises: and decoding the obtained acoustic characteristics in the synthesis process of the current speech frame.

In step S202, the text to be processed is a text that needs to be converted into a voice, wherein the voice synthesis system can acquire the text to be processed. Optionally, the voice staff may listen to the voice repeatedly, then write out the content in the voice to obtain the text to be processed, store the text to be processed in the preset storage location, and the voice synthesis system obtains the text to be processed from the preset storage location. Optionally, the speech synthesis system may further perform text conversion processing on the speech to be processed corresponding to the automatic text to be processed to obtain a text to be processed, and store the text to be processed in the preset storage location, and the speech synthesis system may obtain the text to be processed from the preset storage location.

In an alternative embodiment, fig. 3 shows a flowchart of a speech synthesis method provided in the present application, and as can be seen from fig. 3, a speech synthesis system firstly inputs a text to be processed into a preprocessing network for preprocessing, inputs the preprocessed text to be processed into an encoder for encoding, and inputs the text to be processed into a decoder for decoding through an attention mechanism, so as to obtain an acoustic feature corresponding to the text to be processed. The speech synthesis system processes the acoustic features to obtain a first acoustic feature set, and optionally, the preprocessing network may use a natural language processing method to preprocess the text to be processed. In addition, before the preprocessing of the text to be processed, the preprocessing network can firstly detect the language type corresponding to the text to be processed, and then preprocess the text to be processed by adopting the preprocessing method corresponding to the language type, so that the decoder can output more accurate acoustic characteristics, and further the voice synthesis effect is improved.

Note that, in step S202, the second acoustic feature set includes: a first component and a second component, wherein the first component comprises: a mel-frequency spectrum acoustic signature, the second component comprising at least one of: mel cepstrum acoustic features, fundamental frequency acoustic features, and clear and turbid acoustic features. The second component and the first component are complementary, and the generation of a linear spectrum can be better guided. Here, the mel-frequency cepstrum acoustic features may be expressed as mel-frequency generalized cepstrum coefficients. The mel-frequency generalized cepstrum Coefficient can be replaced by a Line spectrum Pair parameter (Line Spectral Pair) or a Linear Prediction Coefficient (Line Prediction Coefficient). It is easy to notice that the stability of the synthesized voice can be improved by introducing acoustic parameters such as mel cepstrum acoustic features, fundamental frequency acoustic features, unvoiced and voiced acoustic features and the like on the basis of the traditional Neural TTS scheme.

And step S204, performing voice synthesis processing at least based on the second acoustic feature set to obtain a current voice frame.

Optionally, after obtaining the second acoustic feature set, the speech synthesis system processes the second acoustic feature set through a post-processing network to obtain a linear spectrum, and then obtains the current speech frame from the linear spectrum through a Griffin-Lim reconstruction algorithm, as shown in fig. 3. The post-processing network refers to a neural network for performing post-processing on the acoustic feature set, the post-processing corresponds to the preprocessing, and in the field of speech synthesis, the post-processing can be processing methods for performing language modeling, decoding, error processing and the like on speech.

Based on the schemes defined in steps S202 to S204, it can be known that, by using a mode of performing speech synthesis through a plurality of acoustic features, a second acoustic feature set is obtained based on a text to be processed and a first acoustic feature set, and then speech synthesis processing is performed at least based on the second acoustic feature set, so as to obtain a current speech frame, where the first acoustic feature set includes acoustic features decoded in a synthesis process based on at least one historical speech frame, and the second acoustic feature set includes acoustic features decoded in a synthesis process of the current speech frame.

It is easy to note that guiding the generation of the current speech frame by using at least one historical speech frame can improve the tone quality and naturalness of the synthesized speech of the low frequency speaker. Therefore, the scheme provided by the application achieves the purpose of synthesizing the voice, thereby realizing the technical effect of improving the voice synthesis effect and further solving the technical problem of poor voice synthesis effect caused by the fact that a historical voice frame is adopted to generate the current voice frame in the prior art.

In an optional embodiment, before obtaining the second acoustic feature set based on the text to be processed and the first acoustic feature set, the speech synthesis system acquires the first acoustic feature set through a first neural network, where the first neural network is configured to perform delay processing on the first acoustic feature set, so that the first acoustic feature set becomes a reference factor of a synthesis process of a current speech frame.

Alternatively, as shown in fig. 3, the first neural network may be a delay network as shown in fig. 3. The voice synthesis system inputs the first acoustic feature set to a time delay network for time delay, then inputs the time-delayed first acoustic feature set to a preprocessing network for preprocessing, and finally inputs the preprocessed first acoustic feature set to a decoder for decoding. By carrying out time delay processing on the first acoustic feature set, the speech synthesis system can better predict parameters in the synthesis process of the next speech frame.

Further, after obtaining the second acoustic feature set based on the text to be processed and the first acoustic feature set, the speech synthesis system inputs the second acoustic feature set to the first neural network, so that the second acoustic feature set is integrated as a reference factor of the synthesis process of the next speech frame. The process is the same as the processing method for the first acoustic feature set, and is not described herein again.

It should be noted that, the acoustic feature set (including the first acoustic feature set and the second acoustic feature set) is subjected to delay processing through the delay network, so that historical speech frame information at multiple moments is introduced, and compared with the case that only one frame of historical information is introduced in the prior art, the scheme provided by the application can improve an MOS (Mean Opinion Score) value by 0.5, thereby effectively improving a speech synthesis effect.

Further, after obtaining the second acoustic feature set, the speech synthesis system performs speech synthesis processing using the second acoustic feature set to obtain the current speech frame. Specifically, the speech synthesis system performs multi-task learning on the acoustic features included in the second acoustic feature set to obtain input parameters of the second neural network, outputs a linear spectrum through the second neural network, and finally performs reconstruction processing on the linear spectrum to obtain the current speech frame.

In another optional embodiment, the speech synthesis system may further perform multi-task learning on the acoustic features included in the first acoustic feature set and the second acoustic feature set to obtain input parameters of the second neural network, output a linear spectrum through the second neural network, and finally perform reconstruction processing on the linear spectrum to obtain the current speech frame.

Optionally, the second neural network may be a post-processing network. As shown in fig. 3, the speech synthesis system inputs the second acoustic feature set into the post-processing network, obtains a linear spectrum after the post-processing of the post-processing network, reconstructs the linear spectrum through a griffin-Lim reconstruction algorithm, and finally obtains the current speech frame from the linear spectrum.

According to the content, the scheme provided by the application is improved on the output of a Neural TTS decoder, and acoustic characteristics such as Mel cepstrum, fundamental frequency, unvoiced and turbid signals are added, so that the method is complementary to the traditional Mel frequency spectrum, and can better guide the generation of a linear spectrum. In addition, the scheme provided by the application takes the output of the decoder as the input of the decoder after passing through the time delay network, thereby better utilizing the historical information, more accurately predicting the current speech frame and improving the tone quality and the naturalness of the synthesized speech of the low-frequency speaker. Finally, according to the scheme provided by the application, on the basis of the traditional Neural TTS scheme, acoustic characteristics such as Mel cepstrum, fundamental frequency, unvoiced and turbid voice and the like are introduced, and the stability of synthesized voice can be further improved by adopting a multi-task learning method.

It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present application is not limited by the order of acts described, as some steps may occur in other orders or concurrently depending on the application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required in this application.

Through the above description of the embodiments, those skilled in the art can clearly understand that the speech synthesis method according to the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but the former is a better implementation mode in many cases. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present application.

Example 2

According to an embodiment of the present application, there is also provided a speech synthesis apparatus for implementing the speech synthesis method, as shown in fig. 4, the apparatus 40 includes: an acquisition module 401 and a synthesis module 403.

The obtaining module 401 is configured to obtain a second acoustic feature set based on the text to be processed and the first acoustic feature set, where the first acoustic feature set includes: based on the acoustic features decoded during the synthesis of the at least one historical speech frame, the second set of acoustic features comprises: decoding the acoustic characteristics obtained in the synthesis process of the current speech frame; and a synthesis module 403, configured to perform speech synthesis processing based on at least the second acoustic feature set to obtain a current speech frame.

It should be noted here that the acquiring module 401 and the synthesizing module 403 correspond to steps S202 to S204 in embodiment 1, and the two modules are the same as the corresponding steps in the implementation example and application scenario, but are not limited to the disclosure of the first embodiment. It should be noted that the modules described above as part of the apparatus may be run in the computer terminal 10 provided in the first embodiment.

Optionally, the second set of acoustic features comprises: a first component and a second component, wherein the first component comprises: a mel-frequency spectrum acoustic signature, the second component comprising at least one of: mel cepstrum acoustic features, fundamental frequency acoustic features, and clear and turbid acoustic features.

In an alternative embodiment, the speech synthesis apparatus further comprises: a first obtaining module. The first obtaining module is used for obtaining a first acoustic feature set through a first neural network before obtaining a second acoustic feature set based on a text to be processed and the first acoustic feature set, wherein the first neural network is used for carrying out delay processing on the first acoustic feature set so that the first acoustic feature set becomes a reference factor of a synthesis process of a current speech frame.

In an alternative embodiment, the speech synthesis apparatus further comprises: and an input module. The input module is used for inputting the second acoustic feature set to the first neural network after the second acoustic feature set is obtained based on the text to be processed and the first acoustic feature set, so that the second acoustic feature set is integrated as a reference factor of the synthesis process of the next speech frame.

In an alternative embodiment, the synthesis module comprises: the device comprises a first processing module and a second processing module. The first processing module is used for performing multi-task learning on the acoustic features contained in the second acoustic feature set to obtain input parameters of the second neural network and outputting a linear spectrum through the second neural network; and the second processing module is used for reconstructing the linear spectrum to obtain the current voice frame.

In an alternative embodiment, the synthesis module comprises: a third processing module and a fourth processing module. The third processing module is used for performing multi-task learning on the acoustic features contained in the first acoustic feature set and the second acoustic feature set to obtain input parameters of the second neural network, and outputting a linear spectrum through the second neural network; and the fourth processing module is used for reconstructing the linear spectrum to obtain the current voice frame.

Optionally, the speech synthesis system is applied to a speech synthesis scene in which the audio of the speaker is in a low frequency band.

Example 3

According to an embodiment of the present application, there is also provided a speech synthesis system for implementing the speech synthesis method, the system including: a processor and a memory.

Wherein the memory is connected with the processor and is used for providing the processor with instructions for processing the following processing steps: obtaining a second acoustic feature set based on the text to be processed and the first acoustic feature set, wherein the first acoustic feature set comprises: based on the acoustic features decoded during the synthesis of the at least one historical speech frame, the second set of acoustic features comprises: decoding the acoustic characteristics obtained in the synthesis process of the current speech frame; and performing voice synthesis processing at least based on the second acoustic feature set to obtain the current voice frame.

Therefore, a mode of performing speech synthesis through a plurality of acoustic features is adopted, a second acoustic feature set is obtained based on a text to be processed and a first acoustic feature set, then speech synthesis processing is performed at least based on the second acoustic feature set, and a current speech frame is obtained, wherein the first acoustic feature set comprises the acoustic features decoded in the synthesis process based on at least one historical speech frame, and the second acoustic feature set comprises the acoustic features decoded in the synthesis process of the current speech frame.

It is easy to notice that guiding the generation of the current speech frame by using a plurality of historical speech frames can improve the tone quality and naturalness of the synthesized speech of the low-frequency speaker. Therefore, the scheme provided by the application achieves the purpose of synthesizing the voice, thereby realizing the technical effect of improving the voice synthesis effect and further solving the technical problem of poor voice synthesis effect caused by the fact that a historical voice frame is adopted to generate the current voice frame in the prior art.

Example 4

The embodiment of the application can provide a computer terminal, and the computer terminal can be any one computer terminal device in a computer terminal group. Optionally, in this embodiment, the computer terminal may also be replaced with a terminal device such as a mobile terminal.

Optionally, in this embodiment, the computer terminal may be located in at least one network device of a plurality of network devices of a computer network.

In this embodiment, the computer terminal may execute program codes of the following steps in the speech synthesis method: obtaining a second acoustic feature set based on the text to be processed and the first acoustic feature set, wherein the first acoustic feature set comprises: based on the acoustic features decoded during the synthesis of the at least one historical speech frame, the second set of acoustic features comprises: decoding the acoustic characteristics obtained in the synthesis process of the current speech frame; and performing voice synthesis processing at least based on the second acoustic feature set to obtain the current voice frame.

Optionally, fig. 5 is a block diagram of a computer terminal according to an embodiment of the present application. As shown in fig. 5, the computer terminal 10 may include: one or more processors 502 (only one of which is shown), memory 504, and a peripheral interface 506.

The memory may be configured to store software programs and modules, such as program instructions/modules corresponding to the speech synthesis method and apparatus in the embodiments of the present application, and the processor executes various functional applications and data processing by running the software programs and modules stored in the memory, so as to implement the speech synthesis method. The memory may include high speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some instances, the memories may further include a memory located remotely from the processor, which may be connected to the terminal 10 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The processor can call the information and application program stored in the memory through the transmission device to execute the following steps: obtaining a second acoustic feature set based on the text to be processed and the first acoustic feature set, wherein the first acoustic feature set comprises: based on the acoustic features decoded during the synthesis of the at least one historical speech frame, the second set of acoustic features comprises: decoding the acoustic characteristics obtained in the synthesis process of the current speech frame; and performing voice synthesis processing at least based on the second acoustic feature set to obtain the current voice frame.

Optionally, the processor may further execute the program code of the following steps: before a second acoustic feature set is obtained based on a text to be processed and the first acoustic feature set, the first acoustic feature set is obtained through a first neural network, wherein the first neural network is used for carrying out delay processing on the first acoustic feature set, so that the first acoustic feature set becomes a reference factor of a synthesis process of a current speech frame.

Optionally, the processor may further execute the program code of the following steps: after obtaining a second acoustic feature set based on the text to be processed and the first acoustic feature set, inputting the second acoustic feature set into the first neural network so as to integrate the second acoustic feature into a reference factor of a synthesis process of a next speech frame.

Optionally, the processor may further execute the program code of the following steps: performing multi-task learning on the acoustic features contained in the second acoustic feature set to obtain input parameters of a second neural network, and outputting a linear spectrum through the second neural network; and reconstructing the linear spectrum to obtain the current voice frame.

Optionally, the processor may further execute the program code of the following steps: performing multi-task learning on the acoustic features contained in the first acoustic feature set and the second acoustic feature set to obtain input parameters of a second neural network, and outputting a linear spectrum through the second neural network; and reconstructing the linear spectrum to obtain the current voice frame.

It can be understood by those skilled in the art that the structure shown in fig. 5 is only an illustration, and the computer terminal may also be a terminal device such as a smart phone (e.g., an Android phone, an iOS phone, etc.), a tablet computer, a palmtop computer, a Mobile Internet Device (MID), a PAD, and the like. Fig. 5 is a diagram illustrating a structure of the electronic device. For example, the computer terminal 10 may also include more or fewer components (e.g., network interfaces, display devices, etc.) than shown in FIG. 5, or have a different configuration than shown in FIG. 5.

Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by a program instructing hardware associated with the terminal device, where the program may be stored in a computer-readable storage medium, and the storage medium may include: flash disks, Read-Only memories (ROMs), Random Access Memories (RAMs), magnetic or optical disks, and the like.

Example 5

Embodiments of the present application also provide a storage medium. Optionally, in this embodiment, the storage medium may be configured to store a program code executed by the speech synthesis method provided in the first embodiment.

Optionally, in this embodiment, the storage medium may be located in any one of computer terminals in a computer terminal group in a computer network, or in any one of mobile terminals in a mobile terminal group.

Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps: obtaining a second acoustic feature set based on the text to be processed and the first acoustic feature set, wherein the first acoustic feature set comprises: based on the acoustic features decoded during the synthesis of the at least one historical speech frame, the second set of acoustic features comprises: decoding the acoustic characteristics obtained in the synthesis process of the current speech frame; and performing voice synthesis processing at least based on the second acoustic feature set to obtain the current voice frame.

Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps: before a second acoustic feature set is obtained based on a text to be processed and the first acoustic feature set, the first acoustic feature set is obtained through a first neural network, wherein the first neural network is used for carrying out delay processing on the first acoustic feature set, so that the first acoustic feature set becomes a reference factor of a synthesis process of a current speech frame.

Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps: after obtaining a second acoustic feature set based on the text to be processed and the first acoustic feature set, inputting the second acoustic feature set into the first neural network so as to integrate the second acoustic feature into a reference factor of a synthesis process of a next speech frame.

Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps: performing multi-task learning on the acoustic features contained in the second acoustic feature set to obtain input parameters of a second neural network, and outputting a linear spectrum through the second neural network; and reconstructing the linear spectrum to obtain the current voice frame.

Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps: performing multi-task learning on the acoustic features contained in the first acoustic feature set and the second acoustic feature set to obtain input parameters of a second neural network, and outputting a linear spectrum through the second neural network; and reconstructing the linear spectrum to obtain the current voice frame.

Example 6

According to an embodiment of the present application, there is also provided an acoustic enclosure for implementing the above speech synthesis method, the acoustic enclosure being configured to perform the following processing steps:

obtaining a second acoustic feature set based on the text to be processed and the first acoustic feature set, wherein the first acoustic feature set comprises: based on the acoustic features decoded during the synthesis of the at least one historical speech frame, the second set of acoustic features comprises: decoding the acoustic characteristics obtained in the synthesis process of the current speech frame;

and performing voice synthesis processing at least based on the second acoustic feature set to obtain the current voice frame.

In an optional embodiment, the loudspeaker further obtains a first acoustic feature set through a first neural network, wherein the first neural network is configured to perform delay processing on the first acoustic feature set, so that the first acoustic feature set becomes a reference factor of a synthesis process of a current speech frame.

In addition, after obtaining a second acoustic feature set based on the text to be processed and the first acoustic feature set, the sound box inputs the second acoustic feature set to the first neural network, so that the second acoustic feature set is integrated as a reference factor of the synthesis process of the next speech frame.

In an optional embodiment, the sound box may obtain the input parameters of the second neural network by performing multi-task learning on the acoustic features included in the second acoustic feature set, output the linear spectrum through the second neural network, and finally perform reconstruction processing on the linear spectrum to obtain the current speech frame.

In another optional embodiment, the sound box may further perform multi-task learning on the acoustic features included in the first acoustic feature set and the second acoustic feature set to obtain input parameters of the second neural network, output a linear spectrum through the second neural network, and finally perform reconstruction processing on the linear spectrum to obtain the current speech frame.

Example 7

According to an embodiment of the present application, there is also provided a speech synthesis system for implementing the speech synthesis method, as shown in fig. 6, the system including: a pre-processing module 601, an encoding module 603, a decoding module 605, and a post-processing module 607.

The preprocessing module 601 is configured to preprocess a text to be processed to obtain a preprocessed text; the encoding module 603 is configured to perform encoding processing on the preprocessed text, and input an encoding result to the decoding module through an attention mechanism; a decoding module 605, configured to perform decoding processing on the encoding result to obtain a plurality of acoustic features corresponding to the text to be processed, where the plurality of acoustic features include: based on a first acoustic feature set obtained by decoding in the synthesis process of at least one historical speech frame and a second acoustic feature set obtained by decoding in the synthesis process of the current speech frame; and the post-processing module 607 is configured to perform speech synthesis processing at least based on the second acoustic feature set to obtain a current speech frame.

Optionally, the second set of acoustic features comprises: a first component and a second component, wherein the first component comprises: a mel-frequency spectrum acoustic signature, the second component comprising at least one of: mel cepstrum acoustic features, fundamental frequency acoustic features, and clear and turbid acoustic features. The second component and the first component are complementary, and the generation of a linear spectrum can be better guided. It is easy to notice that the stability of the synthesized voice can be improved by introducing acoustic parameters such as mel cepstrum acoustic features, fundamental frequency acoustic features, unvoiced and voiced acoustic features and the like on the basis of the traditional Neural TTS scheme.

In an alternative embodiment, this is described in conjunction with the flow chart shown in FIG. 3. The speech synthesis system firstly inputs the text to be processed into a preprocessing network (namely, the preprocessing module) for preprocessing, inputs the preprocessed text to be processed into an encoder (namely, the encoding module) for encoding, and inputs the text to be processed into a decoder (namely, the decoder) for decoding through an attention mechanism so as to obtain the acoustic characteristics corresponding to the text to be processed. The speech synthesis system processes the acoustic features to obtain a first acoustic feature set, and optionally, the preprocessing network may use a natural language processing method to preprocess the text to be processed. In addition, before the preprocessing of the text to be processed, the preprocessing network can firstly detect the language type corresponding to the text to be processed, and then preprocess the text to be processed by adopting the preprocessing method corresponding to the language type, so that the decoder can output more accurate acoustic characteristics, and further the voice synthesis effect is improved.

After obtaining the second acoustic feature set, the speech synthesis system processes the second acoustic feature set through a post-processing network (i.e., the post-processing module) to obtain a linear spectrum, and then obtains the current speech frame from the linear spectrum through a Griffin-Lim reconstruction algorithm, as shown in fig. 3. The post-processing network refers to a neural network for performing post-processing on the acoustic feature set, the post-processing corresponds to the preprocessing, and in the field of speech synthesis, the post-processing can be processing methods for performing language modeling, decoding, error processing and the like on speech.

In an optional embodiment, before obtaining the second acoustic feature set based on the text to be processed and the first acoustic feature set, the speech synthesis system acquires the first acoustic feature set through a first neural network, where the first neural network is configured to perform delay processing on the first acoustic feature set, so that the first acoustic feature set becomes a reference factor of a synthesis process of a current speech frame. Alternatively, the first neural network may be a delay network as in fig. 3.

Further, after obtaining the second acoustic feature set based on the text to be processed and the first acoustic feature set, the speech synthesis system inputs the second acoustic feature set to the first neural network, so that the second acoustic feature set is integrated as a reference factor of the synthesis process of the next speech frame. After the second acoustic feature set is obtained, the speech synthesis system performs multitask learning on the acoustic features contained in the second acoustic feature set, or performs multitask learning on the acoustic features contained in the first acoustic feature set and the second acoustic feature set to obtain input parameters of a second neural network, outputs a linear spectrum through the second neural network, and finally performs reconstruction processing on the linear spectrum to obtain a current speech frame.

Optionally, the second neural network may be a post-processing network. As shown in fig. 3, the speech synthesis system inputs the second acoustic feature set into the post-processing network, obtains a linear spectrum after the post-processing of the post-processing network, reconstructs the linear spectrum through the Griffin-Lim reconstruction algorithm, and finally obtains the current speech frame from the linear spectrum.

As can be seen from the above, by adopting a mode of performing speech synthesis through a plurality of acoustic features, obtaining a second acoustic feature set based on a text to be processed and a first acoustic feature set, and then performing speech synthesis processing based on at least the second acoustic feature set to obtain a current speech frame, where the first acoustic feature set includes acoustic features decoded in a synthesis process based on at least one historical speech frame, and the second acoustic feature set includes acoustic features decoded in a synthesis process of the current speech frame.

The above-mentioned serial numbers of the embodiments of the present application are merely for description and do not represent the merits of the embodiments.

In the above embodiments of the present application, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In the embodiments provided in the present application, it should be understood that the disclosed technology can be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one type of division of logical functions, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.

The foregoing is only a preferred embodiment of the present application and it should be noted that those skilled in the art can make several improvements and modifications without departing from the principle of the present application, and these improvements and modifications should also be considered as the protection scope of the present application.

Claims

1. A method of speech synthesis, comprising:

obtaining a second acoustic feature set based on the text to be processed and the first acoustic feature set, wherein the first acoustic feature set comprises: based on the acoustic features decoded during the synthesis of at least one historical speech frame, the second set of acoustic features comprises: decoding the acoustic characteristics obtained in the synthesis process of the current speech frame;

2. The method of claim 1, wherein the second set of acoustic features comprises: a first component and a second component, wherein the first component comprises: a Mel-spectral acoustic signature, the second component comprising at least one of: mel cepstrum acoustic features, fundamental frequency acoustic features, and clear and turbid acoustic features.

3. The method of claim 1, further comprising:

and acquiring the first acoustic feature set through a first neural network, wherein the first neural network is used for carrying out time delay processing on the first acoustic feature set so that the first acoustic feature set becomes a reference factor of the synthesis process of the current speech frame.

4. The method of claim 3, after deriving the second set of acoustic features based on the text to be processed and the first set of acoustic features, further comprising:

inputting the second acoustic feature set to the first neural network so that the second acoustic feature set is integrated as a reference factor of a synthesis process of a next speech frame.

5. The method of claim 1, wherein performing a speech synthesis process based on at least the second set of acoustic features to obtain the current speech frame comprises:

performing multi-task learning on the acoustic features contained in the second acoustic feature set to obtain input parameters of a second neural network, and outputting a linear spectrum through the second neural network;

and reconstructing the linear spectrum to obtain the current voice frame.

6. The method of claim 1, wherein performing a speech synthesis process based on at least the second set of acoustic features to obtain the current speech frame comprises:

obtaining input parameters of a second neural network by performing multi-task learning on the acoustic features contained in the first acoustic feature set and the second acoustic feature set, and outputting a linear spectrum through the second neural network;

and reconstructing the linear spectrum to obtain the current voice frame.

7. The method of claim 1, applied to a speech synthesis scenario where the speaker's audio is in the low band.

8. A speech synthesis apparatus, comprising:

an obtaining module, configured to obtain a second acoustic feature set based on a text to be processed and a first acoustic feature set, where the first acoustic feature set includes: based on the acoustic features decoded during the synthesis of at least one historical speech frame, the second set of acoustic features comprises: decoding the acoustic characteristics obtained in the synthesis process of the current speech frame;

and the synthesis module is used for performing voice synthesis processing at least based on the second acoustic feature set to obtain the current voice frame.

9. A storage medium comprising a stored program, wherein the apparatus on which the storage medium is located is controlled to perform the speech synthesis method according to any one of claims 1 to 7 when the program is executed.

10. A processor, characterized in that the processor is configured to run a program, wherein the program is configured to perform the speech synthesis method according to any one of claims 1 to 7 when running.

11. An acoustic enclosure, the acoustic enclosure configured to perform the following process steps:

12. A speech synthesis system, comprising:

a processor; and

a memory coupled to the processor for providing instructions to the processor for processing the following processing steps:

13. A speech synthesis system, comprising:

the preprocessing module is used for preprocessing the text to be processed to obtain a preprocessed text;

the encoding module is used for encoding the preprocessed text and inputting an encoding result into the decoding module through an attention mechanism;

the decoding module is configured to perform decoding processing on the encoding result to obtain a plurality of acoustic features corresponding to the text to be processed, where the plurality of acoustic features include: based on a first acoustic feature set obtained by decoding in the synthesis process of at least one historical speech frame and a second acoustic feature set obtained by decoding in the synthesis process of the current speech frame;

and the post-processing module is used for performing voice synthesis processing at least based on the second acoustic feature set to obtain the current voice frame.