CN112951202A

CN112951202A - Speech synthesis method, apparatus, electronic device and program product

Info

Publication number: CN112951202A
Application number: CN202110264700.1A
Authority: CN
Inventors: 文成; 郭庭炜
Original assignee: Beijing Didi Infinity Technology and Development Co Ltd
Current assignee: Beijing Didi Infinity Technology and Development Co Ltd
Priority date: 2021-03-11
Filing date: 2021-03-11
Publication date: 2021-06-11
Anticipated expiration: 2041-03-11
Also published as: CN112951202B

Abstract

According to the voice synthesis method, the voice synthesis device, the electronic equipment and the program product, feature sampling data of acoustic feature data at a plurality of sampling moments are obtained; performing prediction processing on the feature sampling data of the multiple sampling moments by using a speech synthesis network at the same time to obtain linear prediction data and nonlinear prediction data of any two target sampling moments in the multiple sampling moments; according to the linear prediction data and the nonlinear prediction data of the two target sampling moments, the speech synthesis data of the two target sampling moments are determined.

Description

Speech synthesis method, apparatus, electronic device and program product

Technical Field

Embodiments of the present disclosure relate to streaming media data processing technologies, and in particular, to a speech synthesis method, apparatus, electronic device, and program product.

Background

In speech technology, the quality of the vocoder will determine the quality of its synthesized speech. With the development of deep learning technology, it becomes possible to use a neural network in the deep learning technology for improving the quality of a vocoder.

The LPCNet vocoder is a vocoder combining a neural network and Linear Predictive Coding (LPC for short), and decomposes a sampling value into a Linear part and a nonlinear part on the basis of a WaveRNN network so as to output the Linear part through Linear prediction and give out the nonlinear part through the neural network, thereby realizing the acquisition of the sampling value in the vocoder.

The method can effectively ensure the voice quality of voice synthesis, but the complexity and the calculation amount of the LPCnet vocoder are still large, so that the operation time and the operation resources required by the voice synthesis are large, and the method is not beneficial to practical application.

Disclosure of Invention

Embodiments of the present disclosure provide a speech synthesis method, apparatus, electronic device and program product.

In one aspect, an embodiment of the present disclosure provides a speech synthesis method, including:

acquiring feature sampling data of acoustic feature data at a plurality of sampling moments;

performing prediction processing on the feature sampling data of the multiple sampling moments by using a speech synthesis network at the same time to obtain linear prediction data and nonlinear prediction data of any two target sampling moments in the multiple sampling moments;

and determining the speech synthesis data of the two target sampling moments according to the linear prediction data and the nonlinear prediction data of the two target sampling moments.

In another aspect, an embodiment of the present disclosure provides a speech synthesis apparatus, including:

the acquisition module is used for acquiring feature sampling data of the acoustic feature data at a plurality of sampling moments;

the processing module is used for simultaneously predicting the characteristic sampling data of the plurality of sampling moments by utilizing a voice synthesis network to obtain linear prediction data and nonlinear prediction data of any two target sampling moments in the plurality of sampling moments;

and the synthesis module is used for determining the voice synthesis data of the two target sampling moments according to the linear prediction data and the nonlinear prediction data of the two target sampling moments.

In yet another aspect, an embodiment of the present disclosure provides an electronic device, including: a memory and a processor;

the memory is to store program instructions;

the processor is configured to invoke program instructions in the memory to perform the method of any of the preceding claims.

In yet another aspect, embodiments of the present disclosure provide a computer-readable storage medium having a computer program stored thereon; the computer program, when executed, implements a method as in any of the preceding claims.

In a final aspect, embodiments of the present disclosure provide a computer program product comprising a computer program that, when executed by a processor, performs the steps of the method as described in any of the previous claims.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.

FIG. 1 is a schematic diagram of a network system architecture according to the present disclosure;

fig. 2 is a schematic flow chart of a speech synthesis method according to an embodiment of the present disclosure;

fig. 3 is a schematic structural diagram of a speech synthesis network according to an embodiment of the present disclosure;

fig. 4 is a schematic structural diagram of a speech synthesis apparatus according to an embodiment of the present disclosure;

FIG. 5 is a block diagram illustrating an electronic device in accordance with an example embodiment.

With the foregoing drawings in mind, certain embodiments of the disclosure have been shown and described in more detail below. These drawings and written description are not intended to limit the scope of the disclosed concepts in any way, but rather to illustrate the concepts of the disclosure to those skilled in the art by reference to specific embodiments.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

First, terms related to embodiments of the present disclosure are explained:

the vocoder is a processing module which analyzes the voice signal, extracts the acoustic characteristic data of the voice signal, encodes and encrypts the data to obtain the matching with the channel, transmits the data to the receiving end through the information channel, and restores the original voice waveform according to the received acoustic characteristic data.

The LPCnet is a network which skillfully combines a digital signal processing technology and a neural network technology and applies the digital signal processing technology and the neural network technology to a vocoder, and can synthesize high-quality voice on a common CPU in real time.

The speech synthesis method provided by the embodiment of the disclosure can be applied to fig. 1, which is a schematic diagram of a network system architecture based on the disclosure. As shown in fig. 1, the network system includes: a speech synthesis apparatus 1 and an electronic device 2.

The speech synthesis apparatus 1 of the present disclosure may be installed or integrated in an electronic device 1, and the electronic device 1 may specifically be an intelligent terminal, such as a smart phone, a tablet computer, a desktop computer, and other devices that can perform data operation processing according to a preset operation logic.

The electronic device 2 may analyze and obtain corresponding acoustic feature data and feature sampling data corresponding to the acoustic feature data in a manner of acquiring a speech text to be synthesized from a network. Then, the speech synthesis apparatus 1 obtains these feature sample data from the electronic device 2 and performs corresponding processing to obtain speech synthesis data. The speech synthesis data may be returned to the electronic device 2 for use and playback by the electronic device 2.

It should be noted that the electronic device 1 shown in fig. 1 may be applicable to different network formats, for example, may be applicable to network formats such as Global System for Mobile communication (GSM), Code Division Multiple Access (CDMA), Wideband Code Division Multiple Access (WCDMA), Time Division-Synchronous Code Division Multiple Access (TD-SCDMA), Long Term Evolution (Long Term Evolution, LTE), and future 5G. Optionally, the electronic device of the network system may be a system in a scenario of high-Reliable and Low Latency Communications (URLLC) transmission in a 5G communication system.

The electronic device may be a wireless terminal or a wired terminal. A wireless terminal may refer to a device that provides voice and/or other traffic data connectivity to a user, a handheld device having wireless connection capability, or other processing device connected to a wireless modem. A wireless terminal, which may be a mobile terminal such as a mobile telephone (or "cellular" telephone) and a computer having a mobile terminal, for example, a portable, pocket, hand-held, computer-included, or vehicle-mounted mobile device, may communicate with one or more core Network devices via a Radio Access Network (RAN), and may exchange language and/or data with the RAN. For another example, the Wireless terminal may also be a Personal Communication Service (PCS) phone, a cordless phone, a Session Initiation Protocol (SIP) phone, a Wireless Local Loop (WLL) station, a Personal Digital Assistant (PDA), and other devices. A wireless Terminal may also be referred to as a system, a Subscriber Unit (Subscriber Unit), a Subscriber Station (Subscriber Station), a Mobile Station (Mobile), a Remote Station (Remote Station), a Remote Terminal (Remote Terminal), an Access Terminal (Access Terminal), a User Terminal (User Terminal), a User Agent (User Agent), and a User Device or User Equipment (User Equipment), which are not limited herein. Optionally, the electronic device may also be a smart watch, a tablet computer, or the like.

Specific application scenarios of embodiments of the present disclosure may be, for example, voice synthesis scenarios in intelligent voice broadcast, navigation, intelligent sound, and voice assistant. As mentioned above, the vocoder is a key component of the speech synthesis scenario and plays a key role in speech quality metrics.

With the continuous development of deep learning technology, it becomes possible to use the neural network in the deep learning technology for improving the quality of the vocoder, and a vocoder based on the neural network represented by WaveNet is produced accordingly. However, the network structure of WaveNet is complex, and when it is used for speech synthesis, it will require a very large amount of computation on data, which also makes it difficult to apply the WaveNet vocoder to the aforementioned electronic devices (such as mobile terminals).

With the advancement of technology, LPCNet vocoders have come into play, and the LPCNet vocoders are lightweight vocoders combining a neural network with Linear Predictive Coding (LPC). On the basis of a WaveRNN network, a sampling value in voice synthesis data is decomposed into a linear part and a nonlinear part, the linear part is output through linear prediction, and the nonlinear part is given through a neural network, so that the sampling value in a vocoder is obtained.

However, the existing LPCNet vocoder can effectively ensure the voice quality of voice synthesis when applied to the electronic device, but the complexity and the calculation amount of the existing LPCNet vocoder are still large, which causes more operation time and operation resources required by voice synthesis, and is not beneficial to practical application.

In the face of the problem, the structure and the processing mode of the LPCnet vocoder are improved to some extent, so that the voice synthesis data of two sampling points can be predicted and processed simultaneously when voice is synthesized, namely, the real-time rate of the processing is improved obviously while the quality of the synthesized voice is not reduced.

The following describes technical solutions of embodiments of the present disclosure and how to solve the above technical problems in detail with specific embodiments. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments. Embodiments of the present disclosure will be described below with reference to the accompanying drawings.

In a first aspect, referring to fig. 2, fig. 2 is a schematic flowchart of a speech synthesis method provided in an embodiment of the present disclosure. The speech synthesis method provided by the embodiment of the disclosure comprises the following steps:

step 101, acquiring feature sampling data of acoustic feature data at a plurality of sampling moments.

And step 102, performing prediction processing on the feature sampling data of the multiple sampling moments by using a speech synthesis network, and obtaining linear prediction data and nonlinear prediction data of any two target sampling moments in the multiple sampling moments.

And 103, determining the speech synthesis data of the two target sampling moments according to the linear prediction data and the nonlinear prediction data of the two target sampling moments.

It should be noted that the main execution body of the processing method provided by this example is the aforementioned speech synthesis apparatus, which can be installed in the aforementioned electronic device to process the feature sampling data of the acoustic feature data in the electronic device.

Specifically, first, the speech synthesis apparatus will obtain feature sample data of acoustic feature data. The acoustic feature data is used to represent acoustic feature information of source information of the speech to be synthesized, where the source information may specifically be a piece of content information having continuity in a time domain, and for convenience of processing, the source information is changed from the time domain to a frequency domain, and the changed data is subjected to processing such as acoustic feature extraction, so as to obtain the acoustic feature data in the present disclosure. Generally, the obtained acoustic feature data is data of a plurality of speech frames.

For each voice frame, in order to make the processing result more accurate, the acoustic feature data of each voice frame is also sampled for multiple times to obtain feature sample data of each voice frame at multiple sampling moments.

The speech synthesis device predicts the data of two target sampling moments in the plurality of sampling moments by adopting an autoregressive processing mode for the speech synthesis network according to the obtained characteristic sampling data to obtain corresponding predicted data.

Finally, the speech synthesis device respectively carries out synthesis processing on the linear prediction data and the nonlinear prediction data at the two target sampling moments to obtain the speech synthesis data at the two target sampling moments.

The speech synthesis method provided by the embodiment of the disclosure can perform prediction processing on feature sampling data in a plurality of sampling moments of acoustic feature data at the same time to obtain speech synthesis data of any two target sampling moments in the plurality of sampling moments, thereby greatly improving the real-time rate of speech synthesis while ensuring the speech synthesis quality.

Fig. 3 is a schematic structural diagram of a speech synthesis network according to an embodiment of the present disclosure, and as shown in fig. 3, the speech synthesis network may specifically be an LPCNet-based network, which may specifically include a frame rate sub-network and a sampling point sub-network.

The speech synthesis method may specifically include:

step 201, feature sampling data of the acoustic feature data at a plurality of sampling moments are obtained.

Similar to the previous embodiment, in this embodiment, the acoustic feature data is to be sampled a plurality of times to obtain feature sample data at a plurality of sampling instants.

Step 202, performing linear prediction processing on the feature sample data in the multiple sampling moments to obtain linear voice data Pm at the mth sampling moment and linear voice data Pm +1 at the m +1 th sampling moment, respectively.

Specifically, taking the prediction of linear speech data at the m-th sampling time and the m + 1-th sampling time among the multiple sampling times as an example, first, the speech synthesis apparatus performs linear prediction processing on the multiple sampling time characteristic sampling data by using a linear prediction module to obtain a linear prediction coefficient; and then combining the utilization historical speech synthesis data with the linear prediction coefficient to obtain linear speech data Pm at the mth sampling moment and linear speech data Pm +1 at the m +1 sampling moment.

Wherein, in determining the linear speech data Pm at the mth sampling timing, the history speech synthesis data using the speech synthesis data [ Sm-16, Sm-15, … … Sm-1] of the mth-16 to m-1 are combined with the linear prediction coefficient;

similarly, in determining the linear speech data Pm +1 at the m +1 th sampling timing, the history speech synthesis data using the speech synthesis data [ Sm-15, … … Sm ] of the m-15 th to m is combined with the linear prediction coefficient.

Wherein the linear prediction processing is implemented based on signal processing techniques.

Step 203, acquiring the speech synthesis data Sm-1 and the nonlinear speech data Em-1 at the m-1 th sampling moment, and the speech synthesis data Sm-2 and the nonlinear speech data Em-2 at the m-2 th sampling moment.

Specifically, the speech synthesis network in the speech synthesis apparatus performs data prediction by using an autoregressive processing method, and therefore, for prediction of the speech synthesis data at the m-th sampling time and the m + 1-th sampling time, the speech synthesis apparatus first obtains the relevant data at the sampling times immediately before its target sampling time, such as the speech synthesis data Sm-1 and the nonlinear speech data Em-1 at the m-1-th sampling time, and the speech synthesis data Sm-2 and the nonlinear speech data Em-2 at the m-2-th sampling time.

And 204, carrying out nonlinear prediction processing on the feature sampling data, the voice synthesis data Sm-1, the nonlinear voice data Em-1, the voice synthesis data Sm-2, the nonlinear voice data Em-2, the linear voice data Pm and the linear voice data Pm +1 at the mth sampling moment and the m +1 th sampling moment to obtain the nonlinear voice data Em at the mth sampling moment and the nonlinear voice data Em +1 at the m +1 th sampling moment.

Specifically, the step 204 is based on a neural network algorithm to predict the non-linear speech data in the speech synthesis data.

As shown in fig. 3, the speech synthesis network includes a frame rate sub-network and a sampling point sub-network.

Similar to the conventional LPCNet network, the frame rate sub-network includes a plurality of convolutional layers connected in series and a plurality of fully connected layers connected in series, and the characteristic sampling data including the m-th sampling time and the m + 1-th sampling time sequentially passes through these network structure layers, so that the vector f can be finally obtained.

That is, the characteristic sample data at the m-th sampling time and the m + 1-th sampling time may be input to the frame rate sub-network to obtain the output vector f.

Then, each data is input into the sampling point sub-network by adopting an autoregressive mode, so as to obtain nonlinear voice data Em at the m-th sampling moment and nonlinear voice data Em +1 at the m + 1-th sampling moment.

That is, the speech synthesis data Sm-1, the nonlinear speech data Em-1, the speech synthesis data Sm-2, the nonlinear speech data Em-2, the linear speech data Pm +1, and the output vector f are input to the sampling point sub-network, and the nonlinear speech data Em at the mth sampling time and the nonlinear speech data Em +1 at the m +1 th sampling time are output.

Specifically, as shown in fig. 3, unlike the prior art, in the sampling point network provided by the present disclosure, a part of the structural layer is shared to process data at two time instants simultaneously, and then the processing results are distinguished through the added mapping layer, so as to obtain nonlinear prediction data at two time instants in this way.

More specifically, in the network shown in fig. 3, the sampling point subnetwork comprises: the device comprises a sampling layer, a mapping layer, a full connection layer and a classifier.

The corresponding step 204 may further include:

inputting the voice synthesis data Sm-1, the nonlinear voice data Em-1, the voice synthesis data Sm-2, the nonlinear voice data Em-2, the linear voice data Pm +1 and the output vector f into the sampling layer, and outputting sampling data;

inputting the obtained sampling data into the mapping layer to perform data mapping on the sampling data to obtain sampling data corresponding to the mth sampling time and sampling data corresponding to the (m + 1) th sampling time;

respectively and sequentially inputting the sampling data corresponding to the mth sampling moment and the sampling data corresponding to the (m + 1) th sampling moment to the full connection layer and the classifier, and respectively obtaining the output sampling distribution at the mth sampling moment and the output sampling distribution at the (m + 1) th sampling moment;

determining nonlinear speech data Em at the mth sampling moment according to the sampling distribution at the mth sampling moment; and determining the nonlinear speech data Em +1 at the m +1 th sampling moment according to the sampling distribution at the m +1 th sampling moment.

In an optional embodiment, the voice synthesis network is an LPCNet network, and the sampling layer is a GRU layer in the LPCNet network. The GRU layers comprise a first GRU layer and a second GRU layer; and the first GRU layer and the second GRU layer adopt different sampling frequencies to sample data input into the GRU layer.

In the embodiment provided by the present disclosure, for the obtaining process of the nonlinear prediction data at multiple sampling moments, compared with the prediction process of the existing LPCNet network, the processing number of the nonlinear prediction data at the sampling layer is halved, which can significantly improve the processing efficiency of the acoustic feature data for each speech frame and improve the speech synthesis real-time rate.

In a second aspect, referring to fig. 4, fig. 4 is a schematic structural diagram of a speech synthesis apparatus provided in the present disclosure. The speech synthesis device comprises:

the acquiring module 10 is configured to acquire feature sampling data of the acoustic feature data at multiple sampling moments;

the processing module 20 is configured to perform prediction processing on the feature sample data at the multiple sampling moments simultaneously by using a speech synthesis network, so as to obtain linear prediction data and nonlinear prediction data of any two target sampling moments in the multiple sampling moments;

and a synthesis module 30, configured to determine speech synthesis data at the two target sampling moments according to the linear prediction data and the nonlinear prediction data at the two target sampling moments.

In an optional embodiment, the processing module 20 is specifically configured to:

performing linear prediction processing on the feature sampling data at a plurality of sampling moments to respectively obtain linear voice data Pm at the mth sampling moment and linear voice data Pm +1 at the m +1 th sampling moment;

acquiring speech synthesis data Sm-1 and nonlinear speech data Em-1 at the m-1 sampling moment, and speech synthesis data Sm-2 and nonlinear speech data Em-2 at the m-2 sampling moment;

and carrying out nonlinear prediction processing on the feature sampling data, the voice synthesis data Sm-1, the nonlinear voice data Em-1, the voice synthesis data Sm-2, the nonlinear voice data Em-2, the linear voice data Pm and the linear voice data Pm +1 at the mth sampling moment and the m +1 th sampling moment to obtain the nonlinear voice data Em at the mth sampling moment and the nonlinear voice data m +1 at the m +1 th sampling moment.

In an alternative embodiment, the speech synthesis network comprises a frame rate sub-network and a sampling point sub-network;

the processing module 20 is specifically configured to input the feature sampling data at the mth sampling time and the (m + 1) th sampling time to the frame rate sub-network, so as to obtain an output vector f; and inputting the voice synthesis data Sm-1, the nonlinear voice data Em-1, the voice synthesis data Sm-2, the nonlinear voice data Em-2, the linear voice data Pm +1 and the output vector f into the sampling point sub-network, and outputting the nonlinear voice data Em at the mth sampling moment and the nonlinear voice data Em +1 at the mth +1 sampling moment.

In an alternative embodiment, the sampling point subnetwork comprises: the system comprises a sampling layer, a mapping layer, a full connection layer and a classifier;

the processing module 20 is specifically configured to input the speech synthesis data Sm-1, the nonlinear speech data Em-1, the speech synthesis data Sm-2, the nonlinear speech data Em-2, the linear speech data Pm +1, and the output vector f to the sampling layer, and output sampling data; inputting the obtained sampling data into the mapping layer to perform data mapping on the sampling data to obtain sampling data corresponding to the mth sampling time and sampling data corresponding to the (m + 1) th sampling time; respectively and sequentially inputting the sampling data corresponding to the mth sampling moment and the sampling data corresponding to the (m + 1) th sampling moment to the full connection layer and the classifier, and respectively obtaining the output sampling distribution at the mth sampling moment and the output sampling distribution at the (m + 1) th sampling moment; determining nonlinear speech data Em at the mth sampling moment according to the sampling distribution at the mth sampling moment; and determining the nonlinear speech data Em +1 at the m +1 th sampling moment according to the sampling distribution at the m +1 th sampling moment.

In an optional embodiment, the speech synthesis network is an LPCNet network, and the sampling layer is a GRU layer in the LPCNet network.

In an alternative embodiment, the GRU layers include a first GRU layer and a second GRU layer;

and the first GRU layer and the second GRU layer adopt different sampling frequencies to sample data input into the GRU layer.

According to the voice synthesis method provided by the embodiment of the disclosure, feature sampling data of acoustic feature data at a plurality of sampling moments are obtained; performing prediction processing on the feature sampling data of the multiple sampling moments by using a speech synthesis network at the same time to obtain linear prediction data and nonlinear prediction data of any two target sampling moments in the multiple sampling moments; according to the linear prediction data and the nonlinear prediction data of the two target sampling moments, the speech synthesis data of the two target sampling moments are determined.

FIG. 5 is a block diagram illustrating an electronic device, which may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, and the like, in accordance with an exemplary embodiment.

The apparatus 800 may include one or more of the following components: a processing component 802, a memory 804, a power component 806, a multimedia component 808, an audio component 810, an input/output (I/O) interface 812, a sensor component 814, and a communication component 816.

The processing component 802 generally controls overall operation of the device 800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing components 802 may include one or more processors 820 to execute instructions to perform all or a portion of the steps of the methods described above. Further, the processing component 802 can include one or more modules that facilitate interaction between the processing component 802 and other components. For example, the processing component 802 can include a multimedia module to facilitate interaction between the multimedia component 808 and the processing component 802.

The memory 804 is configured to store various types of data to support operations at the apparatus 800. Examples of such data include instructions for any application or method operating on device 800, contact data, phonebook data, messages, pictures, streaming media, and so forth. The memory 804 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

Power components 806 provide power to the various components of device 800. The power components 806 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the apparatus 800.

The multimedia component 808 includes a screen that provides an output interface between the device 800 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 808 includes a front facing camera and/or a rear facing camera. The front camera and/or the rear camera may receive external multimedia data when the device 800 is in an operation mode, such as a photographing mode or a streaming mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

The audio component 810 is configured to output and/or input audio signals. For example, the audio component 810 includes a Microphone (MIC) configured to receive external audio signals when the apparatus 800 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 804 or transmitted via the communication component 816. In some embodiments, audio component 810 also includes a speaker for outputting audio signals.

The I/O interface 812 provides an interface between the processing component 802 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

The sensor assembly 814 includes one or more sensors for providing various aspects of state assessment for the device 800. For example, the sensor assembly 814 may detect the open/closed status of the device 800, the relative positioning of components, such as a display and keypad of the device 800, the sensor assembly 814 may also detect a change in the position of the device 800 or a component of the device 800, the presence or absence of user contact with the device 800, the orientation or acceleration/deceleration of the device 800, and a change in the temperature of the device 800. Sensor assembly 814 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 814 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 814 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 816 is configured to facilitate communications between the apparatus 800 and other devices in a wired or wireless manner. The device 800 may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 816 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 816 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the apparatus 800 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described methods.

In an exemplary embodiment, a non-transitory computer-readable storage medium comprising instructions, such as the memory 804 comprising instructions, executable by the processor 820 of the device 800 to perform the above-described method is also provided. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

A non-transitory computer readable storage medium, instructions in which, when executed by a processor of an electronic device, enable the electronic device to perform a speech synthesis method of the electronic device.

A computer program product comprising computer instructions for executing the method according to any of the preceding claims, wherein the implementation principle and the technical effect are similar, and are not described herein again.

The present disclosure also provides the following embodiments:

embodiment 1, a speech synthesis method, comprising:

Embodiment 2 and the speech synthesis method according to embodiment 1, where the obtaining linear prediction data and nonlinear prediction data of any two target sampling times in the plurality of sampling times by performing prediction processing on the feature sampling data of the plurality of sampling times simultaneously with the speech synthesis network includes:

and carrying out nonlinear prediction processing on the feature sampling data, the voice synthesis data Sm-1, the nonlinear voice data Em-1, the voice synthesis data Sm-2, the nonlinear voice data Em-2, the linear voice data Pm and the linear voice data Pm +1 at the mth sampling moment and the m +1 th sampling moment to obtain the nonlinear voice data Em at the mth sampling moment and the nonlinear voice data Em +1 at the m +1 th sampling moment.

Embodiment 3, the speech synthesis method according to embodiment 2, the speech synthesis network comprising a frame rate sub-network and a sampling point sub-network;

the non-linear prediction process comprises:

inputting characteristic sampling data of the mth sampling moment and the (m + 1) th sampling moment into the frame rate sub-network to obtain an output vector f;

and inputting the voice synthesis data Sm-1, the nonlinear voice data Em-1, the voice synthesis data Sm-2, the nonlinear voice data Em-2, the linear voice data Pm +1 and the output vector f into the sampling point sub-network, and outputting the nonlinear voice data Em at the mth sampling moment and the nonlinear voice data Em +1 at the mth +1 sampling moment.

Embodiment 4, according to the speech synthesis method of embodiment 3, the sampling point subnetwork includes: the system comprises a sampling layer, a mapping layer, a full connection layer and a classifier;

Embodiment 5, according to the speech synthesis method described in embodiment 4, the speech synthesis network is an LPCNet network, and the sampling layer is a GRU layer in the LPCNet network.

Embodiment 6 the method of speech synthesis of embodiment 5, wherein the GRU layers comprise a first GRU layer and a second GRU layer;

Embodiment 7 is a speech synthesis apparatus including:

Embodiment 8, according to the speech synthesis apparatus in embodiment 7, the processing module is specifically configured to:

Embodiment 9, the speech synthesis apparatus according to embodiment 8, wherein the speech synthesis network comprises a frame rate sub-network and a sampling point sub-network;

the processing module is specifically configured to input feature sampling data at an mth sampling time and an (m + 1) th sampling time to the frame rate sub-network, so as to obtain an output vector f; and inputting the voice synthesis data Sm-1, the nonlinear voice data Em-1, the voice synthesis data Sm-2, the nonlinear voice data Em-2, the linear voice data Pm +1 and the output vector f into the sampling point sub-network, and outputting the nonlinear voice data Em at the mth sampling moment and the nonlinear voice data Em +1 at the mth +1 sampling moment.

Embodiment 10, the speech synthesis apparatus according to embodiment 9, wherein the sampling point subnetwork comprises: the system comprises a sampling layer, a mapping layer, a full connection layer and a classifier;

the processing module is specifically configured to input the speech synthesis data Sm-1, the nonlinear speech data Em-1, the speech synthesis data Sm-2, the nonlinear speech data Em-2, the linear speech data Pm +1, and the output vector f to the sampling layer, and output sampling data; inputting the obtained sampling data into the mapping layer to perform data mapping on the sampling data to obtain sampling data corresponding to the mth sampling time and sampling data corresponding to the (m + 1) th sampling time; respectively and sequentially inputting the sampling data corresponding to the mth sampling moment and the sampling data corresponding to the (m + 1) th sampling moment to the full connection layer and the classifier, and respectively obtaining the output sampling distribution at the mth sampling moment and the output sampling distribution at the (m + 1) th sampling moment; determining nonlinear speech data Em at the mth sampling moment according to the sampling distribution at the mth sampling moment; and determining the nonlinear speech data Em +1 at the m +1 th sampling moment according to the sampling distribution at the m +1 th sampling moment.

Embodiment 11 and the speech synthesis apparatus according to embodiment 10, wherein the speech synthesis network is an LPCNet network, and the sampling layer is a GRU layer in the LPCNet network.

Embodiment 12, the speech synthesis apparatus of embodiment 11, the GRU layers comprising a first GRU layer and a second GRU layer;

Embodiment 13, an electronic device, comprising: a memory and a processor;

the memory is to store program instructions;

the processor is configured to invoke program instructions in the memory to perform the method of any of embodiments 1-6.

Embodiment 14, a computer-readable storage medium having a computer program stored thereon; the computer program, when executed, implements the method of any of embodiments 1-6.

Embodiment 15, a computer program product comprising a computer program which, when executed by a processor, performs the steps of the method of any of embodiments 1 to 6.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. The embodiments of the disclosure are intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A method of speech synthesis, comprising:

2. The speech synthesis method according to claim 1, wherein the obtaining linear prediction data and non-linear prediction data of any two target sampling moments in the plurality of sampling moments by performing prediction processing on the feature sampling data of the plurality of sampling moments simultaneously by using a speech synthesis network comprises:

3. The speech synthesis method of claim 2, wherein the speech synthesis network comprises a frame rate sub-network and a sampling point sub-network;

the non-linear prediction process comprises:

4. A speech synthesis method according to claim 3, characterised in that the sampling point sub-network comprises: the system comprises a sampling layer, a mapping layer, a full connection layer and a classifier;

5. The speech synthesis method of claim 4, wherein the speech synthesis network is an LPCnet network, and the sampling layer is a GRU layer in the LPCnet network.

6. The method of speech synthesis of claim 5, wherein the GRU layers comprise a first GRU layer and a second GRU layer;

7. A speech synthesis apparatus, comprising:

8. An electronic device, comprising: a memory and a processor;

the memory is to store program instructions;

the processor is configured to invoke program instructions in the memory to perform the method of any of claims 1-6.

9. A computer-readable storage medium, characterized in that the computer-readable storage medium has stored thereon a computer program; the computer program, when executed, implementing the method of any one of claims 1-6.

10. A computer program product comprising a computer program, characterized in that the computer program realizes the steps of the method of any one of claims 1-6 when executed by a processor.