CN112086086A

CN112086086A - Speech synthesis method, device, equipment and computer readable storage medium

Info

Publication number: CN112086086A
Application number: CN202011138907.6A
Authority: CN
Inventors: 曾振; 王健宗; 程宁
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2020-10-22
Filing date: 2020-10-22
Publication date: 2020-12-15
Also published as: WO2021189984A1

Abstract

The invention relates to artificial intelligence, and discloses a voice synthesis method, which comprises the following steps: performing semantic extraction processing on the obtained text to be synthesized to obtain a semantic feature sequence; carrying out prosody prediction processing on the semantic feature sequence through a prosody prediction model to obtain prosody emotion features of the text to be synthesized; simultaneously inputting the rhythm emotion characteristics and a pre-acquired syllable sequence of a text to be synthesized into a voice characteristic prediction model, and performing voice prediction processing to obtain a voice characteristic spectrum of the text to be synthesized; and synthesizing the voice characteristic spectrum and the text to be synthesized into the voice with prosody emotion. The invention also relates to a blockchain technology, wherein the prosodic prediction model is stored in the blockchain. The invention can switch the characteristics of tone emotion and rhythm of the synthesized voice in real time.

Description

Speech synthesis method, device, equipment and computer readable storage medium

Technical Field

The present invention relates to artificial intelligence, and in particular, to a method and an apparatus for speech synthesis, an electronic device, and a computer-readable storage medium.

Background

The speech synthesis technology is used for generating understandable and anthropomorphic speech according to input text information, is a very important part of a human-computer interaction system, and is widely applied to various artificial intelligent terminals, such as intelligent sound boxes, intelligent clients and the like. At present, the mainstream speech synthesis system can basically synthesize very stable and reliable speech, so the performance of a speech synthesis system is judged mainly according to the anthropomorphic degree of synthesized sound, which has very important influence on the experience of an interactive system.

The traditional speech synthesis technology is to synthesize corresponding speech directly according to the provided text, and the synthesized speech is basically consistent for the same text, so that people are difficult to synthesize specific emotion and prosody speech. However, in some scenarios with high requirements on synthesis effect, for example, smart clients, it is necessary to adjust the emotion, speech speed, and prosody of speech in time according to the responses of users to realize efficient representation of service content.

Most of the existing speech synthesis systems synthesize corresponding speech directly according to the input text sequence, and the synthesized speech is basically the same for the same text input. The synthesized speech cannot be adjusted according to the particular speech synthesis application scenario and the current dialog state.

Disclosure of Invention

The invention provides a voice synthesis method, a voice synthesis device, electronic equipment and a computer readable storage medium, and mainly aims to switch the characteristics of mood emotion and rhythm of synthesized voice in real time.

In a first aspect, to achieve the above object, the present invention provides a speech synthesis method, including:

performing semantic extraction processing on the obtained text to be synthesized to obtain a semantic feature sequence;

carrying out prosody prediction processing on the semantic feature sequence through a prosody prediction model to obtain prosody emotion features of the text to be synthesized;

simultaneously inputting the rhythm emotion characteristics and a pre-acquired syllable sequence of the text to be synthesized into a voice characteristic prediction model, and performing voice prediction processing to obtain a voice characteristic spectrum of the text to be synthesized;

and synthesizing the voice feature spectrum and the text to be synthesized into voice with prosody emotion.

Optionally, the semantic extracting processing on the obtained text to be synthesized to obtain a semantic feature sequence includes:

carrying out word separation processing on the text to be synthesized to obtain a character collection;

inputting the character set into a pre-training language model for semantic extraction processing to obtain a semantic feature sequence; wherein the pre-trained language model comprises:

the system comprises an input layer for inputting the character collection, a vector coding layer for carrying out vector conversion processing on the character collection of the input layer, a self-attention mechanism layer for carrying out enhancement semantic vector processing on a word vector, a text vector and a position vector obtained by the vector coding layer, a pooling layer for carrying out dimension reduction and splicing processing on the enhancement semantic vector of each word obtained by the self-attention mechanism layer and an output layer for outputting a semantic feature sequence obtained by the pooling layer.

Optionally, the prosodic prediction model comprises:

the memory network layer is used for finding out prosodic emotion characteristics corresponding to the semantic feature sequence from prosodic emotion samples learned in advance according to the semantic feature sequence of the input linear layer; and the output linear layer is used for outputting the corresponding prosodic emotion characteristics.

Optionally, before the inputting the prosodic emotion features and the pre-obtained syllable sequence of the text to be synthesized into a speech feature prediction model at the same time, performing speech prediction processing to obtain a speech feature spectrum of the text to be synthesized, the method further includes:

and adjusting the tone of the prosodic emotional feature according to a preset tone adjusting rule to obtain the standby prosodic emotional feature of the text to be synthesized.

Optionally, the performing tone adjustment on the prosodic emotion feature according to a preset tone adjustment rule to obtain a to-be-used prosodic emotion feature spectrum of the text to be synthesized includes:

acquiring a sequence element vector of the rhythm emotion characteristics;

and adjusting the numerical value of the sequence element vector according to a preset tone adjustment rule to obtain a to-be-used prosodic emotion feature spectrum of the text to be synthesized.

Optionally, the speech feature prediction model comprises:

the device comprises a character embedding layer used for converting the syllable sequence into a syllable embedding vector, an overlapping layer used for overlapping the prosodic emotion characteristics to the syllable embedding vector after linear processing, and a voice characteristic output layer used for outputting a voice characteristic spectrum obtained by the overlapping layer.

Optionally, the synthesizing the speech feature spectrum and the text to be synthesized into the speech with prosodic emotion includes:

and performing voice synthesis on the text to be synthesized through a vocoder according to the voice feature spectrum to obtain voice with rhythm emotion.

In a second aspect, in order to solve the above problem, the present invention further provides a speech synthesis apparatus, comprising:

the semantic extraction module is used for performing semantic extraction processing on the acquired text to be synthesized to obtain a semantic feature sequence;

the prosodic emotion feature acquisition module is used for performing prosodic prediction processing on the semantic feature sequence through a prosodic prediction model to obtain prosodic emotion features of the text to be synthesized;

the voice feature acquisition module: the voice feature prediction module is used for inputting the prosodic emotion features and the pre-acquired syllable sequence of the text to be synthesized into a voice feature prediction model at the same time, and performing voice prediction processing to obtain a voice feature spectrum of the text to be synthesized;

a speech synthesis module: and the voice characteristic spectrum and the text to be synthesized are synthesized into the voice with prosody emotion.

In a third aspect, to solve the above problem, the present invention further provides an electronic apparatus, including:

a memory storing at least one instruction; and

and the processor executes the instructions stored in the memory to realize the voice synthesis method.

In a fourth aspect, to solve the above problem, the present invention further provides a computer-readable storage medium having at least one instruction stored therein, where the at least one instruction is executed by a processor in an electronic device to implement the speech synthesis method described above.

The invention provides a voice synthesis method, a voice synthesis device, electronic equipment and a computer readable storage medium, wherein a semantic feature sequence is obtained by performing semantic extraction processing on an acquired text to be synthesized; then carrying out prosody prediction processing on the semantic feature sequence through a prosody prediction model to obtain prosody emotion features; inputting the rhythm emotion characteristics and a pre-acquired syllable sequence of the text to be synthesized into a voice characteristic prediction model at the same time, and performing voice prediction processing to obtain a voice characteristic spectrum; finally, synthesizing the voice feature spectrum and the text to be synthesized into voice with rhythm emotion; the method can directly model and extract the prosodic emotional characteristics of the voice from the voice, and can obtain accurate prosodic information to improve the prediction effect of voice synthesis; the voice with more accurate rhythm and more natural rhythm can be synthesized according to the text to be synthesized; the method and the system can be suitable for scenes with higher requirements on the diversity of voice synthesis, can synthesize voices with various prosody emotions for the same text, and particularly can adjust the prosody emotion of the synthesized voice in real time according to the attribute, the conversation state and the conversation emotion of the current conversation user in the artificial intelligent service, thereby realizing more humanized artificial intelligent voice service.

Drawings

Fig. 1 is a schematic flow chart of a speech synthesis method according to an embodiment of the present invention;

FIG. 2 is a block diagram of a Chinese speech synthesizer according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of an internal structure of an electronic device implementing a Chinese speech synthesis method according to an embodiment of the present invention;

the implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The invention provides a speech synthesis method. Fig. 1 is a schematic flow chart of a speech synthesis method according to an embodiment of the present invention. The method may be performed by an apparatus, which may be implemented by software and/or hardware.

In this embodiment, the speech synthesis method includes:

s110, performing semantic extraction processing on the obtained text to be synthesized to obtain a semantic feature sequence.

Specifically, because the prosodic emotion of each sentence is often related to the semantics of the speech technology, text semantic information is introduced according to the input text to be synthesized, and thus the prosodic prediction effect is improved.

As a preferred embodiment of the present invention, performing semantic extraction processing on the obtained text to be synthesized to obtain a semantic feature sequence includes:

carrying out word separation processing on a text to be synthesized to obtain a character collection;

inputting the character set into a pre-training language model to perform semantic extraction processing to obtain a semantic feature sequence; wherein the pre-trained language model comprises:

the system comprises an input layer for inputting a character set, a vector coding layer for carrying out vector conversion processing on the character set of the input layer, an attention machine mechanism layer for carrying out enhancement semantic vector processing on a word vector, a text vector and a position vector obtained by the vector coding layer, a pooling layer for carrying out dimension reduction and splicing processing on the enhancement semantic vector of each word obtained by the attention machine mechanism layer and an output layer for outputting a semantic feature sequence obtained by the pooling layer.

Specifically, the whole sentence in the text to be synthesized is separated in a word-by-word manner to obtain a character set, and then the character set is input into a pre-training language model for semantic extraction processing, wherein the pre-training language model is a speech extraction model in Natural Language Processing (NLP).

The optimal pre-training language model is a BERT model, a characteristic sequence capable of reflecting text semantics can be calculated according to an input text to be synthesized, the model is also a uniform network model and can be directly used as an open source model, the structure of the model can comprise an input layer, a vector coding layer, a self-attention mechanism layer, a pooling layer and an output layer, a character set taking characters as elements in the text to be synthesized is input into the pre-training language model from the input layer, the vector coding layer carries out vector coding processing on each character in the character set according to a vector template learned in advance by the model to obtain a character vector, a text vector and a position vector corresponding to each character, wherein the value of the text vector is automatically learned in the model training process and is used for depicting the global semantic information of the text to be synthesized and is fused with the semantic information of single characters; position vectors are different in semantic information carried by characters at different positions of a text to be synthesized (such as 'i want you' and 'i want me'), so that the characters at different positions are distinguished by adding a different vector; the meaning of a word expressed in a text is usually related to its context. Thus, the context information of a word helps to enhance its semantic representation. And performing enhancement processing on the semantic representations of the word vector, the text vector and the position vector through a self-attention mechanism layer to finally obtain an enhanced semantic vector of each word, performing feature dimension reduction and splicing on the enhanced semantic vector through a pooling layer, and outputting a semantic feature sequence through an output layer.

And S120, carrying out prosody prediction processing on the semantic feature sequence through a prosody prediction model to obtain prosody emotion features of the text to be synthesized.

Specifically, the prosodic emotion characteristics are obtained from the input speech characteristics by a prosodic prediction model. For a piece of speech, the information contained inside the speech can be divided into two parts: 1) the first part is the pronunciation information of the corresponding syllable in the voice, namely the syllable sequence of the said text content; 2) the other pronunciation features except the syllable information mainly include the voice information such as rhythm, emotion and tone, which are summarized as rhythm emotion features. Because the prosodic emotional features cannot be directly extracted from the speech, a prosodic prediction model is required, and the ability to acquire the prosodic emotional features from the speech features can be learned in the training process of speech synthesis.

As a preferred embodiment of the present invention, a prosody prediction model is stored in the block chain, the prosody prediction model including:

the system comprises an input linear layer used for inputting a semantic feature sequence, and a memory network layer used for finding out prosodic emotion features corresponding to the semantic feature sequence from prosodic emotion samples learned in advance according to the semantic feature sequence of the input linear layer; and the output linear layer is used for outputting the corresponding prosodic emotion characteristics.

Specifically, the prosody prediction model is a deep learning network, and a feature sequence of a lower dimension, namely prosody emotion features, is output from an output linear layer of the prosody prediction model (the dimension of the feature sequence can be set to 3, 4 or 5 as required, and preferably does not exceed 10).

S130, inputting the prosodic emotion characteristics and the pre-acquired syllable sequence of the text to be synthesized into the voice characteristic prediction model at the same time, and performing voice prediction processing to obtain a voice characteristic spectrum of the text to be synthesized.

Specifically, a syllable sequence of a text to be synthesized is first converted into a syllable embedded vector, prosodic emotion features (sequence vectors) are subjected to linear processing, and then the result is superimposed on the syllable embedded vector to obtain a voice feature spectrum. The syllable sequence of the text to be synthesized is a pinyin sequence corresponding to the text to be synthesized, and is formed by splitting according to syllables, for example, Chinese peace, the pinyin sequence is 'zhong 1guo2ping2an 1', and the syllable sequence is [ zh, ong1, g, uo2, p, ing2, an1 ].

As a preferred embodiment of the present invention, before inputting prosodic emotion features and a pre-obtained syllable sequence of a text to be synthesized into a speech feature prediction model and performing speech prediction processing to obtain a speech feature spectrum of the text to be synthesized, the method further includes:

and adjusting the tone of the prosodic emotional characteristics according to a preset tone adjusting rule to obtain standby prosodic emotional characteristics of the text to be synthesized.

Specifically, the preset tone adjustment rule includes tone values corresponding to prosodic emotional features corresponding to each word in different scenes, and the tone values of the prosodic emotional features are adjusted according to the different scenes, so that the prosodic emotional features corresponding to each word in different scenes are obtained; therefore, the prosodic emotion of the synthesized voice can be adjusted in real time according to the application scene requirements, and the synthesized voice with different prosodic effects can be obtained.

As a preferred embodiment of the present invention, the pitch adjusting the prosodic emotion characteristics according to the preset pitch adjustment rule to obtain the to-be-used prosodic emotion characteristics of the text to be synthesized includes:

acquiring a sequence element vector of the rhythm emotion characteristics;

and adjusting the numerical value of the sequence element vector according to a preset tone adjustment rule to obtain the to-be-used prosodic emotional characteristic of the text to be synthesized.

Specifically, the prosody prediction model predicts a prosody emotion feature, the prosody emotion feature is actually a sequence, each element in the sequence represents a prosody emotion feature vector of each character, and the numerical value of the prosody emotion feature vector of each character can be modified according to a preset tone adjustment rule to achieve adjustment of prosody emotion. For example, if z ═ { z _1, z _2, …, z _ n } represents the prosodic emotion feature sequence predicted by the prosody prediction model, α ═ { α _1, α _2, …, α _ n } represents the adjustment coefficient (the coefficient range is between-1 and 1), then the adjusted prosody is: z-z + U · α where U represents the adjustable prosody range. The preset tone adjustment rule can be set according to the requirement.

As a preferred embodiment of the present invention, the speech feature prediction model includes:

the device comprises a character embedding layer for converting syllable sequences into syllable embedding vectors, a superposition layer for superposing prosodic emotion characteristics to the syllable embedding vectors after linear processing, and a voice characteristic output layer for outputting voice characteristic spectrums obtained by the superposition layer.

Specifically, the speech feature prediction model mainly includes a speech feature prediction network, and a mature acoustic model, such as tacon 2, can be directly used as an underlying network structure, and a slight adjustment needs to be performed on the network, specifically, in tacon 2, a syllable sequence is first converted into a syllable Embedding vector through a Character Embedding (or layer), prosodic emotion features are passed through a linear layer, and then the result is superimposed on the syllable Embedding vector.

The traditional acoustic model only predicts the voice characteristics according to the syllable sequence, which is difficult to model the prosody, emotion and other information contained in the target voice characteristics, because different prosody are expressed by the same syllable sequence and different voices are generated, the voice characteristics are also different, and a prosody prediction model is designed to learn and model the prosody, emotion and other information of the voice from the voice characteristics. In the model training process, the prosody prediction model and the speech feature prediction model can be trained together.

And S140, synthesizing the voice feature spectrum and the text to be synthesized into the voice with prosody emotion.

Specifically, a speech signal generated by a speech feature spectrum, that is, a speech with prosodic emotion is generated as a synthesized speech of the text to be synthesized.

As a preferred embodiment of the invention, the step of synthesizing the voice characteristic spectrum and the text to be synthesized into the voice with prosodic emotion comprises the following steps:

and performing voice synthesis on the text to be synthesized through a vocoder according to the voice characteristic spectrum to obtain the voice with rhythm emotion.

Specifically, the speech feature spectrum may be synthesized by a vocoder that generates a speech signal based on speech features (mel spectrum). The vocoder has certain universality, and the invention preferably uses Parallel WaveGAN as the network of the vocoder.

Fig. 2 is a functional block diagram of a speech synthesis apparatus according to an embodiment of the present invention.

The speech synthesis apparatus 200 of the present invention can be installed in an electronic device. According to the implemented functions, the speech synthesis apparatus may include a semantic extraction module 210, a prosodic emotion feature acquisition module 220, a speech feature acquisition module 230, and a speech synthesis module 240. A module according to the present invention, which may also be referred to as a unit, refers to a series of computer program segments that can be executed by a processor of an electronic device and that can perform a fixed function, and that are stored in a memory of the electronic device.

In the present embodiment, the functions regarding the respective modules/units are as follows:

and the semantic extraction module 210 is configured to perform semantic extraction processing on the obtained text to be synthesized to obtain a semantic feature sequence.

The optimal pre-training language model is a BERT model, a characteristic sequence capable of reflecting text semantics can be calculated according to an input text to be synthesized, the model is also a uniform network model and can be directly used as an open source model, the structure of the model can comprise an input layer, a vector coding layer, a self-attention mechanism layer, a pooling layer and an output layer, a character set taking characters as elements in the text to be synthesized is input into the pre-training language model from the input layer, the vector coding layer carries out vector coding processing on each character in the character set according to a vector template learned in advance by the model to obtain a word vector, a text vector and a position vector corresponding to each character, wherein the value of the text vector is automatically learned in the model training process and is used for depicting the global semantic information of the text to be synthesized and is fused with the semantic information of single characters; position vectors are different in semantic information carried by characters at different positions of a text to be synthesized (such as 'i want you' and 'i want me'), so that the characters at different positions are distinguished by adding a different vector; the meaning of a word expressed in a text is usually related to its context. Thus, the context information of a word helps to enhance its semantic representation. And performing enhancement processing on the semantic representations of the word vector, the text vector and the position vector through a self-attention mechanism layer to finally obtain an enhanced semantic vector of each word, performing feature dimension reduction and splicing on the enhanced semantic vector through a pooling layer, and outputting a semantic feature sequence through an output layer.

And the prosodic emotion feature acquisition module 220 is configured to perform prosodic prediction processing on the semantic feature sequence through a prosodic prediction model to obtain prosodic emotion features of the text to be synthesized. It is emphasized that the prosodic prediction model is stored in a blockchain.

As a preferred embodiment of the present invention, the prosody prediction model includes:

The voice feature acquisition module 230: and the voice characteristic prediction model is used for inputting the prosodic emotion characteristics and the pre-acquired syllable sequence of the text to be synthesized into the voice characteristic prediction model at the same time, and performing voice prediction processing to obtain a voice characteristic spectrum of the text to be synthesized.

acquiring a sequence element vector of the rhythm emotion characteristics;

Specifically, the prosody prediction model predicts a prosody emotion feature, the prosody emotion feature is actually a sequence, each element in the sequence represents a prosody emotion feature vector of each word, and the numerical value of the prosody emotion feature vector of each word can be modified according to a preset tone adjustment rule to achieve adjustment of prosody emotion. For example, if z ═ { z _1, z _2, …, z _ n } represents the prosodic emotion feature sequence predicted by the prosody prediction model, α ═ { α _1, α _2, …, α _ n } represents the adjustment coefficient (the coefficient range is between-1 and 1), then the adjusted prosody is: z-z + U · α where U represents the adjustable prosody range. The preset tone adjustment rule can be set according to the requirement.

The traditional acoustic model only predicts the voice characteristics according to the syllable sequence, which is difficult to model the prosody, emotion and other information contained in the target voice characteristics, because different prosody are expressed by the same syllable sequence and different voices are generated, the voice characteristics are also different, and therefore, a prosody prediction model is designed to learn and model the prosody, emotion and other information of the voice from the voice characteristics. In the model training process, the prosody prediction model and the speech feature prediction model can be trained together.

The speech synthesis module 240: and synthesizing the voice feature spectrum and the text to be synthesized into the Chinese voice with prosodic emotion.

Fig. 3 is a schematic structural diagram of an electronic device implementing a speech synthesis method according to an embodiment of the present invention.

The electronic device 1 may comprise a processor 10, a memory 11 and a bus, and may further comprise a computer program, such as a speech synthesis program 12, stored in the memory 11 and executable on the processor 10.

The memory 11 includes at least one type of readable storage medium, which includes flash memory, removable hard disk, multimedia card, card-type memory (e.g., SD or DX memory, etc.), magnetic memory, magnetic disk, optical disk, etc. The memory 11 may in some embodiments be an internal storage unit of the electronic device 1, such as a removable hard disk of the electronic device 1. The memory 11 may also be an external storage device of the electronic device 1 in other embodiments, such as a plug-in mobile hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the electronic device 1. Further, the memory 11 may also include both an internal storage unit and an external storage device of the electronic device 1. The memory 11 may be used not only to store application software installed in the electronic device 1 and various types of data, such as codes of a speech synthesis program, but also to temporarily store data that has been output or is to be output.

The processor 10 may be composed of an integrated circuit in some embodiments, for example, a single packaged integrated circuit, or may be composed of a plurality of integrated circuits packaged with the same or different functions, including one or more Central Processing Units (CPUs), microprocessors, digital Processing chips, graphics processors, and combinations of various control chips. The processor 10 is a Control Unit (Control Unit) of the electronic device, connects various components of the electronic device by using various interfaces and lines, and executes various functions and processes data of the electronic device 1 by running or executing programs or modules (e.g., a speech synthesis program, etc.) stored in the memory 11 and calling data stored in the memory 11.

The bus may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. The bus is arranged to enable connection communication between the memory 11 and at least one processor 10 or the like.

Fig. 3 shows only an electronic device with components, and it will be understood by those skilled in the art that the structure shown in fig. 3 does not constitute a limitation of the electronic device 1, and may comprise fewer or more components than those shown, or some components may be combined, or a different arrangement of components.

For example, although not shown, the electronic device 1 may further include a power supply (such as a battery) for supplying power to each component, and preferably, the power supply may be logically connected to the at least one processor 10 through a power management device, so as to implement functions of charge management, discharge management, power consumption management, and the like through the power management device. The power supply may also include any component of one or more dc or ac power sources, recharging devices, power failure detection circuitry, power converters or inverters, power status indicators, and the like. The electronic device 1 may further include various sensors, a bluetooth module, a Wi-Fi module, and the like, which are not described herein again.

Further, the electronic device 1 may further include a network interface, and optionally, the network interface may include a wired interface and/or a wireless interface (such as a WI-FI interface, a bluetooth interface, etc.), which are generally used for establishing a communication connection between the electronic device 1 and other electronic devices.

Optionally, the electronic device 1 may further comprise a user interface, which may be a Display (Display), an input unit (such as a Keyboard), and optionally a standard wired interface, a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch device, or the like. The display, which may also be referred to as a display screen or display unit, is suitable for displaying information processed in the electronic device 1 and for displaying a visualized user interface, among other things.

It is to be understood that the described embodiments are for purposes of illustration only and that the scope of the appended claims is not limited to such structures.

The speech synthesis program 12 stored in the memory 11 of the electronic device 1 is a combination of instructions that, when executed in the processor 10, enable:

simultaneously inputting the rhythm emotion characteristics and a pre-acquired syllable sequence of a text to be synthesized into a voice characteristic prediction model, and performing voice prediction processing to obtain a voice characteristic spectrum of the text to be synthesized;

and synthesizing the voice characteristic spectrum and the text to be synthesized into the voice with prosody emotion.

Specifically, the specific implementation method of the processor 10 for the instruction may refer to the description of the relevant steps in the embodiment corresponding to fig. 1, which is not described herein again. It is emphasized that, to further ensure the privacy and security of the prosody prediction model, the prosody prediction model may also be stored in a node of a block chain.

Further, the integrated modules/units of the electronic device 1, if implemented in the form of software functional units and sold or used as separate products, may be stored in a computer readable storage medium. The computer-readable medium may include: any entity or device capable of carrying said computer program code, recording medium, U-disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM).

In the embodiments provided in the present invention, it should be understood that the disclosed apparatus, device and method can be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is only one logical functional division, and other divisions may be realized in practice.

The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.

In addition, functional modules in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional module.

It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof.

The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference signs in the claims shall not be construed as limiting the claim concerned.

The block chain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.

Furthermore, it is obvious that the word "comprising" does not exclude other elements or steps, and the singular does not exclude the plural. A plurality of units or means recited in the system claims may also be implemented by one unit or means in software or hardware. The terms second, etc. are used to denote names, but not any particular order.

Finally, it should be noted that the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting, and although the present invention is described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention.

Claims

1. A method of speech synthesis, the method comprising:

2. The speech synthesis method according to claim 1, wherein the semantic extraction processing on the acquired text to be synthesized to obtain a semantic feature sequence comprises:

3. The speech synthesis method of claim 1, wherein the prosodic prediction model comprises:

4. The method of claim 1, wherein the step of inputting the prosodic emotion features and the pre-obtained syllable sequence of the text to be synthesized into a speech feature prediction model and performing speech prediction processing to obtain a speech feature spectrum of the text to be synthesized further comprises:

5. The speech synthesis method of claim 4, wherein the pitch adjustment of the prosodic emotion features according to a preset pitch adjustment rule to obtain a to-be-used prosodic emotion feature spectrum of the text to be synthesized comprises:

acquiring a sequence element vector of the rhythm emotion characteristics;

6. The speech synthesis method of claim 1, wherein the speech feature prediction model comprises:

7. The speech synthesis method of claim 1, wherein the synthesizing the speech feature spectrum and the text to be synthesized into speech with prosodic emotion comprises:

8. A speech synthesis apparatus, characterized in that the apparatus comprises:

9. An electronic device, characterized in that the electronic device comprises:

at least one processor; and the number of the first and second groups,

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a speech synthesis method according to any one of claims 1 to 7.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out a speech synthesis method according to any one of claims 1 to 7.