CN112086086B

CN112086086B - Speech synthesis method, device, equipment and computer readable storage medium

Info

Publication number: CN112086086B
Application number: CN202011138907.6A
Authority: CN
Inventors: 曾振; 王健宗; 程宁
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2020-10-22
Filing date: 2020-10-22
Publication date: 2024-06-25
Anticipated expiration: 2040-10-22
Also published as: WO2021189984A1; CN112086086A

Abstract

The invention relates to artificial intelligence, and discloses a voice synthesis method, which comprises the following steps: carrying out semantic extraction processing on the acquired text to be synthesized to obtain a semantic feature sequence; performing prosody prediction processing on the semantic feature sequence through a prosody prediction model to obtain prosody emotion features of the text to be synthesized; inputting prosodic emotion characteristics and syllable sequences of a pre-acquired text to be synthesized into a voice characteristic prediction model at the same time, and performing voice prediction processing to obtain a voice characteristic spectrum of the text to be synthesized; and synthesizing the voice characteristic spectrum and the text to be synthesized into voice with prosodic emotion. The present invention also relates to blockchain techniques, the prosody prediction model being stored in the blockchain. The invention can switch the mood emotion and rhythm characteristics of the synthesized voice in real time.

Description

Speech synthesis method, device, equipment and computer readable storage medium

Technical Field

The present invention relates to artificial intelligence, and more particularly, to a method, apparatus, electronic device, and computer-readable storage medium for speech synthesis.

Background

The speech synthesis technology is to generate intelligible anthropomorphic speech according to the input text information, and is widely applied to various artificial intelligent terminals such as intelligent sound boxes, intelligent clients and the like as a very important ring of man-machine interaction systems. At present, the mainstream speech synthesis systems can basically synthesize very stable and reliable speech, so that the performance of one speech synthesis system is judged mainly according to the anthropomorphic degree of the synthesized sound, which has very important influence on the experience of an interactive system.

The human has very variability when speaking, different emotion and rhythm are used for expressing, and different voices are generated, while the traditional voice synthesis technology synthesizes corresponding voices directly according to the provided texts, and the synthesized voices are basically consistent for the same texts, so that the user can hardly synthesize specific emotion and rhythm voices. However, in some scenes with high requirements on the synthesis effect, such as smart clients, the emotion, the speed and the rhythm of the voice need to be adjusted in time according to the answer of the user to realize the efficient representation of the service content.

Most of the existing speech synthesis systems synthesize corresponding speech directly according to the input text sequence, and the synthesized speech is basically the same for the same text input. The synthesized speech cannot be adjusted according to the specific speech synthesis application scenario and the current dialog state.

Disclosure of Invention

The invention provides a voice synthesis method, a voice synthesis device, electronic equipment and a computer readable storage medium, and mainly aims to switch the emotion and rhythm characteristics of synthesized voice in real time.

In order to achieve the above object, a first aspect of the present invention provides a speech synthesis method, including:

carrying out semantic extraction processing on the acquired text to be synthesized to obtain a semantic feature sequence;

Performing prosody prediction processing on the semantic feature sequence through a prosody prediction model to obtain prosody emotion features of the text to be synthesized;

inputting the prosodic emotion characteristics and the syllable sequence of the text to be synthesized, which is obtained in advance, into a voice characteristic prediction model at the same time, and performing voice prediction processing to obtain a voice characteristic spectrum of the text to be synthesized;

And synthesizing the voice characteristic spectrum and the text to be synthesized into voice with prosodic emotion.

Optionally, the performing semantic extraction processing on the obtained text to be synthesized to obtain a semantic feature sequence includes:

performing word separation processing on the text to be synthesized to obtain a word coincidence set;

Inputting the word coincidence set into a pre-training language model for semantic extraction processing to obtain a semantic feature sequence; wherein the pre-trained language model comprises:

the character input device comprises an input layer for inputting the character set, a vector coding layer for carrying out vector conversion processing on a character coincidence set of the input layer, a self-attention mechanism layer for carrying out enhanced semantic vector processing on a character vector, a text vector and a position vector which are obtained by the vector coding layer, a pooling layer for carrying out dimension reduction and splicing processing on the enhanced semantic vector of each character which is obtained by the self-attention mechanism layer, and an output layer for outputting a semantic feature sequence which is obtained by the pooling layer.

Optionally, the prosody prediction model includes:

The memory network layer is used for finding prosodic emotion features corresponding to the semantic feature sequences from prosodic emotion samples learned in advance according to the semantic feature sequences of the input linear layer; and the output linear layer is used for outputting the corresponding prosodic emotion characteristics.

Optionally, inputting the prosodic emotion feature and the syllable sequence of the text to be synthesized simultaneously into a speech feature prediction model, and performing speech prediction processing to obtain a speech feature spectrum of the text to be synthesized, where before obtaining the speech feature spectrum of the text to be synthesized, the method further includes:

and carrying out tone adjustment on the prosodic emotion characteristics according to a preset tone adjustment rule to obtain the standby prosodic emotion characteristics of the text to be synthesized.

Optionally, the performing pitch adjustment on the prosodic emotion feature according to a preset pitch adjustment rule, and obtaining the standby prosodic emotion feature spectrum of the text to be synthesized includes:

Acquiring a sequence element vector of prosodic emotion characteristics;

And adjusting the numerical value of the sequence element vector according to a preset tone adjustment rule to obtain the standby prosody emotion characteristic spectrum of the text to be synthesized.

Optionally, the speech feature prediction model includes:

the voice feature processing device comprises a character embedding layer for converting the syllable sequence into a syllable embedding vector, an overlapping layer for overlapping the rhythm emotion features to the syllable embedding vector after linear processing, and a voice feature output layer for outputting a voice feature spectrum obtained by the overlapping layer.

Optionally, the synthesizing the speech feature spectrum and the text to be synthesized into the speech with prosodic emotion includes:

And performing voice synthesis on the text to be synthesized according to the voice characteristic spectrum through a vocoder to obtain voice with prosodic emotion.

In a second aspect, in order to solve the above-mentioned problems, the present invention also provides a speech synthesis apparatus, the apparatus comprising:

the semantic extraction module is used for carrying out semantic extraction processing on the acquired text to be synthesized to obtain a semantic feature sequence;

The prosodic emotion feature acquisition module is used for performing prosodic prediction processing on the semantic feature sequence through a prosodic prediction model to obtain prosodic emotion features of the text to be synthesized;

The voice characteristic acquisition module is used for: the method comprises the steps of inputting the prosodic emotion characteristics and a syllable sequence of the text to be synthesized, which is obtained in advance, into a voice characteristic prediction model at the same time, and performing voice prediction processing to obtain a voice characteristic spectrum of the text to be synthesized;

and a voice synthesis module: and the voice characteristic spectrum and the text to be synthesized are used for synthesizing voice with prosodic emotion.

In order to solve the above-mentioned problems, the present invention also provides an electronic device including:

a memory storing at least one instruction; and

And the processor executes the instructions stored in the memory to realize the voice synthesis method.

In a fourth aspect, in order to solve the above-mentioned problems, the present invention also provides a computer-readable storage medium having stored therein at least one instruction that is executed by a processor in an electronic device to implement the above-mentioned speech synthesis method.

According to the voice synthesis method, the voice synthesis device, the electronic equipment and the computer readable storage medium, semantic feature sequences are obtained by carrying out semantic extraction processing on the acquired text to be synthesized; then carrying out prosody prediction processing on the semantic feature sequence through a prosody prediction model to obtain prosody emotion features; inputting the prosodic emotion characteristics and syllable sequences of the text to be synthesized obtained in advance into a voice characteristic prediction model at the same time, and performing voice prediction processing to obtain a voice characteristic spectrum; finally, synthesizing the voice characteristic spectrum and the text to be synthesized into voice with prosodic emotion; the method can directly model and extract prosodic emotion characteristics of the voice from the voice, and can obtain accurate prosodic information to promote the prediction effect of voice synthesis; the voice with more accurate rhythm and more natural rhythm can be synthesized according to the text to be synthesized; the method and the device can be suitable for scenes with higher requirements on voice synthesis diversity, and can synthesize voices with various prosodic emotions for the same text, and especially in artificial intelligent service, the prosodic emotion of the synthesized voice can be adjusted in real time according to the attribute, the dialogue state and the dialogue emotion of the current dialogue user, so that more humanized artificial intelligent voice service is realized.

Drawings

FIG. 1 is a flow chart of a speech synthesis method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a Chinese speech synthesis apparatus according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of an internal structure of an electronic device for implementing a Chinese speech synthesis method according to an embodiment of the present invention;

The achievement of the objects, functional features and advantages of the present invention will be further described with reference to the accompanying drawings, in conjunction with the embodiments.

Detailed Description

It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

The invention provides a voice synthesis method. Referring to fig. 1, a flow chart of a speech synthesis method according to an embodiment of the invention is shown. The method may be performed by an apparatus, which may be implemented in software and/or hardware.

In this embodiment, the speech synthesis method includes:

S110, carrying out semantic extraction processing on the acquired text to be synthesized to obtain a semantic feature sequence.

Specifically, since prosodic emotion of each sentence is often related to the semantics of the speech, text semantic information is introduced according to the input text to be synthesized, so that the prosodic predicted effect is improved.

As a preferred embodiment of the present invention, performing semantic extraction processing on an obtained text to be synthesized, to obtain a semantic feature sequence includes:

word separation processing is carried out on the text to be synthesized, and a word coincidence set is obtained;

Inputting the word coincidence set into a pre-training language model for semantic extraction processing to obtain a semantic feature sequence; wherein the pre-training language model comprises:

The device comprises an input layer for inputting word coincidence sets, a vector coding layer for carrying out vector conversion processing on the word coincidence sets of the input layer, a self-attention mechanism layer for carrying out enhanced semantic vector processing on word vectors, text vectors and position vectors obtained by the vector coding layer, a pooling layer for carrying out dimension reduction and splicing processing on the enhanced semantic vectors of each word obtained by the self-attention mechanism layer and an output layer for outputting semantic feature sequences obtained by the pooling layer.

Specifically, separating whole sentences in a text to be synthesized in a word-by-word manner to obtain a word coincidence set, and then inputting the word coincidence set into a pre-training language model for semantic extraction processing, wherein the pre-training language model is a voice extraction model in Natural Language Processing (NLP).

The invention discloses a optimized pre-training language model which is a BERT model, wherein a characteristic sequence capable of reflecting text semantics can be calculated according to an input text to be synthesized, the model is also a unified network model, an open source model is directly used, the structure of the model can comprise an input layer, a vector coding layer, a self-attention mechanism layer, a pooling layer and an output layer, a character coincidence set taking characters as elements in the text to be synthesized is input into the pre-training language model from the input layer, the vector coding layer carries out vector coding processing on each character in the character coincidence set according to a vector template which is learned in advance by the model to obtain a character vector, a text vector and a position vector corresponding to each character, wherein the value of the text vector is automatically learned in the model training process and is used for describing global semantic information of the text to be synthesized and is fused with semantic information of single characters; the position vector shows that the semantic information carried by the characters at different positions of the text to be synthesized has differences (such as 'I want you' and 'I want me'), so that the characters at different positions are distinguished by adding a different vector; the meaning of a word expressed in a text is generally related to its context. Thus, the context information of a word helps to enhance its semantic representation. The semantic representation of the word vector, the text vector and the position vector is enhanced through the self-attention mechanism layer, the enhanced semantic vector of each word is finally obtained, then the enhanced semantic vector is subjected to feature dimension reduction and splicing through the pooling layer, and the semantic feature sequence is output through the output layer.

And S120, performing prosody prediction processing on the semantic feature sequence through a prosody prediction model to obtain prosody emotion features of the text to be synthesized.

Specifically, prosodic emotion features are obtained from the input speech features by a prosodic prediction model. For a piece of speech, the information contained inside the speech can be divided into two parts: 1) The first part is syllable pronunciation information corresponding to the voice, namely syllable sequence of the text content; 2) Other pronunciation characteristics of syllable information are removed, and the pronunciation information mainly comprises prosody, emotion, tone and the like, and the pronunciation information is summarized as prosody emotion characteristics. Since prosodic emotion features cannot be directly extracted from speech, a prosodic prediction model is required that can learn the ability to obtain prosodic emotion features from speech features during the training process of speech synthesis.

As a preferred embodiment of the present invention, a prosody prediction model is stored in a blockchain, the prosody prediction model including:

the memory network layer is used for finding prosodic emotion features corresponding to the semantic feature sequences from prosodic emotion samples learned in advance according to the semantic feature sequences of the input linear layer; and an output linear layer for outputting the corresponding prosodic emotion characteristics.

Specifically, the prosody prediction model is a deep learning network, and a lower-dimensional feature sequence, namely prosody emotion features (the dimension of the feature sequence can be set to 3,4 or 5 as required, and preferably not more than 10) is output by an output linear layer of the prosody prediction model.

S130, inputting the prosodic emotion characteristics and the syllable sequence of the text to be synthesized, which is obtained in advance, into a voice characteristic prediction model at the same time, and performing voice prediction processing to obtain a voice characteristic spectrum of the text to be synthesized.

Specifically, the syllable sequence of the text to be synthesized is firstly converted into syllable embedded vectors, the prosodic emotion characteristics (sequence vectors) are subjected to linear processing, and the results are superimposed on the syllable embedded vectors to obtain the voice characteristic spectrum. The syllable sequence of the text to be synthesized refers to a pinyin sequence corresponding to the text to be synthesized, and is split according to syllables, for example, chinese security, the pinyin sequence is 'zheng 1guo2ping2an 1', and the syllable sequence is [ zh, ong1, g, uo2, p, ing2, an1].

As a preferred embodiment of the present invention, inputting prosodic emotion features and syllable sequences of a pre-acquired text to be synthesized into a speech feature prediction model simultaneously, and performing speech prediction processing to obtain a speech feature spectrum of the text to be synthesized, the method further comprises:

Specifically, the preset tone adjustment rule comprises tone values corresponding to the prosodic emotion features corresponding to each word under different scenes, and tone value adjustment is carried out on the prosodic emotion features according to the different scenes, so that the prosodic emotion features corresponding to each word under the different scenes are obtained; therefore, the prosodic emotion of the synthesized voice can be adjusted in real time according to the application scene requirements, and voices with different prosodic effects can be synthesized.

As a preferred embodiment of the present invention, performing pitch adjustment on prosodic emotion features according to a preset pitch adjustment rule, to obtain a standby prosodic emotion feature of a text to be synthesized includes:

Acquiring a sequence element vector of prosodic emotion characteristics;

And adjusting the numerical value of the sequence element vector according to a preset tone adjustment rule to obtain the standby prosody emotion characteristics of the text to be synthesized.

Specifically, the prosodic emotion prediction model predicts a prosodic emotion feature, the prosodic emotion feature is actually a sequence, each element in the sequence represents a prosodic emotion feature vector of each character, and the numerical value of the prosodic emotion feature vector of each character can be modified according to a preset tone adjustment rule to realize adjustment of prosodic emotion. For example, as follows, z= { z_1, z_2, …, z_n } represents the prosody emotion feature sequence predicted by the prosody prediction model, α= { α_1, α_2, …, α_n } represents the adjustment coefficient (coefficient range is between-1 and 1), and the adjusted prosody is: z=z+u·α, where U represents an adjustable prosody range. Wherein, the preset tone adjustment rule can be set by itself according to the requirement.

As a preferred embodiment of the present invention, the speech feature prediction model includes:

The system comprises a character embedding layer for converting syllable sequences into syllable embedding vectors, an overlapping layer for overlapping rhythm emotion characteristics to the syllable embedding vectors after linear processing, and a voice characteristic output layer for outputting voice characteristic spectrums obtained by the overlapping layer.

Specifically, the speech feature prediction model mainly includes a speech feature prediction network, and a mature acoustic model, for example Tacotron, can be directly used as a basic network structure, and a slight adjustment needs to be performed on the network, specifically, in Tacotron, a syllable sequence is firstly converted into a syllable embedded vector through a character embedding (CHARACTER EMBEDDING) network (or layer), a prosodic emotion feature is passed through a linear layer, and then the result is superimposed on the syllable embedded vector.

The traditional acoustic model only predicts the voice characteristics according to syllable sequences, which is difficult to model the rhythm, emotion and other information contained in the target voice characteristics, because different voices are generated when the same syllable sequences are expressed with different rhythms, the voice characteristics are different, and a rhythm prediction model is designed to learn and model the rhythm, emotion and other information of the voices from the voice characteristics. In the model training process, the prosody prediction model and the voice feature prediction model can be trained together.

S140, synthesizing the voice characteristic spectrum and the text to be synthesized into voice with prosodic emotion.

Specifically, a speech signal generated by the speech feature spectrum, that is, a speech with prosodic emotion is generated as a synthesized speech of the text to be synthesized.

As a preferred embodiment of the present invention, synthesizing a voice with prosodic emotion from a voice profile and a text to be synthesized includes:

And performing voice synthesis on the text to be synthesized according to the voice characteristic spectrum through the vocoder to obtain voice with prosodic emotion.

In particular, the speech feature spectrum may be synthesized by a vocoder that generates a speech signal based on speech features (mel-spectrum). The vocoder has certain universality, and the invention preferably uses PARALLEL WAVEGAN as a network of the vocoder.

Fig. 2 is a functional block diagram of a speech synthesis apparatus according to an embodiment of the present invention.

The speech synthesis apparatus 200 of the present invention may be installed in an electronic device. Depending on the implemented functions, the speech synthesis apparatus may include a semantic extraction module 210, a prosodic emotion feature acquisition module 220, a speech feature acquisition module 230, and a speech synthesis module 240. The module of the present invention may also be referred to as a unit, meaning a series of computer program segments capable of being executed by the processor of the electronic device and of performing fixed functions, stored in the memory of the electronic device.

In the present embodiment, the functions concerning the respective modules/units are as follows:

The semantic extraction module 210 is configured to perform semantic extraction processing on the obtained text to be synthesized, so as to obtain a semantic feature sequence.

The prosodic emotion feature obtaining module 220 is configured to perform prosodic prediction processing on the semantic feature sequence through a prosodic prediction model, so as to obtain prosodic emotion features of the text to be synthesized. It is emphasized that the prosody prediction model is stored in the blockchain.

As a preferred embodiment of the present invention, the prosody prediction model includes:

The voice feature acquisition module 230: and the method is used for inputting the prosodic emotion characteristics and the syllable sequence of the text to be synthesized, which is obtained in advance, into a voice characteristic prediction model at the same time, and carrying out voice prediction processing to obtain a voice characteristic spectrum of the text to be synthesized.

Acquiring a sequence element vector of prosodic emotion characteristics;

Specifically, the prosodic emotion prediction model predicts a prosodic emotion feature, the prosodic emotion feature is actually a sequence, each element in the sequence represents a prosodic emotion feature vector of each word, and the numerical value of the prosodic emotion feature vector of each word can be modified according to a preset tone adjustment rule to realize adjustment of prosodic emotion. For example, as follows, z= { z_1, z_2, …, z_n } represents the prosody emotion feature sequence predicted by the prosody prediction model, α= { α_1, α_2, …, α_n } represents the adjustment coefficient (coefficient range is between-1 and 1), and the adjusted prosody is: z=z+u·α, where U represents an adjustable prosody range. Wherein, the preset tone adjustment rule can be set by itself according to the requirement.

The traditional acoustic model only predicts the voice characteristics according to syllable sequences, which is difficult to model the rhythm, emotion and other information contained in the target voice characteristics, because different voices are generated when the same syllable sequences are expressed with different rhythms, the voice characteristics are different, and therefore, a rhythm prediction model is designed to learn and model the rhythm, emotion and other information of the voices from the voice characteristics. In the model training process, the prosody prediction model and the voice feature prediction model can be trained together.

The speech synthesis module 240: and synthesizing the voice characteristic spectrum and the text to be synthesized into Chinese voice with prosodic emotion.

Fig. 3 is a schematic structural diagram of an electronic device implementing a speech synthesis method according to an embodiment of the present invention.

The electronic device 1 may comprise a processor 10, a memory 11 and a bus, and may further comprise a computer program, such as a speech synthesis program 12, stored in the memory 11 and executable on the processor 10.

The memory 11 includes at least one type of readable storage medium, including flash memory, a mobile hard disk, a multimedia card, a card memory (e.g., SD or DX memory, etc.), a magnetic memory, a magnetic disk, an optical disk, etc. The memory 11 may in some embodiments be an internal storage unit of the electronic device 1, such as a removable hard disk of the electronic device 1. The memory 11 may in other embodiments also be an external storage device of the electronic device 1, such as a plug-in mobile hard disk, a smart memory card (SMART MEDIA CARD, SMC), a Secure Digital (SD) card, a flash memory card (FLASH CARD) or the like, which are provided on the electronic device 1. Further, the memory 11 may also include both an internal storage unit and an external storage device of the electronic device 1. The memory 11 may be used not only for storing application software installed in the electronic device 1 and various types of data, such as codes of a speech synthesis program, but also for temporarily storing data that has been output or is to be output.

The processor 10 may be comprised of integrated circuits in some embodiments, for example, a single packaged integrated circuit, or may be comprised of multiple integrated circuits packaged with the same or different functions, including one or more central processing units (Central Processing unit, CPU), microprocessors, digital processing chips, graphics processors, combinations of various control chips, and the like. The processor 10 is a Control Unit (Control Unit) of the electronic device, connects respective components of the entire electronic device using various interfaces and lines, and executes various functions of the electronic device 1 and processes data by running or executing programs or modules (e.g., a speech synthesis program, etc.) stored in the memory 11, and calling data stored in the memory 11.

The bus may be a peripheral component interconnect standard (PERIPHERAL COMPONENT INTERCONNECT, PCI) bus, or an extended industry standard architecture (extended industry standard architecture, EISA) bus, among others. The bus may be classified as an address bus, a data bus, a control bus, etc. The bus is arranged to enable a connection communication between the memory 11 and at least one processor 10 etc.

Fig. 3 shows only an electronic device with components, it being understood by a person skilled in the art that the structure shown in fig. 3 does not constitute a limitation of the electronic device 1, and may comprise fewer or more components than shown, or may combine certain components, or may be arranged in different components.

For example, although not shown, the electronic device 1 may further include a power source (such as a battery) for supplying power to each component, and preferably, the power source may be logically connected to the at least one processor 10 through a power management device, so that functions of charge management, discharge management, power consumption management, and the like are implemented through the power management device. The power supply may also include one or more of any of a direct current or alternating current power supply, recharging device, power failure detection circuit, power converter or inverter, power status indicator, etc. The electronic device 1 may further include various sensors, bluetooth modules, wi-Fi modules, etc., which will not be described herein.

Further, the electronic device 1 may also comprise a network interface, optionally the network interface may comprise a wired interface and/or a wireless interface (e.g. WI-FI interface, bluetooth interface, etc.), typically used for establishing a communication connection between the electronic device 1 and other electronic devices.

The electronic device 1 may optionally further comprise a user interface, which may be a Display, an input unit, such as a Keyboard (Keyboard), or a standard wired interface, a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch, or the like. The display may also be referred to as a display screen or display unit, as appropriate, for displaying information processed in the electronic device 1 and for displaying a visual user interface.

It should be understood that the embodiments described are for illustrative purposes only and are not limited to this configuration in the scope of the patent application.

The speech synthesis program 12 stored in the memory 11 of the electronic device 1 is a combination of instructions that, when executed in the processor 10, enable:

inputting prosodic emotion characteristics and syllable sequences of a pre-acquired text to be synthesized into a voice characteristic prediction model at the same time, and performing voice prediction processing to obtain a voice characteristic spectrum of the text to be synthesized;

Specifically, the specific implementation method of the above instructions by the processor 10 may refer to the description of the relevant steps in the corresponding embodiment of fig. 1, which is not repeated herein. It is emphasized that, to further ensure the privacy and safety of the prosody prediction model, the prosody prediction model may also be stored in a blockchain node.

Further, the modules/units integrated in the electronic device 1 may be stored in a computer readable storage medium if implemented in the form of software functional units and sold or used as separate products. The computer readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM).

In the several embodiments provided in the present invention, it should be understood that the disclosed apparatus, device and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is merely a logical function division, and there may be other manners of division when actually implemented.

The modules described as separate components may or may not be physically separate, and components shown as modules may or may not be physical units, may be located in one place, or may be distributed over multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional module in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units can be realized in a form of hardware or a form of hardware and a form of software functional modules.

It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential characteristics thereof.

The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference signs in the claims shall not be construed as limiting the claim concerned.

The blockchain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, encryption algorithm and the like. The blockchain (Blockchain), essentially a de-centralized database, is a string of data blocks that are generated in association using cryptographic methods, each of which contains information from a batch of network transactions for verifying the validity (anti-counterfeit) of its information and generating the next block. The blockchain may include a blockchain underlying platform, a platform product services layer, an application services layer, and the like.

Furthermore, it is evident that the word "comprising" does not exclude other elements or steps, and that the singular does not exclude a plurality. A plurality of units or means recited in the system claims can also be implemented by means of software or hardware by means of one unit or means. The terms second, etc. are used to denote a name, but not any particular order.

Finally, it should be noted that the above-mentioned embodiments are merely for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications and equivalents may be made to the technical solution of the present invention without departing from the spirit and scope of the technical solution of the present invention.

Claims

1. A method of speech synthesis, the method comprising:

Carrying out semantic extraction processing on the acquired text to be synthesized to obtain a semantic feature sequence; the semantic extraction processing is performed on the acquired text to be synthesized, and the obtaining of the semantic feature sequence comprises the following steps:

The device comprises an input layer for inputting the character set, a vector coding layer for carrying out vector conversion processing on a word coincidence set of the input layer, a self-attention mechanism layer for carrying out enhanced semantic vector processing on a word vector, a text vector and a position vector which are obtained by the vector coding layer, a pooling layer for carrying out dimension reduction and splicing processing on the enhanced semantic vector of each word which is obtained by the self-attention mechanism layer, and an output layer for outputting a semantic feature sequence which is obtained by the pooling layer;

Performing tone adjustment on the prosodic emotion characteristics according to a preset tone adjustment rule to obtain the standby prosodic emotion characteristics of the text to be synthesized, wherein the method comprises the following steps: acquiring a sequence element vector of prosodic emotion characteristics; adjusting the numerical value of the sequence element vector according to a preset tone adjustment rule to obtain a standby prosody emotion characteristic spectrum of the text to be synthesized;

Inputting the emotion characteristics of the standby prosody and the syllable sequence of the text to be synthesized into a voice characteristic prediction model simultaneously, and performing voice prediction processing to obtain a voice characteristic spectrum of the text to be synthesized; wherein the speech feature prediction model comprises: the voice feature processing system comprises a character embedding layer for converting the syllable sequence into a syllable embedding vector, an overlapping layer for overlapping the rhythm emotion feature to be used to the syllable embedding vector after linear processing, and a voice feature output layer for outputting a voice feature spectrum obtained by the overlapping layer;

2. The speech synthesis method according to claim 1, wherein the prosody prediction model comprises:

3. The method of claim 1, wherein synthesizing the speech feature spectrum and the text to be synthesized into speech with prosodic emotion comprises:

4. A speech synthesis apparatus, the apparatus comprising:

The semantic extraction module is used for carrying out semantic extraction processing on the acquired text to be synthesized to obtain a semantic feature sequence; the semantic extraction processing is performed on the acquired text to be synthesized, and the obtaining of the semantic feature sequence comprises the following steps:

The prosodic emotion feature acquisition module is used for performing prosodic prediction processing on the semantic feature sequence through a prosodic prediction model to obtain prosodic emotion features of the text to be synthesized; and performing tone adjustment on the prosodic emotion characteristics according to a preset tone adjustment rule to obtain standby prosodic emotion characteristics of the text to be synthesized, wherein the method comprises the following steps: acquiring a sequence element vector of prosodic emotion characteristics; adjusting the numerical value of the sequence element vector according to a preset tone adjustment rule to obtain a standby prosody emotion characteristic spectrum of the text to be synthesized;

The voice characteristic acquisition module is used for: the method comprises the steps of inputting the emotion characteristics of the standby prosody and a syllable sequence of the text to be synthesized, which is obtained in advance, into a voice characteristic prediction model at the same time, and performing voice prediction processing to obtain a voice characteristic spectrum of the text to be synthesized; wherein the speech feature prediction model comprises: the voice feature processing system comprises a character embedding layer for converting the syllable sequence into a syllable embedding vector, an overlapping layer for overlapping the rhythm emotion feature to be used to the syllable embedding vector after linear processing, and a voice feature output layer for outputting a voice feature spectrum obtained by the overlapping layer;

5. An electronic device, the electronic device comprising:

at least one processor; and

A memory communicatively coupled to the at least one processor; wherein,

The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the speech synthesis method of any one of claims 1 to 3.

6. A computer readable storage medium storing a computer program, characterized in that the computer program, when executed by a processor, implements the speech synthesis method according to any one of claims 1 to 3.