WO2022178941A1

WO2022178941A1 - Speech synthesis method and apparatus, and device and storage medium

Info

Publication number: WO2022178941A1
Application number: PCT/CN2021/084167
Authority: WO
Inventors: 孙奥兰; 王健宗; 程宁
Original assignee: 平安科技（深圳）有限公司
Priority date: 2021-02-26
Filing date: 2021-03-30
Publication date: 2022-09-01
Also published as: CN112786009A

Abstract

A speech synthesis method and apparatus, and a computer device and a computer-readable storage medium. The method comprises: acquiring text to be processed and speaking style audio to be synthesized, and inputting said text and said speaking style audio into a preset speech synthesis model; encoding said speaking style audio on the basis of a multi-reference encoder, so as to obtain style embedding vector information; encoding said text on the basis of a text encoder, so as to obtain text encoding vector information; combining the style embedding vector information and the text encoding vector information by means of a fully connected layer, so as to generate a Mel spectrogram; and performing feature extraction on the Mel spectrogram by means of an output layer, and outputting target audio of said text. Control over the speaking style of synthesized speech is thus realized, such that more pieces of emotional expression speech are synthesized.

Description

Speech synthesis method, device, equipment and storage medium

This application requires a Chinese patent application with an application number of 202110218672.X and an invention title of "Speech Synthesis Method, Device, Equipment and Storage Medium", which was submitted to the Patent Office of the State Intellectual Property Office of the People's Republic of China on February 26, 2021. Priority, the entire contents of which are incorporated herein by reference.

technical field

The present application relates to the technical field of speech processing, and in particular, to a speech synthesis method, apparatus, computer device, and computer-readable storage medium.

Background technique

In the process of speech synthesis, not only the clarity and fluency of the synthesized speech should be considered, but also the prosody information of the synthesized speech, so that the synthesized speech has rich emotional expression. When synthesizing speech, not only consider the smoothness of the sentence, but also consider changing the emotional state of the speaker, and use the model to learn the style information of the reference audio, so as to achieve a level comparable to that of human voice. The inventor realized that in the construction of the current prosody model, the commonly used method is to classify all the speaking styles into one expression, and the speaking styles cannot be separated, so the speaking styles cannot be individually controlled, and the emotional expression of the synthesized speech is very simple. .

technical problem

One of the purposes of the embodiments of the present application is to provide a speech synthesis method, device, computer equipment and computer-readable storage medium, so as to solve the problem that in the prior art, the speaking style cannot be individually controlled, and the emotional expression of the synthesized speech is very simple. technical issues.

technical solutions

In a first aspect, an embodiment of the present application provides a speech synthesis method, including:

Acquire the text to be processed and the audio of the speech style to be synthesized, and input the text to be processed and the audio of the speech style to be synthesized into a preset speech synthesis model, wherein the preset speech synthesis model includes a multi-reference encoder, a text encoding , the fully connected layer and the output layer;

Encoding the to-be-synthesized speaking style audio based on the multi-reference encoder to obtain style embedding vector information;

Encoding the text to be processed based on the text encoder to obtain text encoding vector information;

splicing the style embedding vector information and the text encoding vector information through the fully connected layer to generate a Mel spectrogram;

Feature extraction is performed on the Mel spectrogram through the output layer, and the target audio of the text to be processed is output.

In a second aspect, an embodiment of the present application provides a speech synthesis apparatus, including:

The first acquisition module is used to acquire the text to be processed and the audio of the speech style to be synthesized, and input the text to be processed and the audio of the speech style to be synthesized into a preset speech synthesis model, wherein the speech synthesis model includes multiple reference encoder, text encoder, fully connected layer and output layer;

a second obtaining module, configured to encode the speech style audio to be synthesized based on the multi-reference encoder to obtain style embedding vector information;

a third acquisition module, configured to encode the text to be processed based on the text encoder to obtain text encoding vector information;

A generation module, configured to splicing the style embedding vector information and the text encoding vector information through the fully connected layer to generate a Mel language spectrogram;

An output module, configured to perform feature extraction on the Mel spectrogram through the output layer, and output the target audio of the text to be processed.

In a third aspect, an embodiment of the present application provides a computer device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements when executing the computer program:

In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium, the computer-readable storage medium may be non-volatile or volatile, and the computer-readable storage medium stores a computer program, the computer program Implemented when executed by the processor:

beneficial effect

Compared with the prior art, the embodiments of the present application have the following beneficial effects: by acquiring the text to be processed and the speech style audio to be synthesized, and inputting the text to be processed and the speech style audio to be synthesized into a preset speech synthesis model, Wherein, the preset speech synthesis model includes a multi-reference encoder, a text encoder, a fully connected layer and an output layer; encoding the to-be-synthesized speaking style audio based on the multi-reference encoder to obtain style embedding vector information; Encode the text to be processed based on the text encoder to obtain text encoding vector information; splicing the style embedding vector information and the text encoding vector information through the fully connected layer to generate a Mel spectrogram Perform feature extraction on the Mel spectrogram through the output layer, and output the target audio of the text to be processed, so as to control the speech style of the synthesized speech and synthesize more emotionally expressed speech.

Description of drawings

In order to explain the technical solutions of the embodiments of the present application more clearly, the following briefly introduces the accompanying drawings used in the description of the embodiments. For those of ordinary skill, other drawings can also be obtained from these drawings without any creative effort.

1 is a schematic flowchart of a speech synthesis method provided by an embodiment of the present application;

Fig. 2 is the sub-step flowchart schematic diagram of the speech synthesis method in Fig. 1;

Fig. 3 is the sub-step flow schematic diagram of the speech synthesis method in Fig. 1;

FIG. 4 is a schematic block diagram of a speech synthesis apparatus provided by an embodiment of the present application;

FIG. 5 is a schematic structural block diagram of a computer device according to an embodiment of the present application.

The realization, functional characteristics and advantages of the purpose of the present application will be further described with reference to the accompanying drawings in conjunction with the embodiments.

Embodiments of the present invention

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments are part of the embodiments of the present application, not all of the embodiments. Based on the embodiments in the present application, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present application.

The flowcharts shown in the figures are for illustration only, and do not necessarily include all contents and operations/steps, nor do they have to be performed in the order described. For example, some operations/steps can also be decomposed, combined or partially combined, so the actual execution order may be changed according to the actual situation.

Embodiments of the present application provide a speech synthesis method, apparatus, computer device, and computer-readable storage medium. Wherein, the speech synthesis method may be applied to computer equipment, and the computer equipment may be electronic equipment such as notebook computers and desktop computers.

Some embodiments of the present application will be described in detail below with reference to the accompanying drawings. The embodiments described below and features in the embodiments may be combined with each other without conflict.

Please refer to FIG. 1 , which is a schematic flowchart of a speech synthesis method provided by an embodiment of the present application.

As shown in FIG. 1 , the speech synthesis method includes steps S101 to S105.

Step S101: Acquire the text to be processed and the speech style audio to be synthesized, and input the text to be processed and the speech style audio to be synthesized into a preset speech synthesis model, wherein the preset speech synthesis model includes a multi-reference encoder , text encoder, connection layer, and output layer.

Exemplarily, the to-be-processed text and the to-be-synthesized speech-style audio are acquired, where the to-be-processed text includes short sentences or short texts, and the to-be-synthesized speech-style audio includes timbre, emotion, and rhythm. The acquisition method includes acquiring pre-stored text to be processed and/or speech-style audio to be synthesized through a preset storage path, or acquiring pre-stored text to be processed and/or speech-style audio to be synthesized from a preset blockchain. When the to-be-processed text and the to-be-synthesized speech-style audio are acquired, the to-be-processed text and the to-be-synthesized speech-style audio are input into a preset speech synthesis model, where the preset speech synthesis model includes a multi-reference encoder, a text encoder, and the like.

Step S102: Encode the speech style audio to be synthesized based on the multi-reference encoder to obtain style embedding vector information.

Exemplarily, the speech style audio to be synthesized is encoded by the multi-reference encoder in the speech synthesis model to obtain style embedding vector information corresponding to the speech style audio to be synthesized. In one embodiment, the reference encoder is composed of a convolutional neural network (ConvolutionalNeuralNetworks, CNN) and a recurrent neural network (RecurrentNeuralNetwork, RNN), and the convolutional neural network layer is composed of multiple layers of two-dimensional convolutional layers. The neural network layer consists of an RNN, where the kernel of the two-dimensional convolutional layer can be selected as 3*3, and the stride can be selected as 2*2. For example, if the CNN layer is a six-layer two-dimensional convolutional layer, then The output channels of 32, 32, 64, 64, 128, and 128 can be set sequentially for these six-layer 2D convolutional layers.

In an embodiment, specifically, referring to FIG. 2 , step S102 includes: sub-step S1021 to sub-step S1022.

Sub-step S1021: Encode the timbre speaking style audio, emotional speaking style audio and prosodic speaking style audio according to a plurality of the reference encoders, respectively, to obtain reference embedded latent vector information.

Exemplarily, the speech synthesis model includes a plurality of reference encoders, and each reference encoder in the speech synthesis model encodes the timbre speech style audio, emotional speech style audio and prosodic speech style to obtain the speech to be synthesized. The target reference embedding vector corresponding to the style audio. Specifically, the timbre speaking style audio, emotional speaking style audio and prosodic speaking style audio are processed through the convolutional neural network in the reference encoder to obtain corresponding three-dimensional tensors, that is, the timbre speaking style audio, emotion Audio features are extracted from speaking style audio and prosodic speaking style audio, and the audio features are processed through each two-dimensional convolutional layer in the convolutional neural network in turn to obtain a tensor, and the tensor is transformed into a three-dimensional tensor, However, the time complexity of the output is maintained; then the three-dimensional tensor is processed through the recurrent neural network layer in the reference encoder to obtain the reference embedded latent vector corresponding to the timbre speaking style audio, emotional speaking style audio and prosodic speaking style audio information.

Sub-step S1022: Calculate the reference embedded latent vector information according to the multi-head attention mechanism to obtain style embedded vector information.

Exemplarily, after obtaining the reference embedded latent vector information, a multi-head attention mechanism is used to calculate the similarity between the preset vector corresponding to each preset style marker and the reference embedded latent vector information. After determining the similarity between the preset vector corresponding to each preset style tag and the reference embedded latent vector information, according to the similarity between the preset vector corresponding to each preset style tag and the reference embedded latent vector information , to determine the style weight of each preset style marker for timbre speaking style audio, emotional speaking style audio and prosodic speaking style audio, that is, accumulating the relationship between the preset vector corresponding to each preset style marker and the reference embedded latent vector information Similarity, get the total similarity, and calculate the ratio of the similarity between the preset vector corresponding to each preset style tag and the reference embedded latent vector information to the total similarity, and then calculate the preset corresponding to each style tag. Let the ratio of the similarity between the vector and the reference embedded latent vector information to the total similarity be determined as the style weight of each style marker for timbre speaking style audio, emotional speaking style audio and prosodic speaking style audio.

For example, if the number of preset style markers is 5, and the similarity between the preset vector corresponding to each preset style marker and the reference embedded latent vector information is 0.6, 0.3, 0.4, 0.4, and 0.3, then the total similarity The ratio of the similarity to the total similarity of each preset style tag is 0.3, 0.15, 0.2, 0.2, and 0.15, respectively, then each preset style tag has a positive effect on timbre speaking style audio, emotional speaking style audio and The style weights for prosodic speaking style audio are 0.3, 0.15, 0.2, 0.2, and 0.15, respectively.

After determining the style weight of each preset style tag for the timbre speaking style audio, emotional speaking style audio and prosodic speaking style audio, multiply the reference embedding hidden by the style weight of each style tag for the to-be-synthesized speaking style audio vector information, obtain the style embedding vector of each preset style tag, and then accumulate the style embedding vector of each style tag to obtain the target style embedding vector corresponding to the speech style audio to be synthesized.

Step S103: Encode the text to be processed based on the text encoder to obtain text encoding vector information.

Exemplarily, the text to be processed is encoded by the text encoder to obtain corresponding text encoding vector information, for example, the text encoder includes a weight matrix, and the to-be-processed text is mapped by the weight matrix to obtain the corresponding text. Text encoding vector information.

In an embodiment, specifically, referring to FIG. 3 , step S103 includes: sub-step S1031 to sub-step S1032.

Sub-step S1031, splitting the text to be processed into each word by the text encoder, and obtaining the order relationship between each word;

Exemplarily, when detecting the to-be-processed text, the encoder splits the to-be-processed text into respective words, and obtains the sequence relationship between the respective words. For example, the text to be processed is "I love China", and the "I love China" is divided into "I", "爱", "中", "国". And get the order between "I", "Love", "Zhong" and "Country" as "I"--"Love"--"China"--"Country".

Sub-step S1032 , map and convert each word and the sequence relationship between each of the words to generate text encoding vector information of the text to be synthesized.

Exemplarily, when acquiring each word of the text to be processed and the order relationship between each word, map each word and the order relationship of each word to obtain word vector information of each word and The sequence vector information between each word is the edge vector information, and the obtained word vector information and the edge vector information are combined to obtain the corresponding text encoding vector information, wherein the weight in the edge vector information is 0.

Step S104, splicing the style embedding vector information and the text encoding vector information through the fully connected layer to generate a Mel language spectrogram.

Exemplarily, the Mel language spectrogram is obtained by splicing the style embedding vector information and the text encoding vector information. For example, the dimension information of the style embedding vector information and the text encoding vector information are obtained respectively, and the style embedding vector information and the text encoding vector information are spliced in the same dimension to generate a Mel language spectrogram.

In one embodiment, the style embedding vector information is obtained through connection layer broadcasting, and the obtained style embedding vector information is connected with the text encoding vector information to obtain splicing vector information; The vector information is decoded to generate a Mel spectrogram.

Exemplarily, the style embedding vector information is obtained through connection layer broadcasting, and the obtained style embedding vector information is connected with the text encoding vector information to obtain the splicing vector information. As an example, the connection layer sends a broadcast to each multi-reference encoder, and when each multi-reference encoder encodes the synthesized speech style audio to obtain the style embedding vector information, each multi-reference encoder sends the obtained style embedding vector information to the The connection layer of the fully connected layer. When the connection layer obtains the style embedding vector information, it obtains the dimension information of the style embedding vector information and the dimension information of the text encoding vector information respectively, and splices the obtained dimension information of the style embedding vector information and the dimension information of the text encoding vector information. , the stitching includes dimension stitching. For example, obtain the dimension information of the style embedding vector information and the dimension information of the text encoding vector information, determine the dimension coordinates of the style embedding vector information and the dimension coordinates of the text encoding vector, and insert the style embedding vector information and the text encoding vector at the same dimension coordinates. The information is spliced to obtain the corresponding splicing vector information.

When the splicing vector information is obtained, the obtained splicing vector information is input into a preset decoder, and the splicing vector information is decoded by the preset encoder to generate a corresponding Mel language spectrogram. For example, the decoder converts the transmitted splicing vector information into spectral signal information through self-decoding, and generates a Mel spectrogram by converting the spectral signal information. Step S105: Perform feature extraction on the Mel spectrogram through the output layer, and output the target audio of the text to be processed.

Exemplarily, after acquiring the Mel spectrum information, the output layer outputs the speech synthesis information of the Mel spectrum information. For example, the output layer includes a vocoder, and the vocoder acquires the feature information in the voice and audio domains in the Mel spectrum information, and generates speech synthesis information by synthesizing the feature information in the voice and audio domains.

Specifically, performing feature extraction on the Mel spectrogram through the output layer and outputting the target audio of the text to be processed includes: extracting the voice and audio domain in the Mel spectrum information through the output layer feature, and map the voice and audio domain features, and output the target audio of the text to be processed.

Exemplarily, when the Mel spectrum information is obtained, the voice and audio domain features in the Mel spectrum information are extracted through the output layer, and the voice and audio domain features in the Mel spectrum information are extracted, and the voice and audio domain features are extracted from the Mel spectrum information. Domain features are mapped to obtain the output speech synthesis information of the output layer. For example, the output layer includes an extraction layer and a mapping layer, the voice and audio domain features in the Mel spectrum information are extracted through the extraction layer, and the voice and audio domain features are activated and mapped through the activation function in the mapping layer to obtain the voice synthetic information.

In the embodiment of the present application, the obtained text to be processed and the audio of the speech style to be synthesized are input into a preset speech synthesis model for encoding, so as to obtain style embedding vector information and text encoding vector information; It is spliced with the text encoding vector information to generate the Mel spectrogram; the feature extraction is performed on the Mel spectrogram through the output layer, and the target audio of the text to be processed is output to realize the control of the speech style of the synthesized speech, and the synthesis is more efficient. Voices that express multiple emotions.

Please refer to FIG. 4 , which is a schematic block diagram of a speech synthesis apparatus provided by an embodiment of the present application.

As shown in FIG. 4 , the speech synthesis apparatus 400 includes: a first acquisition module 401 , a second acquisition module 402 , a third acquisition module 403 , a generation module 404 , and an output module 405 .

The first obtaining module 401 is used to obtain the text to be processed and the speech style audio to be synthesized, and input the text to be processed and the speech style audio to be synthesized into a preset speech synthesis model, wherein the speech synthesis model includes Multi-reference encoder, text encoder, fully connected layer and output layer;

A second obtaining module 402, configured to encode the speech style audio to be synthesized based on the multi-reference encoder to obtain style embedding vector information;

A third acquiring module 403, configured to encode the text to be processed based on the text encoder to obtain text encoding vector information;

The generating module 404 is used for splicing the style embedding vector information and the text encoding vector information through the fully connected layer to generate a Mel language spectrogram;

The output module 405 is configured to perform feature extraction on the Mel spectrogram through the output layer, and output the target audio of the text to be processed.

Wherein, the second obtaining module 402 is also specifically used for:

According to a plurality of the reference encoders, the timbre speaking style audio, the emotional speaking style audio and the prosodic speaking style audio are respectively encoded to obtain reference embedded latent vector information;

The reference embedded latent vector information is calculated according to the multi-head attention mechanism to obtain style embedded vector information.

Wherein, the second obtaining module 402 is also specifically used for:

The timbre speaking style audio, the emotional speaking style audio and the prosodic speaking style audio are encoded by the convolutional neural networks in the multiple reference encoders, so as to obtain the three-dimensional tensor of the speaking style audio to be synthesized;

The three-dimensional tensor is processed by the recurrent neural network in the reference encoder to obtain the reference embedded latent vector information of the speech style audio to be synthesized.

Wherein, the second obtaining module 402 is also specifically used for:

obtaining the style weights of each preset style marker in the multi-head attention mechanism for the timbre speaking style audio, the emotional speaking style audio and the prosodic speaking style audio;

By multiplying the style weights of the timbre speaking style audio, the emotional speaking style audio and the prosodic speaking style audio by each of the preset style tags by the reference embedded latent vector information, each preset style is obtained. style embedding vector for style tags;

Accumulating the style embedding vectors of each of the preset style tags to obtain the style embedding vector information of the speech style audio to be synthesized.

Wherein, the third obtaining module 403 is also specifically used for:

Splitting the to-be-processed text into words by the text encoder, and acquiring the order relationship between the words;

Perform mapping conversion on each word and the sequence relationship between each of the words to generate text encoding vector information of the text to be synthesized.

Wherein, the generating module 404 is also specifically used for:

Obtain the style embedding vector information through the connection layer broadcast, and connect the obtained style embedding vector information with the text encoding vector information to obtain the splicing vector information;

The splicing vector information is decoded by the preset decoder to generate a Mel spectrogram.

Among them, the output module is also used for:

The voice and audio domain features in the Mel spectrum information are extracted through the output layer, and the voice and audio domain features are mapped to output the target audio of the text to be processed.

It should be noted that those skilled in the art can clearly understand that, for the convenience and brevity of description, for the specific working process of the above-described device and each module and unit, reference may be made to the corresponding process in the foregoing speech synthesis method embodiment, It is not repeated here.

The apparatuses provided by the above embodiments may be implemented in the form of a computer program, and the computer program may be executed on the computer device as shown in FIG. 5 .

Please refer to FIG. 5. FIG. 5 is a schematic block diagram of the structure of a computer device according to an embodiment of the present application. The computer device may be a terminal.

As shown in FIG. 5, the computer device includes a processor, a memory, and a network interface connected by a system bus, wherein the memory may include a non-volatile storage medium and an internal memory.

The nonvolatile storage medium can store operating systems and computer programs. The computer program includes program instructions that, when executed, cause the processor to perform any speech synthesis method.

The processor is used to provide computing and control capabilities to support the operation of the entire computer equipment.

The internal memory provides an environment for running the computer program in the non-volatile storage medium. When the computer program is executed by the processor, the processor can cause the processor to execute any speech synthesis method.

The network interface is used for network communication, such as sending assigned tasks. Those skilled in the art can understand that the structure shown in FIG. 5 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation on the computer equipment to which the solution of the present application is applied. Include more or fewer components than shown in the figures, or combine certain components, or have a different arrangement of components.

It should be understood that the processor may be a central processing unit (Central Processing Unit, CPU), and the processor may also be other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), application-specific integrated circuit (Application Specific Integrated Circuit, ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. Wherein, the general-purpose processor can be a microprocessor or the processor can also be any conventional processor or the like.

Wherein, in one embodiment, the processor is configured to run a computer program stored in the memory to implement the following steps:

In one embodiment, the processor, the multi-reference encoder includes multiple reference encoders and a multi-head attention mechanism; the to-be-synthesized speaking style audio includes timbre speaking style audio, emotional speaking style audio, and prosodic speaking style audio Described to encode the speech style audio to be synthesized based on the multi-reference encoder, when obtaining the style embedding vector information method to realize, for realizing:

In one embodiment, when the processor encodes the to-be-synthesized speaking style audio according to a plurality of the reference encoders, and obtains reference embedded latent vector information for implementation, it is used to implement:

In one embodiment, the processor calculates the reference embedded latent vector information according to the multi-head attention mechanism, and when the obtained style embedded vector information is implemented, is used to implement:

In one embodiment, when the processor encodes the text to be processed based on the text encoder, and obtains text encoding vector information for implementation, it is used to implement:

Splitting the to-be-processed text into words by the text encoder, and obtaining the order relationship between the words;

In one embodiment, the fully-connected layer of the processor includes a connection layer and a preset decoder; the fully-connected layer splices the style embedding vector information and the text encoding vector information to generate a Mel When the spectrogram is implemented, it is used to implement:

In one embodiment, when the processor performs feature extraction on the Mel spectrogram through the output layer, and outputs the target audio of the to-be-processed text, it is used to achieve:

The voice and audio domain features in the Mel spectrum information are extracted through the output layer, and the voice and audio domain features are mapped to output the to-be-processed text target audio.

Embodiments of the present application further provide a computer-readable storage medium, the computer-readable storage medium may be non-volatile or volatile, and a computer program is stored on the computer-readable storage medium, and the computer The program includes program instructions, and the method implemented when the program instructions are executed may refer to the various embodiments of the speech synthesis method of the present application.

The computer-readable storage medium may be an internal storage unit of the computer device described in the foregoing embodiments, such as a hard disk or a memory of the computer device. The computer-readable storage medium may also be an external storage device of the computer device, such as a plug-in hard disk, a smart memory card (Smart Media Card, SMC), a secure digital (Secure Digital, SD) equipped on the computer device ) card, flash card (Flash Card) and so on.

Further, the computer-readable storage medium may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function, and the like; The data created by the use of the node, etc.

The blockchain referred to in this application is a new application mode of computer technology such as storage of preset speech synthesis models, point-to-point transmission, consensus mechanism, and encryption algorithm. Blockchain, essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information to verify its Validity of information (anti-counterfeiting) and generation of the next block. The blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.

It should be noted that, herein, the terms "comprising", "comprising" or any other variation thereof are intended to encompass non-exclusive inclusion, such that a process, method, article or system comprising a series of elements includes not only those elements, It also includes other elements not expressly listed or inherent to such a process, method, article or system. Without further limitation, an element qualified by the phrase "comprising a..." does not preclude the presence of additional identical elements in the process, method, article or system that includes the element.

The above-mentioned serial numbers of the embodiments of the present application are only for description, and do not represent the advantages or disadvantages of the embodiments. The above are only specific embodiments of the present application, but the protection scope of the present application is not limited thereto. Any person skilled in the art can easily think of various equivalents within the technical scope disclosed in the present application. Modifications or substitutions shall be covered by the protection scope of this application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

A speech synthesis method, comprising:

Acquire the text to be processed and the audio of the speech style to be synthesized, and input the text to be processed and the audio of the speech style to be synthesized into a preset speech synthesis model, wherein the preset speech synthesis model includes a multi-reference encoder, a text encoding , the fully connected layer and the output layer;

Encoding the to-be-synthesized speaking style audio based on the multi-reference encoder to obtain style embedding vector information;

Encoding the text to be processed based on the text encoder to obtain text encoding vector information;

splicing the style embedding vector information and the text encoding vector information through the fully connected layer to generate a Mel spectrogram;

Feature extraction is performed on the Mel spectrogram through the output layer, and the target audio of the text to be processed is output.
The speech synthesis method of claim 1, wherein the multi-reference encoder includes multiple reference encoders and a multi-head attention mechanism; the speech-style audio to be synthesized includes timbre speech-style audio, emotional speech-style audio, and prosody speaking style audio;

Encoding the to-be-synthesized speaking style audio based on the multi-reference encoder to obtain style embedding vector information, including:

According to a plurality of the reference encoders, the timbre speaking style audio, the emotional speaking style audio and the prosodic speaking style audio are respectively encoded to obtain reference embedded latent vector information;

The reference embedded latent vector information is calculated according to the multi-head attention mechanism to obtain style embedded vector information.
The speech synthesis method according to claim 2, wherein the encoding of the speech style audio to be synthesized according to a plurality of the reference encoders to obtain reference embedded latent vector information, comprising:

The timbre speaking style audio, the emotional speaking style audio and the prosodic speaking style audio are encoded by the convolutional neural networks in the multiple reference encoders, so as to obtain the three-dimensional tensor of the speaking style audio to be synthesized;

The three-dimensional tensor is processed by the recurrent neural network in the reference encoder to obtain the reference embedded latent vector information of the speech style audio to be synthesized.
The speech synthesis method according to claim 2, wherein calculating the reference embedded latent vector information according to the multi-head attention mechanism to obtain style embedded vector information, comprising:

obtaining the style weights of each preset style marker in the multi-head attention mechanism for the timbre speaking style audio, the emotional speaking style audio and the prosodic speaking style audio;

By multiplying the style weights of the timbre speaking style audio, the emotional speaking style audio and the prosodic speaking style audio by each of the preset style tags by the reference embedded latent vector information, each preset style is obtained. style embedding vector for style tags;

Accumulating the style embedding vectors of each of the preset style tags to obtain the style embedding vector information of the speech style audio to be synthesized.
The speech synthesis method according to claim 1, wherein the encoding the text to be processed based on the text encoder to obtain text encoding vector information, comprising:

Splitting the to-be-processed text into words by the text encoder, and obtaining the order relationship between the words;

Perform mapping conversion on each word and the sequence relationship between each of the words to generate text encoding vector information of the text to be synthesized.
The speech synthesis method according to claim 1, wherein the fully connected layer comprises a connection layer and a preset decoder; the style embedding vector information and the text encoding vector information are spliced through the fully connected layer, Generate a Mel spectrogram, including:

Obtain the style embedding vector information through the connection layer broadcast, and connect the obtained style embedding vector information with the text encoding vector information to obtain the splicing vector information;

The splicing vector information is decoded by the preset decoder to generate a Mel spectrogram.
The speech synthesis method according to claim 1, wherein the feature extraction is performed on the Mel spectrogram through the output layer, and the target audio of the to-be-processed text is output, comprising:

The voice and audio domain features in the Mel spectrum information are extracted through the output layer, and the voice and audio domain features are mapped to output the target audio of the text to be processed.
A speech synthesis device, comprising:

The first acquisition module is used to acquire the text to be processed and the audio of the speech style to be synthesized, and input the text to be processed and the audio of the speech style to be synthesized into a preset speech synthesis model, wherein the speech synthesis model includes multiple reference encoder, text encoder, fully connected layer and output layer;

a second obtaining module, configured to encode the speech style audio to be synthesized based on the multi-reference encoder to obtain style embedding vector information;

a third acquisition module, configured to encode the text to be processed based on the text encoder to obtain text encoding vector information;

A generation module, configured to splicing the style embedding vector information and the text encoding vector information through the fully connected layer to generate a Mel language spectrogram;

An output module, configured to perform feature extraction on the Mel spectrogram through the output layer, and output the target audio of the text to be processed.
A computer device, comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, the processor implementing the computer program to achieve:

Acquire the text to be processed and the audio of the speech style to be synthesized, and input the text to be processed and the audio of the speech style to be synthesized into a preset speech synthesis model, wherein the preset speech synthesis model includes a multi-reference encoder, a text encoding , the fully connected layer and the output layer;

Encoding the to-be-synthesized speaking style audio based on the multi-reference encoder to obtain style embedding vector information;

Encoding the text to be processed based on the text encoder to obtain text encoding vector information;

splicing the style embedding vector information and the text encoding vector information through the fully connected layer to generate a Mel spectrogram;

Feature extraction is performed on the Mel spectrogram through the output layer, and the target audio of the text to be processed is output.
The computer device of claim 9, wherein the multi-reference encoder comprises a plurality of reference encoders and a multi-head attention mechanism, and the speech-style audio to be synthesized comprises timbre speech-style audio, emotional speech-style audio, and prosodic speech style audio, the processor further implements when executing the computer program:

According to a plurality of the reference encoders, the timbre speaking style audio, the emotional speaking style audio and the prosodic speaking style audio are respectively encoded to obtain reference embedded latent vector information;

The reference embedded latent vector information is calculated according to the multi-head attention mechanism to obtain style embedded vector information.
The computer device of claim 10, wherein the processor, when executing the computer program, further implements:

The timbre speaking style audio, the emotional speaking style audio and the prosodic speaking style audio are encoded by the convolutional neural networks in the multiple reference encoders, so as to obtain the three-dimensional tensor of the speaking style audio to be synthesized;

The three-dimensional tensor is processed by the recurrent neural network in the reference encoder to obtain the reference embedded latent vector information of the speech style audio to be synthesized.
The computer device of claim 10, wherein the processor, when executing the computer program, further implements:

obtaining the style weights of each preset style marker in the multi-head attention mechanism for the timbre speaking style audio, the emotional speaking style audio and the prosodic speaking style audio;

By multiplying the style weights of the timbre speaking style audio, the emotional speaking style audio and the prosodic speaking style audio by each of the preset style tags by the reference embedded latent vector information, each preset style is obtained. style embedding vector for style tags;

Accumulating the style embedding vectors of each of the preset style tags to obtain the style embedding vector information of the speech style audio to be synthesized.
The computer device of claim 9, wherein the processor, when executing the computer program, further implements:

Splitting the to-be-processed text into words by the text encoder, and acquiring the order relationship between the words;

Perform mapping conversion on each word and the sequence relationship between each of the words to generate text encoding vector information of the text to be synthesized.
The computer device according to claim 9, wherein the fully connected layer comprises a connection layer and a preset decoder, and the processor further implements when executing the computer program:

Obtain the style embedding vector information through the connection layer broadcast, and connect the obtained style embedding vector information with the text encoding vector information to obtain the splicing vector information;

The splicing vector information is decoded by the preset decoder to generate a Mel spectrogram.
The computer device of claim 9, wherein the processor, when executing the computer program, further implements:

The voice and audio domain features in the Mel spectrum information are extracted through the output layer, and the voice and audio domain features are mapped to output the target audio of the text to be processed.
A computer-readable storage medium storing a computer program, wherein the computer program is executed by a processor to realize:

Acquire the text to be processed and the audio of the speech style to be synthesized, and input the text to be processed and the audio of the speech style to be synthesized into a preset speech synthesis model, wherein the preset speech synthesis model includes a multi-reference encoder, a text encoding , the fully connected layer and the output layer;

Encoding the to-be-synthesized speaking style audio based on the multi-reference encoder to obtain style embedding vector information;

Encoding the text to be processed based on the text encoder to obtain text encoding vector information;

splicing the style embedding vector information and the text encoding vector information through the fully connected layer to generate a Mel spectrogram;

Feature extraction is performed on the Mel spectrogram through the output layer, and the target audio of the text to be processed is output.
17. The computer-readable storage medium of claim 16, wherein the multi-reference encoder comprises a plurality of reference encoders and a multi-head attention mechanism, and the speech-style audio to be synthesized comprises timbre speech-style audio, emotional speech-style audio and prosodic speaking style audio the computer program when executed by the processor also implements:

According to a plurality of the reference encoders, the timbre speaking style audio, the emotional speaking style audio and the prosodic speaking style audio are respectively encoded to obtain reference embedded latent vector information;

The reference embedded latent vector information is calculated according to the multi-head attention mechanism to obtain style embedded vector information.
The computer-readable storage medium of claim 17, wherein the processor, when executing the computer program, further implements:

The timbre speaking style audio, the emotional speaking style audio and the prosodic speaking style audio are encoded by the convolutional neural networks in the multiple reference encoders, so as to obtain the three-dimensional tensor of the speaking style audio to be synthesized;

The three-dimensional tensor is processed by the recurrent neural network in the reference encoder to obtain the reference embedded latent vector information of the speech style audio to be synthesized.
The computer-readable storage medium of claim 16, wherein the processor, when executing the computer program, further implements:

obtaining the style weights of each preset style marker in the multi-head attention mechanism for the timbre speaking style audio, the emotional speaking style audio and the prosodic speaking style audio;

By multiplying the style weights of the timbre speaking style audio, the emotional speaking style audio and the prosodic speaking style audio by each of the preset style tags by the reference embedded latent vector information, each preset style is obtained. style embedding vector for style tags;

Accumulating the style embedding vectors of each of the preset style tags to obtain the style embedding vector information of the speech style audio to be synthesized.
The computer-readable storage medium of claim 16, wherein the processor, when executing the computer program, further implements:

Splitting the to-be-processed text into words by the text encoder, and obtaining the order relationship between the words;

Mapping conversion is performed on each word and the sequence relationship between each of the words, so as to generate text encoding vector information of the text to be synthesized.