CN115346510A

CN115346510A - Voice synthesis method and device, electronic equipment and storage medium

Info

Publication number: CN115346510A
Application number: CN202110527979.8A
Authority: CN
Inventors: 张超; 宋伟; 张政臣; 何晓冬; 周伯文
Original assignee: Jingdong Technology Holding Co Ltd
Current assignee: Jingdong Technology Holding Co Ltd
Priority date: 2021-05-14
Filing date: 2021-05-14
Publication date: 2022-11-15

Abstract

The application discloses a voice synthesis method, which comprises the following steps: obtaining a context sentence of a current sentence, and constructing a sentence set comprising the current sentence and the context sentence; executing text feature extraction operation on the sentences in the sentence set to obtain cross-sentence features, and training a speech synthesis model by using the cross-sentence features; and synthesizing the voice information of the target text by using the trained voice synthesis model. The method and the device train the voice synthesis model by using the cross sentence characteristics, and because the cross sentence characteristics can describe the chapter structure of the text, the trained voice synthesis model can synthesize voice based on the chapter structure of the text, thereby improving the rhythm effect of voice synthesis. The application also discloses a voice synthesis device, an electronic device and a storage medium, which have the beneficial effects.

Description

Voice synthesis method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of speech synthesis technologies, and in particular, to a speech synthesis method and apparatus, an electronic device, and a storage medium.

Background

Speech synthesis refers to a technique for generating artificial speech by mechanical, electronic means. The text information generated by a computer or input from the outside can be converted into fluent audio content which can be understood by a user through a speech synthesis technology.

At present, the speech synthesis is mainly realized by training a speech synthesis model, but because the speech synthesis model is trained only by using the current sentence of training data in the related art, the prosodic effect of the speech synthesis is poor.

Therefore, how to improve the prosodic effect of speech synthesis is a technical problem that needs to be solved by those skilled in the art.

Disclosure of Invention

An object of the present application is to provide a speech synthesis method, apparatus, an electronic device and a storage medium, which can improve the prosodic effect of speech synthesis.

In order to solve the above technical problem, the present application provides a speech synthesis method, including:

obtaining a context sentence of a current sentence, and constructing a sentence set comprising the current sentence and the context sentence;

executing text feature extraction operation on the sentences in the sentence set to obtain cross-sentence features, and training a speech synthesis model by using the cross-sentence features;

and synthesizing the voice information of the target text by using the trained voice synthesis model.

Optionally, performing a text feature extraction operation on the sentences in the sentence set to obtain cross-sentence features, including:

determining a forward sentence set and a backward sentence set; wherein the forward sentence set comprises the current sentence and a context sentence preceding the current sentence, and the backward sentence set comprises the current sentence and a context sentence following the current sentence;

calculating forward cross sentence characteristics of the forward sentence set and backward cross sentence characteristics of the backward sentence set.

Optionally, calculating the forward cross sentence feature of the forward sentence set and the backward cross sentence feature of the backward sentence set includes:

inputting all the forward sentence sets into the language model to obtain the forward cross sentence characteristics;

and inputting all the backward sentence sets into the language model to obtain the backward cross sentence characteristics.

Optionally, training a speech synthesis model using the cross-sentence feature includes:

splicing the forward cross sentence characteristic and the backward cross sentence characteristic to obtain a context cross sentence characteristic;

acquiring phoneme characteristics output by an encoder of the speech synthesis model;

and splicing the context cross sentence characteristics and the phoneme characteristics to obtain a spliced characteristic vector, and training the speech synthesis model by using the spliced characteristic vector.

acquiring adjacent sentence pairs in the sentence set;

inputting the adjacent sentence pairs into a language model to obtain cross sentence characteristics of the adjacent sentence pairs;

correspondingly, training a speech synthesis model by using the cross sentence characteristics comprises the following steps:

weighting and summing the cross sentence characteristics of all the adjacent sentence pairs to obtain weighted cross sentence characteristics;

learning the weighted cross sentence features by controlling each phoneme feature of an encoder of the speech synthesis model through a self-attention mechanism network structure; wherein the key vector and the value vector of the self-attention mechanism network structure are the weighted cross sentence features, and the query vector of the self-attention mechanism network structure is the phoneme feature vector of the encoder;

and training the voice synthesis model by using the weighted cross sentence characteristics.

querying a word vector of each word in the sentence set by using a word vector table, and carrying out average value calculation on the word vectors corresponding to the sentence texts in the context sentences to obtain sentence text vectors;

and splicing the sentence text vectors to obtain the cross-sentence characteristics.

Optionally, before performing a text feature extraction operation on the sentences in the sentence set to obtain cross-sentence features, the method further includes:

knowledge distillation is carried out on a language model to obtain a Student model, so that the Student model learns the cross sentence characteristics extracted by the language model;

correspondingly, performing a text feature extraction operation on the sentences in the sentence set to obtain cross-sentence features includes:

and performing text feature extraction operation on the sentences in the sentence set by using the Student model to obtain the cross-sentence features.

The present application also provides a speech synthesis apparatus, the apparatus comprising:

the set construction module is used for acquiring a context sentence of a current sentence and constructing a sentence set comprising the current sentence and the context sentence;

the model training module is used for executing text feature extraction operation on the sentences in the sentence set to obtain cross-sentence features and training a speech synthesis model by using the cross-sentence features;

and the voice synthesis module is used for synthesizing the voice information of the target text by utilizing the trained voice synthesis model.

The present application further provides a storage medium on which a computer program is stored, which when executed, implements the steps performed by the above-described speech synthesis method.

The application also provides an electronic device, which comprises a memory and a processor, wherein the memory stores a computer program, and the processor realizes the steps executed by the speech synthesis method when calling the computer program in the memory.

The application provides a speech synthesis method, which comprises the following steps: acquiring a context sentence of a current sentence, and constructing a sentence set comprising the current sentence and the context sentence; executing text feature extraction operation on the sentences in the sentence set to obtain cross-sentence features, and training a voice synthesis model by using the cross-sentence features; and synthesizing the voice information of the target text by using the trained voice synthesis model.

The present application obtains a context sentence of a current sentence, and sets a sentence set including the current sentence and the context sentence as a sample for training a speech synthesis model. Specifically, the method and the device perform text feature extraction operation on sentences in the sentence set to obtain cross-sentence features, and the cross-sentence features can describe chapter structures of contexts of the texts, so that the chapter structure features can be learned by using a speech synthesis model trained by the cross-sentence features. Because the prosodic information of the text when being read is related to the context chapter structure, the same sentence may have completely different prosodic expressions in different context contexts, and the speech synthesis model trained by using the cross-sentence characteristics can synthesize speech based on the context chapter structure, thereby improving the prosodic effect of speech synthesis. The application also provides a voice synthesis device, an electronic device and a storage medium, which have the beneficial effects and are not described again.

Drawings

In order to more clearly illustrate the embodiments of the present application, the drawings needed for the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings can be obtained by those skilled in the art without inventive effort.

Fig. 1 is a flowchart of a speech synthesis method according to an embodiment of the present application;

FIG. 2 is an architecture diagram of an end-to-end based speech synthesis model according to an embodiment of the present application;

FIG. 3 is a flowchart of a method for training a speech synthesis model according to an embodiment of the present application;

fig. 4 is a schematic diagram of a network structure of a CSE cross sentence feature-based encoder according to an embodiment of the present application;

FIG. 5 is a flowchart of a method for training a speech synthesis model according to an embodiment of the present application;

fig. 6 is a schematic diagram of a network structure based on a PSE cross sentence feature encoder according to an embodiment of the present application;

fig. 7 is a block diagram of an electronic device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

End-to-end (sequence-to-sequence) based speech synthesis technology is one of the mainstream speech synthesis technologies at present, and the common end-to-end based speech synthesis models include tacon 1 (an end-to-end text-to-speech deep learning model), tacon 2, durIAN (a model combining the traditional parametric speech synthesis technology and the end-to-end speech synthesis technology), and various similar variants of seq2seq models, as well as encor/decoder network structures based on a Transformer network structure, such as fastspech (a speech synthesis model), fastspech 2, and the like. However, the end-to-end-based speech synthesis model is not trained by chapter structure information, but only employs the linguistic features of the current sentence for speech synthesis. For example, in the prior art, in order to learn the prosody characterization, global Style Token (GST) is usually learned to perform prosody control, but the above manner uses the audio data of the current sentence to learn GST, and does not use any cross-sentence information, so the prosody effect of the synthesized speech in the related art is poor. In order to solve the above drawbacks of the related art, the present application provides a new speech synthesis scheme, which can improve the prosodic effect of speech synthesis through the following embodiments.

Referring to fig. 1, fig. 1 is a flowchart of a speech synthesis method according to an embodiment of the present disclosure.

The specific steps may include:

s101: obtaining a context sentence of a current sentence, and constructing a sentence set comprising the current sentence and the context sentence;

the embodiment can be applied to an end-to-end speech synthesis model, and a main speech synthesis framework of the speech synthesis model can be a model such as Tacotron, fastSpeech1, fastSpeech2, EAST (a text detection model), deepVoice or ClariNet (a neural network speech synthesis model). The current sentence may be any sentence in the training data, and in this embodiment, the size of the context query window may be preset, and the current sentence is used as the window center of the context query window to query the upper sentence and the lower sentence, so as to obtain the context sentence. The above sentence is a sentence in the training data located before the current sentence, and the below sentence is a sentence in the training data located after the current sentence. As a possible implementation manner, if the current sentence is the nth sentence of the training data, the size of the context query window is 9, and the context sentences are the N-4 th, N-3 th, N-2 th, N-1 th, N +2 th, N +3 th, and N +4 th sentences of the training data.

After obtaining the current sentence and the context sentence, the present embodiment may construct a sentence set including the current sentence and the context sentence. After the speech synthesis model is trained by using the sentence set corresponding to the current sentence, the embodiment may further re-determine a new current sentence, and construct a sentence set corresponding to the new current sentence, so as to train the speech synthesis model again.

S102: performing text feature extraction operation on sentences in the sentence set to obtain cross-sentence features, and training a voice synthesis model by using the cross-sentence features;

on the basis of obtaining the sentence set, the present embodiment may perform the text feature extraction operation on each sentence in the sentence combination, and since the sentence set includes the current sentence and the context sentence thereof, the extraction result of performing the text feature extraction on the sentence set is the cross-sentence feature of the current sentence.

Specifically, the present embodiment may use the following method to extract cross sentence features: the method (1) is based on CSE (Chunked Sentence Embedding, block Sentence Embedding vector) to execute text feature extraction operation on sentences in a Sentence set to obtain cross Sentence features; the method (2) is based on PSE (Paired sequence Embedding) to execute text feature extraction operation on sentences in the Sentence set to obtain cross Sentence features; the method (3) is based on a word vector table (such as a character vector list) to perform text feature extraction operation on sentences in a sentence set to obtain cross-sentence features; and (4) executing text feature extraction operation on sentences in the sentence set based on the Student model obtained by knowledge distillation to obtain cross-sentence features. Further, the present embodiment may extract a cross-sentence feature using any one of the above-described method (1), method (2), method (3), and method (4), or a combination of any several items. If the cross-sentence features are extracted by using multiple methods, the embodiment can determine the cross-sentence features extracted by each method, and perform average value calculation on all the cross-sentence features to obtain the cross-sentence features finally used for training the speech synthesis model.

After the cross-sentence characteristics are obtained, the method and the device can train the voice synthesis model by utilizing the cross-sentence characteristics, so that the voice synthesis model can learn the context chapter structure information in the training data, and further the rhythm effect of voice synthesis is improved.

S103: and synthesizing the voice information of the target text by using the trained voice synthesis model.

In this embodiment, the speech synthesis model may be iteratively trained by updating the current sentence for multiple times, and after the speech synthesis model is trained, the target text of the speech to be synthesized may be input into the speech synthesis model, so as to synthesize the speech information corresponding to the target text by using the trained speech synthesis model.

The present embodiment acquires a context sentence of a current sentence, and sets a sentence set including the current sentence and the context sentence as a sample for training a speech synthesis model. Specifically, in the embodiment, the sentence in the sentence set is subjected to the text feature extraction operation to obtain the cross-sentence feature, and the cross-sentence feature can describe the chapter structure of the context of the text, so that the speech synthesis model trained by the cross-sentence feature can learn the chapter structure feature. Because the prosodic information of the text when being read is related to the context chapter structure, the same sentence may have completely different prosodic expressions in different context contexts, and the speech synthesis model trained by using the cross-sentence characteristics can synthesize speech based on the context chapter structure, thereby improving the prosodic effect of speech synthesis.

Although the current end-to-end-based speech synthesis technology has realized a relatively natural and prosody-rich speech synthesis effect, the related technology does not adopt discourse structure information but only adopts the linguistic characteristics of the current sentence for speech synthesis. Usually, prosody information is strongly related to a chapter structure of a context, and a text of the same sentence has completely different prosody expressions in different context contexts, so that when an end-to-end system which only uses the text characteristics of the current sentence for speech synthesis synthesizes a text, it is difficult to convert the text into natural speech with rich prosody expressions according to the context information.

Referring to fig. 2, fig. 2 is an architecture diagram of an end-to-end speech synthesis model according to an embodiment of the present application. In fig. 2, input1 is a phoneme sequence of a current sentence, input2 is an acoustic feature Mel-Spectrum prediction (Mel-frequency prediction) of the last prediction of a speech synthesis model, 1 d-context refers to one-dimensional convolution, bi-directional LSTM (Long Short-Term Memory) refers to a bidirectional Long-Short Term Memory artificial neural network, attention Mechanism refers to Attention Mechanism, FC (full connected) refers to a full connected layer, query refers to a query vector, context vector refers to a context vector, plus refers to a parameter adjustment process of a model, and Stop token prediction refers to a probability prediction module for calculating a parameter for stopping the training model. In the speech synthesis model, the phoneme sequence of the current sentence can be subjected to triple convolution calculation and input to the bidirectional long-short term memory artificial neural network for processing, and the cross sentence characteristics and the output characteristics of the long-short term memory artificial neural network are spliced and input to the Attention Mechanism module. The acoustic features predicted last time can be used as input features, query vectors are obtained through calculation of the full connection layer and the LSTM in sequence, and the query vectors are input to the Attention Mechanism module. The Attention Mechanism module generates a context vector according to a splicing result of the cross sentence characteristic and the output characteristic of the long-short term memory artificial neural network and a query vector, finally adjusts the parameter of the speech synthesis model according to the context vector, and outputs the predicted acoustic characteristic. And stopping training the speech synthesis model after the predicted value of the Stop token prediction is greater than a specific value.

The voice synthesis model combines the text characteristics of the current sentence and the context information extraction cross-sentence characteristics corresponding to the sentence to carry out model training, and the prosodic effect of the model is improved. A context-continuous TTS (Text To Speech, from Text To Speech) corpus may also be employed when training the model. For example, the chinese training data of the present embodiment may be male-language-novel reading data with continuous context, and the english training data may be female-language-non-novel data with continuous context.

In order to obtain the cross-sentence feature of the context text, the embodiment may use a BERT model to extract the cross-sentence feature of the context sentence. As a possible implementation, the present embodiment may use an open source BERT model that has been trained to propose cross sentence features. In order to verify the usage modes of different cross sentence features, the embodiment can use a CSE mode and a PSE mode to extract the cross sentence features. This embodiment uses an end-to-end based speech synthesis model similar to tacontron 2 as a basic framework, as shown in fig. 2, and then combines cross sentence features under the basic framework to improve the prosodic effect of the model.

As a feasible implementation manner, the present application may perform a text feature extraction operation on sentences in a sentence set based on a CSE to obtain cross-sentence features, a specific process is shown in fig. 3, fig. 3 is a flowchart of a speech synthesis model training method provided in an embodiment of the present application, and the present embodiment may include the following steps:

s301: determining a forward sentence set and a backward sentence set;

wherein the forward sentence set comprises the current sentence and a context sentence preceding the current sentence, and the backward sentence set comprises the current sentence and a context sentence following the current sentence;

s302: calculating forward cross sentence characteristics of the forward sentence set and backward cross sentence characteristics of the backward sentence set.

Specifically, the present embodiment may utilize a language model to calculate cross-sentence characteristics, and the process is as follows: inputting all the forward sentence sets into the language model to obtain the forward cross sentence characteristics; and inputting all backward sentence sets into the language model to obtain the backward cross sentence characteristics. Specifically, the language model may be a BERT model, a GPT model, a RoBERTa model, an EMLo model, or a transform-XL model. In this embodiment, all the forward sentence sets may be input into the language model as input data, so that the language model extracts semantic information included in all texts in the forward sentence sets, that is, obtains forward cross sentence characteristics. In this embodiment, all backward sentence sets may be input into the language model as input data, so that the language model extracts semantic information included in all texts in the backward sentence sets, that is, the backward cross-sentence features are obtained.

S303: splicing the forward cross sentence characteristic and the backward cross sentence characteristic to obtain a context cross sentence characteristic;

s304: acquiring phoneme characteristics output by an encoder of the speech synthesis model;

the phoneme features are obtained by processing a current sentence by an encoder of the speech synthesis model.

S305: and splicing the context cross sentence characteristics and the phoneme characteristics to obtain a spliced characteristic vector, and training the speech synthesis model by using the spliced characteristic vector.

As a possible implementation, after the stitched feature vector is obtained, the stitched feature vector may be input to a linear transformation layer to map the stitched feature vector to the same dimension as the encoder output vector, so as to reduce the number of parameters of the speech synthesis model. The embodiment provides a scheme for improving prosody performance of a model by combining contextual cross-sentence features, the cross-sentence feature extraction process can use models such as BERT, GPT, roBERTa, EMLo or Transformer-XL to extract cross-sentence features of a current sentence, and the main speech synthesis framework can be a Tacotron, fastSpeech2, EAST, deepspeech or ClariNet model.

The input of the encoder of the speech synthesis model is a phoneme sequence, the factor sequence is the result of splitting the initial consonant and the final of the pinyin, the encoder usually obtains a vector through a lookup table according to the ID of the phoneme, then performs one-dimensional convolution operation on all the vectors, and finally passes through a bidirectional LSTM layer.

The decoder of the speech synthesis model is an autoregressive generation model based on LSTM, the input of the decoder is the acoustic feature output at the previous moment, then the output of the LSTM at the last layer is used as the query vector and the splicing feature vector of an attention module to carry out attention calculation, a context vector is obtained, then the context vector and the LSTM vector at the last layer are spliced, and the acoustic feature at the current moment is predicted by a fully-connected layer at the last layer. Further, the acoustic feature at the current time is taken as the Input of the decoder at the next time (i.e. Input2 in fig. 2).

Referring to fig. 4, fig. 4 is a schematic diagram of a network structure of a CSE cross sentence feature-based encoder according to an embodiment of the present application, and fig. 4 describes a specific process of extracting a cross sentence feature based on CSE in the foregoing embodiment. u. of _i Text representing the ith sentence of the training data, SEP is a separator used in the BERT model to separate two sentences, CLS is a classifier used in the BERT model to distinguish whether two sentences are continuous sentences, p _i Is the ith phoneme of the phoneme list corresponding to a sentence of text, T is the number of phonemes of the sentence of text, CU (Cross Utterance) represents a Cross sentence, u is a Cross sentence ₀ For the current sentence, G2P refers to the word in the current sentenceThe process of conversion of a Phoneme to a Phoneme, phoneme Encoder, is a Phoneme coder, f (p) _i ) For phoneme coding result, cat refers to splicing operation, linear project W is weighted Linear Projection, decoder is Decoder, u is _P As a forward sentence set, u _N Set of backward sentences, e (u) _P ) For forward cross sentence features, e (u) _N ) The characteristic is a backward cross sentence characteristic, c is a context cross sentence characteristic after splicing a forward cross sentence characteristic and a backward cross sentence characteristic, and the CU Encoder is a cross sentence Encoder.

In this embodiment, assume u ₀ For the current sentence, L is the number of contextual sentences considered by the present embodiment, then u is defined _p ＝{CLS,u _-L ,SEP,u _-L+1 ,…SEP,u ₀ Denotes the current sentence and the preceding sentences, defines u _N ＝{CLS,u ₀ ,SEP,u ₀₊₁ ,SEP,u ₀₊₂ ,…,SEP,u _L The present embodiment shows u as the current sentence and the sentences following the current sentence _p And u _N Feeding into BERT model, and extracting u _p And u _N The vector of the output layer of the corresponding CLS in (1) is used as the forward cross sentence characteristic and the backward cross sentence characteristic of the current sentence, as shown in fig. 4. In CSE Cross-sentence feature usage, u _p And u _N The corresponding cross sentence feature vectors are spliced together and then spliced with the phoneme feature vectors output by the coder of the speech synthesis model. In order to reduce the number of model parameters, the feature vectors that can be concatenated are mapped by a linear transformation layer to the same dimensions as the encoder output of the speech synthesis model.

As a feasible implementation manner, the present application may obtain cross-sentence features by performing a text feature extraction operation on sentences in a sentence set based on the PSE, a specific process is shown in fig. 5, fig. 5 is a flowchart of a speech synthesis model training method provided in an embodiment of the present application, and the present embodiment may include the following steps:

s501: acquiring adjacent sentence pairs in the sentence set;

specifically, the present embodiment may use two adjacent sentences in the sentence set as adjacent sentence pairs, and may set the upper sentence and the lower sentence farthest from the current sentence as the adjacent sentence pairs, so as to obtain the same number of adjacent sentence pairs as the number of sentences in the sentence set.

S502: inputting the adjacent sentence pairs into a language model to obtain cross sentence characteristics of the adjacent sentence pairs;

s503: weighting and summing the cross sentence characteristics of all the adjacent sentence pairs to obtain weighted cross sentence characteristics;

the weight in the weighted summation process can be obtained by the following steps: the weighting information can be obtained by performing relevance calculation (such as vector multiplication) by using the feature vector of each phoneme and the cross-sentence feature of each sentence pair.

S504: learning the weighted cross sentence features by controlling each phoneme feature of an encoder of the speech synthesis model through a self-attention mechanism network structure;

wherein a key vector (i.e., key vector) and a value vector (i.e., value vector) of the Self-attention Mechanism (Self-attention Mechanism) network structure are the weighted cross sentence features, and a query vector (i.e., query vector) of the Self-attention Mechanism network structure is a phoneme feature vector of the encoder. The nature of the attention function in the self-attention mechanism network structure may be described as a mapping of one query vector to a plurality of key vector-value vector pairs, and calculating the attention score from the attention mechanism network structure may include the following processes: (1) Similarity calculation is carried out on the query vectors and each key vector to obtain weight; (2) Normalizing the weight obtained in the last step by using a softmax function; (3) And carrying out weighted summation on the weight and the corresponding value vector to obtain an attention score attention. In the field of natural language processing, the key vector and the value vector are typically set to the same vector, i.e., the key vector is equal to the value vector.

S505: and training the voice synthesis model by using the weighted cross sentence characteristics.

In the above process, the feature vector obtained by each phoneme is used as a query vector, then sentence vectors of sentence texts obtained by all BERTs are queried, and then each sentence vector can obtain a weight, and a weighted summation is performed according to the weights, so that unique (i.e. different) cross-sentence features of each phoneme can be obtained. In the embodiment, the independent cross sentence characteristic of the encoder characteristic of each phoneme is obtained through the attention mechanism network structure, so that each pronunciation unit obtains a fine-grained cross sentence characteristic which is helpful for pronunciation of the current unit, and the cross sentence characteristic is used for improving the prosodic effect of the model.

Referring to fig. 6, fig. 6 is a schematic diagram of a network structure based on a PSE cross-sentence feature encoder according to an embodiment of the present application, and fig. 6 describes a specific process of extracting a cross-sentence feature based on a PSE in the foregoing embodiment. The same English expressions in FIG. 6 and FIG. 5 have the same meaning, and are not repeated herein, in FIG. 6, pair refers to adjacent sentences, and Multi-Head Attention refers to a Multi-Head self-Attention mechanism. As shown in fig. 6, each group of adjacent sentence pairs is inputted as an independent BERT model, for example, u0 and u1, u1 and u2, \8230;, u-1 and u0 are inputted as input sentences of the BERT model, respectively, and then feature vectors corresponding to CLS symbols of each group of sentence pairs are extracted as cross-sentence features of the group of sentence pairs, as shown in fig. 6. In order to integrate the cross-sentence features obtained from all sentence pairs and apply the cross-sentence features to the speech synthesis models, the embodiment employs a self-attention mechanism network structure, and learns an independent cross-sentence feature for the phoneme feature vector of each speech synthesis encoder model, where the cross-sentence feature is a weighted sum of the cross-sentence features of all sentence pairs. In the network structure of the self-attention mechanism, a phoneme feature vector sequence of a speech synthesis encoder is used as a query vector of the network structure of the self-attention mechanism, and cross sentence feature vectors of all sentence pairs obtained by a PSE mode are used as key vectors and value vectors of the network structure of the self-attention mechanism. By adopting the self-attention mechanism network structure, the encoder features of each phoneme can learn different cross sentence features, and the phonemes at different positions can learn different cross sentence features.

The two different ways of using cross-sentence features mentioned above both require additional language models for cross-sentence feature extraction. However, the large parameter amount of the language model causes additional huge expenses to the deployment and reasoning of the model. If the speech synthesis model is deployed on a mobile device, the larger number of parameters of the language model is not suitable for deployment on an offline model. Therefore, the application provides two simpler extraction modes of cross-sentence characteristics, which can be used for overcoming the defects of the language model in reasoning speed and disk space occupation by the model.

The application also provides a scheme for obtaining cross-sentence characteristics by performing text characteristic extraction operation on sentences in the sentence set based on the word vector table, which comprises the following specific processes: querying a word vector of each word in the sentence set by using a word vector table, and carrying out average value calculation on the word vectors corresponding to the sentence texts in the context sentences to obtain sentence text vectors; and splicing the sentence text vectors to obtain the cross-sentence characteristics.

Specifically, the embodiment can splice sentence text vectors corresponding to the forward sentence set to obtain forward cross-sentence characteristics; and the sentence text vectors corresponding to the backward sentence set can be spliced to obtain backward cross-sentence characteristics. Further, the embodiment may further splice the forward cross-sentence characteristic and the backward cross-sentence characteristic to obtain a context cross-sentence characteristic; acquiring phoneme characteristics output by an encoder of the speech synthesis model; and splicing the context cross sentence characteristics and the phoneme characteristics to obtain a spliced characteristic vector, and training the speech synthesis model by using the spliced characteristic vector.

Specifically, the sentence text vectors of the adjacent sentence texts may be spliced to obtain the cross-sentence characteristics of the adjacent sentence pairs. Further, in this embodiment, the cross-sentence characteristics of all the adjacent sentence pairs may be weighted and summed to obtain weighted cross-sentence characteristics; learning the weighted cross sentence features by controlling each phoneme feature of an encoder of the speech synthesis model through a self-attention mechanism network structure; wherein the key vector and the value vector of the self-attention mechanism network structure are the weighted cross sentence features, and the query vector of the self-attention mechanism network structure is the phoneme feature vector of the encoder; and training the voice synthesis model by using the weighted cross sentence characteristics.

The embodiment can acquire the cross sentence characteristics on the premise of not depending on the language model, specifically adopts the word vector of each sentence text to acquire the sentence vector of each sentence, and then splices the adjacent sentence vectors of two sentences to serve as the cross sentence characteristics of the two sentences. In this way, the word vector table is trained to search for the word vector, and then the word vectors of each word of each sentence are averaged to obtain the sentence vector, after the sentence vector is obtained, the cross-sentence feature can be used in the above mentioned two ways, i.e. PSE and CSE. If the cross-sentence feature is used in the CSE mode, the sentence vectors of each context sentence can be spliced together and then spliced with the output of the TTS encoder to perform speech synthesis as shown in fig. 4. If the cross sentence characteristics are used in the PSE mode, firstly, sentence vectors of each sentence can be used as key and value sequences in the graph 6, and then the output of an encoder is used as a query vector to obtain the cross sentence characteristics based on each phoneme; secondly, sentence vectors of two adjacent sentences can be spliced together to serve as cross-sentence characteristics of the two sentences, and then all the cross-sentence characteristic vectors serve as key, value sequences in fig. 6.

The application also provides a scheme for obtaining cross-sentence characteristics by executing text characteristic extraction operation on sentences in the sentence set based on the Student model obtained by knowledge distillation, and the specific process is as follows: before text feature extraction operation is carried out on sentences in the sentence set to obtain cross-sentence features, knowledge distillation is carried out on a language model to obtain a Student model, so that the Student model learns the cross-sentence features extracted by the language model; and performing text feature extraction operation on the sentences in the sentence set by using the Student model to obtain the cross-sentence features. Knowledge distillation is a model compression method, and can convert a high-precision but heavy Teacher model (such as the language model mentioned above) into a more compact and suitably deployed Student model with knowledge learned in the Teacher model. The Student model in the embodiment is a result obtained by knowledge distillation of the language model, the Student model can extract cross sentence features in the sentence set, and the Student model has a small requirement on storage space.

As a feasible implementation manner, in this embodiment, the Student model may be used to perform text feature extraction on the sentences in the sentence set to obtain a sentence text vector of each sentence text, and the sentence text vectors are spliced to obtain the cross-sentence features.

Specifically, the embodiment can splice sentence text vectors corresponding to the forward sentence set to obtain forward cross-sentence characteristics; and the sentence text vectors corresponding to the backward sentence set can be spliced to obtain backward sentence-crossing characteristics. Further, the embodiment may further splice the forward cross sentence feature and the backward cross sentence feature to obtain a context cross sentence feature; acquiring phoneme characteristics output by an encoder of the speech synthesis model; and splicing the context cross sentence characteristics and the phoneme characteristics to obtain a spliced characteristic vector, and training the speech synthesis model by using the spliced characteristic vector.

Specifically, the sentence text vectors of adjacent sentence texts may be spliced to obtain the cross-sentence characteristics of the adjacent sentence pairs. Further, in this embodiment, the cross-sentence characteristics of all the adjacent sentence pairs may be weighted and summed to obtain weighted cross-sentence characteristics; learning the weighted cross sentence features by controlling each phoneme feature of an encoder of the speech synthesis model through an attention mechanism network structure; wherein the key vector and the value vector of the self-attention mechanism network structure are the weighted cross sentence features, and the query vector of the self-attention mechanism network structure is the phoneme feature vector of the encoder; and training the voice synthesis model by using the weighted cross sentence characteristics.

The method learns the feature vector obtained by the BERT model through a simple small model, namely, the knowledge distillation of the cross sentence features of the BERT model is carried out. Because the BERT model is large and the number of parameters is large, it is not suitable for deployment and reasoning under limited conditions. For example, if the cpu is in a poor environment, it is possible that the BERT inference speed is very slow, or the BERT model is deployed on a mobile device, the large storage requirement of the BERT model may bring many challenges, and the inference speed of the mobile device may not meet the inference requirement of the BERT model. Therefore, on the basis of the above, the invention proposes a way of learning the cross sentence features extracted by the BERT model through a simple small model and then combining the cross sentence features extracted by the small model with the PSE/CSE to perform speech synthesis.

The above-described embodiments refer to a small model as a Student model, which takes the above-mentioned BERT-extracted cross sentence features as a training target, so that the Student model learns the knowledge of a large BERT model, and does not seriously affect the extraction of cross sentence features while reducing the complexity of the model. The input of the Student model can be words, phonemes can also be adopted as the input as in the TTS model, and if the words are used as the input, a relatively large word vector table needs to be maintained; if a phoneme is used as input, a vector table of phonemes can be shared with the TTS model. The Student model in this embodiment may be any simple model, and for example, may be an RNN model, a CNN model, or a model based on a Transformer network structure. Taking RNN as an example, the input of the Student model, for the CSE approach, is to put a word or phoneme of a sentence in and out and then output a vector that is as similar as possible to the sentence vector obtained by BERT. For the PSE, the Student inputs words or phrases of two adjacent words and outputs a vector which is similar to the cross-sentence feature of the two adjacent words obtained by BERT as much as possible. With the Student model, speech synthesis based on cross-sentence features can be performed in the same manner as the PSE or CSE mentioned earlier. Meanwhile, the Student model can ensure smaller storage space requirement and faster reasoning speed requirement, and can realize speech synthesis based on cross-sentence characteristics under the condition of resource limitation.

In the experimental process, MUSHRA MOS (mean opinion score) and ABN choice are adopted to evaluate the performance of the model, and 65 persons are totally participated in scoring, wherein 50 persons are Chinese persons, and 15 persons are English persons.

The MUSHRA scores of the different models are shown in table 1 below. From the experimental result, the speech synthesis model combining the cross-sentence characteristics of the PSE mode is obviously improved in Chinese data compared with the baseline model, but is not obviously improved in English data. This is because the prosody of the english data is not much converted, and the prosody of each sentence is relatively uniform. The speech synthesis model incorporating the cross-sentence feature of the CSE approach has a lower score than baseline because the CSE-based model synthesizes some wrong tones in the synthesized speech, giving the grader a poorer impression, resulting in a lower overall score.

TABLE 1 MUSHRA score comparison Table

As shown in Table 2, it can be seen from the results of the Chinese synthetic data experiment that more testers prefer the audio synthesized by the PSE-based model according to the Table 2.

TABLE 2 ABN preferer test results statistics

According to the above experimental results, the end-to-end speech synthesis model based on cross sentence information provided by this embodiment can effectively utilize the cross sentence characteristics of the current text to improve the prosody performance of the model.

An embodiment of the present application further provides a speech synthesis apparatus, which may include:

The present embodiment acquires a context sentence of a current sentence, and takes a sentence set including the current sentence and the context sentence as a sample for training a speech synthesis model. Specifically, in the embodiment, the sentence in the sentence set is subjected to the text feature extraction operation to obtain the cross-sentence feature, and the cross-sentence feature can describe the chapter structure of the context of the text, so that the speech synthesis model trained by the cross-sentence feature can learn the chapter structure feature. Because the prosodic information of the text read is related to the structure of the context chapter, the same sentence may have completely different prosodic expressions in different context contexts, and the speech synthesis model trained by using the cross-sentence characteristics in the embodiment can synthesize speech based on the structure of the context chapter, thereby improving the prosodic effect of speech synthesis.

Further, the model training module comprises:

a set determining unit for determining a forward sentence set and a backward sentence set; wherein the forward sentence set comprises the current sentence and a context sentence preceding the current sentence, and the backward sentence set comprises the current sentence and a context sentence following the current sentence;

and the forward and backward cross sentence characteristic calculation unit is used for calculating the forward cross sentence characteristics of the forward sentence set and the backward cross sentence characteristics of the backward sentence set.

Further, a forward and backward cross sentence characteristic calculation unit, configured to input all the forward sentence sets into the language model to obtain the forward cross sentence characteristics; and the language model is also used for inputting all the backward sentence sets into the language model to obtain the backward cross sentence characteristics.

Further, the process of training the speech synthesis model by the model training module using the cross sentence characteristics comprises: splicing the forward cross sentence characteristics and the backward cross sentence characteristics to obtain context cross sentence characteristics; acquiring phoneme characteristics output by an encoder of the speech synthesis model; and splicing the context cross sentence characteristics and the phoneme characteristics to obtain a spliced characteristic vector, and training the speech synthesis model by using the spliced characteristic vector.

Further, the model training module comprises:

a first feature extraction unit, configured to obtain adjacent sentence pairs in the sentence set; the adjacent sentence pair is input into a language model to obtain cross sentence characteristics of the adjacent sentence pair;

the training unit is used for weighting and summing the cross sentence characteristics of all the adjacent sentence pairs to obtain weighted cross sentence characteristics; further for learning the weighted cross sentence features by each phoneme feature of an encoder controlling the speech synthesis model through a self-attention mechanism network structure; wherein the key vector and the value vector of the self-attention mechanism network structure are the weighted cross sentence features, and the query vector of the self-attention mechanism network structure is the phoneme feature vector of the encoder; and is further configured to train the speech synthesis model using the weighted cross-sentence features.

Further, the model training module comprises:

the second characteristic extraction unit is used for inquiring the word vector of each word in the sentence set by using a word vector table, and carrying out average value calculation on the word vectors corresponding to the sentence texts in the context sentences to obtain sentence text vectors; and the sentence-crossing feature is obtained by splicing the sentence text vectors.

Further, the method also comprises the following steps:

the knowledge distillation module is used for performing knowledge distillation on a language model to obtain a Student model before text feature extraction operation is performed on the sentences in the sentence set to obtain cross-sentence features, so that the Student model learns the cross-sentence features extracted by the language model;

correspondingly, the model training module comprises:

and the third feature extraction unit is used for executing text feature extraction operation on the sentences in the sentence set by using the Student model to obtain the cross-sentence features.

Since the embodiments of the apparatus portion and the method portion correspond to each other, please refer to the description of the embodiments of the method portion for the embodiments of the apparatus portion, which is not repeated here.

The present application also provides a storage medium having a computer program stored thereon, which when executed, may implement the steps provided by the above-described embodiments. The storage medium may include: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The present application further provides an electronic device, and referring to fig. 7, a structure diagram of an electronic device provided in an embodiment of the present application may include a processor 710 and a memory 720, as shown in fig. 7.

Processor 710 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and so on. The processor 710 may be implemented in at least one hardware form of a DSP (Digital Signal Processing), an FPGA (Field-Programmable Gate Array), and a PLA (Programmable Logic Array). Processor 710 may also include a main processor and a coprocessor, where the main processor is a processor for Processing data in a wake state, and is also called a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 710 may be integrated with a GPU (Graphics Processing Unit), which is responsible for rendering and drawing the content required to be displayed on the display screen. In some embodiments, the processor 710 may further include an AI (Artificial Intelligence) processor for processing computing operations related to machine learning.

Memory 720 may include one or more computer-readable storage media, which may be non-transitory. Memory 720 may also include high speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In this embodiment, the memory 720 is at least used for storing the following computer program 721, wherein after being loaded and executed by the processor 710, the computer program can implement the relevant steps in the speech synthesis method disclosed in any of the foregoing embodiments. In addition, the resources stored by the memory 720 may also include an operating system 722, data 723, and the like, which may be stored in a transient or persistent manner. The operating system 722 may include Windows, linux, android, and the like.

In some embodiments, the electronic device may also include a display 730, an input output interface 740, a communication interface 750, sensors 760, a power supply 770, and a communication bus 780.

Of course, the structure of the electronic device shown in fig. 7 does not constitute a limitation of the electronic device in the embodiment of the present application, and the electronic device may include more or less components than those shown in fig. 7 or some components in combination in practical applications.

The embodiments are described in a progressive mode in the specification, the emphasis of each embodiment is on the difference from the other embodiments, and the same and similar parts among the embodiments can be referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description. It should be noted that, for those skilled in the art, it is possible to make several improvements and modifications to the present application without departing from the principle of the present application, and such improvements and modifications also fall within the scope of the claims of the present application.

It should also be noted that, in this specification, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrases "comprising a," "8230," "8230," or "comprising" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

Claims

1. A method of speech synthesis, comprising:

acquiring a context sentence of a current sentence, and constructing a sentence set comprising the current sentence and the context sentence;

executing text feature extraction operation on the sentences in the sentence set to obtain cross-sentence features, and training a voice synthesis model by using the cross-sentence features;

2. The speech synthesis method of claim 1, wherein performing a text feature extraction operation on the sentences in the set of sentences to obtain cross-sentence features comprises:

3. The speech synthesis method of claim 2 wherein computing forward cross sentence features of the forward set of sentences and backward cross sentence features of the backward set of sentences comprises:

inputting all the forward sentence sets into a language model to obtain the forward cross sentence characteristics;

4. The method of claim 3, wherein training a speech synthesis model using the cross-sentence features comprises:

5. The speech synthesis method of claim 1, wherein performing a text feature extraction operation on the sentences in the sentence set to obtain cross-sentence features comprises:

acquiring adjacent sentence pairs in the sentence set;

inputting the adjacent sentence pair into a language model to obtain the cross sentence characteristics of the adjacent sentence pair;

correspondingly, the training of the speech synthesis model by using the cross sentence characteristics comprises the following steps:

learning the weighted cross sentence features by controlling each phoneme feature of an encoder of the speech synthesis model through an attention mechanism network structure; wherein the key vector and the value vector of the self-attention mechanism network structure are the weighted cross sentence features, and the query vector of the self-attention mechanism network structure is the phoneme feature vector of the encoder;

6. The speech synthesis method of claim 1, wherein performing a text feature extraction operation on the sentences in the set of sentences to obtain cross-sentence features comprises:

using a word vector table to query the word vector of each word in the sentence set, and performing average value calculation on the word vectors corresponding to the sentence texts in the context sentences to obtain sentence text vectors;

7. The speech synthesis method of claim 1, wherein before performing a text feature extraction operation on the sentences in the set of sentences to obtain cross-sentence features, further comprising:

8. A speech synthesis apparatus, comprising:

9. An electronic device, comprising a memory in which a computer program is stored and a processor which, when called into the memory, implements the steps of a speech synthesis method according to any one of claims 1 to 7.

10. A storage medium having stored thereon computer-executable instructions which, when loaded and executed by a processor, carry out the steps of a speech synthesis method according to any one of claims 1 to 7.