CN110264991B

CN110264991B - Training method of speech synthesis model, speech synthesis method, device, equipment and storage medium

Info

Publication number: CN110264991B
Application number: CN201910420168.0A
Authority: CN
Inventors: 王健宗; 贾雪丽
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2019-05-20
Filing date: 2019-05-20
Publication date: 2023-12-22
Anticipated expiration: 2039-05-20
Also published as: CN110264991A

Abstract

The application relates to the field of voice semantics, and particularly discloses a training method, a voice synthesis method, a device, equipment and a storage medium for realizing voice synthesis by using an attention mechanism and a neural network, wherein the training method comprises the following steps: acquiring a data set, wherein the data set comprises training text data and training voice data corresponding to the training text data; generating a training text vector according to the training text data; encoding the training speech data based on a first encoder to obtain a training embedded vector; marking the training embedded vector based on an attention mechanism to obtain a training style vector; and carrying out model training on a preset neural network model according to the training text vector, the training voice data and the training style vector so as to obtain a voice synthesis model.

Description

Training method of speech synthesis model, speech synthesis method, device, equipment and storage medium

Technical Field

The present disclosure relates to the field of speech technologies, and in particular, to a training method for a speech synthesis model, a speech synthesis method, a device, equipment, and a storage medium.

Background

Speech synthesis technology, i.e., text To Speech (TTS) technology, is capable of converting Text information into Speech. With the continuous development of speech synthesis technology, people have increasingly diversified requirements for synthesizing speech. The synthesized voice can preferably show stronger rhythm sense, and the voice with unique characteristic styles, such as a comment style and a recitation style with heavier emotion colors, and informal synthesized voice styles such as a thrill style, a photo style and the like with different expressions can be synthesized, so that the diversity of the synthesized voice is increased, and different requirements of people are met.

However, the current TTS model cannot accurately define the styles, and it is difficult to consider the details of each style of speech, so that a specific style cannot be well reflected in the synthesized speech, and the experience of the user is reduced.

Disclosure of Invention

The application provides a training method, a voice synthesis method, a device, equipment and a storage medium for a voice synthesis model.

In a first aspect, the present application provides a method for training a speech synthesis model, the method comprising:

Acquiring a data set, wherein the data set comprises training text data and training voice data corresponding to the training text data;

generating a training text vector according to the training text data;

encoding the training speech data based on a first encoder to obtain a training embedded vector;

marking the training embedded vector based on an attention mechanism to obtain a training style vector;

and carrying out model training on a preset neural network model according to the training text vector, the training voice data and the training style vector so as to obtain a voice synthesis model.

In a second aspect, the present application further provides a speech synthesis method, including:

obtaining a target text vector and a target voice style vector;

performing splicing processing on the target text vector and the target voice style vector to obtain a target spliced vector;

inputting the target splicing vector into a voice synthesis model to output target synthesized voice data; the speech synthesis model is a model trained by the training method of the speech synthesis model as described above.

In a third aspect, the present application further provides a training device for a speech synthesis model, where the device includes:

The data acquisition unit is used for acquiring a data set, wherein the data set comprises training text data and training voice data corresponding to the training text data;

the vector generation unit is used for generating training text vectors according to the training text data;

the vector coding unit is used for coding the training voice data based on the first coder so as to obtain a training embedded vector;

the vector acquisition unit is used for marking the training embedded vector based on an attention mechanism so as to obtain a training style vector;

and the model training unit is used for carrying out model training on a preset neural network model according to the training text vector, the training voice data and the training style vector so as to obtain a voice synthesis model.

In a fourth aspect, the present application further provides a speech synthesis apparatus, including:

the vector acquisition unit is used for acquiring a target text vector and a target voice style vector;

the vector splicing unit is used for carrying out splicing processing on the target text vector and the target voice style vector so as to obtain a target spliced vector;

the data output unit is used for inputting the target splicing vector into a voice synthesis model so as to output target synthesized voice data; the speech synthesis model is a model trained by the above-described training method of the speech synthesis model.

In a fifth aspect, the present application also provides a computer device comprising a memory and a processor; the memory is used for storing a computer program; the processor is configured to execute the computer program and implement the training method of the speech synthesis model or the speech synthesis method when executing the computer program.

In a sixth aspect, the present application further provides a computer readable storage medium storing a computer program, which when executed by a processor causes the processor to implement a training method of a speech synthesis model as described above or a speech synthesis method as described above.

The application discloses a training method, a voice synthesis method, a device, equipment and a storage medium of a voice synthesis model, wherein training voice data are encoded based on a first encoder to obtain training embedded vectors; marking the training embedded vector based on an attention mechanism to obtain a training style vector; and carrying out model training on a preset neural network model according to the training text vector, the training voice data and the training style vector so as to obtain a voice synthesis model. The speech synthesis model obtained by training the training method can synthesize natural target speech data, and the synthesized target speech data has a specific speaking style, is not mechanized speech any more, has rich emotion expressive force, and therefore improves the experience of users.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic flow chart of a training method of a speech synthesis model according to an embodiment of the present application;

FIG. 2 is a schematic flow chart of sub-steps of the training method of the speech synthesis model of FIG. 1;

FIG. 3 is a schematic flow chart of the pinyin conversion of the training text data of FIG. 1;

FIG. 4 is a schematic flow chart of sub-steps of the training method of the speech synthesis model of FIG. 1;

FIG. 5 is a schematic flow chart of the construction steps of training style vectors provided by an embodiment of the present application;

FIG. 6 is a schematic flow chart of sub-steps of the training method of the speech synthesis model of FIG. 1;

FIG. 7 is a schematic flow chart diagram of a training model according to the training speech data and the training splice vector provided in an embodiment of the present application;

FIG. 8 is a schematic flow chart of steps of a speech synthesis method according to an embodiment of the present application;

FIG. 9 is a schematic block diagram of a training apparatus of a speech synthesis model according to an embodiment of the present application;

FIG. 10 is a schematic block diagram of a subunit of the training device of the speech synthesis model of FIG. 9;

FIG. 11 is a schematic block diagram of a subunit of the training device of the speech synthesis model of FIG. 9;

FIG. 12 is a schematic block diagram of a speech synthesis apparatus according to an embodiment of the present application;

fig. 13 is a schematic block diagram of a computer device according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.

The flow diagrams depicted in the figures are merely illustrative and not necessarily all of the elements and operations/steps are included or performed in the order described. For example, some operations/steps may be further divided, combined, or partially combined, so that the order of actual execution may be changed according to actual situations.

The embodiment of the application provides a training method and device of a speech synthesis model, computer equipment and a storage medium. The training method of the speech synthesis model can be used for synthesizing speech data with a certain style.

Some embodiments of the present application are described in detail below with reference to the accompanying drawings. The following embodiments and features of the embodiments may be combined with each other without conflict.

Referring to fig. 1, fig. 1 is a schematic flowchart illustrating steps of a training method of a speech synthesis model according to an embodiment of the present application.

As shown in fig. 1, the training method of the speech synthesis model specifically includes: step S110 to step S150.

S110, acquiring a data set, wherein the data set comprises training text data and training voice data corresponding to the training text data.

Specifically, the training text data is text data used in a training stage for training a speech synthesis model. The training voice data is voice data corresponding to training text data, and is voice data marked by a developer.

S120, generating a training text vector according to the training text data.

Specifically, after the training text data is obtained, vector conversion may be performed on the training text data to generate a training text vector.

As shown in fig. 2, in an embodiment, the specific process of generating a training text vector according to the training text data, i.e. step S120, may include sub-steps S121 and S122.

S121, performing pinyin conversion on the training text data to obtain corresponding pinyin strings.

In an embodiment, the specific process of pinyin conversion on the training text data is shown in fig. 3, i.e. step S121 may include sub-steps S1211, S1212 and S1213.

S1211, performing word segmentation processing on the training text data to obtain a plurality of word strings.

The word segmentation processing is performed on the training text data to obtain a plurality of word strings, which specifically may include: performing sentence segmentation on the training text data to obtain a plurality of corresponding sentences; and performing word segmentation processing on a plurality of sentences to obtain a plurality of word strings.

Specifically, after the training text data is obtained, sentence segmentation can be performed on the training text data, for example, each training text data can be segmented into a complete sentence according to punctuation marks. Then, word segmentation processing is performed on each sentence, thereby obtaining a plurality of word strings. In an embodiment, word segmentation processing can be performed on each segmented sentence through a word segmentation method of character string matching.

For example, the word segmentation method of the character string matching can be a forward maximum matching method, a reverse maximum matching method, a shortest path word segmentation method, a bidirectional maximum matching method and the like. The forward maximum matching method is to divide a character string in a segmented sentence from left to right. The reverse maximum matching method refers to word segmentation from right to left of a character string in a segmented sentence. The bidirectional maximum matching method refers to the simultaneous word segmentation matching in the forward and reverse directions (left to right and right to left). The shortest path word segmentation method means that the number of words required to be segmented out in a character string in a segmented sentence is the smallest.

In other embodiments, word segmentation processing may also be performed on each segmented sentence by word sense and word segmentation method. The word sense word segmentation method is a word segmentation method for machine voice judgment, and uses syntactic information and semantic information to process ambiguity to segment words.

Taking a bidirectional maximum matching method as an example, a Chinese dictionary library with a word set is obtained, and continuous characters with the word set length of m in the segmented sentence are matched with words in the Chinese dictionary library in the forward and reverse directions on the premise that the length of the longest word set of the Chinese dictionary library is m. If the matching of the segmented sentence and each word in the Chinese dictionary base is unsuccessful, successively reducing the length of the continuous characters to perform multiple scanning matching until the sentence is successfully matched with one word in the Chinese dictionary base, and finally obtaining a plurality of word strings.

S1212, performing pinyin conversion on each word string to obtain sub pinyin strings corresponding to each word string.

For example, the training text information S is subjected to word segmentation to obtain N word strings, which are FS1, FS2, and FSN, respectively. After the N word strings are respectively subjected to pinyin conversion, sub pinyin strings corresponding to the word strings are obtained, wherein PS1, PS2 and PSN are used. For example, after the word string "Zhang1san1" is pinyin-converted, a sub pinyin string "Zhang three" is obtained, wherein the numeral 1 indicates that the tone is yin-flat.

S1213, performing splicing processing on each sub pinyin string to obtain the pinyin string.

The training text data is "hello" and is subjected to word segmentation to obtain two word strings, "hello" and "mock", and the two word strings, "hello" and "mock" are subjected to pinyin conversion to obtain two sub pinyin strings, "ni1hao3" and "ma0", wherein the number 3 represents that the tone is the upper tone, and the number 0 represents that the tone is the light tone. And (3) performing splicing treatment on the two sub pinyin strings 'ni 1hao 3' and 'ma 0', and obtaining a pinyin string 'ni 1hao3ma 0'.

S122, converting the pinyin string into a digital sequence based on the corresponding relation of the alphanumeric characters, and storing the digital sequence as a training text vector.

In an embodiment, based on the corresponding relationship of the alphanumeric characters, the pinyin string is converted into a numeric sequence, and before the numeric sequence is stored as the training text vector, the method further comprises:

and establishing a corresponding relation of the character and the number according to the preset character sequence and the preset number of numbers.

Specifically, the character number correspondence has a character sequence and numbers corresponding to each character in the character sequence, and each character corresponds to a number. The character types can be letters, numbers, spaces, and the like.

In an embodiment, the establishing the corresponding relationship between the character and the number according to the preset character sequence and the preset number of numbers specifically includes: acquiring a preset character sequence and a preset number of digits; and marking the characters in the character sequence according to the numbers to obtain the corresponding relation of the character numbers.

Wherein the preset number may be greater than or equal to the length of the character sequence.

The character sequence has, for example, 32 characters and numbers corresponding to each character. The 32 characters can comprise 26 English letters, 0, 1, 2, 3, 4 and spaces, 26 English letters, 5 numbers and spaces are sequentially arranged, and the 32 characters are marked by the numbers 0-31, so that each character corresponds to a digital label. Specifically, the corresponding relationship of the alphanumeric is shown in table 1.

Table 1 is a schematic representation of alphanumeric correspondence

For example, the training text data is "hello," and the corresponding pinyin string "ni1hao3ma 0. Based on the alphanumeric correspondence in Table 1, the Pinyin string "ni1hao3ma0" may be converted into a digital sequence 13/8/27/7/0/14/29/12/0/26. The number sequence is stored as a training text vector (13, 8, 27,7,0, 14, 29, 12,0, 26).

S130, based on the first encoder, the training voice data are encoded to obtain a training embedded vector.

Wherein the first encoder comprises a convolutional neural network (Convolutional Neural Network, CNN) +a recurrent neural network (Recurrent Neural Network, RNN). The convolutional neural network may include a plurality of convolutional layers. For example, a convolutional neural network includes 6 convolutional layers, each of which uses the same size convolutional kernel to extract convolutional features; a smaller convolution kernel, such as a 3 x 3 size convolution kernel, is typically employed to capture the speech characteristics of the training speech data, with a step size (stride) of 2. The 6 convolutional layers use the output channels of 32, 64, 128 and 128, respectively, so that the speech feature vector output by the last convolutional layer of the convolutional neural network is a three-dimensional vector.

The recurrent neural network may be a unidirectional GRU comprising 128 hidden neurons. In one embodiment, a remodelling layer may be provided between the convolutional neural network and the recurrent neural network, and the output of the convolutional neural network is adjusted in conjunction with batch normalization (Batch Normalization, BN) and ReLU (Rectified Linear Units) activation functions to accommodate the input of the recurrent neural network.

Specifically, the logarithmic mel spectrum of the training speech data is input into the convolutional neural network, then the output of the convolutional neural network is input into the recurrent neural network, and the output of the recurrent neural network is used as a training embedded vector, so that the rhythms of the training speech signals with different lengths are converted into the training embedded vectors with fixed lengths.

In an embodiment, before the encoding the training speech data based on the first encoder to obtain the training embedded vector, the method further includes: preprocessing the training voice data to obtain a corresponding Mel (Mel) frequency spectrum so as to extract the voice characteristics sensitive to human ears.

The encoding the training speech data based on the first encoder to obtain a training embedded vector includes:

the mel spectrum is input to a first encoder such that the first encoder encodes the mel spectrum to obtain a training embedded vector.

Wherein, the preprocessing the training voice data specifically includes: carrying out framing windowing processing on the training voice data to obtain processed training voice data; and carrying out frequency domain transformation on the processed training voice data to obtain a corresponding amplitude spectrum, wherein the amplitude spectrum is the Mel frequency spectrum.

Specifically, the frame-dividing and windowing process may specifically divide the training speech data according to a set frame length, for example, 60ms, so as to obtain divided training speech data, and then add a hamming window to the divided training speech data, where the hamming window processing refers to multiplying the divided speech information by a window function, so as to perform fourier expansion.

The frequency domain transformation is to perform fast fourier transform (Fast Fourier Transform, FFT) on the training speech data after the frame division windowing process to obtain corresponding parameters, and in this embodiment, to obtain the amplitude as the amplitude spectrum, i.e. the amplitude after the fast fourier transform. Of course, other parameters after FFT may be used, such as amplitude plus phase information.

And S140, marking the training embedded vector based on an attention mechanism to obtain a training style vector.

Specifically, according to the attention mechanism, the training embedded vector can be subjected to style marking, so that a training style vector is obtained.

In one embodiment, as shown in fig. 4, the attention-based mechanism marks the training embedded vector to obtain a training style vector, and specifically includes sub-steps S141 to S143.

S141, acquiring a plurality of initial voice style vectors.

The types of styles are, for example, four, namely recitation style, photo style, thrill style, and comment style. Wherein the recited style initial speech style vector is (1, 0), the phase style initial speech style vector is (0, 1, 0), the thriller style initial speech style vector is (0, 1, 0) and the comment style initial speech style vector is (0, 1).

It should be noted that in other embodiments, the style may be one, two, three, five or more.

S142, calculating the similarity between the training embedded vector and each initial voice style vector according to an attention mechanism.

In an embodiment, the calculating the similarity between the training embedded vector and each of the initial speech style vectors according to the attention mechanism specifically includes: and inputting the training embedded vector and the initial voice style vector into an attention model to output the similarity of the training embedded vector and each initial voice style vector.

Wherein, the attention model is to simulate human to express language information by natural human voice by adopting a multi-head attention mechanism and a softmax activation function. That is, when a human uses speech to express language information with a style, different attention is allocated to each audio clip in the speech data according to the style. That is, one may more easily notice a style type or types of style types that are related to the voice data, and ignore other unrelated style types.

In particular, the multi-headed attentiveness mechanism includes a plurality of point-by-point attentiveness mechanisms, for example, 8 point-by-point attentiveness mechanisms may be included.

S143, constructing the training style vector according to the similarity of each initial voice style vector.

In particular, the training style vector describes the distribution of attention between each style type and the training speech data. Wherein, the step S143 specifically includes the substep S1431 of constructing the training style vector according to the similarity of each initial speech style vector as shown in fig. 5: and taking the similarity corresponding to each initial voice style vector as the attention weight of the initial voice style vector, and carrying out weighted summation on each initial voice style vector to obtain the training style vector.

The types of styles are, for example, four, namely recitation style, photo style, thrill style, and comment style. Wherein the original speech style vector A of the reciting style is (1, 0), the original speech style vector B of the phase style is (0, 1, 0), the thriller style initial speech style vector C is (0, 1, 0), and the comment style initial speech style vector D is (0, 1). For example, in the training speech data having the phase style, the similarity between the training embedded vector and the original speech style vector of the reciting style is 0.1, the similarity between the training embedded vector and the original speech style vector of the phase style is 0.8, the similarity between the training embedded vector and the original speech style vector of the thrill style is 0.0, and the similarity between the training embedded vector and the original speech style vector of the comment style is 0.1.

The similarity corresponding to each initial voice style vector is taken as the attention weight of the initial voice style vector, namely the attention weight of the contribution of the initial voice style vector of the reciting style to the training voice data is 0.1, the attention weight of the contribution of the initial voice style vector of the phase voice style to the training voice data is 0.8, the attention weight of the contribution of the initial voice style vector of the thrilling style to the training voice data is 0.0, and the attention weight of the contribution of the initial voice style vector of the comment style to the training voice data is 0.1. And weighting and summing all the initial voice style vectors according to the attention weight, namely training style vectors corresponding to the training voice data with the phase voice style=0.1×a+0.8×b+0.0×c+0.1×d.

And S150, performing model training on a preset neural network model according to the training text vector, the training voice data and the training style vector so as to obtain a voice synthesis model.

In one embodiment, the predetermined neural network model may include a Tacotron model that employs a maximum likelihood function as an objective function. The Tacotron model includes an encoder, an attention mechanism, and a decoder. In other embodiments, the neural network model may also be other deep learning models, such as a google net model, and the like. The Tacotron model will be described below as an example.

In an embodiment, the specific process of model training the preset neural network model according to the training text vector, the training voice data and the training style vector is shown in fig. 6, that is, step S150 includes sub-steps S151 and S152.

And S151, performing splicing processing on the training text vector and the training style vector to obtain a training spliced vector.

Exemplary, training text vector a= (a ₁ ,a ₂ ,a ₃ ,a ₄ ) Training style vector b= (B) ₁ ,b ₂ ,b ₃ ,b ₄ ) Splicing the training text vector A and the training style vector B to obtain a training splicing vector C= (a) ₁ ,a ₂ ,a ₃ ,a ₄ ,b ₁ ,b ₂ ,b ₃ ,b ₄ )。

And S152, performing model training on the neural network model according to the training voice data and the training splicing vector so as to obtain the voice synthesis model.

Specifically, the training splicing vector is input into a preset neural network model, and training synthesized voice data is output. And comparing the training synthesized voice data with the training voice data, and adjusting parameters in the neural network model according to a preset loss function so as to obtain a voice synthesis model.

In an embodiment, according to the training speech data and the training stitching vector, the specific process of model training for the neural network model is shown in fig. 7, that is, step S152 includes sub-steps S1521, S1522 and S1523.

S1521, inputting the training splice vector into the neural network model to output training synthesized voice data.

Specifically, after the training splice vector is obtained through the splicing process, the training splice vector is input into the neural network model, so that training synthesized voice data is output.

S1522, calculating the voice similarity between the training synthesized voice data and the training voice data.

Specifically, the training synthesized voice data and the training voice data are input into a pre-trained similarity model, so that the voice similarity of the training synthesized voice data and the training voice data is output. Wherein the similarity model may be, for example, a convolutional neural network model.

S1523, calculating a loss value according to the voice similarity and a preset loss function, and adjusting parameters in the neural network model according to the loss value to obtain a voice synthesis model.

Specifically, a loss function (loss function) is typically used to measure the degree of inconsistency between the training synthesized speech data and the training speech data of the model, with the loss (loss) being minimized the closer the training synthesized speech data is to the training speech data.

Illustratively, the predetermined loss function is a cross entropy loss function. After calculating the loss value according to the voice similarity and a preset loss function, the parameters in the neural network model can be adjusted by back propagation according to a random gradient descent method, so that a voice synthesis model is obtained. The back propagation is a process of continuously updating the weights and the deviations in the neural network model, and when the loss value is 0 after a certain training, the training synthesized voice data reaches the training voice data, and the weights and the deviations can not be updated at this time.

According to the training method of the voice synthesis model, a data set is obtained; generating a training text vector according to the training text data; encoding the training speech data based on a first encoder to obtain a training embedded vector; marking the training embedded vector based on an attention mechanism to obtain a training style vector; and carrying out model training on a preset neural network model according to the training text vector, the training voice data and the training style vector so as to obtain a voice synthesis model. The speech synthesis model obtained by training the training method can synthesize natural target speech data, and the synthesized target speech data has a specific speaking style, is not mechanized speech any more, has rich emotion expressive force, and therefore improves the experience of users.

Referring to fig. 8, fig. 8 is a schematic flowchart illustrating steps of a speech synthesis method according to an embodiment of the present application.

As shown in fig. 8, the speech synthesis method specifically includes: step S210 to step S230.

S210, acquiring a target text vector and a target voice style vector.

In an embodiment, before obtaining the target text vector, the method further includes: target text data is acquired.

Specifically, the target text data may be news text, novel text, blog text, or the like.

The method for acquiring the target text vector specifically comprises the following steps: and acquiring a target text vector according to the target text data.

In an embodiment, the obtaining the target text vector according to the target text data specifically includes: word segmentation is carried out on the target text data so as to obtain a plurality of target word strings; performing pinyin conversion on each target word string to obtain a target sub pinyin string corresponding to each target word string; performing splicing treatment on each target sub pinyin string to obtain the target pinyin string; and converting the target pinyin string into a target number sequence based on a preset alphanumeric corresponding relation, and storing the target number sequence as a target text vector.

Specifically, after the target text data is obtained, sentence segmentation may be performed on the target text data, for example, the target text data may be segmented into a complete sentence according to punctuation marks. Then, word segmentation processing is performed on each sentence, so that a plurality of target word strings are obtained. In an embodiment, word segmentation processing can be performed on each segmented sentence through a word segmentation method of character string matching.

For example, after the target text data "good in the morning" is subjected to word segmentation, two target word strings of "good in the morning" and "good" are obtained. And performing pinyin conversion on the two target word strings of 'morning' and 'good' to obtain target sub pinyin strings of 'zao shang 4' and 'hao 3' corresponding to the two target word strings, wherein the numbers represent tones. And (3) performing splicing treatment on the two target sub pinyin strings to obtain a target pinyin string 'zao shang4hao 3'.

Exemplary, alphanumeric correspondences may be as shown in table 1. For example, the target text data is "good morning", the target pinyin string corresponding to the training text data is "zao3shang4hao3", and the target pinyin string "zao shang4hao3" is converted into the target numeric sequence 25/0/14/29/18/7/0/13/6/30/7/0/14/29 based on the corresponding relationship of the alphanumeric characters in table 1. The target number sequence is stored as a target text vector (25, 0, 14, 29, 18,7,0, 13,6, 30,7,0, 14, 29).

Wherein obtaining the target speech style vector comprises: acquiring a plurality of initial speech style vectors; and weighting and summing all the initial voice style vectors according to the attention weights corresponding to the various style types to obtain the target voice style vector.

The types of styles are, for example, four, namely recitation style, photo style, thrill style, and comment style. Wherein the original speech style vector A of the reciting style is (1, 0), the original speech style vector B of the phase style is (0, 1, 0), the thriller style initial speech style vector C is (0, 1, 0), and the comment style initial speech style vector D is (0, 1). Assuming that the attention weight corresponding to the reciting style is 0.1, the attention weight corresponding to the reciting style is 0.8, the attention weight corresponding to the thrilling style is 0.0, and the attention weight corresponding to the comment style is 0.1 in the speech having the phase style, the target speech style vector=0.1×a+0.8×b+0.0×c+0.1×d.

The attention weights corresponding to various types of style in the voice data of the specific style can be preset manually or trained in advance. For example, if the user wants to check the target text data for the voice having the phase style, the attention weight corresponding to the recite style is set to 0.1, the attention weight corresponding to the phase style is set to 0.8, the attention weight corresponding to the thrill style is set to 0.0, and the attention weight corresponding to the comment style is set to 0.1.

As another example, to synthesize the target text data into a voice having a comment style, the attention weight corresponding to the recite style is set to 0.02, the attention weight corresponding to the phase sound style is set to 0.05, the attention weight corresponding to the thrilling style is set to 0.85, and the attention weight corresponding to the comment style is set to 0.08.

S220, performing splicing processing on the target text vector and the target voice style vector to obtain a target spliced vector.

Exemplary, target text vector x= (X ₁ ,x ₂ ,x ₃ ,x ₄ ) Target speech style vector y= (Y) ₁ ,y ₂ ,y ₃ ,y ₄ ) Splicing the target text vector X and the target voice style vector Y to obtain a target splicing vector W= (X) ₁ ,x ₂ ,x ₃ ,x ₄ ,y ₁ ,y ₂ ,y ₃ ,y ₄ )。

S230, inputting the target splicing vector into a voice synthesis model to output target synthesized voice data.

The speech synthesis model is a model obtained by training the speech synthesis model training method. Specifically, the target concatenation vector is input to the speech synthesis model, so that target synthesized speech data having a specific style, such as target synthesized speech data having a reciting style, a phase style, a thrilling style, or a comment style, is output.

It can be understood that the target splicing vector may also be a sequence segment in time sequence, so that the speech synthesis model may also synthesize target synthesized speech data with a specific style in a segmented manner, for example, the target splicing vector of the sequence is synthesized in two segments, and the synthesized target synthesized speech data has a reciting style and a comment style, respectively, so that the process of style change of the target text data during speech synthesis can be indicated.

According to the voice synthesis method, natural target voice data can be synthesized, the synthesized target voice data has a specific speaking style, is not mechanized voice any more, and has rich emotion expressive force, so that the experience of a user is improved.

Referring to fig. 9, fig. 9 is a schematic block diagram of a training apparatus for a speech synthesis model according to an embodiment of the present application, where the training apparatus is used to perform the training method of any one of the foregoing speech synthesis models. The training device of the speech synthesis model can be configured in a server or a terminal.

The servers may be independent servers or may be server clusters. The terminal can be electronic equipment such as a mobile phone, a tablet computer, a notebook computer, a desktop computer, a personal digital assistant, wearable equipment and the like.

As shown in fig. 9, the training apparatus 300 for a speech synthesis model includes: a data acquisition unit 310, a vector generation unit 320, a vector encoding unit 330, a vector acquisition unit 340, and a model training unit 350.

A data acquisition unit 310, configured to acquire a data set, where the data set includes training text data and training voice data corresponding to the training text data;

a vector generating unit 320, configured to generate a training text vector according to the training text data;

a vector encoding unit 330, configured to encode the training speech data based on a first encoder to obtain a training embedded vector;

a vector obtaining unit 340, configured to mark the training embedded vector based on an attention mechanism, so as to obtain a training style vector;

the model training unit 350 is configured to perform model training on a preset neural network model according to the training text vector, the training voice data and the training style vector, so as to obtain a voice synthesis model.

As shown in FIG. 9, in one embodiment, vector generation unit 320 includes a pinyin conversion subunit 321 and a vector storage subunit 322.

The pinyin conversion subunit 321 is configured to perform pinyin conversion on the training text data to obtain a corresponding pinyin string.

The vector storage subunit 322 is configured to convert the pinyin string into a digital sequence based on the corresponding relationship between the digits, and store the digital sequence as a training text vector.

As shown in fig. 10, in an embodiment, the vector acquisition unit 340 includes a style acquisition subunit 341, a similarity calculation subunit 342, and a vector construction subunit 343.

Style acquisition subunit 341 is configured to acquire a plurality of initial speech style vectors.

A similarity calculating subunit 342, configured to calculate, according to an attention mechanism, a similarity between the training embedded vector and each of the initial speech style vectors.

Vector construction subunit 343 is configured to construct the training style vector according to the similarity of each of the initial speech style vectors.

In one embodiment, the vector construction subunit 343 is specifically configured to weight and sum each of the initial speech style vectors with respect to the similarity corresponding to each of the initial speech style vectors as the attention weight of the initial speech style vector, so as to obtain the training style vector.

As shown in FIG. 11, in one embodiment, model training unit 350 includes a stitching processing subunit 351 and a model training subunit 352.

A stitching subunit 351, configured to perform stitching on the training text vector and the training style vector, so as to obtain a training stitching vector;

And a model training subunit 352, configured to perform model training on the neural network model according to the training speech data and the training concatenation vector, so as to obtain the speech synthesis model.

Referring to fig. 12, fig. 12 is a schematic block diagram of a speech synthesis apparatus according to an embodiment of the present application, where the speech synthesis apparatus may be configured in a terminal or a server for performing the foregoing speech synthesis method.

As shown in fig. 12, the speech synthesis apparatus 400 includes: a vector acquisition unit 410, a vector concatenation unit 420, and a data output unit 430.

The vector acquisition unit 410 is configured to acquire a target text vector and a target speech style vector.

And the vector stitching unit 420 is configured to perform stitching processing on the target text vector and the target speech style vector, so as to obtain a target stitched vector.

A data output unit 430 for inputting the target concatenation vector into a speech synthesis model to output target synthesized speech data; the speech synthesis model is a model which is obtained by training by the training method of the speech synthesis model.

It should be noted that, for convenience and brevity of description, the specific working process of the apparatus and each unit described above may refer to a corresponding process in the foregoing embodiment of the training method of the speech synthesis model, which is not described herein again.

The apparatus described above may be implemented in the form of a computer program which is executable on a computer device as shown in fig. 8.

Referring to fig. 13, fig. 13 is a schematic block diagram of a computer device according to an embodiment of the present application. The computer device may be a server or a terminal.

With reference to FIG. 13, the computer device includes a processor, memory, and a network interface connected by a system bus, where the memory may include a non-volatile storage medium and an internal memory.

The non-volatile storage medium may store an operating system and a computer program. The computer program comprises program instructions that, when executed, cause the processor to perform a method of training a speech synthesis model.

The processor is used to provide computing and control capabilities to support the operation of the entire computer device.

The internal memory provides an environment for the execution of a computer program in a non-volatile storage medium that, when executed by the processor, causes the processor to perform a method of training a speech synthesis model.

The network interface is used for network communication such as transmitting assigned tasks and the like. It will be appreciated by those skilled in the art that the structure shown in fig. 13 is merely a block diagram of a portion of the structure associated with the present application and is not limiting of the computer device to which the present application applies, and that a particular computer device may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.

It should be appreciated that the processor may be a central processing unit (Central Processing Unit, CPU), but may also be other general purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), field-programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. Wherein the general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

Wherein the processor is configured to run a computer program stored in the memory to implement the steps of:

acquiring a data set, wherein the data set comprises training text data and training voice data corresponding to the training text data; generating a training text vector according to the training text data; encoding the training speech data based on a first encoder to obtain a training embedded vector; marking the training embedded vector based on an attention mechanism to obtain a training style vector; and carrying out model training on a preset neural network model according to the training text vector, the training voice data and the training style vector so as to obtain a voice synthesis model.

In one embodiment, when implementing the generating of the training text vector according to the training text data, the processor is configured to implement:

performing pinyin conversion on the training text data to obtain corresponding pinyin strings; based on the corresponding relation of the character numbers, the pinyin strings are converted into digital sequences, and the digital sequences are stored as training text vectors.

In one embodiment, when implementing the attention-based mechanism, the processor is configured to, when implementing the attention-based mechanism, tag the training embedded vector to obtain a training style vector, implement:

acquiring a plurality of initial speech style vectors; calculating the similarity between the training embedded vector and each initial speech style vector according to an attention mechanism; and constructing the training style vector according to the similarity of each initial voice style vector.

In one embodiment, the processor is configured, when implementing the constructing the training style vector according to the similarity of each of the initial speech style vectors, to implement:

and taking the similarity corresponding to each initial voice style vector as the attention weight of the initial voice style vector, and carrying out weighted summation on each initial voice style vector to obtain the training style vector.

In one embodiment, when implementing model training on a preset neural network model according to the training text vector, the training voice data and the training style vector to obtain a voice synthesis model, the processor is configured to implement:

performing splicing processing on the training text vector and the training style vector to obtain a training spliced vector; and performing model training on the neural network model according to the training voice data and the training splicing vector so as to obtain the voice synthesis model.

Wherein in another embodiment the processor is configured to run a computer program stored in the memory to implement the steps of:

obtaining a target text vector and a target voice style vector; performing splicing processing on the target text vector and the target voice style vector to obtain a target spliced vector; inputting the target splicing vector into a voice synthesis model to output target synthesized voice data; the speech synthesis model is a model trained by the training method of the speech synthesis model described in any one of the above.

The embodiment of the application also provides a computer readable storage medium, wherein the computer readable storage medium stores a computer program, the computer program comprises program instructions, and the processor executes the program instructions to realize the training method or the speech synthesis method of any speech synthesis model provided by the embodiment of the application.

The computer readable storage medium may be an internal storage unit of the computer device according to the foregoing embodiment, for example, a hard disk or a memory of the computer device. The computer readable storage medium may also be an external storage device of the computer device, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), or the like, which are provided on the computer device.

While the invention has been described with reference to certain preferred embodiments, it will be understood by those skilled in the art that various changes and substitutions of equivalents may be made and equivalents will be apparent to those skilled in the art without departing from the scope of the invention. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method of training a speech synthesis model, comprising:

generating a training text vector according to the training text data;

marking the training embedded vector based on an attention mechanism to obtain a training style vector; the marking the training embedded vector based on the attention mechanism to obtain a training style vector includes: acquiring a plurality of initial speech style vectors; calculating the similarity between the training embedded vector and each initial speech style vector according to an attention mechanism; constructing the training style vector according to the similarity of each initial voice style vector; the constructing the training style vector according to the similarity of the initial speech style vectors comprises the following steps: taking the similarity corresponding to each initial voice style vector as the attention weight of the initial voice style vector, and carrying out weighted summation on each initial voice style vector to obtain the training style vector;

model training is carried out on a preset neural network model according to the training text vector, the training voice data and the training style vector so as to obtain a voice synthesis model; the training of the model of the preset neural network model according to the training text vector, the training voice data and the training style vector to obtain a voice synthesis model comprises the following steps: performing splicing processing on the training text vector and the training style vector to obtain a training spliced vector; and performing model training on the neural network model according to the training voice data and the training splicing vector so as to obtain the voice synthesis model.

2. The method of claim 1, wherein generating training text vectors from the training text data comprises:

performing pinyin conversion on the training text data to obtain corresponding pinyin strings;

based on the corresponding relation of the character numbers, the pinyin strings are converted into digital sequences, and the digital sequences are stored as training text vectors.

3. A method of speech synthesis, comprising:

obtaining a target text vector and a target voice style vector;

inputting the target splicing vector into a voice synthesis model to output target synthesized voice data; the training method of the voice synthesis model comprises the following steps: acquiring a data set, wherein the data set comprises training text data and training voice data corresponding to the training text data; generating a training text vector according to the training text data; encoding the training speech data based on a first encoder to obtain a training embedded vector; marking the training embedded vector based on an attention mechanism to obtain a training style vector; the marking the training embedded vector based on the attention mechanism to obtain a training style vector includes: acquiring a plurality of initial speech style vectors; calculating the similarity between the training embedded vector and each initial speech style vector according to an attention mechanism; constructing the training style vector according to the similarity of each initial voice style vector; the constructing the training style vector according to the similarity of the initial speech style vectors comprises the following steps: taking the similarity corresponding to each initial voice style vector as the attention weight of the initial voice style vector, and carrying out weighted summation on each initial voice style vector to obtain the training style vector; model training is carried out on a preset neural network model according to the training text vector, the training voice data and the training style vector so as to obtain a voice synthesis model; the training of the model of the preset neural network model according to the training text vector, the training voice data and the training style vector to obtain a voice synthesis model comprises the following steps: performing splicing processing on the training text vector and the training style vector to obtain a training spliced vector; and performing model training on the neural network model according to the training voice data and the training splicing vector so as to obtain the voice synthesis model.

4. A training device for a speech synthesis model, comprising:

the vector acquisition unit is used for marking the training embedded vector based on an attention mechanism so as to obtain a training style vector; the marking the training embedded vector based on the attention mechanism to obtain a training style vector includes: acquiring a plurality of initial speech style vectors; calculating the similarity between the training embedded vector and each initial speech style vector according to an attention mechanism; constructing the training style vector according to the similarity of each initial voice style vector; the constructing the training style vector according to the similarity of the initial speech style vectors comprises the following steps: taking the similarity corresponding to each initial voice style vector as the attention weight of the initial voice style vector, and carrying out weighted summation on each initial voice style vector to obtain the training style vector;

The model training unit is used for carrying out model training on a preset neural network model according to the training text vector, the training voice data and the training style vector so as to obtain a voice synthesis model; the training of the model of the preset neural network model according to the training text vector, the training voice data and the training style vector to obtain a voice synthesis model comprises the following steps: performing splicing processing on the training text vector and the training style vector to obtain a training spliced vector; and performing model training on the neural network model according to the training voice data and the training splicing vector so as to obtain the voice synthesis model.

5. A speech synthesis apparatus, comprising:

the data output unit is used for inputting the target splicing vector into a voice synthesis model so as to output target synthesized voice data; the training method of the voice synthesis model comprises the following steps: acquiring a data set, wherein the data set comprises training text data and training voice data corresponding to the training text data; generating a training text vector according to the training text data; encoding the training speech data based on a first encoder to obtain a training embedded vector; marking the training embedded vector based on an attention mechanism to obtain a training style vector; the marking the training embedded vector based on the attention mechanism to obtain a training style vector includes: acquiring a plurality of initial speech style vectors; calculating the similarity between the training embedded vector and each initial speech style vector according to an attention mechanism; constructing the training style vector according to the similarity of each initial voice style vector; the constructing the training style vector according to the similarity of the initial speech style vectors comprises the following steps: taking the similarity corresponding to each initial voice style vector as the attention weight of the initial voice style vector, and carrying out weighted summation on each initial voice style vector to obtain the training style vector; model training is carried out on a preset neural network model according to the training text vector, the training voice data and the training style vector so as to obtain a voice synthesis model; the training of the model of the preset neural network model according to the training text vector, the training voice data and the training style vector to obtain a voice synthesis model comprises the following steps: performing splicing processing on the training text vector and the training style vector to obtain a training spliced vector; and performing model training on the neural network model according to the training voice data and the training splicing vector so as to obtain the voice synthesis model.

6. A computer device, the computer device comprising a memory and a processor;

the memory is used for storing a computer program;

the processor being configured to execute the computer program and to implement the training method of the speech synthesis model according to any one of claims 1 to 2 or the speech synthesis method according to claim 3 when the computer program is executed.

7. A computer readable storage medium, characterized in that the computer readable storage medium stores a computer program which, when executed by a processor, causes the processor to implement the training method of a speech synthesis model according to any one of claims 1 to 2 or the speech synthesis method according to claim 3.