CN110264991A

CN110264991A - Training method, phoneme synthesizing method, device, equipment and the storage medium of speech synthesis model

Info

Publication number: CN110264991A
Application number: CN201910420168.0A
Authority: CN
Inventors: 王健宗; 贾雪丽
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2019-05-20
Filing date: 2019-05-20
Publication date: 2019-09-20
Anticipated expiration: 2039-05-20
Also published as: CN110264991B

Abstract

This application involves voice semantic domains, specifically used attention mechanism and neural fusion speech synthesis, and disclose training method, phoneme synthesizing method, device, equipment and the storage medium of a kind of speech synthesis model, the training method includes: acquisition data set, and the data set includes training text data and trained voice data corresponding with the training text data；According to the training text data, training text vector is generated；Based on the first encoder, the trained voice data is encoded, to obtain training insertion vector；Based on attention mechanism, the training insertion vector is marked, to obtain training style vector；According to the training text vector, the trained voice data and the trained style vector, model training is carried out to preset neural network model, to obtain speech synthesis model.

Description

Training method, phoneme synthesizing method, device, equipment and the storage of speech synthesis model Medium

Technical field

This application involves voice technology field more particularly to a kind of training methods of speech synthesis model, speech synthesis side Method, device, equipment and storage medium.

Background technique

Speech synthesis technique, i.e. literary periodicals (Text to Speech, TTS) technology, can convert text information For voice.With the continuous development of speech synthesis technique, people's pairing at voice requirement increasingly diversity.Synthesize voice most Can embody stronger rhythmical image well, synthesize the voice with specific characteristic style, as the heavier storytelling style of emotion, The informal synthesis voice style such as terrible style, cross-talk style of style and different manifestations is read aloud, to increase synthesis language The diversity of sound meets the different demands of the people.

However, current TTS model can not precisely define style, it is difficult in view of the details of every kind of style voice, cause Specific style is seted, from embodying very well, to reduce the Experience Degree of user in synthesis voice.

Summary of the invention

This application provides a kind of training method of speech synthesis model, phoneme synthesizing method, device, equipment and storages to be situated between Matter, the speech synthesis model that training method training obtains can synthesize the language with certain style and abundant emotion behavior power Sound, to promote the Experience Degree of user.

In a first aspect, this application provides a kind of training methods of speech synthesis model, which comprises

Data set is obtained, the data set includes training text data and trained language corresponding with the training text data Sound data；

According to the training text data, training text vector is generated；

Based on the first encoder, the trained voice data is encoded, to obtain training insertion vector；

Based on attention mechanism, the training insertion vector is marked, to obtain training style vector；

According to the training text vector, the trained voice data and the trained style vector, to preset nerve Network model carries out model training, to obtain speech synthesis model.

Second aspect, present invention also provides a kind of phoneme synthesizing methods, comprising:

Obtain target text vector sum target voice style vector；

By target voice style vector described in the target text vector sum carry out splicing, with obtain target splice to Amount；

Target splicing vector is inputted into speech synthesis model, to export target synthesized voice data；The voice closes It is the model obtained by the training method training of speech synthesis model as described above at model.

The third aspect, present invention also provides a kind of training device of speech synthesis model, described device includes:

Data capture unit, for obtaining data set, the data set include training text data and with the training text The corresponding trained voice data of notebook data；

Vector generation unit, for generating training text vector according to the training text data；

Vector coding unit encodes the trained voice data, for being based on the first encoder to be trained It is embedded in vector；

Vector acquiring unit is marked the training insertion vector, for being based on attention mechanism to be trained Style vector；

Model training unit, for according to the training text vector, the trained voice data and the trained style Vector carries out model training to preset neural network model, to obtain speech synthesis model.

Fourth aspect, the application also provide a kind of speech synthetic device, comprising:

Vector acquiring unit, for obtaining target text vector sum target voice style vector；

Vector concatenation unit, for target voice style vector described in the target text vector sum to be carried out stitching portion Reason, to obtain target splicing vector；

Data outputting unit, for target splicing vector to be inputted speech synthesis model, to export target synthesis language Sound data；The speech synthesis model is the model that the training method training of speech synthesis model from the above mentioned obtains.

5th aspect, present invention also provides a kind of computer equipment, the computer equipment includes memory and processing Device；The memory is for storing computer program；The processor, for executing the computer program and described in the execution Training method or the above-mentioned phoneme synthesizing method such as above-mentioned speech synthesis model are realized when computer program.

6th aspect, present invention also provides a kind of computer readable storage medium, the computer readable storage medium It is stored with computer program, the computer program makes the processor realize such as above-mentioned speech synthesis when being executed by processor The training method of model or above-mentioned phoneme synthesizing method.

This application discloses a kind of training method of speech synthesis model, phoneme synthesizing method, device, equipment and storages to be situated between Matter encodes the trained voice data, by being based on the first encoder to obtain training insertion vector；Based on attention The training insertion vector is marked in power mechanism, to obtain training style vector；According to the training text vector, institute Trained voice data and the trained style vector are stated, model training is carried out to preset neural network model, to obtain voice Synthetic model.The speech synthesis model that training method training obtains, can synthesize natural target speech data, synthesized Target speech data has specific locution, is no longer the voice of mechanization, has emotion behavior power abundant, to mention Rise the Experience Degree of user.

Detailed description of the invention

Technical solution in ord to more clearly illustrate embodiments of the present application, below will be to needed in embodiment description Attached drawing is briefly described, it should be apparent that, the accompanying drawings in the following description is some embodiments of the present application, general for this field For logical technical staff, without creative efforts, it is also possible to obtain other drawings based on these drawings.

Fig. 1 is a kind of schematic flow diagram of the training method of speech synthesis model provided by the embodiments of the present application；

Fig. 2 is the sub-step schematic flow diagram of the training method of speech synthesis model in Fig. 1；

Fig. 3 is the schematic flow diagram that in Fig. 1 the training text data are carried out with phonetic conversion；

Fig. 4 is the sub-step schematic flow diagram of the training method of speech synthesis model in Fig. 1；

Fig. 5 is the schematic flow diagram of the construction step for the training style vector that one embodiment of the application provides；

Fig. 6 is the sub-step schematic flow diagram of the training method of speech synthesis model in Fig. 1；

Fig. 7 is the offer of one embodiment of the application according to the trained voice data and the training splicing vector training mould The schematic flow diagram of type；

Fig. 8 is a kind of step schematic flow diagram of phoneme synthesizing method provided by the embodiments of the present application；

Fig. 9 is a kind of schematic block diagram for the training device that embodiments herein also provides speech synthesis model；

Figure 10 is the schematic block diagram of the subelement of the training device of speech synthesis model in Fig. 9；

Figure 11 is the schematic block diagram of the subelement of the training device of speech synthesis model in Fig. 9；

Figure 12 is a kind of schematic block diagram for speech synthetic device that one embodiment of the application provides；

Figure 13 is a kind of structural representation block diagram for computer equipment that one embodiment of the application provides.

Specific embodiment

Below in conjunction with the attached drawing in the embodiment of the present application, technical solutions in the embodiments of the present application carries out clear, complete Site preparation description, it is clear that described embodiment is some embodiments of the present application, instead of all the embodiments.Based on this Shen Please in embodiment, every other implementation obtained by those of ordinary skill in the art without making creative efforts Example, shall fall in the protection scope of this application.

Flow chart shown in the drawings only illustrates, it is not necessary to including all content and operation/step, also not It is that must be executed by described sequence.For example, some operation/steps can also decompose, combine or partially merge, therefore practical The sequence of execution is possible to change according to the actual situation.

Embodiments herein provides training method, device, computer equipment and the storage of a kind of speech synthesis model Medium.The training method of speech synthesis model can be used for synthesizing the voice data with certain style.

With reference to the accompanying drawing, it elaborates to some embodiments of the application.In the absence of conflict, following Feature in embodiment and embodiment can be combined with each other.

Referring to Fig. 1, the step of Fig. 1 is a kind of training method of speech synthesis model provided by the embodiments of the present application signal Flow chart.

As shown in Figure 1, the training method of the speech synthesis model, specifically includes: step S110 to step S150.

S110, data set is obtained, the data set includes training text data and corresponding with the training text data Training voice data.

Specifically, training text data are the text datas used by the training stage, for speech synthesis model into Row training.Training voice data is voice data corresponding with training text data, and is the voice number marked through developer According to.

S120, according to the training text data, generate training text vector.

Specifically, can carry out vector conversion after obtaining training text data to training text data, generate training text This vector.

As shown in Fig. 2, in one embodiment, it is described according to the training text data, generate the tool of training text vector Body process, i.e. step S120 may include sub-step S121 and S122.

S121, phonetic conversion is carried out to the training text data, to obtain corresponding pinyin string.

In one embodiment, the training text data are carried out with the detailed process of phonetic conversion, as shown in figure 3, walking Rapid S121 may include sub-step S1211, S1212 and S1213.

S1211, word segmentation processing is carried out to the training text data, to obtain multiple word strings.

Wherein, described that training text data progress word segmentation processing can specifically include with obtaining multiple word strings: Sentence segmentation is carried out to the training text data, to obtain corresponding several sentences；Several sentences are carried out at participle Reason, to obtain multiple word strings.

Specifically, after obtaining training text data sentence segmentation can be carried out to the training text data, such as can be according to mark Each training text data cutting is the complete sentence of a rule by point symbol.Then, word segmentation processing is carried out to each sentence, thus Obtain multiple word strings.In one embodiment, the sentence of each cutting can be divided by the segmenting method of string matching Word processing.

For example, the segmenting method of string matching can be Forward Maximum Method method, reversed maximum matching method, shortest path Participle method and two-way maximum matching method etc..Wherein, Forward Maximum Method method refer to the character string in the sentence a cutting from Left-to-right segments.Reversed maximum matching method refers to that the character string in the sentence a cutting segments from right to left.It is two-way Maximum matching method refers to forward and reverse (from left to right, from right to left) while carrying out participle matching.Shortest path participle method refers to one It is least that the word number cut out is required inside character string in the sentence of a cutting.

In other embodiments, method can also being segmented by the meaning of a word, word segmentation processing is carried out to the sentence after each cutting.Wherein, Meaning of a word participle method is a kind of segmenting method of machine talk judgement, handled using syntactic information and semantic information Ambiguity come Participle.

Illustratively, by taking two-way maximum matching method as an example, the Chinese dictionary library with word collection is obtained, it is assumed that Chinese dictionary The length of the longest phrase in library is m, continuation character and Chinese word forward and reverse while that be m by phrase length in the sentence after cutting Word in allusion quotation library is matched.If the sentence after cutting and each word match in Chinese dictionary library are unsuccessful, gradually subtract The length of small continuation character takes multiple scan matching, until a certain word match success in the sentence and Chinese dictionary library, Finally obtain multiple word strings.

S1212, phonetic conversion is carried out to each word string, to obtain the corresponding sub- pinyin string of each word string.

Illustratively, training text information S is after word segmentation processing, obtains N number of word string, respectively FS1, FS2 ..., FSN. N number of word string obtains the corresponding sub- pinyin string of each word string respectively after phonetic is converted, PS1, PS2 ..., PSN.For example, word string " Three " after phonetic is converted, and obtains sub- pinyin string " zhang1san1 ", wherein number 1 indicates that tone is high and level tone.

S1213, each sub- pinyin string is subjected to splicing, to obtain the pinyin string.

Illustratively, training text data are " how do you do ", and two word strings " hello " and " " are obtained after word segmentation processing, Two word strings " hello " and " " are subjected to phonetic conversion, two sub- pinyin strings " ni1hao3 " and " ma0 " are obtained, wherein counting Word 3 indicates that tone is upper sound, and number 0 is to indicate that tone is softly.Two sub- pinyin strings " ni1hao3 " and " ma0 " are spelled Processing is connect, pinyin string " ni1hao3ma0 " is obtained.

S122, it is based on alphanumeric corresponding relationship, the pinyin string is converted into Serial No., the Serial No. is deposited Storage is training text vector.

In one embodiment, it is based on alphanumeric corresponding relationship, the pinyin string is converted into Serial No., by the number Word sequence is stored as before training text vector, further includes:

According to the number of preset characters sequence and preset quantity, alphanumeric corresponding relationship is established.

Specifically, having character string and number corresponding with character each in character string in alphanumeric corresponding relationship Word, the corresponding number of each character.Wherein, the type of character can be letter, number and space etc..

In one embodiment, the number according to preset characters sequence and preset quantity establishes the corresponding pass of alphanumeric System, specifically includes: obtaining the number of preset characters sequence and preset quantity；According to the number, in the character string Character is marked, to obtain the alphanumeric corresponding relationship.

Wherein, preset quantity can be greater than or equal to the length of character string.

Illustratively, character string has 32 characters and the corresponding number of each character.Wherein 32 characters may include 26 English alphabets, 5 numbers and space are arranged successively, using digital 0- by 26 English alphabets, 0,1,2,3,4 and space 31 pairs of 32 characters are marked, so that each character is corresponding with a digital label.It is closed specifically, the alphanumeric is corresponding System is as shown in table 1.

Table 1 is the schematic table of alphanumeric corresponding relationship

For example, training text data are " how do you do ", corresponding pinyin string " ni1hao3ma0 ".Based on the word in table 1 Digital corresponding relationship is accorded with, Serial No. 13/8/27/7/0/14/29/12/0/ can be converted by pinyin string " ni1hao3ma0 " 26.The Serial No. is stored as training text vector (13,8,27,7,0,14,29,12,0,26).

S130, it is based on the first encoder, the trained voice data is encoded, to obtain training insertion vector.

Wherein, the first encoder includes convolutional neural networks (Convolutional Neural Network, CNN)+pass Neural network (Recurrent Neural Network, RNN) is returned to form.Convolutional neural networks may include multiple convolutional layers. For example, convolutional neural networks include 6 convolutional layers, each convolutional layer extracts convolution feature using the convolution kernel of same size； Smaller convolution kernel is generally used, the convolution kernel of such as 3 × 3 sizes is enough to capture the phonetic feature of trained voice data, step-length It (stride) is 2.6 layers of convolutional layer use 32,32,64,64,128 and 128 output channel, thus convolutional neural networks respectively The speech feature vector of the last layer convolutional layer output is three-dimensional vector.

Recurrent neural network can be the unidirectional GRU for including 128 hidden neurons.In one embodiment, convolutional Neural Between network and recurrent neural network can be equipped with remodeling layer, in conjunction with batch standardization (Batch Normalization, BN) and ReLU (Rectified Linear Units) activation primitive adjusts the output of convolutional neural networks, to adapt to recurrent neural net The input of network.

Specifically, the logarithm Meier frequency spectrum of training voice data is inputted convolutional neural networks, then by convolutional Neural net The output of network is input to recurrent neural network, vector is embedded in using the output of recurrent neural network as training, thus by different length The rhythm of the training voice signal of degree is converted into the training insertion vector of regular length.

In one embodiment, described to be based on the first encoder, the trained voice data is encoded, to be trained It is embedded in front of vector, further includes: the trained voice data is pre-processed, to obtain corresponding Meier (Mel) frequency spectrum, with Extract the phonetic feature of human ear sensitivity.

It is described to be based on the first encoder, the trained voice data is encoded, to obtain training insertion vector, packet It includes:

The Meier frequency spectrum is inputted into the first encoder, so that the first encoder encodes the Meier frequency spectrum, thus Obtain training insertion vector.

Wherein, described the trained voice data is subjected to pretreatment to specifically include: the trained voice data is carried out To obtain, treated trains voice data for framing windowing process；To treated, training voice data carries out frequency-domain transform to obtain To corresponding amplitude spectrum, which is the Meier frequency spectrum.

Specifically, framing windowing process, can specifically carry out training voice data according to the frame length such as 60ms of setting Then dividing processing is added hamming window to handle training voice data after segmentation, is added again with the training voice data after being divided The processing of hamming window refers to voice messaging after segmentation multiplied by a window function, in order to carry out Fourier expansion.

Frequency domain conversion is to carry out Fast Fourier Transform (FFT) (Fast to the training voice data after framing windowing process Fourier Transform, FFT), to obtain corresponding parameter, be in the present embodiment in order to obtain amplitude as amplitude spectrum, Amplitude i.e. after Fast Fourier Transform (FFT).It is of course also possible to the other parameters after FFT transform, for example amplitude is believed plus phase Breath etc..

S140, it is based on attention mechanism, the training insertion vector is marked, to obtain training style vector.

Specifically, according to attention mechanism genre label can be carried out to training insertion vector, to obtain training style Vector.

As shown in figure 4, in one embodiment, it is described to be based on attention mechanism, the training insertion vector is marked, To obtain training style vector, sub-step S141 to S143 is specifically included.

S141, multiple initial speech style vectors are obtained.

Illustratively, there are four types of the types of style, style, cross-talk style, terrible style and storytelling style are respectively read aloud. Wherein, the initial speech style vector for reading aloud style is (1,0,0,0), the initial speech style vector of cross-talk style be (0,1, 0,0), the initial speech style vector of terrible style is (0,0,1,0), the initial speech style vector of storytelling style be (0,0, 0,1).

It should be noted that in other embodiments, the type of style may be it is a kind of, two kinds, three kinds, five kinds or More.

S142, according to attention mechanism, calculate the phase of training the insertion vector and each initial speech style vector Like degree.

In one embodiment, described according to attention mechanism, calculate the training insertion vector and each initial speech The similarity of style vector, specifically includes: initial speech style vector described in the training insertion vector sum is inputted attention Model, to export the similarity of training the insertion vector and each initial speech style vector.

Wherein, the attention model refers to that imitating the mankind using bull attention mechanism and softmax activation primitive adopts With natural human speech representation language information.That is, when the mankind use phonetic representation language message with style, it can be because of wind Lattice and distribute different attentions to audio fragment each in voice data.That is, people can be easier to notice each wind Certain stylistic category relevant to voice data or a variety of stylistic categories in lattice type, and ignore other unrelated stylistic categories.

Specifically, the bull attention mechanism includes multiple dot product attention mechanism, for example, may include 8 dot product notes Meaning power mechanism.

S143, according to the similarity of each initial speech style vector, construct the trained style vector.

Specifically, training the Automobile driving feelings between each stylistic category of style vector description and training voice data Condition.Wherein, the similarity according to each initial speech style vector constructs the trained style vector, such as Fig. 5 institute Show, step S143 specifically includes sub-step S1431: using the corresponding similarity of each initial speech style vector as initial speech The attention weight of style vector, to each initial speech style vector weighted sum, to obtain the trained style vector.

Illustratively, there are four types of the types of style, style, cross-talk style, terrible style and storytelling style are respectively read aloud. Wherein, the initial speech style vector A for reading aloud style is (1,0,0,0), the initial speech style vector B of cross-talk style be (0, 1,0,0), the initial speech style vector C of terrible style is (0,0,1,0), and the initial speech style vector D of storytelling style is (0,0,0,1).For example, in the training voice data with cross-talk style, training insertion vector and the initial speech for reading aloud style The similarity of style vector is 0.1, and the similarity of the initial speech style vector of training insertion vector and cross-talk style is 0.8, The similarity of training insertion vector and the initial speech style vector of terrible style is 0.0, training insertion vector and storytelling style Initial speech style vector similarity be 0.1.

Using the corresponding similarity of each initial speech style vector as the attention weight of initial speech style vector, i.e., Reading aloud the initial speech style vector of style is 0.1 to the attention weight of training voice data contribution, cross-talk style it is initial Voice style vector is 0.8 to the attention weight of training voice data contribution, the initial speech style vector pair of terrible style The attention weight of training voice data contribution is 0.0, and the initial speech style vector of storytelling style is to training voice data tribute The attention weight offered is 0.1.According to above-mentioned attention weight to each initial speech style vector weighted sum, that is, have The corresponding trained style vector=0.1 × A+0.8 × B+0.0 × C+0.1 × D of the training voice data of cross-talk style.

S150, according to the training text vector, the trained voice data and the trained style vector, to preset Neural network model carries out model training, to obtain speech synthesis model.

In one embodiment, preset neural network model may include Tacotron model, use maximum likelihood letter Number is used as objective function.The Tacotron model includes encoder, attention mechanism and decoder.In other embodiments, should Neural network model is also possible to other deep learning models, such as GoogLeNet model etc..It is with Tacotron model below Example is illustrated.

In one embodiment, described according to the training text vector, the trained voice data and the trained style Vector carries out the detailed process of model training to preset neural network model, as shown in fig. 6, i.e. step S150 includes sub-step Rapid S151 and S152.

S151, splicing is carried out to training style vector described in the training text vector sum, to obtain training splicing Vector.

Illustratively, training text vector A=(a₁,a₂,a₃,a₄), training style vector B=(b₁,b₂,b₃,b₄), to instruction Practice text vector A and training style vector B is spliced, obtains training splicing vector C=(a₁,a₂,a₃,a₄,b₁,b₂,b₃,b₄)。

S152, vector is spliced according to the trained voice data and the training, mould is carried out to the neural network model Type training, to obtain the speech synthesis model.

Specifically, training splicing vector is inputted in preset neural network model, training synthesis voice data is exported.It will The training synthesizes voice data and is compared with training voice data, adjusts the neural network model according to preset loss function In parameter, to obtain speech synthesis model.

In one embodiment, according to the trained voice data and the training splicing vector, to the neural network mould Type carries out the detailed process of model training, as shown in fig. 7, i.e. step S152 includes sub-step S1521, S1522 and S1523.

S1521, the training splicing vector is inputted into the neural network model, with output training synthesis voice data.

Specifically, training splicing vector is inputted above-mentioned neural network after splicing obtains training splicing vector Model, to export training synthesis voice data.

S1522, the voice similarity for calculating training the synthesis voice data and the trained voice data.

Specifically, the training synthesis voice data and the trained voice data are inputted trained similarity in advance Model, to export the voice similarity of the two.Wherein similarity model can be such as convolutional neural networks model.

S1523, penalty values are calculated according to the voice similarity and preset loss function, and according to the penalty values tune Parameter in the whole neural network model, to obtain speech synthesis model.

Specifically, training synthesis voice data and instruction of the loss function (loss function) commonly used to estimate model Practice the inconsistent degree of voice data, training synthesis voice data and training voice data are closer, and loss (loss) value is minimum.

Illustratively, preset loss function is cross entropy loss function.According to the voice similarity and preset damage After losing function calculating penalty values, it can be gone to adjust the ginseng in the neural network model according to stochastic gradient descent method backpropagation Number, to obtain speech synthesis model.Backpropagation is the process for constantly updating weight and deviation in neural network model, When penalty values are 0 after certain training, indicate that training synthesis voice data reaches trained voice data, at this time weight and deviation It can not update.

The training method of above-mentioned speech synthesis model, by obtaining data set；According to the training text data, instruction is generated Practice text vector；Based on the first encoder, the trained voice data is encoded, to obtain training insertion vector；It is based on The training insertion vector is marked in attention mechanism, to obtain training style vector；According to the training text to Amount, the trained voice data and the trained style vector carry out model training to preset neural network model, to obtain Speech synthesis model.The speech synthesis model that training method training obtains, can synthesize natural target speech data, be closed At target speech data have specific locution, be no longer the voice of mechanization, have emotion behavior power abundant, from And promote the Experience Degree of user.

Referring to Fig. 8, Fig. 8 is a kind of step schematic flow diagram of phoneme synthesizing method provided by the embodiments of the present application.

As shown in figure 8, the phoneme synthesizing method, specifically includes: step S210 to step S230.

S210, target text vector sum target voice style vector is obtained.

In one embodiment, before acquisition target text vector, further includes: obtain target text data.

Specifically, target text data can be newsletter archive, novel text, blog text etc..

Wherein, target text vector is obtained, is specifically included: according to the target text data, obtaining target text vector.

In one embodiment, described according to the target text data, target text vector is obtained, is specifically included: to institute It states target text data and carries out word segmentation processing, to obtain multiple target word strings；Phonetic conversion is carried out to each target word string, with Obtain the corresponding sub- pinyin string of target of each target word string；Each sub- pinyin string of target is subjected to splicing, to obtain The target pinyin string；Based on preset characters number corresponding relationship, the target pinyin string is converted into target number sequence, it will The target number sequence is stored as target text vector.

Specifically, after obtaining target text data sentence segmentation can be carried out to the target text data, such as can be according to mark Target text data cutting is the complete sentence of a rule by point symbol.Then, word segmentation processing is carried out to each sentence, to obtain Multiple target word strings.In one embodiment, the sentence of each cutting can be divided by the segmenting method of string matching Word processing.

For example, target text data " good morning " after word segmentation processing, obtain " morning " and " good " two target word strings.It will Two target word strings " morning " and " good " progress phonetic conversion, obtain the corresponding sub- pinyin string of target of two target word strings " zao3shang4 " and " hao3 ", wherein digital representation tone.The sub- pinyin string of two targets is subjected to splicing, obtains target Pinyin string " zao3shang4hao3 ".

Illustratively, alphanumeric corresponding relationship can be as shown in table 1.For example, target text data are " good morning ", it should The corresponding target pinyin string of training text data is " zao3shang4hao3 ", will based on the alphanumeric corresponding relationship in table 1 The target pinyin string " zao3shang4hao3 " is converted into target number sequence 25/0/14/29/18/7/0/13/6/30/7/0/ 14/29.By the target number sequence be stored as target text vector (25,0,14,29,18,7,0,13,6,30,7,0,14, 29)。

Wherein, target voice style vector is obtained, comprising: obtain multiple initial speech style vectors；According to various styles The corresponding attention weight of type, to each initial speech style vector weighted sum, to obtain the target voice style Vector.

Illustratively, there are four types of the types of style, style, cross-talk style, terrible style and storytelling style are respectively read aloud. Wherein, the initial speech style vector A for reading aloud style is (1,0,0,0), the initial speech style vector B of cross-talk style be (0, 1,0,0), the initial speech style vector C of terrible style is (0,0,1,0), and the initial speech style vector D of storytelling style is (0,0,0,1).Assuming that reading aloud the corresponding attention weight of style is 0.1, and cross-talk style is corresponding in the voice with cross-talk style Attention weight be 0.8, the corresponding attention weight of terrible style is 0.0, the corresponding attention weight of storytelling style is 0.1, then target voice style vector=0.1 × A+0.8 × B+0.0 × C+0.1 × D.

Wherein, the corresponding attention weight of various stylistic categories can be manually arranged in advance in the voice data of specific style Or training obtains in advance.Such as user wants the voice that target text data check is had cross-talk style, then will read aloud style Corresponding attention weight is set as 0.1, sets 0.8 for the corresponding attention weight of cross-talk style, and terrible style is corresponding Attention weight be set as 0.0, set 0.1 for the corresponding attention weight of storytelling style.

For another example, want the voice that target text Data Synthesis is had into storytelling style, then will read aloud the corresponding attention of style Weight is set as 0.02, sets 0.05 for the corresponding attention weight of cross-talk style, and the corresponding attention of terrible style is weighed It resets and is set to 0.85, set 0.08 for the corresponding attention weight of storytelling style.

S220, target voice style vector described in the target text vector sum is subjected to splicing, to obtain target Splice vector.

Illustratively, target text vector X=(x₁,x₂,x₃,x₄), target voice style vector Y=(y₁,y₂,y₃,y₄), Target text vector X and target voice style vector Y are spliced, target splicing vector W=(x is obtained₁,x₂,x₃,x₄,y₁, y₂,y₃,y₄)。

S230, target splicing vector is inputted into speech synthesis model, to export target synthesized voice data.

Wherein, the speech synthesis model is the model obtained by the training method training of above-mentioned speech synthesis model.Tool Body, target splicing vector is input to the speech synthesis model, so that output has the target synthesized voice number of specific style According to, for example there are the target synthesized voice data for reading aloud style, cross-talk style, terrible style or storytelling style.

It should be understood that it may be a sequence fragment, therefore, speech synthesis model in timing that target, which splices vector, There can also be the target synthesized voice data of specific style with subsection synthesis, for example the target splicing vector of sequence is divided into two sections Synthesis, the target synthesized voice data of synthesis, which are respectively provided with, reads aloud style and storytelling style, it is possible thereby to show target text number According to the declinable process of wind in speech synthesis.

Above-mentioned phoneme synthesizing method can synthesize natural target speech data, and synthesized target speech data has spy Fixed locution is no longer the voice of mechanization, has emotion behavior power abundant, to promote the Experience Degree of user.

Referring to Fig. 9, Fig. 9 is a kind of the schematic of the training device that embodiments herein also provides speech synthesis model Block diagram, the training device of the speech synthesis model are used to execute the training method of any one of aforementioned speech synthesis model.Wherein, should The training device of speech synthesis model can be configured in server or terminal.

Wherein, server can be independent server, or server cluster.The terminal can be mobile phone, put down The electronic equipments such as plate computer, laptop, desktop computer, personal digital assistant and wearable device.

As shown in figure 9, the training device 300 of speech synthesis model includes: data capture unit 310, vector generation unit 320, vector coding unit 330, vector acquiring unit 340 and model training unit 350.

Data capture unit 310, for obtaining data set, the data set include training text data and with the training The corresponding trained voice data of text data；

Vector generation unit 320, for generating training text vector according to the training text data；

Vector coding unit 330 encodes the trained voice data, for being based on the first encoder to obtain Training insertion vector；

Vector acquiring unit 340 is marked the training insertion vector, for being based on attention mechanism to obtain Training style vector；

Model training unit 350, for according to the training text vector, the trained voice data and the trained wind Lattice vector carries out model training to preset neural network model, to obtain speech synthesis model.

As shown in figure 9, in one embodiment, vector generation unit 320 includes that phonetic conversion subunit 321 and vector are deposited Store up subelement 322.

Phonetic conversion subunit 321, for carrying out phonetic conversion to the training text data, to obtain corresponding phonetic String.

The pinyin string is converted to digital sequence for being based on alphanumeric corresponding relationship by vector storing sub-units 322 Column, are stored as training text vector for the Serial No..

As shown in Figure 10, in one embodiment, vector acquiring unit 340 includes that style obtains subelement 341, similarity meter Operator unit 342 and vector construct subelement 343.

Style obtains subelement 341 for obtaining multiple initial speech style vectors.

Similarity calculation subelement 342, for according to attention mechanism, calculate the training insertion vector and it is each it is described just The similarity of beginning voice style vector.

Vector constructs subelement 343 and constructs the training for the similarity according to each initial speech style vector Style vector.

In one embodiment, vector building subelement 343 is specifically used for corresponding with each initial speech style vector Similarity is the attention weight of initial speech style vector, to each initial speech style vector weighted sum, to obtain The trained style vector.

As shown in figure 11, in one embodiment, model training unit 350 includes that splicing subelement 351 and model are instructed Practice subelement 352.

Splicing subelement 351, for carrying out stitching portion to training style vector described in the training text vector sum Reason, to obtain training splicing vector；

Model training subelement 352, for splicing vector according to the trained voice data and the training, to the mind Model training is carried out through network model, to obtain the speech synthesis model.

Figure 12 is please referred to, Figure 12 is a kind of schematic block diagram for speech synthetic device that one embodiment of the application provides, should Speech synthetic device can be configured in terminal or server, for executing phoneme synthesizing method above-mentioned.

As shown in figure 12, speech synthetic device 400, comprising: vector acquiring unit 410,420 sum number of vector concatenation unit According to output unit 430.

Vector acquiring unit 410, for obtaining target text vector sum target voice style vector.

Vector concatenation unit 420, for splicing target voice style vector described in the target text vector sum Processing, to obtain target splicing vector.

Data outputting unit 430, for target splicing vector to be inputted speech synthesis model, to export target synthesis Voice data；The speech synthesis model is the model obtained by the training method training of above-mentioned speech synthesis model.

It should be noted that it is apparent to those skilled in the art that, for convenience of description and succinctly, The device of foregoing description and the specific work process of each unit can refer to the training method embodiment of aforementioned voice synthetic model In corresponding process, details are not described herein.

Above-mentioned device can be implemented as a kind of form of computer program, which can be as shown in Figure 8 Computer equipment on run.

Figure 13 is please referred to, Figure 13 is a kind of schematic block diagram of computer equipment provided by the embodiments of the present application.The calculating Machine equipment can be server or terminal.

Refering to fig. 13, which includes processor, memory and the network interface connected by system bus, In, memory may include non-volatile memory medium and built-in storage.

Non-volatile memory medium can storage program area and computer program.The computer program includes program instruction, The program instruction is performed, and processor may make to execute a kind of training method of speech synthesis model.

Processor supports the operation of entire computer equipment for providing calculating and control ability.

Built-in storage provides environment for the operation of the computer program in non-volatile memory medium, the computer program quilt When processor executes, processor may make to execute a kind of training method of speech synthesis model.

The network interface such as sends the task dispatching of distribution for carrying out network communication.It will be understood by those skilled in the art that Structure shown in Figure 13, only the block diagram of part-structure relevant to application scheme, is not constituted to application scheme The restriction for the computer equipment being applied thereon, specific computer equipment may include more more or fewer than as shown in the figure Component perhaps combines certain components or with different component layouts.

It should be understood that processor can be central processing unit (Central Processing Unit, CPU), it should Processor can also be other general processors, digital signal processor (Digital Signal Processor, DSP), specially With integrated circuit (Application Specific Integrated Circuit, ASIC), field programmable gate array (Field-Programmable Gate Array, FPGA) either other programmable logic device, discrete gate or transistor are patrolled Collect device, discrete hardware components etc..Wherein, general processor can be microprocessor or the processor be also possible to it is any often The processor etc. of rule.

Wherein, the processor is for running computer program stored in memory, to realize following steps:

Data set is obtained, the data set includes training text data and trained language corresponding with the training text data Sound data；According to the training text data, training text vector is generated；Based on the first encoder, to the trained voice number According to being encoded, to obtain training insertion vector；Based on attention mechanism, the training insertion vector is marked, with To training style vector；According to the training text vector, the trained voice data and the trained style vector, to default Neural network model carry out model training, to obtain speech synthesis model.

In one embodiment, the processor is described according to the training text data in realization, generates training text When vector, for realizing:

Phonetic conversion is carried out to the training text data, to obtain corresponding pinyin string；Based on the corresponding pass of alphanumeric System, is converted to Serial No. for the pinyin string, the Serial No. is stored as training text vector.

In one embodiment, the processor is described based on attention mechanism in realization, to the training insertion vector It is marked, when obtaining training style vector, for realizing:

Obtain multiple initial speech style vectors；According to attention mechanism, calculate the training insertion vector with it is each described The similarity of initial speech style vector；According to the similarity of each initial speech style vector, the trained style is constructed Vector.

In one embodiment, the processor is described according to the similar of each initial speech style vector in realization Degree, when constructing the trained style vector, for realizing:

It is right using the corresponding similarity of each initial speech style vector as the attention weight of initial speech style vector Each initial speech style vector weighted sum, to obtain the trained style vector.

In one embodiment, the processor is described according to the training text vector, the trained voice in realization Data and the trained style vector carry out model training to preset neural network model, when obtaining speech synthesis model, For realizing:

Splicing is carried out to training style vector described in the training text vector sum, to obtain training splicing vector； According to the trained voice data and the training splicing vector, model training is carried out to the neural network model, to obtain The speech synthesis model.

Wherein, in another embodiment, the processor is for running computer program stored in memory, with reality Existing following steps:

Obtain target text vector sum target voice style vector；By target voice wind described in the target text vector sum Lattice vector carries out splicing, to obtain target splicing vector；Target splicing vector is inputted into speech synthesis model, with defeated Target synthesized voice data out；The speech synthesis model is by the training method of speech synthesis model described in any of the above embodiments The model that training obtains.

A kind of computer readable storage medium is also provided in embodiments herein, the computer readable storage medium is deposited Computer program is contained, includes program instruction in the computer program, the processor executes described program instruction, realizes this Apply for the training method or phoneme synthesizing method of any one speech synthesis model that embodiment provides.

Wherein, the computer readable storage medium can be the storage inside of computer equipment described in previous embodiment Unit, such as the hard disk or memory of the computer equipment.The computer readable storage medium is also possible to the computer The plug-in type hard disk being equipped on the External memory equipment of equipment, such as the computer equipment, intelligent memory card (Smart Media Card, SMC), secure digital (Secure Digital, SD) card, flash card (Flash Card) etc..

The above, the only specific embodiment of the application, but the protection scope of the application is not limited thereto, it is any Those familiar with the art within the technical scope of the present application, can readily occur in various equivalent modifications or replace It changes, these modifications or substitutions should all cover within the scope of protection of this application.Therefore, the protection scope of the application should be with right It is required that protection scope subject to.

Claims

1. a kind of training method of speech synthesis model characterized by comprising

Data set is obtained, the data set includes training text data and trained voice number corresponding with the training text data According to；

According to the training text data, training text vector is generated；

According to the training text vector, the trained voice data and the trained style vector, to preset neural network Model carries out model training, to obtain speech synthesis model.

2. the training method of speech synthesis model according to claim 1, which is characterized in that described according to the training text Notebook data generates training text vector, comprising:

Phonetic conversion is carried out to the training text data, to obtain corresponding pinyin string；

Based on alphanumeric corresponding relationship, the pinyin string is converted into Serial No., the Serial No. is stored as training Text vector.

3. the training method of speech synthesis model according to claim 1, which is characterized in that described to be based on attention machine The training insertion vector is marked in system, to obtain training style vector, comprising:

Obtain multiple initial speech style vectors；

According to attention mechanism, the similarity of training the insertion vector and each initial speech style vector is calculated；

According to the similarity of each initial speech style vector, the trained style vector is constructed.

4. the training method of speech synthesis model according to claim 3, which is characterized in that described according to each described initial The similarity of voice style vector constructs the trained style vector, comprising:

Using the corresponding similarity of each initial speech style vector as the attention weight of initial speech style vector, to each institute Initial speech style vector weighted sum is stated, to obtain the trained style vector.

5. the training method of speech synthesis model according to claim 1, which is characterized in that described according to the training text This vector, the trained voice data and the trained style vector carry out model training to preset neural network model, with Obtain speech synthesis model, comprising:

Splicing is carried out to training style vector described in the training text vector sum, to obtain training splicing vector；

According to the trained voice data and the training splicing vector, model training is carried out to the neural network model, with Obtain the speech synthesis model.

6. a kind of phoneme synthesizing method characterized by comprising

Obtain target text vector sum target voice style vector；

Target voice style vector described in the target text vector sum is subjected to splicing, to obtain target splicing vector；

Target splicing vector is inputted into speech synthesis model, to export target synthesized voice data；The speech synthesis mould Type is the model obtained by the training method training of speech synthesis model as described in any one in claim 1-5.

7. a kind of training device of speech synthesis model characterized by comprising

Data capture unit, for obtaining data set, the data set include training text data and with the training text number According to corresponding trained voice data；

Vector coding unit encodes the trained voice data, for being based on the first encoder to obtain training insertion Vector；

Vector acquiring unit is marked the training insertion vector for being based on attention mechanism, to obtain training style Vector；

Model training unit is used for according to the training text vector, the trained voice data and the trained style vector, Model training is carried out to preset neural network model, to obtain speech synthesis model.

8. a kind of speech synthetic device characterized by comprising

Vector concatenation unit, for target voice style vector described in the target text vector sum to be carried out splicing, with Obtain target splicing vector；

Data outputting unit, for target splicing vector to be inputted speech synthesis model, to export target synthesized voice number According to；The speech synthesis model is the mould obtained as the training of the training method of the speech synthesis model as described in any one of 1 to 5 Type.

9. a kind of computer equipment, which is characterized in that the computer equipment includes memory and processor；

The memory is for storing computer program；

The processor, for executing the computer program and realization such as claim 1 when executing the computer program To the training method of speech synthesis model described in any one of 5 or phoneme synthesizing method as claimed in claim 6.

10. a kind of computer readable storage medium, which is characterized in that the computer-readable recording medium storage has computer journey Sequence, the computer program make the processor realize the language as described in any one of claims 1 to 5 when being executed by processor The training method of sound synthetic model or phoneme synthesizing method as claimed in claim 6.