CN110148398A

CN110148398A - Training method, device, equipment and the storage medium of speech synthesis model

Info

Publication number: CN110148398A
Application number: CN201910407683.5A
Authority: CN
Inventors: 陈闽川; 马骏; 王少军
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2019-05-16
Filing date: 2019-05-16
Publication date: 2019-08-20

Abstract

The present invention designs artificial intelligence field and discloses a kind of training method of speech synthesis model, this method comprises: detect first when training data and second when training data, read first in training data without mark text information and without mark voice messaging and second in training data band mark text information and with mark voice messaging；Based on preset coding decoder model, building is without mark text model and without mark speech model；Based on band mark text information and with mark voice messaging, the vector characterization information training with mark character features is obtained without mark speech model, is obtained the vector characterization information training with mark acoustic feature without mark text model, is generated speech synthesis model.The invention also discloses a kind of device, computer equipment and storage mediums.The present invention obtains pre-training model using the largely voice data without label or text data, it is only necessary to which speech synthesis model construction can be completed in a small amount of tape label voice data and text data training.

Description

Training method, device, equipment and the storage medium of speech synthesis model

Technical field

The present invention relates to artificial intelligence field more particularly to a kind of training methods of speech synthesis model, device, computer Equipment and computer readable storage medium.

Background technique

Currently, With the fast development of internet, people more come being exchanged using functions such as voice-enabled chats the more, are Meet the needs of users, many customer services often carry out Intelligent dialogue using artificial synthesized voice, therefore, it is necessary to according to text into Row speech synthesis, converts text to voice, to meet the dialogue chat demand of user.

The training of current speech synthetic model needs to construct the voice and text data of a large amount of high quality, especially in parameter type In end-to-end synthetic model system, tens of hours target speaker voices and the sampled data of corresponding text need to be prepared.It builds Vertical such database need to consume a large amount of high times and human cost.And the training method of the speech synthesis model used, it utilizes The text feature and acoustic feature extracted from the small-scale corpus data of target speaker, training obtain the deep layer of speech synthesis Neural network model.But this method still needs to the voice of building a few hours or tens of hours high quality tape labels to different speakers Data carry out model training by supervised learning, and data user rate is low.

Summary of the invention

The main purpose of the present invention is to provide a kind of training methods of speech synthesis model, it is intended to which solving the prior art needs The text and voice of a large amount of tape label are acquired, the technical issues of obtaining the deep-neural-network model of speech synthesis is trained.

To achieve the above object, the present invention provides a kind of training method of speech synthesis model, a kind of speech synthesis The training method of model includes:

Detect first when training data and second when training data, read described first in training data without mark Infuse text information and without mark voice messaging and described second to band mark text information in training data and with mark voice Information, wherein described first to training data quantity be greater than the described second quantity to training data；

Based on preset coding decoder model, construct described without mark text model and described without mark speech model；

Text information is marked based on the band and the band marks voice messaging, obtains the vector table with mark character features Reference breath and the vector characterization information with mark acoustic feature；

The vector characterization information training for marking character features according to the band is described without mark speech model, according to the band The vector characterization information training for marking acoustic feature is described without mark text model, generates speech synthesis model.

Optionally, described to be based on preset coding decoder model, it constructs described without mark text model and the no mark Speech model, comprising:

Read it is described without mark text information and it is described without mark voice messaging when, obtain preset encoding-decoder Model, wherein the preset encoding-decoder model includes preset encoder model and preset decoder model；

Based on described without the preset encoder model of mark text information training, the vector characterization without mark character features is obtained Information constructs described without mark text model；

Based on described without the preset decoder model of mark voice messaging training, the vector characterization without mark acoustic feature is obtained Information constructs described without mark speech model.

Optionally, described to train preset encoder model without mark text information based on described, it obtains special without mark text The vector characterization information of sign constructs described without mark text model, comprising:

When detecting the preset encoder model trained as input value without mark text information, preset word is obtained Method analysis, wherein the preset morphological analysis is the coding rule of the preset encoder model；

Based on the preset morphological analysis, the described without mark character features of the preset encoder model output is obtained Vector characterization information；

Based on described without the vector characterization information for marking character features and described without mark text information, the building no mark Infuse text model.

Optionally, described to train preset decoder model without mark voice messaging based on described, it obtains special without mark acoustics The vector characterization information of sign constructs described without mark speech model, comprising:

If detect the preset decoder model trained as input value without mark voice messaging, obtain preset Syntactic analysis, wherein the preset syntactic analysis is the decoding rule of the preset decoder model；

Based on the preset syntactic analysis, the described without mark acoustic feature of the preset decoder model output is obtained Vector characterization information；

Vector characterization information based on the acoustic feature and described without mark text information, constructs described without mark voice Model.

Optionally, described that voice messaging is marked based on band mark text information and the band, obtain band mark text The vector characterization information of feature and with mark acoustic feature vector characterization information, comprising:

When reading the band mark text information and band mark voice messaging, obtain described without mark text mould Type and it is described without mark speech model；

If detecting using band mark text information as the input value without mark text model, obtain described pre- Set morphological analysis；

Based on the preset morphological analysis, the vector with mark character features without mark text model output is obtained Characterization information；

If detecting using the voice messaging to be marked as the input value without mark speech model, obtain described pre- Set syntactic analysis；

Based on the preset syntactic analysis, the vector with mark acoustic feature without mark speech model output is obtained Characterization information.

Optionally, described to train preset decoder model without mark voice messaging based on described, it obtains special without mark acoustics The vector characterization information of sign, construct it is described without mark speech model after, further includes:

Attention mechanism based on the encoder model obtains the described without mark text of the attention mechanism concern The vector characterization information of feature；

Attention mechanism based on the decoder model obtains the described without mark acoustics of the attention mechanism concern The vector characterization information of feature.

Optionally, the vector characterization information training for marking character features according to the band is described without mark voice mould Type, the vector characterization information training of the band mark acoustic feature is described without mark text model, generates speech synthesis model, packet It includes:

If detecting the vector characterization without mark character features of the attention mechanism concern of the encoder model Simultaneously, the vector based on band mark character features characterizes the vector table reference manner of breathing of information and band mark character features Mapping relations between information and the vector characterization information of band mark acoustic feature, training are described without mark text model；

If detecting the vector characterization without mark acoustic feature of the attention mechanism concern of the decoder model Simultaneously, the vector based on band mark character features characterizes the vector table reference manner of breathing of information and band mark acoustic feature Mapping relations between information and the vector characterization information of band mark acoustic feature, training are described without mark acoustic model；

Based on, without mark text model and without mark acoustic model, modification is described without mark text model and institute after training It states without the weight parameter between mark acoustic model, generates the speech synthesis model.

In addition, to achieve the above object, the present invention also provides a kind of training device of speech synthesis model, the voice is closed Include: at the training device of model

Reading unit, for detect first when training data and second when training data, read described first to Without mark text information and without mark voice messaging and described second to band mark text envelope in training data in training data Breath and with mark voice messaging, wherein described first to training data quantity be greater than the described second quantity to training data；

Construction unit constructs described without mark text model and the no mark for being based on preset coding decoder model Infuse speech model；

Acquiring unit obtains band mark text for marking voice messaging based on band mark text information and the band The vector characterization information of word feature and with mark acoustic feature vector characterization information；

Generation unit, the vector characterization information training for marking character features according to the band are described without mark voice mould Type, the vector characterization information training for marking acoustic feature according to the band is described without mark text model, generates speech synthesis mould Type.

Optionally, the construction unit is specifically used for:

Optionally, the construction unit is specifically used for, further includes:

First obtains subelement, for detect it is described without mark text information as the trained preset coding of input value When device model, preset morphological analysis is obtained, wherein the preset morphological analysis is that the coding of the preset encoder model is advised Then；

Second obtains subelement, for being based on the preset morphological analysis, obtains the preset encoder model output The vector characterization information without mark character features；

First building subelement, for based on described without the vector characterization information for marking character features and described without mark text This information constructs described without mark text model.

Optionally, the construction unit is specifically used for, further includes:

Third obtains subelement, if for detect it is described without mark voice messaging as the trained preset solution of input value When code device model, preset syntactic analysis is obtained, wherein the preset syntactic analysis is that the decoding of the preset decoder model is advised Then；

4th obtains subelement, for being based on the preset syntactic analysis, obtains the preset decoder model output The vector characterization information without mark acoustic feature；

Second building subelement, for based on the acoustic feature vector characterization information and it is described without mark text envelope Breath constructs described without mark speech model.

Optionally, the acquiring unit is specifically used for:

Optionally, the training device of the speech synthesis model includes:

First, which obtains concern unit, obtains the attention machine for the attention mechanism based on the encoder model The vector characterization information without mark character features of system concern；

Second, which obtains concern unit, obtains the attention machine for the attention mechanism based on the decoder model The vector characterization information without mark acoustic feature of system concern.

Optionally, the generation unit is specifically used for:

In addition, to achieve the above object, the present invention also provides a kind of computer equipment, the computer equipment includes: to deposit Reservoir, processor and the training journey for being stored in the speech synthesis model that can be run on the memory and on the processor It is realized when the training program of sequence, the speech synthesis model is executed by the processor and as above invents the speech synthesis model The step of training method.

In addition, to achieve the above object, it is described computer-readable the present invention also provides a kind of computer readable storage medium The training program of speech synthesis model is stored on storage medium, the training program of the speech synthesis model is executed by processor Shi Shixian as above invents the step of training method of the speech synthesis model.

Training method, device, computer equipment and the computer for a kind of speech synthesis model that the embodiment of the present invention proposes Readable storage medium storing program for executing, by detect first when training data and second when training data, read described first wait train In data without mark text information and without mark voice messaging and described second in training data band mark text information and Band mark voice messaging, wherein described first to training data quantity be greater than the described second quantity to training data；It is based on Preset coding decoder model constructs described without mark text model and described without mark speech model；It is marked based on the band Text information and the band mark voice messaging, obtain the vector characterization information with mark character features and with mark acoustic feature Vector characterization information；The vector characterization information training for marking character features according to the band is described without mark speech model, root Speech synthesis model is generated without mark text model according to the vector characterization information training of band mark acoustic feature is described, it is real Show and pre-training model is obtained using the largely voice data without label or text data, it is only necessary to a small amount of tape label voice data Speech synthesis model construction can be completed with text data training.

Detailed description of the invention

Fig. 1 is the terminal structure schematic diagram for the hardware running environment that the embodiment of the present invention is related to；

Fig. 2 is the flow diagram of the first embodiment of the training method of speech synthesis model of the present invention；

Fig. 3 is the flow diagram of the second embodiment of the training method of speech synthesis model of the present invention；

The refinement flow diagram for the step of Fig. 4 is S22 in Fig. 3；

The refinement flow diagram for the step of Fig. 5 is S23 in Fig. 3；

The refinement flow diagram for the step of Fig. 6 is S30 in Fig. 2；

Fig. 7 is the flow diagram of the 3rd embodiment of the training method of speech synthesis model of the present invention.

The object of the invention is realized, the embodiments will be further described with reference to the accompanying drawings for functional characteristics and advantage.

Specific embodiment

It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, it is not intended to limit the present invention.

The primary solutions of the embodiment of the present invention are: detect first to training data and second to training data When, read first in training data without mark text information and without mark voice messaging and second to band in training data Mark text information and with mark voice messaging, wherein first is greater than to the quantity of training data described second to training data Quantity；Based on preset coding decoder model, building is without mark text model and described without mark speech model；It is marked based on band It infuses text information and with mark voice messaging, obtains the vector characterization information with mark character features and with mark acoustic feature Vector characterization information；According to the vector characterization information training with mark character features without mark speech model, according to band mark sound The vector characterization information training of feature is learned without mark text model, generates speech synthesis model.

Since the prior art needs to acquire the text and voice of a large amount of tape label, training obtains the deep layer mind of speech synthesis Through network model.

The present invention provides a solution, makes to obtain pre-training using the largely voice data without label or text data Model, it is only necessary to which speech synthesis model construction can be completed in a small amount of tape label voice data and text data training.

As shown in FIG. 1, FIG. 1 is the terminal structure schematic diagrames for the hardware running environment that the embodiment of the present invention is related to.

The terminal of that embodiment of the invention can be PC, the packaged type terminal device having a display function such as portable computer.

As shown in Figure 1, the terminal may include: processor 1001, such as CPU, network interface 1004, user interface 1003, memory 1005, communication bus 1002.Wherein, communication bus 1002 is for realizing the connection communication between these components. User interface 1003 may include display screen (Display), input unit such as keyboard (Keyboard), optional user interface 1003 can also include standard wireline interface and wireless interface.Network interface 1004 optionally may include that the wired of standard connects Mouth, wireless interface (such as WI-FI interface).Memory 1005 can be high speed RAM memory, be also possible to stable memory (non-volatile memory), such as magnetic disk storage.Memory 1005 optionally can also be independently of aforementioned processor 1001 storage device.

It will be understood by those skilled in the art that the restriction of the not structure paired terminal of terminal structure shown in Fig. 1, can wrap It includes than illustrating more or fewer components, perhaps combines certain components or different component layouts.

As shown in Figure 1, as may include that operating system, network are logical in a kind of memory 1005 of computer storage medium Believe module, the training program of Subscriber Interface Module SIM and speech synthesis model.

In terminal shown in Fig. 1, network interface 1004 is mainly used for connecting background server, carries out with background server Data communication；User interface 1003 is mainly used for connecting client (user terminal), carries out data communication with client；And processor 1001 can be used for calling the training program of speech synthesis model in memory 1005, and execute following operation:

Detect first when training data and second when training data, read first in training data without mark text This information and without mark voice messaging and second in training data band mark text information and with mark voice messaging, In, first to training data quantity be greater than the second quantity to training data；

Based on preset coding decoder model, building is without mark text model and without mark speech model；

The vector characterization information training for marking character features according to the band marks acoustics according to band without mark speech model The vector characterization information training of feature generates speech synthesis model without mark text model.

Further, processor 1001 can call the training program of the speech synthesis model stored in memory 1005, Also execute following operation:

When reading without mark text information and without mark voice messaging, preset encoding-decoder model is obtained；

Based on without the preset encoder model of mark text information training, the vector table reference without mark character features is obtained Breath, building is without mark text model；

Based on without the preset decoder model of mark voice messaging training, the vector table reference without mark acoustic feature is obtained Breath, building is without mark speech model.

When detecting without mark text information as the input value preset encoder model of training, preset morphological analysis is obtained, Wherein, preset morphological analysis is the coding rule of preset encoder model；

Based on preset morphological analysis, the vector table reference without mark character features of preset encoder model output is obtained Breath；

Based on the vector characterization information without mark character features and without mark text information, building is without mark text model.

If detect without mark voice messaging as input value training preset decoder model, preset syntax point is obtained Analysis, wherein preset syntactic analysis is the decoding rule of preset decoder model；

Based on preset syntactic analysis, the vector table reference without mark acoustic feature of preset decoder model output is obtained Breath；

Vector characterization information based on acoustic feature and without mark text information, building is without mark speech model.

When reading band mark text information and with mark voice messaging, obtain without mark text model and without mark language Sound model；

, as the input value without mark text model, preset morphological analysis is obtained for mark text information if detecting；

Based on preset morphological analysis, the vector table reference with mark character features without mark text model output is obtained Breath；

If detecting using voice messaging to be marked as the input value without mark speech model, preset syntactic analysis is obtained；

Based on preset syntactic analysis, the vector table reference with mark acoustic feature without mark speech model output is obtained Breath.

Attention mechanism based on encoder model obtains the vector table without mark character features of attention mechanism concern Reference breath；

Attention mechanism based on decoder model obtains the vector table without mark acoustic feature of attention mechanism concern Reference breath.

If detecting the vector characterization information and band without mark character features of the attention mechanism concern of encoder model Mark character features vector table reference manner of breathing simultaneously, based on band mark character features vector characterization information with mark acoustics Mapping relations between the vector characterization information of feature, training is without mark text model；

If detecting the vector characterization information and band without mark acoustic feature of the attention mechanism concern of decoder model Mark acoustic feature vector table reference manner of breathing simultaneously, based on band mark character features vector characterization information with mark acoustics Mapping relations between the vector characterization information of feature, training is without mark acoustic model；

Based on, without mark text model and without mark acoustic model, modification is without mark text model and no mark after training Weight parameter between acoustic model generates speech synthesis model.

It is the first embodiment of the training method of speech synthesis model of the present invention referring to Fig. 2, Fig. 2, speech synthesis model Training method includes:

Step S10, detect first when training data and second when training data, read first in training data Without mark text information and without mark voice messaging and second to band mark text information in training data and with mark voice Information, wherein first to training data quantity be greater than the second quantity to training data；

Terminal detect first when training data and second when training data, read described first in training data Without mark text information and without mark voice messaging and second in training data band mark text information and with mark Voice messaging.First to training data to be that terminal searching arrives a large amount of without mark text information and voice messaging, second to Training data is the voice messaging and the corresponding text information of voice messaging for the target speaker that terminal is got, and target is sent out The voice messaging of sound people is used as band mark voice messaging, marks text information for the corresponding text information of voice messaging as band, Band mark voice messaging with mark text information there is mapping relations, and first in training data without mark text envelope Breath and quantity without mark voice messaging are greater than second to band mark text information in training data and with marking voice messaging Quantity.

Step S20 is based on preset coding decoder model, and building is without mark text model and without mark speech model；

When terminal is read without mark text information, without mark voice messaging, band mark text information, band mark voice letter When breath, initialized data base is extracted.Initialized data base can be dictionary, can also make Chinese wikipedia etc., preset coding and decoding Device model will be without mark text information and without mark voice messaging and with mark text envelope with dictionary or Chinese wikipedia etc. It ceases and is disassembled with mark voice messaging, obtain the character features without mark text information, the sound without mark voice messaging Learn feature, the character features with mark text information and the acoustic feature with mark voice messaging.Wherein, character features packet Include: words granularity, words, word are long and the rhythm pauses, and acoustic feature includes: spectrum parameter, duration and fundamental frequency.

Terminal obtains preset encoder model when getting the character features without mark text information.It will be without mark text Input value of this information as preset encoder model, using without mark character features as the output valve of preset encoder model. Terminal according to without mark text information and it is corresponding without mark character features, modify the weight parameter of preset encoder model, Neural network model to building without mark text, wherein the type of the neural network model without mark text can be volume Product neural network model or Recognition with Recurrent Neural Network model.

Terminal obtains preset decoder model when getting the acoustic feature without mark voice messaging.It will be without mark language Message ceases input value as preset decoder model, will be without marking acoustic feature as the output valve of preset decoder model. Terminal according to without mark voice messaging and it is corresponding without mark acoustic feature, modify the weight parameter of preset decoder model, Neural network model to building without mark voice, wherein the type of the neural network model without mark voice can be volume Product neural network model or Recognition with Recurrent Neural Network model.

Step S30 obtains the vector table with mark character features based on band mark text information and with mark voice messaging Reference breath and the vector characterization information with mark acoustic feature；；

Terminal, will be with mark text information as no mark when reading band mark text information and with mark voice messaging The input value of text model is infused, and will be used as with mark voice messaging without mark speech model output valve.When will with mark text Information obtains band mark text according to without the morphological analysis in mark text model as the input value without mark text model The character features of information, it is no mark text model based on band mark text information and with mark text information character features, The vector characterization information with mark character features is obtained, specifically, being based on weight parameter without mark text model, is got wait mark Infuse vector characterization information of the character features in weight matrix.When using with mark voice messaging be used as without mark acoustic model it is defeated It is worth out, according to without the syntactic analysis in mark speech model, obtains the acoustic feature with mark voice messaging, no mark acoustic mode Type obtains the vector characterization with mark acoustic feature based on the acoustic feature with mark acoustic information and with mark text information Information gets vector of the acoustic feature to be marked in weight matrix specifically, being based on weight parameter without mark acoustic model Characterization information.Character features include: words granularity, words, word is long and the rhythm pauses, acoustic feature include: spectrum parameter, when Long and fundamental frequency.

Step S40, the vector characterization information training for marking character features according to the band are described without mark speech model, root It is described without mark text model according to the vector characterization information training of band mark acoustic feature, generate speech synthesis model.

Terminal get without mark text model and without mark speech model when, obtain without mark text model in without mark Explanatory notes word feature vector characterization information and without mark phonetic feature model in without mark acoustic feature vector characterization information.? Without mark character features vector characterization information with mark character features vector table reference manner of breathing it is same, or without mark acoustic feature to Measure characterization information with mark acoustic feature vector table reference manner of breathing simultaneously, according to band mark character features vector characterization information and With the mapping relations between mark acoustic feature vector characterization information, obtain without mark character features vector characterization information and with mark The mapping relations between acoustic feature vector characterization information are infused, or without mark acoustic feature vector characterization information and with mark text Mapping relations between word feature vector characterization information.For example, judging without mark character features vector characterization information and with mark Whether character features vector characterization information is identical, or judges without mark acoustic feature vector characterization information and with mark acoustics spy It whether identical levies vector characterization information, when identical, obtains without mark character features vector characterization information and with mark acoustics spy Levy vector characterization information between mapping relations or without mark acoustic feature vector characterization information with mark character features to Measure the mapping relations between characterization information.Terminal is being obtained without mark character features vector characterization information and without mark acoustic feature When mapping relations between vector characterization information, training is raw without mark text model and the weight parameter without mark speech model At voice synthetic model.

In the present embodiment, terminal detect first in coding decoder model to training data and second wait train When data, it is special to the character features without mark text information in training data and the acoustics without mark voice messaging to obtain first Sign and second is to the character features with mark text information in training data and the acoustic feature with mark voice messaging, base In coding decoder model, generate without mark text model and without mark speech model, based on without mark text model and without mark Speech model is infused, the vector characterization information with mark character features and the vector characterization information with mark acoustic feature are obtained, and According to the mapping relations between the vector characterization information with mark character features and the vector characterization information with mark acoustic feature Training generates speech synthesis model, using largely without the voice data of label without mark text model and without mark speech model Or text data obtains pre-training model, it is only necessary to which voice conjunction can be completed in a small amount of tape label voice data and text data training At model construction.

It further, is the second embodiment of the training method of speech synthesis model of the present invention referring to Fig. 3, Fig. 3, based on upper Embodiment shown in Fig. 2 is stated, step S20 includes:

Step S21 obtains preset encoding-decoder when reading without mark text information and without mark voice messaging Model；

Step S22 obtains the vector without mark character features based on without the preset encoder model of mark text information training Characterization information, building is without mark text model；

Step S23 obtains the vector without mark acoustic feature based on without the preset decoder model of mark voice messaging training Characterization information, building is without mark speech model.

Terminal obtains preset coding decoder model when reading without mark text information and without mark voice messaging Coding rule and decoding rule, the coding rule of preset encoder model include the morphology point such as words granularity, words, word length Analysis, the decoding rule of preset decoder model include that the rhythm pauses, the syntactic analyses such as spectrum parameter, duration and fundamental frequency.Terminal root According to the morphological analyses such as the words granularity in preset encoder model, words, word be long, to get without mark text information into Row coding is got without the character features in mark text information, and terminal is according to without the character features in mark text information and without mark Text information building is without mark text model.It is paused according to the rhythm in preset decoder model, spectrum parameter, duration and fundamental frequency Equal syntactic analyses are decoded without mark voice messaging to what is got, get without the acoustic feature in mark voice messaging, eventually End is constructed according to without the acoustic feature in mark voice messaging and without mark voice messaging without mark speech model.Terminal is according to encoder Attention mechanism in model gets the vector characterization information without mark character features of attention mechanism concern, specifically, Attention mechanism in encoder model is when encoder model without mark text information to encoding, encoder model output Without mark character features, get without mark character features weight matrix when, attention mechanism concern without mark text Vector characterization information of the feature in the upper and lower level information of weight matrix.

In the present embodiment, terminal obtains preset coding when getting mark text information and without mark voice messaging The coding rule of decoder model encode generation without mark to without mark text information with rule, the preset coding solution of starting is decoded Infuse text model, start preset decoder model to without mark voice messaging be decoded, generate without mark speech model, according to Coding decoder model is quickly generated without mark text model and without mark acoustic model.

Referring to Fig. 4, Fig. 4 is the step refined flow chart of S22 in above-mentioned Fig. 3, and step S22 includes:

Step S221 is obtained preset when detecting without mark text information as the input value preset encoder model of training Morphological analysis；

Step S222 is based on preset morphological analysis, obtain the output of preset encoder model without mark character features to Measure characterization information；

Step S223 is constructed based on the vector characterization information without mark character features and without mark text information without mark Text model.

When terminal is detected without mark text information as input value training preset encoder model, preset encoder is obtained The morphological analysis of model, morphological analysis are also the coding rule of preset encoder model.Preset encoder model is according to preset volume Code rule to without mark text information encodes, obtain coding after without mark character features vector characterization information.It is obtaining When to without mark character features vector characterization information, terminal is according to the vector without mark text information and without mark character features Characterization information is adjusted the weight parameter in encoder model, and building is without mark text model.When preset coding rule When being according to being encoded without the words in mark text information, the words vector weight matrix without mark character features is generated, When preset coding rule is encoded according to the stroke without words in mark text information, generate without mark character features Stroke vector weight matrix.Terminal is according to the words vector weight matrix without mark character features or without mark character features Stroke vector weight matrix is obtained without mark character features vector characterization information.Terminal is according to the attention in decoder model Mechanism gets the vector characterization information without mark acoustic feature of attention mechanism concern, specifically, in decoder model Attention mechanism decoder model to without mark voice messaging be decoded when, concern decoder model input without mark sound Feature is learned, when getting the weight matrix without mark acoustic feature, the concern of attention mechanism is without mark acoustic feature in weight The vector characterization information of the vector characterization information of the upper and lower level information of matrix.

In the present embodiment, terminal is being detected without mark text information as the preset encoder model of input value training When, the morphological analysis of preset encoder model is obtained, preset encoder model is according to preset coding rule to without mark text envelope Breath encoded, obtain coding after without mark character features vector characterization information.It is getting without mark character features vector When characterization information, terminal is according to the vector characterization information without mark text information and without mark character features, to encoder mould Weight parameter in type is adjusted, and building is without mark text model.Pass through the coding rule of encoder model, quickly training Data construct model.

Referring to Fig. 5, Fig. 5 is the step refined flow chart of S23 in Fig. 3, and step S23 includes:

If step S231 is obtained pre- detect without mark voice messaging as input value training preset decoder model Set syntactic analysis；

Step S232 is based on preset syntactic analysis, obtain the output of preset decoder model without mark acoustic feature to Measure characterization information；

Step S233, vector characterization information based on acoustic feature and without mark text information, building is without mark voice mould Type.

When terminal is detected without mark voice messaging as input value training preset decoder model, preset decoder is obtained The syntactic analysis of model, morphological analysis are also the decoding rule of preset decoder model.Preset decoder model is according to preset solution Code rule is encoded to without mark voice messaging, is obtained decoded without mark acoustic feature vector characterization information.It is obtaining When to without mark acoustic feature vector characterization information, terminal is according to the vector without mark voice messaging and without mark acoustic feature Characterization information is adjusted the weight parameter in decoder model, and building is without mark speech model.When preset decoding rule It is when carrying out encoding and decoding according to the fundamental frequency without mark acoustic feature, to generate the fundamental frequency vector weight matrix without mark acoustic feature, When preset decoding rule is decoded according to the duration without mark acoustic feature, the duration without mark acoustic feature is generated Vector weight matrix.Terminal is according to the word fundamental frequency vector weight matrix without mark acoustic feature or the duration without mark acoustic feature Vector weight matrix obtains the vector characterization information without mark acoustic feature.By the decoding rule of decoder model, quickly Training data constructs model.

Referring to Fig. 6, Fig. 6 is the step refined flow chart of S30 in Fig. 2, and step S30 includes:

Step S31, read band mark text information and with mark voice messaging when, obtain without mark text model and Without mark speech model；

Step S32, as the input value without mark text model, obtains preset word for mark text information if detecting Method analysis；

Step S33 is based on preset morphological analysis, obtains the vector with mark character features without mark text model output Characterization information；

Step S34 is obtained described pre- if detecting using voice messaging to be marked as the input value without mark speech model Set syntactic analysis；

Step S35 is based on preset syntactic analysis, obtains the vector with mark acoustic feature without mark speech model output Characterization information.

Terminal is reading second when with mark text information in training data and with mark voice messaging, obtains without mark Infuse text model and without mark speech model.When terminal detect using with mark text information be used as without mark text model it is defeated Enter value, obtain preset morphological analysis, be based on preset morphological analysis, obtains the band mark character features without mark text model output Vector characterization information.When terminal is detected using the voice messaging to be marked as the input without mark speech model Value obtains the preset syntactic analysis, is based on the preset syntactic analysis, obtains the band mark without mark speech model output Infuse the vector characterization information of acoustic feature.Specifically, being without the preset coding rule in mark text model when terminal is detected When being encoded according to the words with mark character features, the words vector weight matrix with mark character features is generated, when pre- The coding rule set is when being encoded according to the stroke with mark character features, to generate the stroke vector with mark character features Weight matrix.Terminal is weighed according to the words vector weight matrix with mark character features or the stroke vector with mark character features Weight matrix, obtains the vector characterization information with mark character features.When terminal is detected without the preset solution in mark speech model Code rule is when being decoded according to the fundamental frequency with mark acoustic feature, to generate the fundamental frequency vector weight square with mark character features Battle array is generated when preset decoding rule is decoded according to the duration with mark character features with mark character features When long vector weight matrix.Terminal according to mark acoustic feature fundamental frequency vector weight matrix or with mark character features when Long vector weight matrix obtains the vector with mark acoustic feature and guarantees information.

In the present embodiment, terminal is reading second to band mark text information in training data and with mark voice letter When breath, obtain without mark text model and without mark speech model.When terminal is detected using band mark text information as no mark The input value for infusing text model, obtains preset morphological analysis, is based on preset morphological analysis, obtains without mark text model output Vector characterization information with mark character features.When terminal is detected using the voice messaging to be marked as described without mark language The input value of sound model obtains the preset syntactic analysis, is based on the preset syntactic analysis, obtains described without mark voice mould The vector characterization information with mark acoustic feature of type output.By the model of building, band mark acoustic feature is quickly obtained Vector characterization information and with mark character features vector characterization information.

It is the 3rd embodiment of the training method of speech synthesis model of the present invention referring to Fig. 7, Fig. 7, based on shown in above-mentioned Fig. 2 Embodiment, step S40 includes:

Step S41, if detecting the vector characterization without mark character features of the attention mechanism concern of encoder model Information with mark character features vector table reference manner of breathing simultaneously, based on band mark character features vector characterization information and band The mapping relations between the vector characterization information of acoustic feature are marked, training is without mark text model；

Step S42, if detecting the vector characterization without mark acoustic feature of the attention mechanism concern of decoder model Information with mark acoustic feature vector table reference manner of breathing simultaneously, based on band mark character features vector characterization information and band The mapping relations between the vector characterization information of acoustic feature are marked, training is without mark acoustic model；

Step S43, based on, without mark text model and without mark acoustic model, modification is without mark text model after training With without the weight parameter between mark acoustic model, speech synthesis model is generated.

When terminal detect concern without mark character features vector characterization information when, according to the attention of encoder model Mechanism, which is obtained, ceases identical band mark Text eigenvector characterization information with without mark character features vector table reference.Obtaining band Mark Text eigenvector characterization information with mark acoustic feature vector characterization information and establish without mark character features vector Characterization information and with mark acoustic feature vector characterization information between mapping relations.Alternatively, when terminal concern is without mark acoustics When feature vector characterization information, terminal is obtained and according to the attention mechanism of decoder model without mark acoustic feature vector table Reference ceases identical band and marks acoustic feature vector characterization information.Getting band mark acoustic feature vector characterization information and band The mapping relations of character features vector characterization information are marked, are established without mark acoustic feature vector characterization information and with mark text Mapping relations between feature vector characterization information.

Terminal is built between Text eigenvector characterization information according to without mark acoustic feature vector characterization information and with marking Vertical mapping relations, alternatively, without mark character features vector characterization information and between mark acoustic feature vector characterization information The mapping relations of foundation, training without mark text model with without mark acoustic model, after modification fine tuning without mark text model with Without the weight parameter between mark acoustic model, to generate speech synthesis model.

In the present embodiment, terminal detect encoder model attention mechanism concern without mark character features to Measure characterization information with mark character features vector table reference manner of breathing simultaneously, based on band mark character features vector table reference The mapping relations between the vector characterization information with mark acoustic feature are ceased, training is without mark text model.If detecting solution The vector characterization information without mark acoustic feature of the attention mechanism concern of code device model and the vector with mark acoustic feature When characterization information is identical, based on the vector characterization information with mark character features and the vector characterization information with mark acoustic feature Between mapping relations, training without mark acoustic model.It is modified without mark text model with without mark acoustic model based on training Without mark text model and without the weight parameter between mark acoustic model, speech synthesis model is generated, using largely without label Voice data or text data obtain pre-training model, it is only necessary to a small amount of tape label voice data and text data training Speech synthesis model construction is completed, does not need that building is complicated and huge search network, can effectively promote speech synthesis training language The consistency of text audio in material moves trained model using the rhythm style that transfer learning and method for trimming can get voice It moves.

In addition, the embodiment of the present invention also proposes a kind of computer equipment, computer equipment include: memory, processor and It is stored in the training program for the speech synthesis model that can be run on the memory and on a processor, the instruction of speech synthesis model Practice the step of realizing the training method of speech synthesis model of embodiment as above when program is executed by the processor.

In addition, the embodiment of the present invention also proposes a kind of computer readable storage medium, which is characterized in that computer-readable to deposit The training program of speech synthesis model, the realization when training program of speech synthesis model is executed by processor are stored on storage media The step of such as training method of the speech synthesis model of embodiment as above.

It should be noted that, in this document, the terms "include", "comprise" or its any other variant are intended to non-row His property includes, so that the process, method, article or the system that include a series of elements not only include those elements, and And further include other elements that are not explicitly listed, or further include for this process, method, article or system institute it is intrinsic Element.In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that including being somebody's turn to do There is also other identical elements in the process, method of element, article or system.

The serial number of the above embodiments of the invention is only for description, does not represent the advantages or disadvantages of the embodiments.

Through the above description of the embodiments, those skilled in the art can be understood that above-described embodiment side Method can be realized by means of software and necessary general hardware platform, naturally it is also possible to by hardware, but in many cases The former is more preferably embodiment.Based on this understanding, technical solution of the present invention substantially in other words does the prior art The part contributed out can be embodied in the form of software products, which is stored in one as described above In storage medium (such as ROM/RAM, magnetic disk, CD), including some instructions are used so that terminal device (it can be mobile phone, Computer, server, air conditioner or network equipment etc.) execute method described in each embodiment of the present invention.

The above is only a preferred embodiment of the present invention, is not intended to limit the scope of the invention, all to utilize this hair Equivalent structure or equivalent flow shift made by bright specification and accompanying drawing content is applied directly or indirectly in other relevant skills Art field, is included within the scope of the present invention.

Claims

1. a kind of training method of speech synthesis model, which is characterized in that the training method of the speech synthesis model includes:

Detect first when training data and second when training data, read described first in training data without mark text This information and without mark voice messaging and described second in training data band mark text information and with mark voice letter Breath, wherein described first to training data quantity be greater than the described second quantity to training data；

Text information is marked based on the band and the band marks voice messaging, obtains the vector table reference with mark character features It ceases and with the vector characterization information for marking acoustic feature；

The vector characterization information training for marking character features according to the band is described without mark speech model, is marked according to the band The vector characterization information training of acoustic feature is described without mark text model, generates speech synthesis model.

2. the training method of speech synthesis model as described in claim 1, which is characterized in that described to be based on preset coding and decoding Device model constructs described without mark text model and described without mark speech model, comprising:

Read it is described without mark text information and it is described without mark voice messaging when, obtain preset encoding-decoder model, Wherein, the preset encoding-decoder model includes preset encoder model and preset decoder model；

Based on described without the preset encoder model of mark text information training, the vector table reference without mark character features is obtained Breath constructs described without mark text model；

Based on described without the preset decoder model of mark voice messaging training, the vector table reference without mark acoustic feature is obtained Breath constructs described without mark speech model.

3. the training method of speech synthesis model as claimed in claim 2, which is characterized in that described literary without mark based on described The preset encoder model of this information training, obtains the vector characterization information without mark character features, constructs described without mark text Model, comprising:

When detecting the preset encoder model trained as input value without mark text information, preset morphology point is obtained Analysis, wherein the preset morphological analysis is the coding rule of the preset encoder model；

Based on the preset morphological analysis, the vector without mark character features of the preset encoder model output is obtained Characterization information；

Based on it is described without mark character features vector characterization information and it is described without mark text information, construct it is described without mark text This model.

4. the training method of speech synthesis model as claimed in claim 2, which is characterized in that it is described based on it is described without mark language The preset decoder model of message breath training, obtains the vector characterization information without mark acoustic feature, constructs described without mark voice Model, comprising:

If detect the preset decoder model trained as input value without mark voice messaging, preset syntax is obtained Analysis, wherein the preset syntactic analysis is the decoding rule of the preset decoder model；

Based on the preset syntactic analysis, the vector without mark acoustic feature of the preset decoder model output is obtained Characterization information；

Vector characterization information based on the acoustic feature and described without mark text information, constructs described without mark voice mould Type.

5. the training method of speech synthesis model as described in claim 1, which is characterized in that described based on band mark text This information and the band mark voice messaging, obtain the vector characterization information with mark character features and with mark acoustic feature Vector characterization information, comprising:

When reading band mark text information and band mark voice messaging, obtain it is described without mark text model and It is described without mark speech model；

If detecting using band mark text information as the input value without mark text model, the preset word is obtained Method analysis；

Based on the preset morphological analysis, the vector characterization with mark character features without mark text model output is obtained Information；

If detecting using the voice messaging to be marked as the input value without mark speech model, the preset sentence is obtained Method analysis；

Based on the preset syntactic analysis, the vector characterization with mark acoustic feature without mark speech model output is obtained Information.

6. the training method of speech synthesis model as claimed in claim 2, which is characterized in that it is described based on it is described without mark language The preset decoder model of message breath training, obtains the vector characterization information without mark acoustic feature, constructs described without mark voice After model, further includes:

Attention mechanism based on the encoder model obtains the described without mark character features of the attention mechanism concern Vector characterization information；

Attention mechanism based on the decoder model obtains the described without mark acoustic feature of the attention mechanism concern Vector characterization information.

7. the training method of speech synthesis model as claimed in claim 6, which is characterized in that described to mark text according to the band The vector characterization information training of word feature is described without mark speech model, and the vector table reference of acoustic feature is marked according to the band Breath training is described without mark text model, generates speech synthesis model, comprising:

If detecting the vector characterization information without mark character features of the attention mechanism concern of the encoder model The vector table reference manner of breathing for marking character features with the band simultaneously, the vector characterization information of character features is marked based on the band Mapping relations between the vector characterization information of band mark acoustic feature, training are described without mark text model；

If detecting the vector characterization information without mark acoustic feature of the attention mechanism concern of the decoder model The vector table reference manner of breathing for marking acoustic feature with the band simultaneously, the vector characterization information of character features is marked based on the band Mapping relations between the vector characterization information of band mark acoustic feature, training are described without mark acoustic model；

Based on after training it is described without mark text model and it is described without mark acoustic model, modify it is described without mark text model With the weight parameter without between mark acoustic model, the speech synthesis model is generated.

8. a kind of training device of speech synthesis model, which is characterized in that the training device of the speech synthesis model includes:

Reading unit, for detect first when training data and second when training data, read described first wait train In data without mark text information and without mark voice messaging and described second in training data band mark text information and Band mark voice messaging, wherein described first to training data quantity be greater than the described second quantity to training data；

Construction unit constructs described without mark text model and described without mark language for being based on preset coding decoder model Sound model；

It is special to obtain band mark text for marking voice messaging based on band mark text information and the band for acquiring unit The vector characterization information of sign and with mark acoustic feature vector characterization information；

Generation unit, for according to the band mark character features vector characterization information training it is described without mark speech model, The vector characterization information training for marking acoustic feature according to the band is described without mark text model, generates speech synthesis model.

9. a kind of computer equipment, which is characterized in that the computer equipment includes: memory, processor and is stored in described On memory and the training program of speech synthesis model that can run on the processor, the training of the speech synthesis model Realizing the training method of speech synthesis model as described in any one of claims 1 to 7 when program is executed by the processor Step.

10. a kind of computer readable storage medium, which is characterized in that be stored with voice conjunction on the computer readable storage medium It is realized at the training program of model, when the training program of the speech synthesis model is executed by processor as in claim 1 to 7 The step of training method of described in any item speech synthesis models.