CN110148398A - Training method, device, equipment and the storage medium of speech synthesis model - Google Patents
Training method, device, equipment and the storage medium of speech synthesis model Download PDFInfo
- Publication number
- CN110148398A CN110148398A CN201910407683.5A CN201910407683A CN110148398A CN 110148398 A CN110148398 A CN 110148398A CN 201910407683 A CN201910407683 A CN 201910407683A CN 110148398 A CN110148398 A CN 110148398A
- Authority
- CN
- China
- Prior art keywords
- mark
- model
- training
- preset
- information
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000012549 training Methods 0.000 title claims abstract description 180
- 230000015572 biosynthetic process Effects 0.000 title claims abstract description 81
- 238000003786 synthesis reaction Methods 0.000 title claims abstract description 81
- 238000000034 method Methods 0.000 title claims abstract description 46
- 238000012512 characterization method Methods 0.000 claims abstract description 180
- 238000010276 construction Methods 0.000 claims abstract description 10
- 238000004458 analytical method Methods 0.000 claims description 67
- 230000007246 mechanism Effects 0.000 claims description 36
- 230000000877 morphologic effect Effects 0.000 claims description 30
- 238000013507 mapping Methods 0.000 claims description 25
- 230000029058 respiratory gaseous exchange Effects 0.000 claims description 13
- 238000013473 artificial intelligence Methods 0.000 abstract description 2
- 239000011159 matrix material Substances 0.000 description 21
- 238000003062 neural network model Methods 0.000 description 10
- 238000010586 diagram Methods 0.000 description 8
- 238000004891 communication Methods 0.000 description 5
- 238000012986 modification Methods 0.000 description 5
- 230000004048 modification Effects 0.000 description 5
- 230000033764 rhythmic process Effects 0.000 description 5
- 238000001228 spectrum Methods 0.000 description 4
- 230000008569 process Effects 0.000 description 3
- 230000008901 benefit Effects 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 230000000306 recurrent effect Effects 0.000 description 2
- 238000013499 data model Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 238000013526 transfer learning Methods 0.000 description 1
- 238000009966 trimming Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
- G10L13/04—Details of speech synthesis systems, e.g. synthesiser structure or memory management
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
- G10L13/04—Details of speech synthesis systems, e.g. synthesiser structure or memory management
- G10L13/047—Architecture of speech synthesisers
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/08—Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Signal Processing (AREA)
- Machine Translation (AREA)
Abstract
The present invention designs artificial intelligence field and discloses a kind of training method of speech synthesis model, this method comprises: detect first when training data and second when training data, read first in training data without mark text information and without mark voice messaging and second in training data band mark text information and with mark voice messaging;Based on preset coding decoder model, building is without mark text model and without mark speech model;Based on band mark text information and with mark voice messaging, the vector characterization information training with mark character features is obtained without mark speech model, is obtained the vector characterization information training with mark acoustic feature without mark text model, is generated speech synthesis model.The invention also discloses a kind of device, computer equipment and storage mediums.The present invention obtains pre-training model using the largely voice data without label or text data, it is only necessary to which speech synthesis model construction can be completed in a small amount of tape label voice data and text data training.
Description
Technical field
The present invention relates to artificial intelligence field more particularly to a kind of training methods of speech synthesis model, device, computer
Equipment and computer readable storage medium.
Background technique
Currently, With the fast development of internet, people more come being exchanged using functions such as voice-enabled chats the more, are
Meet the needs of users, many customer services often carry out Intelligent dialogue using artificial synthesized voice, therefore, it is necessary to according to text into
Row speech synthesis, converts text to voice, to meet the dialogue chat demand of user.
The training of current speech synthetic model needs to construct the voice and text data of a large amount of high quality, especially in parameter type
In end-to-end synthetic model system, tens of hours target speaker voices and the sampled data of corresponding text need to be prepared.It builds
Vertical such database need to consume a large amount of high times and human cost.And the training method of the speech synthesis model used, it utilizes
The text feature and acoustic feature extracted from the small-scale corpus data of target speaker, training obtain the deep layer of speech synthesis
Neural network model.But this method still needs to the voice of building a few hours or tens of hours high quality tape labels to different speakers
Data carry out model training by supervised learning, and data user rate is low.
Summary of the invention
The main purpose of the present invention is to provide a kind of training methods of speech synthesis model, it is intended to which solving the prior art needs
The text and voice of a large amount of tape label are acquired, the technical issues of obtaining the deep-neural-network model of speech synthesis is trained.
To achieve the above object, the present invention provides a kind of training method of speech synthesis model, a kind of speech synthesis
The training method of model includes:
Detect first when training data and second when training data, read described first in training data without mark
Infuse text information and without mark voice messaging and described second to band mark text information in training data and with mark voice
Information, wherein described first to training data quantity be greater than the described second quantity to training data;
Based on preset coding decoder model, construct described without mark text model and described without mark speech model;
Text information is marked based on the band and the band marks voice messaging, obtains the vector table with mark character features
Reference breath and the vector characterization information with mark acoustic feature;
The vector characterization information training for marking character features according to the band is described without mark speech model, according to the band
The vector characterization information training for marking acoustic feature is described without mark text model, generates speech synthesis model.
Optionally, described to be based on preset coding decoder model, it constructs described without mark text model and the no mark
Speech model, comprising:
Read it is described without mark text information and it is described without mark voice messaging when, obtain preset encoding-decoder
Model, wherein the preset encoding-decoder model includes preset encoder model and preset decoder model;
Based on described without the preset encoder model of mark text information training, the vector characterization without mark character features is obtained
Information constructs described without mark text model;
Based on described without the preset decoder model of mark voice messaging training, the vector characterization without mark acoustic feature is obtained
Information constructs described without mark speech model.
Optionally, described to train preset encoder model without mark text information based on described, it obtains special without mark text
The vector characterization information of sign constructs described without mark text model, comprising:
When detecting the preset encoder model trained as input value without mark text information, preset word is obtained
Method analysis, wherein the preset morphological analysis is the coding rule of the preset encoder model;
Based on the preset morphological analysis, the described without mark character features of the preset encoder model output is obtained
Vector characterization information;
Based on described without the vector characterization information for marking character features and described without mark text information, the building no mark
Infuse text model.
Optionally, described to train preset decoder model without mark voice messaging based on described, it obtains special without mark acoustics
The vector characterization information of sign constructs described without mark speech model, comprising:
If detect the preset decoder model trained as input value without mark voice messaging, obtain preset
Syntactic analysis, wherein the preset syntactic analysis is the decoding rule of the preset decoder model;
Based on the preset syntactic analysis, the described without mark acoustic feature of the preset decoder model output is obtained
Vector characterization information;
Vector characterization information based on the acoustic feature and described without mark text information, constructs described without mark voice
Model.
Optionally, described that voice messaging is marked based on band mark text information and the band, obtain band mark text
The vector characterization information of feature and with mark acoustic feature vector characterization information, comprising:
When reading the band mark text information and band mark voice messaging, obtain described without mark text mould
Type and it is described without mark speech model;
If detecting using band mark text information as the input value without mark text model, obtain described pre-
Set morphological analysis;
Based on the preset morphological analysis, the vector with mark character features without mark text model output is obtained
Characterization information;
If detecting using the voice messaging to be marked as the input value without mark speech model, obtain described pre-
Set syntactic analysis;
Based on the preset syntactic analysis, the vector with mark acoustic feature without mark speech model output is obtained
Characterization information.
Optionally, described to train preset decoder model without mark voice messaging based on described, it obtains special without mark acoustics
The vector characterization information of sign, construct it is described without mark speech model after, further includes:
Attention mechanism based on the encoder model obtains the described without mark text of the attention mechanism concern
The vector characterization information of feature;
Attention mechanism based on the decoder model obtains the described without mark acoustics of the attention mechanism concern
The vector characterization information of feature.
Optionally, the vector characterization information training for marking character features according to the band is described without mark voice mould
Type, the vector characterization information training of the band mark acoustic feature is described without mark text model, generates speech synthesis model, packet
It includes:
If detecting the vector characterization without mark character features of the attention mechanism concern of the encoder model
Simultaneously, the vector based on band mark character features characterizes the vector table reference manner of breathing of information and band mark character features
Mapping relations between information and the vector characterization information of band mark acoustic feature, training are described without mark text model;
If detecting the vector characterization without mark acoustic feature of the attention mechanism concern of the decoder model
Simultaneously, the vector based on band mark character features characterizes the vector table reference manner of breathing of information and band mark acoustic feature
Mapping relations between information and the vector characterization information of band mark acoustic feature, training are described without mark acoustic model;
Based on, without mark text model and without mark acoustic model, modification is described without mark text model and institute after training
It states without the weight parameter between mark acoustic model, generates the speech synthesis model.
In addition, to achieve the above object, the present invention also provides a kind of training device of speech synthesis model, the voice is closed
Include: at the training device of model
Reading unit, for detect first when training data and second when training data, read described first to
Without mark text information and without mark voice messaging and described second to band mark text envelope in training data in training data
Breath and with mark voice messaging, wherein described first to training data quantity be greater than the described second quantity to training data;
Construction unit constructs described without mark text model and the no mark for being based on preset coding decoder model
Infuse speech model;
Acquiring unit obtains band mark text for marking voice messaging based on band mark text information and the band
The vector characterization information of word feature and with mark acoustic feature vector characterization information;
Generation unit, the vector characterization information training for marking character features according to the band are described without mark voice mould
Type, the vector characterization information training for marking acoustic feature according to the band is described without mark text model, generates speech synthesis mould
Type.
Optionally, the construction unit is specifically used for:
Read it is described without mark text information and it is described without mark voice messaging when, obtain preset encoding-decoder
Model, wherein the preset encoding-decoder model includes preset encoder model and preset decoder model;
Based on described without the preset encoder model of mark text information training, the vector characterization without mark character features is obtained
Information constructs described without mark text model;
Based on described without the preset decoder model of mark voice messaging training, the vector characterization without mark acoustic feature is obtained
Information constructs described without mark speech model.
Optionally, the construction unit is specifically used for, further includes:
First obtains subelement, for detect it is described without mark text information as the trained preset coding of input value
When device model, preset morphological analysis is obtained, wherein the preset morphological analysis is that the coding of the preset encoder model is advised
Then;
Second obtains subelement, for being based on the preset morphological analysis, obtains the preset encoder model output
The vector characterization information without mark character features;
First building subelement, for based on described without the vector characterization information for marking character features and described without mark text
This information constructs described without mark text model.
Optionally, the construction unit is specifically used for, further includes:
Third obtains subelement, if for detect it is described without mark voice messaging as the trained preset solution of input value
When code device model, preset syntactic analysis is obtained, wherein the preset syntactic analysis is that the decoding of the preset decoder model is advised
Then;
4th obtains subelement, for being based on the preset syntactic analysis, obtains the preset decoder model output
The vector characterization information without mark acoustic feature;
Second building subelement, for based on the acoustic feature vector characterization information and it is described without mark text envelope
Breath constructs described without mark speech model.
Optionally, the acquiring unit is specifically used for:
When reading the band mark text information and band mark voice messaging, obtain described without mark text mould
Type and it is described without mark speech model;
If detecting using band mark text information as the input value without mark text model, obtain described pre-
Set morphological analysis;
Based on the preset morphological analysis, the vector with mark character features without mark text model output is obtained
Characterization information;
If detecting using the voice messaging to be marked as the input value without mark speech model, obtain described pre-
Set syntactic analysis;
Based on the preset syntactic analysis, the vector with mark acoustic feature without mark speech model output is obtained
Characterization information.
Optionally, the training device of the speech synthesis model includes:
First, which obtains concern unit, obtains the attention machine for the attention mechanism based on the encoder model
The vector characterization information without mark character features of system concern;
Second, which obtains concern unit, obtains the attention machine for the attention mechanism based on the decoder model
The vector characterization information without mark acoustic feature of system concern.
Optionally, the generation unit is specifically used for:
If detecting the vector characterization without mark character features of the attention mechanism concern of the encoder model
Simultaneously, the vector based on band mark character features characterizes the vector table reference manner of breathing of information and band mark character features
Mapping relations between information and the vector characterization information of band mark acoustic feature, training are described without mark text model;
If detecting the vector characterization without mark acoustic feature of the attention mechanism concern of the decoder model
Simultaneously, the vector based on band mark character features characterizes the vector table reference manner of breathing of information and band mark acoustic feature
Mapping relations between information and the vector characterization information of band mark acoustic feature, training are described without mark acoustic model;
Based on, without mark text model and without mark acoustic model, modification is described without mark text model and institute after training
It states without the weight parameter between mark acoustic model, generates the speech synthesis model.
In addition, to achieve the above object, the present invention also provides a kind of computer equipment, the computer equipment includes: to deposit
Reservoir, processor and the training journey for being stored in the speech synthesis model that can be run on the memory and on the processor
It is realized when the training program of sequence, the speech synthesis model is executed by the processor and as above invents the speech synthesis model
The step of training method.
In addition, to achieve the above object, it is described computer-readable the present invention also provides a kind of computer readable storage medium
The training program of speech synthesis model is stored on storage medium, the training program of the speech synthesis model is executed by processor
Shi Shixian as above invents the step of training method of the speech synthesis model.
Training method, device, computer equipment and the computer for a kind of speech synthesis model that the embodiment of the present invention proposes
Readable storage medium storing program for executing, by detect first when training data and second when training data, read described first wait train
In data without mark text information and without mark voice messaging and described second in training data band mark text information and
Band mark voice messaging, wherein described first to training data quantity be greater than the described second quantity to training data;It is based on
Preset coding decoder model constructs described without mark text model and described without mark speech model;It is marked based on the band
Text information and the band mark voice messaging, obtain the vector characterization information with mark character features and with mark acoustic feature
Vector characterization information;The vector characterization information training for marking character features according to the band is described without mark speech model, root
Speech synthesis model is generated without mark text model according to the vector characterization information training of band mark acoustic feature is described, it is real
Show and pre-training model is obtained using the largely voice data without label or text data, it is only necessary to a small amount of tape label voice data
Speech synthesis model construction can be completed with text data training.
Detailed description of the invention
Fig. 1 is the terminal structure schematic diagram for the hardware running environment that the embodiment of the present invention is related to;
Fig. 2 is the flow diagram of the first embodiment of the training method of speech synthesis model of the present invention;
Fig. 3 is the flow diagram of the second embodiment of the training method of speech synthesis model of the present invention;
The refinement flow diagram for the step of Fig. 4 is S22 in Fig. 3;
The refinement flow diagram for the step of Fig. 5 is S23 in Fig. 3;
The refinement flow diagram for the step of Fig. 6 is S30 in Fig. 2;
Fig. 7 is the flow diagram of the 3rd embodiment of the training method of speech synthesis model of the present invention.
The object of the invention is realized, the embodiments will be further described with reference to the accompanying drawings for functional characteristics and advantage.
Specific embodiment
It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, it is not intended to limit the present invention.
The primary solutions of the embodiment of the present invention are: detect first to training data and second to training data
When, read first in training data without mark text information and without mark voice messaging and second to band in training data
Mark text information and with mark voice messaging, wherein first is greater than to the quantity of training data described second to training data
Quantity;Based on preset coding decoder model, building is without mark text model and described without mark speech model;It is marked based on band
It infuses text information and with mark voice messaging, obtains the vector characterization information with mark character features and with mark acoustic feature
Vector characterization information;According to the vector characterization information training with mark character features without mark speech model, according to band mark sound
The vector characterization information training of feature is learned without mark text model, generates speech synthesis model.
Since the prior art needs to acquire the text and voice of a large amount of tape label, training obtains the deep layer mind of speech synthesis
Through network model.
The present invention provides a solution, makes to obtain pre-training using the largely voice data without label or text data
Model, it is only necessary to which speech synthesis model construction can be completed in a small amount of tape label voice data and text data training.
As shown in FIG. 1, FIG. 1 is the terminal structure schematic diagrames for the hardware running environment that the embodiment of the present invention is related to.
The terminal of that embodiment of the invention can be PC, the packaged type terminal device having a display function such as portable computer.
As shown in Figure 1, the terminal may include: processor 1001, such as CPU, network interface 1004, user interface
1003, memory 1005, communication bus 1002.Wherein, communication bus 1002 is for realizing the connection communication between these components.
User interface 1003 may include display screen (Display), input unit such as keyboard (Keyboard), optional user interface
1003 can also include standard wireline interface and wireless interface.Network interface 1004 optionally may include that the wired of standard connects
Mouth, wireless interface (such as WI-FI interface).Memory 1005 can be high speed RAM memory, be also possible to stable memory
(non-volatile memory), such as magnetic disk storage.Memory 1005 optionally can also be independently of aforementioned processor
1001 storage device.
It will be understood by those skilled in the art that the restriction of the not structure paired terminal of terminal structure shown in Fig. 1, can wrap
It includes than illustrating more or fewer components, perhaps combines certain components or different component layouts.
As shown in Figure 1, as may include that operating system, network are logical in a kind of memory 1005 of computer storage medium
Believe module, the training program of Subscriber Interface Module SIM and speech synthesis model.
In terminal shown in Fig. 1, network interface 1004 is mainly used for connecting background server, carries out with background server
Data communication;User interface 1003 is mainly used for connecting client (user terminal), carries out data communication with client;And processor
1001 can be used for calling the training program of speech synthesis model in memory 1005, and execute following operation:
Detect first when training data and second when training data, read first in training data without mark text
This information and without mark voice messaging and second in training data band mark text information and with mark voice messaging,
In, first to training data quantity be greater than the second quantity to training data;
Based on preset coding decoder model, building is without mark text model and without mark speech model;
Text information is marked based on the band and the band marks voice messaging, obtains the vector table with mark character features
Reference breath and the vector characterization information with mark acoustic feature;
The vector characterization information training for marking character features according to the band marks acoustics according to band without mark speech model
The vector characterization information training of feature generates speech synthesis model without mark text model.
Further, processor 1001 can call the training program of the speech synthesis model stored in memory 1005,
Also execute following operation:
When reading without mark text information and without mark voice messaging, preset encoding-decoder model is obtained;
Based on without the preset encoder model of mark text information training, the vector table reference without mark character features is obtained
Breath, building is without mark text model;
Based on without the preset decoder model of mark voice messaging training, the vector table reference without mark acoustic feature is obtained
Breath, building is without mark speech model.
Further, processor 1001 can call the training program of the speech synthesis model stored in memory 1005,
Also execute following operation:
When detecting without mark text information as the input value preset encoder model of training, preset morphological analysis is obtained,
Wherein, preset morphological analysis is the coding rule of preset encoder model;
Based on preset morphological analysis, the vector table reference without mark character features of preset encoder model output is obtained
Breath;
Based on the vector characterization information without mark character features and without mark text information, building is without mark text model.
Further, processor 1001 can call the training program of the speech synthesis model stored in memory 1005,
Also execute following operation:
If detect without mark voice messaging as input value training preset decoder model, preset syntax point is obtained
Analysis, wherein preset syntactic analysis is the decoding rule of preset decoder model;
Based on preset syntactic analysis, the vector table reference without mark acoustic feature of preset decoder model output is obtained
Breath;
Vector characterization information based on acoustic feature and without mark text information, building is without mark speech model.
Further, processor 1001 can call the training program of the speech synthesis model stored in memory 1005,
Also execute following operation:
When reading band mark text information and with mark voice messaging, obtain without mark text model and without mark language
Sound model;
, as the input value without mark text model, preset morphological analysis is obtained for mark text information if detecting;
Based on preset morphological analysis, the vector table reference with mark character features without mark text model output is obtained
Breath;
If detecting using voice messaging to be marked as the input value without mark speech model, preset syntactic analysis is obtained;
Based on preset syntactic analysis, the vector table reference with mark acoustic feature without mark speech model output is obtained
Breath.
Further, processor 1001 can call the training program of the speech synthesis model stored in memory 1005,
Also execute following operation:
Attention mechanism based on encoder model obtains the vector table without mark character features of attention mechanism concern
Reference breath;
Attention mechanism based on decoder model obtains the vector table without mark acoustic feature of attention mechanism concern
Reference breath.
Further, processor 1001 can call the training program of the speech synthesis model stored in memory 1005,
Also execute following operation:
If detecting the vector characterization information and band without mark character features of the attention mechanism concern of encoder model
Mark character features vector table reference manner of breathing simultaneously, based on band mark character features vector characterization information with mark acoustics
Mapping relations between the vector characterization information of feature, training is without mark text model;
If detecting the vector characterization information and band without mark acoustic feature of the attention mechanism concern of decoder model
Mark acoustic feature vector table reference manner of breathing simultaneously, based on band mark character features vector characterization information with mark acoustics
Mapping relations between the vector characterization information of feature, training is without mark acoustic model;
Based on, without mark text model and without mark acoustic model, modification is without mark text model and no mark after training
Weight parameter between acoustic model generates speech synthesis model.
It is the first embodiment of the training method of speech synthesis model of the present invention referring to Fig. 2, Fig. 2, speech synthesis model
Training method includes:
Step S10, detect first when training data and second when training data, read first in training data
Without mark text information and without mark voice messaging and second to band mark text information in training data and with mark voice
Information, wherein first to training data quantity be greater than the second quantity to training data;
Terminal detect first when training data and second when training data, read described first in training data
Without mark text information and without mark voice messaging and second in training data band mark text information and with mark
Voice messaging.First to training data to be that terminal searching arrives a large amount of without mark text information and voice messaging, second to
Training data is the voice messaging and the corresponding text information of voice messaging for the target speaker that terminal is got, and target is sent out
The voice messaging of sound people is used as band mark voice messaging, marks text information for the corresponding text information of voice messaging as band,
Band mark voice messaging with mark text information there is mapping relations, and first in training data without mark text envelope
Breath and quantity without mark voice messaging are greater than second to band mark text information in training data and with marking voice messaging
Quantity.
Step S20 is based on preset coding decoder model, and building is without mark text model and without mark speech model;
When terminal is read without mark text information, without mark voice messaging, band mark text information, band mark voice letter
When breath, initialized data base is extracted.Initialized data base can be dictionary, can also make Chinese wikipedia etc., preset coding and decoding
Device model will be without mark text information and without mark voice messaging and with mark text envelope with dictionary or Chinese wikipedia etc.
It ceases and is disassembled with mark voice messaging, obtain the character features without mark text information, the sound without mark voice messaging
Learn feature, the character features with mark text information and the acoustic feature with mark voice messaging.Wherein, character features packet
Include: words granularity, words, word are long and the rhythm pauses, and acoustic feature includes: spectrum parameter, duration and fundamental frequency.
Terminal obtains preset encoder model when getting the character features without mark text information.It will be without mark text
Input value of this information as preset encoder model, using without mark character features as the output valve of preset encoder model.
Terminal according to without mark text information and it is corresponding without mark character features, modify the weight parameter of preset encoder model,
Neural network model to building without mark text, wherein the type of the neural network model without mark text can be volume
Product neural network model or Recognition with Recurrent Neural Network model.
Terminal obtains preset decoder model when getting the acoustic feature without mark voice messaging.It will be without mark language
Message ceases input value as preset decoder model, will be without marking acoustic feature as the output valve of preset decoder model.
Terminal according to without mark voice messaging and it is corresponding without mark acoustic feature, modify the weight parameter of preset decoder model,
Neural network model to building without mark voice, wherein the type of the neural network model without mark voice can be volume
Product neural network model or Recognition with Recurrent Neural Network model.
Step S30 obtains the vector table with mark character features based on band mark text information and with mark voice messaging
Reference breath and the vector characterization information with mark acoustic feature;;
Terminal, will be with mark text information as no mark when reading band mark text information and with mark voice messaging
The input value of text model is infused, and will be used as with mark voice messaging without mark speech model output valve.When will with mark text
Information obtains band mark text according to without the morphological analysis in mark text model as the input value without mark text model
The character features of information, it is no mark text model based on band mark text information and with mark text information character features,
The vector characterization information with mark character features is obtained, specifically, being based on weight parameter without mark text model, is got wait mark
Infuse vector characterization information of the character features in weight matrix.When using with mark voice messaging be used as without mark acoustic model it is defeated
It is worth out, according to without the syntactic analysis in mark speech model, obtains the acoustic feature with mark voice messaging, no mark acoustic mode
Type obtains the vector characterization with mark acoustic feature based on the acoustic feature with mark acoustic information and with mark text information
Information gets vector of the acoustic feature to be marked in weight matrix specifically, being based on weight parameter without mark acoustic model
Characterization information.Character features include: words granularity, words, word is long and the rhythm pauses, acoustic feature include: spectrum parameter, when
Long and fundamental frequency.
Step S40, the vector characterization information training for marking character features according to the band are described without mark speech model, root
It is described without mark text model according to the vector characterization information training of band mark acoustic feature, generate speech synthesis model.
Terminal get without mark text model and without mark speech model when, obtain without mark text model in without mark
Explanatory notes word feature vector characterization information and without mark phonetic feature model in without mark acoustic feature vector characterization information.?
Without mark character features vector characterization information with mark character features vector table reference manner of breathing it is same, or without mark acoustic feature to
Measure characterization information with mark acoustic feature vector table reference manner of breathing simultaneously, according to band mark character features vector characterization information and
With the mapping relations between mark acoustic feature vector characterization information, obtain without mark character features vector characterization information and with mark
The mapping relations between acoustic feature vector characterization information are infused, or without mark acoustic feature vector characterization information and with mark text
Mapping relations between word feature vector characterization information.For example, judging without mark character features vector characterization information and with mark
Whether character features vector characterization information is identical, or judges without mark acoustic feature vector characterization information and with mark acoustics spy
It whether identical levies vector characterization information, when identical, obtains without mark character features vector characterization information and with mark acoustics spy
Levy vector characterization information between mapping relations or without mark acoustic feature vector characterization information with mark character features to
Measure the mapping relations between characterization information.Terminal is being obtained without mark character features vector characterization information and without mark acoustic feature
When mapping relations between vector characterization information, training is raw without mark text model and the weight parameter without mark speech model
At voice synthetic model.
In the present embodiment, terminal detect first in coding decoder model to training data and second wait train
When data, it is special to the character features without mark text information in training data and the acoustics without mark voice messaging to obtain first
Sign and second is to the character features with mark text information in training data and the acoustic feature with mark voice messaging, base
In coding decoder model, generate without mark text model and without mark speech model, based on without mark text model and without mark
Speech model is infused, the vector characterization information with mark character features and the vector characterization information with mark acoustic feature are obtained, and
According to the mapping relations between the vector characterization information with mark character features and the vector characterization information with mark acoustic feature
Training generates speech synthesis model, using largely without the voice data of label without mark text model and without mark speech model
Or text data obtains pre-training model, it is only necessary to which voice conjunction can be completed in a small amount of tape label voice data and text data training
At model construction.
It further, is the second embodiment of the training method of speech synthesis model of the present invention referring to Fig. 3, Fig. 3, based on upper
Embodiment shown in Fig. 2 is stated, step S20 includes:
Step S21 obtains preset encoding-decoder when reading without mark text information and without mark voice messaging
Model;
Step S22 obtains the vector without mark character features based on without the preset encoder model of mark text information training
Characterization information, building is without mark text model;
Step S23 obtains the vector without mark acoustic feature based on without the preset decoder model of mark voice messaging training
Characterization information, building is without mark speech model.
Terminal obtains preset coding decoder model when reading without mark text information and without mark voice messaging
Coding rule and decoding rule, the coding rule of preset encoder model include the morphology point such as words granularity, words, word length
Analysis, the decoding rule of preset decoder model include that the rhythm pauses, the syntactic analyses such as spectrum parameter, duration and fundamental frequency.Terminal root
According to the morphological analyses such as the words granularity in preset encoder model, words, word be long, to get without mark text information into
Row coding is got without the character features in mark text information, and terminal is according to without the character features in mark text information and without mark
Text information building is without mark text model.It is paused according to the rhythm in preset decoder model, spectrum parameter, duration and fundamental frequency
Equal syntactic analyses are decoded without mark voice messaging to what is got, get without the acoustic feature in mark voice messaging, eventually
End is constructed according to without the acoustic feature in mark voice messaging and without mark voice messaging without mark speech model.Terminal is according to encoder
Attention mechanism in model gets the vector characterization information without mark character features of attention mechanism concern, specifically,
Attention mechanism in encoder model is when encoder model without mark text information to encoding, encoder model output
Without mark character features, get without mark character features weight matrix when, attention mechanism concern without mark text
Vector characterization information of the feature in the upper and lower level information of weight matrix.
In the present embodiment, terminal obtains preset coding when getting mark text information and without mark voice messaging
The coding rule of decoder model encode generation without mark to without mark text information with rule, the preset coding solution of starting is decoded
Infuse text model, start preset decoder model to without mark voice messaging be decoded, generate without mark speech model, according to
Coding decoder model is quickly generated without mark text model and without mark acoustic model.
Referring to Fig. 4, Fig. 4 is the step refined flow chart of S22 in above-mentioned Fig. 3, and step S22 includes:
Step S221 is obtained preset when detecting without mark text information as the input value preset encoder model of training
Morphological analysis;
Step S222 is based on preset morphological analysis, obtain the output of preset encoder model without mark character features to
Measure characterization information;
Step S223 is constructed based on the vector characterization information without mark character features and without mark text information without mark
Text model.
When terminal is detected without mark text information as input value training preset encoder model, preset encoder is obtained
The morphological analysis of model, morphological analysis are also the coding rule of preset encoder model.Preset encoder model is according to preset volume
Code rule to without mark text information encodes, obtain coding after without mark character features vector characterization information.It is obtaining
When to without mark character features vector characterization information, terminal is according to the vector without mark text information and without mark character features
Characterization information is adjusted the weight parameter in encoder model, and building is without mark text model.When preset coding rule
When being according to being encoded without the words in mark text information, the words vector weight matrix without mark character features is generated,
When preset coding rule is encoded according to the stroke without words in mark text information, generate without mark character features
Stroke vector weight matrix.Terminal is according to the words vector weight matrix without mark character features or without mark character features
Stroke vector weight matrix is obtained without mark character features vector characterization information.Terminal is according to the attention in decoder model
Mechanism gets the vector characterization information without mark acoustic feature of attention mechanism concern, specifically, in decoder model
Attention mechanism decoder model to without mark voice messaging be decoded when, concern decoder model input without mark sound
Feature is learned, when getting the weight matrix without mark acoustic feature, the concern of attention mechanism is without mark acoustic feature in weight
The vector characterization information of the vector characterization information of the upper and lower level information of matrix.
In the present embodiment, terminal is being detected without mark text information as the preset encoder model of input value training
When, the morphological analysis of preset encoder model is obtained, preset encoder model is according to preset coding rule to without mark text envelope
Breath encoded, obtain coding after without mark character features vector characterization information.It is getting without mark character features vector
When characterization information, terminal is according to the vector characterization information without mark text information and without mark character features, to encoder mould
Weight parameter in type is adjusted, and building is without mark text model.Pass through the coding rule of encoder model, quickly training
Data construct model.
Referring to Fig. 5, Fig. 5 is the step refined flow chart of S23 in Fig. 3, and step S23 includes:
If step S231 is obtained pre- detect without mark voice messaging as input value training preset decoder model
Set syntactic analysis;
Step S232 is based on preset syntactic analysis, obtain the output of preset decoder model without mark acoustic feature to
Measure characterization information;
Step S233, vector characterization information based on acoustic feature and without mark text information, building is without mark voice mould
Type.
When terminal is detected without mark voice messaging as input value training preset decoder model, preset decoder is obtained
The syntactic analysis of model, morphological analysis are also the decoding rule of preset decoder model.Preset decoder model is according to preset solution
Code rule is encoded to without mark voice messaging, is obtained decoded without mark acoustic feature vector characterization information.It is obtaining
When to without mark acoustic feature vector characterization information, terminal is according to the vector without mark voice messaging and without mark acoustic feature
Characterization information is adjusted the weight parameter in decoder model, and building is without mark speech model.When preset decoding rule
It is when carrying out encoding and decoding according to the fundamental frequency without mark acoustic feature, to generate the fundamental frequency vector weight matrix without mark acoustic feature,
When preset decoding rule is decoded according to the duration without mark acoustic feature, the duration without mark acoustic feature is generated
Vector weight matrix.Terminal is according to the word fundamental frequency vector weight matrix without mark acoustic feature or the duration without mark acoustic feature
Vector weight matrix obtains the vector characterization information without mark acoustic feature.By the decoding rule of decoder model, quickly
Training data constructs model.
Referring to Fig. 6, Fig. 6 is the step refined flow chart of S30 in Fig. 2, and step S30 includes:
Step S31, read band mark text information and with mark voice messaging when, obtain without mark text model and
Without mark speech model;
Step S32, as the input value without mark text model, obtains preset word for mark text information if detecting
Method analysis;
Step S33 is based on preset morphological analysis, obtains the vector with mark character features without mark text model output
Characterization information;
Step S34 is obtained described pre- if detecting using voice messaging to be marked as the input value without mark speech model
Set syntactic analysis;
Step S35 is based on preset syntactic analysis, obtains the vector with mark acoustic feature without mark speech model output
Characterization information.
Terminal is reading second when with mark text information in training data and with mark voice messaging, obtains without mark
Infuse text model and without mark speech model.When terminal detect using with mark text information be used as without mark text model it is defeated
Enter value, obtain preset morphological analysis, be based on preset morphological analysis, obtains the band mark character features without mark text model output
Vector characterization information.When terminal is detected using the voice messaging to be marked as the input without mark speech model
Value obtains the preset syntactic analysis, is based on the preset syntactic analysis, obtains the band mark without mark speech model output
Infuse the vector characterization information of acoustic feature.Specifically, being without the preset coding rule in mark text model when terminal is detected
When being encoded according to the words with mark character features, the words vector weight matrix with mark character features is generated, when pre-
The coding rule set is when being encoded according to the stroke with mark character features, to generate the stroke vector with mark character features
Weight matrix.Terminal is weighed according to the words vector weight matrix with mark character features or the stroke vector with mark character features
Weight matrix, obtains the vector characterization information with mark character features.When terminal is detected without the preset solution in mark speech model
Code rule is when being decoded according to the fundamental frequency with mark acoustic feature, to generate the fundamental frequency vector weight square with mark character features
Battle array is generated when preset decoding rule is decoded according to the duration with mark character features with mark character features
When long vector weight matrix.Terminal according to mark acoustic feature fundamental frequency vector weight matrix or with mark character features when
Long vector weight matrix obtains the vector with mark acoustic feature and guarantees information.
In the present embodiment, terminal is reading second to band mark text information in training data and with mark voice letter
When breath, obtain without mark text model and without mark speech model.When terminal is detected using band mark text information as no mark
The input value for infusing text model, obtains preset morphological analysis, is based on preset morphological analysis, obtains without mark text model output
Vector characterization information with mark character features.When terminal is detected using the voice messaging to be marked as described without mark language
The input value of sound model obtains the preset syntactic analysis, is based on the preset syntactic analysis, obtains described without mark voice mould
The vector characterization information with mark acoustic feature of type output.By the model of building, band mark acoustic feature is quickly obtained
Vector characterization information and with mark character features vector characterization information.
It is the 3rd embodiment of the training method of speech synthesis model of the present invention referring to Fig. 7, Fig. 7, based on shown in above-mentioned Fig. 2
Embodiment, step S40 includes:
Step S41, if detecting the vector characterization without mark character features of the attention mechanism concern of encoder model
Information with mark character features vector table reference manner of breathing simultaneously, based on band mark character features vector characterization information and band
The mapping relations between the vector characterization information of acoustic feature are marked, training is without mark text model;
Step S42, if detecting the vector characterization without mark acoustic feature of the attention mechanism concern of decoder model
Information with mark acoustic feature vector table reference manner of breathing simultaneously, based on band mark character features vector characterization information and band
The mapping relations between the vector characterization information of acoustic feature are marked, training is without mark acoustic model;
Step S43, based on, without mark text model and without mark acoustic model, modification is without mark text model after training
With without the weight parameter between mark acoustic model, speech synthesis model is generated.
When terminal detect concern without mark character features vector characterization information when, according to the attention of encoder model
Mechanism, which is obtained, ceases identical band mark Text eigenvector characterization information with without mark character features vector table reference.Obtaining band
Mark Text eigenvector characterization information with mark acoustic feature vector characterization information and establish without mark character features vector
Characterization information and with mark acoustic feature vector characterization information between mapping relations.Alternatively, when terminal concern is without mark acoustics
When feature vector characterization information, terminal is obtained and according to the attention mechanism of decoder model without mark acoustic feature vector table
Reference ceases identical band and marks acoustic feature vector characterization information.Getting band mark acoustic feature vector characterization information and band
The mapping relations of character features vector characterization information are marked, are established without mark acoustic feature vector characterization information and with mark text
Mapping relations between feature vector characterization information.
Terminal is built between Text eigenvector characterization information according to without mark acoustic feature vector characterization information and with marking
Vertical mapping relations, alternatively, without mark character features vector characterization information and between mark acoustic feature vector characterization information
The mapping relations of foundation, training without mark text model with without mark acoustic model, after modification fine tuning without mark text model with
Without the weight parameter between mark acoustic model, to generate speech synthesis model.
In the present embodiment, terminal detect encoder model attention mechanism concern without mark character features to
Measure characterization information with mark character features vector table reference manner of breathing simultaneously, based on band mark character features vector table reference
The mapping relations between the vector characterization information with mark acoustic feature are ceased, training is without mark text model.If detecting solution
The vector characterization information without mark acoustic feature of the attention mechanism concern of code device model and the vector with mark acoustic feature
When characterization information is identical, based on the vector characterization information with mark character features and the vector characterization information with mark acoustic feature
Between mapping relations, training without mark acoustic model.It is modified without mark text model with without mark acoustic model based on training
Without mark text model and without the weight parameter between mark acoustic model, speech synthesis model is generated, using largely without label
Voice data or text data obtain pre-training model, it is only necessary to a small amount of tape label voice data and text data training
Speech synthesis model construction is completed, does not need that building is complicated and huge search network, can effectively promote speech synthesis training language
The consistency of text audio in material moves trained model using the rhythm style that transfer learning and method for trimming can get voice
It moves.
In addition, the embodiment of the present invention also proposes a kind of computer equipment, computer equipment include: memory, processor and
It is stored in the training program for the speech synthesis model that can be run on the memory and on a processor, the instruction of speech synthesis model
Practice the step of realizing the training method of speech synthesis model of embodiment as above when program is executed by the processor.
In addition, the embodiment of the present invention also proposes a kind of computer readable storage medium, which is characterized in that computer-readable to deposit
The training program of speech synthesis model, the realization when training program of speech synthesis model is executed by processor are stored on storage media
The step of such as training method of the speech synthesis model of embodiment as above.
It should be noted that, in this document, the terms "include", "comprise" or its any other variant are intended to non-row
His property includes, so that the process, method, article or the system that include a series of elements not only include those elements, and
And further include other elements that are not explicitly listed, or further include for this process, method, article or system institute it is intrinsic
Element.In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that including being somebody's turn to do
There is also other identical elements in the process, method of element, article or system.
The serial number of the above embodiments of the invention is only for description, does not represent the advantages or disadvantages of the embodiments.
Through the above description of the embodiments, those skilled in the art can be understood that above-described embodiment side
Method can be realized by means of software and necessary general hardware platform, naturally it is also possible to by hardware, but in many cases
The former is more preferably embodiment.Based on this understanding, technical solution of the present invention substantially in other words does the prior art
The part contributed out can be embodied in the form of software products, which is stored in one as described above
In storage medium (such as ROM/RAM, magnetic disk, CD), including some instructions are used so that terminal device (it can be mobile phone,
Computer, server, air conditioner or network equipment etc.) execute method described in each embodiment of the present invention.
The above is only a preferred embodiment of the present invention, is not intended to limit the scope of the invention, all to utilize this hair
Equivalent structure or equivalent flow shift made by bright specification and accompanying drawing content is applied directly or indirectly in other relevant skills
Art field, is included within the scope of the present invention.
Claims (10)
1. a kind of training method of speech synthesis model, which is characterized in that the training method of the speech synthesis model includes:
Detect first when training data and second when training data, read described first in training data without mark text
This information and without mark voice messaging and described second in training data band mark text information and with mark voice letter
Breath, wherein described first to training data quantity be greater than the described second quantity to training data;
Based on preset coding decoder model, construct described without mark text model and described without mark speech model;
Text information is marked based on the band and the band marks voice messaging, obtains the vector table reference with mark character features
It ceases and with the vector characterization information for marking acoustic feature;
The vector characterization information training for marking character features according to the band is described without mark speech model, is marked according to the band
The vector characterization information training of acoustic feature is described without mark text model, generates speech synthesis model.
2. the training method of speech synthesis model as described in claim 1, which is characterized in that described to be based on preset coding and decoding
Device model constructs described without mark text model and described without mark speech model, comprising:
Read it is described without mark text information and it is described without mark voice messaging when, obtain preset encoding-decoder model,
Wherein, the preset encoding-decoder model includes preset encoder model and preset decoder model;
Based on described without the preset encoder model of mark text information training, the vector table reference without mark character features is obtained
Breath constructs described without mark text model;
Based on described without the preset decoder model of mark voice messaging training, the vector table reference without mark acoustic feature is obtained
Breath constructs described without mark speech model.
3. the training method of speech synthesis model as claimed in claim 2, which is characterized in that described literary without mark based on described
The preset encoder model of this information training, obtains the vector characterization information without mark character features, constructs described without mark text
Model, comprising:
When detecting the preset encoder model trained as input value without mark text information, preset morphology point is obtained
Analysis, wherein the preset morphological analysis is the coding rule of the preset encoder model;
Based on the preset morphological analysis, the vector without mark character features of the preset encoder model output is obtained
Characterization information;
Based on it is described without mark character features vector characterization information and it is described without mark text information, construct it is described without mark text
This model.
4. the training method of speech synthesis model as claimed in claim 2, which is characterized in that it is described based on it is described without mark language
The preset decoder model of message breath training, obtains the vector characterization information without mark acoustic feature, constructs described without mark voice
Model, comprising:
If detect the preset decoder model trained as input value without mark voice messaging, preset syntax is obtained
Analysis, wherein the preset syntactic analysis is the decoding rule of the preset decoder model;
Based on the preset syntactic analysis, the vector without mark acoustic feature of the preset decoder model output is obtained
Characterization information;
Vector characterization information based on the acoustic feature and described without mark text information, constructs described without mark voice mould
Type.
5. the training method of speech synthesis model as described in claim 1, which is characterized in that described based on band mark text
This information and the band mark voice messaging, obtain the vector characterization information with mark character features and with mark acoustic feature
Vector characterization information, comprising:
When reading band mark text information and band mark voice messaging, obtain it is described without mark text model and
It is described without mark speech model;
If detecting using band mark text information as the input value without mark text model, the preset word is obtained
Method analysis;
Based on the preset morphological analysis, the vector characterization with mark character features without mark text model output is obtained
Information;
If detecting using the voice messaging to be marked as the input value without mark speech model, the preset sentence is obtained
Method analysis;
Based on the preset syntactic analysis, the vector characterization with mark acoustic feature without mark speech model output is obtained
Information.
6. the training method of speech synthesis model as claimed in claim 2, which is characterized in that it is described based on it is described without mark language
The preset decoder model of message breath training, obtains the vector characterization information without mark acoustic feature, constructs described without mark voice
After model, further includes:
Attention mechanism based on the encoder model obtains the described without mark character features of the attention mechanism concern
Vector characterization information;
Attention mechanism based on the decoder model obtains the described without mark acoustic feature of the attention mechanism concern
Vector characterization information.
7. the training method of speech synthesis model as claimed in claim 6, which is characterized in that described to mark text according to the band
The vector characterization information training of word feature is described without mark speech model, and the vector table reference of acoustic feature is marked according to the band
Breath training is described without mark text model, generates speech synthesis model, comprising:
If detecting the vector characterization information without mark character features of the attention mechanism concern of the encoder model
The vector table reference manner of breathing for marking character features with the band simultaneously, the vector characterization information of character features is marked based on the band
Mapping relations between the vector characterization information of band mark acoustic feature, training are described without mark text model;
If detecting the vector characterization information without mark acoustic feature of the attention mechanism concern of the decoder model
The vector table reference manner of breathing for marking acoustic feature with the band simultaneously, the vector characterization information of character features is marked based on the band
Mapping relations between the vector characterization information of band mark acoustic feature, training are described without mark acoustic model;
Based on after training it is described without mark text model and it is described without mark acoustic model, modify it is described without mark text model
With the weight parameter without between mark acoustic model, the speech synthesis model is generated.
8. a kind of training device of speech synthesis model, which is characterized in that the training device of the speech synthesis model includes:
Reading unit, for detect first when training data and second when training data, read described first wait train
In data without mark text information and without mark voice messaging and described second in training data band mark text information and
Band mark voice messaging, wherein described first to training data quantity be greater than the described second quantity to training data;
Construction unit constructs described without mark text model and described without mark language for being based on preset coding decoder model
Sound model;
It is special to obtain band mark text for marking voice messaging based on band mark text information and the band for acquiring unit
The vector characterization information of sign and with mark acoustic feature vector characterization information;
Generation unit, for according to the band mark character features vector characterization information training it is described without mark speech model,
The vector characterization information training for marking acoustic feature according to the band is described without mark text model, generates speech synthesis model.
9. a kind of computer equipment, which is characterized in that the computer equipment includes: memory, processor and is stored in described
On memory and the training program of speech synthesis model that can run on the processor, the training of the speech synthesis model
Realizing the training method of speech synthesis model as described in any one of claims 1 to 7 when program is executed by the processor
Step.
10. a kind of computer readable storage medium, which is characterized in that be stored with voice conjunction on the computer readable storage medium
It is realized at the training program of model, when the training program of the speech synthesis model is executed by processor as in claim 1 to 7
The step of training method of described in any item speech synthesis models.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910407683.5A CN110148398A (en) | 2019-05-16 | 2019-05-16 | Training method, device, equipment and the storage medium of speech synthesis model |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910407683.5A CN110148398A (en) | 2019-05-16 | 2019-05-16 | Training method, device, equipment and the storage medium of speech synthesis model |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110148398A true CN110148398A (en) | 2019-08-20 |
Family
ID=67594320
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910407683.5A Pending CN110148398A (en) | 2019-05-16 | 2019-05-16 | Training method, device, equipment and the storage medium of speech synthesis model |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110148398A (en) |
Cited By (18)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110767210A (en) * | 2019-10-30 | 2020-02-07 | 四川长虹电器股份有限公司 | Method and device for generating personalized voice |
CN111128119A (en) * | 2019-12-31 | 2020-05-08 | 云知声智能科技股份有限公司 | Voice synthesis method and device |
CN111161703A (en) * | 2019-12-30 | 2020-05-15 | 深圳前海达闼云端智能科技有限公司 | Voice synthesis method with tone, device, computing equipment and storage medium |
CN111276120A (en) * | 2020-01-21 | 2020-06-12 | 华为技术有限公司 | Speech synthesis method, apparatus and computer-readable storage medium |
CN111627420A (en) * | 2020-04-21 | 2020-09-04 | 升智信息科技(南京)有限公司 | Specific-speaker emotion voice synthesis method and device under extremely low resources |
CN111883101A (en) * | 2020-07-13 | 2020-11-03 | 北京百度网讯科技有限公司 | Model training and voice synthesis method, device, equipment and medium |
CN111949796A (en) * | 2020-08-24 | 2020-11-17 | 云知声智能科技股份有限公司 | Resource-limited language speech synthesis front-end text analysis method and system |
CN111951778A (en) * | 2020-07-15 | 2020-11-17 | 天津大学 | Method for synthesizing emotion voice by using transfer learning under low resource |
CN112309375A (en) * | 2020-10-28 | 2021-02-02 | 平安科技(深圳)有限公司 | Training test method, device, equipment and storage medium of voice recognition model |
CN112509553A (en) * | 2020-12-02 | 2021-03-16 | 出门问问(苏州)信息科技有限公司 | Speech synthesis method, device and computer readable storage medium |
CN112786003A (en) * | 2020-12-29 | 2021-05-11 | 平安科技(深圳)有限公司 | Speech synthesis model training method and device, terminal equipment and storage medium |
CN112908292A (en) * | 2019-11-19 | 2021-06-04 | 北京字节跳动网络技术有限公司 | Text voice synthesis method and device, electronic equipment and storage medium |
CN113053357A (en) * | 2021-01-29 | 2021-06-29 | 网易(杭州)网络有限公司 | Speech synthesis method, apparatus, device and computer readable storage medium |
CN113257238A (en) * | 2021-07-13 | 2021-08-13 | 北京世纪好未来教育科技有限公司 | Training method of pre-training model, coding feature acquisition method and related device |
CN113270090A (en) * | 2021-05-19 | 2021-08-17 | 平安科技(深圳)有限公司 | Combined model training method and device based on ASR model and TTS model |
CN113345410A (en) * | 2021-05-11 | 2021-09-03 | 科大讯飞股份有限公司 | Training method of general speech and target speech synthesis model and related device |
WO2022095743A1 (en) * | 2020-11-03 | 2022-05-12 | 北京有竹居网络技术有限公司 | Speech synthesis method and apparatus, storage medium, and electronic device |
CN113345410B (en) * | 2021-05-11 | 2024-05-31 | 科大讯飞股份有限公司 | Training method of general speech and target speech synthesis model and related device |
-
2019
- 2019-05-16 CN CN201910407683.5A patent/CN110148398A/en active Pending
Cited By (27)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110767210A (en) * | 2019-10-30 | 2020-02-07 | 四川长虹电器股份有限公司 | Method and device for generating personalized voice |
CN112908292A (en) * | 2019-11-19 | 2021-06-04 | 北京字节跳动网络技术有限公司 | Text voice synthesis method and device, electronic equipment and storage medium |
CN111161703A (en) * | 2019-12-30 | 2020-05-15 | 深圳前海达闼云端智能科技有限公司 | Voice synthesis method with tone, device, computing equipment and storage medium |
CN111161703B (en) * | 2019-12-30 | 2023-06-30 | 达闼机器人股份有限公司 | Speech synthesis method and device with language, computing equipment and storage medium |
CN111128119A (en) * | 2019-12-31 | 2020-05-08 | 云知声智能科技股份有限公司 | Voice synthesis method and device |
CN111276120A (en) * | 2020-01-21 | 2020-06-12 | 华为技术有限公司 | Speech synthesis method, apparatus and computer-readable storage medium |
CN111276120B (en) * | 2020-01-21 | 2022-08-19 | 华为技术有限公司 | Speech synthesis method, apparatus and computer-readable storage medium |
CN111627420B (en) * | 2020-04-21 | 2023-12-08 | 升智信息科技(南京)有限公司 | Method and device for synthesizing emotion voice of specific speaker under extremely low resource |
CN111627420A (en) * | 2020-04-21 | 2020-09-04 | 升智信息科技(南京)有限公司 | Specific-speaker emotion voice synthesis method and device under extremely low resources |
CN111883101A (en) * | 2020-07-13 | 2020-11-03 | 北京百度网讯科技有限公司 | Model training and voice synthesis method, device, equipment and medium |
CN111883101B (en) * | 2020-07-13 | 2024-02-23 | 北京百度网讯科技有限公司 | Model training and speech synthesis method, device, equipment and medium |
CN111951778A (en) * | 2020-07-15 | 2020-11-17 | 天津大学 | Method for synthesizing emotion voice by using transfer learning under low resource |
CN111951778B (en) * | 2020-07-15 | 2023-10-17 | 天津大学 | Method for emotion voice synthesis by utilizing transfer learning under low resource |
CN111949796A (en) * | 2020-08-24 | 2020-11-17 | 云知声智能科技股份有限公司 | Resource-limited language speech synthesis front-end text analysis method and system |
CN111949796B (en) * | 2020-08-24 | 2023-10-20 | 云知声智能科技股份有限公司 | Method and system for analyzing front-end text of voice synthesis of resource-limited language |
CN112309375A (en) * | 2020-10-28 | 2021-02-02 | 平安科技(深圳)有限公司 | Training test method, device, equipment and storage medium of voice recognition model |
CN112309375B (en) * | 2020-10-28 | 2024-02-23 | 平安科技(深圳)有限公司 | Training test method, device, equipment and storage medium for voice recognition model |
WO2022095743A1 (en) * | 2020-11-03 | 2022-05-12 | 北京有竹居网络技术有限公司 | Speech synthesis method and apparatus, storage medium, and electronic device |
CN112509553A (en) * | 2020-12-02 | 2021-03-16 | 出门问问(苏州)信息科技有限公司 | Speech synthesis method, device and computer readable storage medium |
CN112509553B (en) * | 2020-12-02 | 2023-08-01 | 问问智能信息科技有限公司 | Speech synthesis method, device and computer readable storage medium |
CN112786003A (en) * | 2020-12-29 | 2021-05-11 | 平安科技(深圳)有限公司 | Speech synthesis model training method and device, terminal equipment and storage medium |
CN113053357A (en) * | 2021-01-29 | 2021-06-29 | 网易(杭州)网络有限公司 | Speech synthesis method, apparatus, device and computer readable storage medium |
CN113053357B (en) * | 2021-01-29 | 2024-03-12 | 网易(杭州)网络有限公司 | Speech synthesis method, apparatus, device and computer readable storage medium |
CN113345410A (en) * | 2021-05-11 | 2021-09-03 | 科大讯飞股份有限公司 | Training method of general speech and target speech synthesis model and related device |
CN113345410B (en) * | 2021-05-11 | 2024-05-31 | 科大讯飞股份有限公司 | Training method of general speech and target speech synthesis model and related device |
CN113270090A (en) * | 2021-05-19 | 2021-08-17 | 平安科技(深圳)有限公司 | Combined model training method and device based on ASR model and TTS model |
CN113257238A (en) * | 2021-07-13 | 2021-08-13 | 北京世纪好未来教育科技有限公司 | Training method of pre-training model, coding feature acquisition method and related device |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110148398A (en) | Training method, device, equipment and the storage medium of speech synthesis model | |
CN110288077B (en) | Method and related device for synthesizing speaking expression based on artificial intelligence | |
CN108231059B (en) | Processing method and device for processing | |
CN106611597B (en) | Voice awakening method and device based on artificial intelligence | |
CN111106995B (en) | Message display method, device, terminal and computer readable storage medium | |
WO2022178969A1 (en) | Voice conversation data processing method and apparatus, and computer device and storage medium | |
US20130211838A1 (en) | Apparatus and method for emotional voice synthesis | |
CN113010138B (en) | Article voice playing method, device and equipment and computer readable storage medium | |
CN110210310A (en) | A kind of method for processing video frequency, device and the device for video processing | |
CN107291704A (en) | Treating method and apparatus, the device for processing | |
CN112071300B (en) | Voice conversation method, device, computer equipment and storage medium | |
CN111862938A (en) | Intelligent response method, terminal and computer readable storage medium | |
CN116682411A (en) | Speech synthesis method, speech synthesis system, electronic device, and storage medium | |
US9087512B2 (en) | Speech synthesis method and apparatus for electronic system | |
KR20200069264A (en) | System for outputing User-Customizable voice and Driving Method thereof | |
CN110781329A (en) | Image searching method and device, terminal equipment and storage medium | |
CN108346424A (en) | Phoneme synthesizing method and device, the device for phonetic synthesis | |
CN112242134A (en) | Speech synthesis method and device | |
CN110781327B (en) | Image searching method and device, terminal equipment and storage medium | |
CN111369975A (en) | University music scoring method, device, equipment and storage medium based on artificial intelligence | |
CN111276118A (en) | Method and system for realizing audio electronic book | |
CN116226411B (en) | Interactive information processing method and device for interactive project based on animation | |
CN116959430A (en) | Speech recognition method, device, electronic equipment and storage medium | |
CN116959409A (en) | Recitation audio generation method, recitation audio generation device, computer equipment and storage medium | |
CN114299909A (en) | Audio data processing method, device, equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |