CN110264991A - Training method, phoneme synthesizing method, device, equipment and the storage medium of speech synthesis model - Google Patents
Training method, phoneme synthesizing method, device, equipment and the storage medium of speech synthesis model Download PDFInfo
- Publication number
- CN110264991A CN110264991A CN201910420168.0A CN201910420168A CN110264991A CN 110264991 A CN110264991 A CN 110264991A CN 201910420168 A CN201910420168 A CN 201910420168A CN 110264991 A CN110264991 A CN 110264991A
- Authority
- CN
- China
- Prior art keywords
- training
- vector
- style
- model
- speech synthesis
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000012549 training Methods 0.000 title claims abstract description 275
- 230000015572 biosynthetic process Effects 0.000 title claims abstract description 103
- 238000003786 synthesis reaction Methods 0.000 title claims abstract description 103
- 238000000034 method Methods 0.000 title claims abstract description 88
- 238000003860 storage Methods 0.000 title claims abstract description 19
- 230000002194 synthesizing effect Effects 0.000 title claims abstract description 18
- 239000013598 vector Substances 0.000 claims abstract description 282
- 238000003780 insertion Methods 0.000 claims abstract description 37
- 230000037431 insertion Effects 0.000 claims abstract description 37
- 230000007246 mechanism Effects 0.000 claims abstract description 24
- 238000003062 neural network model Methods 0.000 claims abstract description 23
- 238000004590 computer program Methods 0.000 claims description 18
- 238000006243 chemical reaction Methods 0.000 claims description 14
- 238000013481 data capture Methods 0.000 claims description 4
- 230000001537 neural effect Effects 0.000 abstract description 4
- 230000004927 fusion Effects 0.000 abstract 1
- 238000010586 diagram Methods 0.000 description 18
- 238000012545 processing Methods 0.000 description 18
- 230000011218 segmentation Effects 0.000 description 14
- 230000008569 process Effects 0.000 description 11
- 238000005520 cutting process Methods 0.000 description 10
- 238000013527 convolutional neural network Methods 0.000 description 9
- 230000006870 function Effects 0.000 description 9
- 238000013528 artificial neural network Methods 0.000 description 8
- 238000001228 spectrum Methods 0.000 description 7
- 230000000306 recurrent effect Effects 0.000 description 6
- 230000008451 emotion Effects 0.000 description 5
- 230000006399 behavior Effects 0.000 description 4
- 238000009432 framing Methods 0.000 description 3
- 230000004913 activation Effects 0.000 description 2
- 244000309464 bull Species 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 239000000284 extract Substances 0.000 description 2
- 239000012634 fragment Substances 0.000 description 2
- 230000014759 maintenance of location Effects 0.000 description 2
- 238000011430 maximum method Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 241000208340 Araliaceae Species 0.000 description 1
- 241001269238 Data Species 0.000 description 1
- 238000007476 Maximum Likelihood Methods 0.000 description 1
- 101100506221 Nitrosomonas europaea (strain ATCC 19718 / CIP 103999 / KCTC 2705 / NBRC 14298) hao3 gene Proteins 0.000 description 1
- 235000005035 Panax pseudoginseng ssp. pseudoginseng Nutrition 0.000 description 1
- 235000003140 Panax quinquefolius Nutrition 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000013136 deep learning model Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000009826 distribution Methods 0.000 description 1
- 235000008434 ginseng Nutrition 0.000 description 1
- 238000011478 gradient descent method Methods 0.000 description 1
- 210000005036 nerve Anatomy 0.000 description 1
- 210000002569 neuron Anatomy 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 238000002360 preparation method Methods 0.000 description 1
- 238000007634 remodeling Methods 0.000 description 1
- 230000033764 rhythmic process Effects 0.000 description 1
- 230000001020 rhythmical effect Effects 0.000 description 1
- 230000035945 sensitivity Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
- G10L13/033—Voice editing, e.g. manipulating the voice of the synthesiser
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/08—Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
Landscapes
- Engineering & Computer Science (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Evolutionary Computation (AREA)
- Signal Processing (AREA)
- Artificial Intelligence (AREA)
- Electrically Operated Instructional Devices (AREA)
- Machine Translation (AREA)
Abstract
This application involves voice semantic domains, specifically used attention mechanism and neural fusion speech synthesis, and disclose training method, phoneme synthesizing method, device, equipment and the storage medium of a kind of speech synthesis model, the training method includes: acquisition data set, and the data set includes training text data and trained voice data corresponding with the training text data;According to the training text data, training text vector is generated;Based on the first encoder, the trained voice data is encoded, to obtain training insertion vector;Based on attention mechanism, the training insertion vector is marked, to obtain training style vector;According to the training text vector, the trained voice data and the trained style vector, model training is carried out to preset neural network model, to obtain speech synthesis model.
Description
Technical field
This application involves voice technology field more particularly to a kind of training methods of speech synthesis model, speech synthesis side
Method, device, equipment and storage medium.
Background technique
Speech synthesis technique, i.e. literary periodicals (Text to Speech, TTS) technology, can convert text information
For voice.With the continuous development of speech synthesis technique, people's pairing at voice requirement increasingly diversity.Synthesize voice most
Can embody stronger rhythmical image well, synthesize the voice with specific characteristic style, as the heavier storytelling style of emotion,
The informal synthesis voice style such as terrible style, cross-talk style of style and different manifestations is read aloud, to increase synthesis language
The diversity of sound meets the different demands of the people.
However, current TTS model can not precisely define style, it is difficult in view of the details of every kind of style voice, cause
Specific style is seted, from embodying very well, to reduce the Experience Degree of user in synthesis voice.
Summary of the invention
This application provides a kind of training method of speech synthesis model, phoneme synthesizing method, device, equipment and storages to be situated between
Matter, the speech synthesis model that training method training obtains can synthesize the language with certain style and abundant emotion behavior power
Sound, to promote the Experience Degree of user.
In a first aspect, this application provides a kind of training methods of speech synthesis model, which comprises
Data set is obtained, the data set includes training text data and trained language corresponding with the training text data
Sound data;
According to the training text data, training text vector is generated;
Based on the first encoder, the trained voice data is encoded, to obtain training insertion vector;
Based on attention mechanism, the training insertion vector is marked, to obtain training style vector;
According to the training text vector, the trained voice data and the trained style vector, to preset nerve
Network model carries out model training, to obtain speech synthesis model.
Second aspect, present invention also provides a kind of phoneme synthesizing methods, comprising:
Obtain target text vector sum target voice style vector;
By target voice style vector described in the target text vector sum carry out splicing, with obtain target splice to
Amount;
Target splicing vector is inputted into speech synthesis model, to export target synthesized voice data;The voice closes
It is the model obtained by the training method training of speech synthesis model as described above at model.
The third aspect, present invention also provides a kind of training device of speech synthesis model, described device includes:
Data capture unit, for obtaining data set, the data set include training text data and with the training text
The corresponding trained voice data of notebook data;
Vector generation unit, for generating training text vector according to the training text data;
Vector coding unit encodes the trained voice data, for being based on the first encoder to be trained
It is embedded in vector;
Vector acquiring unit is marked the training insertion vector, for being based on attention mechanism to be trained
Style vector;
Model training unit, for according to the training text vector, the trained voice data and the trained style
Vector carries out model training to preset neural network model, to obtain speech synthesis model.
Fourth aspect, the application also provide a kind of speech synthetic device, comprising:
Vector acquiring unit, for obtaining target text vector sum target voice style vector;
Vector concatenation unit, for target voice style vector described in the target text vector sum to be carried out stitching portion
Reason, to obtain target splicing vector;
Data outputting unit, for target splicing vector to be inputted speech synthesis model, to export target synthesis language
Sound data;The speech synthesis model is the model that the training method training of speech synthesis model from the above mentioned obtains.
5th aspect, present invention also provides a kind of computer equipment, the computer equipment includes memory and processing
Device;The memory is for storing computer program;The processor, for executing the computer program and described in the execution
Training method or the above-mentioned phoneme synthesizing method such as above-mentioned speech synthesis model are realized when computer program.
6th aspect, present invention also provides a kind of computer readable storage medium, the computer readable storage medium
It is stored with computer program, the computer program makes the processor realize such as above-mentioned speech synthesis when being executed by processor
The training method of model or above-mentioned phoneme synthesizing method.
This application discloses a kind of training method of speech synthesis model, phoneme synthesizing method, device, equipment and storages to be situated between
Matter encodes the trained voice data, by being based on the first encoder to obtain training insertion vector;Based on attention
The training insertion vector is marked in power mechanism, to obtain training style vector;According to the training text vector, institute
Trained voice data and the trained style vector are stated, model training is carried out to preset neural network model, to obtain voice
Synthetic model.The speech synthesis model that training method training obtains, can synthesize natural target speech data, synthesized
Target speech data has specific locution, is no longer the voice of mechanization, has emotion behavior power abundant, to mention
Rise the Experience Degree of user.
Detailed description of the invention
Technical solution in ord to more clearly illustrate embodiments of the present application, below will be to needed in embodiment description
Attached drawing is briefly described, it should be apparent that, the accompanying drawings in the following description is some embodiments of the present application, general for this field
For logical technical staff, without creative efforts, it is also possible to obtain other drawings based on these drawings.
Fig. 1 is a kind of schematic flow diagram of the training method of speech synthesis model provided by the embodiments of the present application;
Fig. 2 is the sub-step schematic flow diagram of the training method of speech synthesis model in Fig. 1;
Fig. 3 is the schematic flow diagram that in Fig. 1 the training text data are carried out with phonetic conversion;
Fig. 4 is the sub-step schematic flow diagram of the training method of speech synthesis model in Fig. 1;
Fig. 5 is the schematic flow diagram of the construction step for the training style vector that one embodiment of the application provides;
Fig. 6 is the sub-step schematic flow diagram of the training method of speech synthesis model in Fig. 1;
Fig. 7 is the offer of one embodiment of the application according to the trained voice data and the training splicing vector training mould
The schematic flow diagram of type;
Fig. 8 is a kind of step schematic flow diagram of phoneme synthesizing method provided by the embodiments of the present application;
Fig. 9 is a kind of schematic block diagram for the training device that embodiments herein also provides speech synthesis model;
Figure 10 is the schematic block diagram of the subelement of the training device of speech synthesis model in Fig. 9;
Figure 11 is the schematic block diagram of the subelement of the training device of speech synthesis model in Fig. 9;
Figure 12 is a kind of schematic block diagram for speech synthetic device that one embodiment of the application provides;
Figure 13 is a kind of structural representation block diagram for computer equipment that one embodiment of the application provides.
Specific embodiment
Below in conjunction with the attached drawing in the embodiment of the present application, technical solutions in the embodiments of the present application carries out clear, complete
Site preparation description, it is clear that described embodiment is some embodiments of the present application, instead of all the embodiments.Based on this Shen
Please in embodiment, every other implementation obtained by those of ordinary skill in the art without making creative efforts
Example, shall fall in the protection scope of this application.
Flow chart shown in the drawings only illustrates, it is not necessary to including all content and operation/step, also not
It is that must be executed by described sequence.For example, some operation/steps can also decompose, combine or partially merge, therefore practical
The sequence of execution is possible to change according to the actual situation.
Embodiments herein provides training method, device, computer equipment and the storage of a kind of speech synthesis model
Medium.The training method of speech synthesis model can be used for synthesizing the voice data with certain style.
With reference to the accompanying drawing, it elaborates to some embodiments of the application.In the absence of conflict, following
Feature in embodiment and embodiment can be combined with each other.
Referring to Fig. 1, the step of Fig. 1 is a kind of training method of speech synthesis model provided by the embodiments of the present application signal
Flow chart.
As shown in Figure 1, the training method of the speech synthesis model, specifically includes: step S110 to step S150.
S110, data set is obtained, the data set includes training text data and corresponding with the training text data
Training voice data.
Specifically, training text data are the text datas used by the training stage, for speech synthesis model into
Row training.Training voice data is voice data corresponding with training text data, and is the voice number marked through developer
According to.
S120, according to the training text data, generate training text vector.
Specifically, can carry out vector conversion after obtaining training text data to training text data, generate training text
This vector.
As shown in Fig. 2, in one embodiment, it is described according to the training text data, generate the tool of training text vector
Body process, i.e. step S120 may include sub-step S121 and S122.
S121, phonetic conversion is carried out to the training text data, to obtain corresponding pinyin string.
In one embodiment, the training text data are carried out with the detailed process of phonetic conversion, as shown in figure 3, walking
Rapid S121 may include sub-step S1211, S1212 and S1213.
S1211, word segmentation processing is carried out to the training text data, to obtain multiple word strings.
Wherein, described that training text data progress word segmentation processing can specifically include with obtaining multiple word strings:
Sentence segmentation is carried out to the training text data, to obtain corresponding several sentences;Several sentences are carried out at participle
Reason, to obtain multiple word strings.
Specifically, after obtaining training text data sentence segmentation can be carried out to the training text data, such as can be according to mark
Each training text data cutting is the complete sentence of a rule by point symbol.Then, word segmentation processing is carried out to each sentence, thus
Obtain multiple word strings.In one embodiment, the sentence of each cutting can be divided by the segmenting method of string matching
Word processing.
For example, the segmenting method of string matching can be Forward Maximum Method method, reversed maximum matching method, shortest path
Participle method and two-way maximum matching method etc..Wherein, Forward Maximum Method method refer to the character string in the sentence a cutting from
Left-to-right segments.Reversed maximum matching method refers to that the character string in the sentence a cutting segments from right to left.It is two-way
Maximum matching method refers to forward and reverse (from left to right, from right to left) while carrying out participle matching.Shortest path participle method refers to one
It is least that the word number cut out is required inside character string in the sentence of a cutting.
In other embodiments, method can also being segmented by the meaning of a word, word segmentation processing is carried out to the sentence after each cutting.Wherein,
Meaning of a word participle method is a kind of segmenting method of machine talk judgement, handled using syntactic information and semantic information Ambiguity come
Participle.
Illustratively, by taking two-way maximum matching method as an example, the Chinese dictionary library with word collection is obtained, it is assumed that Chinese dictionary
The length of the longest phrase in library is m, continuation character and Chinese word forward and reverse while that be m by phrase length in the sentence after cutting
Word in allusion quotation library is matched.If the sentence after cutting and each word match in Chinese dictionary library are unsuccessful, gradually subtract
The length of small continuation character takes multiple scan matching, until a certain word match success in the sentence and Chinese dictionary library,
Finally obtain multiple word strings.
S1212, phonetic conversion is carried out to each word string, to obtain the corresponding sub- pinyin string of each word string.
Illustratively, training text information S is after word segmentation processing, obtains N number of word string, respectively FS1, FS2 ..., FSN.
N number of word string obtains the corresponding sub- pinyin string of each word string respectively after phonetic is converted, PS1, PS2 ..., PSN.For example, word string "
Three " after phonetic is converted, and obtains sub- pinyin string " zhang1san1 ", wherein number 1 indicates that tone is high and level tone.
S1213, each sub- pinyin string is subjected to splicing, to obtain the pinyin string.
Illustratively, training text data are " how do you do ", and two word strings " hello " and " " are obtained after word segmentation processing,
Two word strings " hello " and " " are subjected to phonetic conversion, two sub- pinyin strings " ni1hao3 " and " ma0 " are obtained, wherein counting
Word 3 indicates that tone is upper sound, and number 0 is to indicate that tone is softly.Two sub- pinyin strings " ni1hao3 " and " ma0 " are spelled
Processing is connect, pinyin string " ni1hao3ma0 " is obtained.
S122, it is based on alphanumeric corresponding relationship, the pinyin string is converted into Serial No., the Serial No. is deposited
Storage is training text vector.
In one embodiment, it is based on alphanumeric corresponding relationship, the pinyin string is converted into Serial No., by the number
Word sequence is stored as before training text vector, further includes:
According to the number of preset characters sequence and preset quantity, alphanumeric corresponding relationship is established.
Specifically, having character string and number corresponding with character each in character string in alphanumeric corresponding relationship
Word, the corresponding number of each character.Wherein, the type of character can be letter, number and space etc..
In one embodiment, the number according to preset characters sequence and preset quantity establishes the corresponding pass of alphanumeric
System, specifically includes: obtaining the number of preset characters sequence and preset quantity;According to the number, in the character string
Character is marked, to obtain the alphanumeric corresponding relationship.
Wherein, preset quantity can be greater than or equal to the length of character string.
Illustratively, character string has 32 characters and the corresponding number of each character.Wherein 32 characters may include
26 English alphabets, 5 numbers and space are arranged successively, using digital 0- by 26 English alphabets, 0,1,2,3,4 and space
31 pairs of 32 characters are marked, so that each character is corresponding with a digital label.It is closed specifically, the alphanumeric is corresponding
System is as shown in table 1.
Table 1 is the schematic table of alphanumeric corresponding relationship
For example, training text data are " how do you do ", corresponding pinyin string " ni1hao3ma0 ".Based on the word in table 1
Digital corresponding relationship is accorded with, Serial No. 13/8/27/7/0/14/29/12/0/ can be converted by pinyin string " ni1hao3ma0 "
26.The Serial No. is stored as training text vector (13,8,27,7,0,14,29,12,0,26).
S130, it is based on the first encoder, the trained voice data is encoded, to obtain training insertion vector.
Wherein, the first encoder includes convolutional neural networks (Convolutional Neural Network, CNN)+pass
Neural network (Recurrent Neural Network, RNN) is returned to form.Convolutional neural networks may include multiple convolutional layers.
For example, convolutional neural networks include 6 convolutional layers, each convolutional layer extracts convolution feature using the convolution kernel of same size;
Smaller convolution kernel is generally used, the convolution kernel of such as 3 × 3 sizes is enough to capture the phonetic feature of trained voice data, step-length
It (stride) is 2.6 layers of convolutional layer use 32,32,64,64,128 and 128 output channel, thus convolutional neural networks respectively
The speech feature vector of the last layer convolutional layer output is three-dimensional vector.
Recurrent neural network can be the unidirectional GRU for including 128 hidden neurons.In one embodiment, convolutional Neural
Between network and recurrent neural network can be equipped with remodeling layer, in conjunction with batch standardization (Batch Normalization, BN) and
ReLU (Rectified Linear Units) activation primitive adjusts the output of convolutional neural networks, to adapt to recurrent neural net
The input of network.
Specifically, the logarithm Meier frequency spectrum of training voice data is inputted convolutional neural networks, then by convolutional Neural net
The output of network is input to recurrent neural network, vector is embedded in using the output of recurrent neural network as training, thus by different length
The rhythm of the training voice signal of degree is converted into the training insertion vector of regular length.
In one embodiment, described to be based on the first encoder, the trained voice data is encoded, to be trained
It is embedded in front of vector, further includes: the trained voice data is pre-processed, to obtain corresponding Meier (Mel) frequency spectrum, with
Extract the phonetic feature of human ear sensitivity.
It is described to be based on the first encoder, the trained voice data is encoded, to obtain training insertion vector, packet
It includes:
The Meier frequency spectrum is inputted into the first encoder, so that the first encoder encodes the Meier frequency spectrum, thus
Obtain training insertion vector.
Wherein, described the trained voice data is subjected to pretreatment to specifically include: the trained voice data is carried out
To obtain, treated trains voice data for framing windowing process;To treated, training voice data carries out frequency-domain transform to obtain
To corresponding amplitude spectrum, which is the Meier frequency spectrum.
Specifically, framing windowing process, can specifically carry out training voice data according to the frame length such as 60ms of setting
Then dividing processing is added hamming window to handle training voice data after segmentation, is added again with the training voice data after being divided
The processing of hamming window refers to voice messaging after segmentation multiplied by a window function, in order to carry out Fourier expansion.
Frequency domain conversion is to carry out Fast Fourier Transform (FFT) (Fast to the training voice data after framing windowing process
Fourier Transform, FFT), to obtain corresponding parameter, be in the present embodiment in order to obtain amplitude as amplitude spectrum,
Amplitude i.e. after Fast Fourier Transform (FFT).It is of course also possible to the other parameters after FFT transform, for example amplitude is believed plus phase
Breath etc..
S140, it is based on attention mechanism, the training insertion vector is marked, to obtain training style vector.
Specifically, according to attention mechanism genre label can be carried out to training insertion vector, to obtain training style
Vector.
As shown in figure 4, in one embodiment, it is described to be based on attention mechanism, the training insertion vector is marked,
To obtain training style vector, sub-step S141 to S143 is specifically included.
S141, multiple initial speech style vectors are obtained.
Illustratively, there are four types of the types of style, style, cross-talk style, terrible style and storytelling style are respectively read aloud.
Wherein, the initial speech style vector for reading aloud style is (1,0,0,0), the initial speech style vector of cross-talk style be (0,1,
0,0), the initial speech style vector of terrible style is (0,0,1,0), the initial speech style vector of storytelling style be (0,0,
0,1).
It should be noted that in other embodiments, the type of style may be it is a kind of, two kinds, three kinds, five kinds or
More.
S142, according to attention mechanism, calculate the phase of training the insertion vector and each initial speech style vector
Like degree.
In one embodiment, described according to attention mechanism, calculate the training insertion vector and each initial speech
The similarity of style vector, specifically includes: initial speech style vector described in the training insertion vector sum is inputted attention
Model, to export the similarity of training the insertion vector and each initial speech style vector.
Wherein, the attention model refers to that imitating the mankind using bull attention mechanism and softmax activation primitive adopts
With natural human speech representation language information.That is, when the mankind use phonetic representation language message with style, it can be because of wind
Lattice and distribute different attentions to audio fragment each in voice data.That is, people can be easier to notice each wind
Certain stylistic category relevant to voice data or a variety of stylistic categories in lattice type, and ignore other unrelated stylistic categories.
Specifically, the bull attention mechanism includes multiple dot product attention mechanism, for example, may include 8 dot product notes
Meaning power mechanism.
S143, according to the similarity of each initial speech style vector, construct the trained style vector.
Specifically, training the Automobile driving feelings between each stylistic category of style vector description and training voice data
Condition.Wherein, the similarity according to each initial speech style vector constructs the trained style vector, such as Fig. 5 institute
Show, step S143 specifically includes sub-step S1431: using the corresponding similarity of each initial speech style vector as initial speech
The attention weight of style vector, to each initial speech style vector weighted sum, to obtain the trained style vector.
Illustratively, there are four types of the types of style, style, cross-talk style, terrible style and storytelling style are respectively read aloud.
Wherein, the initial speech style vector A for reading aloud style is (1,0,0,0), the initial speech style vector B of cross-talk style be (0,
1,0,0), the initial speech style vector C of terrible style is (0,0,1,0), and the initial speech style vector D of storytelling style is
(0,0,0,1).For example, in the training voice data with cross-talk style, training insertion vector and the initial speech for reading aloud style
The similarity of style vector is 0.1, and the similarity of the initial speech style vector of training insertion vector and cross-talk style is 0.8,
The similarity of training insertion vector and the initial speech style vector of terrible style is 0.0, training insertion vector and storytelling style
Initial speech style vector similarity be 0.1.
Using the corresponding similarity of each initial speech style vector as the attention weight of initial speech style vector, i.e.,
Reading aloud the initial speech style vector of style is 0.1 to the attention weight of training voice data contribution, cross-talk style it is initial
Voice style vector is 0.8 to the attention weight of training voice data contribution, the initial speech style vector pair of terrible style
The attention weight of training voice data contribution is 0.0, and the initial speech style vector of storytelling style is to training voice data tribute
The attention weight offered is 0.1.According to above-mentioned attention weight to each initial speech style vector weighted sum, that is, have
The corresponding trained style vector=0.1 × A+0.8 × B+0.0 × C+0.1 × D of the training voice data of cross-talk style.
It should be noted that in other embodiments, the type of style may be it is a kind of, two kinds, three kinds, five kinds or
More.
S150, according to the training text vector, the trained voice data and the trained style vector, to preset
Neural network model carries out model training, to obtain speech synthesis model.
In one embodiment, preset neural network model may include Tacotron model, use maximum likelihood letter
Number is used as objective function.The Tacotron model includes encoder, attention mechanism and decoder.In other embodiments, should
Neural network model is also possible to other deep learning models, such as GoogLeNet model etc..It is with Tacotron model below
Example is illustrated.
In one embodiment, described according to the training text vector, the trained voice data and the trained style
Vector carries out the detailed process of model training to preset neural network model, as shown in fig. 6, i.e. step S150 includes sub-step
Rapid S151 and S152.
S151, splicing is carried out to training style vector described in the training text vector sum, to obtain training splicing
Vector.
Illustratively, training text vector A=(a1,a2,a3,a4), training style vector B=(b1,b2,b3,b4), to instruction
Practice text vector A and training style vector B is spliced, obtains training splicing vector C=(a1,a2,a3,a4,b1,b2,b3,b4)。
S152, vector is spliced according to the trained voice data and the training, mould is carried out to the neural network model
Type training, to obtain the speech synthesis model.
Specifically, training splicing vector is inputted in preset neural network model, training synthesis voice data is exported.It will
The training synthesizes voice data and is compared with training voice data, adjusts the neural network model according to preset loss function
In parameter, to obtain speech synthesis model.
In one embodiment, according to the trained voice data and the training splicing vector, to the neural network mould
Type carries out the detailed process of model training, as shown in fig. 7, i.e. step S152 includes sub-step S1521, S1522 and S1523.
S1521, the training splicing vector is inputted into the neural network model, with output training synthesis voice data.
Specifically, training splicing vector is inputted above-mentioned neural network after splicing obtains training splicing vector
Model, to export training synthesis voice data.
S1522, the voice similarity for calculating training the synthesis voice data and the trained voice data.
Specifically, the training synthesis voice data and the trained voice data are inputted trained similarity in advance
Model, to export the voice similarity of the two.Wherein similarity model can be such as convolutional neural networks model.
S1523, penalty values are calculated according to the voice similarity and preset loss function, and according to the penalty values tune
Parameter in the whole neural network model, to obtain speech synthesis model.
Specifically, training synthesis voice data and instruction of the loss function (loss function) commonly used to estimate model
Practice the inconsistent degree of voice data, training synthesis voice data and training voice data are closer, and loss (loss) value is minimum.
Illustratively, preset loss function is cross entropy loss function.According to the voice similarity and preset damage
After losing function calculating penalty values, it can be gone to adjust the ginseng in the neural network model according to stochastic gradient descent method backpropagation
Number, to obtain speech synthesis model.Backpropagation is the process for constantly updating weight and deviation in neural network model,
When penalty values are 0 after certain training, indicate that training synthesis voice data reaches trained voice data, at this time weight and deviation
It can not update.
The training method of above-mentioned speech synthesis model, by obtaining data set;According to the training text data, instruction is generated
Practice text vector;Based on the first encoder, the trained voice data is encoded, to obtain training insertion vector;It is based on
The training insertion vector is marked in attention mechanism, to obtain training style vector;According to the training text to
Amount, the trained voice data and the trained style vector carry out model training to preset neural network model, to obtain
Speech synthesis model.The speech synthesis model that training method training obtains, can synthesize natural target speech data, be closed
At target speech data have specific locution, be no longer the voice of mechanization, have emotion behavior power abundant, from
And promote the Experience Degree of user.
Referring to Fig. 8, Fig. 8 is a kind of step schematic flow diagram of phoneme synthesizing method provided by the embodiments of the present application.
As shown in figure 8, the phoneme synthesizing method, specifically includes: step S210 to step S230.
S210, target text vector sum target voice style vector is obtained.
In one embodiment, before acquisition target text vector, further includes: obtain target text data.
Specifically, target text data can be newsletter archive, novel text, blog text etc..
Wherein, target text vector is obtained, is specifically included: according to the target text data, obtaining target text vector.
In one embodiment, described according to the target text data, target text vector is obtained, is specifically included: to institute
It states target text data and carries out word segmentation processing, to obtain multiple target word strings;Phonetic conversion is carried out to each target word string, with
Obtain the corresponding sub- pinyin string of target of each target word string;Each sub- pinyin string of target is subjected to splicing, to obtain
The target pinyin string;Based on preset characters number corresponding relationship, the target pinyin string is converted into target number sequence, it will
The target number sequence is stored as target text vector.
Specifically, after obtaining target text data sentence segmentation can be carried out to the target text data, such as can be according to mark
Target text data cutting is the complete sentence of a rule by point symbol.Then, word segmentation processing is carried out to each sentence, to obtain
Multiple target word strings.In one embodiment, the sentence of each cutting can be divided by the segmenting method of string matching
Word processing.
For example, target text data " good morning " after word segmentation processing, obtain " morning " and " good " two target word strings.It will
Two target word strings " morning " and " good " progress phonetic conversion, obtain the corresponding sub- pinyin string of target of two target word strings
" zao3shang4 " and " hao3 ", wherein digital representation tone.The sub- pinyin string of two targets is subjected to splicing, obtains target
Pinyin string " zao3shang4hao3 ".
Illustratively, alphanumeric corresponding relationship can be as shown in table 1.For example, target text data are " good morning ", it should
The corresponding target pinyin string of training text data is " zao3shang4hao3 ", will based on the alphanumeric corresponding relationship in table 1
The target pinyin string " zao3shang4hao3 " is converted into target number sequence 25/0/14/29/18/7/0/13/6/30/7/0/
14/29.By the target number sequence be stored as target text vector (25,0,14,29,18,7,0,13,6,30,7,0,14,
29)。
Wherein, target voice style vector is obtained, comprising: obtain multiple initial speech style vectors;According to various styles
The corresponding attention weight of type, to each initial speech style vector weighted sum, to obtain the target voice style
Vector.
Illustratively, there are four types of the types of style, style, cross-talk style, terrible style and storytelling style are respectively read aloud.
Wherein, the initial speech style vector A for reading aloud style is (1,0,0,0), the initial speech style vector B of cross-talk style be (0,
1,0,0), the initial speech style vector C of terrible style is (0,0,1,0), and the initial speech style vector D of storytelling style is
(0,0,0,1).Assuming that reading aloud the corresponding attention weight of style is 0.1, and cross-talk style is corresponding in the voice with cross-talk style
Attention weight be 0.8, the corresponding attention weight of terrible style is 0.0, the corresponding attention weight of storytelling style is
0.1, then target voice style vector=0.1 × A+0.8 × B+0.0 × C+0.1 × D.
Wherein, the corresponding attention weight of various stylistic categories can be manually arranged in advance in the voice data of specific style
Or training obtains in advance.Such as user wants the voice that target text data check is had cross-talk style, then will read aloud style
Corresponding attention weight is set as 0.1, sets 0.8 for the corresponding attention weight of cross-talk style, and terrible style is corresponding
Attention weight be set as 0.0, set 0.1 for the corresponding attention weight of storytelling style.
For another example, want the voice that target text Data Synthesis is had into storytelling style, then will read aloud the corresponding attention of style
Weight is set as 0.02, sets 0.05 for the corresponding attention weight of cross-talk style, and the corresponding attention of terrible style is weighed
It resets and is set to 0.85, set 0.08 for the corresponding attention weight of storytelling style.
It should be noted that in other embodiments, the type of style may be it is a kind of, two kinds, three kinds, five kinds or
More.
S220, target voice style vector described in the target text vector sum is subjected to splicing, to obtain target
Splice vector.
Illustratively, target text vector X=(x1,x2,x3,x4), target voice style vector Y=(y1,y2,y3,y4),
Target text vector X and target voice style vector Y are spliced, target splicing vector W=(x is obtained1,x2,x3,x4,y1,
y2,y3,y4)。
S230, target splicing vector is inputted into speech synthesis model, to export target synthesized voice data.
Wherein, the speech synthesis model is the model obtained by the training method training of above-mentioned speech synthesis model.Tool
Body, target splicing vector is input to the speech synthesis model, so that output has the target synthesized voice number of specific style
According to, for example there are the target synthesized voice data for reading aloud style, cross-talk style, terrible style or storytelling style.
It should be understood that it may be a sequence fragment, therefore, speech synthesis model in timing that target, which splices vector,
There can also be the target synthesized voice data of specific style with subsection synthesis, for example the target splicing vector of sequence is divided into two sections
Synthesis, the target synthesized voice data of synthesis, which are respectively provided with, reads aloud style and storytelling style, it is possible thereby to show target text number
According to the declinable process of wind in speech synthesis.
Above-mentioned phoneme synthesizing method can synthesize natural target speech data, and synthesized target speech data has spy
Fixed locution is no longer the voice of mechanization, has emotion behavior power abundant, to promote the Experience Degree of user.
Referring to Fig. 9, Fig. 9 is a kind of the schematic of the training device that embodiments herein also provides speech synthesis model
Block diagram, the training device of the speech synthesis model are used to execute the training method of any one of aforementioned speech synthesis model.Wherein, should
The training device of speech synthesis model can be configured in server or terminal.
Wherein, server can be independent server, or server cluster.The terminal can be mobile phone, put down
The electronic equipments such as plate computer, laptop, desktop computer, personal digital assistant and wearable device.
As shown in figure 9, the training device 300 of speech synthesis model includes: data capture unit 310, vector generation unit
320, vector coding unit 330, vector acquiring unit 340 and model training unit 350.
Data capture unit 310, for obtaining data set, the data set include training text data and with the training
The corresponding trained voice data of text data;
Vector generation unit 320, for generating training text vector according to the training text data;
Vector coding unit 330 encodes the trained voice data, for being based on the first encoder to obtain
Training insertion vector;
Vector acquiring unit 340 is marked the training insertion vector, for being based on attention mechanism to obtain
Training style vector;
Model training unit 350, for according to the training text vector, the trained voice data and the trained wind
Lattice vector carries out model training to preset neural network model, to obtain speech synthesis model.
As shown in figure 9, in one embodiment, vector generation unit 320 includes that phonetic conversion subunit 321 and vector are deposited
Store up subelement 322.
Phonetic conversion subunit 321, for carrying out phonetic conversion to the training text data, to obtain corresponding phonetic
String.
The pinyin string is converted to digital sequence for being based on alphanumeric corresponding relationship by vector storing sub-units 322
Column, are stored as training text vector for the Serial No..
As shown in Figure 10, in one embodiment, vector acquiring unit 340 includes that style obtains subelement 341, similarity meter
Operator unit 342 and vector construct subelement 343.
Style obtains subelement 341 for obtaining multiple initial speech style vectors.
Similarity calculation subelement 342, for according to attention mechanism, calculate the training insertion vector and it is each it is described just
The similarity of beginning voice style vector.
Vector constructs subelement 343 and constructs the training for the similarity according to each initial speech style vector
Style vector.
In one embodiment, vector building subelement 343 is specifically used for corresponding with each initial speech style vector
Similarity is the attention weight of initial speech style vector, to each initial speech style vector weighted sum, to obtain
The trained style vector.
As shown in figure 11, in one embodiment, model training unit 350 includes that splicing subelement 351 and model are instructed
Practice subelement 352.
Splicing subelement 351, for carrying out stitching portion to training style vector described in the training text vector sum
Reason, to obtain training splicing vector;
Model training subelement 352, for splicing vector according to the trained voice data and the training, to the mind
Model training is carried out through network model, to obtain the speech synthesis model.
Figure 12 is please referred to, Figure 12 is a kind of schematic block diagram for speech synthetic device that one embodiment of the application provides, should
Speech synthetic device can be configured in terminal or server, for executing phoneme synthesizing method above-mentioned.
As shown in figure 12, speech synthetic device 400, comprising: vector acquiring unit 410,420 sum number of vector concatenation unit
According to output unit 430.
Vector acquiring unit 410, for obtaining target text vector sum target voice style vector.
Vector concatenation unit 420, for splicing target voice style vector described in the target text vector sum
Processing, to obtain target splicing vector.
Data outputting unit 430, for target splicing vector to be inputted speech synthesis model, to export target synthesis
Voice data;The speech synthesis model is the model obtained by the training method training of above-mentioned speech synthesis model.
It should be noted that it is apparent to those skilled in the art that, for convenience of description and succinctly,
The device of foregoing description and the specific work process of each unit can refer to the training method embodiment of aforementioned voice synthetic model
In corresponding process, details are not described herein.
Above-mentioned device can be implemented as a kind of form of computer program, which can be as shown in Figure 8
Computer equipment on run.
Figure 13 is please referred to, Figure 13 is a kind of schematic block diagram of computer equipment provided by the embodiments of the present application.The calculating
Machine equipment can be server or terminal.
Refering to fig. 13, which includes processor, memory and the network interface connected by system bus,
In, memory may include non-volatile memory medium and built-in storage.
Non-volatile memory medium can storage program area and computer program.The computer program includes program instruction,
The program instruction is performed, and processor may make to execute a kind of training method of speech synthesis model.
Processor supports the operation of entire computer equipment for providing calculating and control ability.
Built-in storage provides environment for the operation of the computer program in non-volatile memory medium, the computer program quilt
When processor executes, processor may make to execute a kind of training method of speech synthesis model.
The network interface such as sends the task dispatching of distribution for carrying out network communication.It will be understood by those skilled in the art that
Structure shown in Figure 13, only the block diagram of part-structure relevant to application scheme, is not constituted to application scheme
The restriction for the computer equipment being applied thereon, specific computer equipment may include more more or fewer than as shown in the figure
Component perhaps combines certain components or with different component layouts.
It should be understood that processor can be central processing unit (Central Processing Unit, CPU), it should
Processor can also be other general processors, digital signal processor (Digital Signal Processor, DSP), specially
With integrated circuit (Application Specific Integrated Circuit, ASIC), field programmable gate array
(Field-Programmable Gate Array, FPGA) either other programmable logic device, discrete gate or transistor are patrolled
Collect device, discrete hardware components etc..Wherein, general processor can be microprocessor or the processor be also possible to it is any often
The processor etc. of rule.
Wherein, the processor is for running computer program stored in memory, to realize following steps:
Data set is obtained, the data set includes training text data and trained language corresponding with the training text data
Sound data;According to the training text data, training text vector is generated;Based on the first encoder, to the trained voice number
According to being encoded, to obtain training insertion vector;Based on attention mechanism, the training insertion vector is marked, with
To training style vector;According to the training text vector, the trained voice data and the trained style vector, to default
Neural network model carry out model training, to obtain speech synthesis model.
In one embodiment, the processor is described according to the training text data in realization, generates training text
When vector, for realizing:
Phonetic conversion is carried out to the training text data, to obtain corresponding pinyin string;Based on the corresponding pass of alphanumeric
System, is converted to Serial No. for the pinyin string, the Serial No. is stored as training text vector.
In one embodiment, the processor is described based on attention mechanism in realization, to the training insertion vector
It is marked, when obtaining training style vector, for realizing:
Obtain multiple initial speech style vectors;According to attention mechanism, calculate the training insertion vector with it is each described
The similarity of initial speech style vector;According to the similarity of each initial speech style vector, the trained style is constructed
Vector.
In one embodiment, the processor is described according to the similar of each initial speech style vector in realization
Degree, when constructing the trained style vector, for realizing:
It is right using the corresponding similarity of each initial speech style vector as the attention weight of initial speech style vector
Each initial speech style vector weighted sum, to obtain the trained style vector.
In one embodiment, the processor is described according to the training text vector, the trained voice in realization
Data and the trained style vector carry out model training to preset neural network model, when obtaining speech synthesis model,
For realizing:
Splicing is carried out to training style vector described in the training text vector sum, to obtain training splicing vector;
According to the trained voice data and the training splicing vector, model training is carried out to the neural network model, to obtain
The speech synthesis model.
Wherein, in another embodiment, the processor is for running computer program stored in memory, with reality
Existing following steps:
Obtain target text vector sum target voice style vector;By target voice wind described in the target text vector sum
Lattice vector carries out splicing, to obtain target splicing vector;Target splicing vector is inputted into speech synthesis model, with defeated
Target synthesized voice data out;The speech synthesis model is by the training method of speech synthesis model described in any of the above embodiments
The model that training obtains.
A kind of computer readable storage medium is also provided in embodiments herein, the computer readable storage medium is deposited
Computer program is contained, includes program instruction in the computer program, the processor executes described program instruction, realizes this
Apply for the training method or phoneme synthesizing method of any one speech synthesis model that embodiment provides.
Wherein, the computer readable storage medium can be the storage inside of computer equipment described in previous embodiment
Unit, such as the hard disk or memory of the computer equipment.The computer readable storage medium is also possible to the computer
The plug-in type hard disk being equipped on the External memory equipment of equipment, such as the computer equipment, intelligent memory card (Smart
Media Card, SMC), secure digital (Secure Digital, SD) card, flash card (Flash Card) etc..
The above, the only specific embodiment of the application, but the protection scope of the application is not limited thereto, it is any
Those familiar with the art within the technical scope of the present application, can readily occur in various equivalent modifications or replace
It changes, these modifications or substitutions should all cover within the scope of protection of this application.Therefore, the protection scope of the application should be with right
It is required that protection scope subject to.
Claims (10)
1. a kind of training method of speech synthesis model characterized by comprising
Data set is obtained, the data set includes training text data and trained voice number corresponding with the training text data
According to;
According to the training text data, training text vector is generated;
Based on the first encoder, the trained voice data is encoded, to obtain training insertion vector;
Based on attention mechanism, the training insertion vector is marked, to obtain training style vector;
According to the training text vector, the trained voice data and the trained style vector, to preset neural network
Model carries out model training, to obtain speech synthesis model.
2. the training method of speech synthesis model according to claim 1, which is characterized in that described according to the training text
Notebook data generates training text vector, comprising:
Phonetic conversion is carried out to the training text data, to obtain corresponding pinyin string;
Based on alphanumeric corresponding relationship, the pinyin string is converted into Serial No., the Serial No. is stored as training
Text vector.
3. the training method of speech synthesis model according to claim 1, which is characterized in that described to be based on attention machine
The training insertion vector is marked in system, to obtain training style vector, comprising:
Obtain multiple initial speech style vectors;
According to attention mechanism, the similarity of training the insertion vector and each initial speech style vector is calculated;
According to the similarity of each initial speech style vector, the trained style vector is constructed.
4. the training method of speech synthesis model according to claim 3, which is characterized in that described according to each described initial
The similarity of voice style vector constructs the trained style vector, comprising:
Using the corresponding similarity of each initial speech style vector as the attention weight of initial speech style vector, to each institute
Initial speech style vector weighted sum is stated, to obtain the trained style vector.
5. the training method of speech synthesis model according to claim 1, which is characterized in that described according to the training text
This vector, the trained voice data and the trained style vector carry out model training to preset neural network model, with
Obtain speech synthesis model, comprising:
Splicing is carried out to training style vector described in the training text vector sum, to obtain training splicing vector;
According to the trained voice data and the training splicing vector, model training is carried out to the neural network model, with
Obtain the speech synthesis model.
6. a kind of phoneme synthesizing method characterized by comprising
Obtain target text vector sum target voice style vector;
Target voice style vector described in the target text vector sum is subjected to splicing, to obtain target splicing vector;
Target splicing vector is inputted into speech synthesis model, to export target synthesized voice data;The speech synthesis mould
Type is the model obtained by the training method training of speech synthesis model as described in any one in claim 1-5.
7. a kind of training device of speech synthesis model characterized by comprising
Data capture unit, for obtaining data set, the data set include training text data and with the training text number
According to corresponding trained voice data;
Vector generation unit, for generating training text vector according to the training text data;
Vector coding unit encodes the trained voice data, for being based on the first encoder to obtain training insertion
Vector;
Vector acquiring unit is marked the training insertion vector for being based on attention mechanism, to obtain training style
Vector;
Model training unit is used for according to the training text vector, the trained voice data and the trained style vector,
Model training is carried out to preset neural network model, to obtain speech synthesis model.
8. a kind of speech synthetic device characterized by comprising
Vector acquiring unit, for obtaining target text vector sum target voice style vector;
Vector concatenation unit, for target voice style vector described in the target text vector sum to be carried out splicing, with
Obtain target splicing vector;
Data outputting unit, for target splicing vector to be inputted speech synthesis model, to export target synthesized voice number
According to;The speech synthesis model is the mould obtained as the training of the training method of the speech synthesis model as described in any one of 1 to 5
Type.
9. a kind of computer equipment, which is characterized in that the computer equipment includes memory and processor;
The memory is for storing computer program;
The processor, for executing the computer program and realization such as claim 1 when executing the computer program
To the training method of speech synthesis model described in any one of 5 or phoneme synthesizing method as claimed in claim 6.
10. a kind of computer readable storage medium, which is characterized in that the computer-readable recording medium storage has computer journey
Sequence, the computer program make the processor realize the language as described in any one of claims 1 to 5 when being executed by processor
The training method of sound synthetic model or phoneme synthesizing method as claimed in claim 6.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910420168.0A CN110264991B (en) | 2019-05-20 | 2019-05-20 | Training method of speech synthesis model, speech synthesis method, device, equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910420168.0A CN110264991B (en) | 2019-05-20 | 2019-05-20 | Training method of speech synthesis model, speech synthesis method, device, equipment and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110264991A true CN110264991A (en) | 2019-09-20 |
CN110264991B CN110264991B (en) | 2023-12-22 |
Family
ID=67914821
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910420168.0A Active CN110264991B (en) | 2019-05-20 | 2019-05-20 | Training method of speech synthesis model, speech synthesis method, device, equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110264991B (en) |
Cited By (26)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110288973A (en) * | 2019-05-20 | 2019-09-27 | 平安科技(深圳)有限公司 | Phoneme synthesizing method, device, equipment and computer readable storage medium |
CN110738026A (en) * | 2019-10-23 | 2020-01-31 | 腾讯科技(深圳)有限公司 | Method and device for generating description text |
CN110767217A (en) * | 2019-10-30 | 2020-02-07 | 爱驰汽车有限公司 | Audio segmentation method, system, electronic device and storage medium |
CN110808027A (en) * | 2019-11-05 | 2020-02-18 | 腾讯科技(深圳)有限公司 | Voice synthesis method and device and news broadcasting method and system |
CN111128137A (en) * | 2019-12-30 | 2020-05-08 | 广州市百果园信息技术有限公司 | Acoustic model training method and device, computer equipment and storage medium |
CN111276120A (en) * | 2020-01-21 | 2020-06-12 | 华为技术有限公司 | Speech synthesis method, apparatus and computer-readable storage medium |
CN111312267A (en) * | 2020-02-20 | 2020-06-19 | 广州市百果园信息技术有限公司 | Voice style conversion method, device, equipment and storage medium |
CN111326136A (en) * | 2020-02-13 | 2020-06-23 | 腾讯科技(深圳)有限公司 | Voice processing method and device, electronic equipment and storage medium |
CN111402857A (en) * | 2020-05-09 | 2020-07-10 | 广州虎牙科技有限公司 | Speech synthesis model training method and device, electronic equipment and storage medium |
CN111489734A (en) * | 2020-04-03 | 2020-08-04 | 支付宝(杭州)信息技术有限公司 | Model training method and device based on multiple speakers |
CN111627420A (en) * | 2020-04-21 | 2020-09-04 | 升智信息科技(南京)有限公司 | Specific-speaker emotion voice synthesis method and device under extremely low resources |
CN111739509A (en) * | 2020-06-16 | 2020-10-02 | 掌阅科技股份有限公司 | Electronic book audio generation method, electronic device and storage medium |
CN112233646A (en) * | 2020-10-20 | 2021-01-15 | 携程计算机技术(上海)有限公司 | Voice cloning method, system, device and storage medium based on neural network |
CN112509550A (en) * | 2020-11-13 | 2021-03-16 | 中信银行股份有限公司 | Speech synthesis model training method, speech synthesis device and electronic equipment |
CN112786001A (en) * | 2019-11-11 | 2021-05-11 | 北京地平线机器人技术研发有限公司 | Speech synthesis model training method, speech synthesis method and device |
CN112802443A (en) * | 2019-11-14 | 2021-05-14 | 腾讯科技(深圳)有限公司 | Speech synthesis method and apparatus, electronic device, and computer-readable storage medium |
CN112837674A (en) * | 2019-11-22 | 2021-05-25 | 阿里巴巴集团控股有限公司 | Speech recognition method, device and related system and equipment |
CN112837673A (en) * | 2020-12-31 | 2021-05-25 | 平安科技(深圳)有限公司 | Speech synthesis method, apparatus, computer device and medium based on artificial intelligence |
CN112863476A (en) * | 2019-11-27 | 2021-05-28 | 阿里巴巴集团控股有限公司 | Method and device for constructing personalized speech synthesis model, method and device for speech synthesis and testing |
CN113192482A (en) * | 2020-01-13 | 2021-07-30 | 北京地平线机器人技术研发有限公司 | Speech synthesis method and training method, device and equipment of speech synthesis model |
CN113345410A (en) * | 2021-05-11 | 2021-09-03 | 科大讯飞股份有限公司 | Training method of general speech and target speech synthesis model and related device |
CN113470615A (en) * | 2020-03-13 | 2021-10-01 | 微软技术许可有限责任公司 | Cross-speaker style transfer speech synthesis |
US20220122581A1 (en) * | 2020-10-21 | 2022-04-21 | Google Llc | Using Speech Recognition to Improve Cross-Language Speech Synthesis |
WO2022121179A1 (en) * | 2020-12-11 | 2022-06-16 | 平安科技(深圳)有限公司 | Speech synthesis method and apparatus, device, and storage medium |
CN115662435A (en) * | 2022-10-24 | 2023-01-31 | 福建网龙计算机网络信息技术有限公司 | Virtual teacher simulation voice generation method and terminal |
US20230036020A1 (en) * | 2019-12-20 | 2023-02-02 | Spotify Ab | Text-to-Speech Synthesis Method and System, a Method of Training a Text-to-Speech Synthesis System, and a Method of Calculating an Expressivity Score |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20170076715A1 (en) * | 2015-09-16 | 2017-03-16 | Kabushiki Kaisha Toshiba | Training apparatus for speech synthesis, speech synthesis apparatus and training method for training apparatus |
CN109036375A (en) * | 2018-07-25 | 2018-12-18 | 腾讯科技(深圳)有限公司 | Phoneme synthesizing method, model training method, device and computer equipment |
CN109616093A (en) * | 2018-12-05 | 2019-04-12 | 平安科技(深圳)有限公司 | End-to-end phoneme synthesizing method, device, equipment and storage medium |
CN109767752A (en) * | 2019-02-27 | 2019-05-17 | 平安科技(深圳)有限公司 | A kind of phoneme synthesizing method and device based on attention mechanism |
-
2019
- 2019-05-20 CN CN201910420168.0A patent/CN110264991B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20170076715A1 (en) * | 2015-09-16 | 2017-03-16 | Kabushiki Kaisha Toshiba | Training apparatus for speech synthesis, speech synthesis apparatus and training method for training apparatus |
CN109036375A (en) * | 2018-07-25 | 2018-12-18 | 腾讯科技(深圳)有限公司 | Phoneme synthesizing method, model training method, device and computer equipment |
CN109616093A (en) * | 2018-12-05 | 2019-04-12 | 平安科技(深圳)有限公司 | End-to-end phoneme synthesizing method, device, equipment and storage medium |
CN109767752A (en) * | 2019-02-27 | 2019-05-17 | 平安科技(深圳)有限公司 | A kind of phoneme synthesizing method and device based on attention mechanism |
Non-Patent Citations (1)
Title |
---|
雷鸣等: "基于感知加权线谱对距离的最小生成误差语音合成模型训练方法", 模式识别与人工智能, vol. 23, no. 4, pages 572 - 579 * |
Cited By (45)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110288973A (en) * | 2019-05-20 | 2019-09-27 | 平安科技(深圳)有限公司 | Phoneme synthesizing method, device, equipment and computer readable storage medium |
CN110288973B (en) * | 2019-05-20 | 2024-03-29 | 平安科技(深圳)有限公司 | Speech synthesis method, device, equipment and computer readable storage medium |
CN110738026A (en) * | 2019-10-23 | 2020-01-31 | 腾讯科技(深圳)有限公司 | Method and device for generating description text |
CN110767217A (en) * | 2019-10-30 | 2020-02-07 | 爱驰汽车有限公司 | Audio segmentation method, system, electronic device and storage medium |
CN110767217B (en) * | 2019-10-30 | 2022-04-12 | 爱驰汽车有限公司 | Audio segmentation method, system, electronic device and storage medium |
CN110808027A (en) * | 2019-11-05 | 2020-02-18 | 腾讯科技(深圳)有限公司 | Voice synthesis method and device and news broadcasting method and system |
CN110808027B (en) * | 2019-11-05 | 2020-12-08 | 腾讯科技(深圳)有限公司 | Voice synthesis method and device and news broadcasting method and system |
CN112786001B (en) * | 2019-11-11 | 2024-04-09 | 北京地平线机器人技术研发有限公司 | Speech synthesis model training method, speech synthesis method and device |
CN112786001A (en) * | 2019-11-11 | 2021-05-11 | 北京地平线机器人技术研发有限公司 | Speech synthesis model training method, speech synthesis method and device |
CN112802443B (en) * | 2019-11-14 | 2024-04-02 | 腾讯科技(深圳)有限公司 | Speech synthesis method and device, electronic equipment and computer readable storage medium |
CN112802443A (en) * | 2019-11-14 | 2021-05-14 | 腾讯科技(深圳)有限公司 | Speech synthesis method and apparatus, electronic device, and computer-readable storage medium |
CN112837674B (en) * | 2019-11-22 | 2024-06-11 | 阿里巴巴集团控股有限公司 | Voice recognition method, device, related system and equipment |
CN112837674A (en) * | 2019-11-22 | 2021-05-25 | 阿里巴巴集团控股有限公司 | Speech recognition method, device and related system and equipment |
CN112863476A (en) * | 2019-11-27 | 2021-05-28 | 阿里巴巴集团控股有限公司 | Method and device for constructing personalized speech synthesis model, method and device for speech synthesis and testing |
US20230036020A1 (en) * | 2019-12-20 | 2023-02-02 | Spotify Ab | Text-to-Speech Synthesis Method and System, a Method of Training a Text-to-Speech Synthesis System, and a Method of Calculating an Expressivity Score |
CN111128137A (en) * | 2019-12-30 | 2020-05-08 | 广州市百果园信息技术有限公司 | Acoustic model training method and device, computer equipment and storage medium |
CN113192482A (en) * | 2020-01-13 | 2021-07-30 | 北京地平线机器人技术研发有限公司 | Speech synthesis method and training method, device and equipment of speech synthesis model |
CN113192482B (en) * | 2020-01-13 | 2023-03-21 | 北京地平线机器人技术研发有限公司 | Speech synthesis method and training method, device and equipment of speech synthesis model |
CN111276120A (en) * | 2020-01-21 | 2020-06-12 | 华为技术有限公司 | Speech synthesis method, apparatus and computer-readable storage medium |
CN111276120B (en) * | 2020-01-21 | 2022-08-19 | 华为技术有限公司 | Speech synthesis method, apparatus and computer-readable storage medium |
CN111326136A (en) * | 2020-02-13 | 2020-06-23 | 腾讯科技(深圳)有限公司 | Voice processing method and device, electronic equipment and storage medium |
CN111326136B (en) * | 2020-02-13 | 2022-10-14 | 腾讯科技(深圳)有限公司 | Voice processing method and device, electronic equipment and storage medium |
CN111312267B (en) * | 2020-02-20 | 2023-08-11 | 广州市百果园信息技术有限公司 | Voice style conversion method, device, equipment and storage medium |
CN111312267A (en) * | 2020-02-20 | 2020-06-19 | 广州市百果园信息技术有限公司 | Voice style conversion method, device, equipment and storage medium |
CN113470615B (en) * | 2020-03-13 | 2024-03-12 | 微软技术许可有限责任公司 | Cross-speaker style transfer speech synthesis |
CN113470615A (en) * | 2020-03-13 | 2021-10-01 | 微软技术许可有限责任公司 | Cross-speaker style transfer speech synthesis |
CN111489734B (en) * | 2020-04-03 | 2023-08-22 | 支付宝(杭州)信息技术有限公司 | Model training method and device based on multiple speakers |
CN111489734A (en) * | 2020-04-03 | 2020-08-04 | 支付宝(杭州)信息技术有限公司 | Model training method and device based on multiple speakers |
CN111627420B (en) * | 2020-04-21 | 2023-12-08 | 升智信息科技(南京)有限公司 | Method and device for synthesizing emotion voice of specific speaker under extremely low resource |
CN111627420A (en) * | 2020-04-21 | 2020-09-04 | 升智信息科技(南京)有限公司 | Specific-speaker emotion voice synthesis method and device under extremely low resources |
CN111402857A (en) * | 2020-05-09 | 2020-07-10 | 广州虎牙科技有限公司 | Speech synthesis model training method and device, electronic equipment and storage medium |
CN111739509A (en) * | 2020-06-16 | 2020-10-02 | 掌阅科技股份有限公司 | Electronic book audio generation method, electronic device and storage medium |
CN111739509B (en) * | 2020-06-16 | 2022-03-22 | 掌阅科技股份有限公司 | Electronic book audio generation method, electronic device and storage medium |
CN112233646A (en) * | 2020-10-20 | 2021-01-15 | 携程计算机技术(上海)有限公司 | Voice cloning method, system, device and storage medium based on neural network |
CN112233646B (en) * | 2020-10-20 | 2024-05-31 | 携程计算机技术(上海)有限公司 | Voice cloning method, system, equipment and storage medium based on neural network |
US11990117B2 (en) * | 2020-10-21 | 2024-05-21 | Google Llc | Using speech recognition to improve cross-language speech synthesis |
US20220122581A1 (en) * | 2020-10-21 | 2022-04-21 | Google Llc | Using Speech Recognition to Improve Cross-Language Speech Synthesis |
CN112509550A (en) * | 2020-11-13 | 2021-03-16 | 中信银行股份有限公司 | Speech synthesis model training method, speech synthesis device and electronic equipment |
WO2022121179A1 (en) * | 2020-12-11 | 2022-06-16 | 平安科技(深圳)有限公司 | Speech synthesis method and apparatus, device, and storage medium |
CN112837673A (en) * | 2020-12-31 | 2021-05-25 | 平安科技(深圳)有限公司 | Speech synthesis method, apparatus, computer device and medium based on artificial intelligence |
CN112837673B (en) * | 2020-12-31 | 2024-05-10 | 平安科技(深圳)有限公司 | Speech synthesis method, device, computer equipment and medium based on artificial intelligence |
CN113345410A (en) * | 2021-05-11 | 2021-09-03 | 科大讯飞股份有限公司 | Training method of general speech and target speech synthesis model and related device |
CN113345410B (en) * | 2021-05-11 | 2024-05-31 | 科大讯飞股份有限公司 | Training method of general speech and target speech synthesis model and related device |
CN115662435A (en) * | 2022-10-24 | 2023-01-31 | 福建网龙计算机网络信息技术有限公司 | Virtual teacher simulation voice generation method and terminal |
US11727915B1 (en) | 2022-10-24 | 2023-08-15 | Fujian TQ Digital Inc. | Method and terminal for generating simulated voice of virtual teacher |
Also Published As
Publication number | Publication date |
---|---|
CN110264991B (en) | 2023-12-22 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110264991A (en) | Training method, phoneme synthesizing method, device, equipment and the storage medium of speech synthesis model | |
US11837216B2 (en) | Speech recognition using unspoken text and speech synthesis | |
CN109840287A (en) | A kind of cross-module state information retrieval method neural network based and device | |
CN110782870A (en) | Speech synthesis method, speech synthesis device, electronic equipment and storage medium | |
US7263488B2 (en) | Method and apparatus for identifying prosodic word boundaries | |
CN110288980A (en) | Audio recognition method, the training method of model, device, equipment and storage medium | |
WO2020062680A1 (en) | Waveform splicing method and apparatus based on double syllable mixing, and device, and storage medium | |
CN111783455B (en) | Training method and device of text generation model, and text generation method and device | |
JP2008134475A (en) | Technique for recognizing accent of input voice | |
CN112802446B (en) | Audio synthesis method and device, electronic equipment and computer readable storage medium | |
Zheng et al. | BLSTM-CRF Based End-to-End Prosodic Boundary Prediction with Context Sensitive Embeddings in a Text-to-Speech Front-End. | |
EP2329489A1 (en) | Stochastic phoneme and accent generation using accent class | |
US11322133B2 (en) | Expressive text-to-speech utilizing contextual word-level style tokens | |
CN116072098A (en) | Audio signal generation method, model training method, device, equipment and medium | |
CN110335608A (en) | Voice print verification method, apparatus, equipment and storage medium | |
WO2014183411A1 (en) | Method, apparatus and speech synthesis system for classifying unvoiced and voiced sound | |
CN117099157A (en) | Multitasking learning for end-to-end automatic speech recognition confidence and erasure estimation | |
CN113593520A (en) | Singing voice synthesis method and device, electronic equipment and storage medium | |
JP7502561B2 (en) | Using speech recognition to improve interlingual speech synthesis. | |
CN111328416B (en) | Speech patterns for fuzzy matching in natural language processing | |
JP2006243673A (en) | Data retrieval device and method | |
CN116702770A (en) | Method, device, terminal and storage medium for generating long text | |
CN112951204B (en) | Speech synthesis method and device | |
CN113223486B (en) | Information processing method, information processing device, electronic equipment and storage medium | |
CN114299910B (en) | Training method, using method, device, equipment and medium of speech synthesis model |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |