CN109616093A - End-to-end phoneme synthesizing method, device, equipment and storage medium - Google Patents
End-to-end phoneme synthesizing method, device, equipment and storage medium Download PDFInfo
- Publication number
- CN109616093A CN109616093A CN201811482781.7A CN201811482781A CN109616093A CN 109616093 A CN109616093 A CN 109616093A CN 201811482781 A CN201811482781 A CN 201811482781A CN 109616093 A CN109616093 A CN 109616093A
- Authority
- CN
- China
- Prior art keywords
- text
- audio
- length
- training sample
- matrix
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 37
- 230000002194 synthesizing effect Effects 0.000 title claims abstract description 26
- 238000003860 storage Methods 0.000 title claims description 30
- 238000012549 training Methods 0.000 claims abstract description 85
- 239000011159 matrix material Substances 0.000 claims abstract description 81
- 239000013598 vector Substances 0.000 claims abstract description 62
- 238000012545 processing Methods 0.000 claims abstract description 55
- 230000015572 biosynthetic process Effects 0.000 claims abstract description 29
- 238000003786 synthesis reaction Methods 0.000 claims abstract description 29
- 230000007246 mechanism Effects 0.000 claims abstract description 25
- 238000013528 artificial neural network Methods 0.000 claims abstract description 16
- 238000005516 engineering process Methods 0.000 claims description 15
- 238000005520 cutting process Methods 0.000 claims description 12
- 238000004590 computer program Methods 0.000 claims description 9
- 230000001537 neural effect Effects 0.000 claims description 9
- 238000010606 normalization Methods 0.000 claims description 9
- 230000006870 function Effects 0.000 claims description 6
- 238000006243 chemical reaction Methods 0.000 claims description 4
- 230000008859 change Effects 0.000 claims description 2
- 238000010189 synthetic method Methods 0.000 claims 1
- 238000013527 convolutional neural network Methods 0.000 abstract description 3
- 238000004891 communication Methods 0.000 description 5
- 235000012054 meals Nutrition 0.000 description 4
- 230000008569 process Effects 0.000 description 4
- 238000004364 calculation method Methods 0.000 description 3
- 238000013135 deep learning Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 238000012805 post-processing Methods 0.000 description 2
- 230000003068 static effect Effects 0.000 description 2
- 230000007423 decrease Effects 0.000 description 1
- 238000000151 deposition Methods 0.000 description 1
- 230000005611 electricity Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000010295 mobile communication Methods 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 230000008092 positive effect Effects 0.000 description 1
- 230000000306 recurrent effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000005236 sound signal Effects 0.000 description 1
- 239000000126 substance Substances 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/08—Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/08—Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
- G10L13/10—Prosody rules derived from text; Stress or intonation
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
- G10L19/02—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders
- G10L19/032—Quantisation or dequantisation of spectral components
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
- G10L19/04—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/08—Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
- G10L2013/083—Special characters, e.g. punctuation marks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/08—Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
- G10L13/10—Prosody rules derived from text; Stress or intonation
- G10L2013/105—Duration
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses a kind of end-to-end phoneme synthesizing methods, belong to speech synthesis technique field.Speech samples and pre-processed this method comprises: obtaining, with obtain include the audio of preset audio length voice training sample, and convert Audio Matrix for the audio in the voice training sample;To samples of text corresponding with the voice training sample carry out Regularization and vectorization processing, with obtain include the text vector of pre-set text length text training sample;Speech synthesis training is carried out using the voice training sample and the corresponding text training sample as the input of the self-consciou power mechanism end-to-end model based on feed-forward neural network, to obtain optimal speech synthesis model.The present invention uses the attention mechanism based on DNN, compared to CNN and RNN training pattern, also accelerates convergence rate while reducing model complexity.
Description
Technical field
The present invention relates to speech synthesis technique field, it is related to a kind of end-to-end phoneme synthesizing method, device, equipment and storage
Medium.
Background technique
In speech synthesis field, there has been no the end-to-end technological frames that input is that text output is voice signal.For example,
WaveNet (Aaron van den Oord et al., 2016) need to multiple frequency domain parameters such as mel cepstrum coefficients coefficient,
Fundamental frequency F0 etc. predicted, while its Text Pretreatment is also relatively complicated (Jonathan Shen et al., 2017).Recently
Deep learning frame such as Tacotron (Yuxuan Wang et al., 2017), the Tacotron2 for speech synthesis
(Jonathan Shen et al., 2017), Deep Voice3 (Wei Ping et al., 2017) use vocoder
(Griffin-Lim, WORLD or WaveNet) is used as speech synthesis post-processing module, so that these models are not real
End-to-end study, and the complexity of model is also higher.
Summary of the invention
The technical problem to be solved by the present invention is to answer to overcome the pretreatment of speech synthesis in the prior art and post-process
The high problem of miscellaneous degree proposes a kind of end-to-end phoneme synthesizing method, device, equipment and storage medium, by using being based on DNN
The attention mechanism of (deep neural network) reduces model complexity, while also accelerating convergence rate.
The present invention is to solve above-mentioned technical problem by following technical proposals:
A kind of end-to-end phoneme synthesizing method, comprising the following steps:
Obtain speech samples simultaneously pre-processed, with obtain include the audio of preset audio length voice training sample
This, and Audio Matrix is converted by the audio in the voice training sample;
Regularization and vectorization processing are carried out to samples of text corresponding with the voice training sample, to obtain
It include the text training sample of the text vector of pre-set text length;
Using the voice training sample and the corresponding text training sample as based on feed-forward neural network
Self-consciou power mechanism end-to-end model input carry out speech synthesis training, to obtain optimal speech synthesis model.
Preferably, the pretreatment includes the sound in speech samples described in the audio and completion in the cutting speech samples
Frequently, so that the length of the audio is equal to the preset audio length.
Preferably, it is described pretreatment specifically includes the following steps:
Calculate the difference between the length of the speech samples sound intermediate frequency and the preset audio length;
Judge that data type belonging to the difference, the data type include negative, positive number and zero;
If the difference is negative, at the end of the speech samples, addition is mute, and the mute length is equal to institute
State the absolute value of difference;
If the difference is positive number, the audio that the preset audio length thereof is exceeded in the speech samples is cut
It removes, the length of the audio of excision is equal to the absolute value of the difference;
If the difference is zero, the speech samples are not handled.
Preferably, the Regularization includes converting Chinese character string for non-chinese character string, after conversion
The Chinese character string determines the pronunciation of the non-chinese character string.
Preferably, the vectorization, which is handled, includes:
According to preset text dictionary, it is corresponding in the text dictionary that the word is converted by the word in the samples of text
Serial number;
The corresponding serial number of each word is sequentially subjected to vector arrangement by each word in the samples of text, to obtain the text
The corresponding text vector of sample;
To comprising the length text vector different from the pre-set text length of text do cutting or completion so that institute
The length for stating text vector is equal to the pre-set text length.
Preferably, the cutting for being when the difference between the length and the pre-set text length of the text vector
When positive number, the step of cutting includes cutting the text vector for exceeding the pre-set text length thereof in institute's text vector
It removes, the length of the text vector of excision is equal to the absolute value of the difference;
The completion is used for when the difference between the length of the text vector and the pre-set text length is negative,
The step of completion includes adding text vector zero, the length of the text vector zero of addition at the end of the text vector
Degree is equal to the absolute value of the difference.
Preferably, speech synthesis training the following steps are included:
Point of addition mask is right to obtain initial audio matrix in the Audio Matrix for including in the voice training sample
The initial audio matrix successively carries out non-LS-SVM sparseness, self-consciou power mechanism is handled, first residual error network technology is handled,
The processing of one layer of feed-forward neural net layer and the again processing of residual error network technology are with the Audio Matrix that obtains that treated;
The text vector for including in the text training sample is moved to right into two lattice and point of addition mask to obtain initial text
This matrix successively carries out non-LS-SVM sparseness, the processing of self-consciou power mechanism, first residual error network to the original text matrix
Technical treatment, the processing of one layer of feed-forward neural net layer and again the processing of residual error network technology with the text square that obtains that treated
Battle array;
It will treated the Audio Matrix and described treated that text matrix successively carries out self-consciou power machine together
System processing, layer normalization and one layer of feed-forward neural network are to obtain voice prediction signal;
The voice prediction signal and treated the text matrix are calculated into loss function, and done using chain rule
Backpropagation calculates, and continuous iteration is until iteration goes out optimal speech synthesis model.
The invention also discloses a kind of end-to-end speech synthetic devices, comprising:
Speech processing module includes preset audio length to obtain for obtaining speech samples and being pre-processed
The voice training sample of audio, and Audio Matrix is converted by the audio in the voice training sample;
Text processing module, for samples of text corresponding with the voice training sample carry out Regularization and
Vectorization processing, with obtain include the text vector of pre-set text length text training sample;
Training module, for the voice training sample and the corresponding text training sample are used as be based on before to
The input of the self-consciou power mechanism end-to-end model of Feedback Neural Network carries out speech synthesis training, to obtain optimal voice
Synthetic model.
The invention also discloses a kind of computer equipment, including memory and processor, meter is stored on the memory
The step of calculation machine program, the computer program realizes aforementioned end-to-end phoneme synthesizing method when being executed by the processor.
The invention also discloses a kind of computer readable storage medium, meter is stored in the computer readable storage medium
Calculation machine program, the computer program can be performed by least one processors, to realize end-to-end speech synthesis side above-mentioned
The step of method.
The positive effect of the present invention is that: reduce the complexity of pretreatment and post-processing;Using based on DNN's
Attention mechanism reduces model complexity compared to CNN (convolutional neural networks) and RNN (Recognition with Recurrent Neural Network) training pattern
While also accelerate convergence rate.
Detailed description of the invention
Fig. 1 shows the flow chart of the end-to-end phoneme synthesizing method embodiment one of the present invention;
Fig. 2 shows in end-to-end phoneme synthesizing method embodiment one of the invention about pretreated flow chart;
Fig. 3 shows the flow chart in the end-to-end phoneme synthesizing method embodiment one of the present invention about vectorization processing;
Fig. 4 shows the flow chart in the end-to-end phoneme synthesizing method embodiment one of the present invention about speech synthesis training;
Fig. 5 shows the structure chart of end-to-end one embodiment of speech synthetic device of the present invention;
Fig. 6 shows the hardware structure schematic diagram of one embodiment of computer equipment of the present invention.
Specific embodiment
The present invention is further illustrated below by the mode of embodiment, but does not therefore limit the present invention to the reality
It applies among a range.
Firstly, the present invention proposes a kind of end-to-end phoneme synthesizing method.
In example 1, as shown in Figure 1, the end-to-end phoneme synthesizing method includes the following steps:
Step 01: obtain speech samples simultaneously pre-processed, with obtain include the audio of preset audio length voice
Training sample, and Audio Matrix is converted by the audio in the voice training sample.
Some original audios that speech samples are usually collected in advance, these original audios did not do processing, audio
Length is different in size, would generally be stored in a corpus after these audio collections.Acquisition speech samples described here are usual
Just refer to and obtains speech samples from corpus.
Difference between speech samples and voice training sample is, the audio length disunity in speech samples, and language
Audio length in sound training sample is unified, for example is unified for 10 seconds.The pretreatment is exactly to add speech samples processing
Work at voice training sample process.
The pretreatment specifically includes the audio cut in speech samples described in audio and completion in the speech samples
Two kinds, in order to so that the length of the audio is equal to the preset audio length, pre-process mainly for audio length not
Equal to those audios of preset audio length.It is unified for 10 seconds for example, presetting the audio length for including in voice training sample,
It so needs to cut off latter 2.5 seconds of such as 12.5 seconds audios of too long speech samples;By too short speech samples such as 8.4 seconds
Audio does completion, and concrete operations are behind the audio plus 1.6 seconds mute, by audio completion by 10 seconds.
In one embodiment, as shown in Fig. 2, pretreatment specifically includes the following steps:
Step 11: calculating the difference between the length of the speech samples sound intermediate frequency and the preset audio length.
Step 12: judging that data type belonging to the difference, the data type include negative, positive number and zero.
Step 13: if the difference is negative, adding mute, the mute length at the end of the speech samples
Equal to the absolute value of the difference.
Step 14: if the difference is positive number, the preset audio length thereof will be exceeded in the speech samples
Audio excision, the length of the audio of excision are equal to the absolute value of the difference.
Step 15: if the difference is zero, the speech samples not being handled.I.e. by speech samples directly as
Voice training sample.
The dimension that audio is converted into Audio Matrix is set as needed, and is for 10 seconds audio samples of 8k with sample rate
Example, shares 80000 sampled points, if each frame takes 800 points, and non-overlapping between adjacent frame, which can be converted
There is the Audio Matrix of 800 sampled points for 100 frames, the dimension of the Audio Matrix is 100 multiplied by 800.
Step 02: samples of text corresponding with the voice training sample is carried out at Regularization and vectorization
Reason, with obtain include the text vector of pre-set text length text training sample.
Here Regularization is primarily referred to as, and Chinese character string is converted by non-chinese character string, after conversion
The Chinese character string determines the pronunciation of the non-chinese character string.For example, converting 1 for number 123.
Vectorization processing primarily to what is obtained includes the text training sample of the text vector of pre-set text length,
Specifically include following steps (as shown in Figure 3):
Step 21: according to preset text dictionary, converting the word in the text word for the word in the samples of text
Corresponding serial number in allusion quotation.
Text dictionary refers to the set being made of several Chinese characters, the corresponding serial number of each Chinese character.With samples of text, " you are eaten
Meal does not have " for, the serial number 12 of " you " in certain text dictionary, the serial number 66 of " eating ", the serial number 35 of " meal ", " not having "
Serial number 973, the serial number 465 of " having ".
Step 22: the corresponding serial number of each word being sequentially subjected to vector arrangement by each word in the samples of text, to obtain
The corresponding text vector of the samples of text.
Example is connected, samples of text " you, which have a meal, does not have " can be converted into text vector [12,66,35,973,465].
Step 23: to comprising the length text vector different from the pre-set text length of text cut or mend
Entirely, so that the length of the text vector is equal to the pre-set text length.
Specifically, completion refers to when the difference between the length and the pre-set text length of the text vector is negative
When, text vector zero is added at the end of the text vector, the length of the text vector zero of addition is equal to the difference
Absolute value.
Connect example, it is assumed that the text size of the text vector of training is uniformly set as 10, then needing to samples of text
" you, which have a meal, does not have " corresponding text vector [12,66,35,973,465] by adding 0 carry out completion, obtain text vector [12,
66,35,973,465,0,0,0,0,0], text vector can serve as the use of text training sample.
Cutting refers to when the difference between the length of the text vector and the pre-set text length is positive number, by institute
Text vector in text vector beyond the pre-set text length thereof is cut off, and the length of the text vector of excision is equal to
The absolute value of the difference.
It is similar with upper example, if the length of a text vector is more than 10, as text vector [99,331,55,62,
2355,888,999,535,676,2,22,36,68], text vector is needed to carry out cutting operation, obtains can be used as text
The text vector [99,331,55,62,2355,888,999,535,676,2] that this training sample uses.It is to be noted that
When deep learning is for batch (batch processing) training production master sample, completion and cutting operation are done, often to guarantee sample
This each dimension is in the same size, and the text vector for the part cut directly is given up to fall.
Step 03: using the voice training sample and the corresponding text training sample as based on feed-forward mind
The input of self-consciou power mechanism end-to-end model through network carries out speech synthesis training, to obtain optimal speech synthesis mould
Type.
Speech synthesis training specifically includes following four step (as shown in Figure 4):
Step 31: point of addition mask is in the Audio Matrix for including in the voice training sample to obtain initial audio
Matrix successively carries out non-LS-SVM sparseness, the processing of self-consciou power mechanism, first residual error network skill to the initial audio matrix
Art processing, the processing of one layer of feed-forward neural net layer and again the processing of residual error network technology with the audio square that obtains that treated
Battle array.
Here the prior art is belonged to using the technology of point of addition mask (positional encoding), herein no longer
It repeats.
Position mask refers to the conversion location information of original matrix.Such as matrix X=[[0.1,0.5], [0.4 ,-
0.3], [- 0.9,0.6]], the position Index of three vectors is respectively 1,2,3 in matrix, is then produced at random between -1 to 1
Raw three vectors such as [0, -0.5], [- 0.4, -0.6], [0.3,0.5], new matrix that these three vectors are constituted [[0, -0.5],
[- 0.4, -0.6], [0.3,0.5]] be exactly original matrix position mask.
Processing for initial audio is specific as follows:
Initial audio matrix obtains the first Audio Matrix after non-LS-SVM sparseness, which is non-rarefaction
Matrix.So-called non-LS-SVM sparseness, which refers to, directly sums position mask information with Audio Matrix, and obtained result is just non-dilute
Thinization matrix.
The second Audio Matrix is obtained after carrying out the processing of self-consciou power mechanism to the first Audio Matrix.
Then, the second Audio Matrix is added in the first Audio Matrix using residual error network technology and obtains third audio square
Battle array.
Finally, to third Audio Matrix after one layer of feed-forward neural net layer is handled and residual error network technology is handled
The 4th Audio Matrix is obtained, the 4th Audio Matrix is exactly the Audio Matrix after aforementioned processing.So-called feed-forward neural network
Layer processing, it is assumed that input matrix M, the neural variable matrix of neural network are K, then after handling before to Feedback Neural Network layer
It is obtaining the result is that M*K.
Step 32: the text vector for including in the text training sample is moved to right into two lattice and point of addition mask to obtain
Original text matrix successively carries out non-LS-SVM sparseness to the original text matrix, self-consciou power mechanism is handled, first residual
Poor network technology processing, one layer of feed-forward neural net layer handle and treated to obtain for the processing of residual error network technology again
Text matrix.
The processing of text training sample is similar to the processing of voice training sample described in step 31.For original text
The processing of matrix is specific as follows:
Original text matrix carries out non-LS-SVM sparseness and obtains the first text matrix, which is non-rarefaction
Matrix.
The second text matrix is obtained after carrying out the processing of self-consciou power mechanism to the first text matrix.
Then, then, the second text matrix is added in the first text matrix using residual error network technology and obtains third
Text matrix.
Finally, to third text matrix after one layer of feed-forward neural net layer is handled and residual error network technology is handled
The 4th text matrix is obtained, the 4th text matrix is exactly the text matrix after aforementioned processing.
Step 33: by treated the Audio Matrix and described treated that text matrix successively carries out self note together
Power mechanism of anticipating processing, layer normalization and one layer of feed-forward neural network are to obtain voice prediction signal.
So-called self-consciou power mechanism refers to one selected in multi information of comforming to current task target more key message
Kind mechanism.
The treatment process of self-consciou power mechanism specifically: utilize the self-consciou torque battle array of N frame before the of Audio Matrix
The self-consciou torque battle array of the matrix and text matrix that obtain after residual error network and layer normalization is by forward-type feedback mind
Matrix multiple after the normalization of network, residual error network and layer gains attention power (Attention) matrix, the attention matrix
It normalized using residual error network, layer normalization, forward-type neural network, residual error network, layer, obtain Audio Matrix after tiling
N+1 frame realizes that the 4th Audio Matrix and the 4th text matrix carry out attention mechanism processing with this autoregressive mode.
The treatment process of layer normalization and one layer of feed-forward neural network specifically: use the other autoregression mode of frame level
Realization is predicted one by one, wherein the voice signal for being predicted out is taken as input again and text comes together to predict next frame language
Sound signal.
So-called layer normalization, for doing layer normalization to matrix M, it is assumed that the standard deviation of M is Ma, and the mean value of M is Mu, that
The M of update after layer normalizes is M=(M-Mu)/Ma.
Step 34: the voice prediction signal and treated the text matrix being calculated into loss function, and utilize chain
Formula rule does backpropagation calculating, and continuous iteration is until iteration goes out optimal speech synthesis model.
Loss function is defined as the average value of the absolute value of the difference of prediction signal and original signal.Here specific to utilize
The tensorflow1.4.0 frame of python3.6 version optimizes, and (tensorlow is one using data to tensorlow
Flow graph is used for the open source software library that numerical value calculates) forward calculation, loss function, iterative steps need to only be defined, chain rule
Backpropagation calculate the optimal model part tensorflow of iteration and will be automatically performed.It should be noted that in optimization
Face, optimal is not best (referring to operational research related data).
Here, training learning rate is updated using exponential attenuation method dynamic, and gradient decline mode uses
AdamOptimizer optimizer is trained.
Secondly, the invention proposes a kind of end-to-end speech synthetic device, described device 20 can be divided into one or
The multiple modules of person.
For example, Fig. 5 shows the structure chart of 20 1 embodiment of end-to-end speech synthetic device, and in the embodiment, institute
Speech processing module 201, text processing module 202 and training module 203 can be divided by stating device 20.Being described below will have
Body introduces the concrete function of the module 201-203.
The speech processing module 201 includes preset audio to obtain for obtaining speech samples and being pre-processed
The voice training sample of the audio of length, and Audio Matrix is converted by the audio in the voice training sample.
The text processing module 202 is used to carry out regularization to samples of text corresponding with the voice training sample
Processing and vectorization processing, with obtain include the text vector of pre-set text length text training sample.
The training module 203 is used for using the voice training sample and the corresponding text training sample as base
Speech synthesis training is carried out in the input of the self-consciou power mechanism end-to-end model of feed-forward neural network, it is optimal to obtain
Speech synthesis model.
Again, the present invention also puts forward a kind of computer equipment.
As shown in fig.6, being the hardware structure schematic diagram of one embodiment of computer equipment of the present invention.In the present embodiment, institute
Stating computer equipment 2 is that one kind can be automatic to carry out at numerical value calculating and/or information according to the instruction for being previously set or storing
The equipment of reason.For example, it may be smart phone, tablet computer, laptop, desktop computer, rack-mount server, blade
Formula server, (including the service composed by independent server or multiple servers of tower server or Cabinet-type server
Device cluster) etc..As shown, the computer equipment 2 includes at least, but it is not limited to, company can be in communication with each other by system bus
Connect memory 21, processor 22 and network interface 23.Wherein:
The memory 21 includes at least a type of computer readable storage medium, and the readable storage medium storing program for executing includes
Flash memory, hard disk, multimedia card, card-type memory (for example, SD or DX memory etc.), random access storage device (RAM), it is static with
Machine accesses memory (SRAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), programmable
Read memory (PROM), magnetic storage, disk, CD etc..In some embodiments, the memory 21 can be the meter
Calculate the internal storage unit of machine equipment 2, such as the hard disk or memory of the computer equipment 2.In further embodiments, described
Memory 21 is also possible to the plug-in type being equipped on the External memory equipment of the computer equipment 2, such as the computer equipment 2
Hard disk, intelligent memory card (Smart Media Card, SMC), secure digital (Secure Digital, SD) card, flash card
(Flash Card) etc..Certainly, the memory 21 can also both including the computer equipment 2 internal storage unit or
Including its External memory equipment.In the present embodiment, the memory 21 is installed on the computer equipment 2 commonly used in storage
Operating system and types of applications software, such as the computer program etc. for realizing the end-to-end phoneme synthesizing method.In addition,
The memory 21 can be also used for temporarily storing the Various types of data that has exported or will export.
The processor 22 can be in some embodiments central processing unit (Central Processing Unit,
CPU), controller, microcontroller, microprocessor or other data processing chips.The processor 22 is commonly used in described in control
The overall operation of computer equipment 2, such as execute control relevant to the computer equipment 2 progress data interaction or communication
System and processing etc..In the present embodiment, the processor 22 is for running the program code stored in the memory 21 or place
Data, such as operation are managed for realizing the computer program etc. of the end-to-end phoneme synthesizing method.
The network interface 23 may include radio network interface or wired network interface, which is commonly used in
Communication connection is established between the computer equipment 2 and other computer equipments.For example, the network interface 23 is for passing through net
The computer equipment 2 is connected by network with exterior terminal, establishes data biography between the computer equipment 2 and exterior terminal
Defeated channel and communication connection etc..The network can be intranet (Intranet), internet (Internet), whole world shifting
Dynamic communication system (Global System of Mobile communication, GSM), wideband code division multiple access (Wideband
Code Division Multiple Access, WCDMA), 4G network, 5G network, bluetooth (Bluetooth), the nothings such as Wi-Fi
Line or cable network.
It should be pointed out that Fig. 6 illustrates only the computer equipment 2 with component 21-23, it should be understood that simultaneously
All components shown realistic are not applied, the implementation that can be substituted is more or less component.
In the present embodiment, the computer for realizing the end-to-end phoneme synthesizing method being stored in memory 21
Program can be performed by one or more processors (the present embodiment is processor 22), to complete the operation of following steps:
Step 01: obtain speech samples simultaneously pre-processed, with obtain include the audio of preset audio length voice
Training sample, and Audio Matrix is converted by the audio in the voice training sample;
Step 02: samples of text corresponding with the voice training sample is carried out at Regularization and vectorization
Reason, with obtain include the text vector of pre-set text length text training sample;
Step 03: using the voice training sample and the corresponding text training sample as based on feed-forward mind
The input of self-consciou power mechanism end-to-end model through network carries out speech synthesis training, to obtain optimal speech synthesis mould
Type.
In addition, a kind of computer readable storage medium of the present invention, the computer readable storage medium be it is non-volatile can
Storage medium is read, computer program is stored with, the computer program can be performed by least one processor, to realize
The operation of above-mentioned end-to-end phoneme synthesizing method or device.
Wherein, computer readable storage medium includes flash memory, hard disk, multimedia card, card-type memory (for example, SD or DX
Memory etc.), random access storage device (RAM), static random-access memory (SRAM), read-only memory (ROM), electricity can
Erasable programmable read-only memory (EPROM) (EEPROM), programmable read only memory (PROM), magnetic storage, disk, CD etc..?
In some embodiments, computer readable storage medium can be the internal storage unit of computer equipment, such as the computer is set
Standby hard disk or memory.In further embodiments, it deposits the outside that computer readable storage medium is also possible to computer equipment
The plug-in type hard disk being equipped in storage equipment, such as the computer equipment, intelligent memory card (Smart Media Card, SMC), peace
Digital (Secure Digital, SD) card, flash card (Flash Card) etc..Certainly, computer readable storage medium may be used also
With the internal storage unit both including computer equipment or including its External memory equipment.It is computer-readable to deposit in the present embodiment
Storage media is installed on the operating system and types of applications software of computer equipment commonly used in storage, such as is previously used for realizing institute
State the computer program etc. of end-to-end phoneme synthesizing method.In addition, computer readable storage medium can be also used for temporarily depositing
Store up the Various types of data that has exported or will export.
Although specific embodiments of the present invention have been described above, it will be appreciated by those of skill in the art that this is only
For example, protection scope of the present invention is to be defined by the appended claims.Those skilled in the art without departing substantially from
Under the premise of the principle and substance of the present invention, many changes and modifications may be made, but these change and
Modification each falls within protection scope of the present invention.
Claims (10)
1. a kind of end-to-end phoneme synthesizing method, which comprises the following steps:
Obtain speech samples simultaneously pre-processed, with obtain include the audio of preset audio length voice training sample, and
Audio Matrix is converted by the audio in the voice training sample;
Regularization and vectorization processing are carried out to samples of text corresponding with the voice training sample, to be included
There is the text training sample of the text vector of pre-set text length;
Using the voice training sample and the corresponding text training sample as based on feed-forward neural network from
The input of my attention mechanism end-to-end model carries out speech synthesis training, to obtain optimal speech synthesis model.
2. end-to-end phoneme synthesizing method according to claim 1, which is characterized in that the pretreatment includes described in cutting
Audio in speech samples described in audio and completion in speech samples, so that the length of the audio is equal to the preset audio
Length.
3. end-to-end phoneme synthesizing method according to claim 2, which is characterized in that the pretreatment specifically includes following
Step:
Calculate the difference between the length of the speech samples sound intermediate frequency and the preset audio length;
Judge that data type belonging to the difference, the data type include negative, positive number and zero;
If the difference is negative, at the end of the speech samples, addition is mute, and the mute length is equal to the difference
The absolute value of value;
If the difference is positive number, the audio that the preset audio length thereof is exceeded in the speech samples is cut off, is cut
The length for the audio removed is equal to the absolute value of the difference;
If the difference is zero, the speech samples are not handled.
4. end-to-end phoneme synthesizing method according to claim 1, which is characterized in that the Regularization includes will be non-
Chinese character string is converted into Chinese character string, and the reading of the non-chinese character string is determined according to the Chinese character string after conversion
Sound.
5. end-to-end phoneme synthesizing method according to claim 4, which is characterized in that the vectorization, which is handled, includes:
According to preset text dictionary, the word corresponding sequence in the text dictionary is converted by the word in the samples of text
Number;
The corresponding serial number of each word is sequentially subjected to vector arrangement by each word in the samples of text, to obtain the samples of text
Corresponding text vector;
To comprising the length text vector different from the pre-set text length of text do cutting or completion so that the text
The length of this vector is equal to the pre-set text length.
6. end-to-end phoneme synthesizing method according to claim 5, which is characterized in that the cutting is for working as the text
When difference between the length of vector and the pre-set text length is positive number, the step of cutting includes by institute's text vector
In the text vector beyond the pre-set text length thereof cut off, the length of the text vector of excision is equal to the difference
Absolute value;
The completion is used for when the difference between the length of the text vector and the pre-set text length is negative, described
The step of completion includes adding text vector zero, the length etc. of the text vector zero of addition at the end of the text vector
In the absolute value of the difference.
7. end-to-end phoneme synthesizing method according to claim 1, which is characterized in that the speech synthesis training include with
Lower step:
Point of addition mask is in the Audio Matrix for including in the voice training sample to obtain initial audio matrix, to described
Initial audio matrix successively carries out non-LS-SVM sparseness, the processing of self-consciou power mechanism, the processing of first residual error network technology, one layer
The processing of feed-forward neural net layer and the again processing of residual error network technology are with the Audio Matrix that obtains that treated;
The text vector for including in the text training sample is moved to right into two lattice and point of addition mask to obtain original text square
Battle array successively carries out non-LS-SVM sparseness, the processing of self-consciou power mechanism, first residual error network technology to the original text matrix
Processing, the processing of one layer of feed-forward neural net layer and again the processing of residual error network technology with the text matrix that obtains that treated;
It will treated the Audio Matrix and described treated that text matrix successively carries out at self-consciou power mechanism together
Reason, layer normalization and one layer of feed-forward neural network are to obtain voice prediction signal;
The voice prediction signal and treated the text matrix are calculated into loss function, and done reversely using chain rule
It propagates and calculates, continuous iteration is until iteration goes out optimal speech synthesis model.
8. a kind of end-to-end speech synthetic device characterized by comprising
Speech processing module, for obtaining speech samples and being pre-processed, with obtain include preset audio length audio
Voice training sample, and convert Audio Matrix for the audio in the voice training sample;
Text processing module, for carrying out Regularization and vector to samples of text corresponding with the voice training sample
Change processing, with obtain include the text vector of pre-set text length text training sample;
Training module, for using the voice training sample and the corresponding text training sample as based on feed-forward
The input of the self-consciou power mechanism end-to-end model of neural network carries out speech synthesis training, to obtain optimal speech synthesis
Model.
9. a kind of computer equipment, including memory and processor, which is characterized in that be stored with computer journey on the memory
Sequence realizes such as end-to-end voice of any of claims 1-7 when the computer program is executed by the processor
The step of synthetic method.
10. a kind of computer readable storage medium, which is characterized in that be stored with computer in the computer readable storage medium
Program, the computer program can be performed by least one processors, to realize as of any of claims 1-7
The step of end-to-end phoneme synthesizing method.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811482781.7A CN109616093B (en) | 2018-12-05 | 2018-12-05 | End-to-end speech synthesis method, device, equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811482781.7A CN109616093B (en) | 2018-12-05 | 2018-12-05 | End-to-end speech synthesis method, device, equipment and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109616093A true CN109616093A (en) | 2019-04-12 |
CN109616093B CN109616093B (en) | 2024-02-27 |
Family
ID=66006036
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811482781.7A Active CN109616093B (en) | 2018-12-05 | 2018-12-05 | End-to-end speech synthesis method, device, equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109616093B (en) |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110070852A (en) * | 2019-04-26 | 2019-07-30 | 平安科技(深圳)有限公司 | Synthesize method, apparatus, equipment and the storage medium of Chinese speech |
CN110264991A (en) * | 2019-05-20 | 2019-09-20 | 平安科技(深圳)有限公司 | Training method, phoneme synthesizing method, device, equipment and the storage medium of speech synthesis model |
CN110299131A (en) * | 2019-08-01 | 2019-10-01 | 苏州奇梦者网络科技有限公司 | A kind of phoneme synthesizing method, device, the storage medium of controllable rhythm emotion |
CN110310626A (en) * | 2019-05-23 | 2019-10-08 | 平安科技(深圳)有限公司 | Voice training data creation method, device, equipment and readable storage medium storing program for executing |
CN110827791A (en) * | 2019-09-09 | 2020-02-21 | 西北大学 | Edge-device-oriented speech recognition-synthesis combined modeling method |
CN111753133A (en) * | 2020-06-11 | 2020-10-09 | 北京小米松果电子有限公司 | Video classification method, device and storage medium |
CN112802443A (en) * | 2019-11-14 | 2021-05-14 | 腾讯科技(深圳)有限公司 | Speech synthesis method and apparatus, electronic device, and computer-readable storage medium |
CN113362218A (en) * | 2021-05-21 | 2021-09-07 | 北京百度网讯科技有限公司 | Data processing method and device, electronic equipment and storage medium |
CN116030789A (en) * | 2022-12-28 | 2023-04-28 | 南京硅基智能科技有限公司 | Method and device for generating speech synthesis training data |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105654940A (en) * | 2016-01-26 | 2016-06-08 | 百度在线网络技术(北京)有限公司 | Voice synthesis method and device |
CN107464554A (en) * | 2017-09-28 | 2017-12-12 | 百度在线网络技术(北京)有限公司 | Phonetic synthesis model generating method and device |
US20180114522A1 (en) * | 2016-10-24 | 2018-04-26 | Semantic Machines, Inc. | Sequence to sequence transformations for speech synthesis via recurrent neural networks |
CN108630190A (en) * | 2018-05-18 | 2018-10-09 | 百度在线网络技术(北京)有限公司 | Method and apparatus for generating phonetic synthesis model |
-
2018
- 2018-12-05 CN CN201811482781.7A patent/CN109616093B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105654940A (en) * | 2016-01-26 | 2016-06-08 | 百度在线网络技术(北京)有限公司 | Voice synthesis method and device |
US20180114522A1 (en) * | 2016-10-24 | 2018-04-26 | Semantic Machines, Inc. | Sequence to sequence transformations for speech synthesis via recurrent neural networks |
CN107464554A (en) * | 2017-09-28 | 2017-12-12 | 百度在线网络技术(北京)有限公司 | Phonetic synthesis model generating method and device |
CN108630190A (en) * | 2018-05-18 | 2018-10-09 | 百度在线网络技术(北京)有限公司 | Method and apparatus for generating phonetic synthesis model |
Cited By (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110070852B (en) * | 2019-04-26 | 2023-06-16 | 平安科技(深圳)有限公司 | Method, device, equipment and storage medium for synthesizing Chinese voice |
CN110070852A (en) * | 2019-04-26 | 2019-07-30 | 平安科技(深圳)有限公司 | Synthesize method, apparatus, equipment and the storage medium of Chinese speech |
CN110264991A (en) * | 2019-05-20 | 2019-09-20 | 平安科技(深圳)有限公司 | Training method, phoneme synthesizing method, device, equipment and the storage medium of speech synthesis model |
CN110264991B (en) * | 2019-05-20 | 2023-12-22 | 平安科技(深圳)有限公司 | Training method of speech synthesis model, speech synthesis method, device, equipment and storage medium |
CN110310626A (en) * | 2019-05-23 | 2019-10-08 | 平安科技(深圳)有限公司 | Voice training data creation method, device, equipment and readable storage medium storing program for executing |
CN110299131B (en) * | 2019-08-01 | 2021-12-10 | 苏州奇梦者网络科技有限公司 | Voice synthesis method and device capable of controlling prosodic emotion and storage medium |
CN110299131A (en) * | 2019-08-01 | 2019-10-01 | 苏州奇梦者网络科技有限公司 | A kind of phoneme synthesizing method, device, the storage medium of controllable rhythm emotion |
CN110827791B (en) * | 2019-09-09 | 2022-07-01 | 西北大学 | Edge-device-oriented speech recognition-synthesis combined modeling method |
CN110827791A (en) * | 2019-09-09 | 2020-02-21 | 西北大学 | Edge-device-oriented speech recognition-synthesis combined modeling method |
CN112802443A (en) * | 2019-11-14 | 2021-05-14 | 腾讯科技(深圳)有限公司 | Speech synthesis method and apparatus, electronic device, and computer-readable storage medium |
CN112802443B (en) * | 2019-11-14 | 2024-04-02 | 腾讯科技(深圳)有限公司 | Speech synthesis method and device, electronic equipment and computer readable storage medium |
CN111753133A (en) * | 2020-06-11 | 2020-10-09 | 北京小米松果电子有限公司 | Video classification method, device and storage medium |
CN113362218A (en) * | 2021-05-21 | 2021-09-07 | 北京百度网讯科技有限公司 | Data processing method and device, electronic equipment and storage medium |
CN116030789A (en) * | 2022-12-28 | 2023-04-28 | 南京硅基智能科技有限公司 | Method and device for generating speech synthesis training data |
CN116030789B (en) * | 2022-12-28 | 2024-01-26 | 南京硅基智能科技有限公司 | Method and device for generating speech synthesis training data |
Also Published As
Publication number | Publication date |
---|---|
CN109616093B (en) | 2024-02-27 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109616093A (en) | End-to-end phoneme synthesizing method, device, equipment and storage medium | |
WO2021047286A1 (en) | Text processing model training method, and text processing method and apparatus | |
CN107273503B (en) | Method and device for generating parallel text in same language | |
US10810993B2 (en) | Sample-efficient adaptive text-to-speech | |
CN112259089B (en) | Speech recognition method and device | |
CN111401037B (en) | Natural language generation method and device, electronic equipment and storage medium | |
CN112466314A (en) | Emotion voice data conversion method and device, computer equipment and storage medium | |
CN113434683B (en) | Text classification method, device, medium and electronic equipment | |
CN111382270A (en) | Intention recognition method, device and equipment based on text classifier and storage medium | |
CN109558605A (en) | Method and apparatus for translating sentence | |
WO2022257454A1 (en) | Speech synthesis method, apparatus and terminal, and storage medium | |
JP2023025126A (en) | Training method and apparatus for deep learning model, text data processing method and apparatus, electronic device, storage medium, and computer program | |
CN111243574A (en) | Voice model adaptive training method, system, device and storage medium | |
CN110598210A (en) | Entity recognition model training method, entity recognition device, entity recognition equipment and medium | |
CN112699213A (en) | Speech intention recognition method and device, computer equipment and storage medium | |
CN112634919A (en) | Voice conversion method and device, computer equipment and storage medium | |
CN115129831A (en) | Data processing method and device, electronic equipment and computer storage medium | |
CN115687934A (en) | Intention recognition method and device, computer equipment and storage medium | |
WO2018014537A1 (en) | Voice recognition method and apparatus | |
CN104679733A (en) | Voice conversation translation method, device and system | |
CN114495977A (en) | Speech translation and model training method, device, electronic equipment and storage medium | |
CN117971487A (en) | High-performance operator generation method, device, equipment and storage medium | |
CN111508481B (en) | Training method and device of voice awakening model, electronic equipment and storage medium | |
WO2020153159A1 (en) | Series labeling device, series labeling method, and program | |
JP6633556B2 (en) | Acoustic model learning device, speech recognition device, acoustic model learning method, speech recognition method, and program |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |