CN108806665A - Phoneme synthesizing method and device - Google Patents
Phoneme synthesizing method and device Download PDFInfo
- Publication number
- CN108806665A CN108806665A CN201811061208.9A CN201811061208A CN108806665A CN 108806665 A CN108806665 A CN 108806665A CN 201811061208 A CN201811061208 A CN 201811061208A CN 108806665 A CN108806665 A CN 108806665A
- Authority
- CN
- China
- Prior art keywords
- information
- sample
- voice
- text
- spectrum
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 78
- 230000002194 synthesizing effect Effects 0.000 title claims abstract description 33
- 238000001228 spectrum Methods 0.000 claims abstract description 101
- 230000015572 biosynthetic process Effects 0.000 claims abstract description 58
- 238000003786 synthesis reaction Methods 0.000 claims abstract description 57
- 238000012549 training Methods 0.000 claims description 102
- 238000010801 machine learning Methods 0.000 claims description 22
- 239000000284 extract Substances 0.000 claims description 10
- 238000004590 computer program Methods 0.000 claims description 5
- 238000005259 measurement Methods 0.000 claims description 2
- 238000013527 convolutional neural network Methods 0.000 description 23
- 230000006870 function Effects 0.000 description 22
- 239000003795 chemical substances by application Substances 0.000 description 13
- 238000005516 engineering process Methods 0.000 description 12
- 238000010586 diagram Methods 0.000 description 9
- 238000012545 processing Methods 0.000 description 8
- 238000013528 artificial neural network Methods 0.000 description 7
- 230000006854 communication Effects 0.000 description 6
- 230000004044 response Effects 0.000 description 6
- 238000004891 communication Methods 0.000 description 5
- 230000009471 action Effects 0.000 description 3
- 230000001537 neural effect Effects 0.000 description 3
- 230000003287 optical effect Effects 0.000 description 3
- 238000004422 calculation algorithm Methods 0.000 description 2
- 238000012512 characterization method Methods 0.000 description 2
- 238000000605 extraction Methods 0.000 description 2
- 230000005291 magnetic effect Effects 0.000 description 2
- 230000010355 oscillation Effects 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 238000003672 processing method Methods 0.000 description 2
- 230000000644 propagated effect Effects 0.000 description 2
- 239000004065 semiconductor Substances 0.000 description 2
- 210000001260 vocal cord Anatomy 0.000 description 2
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 210000004556 brain Anatomy 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000005611 electricity Effects 0.000 description 1
- 239000000835 fiber Substances 0.000 description 1
- 210000003127 knee Anatomy 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 210000004218 nerve net Anatomy 0.000 description 1
- 239000013307 optical fiber Substances 0.000 description 1
- 238000012163 sequencing technique Methods 0.000 description 1
- 238000010189 synthetic method Methods 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
- G10L13/04—Details of speech synthesis systems, e.g. synthesiser structure or memory management
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Machine Translation (AREA)
Abstract
The embodiment of the present application discloses phoneme synthesizing method and device.One specific implementation mode of the system includes:It obtains for synthesizing the corresponding predictive information of target text for being converted into voice, predictive information includes:The duration information, the spectrum information corresponding to target text predicted, the fundamental frequency information corresponding to target text that predicts corresponding to target text predicted;The corresponding predictive information of the target text got is input to phonetic synthesis model trained in advance, obtains synthesis voice corresponding with target text.This embodiment improves the precision of synthesized voice.
Description
Technical field
The invention relates to field of computer technology, and in particular to phoneme synthesizing method and device.
Background technology
With the development of artificial intelligence technology, voice processing technology is widely used.Voice processing technology is usual
Including speech recognition technology and speech synthesis technique.Speech recognition technology is typically that intelligent machine is allowed to pass through to speech recognition and understanding
Voice signal is changed into the technology of corresponding text or order.Speech synthesis technique usually by computer oneself generate or
The technology for spoken output that user can listen to understand, fluent that externally input text information is changed into.
In related voice synthetic technology, feature usually is carried out to the information of the information and the characterization shape of the mouth as one speaks of describing vocal cord vibration and is carried
It takes, voice signal is converted text to by simulating human body sounding.
Invention content
The embodiment of the present application proposes phoneme synthesizing method and device.
In a first aspect, the embodiment of the present application provides a kind of phoneme synthesizing method, including:It obtains and is converted into for synthesizing
The corresponding predictive information of target text of voice, predictive information include:The duration information, pre- corresponding to target text predicted
The spectrum information corresponding to target text measured, the fundamental frequency information corresponding to target text that predicts;The mesh that will be got
The corresponding predictive information of mark text is input to phonetic synthesis model trained in advance, obtains synthesis language corresponding with target text
Sound.
In some embodiments, training obtains phonetic synthesis model as follows:Obtain training sample set, training
Sample includes sample text, sample voice information corresponding with sample text, sample voice information include acoustic feature information and
Sample voice waveform, acoustic feature information include fundamental frequency information, spectrum information and duration information, wherein it is right that fundamental frequency information passes through
Sample voice carries out fundamental frequency and extracts to obtain, and spectrum information carries out frequency spectrum using spectrum prediction model trained in advance to sample text
Prediction obtains;Using the acoustic feature information in the corresponding sample voice information of the sample text in training sample set as voice
The input of synthetic model, using sample voice waveform as desired output, train to obtain phonetic synthesis using the method for machine learning
Model.
In some embodiments, it is to utilize spectrum prediction model pair trained in advance corresponding to the spectrum information of target text
The corresponding voice of target text carries out what spectrum prediction obtained;And training obtains spectrum prediction model as follows:It will
Sample text in training sample set as the input of initial spectrum prediction model, from sample voice corresponding with sample text
In the spectrum information that extracts as desired output, train to obtain spectrum prediction model using the method for machine learning.
In some embodiments, the corresponding duration information with target text is to utilize duration prediction model pair trained in advance
The corresponding voice of target text carries out what duration prediction obtained;And training obtains duration prediction model as follows:It will
The input and desired output of sample text and duration information respectively as duration prediction model in training sample set, utilizes machine
The method of device study, training obtain duration prediction model.
In some embodiments, the corresponding fundamental frequency information with target text is by pitch prediction model to target text pair
The voice answered carries out pitch prediction obtains and pitch prediction model, and training obtains as follows:By training sample set
Sample text in conjunction as initial pitch prediction model input, will be extracted from sample voice corresponding with sample text
Fundamental frequency information as desired output, using the method for machine learning, training obtains pitch prediction model.
In some embodiments, the type of phonetic synthesis model is WaveRNN.
Second aspect, the embodiment of the present application provide a kind of speech synthetic device, which includes:Acquiring unit, by with
Acquisition is set to for synthesizing the corresponding predictive information of target text for being converted into voice, predictive information includes:Pair predicted
Should in target text duration information, predict corresponding to target text spectrum information, predict correspond to target text
This fundamental frequency information;Synthesis unit is configured to the corresponding predictive information of the target text that will be got and is input to advance training
Phonetic synthesis model, obtain synthesis voice corresponding with target text.
In some embodiments, training obtains phonetic synthesis model as follows:Obtain training sample set, training
Sample includes sample text, sample voice information corresponding with sample text, sample voice information include acoustic feature information and
Sample voice waveform, acoustic feature information include fundamental frequency information, spectrum information and duration information, wherein it is right that fundamental frequency information passes through
Sample voice carries out fundamental frequency and extracts to obtain, and spectrum information carries out frequency spectrum using spectrum prediction model trained in advance to sample text
Prediction obtains;Using the acoustic feature information in the corresponding sample voice information of the sample text in training sample set as voice
The input of synthetic model, using sample voice waveform as desired output, train to obtain phonetic synthesis using the method for machine learning
Model.
In some embodiments, it is to utilize spectrum prediction model pair trained in advance corresponding to the spectrum information of target text
The corresponding voice of target text carries out what spectrum prediction obtained;And training obtains spectrum prediction model as follows:It will
Sample text in training sample set as the input of initial spectrum prediction model, from sample voice corresponding with sample text
In the spectrum information that extracts as desired output, train to obtain spectrum prediction model using the method for machine learning.
In some embodiments, the corresponding duration information with target text is to utilize duration prediction model pair trained in advance
The corresponding voice of target text carries out what duration prediction obtained;And training obtains duration prediction model as follows:It will
The input and desired output of sample text and duration information respectively as duration prediction model in training sample set, utilizes machine
The method of device study, training obtain duration prediction model.
In some embodiments, wherein the corresponding fundamental frequency information with target text is by pitch prediction model to target
The corresponding voice of text carries out pitch prediction obtains and pitch prediction model, and training obtains as follows:It will train
Sample text in sample set as initial pitch prediction model input, will be from sample voice corresponding with sample text
The fundamental frequency information extracted is as desired output, and using the method for machine learning, training obtains pitch prediction model.
In some embodiments, the type of phonetic synthesis model is WaveRNN.
The third aspect, the embodiment of the present application provide a kind of electronic equipment, including:One or more processors;Storage dress
Set, be stored thereon with one or more programs, when one or more programs are executed by one or more processors so that one or
Multiple processors realize the method such as any embodiment in control method.
Fourth aspect, the embodiment of the present application provide a kind of computer readable storage medium, are stored thereon with computer journey
Sequence realizes the method such as any embodiment in control method when the program is executed by processor.
Phoneme synthesizing method and device provided by the embodiments of the present application, by obtaining for synthesizing the mesh for being converted into voice
The corresponding duration information of text, spectrum information and fundamental frequency information are marked, then by the corresponding predictive information of the target text got
It it is input in advance trained phonetic synthesis model, obtains synthesis voice corresponding with target text, synthesize to improve
The accuracy of voice.
Description of the drawings
By reading a detailed description of non-restrictive embodiments in the light of the attached drawings below, the application's is other
Feature, objects and advantages will become more apparent upon:
Fig. 1 is that one embodiment of the application can be applied to exemplary system architecture figure therein;
Fig. 2 is the flow chart according to one embodiment of the phoneme synthesizing method of the application;
Fig. 3 is the schematic diagram according to an application scenarios of the phoneme synthesizing method of the application;
Fig. 4 is the flow chart according to another embodiment of the phoneme synthesizing method of the application;
Fig. 5 is the structural schematic diagram according to one embodiment of the speech synthetic device of the application;
Fig. 6 is adapted for the structural schematic diagram of the computer system of the server for realizing the embodiment of the present application.
Specific implementation mode
The application is described in further detail with reference to the accompanying drawings and examples.It is understood that this place is retouched
The specific embodiment stated is used only for explaining related invention, rather than the restriction to the invention.It also should be noted that in order to
Convenient for description, is illustrated only in attached drawing and invent relevant part with related.
It should be noted that in the absence of conflict, the features in the embodiments and the embodiments of the present application can phase
Mutually combination.The application is described in detail below with reference to the accompanying drawings and in conjunction with the embodiments.
Fig. 1 shows the exemplary system of the embodiment of the phoneme synthesizing method or speech synthetic device that can apply the application
System framework 100.
As shown in Figure 1, system architecture 100 may include terminal device 101,102,103, network 104 and server 105.
Network 104 between terminal device 101,102,103 and server 105 provide communication link medium.Network 104 can be with
Including various connection types, such as wired, wireless communication link or fiber optic cables etc..
Speech ciphering equipment (such as microphone, sound equipment etc.), text input class can be installed on terminal device 101,102,103
Using, phonetic synthesis class application etc..User can be handed over using terminal equipment 101,102,103 by network 104 and server 105
Mutually, to receive or send message etc..
Terminal device 101,102,103 can be hardware, can also be software.When terminal device 101,102,103 is hard
Can be the various electronic equipments that there is display screen and support data input when part, including but not limited to smart mobile phone, tablet is electric
Brain, pocket computer on knee and desktop computer etc..When terminal device 101,102,103 is software, may be mounted at
In above-mentioned cited electronic equipment.Multiple softwares or software module may be implemented into it, can also be implemented as single software or
Software module.It is not specifically limited herein.
Server 105 can be to provide the server of various services, such as the mesh to the transmission of terminal device 101,102,103
Mark text is analyzed, and the voice synthesizing server with the matched synthesis voice of target text is generated.Voice synthesizing server can
To carry out analyzing processing to the target text got, prediction spectrum information corresponding with target text, prediction fundamental frequency letter are determined
Then identified various information are carried out synthesis processing by breath, prediction duration information, to generate conjunction corresponding with target text
At voice.
It should be noted that the phoneme synthesizing method that the embodiment of the present application is provided generally is executed by server 105, accordingly
Ground, speech synthetic device are generally positioned in server 105.
It should be pointed out that the local of server 105 can also be stored with the text of voice to be synthesized, server 105 can
Directly to extract the text of local voice to be synthesized, at this point, exemplary system architecture 100 can not include terminal device 101,
102,103 and network 104.
It should be noted that server 105 can be hardware, can also be software.It, can when server 105 is hardware
To be implemented as the distributed server cluster that multiple servers form, individual server can also be implemented as.When server is soft
When part, multiple softwares or software module may be implemented into, single software or software module can also be implemented as.It does not do herein specific
It limits.
It should be understood that the number of the terminal device, network and server in Fig. 1 is only schematical.According to realization need
It wants, can have any number of terminal device, network and server.
With continued reference to Fig. 2, the flow 200 of one embodiment of the phoneme synthesizing method according to the application is shown.Voice
Synthetic method includes the following steps:
Step 201, it obtains for synthesizing the corresponding predictive information of target text for being converted into voice.
In the present embodiment, the executive agent (such as server shown in FIG. 1) of phoneme synthesizing method can be by wired
The mode of connection type or wireless connection receives the target text for synthesizing voice to be synthesized that terminal is sent from terminal device
This.Then, above-mentioned executive agent can be obtained for synthesizing the corresponding predictive information of the target text to be installed for changing into voice.At this
In, which may include:Predict corresponding to target text duration information, predict correspond to target text
Spectrum information, the fundamental frequency information corresponding to target text that predicts.Herein, the correspondence predicted and target text
Duration information be by the way that target text is converted into aligned phoneme sequence after, long letter when corresponding with each phoneme in aligned phoneme sequence
Breath.Herein, phoneme is the least unit in voice, is usually analyzed according to the articulation of syllable example, and an action constitutes one
A phoneme.Phoneme is generally divided into vowel and consonant two major classes.Due to the pronunciation of each word in text, letter generally include it is more
A action, therefore, the texts such as each word, letter include multiple phonemes.To which target text is usually according to tone period
Sequencing include continuous multiple phonemes, which constitutes aligned phoneme sequence.
In the present embodiment, the frequency spectrum in voice is commonly used in expression sound spectrum and frequency spectrum, the relationship of loudness.In general,
The sounding position of different words is different, therefore the frequency of vocal cord vibration, sounding loudness are embodied by time-domain speech waveform.And frequency spectrum
It is embodiment of the voice messaging in frequency domain, is that Fourier is passed through by the continuous speech waveform including multi-frequency, a variety of amplitudes
It is obtained after transformation.To which by carrying out spectrum prediction to the corresponding audio of text, each word or word in text can be simulated
Sounding.
In the present embodiment, due to the not individual frequency of voice, there are many simple harmonic oscillations of frequency to be formed by stacking.
And multiple frequencies of simple harmonic oscillation form the different wave crest of multiple amplitudes by superposition.In multiple wave crest, first wave crest
The frequency of as fundamental tone, first wave crest is fundamental frequency.To which the fundamental frequency in voice determines the pitch in voice.
In some optional realization methods of the present embodiment, each sound in aligned phoneme sequence corresponding with target text
Element includes pronunciation duration, and since the different pronunciation duration of same phoneme shows the different tone, the meaning of expression is logical
It is often also different.Therefore, above-mentioned executive agent can utilize duration prediction model trained in advance to being used to synthesize Chinese idiom to be converted
The target text of sound carries out duration prediction, corresponding pre- to obtain each phoneme in aligned phoneme sequence corresponding with target text
Survey duration information.Herein, above-mentioned duration prediction model can be used for characterizing the corresponding pass between target text and duration information
System.Herein, which can be convolutional neural networks.Trained convolutional neural networks can be to phoneme sequence in advance
Row carry out various feature extractions.Herein, convolutional neural networks trained in advance can be according to each phoneme learnt in advance
The features such as the feature of position that the feature of the relationship between its adjacent phoneme, each phoneme occur in aligned phoneme sequence are come
Determine prediction duration information corresponding with each phoneme in aligned phoneme sequence.
Duration prediction model can be trained and be obtained as follows:By the sample text and duration in training sample set
Input and desired output of the information respectively as the duration prediction model, using the method for machine learning, training obtains described
Duration prediction model.Specifically,
First, training sample set is obtained.Herein, the training sample in the training sample set may include sample text
Originally, sample audio information corresponding with sample text.Herein, which may include sample text and and sample
The duration information of each phoneme in the corresponding aligned phoneme sequence of text.
Then, above-mentioned executive agent can choose training sample from training sample set, execute following training step:
First, using the sample text in each training sample of selection as the input of initial convolutional neural networks and with
The duration information of each phoneme in the corresponding aligned phoneme sequence of sample text carries out initial convolutional neural networks as desired output
Training, obtains the predicted value with the duration information of each phoneme in aligned phoneme sequence.Then, it is based on default loss function, is determined pre-
If whether the penalty values of loss function reach predetermined target value.In response to determining that the penalty values of default loss function reach default
When desired value, it may be determined that initial neural metwork training is completed, and it is pre- that the initial neural network completed will be trained to be determined as duration
Survey model.Herein, default loss function can be used for characterizing the predicted value with the duration information of each phoneme in aligned phoneme sequence
Difference between mark duration information.
For above-mentioned executive agent when in response to determining that the penalty values of default loss function are not up to predetermined target value, adjustment is just
The parameter of beginning convolutional neural networks, and sample is chosen again from above-mentioned training sample set, by the initial convolution after adjustment
Neural network continues to execute above-mentioned training step as initial convolutional neural networks.Herein, initial convolutional neural networks are adjusted
Parameter can for example adjust initial convolutional neural networks convolutional layer number, the size of convolution kernel.
In some optional realization methods of the present embodiment, above-mentioned executive agent can utilize frequency spectrum trained in advance pre-
It surveys model and spectrum prediction is carried out to the corresponding voice of target text, to obtain the corresponding prediction of the target text to be changed at voice
Spectrum information.Herein, it in the prediction spectrum information may include in frequency spectrum formant, spectrum envelope.Herein, the frequency spectrum
Prediction model can be used for characterizing target text and predict the correspondence between spectrum information.The spectrum prediction model can be
Convolutional neural networks.Trained convolutional neural networks can be corresponded to according to each word, the word in the text learnt in advance in advance
Sound mark information and adjacent word between the features such as relationship characteristic information, spectrum prediction is carried out to target text, to
To prediction spectrum information.Herein, which can both refer to word, word, vowel, consonant etc. in Chinese, can also refer to
The letter etc. of word, composition word in other language (such as English, French etc.).It does not limit herein.
Spectrum prediction model can be trained and be obtained as follows:Using the sample text in training sample set as just
The input of beginning spectrum prediction model, the spectrum information extracted from sample voice corresponding with sample text are defeated as it is expected
Go out, trains to obtain spectrum prediction model using the method for machine learning.Specifically,
First, training sample set is obtained.Herein, the training sample in the training sample set may include sample text
Originally, the spectrum information of sample audio corresponding with sample text.
Then, above-mentioned executive agent can choose training sample from training sample set, execute following training step:
First, using the sample text in each training sample of selection as the input of initial convolutional neural networks and with
The spectrum information of the corresponding sample audio of sample text is trained initial convolutional neural networks, obtains as desired output
Spectrum information corresponding with sample text.Then, it is based on default loss function, determines whether the penalty values of default loss function reach
To predetermined target value.When in response to determining that the penalty values of default loss function reach predetermined target value, it may be determined that initial god
It is completed through network training, and the initial neural network that training is completed is determined as spectrum prediction model.Herein, loss function can
For the difference between the spectrum information and the spectrum information of desired output of initial convolutional neural networks reality output.
For above-mentioned executive agent when in response to determining that the penalty values of default loss function are not up to predetermined target value, adjustment is just
The parameter of beginning convolutional neural networks, and sample is chosen again from above-mentioned training sample set, by the initial convolution after adjustment
Neural network continues to execute above-mentioned training step as initial convolutional neural networks.Herein, initial convolutional neural networks are adjusted
Parameter can for example adjust initial convolutional neural networks convolutional layer number, the size of convolution kernel.
In some optional realization methods of this implementation, above-mentioned executive agent can utilize pitch prediction trained in advance
Model carries out pitch prediction to target text, obtains the prediction fundamental frequency information of target text.Herein, which can
For characterization target text and predict the correspondence between fundamental frequency information.The pitch prediction model can be convolutional Neural net
Network.
In some optional realization methods of the present embodiment, pitch prediction model can be trained as follows
It arrives:Using the sample text in training sample set as the input of initial pitch prediction model, will be from corresponding with sample text
The fundamental frequency information extracted in sample voice is as desired output, and using the method for machine learning, training obtains pitch prediction mould
Type.Herein, the methods of auto-correlation algorithm, parallel processing method, Cepstrum Method and simplified liftering method can be utilized to be carried from voice
Take fundamental frequency.
Step 202, the corresponding predictive information of the target text got is input to phonetic synthesis model trained in advance,
Obtain synthesis voice corresponding with target text.
In the present embodiment, the target that above-mentioned executive agent can be using phonetic synthesis model trained in advance to getting
The corresponding predictive information of text carries out phonetic synthesis, to obtain synthesis voice corresponding with target text.Herein, voice closes
It can be used for characterizing predictive information at model and synthesize the correspondence between voice.
In the present embodiment, above-mentioned phonetic synthesis model can be for example convolutional neural networks, which can
To carry out feature extraction by convolution kernel, respectively by each phoneme and prediction frequency spectrum letter in aligned phoneme sequence corresponding with sample text
Breath, prediction fundamental frequency information correspond, and it is corresponding with each phoneme a certain to that is to say that the pronunciation duration information based on each phoneme is determined
Section prediction spectrum information and a certain section of prediction fundamental frequency information.Finally, based on a certain section corresponding with each phoneme in aligned phoneme sequence
Spectrum information and a certain section of prediction fundamental frequency information are predicted, to generate synthesis voice.
In some optional realization methods of the present embodiment, the type of above-mentioned phonetic synthesis model can also be
WaveRNN。
It is a schematic diagram according to the application scenarios of the phoneme synthesizing method of the present embodiment with continued reference to Fig. 3, Fig. 3.?
In the application scenarios of Fig. 3, user is had input to server 302 by computer 301 " my motherland " text converting Chinese idiom
The request of sound.Server 302 upon receiving the request, can obtain the corresponding prediction letter of " my motherland " text information
Breath, the predictive information include prediction duration information, prediction spectrum information and prediction fundamental frequency information.Then, server 302 is by gained
To prediction spectrum information, prediction fundamental frequency information and predict that duration information is input in advance trained phonetic synthesis model 303, from
And synthesis voice corresponding with " my motherland " text is obtained, and exported by sound equipment 304.
Phoneme synthesizing method and device provided by the embodiments of the present application, by obtaining for synthesizing the mesh for being converted into voice
The corresponding duration information of text, spectrum information and fundamental frequency information are marked, then by the corresponding predictive information of the target text got
It it is input in advance trained phonetic synthesis model, obtains synthesis voice corresponding with target text, synthesize to improve
The accuracy of voice.
With further reference to Fig. 4, it illustrates the embodiments according to a training method of the phonetic synthesis model of the application
Flow 400.This flow 400, includes the following steps:
Step 401, training sample set is obtained, the training sample includes sample text, sample corresponding with sample text
This voice messaging, the sample voice information include acoustic feature information and speech waveform, and the acoustic feature information includes base
Frequency information, spectrum information and duration information.
In the present embodiment, the executive agent (such as server shown in FIG. 1) of phoneme synthesizing method can be by wired
The mode of connection type or wireless connection obtains instruction from the storage server for being stored with sample text and sample voice information
Practice sample set.Herein, the training sample in training sample set may include sample text, sample corresponding with sample text
This voice messaging.The sample voice information may include speech waveform, which can be by acquiring natural person to text
The waveform that this sound read aloud is recorded.The sample voice information can also include the acoustic feature information of sample voice.?
Here, which may include fundamental frequency information, spectrum information and duration information.
In the present embodiment, fundamental frequency information is typically to extract to obtain by carrying out fundamental frequency to sample voice.The fundamental frequency extracts
Method for example may include the methods of auto-correlation algorithm, parallel processing method, Cepstrum Method and simplified liftering method.With sample text pair
The aligned phoneme sequence answered is typically to be obtained to sample text cutting using hidden Markov model.The spectrum information is to utilize instruction in advance
Experienced spectrum prediction model carries out spectrum prediction to sample text and obtains.
Step 402, by the acoustic feature information in the corresponding sample voice information of sample text in training sample set
As the input of phonetic synthesis model, using sample voice waveform as desired output, train to obtain using the method for machine learning
Phonetic synthesis model.
In the present embodiment, above-mentioned executive agent can choose training sample from training sample set, execute following instruction
Practice step:
First, using the acoustic feature information in the sample voice information in each training sample of selection as initial volume
Product neural network input, will sample voice waveform corresponding with sample text as desired output, to initial convolution nerve net
Network is trained, and obtains prediction speech waveform corresponding with sample text.Then, it is based on default loss function, determines default damage
Whether the penalty values for losing function reach predetermined target value.In response to determining that the penalty values of default loss function reach goal-selling
When value, it may be determined that initial neural metwork training is completed, and the initial neural network that training is completed is determined as phonetic synthesis mould
Type.Herein, default loss function can be used for characterizing the difference between prediction speech waveform and sample voice waveform.
For above-mentioned executive agent when in response to determining that the penalty values of default loss function are not up to predetermined target value, adjustment is just
The parameter of beginning convolutional neural networks, and sample is chosen again from above-mentioned training sample set, by the initial convolution after adjustment
Neural network continues to execute above-mentioned training step as initial convolutional neural networks.Herein, initial convolutional neural networks are adjusted
Parameter can for example adjust initial convolutional neural networks convolutional layer number, the size of convolution kernel.
Figure 4, it is seen that unlike embodiment shown in Fig. 2, the present embodiment is highlighted to phonetic synthesis mould
The training step of type.So that synthesized voice is more accurate.
With further reference to Fig. 5, as the realization to method shown in above-mentioned Fig. 4, this application provides a kind of phonetic synthesis dresses
The one embodiment set, the device embodiment is corresponding with embodiment of the method shown in Fig. 2, which specifically can be applied to respectively
In kind electronic equipment.
As shown in figure 5, the speech synthetic device 500 of the present embodiment includes:Acquiring unit 501 and synthesis unit 502.Its
In, acquiring unit 501 is configured to obtain for synthesizing the corresponding predictive information of target text for being converted into voice, predict
Information includes:The duration information corresponding to target text predicted, the spectrum information, pre- corresponding to target text that predicts
The fundamental frequency information corresponding to target text measured.It is corresponding pre- to be configured to the target text that will be got for synthesis unit 502
Measurement information is input to phonetic synthesis model trained in advance, obtains synthesis voice corresponding with target text.
In the present embodiment, in speech synthetic device 500:The specific processing of acquiring unit 501 and synthesis unit 502 and its
The advantageous effect brought can referring to the associated description of the realization method of step 201 and step 202 in Fig. 2 corresponding embodiments,
This is repeated no more.
In some optional realization methods of the present embodiment, training obtains phonetic synthesis model as follows:It obtains
It includes sample text, sample voice information corresponding with sample text, sample voice information to take training sample set, training sample
Including acoustic feature information and sample voice waveform, acoustic feature information includes fundamental frequency information, spectrum information and duration information,
In, fundamental frequency information extracts to obtain by carrying out fundamental frequency to sample voice, and spectrum information utilizes spectrum prediction model trained in advance
Spectrum prediction is carried out to sample text to obtain;By the sound in the corresponding sample voice information of sample text in training sample set
Learn characteristic information as the input of phonetic synthesis model, using sample voice waveform as desired output, utilize the side of machine learning
Method trains to obtain phonetic synthesis model.
In some optional realization methods of the present embodiment, the spectrum information corresponding to target text is to utilize instruction in advance
Experienced spectrum prediction model carries out what spectrum prediction obtained to the corresponding voice of target text;And spectrum prediction model passes through such as
Lower step trains to obtain:Using the sample text in training sample set as the input of initial spectrum prediction model, from sample
The spectrum information extracted in the corresponding sample voice of text trains to obtain frequency using the method for machine learning as desired output
Compose prediction model.
In some optional realization methods of the present embodiment, the corresponding duration information with target text is to utilize instruction in advance
Experienced duration prediction model carries out what duration prediction obtained to the corresponding voice of target text;And duration prediction model passes through such as
Lower step trains to obtain:Using in training sample set sample text and duration information as the input of duration prediction model
And desired output, using the method for machine learning, training obtains duration prediction model.
In some optional realization methods of the present embodiment, wherein the corresponding fundamental frequency information with target text is to pass through
Pitch prediction model carries out pitch prediction obtains and pitch prediction model by walking as follows to the corresponding voice of target text
Rapid training obtains:Using the sample text in training sample set as the input of initial pitch prediction model, will from sample text
The fundamental frequency information extracted in this corresponding sample voice is as desired output, and using the method for machine learning, training obtains base
Frequency prediction model.
Below with reference to Fig. 6, it illustrates suitable for for realizing that the electronic equipment of the embodiment of the present application is (such as shown in FIG. 1
Server) computer system 600 structural schematic diagram.Electronic equipment shown in Fig. 6 is only an example, should not be to this Shen
Please embodiment function and use scope bring any restrictions.
As shown in fig. 6, computer system 600 includes central processing unit (CPU) 601, it can be read-only according to being stored in
Program in memory (ROM) 602 or be loaded into the program in random access storage device (RAM) 603 from storage section 608 and
Execute various actions appropriate and processing.In RAM 603, also it is stored with system 600 and operates required various programs and data.
CPU 601, ROM 602 and RAM 603 are connected with each other by bus 604.Input/output (I/O) interface 605 is also connected to always
Line 604.
It is connected to I/O interfaces 605 with lower component:Importation 606 including keyboard, mouse etc.;It is penetrated including such as cathode
The output par, c 607 of spool (CRT), liquid crystal display (LCD) etc. and loud speaker etc.;Storage section 608 including hard disk etc.;
And the communications portion 609 of the network interface card including LAN card, modem etc..Communications portion 609 via such as because
The network of spy's net executes communication process.Driver 610 is also according to needing to be connected to I/O interfaces 605.Detachable media 611, such as
Disk, CD, magneto-optic disk, semiconductor memory etc. are mounted on driver 610, as needed in order to be read from thereon
Computer program be mounted into storage section 608 as needed.
Particularly, in accordance with an embodiment of the present disclosure, it may be implemented as computer above with reference to the process of flow chart description
Software program.For example, embodiment of the disclosure includes a kind of computer program product comprising be carried on computer-readable medium
On computer program, which includes the program code for method shown in execution flow chart.In such reality
It applies in example, which can be downloaded and installed by communications portion 609 from network, and/or from detachable media
611 are mounted.When the computer program is executed by central processing unit (CPU) 601, limited in execution the present processes
Above-mentioned function.It should be noted that the computer-readable medium that the application is somebody's turn to do can be computer-readable signal media or meter
Calculation machine readable storage medium storing program for executing either the two arbitrarily combines.Computer readable storage medium can be for example but not limited to
System, device or the device of electricity, magnetic, optical, electromagnetic, infrared ray or semiconductor, or the arbitrary above combination.It is computer-readable
The more specific example of storage medium can include but is not limited to:Electrical connection, portable computing with one or more conducting wires
Machine disk, hard disk, random access storage device (RAM), read-only memory (ROM), erasable programmable read only memory (EPROM
Or flash memory), optical fiber, portable compact disc read-only memory (CD-ROM), light storage device, magnetic memory device or above-mentioned
Any appropriate combination.In this application, computer readable storage medium can be any include or storage program it is tangible
Medium, the program can be commanded the either device use or in connection of execution system, device.And in this application,
Computer-readable signal media may include in a base band or as the data-signal that a carrier wave part is propagated, wherein carrying
Computer-readable program code.Diversified forms may be used in the data-signal of this propagation, and including but not limited to electromagnetism is believed
Number, optical signal or above-mentioned any appropriate combination.Computer-readable signal media can also be computer-readable storage medium
Any computer-readable medium other than matter, the computer-readable medium can be sent, propagated or transmitted for being held by instruction
Row system, device either device use or program in connection.The program code for including on computer-readable medium
It can transmit with any suitable medium, including but not limited to:Wirelessly, electric wire, optical cable, RF etc. or above-mentioned arbitrary conjunction
Suitable combination.
The calculating of the operation for executing the application can be write with one or more programming languages or combinations thereof
Machine program code, the programming language include object oriented program language-such as Java, Smalltalk, C++,
Further include conventional procedural programming language-such as " C " language or similar programming language.Program code can be with
It fully executes, partly execute on the user computer on the user computer, being executed as an independent software package, portion
Divide and partly executes or executed on a remote computer or server completely on the remote computer on the user computer.?
Be related in the situation of remote computer, remote computer can pass through the network of any kind --- including LAN (LAN) or
Wide area network (WAN)-be connected to subscriber computer, or, it may be connected to outer computer (such as carried using Internet service
It is connected by internet for quotient).
Flow chart in attached drawing and block diagram, it is illustrated that according to the system of the various embodiments of the application, method and computer journey
The architecture, function and operation in the cards of sequence product.In this regard, each box in flowchart or block diagram can generation
A part for a part for one module, program segment, or code of table, the module, program segment, or code includes one or more uses
The executable instruction of the logic function as defined in realization.It should also be noted that in some implementations as replacements, being marked in box
The function of note can also occur in a different order than that indicated in the drawings.For example, two boxes succeedingly indicated are actually
It can be basically executed in parallel, they can also be executed in the opposite order sometimes, this is depended on the functions involved.Also it to note
Meaning, the combination of each box in block diagram and or flow chart and the box in block diagram and or flow chart can be with holding
The dedicated hardware based system of functions or operations as defined in row is realized, or can use specialized hardware and computer instruction
Combination realize.
Being described in unit involved in the embodiment of the present application can be realized by way of software, can also be by hard
The mode of part is realized.Described unit can also be arranged in the processor, for example, can be described as:A kind of processor packet
Include acquiring unit and synthesis unit.Wherein, the title of these units does not constitute the limit to the unit itself under certain conditions
It is fixed, for example, acquiring unit is also described as " obtaining for synthesizing the corresponding prediction letter of target text for being converted into voice
The unit of breath ".
As on the other hand, present invention also provides a kind of computer-readable medium, which can be
Included in electronic equipment described in above-described embodiment;Can also be individualism, and without be incorporated the electronic equipment in.
Above computer readable medium carries one or more program, when said one or multiple programs are held by the electronic equipment
When row so that the electronic equipment:It obtains for synthesizing the corresponding predictive information of target text for being converted into voice, predictive information
Including:Predict corresponding to target text duration information, predict corresponding to target text spectrum information, predict
The fundamental frequency information corresponding to target text;The corresponding predictive information of the target text got is input to language trained in advance
Sound synthetic model obtains synthesis voice corresponding with target text.
Above description is only the preferred embodiment of the application and the explanation to institute's application technology principle.People in the art
Member should be appreciated that invention scope involved in the application, however it is not limited to technology made of the specific combination of above-mentioned technical characteristic
Scheme, while should also cover in the case where not departing from foregoing invention design, it is carried out by above-mentioned technical characteristic or its equivalent feature
Other technical solutions of arbitrary combination and formation.Such as features described above has similar work(with (but not limited to) disclosed herein
Can technical characteristic replaced mutually and the technical solution that is formed.
Claims (14)
1. a kind of phoneme synthesizing method, including:
It obtains for synthesizing the corresponding predictive information of target text for being converted into voice, the predictive information includes:It predicts
Corresponding to target text duration information, predict corresponding to target text spectrum information, predict correspond to mesh
Mark the fundamental frequency information of text;
The corresponding predictive information of the target text got is input to in advance trained phonetic synthesis model, is obtained and institute
State the corresponding synthesis voice of target text.
2. according to the method described in claim 1, wherein, training obtains the phonetic synthesis model as follows:
Training sample set is obtained, the training sample includes sample text, sample voice information corresponding with sample text, institute
It includes acoustic feature information and sample voice waveform to state sample voice information, and the acoustic feature information includes fundamental frequency information, frequency
Spectrum information and duration information, wherein fundamental frequency information extracts to obtain by carrying out fundamental frequency to sample voice, and spectrum information is using in advance
Trained spectrum prediction model carries out spectrum prediction to sample text and obtains;
Using the acoustic feature information in the corresponding sample voice information of the sample text in the training sample set as voice
The input of synthetic model, using sample voice waveform as desired output, train to obtain the voice using the method for machine learning
Synthetic model.
3. according to the method described in claim 2, wherein, the spectrum information corresponding to target text is to utilize training in advance
Spectrum prediction model spectrum prediction carried out to the target text corresponding voice obtain;And
Training obtains the spectrum prediction model as follows:
Using the sample text in the training sample set as the input of initial spectrum prediction model, from corresponding with sample text
Sample voice in the spectrum information that extracts as desired output, train to obtain the frequency spectrum using the method for machine learning pre-
Survey model.
4. according to the method described in claim 2, wherein, the duration information of the correspondence and target text is to utilize training in advance
Duration prediction prediction model duration prediction carried out to the target text corresponding voice obtain;And
Training obtains the duration prediction model as follows:
Using in the training sample set sample text and duration information as the input of the duration prediction model and
Desired output, using the method for machine learning, training obtains the duration prediction model.
5. according to the method described in claim 2, wherein, the fundamental frequency information of the correspondence and target text is to pass through pitch prediction
Model carries out what pitch prediction obtained to the corresponding voice of the target text, and
Training obtains the pitch prediction model as follows:
Using the sample text in the training sample set as the input of initial pitch prediction model, will from sample text pair
The fundamental frequency information extracted in the sample voice answered is as desired output, and using the method for machine learning, training obtains the base
Frequency prediction model.
6. according to the method described in one of claim 1-5, wherein the type of the phonetic synthesis model is WaveRNN.
7. a kind of speech synthetic device, including:
Acquiring unit is configured to obtain for synthesizing the corresponding predictive information of target text for being converted into voice, described pre-
Measurement information includes:Predict corresponding to target text duration information, predict corresponding to target text spectrum information,
The fundamental frequency information corresponding to target text predicted;
Synthesis unit is configured to the corresponding predictive information of the target text that will be got and is input to phonetic synthesis trained in advance
Model obtains synthesis voice corresponding with the target text.
8. device according to claim 7, wherein training obtains the phonetic synthesis model as follows:
Training sample set is obtained, the training sample includes sample text, sample voice information corresponding with sample text, institute
It includes acoustic feature information and sample voice waveform to state sample voice information, and the acoustic feature information includes fundamental frequency information, frequency
Spectrum information and duration information, wherein fundamental frequency information extracts to obtain by carrying out fundamental frequency to sample voice, and spectrum information is using in advance
Trained spectrum prediction model carries out spectrum prediction to sample text and obtains;
Using the acoustic feature information in the corresponding sample voice information of the sample text in the training sample set as voice
The input of synthetic model, using sample voice waveform as desired output, train to obtain the voice using the method for machine learning
Synthetic model.
9. device according to claim 8, wherein the spectrum information corresponding to target text is to utilize training in advance
Spectrum prediction model spectrum prediction carried out to the target text corresponding voice obtain;And
Training obtains the spectrum prediction model as follows:
Using the sample text in the training sample set as the input of initial spectrum prediction model, from corresponding with sample text
Sample voice in the spectrum information that extracts as desired output, train to obtain the frequency spectrum using the method for machine learning pre-
Survey model.
10. device according to claim 8, wherein the duration information of the correspondence and target text is to utilize instruction in advance
Experienced duration prediction model carries out what duration prediction obtained to the corresponding voice of the target text;And
Training obtains the duration prediction model as follows:
Using in the training sample set sample text and duration information as the input of the duration prediction model and
Desired output, using the method for machine learning, training obtains the duration prediction model.
11. device according to claim 8, wherein the fundamental frequency information of the correspondence and target text is pre- by fundamental frequency
It surveys model and what pitch prediction obtained is carried out to the corresponding voice of the target text, and
Training obtains the pitch prediction model as follows:
Using the sample text in the training sample set as the input of initial pitch prediction model, will from sample text pair
The fundamental frequency information extracted in the sample voice answered is as desired output, and using the method for machine learning, training obtains the base
Frequency prediction model.
12. according to the device described in one of claim 7-11, wherein the type of the phonetic synthesis model is WaveRNN.
13. a kind of electronic equipment, including:
One or more processors;
Storage device is stored thereon with one or more programs,
When one or more of programs are executed by one or more of processors so that one or more of processors are real
The now method as described in any in claim 1-6.
14. a kind of computer-readable medium, is stored thereon with computer program, wherein the program is realized when being executed by processor
Method as described in any in claim 1-6.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811061208.9A CN108806665A (en) | 2018-09-12 | 2018-09-12 | Phoneme synthesizing method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811061208.9A CN108806665A (en) | 2018-09-12 | 2018-09-12 | Phoneme synthesizing method and device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108806665A true CN108806665A (en) | 2018-11-13 |
Family
ID=64082342
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811061208.9A Pending CN108806665A (en) | 2018-09-12 | 2018-09-12 | Phoneme synthesizing method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108806665A (en) |
Cited By (24)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109599092A (en) * | 2018-12-21 | 2019-04-09 | 秒针信息技术有限公司 | A kind of audio synthetic method and device |
CN109801608A (en) * | 2018-12-18 | 2019-05-24 | 武汉西山艺创文化有限公司 | A kind of song generation method neural network based and system |
CN110473515A (en) * | 2019-08-29 | 2019-11-19 | 郝洁 | A kind of end-to-end speech synthetic method based on WaveRNN |
CN110751940A (en) * | 2019-09-16 | 2020-02-04 | 百度在线网络技术(北京)有限公司 | Method, device, equipment and computer storage medium for generating voice packet |
CN111192566A (en) * | 2020-03-03 | 2020-05-22 | 云知声智能科技股份有限公司 | English speech synthesis method and device |
CN111402855A (en) * | 2020-03-06 | 2020-07-10 | 北京字节跳动网络技术有限公司 | Speech synthesis method, speech synthesis device, storage medium and electronic equipment |
CN111429881A (en) * | 2020-03-19 | 2020-07-17 | 北京字节跳动网络技术有限公司 | Sound reproduction method, device, readable medium and electronic equipment |
CN111583904A (en) * | 2020-05-13 | 2020-08-25 | 北京字节跳动网络技术有限公司 | Speech synthesis method, speech synthesis device, storage medium and electronic equipment |
CN111599338A (en) * | 2020-04-09 | 2020-08-28 | 云知声智能科技股份有限公司 | Stable and controllable end-to-end speech synthesis method and device |
CN111599343A (en) * | 2020-05-14 | 2020-08-28 | 北京字节跳动网络技术有限公司 | Method, apparatus, device and medium for generating audio |
CN111613224A (en) * | 2020-04-10 | 2020-09-01 | 云知声智能科技股份有限公司 | Personalized voice synthesis method and device |
CN111785247A (en) * | 2020-07-13 | 2020-10-16 | 北京字节跳动网络技术有限公司 | Voice generation method, device, equipment and computer readable medium |
CN111899720A (en) * | 2020-07-30 | 2020-11-06 | 北京字节跳动网络技术有限公司 | Method, apparatus, device and medium for generating audio |
CN112289298A (en) * | 2020-09-30 | 2021-01-29 | 北京大米科技有限公司 | Processing method and device for synthesized voice, storage medium and electronic equipment |
CN112420015A (en) * | 2020-11-18 | 2021-02-26 | 腾讯音乐娱乐科技(深圳)有限公司 | Audio synthesis method, device, equipment and computer readable storage medium |
CN112530400A (en) * | 2020-11-30 | 2021-03-19 | 清华珠三角研究院 | Method, system, device and medium for generating voice based on text of deep learning |
CN112542153A (en) * | 2020-12-02 | 2021-03-23 | 北京沃东天骏信息技术有限公司 | Duration prediction model training method and device, and speech synthesis method and device |
CN112767957A (en) * | 2020-12-31 | 2021-05-07 | 科大讯飞股份有限公司 | Method for obtaining prediction model, method for predicting voice waveform and related device |
CN112786013A (en) * | 2021-01-11 | 2021-05-11 | 北京有竹居网络技术有限公司 | Voice synthesis method and device based on album, readable medium and electronic equipment |
CN112951204A (en) * | 2021-03-29 | 2021-06-11 | 北京大米科技有限公司 | Speech synthesis method and device |
CN113012680A (en) * | 2021-03-03 | 2021-06-22 | 北京太极华保科技股份有限公司 | Speech technology synthesis method and device for speech robot |
CN113299272A (en) * | 2020-02-06 | 2021-08-24 | 菜鸟智能物流控股有限公司 | Speech synthesis model training method, speech synthesis apparatus, and storage medium |
CN113763924A (en) * | 2021-11-08 | 2021-12-07 | 北京优幕科技有限责任公司 | Acoustic deep learning model training method, and voice generation method and device |
CN113823256A (en) * | 2020-06-19 | 2021-12-21 | 微软技术许可有限责任公司 | Self-generated text-to-speech (TTS) synthesis |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1835075A (en) * | 2006-04-07 | 2006-09-20 | 安徽中科大讯飞信息科技有限公司 | Speech synthetizing method combined natural sample selection and acaustic parameter to build mould |
CN102122505A (en) * | 2010-01-08 | 2011-07-13 | 王程程 | Modeling method for enhancing expressive force of text-to-speech (TTS) system |
CN105529023A (en) * | 2016-01-25 | 2016-04-27 | 百度在线网络技术(北京)有限公司 | Voice synthesis method and device |
EP3021318A1 (en) * | 2014-11-17 | 2016-05-18 | Samsung Electronics Co., Ltd. | Speech synthesis apparatus and control method thereof |
CN108182936A (en) * | 2018-03-14 | 2018-06-19 | 百度在线网络技术(北京)有限公司 | Voice signal generation method and device |
-
2018
- 2018-09-12 CN CN201811061208.9A patent/CN108806665A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1835075A (en) * | 2006-04-07 | 2006-09-20 | 安徽中科大讯飞信息科技有限公司 | Speech synthetizing method combined natural sample selection and acaustic parameter to build mould |
CN102122505A (en) * | 2010-01-08 | 2011-07-13 | 王程程 | Modeling method for enhancing expressive force of text-to-speech (TTS) system |
EP3021318A1 (en) * | 2014-11-17 | 2016-05-18 | Samsung Electronics Co., Ltd. | Speech synthesis apparatus and control method thereof |
CN105529023A (en) * | 2016-01-25 | 2016-04-27 | 百度在线网络技术(北京)有限公司 | Voice synthesis method and device |
CN108182936A (en) * | 2018-03-14 | 2018-06-19 | 百度在线网络技术(北京)有限公司 | Voice signal generation method and device |
Cited By (35)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109801608A (en) * | 2018-12-18 | 2019-05-24 | 武汉西山艺创文化有限公司 | A kind of song generation method neural network based and system |
CN109599092A (en) * | 2018-12-21 | 2019-04-09 | 秒针信息技术有限公司 | A kind of audio synthetic method and device |
CN109599092B (en) * | 2018-12-21 | 2022-06-10 | 秒针信息技术有限公司 | Audio synthesis method and device |
CN110473515A (en) * | 2019-08-29 | 2019-11-19 | 郝洁 | A kind of end-to-end speech synthetic method based on WaveRNN |
CN110751940A (en) * | 2019-09-16 | 2020-02-04 | 百度在线网络技术(北京)有限公司 | Method, device, equipment and computer storage medium for generating voice packet |
US11527233B2 (en) | 2019-09-16 | 2022-12-13 | Baidu Online Network Technology (Beijing) Co., Ltd. | Method, apparatus, device and computer storage medium for generating speech packet |
CN113299272B (en) * | 2020-02-06 | 2023-10-31 | 菜鸟智能物流控股有限公司 | Speech synthesis model training and speech synthesis method, equipment and storage medium |
CN113299272A (en) * | 2020-02-06 | 2021-08-24 | 菜鸟智能物流控股有限公司 | Speech synthesis model training method, speech synthesis apparatus, and storage medium |
CN111192566A (en) * | 2020-03-03 | 2020-05-22 | 云知声智能科技股份有限公司 | English speech synthesis method and device |
CN111192566B (en) * | 2020-03-03 | 2022-06-24 | 云知声智能科技股份有限公司 | English speech synthesis method and device |
CN111402855A (en) * | 2020-03-06 | 2020-07-10 | 北京字节跳动网络技术有限公司 | Speech synthesis method, speech synthesis device, storage medium and electronic equipment |
CN111402855B (en) * | 2020-03-06 | 2021-08-27 | 北京字节跳动网络技术有限公司 | Speech synthesis method, speech synthesis device, storage medium and electronic equipment |
CN111429881B (en) * | 2020-03-19 | 2023-08-18 | 北京字节跳动网络技术有限公司 | Speech synthesis method and device, readable medium and electronic equipment |
CN111429881A (en) * | 2020-03-19 | 2020-07-17 | 北京字节跳动网络技术有限公司 | Sound reproduction method, device, readable medium and electronic equipment |
CN111599338A (en) * | 2020-04-09 | 2020-08-28 | 云知声智能科技股份有限公司 | Stable and controllable end-to-end speech synthesis method and device |
CN111613224A (en) * | 2020-04-10 | 2020-09-01 | 云知声智能科技股份有限公司 | Personalized voice synthesis method and device |
CN111583904A (en) * | 2020-05-13 | 2020-08-25 | 北京字节跳动网络技术有限公司 | Speech synthesis method, speech synthesis device, storage medium and electronic equipment |
CN111583904B (en) * | 2020-05-13 | 2021-11-19 | 北京字节跳动网络技术有限公司 | Speech synthesis method, speech synthesis device, storage medium and electronic equipment |
CN111599343B (en) * | 2020-05-14 | 2021-11-09 | 北京字节跳动网络技术有限公司 | Method, apparatus, device and medium for generating audio |
CN111599343A (en) * | 2020-05-14 | 2020-08-28 | 北京字节跳动网络技术有限公司 | Method, apparatus, device and medium for generating audio |
CN113823256A (en) * | 2020-06-19 | 2021-12-21 | 微软技术许可有限责任公司 | Self-generated text-to-speech (TTS) synthesis |
CN111785247A (en) * | 2020-07-13 | 2020-10-16 | 北京字节跳动网络技术有限公司 | Voice generation method, device, equipment and computer readable medium |
CN111899720B (en) * | 2020-07-30 | 2024-03-15 | 北京字节跳动网络技术有限公司 | Method, apparatus, device and medium for generating audio |
CN111899720A (en) * | 2020-07-30 | 2020-11-06 | 北京字节跳动网络技术有限公司 | Method, apparatus, device and medium for generating audio |
CN112289298A (en) * | 2020-09-30 | 2021-01-29 | 北京大米科技有限公司 | Processing method and device for synthesized voice, storage medium and electronic equipment |
CN112420015A (en) * | 2020-11-18 | 2021-02-26 | 腾讯音乐娱乐科技(深圳)有限公司 | Audio synthesis method, device, equipment and computer readable storage medium |
CN112530400A (en) * | 2020-11-30 | 2021-03-19 | 清华珠三角研究院 | Method, system, device and medium for generating voice based on text of deep learning |
CN112542153A (en) * | 2020-12-02 | 2021-03-23 | 北京沃东天骏信息技术有限公司 | Duration prediction model training method and device, and speech synthesis method and device |
CN112767957A (en) * | 2020-12-31 | 2021-05-07 | 科大讯飞股份有限公司 | Method for obtaining prediction model, method for predicting voice waveform and related device |
CN112767957B (en) * | 2020-12-31 | 2024-05-31 | 中国科学技术大学 | Method for obtaining prediction model, prediction method of voice waveform and related device |
CN112786013A (en) * | 2021-01-11 | 2021-05-11 | 北京有竹居网络技术有限公司 | Voice synthesis method and device based on album, readable medium and electronic equipment |
CN113012680A (en) * | 2021-03-03 | 2021-06-22 | 北京太极华保科技股份有限公司 | Speech technology synthesis method and device for speech robot |
CN113012680B (en) * | 2021-03-03 | 2021-10-15 | 北京太极华保科技股份有限公司 | Speech technology synthesis method and device for speech robot |
CN112951204A (en) * | 2021-03-29 | 2021-06-11 | 北京大米科技有限公司 | Speech synthesis method and device |
CN113763924A (en) * | 2021-11-08 | 2021-12-07 | 北京优幕科技有限责任公司 | Acoustic deep learning model training method, and voice generation method and device |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108806665A (en) | Phoneme synthesizing method and device | |
CN108597492B (en) | Phoneme synthesizing method and device | |
US10553201B2 (en) | Method and apparatus for speech synthesis | |
CN109036384B (en) | Audio recognition method and device | |
CN108182936B (en) | Voice signal generation method and device | |
WO2020073944A1 (en) | Speech synthesis method and device | |
CN111369971B (en) | Speech synthesis method, device, storage medium and electronic equipment | |
CN111276120B (en) | Speech synthesis method, apparatus and computer-readable storage medium | |
CN110033755A (en) | Phoneme synthesizing method, device, computer equipment and storage medium | |
CN108630190A (en) | Method and apparatus for generating phonetic synthesis model | |
CN111402843B (en) | Rap music generation method and device, readable medium and electronic equipment | |
CN110197655A (en) | Method and apparatus for synthesizing voice | |
CN112382270A (en) | Speech synthesis method, apparatus, device and storage medium | |
CN111477210A (en) | Speech synthesis method and device | |
CN111627420A (en) | Specific-speaker emotion voice synthesis method and device under extremely low resources | |
CN113421584B (en) | Audio noise reduction method, device, computer equipment and storage medium | |
CN107910005A (en) | The target service localization method and device of interaction text | |
CN113345416B (en) | Voice synthesis method and device and electronic equipment | |
CN113539239B (en) | Voice conversion method and device, storage medium and electronic equipment | |
JP2020013008A (en) | Voice processing device, voice processing program, and voice processing method | |
JP2022133447A (en) | Speech processing method and device, electronic apparatus, and storage medium | |
Płonkowski | Using bands of frequencies for vowel recognition for Polish language | |
CN114464163A (en) | Method, device, equipment, storage medium and product for training speech synthesis model | |
CN112382274A (en) | Audio synthesis method, device, equipment and storage medium | |
Matsumoto et al. | Speech-like emotional sound generation using wavenet |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20181113 |
|
RJ01 | Rejection of invention patent application after publication |