CN110033755A - Phoneme synthesizing method, device, computer equipment and storage medium - Google Patents
Phoneme synthesizing method, device, computer equipment and storage medium Download PDFInfo
- Publication number
- CN110033755A CN110033755A CN201910328125.XA CN201910328125A CN110033755A CN 110033755 A CN110033755 A CN 110033755A CN 201910328125 A CN201910328125 A CN 201910328125A CN 110033755 A CN110033755 A CN 110033755A
- Authority
- CN
- China
- Prior art keywords
- speech
- model
- samples
- acoustic
- acoustic feature
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 45
- 230000002194 synthesizing effect Effects 0.000 title claims abstract description 28
- 238000003860 storage Methods 0.000 title claims abstract description 17
- 230000015572 biosynthetic process Effects 0.000 claims abstract description 80
- 238000003786 synthesis reaction Methods 0.000 claims abstract description 80
- 238000000605 extraction Methods 0.000 claims abstract description 20
- 238000004590 computer program Methods 0.000 claims description 16
- 238000002156 mixing Methods 0.000 claims description 14
- 238000005520 cutting process Methods 0.000 claims description 7
- 238000012545 processing Methods 0.000 abstract description 14
- 230000008569 process Effects 0.000 description 12
- 238000010586 diagram Methods 0.000 description 10
- 230000006870 function Effects 0.000 description 6
- 238000001228 spectrum Methods 0.000 description 6
- 238000009432 framing Methods 0.000 description 4
- 238000005311 autocorrelation function Methods 0.000 description 3
- 238000009826 distribution Methods 0.000 description 3
- 230000033764 rhythmic process Effects 0.000 description 3
- 238000004088 simulation Methods 0.000 description 3
- 230000003044 adaptive effect Effects 0.000 description 2
- 238000004422 calculation algorithm Methods 0.000 description 2
- 238000013527 convolutional neural network Methods 0.000 description 2
- 230000008451 emotion Effects 0.000 description 2
- 230000005281 excited state Effects 0.000 description 2
- 239000000203 mixture Substances 0.000 description 2
- 238000005070 sampling Methods 0.000 description 2
- 230000009466 transformation Effects 0.000 description 2
- 230000004913 activation Effects 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 230000010485 coping Effects 0.000 description 1
- 230000000994 depressogenic effect Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000002708 enhancing effect Effects 0.000 description 1
- 230000005284 excitation Effects 0.000 description 1
- 230000014759 maintenance of location Effects 0.000 description 1
- 210000005036 nerve Anatomy 0.000 description 1
- 238000003062 neural network model Methods 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 230000000737 periodic effect Effects 0.000 description 1
- 238000002360 preparation method Methods 0.000 description 1
- 230000003252 repetitive effect Effects 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 230000000153 supplemental effect Effects 0.000 description 1
- 230000001360 synchronised effect Effects 0.000 description 1
- 238000010189 synthetic method Methods 0.000 description 1
- 210000001260 vocal cord Anatomy 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
- G10L13/04—Details of speech synthesis systems, e.g. synthesiser structure or memory management
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
- G10L13/04—Details of speech synthesis systems, e.g. synthesiser structure or memory management
- G10L13/047—Architecture of speech synthesisers
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Signal Processing (AREA)
- Electrically Operated Instructional Devices (AREA)
Abstract
The invention discloses a kind of phoneme synthesizing method, device, computer equipment and storage medium, method includes: acquisition speech samples;Feature extraction is carried out to the speech samples, obtains the corresponding acoustic feature sequence of the speech samples;Preset acoustic model is trained according to the speech samples and the corresponding acoustic feature sequence, and the acoustic model for meeting preset requirement after training is determined as speech synthesis model, the preset acoustic model is based on wavenet network;The speech parameter parsed from speech text to be synthesized is obtained, the speech parameter is inputted into the speech synthesis model, obtains the synthesis voice of the speech synthesis model output.The processing speed and accuracy of the sound processing system of analog true man pronunciation can be improved in phoneme synthesizing method provided by the invention.
Description
Technical field
The present invention relates to speech synthesis field more particularly to a kind of phoneme synthesizing method, device, computer equipment and storages
Medium.
Background technique
Existing text voice system (Text-To-Speech, TTS), converts voice for text, although efficiency is very high,
But the voice being converted to and real voice have larger difference.
Although there is also the acoustic processings that can synthesize simulation true man's pronunciation (recording) in existing text voice system
System, however the processing system of this simulation true man pronunciation, processing speed is slow, and error rate is high.
Summary of the invention
Based on this, it is necessary in view of the above technical problems, provide a kind of phoneme synthesizing method, device, computer equipment and
Storage medium, to improve the processing speed and accuracy of the sound processing system of analog true man pronunciation.
A kind of phoneme synthesizing method, comprising:
Obtain speech samples;
Feature extraction is carried out to the speech samples, obtains the corresponding acoustic feature sequence of the speech samples;
Preset acoustic model is instructed according to the speech samples and the corresponding acoustic feature sequence
Practice, and the acoustic model for meeting preset requirement after training is determined as speech synthesis model, the preset acoustic mode
Type is based on wavenet network;
The speech parameter parsed from speech text to be synthesized is obtained, the speech parameter is inputted into the speech synthesis
Model obtains the synthesis voice of the speech synthesis model output.
A kind of speech synthetic device, comprising:
Sample module is obtained, for obtaining speech samples;
Characteristic extracting module obtains the corresponding sound of the speech samples for carrying out feature extraction to the speech samples
Learn characteristic sequence;
Training module, for according to the speech samples and the corresponding acoustic feature sequence to preset sound
It learns model to be trained, and the acoustic model for meeting preset requirement after training is determined as speech synthesis model, it is described
Preset acoustic model is based on wavenet network;
Synthesis module, it is for obtaining the speech parameter parsed from speech text to be synthesized, the speech parameter is defeated
Enter the speech synthesis model, obtains the synthesis voice of the speech synthesis model output.
A kind of computer equipment, including memory, processor and storage are in the memory and can be in the processing
The computer program run on device, the processor realize above-mentioned phoneme synthesizing method when executing the computer program.
A kind of computer readable storage medium, the computer-readable recording medium storage have computer program, the meter
Calculation machine program realizes above-mentioned phoneme synthesizing method when being executed by processor.
Above-mentioned phoneme synthesizing method, device, computer equipment and storage medium obtain speech samples, which can
To improve the sound pronunciation authenticity of speech synthesis model for being trained to the language synthetic model that will be constructed.To institute
State speech samples carry out feature extraction, obtain the corresponding acoustic feature sequence of the speech samples, by the step of feature extraction (i.e.
Obtain acoustic feature sequence) independently separated with the training process of acoustic model, reduce operand when acoustic training model.Root
Preset acoustic model is trained according to the speech samples and the corresponding acoustic feature sequence, and will be trained
The acoustic model for meeting preset requirement later is determined as speech synthesis model, and the preset acoustic model is based on
Wavenet network, to obtain (authenticity for referring to simulation human speech) the high speech synthesis mould of quality in terms of naturality
Type.The speech parameter parsed from speech text to be synthesized is obtained, the speech parameter is inputted into the speech synthesis model,
The synthesis voice for obtaining the speech synthesis model output is obtained with executing speech synthesis task required for user and wait close
At the corresponding synthesis voice of speech text.The sound of analog true man pronunciation can be improved in phoneme synthesizing method provided by the invention
The processing speed and accuracy of sound processing system.
Detailed description of the invention
In order to illustrate the technical solution of the embodiments of the present invention more clearly, below by institute in the description to the embodiment of the present invention
Attached drawing to be used is needed to be briefly described, it should be apparent that, the accompanying drawings in the following description is only some implementations of the invention
Example, for those of ordinary skill in the art, without any creative labor, can also be according to these attached drawings
Obtain other attached drawings.
Fig. 1 is an application environment schematic diagram of phoneme synthesizing method in one embodiment of the invention;
Fig. 2 is a flow diagram of phoneme synthesizing method in one embodiment of the invention;
Fig. 3 is a flow diagram of phoneme synthesizing method in one embodiment of the invention;
Fig. 4 is a flow diagram of phoneme synthesizing method in one embodiment of the invention;
Fig. 5 is a flow diagram of phoneme synthesizing method in one embodiment of the invention;
Fig. 6 is a flow diagram of phoneme synthesizing method in one embodiment of the invention;
Fig. 7 is a structural schematic diagram of speech synthetic device in one embodiment of the invention;
Fig. 8 is a structural schematic diagram of speech synthetic device in one embodiment of the invention;
Fig. 9 is a structural schematic diagram of speech synthetic device in one embodiment of the invention;
Figure 10 is a schematic diagram of computer equipment in one embodiment of the invention.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete
Site preparation description, it is clear that described embodiments are some of the embodiments of the present invention, instead of all the embodiments.Based on this hair
Embodiment in bright, every other implementation obtained by those of ordinary skill in the art without making creative efforts
Example, shall fall within the protection scope of the present invention.
Phoneme synthesizing method provided in this embodiment can be applicable in the application environment such as Fig. 1, wherein client passes through
Network is communicated with server-side.Client includes but is not limited to various personal computers, laptop, smart phone, puts down
Plate computer and portable wearable device.Server-side can use the server of the either multiple server compositions of independent server
Cluster is realized.In some cases, server-side can obtain speech text to be synthesized from client, and generate the language to be synthesized
The corresponding synthesis voice of sound text.
In one embodiment, as shown in Fig. 2, providing a kind of method of speech processing, the service in Fig. 1 is applied in this way
It is illustrated, includes the following steps: for end
S10, speech samples are obtained;
S20, feature extraction is carried out to the speech samples, obtains the corresponding acoustic feature sequence of the speech samples;
S30, according to the speech samples and the corresponding acoustic feature sequence to preset acoustic model into
Row training, and the acoustic model for meeting preset requirement after training is determined as speech synthesis model, the preset sound
It learns model and is based on wavenet network;
The speech parameter is inputted the voice by the speech parameter that S40, acquisition are parsed from speech text to be synthesized
Synthetic model obtains the synthesis voice of the speech synthesis model output.
In the present embodiment, speech samples can derive from common speech database, can also be from voluntarily collecting
Voice data.Under normal conditions, it if the format of the voice data got is inconsistent, needs for voice data to be uniformly processed
For same format, speech samples use just can be used as.Here, the format of voice data includes but is not limited to voice data
File type, sample frequency.
Here, speech samples can be handled using STRAIGHT, acoustic feature is extracted from speech samples, and
Form acoustic feature sequence.Corresponding acoustic feature sequence can be generated in each speech samples.Acoustic feature sequence be from
The acoustic feature extracted in speech samples is by certain
Wherein, STRAIGHT is a kind of algorithm of speech signal analysis synthesis.The characteristics of STRAIGHT is can be language
The spectrum envelope and voice messaging of sound separate, and voice signal is decomposed into mutually independent frequency spectrum parameter and base frequency parameters, and
The parameters such as voice signal fundamental frequency, duration, word speed can be adjusted flexibly.
Speech samples and the acoustic feature sequence obtained after handling via speech samples can be used for the training of acoustic model.It should
Acoustic model can be based on neural network model made of wavenet network struction.Wavenet network is a kind of based on CNN
The autoregression network of (convolutional neural networks).Wavenet network is directly modeled in speech waveform level, by the joint of wave sequence
Probability is decomposed into conditional probability and even multiplies, and the sampling point value p (x) in voice can be predicted with the following formula:
Wherein, xtT moment sampled point, each because subitem expression use t moment to historical information it is pre- as input
Survey the probability distribution of current sampling point.
Wavenet network passes through supplemental characteristic sequence a1:NAnalog waveform sequence o1:TCondition distribution, be shown below:
For each sample in time t, value depends on all previous observation result o< t.In practice, o< t's
Prediction is only limitted to the previous sample of limited quantity, these samples are collectively known as receptive field.By suitable in each time step
Sequence sample waveform, wavenet network can generate the synthesis voice of very high quality in terms of naturality.
Wavenet network adds gate activation letter using similar PixelCNN (pixel-recursive neural network) in structure
Number, wherein following formula calculating can be used in gate cell:
Z=tanh (Wf,k*x)⊙σ(Wg,k*x)。
Wherein, * is convolution algorithm, and ⊙ is point multiplication operation, and σ is Sigmoid function, Wf,kAnd Wg,kRespectively represent kth layer
Filter convolution weight and gate convolution weight.Wavenet network also uses the jump of residual error network structure and parametrization to connect
(skip connections) constructs profound network, while this network structure also contributes to accelerating model convergence.
It, can will be defeated if the output result of acoustic model meets preset requirement after the repetitive exercise of certain number
The acoustic model that result meets preset requirement out is determined as speech synthesis model.The speech synthesis model of acquisition can be used for text
Data are converted into audio data (i.e. synthesis voice).Here, preset requirement can be determined according to the actual situation.For example,
When the diversity factor of the synthesis voice of wavenet network output and former speech samples is less than specified threshold, acoustic mode can be determined
The output result of type meets preset requirement.It in other cases, can also be with if the frequency of training of acoustic model reaches preset value
Determine that the output result of acoustic model meets preset requirement.
Here, speech text to be synthesized refers to needing to be converted to the text of audio data.Preset text can be used
Analytic modell analytical model parses speech text to be synthesized, obtains speech parameter corresponding to speech text to be synthesized.Speech parameter
Interval etc. including but not limited between the tone of words, the rhythm, syllable and sentence.
In step S10-S40, speech samples are obtained, which can be used for the language synthetic model that will be constructed
It is trained, improves the sound pronunciation authenticity of speech synthesis model.Feature extraction is carried out to the speech samples, described in acquisition
The corresponding acoustic feature sequence of speech samples, by (i.e. acquisition acoustic feature sequence) independence the step of feature extraction and acoustic model
Training process separate, reduce operand when acoustic training model.According to the speech samples and corresponding institute
The acoustic model stated acoustic feature sequence to be trained preset acoustic model, and preset requirement will be met after training
It is determined as speech synthesis model, the preset acoustic model is based on wavenet network, (refers in terms of naturality to obtain
It is the authenticity for simulating human speech) the high speech synthesis model of quality.Obtain the language parsed from speech text to be synthesized
The speech parameter is inputted the speech synthesis model by sound parameter, obtains the synthesis voice of the speech synthesis model output,
To execute speech synthesis task, synthesis voice corresponding with speech text to be synthesized required for user is obtained.
Optionally, as shown in figure 3, before step S40, further includes:
S41, speech text to be synthesized is obtained;
S42, the speech text to be synthesized is inputted into preset text resolution model, obtains the preset text resolution
The speech parameter corresponding with the speech text to be synthesized of model output.
In the present embodiment, the speech synthesis model that training obtains can synthesize speech text to be synthesized to be sent out close to true man
The synthesis voice of sound.Preset text resolution model can parse the sound of all words included in speech text to be synthesized
The information such as tune, the rhythm, syllable, and generate obtain corresponding speech parameter (can be used language ambience information mark file form table
It is existing).This process is referred to as text analyzing.The speech parameter of acquisition can be by speech synthesis model conversion at synthesis language
Sound.During entire speech synthesis, text analyzing provides important evidence, the effect meeting of text analyzing for rear end speech synthesis
Directly influence naturalness, the accuracy of synthesis voice.
In one example, using the sound mother of standard Chinese as speech synthesis primitive, to the Chinese language text of input, by means of
The guidance of syntactic lexicon, syntax rule library, by text normalization, syntactic analysis, prosody prediction analysis, making character fonts, successively
The sound for obtaining the sentence information for inputting text, word information, rhythm structure information and each Chinese character is female, so that it is common to obtain input
The information of the speech synthesis primitive (sound is female) of text and the context-related information of each speech synthesis primitive are talked about, most throughout one's life
It include the single phoneme notation and context-sensitive mark of each words in Chinese language text at speech parameter.
In step S41-S42, speech text to be synthesized is obtained, to obtain text to be processed, the text is for synthesizing phase
The voice answered.The speech text to be synthesized is inputted into preset text resolution model, obtains the preset text resolution mould
The speech parameter corresponding with the speech text to be synthesized of type output, it is higher with speech text matching degree to be synthesized to obtain
Speech parameter, to obtain the synthesis voice of more high quality.
Optionally, as shown in figure 4, step S20 includes:
S201, by the speech samples cutting be multiple speech frames;
S202, the acoustic feature for calculating separately each speech frame, the acoustic feature include fundamental frequency, energy, Meier
Frequency cepstral coefficient;
S203, the acoustic feature of each speech frame is chronologically sorted, forms the acoustic feature sequence.
In the present embodiment, cutting can be carried out to speech samples according to the actual situation, cutting is multiple speech frames.Voice sample
Originally belong to voice signal.The information of only stable state just can be carried out signal processing, since voice signal is quasi-steady state signal, thus need
It will be to speech samples framing.When handling speech samples, the length of each speech frame can be 20ms-30ms, in this section
It is interior, voice signal is seen and is considered as steady-state signal.In some cases, wavelet transformation can be carried out to voice signal framing, is
After voice signal framing, wavelet transformation and processing are carried out to each frame.
In a speech samples I1:NIn, I1:NIt may be expressed as: I1:N=I1, I2..., IN}.Wherein, N is speech samples I1:N's
Totalframes.Feature extraction can be carried out to the voice messaging of each frame, obtain the acoustic feature of each speech frame.The acoustics of acquisition
Feature includes but is not limited to fundamental frequency, energy, mel-frequency cepstrum coefficient.
Wherein, fundamental frequency determines the tone color harmony modulation of voice.For people when sending out voiced sound, air-flow causes vocal cords periodically to shake
Movable property gives birth to fundamental tone, and the frequency of fundamental tone vibration is exactly fundamental frequency.Short-time autocorrelation function can be used in the extraction of fundamental frequency, and steps are as follows:
Speech frame is aggravated, the pretreatment such as denoising and adding window;
The short-time autocorrelation function of data is calculated, and chooses its local maximum point;
The pitch period estimated value of voice is exactly first position that peak value occurs in all articulation points;
The inverse of pitch period is asked just to obtain the fundamental frequency value of fundamental tone again.
Wherein, short-time autocorrelation function can be with is defined as:
Rm(k)=Σ x (n) x (n-k)
In formula, x (n) refers to that voice signal, m indicate that window function is added from m point.
Voice can be divided into unvoiced segments, voiceless sound section and voiced segments, and the voice signal amplitude of voiceless sound section is smaller, without rule
Property;The voice signal amplitude of voiced segments is bigger.The variation of voice signal has regularity, there is certain quasi periodic.Therefore,
Using (10-30ms) smooth performance in short-term of voice.Because the variation of voice be very slowly, in a short period of time,
Usually assume that voice is almost unchanged, therefore for voiced segments, it is believed that it has periodicity in a short time
Think that Voiced signal has periodicity in a short period of time:
x(t+nτ0In)=x (t) above formula, x (t) represents the voice signal in t moment, τ0It is the period of Voiced signal, referred to as
Pitch period.
According to the relationship of time and frequency:
Wherein, f0It is fundamental frequency, because of τ0It is almost unchanged, therefore it is also assumed that f0It is almost unchanged.
The energy response intensity of speaker's sound, the intensity of voice is also to be not quite similar under different emotions state, when
When people is in excited state, the intensity of voice is significantly greater than general state;And when in the state of droning losing, voice
Intensity can be substantially reduced.
The intensity of sound, the short-time energy definition of voice signal are generally indicated using short-time energy and short-time average magnitude
Are as follows:
N is expressed as n-th of moment, and x (m) is voice signal, i.e. the short-time energy weighted sum of squares that is a frame sample value.
If enabling:
H (n)=w2(n)
Then have:
Then above formula passes through the filter of a h (n) it is to be understood that each sample value square of voice signal first, defeated
Time series to be made of short-time energy out.
Mel-frequency cepstrum coefficient (MFCC) can be found out by following steps:
Preemphasis, framing and adding window first are carried out to speech frame;
To each short-time analysis window, corresponding frequency spectrum is obtained by FFT (Fast Fourier Transform (FFT));
By frequency spectrum above by Meier filter process, Meier frequency spectrum is obtained;
Cepstral analysis is carried out on Meier frequency spectrum, obtains mel-frequency cepstrum coefficient.
In some cases, the processing model based on STRAIGHT also can be used to handle each speech frame, it is raw
At corresponding acoustic feature.
In some cases, acoustic feature also may include pronunciation duration in speech samples and relevant to pronunciation duration
Feature.Here, pronunciation duration also refers to the duration that people speaks.The speed that people speaks under different emotions state
It will be different.For example, when people is more exciting or excitation time is since nerve can be in height excited state, speak often compared with
Fastly, word speed is higher.On the contrary, people when sad due to depressed, speaking can be relatively more powerless, slowly.
In step S201-S203, by the speech samples cutting be multiple speech frames, by speech samples be divided into it is multiple just
In the sound bite (i.e. speech frame) of computer disposal.The acoustic feature of each speech frame is calculated separately, the acoustics is special
Sign includes fundamental frequency, energy, mel-frequency cepstrum coefficient, to obtain the important acoustic feature of speech samples.By each voice
The acoustic feature of frame chronologically sorts, and forms the acoustic feature sequence, which can be directly used for input acoustics
Model is trained, to reduce operand and the operation time of acoustic training model.
Optionally, as shown in figure 5, step S30 includes:
S301, mixing voice data and corresponding acoustic feature sequence are obtained;
S302, acoustic model described in the mixing voice data and its corresponding acoustic feature sequence inputting is instructed in advance
Practice, obtains pre-training model;
S303, particular piece of voice data and corresponding acoustic feature sequence are obtained;
S304, pre-training model described in the particular piece of voice data and corresponding acoustic feature sequence inputting is carried out
Training.
Here, mixing voice data can refer to the sample set being made of the speech samples comprising different voice.Wherein, one
A speech samples may include the sound of more than one people, alternatively, there are the speakers of a speech samples and another voice
The speaker of sample is different.Feature extraction can be carried out to each speech samples in mixing voice data, obtain each voice
The corresponding acoustic feature sequence of sample.Pre-training is identical as normal training process, is intended merely to herein with " pre-training " expression
It is distinguished with the training process of step S304.The number of iterations required for training can also be determined according to actual needs.
After the training of mixing voice data, corresponding pre-training model can get.
Particular piece of voice data can refer to the sample set being made of the speech samples of the pronunciation comprising the same person.For example, special
Determining all speech samples in voice data is issued by user's first.It can be to each voice in particular piece of voice data
Sample carries out feature extraction, obtains the corresponding acoustic feature sequence of each speech samples.It, can be with after particular piece of voice data training
Obtain the matched speech synthesis model of speaker with particular piece of voice data.That is, the speech synthesis model obtained can produce class
Like the synthesis voice of speaker, e.g., the pronunciation of some specific famous person can be generated.Due to first using mixing voice data trained
To pre-training model, then make further adaptive training with particular piece of voice data so that training after acoustic model have compared with
Good extensive expression ability can be established to map between acoustic feature and speech waveform and be closed in conjunction with different speaker's acoustic features
System.
In step S301-S304, mixing voice data and corresponding acoustic feature sequence are obtained, it is specific to obtain
For trained speech samples.By acoustic model described in the mixing voice data and its corresponding acoustic feature sequence inputting into
Row pre-training is obtained pre-training model, is trained using mixing voice data to acoustic model, to improve the acoustics after training
Model has preferable extensive expression ability.Particular piece of voice data and corresponding acoustic feature sequence are obtained, to use spy
Determine voice data and further adaptive training is made to pre-training model.The particular piece of voice data and corresponding acoustics is special
Pre-training model described in sign sequence inputting is trained, and synthesizes the higher speech synthesis model of quality to obtain.
Optionally, as shown in fig. 6, step S30 further include:
S305, the speech samples for choosing preset ratio;
S306, the speech samples to be selected add noise, form noisy samples;
S307, the corresponding acoustic feature sequence of the noisy samples is obtained;
S308, according to the noisy samples and with and the corresponding sound characteristic sequence, the speech samples and
The corresponding sound characteristic sequence is trained preset acoustic model.
In the present embodiment, in order to which the acoustic model for obtaining training has better adaptability and generalization ability, simultaneously
The accuracy for improving speech synthesis, can add noise in the speech samples of part.Preset ratio can be according to actual needs
It is determined, such as can be 1-10%.Here, noise can refer to the ambient sound under different scenes, such as airport noise, office
Noise, food market noise, supermarket's noise etc..Speech samples after adding noise, as noisy samples.Noisy samples are carried out special
The treatment process of the treatment process and the speech samples for being not added with noise of sign extraction and input acoustic training model is almost the same,
Details are not described herein.
In step S305-S308, the speech samples of preset ratio are chosen, it can be different for different scene settings
Selection ratio.Speech samples to be selected add noise, noisy samples are formed, to obtain multiple instructions comprising specific noise
Practice sample (i.e. noisy samples).Obtain the corresponding acoustic feature sequence of the noisy samples, the acoustic feature sequence of acquisition can be with
Acoustic model is directly inputted to be trained.According to the noisy samples and with and it is the corresponding sound characteristic sequence, described
Speech samples and the corresponding sound characteristic sequence are trained preset acoustic model, are closed with the voice of acquisition
There is better adaptability and generalization ability at model.
It should be understood that the size of the serial number of each step is not meant that the order of the execution order in above-described embodiment, each process
Execution sequence should be determined by its function and internal logic, the implementation process without coping with the embodiment of the present invention constitutes any limit
It is fixed.
In one embodiment, a kind of speech synthetic device is provided, voice closes in the speech synthetic device and above-described embodiment
It is corresponded at method.As shown in fig. 7, the speech synthetic device includes obtaining sample module 10, characteristic extracting module 20, training
Module 30 and synthesis module 40.Detailed description are as follows for each functional module:
Sample module 10 is obtained, for obtaining speech samples;
It is corresponding to obtain the speech samples for carrying out feature extraction to the speech samples for characteristic extracting module 20
Acoustic feature sequence;
Training module 30, for according to the speech samples and the corresponding acoustic feature sequence to preset
Acoustic model is trained, and the acoustic model for meeting preset requirement after training is determined as speech synthesis model, institute
It states preset acoustic model and is based on wavenet network;
Synthesis module 40, for obtaining the speech parameter parsed from speech text to be synthesized, by the speech parameter
The speech synthesis model is inputted, the synthesis voice of the speech synthesis model output is obtained.
Optionally, as shown in figure 8, speech synthetic device further include:
Text module 50 is obtained, for obtaining speech text to be synthesized;
Speech parameter module 60 is obtained, for the speech text to be synthesized to be inputted preset text resolution model, is obtained
The speech parameter corresponding with the speech text to be synthesized for taking the preset text resolution model output.
Optionally, as shown in figure 9, characteristic extracting module 20 includes:
Voice cutter unit 201, for being multiple speech frames by the speech samples cutting;
Acoustic feature extraction unit 202, for extracting the acoustic feature for calculating separately each speech frame, the acoustics
Feature includes fundamental frequency, energy, mel-frequency cepstrum coefficient;
Acoustic feature sequence units 203 are generated, for the acoustic feature of each speech frame chronologically to sort, are formed
The acoustic feature sequence.
Optionally, training module 30 includes:
Mixing voice unit is obtained, for obtaining mixing voice data and corresponding acoustic feature sequence;
Pre-training unit is used for acoustic mode described in the mixing voice data and its corresponding acoustic feature sequence inputting
Type carries out pre-training, obtains pre-training model;
Specific phonetic unit is obtained, for obtaining particular piece of voice data and corresponding acoustic feature sequence;
Special sound training unit, for by the particular piece of voice data and corresponding acoustic feature sequence inputting institute
Pre-training model is stated to be trained.
Optionally, training module 30 further include:
Speech samples unit is chosen, for choosing the speech samples of preset ratio;
Noise unit is added, for adding noise for the speech samples being selected, forms noisy samples;
Feature of noise unit is obtained, for obtaining the corresponding acoustic feature sequence of the noisy samples;
Noisy samples training unit, for according to the noisy samples and with and the corresponding sound characteristic sequence,
The speech samples and the corresponding sound characteristic sequence are trained preset acoustic model.
Specific about speech synthetic device limits the restriction that may refer to above for phoneme synthesizing method, herein not
It repeats again.Modules in above-mentioned speech synthetic device can be realized fully or partially through software, hardware and combinations thereof.On
Stating each module can be embedded in the form of hardware or independently of in the processor in computer equipment, can also store in a software form
In memory in computer equipment, the corresponding operation of the above modules is executed in order to which processor calls.
In one embodiment, a kind of computer equipment is provided, which can be server, internal junction
Composition can be as shown in Figure 10.The computer equipment include by system bus connect processor, memory, network interface and
Database.Wherein, the processor of the computer equipment is for providing calculating and control ability.The memory packet of the computer equipment
Include non-volatile memory medium, built-in storage.The non-volatile memory medium is stored with operating system, computer program and data
Library.The built-in storage provides environment for the operation of operating system and computer program in non-volatile memory medium.The calculating
The database of machine equipment is for data involved in storaged voice synthetic method.The network interface of the computer equipment be used for it is outer
The terminal in portion passes through network connection communication.To realize a kind of phoneme synthesizing method when the computer program is executed by processor.
In one embodiment, a kind of computer equipment is provided, including memory, processor and storage are on a memory
And the computer program that can be run on a processor, processor perform the steps of when executing computer program
Obtain speech samples;
Feature extraction is carried out to the speech samples, obtains the corresponding acoustic feature sequence of the speech samples;
Preset acoustic model is instructed according to the speech samples and the corresponding acoustic feature sequence
Practice, and the acoustic model for meeting preset requirement after training is determined as speech synthesis model, the preset acoustic mode
Type is based on wavenet network;
The speech parameter parsed from speech text to be synthesized is obtained, the speech parameter is inputted into the speech synthesis
Model obtains the synthesis voice of the speech synthesis model output.
In one embodiment, a kind of computer readable storage medium is provided, computer program is stored thereon with, is calculated
Machine program performs the steps of when being executed by processor
Obtain speech samples;
Feature extraction is carried out to the speech samples, obtains the corresponding acoustic feature sequence of the speech samples;
Preset acoustic model is instructed according to the speech samples and the corresponding acoustic feature sequence
Practice, and the acoustic model for meeting preset requirement after training is determined as speech synthesis model, the preset acoustic mode
Type is based on wavenet network;
The speech parameter parsed from speech text to be synthesized is obtained, the speech parameter is inputted into the speech synthesis
Model obtains the synthesis voice of the speech synthesis model output.
Those of ordinary skill in the art will appreciate that realizing all or part of the process in above-described embodiment method, being can be with
Relevant hardware is instructed to complete by computer program, the computer program can be stored in a non-volatile computer
In read/write memory medium, the computer program is when being executed, it may include such as the process of the embodiment of above-mentioned each method.Wherein,
To any reference of memory, storage, database or other media used in each embodiment provided herein,
Including non-volatile and/or volatile memory.Nonvolatile memory may include read-only memory (ROM), programming ROM
(PROM), electrically programmable ROM (EPROM), electrically erasable ROM (EEPROM) or flash memory.Volatile memory may include
Random access memory (RAM) or external cache.By way of illustration and not limitation, RAM is available in many forms,
Such as static state RAM (SRAM), dynamic ram (DRAM), synchronous dram (SDRAM), double data rate sdram (DDRSDRAM), enhancing
Type SDRAM (ESDRAM), synchronization link (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM
(RDRAM), direct memory bus dynamic ram (DRDRAM) and memory bus dynamic ram (RDRAM) etc..
It is apparent to those skilled in the art that for convenience of description and succinctly, only with above-mentioned each function
Can unit, module division progress for example, in practical application, can according to need and by above-mentioned function distribution by different
Functional unit, module are completed, i.e., the internal structure of described device is divided into different functional unit or module, more than completing
The all or part of function of description.
Embodiment described above is merely illustrative of the technical solution of the present invention, rather than its limitations;Although referring to aforementioned reality
Applying example, invention is explained in detail, those skilled in the art should understand that: it still can be to aforementioned each
Technical solution documented by embodiment is modified or equivalent replacement of some of the technical features;And these are modified
Or replacement, the spirit and scope for technical solution of various embodiments of the present invention that it does not separate the essence of the corresponding technical solution should all
It is included within protection scope of the present invention.
Claims (10)
1. a kind of phoneme synthesizing method characterized by comprising
Obtain speech samples;
Feature extraction is carried out to the speech samples, obtains the corresponding acoustic feature sequence of the speech samples;
Preset acoustic model is trained according to the speech samples and the corresponding acoustic feature sequence, and
The acoustic model for meeting preset requirement after training is determined as speech synthesis model, the preset acoustic model is based on
Wavenet network;
The speech parameter parsed from speech text to be synthesized is obtained, the speech parameter is inputted into the speech synthesis mould
Type obtains the synthesis voice of the speech synthesis model output.
2. phoneme synthesizing method as described in claim 1, which is characterized in that the acquisition is parsed from speech text to be synthesized
The speech parameter is inputted the speech synthesis model by speech parameter out, obtains the conjunction of the speech synthesis model output
Before voice, further includes:
Obtain speech text to be synthesized;
The speech text to be synthesized is inputted into preset text resolution model, obtains the preset text resolution model output
Speech parameter corresponding with the speech text to be synthesized.
3. phoneme synthesizing method as described in claim 1, which is characterized in that described to be mentioned to speech samples progress feature
It takes, obtains the corresponding acoustic feature sequence of the speech samples, comprising:
It is multiple speech frames by the speech samples cutting;
The acoustic feature of each speech frame is calculated separately, the acoustic feature includes fundamental frequency, energy, mel-frequency cepstrum system
Number;
The acoustic feature of each speech frame is chronologically sorted, the acoustic feature sequence is formed.
4. phoneme synthesizing method as described in claim 1, which is characterized in that described according to the speech samples and right with it
The acoustic feature sequence answered is trained preset acoustic model, comprising:
Obtain mixing voice data and corresponding acoustic feature sequence;
Acoustic model described in the mixing voice data and its corresponding acoustic feature sequence inputting is subjected to pre-training, is obtained pre-
Training pattern;
Obtain particular piece of voice data and corresponding acoustic feature sequence;
Pre-training model described in the particular piece of voice data and corresponding acoustic feature sequence inputting is trained.
5. phoneme synthesizing method as described in claim 1, which is characterized in that described according to the speech samples and right with it
The sound characteristic sequence answered is trained preset acoustic model, comprising:
Choose the speech samples of preset ratio;
Speech samples to be selected add noise, form noisy samples;
Obtain the corresponding acoustic feature sequence of the noisy samples;
According to the noisy samples and with and the corresponding sound characteristic sequence, speech samples and corresponding
The sound characteristic sequence is trained preset acoustic model.
6. a kind of speech synthetic device characterized by comprising
Sample module is obtained, for obtaining speech samples;
It is special to obtain the corresponding acoustics of the speech samples for carrying out feature extraction to the speech samples for characteristic extracting module
Levy sequence;
Training module, for according to the speech samples and the corresponding acoustic feature sequence to preset acoustic mode
Type is trained, and the acoustic model for meeting preset requirement after training is determined as speech synthesis model, described default
Acoustic model be based on wavenet network;
The speech parameter is inputted institute for obtaining the speech parameter parsed from speech text to be synthesized by synthesis module
Predicate sound synthetic model obtains the synthesis voice of the speech synthesis model output.
7. speech synthetic device as claimed in claim 6, which is characterized in that further include:
Text module is obtained, for obtaining speech text to be synthesized;
Obtain speech parameter module, for will the preset text resolution model of the speech text to be synthesized input, described in acquisition
The speech parameter corresponding with the speech text to be synthesized of preset text resolution model output.
8. speech synthetic device as claimed in claim 6, which is characterized in that the characteristic extracting module includes:
Voice cutter unit, for being multiple speech frames by the speech samples cutting;
Acoustic feature extraction unit, for extracting the acoustic feature for calculating separately each speech frame, the acoustic feature packet
Include fundamental frequency, energy, mel-frequency cepstrum coefficient;
It generates acoustic feature sequence units and forms the sound for the acoustic feature of each speech frame chronologically to sort
Learn characteristic sequence.
9. a kind of computer equipment, including memory, processor and storage are in the memory and can be in the processor
The computer program of upper operation, which is characterized in that the processor realized when executing the computer program as claim 1 to
Any one of 5 phoneme synthesizing methods.
10. a kind of computer readable storage medium, the computer-readable recording medium storage has computer program, and feature exists
In realization phoneme synthesizing method as described in any one of claim 1 to 5 when the computer program is executed by processor.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910328125.XA CN110033755A (en) | 2019-04-23 | 2019-04-23 | Phoneme synthesizing method, device, computer equipment and storage medium |
PCT/CN2019/116509 WO2020215666A1 (en) | 2019-04-23 | 2019-11-08 | Speech synthesis method and apparatus, computer device, and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910328125.XA CN110033755A (en) | 2019-04-23 | 2019-04-23 | Phoneme synthesizing method, device, computer equipment and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110033755A true CN110033755A (en) | 2019-07-19 |
Family
ID=67239848
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910328125.XA Pending CN110033755A (en) | 2019-04-23 | 2019-04-23 | Phoneme synthesizing method, device, computer equipment and storage medium |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN110033755A (en) |
WO (1) | WO2020215666A1 (en) |
Cited By (28)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110675881A (en) * | 2019-09-05 | 2020-01-10 | 北京捷通华声科技股份有限公司 | Voice verification method and device |
CN111081216A (en) * | 2019-12-26 | 2020-04-28 | 上海优扬新媒信息技术有限公司 | Audio synthesis method, device, server and storage medium |
CN111276120A (en) * | 2020-01-21 | 2020-06-12 | 华为技术有限公司 | Speech synthesis method, apparatus and computer-readable storage medium |
CN111276119A (en) * | 2020-01-17 | 2020-06-12 | 平安科技(深圳)有限公司 | Voice generation method and system and computer equipment |
CN111276121A (en) * | 2020-01-23 | 2020-06-12 | 北京世纪好未来教育科技有限公司 | Voice alignment method and device, electronic equipment and storage medium |
CN111292720A (en) * | 2020-02-07 | 2020-06-16 | 北京字节跳动网络技术有限公司 | Speech synthesis method, speech synthesis device, computer readable medium and electronic equipment |
CN111312208A (en) * | 2020-03-09 | 2020-06-19 | 广州深声科技有限公司 | Neural network vocoder system with irrelevant speakers |
CN111402923A (en) * | 2020-03-27 | 2020-07-10 | 中南大学 | Emotional voice conversion method based on wavenet |
CN111489734A (en) * | 2020-04-03 | 2020-08-04 | 支付宝(杭州)信息技术有限公司 | Model training method and device based on multiple speakers |
CN111696517A (en) * | 2020-05-28 | 2020-09-22 | 平安科技(深圳)有限公司 | Speech synthesis method, speech synthesis device, computer equipment and computer readable storage medium |
CN111785303A (en) * | 2020-06-30 | 2020-10-16 | 合肥讯飞数码科技有限公司 | Model training method, simulated sound detection method, device, equipment and storage medium |
CN111816158A (en) * | 2019-09-17 | 2020-10-23 | 北京京东尚科信息技术有限公司 | Voice synthesis method and device and storage medium |
WO2020215666A1 (en) * | 2019-04-23 | 2020-10-29 | 平安科技(深圳)有限公司 | Speech synthesis method and apparatus, computer device, and storage medium |
CN111916049A (en) * | 2020-07-15 | 2020-11-10 | 北京声智科技有限公司 | Voice synthesis method and device |
CN111968678A (en) * | 2020-09-11 | 2020-11-20 | 腾讯科技(深圳)有限公司 | Audio data processing method, device and equipment and readable storage medium |
CN112289298A (en) * | 2020-09-30 | 2021-01-29 | 北京大米科技有限公司 | Processing method and device for synthesized voice, storage medium and electronic equipment |
CN112349268A (en) * | 2020-11-09 | 2021-02-09 | 湖南芒果听见科技有限公司 | Emergency broadcast audio processing system and operation method thereof |
CN112767957A (en) * | 2020-12-31 | 2021-05-07 | 科大讯飞股份有限公司 | Method for obtaining prediction model, method for predicting voice waveform and related device |
CN112863483A (en) * | 2021-01-05 | 2021-05-28 | 杭州一知智能科技有限公司 | Voice synthesizer supporting multi-speaker style and language switching and controllable rhythm |
CN112951203A (en) * | 2021-04-25 | 2021-06-11 | 平安科技(深圳)有限公司 | Speech synthesis method, speech synthesis device, electronic equipment and storage medium |
CN112992162A (en) * | 2021-04-16 | 2021-06-18 | 杭州一知智能科技有限公司 | Tone cloning method, system, device and computer readable storage medium |
CN113192482A (en) * | 2020-01-13 | 2021-07-30 | 北京地平线机器人技术研发有限公司 | Speech synthesis method and training method, device and equipment of speech synthesis model |
CN113257236A (en) * | 2020-04-30 | 2021-08-13 | 浙江大学 | Model score optimization method based on core frame screening |
CN113299272A (en) * | 2020-02-06 | 2021-08-24 | 菜鸟智能物流控股有限公司 | Speech synthesis model training method, speech synthesis apparatus, and storage medium |
CN113450764A (en) * | 2021-07-08 | 2021-09-28 | 平安科技(深圳)有限公司 | Text voice recognition method, device, equipment and storage medium |
CN113569196A (en) * | 2021-07-15 | 2021-10-29 | 苏州仰思坪半导体有限公司 | Data processing method, device, medium and equipment |
CN113838450A (en) * | 2021-08-11 | 2021-12-24 | 北京百度网讯科技有限公司 | Audio synthesis and corresponding model training method, device, equipment and storage medium |
WO2022141126A1 (en) * | 2020-12-29 | 2022-07-07 | 深圳市优必选科技股份有限公司 | Personalized speech conversion training method, computer device, and storage medium |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103514878A (en) * | 2012-06-27 | 2014-01-15 | 北京百度网讯科技有限公司 | Acoustic modeling method and device, and speech recognition method and device |
CN108573694A (en) * | 2018-02-01 | 2018-09-25 | 北京百度网讯科技有限公司 | Language material expansion and speech synthesis system construction method based on artificial intelligence and device |
CN108597492A (en) * | 2018-05-02 | 2018-09-28 | 百度在线网络技术(北京)有限公司 | Phoneme synthesizing method and device |
CN108630190A (en) * | 2018-05-18 | 2018-10-09 | 百度在线网络技术(北京)有限公司 | Method and apparatus for generating phonetic synthesis model |
CN108899009A (en) * | 2018-08-17 | 2018-11-27 | 百卓网络科技有限公司 | A kind of Chinese Speech Synthesis System based on phoneme |
CN109036371A (en) * | 2018-07-19 | 2018-12-18 | 北京光年无限科技有限公司 | Audio data generation method and system for speech synthesis |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107945786B (en) * | 2017-11-27 | 2021-05-25 | 北京百度网讯科技有限公司 | Speech synthesis method and device |
CN109102796A (en) * | 2018-08-31 | 2018-12-28 | 北京未来媒体科技股份有限公司 | A kind of phoneme synthesizing method and device |
CN110033755A (en) * | 2019-04-23 | 2019-07-19 | 平安科技(深圳)有限公司 | Phoneme synthesizing method, device, computer equipment and storage medium |
-
2019
- 2019-04-23 CN CN201910328125.XA patent/CN110033755A/en active Pending
- 2019-11-08 WO PCT/CN2019/116509 patent/WO2020215666A1/en active Application Filing
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103514878A (en) * | 2012-06-27 | 2014-01-15 | 北京百度网讯科技有限公司 | Acoustic modeling method and device, and speech recognition method and device |
CN108573694A (en) * | 2018-02-01 | 2018-09-25 | 北京百度网讯科技有限公司 | Language material expansion and speech synthesis system construction method based on artificial intelligence and device |
CN108597492A (en) * | 2018-05-02 | 2018-09-28 | 百度在线网络技术(北京)有限公司 | Phoneme synthesizing method and device |
CN108630190A (en) * | 2018-05-18 | 2018-10-09 | 百度在线网络技术(北京)有限公司 | Method and apparatus for generating phonetic synthesis model |
CN109036371A (en) * | 2018-07-19 | 2018-12-18 | 北京光年无限科技有限公司 | Audio data generation method and system for speech synthesis |
CN108899009A (en) * | 2018-08-17 | 2018-11-27 | 百卓网络科技有限公司 | A kind of Chinese Speech Synthesis System based on phoneme |
Non-Patent Citations (3)
Title |
---|
A¨ARON VAN DEN OORD等: "《WAVENET:A GENERATIVE MODEL FOR RAW AUDIO》", 《ARXIV》 * |
JONATHAN SHEN等: "《NATURAL TTS SYNTHESIS BY CONDITIONINGWAVENET ON MEL SPECTROGRAM》", 《ARXIV》 * |
伍宏传: "《基于卷积神经网络的语音合成声码器研究》", 《中国优秀硕士学位论文全文数据库2019》 * |
Cited By (44)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2020215666A1 (en) * | 2019-04-23 | 2020-10-29 | 平安科技(深圳)有限公司 | Speech synthesis method and apparatus, computer device, and storage medium |
CN110675881A (en) * | 2019-09-05 | 2020-01-10 | 北京捷通华声科技股份有限公司 | Voice verification method and device |
CN111816158B (en) * | 2019-09-17 | 2023-08-04 | 北京京东尚科信息技术有限公司 | Speech synthesis method and device and storage medium |
CN111816158A (en) * | 2019-09-17 | 2020-10-23 | 北京京东尚科信息技术有限公司 | Voice synthesis method and device and storage medium |
CN111081216A (en) * | 2019-12-26 | 2020-04-28 | 上海优扬新媒信息技术有限公司 | Audio synthesis method, device, server and storage medium |
CN113192482A (en) * | 2020-01-13 | 2021-07-30 | 北京地平线机器人技术研发有限公司 | Speech synthesis method and training method, device and equipment of speech synthesis model |
CN113192482B (en) * | 2020-01-13 | 2023-03-21 | 北京地平线机器人技术研发有限公司 | Speech synthesis method and training method, device and equipment of speech synthesis model |
CN111276119A (en) * | 2020-01-17 | 2020-06-12 | 平安科技(深圳)有限公司 | Voice generation method and system and computer equipment |
CN111276119B (en) * | 2020-01-17 | 2023-08-22 | 平安科技(深圳)有限公司 | Speech generation method, system and computer equipment |
CN111276120A (en) * | 2020-01-21 | 2020-06-12 | 华为技术有限公司 | Speech synthesis method, apparatus and computer-readable storage medium |
CN111276120B (en) * | 2020-01-21 | 2022-08-19 | 华为技术有限公司 | Speech synthesis method, apparatus and computer-readable storage medium |
CN111276121B (en) * | 2020-01-23 | 2021-04-30 | 北京世纪好未来教育科技有限公司 | Voice alignment method and device, electronic equipment and storage medium |
CN111276121A (en) * | 2020-01-23 | 2020-06-12 | 北京世纪好未来教育科技有限公司 | Voice alignment method and device, electronic equipment and storage medium |
CN113299272B (en) * | 2020-02-06 | 2023-10-31 | 菜鸟智能物流控股有限公司 | Speech synthesis model training and speech synthesis method, equipment and storage medium |
CN113299272A (en) * | 2020-02-06 | 2021-08-24 | 菜鸟智能物流控股有限公司 | Speech synthesis model training method, speech synthesis apparatus, and storage medium |
CN111292720B (en) * | 2020-02-07 | 2024-01-23 | 北京字节跳动网络技术有限公司 | Speech synthesis method, device, computer readable medium and electronic equipment |
CN111292720A (en) * | 2020-02-07 | 2020-06-16 | 北京字节跳动网络技术有限公司 | Speech synthesis method, speech synthesis device, computer readable medium and electronic equipment |
CN111312208A (en) * | 2020-03-09 | 2020-06-19 | 广州深声科技有限公司 | Neural network vocoder system with irrelevant speakers |
CN111402923B (en) * | 2020-03-27 | 2023-11-03 | 中南大学 | Emotion voice conversion method based on wavenet |
CN111402923A (en) * | 2020-03-27 | 2020-07-10 | 中南大学 | Emotional voice conversion method based on wavenet |
CN111489734B (en) * | 2020-04-03 | 2023-08-22 | 支付宝(杭州)信息技术有限公司 | Model training method and device based on multiple speakers |
CN111489734A (en) * | 2020-04-03 | 2020-08-04 | 支付宝(杭州)信息技术有限公司 | Model training method and device based on multiple speakers |
CN113257236B (en) * | 2020-04-30 | 2022-03-29 | 浙江大学 | Model score optimization method based on core frame screening |
CN113257236A (en) * | 2020-04-30 | 2021-08-13 | 浙江大学 | Model score optimization method based on core frame screening |
CN111696517A (en) * | 2020-05-28 | 2020-09-22 | 平安科技(深圳)有限公司 | Speech synthesis method, speech synthesis device, computer equipment and computer readable storage medium |
CN111785303B (en) * | 2020-06-30 | 2024-04-16 | 合肥讯飞数码科技有限公司 | Model training method, imitation sound detection device, equipment and storage medium |
CN111785303A (en) * | 2020-06-30 | 2020-10-16 | 合肥讯飞数码科技有限公司 | Model training method, simulated sound detection method, device, equipment and storage medium |
CN111916049A (en) * | 2020-07-15 | 2020-11-10 | 北京声智科技有限公司 | Voice synthesis method and device |
CN111916049B (en) * | 2020-07-15 | 2021-02-09 | 北京声智科技有限公司 | Voice synthesis method and device |
CN111968678A (en) * | 2020-09-11 | 2020-11-20 | 腾讯科技(深圳)有限公司 | Audio data processing method, device and equipment and readable storage medium |
CN111968678B (en) * | 2020-09-11 | 2024-02-09 | 腾讯科技(深圳)有限公司 | Audio data processing method, device, equipment and readable storage medium |
CN112289298A (en) * | 2020-09-30 | 2021-01-29 | 北京大米科技有限公司 | Processing method and device for synthesized voice, storage medium and electronic equipment |
CN112349268A (en) * | 2020-11-09 | 2021-02-09 | 湖南芒果听见科技有限公司 | Emergency broadcast audio processing system and operation method thereof |
WO2022141126A1 (en) * | 2020-12-29 | 2022-07-07 | 深圳市优必选科技股份有限公司 | Personalized speech conversion training method, computer device, and storage medium |
CN112767957A (en) * | 2020-12-31 | 2021-05-07 | 科大讯飞股份有限公司 | Method for obtaining prediction model, method for predicting voice waveform and related device |
CN112767957B (en) * | 2020-12-31 | 2024-05-31 | 中国科学技术大学 | Method for obtaining prediction model, prediction method of voice waveform and related device |
CN112863483A (en) * | 2021-01-05 | 2021-05-28 | 杭州一知智能科技有限公司 | Voice synthesizer supporting multi-speaker style and language switching and controllable rhythm |
CN112992162A (en) * | 2021-04-16 | 2021-06-18 | 杭州一知智能科技有限公司 | Tone cloning method, system, device and computer readable storage medium |
CN112951203A (en) * | 2021-04-25 | 2021-06-11 | 平安科技(深圳)有限公司 | Speech synthesis method, speech synthesis device, electronic equipment and storage medium |
CN112951203B (en) * | 2021-04-25 | 2023-12-29 | 平安创科科技(北京)有限公司 | Speech synthesis method, device, electronic equipment and storage medium |
CN113450764B (en) * | 2021-07-08 | 2024-02-06 | 平安科技(深圳)有限公司 | Text voice recognition method, device, equipment and storage medium |
CN113450764A (en) * | 2021-07-08 | 2021-09-28 | 平安科技(深圳)有限公司 | Text voice recognition method, device, equipment and storage medium |
CN113569196A (en) * | 2021-07-15 | 2021-10-29 | 苏州仰思坪半导体有限公司 | Data processing method, device, medium and equipment |
CN113838450A (en) * | 2021-08-11 | 2021-12-24 | 北京百度网讯科技有限公司 | Audio synthesis and corresponding model training method, device, equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
WO2020215666A1 (en) | 2020-10-29 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110033755A (en) | Phoneme synthesizing method, device, computer equipment and storage medium | |
JP7106680B2 (en) | Text-to-Speech Synthesis in Target Speaker's Voice Using Neural Networks | |
EP3895159B1 (en) | Multi-speaker neural text-to-speech synthesis | |
CN108573693B (en) | Text-to-speech system and method, and storage medium therefor | |
Wang et al. | Tacotron: A fully end-to-end text-to-speech synthesis model | |
KR20240096867A (en) | Two-level speech prosody transfer | |
JP7228998B2 (en) | speech synthesizer and program | |
KR20230133362A (en) | Generate diverse and natural text-to-speech conversion samples | |
Hu et al. | Whispered and Lombard neural speech synthesis | |
CN117678013A (en) | Two-level text-to-speech system using synthesized training data | |
CN116601702A (en) | End-to-end nervous system for multi-speaker and multi-language speech synthesis | |
WO2015025788A1 (en) | Quantitative f0 pattern generation device and method, and model learning device and method for generating f0 pattern | |
Ronanki et al. | A Hierarchical Encoder-Decoder Model for Statistical Parametric Speech Synthesis. | |
CN113963679A (en) | Voice style migration method and device, electronic equipment and storage medium | |
JP6330069B2 (en) | Multi-stream spectral representation for statistical parametric speech synthesis | |
Raghavendra et al. | Speech synthesis using artificial neural networks | |
Gong et al. | TacoLPCNet: Fast and Stable TTS by Conditioning LPCNet on Mel Spectrogram Predictions. | |
CN115762471A (en) | Voice synthesis method, device, equipment and storage medium | |
JPWO2010104040A1 (en) | Speech synthesis apparatus, speech synthesis method and speech synthesis program based on one model speech recognition synthesis | |
JP7357518B2 (en) | Speech synthesis device and program | |
Sulír et al. | Hidden Markov Model based speech synthesis system in Slovak language with speaker interpolation | |
Govender et al. | The CSTR entry to the 2018 Blizzard Challenge | |
Alastalo | Finnish end-to-end speech synthesis with Tacotron 2 and WaveNet | |
CN118366430B (en) | Personification voice synthesis method, personification voice synthesis device and readable storage medium | |
Louw | Neural speech synthesis for resource-scarce languages |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20190719 |
|
RJ01 | Rejection of invention patent application after publication |