CN110033755A - Phoneme synthesizing method, device, computer equipment and storage medium - Google Patents

Phoneme synthesizing method, device, computer equipment and storage medium Download PDF

Info

Publication number
CN110033755A
CN110033755A CN201910328125.XA CN201910328125A CN110033755A CN 110033755 A CN110033755 A CN 110033755A CN 201910328125 A CN201910328125 A CN 201910328125A CN 110033755 A CN110033755 A CN 110033755A
Authority
CN
China
Prior art keywords
speech
model
samples
acoustic
acoustic feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910328125.XA
Other languages
Chinese (zh)
Inventor
彭俊清
尚迪雅
王健宗
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN201910328125.XA priority Critical patent/CN110033755A/en
Publication of CN110033755A publication Critical patent/CN110033755A/en
Priority to PCT/CN2019/116509 priority patent/WO2020215666A1/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management
    • G10L13/047Architecture of speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Signal Processing (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

The invention discloses a kind of phoneme synthesizing method, device, computer equipment and storage medium, method includes: acquisition speech samples;Feature extraction is carried out to the speech samples, obtains the corresponding acoustic feature sequence of the speech samples;Preset acoustic model is trained according to the speech samples and the corresponding acoustic feature sequence, and the acoustic model for meeting preset requirement after training is determined as speech synthesis model, the preset acoustic model is based on wavenet network;The speech parameter parsed from speech text to be synthesized is obtained, the speech parameter is inputted into the speech synthesis model, obtains the synthesis voice of the speech synthesis model output.The processing speed and accuracy of the sound processing system of analog true man pronunciation can be improved in phoneme synthesizing method provided by the invention.

Description

Phoneme synthesizing method, device, computer equipment and storage medium
Technical field
The present invention relates to speech synthesis field more particularly to a kind of phoneme synthesizing method, device, computer equipment and storages Medium.
Background technique
Existing text voice system (Text-To-Speech, TTS), converts voice for text, although efficiency is very high, But the voice being converted to and real voice have larger difference.
Although there is also the acoustic processings that can synthesize simulation true man's pronunciation (recording) in existing text voice system System, however the processing system of this simulation true man pronunciation, processing speed is slow, and error rate is high.
Summary of the invention
Based on this, it is necessary in view of the above technical problems, provide a kind of phoneme synthesizing method, device, computer equipment and Storage medium, to improve the processing speed and accuracy of the sound processing system of analog true man pronunciation.
A kind of phoneme synthesizing method, comprising:
Obtain speech samples;
Feature extraction is carried out to the speech samples, obtains the corresponding acoustic feature sequence of the speech samples;
Preset acoustic model is instructed according to the speech samples and the corresponding acoustic feature sequence Practice, and the acoustic model for meeting preset requirement after training is determined as speech synthesis model, the preset acoustic mode Type is based on wavenet network;
The speech parameter parsed from speech text to be synthesized is obtained, the speech parameter is inputted into the speech synthesis Model obtains the synthesis voice of the speech synthesis model output.
A kind of speech synthetic device, comprising:
Sample module is obtained, for obtaining speech samples;
Characteristic extracting module obtains the corresponding sound of the speech samples for carrying out feature extraction to the speech samples Learn characteristic sequence;
Training module, for according to the speech samples and the corresponding acoustic feature sequence to preset sound It learns model to be trained, and the acoustic model for meeting preset requirement after training is determined as speech synthesis model, it is described Preset acoustic model is based on wavenet network;
Synthesis module, it is for obtaining the speech parameter parsed from speech text to be synthesized, the speech parameter is defeated Enter the speech synthesis model, obtains the synthesis voice of the speech synthesis model output.
A kind of computer equipment, including memory, processor and storage are in the memory and can be in the processing The computer program run on device, the processor realize above-mentioned phoneme synthesizing method when executing the computer program.
A kind of computer readable storage medium, the computer-readable recording medium storage have computer program, the meter Calculation machine program realizes above-mentioned phoneme synthesizing method when being executed by processor.
Above-mentioned phoneme synthesizing method, device, computer equipment and storage medium obtain speech samples, which can To improve the sound pronunciation authenticity of speech synthesis model for being trained to the language synthetic model that will be constructed.To institute State speech samples carry out feature extraction, obtain the corresponding acoustic feature sequence of the speech samples, by the step of feature extraction (i.e. Obtain acoustic feature sequence) independently separated with the training process of acoustic model, reduce operand when acoustic training model.Root Preset acoustic model is trained according to the speech samples and the corresponding acoustic feature sequence, and will be trained The acoustic model for meeting preset requirement later is determined as speech synthesis model, and the preset acoustic model is based on Wavenet network, to obtain (authenticity for referring to simulation human speech) the high speech synthesis mould of quality in terms of naturality Type.The speech parameter parsed from speech text to be synthesized is obtained, the speech parameter is inputted into the speech synthesis model, The synthesis voice for obtaining the speech synthesis model output is obtained with executing speech synthesis task required for user and wait close At the corresponding synthesis voice of speech text.The sound of analog true man pronunciation can be improved in phoneme synthesizing method provided by the invention The processing speed and accuracy of sound processing system.
Detailed description of the invention
In order to illustrate the technical solution of the embodiments of the present invention more clearly, below by institute in the description to the embodiment of the present invention Attached drawing to be used is needed to be briefly described, it should be apparent that, the accompanying drawings in the following description is only some implementations of the invention Example, for those of ordinary skill in the art, without any creative labor, can also be according to these attached drawings Obtain other attached drawings.
Fig. 1 is an application environment schematic diagram of phoneme synthesizing method in one embodiment of the invention;
Fig. 2 is a flow diagram of phoneme synthesizing method in one embodiment of the invention;
Fig. 3 is a flow diagram of phoneme synthesizing method in one embodiment of the invention;
Fig. 4 is a flow diagram of phoneme synthesizing method in one embodiment of the invention;
Fig. 5 is a flow diagram of phoneme synthesizing method in one embodiment of the invention;
Fig. 6 is a flow diagram of phoneme synthesizing method in one embodiment of the invention;
Fig. 7 is a structural schematic diagram of speech synthetic device in one embodiment of the invention;
Fig. 8 is a structural schematic diagram of speech synthetic device in one embodiment of the invention;
Fig. 9 is a structural schematic diagram of speech synthetic device in one embodiment of the invention;
Figure 10 is a schematic diagram of computer equipment in one embodiment of the invention.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are some of the embodiments of the present invention, instead of all the embodiments.Based on this hair Embodiment in bright, every other implementation obtained by those of ordinary skill in the art without making creative efforts Example, shall fall within the protection scope of the present invention.
Phoneme synthesizing method provided in this embodiment can be applicable in the application environment such as Fig. 1, wherein client passes through Network is communicated with server-side.Client includes but is not limited to various personal computers, laptop, smart phone, puts down Plate computer and portable wearable device.Server-side can use the server of the either multiple server compositions of independent server Cluster is realized.In some cases, server-side can obtain speech text to be synthesized from client, and generate the language to be synthesized The corresponding synthesis voice of sound text.
In one embodiment, as shown in Fig. 2, providing a kind of method of speech processing, the service in Fig. 1 is applied in this way It is illustrated, includes the following steps: for end
S10, speech samples are obtained;
S20, feature extraction is carried out to the speech samples, obtains the corresponding acoustic feature sequence of the speech samples;
S30, according to the speech samples and the corresponding acoustic feature sequence to preset acoustic model into Row training, and the acoustic model for meeting preset requirement after training is determined as speech synthesis model, the preset sound It learns model and is based on wavenet network;
The speech parameter is inputted the voice by the speech parameter that S40, acquisition are parsed from speech text to be synthesized Synthetic model obtains the synthesis voice of the speech synthesis model output.
In the present embodiment, speech samples can derive from common speech database, can also be from voluntarily collecting Voice data.Under normal conditions, it if the format of the voice data got is inconsistent, needs for voice data to be uniformly processed For same format, speech samples use just can be used as.Here, the format of voice data includes but is not limited to voice data File type, sample frequency.
Here, speech samples can be handled using STRAIGHT, acoustic feature is extracted from speech samples, and Form acoustic feature sequence.Corresponding acoustic feature sequence can be generated in each speech samples.Acoustic feature sequence be from The acoustic feature extracted in speech samples is by certain
Wherein, STRAIGHT is a kind of algorithm of speech signal analysis synthesis.The characteristics of STRAIGHT is can be language The spectrum envelope and voice messaging of sound separate, and voice signal is decomposed into mutually independent frequency spectrum parameter and base frequency parameters, and The parameters such as voice signal fundamental frequency, duration, word speed can be adjusted flexibly.
Speech samples and the acoustic feature sequence obtained after handling via speech samples can be used for the training of acoustic model.It should Acoustic model can be based on neural network model made of wavenet network struction.Wavenet network is a kind of based on CNN The autoregression network of (convolutional neural networks).Wavenet network is directly modeled in speech waveform level, by the joint of wave sequence Probability is decomposed into conditional probability and even multiplies, and the sampling point value p (x) in voice can be predicted with the following formula:
Wherein, xtT moment sampled point, each because subitem expression use t moment to historical information it is pre- as input Survey the probability distribution of current sampling point.
Wavenet network passes through supplemental characteristic sequence a1:NAnalog waveform sequence o1:TCondition distribution, be shown below:
For each sample in time t, value depends on all previous observation result o< t.In practice, o< t's Prediction is only limitted to the previous sample of limited quantity, these samples are collectively known as receptive field.By suitable in each time step Sequence sample waveform, wavenet network can generate the synthesis voice of very high quality in terms of naturality.
Wavenet network adds gate activation letter using similar PixelCNN (pixel-recursive neural network) in structure Number, wherein following formula calculating can be used in gate cell:
Z=tanh (Wf,k*x)⊙σ(Wg,k*x)。
Wherein, * is convolution algorithm, and ⊙ is point multiplication operation, and σ is Sigmoid function, Wf,kAnd Wg,kRespectively represent kth layer Filter convolution weight and gate convolution weight.Wavenet network also uses the jump of residual error network structure and parametrization to connect (skip connections) constructs profound network, while this network structure also contributes to accelerating model convergence.
It, can will be defeated if the output result of acoustic model meets preset requirement after the repetitive exercise of certain number The acoustic model that result meets preset requirement out is determined as speech synthesis model.The speech synthesis model of acquisition can be used for text Data are converted into audio data (i.e. synthesis voice).Here, preset requirement can be determined according to the actual situation.For example, When the diversity factor of the synthesis voice of wavenet network output and former speech samples is less than specified threshold, acoustic mode can be determined The output result of type meets preset requirement.It in other cases, can also be with if the frequency of training of acoustic model reaches preset value Determine that the output result of acoustic model meets preset requirement.
Here, speech text to be synthesized refers to needing to be converted to the text of audio data.Preset text can be used Analytic modell analytical model parses speech text to be synthesized, obtains speech parameter corresponding to speech text to be synthesized.Speech parameter Interval etc. including but not limited between the tone of words, the rhythm, syllable and sentence.
In step S10-S40, speech samples are obtained, which can be used for the language synthetic model that will be constructed It is trained, improves the sound pronunciation authenticity of speech synthesis model.Feature extraction is carried out to the speech samples, described in acquisition The corresponding acoustic feature sequence of speech samples, by (i.e. acquisition acoustic feature sequence) independence the step of feature extraction and acoustic model Training process separate, reduce operand when acoustic training model.According to the speech samples and corresponding institute The acoustic model stated acoustic feature sequence to be trained preset acoustic model, and preset requirement will be met after training It is determined as speech synthesis model, the preset acoustic model is based on wavenet network, (refers in terms of naturality to obtain It is the authenticity for simulating human speech) the high speech synthesis model of quality.Obtain the language parsed from speech text to be synthesized The speech parameter is inputted the speech synthesis model by sound parameter, obtains the synthesis voice of the speech synthesis model output, To execute speech synthesis task, synthesis voice corresponding with speech text to be synthesized required for user is obtained.
Optionally, as shown in figure 3, before step S40, further includes:
S41, speech text to be synthesized is obtained;
S42, the speech text to be synthesized is inputted into preset text resolution model, obtains the preset text resolution The speech parameter corresponding with the speech text to be synthesized of model output.
In the present embodiment, the speech synthesis model that training obtains can synthesize speech text to be synthesized to be sent out close to true man The synthesis voice of sound.Preset text resolution model can parse the sound of all words included in speech text to be synthesized The information such as tune, the rhythm, syllable, and generate obtain corresponding speech parameter (can be used language ambience information mark file form table It is existing).This process is referred to as text analyzing.The speech parameter of acquisition can be by speech synthesis model conversion at synthesis language Sound.During entire speech synthesis, text analyzing provides important evidence, the effect meeting of text analyzing for rear end speech synthesis Directly influence naturalness, the accuracy of synthesis voice.
In one example, using the sound mother of standard Chinese as speech synthesis primitive, to the Chinese language text of input, by means of The guidance of syntactic lexicon, syntax rule library, by text normalization, syntactic analysis, prosody prediction analysis, making character fonts, successively The sound for obtaining the sentence information for inputting text, word information, rhythm structure information and each Chinese character is female, so that it is common to obtain input The information of the speech synthesis primitive (sound is female) of text and the context-related information of each speech synthesis primitive are talked about, most throughout one's life It include the single phoneme notation and context-sensitive mark of each words in Chinese language text at speech parameter.
In step S41-S42, speech text to be synthesized is obtained, to obtain text to be processed, the text is for synthesizing phase The voice answered.The speech text to be synthesized is inputted into preset text resolution model, obtains the preset text resolution mould The speech parameter corresponding with the speech text to be synthesized of type output, it is higher with speech text matching degree to be synthesized to obtain Speech parameter, to obtain the synthesis voice of more high quality.
Optionally, as shown in figure 4, step S20 includes:
S201, by the speech samples cutting be multiple speech frames;
S202, the acoustic feature for calculating separately each speech frame, the acoustic feature include fundamental frequency, energy, Meier Frequency cepstral coefficient;
S203, the acoustic feature of each speech frame is chronologically sorted, forms the acoustic feature sequence.
In the present embodiment, cutting can be carried out to speech samples according to the actual situation, cutting is multiple speech frames.Voice sample Originally belong to voice signal.The information of only stable state just can be carried out signal processing, since voice signal is quasi-steady state signal, thus need It will be to speech samples framing.When handling speech samples, the length of each speech frame can be 20ms-30ms, in this section It is interior, voice signal is seen and is considered as steady-state signal.In some cases, wavelet transformation can be carried out to voice signal framing, is After voice signal framing, wavelet transformation and processing are carried out to each frame.
In a speech samples I1:NIn, I1:NIt may be expressed as: I1:N=I1, I2..., IN}.Wherein, N is speech samples I1:N's Totalframes.Feature extraction can be carried out to the voice messaging of each frame, obtain the acoustic feature of each speech frame.The acoustics of acquisition Feature includes but is not limited to fundamental frequency, energy, mel-frequency cepstrum coefficient.
Wherein, fundamental frequency determines the tone color harmony modulation of voice.For people when sending out voiced sound, air-flow causes vocal cords periodically to shake Movable property gives birth to fundamental tone, and the frequency of fundamental tone vibration is exactly fundamental frequency.Short-time autocorrelation function can be used in the extraction of fundamental frequency, and steps are as follows:
Speech frame is aggravated, the pretreatment such as denoising and adding window;
The short-time autocorrelation function of data is calculated, and chooses its local maximum point;
The pitch period estimated value of voice is exactly first position that peak value occurs in all articulation points;
The inverse of pitch period is asked just to obtain the fundamental frequency value of fundamental tone again.
Wherein, short-time autocorrelation function can be with is defined as:
Rm(k)=Σ x (n) x (n-k)
In formula, x (n) refers to that voice signal, m indicate that window function is added from m point.
Voice can be divided into unvoiced segments, voiceless sound section and voiced segments, and the voice signal amplitude of voiceless sound section is smaller, without rule Property;The voice signal amplitude of voiced segments is bigger.The variation of voice signal has regularity, there is certain quasi periodic.Therefore, Using (10-30ms) smooth performance in short-term of voice.Because the variation of voice be very slowly, in a short period of time, Usually assume that voice is almost unchanged, therefore for voiced segments, it is believed that it has periodicity in a short time Think that Voiced signal has periodicity in a short period of time:
x(t+nτ0In)=x (t) above formula, x (t) represents the voice signal in t moment, τ0It is the period of Voiced signal, referred to as Pitch period.
According to the relationship of time and frequency:
Wherein, f0It is fundamental frequency, because of τ0It is almost unchanged, therefore it is also assumed that f0It is almost unchanged.
The energy response intensity of speaker's sound, the intensity of voice is also to be not quite similar under different emotions state, when When people is in excited state, the intensity of voice is significantly greater than general state;And when in the state of droning losing, voice Intensity can be substantially reduced.
The intensity of sound, the short-time energy definition of voice signal are generally indicated using short-time energy and short-time average magnitude Are as follows:
N is expressed as n-th of moment, and x (m) is voice signal, i.e. the short-time energy weighted sum of squares that is a frame sample value.
If enabling:
H (n)=w2(n)
Then have:
Then above formula passes through the filter of a h (n) it is to be understood that each sample value square of voice signal first, defeated Time series to be made of short-time energy out.
Mel-frequency cepstrum coefficient (MFCC) can be found out by following steps:
Preemphasis, framing and adding window first are carried out to speech frame;
To each short-time analysis window, corresponding frequency spectrum is obtained by FFT (Fast Fourier Transform (FFT));
By frequency spectrum above by Meier filter process, Meier frequency spectrum is obtained;
Cepstral analysis is carried out on Meier frequency spectrum, obtains mel-frequency cepstrum coefficient.
In some cases, the processing model based on STRAIGHT also can be used to handle each speech frame, it is raw At corresponding acoustic feature.
In some cases, acoustic feature also may include pronunciation duration in speech samples and relevant to pronunciation duration Feature.Here, pronunciation duration also refers to the duration that people speaks.The speed that people speaks under different emotions state It will be different.For example, when people is more exciting or excitation time is since nerve can be in height excited state, speak often compared with Fastly, word speed is higher.On the contrary, people when sad due to depressed, speaking can be relatively more powerless, slowly.
In step S201-S203, by the speech samples cutting be multiple speech frames, by speech samples be divided into it is multiple just In the sound bite (i.e. speech frame) of computer disposal.The acoustic feature of each speech frame is calculated separately, the acoustics is special Sign includes fundamental frequency, energy, mel-frequency cepstrum coefficient, to obtain the important acoustic feature of speech samples.By each voice The acoustic feature of frame chronologically sorts, and forms the acoustic feature sequence, which can be directly used for input acoustics Model is trained, to reduce operand and the operation time of acoustic training model.
Optionally, as shown in figure 5, step S30 includes:
S301, mixing voice data and corresponding acoustic feature sequence are obtained;
S302, acoustic model described in the mixing voice data and its corresponding acoustic feature sequence inputting is instructed in advance Practice, obtains pre-training model;
S303, particular piece of voice data and corresponding acoustic feature sequence are obtained;
S304, pre-training model described in the particular piece of voice data and corresponding acoustic feature sequence inputting is carried out Training.
Here, mixing voice data can refer to the sample set being made of the speech samples comprising different voice.Wherein, one A speech samples may include the sound of more than one people, alternatively, there are the speakers of a speech samples and another voice The speaker of sample is different.Feature extraction can be carried out to each speech samples in mixing voice data, obtain each voice The corresponding acoustic feature sequence of sample.Pre-training is identical as normal training process, is intended merely to herein with " pre-training " expression It is distinguished with the training process of step S304.The number of iterations required for training can also be determined according to actual needs. After the training of mixing voice data, corresponding pre-training model can get.
Particular piece of voice data can refer to the sample set being made of the speech samples of the pronunciation comprising the same person.For example, special Determining all speech samples in voice data is issued by user's first.It can be to each voice in particular piece of voice data Sample carries out feature extraction, obtains the corresponding acoustic feature sequence of each speech samples.It, can be with after particular piece of voice data training Obtain the matched speech synthesis model of speaker with particular piece of voice data.That is, the speech synthesis model obtained can produce class Like the synthesis voice of speaker, e.g., the pronunciation of some specific famous person can be generated.Due to first using mixing voice data trained To pre-training model, then make further adaptive training with particular piece of voice data so that training after acoustic model have compared with Good extensive expression ability can be established to map between acoustic feature and speech waveform and be closed in conjunction with different speaker's acoustic features System.
In step S301-S304, mixing voice data and corresponding acoustic feature sequence are obtained, it is specific to obtain For trained speech samples.By acoustic model described in the mixing voice data and its corresponding acoustic feature sequence inputting into Row pre-training is obtained pre-training model, is trained using mixing voice data to acoustic model, to improve the acoustics after training Model has preferable extensive expression ability.Particular piece of voice data and corresponding acoustic feature sequence are obtained, to use spy Determine voice data and further adaptive training is made to pre-training model.The particular piece of voice data and corresponding acoustics is special Pre-training model described in sign sequence inputting is trained, and synthesizes the higher speech synthesis model of quality to obtain.
Optionally, as shown in fig. 6, step S30 further include:
S305, the speech samples for choosing preset ratio;
S306, the speech samples to be selected add noise, form noisy samples;
S307, the corresponding acoustic feature sequence of the noisy samples is obtained;
S308, according to the noisy samples and with and the corresponding sound characteristic sequence, the speech samples and The corresponding sound characteristic sequence is trained preset acoustic model.
In the present embodiment, in order to which the acoustic model for obtaining training has better adaptability and generalization ability, simultaneously The accuracy for improving speech synthesis, can add noise in the speech samples of part.Preset ratio can be according to actual needs It is determined, such as can be 1-10%.Here, noise can refer to the ambient sound under different scenes, such as airport noise, office Noise, food market noise, supermarket's noise etc..Speech samples after adding noise, as noisy samples.Noisy samples are carried out special The treatment process of the treatment process and the speech samples for being not added with noise of sign extraction and input acoustic training model is almost the same, Details are not described herein.
In step S305-S308, the speech samples of preset ratio are chosen, it can be different for different scene settings Selection ratio.Speech samples to be selected add noise, noisy samples are formed, to obtain multiple instructions comprising specific noise Practice sample (i.e. noisy samples).Obtain the corresponding acoustic feature sequence of the noisy samples, the acoustic feature sequence of acquisition can be with Acoustic model is directly inputted to be trained.According to the noisy samples and with and it is the corresponding sound characteristic sequence, described Speech samples and the corresponding sound characteristic sequence are trained preset acoustic model, are closed with the voice of acquisition There is better adaptability and generalization ability at model.
It should be understood that the size of the serial number of each step is not meant that the order of the execution order in above-described embodiment, each process Execution sequence should be determined by its function and internal logic, the implementation process without coping with the embodiment of the present invention constitutes any limit It is fixed.
In one embodiment, a kind of speech synthetic device is provided, voice closes in the speech synthetic device and above-described embodiment It is corresponded at method.As shown in fig. 7, the speech synthetic device includes obtaining sample module 10, characteristic extracting module 20, training Module 30 and synthesis module 40.Detailed description are as follows for each functional module:
Sample module 10 is obtained, for obtaining speech samples;
It is corresponding to obtain the speech samples for carrying out feature extraction to the speech samples for characteristic extracting module 20 Acoustic feature sequence;
Training module 30, for according to the speech samples and the corresponding acoustic feature sequence to preset Acoustic model is trained, and the acoustic model for meeting preset requirement after training is determined as speech synthesis model, institute It states preset acoustic model and is based on wavenet network;
Synthesis module 40, for obtaining the speech parameter parsed from speech text to be synthesized, by the speech parameter The speech synthesis model is inputted, the synthesis voice of the speech synthesis model output is obtained.
Optionally, as shown in figure 8, speech synthetic device further include:
Text module 50 is obtained, for obtaining speech text to be synthesized;
Speech parameter module 60 is obtained, for the speech text to be synthesized to be inputted preset text resolution model, is obtained The speech parameter corresponding with the speech text to be synthesized for taking the preset text resolution model output.
Optionally, as shown in figure 9, characteristic extracting module 20 includes:
Voice cutter unit 201, for being multiple speech frames by the speech samples cutting;
Acoustic feature extraction unit 202, for extracting the acoustic feature for calculating separately each speech frame, the acoustics Feature includes fundamental frequency, energy, mel-frequency cepstrum coefficient;
Acoustic feature sequence units 203 are generated, for the acoustic feature of each speech frame chronologically to sort, are formed The acoustic feature sequence.
Optionally, training module 30 includes:
Mixing voice unit is obtained, for obtaining mixing voice data and corresponding acoustic feature sequence;
Pre-training unit is used for acoustic mode described in the mixing voice data and its corresponding acoustic feature sequence inputting Type carries out pre-training, obtains pre-training model;
Specific phonetic unit is obtained, for obtaining particular piece of voice data and corresponding acoustic feature sequence;
Special sound training unit, for by the particular piece of voice data and corresponding acoustic feature sequence inputting institute Pre-training model is stated to be trained.
Optionally, training module 30 further include:
Speech samples unit is chosen, for choosing the speech samples of preset ratio;
Noise unit is added, for adding noise for the speech samples being selected, forms noisy samples;
Feature of noise unit is obtained, for obtaining the corresponding acoustic feature sequence of the noisy samples;
Noisy samples training unit, for according to the noisy samples and with and the corresponding sound characteristic sequence, The speech samples and the corresponding sound characteristic sequence are trained preset acoustic model.
Specific about speech synthetic device limits the restriction that may refer to above for phoneme synthesizing method, herein not It repeats again.Modules in above-mentioned speech synthetic device can be realized fully or partially through software, hardware and combinations thereof.On Stating each module can be embedded in the form of hardware or independently of in the processor in computer equipment, can also store in a software form In memory in computer equipment, the corresponding operation of the above modules is executed in order to which processor calls.
In one embodiment, a kind of computer equipment is provided, which can be server, internal junction Composition can be as shown in Figure 10.The computer equipment include by system bus connect processor, memory, network interface and Database.Wherein, the processor of the computer equipment is for providing calculating and control ability.The memory packet of the computer equipment Include non-volatile memory medium, built-in storage.The non-volatile memory medium is stored with operating system, computer program and data Library.The built-in storage provides environment for the operation of operating system and computer program in non-volatile memory medium.The calculating The database of machine equipment is for data involved in storaged voice synthetic method.The network interface of the computer equipment be used for it is outer The terminal in portion passes through network connection communication.To realize a kind of phoneme synthesizing method when the computer program is executed by processor.
In one embodiment, a kind of computer equipment is provided, including memory, processor and storage are on a memory And the computer program that can be run on a processor, processor perform the steps of when executing computer program
Obtain speech samples;
Feature extraction is carried out to the speech samples, obtains the corresponding acoustic feature sequence of the speech samples;
Preset acoustic model is instructed according to the speech samples and the corresponding acoustic feature sequence Practice, and the acoustic model for meeting preset requirement after training is determined as speech synthesis model, the preset acoustic mode Type is based on wavenet network;
The speech parameter parsed from speech text to be synthesized is obtained, the speech parameter is inputted into the speech synthesis Model obtains the synthesis voice of the speech synthesis model output.
In one embodiment, a kind of computer readable storage medium is provided, computer program is stored thereon with, is calculated Machine program performs the steps of when being executed by processor
Obtain speech samples;
Feature extraction is carried out to the speech samples, obtains the corresponding acoustic feature sequence of the speech samples;
Preset acoustic model is instructed according to the speech samples and the corresponding acoustic feature sequence Practice, and the acoustic model for meeting preset requirement after training is determined as speech synthesis model, the preset acoustic mode Type is based on wavenet network;
The speech parameter parsed from speech text to be synthesized is obtained, the speech parameter is inputted into the speech synthesis Model obtains the synthesis voice of the speech synthesis model output.
Those of ordinary skill in the art will appreciate that realizing all or part of the process in above-described embodiment method, being can be with Relevant hardware is instructed to complete by computer program, the computer program can be stored in a non-volatile computer In read/write memory medium, the computer program is when being executed, it may include such as the process of the embodiment of above-mentioned each method.Wherein, To any reference of memory, storage, database or other media used in each embodiment provided herein, Including non-volatile and/or volatile memory.Nonvolatile memory may include read-only memory (ROM), programming ROM (PROM), electrically programmable ROM (EPROM), electrically erasable ROM (EEPROM) or flash memory.Volatile memory may include Random access memory (RAM) or external cache.By way of illustration and not limitation, RAM is available in many forms, Such as static state RAM (SRAM), dynamic ram (DRAM), synchronous dram (SDRAM), double data rate sdram (DDRSDRAM), enhancing Type SDRAM (ESDRAM), synchronization link (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic ram (DRDRAM) and memory bus dynamic ram (RDRAM) etc..
It is apparent to those skilled in the art that for convenience of description and succinctly, only with above-mentioned each function Can unit, module division progress for example, in practical application, can according to need and by above-mentioned function distribution by different Functional unit, module are completed, i.e., the internal structure of described device is divided into different functional unit or module, more than completing The all or part of function of description.
Embodiment described above is merely illustrative of the technical solution of the present invention, rather than its limitations;Although referring to aforementioned reality Applying example, invention is explained in detail, those skilled in the art should understand that: it still can be to aforementioned each Technical solution documented by embodiment is modified or equivalent replacement of some of the technical features;And these are modified Or replacement, the spirit and scope for technical solution of various embodiments of the present invention that it does not separate the essence of the corresponding technical solution should all It is included within protection scope of the present invention.

Claims (10)

1. a kind of phoneme synthesizing method characterized by comprising
Obtain speech samples;
Feature extraction is carried out to the speech samples, obtains the corresponding acoustic feature sequence of the speech samples;
Preset acoustic model is trained according to the speech samples and the corresponding acoustic feature sequence, and The acoustic model for meeting preset requirement after training is determined as speech synthesis model, the preset acoustic model is based on Wavenet network;
The speech parameter parsed from speech text to be synthesized is obtained, the speech parameter is inputted into the speech synthesis mould Type obtains the synthesis voice of the speech synthesis model output.
2. phoneme synthesizing method as described in claim 1, which is characterized in that the acquisition is parsed from speech text to be synthesized The speech parameter is inputted the speech synthesis model by speech parameter out, obtains the conjunction of the speech synthesis model output Before voice, further includes:
Obtain speech text to be synthesized;
The speech text to be synthesized is inputted into preset text resolution model, obtains the preset text resolution model output Speech parameter corresponding with the speech text to be synthesized.
3. phoneme synthesizing method as described in claim 1, which is characterized in that described to be mentioned to speech samples progress feature It takes, obtains the corresponding acoustic feature sequence of the speech samples, comprising:
It is multiple speech frames by the speech samples cutting;
The acoustic feature of each speech frame is calculated separately, the acoustic feature includes fundamental frequency, energy, mel-frequency cepstrum system Number;
The acoustic feature of each speech frame is chronologically sorted, the acoustic feature sequence is formed.
4. phoneme synthesizing method as described in claim 1, which is characterized in that described according to the speech samples and right with it The acoustic feature sequence answered is trained preset acoustic model, comprising:
Obtain mixing voice data and corresponding acoustic feature sequence;
Acoustic model described in the mixing voice data and its corresponding acoustic feature sequence inputting is subjected to pre-training, is obtained pre- Training pattern;
Obtain particular piece of voice data and corresponding acoustic feature sequence;
Pre-training model described in the particular piece of voice data and corresponding acoustic feature sequence inputting is trained.
5. phoneme synthesizing method as described in claim 1, which is characterized in that described according to the speech samples and right with it The sound characteristic sequence answered is trained preset acoustic model, comprising:
Choose the speech samples of preset ratio;
Speech samples to be selected add noise, form noisy samples;
Obtain the corresponding acoustic feature sequence of the noisy samples;
According to the noisy samples and with and the corresponding sound characteristic sequence, speech samples and corresponding The sound characteristic sequence is trained preset acoustic model.
6. a kind of speech synthetic device characterized by comprising
Sample module is obtained, for obtaining speech samples;
It is special to obtain the corresponding acoustics of the speech samples for carrying out feature extraction to the speech samples for characteristic extracting module Levy sequence;
Training module, for according to the speech samples and the corresponding acoustic feature sequence to preset acoustic mode Type is trained, and the acoustic model for meeting preset requirement after training is determined as speech synthesis model, described default Acoustic model be based on wavenet network;
The speech parameter is inputted institute for obtaining the speech parameter parsed from speech text to be synthesized by synthesis module Predicate sound synthetic model obtains the synthesis voice of the speech synthesis model output.
7. speech synthetic device as claimed in claim 6, which is characterized in that further include:
Text module is obtained, for obtaining speech text to be synthesized;
Obtain speech parameter module, for will the preset text resolution model of the speech text to be synthesized input, described in acquisition The speech parameter corresponding with the speech text to be synthesized of preset text resolution model output.
8. speech synthetic device as claimed in claim 6, which is characterized in that the characteristic extracting module includes:
Voice cutter unit, for being multiple speech frames by the speech samples cutting;
Acoustic feature extraction unit, for extracting the acoustic feature for calculating separately each speech frame, the acoustic feature packet Include fundamental frequency, energy, mel-frequency cepstrum coefficient;
It generates acoustic feature sequence units and forms the sound for the acoustic feature of each speech frame chronologically to sort Learn characteristic sequence.
9. a kind of computer equipment, including memory, processor and storage are in the memory and can be in the processor The computer program of upper operation, which is characterized in that the processor realized when executing the computer program as claim 1 to Any one of 5 phoneme synthesizing methods.
10. a kind of computer readable storage medium, the computer-readable recording medium storage has computer program, and feature exists In realization phoneme synthesizing method as described in any one of claim 1 to 5 when the computer program is executed by processor.
CN201910328125.XA 2019-04-23 2019-04-23 Phoneme synthesizing method, device, computer equipment and storage medium Pending CN110033755A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201910328125.XA CN110033755A (en) 2019-04-23 2019-04-23 Phoneme synthesizing method, device, computer equipment and storage medium
PCT/CN2019/116509 WO2020215666A1 (en) 2019-04-23 2019-11-08 Speech synthesis method and apparatus, computer device, and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910328125.XA CN110033755A (en) 2019-04-23 2019-04-23 Phoneme synthesizing method, device, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
CN110033755A true CN110033755A (en) 2019-07-19

Family

ID=67239848

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910328125.XA Pending CN110033755A (en) 2019-04-23 2019-04-23 Phoneme synthesizing method, device, computer equipment and storage medium

Country Status (2)

Country Link
CN (1) CN110033755A (en)
WO (1) WO2020215666A1 (en)

Cited By (28)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110675881A (en) * 2019-09-05 2020-01-10 北京捷通华声科技股份有限公司 Voice verification method and device
CN111081216A (en) * 2019-12-26 2020-04-28 上海优扬新媒信息技术有限公司 Audio synthesis method, device, server and storage medium
CN111276120A (en) * 2020-01-21 2020-06-12 华为技术有限公司 Speech synthesis method, apparatus and computer-readable storage medium
CN111276119A (en) * 2020-01-17 2020-06-12 平安科技(深圳)有限公司 Voice generation method and system and computer equipment
CN111276121A (en) * 2020-01-23 2020-06-12 北京世纪好未来教育科技有限公司 Voice alignment method and device, electronic equipment and storage medium
CN111292720A (en) * 2020-02-07 2020-06-16 北京字节跳动网络技术有限公司 Speech synthesis method, speech synthesis device, computer readable medium and electronic equipment
CN111312208A (en) * 2020-03-09 2020-06-19 广州深声科技有限公司 Neural network vocoder system with irrelevant speakers
CN111402923A (en) * 2020-03-27 2020-07-10 中南大学 Emotional voice conversion method based on wavenet
CN111489734A (en) * 2020-04-03 2020-08-04 支付宝(杭州)信息技术有限公司 Model training method and device based on multiple speakers
CN111696517A (en) * 2020-05-28 2020-09-22 平安科技(深圳)有限公司 Speech synthesis method, speech synthesis device, computer equipment and computer readable storage medium
CN111785303A (en) * 2020-06-30 2020-10-16 合肥讯飞数码科技有限公司 Model training method, simulated sound detection method, device, equipment and storage medium
CN111816158A (en) * 2019-09-17 2020-10-23 北京京东尚科信息技术有限公司 Voice synthesis method and device and storage medium
WO2020215666A1 (en) * 2019-04-23 2020-10-29 平安科技(深圳)有限公司 Speech synthesis method and apparatus, computer device, and storage medium
CN111916049A (en) * 2020-07-15 2020-11-10 北京声智科技有限公司 Voice synthesis method and device
CN111968678A (en) * 2020-09-11 2020-11-20 腾讯科技(深圳)有限公司 Audio data processing method, device and equipment and readable storage medium
CN112289298A (en) * 2020-09-30 2021-01-29 北京大米科技有限公司 Processing method and device for synthesized voice, storage medium and electronic equipment
CN112349268A (en) * 2020-11-09 2021-02-09 湖南芒果听见科技有限公司 Emergency broadcast audio processing system and operation method thereof
CN112767957A (en) * 2020-12-31 2021-05-07 科大讯飞股份有限公司 Method for obtaining prediction model, method for predicting voice waveform and related device
CN112863483A (en) * 2021-01-05 2021-05-28 杭州一知智能科技有限公司 Voice synthesizer supporting multi-speaker style and language switching and controllable rhythm
CN112951203A (en) * 2021-04-25 2021-06-11 平安科技(深圳)有限公司 Speech synthesis method, speech synthesis device, electronic equipment and storage medium
CN112992162A (en) * 2021-04-16 2021-06-18 杭州一知智能科技有限公司 Tone cloning method, system, device and computer readable storage medium
CN113192482A (en) * 2020-01-13 2021-07-30 北京地平线机器人技术研发有限公司 Speech synthesis method and training method, device and equipment of speech synthesis model
CN113257236A (en) * 2020-04-30 2021-08-13 浙江大学 Model score optimization method based on core frame screening
CN113299272A (en) * 2020-02-06 2021-08-24 菜鸟智能物流控股有限公司 Speech synthesis model training method, speech synthesis apparatus, and storage medium
CN113450764A (en) * 2021-07-08 2021-09-28 平安科技(深圳)有限公司 Text voice recognition method, device, equipment and storage medium
CN113569196A (en) * 2021-07-15 2021-10-29 苏州仰思坪半导体有限公司 Data processing method, device, medium and equipment
CN113838450A (en) * 2021-08-11 2021-12-24 北京百度网讯科技有限公司 Audio synthesis and corresponding model training method, device, equipment and storage medium
WO2022141126A1 (en) * 2020-12-29 2022-07-07 深圳市优必选科技股份有限公司 Personalized speech conversion training method, computer device, and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103514878A (en) * 2012-06-27 2014-01-15 北京百度网讯科技有限公司 Acoustic modeling method and device, and speech recognition method and device
CN108573694A (en) * 2018-02-01 2018-09-25 北京百度网讯科技有限公司 Language material expansion and speech synthesis system construction method based on artificial intelligence and device
CN108597492A (en) * 2018-05-02 2018-09-28 百度在线网络技术(北京)有限公司 Phoneme synthesizing method and device
CN108630190A (en) * 2018-05-18 2018-10-09 百度在线网络技术(北京)有限公司 Method and apparatus for generating phonetic synthesis model
CN108899009A (en) * 2018-08-17 2018-11-27 百卓网络科技有限公司 A kind of Chinese Speech Synthesis System based on phoneme
CN109036371A (en) * 2018-07-19 2018-12-18 北京光年无限科技有限公司 Audio data generation method and system for speech synthesis

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107945786B (en) * 2017-11-27 2021-05-25 北京百度网讯科技有限公司 Speech synthesis method and device
CN109102796A (en) * 2018-08-31 2018-12-28 北京未来媒体科技股份有限公司 A kind of phoneme synthesizing method and device
CN110033755A (en) * 2019-04-23 2019-07-19 平安科技(深圳)有限公司 Phoneme synthesizing method, device, computer equipment and storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103514878A (en) * 2012-06-27 2014-01-15 北京百度网讯科技有限公司 Acoustic modeling method and device, and speech recognition method and device
CN108573694A (en) * 2018-02-01 2018-09-25 北京百度网讯科技有限公司 Language material expansion and speech synthesis system construction method based on artificial intelligence and device
CN108597492A (en) * 2018-05-02 2018-09-28 百度在线网络技术(北京)有限公司 Phoneme synthesizing method and device
CN108630190A (en) * 2018-05-18 2018-10-09 百度在线网络技术(北京)有限公司 Method and apparatus for generating phonetic synthesis model
CN109036371A (en) * 2018-07-19 2018-12-18 北京光年无限科技有限公司 Audio data generation method and system for speech synthesis
CN108899009A (en) * 2018-08-17 2018-11-27 百卓网络科技有限公司 A kind of Chinese Speech Synthesis System based on phoneme

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
A¨ARON VAN DEN OORD等: "《WAVENET:A GENERATIVE MODEL FOR RAW AUDIO》", 《ARXIV》 *
JONATHAN SHEN等: "《NATURAL TTS SYNTHESIS BY CONDITIONINGWAVENET ON MEL SPECTROGRAM》", 《ARXIV》 *
伍宏传: "《基于卷积神经网络的语音合成声码器研究》", 《中国优秀硕士学位论文全文数据库2019》 *

Cited By (44)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020215666A1 (en) * 2019-04-23 2020-10-29 平安科技(深圳)有限公司 Speech synthesis method and apparatus, computer device, and storage medium
CN110675881A (en) * 2019-09-05 2020-01-10 北京捷通华声科技股份有限公司 Voice verification method and device
CN111816158B (en) * 2019-09-17 2023-08-04 北京京东尚科信息技术有限公司 Speech synthesis method and device and storage medium
CN111816158A (en) * 2019-09-17 2020-10-23 北京京东尚科信息技术有限公司 Voice synthesis method and device and storage medium
CN111081216A (en) * 2019-12-26 2020-04-28 上海优扬新媒信息技术有限公司 Audio synthesis method, device, server and storage medium
CN113192482A (en) * 2020-01-13 2021-07-30 北京地平线机器人技术研发有限公司 Speech synthesis method and training method, device and equipment of speech synthesis model
CN113192482B (en) * 2020-01-13 2023-03-21 北京地平线机器人技术研发有限公司 Speech synthesis method and training method, device and equipment of speech synthesis model
CN111276119A (en) * 2020-01-17 2020-06-12 平安科技(深圳)有限公司 Voice generation method and system and computer equipment
CN111276119B (en) * 2020-01-17 2023-08-22 平安科技(深圳)有限公司 Speech generation method, system and computer equipment
CN111276120A (en) * 2020-01-21 2020-06-12 华为技术有限公司 Speech synthesis method, apparatus and computer-readable storage medium
CN111276120B (en) * 2020-01-21 2022-08-19 华为技术有限公司 Speech synthesis method, apparatus and computer-readable storage medium
CN111276121B (en) * 2020-01-23 2021-04-30 北京世纪好未来教育科技有限公司 Voice alignment method and device, electronic equipment and storage medium
CN111276121A (en) * 2020-01-23 2020-06-12 北京世纪好未来教育科技有限公司 Voice alignment method and device, electronic equipment and storage medium
CN113299272B (en) * 2020-02-06 2023-10-31 菜鸟智能物流控股有限公司 Speech synthesis model training and speech synthesis method, equipment and storage medium
CN113299272A (en) * 2020-02-06 2021-08-24 菜鸟智能物流控股有限公司 Speech synthesis model training method, speech synthesis apparatus, and storage medium
CN111292720B (en) * 2020-02-07 2024-01-23 北京字节跳动网络技术有限公司 Speech synthesis method, device, computer readable medium and electronic equipment
CN111292720A (en) * 2020-02-07 2020-06-16 北京字节跳动网络技术有限公司 Speech synthesis method, speech synthesis device, computer readable medium and electronic equipment
CN111312208A (en) * 2020-03-09 2020-06-19 广州深声科技有限公司 Neural network vocoder system with irrelevant speakers
CN111402923B (en) * 2020-03-27 2023-11-03 中南大学 Emotion voice conversion method based on wavenet
CN111402923A (en) * 2020-03-27 2020-07-10 中南大学 Emotional voice conversion method based on wavenet
CN111489734B (en) * 2020-04-03 2023-08-22 支付宝(杭州)信息技术有限公司 Model training method and device based on multiple speakers
CN111489734A (en) * 2020-04-03 2020-08-04 支付宝(杭州)信息技术有限公司 Model training method and device based on multiple speakers
CN113257236B (en) * 2020-04-30 2022-03-29 浙江大学 Model score optimization method based on core frame screening
CN113257236A (en) * 2020-04-30 2021-08-13 浙江大学 Model score optimization method based on core frame screening
CN111696517A (en) * 2020-05-28 2020-09-22 平安科技(深圳)有限公司 Speech synthesis method, speech synthesis device, computer equipment and computer readable storage medium
CN111785303B (en) * 2020-06-30 2024-04-16 合肥讯飞数码科技有限公司 Model training method, imitation sound detection device, equipment and storage medium
CN111785303A (en) * 2020-06-30 2020-10-16 合肥讯飞数码科技有限公司 Model training method, simulated sound detection method, device, equipment and storage medium
CN111916049A (en) * 2020-07-15 2020-11-10 北京声智科技有限公司 Voice synthesis method and device
CN111916049B (en) * 2020-07-15 2021-02-09 北京声智科技有限公司 Voice synthesis method and device
CN111968678A (en) * 2020-09-11 2020-11-20 腾讯科技(深圳)有限公司 Audio data processing method, device and equipment and readable storage medium
CN111968678B (en) * 2020-09-11 2024-02-09 腾讯科技(深圳)有限公司 Audio data processing method, device, equipment and readable storage medium
CN112289298A (en) * 2020-09-30 2021-01-29 北京大米科技有限公司 Processing method and device for synthesized voice, storage medium and electronic equipment
CN112349268A (en) * 2020-11-09 2021-02-09 湖南芒果听见科技有限公司 Emergency broadcast audio processing system and operation method thereof
WO2022141126A1 (en) * 2020-12-29 2022-07-07 深圳市优必选科技股份有限公司 Personalized speech conversion training method, computer device, and storage medium
CN112767957A (en) * 2020-12-31 2021-05-07 科大讯飞股份有限公司 Method for obtaining prediction model, method for predicting voice waveform and related device
CN112767957B (en) * 2020-12-31 2024-05-31 中国科学技术大学 Method for obtaining prediction model, prediction method of voice waveform and related device
CN112863483A (en) * 2021-01-05 2021-05-28 杭州一知智能科技有限公司 Voice synthesizer supporting multi-speaker style and language switching and controllable rhythm
CN112992162A (en) * 2021-04-16 2021-06-18 杭州一知智能科技有限公司 Tone cloning method, system, device and computer readable storage medium
CN112951203A (en) * 2021-04-25 2021-06-11 平安科技(深圳)有限公司 Speech synthesis method, speech synthesis device, electronic equipment and storage medium
CN112951203B (en) * 2021-04-25 2023-12-29 平安创科科技(北京)有限公司 Speech synthesis method, device, electronic equipment and storage medium
CN113450764B (en) * 2021-07-08 2024-02-06 平安科技(深圳)有限公司 Text voice recognition method, device, equipment and storage medium
CN113450764A (en) * 2021-07-08 2021-09-28 平安科技(深圳)有限公司 Text voice recognition method, device, equipment and storage medium
CN113569196A (en) * 2021-07-15 2021-10-29 苏州仰思坪半导体有限公司 Data processing method, device, medium and equipment
CN113838450A (en) * 2021-08-11 2021-12-24 北京百度网讯科技有限公司 Audio synthesis and corresponding model training method, device, equipment and storage medium

Also Published As

Publication number Publication date
WO2020215666A1 (en) 2020-10-29

Similar Documents

Publication Publication Date Title
CN110033755A (en) Phoneme synthesizing method, device, computer equipment and storage medium
JP7106680B2 (en) Text-to-Speech Synthesis in Target Speaker&#39;s Voice Using Neural Networks
EP3895159B1 (en) Multi-speaker neural text-to-speech synthesis
CN108573693B (en) Text-to-speech system and method, and storage medium therefor
Wang et al. Tacotron: A fully end-to-end text-to-speech synthesis model
KR20240096867A (en) Two-level speech prosody transfer
JP7228998B2 (en) speech synthesizer and program
KR20230133362A (en) Generate diverse and natural text-to-speech conversion samples
Hu et al. Whispered and Lombard neural speech synthesis
CN117678013A (en) Two-level text-to-speech system using synthesized training data
CN116601702A (en) End-to-end nervous system for multi-speaker and multi-language speech synthesis
WO2015025788A1 (en) Quantitative f0 pattern generation device and method, and model learning device and method for generating f0 pattern
Ronanki et al. A Hierarchical Encoder-Decoder Model for Statistical Parametric Speech Synthesis.
CN113963679A (en) Voice style migration method and device, electronic equipment and storage medium
JP6330069B2 (en) Multi-stream spectral representation for statistical parametric speech synthesis
Raghavendra et al. Speech synthesis using artificial neural networks
Gong et al. TacoLPCNet: Fast and Stable TTS by Conditioning LPCNet on Mel Spectrogram Predictions.
CN115762471A (en) Voice synthesis method, device, equipment and storage medium
JPWO2010104040A1 (en) Speech synthesis apparatus, speech synthesis method and speech synthesis program based on one model speech recognition synthesis
JP7357518B2 (en) Speech synthesis device and program
Sulír et al. Hidden Markov Model based speech synthesis system in Slovak language with speaker interpolation
Govender et al. The CSTR entry to the 2018 Blizzard Challenge
Alastalo Finnish end-to-end speech synthesis with Tacotron 2 and WaveNet
CN118366430B (en) Personification voice synthesis method, personification voice synthesis device and readable storage medium
Louw Neural speech synthesis for resource-scarce languages

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20190719

RJ01 Rejection of invention patent application after publication