CN110379407A - Adaptive voice synthetic method, device, readable storage medium storing program for executing and calculating equipment - Google Patents
Adaptive voice synthetic method, device, readable storage medium storing program for executing and calculating equipment Download PDFInfo
- Publication number
- CN110379407A CN110379407A CN201910661648.6A CN201910661648A CN110379407A CN 110379407 A CN110379407 A CN 110379407A CN 201910661648 A CN201910661648 A CN 201910661648A CN 110379407 A CN110379407 A CN 110379407A
- Authority
- CN
- China
- Prior art keywords
- speaker
- data
- voice
- speech
- voice data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 230000003044 adaptive effect Effects 0.000 title claims abstract description 25
- 238000010189 synthetic method Methods 0.000 title claims abstract description 12
- 238000012549 training Methods 0.000 claims abstract description 65
- 238000000034 method Methods 0.000 claims abstract description 25
- 230000015572 biosynthetic process Effects 0.000 claims abstract description 23
- 238000003786 synthesis reaction Methods 0.000 claims abstract description 23
- 230000009467 reduction Effects 0.000 claims description 6
- 238000013481 data capture Methods 0.000 claims description 3
- 230000000694 effects Effects 0.000 abstract description 4
- 238000004891 communication Methods 0.000 description 15
- 238000012545 processing Methods 0.000 description 10
- 238000010586 diagram Methods 0.000 description 5
- 238000005516 engineering process Methods 0.000 description 3
- 230000006870 function Effects 0.000 description 3
- 230000002093 peripheral effect Effects 0.000 description 3
- 230000000306 recurrent effect Effects 0.000 description 3
- 238000001228 spectrum Methods 0.000 description 3
- 238000013528 artificial neural network Methods 0.000 description 2
- 230000005540 biological transmission Effects 0.000 description 2
- 230000008859 change Effects 0.000 description 2
- 230000007774 longterm Effects 0.000 description 2
- 230000007246 mechanism Effects 0.000 description 2
- 230000035772 mutation Effects 0.000 description 2
- 238000007781 pre-processing Methods 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 230000008901 benefit Effects 0.000 description 1
- 230000001413 cellular effect Effects 0.000 description 1
- 238000000151 deposition Methods 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 230000008451 emotion Effects 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 238000003780 insertion Methods 0.000 description 1
- 230000037431 insertion Effects 0.000 description 1
- 238000009434 installation Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000002194 synthesizing effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/08—Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Signal Processing (AREA)
- Electrically Operated Instructional Devices (AREA)
Abstract
The embodiment of the present disclosure provides a kind of adaptive voice synthetic method, device, readable storage medium storing program for executing and calculates equipment, under conditions of only a small amount of and voice quality not high voice data, the pretty good speaker's voice of synthetic effect.Method includes: to obtain basic speech data and the corresponding text data of basic speech data;According to basic speech data and the corresponding text data of basic speech data, training basic speech model;Obtain the corresponding text data of voice data of the voice data and speaker of speaker;According to the corresponding text data of the voice data of the voice data of speaker and speaker and basic speech model, training GRU speech model;When receiving speech synthesis instruction, according to the text information that GRU speech model and instruction include, the voice of speaker is synthesized.
Description
Technical field
This disclosure relates to voice processing technology field more particularly to a kind of adaptive voice synthetic method, device, readable deposit
Storage media and calculating equipment.
Background technique
Speech synthesis refers to computer automatically according to the technology of the corresponding voice of text generation.Current speech synthesis system needs
A large amount of and high quality (professional recording equipment is needed to be recorded) data are used, and collecting these data can expend largely
Manpower and financial resources.In addition, one new speaker of every increase, requires to recording studio to record the new data of a batch.
How under conditions of only a small amount of and voice quality not high voice data, the pretty good speaker of synthetic effect
Voice is a technical problem to be solved urgently.
Summary of the invention
For this purpose, present disclose provides a kind of adaptive voice synthetic method, device, readable storage medium storing program for executing and equipment is calculated,
With try hard to solve the problems, such as or at least alleviate above it is existing at least one.
According to the one aspect of the embodiment of the present disclosure, a kind of adaptive voice synthetic method is provided, comprising:
Obtain basic speech data and the corresponding text data of basic speech data;
According to basic speech data and the corresponding text data of basic speech data, training basic speech model;
Obtain the corresponding text data of voice data of the voice data and speaker of speaker;
According to the corresponding text data of the voice data of the voice data of speaker and speaker and basic speech mould
Type, training GRU speech model;
When receiving speech synthesis instruction, according to the text information that GRU speech model and instruction include, speaker is synthesized
Voice.
Optionally, according to the corresponding text data of voice data of the voice data of speaker and speaker, and basis
Speech model, training GRU speech model, comprising:
According to the corresponding text data of the voice data of the voice data of speaker and speaker, the acoustics of speaker is determined
Feature and phoneme feature;
According to the acoustic feature of speaker and phoneme feature and the basic speech model, training GRU speech model.
Optionally it is determined that the phoneme feature of speaker, comprising:
The corresponding text of voice data of the voice data and speaker of speaker is handled according to preset speech recognition modeling
Notebook data obtains the duration alignment information of the phoneme of speaker;
According to the corresponding text data of voice data of the duration alignment information and speaker of the phoneme of speaker, determination is said
Talk about the phoneme feature of people.
Optionally it is determined that before the acoustic feature and phoneme feature of speaker, further includes:
The voice data of speaker is pre-processed.
Optionally, pretreatment includes:
Noise reduction and/or dereverberation.
Optionally, method further include:
The corresponding text data of the voice data of the voice data and speaker of verifying speaker corresponds.
Optionally, according to the acoustic feature of speaker and phoneme feature and basic speech model, training GRU voice mould
Type, comprising:
According to basic speech model initialization GRU speech model;
It is format needed for GRU speech model by phoneme feature and acoustics characteristic processing, and inputs GRU speech model, completes
The training of GRU speech model.
According to the another aspect of the embodiment of the present disclosure, a kind of adaptive voice synthesizer is provided, comprising:
Basic speech data capture unit, for obtaining basic speech data and the corresponding textual data of basic speech data
According to;
Basic speech model training unit, for according to basic speech data and the corresponding textual data of basic speech data
According to training basic speech model;
Speaker's voice data acquiring unit, for obtaining the voice data of speaker and the voice data correspondence of speaker
Text data;
GRU model training unit, for according to the voice data of speaker and the corresponding text of the voice data of speaker
Data and basic speech model, training GRU speech model;
Adaptive voice synthesis unit, for being wrapped according to GRU speech model and instruction when receiving speech synthesis instruction
The text information contained synthesizes the voice of speaker.
Optionally, GRU model training unit is specifically used for according to the voice data of speaker and the voice data of speaker
Corresponding text data determines the acoustic feature and phoneme feature of speaker;
According to the acoustic feature of speaker and phoneme feature and basic speech model, training GRU speech model.
Optionally, when GRU model training unit is used to determine the phoneme feature of speaker, it is specifically used for:
The corresponding text of voice data of the voice data and speaker of speaker is handled according to preset speech recognition modeling
Notebook data obtains the duration alignment information of the phoneme of speaker;
According to the corresponding text data of voice data of the duration alignment information and speaker of the phoneme of speaker, determination is said
Talk about the phoneme feature of people.
Optionally, GRU model training unit is also used to:
Before the acoustic feature and phoneme feature for determining speaker, the voice data of speaker is pre-processed.
Optionally, pretreatment includes:
Noise reduction and/or dereverberation.
Optionally, GRU model training unit is also used to:
The corresponding text data of the voice data of the voice data and speaker of verifying speaker corresponds.
Optionally, GRU model training unit is used for acoustic feature and phoneme feature and basic speech according to speaker
Model is specifically used for when training GRU speech model:
According to basic speech model initialization GRU speech model;
It is format needed for GRU speech model by phoneme feature and acoustics characteristic processing, and inputs GRU speech model, completes
The training of GRU speech model.
According to the another aspect of the embodiment of the present disclosure, a kind of readable storage medium storing program for executing is provided, there is executable refer to thereon
It enables, when executable instruction is performed, so that computer executes operation included by the above method.
According to the another aspect of the embodiment of the present disclosure, a kind of calculating equipment is provided, comprising: processor;And memory,
It is stored with executable instruction, and executable instruction makes processor execute operation included by the above method upon being performed.
According to the technical solution that the embodiment of the present disclosure provides, first with large-scale basis voice data training basic speech mould
Type, recycles speaker's voice data training GRU speech model of small sample, and GRU model can remember the upper of text well
Context information, in addition, GRU model structure is simple, decoding speed is fast, can satisfy the real-time requirement of online speech synthesis;Therefore,
The technical solution that the embodiment of the present disclosure provides only needs to record a small amount of and of low quality voice data of speaker, just can close
At the pretty good speaker's voice of effect.
Detailed description of the invention
Attached drawing shows the illustrative embodiments of the disclosure, and it is bright together for explaining the principles of this disclosure,
Which includes these attached drawings to provide further understanding of the disclosure, and attached drawing is included in the description and constitutes this
Part of specification.
Fig. 1 is exemplary the structural block diagram for calculating equipment;
Fig. 2 is the flow chart according to a kind of adaptive voice synthetic method of the embodiment of the present disclosure;
Fig. 3 is the structural block diagram according to the adaptive voice synthesis system based on small sample of the embodiment of the present disclosure;
Fig. 4 is the structural block diagram according to a kind of adaptive voice synthesizer of the embodiment of the present disclosure.
Specific embodiment
Exemplary embodiments of the present disclosure are described in more detail below with reference to accompanying drawings.Although showing the disclosure in attached drawing
Exemplary embodiment, it being understood, however, that may be realized in various forms the disclosure without should be by embodiments set forth here
It is limited.On the contrary, these embodiments are provided to facilitate a more thoroughly understanding of the present invention, and can be by the scope of the present disclosure
It is fully disclosed to those skilled in the art.
Fig. 1 is the Example Computing Device 100 for being arranged as realizing a kind of adaptive voice synthetic method according to the disclosure
Block diagram.In basic configuration 102, calculates equipment 100 and typically comprise system storage 106 and one or more processor
104.Memory bus 108 can be used for the communication between processor 104 and system storage 106.
Depending on desired configuration, processor 104 can be any kind of processing, including but not limited to: microprocessor
((μ P), microcontroller (μ C), digital information processor (DSP) or any combination of them.Processor 104 may include all
Cache, processor core such as one or more rank of on-chip cache 110 and second level cache 112 etc
114 and register 116.Exemplary processor core 114 may include arithmetic and logical unit (ALU), floating-point unit (FPU),
Digital signal processing core (DSP core) or any combination of them.Exemplary Memory Controller 118 can be with processor
104 are used together, or in some implementations, and Memory Controller 118 can be an interior section of processor 104.
Depending on desired configuration, system storage 106 can be any type of memory, including but not limited to: easily
The property lost memory (RAM), nonvolatile memory (ROM, flash memory etc.) or any combination of them.System storage
Device 106 may include operating system 120, one or more program 122 and program data 124.In some embodiments,
Program 122 can be configured as to be referred to by one or more processor 104 using the execution of program data 124 on an operating system
It enables.
Calculating equipment 100 can also include facilitating from various interface equipments (for example, output equipment 142, Peripheral Interface
144 and communication equipment 146) to basic configuration 102 via the communication of bus/interface controller 130 interface bus 140.Example
Output equipment 142 include graphics processing unit 148 and audio treatment unit 150.They can be configured as facilitate via
One or more port A/V 152 is communicated with the various external equipments of such as display terminal or loudspeaker etc.Example
Peripheral Interface 144 may include serial interface controller 154 and parallel interface controller 156, they, which can be configured as, helps
In via one or more port I/O 158 and such as input equipment (for example, keyboard, mouse, pen, voice-input device, touching
Touch input equipment) or the external equipment of other peripheral hardwares (such as printer, scanner etc.) etc communicated.Exemplary communication
Equipment 146 may include network controller 160, can be arranged to convenient for via one or more communication port 164 with
One or more other calculating communications of equipment 162 by network communication link.
Network communication link can be an example of communication media.Communication media can be usually presented as in such as carrier wave
Or computer readable instructions, data structure, program module in the modulated data signal of other transmission mechanisms etc, and can
To include any information delivery media." modulated data signal " can such signal, one in its data set or more
It is a or it change can the mode of encoded information in the signal carry out.As unrestricted example, communication media can be with
Wired medium including such as cable network or private line network etc, and it is such as sound, radio frequency (RF), microwave, infrared
(IR) the various wireless mediums or including other wireless mediums.Term computer-readable medium used herein may include depositing
Both storage media and communication media.
Calculating equipment 100 can be implemented as a part of portable (or mobile) electronic equipment of small size, these electronics are set
The standby such as cellular phone, personal digital assistant (PDA), personal media player device, wireless network browsing apparatus, a of can be
People's helmet, application specific equipment or may include any of the above function mixing apparatus.Calculating equipment 100 can be with
Be embodied as include desktop computer and notebook computer configuration personal computer.
Wherein, it includes adaptive according to one kind of the disclosure for executing for calculating one or more programs 122 of equipment 100
The instruction of phoneme synthesizing method.
Fig. 2 illustrates a kind of flow chart of adaptive voice synthetic method 200 according to an embodiment of the present disclosure,
Adaptive voice synthetic method 200 starts from step S210.
Step S210, basic speech data and the corresponding text data of the basic speech data are obtained.
Step S220, according to basic speech data and the corresponding text data of basic speech data, training basic speech mould
Type;
Step S230, the corresponding text data of voice data of the voice data and speaker of speaker is obtained;
Step S240, according to the corresponding text data of voice data of the voice data of speaker and speaker, Yi Jiji
Plinth speech model, training GRU speech model;
Step S250, when receiving speech synthesis instruction, according to GRU speech model and the text information for including is instructed,
Synthesize the voice of speaker.
Basic speech data refer to existing large-scale voice data.First according to large-scale basic speech data and
The corresponding text data of basic speech data trains basic speech model, then the voice data based on a small amount of speaker and says
The corresponding text data of voice data for talking about people carries out adaptive training, compared to directly using the voice data of speaker and saying
The corresponding text data training pattern of voice data for talking about people, can be improved the accuracy of model.
In step S240, gating cycle unit (Gated Recurrent Unit, GRU) model is used, GRU model is
The mutation of Recognition with Recurrent Neural Network can learn the long-term dependence of time series.For speech synthesis, which can be with
The contextual information of memory text well, in addition, GRU model structure is simple, decoding speed is fast, can satisfy online voice and closes
At real-time requirement.
Optionally, step S240 includes:
Step S241, according to the corresponding text data of voice data of the voice data of speaker and speaker, determination is said
Talk about the acoustic feature and phoneme feature of people;
Step S242, according to the acoustic feature of speaker and phoneme feature and basic speech model, training gating cycle
Unit GRU speech model.
Wherein, acoustic feature can be extracted directly from voice data, and it is special that acoustic feature can be spectrum signature, fundamental frequency
Sign, interpolation feature.The point that it is 0 to fundamental frequency that interpolation feature, which is specifically referred to, calculates value of the mean value of front and back sampled point as the point, from
And solve the problems, such as that fundamental frequency is rough.
Determine the phoneme feature of speaker, comprising:
The corresponding text of voice data of the voice data and speaker of speaker is handled according to preset speech recognition modeling
Notebook data obtains the duration alignment information of the phoneme of speaker;
According to the corresponding text data of voice data of the duration alignment information and speaker of the phoneme of speaker, determination is said
Talk about the phoneme feature of people.
Wherein, phoneme feature refers to part of speech feature, contextual information and the tonality feature expressed according to phoneme duration information.
Part of speech feature includes: verb, noun, emotion word;Contextual information then refers to a word and the word of front and back in various ways
It is combined, generates corresponding dictionary and can be used for training.
Preset speech recognition modeling can be hidden Markov model (Hidden Markov Model-Gaussian,
HMM), which is trained using the text after segmenting as input using the features such as the pronunciation frequency spectrum of word and tone period as output
It obtains.
Optionally, before the acoustic feature and phoneme feature for determining speaker, further includes:
The voice data of the speaker is pre-processed.
Wherein, pretreatment includes but is not limited to noise reduction, dereverberation processing, thus the low-quality voice number that user is inputted
Voice data according to processing to be more clear.
Optionally, the voice data of speaker and the voice data pair of speaker are handled according to preset speech recognition modeling
Before the text data answered, the voice data and text data for verifying speaker are corresponded, to not avoid input data not
The case where with causing training cannot proceed normally.
Optionally, in step S242, according to the acoustic feature of speaker and phoneme feature and basic speech model, instruction
Practice GRU speech model, comprising:
According to basic speech model initialization GRU speech model;
It is format needed for GRU speech model by phoneme feature and acoustics characteristic processing, and inputs GRU speech model, completes
The training of GRU speech model.
Wherein, the training of GRU speech model is specifically to be trained using phoneme feature as input, acoustic feature as output
The voice of GRU speech model synthesis meet the characteristic voice of speaker.
Below with reference to Fig. 3, the specific embodiment of the disclosure is provided.
Fig. 3 is the structural block diagram of the adaptive voice synthesis system based on small sample, is specifically included:
1, data checking module;
Whether the module matches mainly for detection of the voice data of user's typing with text data, if it does not match, logical
Know that user re-types.
2, automatic aligning module;
Using built-in speech recognition modeling, the voice of user is identified, is matched with corresponding text information,
Obtain the duration alignment information of each phoneme, i.e., the duration of each phoneme pronunciation and corresponding frequency spectrum.
3, preprocessing module;
Since the recording device of user is generally simpler and cruder, the quality of voice data generally will not be very high, therefore in order to obtain
Better synthetic effect, system can carry out series of preprocessing operation to voice data at the module, including noise reduction and go
Reverberation operation, thus the voice data being more clear.
4, acoustic feature extraction module;
The module extracts its acoustic feature to treated voice data.
5, phoneme characteristic extracting module;
The input of the module is that the duration of the phoneme of the corresponding text data of voice and the output of automatic aligning module is aligned letter
Breath, exports as the phoneme feature comprising phoneme duration information.
6, training data generation module;
The module handles acoustic feature and phoneme feature, generates data format required for training pattern.
7, basic model module;
Obtain the basic model using the training of existing large-scale data.
8, training pattern module;
The data generated using training data generation module, the speech synthesis model of the training speaker.Specifically used
Model Weight is initialized according to the model that big data is trained in advance, reuse it is new speak on a small quantity personal data to model into
Row iteration training.Training pattern has used GRU model, and GRU model is the mutation of Recognition with Recurrent Neural Network, can learn time series
Long-term dependence.For speech synthesis, which can be very good the contextual information of memory text, in addition, GRU
Model structure is simple, and decoding speed is fast, can satisfy the real-time requirement of online speech synthesis.
9, automatic speech synthesis module;
The personalized speech synthesis system obtained by adaptive algorithm training.
10, text module and output audio module are inputted;
It is respectively used to the text that input needs synthetic video, speaker's voice that output system is automatically synthesized.
Disclosure specific embodiment can be can be greatly reduced only using personal data training speech synthesis system of speaking on a small quantity
The cost of training pattern;Compared with the model for only using a small amount of speaker's voice data training, the mould of the invention trained
Type speech quality, intelligibility, in terms of have and significantly promoted.
Referring to fig. 4, the embodiment of the present disclosure provides a kind of adaptive voice synthesizer, comprising:
Basic speech data capture unit 410, for obtaining basic speech data and the corresponding text of basic speech data
Data;
Basic speech model training unit 420, for according to basic speech data and the corresponding text of basic speech data
Data, training basic speech model;
Speaker's voice data acquiring unit 430, for obtaining the voice data of speaker and the voice data of speaker
Corresponding text data;
GRU model training unit 440, for according to the voice data of speaker and the corresponding text of the voice data of speaker
Notebook data and basic speech model, training GRU speech model;
Adaptive voice synthesis unit 450, for according to GRU speech model and referring to when receiving speech synthesis instruction
The text information that order includes synthesizes the voice of speaker.
Optionally, GRU model training unit 440 is specifically used for according to the voice data of speaker and the voice number of speaker
According to corresponding text data, the acoustic feature and phoneme feature of speaker are determined;
According to the acoustic feature of speaker and phoneme feature and basic speech model, training GRU speech model.
Optionally, when GRU model training unit 440 is used to determine the phoneme feature of speaker, it is specifically used for:
The corresponding text of voice data of the voice data and speaker of speaker is handled according to preset speech recognition modeling
Notebook data obtains the duration alignment information of the phoneme of speaker;
According to the corresponding text data of voice data of the duration alignment information and speaker of the phoneme of speaker, determination is said
Talk about the phoneme feature of people.
Optionally, GRU model training unit 440 is also used to:
Before the acoustic feature and phoneme feature for determining speaker, the voice data of speaker is pre-processed.
Optionally, pretreatment includes:
Noise reduction and/or dereverberation.
Optionally, GRU model training unit 440 is also used to:
The corresponding text data of the voice data of the voice data and speaker of verifying speaker corresponds.
Optionally, GRU model training unit 440 is used for acoustic feature and phoneme feature according to speaker, and basis
Speech model is specifically used for when training GRU speech model:
According to basic speech model initialization GRU speech model;
It is format needed for GRU speech model by phoneme feature and acoustics characteristic processing, and inputs GRU speech model, completes
The training of GRU speech model.
Specific restriction about adaptive voice synthesizer may refer to above for adaptive voice synthetic method
Restriction, details are not described herein.
It should be appreciated that various technologies described herein are realized together in combination with hardware or software or their combination.From
And some aspects or part of disclosed method and equipment or disclosed method and equipment can take the tangible matchmaker of insertion
It is situated between, such as the program code in floppy disk, CD-ROM, hard disk drive or other any machine readable storage mediums (refers to
Enable) form, wherein when program is loaded into the machine of such as computer etc, and when being executed by the machine, which becomes real
The equipment for trampling the disclosure.
In the case where program code executes on programmable computers, calculates equipment and generally comprise processor, processor
Readable storage medium (including volatile and non-volatile memory and or memory element), at least one input unit, and extremely
A few output device.Wherein, memory is configured for storage program code;Processor is configured for according to the memory
Instruction in the program code of middle storage executes the various methods of the disclosure.
By way of example and not limitation, computer-readable medium includes computer storage media and communication media.It calculates
Machine readable medium includes computer storage media and communication media.Computer storage medium storage such as computer-readable instruction,
The information such as data structure, program module or other data.Communication media is generally modulated with carrier wave or other transmission mechanisms etc.
Data-signal processed passes to embody computer readable instructions, data structure, program module or other data including any information
Pass medium.Above any combination is also included within the scope of computer-readable medium.
It should be appreciated that in order to simplify the disclosure and help to understand one or more of each open aspect, it is right above
In the description of the exemplary embodiment of the disclosure, each feature of the disclosure be grouped together into sometimes single embodiment, figure or
In person's descriptions thereof.However, the disclosed method should not be interpreted as reflecting the following intention: i.e. claimed public affairs
Requirement is opened than feature more features expressly recited in each claim.More precisely, as the following claims
As book reflects, open aspect is all features less than single embodiment disclosed above.Therefore, it then follows specific real
Thus the claims for applying mode are expressly incorporated in the specific embodiment, wherein each claim itself is used as this public affairs
The separate embodiments opened.
Those skilled in the art should understand that the module of the equipment in example disclosed herein or unit or groups
Part can be arranged in equipment as depicted in this embodiment, or alternatively can be positioned at and the equipment in the example
In different one or more equipment.Module in aforementioned exemplary can be combined into a module or furthermore be segmented into multiple
Submodule.
Those skilled in the art will understand that can be carried out adaptively to the module in the equipment in embodiment
Change and they are arranged in one or more devices different from this embodiment.It can be the module or list in embodiment
Member or component are combined into a module or unit or component, and furthermore they can be divided into multiple submodule or subelement or
Sub-component.Other than such feature and/or at least some of process or unit exclude each other, it can use any
Combination is to all features disclosed in this specification (including adjoint claim, abstract and attached drawing) and so disclosed
All process or units of what method or apparatus are combined.Unless expressly stated otherwise, this specification is (including adjoint power
Benefit require, abstract and attached drawing) disclosed in each feature can carry out generation with an alternative feature that provides the same, equivalent, or similar purpose
It replaces.
In addition, it will be appreciated by those of skill in the art that although some embodiments described herein include other embodiments
In included certain features rather than other feature, but the combination of the feature of different embodiments means to be in the disclosure
Within the scope of and form different embodiments.For example, in the following claims, embodiment claimed is appointed
Meaning one of can in any combination mode come using.
In addition, be described as herein can be by the processor of computer system or by executing by some in the embodiment
The combination of method or method element that other devices of the function are implemented.Therefore, have for implementing the method or method
The processor of the necessary instruction of element forms the device for implementing this method or method element.In addition, Installation practice
Element described in this is the example of following device: the device be used for implement as in order to implement the purpose of the disclosure element performed by
Function.
As used in this, unless specifically stated, come using ordinal number " first ", " second ", " third " etc.
Description plain objects, which are merely representative of, is related to the different instances of similar object, and is not intended to imply that the object being described in this way must
Must have the time it is upper, spatially, sequence aspect or given sequence in any other manner.
Although the embodiment according to limited quantity describes the disclosure, above description, the art are benefited from
It is interior it is clear for the skilled person that in the scope of the present disclosure thus described, it can be envisaged that other embodiments.Additionally, it should be noted that
Language used in this specification primarily to readable and introduction purpose and select, rather than in order to explain or limit
Determine the theme of the disclosure and selects.Therefore, without departing from the scope and spirit of the appended claims, for this
Many modifications and changes are obvious for the those of ordinary skill of technical field.For the scope of the present disclosure, to this
Openly done disclosure is illustrative and not restrictive, and the scope of the present disclosure is defined by the appended claims.
Claims (10)
1. a kind of adaptive voice synthetic method characterized by comprising
Obtain basic speech data and the corresponding text data of the basic speech data;
According to the basic speech data and the corresponding text data of the basic speech data, training basic speech model;
Obtain the corresponding text data of voice data of the voice data and the speaker of speaker;
According to the corresponding text data of the voice data of the voice data of the speaker and the speaker and the basis
Speech model, training gating cycle unit GRU speech model;
When receiving speech synthesis instruction, according to the text information that the GRU speech model and described instruction include, institute is synthesized
State the voice of speaker.
2. the method as described in claim 1, which is characterized in that according to the voice data of the speaker and the speaker
The corresponding text data of voice data and the basic speech model, training gating cycle unit GRU speech model, comprising:
According to the corresponding text data of the voice data of the voice data of the speaker and the speaker, speak described in determination
The acoustic feature and phoneme feature of people;
According to the acoustic feature of the speaker and phoneme feature and the basic speech model, training gating cycle unit
GRU speech model.
3. method according to claim 2 determines the phoneme feature of the speaker, comprising:
The voice data of the speaker and the voice data correspondence of the speaker are handled according to preset speech recognition modeling
Text data, obtain the duration alignment information of the phoneme of the speaker;
According to the corresponding text data of voice data of the duration alignment information and the speaker of the phoneme of the speaker, really
The phoneme feature of the fixed speaker.
4. method as claimed in claim 3, which is characterized in that handle the speaker's according to preset speech recognition modeling
Before voice data and the corresponding text data of the voice data of the speaker, further includes:
The voice data and the text data for verifying the speaker correspond.
5. method according to claim 2, which is characterized in that determine the speaker acoustic feature and phoneme feature it
Before, further includes:
The voice data of the speaker is pre-processed.
6. method as claimed in claim 5, which is characterized in that the pretreatment includes:
Noise reduction and/or dereverberation.
7. method according to claim 2, which is characterized in that according to the acoustic feature of the speaker and phoneme feature, with
And the basic speech model, training gating cycle unit GRU speech model, comprising:
According to the basic speech model initialization GRU speech model;
The phoneme feature and the acoustic feature are handled as format needed for the GRU speech model, and input the GRU language
Sound model completes the GRU speech model training.
8. a kind of adaptive voice synthesizer characterized by comprising
Basic speech data capture unit, for obtaining basic speech data and the corresponding text data of basic speech data;
Basic speech model training unit, for according to the basic speech data and the corresponding text of the basic speech data
Data, training basic speech model;
Speaker's voice data acquiring unit, for obtaining the voice data of speaker and the voice data correspondence of the speaker
Text data;
GRU model training unit, for corresponding according to the voice data of the speaker and the voice data of the speaker
Text data and the basic speech model, training gating cycle unit GRU speech model;
Adaptive voice synthesis unit, for when receive speech synthesis instruction when, according to the GRU speech model and the finger
The text information that order includes synthesizes the voice of the speaker.
9. a kind of readable storage medium storing program for executing has executable instruction, when the executable instruction is performed, so that computer thereon
Operation included by any one in perform claim requirement 1-7.
10. a kind of calculating equipment, comprising:
Processor;And
Memory, is stored with executable instruction, and the executable instruction makes the processor perform claim upon being performed
It is required that operation included by any one in 1-7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910661648.6A CN110379407B (en) | 2019-07-22 | 2019-07-22 | Adaptive speech synthesis method, device, readable storage medium and computing equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910661648.6A CN110379407B (en) | 2019-07-22 | 2019-07-22 | Adaptive speech synthesis method, device, readable storage medium and computing equipment |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110379407A true CN110379407A (en) | 2019-10-25 |
CN110379407B CN110379407B (en) | 2021-10-19 |
Family
ID=68254684
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910661648.6A Active CN110379407B (en) | 2019-07-22 | 2019-07-22 | Adaptive speech synthesis method, device, readable storage medium and computing equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110379407B (en) |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111276120A (en) * | 2020-01-21 | 2020-06-12 | 华为技术有限公司 | Speech synthesis method, apparatus and computer-readable storage medium |
CN112133282A (en) * | 2020-10-26 | 2020-12-25 | 厦门大学 | Lightweight multi-speaker speech synthesis system and electronic equipment |
CN112185340A (en) * | 2020-10-30 | 2021-01-05 | 网易(杭州)网络有限公司 | Speech synthesis method, speech synthesis device, storage medium and electronic apparatus |
CN112365876A (en) * | 2020-11-27 | 2021-02-12 | 北京百度网讯科技有限公司 | Method, device and equipment for training speech synthesis model and storage medium |
CN112634856A (en) * | 2020-12-10 | 2021-04-09 | 苏州思必驰信息科技有限公司 | Speech synthesis model training method and speech synthesis method |
CN113299268A (en) * | 2021-07-28 | 2021-08-24 | 成都启英泰伦科技有限公司 | Speech synthesis method based on stream generation model |
CN113314092A (en) * | 2021-05-11 | 2021-08-27 | 北京三快在线科技有限公司 | Method and device for model training and voice interaction |
CN113707122A (en) * | 2021-08-11 | 2021-11-26 | 北京搜狗科技发展有限公司 | Method and device for constructing voice synthesis model |
WO2022094740A1 (en) * | 2020-11-03 | 2022-05-12 | Microsoft Technology Licensing, Llc | Controlled training and use of text-to-speech models and personalized model generated voices |
Citations (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR20000042363A (en) * | 1998-12-24 | 2000-07-15 | 이계철 | Method for initializing program and data in voice recognition system |
CN105185372A (en) * | 2015-10-20 | 2015-12-23 | 百度在线网络技术(北京)有限公司 | Training method for multiple personalized acoustic models, and voice synthesis method and voice synthesis device |
US20170076715A1 (en) * | 2015-09-16 | 2017-03-16 | Kabushiki Kaisha Toshiba | Training apparatus for speech synthesis, speech synthesis apparatus and training method for training apparatus |
US9922641B1 (en) * | 2012-10-01 | 2018-03-20 | Google Llc | Cross-lingual speaker adaptation for multi-lingual speech synthesis |
CN107871497A (en) * | 2016-09-23 | 2018-04-03 | 北京眼神科技有限公司 | Audio recognition method and device |
JP2018097115A (en) * | 2016-12-12 | 2018-06-21 | 日本電信電話株式会社 | Fundamental frequency model parameter estimation device, method, and program |
CN108550363A (en) * | 2018-06-04 | 2018-09-18 | 百度在线网络技术(北京)有限公司 | Phoneme synthesizing method and device, computer equipment and readable medium |
CN108665898A (en) * | 2018-03-30 | 2018-10-16 | 西北师范大学 | A method of gesture is converted into the Chinese and hides double-language voice |
CN108831435A (en) * | 2018-06-06 | 2018-11-16 | 安徽继远软件有限公司 | A kind of emotional speech synthesizing method based on susceptible sense speaker adaptation |
US20190066656A1 (en) * | 2017-08-29 | 2019-02-28 | Kabushiki Kaisha Toshiba | Speech synthesis dictionary delivery device, speech synthesis system, and program storage medium |
CN109523993A (en) * | 2018-11-02 | 2019-03-26 | 成都三零凯天通信实业有限公司 | A kind of voice languages classification method merging deep neural network with GRU based on CNN |
CN109599092A (en) * | 2018-12-21 | 2019-04-09 | 秒针信息技术有限公司 | A kind of audio synthetic method and device |
CN109635462A (en) * | 2018-12-17 | 2019-04-16 | 深圳前海微众银行股份有限公司 | Model parameter training method, device, equipment and medium based on federation's study |
CN109885832A (en) * | 2019-02-14 | 2019-06-14 | 平安科技(深圳)有限公司 | Model training, sentence processing method, device, computer equipment and storage medium |
-
2019
- 2019-07-22 CN CN201910661648.6A patent/CN110379407B/en active Active
Patent Citations (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR20000042363A (en) * | 1998-12-24 | 2000-07-15 | 이계철 | Method for initializing program and data in voice recognition system |
US9922641B1 (en) * | 2012-10-01 | 2018-03-20 | Google Llc | Cross-lingual speaker adaptation for multi-lingual speech synthesis |
US20170076715A1 (en) * | 2015-09-16 | 2017-03-16 | Kabushiki Kaisha Toshiba | Training apparatus for speech synthesis, speech synthesis apparatus and training method for training apparatus |
CN105185372A (en) * | 2015-10-20 | 2015-12-23 | 百度在线网络技术(北京)有限公司 | Training method for multiple personalized acoustic models, and voice synthesis method and voice synthesis device |
CN107871497A (en) * | 2016-09-23 | 2018-04-03 | 北京眼神科技有限公司 | Audio recognition method and device |
JP2018097115A (en) * | 2016-12-12 | 2018-06-21 | 日本電信電話株式会社 | Fundamental frequency model parameter estimation device, method, and program |
US20190066656A1 (en) * | 2017-08-29 | 2019-02-28 | Kabushiki Kaisha Toshiba | Speech synthesis dictionary delivery device, speech synthesis system, and program storage medium |
CN108665898A (en) * | 2018-03-30 | 2018-10-16 | 西北师范大学 | A method of gesture is converted into the Chinese and hides double-language voice |
CN108550363A (en) * | 2018-06-04 | 2018-09-18 | 百度在线网络技术(北京)有限公司 | Phoneme synthesizing method and device, computer equipment and readable medium |
CN108831435A (en) * | 2018-06-06 | 2018-11-16 | 安徽继远软件有限公司 | A kind of emotional speech synthesizing method based on susceptible sense speaker adaptation |
CN109523993A (en) * | 2018-11-02 | 2019-03-26 | 成都三零凯天通信实业有限公司 | A kind of voice languages classification method merging deep neural network with GRU based on CNN |
CN109635462A (en) * | 2018-12-17 | 2019-04-16 | 深圳前海微众银行股份有限公司 | Model parameter training method, device, equipment and medium based on federation's study |
CN109599092A (en) * | 2018-12-21 | 2019-04-09 | 秒针信息技术有限公司 | A kind of audio synthetic method and device |
CN109885832A (en) * | 2019-02-14 | 2019-06-14 | 平安科技(深圳)有限公司 | Model training, sentence processing method, device, computer equipment and storage medium |
Cited By (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111276120A (en) * | 2020-01-21 | 2020-06-12 | 华为技术有限公司 | Speech synthesis method, apparatus and computer-readable storage medium |
CN111276120B (en) * | 2020-01-21 | 2022-08-19 | 华为技术有限公司 | Speech synthesis method, apparatus and computer-readable storage medium |
CN112133282A (en) * | 2020-10-26 | 2020-12-25 | 厦门大学 | Lightweight multi-speaker speech synthesis system and electronic equipment |
CN112133282B (en) * | 2020-10-26 | 2022-07-08 | 厦门大学 | Lightweight multi-speaker speech synthesis system and electronic equipment |
CN112185340A (en) * | 2020-10-30 | 2021-01-05 | 网易(杭州)网络有限公司 | Speech synthesis method, speech synthesis device, storage medium and electronic apparatus |
CN112185340B (en) * | 2020-10-30 | 2024-03-15 | 网易(杭州)网络有限公司 | Speech synthesis method, speech synthesis device, storage medium and electronic equipment |
WO2022094740A1 (en) * | 2020-11-03 | 2022-05-12 | Microsoft Technology Licensing, Llc | Controlled training and use of text-to-speech models and personalized model generated voices |
CN112365876A (en) * | 2020-11-27 | 2021-02-12 | 北京百度网讯科技有限公司 | Method, device and equipment for training speech synthesis model and storage medium |
CN112365876B (en) * | 2020-11-27 | 2022-04-12 | 北京百度网讯科技有限公司 | Method, device and equipment for training speech synthesis model and storage medium |
CN112634856A (en) * | 2020-12-10 | 2021-04-09 | 苏州思必驰信息科技有限公司 | Speech synthesis model training method and speech synthesis method |
CN113314092A (en) * | 2021-05-11 | 2021-08-27 | 北京三快在线科技有限公司 | Method and device for model training and voice interaction |
CN113299268A (en) * | 2021-07-28 | 2021-08-24 | 成都启英泰伦科技有限公司 | Speech synthesis method based on stream generation model |
CN113707122A (en) * | 2021-08-11 | 2021-11-26 | 北京搜狗科技发展有限公司 | Method and device for constructing voice synthesis model |
CN113707122B (en) * | 2021-08-11 | 2024-04-05 | 北京搜狗科技发展有限公司 | Method and device for constructing voice synthesis model |
Also Published As
Publication number | Publication date |
---|---|
CN110379407B (en) | 2021-10-19 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110379407A (en) | Adaptive voice synthetic method, device, readable storage medium storing program for executing and calculating equipment | |
CN108520741B (en) | Method, device and equipment for restoring ear voice and readable storage medium | |
US8510103B2 (en) | System and method for voice recognition | |
CN110111775A (en) | A kind of Streaming voice recognition methods, device, equipment and storage medium | |
KR101323061B1 (en) | Speaker authentication | |
CN110415687A (en) | Method of speech processing, device, medium, electronic equipment | |
CN110189748B (en) | Model construction method and device | |
CN112233646B (en) | Voice cloning method, system, equipment and storage medium based on neural network | |
CN111653265B (en) | Speech synthesis method, device, storage medium and electronic equipment | |
CN110232907A (en) | A kind of phoneme synthesizing method, device, readable storage medium storing program for executing and calculate equipment | |
CN112309365B (en) | Training method and device of speech synthesis model, storage medium and electronic equipment | |
CN110310619A (en) | Polyphone prediction technique, device, equipment and computer readable storage medium | |
CN107240401B (en) | Tone conversion method and computing device | |
EP3989217B1 (en) | Method for detecting an audio adversarial attack with respect to a voice input processed by an automatic speech recognition system, corresponding device, computer program product and computer-readable carrier medium | |
CN111081230A (en) | Speech recognition method and apparatus | |
CN104361896B (en) | Voice quality assessment equipment, method and system | |
CN112837669B (en) | Speech synthesis method, device and server | |
CN105654955B (en) | Audio recognition method and device | |
CN114333865A (en) | Model training and tone conversion method, device, equipment and medium | |
CN112489623A (en) | Language identification model training method, language identification method and related equipment | |
CN110580897B (en) | Audio verification method and device, storage medium and electronic equipment | |
CN109688271A (en) | The method, apparatus and terminal device of contact information input | |
CN110853669A (en) | Audio identification method, device and equipment | |
CN108922523B (en) | Position prompting method and device, storage medium and electronic equipment | |
CN108989551B (en) | Position prompting method and device, storage medium and electronic equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |