CN110379407A - Adaptive voice synthetic method, device, readable storage medium storing program for executing and calculating equipment - Google Patents

Adaptive voice synthetic method, device, readable storage medium storing program for executing and calculating equipment Download PDF

Info

Publication number
CN110379407A
CN110379407A CN201910661648.6A CN201910661648A CN110379407A CN 110379407 A CN110379407 A CN 110379407A CN 201910661648 A CN201910661648 A CN 201910661648A CN 110379407 A CN110379407 A CN 110379407A
Authority
CN
China
Prior art keywords
speaker
data
voice
speech
voice data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910661648.6A
Other languages
Chinese (zh)
Other versions
CN110379407B (en
Inventor
殷昊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Go Out And Ask (suzhou) Information Technology Co Ltd
Original Assignee
Go Out And Ask (suzhou) Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Go Out And Ask (suzhou) Information Technology Co Ltd filed Critical Go Out And Ask (suzhou) Information Technology Co Ltd
Priority to CN201910661648.6A priority Critical patent/CN110379407B/en
Publication of CN110379407A publication Critical patent/CN110379407A/en
Application granted granted Critical
Publication of CN110379407B publication Critical patent/CN110379407B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Signal Processing (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

The embodiment of the present disclosure provides a kind of adaptive voice synthetic method, device, readable storage medium storing program for executing and calculates equipment, under conditions of only a small amount of and voice quality not high voice data, the pretty good speaker's voice of synthetic effect.Method includes: to obtain basic speech data and the corresponding text data of basic speech data;According to basic speech data and the corresponding text data of basic speech data, training basic speech model;Obtain the corresponding text data of voice data of the voice data and speaker of speaker;According to the corresponding text data of the voice data of the voice data of speaker and speaker and basic speech model, training GRU speech model;When receiving speech synthesis instruction, according to the text information that GRU speech model and instruction include, the voice of speaker is synthesized.

Description

Adaptive voice synthetic method, device, readable storage medium storing program for executing and calculating equipment
Technical field
This disclosure relates to voice processing technology field more particularly to a kind of adaptive voice synthetic method, device, readable deposit Storage media and calculating equipment.
Background technique
Speech synthesis refers to computer automatically according to the technology of the corresponding voice of text generation.Current speech synthesis system needs A large amount of and high quality (professional recording equipment is needed to be recorded) data are used, and collecting these data can expend largely Manpower and financial resources.In addition, one new speaker of every increase, requires to recording studio to record the new data of a batch.
How under conditions of only a small amount of and voice quality not high voice data, the pretty good speaker of synthetic effect Voice is a technical problem to be solved urgently.
Summary of the invention
For this purpose, present disclose provides a kind of adaptive voice synthetic method, device, readable storage medium storing program for executing and equipment is calculated, With try hard to solve the problems, such as or at least alleviate above it is existing at least one.
According to the one aspect of the embodiment of the present disclosure, a kind of adaptive voice synthetic method is provided, comprising:
Obtain basic speech data and the corresponding text data of basic speech data;
According to basic speech data and the corresponding text data of basic speech data, training basic speech model;
Obtain the corresponding text data of voice data of the voice data and speaker of speaker;
According to the corresponding text data of the voice data of the voice data of speaker and speaker and basic speech mould Type, training GRU speech model;
When receiving speech synthesis instruction, according to the text information that GRU speech model and instruction include, speaker is synthesized Voice.
Optionally, according to the corresponding text data of voice data of the voice data of speaker and speaker, and basis Speech model, training GRU speech model, comprising:
According to the corresponding text data of the voice data of the voice data of speaker and speaker, the acoustics of speaker is determined Feature and phoneme feature;
According to the acoustic feature of speaker and phoneme feature and the basic speech model, training GRU speech model.
Optionally it is determined that the phoneme feature of speaker, comprising:
The corresponding text of voice data of the voice data and speaker of speaker is handled according to preset speech recognition modeling Notebook data obtains the duration alignment information of the phoneme of speaker;
According to the corresponding text data of voice data of the duration alignment information and speaker of the phoneme of speaker, determination is said Talk about the phoneme feature of people.
Optionally it is determined that before the acoustic feature and phoneme feature of speaker, further includes:
The voice data of speaker is pre-processed.
Optionally, pretreatment includes:
Noise reduction and/or dereverberation.
Optionally, method further include:
The corresponding text data of the voice data of the voice data and speaker of verifying speaker corresponds.
Optionally, according to the acoustic feature of speaker and phoneme feature and basic speech model, training GRU voice mould Type, comprising:
According to basic speech model initialization GRU speech model;
It is format needed for GRU speech model by phoneme feature and acoustics characteristic processing, and inputs GRU speech model, completes The training of GRU speech model.
According to the another aspect of the embodiment of the present disclosure, a kind of adaptive voice synthesizer is provided, comprising:
Basic speech data capture unit, for obtaining basic speech data and the corresponding textual data of basic speech data According to;
Basic speech model training unit, for according to basic speech data and the corresponding textual data of basic speech data According to training basic speech model;
Speaker's voice data acquiring unit, for obtaining the voice data of speaker and the voice data correspondence of speaker Text data;
GRU model training unit, for according to the voice data of speaker and the corresponding text of the voice data of speaker Data and basic speech model, training GRU speech model;
Adaptive voice synthesis unit, for being wrapped according to GRU speech model and instruction when receiving speech synthesis instruction The text information contained synthesizes the voice of speaker.
Optionally, GRU model training unit is specifically used for according to the voice data of speaker and the voice data of speaker Corresponding text data determines the acoustic feature and phoneme feature of speaker;
According to the acoustic feature of speaker and phoneme feature and basic speech model, training GRU speech model.
Optionally, when GRU model training unit is used to determine the phoneme feature of speaker, it is specifically used for:
The corresponding text of voice data of the voice data and speaker of speaker is handled according to preset speech recognition modeling Notebook data obtains the duration alignment information of the phoneme of speaker;
According to the corresponding text data of voice data of the duration alignment information and speaker of the phoneme of speaker, determination is said Talk about the phoneme feature of people.
Optionally, GRU model training unit is also used to:
Before the acoustic feature and phoneme feature for determining speaker, the voice data of speaker is pre-processed.
Optionally, pretreatment includes:
Noise reduction and/or dereverberation.
Optionally, GRU model training unit is also used to:
The corresponding text data of the voice data of the voice data and speaker of verifying speaker corresponds.
Optionally, GRU model training unit is used for acoustic feature and phoneme feature and basic speech according to speaker Model is specifically used for when training GRU speech model:
According to basic speech model initialization GRU speech model;
It is format needed for GRU speech model by phoneme feature and acoustics characteristic processing, and inputs GRU speech model, completes The training of GRU speech model.
According to the another aspect of the embodiment of the present disclosure, a kind of readable storage medium storing program for executing is provided, there is executable refer to thereon It enables, when executable instruction is performed, so that computer executes operation included by the above method.
According to the another aspect of the embodiment of the present disclosure, a kind of calculating equipment is provided, comprising: processor;And memory, It is stored with executable instruction, and executable instruction makes processor execute operation included by the above method upon being performed.
According to the technical solution that the embodiment of the present disclosure provides, first with large-scale basis voice data training basic speech mould Type, recycles speaker's voice data training GRU speech model of small sample, and GRU model can remember the upper of text well Context information, in addition, GRU model structure is simple, decoding speed is fast, can satisfy the real-time requirement of online speech synthesis;Therefore, The technical solution that the embodiment of the present disclosure provides only needs to record a small amount of and of low quality voice data of speaker, just can close At the pretty good speaker's voice of effect.
Detailed description of the invention
Attached drawing shows the illustrative embodiments of the disclosure, and it is bright together for explaining the principles of this disclosure, Which includes these attached drawings to provide further understanding of the disclosure, and attached drawing is included in the description and constitutes this Part of specification.
Fig. 1 is exemplary the structural block diagram for calculating equipment;
Fig. 2 is the flow chart according to a kind of adaptive voice synthetic method of the embodiment of the present disclosure;
Fig. 3 is the structural block diagram according to the adaptive voice synthesis system based on small sample of the embodiment of the present disclosure;
Fig. 4 is the structural block diagram according to a kind of adaptive voice synthesizer of the embodiment of the present disclosure.
Specific embodiment
Exemplary embodiments of the present disclosure are described in more detail below with reference to accompanying drawings.Although showing the disclosure in attached drawing Exemplary embodiment, it being understood, however, that may be realized in various forms the disclosure without should be by embodiments set forth here It is limited.On the contrary, these embodiments are provided to facilitate a more thoroughly understanding of the present invention, and can be by the scope of the present disclosure It is fully disclosed to those skilled in the art.
Fig. 1 is the Example Computing Device 100 for being arranged as realizing a kind of adaptive voice synthetic method according to the disclosure Block diagram.In basic configuration 102, calculates equipment 100 and typically comprise system storage 106 and one or more processor 104.Memory bus 108 can be used for the communication between processor 104 and system storage 106.
Depending on desired configuration, processor 104 can be any kind of processing, including but not limited to: microprocessor ((μ P), microcontroller (μ C), digital information processor (DSP) or any combination of them.Processor 104 may include all Cache, processor core such as one or more rank of on-chip cache 110 and second level cache 112 etc 114 and register 116.Exemplary processor core 114 may include arithmetic and logical unit (ALU), floating-point unit (FPU), Digital signal processing core (DSP core) or any combination of them.Exemplary Memory Controller 118 can be with processor 104 are used together, or in some implementations, and Memory Controller 118 can be an interior section of processor 104.
Depending on desired configuration, system storage 106 can be any type of memory, including but not limited to: easily The property lost memory (RAM), nonvolatile memory (ROM, flash memory etc.) or any combination of them.System storage Device 106 may include operating system 120, one or more program 122 and program data 124.In some embodiments, Program 122 can be configured as to be referred to by one or more processor 104 using the execution of program data 124 on an operating system It enables.
Calculating equipment 100 can also include facilitating from various interface equipments (for example, output equipment 142, Peripheral Interface 144 and communication equipment 146) to basic configuration 102 via the communication of bus/interface controller 130 interface bus 140.Example Output equipment 142 include graphics processing unit 148 and audio treatment unit 150.They can be configured as facilitate via One or more port A/V 152 is communicated with the various external equipments of such as display terminal or loudspeaker etc.Example Peripheral Interface 144 may include serial interface controller 154 and parallel interface controller 156, they, which can be configured as, helps In via one or more port I/O 158 and such as input equipment (for example, keyboard, mouse, pen, voice-input device, touching Touch input equipment) or the external equipment of other peripheral hardwares (such as printer, scanner etc.) etc communicated.Exemplary communication Equipment 146 may include network controller 160, can be arranged to convenient for via one or more communication port 164 with One or more other calculating communications of equipment 162 by network communication link.
Network communication link can be an example of communication media.Communication media can be usually presented as in such as carrier wave Or computer readable instructions, data structure, program module in the modulated data signal of other transmission mechanisms etc, and can To include any information delivery media." modulated data signal " can such signal, one in its data set or more It is a or it change can the mode of encoded information in the signal carry out.As unrestricted example, communication media can be with Wired medium including such as cable network or private line network etc, and it is such as sound, radio frequency (RF), microwave, infrared (IR) the various wireless mediums or including other wireless mediums.Term computer-readable medium used herein may include depositing Both storage media and communication media.
Calculating equipment 100 can be implemented as a part of portable (or mobile) electronic equipment of small size, these electronics are set The standby such as cellular phone, personal digital assistant (PDA), personal media player device, wireless network browsing apparatus, a of can be People's helmet, application specific equipment or may include any of the above function mixing apparatus.Calculating equipment 100 can be with Be embodied as include desktop computer and notebook computer configuration personal computer.
Wherein, it includes adaptive according to one kind of the disclosure for executing for calculating one or more programs 122 of equipment 100 The instruction of phoneme synthesizing method.
Fig. 2 illustrates a kind of flow chart of adaptive voice synthetic method 200 according to an embodiment of the present disclosure, Adaptive voice synthetic method 200 starts from step S210.
Step S210, basic speech data and the corresponding text data of the basic speech data are obtained.
Step S220, according to basic speech data and the corresponding text data of basic speech data, training basic speech mould Type;
Step S230, the corresponding text data of voice data of the voice data and speaker of speaker is obtained;
Step S240, according to the corresponding text data of voice data of the voice data of speaker and speaker, Yi Jiji Plinth speech model, training GRU speech model;
Step S250, when receiving speech synthesis instruction, according to GRU speech model and the text information for including is instructed, Synthesize the voice of speaker.
Basic speech data refer to existing large-scale voice data.First according to large-scale basic speech data and The corresponding text data of basic speech data trains basic speech model, then the voice data based on a small amount of speaker and says The corresponding text data of voice data for talking about people carries out adaptive training, compared to directly using the voice data of speaker and saying The corresponding text data training pattern of voice data for talking about people, can be improved the accuracy of model.
In step S240, gating cycle unit (Gated Recurrent Unit, GRU) model is used, GRU model is The mutation of Recognition with Recurrent Neural Network can learn the long-term dependence of time series.For speech synthesis, which can be with The contextual information of memory text well, in addition, GRU model structure is simple, decoding speed is fast, can satisfy online voice and closes At real-time requirement.
Optionally, step S240 includes:
Step S241, according to the corresponding text data of voice data of the voice data of speaker and speaker, determination is said Talk about the acoustic feature and phoneme feature of people;
Step S242, according to the acoustic feature of speaker and phoneme feature and basic speech model, training gating cycle Unit GRU speech model.
Wherein, acoustic feature can be extracted directly from voice data, and it is special that acoustic feature can be spectrum signature, fundamental frequency Sign, interpolation feature.The point that it is 0 to fundamental frequency that interpolation feature, which is specifically referred to, calculates value of the mean value of front and back sampled point as the point, from And solve the problems, such as that fundamental frequency is rough.
Determine the phoneme feature of speaker, comprising:
The corresponding text of voice data of the voice data and speaker of speaker is handled according to preset speech recognition modeling Notebook data obtains the duration alignment information of the phoneme of speaker;
According to the corresponding text data of voice data of the duration alignment information and speaker of the phoneme of speaker, determination is said Talk about the phoneme feature of people.
Wherein, phoneme feature refers to part of speech feature, contextual information and the tonality feature expressed according to phoneme duration information. Part of speech feature includes: verb, noun, emotion word;Contextual information then refers to a word and the word of front and back in various ways It is combined, generates corresponding dictionary and can be used for training.
Preset speech recognition modeling can be hidden Markov model (Hidden Markov Model-Gaussian, HMM), which is trained using the text after segmenting as input using the features such as the pronunciation frequency spectrum of word and tone period as output It obtains.
Optionally, before the acoustic feature and phoneme feature for determining speaker, further includes:
The voice data of the speaker is pre-processed.
Wherein, pretreatment includes but is not limited to noise reduction, dereverberation processing, thus the low-quality voice number that user is inputted Voice data according to processing to be more clear.
Optionally, the voice data of speaker and the voice data pair of speaker are handled according to preset speech recognition modeling Before the text data answered, the voice data and text data for verifying speaker are corresponded, to not avoid input data not The case where with causing training cannot proceed normally.
Optionally, in step S242, according to the acoustic feature of speaker and phoneme feature and basic speech model, instruction Practice GRU speech model, comprising:
According to basic speech model initialization GRU speech model;
It is format needed for GRU speech model by phoneme feature and acoustics characteristic processing, and inputs GRU speech model, completes The training of GRU speech model.
Wherein, the training of GRU speech model is specifically to be trained using phoneme feature as input, acoustic feature as output The voice of GRU speech model synthesis meet the characteristic voice of speaker.
Below with reference to Fig. 3, the specific embodiment of the disclosure is provided.
Fig. 3 is the structural block diagram of the adaptive voice synthesis system based on small sample, is specifically included:
1, data checking module;
Whether the module matches mainly for detection of the voice data of user's typing with text data, if it does not match, logical Know that user re-types.
2, automatic aligning module;
Using built-in speech recognition modeling, the voice of user is identified, is matched with corresponding text information, Obtain the duration alignment information of each phoneme, i.e., the duration of each phoneme pronunciation and corresponding frequency spectrum.
3, preprocessing module;
Since the recording device of user is generally simpler and cruder, the quality of voice data generally will not be very high, therefore in order to obtain Better synthetic effect, system can carry out series of preprocessing operation to voice data at the module, including noise reduction and go Reverberation operation, thus the voice data being more clear.
4, acoustic feature extraction module;
The module extracts its acoustic feature to treated voice data.
5, phoneme characteristic extracting module;
The input of the module is that the duration of the phoneme of the corresponding text data of voice and the output of automatic aligning module is aligned letter Breath, exports as the phoneme feature comprising phoneme duration information.
6, training data generation module;
The module handles acoustic feature and phoneme feature, generates data format required for training pattern.
7, basic model module;
Obtain the basic model using the training of existing large-scale data.
8, training pattern module;
The data generated using training data generation module, the speech synthesis model of the training speaker.Specifically used Model Weight is initialized according to the model that big data is trained in advance, reuse it is new speak on a small quantity personal data to model into Row iteration training.Training pattern has used GRU model, and GRU model is the mutation of Recognition with Recurrent Neural Network, can learn time series Long-term dependence.For speech synthesis, which can be very good the contextual information of memory text, in addition, GRU Model structure is simple, and decoding speed is fast, can satisfy the real-time requirement of online speech synthesis.
9, automatic speech synthesis module;
The personalized speech synthesis system obtained by adaptive algorithm training.
10, text module and output audio module are inputted;
It is respectively used to the text that input needs synthetic video, speaker's voice that output system is automatically synthesized.
Disclosure specific embodiment can be can be greatly reduced only using personal data training speech synthesis system of speaking on a small quantity The cost of training pattern;Compared with the model for only using a small amount of speaker's voice data training, the mould of the invention trained Type speech quality, intelligibility, in terms of have and significantly promoted.
Referring to fig. 4, the embodiment of the present disclosure provides a kind of adaptive voice synthesizer, comprising:
Basic speech data capture unit 410, for obtaining basic speech data and the corresponding text of basic speech data Data;
Basic speech model training unit 420, for according to basic speech data and the corresponding text of basic speech data Data, training basic speech model;
Speaker's voice data acquiring unit 430, for obtaining the voice data of speaker and the voice data of speaker Corresponding text data;
GRU model training unit 440, for according to the voice data of speaker and the corresponding text of the voice data of speaker Notebook data and basic speech model, training GRU speech model;
Adaptive voice synthesis unit 450, for according to GRU speech model and referring to when receiving speech synthesis instruction The text information that order includes synthesizes the voice of speaker.
Optionally, GRU model training unit 440 is specifically used for according to the voice data of speaker and the voice number of speaker According to corresponding text data, the acoustic feature and phoneme feature of speaker are determined;
According to the acoustic feature of speaker and phoneme feature and basic speech model, training GRU speech model.
Optionally, when GRU model training unit 440 is used to determine the phoneme feature of speaker, it is specifically used for:
The corresponding text of voice data of the voice data and speaker of speaker is handled according to preset speech recognition modeling Notebook data obtains the duration alignment information of the phoneme of speaker;
According to the corresponding text data of voice data of the duration alignment information and speaker of the phoneme of speaker, determination is said Talk about the phoneme feature of people.
Optionally, GRU model training unit 440 is also used to:
Before the acoustic feature and phoneme feature for determining speaker, the voice data of speaker is pre-processed.
Optionally, pretreatment includes:
Noise reduction and/or dereverberation.
Optionally, GRU model training unit 440 is also used to:
The corresponding text data of the voice data of the voice data and speaker of verifying speaker corresponds.
Optionally, GRU model training unit 440 is used for acoustic feature and phoneme feature according to speaker, and basis Speech model is specifically used for when training GRU speech model:
According to basic speech model initialization GRU speech model;
It is format needed for GRU speech model by phoneme feature and acoustics characteristic processing, and inputs GRU speech model, completes The training of GRU speech model.
Specific restriction about adaptive voice synthesizer may refer to above for adaptive voice synthetic method Restriction, details are not described herein.
It should be appreciated that various technologies described herein are realized together in combination with hardware or software or their combination.From And some aspects or part of disclosed method and equipment or disclosed method and equipment can take the tangible matchmaker of insertion It is situated between, such as the program code in floppy disk, CD-ROM, hard disk drive or other any machine readable storage mediums (refers to Enable) form, wherein when program is loaded into the machine of such as computer etc, and when being executed by the machine, which becomes real The equipment for trampling the disclosure.
In the case where program code executes on programmable computers, calculates equipment and generally comprise processor, processor Readable storage medium (including volatile and non-volatile memory and or memory element), at least one input unit, and extremely A few output device.Wherein, memory is configured for storage program code;Processor is configured for according to the memory Instruction in the program code of middle storage executes the various methods of the disclosure.
By way of example and not limitation, computer-readable medium includes computer storage media and communication media.It calculates Machine readable medium includes computer storage media and communication media.Computer storage medium storage such as computer-readable instruction, The information such as data structure, program module or other data.Communication media is generally modulated with carrier wave or other transmission mechanisms etc. Data-signal processed passes to embody computer readable instructions, data structure, program module or other data including any information Pass medium.Above any combination is also included within the scope of computer-readable medium.
It should be appreciated that in order to simplify the disclosure and help to understand one or more of each open aspect, it is right above In the description of the exemplary embodiment of the disclosure, each feature of the disclosure be grouped together into sometimes single embodiment, figure or In person's descriptions thereof.However, the disclosed method should not be interpreted as reflecting the following intention: i.e. claimed public affairs Requirement is opened than feature more features expressly recited in each claim.More precisely, as the following claims As book reflects, open aspect is all features less than single embodiment disclosed above.Therefore, it then follows specific real Thus the claims for applying mode are expressly incorporated in the specific embodiment, wherein each claim itself is used as this public affairs The separate embodiments opened.
Those skilled in the art should understand that the module of the equipment in example disclosed herein or unit or groups Part can be arranged in equipment as depicted in this embodiment, or alternatively can be positioned at and the equipment in the example In different one or more equipment.Module in aforementioned exemplary can be combined into a module or furthermore be segmented into multiple Submodule.
Those skilled in the art will understand that can be carried out adaptively to the module in the equipment in embodiment Change and they are arranged in one or more devices different from this embodiment.It can be the module or list in embodiment Member or component are combined into a module or unit or component, and furthermore they can be divided into multiple submodule or subelement or Sub-component.Other than such feature and/or at least some of process or unit exclude each other, it can use any Combination is to all features disclosed in this specification (including adjoint claim, abstract and attached drawing) and so disclosed All process or units of what method or apparatus are combined.Unless expressly stated otherwise, this specification is (including adjoint power Benefit require, abstract and attached drawing) disclosed in each feature can carry out generation with an alternative feature that provides the same, equivalent, or similar purpose It replaces.
In addition, it will be appreciated by those of skill in the art that although some embodiments described herein include other embodiments In included certain features rather than other feature, but the combination of the feature of different embodiments means to be in the disclosure Within the scope of and form different embodiments.For example, in the following claims, embodiment claimed is appointed Meaning one of can in any combination mode come using.
In addition, be described as herein can be by the processor of computer system or by executing by some in the embodiment The combination of method or method element that other devices of the function are implemented.Therefore, have for implementing the method or method The processor of the necessary instruction of element forms the device for implementing this method or method element.In addition, Installation practice Element described in this is the example of following device: the device be used for implement as in order to implement the purpose of the disclosure element performed by Function.
As used in this, unless specifically stated, come using ordinal number " first ", " second ", " third " etc. Description plain objects, which are merely representative of, is related to the different instances of similar object, and is not intended to imply that the object being described in this way must Must have the time it is upper, spatially, sequence aspect or given sequence in any other manner.
Although the embodiment according to limited quantity describes the disclosure, above description, the art are benefited from It is interior it is clear for the skilled person that in the scope of the present disclosure thus described, it can be envisaged that other embodiments.Additionally, it should be noted that Language used in this specification primarily to readable and introduction purpose and select, rather than in order to explain or limit Determine the theme of the disclosure and selects.Therefore, without departing from the scope and spirit of the appended claims, for this Many modifications and changes are obvious for the those of ordinary skill of technical field.For the scope of the present disclosure, to this Openly done disclosure is illustrative and not restrictive, and the scope of the present disclosure is defined by the appended claims.

Claims (10)

1. a kind of adaptive voice synthetic method characterized by comprising
Obtain basic speech data and the corresponding text data of the basic speech data;
According to the basic speech data and the corresponding text data of the basic speech data, training basic speech model;
Obtain the corresponding text data of voice data of the voice data and the speaker of speaker;
According to the corresponding text data of the voice data of the voice data of the speaker and the speaker and the basis Speech model, training gating cycle unit GRU speech model;
When receiving speech synthesis instruction, according to the text information that the GRU speech model and described instruction include, institute is synthesized State the voice of speaker.
2. the method as described in claim 1, which is characterized in that according to the voice data of the speaker and the speaker The corresponding text data of voice data and the basic speech model, training gating cycle unit GRU speech model, comprising:
According to the corresponding text data of the voice data of the voice data of the speaker and the speaker, speak described in determination The acoustic feature and phoneme feature of people;
According to the acoustic feature of the speaker and phoneme feature and the basic speech model, training gating cycle unit GRU speech model.
3. method according to claim 2 determines the phoneme feature of the speaker, comprising:
The voice data of the speaker and the voice data correspondence of the speaker are handled according to preset speech recognition modeling Text data, obtain the duration alignment information of the phoneme of the speaker;
According to the corresponding text data of voice data of the duration alignment information and the speaker of the phoneme of the speaker, really The phoneme feature of the fixed speaker.
4. method as claimed in claim 3, which is characterized in that handle the speaker's according to preset speech recognition modeling Before voice data and the corresponding text data of the voice data of the speaker, further includes:
The voice data and the text data for verifying the speaker correspond.
5. method according to claim 2, which is characterized in that determine the speaker acoustic feature and phoneme feature it Before, further includes:
The voice data of the speaker is pre-processed.
6. method as claimed in claim 5, which is characterized in that the pretreatment includes:
Noise reduction and/or dereverberation.
7. method according to claim 2, which is characterized in that according to the acoustic feature of the speaker and phoneme feature, with And the basic speech model, training gating cycle unit GRU speech model, comprising:
According to the basic speech model initialization GRU speech model;
The phoneme feature and the acoustic feature are handled as format needed for the GRU speech model, and input the GRU language Sound model completes the GRU speech model training.
8. a kind of adaptive voice synthesizer characterized by comprising
Basic speech data capture unit, for obtaining basic speech data and the corresponding text data of basic speech data;
Basic speech model training unit, for according to the basic speech data and the corresponding text of the basic speech data Data, training basic speech model;
Speaker's voice data acquiring unit, for obtaining the voice data of speaker and the voice data correspondence of the speaker Text data;
GRU model training unit, for corresponding according to the voice data of the speaker and the voice data of the speaker Text data and the basic speech model, training gating cycle unit GRU speech model;
Adaptive voice synthesis unit, for when receive speech synthesis instruction when, according to the GRU speech model and the finger The text information that order includes synthesizes the voice of the speaker.
9. a kind of readable storage medium storing program for executing has executable instruction, when the executable instruction is performed, so that computer thereon Operation included by any one in perform claim requirement 1-7.
10. a kind of calculating equipment, comprising:
Processor;And
Memory, is stored with executable instruction, and the executable instruction makes the processor perform claim upon being performed It is required that operation included by any one in 1-7.
CN201910661648.6A 2019-07-22 2019-07-22 Adaptive speech synthesis method, device, readable storage medium and computing equipment Active CN110379407B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910661648.6A CN110379407B (en) 2019-07-22 2019-07-22 Adaptive speech synthesis method, device, readable storage medium and computing equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910661648.6A CN110379407B (en) 2019-07-22 2019-07-22 Adaptive speech synthesis method, device, readable storage medium and computing equipment

Publications (2)

Publication Number Publication Date
CN110379407A true CN110379407A (en) 2019-10-25
CN110379407B CN110379407B (en) 2021-10-19

Family

ID=68254684

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910661648.6A Active CN110379407B (en) 2019-07-22 2019-07-22 Adaptive speech synthesis method, device, readable storage medium and computing equipment

Country Status (1)

Country Link
CN (1) CN110379407B (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111276120A (en) * 2020-01-21 2020-06-12 华为技术有限公司 Speech synthesis method, apparatus and computer-readable storage medium
CN112133282A (en) * 2020-10-26 2020-12-25 厦门大学 Lightweight multi-speaker speech synthesis system and electronic equipment
CN112185340A (en) * 2020-10-30 2021-01-05 网易(杭州)网络有限公司 Speech synthesis method, speech synthesis device, storage medium and electronic apparatus
CN112365876A (en) * 2020-11-27 2021-02-12 北京百度网讯科技有限公司 Method, device and equipment for training speech synthesis model and storage medium
CN112634856A (en) * 2020-12-10 2021-04-09 苏州思必驰信息科技有限公司 Speech synthesis model training method and speech synthesis method
CN113299268A (en) * 2021-07-28 2021-08-24 成都启英泰伦科技有限公司 Speech synthesis method based on stream generation model
CN113314092A (en) * 2021-05-11 2021-08-27 北京三快在线科技有限公司 Method and device for model training and voice interaction
CN113707122A (en) * 2021-08-11 2021-11-26 北京搜狗科技发展有限公司 Method and device for constructing voice synthesis model
WO2022094740A1 (en) * 2020-11-03 2022-05-12 Microsoft Technology Licensing, Llc Controlled training and use of text-to-speech models and personalized model generated voices

Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20000042363A (en) * 1998-12-24 2000-07-15 이계철 Method for initializing program and data in voice recognition system
CN105185372A (en) * 2015-10-20 2015-12-23 百度在线网络技术(北京)有限公司 Training method for multiple personalized acoustic models, and voice synthesis method and voice synthesis device
US20170076715A1 (en) * 2015-09-16 2017-03-16 Kabushiki Kaisha Toshiba Training apparatus for speech synthesis, speech synthesis apparatus and training method for training apparatus
US9922641B1 (en) * 2012-10-01 2018-03-20 Google Llc Cross-lingual speaker adaptation for multi-lingual speech synthesis
CN107871497A (en) * 2016-09-23 2018-04-03 北京眼神科技有限公司 Audio recognition method and device
JP2018097115A (en) * 2016-12-12 2018-06-21 日本電信電話株式会社 Fundamental frequency model parameter estimation device, method, and program
CN108550363A (en) * 2018-06-04 2018-09-18 百度在线网络技术(北京)有限公司 Phoneme synthesizing method and device, computer equipment and readable medium
CN108665898A (en) * 2018-03-30 2018-10-16 西北师范大学 A method of gesture is converted into the Chinese and hides double-language voice
CN108831435A (en) * 2018-06-06 2018-11-16 安徽继远软件有限公司 A kind of emotional speech synthesizing method based on susceptible sense speaker adaptation
US20190066656A1 (en) * 2017-08-29 2019-02-28 Kabushiki Kaisha Toshiba Speech synthesis dictionary delivery device, speech synthesis system, and program storage medium
CN109523993A (en) * 2018-11-02 2019-03-26 成都三零凯天通信实业有限公司 A kind of voice languages classification method merging deep neural network with GRU based on CNN
CN109599092A (en) * 2018-12-21 2019-04-09 秒针信息技术有限公司 A kind of audio synthetic method and device
CN109635462A (en) * 2018-12-17 2019-04-16 深圳前海微众银行股份有限公司 Model parameter training method, device, equipment and medium based on federation's study
CN109885832A (en) * 2019-02-14 2019-06-14 平安科技(深圳)有限公司 Model training, sentence processing method, device, computer equipment and storage medium

Patent Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20000042363A (en) * 1998-12-24 2000-07-15 이계철 Method for initializing program and data in voice recognition system
US9922641B1 (en) * 2012-10-01 2018-03-20 Google Llc Cross-lingual speaker adaptation for multi-lingual speech synthesis
US20170076715A1 (en) * 2015-09-16 2017-03-16 Kabushiki Kaisha Toshiba Training apparatus for speech synthesis, speech synthesis apparatus and training method for training apparatus
CN105185372A (en) * 2015-10-20 2015-12-23 百度在线网络技术(北京)有限公司 Training method for multiple personalized acoustic models, and voice synthesis method and voice synthesis device
CN107871497A (en) * 2016-09-23 2018-04-03 北京眼神科技有限公司 Audio recognition method and device
JP2018097115A (en) * 2016-12-12 2018-06-21 日本電信電話株式会社 Fundamental frequency model parameter estimation device, method, and program
US20190066656A1 (en) * 2017-08-29 2019-02-28 Kabushiki Kaisha Toshiba Speech synthesis dictionary delivery device, speech synthesis system, and program storage medium
CN108665898A (en) * 2018-03-30 2018-10-16 西北师范大学 A method of gesture is converted into the Chinese and hides double-language voice
CN108550363A (en) * 2018-06-04 2018-09-18 百度在线网络技术(北京)有限公司 Phoneme synthesizing method and device, computer equipment and readable medium
CN108831435A (en) * 2018-06-06 2018-11-16 安徽继远软件有限公司 A kind of emotional speech synthesizing method based on susceptible sense speaker adaptation
CN109523993A (en) * 2018-11-02 2019-03-26 成都三零凯天通信实业有限公司 A kind of voice languages classification method merging deep neural network with GRU based on CNN
CN109635462A (en) * 2018-12-17 2019-04-16 深圳前海微众银行股份有限公司 Model parameter training method, device, equipment and medium based on federation's study
CN109599092A (en) * 2018-12-21 2019-04-09 秒针信息技术有限公司 A kind of audio synthetic method and device
CN109885832A (en) * 2019-02-14 2019-06-14 平安科技(深圳)有限公司 Model training, sentence processing method, device, computer equipment and storage medium

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111276120A (en) * 2020-01-21 2020-06-12 华为技术有限公司 Speech synthesis method, apparatus and computer-readable storage medium
CN111276120B (en) * 2020-01-21 2022-08-19 华为技术有限公司 Speech synthesis method, apparatus and computer-readable storage medium
CN112133282A (en) * 2020-10-26 2020-12-25 厦门大学 Lightweight multi-speaker speech synthesis system and electronic equipment
CN112133282B (en) * 2020-10-26 2022-07-08 厦门大学 Lightweight multi-speaker speech synthesis system and electronic equipment
CN112185340A (en) * 2020-10-30 2021-01-05 网易(杭州)网络有限公司 Speech synthesis method, speech synthesis device, storage medium and electronic apparatus
CN112185340B (en) * 2020-10-30 2024-03-15 网易(杭州)网络有限公司 Speech synthesis method, speech synthesis device, storage medium and electronic equipment
WO2022094740A1 (en) * 2020-11-03 2022-05-12 Microsoft Technology Licensing, Llc Controlled training and use of text-to-speech models and personalized model generated voices
CN112365876A (en) * 2020-11-27 2021-02-12 北京百度网讯科技有限公司 Method, device and equipment for training speech synthesis model and storage medium
CN112365876B (en) * 2020-11-27 2022-04-12 北京百度网讯科技有限公司 Method, device and equipment for training speech synthesis model and storage medium
CN112634856A (en) * 2020-12-10 2021-04-09 苏州思必驰信息科技有限公司 Speech synthesis model training method and speech synthesis method
CN113314092A (en) * 2021-05-11 2021-08-27 北京三快在线科技有限公司 Method and device for model training and voice interaction
CN113299268A (en) * 2021-07-28 2021-08-24 成都启英泰伦科技有限公司 Speech synthesis method based on stream generation model
CN113707122A (en) * 2021-08-11 2021-11-26 北京搜狗科技发展有限公司 Method and device for constructing voice synthesis model
CN113707122B (en) * 2021-08-11 2024-04-05 北京搜狗科技发展有限公司 Method and device for constructing voice synthesis model

Also Published As

Publication number Publication date
CN110379407B (en) 2021-10-19

Similar Documents

Publication Publication Date Title
CN110379407A (en) Adaptive voice synthetic method, device, readable storage medium storing program for executing and calculating equipment
CN108520741B (en) Method, device and equipment for restoring ear voice and readable storage medium
US8510103B2 (en) System and method for voice recognition
CN110111775A (en) A kind of Streaming voice recognition methods, device, equipment and storage medium
KR101323061B1 (en) Speaker authentication
CN110415687A (en) Method of speech processing, device, medium, electronic equipment
CN110189748B (en) Model construction method and device
CN112233646B (en) Voice cloning method, system, equipment and storage medium based on neural network
CN111653265B (en) Speech synthesis method, device, storage medium and electronic equipment
CN110232907A (en) A kind of phoneme synthesizing method, device, readable storage medium storing program for executing and calculate equipment
CN112309365B (en) Training method and device of speech synthesis model, storage medium and electronic equipment
CN110310619A (en) Polyphone prediction technique, device, equipment and computer readable storage medium
CN107240401B (en) Tone conversion method and computing device
EP3989217B1 (en) Method for detecting an audio adversarial attack with respect to a voice input processed by an automatic speech recognition system, corresponding device, computer program product and computer-readable carrier medium
CN111081230A (en) Speech recognition method and apparatus
CN104361896B (en) Voice quality assessment equipment, method and system
CN112837669B (en) Speech synthesis method, device and server
CN105654955B (en) Audio recognition method and device
CN114333865A (en) Model training and tone conversion method, device, equipment and medium
CN112489623A (en) Language identification model training method, language identification method and related equipment
CN110580897B (en) Audio verification method and device, storage medium and electronic equipment
CN109688271A (en) The method, apparatus and terminal device of contact information input
CN110853669A (en) Audio identification method, device and equipment
CN108922523B (en) Position prompting method and device, storage medium and electronic equipment
CN108989551B (en) Position prompting method and device, storage medium and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant