CN105931631A - Voice synthesis system and method - Google Patents
Voice synthesis system and method Download PDFInfo
- Publication number
- CN105931631A CN105931631A CN201610236400.1A CN201610236400A CN105931631A CN 105931631 A CN105931631 A CN 105931631A CN 201610236400 A CN201610236400 A CN 201610236400A CN 105931631 A CN105931631 A CN 105931631A
- Authority
- CN
- China
- Prior art keywords
- information
- extract
- training
- synthesis
- image
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
- G10L13/04—Details of speech synthesis systems, e.g. synthesiser structure or memory management
- G10L13/047—Architecture of speech synthesisers
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/08—Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
Abstract
The invention provides a voice synthesis system and method, and the method comprises the steps: collecting a plurality of pieces of synthesis material information, and carrying out the preprocessing of each piece of synthesis material information, so as to extract the synthesis feature information, wherein the synthesis material information comprises at least one type of text information, voice information and image information; carrying out the prediction of all synthesis feature information through a prediction model, so as to generate acoustic parameter information; and generating voice synthesis result information according to the acoustic parameter information. The system and method respectively collect at least one type of text information, voice information and image information, so as to extract the synthesis feature information. The system and method employ the prediction model for prediction, finally generate voice, predict the emotion or context of a user through the feature information extracted from the voice information and/or image information, and achieve the synthesis of personalized voice expressing the emotion or context of the user.
Description
Technical field
The application relates to speech synthesis technique field, is specifically related to a kind of speech synthesis system and side
Method.
Background technology
The phonetic synthesis solution of existing text converting speech (Text To Speech is called for short TTS)
Certainly scheme mainly has two classes, a class to be splicing systems, and an other class is that parameter generates system.Two
The something in common of class system is to be required for carrying out text analyzing, and difference is splicing system
Utilize a large amount of fragment voice recorded, in conjunction with text analyzing result, recording fragment is spliced
Obtain synthesizing voice;And parameter generates system and utilizes the result of text analyzing, produced by model
The parameter of voice, such as fundamental frequency etc., and then changes into waveform.
The model training modeling of existing system, only with the feature of text, or off-line sound
Feature, it was predicted that time only employ the feature of text, and do not account for the expression of user, surrounding
Environment and user use the information such as the voice that different context expresses under different emotions state.Cause
This existing system, due to the defect of the above-mentioned state that cannot observe surrounding environment and user, causes institute
The voice generated is natural not, and lacks emotion, for the same text under different context,
Generate is all same voice every time.
Summary of the invention
In view of drawbacks described above of the prior art or deficiency, it is desirable to provide a kind of by combining text
With image, voice, the phonetic synthesis system of the personalized speech of user feeling or linguistic context is expressed in synthesis
System and method.
First aspect, the present invention provides a kind of speech synthesis system, and described system includes:
Feature extraction unit, is used for gathering some synthesis material information, and respectively to every institute
State synthesis material information to carry out pre-processing to extract composite character information;Wherein, described synthetic chemical
Material information includes text message, and at least one category information in voice messaging and image information;
Predicting unit, for every described composite character information being predicted by forecast model,
To generate parameters,acoustic information;
Synthesis unit, for generating phonetic synthesis object information according to described parameters,acoustic information.
Second aspect, the present invention provides a kind of phoneme synthesizing method, and described method includes:
Gather some synthesis material information, and respectively every described synthesis material information is carried out
Pretreatment is to extract composite character information;Wherein, described synthesis material information includes text message,
And at least one category information in voice messaging and image information;
By forecast model, every described composite character information is predicted, to generate acoustics ginseng
Number information;
Phonetic synthesis object information is generated according to described parameters,acoustic information.
Speech synthesis system and method that the many embodiments of the present invention provide are passed through to gather text respectively
At least one category information in information, and voice messaging and image information, to extract every synthesis
Characteristic information, and be predicted by forecast model, ultimately generate voice, pass through voice messaging
And/or the characteristic information prediction emotion of user extracted of image information or linguistic context, it is achieved that synthesis
Express user feeling or the personalized speech of linguistic context;
Speech synthesis system and method that some embodiments of the invention provide are civilian by gathering further
At least one category information in this information, and voice messaging and image information, to extract every instruction
Practice characteristic information, train forecast model, extend the match-type of forecast model, improve pre-
The precision surveyed.
Accompanying drawing explanation
By reading retouching in detail with reference to made non-limiting example is made of the following drawings
Stating, other features, purpose and advantage will become more apparent upon:
Fig. 1 is the structural representation of speech synthesis system in one embodiment of the invention.
Fig. 2 is the flow chart of phoneme synthesizing method in one embodiment of the invention.
Fig. 3 is the structural representation of speech synthesis system in one embodiment of the present invention.
Fig. 4 is the flow chart of phoneme synthesizing method in one embodiment of the present invention.
Fig. 5 is the structural representation of speech synthesis system in one embodiment of the present invention.
Detailed description of the invention
With embodiment, the application is described in further detail below in conjunction with the accompanying drawings.It is appreciated that
, specific embodiment described herein is used only for explaining related invention, rather than to this
Bright restriction.It also should be noted that, for the ease of describe, accompanying drawing illustrate only with
The part that invention is relevant.
It should be noted that in the case of not conflicting, the embodiment in the application and embodiment
In feature can be mutually combined.Describe this below with reference to the accompanying drawings and in conjunction with the embodiments in detail
Application.
Fig. 1 is the structural representation of speech synthesis system in one embodiment of the invention.
As it is shown in figure 1, in the present embodiment, the speech synthesis system that the present invention provides includes spy
Levy extraction unit 10, predicting unit 30 and synthesis unit 50.
Feature extraction unit 10 is used for gathering some synthesis material information, and respectively to every institute
State synthesis material information to carry out pre-processing to extract composite character information.
Predicting unit 30 is for carrying out pre-by forecast model to every described composite character information
Survey, to generate parameters,acoustic information.
Synthesis unit 50 is for generating phonetic synthesis object information according to described parameters,acoustic information.
In certain embodiments, described synthesis material information includes text message and voice messaging;
In further embodiments, described synthesis material information includes text message and image information;?
In some other embodiment, described synthesis material information includes text message, voice messaging simultaneously
And image information.
In the present embodiment, it was predicted that unit 30 storage has the forecast model trained;Real at some
Execute in example, the forecast model of renewal can be received further by modes such as radio communications.
In the present embodiment, synthesis unit 50 is vocoder, and described parameters,acoustic information includes base
Frequency and formant frequency.In more embodiments, synthesis unit 50 can use according to the actual requirements
Different sound synthesis device, described parameters,acoustic information can use corresponding parameters,acoustic.
In certain embodiments, described forecast model is Logic Regression Models;Implement at other
In example, described forecast model is deep neural network model.
Fig. 2 is the flow chart of phoneme synthesizing method in one embodiment of the invention.Language shown in Fig. 2
Sound synthetic method can be applicable in the speech synthesis system shown in Fig. 1.
As in figure 2 it is shown, in the present embodiment, the phoneme synthesizing method that the present invention provides includes:
S30: gather some synthesis material information, and respectively to every described synthesis material information
Carry out pre-processing to extract composite character information.Wherein, described synthesis material information includes text
At least one category information in information, and voice messaging and image information.
S50: every described composite character information is predicted, to generate sound by forecast model
Learn parameter information.
S70: generate phonetic synthesis object information according to described parameters,acoustic information.
Specifically, by the forecast model conjunction to being extracted based on voice messaging and/or image information
Characteristic information is become to be predicted, it is achieved emotion and linguistic context to user judge, thus generate
Corresponding parameters,acoustic information.Such as, text message " is please shut door ", existing language
Sound synthesis system generally only can generate flat a, voice without emotion and the tone;And
In the present embodiment, extract the composite character information corresponding to the soft tone by voice messaging,
Or extract the composite character information corresponding to anger expression by image information, then by prediction
Synthesis characteristic information is predicted by model, generates the corresponding soft tone or the acoustics of the exciting tone
Parameter information, ultimately produce the tone with user or expression is corresponding, user feeling can be expressed or
The phonetic synthesis object information of linguistic context.
Speech synthesis system and method that above-described embodiment provides are passed through to gather text message respectively,
And at least one category information in voice messaging and image information, to extract every composite character letter
Breath, and be predicted by forecast model, ultimately generate voice, by voice messaging and/or figure
The emotion of the characteristic information prediction user extracted as information or linguistic context, it is achieved that synthesis is expressed and used
Family emotion or the personalized speech of linguistic context.
Fig. 3 is the structural representation of speech synthesis system in one embodiment of the present invention.
As it is shown on figure 3, in a preferred embodiment, described system also includes model training unit
20.Feature extraction unit 10 is additionally operable to gather some training material information, and respectively to every
Described training material information carries out pre-processing to extract training characteristics information.Model training unit 20
For training forecast model according to every described training characteristics information.
With described synthesis material information similarly, in certain embodiments, described training material letter
Breath includes text message and voice messaging;In further embodiments, described training material information
Including text message and image information;In some other embodiment, described training material information
Include text message, voice messaging and image information simultaneously.
Specifically, in the present embodiment, new prediction can be trained by model training unit 20
Model, it is possible to the training pattern prestoring predicting unit 30 is further trained to improve pre-
The accuracy rate surveyed.
Fig. 4 is the flow chart of phoneme synthesizing method in one embodiment of the present invention.Shown in Fig. 4
Phoneme synthesizing method can apply in the speech synthesis system shown in Fig. 3.
As shown in Figure 4, in a preferred embodiment, also include before step S30:
S10: gather some training material information, and respectively to every described training material information
Carry out pre-processing to extract training characteristics information.Wherein, described training material information includes text
At least one category information in information, and voice messaging and image information.
S20: train forecast model according to every described training characteristics information.
The speech synthesis system of above-described embodiment offer and method are further by gathering text envelope
At least one category information in breath, and voice messaging and image information, to extract every training spy
Reference ceases, and trains forecast model, extends the match-type of forecast model, improve prediction
Precision.
Fig. 5 is the structural representation of speech synthesis system in one embodiment of the present invention.
As it is shown in figure 5, in a preferred embodiment, feature extraction unit 10 includes:
Text character extraction subelement 101, for gathering the first text message, and to described the
One text message pre-processes, to extract the first text feature information for prediction;
The most also include following at least one:
Speech feature extraction subelement 103, for gathering the first voice messaging, and to described the
One voice messaging pre-processes, to extract for Prediction and Acquisition environment and the first of user's linguistic context
Voice characteristics information;
Image characteristics extraction subelement 105, for gathering the first image information, and to described the
One image information pre-processes, to extract for predicting the first characteristics of image letter that user expresses one's feelings
Breath.
Specifically, the first voice characteristics information, also can be pre-in addition to being used for predicting user's linguistic context
Survey gathers the features such as the noisy degree of environment, with the further accuracy rate improving prediction.
In the preferred embodiment of a correspondence, step S30 includes:
S301: gather the first text message, and described first text message is pre-processed,
To extract the first text feature information for prediction;
The most also include following at least one:
S302: gather the first voice messaging, and described first voice messaging is pre-processed,
To extract for Prediction and Acquisition environment and the first voice characteristics information of user's linguistic context;
S303: gather the first image information, and described first image information is pre-processed,
To extract for predicting the first image feature information that user expresses one's feelings.
In a preferred embodiment, described described first text message is carried out pretreatment include right
Described first text message carries out Text normalization process and prosody prediction.
Described described first voice messaging is carried out pretreatment include described first voice messaging is entered
Row mel-frequency cepstrum coefficient (being called for short mfcc) feature extraction and digitized processing.
Described described first image information is carried out pretreatment include described first image information is entered
Row recognition of face, and extract relevant color, texture, shape and spatial relationship feature.
In a preferred embodiment, Text character extraction subelement 101 is additionally operable to gather the second literary composition
This information, and described second text message is pre-processed, it is used for training prediction mould to extract
Second text feature information of type.
Speech feature extraction subelement 103 is additionally operable to gather the second voice messaging, and to described
Two voice messagings pre-process, to extract the second phonetic feature letter for training forecast model
Breath.
Image characteristics extraction subelement 105 is additionally operable to gather the second image information, and to described
Two image informations pre-process, to extract the second characteristics of image letter for training forecast model
Breath.
In the preferred embodiment of a correspondence, step S10 includes:
S101: gather the second text message, and described second text message is pre-processed,
To extract the second text feature information for training forecast model;
The most also include following at least one:
S103: gather the second voice messaging, and described second voice messaging is pre-processed,
To extract the second voice characteristics information for training forecast model;
S105: gather the second image information, and described second image information is pre-processed,
To extract the second image feature information for training forecast model.
In certain embodiments, described forecast model is Logic Regression Models;Implement at other
In example, described forecast model is deep neural network model.
Flow chart in accompanying drawing and block diagram, it is illustrated that according to various embodiments of the invention system,
Architectural framework in the cards, function and the operation of method and computer program product.This point
On, each square frame in flow chart or block diagram can represent a module, program segment or code
A part, a part for described module, program segment or code comprise one or more for
Realize the executable instruction of the logic function of regulation.It should also be noted that at some as replacement
In realization, the function marked in square frame can also be sent out to be different from the order marked in accompanying drawing
Raw.Such as, two square frames succeedingly represented can essentially perform substantially in parallel, they
Sometimes can also perform in the opposite order, depending on this is according to involved function.It is also noted that
, square frame in each square frame in block diagram and/or flow chart and block diagram and/or flow chart
Combination, the special hardware based system of the function that can be specified by execution or operation come
Realize, or can be realized by the combination of specialized hardware with computer instruction.
It is described in the embodiment of the present application involved unit or module can be by the side of software
Formula realizes, it is also possible to realize by the way of hardware.Described unit or module can also
Arrange within a processor, such as, it was predicted that unit 30 can be provided in computer or smart machine
In software program;It can also be the hardware unit being individually predicted.Wherein, these unit
Or the title of module is not intended that under certain conditions to this unit or the restriction of module itself, example
As, it was predicted that unit 30 is also described as " beating for carrying out characteristic information with model mating
Divide to generate the comparing unit of parameter ".
As on the other hand, present invention also provides a kind of computer-readable recording medium, this meter
Calculation machine readable storage medium storing program for executing can be that computer included in device described in above-described embodiment can
Read storage medium;Can also be individualism, the computer-readable storage being unkitted in the equipment of allocating into
Medium.Computer-readable recording medium storage has one or more than one program, described program
It is used for performing to be described in the Formula Input Technology side of the application by one or more than one processor
Method.
Above description is only the preferred embodiment of the application and saying institute's application technology principle
Bright.It will be appreciated by those skilled in the art that invention scope involved in the application, do not limit
In the technical scheme of the particular combination of above-mentioned technical characteristic, also should contain simultaneously without departing from
In the case of described inventive concept, above-mentioned technical characteristic or its equivalent feature are combined
And other technical scheme formed.Such as features described above and (but not limited to) disclosed herein
The technical characteristic with similar functions is replaced mutually and the technical scheme that formed.
Claims (12)
1. a speech synthesis system, it is characterised in that described system includes:
Feature extraction unit, is used for gathering some synthesis material information, and respectively to every institute
State synthesis material information to carry out pre-processing to extract composite character information;Wherein, described synthetic chemical
Material information includes text message, and at least one category information in voice messaging and image information;
Predicting unit, for every described composite character information being predicted by forecast model,
To generate parameters,acoustic information;
Synthesis unit, for generating phonetic synthesis object information according to described parameters,acoustic information.
Speech synthesis system the most according to claim 1, it is characterised in that described feature
Extraction unit is additionally operable to gather some training material information, and respectively to every described training element
Material information carries out pre-processing to extract training characteristics information;Wherein, described training material information bag
Include text message, and at least one category information in voice messaging and image information;
Described system also includes:
Model training unit, for training forecast model according to every described training characteristics information.
Speech synthesis system the most according to claim 1 and 2, it is characterised in that described
Feature extraction unit includes:
Text character extraction subelement, for gathering the first text message, and to described first literary composition
This information pre-processes, to extract the first text feature information for prediction;
The most also include following at least one:
Speech feature extraction subelement, for gathering the first voice messaging, and to described first language
Message breath pre-processes, to extract for Prediction and Acquisition environment and the first voice of user's linguistic context
Characteristic information;
Image characteristics extraction subelement, for gathering the first image information, and to described first figure
As information pre-processes, to extract for predicting the first image feature information that user expresses one's feelings.
Speech synthesis system the most according to claim 3, it is characterised in that described to institute
State the first text message to carry out pretreatment and include described first text message is carried out Text normalization
Process and prosody prediction;
Described described first voice messaging is carried out pretreatment include described first voice messaging is entered
Row mel-frequency cepstrum coefficient (being called for short mfcc) feature extraction and digitized processing;
Described described first image information is carried out pretreatment include described first image information is entered
Row recognition of face, and extract relevant color, texture, shape and spatial relationship feature.
Speech synthesis system the most according to claim 3, it is characterised in that described text
Feature extraction subelement is additionally operable to gather the second text message, and enters described second text message
Row pretreatment, to extract the second text feature information for training forecast model;
Described speech feature extraction subelement is additionally operable to gather the second voice messaging, and to described
Two voice messagings pre-process, to extract the second phonetic feature letter for training forecast model
Breath;
Described image characteristics extraction subelement is additionally operable to gather the second image information, and to described
Two image informations pre-process, to extract the second characteristics of image letter for training forecast model
Breath.
Speech synthesis system the most according to claim 1, it is characterised in that described prediction
Model is Logic Regression Models or deep neural network model.
7. a phoneme synthesizing method, it is characterised in that described method includes:
Gather some synthesis material information, and respectively every described synthesis material information is carried out
Pretreatment is to extract composite character information;Wherein, described synthesis material information includes text message,
And at least one category information in voice messaging and image information;
By forecast model, every described composite character information is predicted, to generate acoustics ginseng
Number information;
Phonetic synthesis object information is generated according to described parameters,acoustic information.
Phoneme synthesizing method the most according to claim 7, it is characterised in that described collection
Some synthesis material information, and respectively every described synthesis material information is pre-processed with
Also include before extracting composite character information:
Gather some training material information, and respectively every described training material information is carried out
Pretreatment is to extract training characteristics information;Wherein, described training material information includes text message,
And at least one category information in voice messaging and image information;
Forecast model is trained according to every described training characteristics information.
9. according to the phoneme synthesizing method described in claim 7 or 8, it is characterised in that described
Gather some synthesis material information, and respectively every described synthesis material information is carried out pre-place
Manage and include with extraction composite character information:
Gather the first text message, and described first text message is pre-processed, to extract
The first text feature information for prediction;
The most also include following at least one:
Gather the first voice messaging, and described first voice messaging is pre-processed, to extract
For Prediction and Acquisition environment and the first voice characteristics information of user's linguistic context;
Gather the first image information, and described first image information is pre-processed, to extract
For predicting the first image feature information that user expresses one's feelings.
Phoneme synthesizing method the most according to claim 9, it is characterised in that described right
Described first text message carries out pretreatment and includes described first text message is carried out text normalizing
Change process and prosody prediction;
Described described first voice messaging is carried out pretreatment include described first voice messaging is entered
Row mel-frequency cepstrum coefficient (being called for short mfcc) feature extraction and digitized processing;
Described described first image information is carried out pretreatment include described first image information is entered
Row recognition of face, and extract relevant color, texture, shape and spatial relationship feature.
11. phoneme synthesizing methods according to claim 8, it is characterised in that described in adopt
Collect some training material information, and respectively every described training material information is pre-processed
Include with extraction training characteristics information:
Gather the second text message, and described second text message is pre-processed, to extract
For training the second text feature information of forecast model;
The most also include following at least one:
Gather the second voice messaging, and described second voice messaging is pre-processed, to extract
For training the second voice characteristics information of forecast model;
Gather the second image information, and described second image information is pre-processed, to extract
For training the second image feature information of forecast model.
12. phoneme synthesizing methods according to claim 7, it is characterised in that described pre-
Surveying model is Logic Regression Models or deep neural network model.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610236400.1A CN105931631A (en) | 2016-04-15 | 2016-04-15 | Voice synthesis system and method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610236400.1A CN105931631A (en) | 2016-04-15 | 2016-04-15 | Voice synthesis system and method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN105931631A true CN105931631A (en) | 2016-09-07 |
Family
ID=56838218
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610236400.1A Pending CN105931631A (en) | 2016-04-15 | 2016-04-15 | Voice synthesis system and method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105931631A (en) |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106328139A (en) * | 2016-09-14 | 2017-01-11 | 努比亚技术有限公司 | Voice interaction method and voice interaction system |
CN107437413A (en) * | 2017-07-05 | 2017-12-05 | 百度在线网络技术(北京)有限公司 | voice broadcast method and device |
CN110047462A (en) * | 2019-01-31 | 2019-07-23 | 北京捷通华声科技股份有限公司 | A kind of phoneme synthesizing method, device and electronic equipment |
CN110288077A (en) * | 2018-11-14 | 2019-09-27 | 腾讯科技(深圳)有限公司 | A kind of synthesis based on artificial intelligence is spoken the method and relevant apparatus of expression |
CN111312210A (en) * | 2020-03-05 | 2020-06-19 | 云知声智能科技股份有限公司 | Text-text fused voice synthesis method and device |
WO2021004113A1 (en) * | 2019-07-05 | 2021-01-14 | 深圳壹账通智能科技有限公司 | Speech synthesis method and apparatus, computer device and storage medium |
WO2022048405A1 (en) * | 2020-09-01 | 2022-03-10 | 魔珐(上海)信息科技有限公司 | Text-based virtual object animation generation method, apparatus, storage medium, and terminal |
CN114945110A (en) * | 2022-05-31 | 2022-08-26 | 深圳市优必选科技股份有限公司 | Speaking head video synthesis method and device, terminal equipment and readable storage medium |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5704007A (en) * | 1994-03-11 | 1997-12-30 | Apple Computer, Inc. | Utilization of multiple voice sources in a speech synthesizer |
US5890115A (en) * | 1997-03-07 | 1999-03-30 | Advanced Micro Devices, Inc. | Speech synthesizer utilizing wavetable synthesis |
WO2002084643A1 (en) * | 2001-04-11 | 2002-10-24 | International Business Machines Corporation | Speech-to-speech generation system and method |
CN1460232A (en) * | 2001-03-29 | 2003-12-03 | 皇家菲利浦电子有限公司 | Text to visual speech system and method incorporating facial emotions |
CN101064104A (en) * | 2006-04-24 | 2007-10-31 | 中国科学院自动化研究所 | Emotion voice creating method based on voice conversion |
CN101474481A (en) * | 2009-01-12 | 2009-07-08 | 北京科技大学 | Emotional robot system |
-
2016
- 2016-04-15 CN CN201610236400.1A patent/CN105931631A/en active Pending
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5704007A (en) * | 1994-03-11 | 1997-12-30 | Apple Computer, Inc. | Utilization of multiple voice sources in a speech synthesizer |
US5890115A (en) * | 1997-03-07 | 1999-03-30 | Advanced Micro Devices, Inc. | Speech synthesizer utilizing wavetable synthesis |
CN1460232A (en) * | 2001-03-29 | 2003-12-03 | 皇家菲利浦电子有限公司 | Text to visual speech system and method incorporating facial emotions |
WO2002084643A1 (en) * | 2001-04-11 | 2002-10-24 | International Business Machines Corporation | Speech-to-speech generation system and method |
CN1379392A (en) * | 2001-04-11 | 2002-11-13 | 国际商业机器公司 | Feeling speech sound and speech sound translation system and method |
CN101064104A (en) * | 2006-04-24 | 2007-10-31 | 中国科学院自动化研究所 | Emotion voice creating method based on voice conversion |
CN101474481A (en) * | 2009-01-12 | 2009-07-08 | 北京科技大学 | Emotional robot system |
Cited By (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106328139A (en) * | 2016-09-14 | 2017-01-11 | 努比亚技术有限公司 | Voice interaction method and voice interaction system |
CN107437413A (en) * | 2017-07-05 | 2017-12-05 | 百度在线网络技术(北京)有限公司 | voice broadcast method and device |
WO2019007308A1 (en) * | 2017-07-05 | 2019-01-10 | 百度在线网络技术(北京)有限公司 | Voice broadcasting method and device |
CN110288077A (en) * | 2018-11-14 | 2019-09-27 | 腾讯科技(深圳)有限公司 | A kind of synthesis based on artificial intelligence is spoken the method and relevant apparatus of expression |
CN110288077B (en) * | 2018-11-14 | 2022-12-16 | 腾讯科技(深圳)有限公司 | Method and related device for synthesizing speaking expression based on artificial intelligence |
CN110047462B (en) * | 2019-01-31 | 2021-08-13 | 北京捷通华声科技股份有限公司 | Voice synthesis method and device and electronic equipment |
CN110047462A (en) * | 2019-01-31 | 2019-07-23 | 北京捷通华声科技股份有限公司 | A kind of phoneme synthesizing method, device and electronic equipment |
WO2021004113A1 (en) * | 2019-07-05 | 2021-01-14 | 深圳壹账通智能科技有限公司 | Speech synthesis method and apparatus, computer device and storage medium |
CN111312210A (en) * | 2020-03-05 | 2020-06-19 | 云知声智能科技股份有限公司 | Text-text fused voice synthesis method and device |
WO2022048405A1 (en) * | 2020-09-01 | 2022-03-10 | 魔珐(上海)信息科技有限公司 | Text-based virtual object animation generation method, apparatus, storage medium, and terminal |
US11908451B2 (en) | 2020-09-01 | 2024-02-20 | Mofa (Shanghai) Information Technology Co., Ltd. | Text-based virtual object animation generation method, apparatus, storage medium, and terminal |
CN114945110A (en) * | 2022-05-31 | 2022-08-26 | 深圳市优必选科技股份有限公司 | Speaking head video synthesis method and device, terminal equipment and readable storage medium |
CN114945110B (en) * | 2022-05-31 | 2023-10-24 | 深圳市优必选科技股份有限公司 | Method and device for synthesizing voice head video, terminal equipment and readable storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105931631A (en) | Voice synthesis system and method | |
Stanton et al. | Predicting expressive speaking style from text in end-to-end speech synthesis | |
CN105185372B (en) | Training method for multiple personalized acoustic models, and voice synthesis method and voice synthesis device | |
CN104538024B (en) | Phoneme synthesizing method, device and equipment | |
CN105206258A (en) | Generation method and device of acoustic model as well as voice synthetic method and device | |
Mariooryad et al. | Compensating for speaker or lexical variabilities in speech for emotion recognition | |
CN102779508B (en) | Sound bank generates Apparatus for () and method therefor, speech synthesis system and method thereof | |
CN105185373B (en) | The generation of prosody hierarchy forecast model and prosody hierarchy Forecasting Methodology and device | |
US8447603B2 (en) | Rating speech naturalness of speech utterances based on a plurality of human testers | |
CN109036371A (en) | Audio data generation method and system for speech synthesis | |
CN102426834B (en) | Method for testing rhythm level of spoken English | |
CN112750446A (en) | Voice conversion method, device and system and storage medium | |
CN101887719A (en) | Speech synthesis method, system and mobile terminal equipment with speech synthesis function | |
Li et al. | Expressive Speech Driven Talking Avatar Synthesis with DBLSTM Using Limited Amount of Emotional Bimodal Data. | |
Přibil et al. | GMM-based speaker gender and age classification after voice conversion | |
CN109036376A (en) | A kind of the south of Fujian Province language phoneme synthesizing method | |
Laurinčiukaitė et al. | Lithuanian Speech Corpus Liepa for development of human-computer interfaces working in voice recognition and synthesis mode | |
Prasanna et al. | Analysis of excitation source information in emotional speech | |
Sheikhan | Generation of suprasegmental information for speech using a recurrent neural network and binary gravitational search algorithm for feature selection | |
Kröger | Computer-implemented articulatory models for speech production: A review | |
Tóth et al. | Optimizing HMM speech synthesis for low-resource devices | |
Urbain et al. | Development of hmm-based acoustic laughter synthesis | |
Nakamura et al. | Integration of spectral feature extraction and modeling for HMM-based speech synthesis | |
Jaiswal et al. | A generative adversarial network based ensemble technique for automatic evaluation of machine synthesized speech | |
Sun et al. | Unsupervised Inference of Physiologically Meaningful Articulatory Trajectories with VocalTractLab. |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20160907 |
|
RJ01 | Rejection of invention patent application after publication |