CN105931631A - Voice synthesis system and method - Google Patents

Voice synthesis system and method Download PDF

Info

Publication number
CN105931631A
CN105931631A CN201610236400.1A CN201610236400A CN105931631A CN 105931631 A CN105931631 A CN 105931631A CN 201610236400 A CN201610236400 A CN 201610236400A CN 105931631 A CN105931631 A CN 105931631A
Authority
CN
China
Prior art keywords
information
extract
training
synthesis
image
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610236400.1A
Other languages
Chinese (zh)
Inventor
曹立新
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Horizon Robotics Technology Research and Development Co Ltd
Original Assignee
Beijing Horizon Robotics Technology Research and Development Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Horizon Robotics Technology Research and Development Co Ltd filed Critical Beijing Horizon Robotics Technology Research and Development Co Ltd
Priority to CN201610236400.1A priority Critical patent/CN105931631A/en
Publication of CN105931631A publication Critical patent/CN105931631A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management
    • G10L13/047Architecture of speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination

Abstract

The invention provides a voice synthesis system and method, and the method comprises the steps: collecting a plurality of pieces of synthesis material information, and carrying out the preprocessing of each piece of synthesis material information, so as to extract the synthesis feature information, wherein the synthesis material information comprises at least one type of text information, voice information and image information; carrying out the prediction of all synthesis feature information through a prediction model, so as to generate acoustic parameter information; and generating voice synthesis result information according to the acoustic parameter information. The system and method respectively collect at least one type of text information, voice information and image information, so as to extract the synthesis feature information. The system and method employ the prediction model for prediction, finally generate voice, predict the emotion or context of a user through the feature information extracted from the voice information and/or image information, and achieve the synthesis of personalized voice expressing the emotion or context of the user.

Description

Speech synthesis system and method
Technical field
The application relates to speech synthesis technique field, is specifically related to a kind of speech synthesis system and side Method.
Background technology
The phonetic synthesis solution of existing text converting speech (Text To Speech is called for short TTS) Certainly scheme mainly has two classes, a class to be splicing systems, and an other class is that parameter generates system.Two The something in common of class system is to be required for carrying out text analyzing, and difference is splicing system Utilize a large amount of fragment voice recorded, in conjunction with text analyzing result, recording fragment is spliced Obtain synthesizing voice;And parameter generates system and utilizes the result of text analyzing, produced by model The parameter of voice, such as fundamental frequency etc., and then changes into waveform.
The model training modeling of existing system, only with the feature of text, or off-line sound Feature, it was predicted that time only employ the feature of text, and do not account for the expression of user, surrounding Environment and user use the information such as the voice that different context expresses under different emotions state.Cause This existing system, due to the defect of the above-mentioned state that cannot observe surrounding environment and user, causes institute The voice generated is natural not, and lacks emotion, for the same text under different context, Generate is all same voice every time.
Summary of the invention
In view of drawbacks described above of the prior art or deficiency, it is desirable to provide a kind of by combining text With image, voice, the phonetic synthesis system of the personalized speech of user feeling or linguistic context is expressed in synthesis System and method.
First aspect, the present invention provides a kind of speech synthesis system, and described system includes:
Feature extraction unit, is used for gathering some synthesis material information, and respectively to every institute State synthesis material information to carry out pre-processing to extract composite character information;Wherein, described synthetic chemical Material information includes text message, and at least one category information in voice messaging and image information;
Predicting unit, for every described composite character information being predicted by forecast model, To generate parameters,acoustic information;
Synthesis unit, for generating phonetic synthesis object information according to described parameters,acoustic information.
Second aspect, the present invention provides a kind of phoneme synthesizing method, and described method includes:
Gather some synthesis material information, and respectively every described synthesis material information is carried out Pretreatment is to extract composite character information;Wherein, described synthesis material information includes text message, And at least one category information in voice messaging and image information;
By forecast model, every described composite character information is predicted, to generate acoustics ginseng Number information;
Phonetic synthesis object information is generated according to described parameters,acoustic information.
Speech synthesis system and method that the many embodiments of the present invention provide are passed through to gather text respectively At least one category information in information, and voice messaging and image information, to extract every synthesis Characteristic information, and be predicted by forecast model, ultimately generate voice, pass through voice messaging And/or the characteristic information prediction emotion of user extracted of image information or linguistic context, it is achieved that synthesis Express user feeling or the personalized speech of linguistic context;
Speech synthesis system and method that some embodiments of the invention provide are civilian by gathering further At least one category information in this information, and voice messaging and image information, to extract every instruction Practice characteristic information, train forecast model, extend the match-type of forecast model, improve pre- The precision surveyed.
Accompanying drawing explanation
By reading retouching in detail with reference to made non-limiting example is made of the following drawings Stating, other features, purpose and advantage will become more apparent upon:
Fig. 1 is the structural representation of speech synthesis system in one embodiment of the invention.
Fig. 2 is the flow chart of phoneme synthesizing method in one embodiment of the invention.
Fig. 3 is the structural representation of speech synthesis system in one embodiment of the present invention.
Fig. 4 is the flow chart of phoneme synthesizing method in one embodiment of the present invention.
Fig. 5 is the structural representation of speech synthesis system in one embodiment of the present invention.
Detailed description of the invention
With embodiment, the application is described in further detail below in conjunction with the accompanying drawings.It is appreciated that , specific embodiment described herein is used only for explaining related invention, rather than to this Bright restriction.It also should be noted that, for the ease of describe, accompanying drawing illustrate only with The part that invention is relevant.
It should be noted that in the case of not conflicting, the embodiment in the application and embodiment In feature can be mutually combined.Describe this below with reference to the accompanying drawings and in conjunction with the embodiments in detail Application.
Fig. 1 is the structural representation of speech synthesis system in one embodiment of the invention.
As it is shown in figure 1, in the present embodiment, the speech synthesis system that the present invention provides includes spy Levy extraction unit 10, predicting unit 30 and synthesis unit 50.
Feature extraction unit 10 is used for gathering some synthesis material information, and respectively to every institute State synthesis material information to carry out pre-processing to extract composite character information.
Predicting unit 30 is for carrying out pre-by forecast model to every described composite character information Survey, to generate parameters,acoustic information.
Synthesis unit 50 is for generating phonetic synthesis object information according to described parameters,acoustic information.
In certain embodiments, described synthesis material information includes text message and voice messaging; In further embodiments, described synthesis material information includes text message and image information;? In some other embodiment, described synthesis material information includes text message, voice messaging simultaneously And image information.
In the present embodiment, it was predicted that unit 30 storage has the forecast model trained;Real at some Execute in example, the forecast model of renewal can be received further by modes such as radio communications.
In the present embodiment, synthesis unit 50 is vocoder, and described parameters,acoustic information includes base Frequency and formant frequency.In more embodiments, synthesis unit 50 can use according to the actual requirements Different sound synthesis device, described parameters,acoustic information can use corresponding parameters,acoustic.
In certain embodiments, described forecast model is Logic Regression Models;Implement at other In example, described forecast model is deep neural network model.
Fig. 2 is the flow chart of phoneme synthesizing method in one embodiment of the invention.Language shown in Fig. 2 Sound synthetic method can be applicable in the speech synthesis system shown in Fig. 1.
As in figure 2 it is shown, in the present embodiment, the phoneme synthesizing method that the present invention provides includes:
S30: gather some synthesis material information, and respectively to every described synthesis material information Carry out pre-processing to extract composite character information.Wherein, described synthesis material information includes text At least one category information in information, and voice messaging and image information.
S50: every described composite character information is predicted, to generate sound by forecast model Learn parameter information.
S70: generate phonetic synthesis object information according to described parameters,acoustic information.
Specifically, by the forecast model conjunction to being extracted based on voice messaging and/or image information Characteristic information is become to be predicted, it is achieved emotion and linguistic context to user judge, thus generate Corresponding parameters,acoustic information.Such as, text message " is please shut door ", existing language Sound synthesis system generally only can generate flat a, voice without emotion and the tone;And In the present embodiment, extract the composite character information corresponding to the soft tone by voice messaging, Or extract the composite character information corresponding to anger expression by image information, then by prediction Synthesis characteristic information is predicted by model, generates the corresponding soft tone or the acoustics of the exciting tone Parameter information, ultimately produce the tone with user or expression is corresponding, user feeling can be expressed or The phonetic synthesis object information of linguistic context.
Speech synthesis system and method that above-described embodiment provides are passed through to gather text message respectively, And at least one category information in voice messaging and image information, to extract every composite character letter Breath, and be predicted by forecast model, ultimately generate voice, by voice messaging and/or figure The emotion of the characteristic information prediction user extracted as information or linguistic context, it is achieved that synthesis is expressed and used Family emotion or the personalized speech of linguistic context.
Fig. 3 is the structural representation of speech synthesis system in one embodiment of the present invention.
As it is shown on figure 3, in a preferred embodiment, described system also includes model training unit 20.Feature extraction unit 10 is additionally operable to gather some training material information, and respectively to every Described training material information carries out pre-processing to extract training characteristics information.Model training unit 20 For training forecast model according to every described training characteristics information.
With described synthesis material information similarly, in certain embodiments, described training material letter Breath includes text message and voice messaging;In further embodiments, described training material information Including text message and image information;In some other embodiment, described training material information Include text message, voice messaging and image information simultaneously.
Specifically, in the present embodiment, new prediction can be trained by model training unit 20 Model, it is possible to the training pattern prestoring predicting unit 30 is further trained to improve pre- The accuracy rate surveyed.
Fig. 4 is the flow chart of phoneme synthesizing method in one embodiment of the present invention.Shown in Fig. 4 Phoneme synthesizing method can apply in the speech synthesis system shown in Fig. 3.
As shown in Figure 4, in a preferred embodiment, also include before step S30:
S10: gather some training material information, and respectively to every described training material information Carry out pre-processing to extract training characteristics information.Wherein, described training material information includes text At least one category information in information, and voice messaging and image information.
S20: train forecast model according to every described training characteristics information.
The speech synthesis system of above-described embodiment offer and method are further by gathering text envelope At least one category information in breath, and voice messaging and image information, to extract every training spy Reference ceases, and trains forecast model, extends the match-type of forecast model, improve prediction Precision.
Fig. 5 is the structural representation of speech synthesis system in one embodiment of the present invention.
As it is shown in figure 5, in a preferred embodiment, feature extraction unit 10 includes:
Text character extraction subelement 101, for gathering the first text message, and to described the One text message pre-processes, to extract the first text feature information for prediction;
The most also include following at least one:
Speech feature extraction subelement 103, for gathering the first voice messaging, and to described the One voice messaging pre-processes, to extract for Prediction and Acquisition environment and the first of user's linguistic context Voice characteristics information;
Image characteristics extraction subelement 105, for gathering the first image information, and to described the One image information pre-processes, to extract for predicting the first characteristics of image letter that user expresses one's feelings Breath.
Specifically, the first voice characteristics information, also can be pre-in addition to being used for predicting user's linguistic context Survey gathers the features such as the noisy degree of environment, with the further accuracy rate improving prediction.
In the preferred embodiment of a correspondence, step S30 includes:
S301: gather the first text message, and described first text message is pre-processed, To extract the first text feature information for prediction;
The most also include following at least one:
S302: gather the first voice messaging, and described first voice messaging is pre-processed, To extract for Prediction and Acquisition environment and the first voice characteristics information of user's linguistic context;
S303: gather the first image information, and described first image information is pre-processed, To extract for predicting the first image feature information that user expresses one's feelings.
In a preferred embodiment, described described first text message is carried out pretreatment include right Described first text message carries out Text normalization process and prosody prediction.
Described described first voice messaging is carried out pretreatment include described first voice messaging is entered Row mel-frequency cepstrum coefficient (being called for short mfcc) feature extraction and digitized processing.
Described described first image information is carried out pretreatment include described first image information is entered Row recognition of face, and extract relevant color, texture, shape and spatial relationship feature.
In a preferred embodiment, Text character extraction subelement 101 is additionally operable to gather the second literary composition This information, and described second text message is pre-processed, it is used for training prediction mould to extract Second text feature information of type.
Speech feature extraction subelement 103 is additionally operable to gather the second voice messaging, and to described Two voice messagings pre-process, to extract the second phonetic feature letter for training forecast model Breath.
Image characteristics extraction subelement 105 is additionally operable to gather the second image information, and to described Two image informations pre-process, to extract the second characteristics of image letter for training forecast model Breath.
In the preferred embodiment of a correspondence, step S10 includes:
S101: gather the second text message, and described second text message is pre-processed, To extract the second text feature information for training forecast model;
The most also include following at least one:
S103: gather the second voice messaging, and described second voice messaging is pre-processed, To extract the second voice characteristics information for training forecast model;
S105: gather the second image information, and described second image information is pre-processed, To extract the second image feature information for training forecast model.
In certain embodiments, described forecast model is Logic Regression Models;Implement at other In example, described forecast model is deep neural network model.
Flow chart in accompanying drawing and block diagram, it is illustrated that according to various embodiments of the invention system, Architectural framework in the cards, function and the operation of method and computer program product.This point On, each square frame in flow chart or block diagram can represent a module, program segment or code A part, a part for described module, program segment or code comprise one or more for Realize the executable instruction of the logic function of regulation.It should also be noted that at some as replacement In realization, the function marked in square frame can also be sent out to be different from the order marked in accompanying drawing Raw.Such as, two square frames succeedingly represented can essentially perform substantially in parallel, they Sometimes can also perform in the opposite order, depending on this is according to involved function.It is also noted that , square frame in each square frame in block diagram and/or flow chart and block diagram and/or flow chart Combination, the special hardware based system of the function that can be specified by execution or operation come Realize, or can be realized by the combination of specialized hardware with computer instruction.
It is described in the embodiment of the present application involved unit or module can be by the side of software Formula realizes, it is also possible to realize by the way of hardware.Described unit or module can also Arrange within a processor, such as, it was predicted that unit 30 can be provided in computer or smart machine In software program;It can also be the hardware unit being individually predicted.Wherein, these unit Or the title of module is not intended that under certain conditions to this unit or the restriction of module itself, example As, it was predicted that unit 30 is also described as " beating for carrying out characteristic information with model mating Divide to generate the comparing unit of parameter ".
As on the other hand, present invention also provides a kind of computer-readable recording medium, this meter Calculation machine readable storage medium storing program for executing can be that computer included in device described in above-described embodiment can Read storage medium;Can also be individualism, the computer-readable storage being unkitted in the equipment of allocating into Medium.Computer-readable recording medium storage has one or more than one program, described program It is used for performing to be described in the Formula Input Technology side of the application by one or more than one processor Method.
Above description is only the preferred embodiment of the application and saying institute's application technology principle Bright.It will be appreciated by those skilled in the art that invention scope involved in the application, do not limit In the technical scheme of the particular combination of above-mentioned technical characteristic, also should contain simultaneously without departing from In the case of described inventive concept, above-mentioned technical characteristic or its equivalent feature are combined And other technical scheme formed.Such as features described above and (but not limited to) disclosed herein The technical characteristic with similar functions is replaced mutually and the technical scheme that formed.

Claims (12)

1. a speech synthesis system, it is characterised in that described system includes:
Feature extraction unit, is used for gathering some synthesis material information, and respectively to every institute State synthesis material information to carry out pre-processing to extract composite character information;Wherein, described synthetic chemical Material information includes text message, and at least one category information in voice messaging and image information;
Predicting unit, for every described composite character information being predicted by forecast model, To generate parameters,acoustic information;
Synthesis unit, for generating phonetic synthesis object information according to described parameters,acoustic information.
Speech synthesis system the most according to claim 1, it is characterised in that described feature Extraction unit is additionally operable to gather some training material information, and respectively to every described training element Material information carries out pre-processing to extract training characteristics information;Wherein, described training material information bag Include text message, and at least one category information in voice messaging and image information;
Described system also includes:
Model training unit, for training forecast model according to every described training characteristics information.
Speech synthesis system the most according to claim 1 and 2, it is characterised in that described Feature extraction unit includes:
Text character extraction subelement, for gathering the first text message, and to described first literary composition This information pre-processes, to extract the first text feature information for prediction;
The most also include following at least one:
Speech feature extraction subelement, for gathering the first voice messaging, and to described first language Message breath pre-processes, to extract for Prediction and Acquisition environment and the first voice of user's linguistic context Characteristic information;
Image characteristics extraction subelement, for gathering the first image information, and to described first figure As information pre-processes, to extract for predicting the first image feature information that user expresses one's feelings.
Speech synthesis system the most according to claim 3, it is characterised in that described to institute State the first text message to carry out pretreatment and include described first text message is carried out Text normalization Process and prosody prediction;
Described described first voice messaging is carried out pretreatment include described first voice messaging is entered Row mel-frequency cepstrum coefficient (being called for short mfcc) feature extraction and digitized processing;
Described described first image information is carried out pretreatment include described first image information is entered Row recognition of face, and extract relevant color, texture, shape and spatial relationship feature.
Speech synthesis system the most according to claim 3, it is characterised in that described text Feature extraction subelement is additionally operable to gather the second text message, and enters described second text message Row pretreatment, to extract the second text feature information for training forecast model;
Described speech feature extraction subelement is additionally operable to gather the second voice messaging, and to described Two voice messagings pre-process, to extract the second phonetic feature letter for training forecast model Breath;
Described image characteristics extraction subelement is additionally operable to gather the second image information, and to described Two image informations pre-process, to extract the second characteristics of image letter for training forecast model Breath.
Speech synthesis system the most according to claim 1, it is characterised in that described prediction Model is Logic Regression Models or deep neural network model.
7. a phoneme synthesizing method, it is characterised in that described method includes:
Gather some synthesis material information, and respectively every described synthesis material information is carried out Pretreatment is to extract composite character information;Wherein, described synthesis material information includes text message, And at least one category information in voice messaging and image information;
By forecast model, every described composite character information is predicted, to generate acoustics ginseng Number information;
Phonetic synthesis object information is generated according to described parameters,acoustic information.
Phoneme synthesizing method the most according to claim 7, it is characterised in that described collection Some synthesis material information, and respectively every described synthesis material information is pre-processed with Also include before extracting composite character information:
Gather some training material information, and respectively every described training material information is carried out Pretreatment is to extract training characteristics information;Wherein, described training material information includes text message, And at least one category information in voice messaging and image information;
Forecast model is trained according to every described training characteristics information.
9. according to the phoneme synthesizing method described in claim 7 or 8, it is characterised in that described Gather some synthesis material information, and respectively every described synthesis material information is carried out pre-place Manage and include with extraction composite character information:
Gather the first text message, and described first text message is pre-processed, to extract The first text feature information for prediction;
The most also include following at least one:
Gather the first voice messaging, and described first voice messaging is pre-processed, to extract For Prediction and Acquisition environment and the first voice characteristics information of user's linguistic context;
Gather the first image information, and described first image information is pre-processed, to extract For predicting the first image feature information that user expresses one's feelings.
Phoneme synthesizing method the most according to claim 9, it is characterised in that described right Described first text message carries out pretreatment and includes described first text message is carried out text normalizing Change process and prosody prediction;
Described described first voice messaging is carried out pretreatment include described first voice messaging is entered Row mel-frequency cepstrum coefficient (being called for short mfcc) feature extraction and digitized processing;
Described described first image information is carried out pretreatment include described first image information is entered Row recognition of face, and extract relevant color, texture, shape and spatial relationship feature.
11. phoneme synthesizing methods according to claim 8, it is characterised in that described in adopt Collect some training material information, and respectively every described training material information is pre-processed Include with extraction training characteristics information:
Gather the second text message, and described second text message is pre-processed, to extract For training the second text feature information of forecast model;
The most also include following at least one:
Gather the second voice messaging, and described second voice messaging is pre-processed, to extract For training the second voice characteristics information of forecast model;
Gather the second image information, and described second image information is pre-processed, to extract For training the second image feature information of forecast model.
12. phoneme synthesizing methods according to claim 7, it is characterised in that described pre- Surveying model is Logic Regression Models or deep neural network model.
CN201610236400.1A 2016-04-15 2016-04-15 Voice synthesis system and method Pending CN105931631A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610236400.1A CN105931631A (en) 2016-04-15 2016-04-15 Voice synthesis system and method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610236400.1A CN105931631A (en) 2016-04-15 2016-04-15 Voice synthesis system and method

Publications (1)

Publication Number Publication Date
CN105931631A true CN105931631A (en) 2016-09-07

Family

ID=56838218

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610236400.1A Pending CN105931631A (en) 2016-04-15 2016-04-15 Voice synthesis system and method

Country Status (1)

Country Link
CN (1) CN105931631A (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106328139A (en) * 2016-09-14 2017-01-11 努比亚技术有限公司 Voice interaction method and voice interaction system
CN107437413A (en) * 2017-07-05 2017-12-05 百度在线网络技术(北京)有限公司 voice broadcast method and device
CN110047462A (en) * 2019-01-31 2019-07-23 北京捷通华声科技股份有限公司 A kind of phoneme synthesizing method, device and electronic equipment
CN110288077A (en) * 2018-11-14 2019-09-27 腾讯科技(深圳)有限公司 A kind of synthesis based on artificial intelligence is spoken the method and relevant apparatus of expression
CN111312210A (en) * 2020-03-05 2020-06-19 云知声智能科技股份有限公司 Text-text fused voice synthesis method and device
WO2021004113A1 (en) * 2019-07-05 2021-01-14 深圳壹账通智能科技有限公司 Speech synthesis method and apparatus, computer device and storage medium
WO2022048405A1 (en) * 2020-09-01 2022-03-10 魔珐(上海)信息科技有限公司 Text-based virtual object animation generation method, apparatus, storage medium, and terminal
CN114945110A (en) * 2022-05-31 2022-08-26 深圳市优必选科技股份有限公司 Speaking head video synthesis method and device, terminal equipment and readable storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5704007A (en) * 1994-03-11 1997-12-30 Apple Computer, Inc. Utilization of multiple voice sources in a speech synthesizer
US5890115A (en) * 1997-03-07 1999-03-30 Advanced Micro Devices, Inc. Speech synthesizer utilizing wavetable synthesis
WO2002084643A1 (en) * 2001-04-11 2002-10-24 International Business Machines Corporation Speech-to-speech generation system and method
CN1460232A (en) * 2001-03-29 2003-12-03 皇家菲利浦电子有限公司 Text to visual speech system and method incorporating facial emotions
CN101064104A (en) * 2006-04-24 2007-10-31 中国科学院自动化研究所 Emotion voice creating method based on voice conversion
CN101474481A (en) * 2009-01-12 2009-07-08 北京科技大学 Emotional robot system

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5704007A (en) * 1994-03-11 1997-12-30 Apple Computer, Inc. Utilization of multiple voice sources in a speech synthesizer
US5890115A (en) * 1997-03-07 1999-03-30 Advanced Micro Devices, Inc. Speech synthesizer utilizing wavetable synthesis
CN1460232A (en) * 2001-03-29 2003-12-03 皇家菲利浦电子有限公司 Text to visual speech system and method incorporating facial emotions
WO2002084643A1 (en) * 2001-04-11 2002-10-24 International Business Machines Corporation Speech-to-speech generation system and method
CN1379392A (en) * 2001-04-11 2002-11-13 国际商业机器公司 Feeling speech sound and speech sound translation system and method
CN101064104A (en) * 2006-04-24 2007-10-31 中国科学院自动化研究所 Emotion voice creating method based on voice conversion
CN101474481A (en) * 2009-01-12 2009-07-08 北京科技大学 Emotional robot system

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106328139A (en) * 2016-09-14 2017-01-11 努比亚技术有限公司 Voice interaction method and voice interaction system
CN107437413A (en) * 2017-07-05 2017-12-05 百度在线网络技术(北京)有限公司 voice broadcast method and device
WO2019007308A1 (en) * 2017-07-05 2019-01-10 百度在线网络技术(北京)有限公司 Voice broadcasting method and device
CN110288077A (en) * 2018-11-14 2019-09-27 腾讯科技(深圳)有限公司 A kind of synthesis based on artificial intelligence is spoken the method and relevant apparatus of expression
CN110288077B (en) * 2018-11-14 2022-12-16 腾讯科技(深圳)有限公司 Method and related device for synthesizing speaking expression based on artificial intelligence
CN110047462B (en) * 2019-01-31 2021-08-13 北京捷通华声科技股份有限公司 Voice synthesis method and device and electronic equipment
CN110047462A (en) * 2019-01-31 2019-07-23 北京捷通华声科技股份有限公司 A kind of phoneme synthesizing method, device and electronic equipment
WO2021004113A1 (en) * 2019-07-05 2021-01-14 深圳壹账通智能科技有限公司 Speech synthesis method and apparatus, computer device and storage medium
CN111312210A (en) * 2020-03-05 2020-06-19 云知声智能科技股份有限公司 Text-text fused voice synthesis method and device
WO2022048405A1 (en) * 2020-09-01 2022-03-10 魔珐(上海)信息科技有限公司 Text-based virtual object animation generation method, apparatus, storage medium, and terminal
US11908451B2 (en) 2020-09-01 2024-02-20 Mofa (Shanghai) Information Technology Co., Ltd. Text-based virtual object animation generation method, apparatus, storage medium, and terminal
CN114945110A (en) * 2022-05-31 2022-08-26 深圳市优必选科技股份有限公司 Speaking head video synthesis method and device, terminal equipment and readable storage medium
CN114945110B (en) * 2022-05-31 2023-10-24 深圳市优必选科技股份有限公司 Method and device for synthesizing voice head video, terminal equipment and readable storage medium

Similar Documents

Publication Publication Date Title
CN105931631A (en) Voice synthesis system and method
Stanton et al. Predicting expressive speaking style from text in end-to-end speech synthesis
CN105185372B (en) Training method for multiple personalized acoustic models, and voice synthesis method and voice synthesis device
CN104538024B (en) Phoneme synthesizing method, device and equipment
CN105206258A (en) Generation method and device of acoustic model as well as voice synthetic method and device
Mariooryad et al. Compensating for speaker or lexical variabilities in speech for emotion recognition
CN102779508B (en) Sound bank generates Apparatus for () and method therefor, speech synthesis system and method thereof
CN105185373B (en) The generation of prosody hierarchy forecast model and prosody hierarchy Forecasting Methodology and device
US8447603B2 (en) Rating speech naturalness of speech utterances based on a plurality of human testers
CN109036371A (en) Audio data generation method and system for speech synthesis
CN102426834B (en) Method for testing rhythm level of spoken English
CN112750446A (en) Voice conversion method, device and system and storage medium
CN101887719A (en) Speech synthesis method, system and mobile terminal equipment with speech synthesis function
Li et al. Expressive Speech Driven Talking Avatar Synthesis with DBLSTM Using Limited Amount of Emotional Bimodal Data.
Přibil et al. GMM-based speaker gender and age classification after voice conversion
CN109036376A (en) A kind of the south of Fujian Province language phoneme synthesizing method
Laurinčiukaitė et al. Lithuanian Speech Corpus Liepa for development of human-computer interfaces working in voice recognition and synthesis mode
Prasanna et al. Analysis of excitation source information in emotional speech
Sheikhan Generation of suprasegmental information for speech using a recurrent neural network and binary gravitational search algorithm for feature selection
Kröger Computer-implemented articulatory models for speech production: A review
Tóth et al. Optimizing HMM speech synthesis for low-resource devices
Urbain et al. Development of hmm-based acoustic laughter synthesis
Nakamura et al. Integration of spectral feature extraction and modeling for HMM-based speech synthesis
Jaiswal et al. A generative adversarial network based ensemble technique for automatic evaluation of machine synthesized speech
Sun et al. Unsupervised Inference of Physiologically Meaningful Articulatory Trajectories with VocalTractLab.

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20160907

RJ01 Rejection of invention patent application after publication