EP1271469A1 - Procédé de génération de caractéristiques de personnalité et procédé de synthèse de la parole - Google Patents

Procédé de génération de caractéristiques de personnalité et procédé de synthèse de la parole Download PDF

Info

Publication number
EP1271469A1
EP1271469A1 EP01115216A EP01115216A EP1271469A1 EP 1271469 A1 EP1271469 A1 EP 1271469A1 EP 01115216 A EP01115216 A EP 01115216A EP 01115216 A EP01115216 A EP 01115216A EP 1271469 A1 EP1271469 A1 EP 1271469A1
Authority
EP
European Patent Office
Prior art keywords
speech
features
anyone
acoustical
synthesizing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
EP01115216A
Other languages
German (de)
English (en)
Inventor
Krzysztof Marasek
Thomas Kemp
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sony Deutschland GmbH
Original Assignee
Sony International Europe GmbH
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sony International Europe GmbH filed Critical Sony International Europe GmbH
Priority to EP01115216A priority Critical patent/EP1271469A1/fr
Publication of EP1271469A1 publication Critical patent/EP1271469A1/fr
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/027Concept to speech synthesisers; Generation of natural phrases from machine-based concepts
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/033Voice editing, e.g. manipulating the voice of the synthesiser

Definitions

  • the present invention relates to a method for generating personality patterns and to a method for synthesizing speech.
  • man-machine dialogue systems to ensure an easy and reliable use by a human user.
  • These man-machine dialogue systems are enabled to receive and consider users' utterances, in particular orders and/or inquiries, and to react and respond in an appropriate way.
  • current speech synthesis systems involved in such man-machine dialogue systems suffer from a lack of personality and naturalness.
  • the systems are enabled to deal with the context of the situation in an appropriate way, the prepared and output speech of the dialogue system often sounds monotonically, machine-like, and not embedded into the particular situation.
  • the object is achieved by a method for generating personality patterns, in particular for synthesizing speech, with the features of claim 1. Furtheron, the object is achieved by a method for synthesizing speech according to the characterizing features of claim 11.
  • a system and a computer program product for carrying out the inventive methods are the subject-matter of claims 14 and 15, respectively. Preferred embodiments of the inventive methods are within the scope of the dependent subclaims.
  • a speech input is received and/or preprocessed.
  • acoustical and/or non-acoustical speech features are extracted.
  • a personality pattern is generated and/or stored.
  • online input speech and/or speech of a speech data base for at least one given speaker are used for receiving said speech input.
  • a speech data base enables a system involving the inventive method to generate the personality patterns in advance of an application. That means that, before the system is applied for example in an speech synthesizing unit, a speech model for a single speaker or for a variety of speakers can be constructed.
  • the personality patterns during the application in a speech synthesizing unit in a real time or online manner, so as to adapt a speech output generated in a dialogue system during the application and/or during the dialogue with the user.
  • pitch Within the class of prosodic features, pitch, pitch range, intonation attitude, loudness, speaking rate, phone duration, speech element duration features, and or the like can be employed.
  • voice quality features phonation type, articulation manner, voice timbre features, and/or the like can be employed.
  • contextual features and/or the like may be important in accordance to a further advantageous embodiment of the present invention.
  • syntactical, grammatical, semantical features, and/or the like can be used as contextual features.
  • a process of speech recognition is preferably carried out within the inventive method.
  • a process of speaker identification and/or adaptation can be performed, in particular so as to increase the matching rate of the feature extraction and/or of the recognition rate of the process of speech recognition.
  • the inventive method for synthesizing speech in particular for a man-machine dialogue system, the inventive method for generating personality patterns is employed.
  • the method for generating personality patterns is essentially carried out in a preprocessing step, in particular based on a speech data base or the like.
  • the method for generating personality patterns can be carried out and/or continued in a continuous, real time, or online manner. This enables a system involving said method for synthesizing speech to adapt its speech output in accordance to the received input during the dialogue.
  • Both of the methods for generating personality patterns and/or for synthesizing speech can be configured to create a personality pattern or a speech output which is in some sense complementary to the personality pattern or character assigned to the speaker of the speech input. That means, for instance, that in the case of an emergency call system for activating ambulance or fire alarm services the speaker of the speech input might be excited and/or confused. It might therefore be necessary to calm down the speaking person and this can be achieved by creating a personality pattern for the speech synthesis reflecting a strong and confident and safe character. Additionally, it might also be possible to construct personality patterns for the synthesized speech output which reflects a gender which is complementary to the gender of the speaker of the speech input, i. e. in the case of a male speaker, the system might respond as a female speaker so as to make the dialogue most convenient for the speaking person.
  • a computer program product comprising computer program means which is adapted to perform and/or to realize the inventive method for generating personality patterns and/or for synthesizing speech and/or the steps thereof when it is executed on a computer, a digital signal processing means, and/or the like.
  • both his relevant voice quality features and his speech itself - as described by any units, such as words, syllables, diphones, sentences, and/or the like - is automatically extracted according to the invention. Also information about preferred sentence structure and word usage are extracted and used to create a speech synthesis system with those characteristics in a completely unsupervised way.
  • the proposed methods can be used to mimic the actual speaker talking to the device but also to equip the device with some different personalities, e. g. gathered from the speaking style of famous people, movie stars, or the like. This can be very attractive for potential customers.
  • the proposed system can be used not only to mimic speaker's behavior but more generally to control the dialogue depending on changing speaking style and emotions of the human partner.
  • the collection of features describing the speaker's personality can be done on different levels during the conversation of the human by a dialogue unit.
  • the speech signal has to be recorded and segmented into phones, diphones, and/or into other speech units or speech elements in dependence on the speech synthesis method used in the system.
  • Prosodic features like pitch, pitch range, attitude of sentence intonation (monotonous or effected), loudness, speaking rate, durations of phones, and/or the like can be collected to characterize the speaker's prosody.
  • Voice quality features like phonation type, articulation manner, voice timbre, and/or the like can be automatically extracted from the collected speech data.
  • Speaker identification or a speaker identification module are necessary for a proper function of the system.
  • the system can also collect all the words recognized from the adherences spoken by the speaker and to generate and evaluate statistics on the usage. This can be used to find the most frequent phrases, words used by a given speaker, and/or the like. Also syntactic information gathered from the recognized phrases can enhance the quality of personality description.
  • the dialogue system can adjust parameters and units of acoustic output - for example the synthesized waveforms or the like - and modes of text generation to suite the recognized speaker's characteristic.
  • the parameterized personality can be stored for future use or can be preprogrammed in the dialogue device.
  • the information can be used to recognize speakers and to change the personality of the system depending on the user's preference or mood, for example in case of a system with a built-in emotion recognition engine.
  • the personality can be changed according to the user's wish, preprogrammed sequence or depending on changing speaker's style and emotions of the speaker.
  • the main advantage of such a system is the possibility to adapt the dialogue to the given speaker, make the dialogue more attractive, and/or the like.
  • the possibility to mimic certain speakers or to switch between different personalities or speaking styles can be very entertaining and attractive for the user.
  • FIG. 1 shows a preferred embodiment of the inventive method for a synthesizing speech employing an embodiment of the inventive method for generating personality pattern from a given received speech input SI.
  • step S1 speech input S1 is received.
  • a first section S10 of the inventive method for synthesizing speech non-acoustic features are extracted from the received speech input SI.
  • acoustical features are extracted from the received speech input SI.
  • the sections S10 and S20 can be performed parallely or sequentially on a given device or apparatus.
  • a speech input S1 for extracting non-acoustical features from the speech input S1 in a first step S11, speech parameters are extracted from said speech input SI.
  • the speech input S1 is fed into a speech recognizer to analyze the content and the context of the received speech input SI.
  • contextual features are extracted from said speech input S1, in particular syntactical, semantical, grammatical, and statistical information on particular speech elements are obtained.
  • the second section S20 of the inventive method for synthesizing speech consists of three steps S21, S22, and S23 to be performed independently from each other.
  • prosodic features are extracted from the received speech input SI.
  • Said prosodic feature may comprise features of pitch, pitch range, intonation attitude, loudness, speaking rate, speech element duration, and/or the like.
  • voice quality features are extracted from the given received speech input SI, for instance phonation type, articulation manner, voice timbre features, and/or the like.
  • the non-acoustical features and the acoustical features obtained from sections S10 and S20 are merged in a following postprocessing step S30 to detect, model, and store a personality pattern PP for the given speaker.
  • the data describing the personality pattern PP for the current speaker are fed into a following step S40 which includes the steps of speech synthesis, text generation, and dialogue managing from which a responsive speech output SO is generated and then output in a final step S50.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Machine Translation (AREA)
EP01115216A 2001-06-22 2001-06-22 Procédé de génération de caractéristiques de personnalité et procédé de synthèse de la parole Withdrawn EP1271469A1 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
EP01115216A EP1271469A1 (fr) 2001-06-22 2001-06-22 Procédé de génération de caractéristiques de personnalité et procédé de synthèse de la parole

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
EP01115216A EP1271469A1 (fr) 2001-06-22 2001-06-22 Procédé de génération de caractéristiques de personnalité et procédé de synthèse de la parole

Publications (1)

Publication Number Publication Date
EP1271469A1 true EP1271469A1 (fr) 2003-01-02

Family

ID=8177799

Family Applications (1)

Application Number Title Priority Date Filing Date
EP01115216A Withdrawn EP1271469A1 (fr) 2001-06-22 2001-06-22 Procédé de génération de caractéristiques de personnalité et procédé de synthèse de la parole

Country Status (1)

Country Link
EP (1) EP1271469A1 (fr)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2004068466A1 (fr) * 2003-01-24 2004-08-12 Voice Signal Technologies, Inc. Procede et appareil de synthese prosodique mimetique
WO2005081508A1 (fr) * 2004-02-17 2005-09-01 Voice Signal Technologies, Inc. Procedes et appareil de personnalisation remplacable d'interfaces multimodales integrees
EP2147429A1 (fr) * 2007-05-24 2010-01-27 Microsoft Corporation Dispositif basé sur la personnalité
US7873390B2 (en) 2002-12-09 2011-01-18 Voice Signal Technologies, Inc. Provider-activated software for mobile communication devices
WO2014024399A1 (fr) * 2012-08-10 2014-02-13 Casio Computer Co., Ltd. Dispositif de commande de reproduction de contenu, procédé de commande de reproduction de contenu et programme associé
US9363378B1 (en) 2014-03-19 2016-06-07 Noble Systems Corporation Processing stored voice messages to identify non-semantic message characteristics
US9865281B2 (en) 2015-09-02 2018-01-09 International Business Machines Corporation Conversational analytics
CN110751940A (zh) * 2019-09-16 2020-02-04 百度在线网络技术(北京)有限公司 一种生成语音包的方法、装置、设备和计算机存储介质

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5278943A (en) * 1990-03-23 1994-01-11 Bright Star Technology, Inc. Speech animation and inflection system
WO1999012324A1 (fr) * 1997-09-02 1999-03-11 Jack Hollins Systeme de conversation au moyen de langage naturel simulant la voix de personnalites connues et active par une carte de telephone
US6144938A (en) * 1998-05-01 2000-11-07 Sun Microsystems, Inc. Voice user interface with personality

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5278943A (en) * 1990-03-23 1994-01-11 Bright Star Technology, Inc. Speech animation and inflection system
WO1999012324A1 (fr) * 1997-09-02 1999-03-11 Jack Hollins Systeme de conversation au moyen de langage naturel simulant la voix de personnalites connues et active par une carte de telephone
US6144938A (en) * 1998-05-01 2000-11-07 Sun Microsystems, Inc. Voice user interface with personality

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
JANET E. CAHN: "The Generation of Affect in Synthesized Speech", JOURNAL OF THE AMERICAN VOICE I/O SOCIETY, vol. 8, July 1990 (1990-07-01), pages 1 - 19, XP002183399, Retrieved from the Internet <URL:http://www.media.mit.edu/~cahn/masters-thesis.htm> [retrieved on 20011120] *
KLASMEYER ET AL: "The perceptual importance of selected voice quality parameters", ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 1997. ICASSP-97., 1997 IEEE INTERNATIONAL CONFERENCE ON MUNICH, GERMANY 21-24 APRIL 1997, LOS ALAMITOS, CA, USA,IEEE COMPUT. SOC, US, 21 April 1997 (1997-04-21), pages 1615 - 1618, XP010226301, ISBN: 0-8186-7919-0 *

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7873390B2 (en) 2002-12-09 2011-01-18 Voice Signal Technologies, Inc. Provider-activated software for mobile communication devices
WO2004068466A1 (fr) * 2003-01-24 2004-08-12 Voice Signal Technologies, Inc. Procede et appareil de synthese prosodique mimetique
US8768701B2 (en) 2003-01-24 2014-07-01 Nuance Communications, Inc. Prosodic mimic method and apparatus
CN1742321B (zh) * 2003-01-24 2010-08-18 语音信号科技公司 韵律模仿合成方法和装置
WO2005081508A1 (fr) * 2004-02-17 2005-09-01 Voice Signal Technologies, Inc. Procedes et appareil de personnalisation remplacable d'interfaces multimodales integrees
US8285549B2 (en) 2007-05-24 2012-10-09 Microsoft Corporation Personality-based device
US8131549B2 (en) * 2007-05-24 2012-03-06 Microsoft Corporation Personality-based device
AU2008256989B2 (en) * 2007-05-24 2012-07-19 Microsoft Technology Licensing, Llc Personality-based device
EP2147429A4 (fr) * 2007-05-24 2011-10-19 Microsoft Corp Dispositif basé sur la personnalité
EP2147429A1 (fr) * 2007-05-24 2010-01-27 Microsoft Corporation Dispositif basé sur la personnalité
WO2014024399A1 (fr) * 2012-08-10 2014-02-13 Casio Computer Co., Ltd. Dispositif de commande de reproduction de contenu, procédé de commande de reproduction de contenu et programme associé
US9363378B1 (en) 2014-03-19 2016-06-07 Noble Systems Corporation Processing stored voice messages to identify non-semantic message characteristics
US9865281B2 (en) 2015-09-02 2018-01-09 International Business Machines Corporation Conversational analytics
US9922666B2 (en) 2015-09-02 2018-03-20 International Business Machines Corporation Conversational analytics
US11074928B2 (en) 2015-09-02 2021-07-27 International Business Machines Corporation Conversational analytics
CN110751940A (zh) * 2019-09-16 2020-02-04 百度在线网络技术(北京)有限公司 一种生成语音包的方法、装置、设备和计算机存储介质

Similar Documents

Publication Publication Date Title
JP7355306B2 (ja) 機械学習を利用したテキスト音声合成方法、装置およびコンピュータ読み取り可能な記憶媒体
KR100811568B1 (ko) 대화형 음성 응답 시스템들에 의해 스피치 이해를 방지하기 위한 방법 및 장치
Shichiri et al. Eigenvoices for HMM-based speech synthesis.
US7966186B2 (en) System and method for blending synthetic voices
JP4884212B2 (ja) 音声合成装置
US20200251104A1 (en) Content output management based on speech quality
JPH10507536A (ja) 言語認識
JP5507260B2 (ja) 発話音声プロンプトを作成するシステム及び技法
WO2007148493A1 (fr) Dispositif de reconnaissance d&#39;émotion
CA2167200A1 (fr) Systeme de reconnaissance vocale multilangue
EP1280137B1 (fr) Procédé de reconnaissance du locuteur
JP2006517037A (ja) 韻律的模擬語合成方法および装置
JP2011186143A (ja) ユーザ挙動を学習する音声合成装置、音声合成方法およびそのためのプログラム
EP1271469A1 (fr) Procédé de génération de caractéristiques de personnalité et procédé de synthèse de la parole
Levinson et al. Speech synthesis in telecommunications
O'Shaughnessy Modern methods of speech synthesis
US20230148275A1 (en) Speech synthesis device and speech synthesis method
Creer et al. Building personalized synthetic voices for individuals with dysarthria using the HTS toolkit
US20230146945A1 (en) Method of forming augmented corpus related to articulation disorder, corpus augmenting system, speech recognition platform, and assisting device
JP3706112B2 (ja) 音声合成装置及びコンピュータプログラム
Westall et al. Speech technology for telecommunications
Carlson Synthesis: Modeling variability and constraints
Nthite et al. End-to-End Text-To-Speech synthesis for under resourced South African languages
Houidhek et al. Evaluation of speech unit modelling for HMM-based speech synthesis for Arabic
KR102116014B1 (ko) 음성인식엔진과 성대모사용음성합성엔진을 이용한 화자 성대모사시스템

Legal Events

Date Code Title Description
PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

AK Designated contracting states

Kind code of ref document: A1

Designated state(s): AT BE CH CY DE DK ES FI FR GB GR IE IT LI LU MC NL PT SE TR

AX Request for extension of the european patent

Free format text: AL;LT;LV;MK;RO;SI

AKX Designation fees paid
REG Reference to a national code

Ref country code: DE

Ref legal event code: 8566

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE APPLICATION IS DEEMED TO BE WITHDRAWN

18D Application deemed to be withdrawn

Effective date: 20030703