EP3144929A1 - Génération synthétique d'un signal vocale ayant un son naturel - Google Patents

Génération synthétique d'un signal vocale ayant un son naturel Download PDF

Info

Publication number
EP3144929A1
EP3144929A1 EP15185879.2A EP15185879A EP3144929A1 EP 3144929 A1 EP3144929 A1 EP 3144929A1 EP 15185879 A EP15185879 A EP 15185879A EP 3144929 A1 EP3144929 A1 EP 3144929A1
Authority
EP
European Patent Office
Prior art keywords
speech
signal
synthesis
parameter
speech signal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
EP15185879.2A
Other languages
German (de)
English (en)
Inventor
Felix Burkhardt
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Deutsche Telekom AG
Original Assignee
Deutsche Telekom AG
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Deutsche Telekom AG filed Critical Deutsche Telekom AG
Priority to EP15185879.2A priority Critical patent/EP3144929A1/fr
Publication of EP3144929A1 publication Critical patent/EP3144929A1/fr
Ceased legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/033Voice editing, e.g. manipulating the voice of the synthesiser

Definitions

  • the invention relates to a solution for automated speech synthesis, which makes it possible to generate speech sequences that sound particularly natural due to the simulation of an emotional speech for a listener addressed by these speech sequences. It refers to a solution which, by including extralinguistic and / or paralinguistic features, goes well beyond the purely linguistic generation of synthetic speech.
  • Objects of the invention are a corresponding method and a system suitable for carrying out this method.
  • One goal in the development of speech synthesis solutions is to include the previously discussed extralinguistic and paralinguistic features in the synthesis and, as a result, to obtain naturally-acting, synthetically generated speech sequences. In doing so, one can be guided by the fact that in this way it is possible to better influence the behavior of people during a dialogue conducted in the course of a human-machine communication, and ultimately also more goal-oriented.
  • average values of the abovementioned feature groups are measured in each case for a specific target emotion, and these feature groups are specifically adapted during the synthesis by changing corresponding parameters. For example, for the emotion "cheering joy", the duration of the voiced fricative is prolonged or the vocal sound is changed from “modal" to "breathy” to achieve a goal emotion “mourning”.
  • the object of the invention is to provide a solution for automated speech synthesis, which leads to speech sequences that act even more natural in their phonetic reproduction in terms of emotion-based speech, as is the case in the prior art.
  • a method and a system suitable for carrying out the method are to be specified.
  • a speech signal generated as part of an automated speech synthesis is influenced to simulate an emotional way of speaking.
  • this is done by a voice raw signal, which is not yet emotionally charged, generated during the speech synthesis, which is specifically modulated with a parameter mixture before the output of the generated speech signal.
  • this parameter mixture comprises parameters of a plurality of feature groups corresponding to melody features, to duration features, to voice features, or to features relating to the articulation accuracy of the language.
  • the parameters set in this respect are parameters from feature groups with which at least two predefined target emotions are associated.
  • a speech raw signal is a signal consisting of phonemes with a parameter grouping that is neutral with regard to the above-mentioned feature groups, that is, in particular with regard to prosody and vocal sound, which does not yet take the form of a having acoustically reproducible audio signal.
  • the system proposed for the solution of the problem and suitable for implementing the method explained above is a specially designed, hardware-based and software-based speech synthesizer.
  • This consists of a phonemeization component encompassed by an input stage of the synthesizer, an emotion simulator and the actual synthesis unit comprised of an output stage equipped with means for the acoustic output of the generated speech signal.
  • Decisive for the solution of the problem is the way in which the aforementioned components, which as such may well be known from the prior art, are brought into operative connection with one another in the speech synthesizer according to the invention.
  • the phonemes generated by the phonemeization component are not transferred to the synthesis unit directly or, with respect to speech parameters, which in particular determine the prosody and / or the vocal sound, only influenced in the direction of a target emotion. Rather, the phonemes provided with the neutral prosody description are modulated for transfer to the synthesis unit by means of a parameter mixture.
  • the corresponding parameter mixture is generated by the emotion simulator associated with the speech synthesizer. This evaluates the information provided to the transferred text, which is to be converted into synthetic speech, into at least two target emotions and influences the parameters of at least two different speech feature groups which are associated with the specifications in accordance with these specifications.
  • the parameters influenced to this extent with regard to their respective characteristics, with which emotions such as joy, grief and anger are associated, are also mixed by the emotion simulator, that is, combined to form a parameter mixture and applied to the speech vocal signal - this modulating.
  • the synthesis component generates according to one of the known methods from the voice raw signal modulated with the parameter mixture and outputs the desired synthetic speech signal.
  • the first example refers to a robot that can express multiple emotions in parallel. It is assumed that a new message has arrived for the owner of this robot. It has a joyful content based on content analysis or sender marking. The robot then uses the ability to express multiple emotions simultaneously, blending "surprise” (over the arrival of the message) with "joy” to prepare the owner for their enjoyable content. For its speech output or for the actual speech synthesis, the robot uses software from the Mbrola project, which is known from the prior art and based on diphone synthesis (see [3]).
  • the Diphonsynthese generates speech by concatenation of single double dahlute (diphones) from a database and subsequent prosody adaptation (influence on Tonmelodie and sound periods) according to the PSOLA method (Pitch Synchronous Overlap and Add).
  • the robot explained by way of example here has three diphone databases for different vocal qualities, namely an inventory each for relaxed, normal and tense speech. The robot actually uses its parameter data for the two target emotions "surprise” and "joy".
  • a phonemization component for example "Mary” (see [4])
  • a voice raw signal as a neutral version of the target utterance (synthetic speech, with the simultaneously expressed emotions “surprise” and “joy” ), consisting of the phonemes to be used and a neutral prosody description.
  • This voice raw signal is then modified or modulated according to the changes described above, for example by means of the "Emofilt” software (see [5]).
  • the synthesizer based on "Mbrola” (see also [3]), as already stated, then transmits the modified speech signal description as well as the reference to the diphone basis to be used and from this generates the speech signal which expresses both surprise and joy.
  • Another embodiment relates to a virtual agent in a game.
  • the user interacts with the agent, inter alia, by voice.
  • its mood changes based on events that happen in the game. For example, the agent learned in the morning that he won the lottery and is generally in a good mood. Now he casts a valuable vase around in a game situation, which breaks and makes a furious exclamation, which is mitigated by the influence of the positive mood.
  • the generation of the mixed emotion expression in the language is to be done here for example by formant synthesis, as described for example in [2].
  • the phonemization component which in turn is based on "Mary” (see [4]), generates a canonic pronunciation variant as well as a neutral prosody description from the text to be spoken as a voice raw signal. This is modified by the Emotionalmaschineskomponente "Emofilt" (see [5]) to the effect that a typical for the emotion "anger” duration and melody progression occurs, that is, the overall fundamental frequency response is increased by 150%, the range (the frequency-related vocal range) is widened by 40% and the contour gets a final increase in frequency. All syllables are accelerated by 30% and the strong ones again by 20%.
  • the role of the synthesizer takes over in this case not “Mbrola” (see [3]), but the “EmoSyn” software (see [2]), which a KlattFormantsynthesizer by a combination of parameter templates for voiced sounds and rules for unvoiced sounds controls. All aspects of the acoustic speech signal can be modeled as a result of a source-filter system, that is, the acoustic nature of the laryngeal excitation signal and mouth-space approach tube can be controlled parametrically.
  • the features of the "articulation accuracy” and “vocal sound” groups are modeled according to the second objective emotion, "satisfaction.” That is, for the vocal sound a rather gushed speech is used, where (according to [1], p. 218) the excitation signal through the Liljencranz-Fant parameter aperture quotient, the spectral attenuation, the bandwidth of the first formant, the amplitude of the voiced portion and the amplitude of the noisy component is adjusted. The formants are raised slightly as this corresponds to a smiling speech. The articulation accuracy is increased, so it is a "Vowel target overshoot" modeled by decentralization of the first two formants.
  • the Fig. 1 shows a first possible embodiment of the system 1 according to the invention, namely a Chattsynthetisierers, in a very simplified schematic representation. All components of the speech synthesizer shown are each software-based and hardware-based components.
  • the speech synthesizer 1 consists essentially as shown in FIG Fig. 1 can be seen from the input stage 2 with the phonemeization component 3, the emotion simulator 4 and the output stage 5 with the actual synthesis unit 6 and with - not shown here - means (for example, speakers) for the acoustic output.
  • the phonemeization component 3 is formed, for example, by the phoneme system "Mary” known therefor.
  • the emotion simulator 4 is based on the software "Emofilt" created for the production of emotional language.
  • the component is based on a database in which parameters for the emotions “joy,”"grief,” and “anger” are kept in four independent sets of features, namely, melody features, duration characteristics, articulation features, and vocal tract characteristics.
  • the relevant parameters are influenced by means of the "Emofilt" software corresponding to at least two target emotions transferred to the system 1 in connection with the transfer of a text to be converted into speech in accordance with adjusted or adjusted parameters by program sequences additionally provided as part of the emotion simulator 4 to form a mixture, by means of which the phonemes generated in the phonemeization component 3 and forming a speech raw signal are modulated 5 is used in the in the Fig. 1 shown embodiment, a component in which the software "Mbrola" is implemented.
  • This component based on “Mbrola”, produces from the one given above modulated voice raw signal with neutral speech signal parameter assignment - that is, from the phonetics anyway neutral (Mary does not provide the phonemes with information or parameters to the vocal sound) phonemes with equally neutral prosody, which were modulated according to the two target emotions - finally by the (not shown) output means of the speech synthesizer 1 to be output synthetic speech signal.
  • the Fig. 2 shows a further embodiment of a designed according to the invention Speech synthesizer 1.
  • Speech synthesizer 1 comprises the basically same components as the speech synthesizer 1 according to the embodiment explained above.
  • the components in question also work together in the same way.
  • speech synthesizer 1 is compared to the embodiment of the Fig. 1 only the speech synthesis unit 5 has been replaced by another embodiment for such a unit.
  • the actual speech synthesis is carried out by means of this component, in which, for example, the software "EmoSyn" has been implemented. This generates a synthetic speech signal according to the principle of formant synthesis.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Machine Translation (AREA)
EP15185879.2A 2015-09-18 2015-09-18 Génération synthétique d'un signal vocale ayant un son naturel Ceased EP3144929A1 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
EP15185879.2A EP3144929A1 (fr) 2015-09-18 2015-09-18 Génération synthétique d'un signal vocale ayant un son naturel

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
EP15185879.2A EP3144929A1 (fr) 2015-09-18 2015-09-18 Génération synthétique d'un signal vocale ayant un son naturel

Publications (1)

Publication Number Publication Date
EP3144929A1 true EP3144929A1 (fr) 2017-03-22

Family

ID=54150328

Family Applications (1)

Application Number Title Priority Date Filing Date
EP15185879.2A Ceased EP3144929A1 (fr) 2015-09-18 2015-09-18 Génération synthétique d'un signal vocale ayant un son naturel

Country Status (1)

Country Link
EP (1) EP3144929A1 (fr)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023012116A1 (fr) * 2021-08-02 2023-02-09 Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. Dispositif de traitement de signal vocal, système de lecture de signal vocal et procédé de sortie de signal vocal sans émotion

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6226614B1 (en) * 1997-05-21 2001-05-01 Nippon Telegraph And Telephone Corporation Method and apparatus for editing/creating synthetic speech message and recording medium with the method recorded thereon

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6226614B1 (en) * 1997-05-21 2001-05-01 Nippon Telegraph And Telephone Corporation Method and apparatus for editing/creating synthetic speech message and recording medium with the method recorded thereon

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
FELIX BURKHARDT: "Emofilt: the Simulation of Emotional Speech by Prosody-Transformation", INTERSPEECH, 2005
FELIX BURKHARDT: "Emofilt: the Simulation of Emotional Speech by Prosody-Transformation", PROC. INTERSPEECH, 4 September 2005 (2005-09-04), pages 1 - 4, XP055225958, Retrieved from the Internet <URL:http://felix.syntheticspeech.de/publications/emofiltInterspeech05.pdf> [retrieved on 20151104] *
FELIX BURKHARDT: "Simulation emotionaler Sprechweise mit Sprachsynthesesystemen", 2001, SHAKER VERLAG
FELIX BURKHARDT; W. F. SENDLMEIER: "Verification of Acoustical Correlates of Emotional Speech using Formant-Synthesis", PROCEEDINGS ISCA WORKSHOP (ITRW) ON SPEECH AND EMOTION, 2000
M. SCHRÖDER; J. TROUVAIN: "The German text-to-speech synthesis system mary: A tool for research, development and teaching", INTERNATIONAL JOURNAL OF SPEECH TECHNOLOGY, 2003, pages 365 - 377, XP019207412, DOI: doi:10.1023/A:1025708916924
T. DUTOIT; V. PAGEL; N. PIERRET; F. BATAILLE; O. VAN DER VREKEN: "The Mbrola project: Towards a set of high-quality speech synthesizers free of use for non-commercial purposes", PROC. ICSLP'96, PHILADELPHIA, vol. 3, 1996, pages 1393 - 1396, XP010237942, DOI: doi:10.1109/ICSLP.1996.607874

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023012116A1 (fr) * 2021-08-02 2023-02-09 Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. Dispositif de traitement de signal vocal, système de lecture de signal vocal et procédé de sortie de signal vocal sans émotion

Similar Documents

Publication Publication Date Title
DE60112512T2 (de) Kodierung von Ausdruck in Sprachsynthese
DE69821673T2 (de) Verfahren und Vorrichtung zum Editieren synthetischer Sprachnachrichten, sowie Speichermittel mit dem Verfahren
DE60119496T2 (de) Verfahren und Vorrichtung um eine mittels eines Klangs übermittelte Emotion zu synthetisieren
DE69031165T2 (de) System und methode zur text-sprache-umsetzung mit hilfe von kontextabhängigen vokalallophonen
DE19610019C2 (de) Digitales Sprachsyntheseverfahren
DE60118874T2 (de) Prosodiemustervergleich für Text-zu-Sprache Systeme
DE69519328T2 (de) Verfahren und Anordnung für die Umwandlung von Sprache in Text
DE69519887T2 (de) Verfahren und Vorrichtung zur Verarbeitung von Sprachinformation
DE69427083T2 (de) Spracherkennungssystem für mehrere sprachen
DE69506037T2 (de) Audioausgabeeinheit und Methode
DE60216069T2 (de) Sprache-zu-sprache erzeugungssystem und verfahren
DE3856146T2 (de) Sprachsynthese
DE69719654T2 (de) Grundfrequenzmuster enthaltende Prosodie-Datenbanken für die Sprachsynthese
DE2212472A1 (de) Verfahren und Anordnung zur Sprachsynthese gedruckter Nachrichtentexte
DE60108104T2 (de) Verfahren zur Sprecheridentifikation
EP3010014B1 (fr) Procede d&#39;interpretation de reconnaissance vocale automatique
EP1058235B1 (fr) Procédé de reproduction pour systèmes contrôlés par la voix avec synthèse de la parole basée sur texte
DE112020005337T5 (de) Steuerbare, natürliche paralinguistik für text-zu-sprache-synthese
DE69318209T2 (de) Verfahren und Anordnung zur Sprachsynthese
EP1105867A1 (fr) Procede et dispositif permettant de concatener des segments audio en tenant compte de la coarticulation
DE69816049T2 (de) Vorrichtung und verfahren zur prosodie-erzeugung bei der visuellen synthese
EP3144929A1 (fr) Génération synthétique d&#39;un signal vocale ayant un son naturel
EP1110203B1 (fr) Procede et dispositif de traitement numerique de la voix
DE69817550T2 (de) Verfahren zur sprachsynthese
Gessinger Phonetic accommodation of human interlocutors in the context of human-computer interaction

Legal Events

Date Code Title Description
PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

AK Designated contracting states

Kind code of ref document: A1

Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR

AX Request for extension of the european patent

Extension state: BA ME

17P Request for examination filed

Effective date: 20170922

RBV Designated contracting states (corrected)

Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR

17Q First examination report despatched

Effective date: 20180801

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE APPLICATION HAS BEEN REFUSED

18R Application refused

Effective date: 20190830