EP3144929A1 - Génération synthétique d'un signal vocale ayant un son naturel - Google Patents
Génération synthétique d'un signal vocale ayant un son naturel Download PDFInfo
- Publication number
- EP3144929A1 EP3144929A1 EP15185879.2A EP15185879A EP3144929A1 EP 3144929 A1 EP3144929 A1 EP 3144929A1 EP 15185879 A EP15185879 A EP 15185879A EP 3144929 A1 EP3144929 A1 EP 3144929A1
- Authority
- EP
- European Patent Office
- Prior art keywords
- speech
- signal
- synthesis
- parameter
- speech signal
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
- 238000003786 synthesis reaction Methods 0.000 claims abstract description 57
- 230000008451 emotion Effects 0.000 claims abstract description 56
- 230000015572 biosynthetic process Effects 0.000 claims abstract description 53
- 238000000034 method Methods 0.000 claims abstract description 35
- 239000000203 mixture Substances 0.000 claims abstract description 21
- 230000002996 emotional effect Effects 0.000 claims description 8
- 238000013528 artificial neural network Methods 0.000 claims description 4
- 238000004088 simulation Methods 0.000 claims description 4
- 238000002156 mixing Methods 0.000 claims description 2
- 230000005236 sound signal Effects 0.000 claims description 2
- 238000011156 evaluation Methods 0.000 claims 3
- 230000008569 process Effects 0.000 abstract description 7
- 230000001755 vocal effect Effects 0.000 description 12
- 230000007935 neutral effect Effects 0.000 description 10
- MQJKPEGWNLWLTK-UHFFFAOYSA-N Dapsone Chemical compound C1=CC(N)=CC=C1S(=O)(=O)C1=CC=C(N)C=C1 MQJKPEGWNLWLTK-UHFFFAOYSA-N 0.000 description 7
- 238000004891 communication Methods 0.000 description 5
- 238000011161 development Methods 0.000 description 3
- 230000018109 developmental process Effects 0.000 description 3
- 230000014509 gene expression Effects 0.000 description 3
- 238000012545 processing Methods 0.000 description 3
- 238000012546 transfer Methods 0.000 description 3
- 230000006978 adaptation Effects 0.000 description 2
- 238000013459 approach Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 230000005284 excitation Effects 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 230000006872 improvement Effects 0.000 description 2
- 230000036651 mood Effects 0.000 description 2
- 238000001308 synthesis method Methods 0.000 description 2
- 206010001497 Agitation Diseases 0.000 description 1
- 241000665848 Isca Species 0.000 description 1
- 206010027940 Mood altered Diseases 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 230000037007 arousal Effects 0.000 description 1
- 230000001174 ascending effect Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 230000007510 mood change Effects 0.000 description 1
- 230000008447 perception Effects 0.000 description 1
- 230000002035 prolonged effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 230000033764 rhythmic process Effects 0.000 description 1
- 230000000630 rising effect Effects 0.000 description 1
- 230000003595 spectral effect Effects 0.000 description 1
- 238000007619 statistical method Methods 0.000 description 1
- 230000001360 synchronised effect Effects 0.000 description 1
- 238000012549 training Methods 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
- 230000007704 transition Effects 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
- G10L13/033—Voice editing, e.g. manipulating the voice of the synthesiser
Definitions
- the invention relates to a solution for automated speech synthesis, which makes it possible to generate speech sequences that sound particularly natural due to the simulation of an emotional speech for a listener addressed by these speech sequences. It refers to a solution which, by including extralinguistic and / or paralinguistic features, goes well beyond the purely linguistic generation of synthetic speech.
- Objects of the invention are a corresponding method and a system suitable for carrying out this method.
- One goal in the development of speech synthesis solutions is to include the previously discussed extralinguistic and paralinguistic features in the synthesis and, as a result, to obtain naturally-acting, synthetically generated speech sequences. In doing so, one can be guided by the fact that in this way it is possible to better influence the behavior of people during a dialogue conducted in the course of a human-machine communication, and ultimately also more goal-oriented.
- average values of the abovementioned feature groups are measured in each case for a specific target emotion, and these feature groups are specifically adapted during the synthesis by changing corresponding parameters. For example, for the emotion "cheering joy", the duration of the voiced fricative is prolonged or the vocal sound is changed from “modal" to "breathy” to achieve a goal emotion “mourning”.
- the object of the invention is to provide a solution for automated speech synthesis, which leads to speech sequences that act even more natural in their phonetic reproduction in terms of emotion-based speech, as is the case in the prior art.
- a method and a system suitable for carrying out the method are to be specified.
- a speech signal generated as part of an automated speech synthesis is influenced to simulate an emotional way of speaking.
- this is done by a voice raw signal, which is not yet emotionally charged, generated during the speech synthesis, which is specifically modulated with a parameter mixture before the output of the generated speech signal.
- this parameter mixture comprises parameters of a plurality of feature groups corresponding to melody features, to duration features, to voice features, or to features relating to the articulation accuracy of the language.
- the parameters set in this respect are parameters from feature groups with which at least two predefined target emotions are associated.
- a speech raw signal is a signal consisting of phonemes with a parameter grouping that is neutral with regard to the above-mentioned feature groups, that is, in particular with regard to prosody and vocal sound, which does not yet take the form of a having acoustically reproducible audio signal.
- the system proposed for the solution of the problem and suitable for implementing the method explained above is a specially designed, hardware-based and software-based speech synthesizer.
- This consists of a phonemeization component encompassed by an input stage of the synthesizer, an emotion simulator and the actual synthesis unit comprised of an output stage equipped with means for the acoustic output of the generated speech signal.
- Decisive for the solution of the problem is the way in which the aforementioned components, which as such may well be known from the prior art, are brought into operative connection with one another in the speech synthesizer according to the invention.
- the phonemes generated by the phonemeization component are not transferred to the synthesis unit directly or, with respect to speech parameters, which in particular determine the prosody and / or the vocal sound, only influenced in the direction of a target emotion. Rather, the phonemes provided with the neutral prosody description are modulated for transfer to the synthesis unit by means of a parameter mixture.
- the corresponding parameter mixture is generated by the emotion simulator associated with the speech synthesizer. This evaluates the information provided to the transferred text, which is to be converted into synthetic speech, into at least two target emotions and influences the parameters of at least two different speech feature groups which are associated with the specifications in accordance with these specifications.
- the parameters influenced to this extent with regard to their respective characteristics, with which emotions such as joy, grief and anger are associated, are also mixed by the emotion simulator, that is, combined to form a parameter mixture and applied to the speech vocal signal - this modulating.
- the synthesis component generates according to one of the known methods from the voice raw signal modulated with the parameter mixture and outputs the desired synthetic speech signal.
- the first example refers to a robot that can express multiple emotions in parallel. It is assumed that a new message has arrived for the owner of this robot. It has a joyful content based on content analysis or sender marking. The robot then uses the ability to express multiple emotions simultaneously, blending "surprise” (over the arrival of the message) with "joy” to prepare the owner for their enjoyable content. For its speech output or for the actual speech synthesis, the robot uses software from the Mbrola project, which is known from the prior art and based on diphone synthesis (see [3]).
- the Diphonsynthese generates speech by concatenation of single double dahlute (diphones) from a database and subsequent prosody adaptation (influence on Tonmelodie and sound periods) according to the PSOLA method (Pitch Synchronous Overlap and Add).
- the robot explained by way of example here has three diphone databases for different vocal qualities, namely an inventory each for relaxed, normal and tense speech. The robot actually uses its parameter data for the two target emotions "surprise” and "joy".
- a phonemization component for example "Mary” (see [4])
- a voice raw signal as a neutral version of the target utterance (synthetic speech, with the simultaneously expressed emotions “surprise” and “joy” ), consisting of the phonemes to be used and a neutral prosody description.
- This voice raw signal is then modified or modulated according to the changes described above, for example by means of the "Emofilt” software (see [5]).
- the synthesizer based on "Mbrola” (see also [3]), as already stated, then transmits the modified speech signal description as well as the reference to the diphone basis to be used and from this generates the speech signal which expresses both surprise and joy.
- Another embodiment relates to a virtual agent in a game.
- the user interacts with the agent, inter alia, by voice.
- its mood changes based on events that happen in the game. For example, the agent learned in the morning that he won the lottery and is generally in a good mood. Now he casts a valuable vase around in a game situation, which breaks and makes a furious exclamation, which is mitigated by the influence of the positive mood.
- the generation of the mixed emotion expression in the language is to be done here for example by formant synthesis, as described for example in [2].
- the phonemization component which in turn is based on "Mary” (see [4]), generates a canonic pronunciation variant as well as a neutral prosody description from the text to be spoken as a voice raw signal. This is modified by the Emotionalmaschineskomponente "Emofilt" (see [5]) to the effect that a typical for the emotion "anger” duration and melody progression occurs, that is, the overall fundamental frequency response is increased by 150%, the range (the frequency-related vocal range) is widened by 40% and the contour gets a final increase in frequency. All syllables are accelerated by 30% and the strong ones again by 20%.
- the role of the synthesizer takes over in this case not “Mbrola” (see [3]), but the “EmoSyn” software (see [2]), which a KlattFormantsynthesizer by a combination of parameter templates for voiced sounds and rules for unvoiced sounds controls. All aspects of the acoustic speech signal can be modeled as a result of a source-filter system, that is, the acoustic nature of the laryngeal excitation signal and mouth-space approach tube can be controlled parametrically.
- the features of the "articulation accuracy” and “vocal sound” groups are modeled according to the second objective emotion, "satisfaction.” That is, for the vocal sound a rather gushed speech is used, where (according to [1], p. 218) the excitation signal through the Liljencranz-Fant parameter aperture quotient, the spectral attenuation, the bandwidth of the first formant, the amplitude of the voiced portion and the amplitude of the noisy component is adjusted. The formants are raised slightly as this corresponds to a smiling speech. The articulation accuracy is increased, so it is a "Vowel target overshoot" modeled by decentralization of the first two formants.
- the Fig. 1 shows a first possible embodiment of the system 1 according to the invention, namely a Chattsynthetisierers, in a very simplified schematic representation. All components of the speech synthesizer shown are each software-based and hardware-based components.
- the speech synthesizer 1 consists essentially as shown in FIG Fig. 1 can be seen from the input stage 2 with the phonemeization component 3, the emotion simulator 4 and the output stage 5 with the actual synthesis unit 6 and with - not shown here - means (for example, speakers) for the acoustic output.
- the phonemeization component 3 is formed, for example, by the phoneme system "Mary” known therefor.
- the emotion simulator 4 is based on the software "Emofilt" created for the production of emotional language.
- the component is based on a database in which parameters for the emotions “joy,”"grief,” and “anger” are kept in four independent sets of features, namely, melody features, duration characteristics, articulation features, and vocal tract characteristics.
- the relevant parameters are influenced by means of the "Emofilt" software corresponding to at least two target emotions transferred to the system 1 in connection with the transfer of a text to be converted into speech in accordance with adjusted or adjusted parameters by program sequences additionally provided as part of the emotion simulator 4 to form a mixture, by means of which the phonemes generated in the phonemeization component 3 and forming a speech raw signal are modulated 5 is used in the in the Fig. 1 shown embodiment, a component in which the software "Mbrola" is implemented.
- This component based on “Mbrola”, produces from the one given above modulated voice raw signal with neutral speech signal parameter assignment - that is, from the phonetics anyway neutral (Mary does not provide the phonemes with information or parameters to the vocal sound) phonemes with equally neutral prosody, which were modulated according to the two target emotions - finally by the (not shown) output means of the speech synthesizer 1 to be output synthetic speech signal.
- the Fig. 2 shows a further embodiment of a designed according to the invention Speech synthesizer 1.
- Speech synthesizer 1 comprises the basically same components as the speech synthesizer 1 according to the embodiment explained above.
- the components in question also work together in the same way.
- speech synthesizer 1 is compared to the embodiment of the Fig. 1 only the speech synthesis unit 5 has been replaced by another embodiment for such a unit.
- the actual speech synthesis is carried out by means of this component, in which, for example, the software "EmoSyn" has been implemented. This generates a synthetic speech signal according to the principle of formant synthesis.
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Machine Translation (AREA)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
EP15185879.2A EP3144929A1 (fr) | 2015-09-18 | 2015-09-18 | Génération synthétique d'un signal vocale ayant un son naturel |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
EP15185879.2A EP3144929A1 (fr) | 2015-09-18 | 2015-09-18 | Génération synthétique d'un signal vocale ayant un son naturel |
Publications (1)
Publication Number | Publication Date |
---|---|
EP3144929A1 true EP3144929A1 (fr) | 2017-03-22 |
Family
ID=54150328
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
EP15185879.2A Ceased EP3144929A1 (fr) | 2015-09-18 | 2015-09-18 | Génération synthétique d'un signal vocale ayant un son naturel |
Country Status (1)
Country | Link |
---|---|
EP (1) | EP3144929A1 (fr) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2023012116A1 (fr) * | 2021-08-02 | 2023-02-09 | Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. | Dispositif de traitement de signal vocal, système de lecture de signal vocal et procédé de sortie de signal vocal sans émotion |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6226614B1 (en) * | 1997-05-21 | 2001-05-01 | Nippon Telegraph And Telephone Corporation | Method and apparatus for editing/creating synthetic speech message and recording medium with the method recorded thereon |
-
2015
- 2015-09-18 EP EP15185879.2A patent/EP3144929A1/fr not_active Ceased
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6226614B1 (en) * | 1997-05-21 | 2001-05-01 | Nippon Telegraph And Telephone Corporation | Method and apparatus for editing/creating synthetic speech message and recording medium with the method recorded thereon |
Non-Patent Citations (6)
Title |
---|
FELIX BURKHARDT: "Emofilt: the Simulation of Emotional Speech by Prosody-Transformation", INTERSPEECH, 2005 |
FELIX BURKHARDT: "Emofilt: the Simulation of Emotional Speech by Prosody-Transformation", PROC. INTERSPEECH, 4 September 2005 (2005-09-04), pages 1 - 4, XP055225958, Retrieved from the Internet <URL:http://felix.syntheticspeech.de/publications/emofiltInterspeech05.pdf> [retrieved on 20151104] * |
FELIX BURKHARDT: "Simulation emotionaler Sprechweise mit Sprachsynthesesystemen", 2001, SHAKER VERLAG |
FELIX BURKHARDT; W. F. SENDLMEIER: "Verification of Acoustical Correlates of Emotional Speech using Formant-Synthesis", PROCEEDINGS ISCA WORKSHOP (ITRW) ON SPEECH AND EMOTION, 2000 |
M. SCHRÖDER; J. TROUVAIN: "The German text-to-speech synthesis system mary: A tool for research, development and teaching", INTERNATIONAL JOURNAL OF SPEECH TECHNOLOGY, 2003, pages 365 - 377, XP019207412, DOI: doi:10.1023/A:1025708916924 |
T. DUTOIT; V. PAGEL; N. PIERRET; F. BATAILLE; O. VAN DER VREKEN: "The Mbrola project: Towards a set of high-quality speech synthesizers free of use for non-commercial purposes", PROC. ICSLP'96, PHILADELPHIA, vol. 3, 1996, pages 1393 - 1396, XP010237942, DOI: doi:10.1109/ICSLP.1996.607874 |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2023012116A1 (fr) * | 2021-08-02 | 2023-02-09 | Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. | Dispositif de traitement de signal vocal, système de lecture de signal vocal et procédé de sortie de signal vocal sans émotion |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
DE60112512T2 (de) | Kodierung von Ausdruck in Sprachsynthese | |
DE69821673T2 (de) | Verfahren und Vorrichtung zum Editieren synthetischer Sprachnachrichten, sowie Speichermittel mit dem Verfahren | |
DE60119496T2 (de) | Verfahren und Vorrichtung um eine mittels eines Klangs übermittelte Emotion zu synthetisieren | |
DE69031165T2 (de) | System und methode zur text-sprache-umsetzung mit hilfe von kontextabhängigen vokalallophonen | |
DE19610019C2 (de) | Digitales Sprachsyntheseverfahren | |
DE60118874T2 (de) | Prosodiemustervergleich für Text-zu-Sprache Systeme | |
DE69519328T2 (de) | Verfahren und Anordnung für die Umwandlung von Sprache in Text | |
DE69519887T2 (de) | Verfahren und Vorrichtung zur Verarbeitung von Sprachinformation | |
DE69427083T2 (de) | Spracherkennungssystem für mehrere sprachen | |
DE69506037T2 (de) | Audioausgabeeinheit und Methode | |
DE60216069T2 (de) | Sprache-zu-sprache erzeugungssystem und verfahren | |
DE3856146T2 (de) | Sprachsynthese | |
DE69719654T2 (de) | Grundfrequenzmuster enthaltende Prosodie-Datenbanken für die Sprachsynthese | |
DE2212472A1 (de) | Verfahren und Anordnung zur Sprachsynthese gedruckter Nachrichtentexte | |
DE60108104T2 (de) | Verfahren zur Sprecheridentifikation | |
EP3010014B1 (fr) | Procede d'interpretation de reconnaissance vocale automatique | |
EP1058235B1 (fr) | Procédé de reproduction pour systèmes contrôlés par la voix avec synthèse de la parole basée sur texte | |
DE112020005337T5 (de) | Steuerbare, natürliche paralinguistik für text-zu-sprache-synthese | |
DE69318209T2 (de) | Verfahren und Anordnung zur Sprachsynthese | |
EP1105867A1 (fr) | Procede et dispositif permettant de concatener des segments audio en tenant compte de la coarticulation | |
DE69816049T2 (de) | Vorrichtung und verfahren zur prosodie-erzeugung bei der visuellen synthese | |
EP3144929A1 (fr) | Génération synthétique d'un signal vocale ayant un son naturel | |
EP1110203B1 (fr) | Procede et dispositif de traitement numerique de la voix | |
DE69817550T2 (de) | Verfahren zur sprachsynthese | |
Gessinger | Phonetic accommodation of human interlocutors in the context of human-computer interaction |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PUAI | Public reference made under article 153(3) epc to a published international application that has entered the european phase |
Free format text: ORIGINAL CODE: 0009012 |
|
AK | Designated contracting states |
Kind code of ref document: A1 Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR |
|
AX | Request for extension of the european patent |
Extension state: BA ME |
|
17P | Request for examination filed |
Effective date: 20170922 |
|
RBV | Designated contracting states (corrected) |
Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR |
|
17Q | First examination report despatched |
Effective date: 20180801 |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: THE APPLICATION HAS BEEN REFUSED |
|
18R | Application refused |
Effective date: 20190830 |