EP1256932A2 - Verfahren und Vorrichtung um eine mittels eines Klangs übermittelte Emotion zu synthetisieren - Google Patents
Verfahren und Vorrichtung um eine mittels eines Klangs übermittelte Emotion zu synthetisieren Download PDFInfo
- Publication number
- EP1256932A2 EP1256932A2 EP01401880A EP01401880A EP1256932A2 EP 1256932 A2 EP1256932 A2 EP 1256932A2 EP 01401880 A EP01401880 A EP 01401880A EP 01401880 A EP01401880 A EP 01401880A EP 1256932 A2 EP1256932 A2 EP 1256932A2
- Authority
- EP
- European Patent Office
- Prior art keywords
- operator
- sound
- elementary
- elementary sound
- portions
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 230000008451 emotion Effects 0.000 title claims abstract description 129
- 238000000034 method Methods 0.000 title claims description 55
- 238000012986 modification Methods 0.000 claims abstract description 13
- 230000004048 modification Effects 0.000 claims abstract description 13
- 230000001944 accentuation Effects 0.000 claims description 31
- 230000008569 process Effects 0.000 claims description 11
- 238000011144 upstream manufacturing Methods 0.000 claims description 4
- 230000002996 emotional effect Effects 0.000 description 40
- 238000003786 synthesis reaction Methods 0.000 description 27
- 230000015572 biosynthetic process Effects 0.000 description 26
- 238000013459 approach Methods 0.000 description 15
- 238000004422 calculation algorithm Methods 0.000 description 14
- 238000002474 experimental method Methods 0.000 description 14
- 238000012545 processing Methods 0.000 description 13
- 230000000630 rising effect Effects 0.000 description 13
- 241000282412 Homo Species 0.000 description 11
- 241000665848 Isca Species 0.000 description 11
- 230000000694 effects Effects 0.000 description 10
- 230000008602 contraction Effects 0.000 description 9
- 230000007935 neutral effect Effects 0.000 description 9
- 230000009471 action Effects 0.000 description 8
- 230000000875 corresponding effect Effects 0.000 description 7
- 230000010339 dilation Effects 0.000 description 7
- 230000006870 function Effects 0.000 description 7
- 230000014509 gene expression Effects 0.000 description 7
- 239000003607 modifier Substances 0.000 description 6
- 238000011160 research Methods 0.000 description 6
- 230000008859 change Effects 0.000 description 5
- 238000004891 communication Methods 0.000 description 5
- 238000011161 development Methods 0.000 description 5
- 238000004458 analytical method Methods 0.000 description 4
- 241001465754 Metazoa Species 0.000 description 3
- 230000001276 controlling effect Effects 0.000 description 3
- 230000002596 correlated effect Effects 0.000 description 3
- 238000010586 diagram Methods 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 230000008921 facial expression Effects 0.000 description 3
- 230000033764 rhythmic process Effects 0.000 description 3
- 230000005236 sound signal Effects 0.000 description 3
- 238000010200 validation analysis Methods 0.000 description 3
- 230000001755 vocal effect Effects 0.000 description 3
- 241000220010 Rhode Species 0.000 description 2
- 230000036772 blood pressure Effects 0.000 description 2
- 238000009826 distribution Methods 0.000 description 2
- 238000011835 investigation Methods 0.000 description 2
- 238000004519 manufacturing process Methods 0.000 description 2
- 239000011159 matrix material Substances 0.000 description 2
- 238000004088 simulation Methods 0.000 description 2
- 230000003595 spectral effect Effects 0.000 description 2
- 238000001308 synthesis method Methods 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- 206010048909 Boredom Diseases 0.000 description 1
- 241000282472 Canis lupus familiaris Species 0.000 description 1
- 241000282326 Felis catus Species 0.000 description 1
- 206010021403 Illusion Diseases 0.000 description 1
- 206010039424 Salivary hypersecretion Diseases 0.000 description 1
- 206010044565 Tremor Diseases 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 230000033228 biological regulation Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000001427 coherent effect Effects 0.000 description 1
- 230000000295 complement effect Effects 0.000 description 1
- 239000000470 constituent Substances 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 206010013781 dry mouth Diseases 0.000 description 1
- 230000008909 emotion recognition Effects 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 101150086005 gob-1 gene Proteins 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 230000008450 motivation Effects 0.000 description 1
- 210000001002 parasympathetic nervous system Anatomy 0.000 description 1
- 238000003909 pattern recognition Methods 0.000 description 1
- 230000008447 perception Effects 0.000 description 1
- 230000001766 physiological effect Effects 0.000 description 1
- 230000035790 physiological processes and functions Effects 0.000 description 1
- 230000006461 physiological response Effects 0.000 description 1
- 238000009877 rendering Methods 0.000 description 1
- 230000003252 repetitive effect Effects 0.000 description 1
- 230000029058 respiratory gaseous exchange Effects 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 230000002441 reversible effect Effects 0.000 description 1
- 238000012552 review Methods 0.000 description 1
- 208000026451 salivation Diseases 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 208000027765 speech disease Diseases 0.000 description 1
- 210000002820 sympathetic nervous system Anatomy 0.000 description 1
- 230000002123 temporal effect Effects 0.000 description 1
- 238000012549 training Methods 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
- G10L13/033—Voice editing, e.g. manipulating the voice of the synthesiser
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/08—Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
- G10L13/10—Prosody rules derived from text; Stress or intonation
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
- G10L13/04—Details of speech synthesis systems, e.g. synthesiser structure or memory management
Definitions
- the invention relates to the field of voice synthesis or reproduction with controllable emotional content. More particularly, the invention relates to a method and device for controllably adding an emotional feel to a synthesised or sampled voice or, in view of providing a more natural or interesting delivery to talking or other sound emitting objects.
- the possibility of adding an emotion content to delivered speech is also useful for computer aided systems which read texts or speeches for persons who cannot read for one reason or another. Examples of such systems which readout novels, magazine articles, or the like and whose listening pleasure ability to focus attention can be enhanced if the reading voice can simulate emotions.
- a first approach which is the most complicated and probably the less satisfactory, is based on linguistic theories for determining intonations.
- a second uses databases containing phrases tinted with different emotions produced by human speakers.
- the nearest- sounding phrase with the corresponding emotion content is extracted from the database. Its pitch contour is measured and copied to apply on the selected phrase to be produced. This approach is mainly useable when the database and produced phrases have very close grammatical structures. It is also difficult to implement.
- a third approach which is recognised as being the most effective, consists in utilising voice synthesisers which sample from a database of recorded human voices. These synthesisers operate by concatenating phonemes or short syllables produced by human voice to resynthesise sound sequences that correspond to a required spoken message. Instead of containing just neutral human voices, the database comprises voices spoken with different emotions.
- these systems have two basic limitations. Firstly, they are difficult to implement, and secondly, the databases are usually created by voices from different persons, for practical reasons. This can be disadvantageous when listeners expect the synthesised voice always to appear to be coming from a same speaker.
- voice synthesis software device which allows a certain number of parameters to be controlled, but within a closed architecture which is not amenable for developing new applications.
- the invention proposes a new approach which is easy to implement, provides convincing results and is easy to parameterise.
- the invention also makes it possible to reproduce emotions in synthesised speech for meaningful speech contents in a recognisable language both in a naturally sounding voice and in deliberately distorted, exaggerated voices, for example as spoken by cartoon characters, talking animals or non-human animated forms, simply by playing on parameters.
- the invention is also amenable to imparting emotions on voices that deliver meaningless sounds, such as babble.
- the invention according to a first aspect proposes a method of synthesising an emotion conveyed on a sound, by selectively modifying at least one elementary sound portion thereof prior to delivering the sound, characterised in that the modification is produced by an operator application step in which at least one operator is selectively applied to at least one elementary sound portion to impose a specific modification in a characteristic thereof, such as pitch/or duration, in accordance with an emotion to be synthesised.
- the operator application step preferably comprises forming at least one set of operators, the set comprising at least one operator to modify a pitch characteristic and/or at least one operator to modify a duration characteristic of the elementary sound portions.
- the operator application step comprises applying:
- the method can comprise a universal phase in which at least one the operator is applied systematically to all elementary sound portions forming a determined sequence of the sound.
- At least one operator can be applied with a same operator parameterisation to all elementary sound portions forming a determined sequence of the sound.
- the method can comprise a probabilistic accentuation phase in which at least one the operator is applied only to selected elementary sound portions chosen to be accentuated.
- the selected elementary sound portions can selected by a random draw from candidate elementary sound portions, preferably to select elementary sound portions with a probability which is programmable.
- the candidate elementary sound portions can be:
- a same operator parameterisation may be used for the at least one operator applied in the probabilistic accentuation phase.
- the method can comprise a first and last elementary sound portions accentuation phase in which at least one operator is applied only to a group of at least one of elementary sound portion forming the start and end of said determined sequence of sound, the latter being e.g. a phrase.
- the elementary portions of sound may correspond to a syllable or to a phoneme.
- the determined sequence of sound can correspond to intelligible speech or to unintelligible sounds.
- the elementary sound portions can be presented as formatted data values specifying a duration and/or at least one pitch value existing over determined parts of or all the duration of the elementary sound.
- the operators can act to selectively modify the data values.
- the method may performed without changing the data format of the elementary sound portion data and upstream of an interpolation stage, whereby the interpolation stage can process data modified in accordance with an emotion to be synthesised in the same manner as for data obtained from an arbitrary source of elementary sound portions.
- the invention provides a device for synthesising an emotion conveyed on a sound, using means for selectively modifying at least one elementary sound portion thereof prior to delivering the sound, characterised in that the means comprise operator application means for applying at least one operator to at least one of the elementary sound portion to impose a specific modification in a characteristic thereof in accordance with an emotion to be synthesised.
- the invention provides a data medium comprising software module means for executing the method according to the first aspect mentioned above.
- the invention is a development from work that forms the subject of earlier European patent application number 01 401 203.3 of the Applicant, filed on May 11, 2001 and to which the present application claims priority.
- the above earlier application concerns a voice synthesis method for synthesising a voice in accordance with information from an apparatus having a capability of uttering and having at least one emotional model.
- the method here comprises an emotional state discrimination step for discriminating an emotional state of the model of the apparatus having a capability of uttering, a sentence output step for outputting a sentence representing a content to be uttered in the form of a voice, a parameter control step for controlling a parameter for use in voice synthesis, depending upon the emotional state discriminated in the emotional state discrimination steps, and a voice synthesis step for inputting, to a voice synthesis unit, the sentence output in the sentence output step and synthesising a voice in accordance with the controlled parameter.
- the voice in the earlier application has a meaningless content.
- the sentence output step When the emotional state of the emotional model becomes greater than a predetermined value, the sentence output step outputs the sentence and supplies it to the voice synthesis unit.
- the sentence output step outputs can output a sentence obtained at random for each utterance and supply it to the voice synthesis unit.
- the sentences can include a number of phonemes and the parameter can include pitch, a duration, and an intensity of a phoneme.
- the apparatus having the capability of uttering can be an autonomous type robot apparatus which acts in response to supplied input information.
- the emotion model can be such as to cause the action in question.
- the voice synthesis method can then further include the step of changing the state of the emotion model in accordance with the input information thereby determining the action.
- an autonomous type e.g. comprising a robot, which acts in accordance with supplied input information, comprising an emotional model which causes the action in question, emotional state discrimination means for discriminating the emotional state of the emotional model, sentence output means for outputting a sentence representing a content to be uttered in the form of a voice, parameter control means for controlling a parameter used in voice synthesis depending upon the emotional state discriminated by the emotional state discrimination means, and voice synthesis means which receive the sentence output from the sentence output means and resynthesises a voice in accordance with the controlled parameter.
- an autonomous type e.g. comprising a robot, which acts in accordance with supplied input information, comprising an emotional model which causes the action in question, emotional state discrimination means for discriminating the emotional state of the emotional model, sentence output means for outputting a sentence representing a content to be uttered in the form of a voice, parameter control means for controlling a parameter used in voice synthesis depending upon the emotional state discriminated by the emotional state discrimination means, and voice synthesis means which
- the Applicants research consisted in providing to a baby-like robot means to express emotions vocally. Unlike most of existing work, the Applicant has also studied the possibility of conferring emotions in cartoon-like meaningless speech, which has different needs and different constraints than, for example trying to produce naturally sounding adult-like normal emotional speech. For example, a goal was that emotions can be recognised by people with different cultural or linguistic background.
- the approach uses concatenative speech synthesis and the algorithms is simpler and completely specified than in those used in other studies, such as conducted by Breazal.
- the first result indicates that the goal of making a machine express affect both with meaningless speech and in a way recognisable by people from different cultures with the accuracy of a human speaker is attainable in theory.
- the second result shows that a perfect result cannot be expected.
- the fact that humans are not so good is mainly explained by the fact that several emotional state have very similar physiological correlates and thus acoustic correlates. In actual situations, humans solve the ambiguities by using the context and/or other modalities.
- MRR-PSOLA Text-to-Speech synthesis based on an MBE re-synthesis of the segments database, Speech Communication.
- the MBROLA software freely available on the web at web page :http://tcts.fpms.ac.be/synthesis/mbrola.html ⁇ , which is an enhancement of more traditional PSOLA techniques (it produces less distortions when pitch is manipulated).
- the price of quality is that very little control is possible over the signal, but this is compatible with a need for simplicity.
- the approach adopted in the invention is - from an algorithmic point of view - completely generative (it does not rely on the recording human speech that would serve as input), and uses concatenative speech synthesis as a basis. It has been found to express emotions as efficiently as with formant synthesis, yet with simpler controls and a more life-like signal quality.
- An algorithm developed by the Applicant consists in generating a meaningless sentence and specifying the pitch contour and the durations of phonemes (the rhythm of the sentence). For sake of simplicity, there is specified only one target per phoneme for the pitch, which can often be sufficient.
- the program generates a file as shown in table I below, which is fed into the MBROLA speech synthesiser.
- the idea of the algorithm is to initially generate a sentence composed of random words, each word being composed of random syllables (of type CV or CCV). Initially, the duration of all phonemes is constant and the pitch of each phoneme is constant equal to a predetermined value (to which is added noise, which is advantageous to make the speech to sound natural. Many different kinds of noise were experimented, and it was found the type of noise used does not make significant differences; for the perceptual experiment reported below, Gaussian noise was used).
- the sentence's pitch and duration information are then altered so as to yield a particular affect. Distortions consist in deciding that a number of syllables become stressed, and in applying a certain stress contour on these syllables as well as some duration modifications. Also, to all syllables are applied a certain default pitch contour and duration deformation.
- a key aspect of this algorithm resides the stochastic parts : on the one hand, it allows to produce each time a different utterance for a given set of parameters (mainly by virtue of the random number of words, the random constituents of phonemes of syllables or the probabilistic attribution of accents) ; on the other hand, details like adding noise to the duration and pitch of phonemes (see line 14 and 15 of the program shown in figure 1, where random(n) means "random number between 0 and n") are advantageous for the naturalness of the vocalisations (if it remains fixed, then one perceives clearly that this is a machine talking). Finally, accents are implemented only by changing the pitch and not the loudness.
- a last step is added to the algorithm in order to get a voice typical of a young creature : the sound file sampling rate is overridden by setting it to 30000 or 35000 Hz as compared to the 16000 Hz produced by MBROLA (this is equivalent to playing the file quicker).
- the speech rate is initially made slower in the program sent to MBROLA. Only the voice quality and pitch are modified.
- This last step is preferable, since no child voice database exists for MBROLA (which is understandable since making it would be difficult with a child). Accordingly, a female adult voice was chosen.
- table II gives examples of values of the parameters obtained for 5 affects : calm, anger, sadness, happiness, comfort.
- Figure 2 shows how these emotions are positioned in a chart which represents an "emotional space", in which the parameters "valence” and “excitement” are expressed respectively along vertical and horizontal axes 2 and 4.
- the valence axis ranges from negative to positive values, while the excitement axis ranges from low to high values.
- the cross-point O of these axes is at the centre of the chart and corresponds to a calm/neutral state.
- quadrant Q1 happy/praising (quadrant Q1) characterised by positive valence and high excitement
- comfort/soothing quadrant Q2 characterised by positive valence and low excitement
- sad quadrant Q3 characterised by negative valence and low excitement
- angry/admonishing quadrant Q4 characterised by negative valence and high excitement.
- the method and device in accordance with the invention are a development of the above concepts.
- the idea resides in controlling at least one of the pitch contour, the intensity contour and the rhythm of a phrase produced by voice synthesis.
- the inventive approach is relatively exhaustive and can easily be reproduced by other workers.
- the preferred embodiments are developed from freely available software modules that are well documented, simple to use and for which there are many equivalent technologies. Accordingly, the modules produced by these embodiments of the invention are totally transparent.
- the embodiments allow a total, or at a least high degree of control of pitch contour, rhythm (duration of phonemes), etc.
- the approach in accordance with the present invention is based on considering a phrase as a succession of syllables.
- the phrase can be speech in a recognised language, or simply meaningless utterances.
- f0 the contour of the pitch
- volume the intensity contour
- duration the duration of the syllable.
- at least the control of the intensity is not necessary, as a modification in the pitch can give the impression of a change in intensity.
- the problem is then to determine these contours - pitch contour, duration, and possibly intensity contour - throughout a sentence so as to produce an intonation that corresponds to a given emotion.
- the concept behind the solution is to start off from a phrase having a set contour (f0), a set intensity and a set duration for each syllable.
- This reference phrase can be produced either from a voice synthesiser for a recognised language, giving an initial contour (f0), an initial duration (t) and possibly an initial intensity.
- the reference phrase can be meaningless utterances, such as babble from infants.
- each syllable to by synthesised can be encoded as follows (case of syllable "be", characterised in terms of a duration and five successive pitch values within that duration):
- the above data is contained in a frame simply by encoding the parameters : be; 100, 80, 100, 120, 90, 230, each being identified by the synthesiser according to the protocol.
- Figure 3 shows the different stages by which these digital data are converted into a synthesised sound output.
- a voice message is composed in terms of a succession of syllables to be uttered.
- the message can be intelligible words forming grammatical sentences conveying meaning in a given recognised language, or meaningless sounds such a babble, animal-like sounds, or totally imaginary sounds.
- the syllables are encoded in the above-described digital data format in a vocalisation data file 10.
- a decoder 12 reads out the successive syllable data from the data file 10.
- Figure 4a shows graphically how these data are organised by the decoder 12 in terms of a coordinate grid with pitch fundamental frequency (in Hertz) along the ordinate axis and time (in milliseconds) along the abscissa.
- the area of the grid is divided into five columns corresponding to each of the five respective durations, as indicated by the arrowed lines.
- the pitch value is placed at the centre of each column.
- the syllable data are transferred to an interpolator 14 which produces from the five elementary pitch values P1-P5 a close succession of interpolated pitch values, using standard interpolation techniques.
- the result is a relatively smooth curve of the evolution of pitch over the 100 ms duration of the syllable "be", as shown in figure 4b.
- the process is repeated for each inputted syllable data, to produce a continuous pitch curve over successive syllables of the phrase.
- the pitch waveform thus produced by the interpolator is supplied to an audio frequency sound processor 16 which generates a corresponding modulated amplitude audio signal.
- the sound processor may also add some random noise to the final audio signal to give a more realistic effect to the synthesised sound, as explained above.
- This final audio signal is supplied to an audio amplifier 18 where its level is raised to a suitable volume, and then outputted on a loudspeaker 20 which thus reproduces the synthesised sound data from vocalisation data file 10.
- part of the syllable data associated with the syllables will normally include an indication of which syllables may be accentuated to give a more naturally sounding delivery.
- the pitch values contained in the syllable data correspond to a "neutral" form of speech, i.e. not charged with a discernible emotion.
- Figure 5 is a block diagram showing in functional terms how an emotion generator 22 of the preferred embodiment integrates with the synthesiser 1 shown in figure 3.
- the emotion generator 22 operates by selectively applying operators on the syllable data read out from the vocalisation data file 10. Depending on their type, these operators can modify either the pitch data (pitch operator) or the syllable duration data (duration operator). These modifications take place upstream of the interpolator 14, e.g. before the decoder 12, so that the interpolation is performed on the operator-modified values. As explained below, the modification is such as to transform selectively a neutral form of speech into a speech conveying a chosen emotion (sad, calm, happy, angry) in a chosen quantity.
- the basic operator forms are stored in an operator set library 24, from which they can be selectively accessed by an operator set configuration unit 26.
- the latter serves to prepare and parameterise the operators in accordance with current requirements.
- a operator parameterisation unit 28 which determines the parameterisation of the operators in accordance with both : i) the emotion to be imprinted on the voice (calm, sad, happy, angry, etc.), ii) possibly the degree - or intensity - of the emotion to apply, and iii) the context of the syllable, as explained below.
- the emotion and degree of emotion are instructed to the operator parameterisation unit 28 by an emotion selection interface 30 which presents inputs accessible to a user 32.
- the emotion selection interface can be in the form of a computer interface with on-screen menus and icons, allowing the user 32 to indicate all the necessary emotion characteristics and other operating parameters.
- the context of the syllable which is operator sensitive is: i) the position of syllable in a phrase, as some operator sets are applied only to the first and last syllables of the phrase, ii) whether the syllables relate to intelligible word sentences or to unintelligible sounds (babble, etc.) and iii) as the case arises, whether or not a syllable considered is allowed or not to be accentuated, as indicated in the vocalisation data file 10.
- a first and last syllables identification unit 34 and an authorised syllable accentuation detection unit 36 both having an access to the vocalisation data file unit 10 and informing the operator parameterisation unit 28 of the appropriate context sensitive parameters.
- the random selection is provided by a controllable probability random draw unit 38 operatively connected between the authorised syllable accentuation unit 36 and the operator parameterisation unit 28.
- the random draw unit 38 has a controllable degree of probability of selecting a syllable from the candidates. Specifically, if N is the probability of a candidate being selected, with N ranging controllably from 0 to 1, then for P candidate syllables, N.P syllables shall be selected on average for being subjected to a specific operator set associated to a random accentuation. The distribution of the randomly selected candidates is substantially uniform over the sequence of syllables.
- the suitably configured operator sets from the operator set configuration unit 26 are sent to a syllable data modifier unit 40 where they operate on the syllable data.
- the syllable data modifier unit 40 receives the syllable data directly from vocalisation data file 10, in a manner analogous to the decoder 12 of figure 3.
- the thus-received syllable data are modified by unit 40 as a function of the operator set, notably in terms of pitch and duration data.
- the resulting modified syllable data (new syllable data) are then outputted by the syllable data modifier unit 40 to the decoder 12, with the same structure as presented in the vocalisation data file (cf. figure 2a).
- the decoder can process the new syllable data exactly as if it originated directly from the vocalisation data file. From there, the new syllable data are interpolated (interpolator unit 14) and processed by the other downstream units of figure 3 in exactly the same way. However, the sound produced at the speaker then no longer corresponds to a neutral tone, but rather to the sound with a simulation of an emotion as defined by the user 32.
- All the above functional units are under the overall control of an operations sequencer unit 42 which governs complete execution of the emotion generation procedure in accordance with a prescribed set of rules.
- Figure 6 illustrates graphically the effect of the pitch operator set OP on a pitch curve (as in figure 4b) of a synthesised sound.
- the figure shows - respectively on left and right columns - a pitch curve (fundamental frequency f against time t) before the action of the pitch operator and after the action of a pitch operator.
- the input pitch curves are identical for all operators and happen to be relatively flat.
- the rising slope and falling slope operators OPrs and OPfs have the following characteristic: the pitch at the central point in time (1/2 tl for a pitch duration of t1) remains substantially unchanged after the operator. In other words, the operators act to pivot the input pitch curve about the pitch value at the central point in time, so as to impose the required slope. This means that in the case of a rising slope operator OPrs, the pitch values before the central point in time are in fact lowered, and that in the case of a falling slope operator OPfs, the pitch values before the central point in time are in fact raised, as shown by the figure.
- intensity operators designated OI.
- OI intensity operators
- FIG 7 is directly analogous to the illustration of figure 6.
- These operators are also four in number and are identical to those of the pitch operators OP, except that they act on the curve of intensity I over time t. Accordingly, these operators shall not be detailed separately, for the sake of conciseness.
- the pitch and intensity operators can each be parameterised as follows :
- Figure 8 illustrates graphically the effect of a duration (or time) operator OD on a syllable.
- the illustration shows on left and right columns respectively the duration of the syllable (in terms of a horizontal line expressing an initial length of time t1) of the input syllable before the effect of a duration operator and after the effect of a duration operator.
- the duration operator can be:
- the operator can also be neutralised or made as a neutral operator, simply by inserting the value 0 for the parameter D.
- duration operator has been represented as being of two different types, respectively dilation and contraction, it is clear that the only difference resides in the sign plus or minus placed before the parameter D.
- a same operator mechanism can produce both operator functions (dilation and contraction) if it can handle both positive and negative numbers.
- the range of possible values for D and its possible incremental values in the range can be chosen according to requirements.
- the embodiment further uses a separate operator, which establishes the probability N for the random draw unit 38.
- This value is selected from a range of 0 (no possibility of selection) to 1 (certainty of selection).
- the value N serves to control the density of accentuated syllables in the vocalised output as appropriate for the emotional quality to reproduce.
- Figures 9A and 9B constitute a flow chart indicating the process of forming and applying selectively the above operators to syllable data on the basis of the system described with reference to figure 5.
- Figure 9B is a continuation of figure 9A.
- the process starts with an initialisation phase P1 which involves loading input syllable data from the vocalisation data file 10 (step S2).
- the data appear as an identification of the syllable e.g. "be”, followed by a first value t1 expressing the normal duration of the syllable, followed by five values P1 to P5 indicating the fundamental frequency of the pitch at five successive intervals of the indicated duration t1, as explained with reference to figure 4a.
- step S4 is loaded the emotion to be conveyed on the phrase or passage of which the loaded syllable data forms a part, using the interface unit 30 (step S4).
- the emotions can be calm, sad, happy, angry, etc.
- the interface also inputs the degree of emotion to be given, e.g. by attributing a weighting value (step S6).
- the system then enters into a universal operator phase P2, in which a universal operator set OS(U) is applied systematically to all the syllables.
- the universal operator set OS(U) contains all the operators of figures 6 and 8, i.e. OPrs, OPfs, OPsu, OPsd, forming the four pitch operators, plus ODd and ODc, forming the two duration operators.
- Each of these operators of operator set OS(U) is parameterised by a respective associated value, respectively Prs(U), Pfs(U), Psu(U), Psd(U), Dd(U), and Dc(U), as explained above (step S8).
- This step involves attributing numerical values to these parameters, and is performed by the operator set configuration unit 26.
- the choice of parameter values for the universal operator set OS(U) is determined by the operator parameterisation unit 8 as a function of the programmed emotion and quantity of emotion, plus other factors as the case arises.
- the universal operator set OS(U) is then applied systematically to all the syllables of a phrase or group of phrases (step S10).
- the action involves modifying the numerical values t1, P1-P5 of the syllable data.
- the slope parameter Prs or Pfs are translated into a group of five difference values to be applied arithmetically to the values P1-P5 respectively. These difference values are chosen to move each of the values P1-P5 according the parameterised slope, the middle value P3 remaining substantially unchanged, as explained earlier.
- the first two values of the rising slope parameters will be negative to cause the first half of the pitch to be lowered and the last two values will be positive to cause the last half of the pitch to be raised, so creating the rising slope articulated at the centre point in time, as shown in figure 6.
- the degree of slope forming the parameterisation is expressed in terms of these difference values.
- a similar approach in reverse is used for the falling slope parameter.
- the shift up or shift down operators can be applied before or after the slope operators. They simply add or subtract a same value, determined by the parameterisation, to the five pitch values P1-P5.
- the operators form mutually exclusive pairs, i.e. a rising slope operator will not be applied if a falling slope operator is to be applied, and likewise for the shift up and down and duration operators.
- the application of the operators i.e. calculation to modify the data parameters t1, P1-P5 is performed by the syllable data modifier unit 40.
- the system then enters into a probabilistic accentuation phase P2, for which another operator accentuation parameter set OS(PA) is prepared.
- This operator set has the same operators as the universal operator set, but with different values for the parameterisation.
- the operator set OS(PA) is parameterised by respective values: Prs(PA), Pfs(PA), Psu(PA), Psd(PA), Dd(PA), and Dc(PA).
- Prs(PA), Pfs(PA), Psu(PA), Psd(PA), Dd(PA), and Dc(PA) are likewise calculated by the operator parameterisation unit 28 as a function of the emotion, degree of emotion and other factors provided by the interface unit 30.
- the choice of the parameters is generally made to add a degree of intonation (prosody) to the speech according to the emotion considered.
- step S14 Next is determined which of the syllables is to be submitted to this operator set OS(PA), as determined by the random unit 38 (step S14).
- the latter supplies the list of the randomly drawn syllables for accentuating by this operator set.
- the candidate syllables are:
- the randomly selected syllables among the candidates are then submitted for processing by the probabilistic accentuation operator set OS(PA) by the syllable data modifier unit 40 (step S16).
- the actual processing performed is the same as explained above for the universal operator set, with the same technical considerations, the only difference being in the parameter values involved.
- the syllable data modifier unit 40 will supply the following modified forms of the syllable data (generically denoted S) originally in the file 10:
- phase P4 of processing an accentuation specific to the first and last syllables of a phrase.
- this phase P4 acts to accentuate all the syllables of the first and last words of the phrase.
- the term phrase can be understood in the normal grammatical sense for intelligible text to be spoken, e.g. in terms of pauses in the recitation.
- a phrase is understood in terms of a beginning and end of the utterance, marked by a pause. Typically, such a phrase can last from around one to three or to four seconds.
- the phase P4 of accentuating the last syllables applies to at least the first and last syllables, and preferably the first m and last n syllables, where m or n are typically equal to around 2 or 3 and can be the same or different.
- the resulting operator set OS(FL) is then applied to the first and last syllables of each phrase (step S20), these syllables being identified by the first/last syllables detector unit 34.
- the syllable data on which is applied operator set OS(FL) will have previously been processed by the universal operator set OS(U) at step S10. Additionally, it may happen that a first or last syllable(s) would also been drawn at the random selection step S14 and thereby also be processed with by probabilistic accentuation operator set OS(PA).
- the parameterisation of the same general type for all operator sets is the same for all operator sets, only the actual values being different.
- the values are usually chosen so that the least amount of change is produced by the universal operator set, and the largest amount of change is produced by the first and last syllable accentuation, the probabilistic accentuation operator set producing an intermediate amount of change.
- the system can also be made to use intensity operators OI in its set, depending on the parameterisation used.
- the interface unit 30 can be integrated into a computer interface to provide different controls. Among these can be direct choice of parameters of the different operator sets mentioned above, in order to allow the user 32 to fine-tune the system.
- the interface can be made user friendly by providing visual scales, showing e.g. graphically the slope values, shift values, contraction/dilation values for the different parameters.
- the examples are illustrated for a given format of speech data, but is clear that any other formatting of data can be accommodated.
- the number of pitch or intensity values given in the examples can be different from 5, typical numbers of values ranging from just one to more than five.
- the embodiment can be implemented in a large variety of devices, for instance: robotic pets and other intelligent electronic creatures, sound systems for educational training, studio productions (dubbing, voice animations, narration, etc.), devices for reading out loud texts (books, articles, mail, etc.), sound experimentation systems (psycho-acoustic research etc), humanised computer interfaces for PCs, instruments and other equipment, and in other applications, etc....
- the form of the embodiment can range from a stand-alone unit fully equipped to provide a complete synthesised sound reproduction (cf. figure 3), an accessory operational with existing sound synthesising, or in the form of software modules recorded on a medium or in downloadable form to be run on adapted processor systems.
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Toys (AREA)
- Feedback Control In General (AREA)
- Soundproofing, Sound Blocking, And Sound Damping (AREA)
Priority Applications (6)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
EP20010401880 EP1256932B1 (de) | 2001-05-11 | 2001-07-13 | Verfahren und Vorrichtung um eine mittels eines Klangs übermittelte Emotion zu synthetisieren |
DE2001631521 DE60131521T2 (de) | 2001-05-11 | 2001-08-14 | Verfahren und Vorrichtung zur Steuerung des Betriebs eines Geräts bzw. eines Systems sowie System mit einer solchen Vorrichtung und Computerprogramm zur Ausführung des Verfahrens |
EP20010402176 EP1256933B1 (de) | 2001-05-11 | 2001-08-14 | Verfahren und Vorrichtung zur Steuerung eines Emotionssynthesegeräts |
US10/192,974 US20030093280A1 (en) | 2001-07-13 | 2002-07-11 | Method and apparatus for synthesising an emotion conveyed on a sound |
JP2002206013A JP2003177772A (ja) | 2001-07-13 | 2002-07-15 | 感情合成装置の処理を制御する方法及び装置 |
JP2002206012A JP2003084800A (ja) | 2001-07-13 | 2002-07-15 | 音声による感情合成方法及び装置 |
Applications Claiming Priority (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
EP01401203A EP1256931A1 (de) | 2001-05-11 | 2001-05-11 | Verfahren und Vorrichtung zur Sprachsynthese und Roboter |
EP01401203 | 2001-05-11 | ||
EP20010401880 EP1256932B1 (de) | 2001-05-11 | 2001-07-13 | Verfahren und Vorrichtung um eine mittels eines Klangs übermittelte Emotion zu synthetisieren |
Publications (3)
Publication Number | Publication Date |
---|---|
EP1256932A2 true EP1256932A2 (de) | 2002-11-13 |
EP1256932A3 EP1256932A3 (de) | 2004-10-13 |
EP1256932B1 EP1256932B1 (de) | 2006-05-10 |
Family
ID=26077240
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
EP20010401880 Expired - Lifetime EP1256932B1 (de) | 2001-05-11 | 2001-07-13 | Verfahren und Vorrichtung um eine mittels eines Klangs übermittelte Emotion zu synthetisieren |
Country Status (2)
Country | Link |
---|---|
EP (1) | EP1256932B1 (de) |
DE (1) | DE60131521T2 (de) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7809572B2 (en) | 2005-07-20 | 2010-10-05 | Panasonic Corporation | Voice quality change portion locating apparatus |
CN111816158A (zh) * | 2019-09-17 | 2020-10-23 | 北京京东尚科信息技术有限公司 | 一种语音合成方法及装置、存储介质 |
CN113611326A (zh) * | 2021-08-26 | 2021-11-05 | 中国地质大学(武汉) | 一种实时语音情感识别方法及装置 |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101901598A (zh) * | 2010-06-30 | 2010-12-01 | 北京捷通华声语音技术有限公司 | 一种哼唱合成方法和系统 |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5860064A (en) * | 1993-05-13 | 1999-01-12 | Apple Computer, Inc. | Method and apparatus for automatic generation of vocal emotion in a synthetic text-to-speech system |
US6212502B1 (en) * | 1998-03-23 | 2001-04-03 | Microsoft Corporation | Modeling and projecting emotion and personality from a computer user interface |
EP1107227A2 (de) * | 1999-11-30 | 2001-06-13 | Sony Corporation | Sprachverarbeitung |
-
2001
- 2001-07-13 EP EP20010401880 patent/EP1256932B1/de not_active Expired - Lifetime
- 2001-08-14 DE DE2001631521 patent/DE60131521T2/de not_active Expired - Lifetime
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5860064A (en) * | 1993-05-13 | 1999-01-12 | Apple Computer, Inc. | Method and apparatus for automatic generation of vocal emotion in a synthetic text-to-speech system |
US6212502B1 (en) * | 1998-03-23 | 2001-04-03 | Microsoft Corporation | Modeling and projecting emotion and personality from a computer user interface |
EP1107227A2 (de) * | 1999-11-30 | 2001-06-13 | Sony Corporation | Sprachverarbeitung |
Non-Patent Citations (2)
Title |
---|
GALANIS D ET AL: "Investigating emotional speech parameters for speech synthesis" ELECTRONICS, CIRCUITS, AND SYSTEMS, 1996. ICECS '96., PROCEEDINGS OF THE THIRD IEEE INTERNATIONAL CONFERENCE ON RODOS, GREECE 13-16 OCT. 1996, NEW YORK, NY, USA,IEEE, US, 13 October 1996 (1996-10-13), pages 1227-1230, XP010217293 ISBN: 0-7803-3650-X * |
IGNASI IRIONDO ET AL: "VALIDATION OF AN ACOUSTICAL MODELLING OF EMOTIONAL EXPRESSION IN SPANISH USING SPEECH SYNTHESIS TECHNIQUES" PROCEEDINGS OF THE ISCA WORKSHOP ON SPEECH AND EMOTION, September 2000 (2000-09), pages 1-6, XP007005765 BELFAST, NORTHERN IRELAND * |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7809572B2 (en) | 2005-07-20 | 2010-10-05 | Panasonic Corporation | Voice quality change portion locating apparatus |
CN111816158A (zh) * | 2019-09-17 | 2020-10-23 | 北京京东尚科信息技术有限公司 | 一种语音合成方法及装置、存储介质 |
CN111816158B (zh) * | 2019-09-17 | 2023-08-04 | 北京京东尚科信息技术有限公司 | 一种语音合成方法及装置、存储介质 |
CN113611326A (zh) * | 2021-08-26 | 2021-11-05 | 中国地质大学(武汉) | 一种实时语音情感识别方法及装置 |
CN113611326B (zh) * | 2021-08-26 | 2023-05-12 | 中国地质大学(武汉) | 一种实时语音情感识别方法及装置 |
Also Published As
Publication number | Publication date |
---|---|
EP1256932B1 (de) | 2006-05-10 |
EP1256932A3 (de) | 2004-10-13 |
DE60131521T2 (de) | 2008-10-23 |
DE60131521D1 (de) | 2008-01-03 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20030093280A1 (en) | Method and apparatus for synthesising an emotion conveyed on a sound | |
Pierre-Yves | The production and recognition of emotions in speech: features and algorithms | |
DE60119496T2 (de) | Verfahren und Vorrichtung um eine mittels eines Klangs übermittelte Emotion zu synthetisieren | |
Burkhardt et al. | Verification of acoustical correlates of emotional speech using formant-synthesis | |
JP4363590B2 (ja) | 音声合成 | |
JP4458321B2 (ja) | 感情認識方法および感情認識装置 | |
Cahn | The generation of affect in synthesized speech | |
Theune et al. | Generating expressive speech for storytelling applications | |
US5860064A (en) | Method and apparatus for automatic generation of vocal emotion in a synthetic text-to-speech system | |
Ladd et al. | Evidence for the independent function of intonation contour type, voice quality, and F 0 range in signaling speaker affect | |
Nose et al. | HMM-based expressive singing voice synthesis with singing style control and robust pitch modeling | |
Mareüil et al. | Generation of emotions by a morphing technique in English, French and Spanish | |
Hill et al. | Low-level articulatory synthesis: A working text-to-speech solution and a linguistic tool1 | |
Gahlawat et al. | Natural speech synthesizer for blind persons using hybrid approach | |
US7457752B2 (en) | Method and apparatus for controlling the operation of an emotion synthesizing device | |
EP1256932B1 (de) | Verfahren und Vorrichtung um eine mittels eines Klangs übermittelte Emotion zu synthetisieren | |
Keller | Towards greater naturalness: Future directions of research in speech synthesis | |
Lobanov et al. | TTS-Synthesizer as a Computer Means for Personal Voice Cloning (On the example of Russian) | |
Gahlawat et al. | Integrating human emotions with spatial speech using optimized selection of acoustic phonetic units | |
Olaszy | The most important prosodic patterns of Hungarian | |
Oudeyer | The synthesis of cartoon emotional speech | |
Vine et al. | Synthesis of emotional speech using RP-PSOLA | |
Makarova et al. | Phonetics of emotion in Russian speech | |
Suchato et al. | Digital storytelling book generator with customizable synthetic voice styles | |
Henton et al. | Generating and manipulating emotional synthetic speech on a personal computer |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PUAI | Public reference made under article 153(3) epc to a published international application that has entered the european phase |
Free format text: ORIGINAL CODE: 0009012 |
|
AK | Designated contracting states |
Kind code of ref document: A2 Designated state(s): AT BE CH CY DE DK ES FI FR GB GR IE IT LI LU MC NL PT SE TR |
|
AX | Request for extension of the european patent |
Free format text: AL;LT;LV;MK;RO;SI |
|
PUAL | Search report despatched |
Free format text: ORIGINAL CODE: 0009013 |
|
AK | Designated contracting states |
Kind code of ref document: A3 Designated state(s): AT BE CH CY DE DK ES FI FR GB GR IE IT LI LU MC NL PT SE TR |
|
AX | Request for extension of the european patent |
Extension state: AL LT LV MK RO SI |
|
RIC1 | Information provided on ipc code assigned before grant |
Ipc: 7G 10L 13/08 A Ipc: 7G 10L 13/02 B |
|
17P | Request for examination filed |
Effective date: 20050317 |
|
17Q | First examination report despatched |
Effective date: 20050429 |
|
AKX | Designation fees paid |
Designated state(s): DE FR GB |
|
GRAP | Despatch of communication of intention to grant a patent |
Free format text: ORIGINAL CODE: EPIDOSNIGR1 |
|
GRAS | Grant fee paid |
Free format text: ORIGINAL CODE: EPIDOSNIGR3 |
|
GRAA | (expected) grant |
Free format text: ORIGINAL CODE: 0009210 |
|
AK | Designated contracting states |
Kind code of ref document: B1 Designated state(s): DE FR GB |
|
REG | Reference to a national code |
Ref country code: GB Ref legal event code: FG4D |
|
REF | Corresponds to: |
Ref document number: 60119496 Country of ref document: DE Date of ref document: 20060614 Kind code of ref document: P |
|
ET | Fr: translation filed | ||
PLBE | No opposition filed within time limit |
Free format text: ORIGINAL CODE: 0009261 |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: NO OPPOSITION FILED WITHIN TIME LIMIT |
|
26N | No opposition filed |
Effective date: 20070213 |
|
PGFP | Annual fee paid to national office [announced via postgrant information from national office to epo] |
Ref country code: FR Payment date: 20110729 Year of fee payment: 11 |
|
PGFP | Annual fee paid to national office [announced via postgrant information from national office to epo] |
Ref country code: GB Payment date: 20110721 Year of fee payment: 11 Ref country code: DE Payment date: 20110722 Year of fee payment: 11 |
|
GBPC | Gb: european patent ceased through non-payment of renewal fee |
Effective date: 20120713 |
|
REG | Reference to a national code |
Ref country code: FR Ref legal event code: ST Effective date: 20130329 |
|
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: GB Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES Effective date: 20120713 Ref country code: DE Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES Effective date: 20130201 Ref country code: FR Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES Effective date: 20120731 |
|
REG | Reference to a national code |
Ref country code: DE Ref legal event code: R119 Ref document number: 60119496 Country of ref document: DE Effective date: 20130201 |