EP1256932A2 - Verfahren und Vorrichtung um eine mittels eines Klangs übermittelte Emotion zu synthetisieren - Google Patents

Verfahren und Vorrichtung um eine mittels eines Klangs übermittelte Emotion zu synthetisieren Download PDF

Info

Publication number
EP1256932A2
EP1256932A2 EP01401880A EP01401880A EP1256932A2 EP 1256932 A2 EP1256932 A2 EP 1256932A2 EP 01401880 A EP01401880 A EP 01401880A EP 01401880 A EP01401880 A EP 01401880A EP 1256932 A2 EP1256932 A2 EP 1256932A2
Authority
EP
European Patent Office
Prior art keywords
operator
sound
elementary
elementary sound
portions
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
EP01401880A
Other languages
English (en)
French (fr)
Other versions
EP1256932B1 (de
EP1256932A3 (de
Inventor
Pierre-Yves c/o Sony France S.A. Oudeyer
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sony France SA
Original Assignee
Sony France SA
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from EP01401203A external-priority patent/EP1256931A1/de
Application filed by Sony France SA filed Critical Sony France SA
Priority to EP20010401880 priority Critical patent/EP1256932B1/de
Priority to DE2001631521 priority patent/DE60131521T2/de
Priority to EP20010402176 priority patent/EP1256933B1/de
Priority to US10/192,974 priority patent/US20030093280A1/en
Priority to JP2002206013A priority patent/JP2003177772A/ja
Priority to JP2002206012A priority patent/JP2003084800A/ja
Publication of EP1256932A2 publication Critical patent/EP1256932A2/de
Publication of EP1256932A3 publication Critical patent/EP1256932A3/de
Publication of EP1256932B1 publication Critical patent/EP1256932B1/de
Application granted granted Critical
Anticipated expiration legal-status Critical
Expired - Lifetime legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/033Voice editing, e.g. manipulating the voice of the synthesiser
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • G10L13/10Prosody rules derived from text; Stress or intonation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management

Definitions

  • the invention relates to the field of voice synthesis or reproduction with controllable emotional content. More particularly, the invention relates to a method and device for controllably adding an emotional feel to a synthesised or sampled voice or, in view of providing a more natural or interesting delivery to talking or other sound emitting objects.
  • the possibility of adding an emotion content to delivered speech is also useful for computer aided systems which read texts or speeches for persons who cannot read for one reason or another. Examples of such systems which readout novels, magazine articles, or the like and whose listening pleasure ability to focus attention can be enhanced if the reading voice can simulate emotions.
  • a first approach which is the most complicated and probably the less satisfactory, is based on linguistic theories for determining intonations.
  • a second uses databases containing phrases tinted with different emotions produced by human speakers.
  • the nearest- sounding phrase with the corresponding emotion content is extracted from the database. Its pitch contour is measured and copied to apply on the selected phrase to be produced. This approach is mainly useable when the database and produced phrases have very close grammatical structures. It is also difficult to implement.
  • a third approach which is recognised as being the most effective, consists in utilising voice synthesisers which sample from a database of recorded human voices. These synthesisers operate by concatenating phonemes or short syllables produced by human voice to resynthesise sound sequences that correspond to a required spoken message. Instead of containing just neutral human voices, the database comprises voices spoken with different emotions.
  • these systems have two basic limitations. Firstly, they are difficult to implement, and secondly, the databases are usually created by voices from different persons, for practical reasons. This can be disadvantageous when listeners expect the synthesised voice always to appear to be coming from a same speaker.
  • voice synthesis software device which allows a certain number of parameters to be controlled, but within a closed architecture which is not amenable for developing new applications.
  • the invention proposes a new approach which is easy to implement, provides convincing results and is easy to parameterise.
  • the invention also makes it possible to reproduce emotions in synthesised speech for meaningful speech contents in a recognisable language both in a naturally sounding voice and in deliberately distorted, exaggerated voices, for example as spoken by cartoon characters, talking animals or non-human animated forms, simply by playing on parameters.
  • the invention is also amenable to imparting emotions on voices that deliver meaningless sounds, such as babble.
  • the invention according to a first aspect proposes a method of synthesising an emotion conveyed on a sound, by selectively modifying at least one elementary sound portion thereof prior to delivering the sound, characterised in that the modification is produced by an operator application step in which at least one operator is selectively applied to at least one elementary sound portion to impose a specific modification in a characteristic thereof, such as pitch/or duration, in accordance with an emotion to be synthesised.
  • the operator application step preferably comprises forming at least one set of operators, the set comprising at least one operator to modify a pitch characteristic and/or at least one operator to modify a duration characteristic of the elementary sound portions.
  • the operator application step comprises applying:
  • the method can comprise a universal phase in which at least one the operator is applied systematically to all elementary sound portions forming a determined sequence of the sound.
  • At least one operator can be applied with a same operator parameterisation to all elementary sound portions forming a determined sequence of the sound.
  • the method can comprise a probabilistic accentuation phase in which at least one the operator is applied only to selected elementary sound portions chosen to be accentuated.
  • the selected elementary sound portions can selected by a random draw from candidate elementary sound portions, preferably to select elementary sound portions with a probability which is programmable.
  • the candidate elementary sound portions can be:
  • a same operator parameterisation may be used for the at least one operator applied in the probabilistic accentuation phase.
  • the method can comprise a first and last elementary sound portions accentuation phase in which at least one operator is applied only to a group of at least one of elementary sound portion forming the start and end of said determined sequence of sound, the latter being e.g. a phrase.
  • the elementary portions of sound may correspond to a syllable or to a phoneme.
  • the determined sequence of sound can correspond to intelligible speech or to unintelligible sounds.
  • the elementary sound portions can be presented as formatted data values specifying a duration and/or at least one pitch value existing over determined parts of or all the duration of the elementary sound.
  • the operators can act to selectively modify the data values.
  • the method may performed without changing the data format of the elementary sound portion data and upstream of an interpolation stage, whereby the interpolation stage can process data modified in accordance with an emotion to be synthesised in the same manner as for data obtained from an arbitrary source of elementary sound portions.
  • the invention provides a device for synthesising an emotion conveyed on a sound, using means for selectively modifying at least one elementary sound portion thereof prior to delivering the sound, characterised in that the means comprise operator application means for applying at least one operator to at least one of the elementary sound portion to impose a specific modification in a characteristic thereof in accordance with an emotion to be synthesised.
  • the invention provides a data medium comprising software module means for executing the method according to the first aspect mentioned above.
  • the invention is a development from work that forms the subject of earlier European patent application number 01 401 203.3 of the Applicant, filed on May 11, 2001 and to which the present application claims priority.
  • the above earlier application concerns a voice synthesis method for synthesising a voice in accordance with information from an apparatus having a capability of uttering and having at least one emotional model.
  • the method here comprises an emotional state discrimination step for discriminating an emotional state of the model of the apparatus having a capability of uttering, a sentence output step for outputting a sentence representing a content to be uttered in the form of a voice, a parameter control step for controlling a parameter for use in voice synthesis, depending upon the emotional state discriminated in the emotional state discrimination steps, and a voice synthesis step for inputting, to a voice synthesis unit, the sentence output in the sentence output step and synthesising a voice in accordance with the controlled parameter.
  • the voice in the earlier application has a meaningless content.
  • the sentence output step When the emotional state of the emotional model becomes greater than a predetermined value, the sentence output step outputs the sentence and supplies it to the voice synthesis unit.
  • the sentence output step outputs can output a sentence obtained at random for each utterance and supply it to the voice synthesis unit.
  • the sentences can include a number of phonemes and the parameter can include pitch, a duration, and an intensity of a phoneme.
  • the apparatus having the capability of uttering can be an autonomous type robot apparatus which acts in response to supplied input information.
  • the emotion model can be such as to cause the action in question.
  • the voice synthesis method can then further include the step of changing the state of the emotion model in accordance with the input information thereby determining the action.
  • an autonomous type e.g. comprising a robot, which acts in accordance with supplied input information, comprising an emotional model which causes the action in question, emotional state discrimination means for discriminating the emotional state of the emotional model, sentence output means for outputting a sentence representing a content to be uttered in the form of a voice, parameter control means for controlling a parameter used in voice synthesis depending upon the emotional state discriminated by the emotional state discrimination means, and voice synthesis means which receive the sentence output from the sentence output means and resynthesises a voice in accordance with the controlled parameter.
  • an autonomous type e.g. comprising a robot, which acts in accordance with supplied input information, comprising an emotional model which causes the action in question, emotional state discrimination means for discriminating the emotional state of the emotional model, sentence output means for outputting a sentence representing a content to be uttered in the form of a voice, parameter control means for controlling a parameter used in voice synthesis depending upon the emotional state discriminated by the emotional state discrimination means, and voice synthesis means which
  • the Applicants research consisted in providing to a baby-like robot means to express emotions vocally. Unlike most of existing work, the Applicant has also studied the possibility of conferring emotions in cartoon-like meaningless speech, which has different needs and different constraints than, for example trying to produce naturally sounding adult-like normal emotional speech. For example, a goal was that emotions can be recognised by people with different cultural or linguistic background.
  • the approach uses concatenative speech synthesis and the algorithms is simpler and completely specified than in those used in other studies, such as conducted by Breazal.
  • the first result indicates that the goal of making a machine express affect both with meaningless speech and in a way recognisable by people from different cultures with the accuracy of a human speaker is attainable in theory.
  • the second result shows that a perfect result cannot be expected.
  • the fact that humans are not so good is mainly explained by the fact that several emotional state have very similar physiological correlates and thus acoustic correlates. In actual situations, humans solve the ambiguities by using the context and/or other modalities.
  • MRR-PSOLA Text-to-Speech synthesis based on an MBE re-synthesis of the segments database, Speech Communication.
  • the MBROLA software freely available on the web at web page :http://tcts.fpms.ac.be/synthesis/mbrola.html ⁇ , which is an enhancement of more traditional PSOLA techniques (it produces less distortions when pitch is manipulated).
  • the price of quality is that very little control is possible over the signal, but this is compatible with a need for simplicity.
  • the approach adopted in the invention is - from an algorithmic point of view - completely generative (it does not rely on the recording human speech that would serve as input), and uses concatenative speech synthesis as a basis. It has been found to express emotions as efficiently as with formant synthesis, yet with simpler controls and a more life-like signal quality.
  • An algorithm developed by the Applicant consists in generating a meaningless sentence and specifying the pitch contour and the durations of phonemes (the rhythm of the sentence). For sake of simplicity, there is specified only one target per phoneme for the pitch, which can often be sufficient.
  • the program generates a file as shown in table I below, which is fed into the MBROLA speech synthesiser.
  • the idea of the algorithm is to initially generate a sentence composed of random words, each word being composed of random syllables (of type CV or CCV). Initially, the duration of all phonemes is constant and the pitch of each phoneme is constant equal to a predetermined value (to which is added noise, which is advantageous to make the speech to sound natural. Many different kinds of noise were experimented, and it was found the type of noise used does not make significant differences; for the perceptual experiment reported below, Gaussian noise was used).
  • the sentence's pitch and duration information are then altered so as to yield a particular affect. Distortions consist in deciding that a number of syllables become stressed, and in applying a certain stress contour on these syllables as well as some duration modifications. Also, to all syllables are applied a certain default pitch contour and duration deformation.
  • a key aspect of this algorithm resides the stochastic parts : on the one hand, it allows to produce each time a different utterance for a given set of parameters (mainly by virtue of the random number of words, the random constituents of phonemes of syllables or the probabilistic attribution of accents) ; on the other hand, details like adding noise to the duration and pitch of phonemes (see line 14 and 15 of the program shown in figure 1, where random(n) means "random number between 0 and n") are advantageous for the naturalness of the vocalisations (if it remains fixed, then one perceives clearly that this is a machine talking). Finally, accents are implemented only by changing the pitch and not the loudness.
  • a last step is added to the algorithm in order to get a voice typical of a young creature : the sound file sampling rate is overridden by setting it to 30000 or 35000 Hz as compared to the 16000 Hz produced by MBROLA (this is equivalent to playing the file quicker).
  • the speech rate is initially made slower in the program sent to MBROLA. Only the voice quality and pitch are modified.
  • This last step is preferable, since no child voice database exists for MBROLA (which is understandable since making it would be difficult with a child). Accordingly, a female adult voice was chosen.
  • table II gives examples of values of the parameters obtained for 5 affects : calm, anger, sadness, happiness, comfort.
  • Figure 2 shows how these emotions are positioned in a chart which represents an "emotional space", in which the parameters "valence” and “excitement” are expressed respectively along vertical and horizontal axes 2 and 4.
  • the valence axis ranges from negative to positive values, while the excitement axis ranges from low to high values.
  • the cross-point O of these axes is at the centre of the chart and corresponds to a calm/neutral state.
  • quadrant Q1 happy/praising (quadrant Q1) characterised by positive valence and high excitement
  • comfort/soothing quadrant Q2 characterised by positive valence and low excitement
  • sad quadrant Q3 characterised by negative valence and low excitement
  • angry/admonishing quadrant Q4 characterised by negative valence and high excitement.
  • the method and device in accordance with the invention are a development of the above concepts.
  • the idea resides in controlling at least one of the pitch contour, the intensity contour and the rhythm of a phrase produced by voice synthesis.
  • the inventive approach is relatively exhaustive and can easily be reproduced by other workers.
  • the preferred embodiments are developed from freely available software modules that are well documented, simple to use and for which there are many equivalent technologies. Accordingly, the modules produced by these embodiments of the invention are totally transparent.
  • the embodiments allow a total, or at a least high degree of control of pitch contour, rhythm (duration of phonemes), etc.
  • the approach in accordance with the present invention is based on considering a phrase as a succession of syllables.
  • the phrase can be speech in a recognised language, or simply meaningless utterances.
  • f0 the contour of the pitch
  • volume the intensity contour
  • duration the duration of the syllable.
  • at least the control of the intensity is not necessary, as a modification in the pitch can give the impression of a change in intensity.
  • the problem is then to determine these contours - pitch contour, duration, and possibly intensity contour - throughout a sentence so as to produce an intonation that corresponds to a given emotion.
  • the concept behind the solution is to start off from a phrase having a set contour (f0), a set intensity and a set duration for each syllable.
  • This reference phrase can be produced either from a voice synthesiser for a recognised language, giving an initial contour (f0), an initial duration (t) and possibly an initial intensity.
  • the reference phrase can be meaningless utterances, such as babble from infants.
  • each syllable to by synthesised can be encoded as follows (case of syllable "be", characterised in terms of a duration and five successive pitch values within that duration):
  • the above data is contained in a frame simply by encoding the parameters : be; 100, 80, 100, 120, 90, 230, each being identified by the synthesiser according to the protocol.
  • Figure 3 shows the different stages by which these digital data are converted into a synthesised sound output.
  • a voice message is composed in terms of a succession of syllables to be uttered.
  • the message can be intelligible words forming grammatical sentences conveying meaning in a given recognised language, or meaningless sounds such a babble, animal-like sounds, or totally imaginary sounds.
  • the syllables are encoded in the above-described digital data format in a vocalisation data file 10.
  • a decoder 12 reads out the successive syllable data from the data file 10.
  • Figure 4a shows graphically how these data are organised by the decoder 12 in terms of a coordinate grid with pitch fundamental frequency (in Hertz) along the ordinate axis and time (in milliseconds) along the abscissa.
  • the area of the grid is divided into five columns corresponding to each of the five respective durations, as indicated by the arrowed lines.
  • the pitch value is placed at the centre of each column.
  • the syllable data are transferred to an interpolator 14 which produces from the five elementary pitch values P1-P5 a close succession of interpolated pitch values, using standard interpolation techniques.
  • the result is a relatively smooth curve of the evolution of pitch over the 100 ms duration of the syllable "be", as shown in figure 4b.
  • the process is repeated for each inputted syllable data, to produce a continuous pitch curve over successive syllables of the phrase.
  • the pitch waveform thus produced by the interpolator is supplied to an audio frequency sound processor 16 which generates a corresponding modulated amplitude audio signal.
  • the sound processor may also add some random noise to the final audio signal to give a more realistic effect to the synthesised sound, as explained above.
  • This final audio signal is supplied to an audio amplifier 18 where its level is raised to a suitable volume, and then outputted on a loudspeaker 20 which thus reproduces the synthesised sound data from vocalisation data file 10.
  • part of the syllable data associated with the syllables will normally include an indication of which syllables may be accentuated to give a more naturally sounding delivery.
  • the pitch values contained in the syllable data correspond to a "neutral" form of speech, i.e. not charged with a discernible emotion.
  • Figure 5 is a block diagram showing in functional terms how an emotion generator 22 of the preferred embodiment integrates with the synthesiser 1 shown in figure 3.
  • the emotion generator 22 operates by selectively applying operators on the syllable data read out from the vocalisation data file 10. Depending on their type, these operators can modify either the pitch data (pitch operator) or the syllable duration data (duration operator). These modifications take place upstream of the interpolator 14, e.g. before the decoder 12, so that the interpolation is performed on the operator-modified values. As explained below, the modification is such as to transform selectively a neutral form of speech into a speech conveying a chosen emotion (sad, calm, happy, angry) in a chosen quantity.
  • the basic operator forms are stored in an operator set library 24, from which they can be selectively accessed by an operator set configuration unit 26.
  • the latter serves to prepare and parameterise the operators in accordance with current requirements.
  • a operator parameterisation unit 28 which determines the parameterisation of the operators in accordance with both : i) the emotion to be imprinted on the voice (calm, sad, happy, angry, etc.), ii) possibly the degree - or intensity - of the emotion to apply, and iii) the context of the syllable, as explained below.
  • the emotion and degree of emotion are instructed to the operator parameterisation unit 28 by an emotion selection interface 30 which presents inputs accessible to a user 32.
  • the emotion selection interface can be in the form of a computer interface with on-screen menus and icons, allowing the user 32 to indicate all the necessary emotion characteristics and other operating parameters.
  • the context of the syllable which is operator sensitive is: i) the position of syllable in a phrase, as some operator sets are applied only to the first and last syllables of the phrase, ii) whether the syllables relate to intelligible word sentences or to unintelligible sounds (babble, etc.) and iii) as the case arises, whether or not a syllable considered is allowed or not to be accentuated, as indicated in the vocalisation data file 10.
  • a first and last syllables identification unit 34 and an authorised syllable accentuation detection unit 36 both having an access to the vocalisation data file unit 10 and informing the operator parameterisation unit 28 of the appropriate context sensitive parameters.
  • the random selection is provided by a controllable probability random draw unit 38 operatively connected between the authorised syllable accentuation unit 36 and the operator parameterisation unit 28.
  • the random draw unit 38 has a controllable degree of probability of selecting a syllable from the candidates. Specifically, if N is the probability of a candidate being selected, with N ranging controllably from 0 to 1, then for P candidate syllables, N.P syllables shall be selected on average for being subjected to a specific operator set associated to a random accentuation. The distribution of the randomly selected candidates is substantially uniform over the sequence of syllables.
  • the suitably configured operator sets from the operator set configuration unit 26 are sent to a syllable data modifier unit 40 where they operate on the syllable data.
  • the syllable data modifier unit 40 receives the syllable data directly from vocalisation data file 10, in a manner analogous to the decoder 12 of figure 3.
  • the thus-received syllable data are modified by unit 40 as a function of the operator set, notably in terms of pitch and duration data.
  • the resulting modified syllable data (new syllable data) are then outputted by the syllable data modifier unit 40 to the decoder 12, with the same structure as presented in the vocalisation data file (cf. figure 2a).
  • the decoder can process the new syllable data exactly as if it originated directly from the vocalisation data file. From there, the new syllable data are interpolated (interpolator unit 14) and processed by the other downstream units of figure 3 in exactly the same way. However, the sound produced at the speaker then no longer corresponds to a neutral tone, but rather to the sound with a simulation of an emotion as defined by the user 32.
  • All the above functional units are under the overall control of an operations sequencer unit 42 which governs complete execution of the emotion generation procedure in accordance with a prescribed set of rules.
  • Figure 6 illustrates graphically the effect of the pitch operator set OP on a pitch curve (as in figure 4b) of a synthesised sound.
  • the figure shows - respectively on left and right columns - a pitch curve (fundamental frequency f against time t) before the action of the pitch operator and after the action of a pitch operator.
  • the input pitch curves are identical for all operators and happen to be relatively flat.
  • the rising slope and falling slope operators OPrs and OPfs have the following characteristic: the pitch at the central point in time (1/2 tl for a pitch duration of t1) remains substantially unchanged after the operator. In other words, the operators act to pivot the input pitch curve about the pitch value at the central point in time, so as to impose the required slope. This means that in the case of a rising slope operator OPrs, the pitch values before the central point in time are in fact lowered, and that in the case of a falling slope operator OPfs, the pitch values before the central point in time are in fact raised, as shown by the figure.
  • intensity operators designated OI.
  • OI intensity operators
  • FIG 7 is directly analogous to the illustration of figure 6.
  • These operators are also four in number and are identical to those of the pitch operators OP, except that they act on the curve of intensity I over time t. Accordingly, these operators shall not be detailed separately, for the sake of conciseness.
  • the pitch and intensity operators can each be parameterised as follows :
  • Figure 8 illustrates graphically the effect of a duration (or time) operator OD on a syllable.
  • the illustration shows on left and right columns respectively the duration of the syllable (in terms of a horizontal line expressing an initial length of time t1) of the input syllable before the effect of a duration operator and after the effect of a duration operator.
  • the duration operator can be:
  • the operator can also be neutralised or made as a neutral operator, simply by inserting the value 0 for the parameter D.
  • duration operator has been represented as being of two different types, respectively dilation and contraction, it is clear that the only difference resides in the sign plus or minus placed before the parameter D.
  • a same operator mechanism can produce both operator functions (dilation and contraction) if it can handle both positive and negative numbers.
  • the range of possible values for D and its possible incremental values in the range can be chosen according to requirements.
  • the embodiment further uses a separate operator, which establishes the probability N for the random draw unit 38.
  • This value is selected from a range of 0 (no possibility of selection) to 1 (certainty of selection).
  • the value N serves to control the density of accentuated syllables in the vocalised output as appropriate for the emotional quality to reproduce.
  • Figures 9A and 9B constitute a flow chart indicating the process of forming and applying selectively the above operators to syllable data on the basis of the system described with reference to figure 5.
  • Figure 9B is a continuation of figure 9A.
  • the process starts with an initialisation phase P1 which involves loading input syllable data from the vocalisation data file 10 (step S2).
  • the data appear as an identification of the syllable e.g. "be”, followed by a first value t1 expressing the normal duration of the syllable, followed by five values P1 to P5 indicating the fundamental frequency of the pitch at five successive intervals of the indicated duration t1, as explained with reference to figure 4a.
  • step S4 is loaded the emotion to be conveyed on the phrase or passage of which the loaded syllable data forms a part, using the interface unit 30 (step S4).
  • the emotions can be calm, sad, happy, angry, etc.
  • the interface also inputs the degree of emotion to be given, e.g. by attributing a weighting value (step S6).
  • the system then enters into a universal operator phase P2, in which a universal operator set OS(U) is applied systematically to all the syllables.
  • the universal operator set OS(U) contains all the operators of figures 6 and 8, i.e. OPrs, OPfs, OPsu, OPsd, forming the four pitch operators, plus ODd and ODc, forming the two duration operators.
  • Each of these operators of operator set OS(U) is parameterised by a respective associated value, respectively Prs(U), Pfs(U), Psu(U), Psd(U), Dd(U), and Dc(U), as explained above (step S8).
  • This step involves attributing numerical values to these parameters, and is performed by the operator set configuration unit 26.
  • the choice of parameter values for the universal operator set OS(U) is determined by the operator parameterisation unit 8 as a function of the programmed emotion and quantity of emotion, plus other factors as the case arises.
  • the universal operator set OS(U) is then applied systematically to all the syllables of a phrase or group of phrases (step S10).
  • the action involves modifying the numerical values t1, P1-P5 of the syllable data.
  • the slope parameter Prs or Pfs are translated into a group of five difference values to be applied arithmetically to the values P1-P5 respectively. These difference values are chosen to move each of the values P1-P5 according the parameterised slope, the middle value P3 remaining substantially unchanged, as explained earlier.
  • the first two values of the rising slope parameters will be negative to cause the first half of the pitch to be lowered and the last two values will be positive to cause the last half of the pitch to be raised, so creating the rising slope articulated at the centre point in time, as shown in figure 6.
  • the degree of slope forming the parameterisation is expressed in terms of these difference values.
  • a similar approach in reverse is used for the falling slope parameter.
  • the shift up or shift down operators can be applied before or after the slope operators. They simply add or subtract a same value, determined by the parameterisation, to the five pitch values P1-P5.
  • the operators form mutually exclusive pairs, i.e. a rising slope operator will not be applied if a falling slope operator is to be applied, and likewise for the shift up and down and duration operators.
  • the application of the operators i.e. calculation to modify the data parameters t1, P1-P5 is performed by the syllable data modifier unit 40.
  • the system then enters into a probabilistic accentuation phase P2, for which another operator accentuation parameter set OS(PA) is prepared.
  • This operator set has the same operators as the universal operator set, but with different values for the parameterisation.
  • the operator set OS(PA) is parameterised by respective values: Prs(PA), Pfs(PA), Psu(PA), Psd(PA), Dd(PA), and Dc(PA).
  • Prs(PA), Pfs(PA), Psu(PA), Psd(PA), Dd(PA), and Dc(PA) are likewise calculated by the operator parameterisation unit 28 as a function of the emotion, degree of emotion and other factors provided by the interface unit 30.
  • the choice of the parameters is generally made to add a degree of intonation (prosody) to the speech according to the emotion considered.
  • step S14 Next is determined which of the syllables is to be submitted to this operator set OS(PA), as determined by the random unit 38 (step S14).
  • the latter supplies the list of the randomly drawn syllables for accentuating by this operator set.
  • the candidate syllables are:
  • the randomly selected syllables among the candidates are then submitted for processing by the probabilistic accentuation operator set OS(PA) by the syllable data modifier unit 40 (step S16).
  • the actual processing performed is the same as explained above for the universal operator set, with the same technical considerations, the only difference being in the parameter values involved.
  • the syllable data modifier unit 40 will supply the following modified forms of the syllable data (generically denoted S) originally in the file 10:
  • phase P4 of processing an accentuation specific to the first and last syllables of a phrase.
  • this phase P4 acts to accentuate all the syllables of the first and last words of the phrase.
  • the term phrase can be understood in the normal grammatical sense for intelligible text to be spoken, e.g. in terms of pauses in the recitation.
  • a phrase is understood in terms of a beginning and end of the utterance, marked by a pause. Typically, such a phrase can last from around one to three or to four seconds.
  • the phase P4 of accentuating the last syllables applies to at least the first and last syllables, and preferably the first m and last n syllables, where m or n are typically equal to around 2 or 3 and can be the same or different.
  • the resulting operator set OS(FL) is then applied to the first and last syllables of each phrase (step S20), these syllables being identified by the first/last syllables detector unit 34.
  • the syllable data on which is applied operator set OS(FL) will have previously been processed by the universal operator set OS(U) at step S10. Additionally, it may happen that a first or last syllable(s) would also been drawn at the random selection step S14 and thereby also be processed with by probabilistic accentuation operator set OS(PA).
  • the parameterisation of the same general type for all operator sets is the same for all operator sets, only the actual values being different.
  • the values are usually chosen so that the least amount of change is produced by the universal operator set, and the largest amount of change is produced by the first and last syllable accentuation, the probabilistic accentuation operator set producing an intermediate amount of change.
  • the system can also be made to use intensity operators OI in its set, depending on the parameterisation used.
  • the interface unit 30 can be integrated into a computer interface to provide different controls. Among these can be direct choice of parameters of the different operator sets mentioned above, in order to allow the user 32 to fine-tune the system.
  • the interface can be made user friendly by providing visual scales, showing e.g. graphically the slope values, shift values, contraction/dilation values for the different parameters.
  • the examples are illustrated for a given format of speech data, but is clear that any other formatting of data can be accommodated.
  • the number of pitch or intensity values given in the examples can be different from 5, typical numbers of values ranging from just one to more than five.
  • the embodiment can be implemented in a large variety of devices, for instance: robotic pets and other intelligent electronic creatures, sound systems for educational training, studio productions (dubbing, voice animations, narration, etc.), devices for reading out loud texts (books, articles, mail, etc.), sound experimentation systems (psycho-acoustic research etc), humanised computer interfaces for PCs, instruments and other equipment, and in other applications, etc....
  • the form of the embodiment can range from a stand-alone unit fully equipped to provide a complete synthesised sound reproduction (cf. figure 3), an accessory operational with existing sound synthesising, or in the form of software modules recorded on a medium or in downloadable form to be run on adapted processor systems.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Toys (AREA)
  • Feedback Control In General (AREA)
  • Soundproofing, Sound Blocking, And Sound Damping (AREA)
EP20010401880 2001-05-11 2001-07-13 Verfahren und Vorrichtung um eine mittels eines Klangs übermittelte Emotion zu synthetisieren Expired - Lifetime EP1256932B1 (de)

Priority Applications (6)

Application Number Priority Date Filing Date Title
EP20010401880 EP1256932B1 (de) 2001-05-11 2001-07-13 Verfahren und Vorrichtung um eine mittels eines Klangs übermittelte Emotion zu synthetisieren
DE2001631521 DE60131521T2 (de) 2001-05-11 2001-08-14 Verfahren und Vorrichtung zur Steuerung des Betriebs eines Geräts bzw. eines Systems sowie System mit einer solchen Vorrichtung und Computerprogramm zur Ausführung des Verfahrens
EP20010402176 EP1256933B1 (de) 2001-05-11 2001-08-14 Verfahren und Vorrichtung zur Steuerung eines Emotionssynthesegeräts
US10/192,974 US20030093280A1 (en) 2001-07-13 2002-07-11 Method and apparatus for synthesising an emotion conveyed on a sound
JP2002206013A JP2003177772A (ja) 2001-07-13 2002-07-15 感情合成装置の処理を制御する方法及び装置
JP2002206012A JP2003084800A (ja) 2001-07-13 2002-07-15 音声による感情合成方法及び装置

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
EP01401203A EP1256931A1 (de) 2001-05-11 2001-05-11 Verfahren und Vorrichtung zur Sprachsynthese und Roboter
EP01401203 2001-05-11
EP20010401880 EP1256932B1 (de) 2001-05-11 2001-07-13 Verfahren und Vorrichtung um eine mittels eines Klangs übermittelte Emotion zu synthetisieren

Publications (3)

Publication Number Publication Date
EP1256932A2 true EP1256932A2 (de) 2002-11-13
EP1256932A3 EP1256932A3 (de) 2004-10-13
EP1256932B1 EP1256932B1 (de) 2006-05-10

Family

ID=26077240

Family Applications (1)

Application Number Title Priority Date Filing Date
EP20010401880 Expired - Lifetime EP1256932B1 (de) 2001-05-11 2001-07-13 Verfahren und Vorrichtung um eine mittels eines Klangs übermittelte Emotion zu synthetisieren

Country Status (2)

Country Link
EP (1) EP1256932B1 (de)
DE (1) DE60131521T2 (de)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7809572B2 (en) 2005-07-20 2010-10-05 Panasonic Corporation Voice quality change portion locating apparatus
CN111816158A (zh) * 2019-09-17 2020-10-23 北京京东尚科信息技术有限公司 一种语音合成方法及装置、存储介质
CN113611326A (zh) * 2021-08-26 2021-11-05 中国地质大学(武汉) 一种实时语音情感识别方法及装置

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101901598A (zh) * 2010-06-30 2010-12-01 北京捷通华声语音技术有限公司 一种哼唱合成方法和系统

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5860064A (en) * 1993-05-13 1999-01-12 Apple Computer, Inc. Method and apparatus for automatic generation of vocal emotion in a synthetic text-to-speech system
US6212502B1 (en) * 1998-03-23 2001-04-03 Microsoft Corporation Modeling and projecting emotion and personality from a computer user interface
EP1107227A2 (de) * 1999-11-30 2001-06-13 Sony Corporation Sprachverarbeitung

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5860064A (en) * 1993-05-13 1999-01-12 Apple Computer, Inc. Method and apparatus for automatic generation of vocal emotion in a synthetic text-to-speech system
US6212502B1 (en) * 1998-03-23 2001-04-03 Microsoft Corporation Modeling and projecting emotion and personality from a computer user interface
EP1107227A2 (de) * 1999-11-30 2001-06-13 Sony Corporation Sprachverarbeitung

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
GALANIS D ET AL: "Investigating emotional speech parameters for speech synthesis" ELECTRONICS, CIRCUITS, AND SYSTEMS, 1996. ICECS '96., PROCEEDINGS OF THE THIRD IEEE INTERNATIONAL CONFERENCE ON RODOS, GREECE 13-16 OCT. 1996, NEW YORK, NY, USA,IEEE, US, 13 October 1996 (1996-10-13), pages 1227-1230, XP010217293 ISBN: 0-7803-3650-X *
IGNASI IRIONDO ET AL: "VALIDATION OF AN ACOUSTICAL MODELLING OF EMOTIONAL EXPRESSION IN SPANISH USING SPEECH SYNTHESIS TECHNIQUES" PROCEEDINGS OF THE ISCA WORKSHOP ON SPEECH AND EMOTION, September 2000 (2000-09), pages 1-6, XP007005765 BELFAST, NORTHERN IRELAND *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7809572B2 (en) 2005-07-20 2010-10-05 Panasonic Corporation Voice quality change portion locating apparatus
CN111816158A (zh) * 2019-09-17 2020-10-23 北京京东尚科信息技术有限公司 一种语音合成方法及装置、存储介质
CN111816158B (zh) * 2019-09-17 2023-08-04 北京京东尚科信息技术有限公司 一种语音合成方法及装置、存储介质
CN113611326A (zh) * 2021-08-26 2021-11-05 中国地质大学(武汉) 一种实时语音情感识别方法及装置
CN113611326B (zh) * 2021-08-26 2023-05-12 中国地质大学(武汉) 一种实时语音情感识别方法及装置

Also Published As

Publication number Publication date
EP1256932B1 (de) 2006-05-10
EP1256932A3 (de) 2004-10-13
DE60131521T2 (de) 2008-10-23
DE60131521D1 (de) 2008-01-03

Similar Documents

Publication Publication Date Title
US20030093280A1 (en) Method and apparatus for synthesising an emotion conveyed on a sound
Pierre-Yves The production and recognition of emotions in speech: features and algorithms
DE60119496T2 (de) Verfahren und Vorrichtung um eine mittels eines Klangs übermittelte Emotion zu synthetisieren
Burkhardt et al. Verification of acoustical correlates of emotional speech using formant-synthesis
JP4363590B2 (ja) 音声合成
JP4458321B2 (ja) 感情認識方法および感情認識装置
Cahn The generation of affect in synthesized speech
Theune et al. Generating expressive speech for storytelling applications
US5860064A (en) Method and apparatus for automatic generation of vocal emotion in a synthetic text-to-speech system
Ladd et al. Evidence for the independent function of intonation contour type, voice quality, and F 0 range in signaling speaker affect
Nose et al. HMM-based expressive singing voice synthesis with singing style control and robust pitch modeling
Mareüil et al. Generation of emotions by a morphing technique in English, French and Spanish
Hill et al. Low-level articulatory synthesis: A working text-to-speech solution and a linguistic tool1
Gahlawat et al. Natural speech synthesizer for blind persons using hybrid approach
US7457752B2 (en) Method and apparatus for controlling the operation of an emotion synthesizing device
EP1256932B1 (de) Verfahren und Vorrichtung um eine mittels eines Klangs übermittelte Emotion zu synthetisieren
Keller Towards greater naturalness: Future directions of research in speech synthesis
Lobanov et al. TTS-Synthesizer as a Computer Means for Personal Voice Cloning (On the example of Russian)
Gahlawat et al. Integrating human emotions with spatial speech using optimized selection of acoustic phonetic units
Olaszy The most important prosodic patterns of Hungarian
Oudeyer The synthesis of cartoon emotional speech
Vine et al. Synthesis of emotional speech using RP-PSOLA
Makarova et al. Phonetics of emotion in Russian speech
Suchato et al. Digital storytelling book generator with customizable synthetic voice styles
Henton et al. Generating and manipulating emotional synthetic speech on a personal computer

Legal Events

Date Code Title Description
PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

AK Designated contracting states

Kind code of ref document: A2

Designated state(s): AT BE CH CY DE DK ES FI FR GB GR IE IT LI LU MC NL PT SE TR

AX Request for extension of the european patent

Free format text: AL;LT;LV;MK;RO;SI

PUAL Search report despatched

Free format text: ORIGINAL CODE: 0009013

AK Designated contracting states

Kind code of ref document: A3

Designated state(s): AT BE CH CY DE DK ES FI FR GB GR IE IT LI LU MC NL PT SE TR

AX Request for extension of the european patent

Extension state: AL LT LV MK RO SI

RIC1 Information provided on ipc code assigned before grant

Ipc: 7G 10L 13/08 A

Ipc: 7G 10L 13/02 B

17P Request for examination filed

Effective date: 20050317

17Q First examination report despatched

Effective date: 20050429

AKX Designation fees paid

Designated state(s): DE FR GB

GRAP Despatch of communication of intention to grant a patent

Free format text: ORIGINAL CODE: EPIDOSNIGR1

GRAS Grant fee paid

Free format text: ORIGINAL CODE: EPIDOSNIGR3

GRAA (expected) grant

Free format text: ORIGINAL CODE: 0009210

AK Designated contracting states

Kind code of ref document: B1

Designated state(s): DE FR GB

REG Reference to a national code

Ref country code: GB

Ref legal event code: FG4D

REF Corresponds to:

Ref document number: 60119496

Country of ref document: DE

Date of ref document: 20060614

Kind code of ref document: P

ET Fr: translation filed
PLBE No opposition filed within time limit

Free format text: ORIGINAL CODE: 0009261

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: NO OPPOSITION FILED WITHIN TIME LIMIT

26N No opposition filed

Effective date: 20070213

PGFP Annual fee paid to national office [announced via postgrant information from national office to epo]

Ref country code: FR

Payment date: 20110729

Year of fee payment: 11

PGFP Annual fee paid to national office [announced via postgrant information from national office to epo]

Ref country code: GB

Payment date: 20110721

Year of fee payment: 11

Ref country code: DE

Payment date: 20110722

Year of fee payment: 11

GBPC Gb: european patent ceased through non-payment of renewal fee

Effective date: 20120713

REG Reference to a national code

Ref country code: FR

Ref legal event code: ST

Effective date: 20130329

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: GB

Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES

Effective date: 20120713

Ref country code: DE

Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES

Effective date: 20130201

Ref country code: FR

Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES

Effective date: 20120731

REG Reference to a national code

Ref country code: DE

Ref legal event code: R119

Ref document number: 60119496

Country of ref document: DE

Effective date: 20130201