CN102184731A

CN102184731A - Method for converting emotional speech by combining rhythm parameters with tone parameters

Info

Publication number: CN102184731A
Application number: CN2011101220344A
Authority: CN
Inventors: 毛峡; 韩林
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2011-05-12
Filing date: 2011-05-12
Publication date: 2011-09-14

Abstract

The invention discloses a method for converting emotional speech by combining rhythm parameters (fundamental frequency, time length and energy) with a tone parameter (a formant), which mainly comprises the following steps of: 1, carrying out extraction and analysis of feature parameters on a Beihang University emotional speech database (BHUDES) emotional speech sample (containing neutral speech and four types of emotional speech of sadness, anger, happiness and surprise); 2, making an emotional speech conversion rule and defining each conversion constant according to the extracted feature parameters; 3, carrying out extraction of the feature parameters and fundamental tone synchronous tagging on the neutral speech to be converted; 4, setting each conversion constant according to the emotional speech conversion rule in the step 2, modifying a fundamental frequency curve, the time length and the energy and synchronously overlaying fundamental tones to synthesize a speech signal; and 5, carrying out linear predictive coding (LPC) analysis on the speech signal in the step 4 and modifying the formant by a pole of a transfer function so as to finally obtain the emotional speech rich in expressive force.

Description

The emotional speech conversion method that a kind of rhythm class and tonequality class parameter combine

Technical field

The present invention relates to voice signal and handle and artificial intelligence field, relate generally to the emotional speech conversion method that a kind of rhythm class and tonequality class parameter combine.

Background technology

Phonetic synthesis is an important component part in the man-machine interaction.Now people desired what hear no longer is the uninteresting machine sound that very high intelligibility is arranged, but voice with human interest that can show emotion.Existing phonetic synthesis level, still this stage from the literal to the phonetic synthesis of solution, literary composition language conversion (TTS:Text to Speech) just, the emotion information in the voice can not well be expressed.

In addition emotional speech can also and other multimedia technology combination, show emotion such as emotional speech being equipped with corresponding facial characteristics, make sound and expression synchronously, current relatively more popular " vision voice (visualspeech) " technology that Here it is.

Extract affective characteristics from voice signal, analyst's emotion is related with voice signal, and affective characteristics is applied to the research of phonetic synthesis aspect, is the research topic of just having risen in recent years in this field both at home and abroad.But a large amount of models also are not well solved.The emotional speech study on the synthesis is that emotion is calculated the problem of intersecting with these two fields of phonetic synthesis, and wherein phonetic synthesis research is longer, and emotion calculating then is relatively young research field.

PSOLA (Pitch Synchronous Overlap Add) is a kind of waveform concatenation algorithm that is used for speech synthesis technique.It and early stage waveform concatenation have the difference of principle: this algorithm is before adjusting the voice unit splicing, can carry out fundamental frequency to concatenation unit, the adjustment of duration and energy, and when adjusting, be that unit carries out waveform and revises with pitch period rather than traditional frame length, the integrality of pitch period as the level and smooth continuous basic premise that guarantees waveform and frequency spectrum.

In the conversion of emotional speech, it is ripe that PSOLA uses that is that all right, and it can only make amendment to the rhythm class parameter of voice signal, can not change tonequality class parameter.Therefore proposing a kind of conversion method more efficiently has very strong realistic meaning.

Summary of the invention

The present invention proposes a kind of method of changing rhythm class parameter and tonequality class parameter simultaneously and finish the conversion of emotional speech.

Main contents of the present invention are: the extraction statistics of the emotional speech sample being carried out characteristic parameter, formulate transformation rule, according to the fundamental curve and the resonance peak position of rule change voice, finish the conversion of neutral voice to four kind of emotional speech (sad, indignation, glad and surprised) then.

The concrete steps of this method are as follows:

Step 1: the extraction and analysis that the emotional speech sample is carried out (comprising neutral voice and sadness, indignation, happiness and surprised four kinds of emotional speeches) characteristic parameter;

Step 2: according to the characteristic parameter that extracts, formulate the emotional speech transformation rule, define every conversion parameter;

Step 3: neutral voice to be converted are carried out characteristic parameter extraction and pitch synchronous mark;

Step 4: by the emotion transformation rule setting and modifying parameter of step 2, to fundamental curve, duration and energy are made amendment, and carry out pitch synchronous stack synthetic speech signal again;

Step 5: the voice signal to step 4 carries out lpc analysis, by the limit that changes transport function resonance peak is changed.

Wherein, in step 1, the language material of choosing is BHUDES (a Beijing Institute of Aeronautics emotional speech database), and the characteristic parameter of extraction comprises fundamental frequency, duration and energy and resonance peak.

In step 2, extract the fundamental frequency of neutral voice and four kinds of emotional speeches respectively, characteristic parameters such as duration and energy draw following transformation rule through statistics:

On the basis of transformation rule, define UP_POSITION (position raises up), DOWN_POSITION (down position) in the above, MEANf0 (whole fundamental frequency change amount), DUR_POSITION (time-delay position), DUR_LEN (time-delay length), ENERGY_SCALE constants such as (energy factors).

In step 3, at first to carry out the judgement of voice segments and quiet section and pure and impure sound to the voice signal x (n) of input.

Voice segments and quiet section double threshold method of adjudicating employing based on short-time energy and short-time zero-crossing rate.

Pure and impure decision method adopts prediction residual energy e _rWith the first rank reflection coefficient r ₁The method that combines, judgment condition is:

If r ₁＞0.2﹠amp; e _r＞threshold, then this frame is a voiced sound; Otherwise be voiceless sound.

r_{1} = \frac{R_{ss} (1)}{R_{ss} (0)} - - - (1)

R_{ss} (0) = \frac{1}{N} Σ_{n = 1}^{N} x (n) x (n) - - - (2)

R_{ss} (1) = \frac{1}{N} Σ_{n = 1}^{N - 1} x (n) x (n + 1) - - - (3)

Wherein N is a frame length, e _rFor carrying out the residual energy after the linear prediction.

Voiced sound is partly carried out the fundamental tone mark, partly take equidistant mark, convenient calculating for voiceless sound.According to the transformation rule of step 2, set every correlation parameter.

Voice signal and a series of pitch synchronous window function of finishing the pitch synchronous mark multiplied each other, obtain some and show overlapping short-time analysis signal.Peaceful (Hanning) window of the general employing standard Chinese or Hamming (Hamming) window, window length is two pitch periods, between the adjacent short-time analysis signal about 50% lap is arranged.The accuracy and the reference position of pitch period are extremely important, and it will have very big influence to the quality of synthetic speech.This method adopts Hamming (Hamming) window, window function as shown in Equation 5:

ω (n) = \{\begin{matrix} 0.54 - 0.46 (2 πn / (N - 1)) & 0 \leq n \leq N - 1 \\ 0 & else \end{matrix} - - - (5)

Original signal x (n) and a series of pitch synchronous window function ω _m(n) multiply each other to the short-time analysis signal be:

x _m(n)＝ω _m(n _m-n)×x(n) (6)

N in the formula _mBe fundamental tone mark point.

According to the emotion transformation rule, set up the mapping relations of the pitch period between converted-wave and the original waveform, see Fig. 2, the needed sequence of composite signal in short-term when mapping relations are determined to synthesize thus again.

Short signal sequence and target pitch period are arranged synchronously, and overlap-add obtains synthetic waveform.At this moment, He Cheng speech waveform y (n) just has desired affective characteristics.

In step 4, set the UP_POSITION (position raises up) in the step 2, DOWN_POSITION (down position), MEANf0 (whole fundamental frequency change amount), DUR_POSITION (time-delay position), DUR_LEN (time-delay length), ENERGY_SCALE constants such as (energy factors).

In step 5, earlier voice signal is carried out lpc analysis, in the method, to analyze exponent number and get 12 rank, flow process is seen Fig. 3.Limit to the transport function that obtains is changed, thereby resonance peak is carried out moving on the frequency.

Advantage of the present invention and good effect are: because the present invention changes simultaneously to the rhythm category feature (fundamental frequency, energy and duration) and the tonequality category feature (resonance peak) of voice, make the emotional speech after the conversion more natural.Simultaneously, when fundamental curve is changed,, can make the rhythm revise better effects if by setting necessary parameter.

Description of drawings

Fig. 1 emotional speech flow path switch figure

The mapping relations synoptic diagram of Fig. 2 pitch period

Fig. 3 LPC change resonance peak process flow diagram

Embodiment

The present invention be a kind of be the new method of four kinds of emotional speeches with neutral speech conversion.

Main contents of the present invention are: the extraction statistics of the BHUDES emotional speech sample of choosing being carried out characteristic parameter, formulate transformation rule, according to the fundamental curve and the resonance peak position of rule change voice, finish the conversion of neutral voice to four kind of emotional speech (sad, indignation, glad and surprised) then.

In order to set forth purpose of the present invention, technical scheme and advantage more clearly,, be that surprised voice are that example is described in further details with neutral speech conversion below in conjunction with accompanying drawing.Should be appreciated that specific embodiment described herein only in order to explanation the present invention, and be not used in qualification the present invention.

Embodiment is seen Fig. 1 process flow diagram, and key step is as follows:

Step 1: the emotional speech sample is carried out (comprising neutral voice and sadness, indignation, happiness and surprised four kinds of emotional speeches) extraction and analysis of characteristic parameter, and the characteristic parameter of extraction comprises fundamental frequency, duration and energy and resonance peak;

Step 2: according to the neutral voice that extract and the fundamental frequency of four kinds of emotional speeches, characteristic parameters such as duration and energy are formulated the emotional speech transformation rule, define every conversion constant;

Step 3: the transformation rule that from the emotional speech transformation rule of step 2, extracts surprised voice.

The average fundamental frequency surprised as can be known by transformation rule is high slightly, and the fundamental frequency scope is high slightly, and the afterbody existence of fundamental curve raises up, and energy is high slightly and the resonance peak position is high slightly, and duration is shorter.

At first to carry out the judgement of voice segments and quiet section and pure and impure sound to the voice signal x (n) of input.

r_{1} = \frac{R_{ss} (1)}{R_{ss} (0)} - - - (1)

R_{ss} (0) = \frac{1}{N} Σ_{n = 1}^{N} x (n) x (n) - - - (2)

R_{ss} (1) = \frac{1}{N} Σ_{n = 1}^{N - 1} x (n) x (n + 1) - - - (3)

ω (n) = \{\begin{matrix} 0.54 - 0.46 (2 πn / (N - 1)) & 0 \leq n \leq N - 1 \\ 0 & else \end{matrix} - - - (5)

x _m(n)＝ω _m(n _m-n)×x(n) (6)

N in the formula _mBe fundamental tone mark point.

According to the emotion transformation rule, set up the mapping relations of the pitch period between converted-wave and the original waveform, mapping relations are determined the synthetic needed sequence of composite signal in short-term thus again.

Step 4: the emotion transformation rule by step 2 is set UP_POSITION (position raises up), DOWN_POSITION (down position), MEANf0 (whole fundamental frequency change amount), DUR_POSITION (time-delay position), DUR_LEN (time-delay length), ENERGY_SCALE constants such as (energy factors).To fundamental curve, duration and energy are made amendment then, again pitch synchronous stack synthetic speech signal.

Step 5: the voice signal to step 4 carries out lpc analysis, by the limit of transport function resonance peak is changed.Finally obtain emotional speech to be converted.

Claims

1. the present invention proposes the emotional speech conversion method that rhythm class parameter (fundamental frequency, duration and energy) and tonequality class parameter (resonance peak) combine, the concrete steps of this method are as follows:

Step 1: the extraction and analysis that BHUDES emotional speech sample is carried out (comprising neutral voice and sadness, indignation, happiness and surprised four kinds of emotional speeches) characteristic parameter;

Step 2: according to the characteristic parameter that extracts, formulate the emotional speech transformation rule, define every conversion constant

Step 4: by the emotion transformation rule setting and modifying parameter of step 2, to fundamental curve, duration and energy are made amendment, again pitch synchronous stack synthetic speech signal;

Step 5: the voice signal to step 4 carries out lpc analysis, by the limit of transport function resonance peak is changed.

2. according to the described method of claim 1, the principal character of described step 1 is the parameter extraction to neutral and sad, indignation, happiness and surprised four kinds of emotional speeches.

3. according to the described method of claim 1, the principal character of described step 2 is: the fundamental frequency that extracts neutral voice and four kinds of emotional speeches respectively, characteristic parameter such as duration and energy, draw transformation rule through statistical study, and in the above on the basis of transformation rule, definition UP_POSITION (position raises up), DOWN_POSITION (down position), MEANf0 (whole fundamental frequency change amount), DUR_POSITION (time-delay position), DUR_LEN (time-delay length), ENERGY_SCALE constants such as (energy factors).

4. according to the described method of claim 1, principal character in the described step 3 is: at first will carry out the judgement of voice segments and quiet section and pure and impure sound to the voice signal x (n) of input, the double threshold method based on short-time energy and short-time zero-crossing rate is adopted in voice segments and quiet section judgement;

Pure and impure decision method adopts prediction residual energy e _rWith the first rank reflection coefficient r ₁The method that combines, judgment condition is: if r ₁＞0.2﹠amp; e _r＞threshold, then this frame is a voiced sound, otherwise is voiceless sound;

r_{1} = \frac{R_{ss} (1)}{R_{ss} (0)} - - - (1)

R_{ss} (0) = \frac{1}{N} Σ_{n = 1}^{N} x (n) x (n) - - - (2)

R_{ss} (1) = \frac{1}{N} Σ_{n = 1}^{N - 1} x (n) x (n + 1) - - - (3)

E wherein _rFor carrying out the residual energy after the linear prediction, N is a frame length;

Voiced sound is partly carried out the fundamental tone mark, partly take equidistant mark for voiceless sound; According to the transformation rule of step 2, set every correlation parameter; Multiply each other to finishing pitch synchronous mark point voice signal and a series of pitch synchronous window function, obtain some and show overlapping short-time analysis signal, this method adopts Hamming (hamming) window, window length is two pitch periods, between the adjacent short-time analysis signal about 50% lap is arranged, window function as shown in Equation 5:

ω (n) = \{\begin{matrix} 0.54 - 0.46 (2 πn / (N - 1)) & 0 \leq n \leq N - 1 \\ 0 & else \end{matrix} - - - (5)

x _m(n)＝ω _m(n _m-n)×x(n) (6)

N in the formula _mBe fundamental tone mark point.

5. according to the described method of claim 1, principal character in the described step 4 is: according to the emotion transformation rule, set up the mapping relations of the pitch period between converted-wave and the original waveform, mapping relations are determined the synthetic needed sequence of composite signal in short-term thus again, short signal sequence and target pitch period are arranged synchronously, and overlap-add obtains synthetic waveform.

6. according to the described method of claim 5, its principal character is: voice signal is carried out lpc analysis, in the method, analyze exponent number and get 12 rank, the limit of the transport function that obtains is changed, thereby changed the sound channel transport function, and then change the resonance peak position.