CN101930747A

CN101930747A - Method and device for converting voice into mouth shape image

Info

Publication number: CN101930747A
Application number: CN2010102408835A
Authority: CN
Inventors: 蒋一宁; 付晓毅; 蒋涛; 张�成; 蔺君刚; 赵旭
Original assignee: SICHUAN WEIDI DIGITAL TECHNOLOGY Co Ltd
Current assignee: SICHUAN WEIDI DIGITAL TECHNOLOGY Co Ltd
Priority date: 2010-07-30
Filing date: 2010-07-30
Publication date: 2010-12-29

Abstract

The invention discloses a method and a device for converting a voice into a mouth shape image. The method comprises the following steps of: firstly, acquiring the voice through an acquisition unit and performing spectrum analysis on the acquired voice through a recognition unit; secondly, recognizing phonemes in the voice according to a resonance peak and a volume parameter obtained by the spectrum analysis, forming sequences by using the phonemes obtained by the recognition and converting the sequences into corresponding mouth shape models one by one by using a conversion unit; thirdly, correcting mouth opening degree parameters of the mouth shape models according to the resonance peak and the volume parameter; and lastly, continuously playing the mouth shape models obtained by the correction according to the phoneme sequences to form the mouth shape image by using a display unit. The phonemes in the voice can be recognized, the parameters of the mouth shape models are determined through the phonemes and a correct mouth shape model is obtained by coordinating with the correction of the resonance peak and the volume parameter.

Description

A kind of method and apparatus that speech conversion is become mouth shape image

Technical field

The present invention relates to voice in the communications field and the switch technology between the shape of the mouth as one speaks, particularly a kind of method and apparatus that speech conversion is become mouth shape image.

Background introduction

The conversion plan of the existing shape of the mouth as one speaks and language at first is the sound of synchronous acquisition language and the video of the shape of the mouth as one speaks, then to video by specific recognizer, in voice, find out some syllable and corresponding image sequence thereof; When using, change mutually according to the image or the sound bite that identify again.

At publication number be: the Chinese patent literature of CN101510256A, denomination of invention is: a kind of mouth-shape language conversion method and device, disclosed method is: the lip motion Video Segmentation of gathering is become the mouth shape image sequence set; Described mouth shape image sequence set is discerned, obtained the speech syllable of described mouth shape image sequence set correspondence; Described device comprises: acquisition module, cut apart module and identification module.This invention becomes the mouth shape image sequence by the lip motion Video Segmentation of will gather, and the pairing speech syllable of identification mouth shape image sequence, realize the conversion of mouth shape language to speech syllable, solved voice disorder personage's conversation problem, thereby satisfy voice disorder personage's conversation demand, have the effect of providing convenience for the voice disorder personage.

(syllable in the so-called Chinese is that sense of hearing sensation can be distinguished the clearly base unit of voice to syllable in the voice that the related method that voice are converted into image of the document is identification, a Chinese character is exactly a syllable in the Chinese, each syllable is by initial consonant, three parts of simple or compound vowel of a Chinese syllable and tone are formed), what that is to say identification is initial consonant, three parts of simple or compound vowel of a Chinese syllable and tone one of them or several contents, but how the technical scheme of document the inside not explanation goes to discern syllable method, do not have how to obtain corresponding mouth shape image after the concrete identification of explanation yet, because such technical scheme has impracticable suspicion, even method identification syllable and conversion mouth shape image are arranged, also have the error rate of identification and the error rate of conversion, such technical scheme can not provide real demand and easy to use for the user.

Summary of the invention

The present invention is for overcoming above-mentioned technical matters, a kind of method and apparatus that speech conversion is become mouth shape image is provided, can identify the phoneme in the voice, determine the parameter of shape of the mouth as one speaks model by phoneme, cooperate resonance peak and volume correction to obtain correct shape of the mouth as one speaks model then, resulting shape of the mouth as one speaks model can be formed continuous mouth shape image and use for the user.

Technical scheme of the present invention is as follows:

A kind of speech conversion is become the method for mouth shape image, it is characterized in that step is as follows:

Gather voice, and the voice that collect are discerned by spectrum analysis;

The phoneme that identification obtains forms syllable sequence;

Syllable sequence is converted to corresponding shape of the mouth as one speaks model one by one;

Parameter according to formant frequency and volume correction shape of the mouth as one speaks model obtains playing the formation mouth shape image continuously according to syllable sequence.

Phoneme (phoneme):, be divided into the limited least speech unit of number by the character of its physiology and physics the speech sound in a kind of language.Phoneme is divided into vowel and consonant.The peak value of some broads is arranged in spectrum envelope figure, be called resonance peak.Can represent the variation of speech signal with time, frequency and intensity, resonance peak can be expressed as has certain intensity energy in the certain frequency scope, and the signal of certain time.Usually the speech signal has 3 resonance peaks, can identify vowel and consonant according to the Changing Pattern of first and second resonance peak, and in addition, formant frequency and volume also have relation with the open size of lip.Open greatly more as mouth, sound is just loud more.

Shape of the mouth as one speaks model can be described with the lip and the size of dehiscing that upperlip constitutes, and lip is as circle, semicircle etc.

Resonance peak is some zones that energy is concentrated relatively in the frequency spectrum of sound, the determinative of tonequality still not, and reflected the physical features of sound channel (resonant cavity).Sound is through resonant cavity the time, be subjected to the filter action of cavity, make that the energy of different frequency is redistributed in the frequency domain, a part is because the resonance effect of resonant cavity is strengthened, another part is then decayed, and those frequencies that strengthened show as dense blackstreak on the sonagram of time frequency analysis.Because energy distribution is inhomogeneous, strong part is just as the mountain peak, so be referred to as resonance peak.All have some fixing frequency peak (Formant Synthesis) in the very wide spectrum distribution of voice and most of musical instruments, this frequency peak just is called resonance peak (Formants) in sound spectrum.In voice acoustics, resonance peak is determining the tonequality of vowel, and in Computer Music, they are important parameters of decision tone color and tonequality.

Resonance peak and volume can be obtained by voice are carried out spectrum analysis, and the vowel and the consonant of phoneme in the voice can be identified.

According to the parameter of formant frequency and volume correction shape of the mouth as one speaks model, be because: when voice were carried out time-domain analysis, time domain parameter was identical sometimes, but can not illustrate be converted to that shape of the mouth as one speaks model is rescued and actual voice identical.Because voice signal not only changes in time, and is also information-related with frequency, phase place etc., this just needs the frequency structure of further analytic signal, and in frequency field signal is described.

A kind of device that speech conversion is become mouth shape image, comprise the collecting unit that is used to gather voice, be used for voice are carried out the recognition unit that spectrum analysis obtains phoneme, be used for that phoneme changed the converting unit of shape of the mouth as one speaks model and be used for display unit the continuous dynamic play of shape of the mouth as one speaks model.

When collecting unit collects voice, by recognition unit voice are carried out synchronous spectrum analysis simultaneously and obtain resonance peak and volume, and identification obtains syllable sequence, converting unit will be converted to shape of the mouth as one speaks model and according to the parameter of formant frequency and volume correction shape of the mouth as one speaks model, obtain mouth shape image by the continuous dynamic play shape of the mouth as one speaks of display unit model at last according to syllable sequence then.

Described collecting unit is a microphone, microphone is converted to the voice signal that collects level signal and inputs to digital signal processor, by digital signal processor level signal is converted to the frequency-region signal that spectrum analysis is used, identification obtains formant frequency, volume and phoneme to frequency-region signal by voice recognition unit then.

Digital signal processor also is converted to digital signal with level signal, and digital signal spreads out of by the loudspeaker that is connected with digital signal processor.

The mouth shape image that obtains by display unit comprises that the basic shape of the mouth as one speaks and lip open the parameter of size.

Beneficial effect of the present invention is as follows:

The present invention obtains determining the resonance peak and the volume of vowel quality by spectrum analysis, and identify the phoneme of voice, determine the parameter of shape of the mouth as one speaks model by phoneme, cooperate resonance peak and volume correction to obtain correct shape of the mouth as one speaks model then, revised shape of the mouth as one speaks model can obtain the very high mouth shape image of accuracy, be more convenient for like this voice disorder personage more easily with other people communication exchange.

Description of drawings

Fig. 1 is the structural representation of apparatus of the present invention

Fig. 2 is a kind of embodiment structural representation of apparatus of the present invention

Embodiment

A kind of speech conversion is become the method for mouth shape image, its switch process is as follows:

Gather voice, and the voice that collect are discerned by spectrum analysis;

The phoneme that identification obtains forms syllable sequence;

Resonance peak and volume can be obtained by voice are carried out spectrum analysis, and the vowel and the consonant of voice phoneme can be identified.Utilize formant frequency and volume to correct shape of the mouth as one speaks model then, then can obtain the mouth shape image that accuracy is arrived very much.

Shape of the mouth as one speaks model is described with the lip and the size of dehiscing that upperlip constitutes, and lip is as circle, semicircle etc.

Mouth shape image comprises that the basic shape of the mouth as one speaks (as semicircle, circle) and lip open the parameter (big more as volume, lip opens also greatly more) of size.

Shown in Fig. 1-2, this device that speech conversion is become mouth shape image, comprise the collecting unit that is used to gather voice, be used for voice are carried out the recognition unit that spectrum analysis obtains phoneme, be used for that phoneme changed the converting unit of shape of the mouth as one speaks model and be used for display unit the continuous dynamic play of shape of the mouth as one speaks model.

Described collecting unit is a microphone, microphone is converted to the voice signal that collects level signal and inputs to digital signal processor, earlier level signal is converted to the time-domain digital signal by digital signal processor, the frequency-region signal that becomes spectrum analysis to use the time-domain digital conversion of signals again, voice recognition unit identification obtains formant frequency then, the vowel of volume and phoneme, consonant, form syllable sequence by discerning the phoneme that obtains one by one, according to converting shape of the mouth as one speaks model to because of sequence, because the shape of the mouth as one speaks model that at this moment obtains is accurate not enough, so need be by formant frequency, volume is corrected, to be the unit adjust duration of the type image of whenever dehiscing according to the duration of phoneme to shape of the mouth as one speaks model after the correction by receiving, and just constituted continuous mouth shape image.

Described frequency-region signal can extract resonance peak by wave filter, by selecting the suitable filters bandwidth, can obtain the frequency of first, second and third resonance peak, be called F1, F2, F3, the more lasting duration in binding resonant peak, just can identify vowel (as F1 at 300-400Hz, F2 is about 1000Hz, and duration just can be identified as vowel u less than 200ms) and consonant (as F1=200, F2=720, F3=2100 is identified as consonant/b, p/).

The mouth shape image that obtains by this method and device is because accuracy is very high, so can help voice disorder personage and other people communication well.

Claims

1. one kind becomes the method for mouth shape image with speech conversion, it is characterized in that step is as follows:

Gather voice, and the voice that collect are discerned by spectrum analysis;

The phoneme that identification obtains forms syllable sequence;

2. according to claim 1ly a kind of speech conversion is become the method for mouth shape image, it is characterized in that: what spectrum analysis obtained is resonance peak and volume, and identification obtains is phoneme in the voice, i.e. vowel and consonant.

3. a kind of device for carrying out said that speech conversion is become the method for mouth shape image according to claim 1 and 2, it is characterized in that: comprise the collecting unit that is used to gather voice, be used for voice are carried out the recognition unit that spectrum analysis obtains phoneme, be used for that phoneme is converted to the converting unit of shape of the mouth as one speaks model and be used for display unit the continuous dynamic play of shape of the mouth as one speaks model.

4. a kind of device that speech conversion is become mouth shape image according to claim 3, it is characterized in that: when collecting unit collects voice, by recognition unit voice are carried out synchronous spectrum analysis simultaneously and obtain resonance peak and volume, and identification obtains syllable sequence, converting unit will be converted to shape of the mouth as one speaks model and according to the parameter of formant frequency and volume correction shape of the mouth as one speaks model, obtain mouth shape image by the continuous dynamic play shape of the mouth as one speaks of display unit model at last according to syllable sequence then.

5. a kind of device that speech conversion is become mouth shape image according to claim 3, it is characterized in that: described collecting unit is a microphone, microphone is converted to the voice signal that collects level signal and inputs to digital signal processor, by digital signal processor level signal is converted to the frequency-region signal that spectrum analysis is used, identification obtains formant frequency, volume and phoneme to frequency-region signal by voice recognition unit then.

6. according to claim 3ly a kind of speech conversion is become the device of mouth shape image, it is characterized in that: the mouth shape image that obtains by display unit comprises that the basic shape of the mouth as one speaks and lip open the parameter of size.