CN101894566A

CN101894566A - Visualization method of Chinese mandarin complex vowels based on formant frequency

Info

Publication number: CN101894566A
Application number: CN2010102348459A
Authority: CN
Inventors: 赵胜辉; 严静雨; 王晶; 匡镜明
Original assignee: Beijing Institute of Technology BIT
Current assignee: Beijing Institute of Technology BIT
Priority date: 2010-07-23
Filing date: 2010-07-23
Publication date: 2010-11-24

Abstract

The invention relates to a visualization method of Chinese mandarin complex vowel based on formant frequency, comprising the following steps: characteristics extraction, i.e. carrying out prefiltering, framing, pre-emphasis, windowing and endpoint detection on primitive complex vowels, and extracting the first three formant frequencies F1, F2 and F3 of each frame signal; and realization steps of complex vowels visualization, i.e. expressing a first formant frequency F1 with the abscissa, expressing specific value between the two formant frequencies with the ordinate, calculating the values of F2/F1 and F1/F2 for each frame, and showing the points (F1, F2/F1) and (F1, F3/F2) on coordinate graphs with different icons and colors. The invention visually shows the complex vowels by the images, can exactly distinguish the voice signals of the complex vowels, only needs to extract short time mean energy, first three formant frequencies and others simple voice acoustics parameters of the voice signals and is easy for realization.

Description

A kind of visualization method of Chinese mandarin complex vowels based on formant frequency

Technical field

The present invention relates to a kind of method for visualizing of Chinese mandarin complex vowels, particularly a kind of compound vowel method for visualizing based on formant frequency belongs to the voice visual field.

Background technology

Voice are the sound with difference meaning function that human vocal organs send, and are indispensable in daily life.But for the impaired hearing crowd, owing to do not receive sufficient acoustic information, it usually is very difficult fluently exchanging smoothly concerning them.Studies show that in people's perception to external world, what the information of obtaining was maximum is vision, next is only the sense of hearing, and the information that the combination of the vision and the sense of hearing is obtained than any single sense organ perception is all many.In addition, experience is also told us, and chart is that people express thoughts, transmit information most convenient, one of method the most intuitively, so people also attempt to come perceptual speech from vision, perhaps more useful information is transmitted in the combination of audio-visual.A kind of voice visual method is explored and sought to purpose of the present invention exactly, promptly utilizes visual element to show voice, reaches the purpose of " by the visually-perceptible voice ", for the effective perceptual speech of impaired hearing crowd, exercise orthoepy provide actual help.

Before the present invention, a lot of voice visual methods all are based on faceform or vocal organs.This method is carried out qualitative or quantitative description to pronunciation mouth shape.Qualitative description is as the size of circle lip, flat lip, opening, height of tongue position or the like.Present many applications need be carried out objectively quantitative measurement to the vision voice,, machine automatic labiomaney synthetic as visual human's face or the like.International standard MPEG-4 has defined people's face defined parameters FDP (Facial Definition Parameter), human face animation parameter F AP (facial animation parameter) and human face animation parameter linear module FAPU (Facial Animation Parameter Unit), wherein the advantage of FAP parameter has made it become the international standard of human face animation, and it by the definition human face animation FAPU of parameter unit (facial animation parameter unit) standard different people face difference, make same parameter can on different faceforms, make similar human face expression.

Realize the comparatively hommization of method of voice visual based on the variation of the motion change of vocal organs and facial expression, analyzed the phonation of human body effectively, help the impaired hearing crowd to practise pronunciation.But the sound that sends for soft palate, these inner vocal organs of lower jaw just is difficult to show effectively by vision.Simultaneously, with regard to its intelligibility of speech, also be difficult to reach ideal effect, except that the only a few expert, people are difficult to directly by the motion of observation vocal organs perceptual speech accurately and efficiently.In addition, visual effect is more single, and expressive force is not strong.

In addition, the human auditory properties that also had some scholar's research is attempted by analyzing hearing organ's hearing mechanism, utilizes corresponding auditory model to obtain distinguishing characteristics information and The Visual Implementation in addition between the voice signal.But, also being in the elementary step at present for human auditory's The Characteristic Study, the information that we can utilize is also very limited.

Summary of the invention

Technical matters to be solved by this invention is the method for visualizing that a kind of voice will be provided, and by the different phonetic feature is integrated into single image, makes image have readability.These class methods adopt different color, icon and different icon sizes, visually represent voice in the mode of image.With compare based on vocal organs model, faceform, the voice visual method of integrating based on phonetic feature possesses good readability, intelligibility.No matter impaired hearing crowd or ordinary people after a relatively short training, can identify the visual image of corresponding pronunciation intuitively.By reading the visual image of this invention, we can make a distinction diphthong compound vowel in the standard Chinese at an easy rate.

Technical scheme of the present invention is:

A kind of Chinese mandarin complex vowels voice visual method based on formant frequency may further comprise the steps:

One, feature extraction, concrete grammar is:

(1) original compound vowel is carried out pre-filtering, eliminate power frequency and disturb;

(2) compound vowel after the pre-filtering is carried out branch frame, pre-emphasis, windowing and end-point detection, determine the initial end points and the end caps of compound vowel;

(3) first three formant frequency F1, F2, the F3 of every frame signal between initial end points of extraction and the end caps;

Two, compound vowel The Visual Implementation step, concrete grammar is: represent the first formant frequency F1 with horizontal ordinate, ordinate is represented two ratios between the formant frequency, for each frame, calculate the value of F2/F1 and F3/F2, and with point (F1, F2/F1) and (F1 F3/F2) is illustrated on the coordinate diagram with different icons or different colours respectively.

The radius of each point is with the increase of frame number rule or dwindle on the coordinate diagram, thereby can reflect formant trajectory direction over time on coordinate diagram intuitively.

Beneficial effect:

(1) the present invention represents compound vowel intuitively by image, utilize the first resonance peak F1 over time trend and F2/F1 and F3/F2 trend and relative position relation are distinguished different Chinese mandarin complex vowels pronunciations over time.Image difference between the Chinese mandarin complex vowels is obvious, therefore can accurately distinguish the compound vowel voice signal.For some specific compound vowel, can also distinguish more exactly by the degree of rarefication of two tracks and the overlapping situation of two tracks.

(2) the present invention only extracts the simple voice parameters,acoustic such as short-time average energy, first three formant frequency of voice signal, is easy to realize.

Description of drawings

Fig. 1 is a Chinese mandarin complex vowels voice visual system chart.

Fig. 2 finds the solution process flow diagram for formant frequency.

Fig. 3 is a male voice Chinese mandarin complex vowels ai voice visual effect exemplary plot.

Fig. 4 is a female voice Chinese mandarin complex vowels ai voice visual effect exemplary plot.

Fig. 5 is a male voice Chinese mandarin complex vowels ao voice visual effect exemplary plot.

Fig. 6 is a female voice Chinese mandarin complex vowels ao voice visual effect exemplary plot.

Fig. 7 is a male voice Chinese mandarin complex vowels ia voice visual effect exemplary plot.

Fig. 8 is a female voice Chinese mandarin complex vowels ia voice visual effect exemplary plot.

Fig. 9 is a male voice Chinese mandarin complex vowels ve voice visual effect exemplary plot.

Figure 10 is a female voice Chinese mandarin complex vowels ve voice visual effect exemplary plot.

Figure 11 is a male voice Chinese mandarin complex vowels ua voice visual effect exemplary plot.

Figure 12 is a female voice Chinese mandarin complex vowels ua voice visual effect exemplary plot.

Embodiment

Below in conjunction with accompanying drawing, specify specific embodiments of the invention.

Shown in Figure 1 is a system chart having realized the method for the invention, mainly is divided into two major parts: characteristic extracting module and effect of visualization figure generation module.

One, characteristic extracting module, this module has realized characteristic extraction step of the present invention.

At first, voice signal is carried out pre-service such as pre-filtering, branch frame, windowing.Directly extract short-time energy, preceding 3 formant frequencies of every frame voice signal then, give up the formant frequency of last some frame of compound vowel latter half and carry out corresponding linear time axis conversion and smoothing processing afterwards.

(1) short-time energy of voice signal:

E_{m} = Σ_{n = m}^{m + N - 1} {s_{w}}^{2} (n) - - - (1)

Wherein, m is the starting point of window, and N is window long (counting).

(2) utilize the LPC technology to find the solution formant frequency:

As shown in Figure 2, at first, utilize the LPC technology to obtain the transition function H (z) of voice system.The root of polynomial correspondence of the transition function H (z) of a digital filter the pole and zero of system frequency transfer curve.According to this theory, the transition function H (z) of the voice here is full polar form, has only the denominator polynomial expression, that is:

H (z) = \frac{1}{A (z)} = \frac{1}{1 - Σ_{k = 1}^{M} a_{k} z^{- k}} - - - (2)

Wherein M is the linear prediction exponent number.

Make A (z)=0, can obtain this polynomial M/2 the conjugation compound radical

z_{i} = r_{i} e^{j θ_{i}}

(3)

z_{i}^{*} = r_{i} e^{- j θ_{i}}

In the formula, r _iBe the mould of compound radical, θ _iBe argument.Theoretical derivation shows, they and formant frequency F _iFollowing relation is arranged:

F _i＝θ _i/2πT _i(4)

T in the formula _iIt is the sampling period.Concerning general speech analysis, the M value is 10-18.

(2) linear time base conversion process

For diphtong, what its differentiation was played a decisive role is the formant frequency of its The initial segment and middle transition section, so we at first give up the formant frequency of some frame of compound vowel latter half.Because the formant trajectory length difference of different compound vowels, because the course length difference of different compound vowels, we need carry out regular to formant trajectory.Resonance peak length this paper after regular gets 50 frames, and frame number just no longer compresses less than 50 after the partial frame if give up, when frame number greater than 50 the time, regular coefficient is:

Coeff=formant trajectory original length/regular back formant trajectory length (5)

If n node of original formant trajectory is x ₁＜x ₂＜....＜x _n, its corresponding formant frequency value is y _i(i=1,2 ... .n).The m of the formant trajectory after a regular node is

Its corresponding formant frequency is z _i(i=1,2 ... .m).

In order to obtain the formant trajectory after regular

The frequency values of node at first will

Node is mapped on the original formant frequency, obtains corresponding position x _i, and

x_{i} = coeff * x_{i}^{0} - - - (6)

Because x _iIn most of the cases be non-integer, we just are chosen at x _iHithermost two some x _I-1And x _I+1Frequency values calculate regular back formant trajectory

Frequency values:

z _i＝y _i-1*(x _i+1-x _i)+y _i+1(x _i-x _i-1)(7)

(4) median filter smoothness of image is handled:

It is a kind of method that adopts the statistics with histogram processing of sliding window that median smoothing is handled.Its ultimate principle is: establish { x (n) } and be input signal, { y (n) } is the output of median filter, and window is long to be 2L+1, n so ₀Output valve y (the n at place ₀) be exactly that center with window moves on to n ₀The intermediate value of input sample in the window during place.So-called intermediate value is exactly 2L+1 input sample x (n ₀-L), x (n ₀-L+1) ..., x (n ₀), x (n ₀+ 1), x (n ₀+ 2) ..., x (n ₀+ L) add up, obtain an accumulative histogram, wherein 1/2 fractile is exactly an intermediate value.

Medium filtering can be corrected indivedual singular points and the value of sampling point around not influencing.

Linear smoothing is to carry out linear filtering with sliding window to handle, that is:

y (n) = Σ_{m = - L}^{L} x (n - m) w (m) - - - (8)

Wherein w (m), and m=-L ,-L+1 ..., 0,1,2 ..., L} is a 2L+1 point smoothing windows, satisfies:

Σ_{m = - L}^{L} w (m) = 1 - - - (9)

For example the value of 3 windows desirable 0.25,0.5,0.25}.Linear smoothing is in rectified input signal in the unsmooth place sample value, and the value of each sampling point is revised near also making, more than two kinds of smoothing techniques can combine use.

Two, effect of visualization is realized module:

Fig. 3---Figure 12 shows that diphthong zero initial simple or compound vowel of a Chinese syllable/ai/ in the standard Chinese ,/ao/ ,/ia/ ,/ve/ and/the effect of visualization figure of ua/, wherein corresponding each simple or compound vowel of a Chinese syllable comprises the pronunciation of male voice and the pronunciation of female voice again.Represent the first formant frequency F1 with horizontal ordinate, ordinate is represented two ratios between the formant frequency, for each frame, calculate the value of F2/F1 and F3/F2, and with point (F1, F2/F1) and (F1 F3/F2) is illustrated on the coordinate diagram with different icons or different colours respectively.In the present embodiment, the respective icon of in each view, representing F2/F1 and F3/F2 respectively with red round dot and blue Diamond spot.In order to reflect formant trajectory order over time, the radius of each icon changes by following rule:

d _i=3+i ^0.6(i represents i icon, d _iBe i icon diameter) (10)

The first resonance peak F1 trend and F1 over time on image, have been reflected, F2, relation between the F3 three, utilize F1 whether according to from big to small variation order, F3/F2 and F2/F1 whether according to from big to small variation tendency and in view the value of F3/F2 whether distinguish different standard Chinese diphthong simple or compound vowel of a Chinese syllable pronunciations greater than the value of F2/F1, for some specific compound vowel, the F2/F1 track that it can also be seen that them distributes more sparse, F3/F2 track and F2/F1 track overlap at the place, end, and this all provides additional information for we distinguish compound vowel more exactly.Concrete grammar is:

As can be seen from the figure, the variation tendency of each pronunciation F1 and variation tendency and the position of F2/F1 and F3/F2 concern obvious difference, and the human eye ratio is easier to they are divided into several big classes.The F2/F1 of indivedual pronunciations and F3/F2 track present discontinuous, and this mainly is because the resonance peak of some frame extracts due to the mistake.

Utilize the method for the invention, the Chinese mandarin complex vowels voice signal is expressed as the coordinate diagram that can intuitively distinguish, can provide actual help for the effective perceptual speech of impaired hearing crowd, exercise orthoepy.

Above embodiment is only in order to illustrate that technical scheme of the present invention is not intended to limit; Although with reference to preferred embodiment the present invention is had been described in detail, those of ordinary skill in the field should be appreciated that still and can make amendment or the part technical characterictic is equal to replacement the specific embodiment of the present invention; And not breaking away from the spirit of technical solution of the present invention, it all should be encompassed in the middle of the technical scheme scope that the present invention asks for protection.

Claims

1. a kind of Mandarin Chinese complex vowel pronunciation method based on formant frequency, is characterized in that, comprises the following steps:

1. Feature extraction, the specific method is:

(1) Carry out pre-filtering to original complex final vowel, eliminate power frequency interference;

(2) Carry out framing, pre-emphasis, windowing and endpoint detection to the compound vowel of a Chinese syllable after pre-filtering, determine the start endpoint and the end endpoint of the compound vowel of a vowel;

(3) Extract the first three formant frequencies F1, F2, F3 of each frame signal between the starting endpoint and the ending endpoint;

2. The steps for realizing the visualization of compound finals. The specific method is: use the abscissa to represent the first formant frequency F1, and the ordinate to represent the ratio between the two formant frequencies. For each frame, calculate the ratio of F2/F1 and F3/F2 value, and points (F1, F2/F1) and (F1, F3/F2) are represented on the coordinate map with different icons or colors.

2. a kind of Chinese mandarin complex vowel pronunciation visualization method based on formant frequency according to claim 1, it is characterized in that, the radius of each point on the coordinate diagram increases or shrinks with frame number law, thereby can be in The change direction of the formant track with time is intuitively reflected on the coordinate diagram.