CN101894566A - Visualization method of Chinese mandarin complex vowels based on formant frequency - Google Patents

Visualization method of Chinese mandarin complex vowels based on formant frequency Download PDF

Info

Publication number
CN101894566A
CN101894566A CN2010102348459A CN201010234845A CN101894566A CN 101894566 A CN101894566 A CN 101894566A CN 2010102348459 A CN2010102348459 A CN 2010102348459A CN 201010234845 A CN201010234845 A CN 201010234845A CN 101894566 A CN101894566 A CN 101894566A
Authority
CN
China
Prior art keywords
formant
formant frequency
voice
complex
complex vowels
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2010102348459A
Other languages
Chinese (zh)
Inventor
赵胜辉
严静雨
王晶
匡镜明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Institute of Technology BIT
Original Assignee
Beijing Institute of Technology BIT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Institute of Technology BIT filed Critical Beijing Institute of Technology BIT
Priority to CN2010102348459A priority Critical patent/CN101894566A/en
Publication of CN101894566A publication Critical patent/CN101894566A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Electrically Operated Instructional Devices (AREA)

Abstract

The invention relates to a visualization method of Chinese mandarin complex vowel based on formant frequency, comprising the following steps: characteristics extraction, i.e. carrying out prefiltering, framing, pre-emphasis, windowing and endpoint detection on primitive complex vowels, and extracting the first three formant frequencies F1, F2 and F3 of each frame signal; and realization steps of complex vowels visualization, i.e. expressing a first formant frequency F1 with the abscissa, expressing specific value between the two formant frequencies with the ordinate, calculating the values of F2/F1 and F1/F2 for each frame, and showing the points (F1, F2/F1) and (F1, F3/F2) on coordinate graphs with different icons and colors. The invention visually shows the complex vowels by the images, can exactly distinguish the voice signals of the complex vowels, only needs to extract short time mean energy, first three formant frequencies and others simple voice acoustics parameters of the voice signals and is easy for realization.

Description

A kind of visualization method of Chinese mandarin complex vowels based on formant frequency
Technical field
The present invention relates to a kind of method for visualizing of Chinese mandarin complex vowels, particularly a kind of compound vowel method for visualizing based on formant frequency belongs to the voice visual field.
Background technology
Voice are the sound with difference meaning function that human vocal organs send, and are indispensable in daily life.But for the impaired hearing crowd, owing to do not receive sufficient acoustic information, it usually is very difficult fluently exchanging smoothly concerning them.Studies show that in people's perception to external world, what the information of obtaining was maximum is vision, next is only the sense of hearing, and the information that the combination of the vision and the sense of hearing is obtained than any single sense organ perception is all many.In addition, experience is also told us, and chart is that people express thoughts, transmit information most convenient, one of method the most intuitively, so people also attempt to come perceptual speech from vision, perhaps more useful information is transmitted in the combination of audio-visual.A kind of voice visual method is explored and sought to purpose of the present invention exactly, promptly utilizes visual element to show voice, reaches the purpose of " by the visually-perceptible voice ", for the effective perceptual speech of impaired hearing crowd, exercise orthoepy provide actual help.
Before the present invention, a lot of voice visual methods all are based on faceform or vocal organs.This method is carried out qualitative or quantitative description to pronunciation mouth shape.Qualitative description is as the size of circle lip, flat lip, opening, height of tongue position or the like.Present many applications need be carried out objectively quantitative measurement to the vision voice,, machine automatic labiomaney synthetic as visual human's face or the like.International standard MPEG-4 has defined people's face defined parameters FDP (Facial Definition Parameter), human face animation parameter F AP (facial animation parameter) and human face animation parameter linear module FAPU (Facial Animation Parameter Unit), wherein the advantage of FAP parameter has made it become the international standard of human face animation, and it by the definition human face animation FAPU of parameter unit (facial animation parameter unit) standard different people face difference, make same parameter can on different faceforms, make similar human face expression.
Realize the comparatively hommization of method of voice visual based on the variation of the motion change of vocal organs and facial expression, analyzed the phonation of human body effectively, help the impaired hearing crowd to practise pronunciation.But the sound that sends for soft palate, these inner vocal organs of lower jaw just is difficult to show effectively by vision.Simultaneously, with regard to its intelligibility of speech, also be difficult to reach ideal effect, except that the only a few expert, people are difficult to directly by the motion of observation vocal organs perceptual speech accurately and efficiently.In addition, visual effect is more single, and expressive force is not strong.
In addition, the human auditory properties that also had some scholar's research is attempted by analyzing hearing organ's hearing mechanism, utilizes corresponding auditory model to obtain distinguishing characteristics information and The Visual Implementation in addition between the voice signal.But, also being in the elementary step at present for human auditory's The Characteristic Study, the information that we can utilize is also very limited.
Summary of the invention
Technical matters to be solved by this invention is the method for visualizing that a kind of voice will be provided, and by the different phonetic feature is integrated into single image, makes image have readability.These class methods adopt different color, icon and different icon sizes, visually represent voice in the mode of image.With compare based on vocal organs model, faceform, the voice visual method of integrating based on phonetic feature possesses good readability, intelligibility.No matter impaired hearing crowd or ordinary people after a relatively short training, can identify the visual image of corresponding pronunciation intuitively.By reading the visual image of this invention, we can make a distinction diphthong compound vowel in the standard Chinese at an easy rate.
Technical scheme of the present invention is:
A kind of Chinese mandarin complex vowels voice visual method based on formant frequency may further comprise the steps:
One, feature extraction, concrete grammar is:
(1) original compound vowel is carried out pre-filtering, eliminate power frequency and disturb;
(2) compound vowel after the pre-filtering is carried out branch frame, pre-emphasis, windowing and end-point detection, determine the initial end points and the end caps of compound vowel;
(3) first three formant frequency F1, F2, the F3 of every frame signal between initial end points of extraction and the end caps;
Two, compound vowel The Visual Implementation step, concrete grammar is: represent the first formant frequency F1 with horizontal ordinate, ordinate is represented two ratios between the formant frequency, for each frame, calculate the value of F2/F1 and F3/F2, and with point (F1, F2/F1) and (F1 F3/F2) is illustrated on the coordinate diagram with different icons or different colours respectively.
The radius of each point is with the increase of frame number rule or dwindle on the coordinate diagram, thereby can reflect formant trajectory direction over time on coordinate diagram intuitively.
Beneficial effect:
(1) the present invention represents compound vowel intuitively by image, utilize the first resonance peak F1 over time trend and F2/F1 and F3/F2 trend and relative position relation are distinguished different Chinese mandarin complex vowels pronunciations over time.Image difference between the Chinese mandarin complex vowels is obvious, therefore can accurately distinguish the compound vowel voice signal.For some specific compound vowel, can also distinguish more exactly by the degree of rarefication of two tracks and the overlapping situation of two tracks.
(2) the present invention only extracts the simple voice parameters,acoustic such as short-time average energy, first three formant frequency of voice signal, is easy to realize.
Description of drawings
Fig. 1 is a Chinese mandarin complex vowels voice visual system chart.
Fig. 2 finds the solution process flow diagram for formant frequency.
Fig. 3 is a male voice Chinese mandarin complex vowels ai voice visual effect exemplary plot.
Fig. 4 is a female voice Chinese mandarin complex vowels ai voice visual effect exemplary plot.
Fig. 5 is a male voice Chinese mandarin complex vowels ao voice visual effect exemplary plot.
Fig. 6 is a female voice Chinese mandarin complex vowels ao voice visual effect exemplary plot.
Fig. 7 is a male voice Chinese mandarin complex vowels ia voice visual effect exemplary plot.
Fig. 8 is a female voice Chinese mandarin complex vowels ia voice visual effect exemplary plot.
Fig. 9 is a male voice Chinese mandarin complex vowels ve voice visual effect exemplary plot.
Figure 10 is a female voice Chinese mandarin complex vowels ve voice visual effect exemplary plot.
Figure 11 is a male voice Chinese mandarin complex vowels ua voice visual effect exemplary plot.
Figure 12 is a female voice Chinese mandarin complex vowels ua voice visual effect exemplary plot.
Embodiment
Below in conjunction with accompanying drawing, specify specific embodiments of the invention.
Shown in Figure 1 is a system chart having realized the method for the invention, mainly is divided into two major parts: characteristic extracting module and effect of visualization figure generation module.
One, characteristic extracting module, this module has realized characteristic extraction step of the present invention.
At first, voice signal is carried out pre-service such as pre-filtering, branch frame, windowing.Directly extract short-time energy, preceding 3 formant frequencies of every frame voice signal then, give up the formant frequency of last some frame of compound vowel latter half and carry out corresponding linear time axis conversion and smoothing processing afterwards.
(1) short-time energy of voice signal:
E m = Σ n = m m + N - 1 s w 2 ( n ) - - - ( 1 )
Wherein, m is the starting point of window, and N is window long (counting).
(2) utilize the LPC technology to find the solution formant frequency:
As shown in Figure 2, at first, utilize the LPC technology to obtain the transition function H (z) of voice system.The root of polynomial correspondence of the transition function H (z) of a digital filter the pole and zero of system frequency transfer curve.According to this theory, the transition function H (z) of the voice here is full polar form, has only the denominator polynomial expression, that is:
H ( z ) = 1 A ( z ) = 1 1 - Σ k = 1 M a k z - k - - - ( 2 )
Wherein M is the linear prediction exponent number.
Make A (z)=0, can obtain this polynomial M/2 the conjugation compound radical
Figure BSA00000202497800043
z i = r i e j θ i (3)
z i * = r i e - j θ i
In the formula, r iBe the mould of compound radical, θ iBe argument.Theoretical derivation shows, they and formant frequency F iFollowing relation is arranged:
F i=θ i/2πT i(4)
T in the formula iIt is the sampling period.Concerning general speech analysis, the M value is 10-18.
(2) linear time base conversion process
For diphtong, what its differentiation was played a decisive role is the formant frequency of its The initial segment and middle transition section, so we at first give up the formant frequency of some frame of compound vowel latter half.Because the formant trajectory length difference of different compound vowels, because the course length difference of different compound vowels, we need carry out regular to formant trajectory.Resonance peak length this paper after regular gets 50 frames, and frame number just no longer compresses less than 50 after the partial frame if give up, when frame number greater than 50 the time, regular coefficient is:
Coeff=formant trajectory original length/regular back formant trajectory length (5)
If n node of original formant trajectory is x 1<x 2<....<x n, its corresponding formant frequency value is y i(i=1,2 ... .n).The m of the formant trajectory after a regular node is
Figure BSA00000202497800046
Its corresponding formant frequency is z i(i=1,2 ... .m).
In order to obtain the formant trajectory after regular
Figure BSA00000202497800047
The frequency values of node at first will
Figure BSA00000202497800048
Node is mapped on the original formant frequency, obtains corresponding position x i, and
x i = coeff * x i 0 - - - ( 6 )
Because x iIn most of the cases be non-integer, we just are chosen at x iHithermost two some x I-1And x I+1Frequency values calculate regular back formant trajectory
Figure BSA00000202497800052
Frequency values:
z i=y i-1*(x i+1-x i)+y i+1(x i-x i-1)(7)
(4) median filter smoothness of image is handled:
It is a kind of method that adopts the statistics with histogram processing of sliding window that median smoothing is handled.Its ultimate principle is: establish { x (n) } and be input signal, { y (n) } is the output of median filter, and window is long to be 2L+1, n so 0Output valve y (the n at place 0) be exactly that center with window moves on to n 0The intermediate value of input sample in the window during place.So-called intermediate value is exactly 2L+1 input sample x (n 0-L), x (n 0-L+1) ..., x (n 0), x (n 0+ 1), x (n 0+ 2) ..., x (n 0+ L) add up, obtain an accumulative histogram, wherein 1/2 fractile is exactly an intermediate value.
Medium filtering can be corrected indivedual singular points and the value of sampling point around not influencing.
Linear smoothing is to carry out linear filtering with sliding window to handle, that is:
y ( n ) = Σ m = - L L x ( n - m ) w ( m ) - - - ( 8 )
Wherein w (m), and m=-L ,-L+1 ..., 0,1,2 ..., L} is a 2L+1 point smoothing windows, satisfies:
Σ m = - L L w ( m ) = 1 - - - ( 9 )
For example the value of 3 windows desirable 0.25,0.5,0.25}.Linear smoothing is in rectified input signal in the unsmooth place sample value, and the value of each sampling point is revised near also making, more than two kinds of smoothing techniques can combine use.
Two, effect of visualization is realized module:
Fig. 3---Figure 12 shows that diphthong zero initial simple or compound vowel of a Chinese syllable/ai/ in the standard Chinese ,/ao/ ,/ia/ ,/ve/ and/the effect of visualization figure of ua/, wherein corresponding each simple or compound vowel of a Chinese syllable comprises the pronunciation of male voice and the pronunciation of female voice again.Represent the first formant frequency F1 with horizontal ordinate, ordinate is represented two ratios between the formant frequency, for each frame, calculate the value of F2/F1 and F3/F2, and with point (F1, F2/F1) and (F1 F3/F2) is illustrated on the coordinate diagram with different icons or different colours respectively.In the present embodiment, the respective icon of in each view, representing F2/F1 and F3/F2 respectively with red round dot and blue Diamond spot.In order to reflect formant trajectory order over time, the radius of each icon changes by following rule:
d i=3+i 0.6(i represents i icon, d iBe i icon diameter) (10)
The first resonance peak F1 trend and F1 over time on image, have been reflected, F2, relation between the F3 three, utilize F1 whether according to from big to small variation order, F3/F2 and F2/F1 whether according to from big to small variation tendency and in view the value of F3/F2 whether distinguish different standard Chinese diphthong simple or compound vowel of a Chinese syllable pronunciations greater than the value of F2/F1, for some specific compound vowel, the F2/F1 track that it can also be seen that them distributes more sparse, F3/F2 track and F2/F1 track overlap at the place, end, and this all provides additional information for we distinguish compound vowel more exactly.Concrete grammar is:
As can be seen from the figure, the variation tendency of each pronunciation F1 and variation tendency and the position of F2/F1 and F3/F2 concern obvious difference, and the human eye ratio is easier to they are divided into several big classes.The F2/F1 of indivedual pronunciations and F3/F2 track present discontinuous, and this mainly is because the resonance peak of some frame extracts due to the mistake.
Utilize the method for the invention, the Chinese mandarin complex vowels voice signal is expressed as the coordinate diagram that can intuitively distinguish, can provide actual help for the effective perceptual speech of impaired hearing crowd, exercise orthoepy.
Above embodiment is only in order to illustrate that technical scheme of the present invention is not intended to limit; Although with reference to preferred embodiment the present invention is had been described in detail, those of ordinary skill in the field should be appreciated that still and can make amendment or the part technical characterictic is equal to replacement the specific embodiment of the present invention; And not breaking away from the spirit of technical solution of the present invention, it all should be encompassed in the middle of the technical scheme scope that the present invention asks for protection.

Claims (2)

1. the Chinese mandarin complex vowels voice visual method based on formant frequency is characterized in that, may further comprise the steps:
One, feature extraction, concrete grammar is:
(1) original compound vowel is carried out pre-filtering, eliminate power frequency and disturb;
(2) compound vowel after the pre-filtering is carried out branch frame, pre-emphasis, windowing and end-point detection, determine the initial end points and the end caps of compound vowel;
(3) first three formant frequency F1, F2, the F3 of every frame signal between initial end points of extraction and the end caps;
Two, compound vowel The Visual Implementation step, concrete grammar is: represent the first formant frequency F1 with horizontal ordinate, ordinate is represented two ratios between the formant frequency, for each frame, calculate the value of F2/F1 and F3/F2, and with point (F1, F2/F1) and (F1 F3/F2) is illustrated on the coordinate diagram with different icons or different colours respectively.
2. a kind of Chinese mandarin complex vowels voice visual method according to claim 1 based on formant frequency, it is characterized in that, the radius of each point is with the increase of frame number rule or dwindle on the coordinate diagram, thereby can reflect formant trajectory direction over time on coordinate diagram intuitively.
CN2010102348459A 2010-07-23 2010-07-23 Visualization method of Chinese mandarin complex vowels based on formant frequency Pending CN101894566A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2010102348459A CN101894566A (en) 2010-07-23 2010-07-23 Visualization method of Chinese mandarin complex vowels based on formant frequency

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2010102348459A CN101894566A (en) 2010-07-23 2010-07-23 Visualization method of Chinese mandarin complex vowels based on formant frequency

Publications (1)

Publication Number Publication Date
CN101894566A true CN101894566A (en) 2010-11-24

Family

ID=43103737

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2010102348459A Pending CN101894566A (en) 2010-07-23 2010-07-23 Visualization method of Chinese mandarin complex vowels based on formant frequency

Country Status (1)

Country Link
CN (1) CN101894566A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102231281A (en) * 2011-07-18 2011-11-02 渤海大学 Voice visualization method based on integration characteristic and neural network
CN102820037A (en) * 2012-07-21 2012-12-12 渤海大学 Chinese initial and final visualization method based on combination feature
CN103077728A (en) * 2012-12-31 2013-05-01 上海师范大学 Patient weak voice endpoint detection method
CN107993071A (en) * 2017-11-21 2018-05-04 平安科技(深圳)有限公司 Electronic device, auth method and storage medium based on vocal print
CN108962251A (en) * 2018-06-26 2018-12-07 珠海金山网络游戏科技有限公司 A kind of game role Chinese speech automatic identifying method
TWI749796B (en) * 2020-09-30 2021-12-11 瑞軒科技股份有限公司 Resonance test system and resonance test method

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH0473698A (en) * 1990-07-13 1992-03-09 Sony Corp Shape control method based on audio signal
CN101281747A (en) * 2008-05-30 2008-10-08 苏州大学 Method for recognizing Chinese language whispered pectoriloquy intonation based on acoustic channel parameter

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH0473698A (en) * 1990-07-13 1992-03-09 Sony Corp Shape control method based on audio signal
CN101281747A (en) * 2008-05-30 2008-10-08 苏州大学 Method for recognizing Chinese language whispered pectoriloquy intonation based on acoustic channel parameter

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102231281A (en) * 2011-07-18 2011-11-02 渤海大学 Voice visualization method based on integration characteristic and neural network
CN102231281B (en) * 2011-07-18 2012-07-18 渤海大学 Voice visualization method based on integration characteristic and neural network
CN102820037A (en) * 2012-07-21 2012-12-12 渤海大学 Chinese initial and final visualization method based on combination feature
CN102820037B (en) * 2012-07-21 2014-03-12 渤海大学 Chinese initial and final visualization method based on combination feature
CN103077728A (en) * 2012-12-31 2013-05-01 上海师范大学 Patient weak voice endpoint detection method
CN103077728B (en) * 2012-12-31 2015-08-19 上海师范大学 A kind of patient's weak voice endpoint detection method
CN107993071A (en) * 2017-11-21 2018-05-04 平安科技(深圳)有限公司 Electronic device, auth method and storage medium based on vocal print
CN108962251A (en) * 2018-06-26 2018-12-07 珠海金山网络游戏科技有限公司 A kind of game role Chinese speech automatic identifying method
TWI749796B (en) * 2020-09-30 2021-12-11 瑞軒科技股份有限公司 Resonance test system and resonance test method
US11641462B2 (en) 2020-09-30 2023-05-02 Amtran Technology Co., Ltd. Resonant testing system and resonant testing method

Similar Documents

Publication Publication Date Title
Sandoval et al. Automatic assessment of vowel space area
CN105976809B (en) Identification method and system based on speech and facial expression bimodal emotion fusion
CN1681002B (en) Speech synthesis system, speech synthesis method
CN101894566A (en) Visualization method of Chinese mandarin complex vowels based on formant frequency
Peterson et al. A physiological theory of phonetics
CN101916566B (en) Electronic larynx speech reconstructing method and system thereof
Sroka et al. Human and machine consonant recognition
CN102609969B (en) Method for processing face and speech synchronous animation based on Chinese text drive
CN103218842A (en) Voice synchronous-drive three-dimensional face mouth shape and face posture animation method
CN101950249B (en) Input method and device for code characters of silent voice notes
CN102176313B (en) Formant-frequency-based Mandarin single final vioce visualizing method
CN105206271A (en) Intelligent equipment voice wake-up method and system for realizing method
CN110210416B (en) Sign language recognition system optimization method and device based on dynamic pseudo tag decoding
CN105788608B (en) Chinese phonetic mother method for visualizing neural network based
CN108549628A (en) The punctuate device and method of streaming natural language information
CN101290766A (en) Syllable splitting method of Tibetan language of Anduo
CN110946554A (en) Cough type identification method, device and system
CN101930619A (en) Collaborative filtering-based real-time voice-driven human face and lip synchronous animation system
CN105765654A (en) Hearing assistance device with fundamental frequency modification
CN105845126A (en) Method for automatic English subtitle filling of English audio image data
CN110349565A (en) A kind of auxiliary word pronunciation learning method and its system towards hearing-impaired people
Liu et al. A pilot study on mandarin chinese cued speech
CN102820037B (en) Chinese initial and final visualization method based on combination feature
CN113838169A (en) Text-driven virtual human micro-expression method
Han et al. Speech recognition and lip shape feature extraction for English vowel pronunciation of the hearing-impaired based on SVM technique

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20101124