CN102820037B - Chinese initial and final visualization method based on combination feature - Google Patents

Chinese initial and final visualization method based on combination feature Download PDF

Info

Publication number
CN102820037B
CN102820037B CN201210252989.6A CN201210252989A CN102820037B CN 102820037 B CN102820037 B CN 102820037B CN 201210252989 A CN201210252989 A CN 201210252989A CN 102820037 B CN102820037 B CN 102820037B
Authority
CN
China
Prior art keywords
information
image
frame
feature
voice signal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201210252989.6A
Other languages
Chinese (zh)
Other versions
CN102820037A (en
Inventor
韩志艳
伦淑娴
王健
于忠党
郭艳东
尹作友
郭兆正
王巍
韩建群
苏宪利
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Bohai University
Original Assignee
Bohai University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Bohai University filed Critical Bohai University
Priority to CN201210252989.6A priority Critical patent/CN102820037B/en
Publication of CN102820037A publication Critical patent/CN102820037A/en
Application granted granted Critical
Publication of CN102820037B publication Critical patent/CN102820037B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention relates to a Chinese initial and final visualization method based on a combination feature, which comprises the steps of: pre-processing a voice signal; calculating the frame number of the pre-processed voice signal as a length feature, representing a resonance strength feature by correlation of a frequency domain peak amplitude and an average amplitude to obtain a resonant peak feature value of each frame signal, and calculating robust feature parameters WPTC1-WPTC20 and PMUSIC-MFCC1-PMUSIC-MFCC12; respectively encoding image width information and image length information by the length feature and the resonance strength feature; encoding the main color information by the resonant peak feature; enabling 32 feature parameters to serve as input of a neural network and the output of the neural network to be corresponding pattern information, wherein the output corresponds to 23 initials and 24 finals sequentially; and fusing the width, length, main color and pattern information in an image and displaying the image on a display screen. The Chinese initial and final visualization method has the advantages that the Chinese initial and final visualization method based on the combination feature is helpful for deaf-mutes for speech training to establish and improve auditory perception and form correct speed reflection so as to recover the speed function of the deaf-mutes.

Description

The female method for visualizing of Chinese phonetic based on assemblage characteristic
Technical field
The present invention relates to the method for visualizing of a kind of Chinese phonetic initial consonant and simple or compound vowel of a Chinese syllable, particularly the female method for visualizing of a kind of Chinese phonetic based on assemblage characteristic.
Background technology
Voice are acoustics performances of language, are that mankind's exchange of information is the most natural, the most effective, the means of most convenient, are also a kind of supports of human thinking.And concerning deaf-mute, communication becomes one and is difficult to the thing realizing, a part of deaf-mute is mute is because their hearing organ is destroyed, and voice messaging can not be collected to brain.Research shows, human auditory system and vision system be two different in kinds and there is complementary infosystem, vision system is that the information of a highly-parallel receives and disposal system, millions of cone cells in mankind's eyeball on retina are connected with brain by fibrous nerve fiber, form the channel of a highly-parallel, the speed that vision channel is received information is very high, and according to measuring and estimation, the information receiving velocity while seeing TV roughly can reach
Figure 330556DEST_PATH_IMAGE001
, information rate when this listens voice than auditory system exceeds thousands of times, therefore it is believed that the information that the mankind obtain has 70% to be the saying obtaining by vision.So for deaf and dumb everybody, this is exactly a very large assistant undoubtedly, the defect of the sense of hearing is compensated by vision, voice can not only be heard, can also deaf-mute " be seen " by multiple other forms and see.
The people such as nineteen forty-seven R.K.Potter and G.A.Kopp have just proposed a kind of method for visualizing-sound spectrograph, there is subsequently different voice study experts to begin one's study and improve this voice visualization method, such as proposed the real-time sound spectrograph system that has the people such as chromatogram and G.M.Kuhn in 1984 to propose deaf person to train people such as L.C.Stewart in 1976, and P.E.Stern in 1986, the people such as F.Plante in 1998 and R.Steinberg in 2008 have also proposed improving one's methods of many sound spectrographs, but the sound spectrograph showing is professional very strong, and be difficult to distinguish memory.Especially for the different people of same voice, or even same voice same person all likely causes the variation of sound spectrograph, more bad for its robust performance of the voice signal of recording under varying environment.
In addition, also have some scholars to realize voice visual to the variation of the motion change of people's vocal organs and facial expression, effectively dissected people's phonation, but with regard to its intelligibility of speech, also be difficult to reach ideal effect, except only a few expert, people are difficult to directly by the observation motion of vocal organs and the variation of facial expression and perceptual speech exactly.
Summary of the invention
Technical matters to be solved by this invention is to provide a kind of voice visualization method based on assemblage characteristic of being simply convenient to memory and high robust, by the method, can help deaf-mute to carry out speech training, set up, improve auditory sense cognition, form correct speech reflection, rebuild sense of hearing speech chain, the phonetic function of recovery self that can maximum possible.
Technical solution of the present invention is:
The female method for visualizing of Chinese phonetic based on assemblage characteristic, comprises the following steps:
1, voice signal pre-service
By microphone input speech signal, by obtaining corresponding speech data after processing unit sample quantization, then carry out pre-emphasis, minute frame windowing and end-point detection;
2, feature extraction
(2.1) frame number that calculates pretreated voice signal is as its otonaga features;
(2.2) adopt a kind of relativity of frequency domain peak amplitude size and average amplitude size to represent resonance strength characteristic, for the voice signal after minute frame, the resonance intensity of every frame voice signal is:
Figure 848125DEST_PATH_IMAGE002
Wherein, plural number
Figure 747948DEST_PATH_IMAGE003
represent the
Figure 314059DEST_PATH_IMAGE004
individual harmonic component transforms to the coefficient after frequency domain; the harmonic wave number that represents this frame signal;
Figure 38618DEST_PATH_IMAGE006
the frequency domain transformed value that represents every frame voice signal;
Figure 488054DEST_PATH_IMAGE007
expression is averaged; according to dissimilar identification voice, adjust, wherein
Figure 545189DEST_PATH_IMAGE009
;
(2.3) adopt the method based on Hilbert-Huang conversion to estimate pretreated voice signal resonance peak feature, obtain the resonance peak eigenwert F1 of every frame signal, F2, F3;
(2.4) calculate the voice signal robust features parameter (WPTC) based on wavelet package transforms: WPTC1~WPTC20;
(2.5) calculate the robust features parameter (PMUSIC-MFCC) based on MUSIC and apperceive characteristic: PMUSIC-MFCC1~PMUSIC-MFCC 12;
3, width information coding
Adopt otonaga features to encode to picture traverse information, according to the size of viewing area pixel, otonaga features is converted into picture traverse information by linear transformation;
4, length information coding
Adopt resonance strength characteristic to encode to image length information, according to the size of viewing area pixel, each frame resonance strength characteristic mean value is converted into image length information by linear transformation;
5, main color coding
Adopt resonance peak feature to encode to main colouring information, all resonance peak eigenwert F1, F2, F3 averages respectively, then by R=5F1/F3, G=3F3/5F2, B=F2/3F1, converts thereof into main colouring information;
6, neural network design
Described neural network is three layers of BP neural network, and wherein input layer has 32 neurons, and output layer has 6 neurons;
7, pattern-information coding
1, WPTC1~WPTC20 and PMUSIC-MFCC1~PMUSIC-MFCC 12 totally 32 assemblage characteristics as the input of neural network, the output of neural network is corresponding pattern-information, the output layer of neural network has 6 neurons, all adopt binary coding, have 64 different codes, wherein only with front 47 codes, corresponding 23 initial consonant b successively, p, m, f, d, t, n, l, g, k, h, j, q, x, zh, ch, sh, r, z, c, s, y, w and 24 simple or compound vowel of a Chinese syllable a, o, e, i, u, ü, ai, ei, ui, ao, ou, iu, ie, ü e, er, an, en, in, un, ü n, ang, eng, ing, ong,
8, image is synthetic
When image synthesizes, width information, length information, main colouring information and pattern-information are merged in piece image and showing screen display.
When described image synthesizes, first obtain width information and length information and determine image size, then in picture position, add main colouring information, finally, with the main colouring information of pattern-information displacement relevant position, obtain corresponding phonetic image.
During described voice signal pre-service, sample quantization is carried out with the sample frequency of 11.025kHz, the quantified precision of 16bit by processing unit; Pre-emphasis is to realize by single order numeral preemphasis filter, and the coefficient value of its preemphasis filter is between 0.93-0.97; Dividing frame windowing is to carry out with the standard of 256 of frame lengths, and the data after minute frame are added to Hamming window processing, and end-point detection is to utilize in short-term nil product method to carry out.
Described picture traverse information=otonaga features *
Figure 342244DEST_PATH_IMAGE010
,
Figure 278976DEST_PATH_IMAGE010
value so that the image showing is beneficial to observer most observes and be identified as principle.
Described image length information=each frame resonance strength characteristic mean value *
Figure 452468DEST_PATH_IMAGE011
,
Figure 463149DEST_PATH_IMAGE011
value so that the image showing is beneficial to observer most observes and be identified as principle.
The pattern of described initial consonant image is white quality, and the pattern of described simple or compound vowel of a Chinese syllable image is black quality.
When the relativity of described employing frequency domain peak amplitude size and average amplitude size represents resonance strength characteristic, take 256 points as a frame.
Beneficial effect of the present invention is as follows:
(1) the present invention enters in piece image by combining different phonetic features, for deaf-mute create voice signal can reading mode, compared with prior art, there is good robustness and understandability, made up with sound spectrograph and carried out the visual shortcoming distinguishing and remember of being difficult to.No matter be impaired hearing crowd or ordinary people, through specialized training after a while, can pick out intuitively the corresponding pronunciation of this visual image, and exchange with abled person.
(2) the present invention has utilized deaf-mute's vision distinguishing ability and the advantage stronger to the Visual memory of chromatic stimulus fully, and different sound mothers' color of image is different, has improved widely the interest of deaf-mute's study.
(3) the present invention adopts the method for dynamic training collection to train modeling to neural network, has avoided blindly finding training set and has caused the drawback that training load is overweight, has effectively improved the correct coding rate of pattern-information.
(4) the present invention carrys out layout information in conjunction with sound mother's pronunciation law, has greatly reduced deaf and dumb man memory burden.
Accompanying drawing explanation
Fig. 1 is system architecture diagram of the present invention;
Fig. 2 is main color coding block diagram;
Fig. 3 is the structural representation of neural network in Fig. 1;
Fig. 4 is compound vowel pattern-information coding schematic diagram;
Fig. 5 is front vowel followed by a nasal consonant (an en in un ü n) pattern-information coding schematic diagram;
Fig. 6 is rear vowel followed by a nasal consonant (ang eng ing ong) pattern-information coding schematic diagram;
Fig. 7 is bilabial sound (b p m) pattern-information coding schematic diagram;
Fig. 8 is labiodental (f) pattern-information coding schematic diagram;
Fig. 9 is dental (z c s) pattern-information coding schematic diagram;
Figure 10 is blade-alveolar (d t n l) pattern-information coding schematic diagram;
Figure 11 is blade-palatal (zh ch sh r) pattern-information coding schematic diagram;
Figure 12 is dorsal (j q x) pattern-information coding schematic diagram;
Figure 13 is velar (g k h) pattern-information coding schematic diagram;
Figure 14 is initial consonant (y w) pattern-information coding schematic diagram;
Figure 15 is the voice visual effect exemplary plot of initial consonant " y ";
Figure 16 is the voice visual effect exemplary plot of two phonetics " y+u ";
Figure 17 is the voice visual effect exemplary plot of three phonetics " y+u+an ";
Figure 18 is the voice visual effect exemplary plot of initial consonant " y, w " and simple or compound vowel of a Chinese syllable " i, u ".
Embodiment
Below in conjunction with drawings and Examples, technical solutions according to the invention are elaborated:
As shown in Figure 1, the method comprises voice signal pretreatment module, characteristic extracting module, width information encoding block module, length information coding module, main color coding module, neural network design module, pattern-information coding and image synthesis unit, specific as follows:
One, voice signal pre-service
By processing unit with the sample frequency of 11.025kHz, the quantified precision of 16bit carries out sample quantization, obtain corresponding speech data, then with single order numeral preemphasis filter, realize pre-emphasis, the coefficient value scope of its preemphasis filter is that between 0.93-0.97, this example gets 0.9375.Next with the frame length standard of 256 o'clock minute frame, and to the data after minute frame, add Hamming window processing, recycling can nil product method be carried out end-point detection in short-term.Described processing unit can adopt computing machine, single-chip microcomputer or dsp chip etc., and this example be take computing machine as example.
Two, feature extraction
1, otonaga features
Calculate the frame number of pretreated voice signal as its otonaga features, wherein take 256 sampled points as a frame, it is 80 sampled points that frame moves.
2, resonance strength characteristic
Adopt the corresponding relativity of frequency domain peak amplitude size and average amplitude size to represent resonance strength characteristic.
The harmonic-model of voice signal is widely used in speech analysis and synthetic.The core content of harmonic-model is the sinusoidal expression of voice signal, for the voice signal after minute frame, supposes that harmonic characteristic changes not quite in a short time frame, and the resonance intensity of so every frame voice signal is:
Wherein, plural number represent the
Figure 832317DEST_PATH_IMAGE004
individual harmonic component transforms to the coefficient after frequency domain; the harmonic wave number that represents this frame signal;
Figure 429837DEST_PATH_IMAGE006
the frequency domain transformed value that represents every frame voice signal; expression is averaged;
Figure 325298DEST_PATH_IMAGE008
according to dissimilar identification voice, adjust, wherein
Figure 310571DEST_PATH_IMAGE012
range of adjustment be 2 ~ 8, this example is got
Figure 682647DEST_PATH_IMAGE013
.
3, resonance peak feature
The method of employing based on Hilbert-Huang conversion estimated pretreated Speech formant frequency feature, obtains the resonance peak eigenwert F1 of every frame signal, F2, F3.
Each rank formant frequency of the voice signal specifically being gone out according to a preliminary estimate by Fast Fourier Transform (FFT) (FFT) is determined the parameter of respective band pass filters, and by this parameter, voice signal is done to filtering and process, filtered signal is carried out to empirical mode decomposition (EMD) and obtain gang's intrinsic mode function (IMF), by the maximum principle of energy, determine and contain formant frequency IMF, calculate the instantaneous frequency of this IMF and the formant frequency parameter that Hilbert spectrum obtains voice signal.
4, calculate WPTC parameter
According to wavelet package transforms, at each, analyze the permanent Q(quality factor of frequency range) the characteristic feature consistent to the processing characteristic of signal with human auditory system, multi-level division in conjunction with wavelet packet to frequency band, and according to the feature of auditory perceptual frequency band, select adaptively frequency band, calculate the voice signal robust features parameter (WPTC) based on wavelet package transforms: WPTC1~WPTC20.
5, calculate PMUSIC-MFCC parameter
For improving the robustness of voice visual, adopt Multiple Signal Classification method (Multiple Signal Classification, MUSIC) spectrum estimation technique is also introduced apperceive characteristic therein, calculates the robust features parameter (PMUSIC-MFCC) based on MUSIC and apperceive characteristic: PMUSIC-MFCC1~PMUSIC-MFCC 12.
Three, width information coding
Adopt otonaga features to encode to picture traverse information, according to the size of viewing area pixel, otonaga features is converted into picture traverse information by linear transformation, picture traverse information=otonaga features *
Figure 753371DEST_PATH_IMAGE010
Figure 806778DEST_PATH_IMAGE010
value in line with the image that makes to show, be beneficial to the principle that observer observes identification most, it is example that this example be take the viewing area of 300 * 300 pixel sizes, such as the duration information of initial consonant y is the width of 15 pixels after linear operation.
Width information=otonaga features *
Figure 708875DEST_PATH_IMAGE010
, this example is got
Figure 189535DEST_PATH_IMAGE010
be 6.
Four, length information coding
Adopt resonance strength characteristic to encode to image length information, according to the size of viewing area pixel, each frame resonance strength characteristic mean value is converted into image length information by linear transformation, length information=each frame resonance strength characteristic mean value *
Figure 278713DEST_PATH_IMAGE011
, value is in line with making the image showing be beneficial to the principle that observer observes identification most, and this example is got
Figure 157993DEST_PATH_IMAGE011
be 180.If the resonance strength information of initial consonant y is the length of 150 pixels after linear operation.
Five, main color coding
As shown in Figure 2, adopt resonance peak feature to shine upon main colouring information, all resonance peak eigenwert F1, F2, F3 averages respectively, then passes through formula: R=5F1/F3, G=3F3/5F2, B=F2/3F1, convert thereof into main colouring information, wherein coefficient 5,3/5 and 1/3 is to have good color identifying power through experimental verification, select target is to make varying in color of most of pronunciation, contributes to like this deaf-mute's identification memory.By giving the RGB assignment of screen relevant position, obtain main colouring information, red-green-blue amplitude is 1 to obtain white entirely, and red-green-blue amplitude is 0 to obtain black entirely, and each primary colours are additive color rules to the contribution of color.
As three resonance peak mean values of initial consonant " b " are respectively F1=538.97Hz, F2=1059.73Hz, F3=2841.58Hz, thus the R=0.9484 calculating, G=1.6089, B=0.6554, so the main color of image producing is light yellow.
Six, neural network design
As shown in Figure 3, described neural network is three layers of BP neural network, and wherein input layer has 32 neurons, and output layer has 6 neurons.And adopt the method for dynamic training collection to train modeling to neural network, use existing sample set (number is less) to train identification, during identification error, just this speech samples is added in corresponding training set and is gone by reality pronunciation, identify when correct and just this speech samples is given up not, make training set sample more and more abundant under this Practical Condition.When occurring that wrong probability less to a certain extent time, just obtains the neural network model under this practical application condition.
Seven, pattern-information coding
As shown in Fig. 4-Figure 14, with WPTC1~WPTC20 and PMUSIC-MFCC1~PMUSIC-MFCC 12, totally 32 assemblage characteristics are as the input of neural network, and the output of neural network is corresponding pattern-information.The output layer of neural network has 6 neurons, all adopts binary coding, has 64 different codes, wherein only with front 47 codes, successively correspondence 23 initial consonants and 24 simple or compound vowel of a Chinese syllable.Represent initial consonant b as 000000,000001 represents initial consonant p, and by that analogy, and the pattern of each initial consonant image is white quality, the pattern of each simple or compound vowel of a Chinese syllable image is black quality, can show differing texture pattern by changing the saturation degree of the three primary colours RGB of relevant position, a wherein, o, e, i, u, ü does not have pattern.What in figure, have identical patterns has very large similarity in pronunciation, as first three sound b, and p, m is bilabial sound, accordingly pattern is quality striped of adularescent all up and down; For another example d, t, n, l is blade-alveolar, corresponding pattern is in centre position.
Eight, image is synthetic
When image synthesizes, width information, length information, main colouring information and pattern-information are merged in piece image and showing screen display.
Be specially and first obtain width information and length information and determine image size, then in picture position, add main colouring information, finally, with the main colouring information of pattern-information displacement relevant position, obtain corresponding phonetic image.
Above-mentioned whole process is processed by computing machine.
?image is synthetic to be exemplified below:
1, as shown in figure 15, the main color of the image of initial consonant y is blue, and the pattern of adularescent quality above.
2, as shown in figure 16, two phonetic yu, initial consonant and simple or compound vowel of a Chinese syllable piece together a sound, and during phonetic, initial consonant is front, and simple or compound vowel of a Chinese syllable is rear, and vowel u becomes in the back time not only heavily but also is long.
3, as shown in figure 17, three phonetic yuan have a sound between initial consonant and simple or compound vowel of a Chinese syllable in mandarin, can be its referral letter, and vowel u serves as referral letter, and are weakened while serving as referral letter, become short and light, and vowel an becomes in the back time not only heavily but also be long.
4, as shown in figure 18, y and i, although w is very similar to u pronunciation, the two pronunciation is different.Initial consonant y and w pronunciation is short and light, and simple or compound vowel of a Chinese syllable i grows and weighs with u pronunciation, so the two is easy to distinguish, and sound spectrograph is very similar, very difficult identification.

Claims (7)

1. the female method for visualizing of the Chinese phonetic based on assemblage characteristic, is characterized in that:
1.1, voice signal pre-service
By microphone input speech signal, by obtaining corresponding speech data after processing unit sample quantization, then carry out pre-emphasis, minute frame windowing and end-point detection;
1.2, feature extraction
(a) frame number that calculates pretreated voice signal is as its otonaga features;
(b) adopt a kind of relativity of frequency domain peak amplitude size and average amplitude size to represent resonance strength characteristic, for the voice signal after minute frame, the resonance intensity of every frame voice signal is:
Figure 710668DEST_PATH_IMAGE001
Wherein, plural number
Figure 918926DEST_PATH_IMAGE002
represent the
Figure 733299DEST_PATH_IMAGE003
individual harmonic component transforms to the coefficient after frequency domain;
Figure 618078DEST_PATH_IMAGE004
the harmonic wave number that represents this frame signal;
Figure 885111DEST_PATH_IMAGE005
the frequency domain transformed value that represents every frame voice signal;
Figure 897061DEST_PATH_IMAGE006
expression is averaged;
Figure 565940DEST_PATH_IMAGE007
according to dissimilar identification voice, adjust, wherein
Figure 621620DEST_PATH_IMAGE008
;
(c) adopt the method based on Hilbert-Huang conversion to estimate pretreated voice signal resonance peak feature, obtain the resonance peak eigenwert F1 of every frame signal, F2, F3;
(d) calculate the voice signal robust features parameter WPTC:WPTC1~WPTC20 based on wavelet package transforms;
(e) calculate the robust features parameter PMUSIC-MFCC:PMUSIC-MFCC1~PMUSIC-MFCC 12 based on MUSIC and apperceive characteristic;
1.3, width information coding
Adopt otonaga features to encode to picture traverse information, according to the size of viewing area pixel, otonaga features is converted into picture traverse information by linear transformation;
1.4, length information coding
Adopt resonance strength characteristic to encode to image length information, according to the size of viewing area pixel, each frame resonance strength characteristic mean value is converted into image length information by linear transformation;
1.5, main color coding
Adopt resonance peak feature to encode to main colouring information, all resonance peak eigenwert F1, F2, F3 averages respectively, then by R=5F1/F3, G=3F3/5F2, B=F2/3F1, converts thereof into main colouring information;
1.6, neural network design
Described neural network is three layers of BP neural network, and wherein input layer has 32 neurons, and output layer has 6 neurons;
1.7, pattern-information coding
Totally 32 assemblage characteristics are as the input of neural network for WPTC1~WPTC20 and PMUSIC-MFCC1~PMUSIC-MFCC 12, and the output of neural network is corresponding pattern-information; The output layer of neural network has 6 neurons, all adopt binary coding, have 64 different codes, wherein only with front 47 codes, successively correspondence 23 initial consonant b, p, m, f, d, t, n, l, g, k, h, j, q, x, zh, ch, sh, r, z, c, s, y, w and 24 simple or compound vowel of a Chinese syllable a, o, e, i, u, ü, ai, ei, ui, ao, ou, iu, ie, ü e, er, an, en, in, un, ü n, ang, eng, ing, ong;
1.8, image is synthetic
When image synthesizes, width information, length information, main colouring information and pattern-information are merged in piece image and showing screen display.
2. the female method for visualizing of Chinese phonetic based on assemblage characteristic according to claim 1, it is characterized in that: when described image synthesizes, first obtain width information and length information and determine image size, then in picture position, add main colouring information, finally, with the main colouring information of pattern-information displacement relevant position, obtain corresponding phonetic image.
3. the female method for visualizing of the Chinese phonetic based on assemblage characteristic according to claim 1, is characterized in that: during described voice signal pre-service, sample quantization is carried out with the sample frequency of 11.025kHz, the quantified precision of 16bit by processing unit; Pre-emphasis is to realize by single order numeral preemphasis filter, and the coefficient value of its preemphasis filter is 0.93-0.97; Dividing frame windowing is to carry out with the standard of 256 of frame lengths, and the data after minute frame are added to Hamming window processing, and end-point detection is to utilize in short-term nil product method to carry out.
4. the female method for visualizing of Chinese phonetic based on assemblage characteristic according to claim 1 and 2, is characterized in that: described picture traverse information=otonaga features *
Figure 375950DEST_PATH_IMAGE009
,
Figure 926011DEST_PATH_IMAGE009
value so that the image showing is beneficial to observer most observes and be identified as principle.
5. the female method for visualizing of Chinese phonetic based on assemblage characteristic according to claim 1 and 2, is characterized in that: described image length information=each frame resonance strength characteristic mean value *
Figure 449396DEST_PATH_IMAGE010
,
Figure 879240DEST_PATH_IMAGE010
value so that the image showing is beneficial to observer most observes and be identified as principle.
6. the female method for visualizing of the Chinese phonetic based on assemblage characteristic according to claim 1, is characterized in that: the pattern of described initial consonant image is white quality, and the pattern of described simple or compound vowel of a Chinese syllable image is black quality.
7. the female method for visualizing of the Chinese phonetic based on assemblage characteristic according to claim 1, is characterized in that: when the relativity of described employing frequency domain peak amplitude size and average amplitude size represents resonance strength characteristic, take 256 points as a frame.
CN201210252989.6A 2012-07-21 2012-07-21 Chinese initial and final visualization method based on combination feature Expired - Fee Related CN102820037B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210252989.6A CN102820037B (en) 2012-07-21 2012-07-21 Chinese initial and final visualization method based on combination feature

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210252989.6A CN102820037B (en) 2012-07-21 2012-07-21 Chinese initial and final visualization method based on combination feature

Publications (2)

Publication Number Publication Date
CN102820037A CN102820037A (en) 2012-12-12
CN102820037B true CN102820037B (en) 2014-03-12

Family

ID=47304122

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210252989.6A Expired - Fee Related CN102820037B (en) 2012-07-21 2012-07-21 Chinese initial and final visualization method based on combination feature

Country Status (1)

Country Link
CN (1) CN102820037B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105825847B (en) * 2016-03-16 2019-08-16 北京语言大学 The parameter synthesis method and perception scope measurement method, device of front and back nasal sound simple or compound vowel of a Chinese syllable
CN106024010B (en) * 2016-05-19 2019-08-20 渤海大学 A kind of voice signal dynamic feature extraction method based on formant curve
CN111009234B (en) * 2019-12-25 2023-06-02 上海锦晟电子科技有限公司 Voice conversion method, device and equipment
CN111613240B (en) * 2020-05-22 2023-06-27 杭州电子科技大学 Camouflage voice detection method based on attention mechanism and Bi-LSTM

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101894566A (en) * 2010-07-23 2010-11-24 北京理工大学 Visualization method of Chinese mandarin complex vowels based on formant frequency
CN102176313A (en) * 2009-10-10 2011-09-07 北京理工大学 Formant-frequency-based Mandarin single final vioce visualizing method
CN102231281A (en) * 2011-07-18 2011-11-02 渤海大学 Voice visualization method based on integration characteristic and neural network

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH1166345A (en) * 1997-08-11 1999-03-09 Sega Enterp Ltd Image acoustic processor and recording medium
US7624019B2 (en) * 2005-10-17 2009-11-24 Microsoft Corporation Raising the visibility of a voice-activated user interface

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102176313A (en) * 2009-10-10 2011-09-07 北京理工大学 Formant-frequency-based Mandarin single final vioce visualizing method
CN101894566A (en) * 2010-07-23 2010-11-24 北京理工大学 Visualization method of Chinese mandarin complex vowels based on formant frequency
CN102231281A (en) * 2011-07-18 2011-11-02 渤海大学 Voice visualization method based on integration characteristic and neural network

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
JP特开平11-66345A 1999.03.09
基于正交实验设计的语音识别特征参数优化;韩志艳等;《计算机科学》;20100131;第37卷(第1期);第214-216、250页 *
韩志艳等.基于正交实验设计的语音识别特征参数优化.《计算机科学》.2010,第37卷(第1期),第214-216、250页.

Also Published As

Publication number Publication date
CN102820037A (en) 2012-12-12

Similar Documents

Publication Publication Date Title
CN105976809B (en) Identification method and system based on speech and facial expression bimodal emotion fusion
CN102231281B (en) Voice visualization method based on integration characteristic and neural network
CN105023573B (en) It is detected using speech syllable/vowel/phone boundary of auditory attention clue
CN101916566B (en) Electronic larynx speech reconstructing method and system thereof
US20200178883A1 (en) Method and system for articulation evaluation by fusing acoustic features and articulatory movement features
CN110619301A (en) Emotion automatic identification method based on bimodal signals
CN103996155A (en) Intelligent interaction and psychological comfort robot service system
CN102820037B (en) Chinese initial and final visualization method based on combination feature
CN105788608B (en) Chinese phonetic mother method for visualizing neural network based
CN105448291A (en) Parkinsonism detection method and detection system based on voice
CN111326178A (en) Multi-mode speech emotion recognition system and method based on convolutional neural network
WO2022048404A1 (en) End-to-end virtual object animation generation method and apparatus, storage medium, and terminal
CN114863905A (en) Voice category acquisition method and device, electronic equipment and storage medium
CN101894566A (en) Visualization method of Chinese mandarin complex vowels based on formant frequency
CN113349801A (en) Imaginary speech electroencephalogram signal decoding method based on convolutional neural network
CN116434759B (en) Speaker identification method based on SRS-CL network
CN114626424B (en) Data enhancement-based silent speech recognition method and device
CN108831472B (en) Artificial intelligent sounding system and sounding method based on lip language recognition
CN111009262A (en) Voice gender identification method and system
CN111785262B (en) Speaker age and gender classification method based on residual error network and fusion characteristics
CN115472182A (en) Attention feature fusion-based voice emotion recognition method and device of multi-channel self-encoder
CN114550701A (en) Deep neural network-based Chinese electronic larynx voice conversion device and method
JP4381404B2 (en) Speech synthesis system, speech synthesis method, speech synthesis program
CN115410061B (en) Image-text emotion analysis system based on natural language processing
CN116172580B (en) Auditory attention object decoding method suitable for multi-sound source scene

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20140312

Termination date: 20140721

EXPY Termination of patent right or utility model