CN102820037B

CN102820037B - Chinese initial and final visualization method based on combination feature

Info

Publication number: CN102820037B
Application number: CN201210252989.6A
Authority: CN
Inventors: 韩志艳; 伦淑娴; 王健; 于忠党; 郭艳东; 尹作友; 郭兆正; 王巍; 韩建群; 苏宪利
Original assignee: Bohai University
Current assignee: Bohai University
Priority date: 2012-07-21
Filing date: 2012-07-21
Publication date: 2014-03-12
Anticipated expiration: 2032-07-21
Also published as: CN102820037A

Abstract

The invention relates to a Chinese initial and final visualization method based on a combination feature, which comprises the steps of: pre-processing a voice signal; calculating the frame number of the pre-processed voice signal as a length feature, representing a resonance strength feature by correlation of a frequency domain peak amplitude and an average amplitude to obtain a resonant peak feature value of each frame signal, and calculating robust feature parameters WPTC1-WPTC20 and PMUSIC-MFCC1-PMUSIC-MFCC12; respectively encoding image width information and image length information by the length feature and the resonance strength feature; encoding the main color information by the resonant peak feature; enabling 32 feature parameters to serve as input of a neural network and the output of the neural network to be corresponding pattern information, wherein the output corresponds to 23 initials and 24 finals sequentially; and fusing the width, length, main color and pattern information in an image and displaying the image on a display screen. The Chinese initial and final visualization method has the advantages that the Chinese initial and final visualization method based on the combination feature is helpful for deaf-mutes for speech training to establish and improve auditory perception and form correct speed reflection so as to recover the speed function of the deaf-mutes.

Description

The female method for visualizing of Chinese phonetic based on assemblage characteristic

Technical field

The present invention relates to the method for visualizing of a kind of Chinese phonetic initial consonant and simple or compound vowel of a Chinese syllable, particularly the female method for visualizing of a kind of Chinese phonetic based on assemblage characteristic.

Background technology

Voice are acoustics performances of language, are that mankind's exchange of information is the most natural, the most effective, the means of most convenient, are also a kind of supports of human thinking.And concerning deaf-mute, communication becomes one and is difficult to the thing realizing, a part of deaf-mute is mute is because their hearing organ is destroyed, and voice messaging can not be collected to brain.Research shows, human auditory system and vision system be two different in kinds and there is complementary infosystem, vision system is that the information of a highly-parallel receives and disposal system, millions of cone cells in mankind's eyeball on retina are connected with brain by fibrous nerve fiber, form the channel of a highly-parallel, the speed that vision channel is received information is very high, and according to measuring and estimation, the information receiving velocity while seeing TV roughly can reach

, information rate when this listens voice than auditory system exceeds thousands of times, therefore it is believed that the information that the mankind obtain has 70% to be the saying obtaining by vision.So for deaf and dumb everybody, this is exactly a very large assistant undoubtedly, the defect of the sense of hearing is compensated by vision, voice can not only be heard, can also deaf-mute " be seen " by multiple other forms and see.

The people such as nineteen forty-seven R.K.Potter and G.A.Kopp have just proposed a kind of method for visualizing-sound spectrograph, there is subsequently different voice study experts to begin one's study and improve this voice visualization method, such as proposed the real-time sound spectrograph system that has the people such as chromatogram and G.M.Kuhn in 1984 to propose deaf person to train people such as L.C.Stewart in 1976, and P.E.Stern in 1986, the people such as F.Plante in 1998 and R.Steinberg in 2008 have also proposed improving one's methods of many sound spectrographs, but the sound spectrograph showing is professional very strong, and be difficult to distinguish memory.Especially for the different people of same voice, or even same voice same person all likely causes the variation of sound spectrograph, more bad for its robust performance of the voice signal of recording under varying environment.

In addition, also have some scholars to realize voice visual to the variation of the motion change of people's vocal organs and facial expression, effectively dissected people's phonation, but with regard to its intelligibility of speech, also be difficult to reach ideal effect, except only a few expert, people are difficult to directly by the observation motion of vocal organs and the variation of facial expression and perceptual speech exactly.

Summary of the invention

Technical matters to be solved by this invention is to provide a kind of voice visualization method based on assemblage characteristic of being simply convenient to memory and high robust, by the method, can help deaf-mute to carry out speech training, set up, improve auditory sense cognition, form correct speech reflection, rebuild sense of hearing speech chain, the phonetic function of recovery self that can maximum possible.

Technical solution of the present invention is:

The female method for visualizing of Chinese phonetic based on assemblage characteristic, comprises the following steps:

1, voice signal pre-service

By microphone input speech signal, by obtaining corresponding speech data after processing unit sample quantization, then carry out pre-emphasis, minute frame windowing and end-point detection;

2, feature extraction

(2.1) frame number that calculates pretreated voice signal is as its otonaga features;

(2.2) adopt a kind of relativity of frequency domain peak amplitude size and average amplitude size to represent resonance strength characteristic, for the voice signal after minute frame, the resonance intensity of every frame voice signal is:

Wherein, plural number

represent the

individual harmonic component transforms to the coefficient after frequency domain; the harmonic wave number that represents this frame signal;

the frequency domain transformed value that represents every frame voice signal;

expression is averaged; according to dissimilar identification voice, adjust, wherein

;

(2.3) adopt the method based on Hilbert-Huang conversion to estimate pretreated voice signal resonance peak feature, obtain the resonance peak eigenwert F1 of every frame signal, F2, F3;

(2.4) calculate the voice signal robust features parameter (WPTC) based on wavelet package transforms: WPTC1～WPTC20;

(2.5) calculate the robust features parameter (PMUSIC-MFCC) based on MUSIC and apperceive characteristic: PMUSIC-MFCC1～PMUSIC-MFCC 12;

3, width information coding

Adopt otonaga features to encode to picture traverse information, according to the size of viewing area pixel, otonaga features is converted into picture traverse information by linear transformation;

4, length information coding

Adopt resonance strength characteristic to encode to image length information, according to the size of viewing area pixel, each frame resonance strength characteristic mean value is converted into image length information by linear transformation;

5, main color coding

Adopt resonance peak feature to encode to main colouring information, all resonance peak eigenwert F1, F2, F3 averages respectively, then by R=5F1/F3, G=3F3/5F2, B=F2/3F1, converts thereof into main colouring information;

6, neural network design

Described neural network is three layers of BP neural network, and wherein input layer has 32 neurons, and output layer has 6 neurons;

7, pattern-information coding

1, WPTC1～WPTC20 and PMUSIC-MFCC1～PMUSIC-MFCC 12 totally 32 assemblage characteristics as the input of neural network, the output of neural network is corresponding pattern-information, the output layer of neural network has 6 neurons, all adopt binary coding, have 64 different codes, wherein only with front 47 codes, corresponding 23 initial consonant b successively, p, m, f, d, t, n, l, g, k, h, j, q, x, zh, ch, sh, r, z, c, s, y, w and 24 simple or compound vowel of a Chinese syllable a, o, e, i, u, ü, ai, ei, ui, ao, ou, iu, ie, ü e, er, an, en, in, un, ü n, ang, eng, ing, ong,

8, image is synthetic

When image synthesizes, width information, length information, main colouring information and pattern-information are merged in piece image and showing screen display.

When described image synthesizes, first obtain width information and length information and determine image size, then in picture position, add main colouring information, finally, with the main colouring information of pattern-information displacement relevant position, obtain corresponding phonetic image.

During described voice signal pre-service, sample quantization is carried out with the sample frequency of 11.025kHz, the quantified precision of 16bit by processing unit; Pre-emphasis is to realize by single order numeral preemphasis filter, and the coefficient value of its preemphasis filter is between 0.93-0.97; Dividing frame windowing is to carry out with the standard of 256 of frame lengths, and the data after minute frame are added to Hamming window processing, and end-point detection is to utilize in short-term nil product method to carry out.

Described picture traverse information=otonaga features *

,

value so that the image showing is beneficial to observer most observes and be identified as principle.

Described image length information=each frame resonance strength characteristic mean value *

,

The pattern of described initial consonant image is white quality, and the pattern of described simple or compound vowel of a Chinese syllable image is black quality.

When the relativity of described employing frequency domain peak amplitude size and average amplitude size represents resonance strength characteristic, take 256 points as a frame.

Beneficial effect of the present invention is as follows:

(1) the present invention enters in piece image by combining different phonetic features, for deaf-mute create voice signal can reading mode, compared with prior art, there is good robustness and understandability, made up with sound spectrograph and carried out the visual shortcoming distinguishing and remember of being difficult to.No matter be impaired hearing crowd or ordinary people, through specialized training after a while, can pick out intuitively the corresponding pronunciation of this visual image, and exchange with abled person.

(2) the present invention has utilized deaf-mute's vision distinguishing ability and the advantage stronger to the Visual memory of chromatic stimulus fully, and different sound mothers' color of image is different, has improved widely the interest of deaf-mute's study.

(3) the present invention adopts the method for dynamic training collection to train modeling to neural network, has avoided blindly finding training set and has caused the drawback that training load is overweight, has effectively improved the correct coding rate of pattern-information.

(4) the present invention carrys out layout information in conjunction with sound mother's pronunciation law, has greatly reduced deaf and dumb man memory burden.

Accompanying drawing explanation

Fig. 1 is system architecture diagram of the present invention;

Fig. 2 is main color coding block diagram;

Fig. 3 is the structural representation of neural network in Fig. 1;

Fig. 4 is compound vowel pattern-information coding schematic diagram;

Fig. 5 is front vowel followed by a nasal consonant (an en in un ü n) pattern-information coding schematic diagram;

Fig. 6 is rear vowel followed by a nasal consonant (ang eng ing ong) pattern-information coding schematic diagram;

Fig. 7 is bilabial sound (b p m) pattern-information coding schematic diagram;

Fig. 8 is labiodental (f) pattern-information coding schematic diagram;

Fig. 9 is dental (z c s) pattern-information coding schematic diagram;

Figure 10 is blade-alveolar (d t n l) pattern-information coding schematic diagram;

Figure 11 is blade-palatal (zh ch sh r) pattern-information coding schematic diagram;

Figure 12 is dorsal (j q x) pattern-information coding schematic diagram;

Figure 13 is velar (g k h) pattern-information coding schematic diagram;

Figure 14 is initial consonant (y w) pattern-information coding schematic diagram;

Figure 15 is the voice visual effect exemplary plot of initial consonant " y ";

Figure 16 is the voice visual effect exemplary plot of two phonetics " y+u ";

Figure 17 is the voice visual effect exemplary plot of three phonetics " y+u+an ";

Figure 18 is the voice visual effect exemplary plot of initial consonant " y, w " and simple or compound vowel of a Chinese syllable " i, u ".

Embodiment

Below in conjunction with drawings and Examples, technical solutions according to the invention are elaborated:

As shown in Figure 1, the method comprises voice signal pretreatment module, characteristic extracting module, width information encoding block module, length information coding module, main color coding module, neural network design module, pattern-information coding and image synthesis unit, specific as follows:

One, voice signal pre-service

By processing unit with the sample frequency of 11.025kHz, the quantified precision of 16bit carries out sample quantization, obtain corresponding speech data, then with single order numeral preemphasis filter, realize pre-emphasis, the coefficient value scope of its preemphasis filter is that between 0.93-0.97, this example gets 0.9375.Next with the frame length standard of 256 o'clock minute frame, and to the data after minute frame, add Hamming window processing, recycling can nil product method be carried out end-point detection in short-term.Described processing unit can adopt computing machine, single-chip microcomputer or dsp chip etc., and this example be take computing machine as example.

Two, feature extraction

1, otonaga features

Calculate the frame number of pretreated voice signal as its otonaga features, wherein take 256 sampled points as a frame, it is 80 sampled points that frame moves.

2, resonance strength characteristic

Adopt the corresponding relativity of frequency domain peak amplitude size and average amplitude size to represent resonance strength characteristic.

The harmonic-model of voice signal is widely used in speech analysis and synthetic.The core content of harmonic-model is the sinusoidal expression of voice signal, for the voice signal after minute frame, supposes that harmonic characteristic changes not quite in a short time frame, and the resonance intensity of so every frame voice signal is:

Wherein, plural number represent the

the frequency domain transformed value that represents every frame voice signal; expression is averaged;

according to dissimilar identification voice, adjust, wherein

range of adjustment be 2 ~ 8, this example is got

.

3, resonance peak feature

The method of employing based on Hilbert-Huang conversion estimated pretreated Speech formant frequency feature, obtains the resonance peak eigenwert F1 of every frame signal, F2, F3.

Each rank formant frequency of the voice signal specifically being gone out according to a preliminary estimate by Fast Fourier Transform (FFT) (FFT) is determined the parameter of respective band pass filters, and by this parameter, voice signal is done to filtering and process, filtered signal is carried out to empirical mode decomposition (EMD) and obtain gang's intrinsic mode function (IMF), by the maximum principle of energy, determine and contain formant frequency IMF, calculate the instantaneous frequency of this IMF and the formant frequency parameter that Hilbert spectrum obtains voice signal.

4, calculate WPTC parameter

According to wavelet package transforms, at each, analyze the permanent Q(quality factor of frequency range) the characteristic feature consistent to the processing characteristic of signal with human auditory system, multi-level division in conjunction with wavelet packet to frequency band, and according to the feature of auditory perceptual frequency band, select adaptively frequency band, calculate the voice signal robust features parameter (WPTC) based on wavelet package transforms: WPTC1～WPTC20.

5, calculate PMUSIC-MFCC parameter

For improving the robustness of voice visual, adopt Multiple Signal Classification method (Multiple Signal Classification, MUSIC) spectrum estimation technique is also introduced apperceive characteristic therein, calculates the robust features parameter (PMUSIC-MFCC) based on MUSIC and apperceive characteristic: PMUSIC-MFCC1～PMUSIC-MFCC 12.

Three, width information coding

Adopt otonaga features to encode to picture traverse information, according to the size of viewing area pixel, otonaga features is converted into picture traverse information by linear transformation, picture traverse information=otonaga features *

value in line with the image that makes to show, be beneficial to the principle that observer observes identification most, it is example that this example be take the viewing area of 300 * 300 pixel sizes, such as the duration information of initial consonant y is the width of 15 pixels after linear operation.

Width information=otonaga features *

, this example is got

be 6.

Four, length information coding

Adopt resonance strength characteristic to encode to image length information, according to the size of viewing area pixel, each frame resonance strength characteristic mean value is converted into image length information by linear transformation, length information=each frame resonance strength characteristic mean value *

, value is in line with making the image showing be beneficial to the principle that observer observes identification most, and this example is got

be 180.If the resonance strength information of initial consonant y is the length of 150 pixels after linear operation.

Five, main color coding

As shown in Figure 2, adopt resonance peak feature to shine upon main colouring information, all resonance peak eigenwert F1, F2, F3 averages respectively, then passes through formula: R=5F1/F3, G=3F3/5F2, B=F2/3F1, convert thereof into main colouring information, wherein coefficient 5,3/5 and 1/3 is to have good color identifying power through experimental verification, select target is to make varying in color of most of pronunciation, contributes to like this deaf-mute's identification memory.By giving the RGB assignment of screen relevant position, obtain main colouring information, red-green-blue amplitude is 1 to obtain white entirely, and red-green-blue amplitude is 0 to obtain black entirely, and each primary colours are additive color rules to the contribution of color.

As three resonance peak mean values of initial consonant " b " are respectively F1=538.97Hz, F2=1059.73Hz, F3=2841.58Hz, thus the R=0.9484 calculating, G=1.6089, B=0.6554, so the main color of image producing is light yellow.

Six, neural network design

As shown in Figure 3, described neural network is three layers of BP neural network, and wherein input layer has 32 neurons, and output layer has 6 neurons.And adopt the method for dynamic training collection to train modeling to neural network, use existing sample set (number is less) to train identification, during identification error, just this speech samples is added in corresponding training set and is gone by reality pronunciation, identify when correct and just this speech samples is given up not, make training set sample more and more abundant under this Practical Condition.When occurring that wrong probability less to a certain extent time, just obtains the neural network model under this practical application condition.

Seven, pattern-information coding

As shown in Fig. 4-Figure 14, with WPTC1～WPTC20 and PMUSIC-MFCC1～PMUSIC-MFCC 12, totally 32 assemblage characteristics are as the input of neural network, and the output of neural network is corresponding pattern-information.The output layer of neural network has 6 neurons, all adopts binary coding, has 64 different codes, wherein only with front 47 codes, successively correspondence 23 initial consonants and 24 simple or compound vowel of a Chinese syllable.Represent initial consonant b as 000000,000001 represents initial consonant p, and by that analogy, and the pattern of each initial consonant image is white quality, the pattern of each simple or compound vowel of a Chinese syllable image is black quality, can show differing texture pattern by changing the saturation degree of the three primary colours RGB of relevant position, a wherein, o, e, i, u, ü does not have pattern.What in figure, have identical patterns has very large similarity in pronunciation, as first three sound b, and p, m is bilabial sound, accordingly pattern is quality striped of adularescent all up and down; For another example d, t, n, l is blade-alveolar, corresponding pattern is in centre position.

Eight, image is synthetic

Be specially and first obtain width information and length information and determine image size, then in picture position, add main colouring information, finally, with the main colouring information of pattern-information displacement relevant position, obtain corresponding phonetic image.

Above-mentioned whole process is processed by computing machine.

?image is synthetic to be exemplified below:

1, as shown in figure 15, the main color of the image of initial consonant y is blue, and the pattern of adularescent quality above.

2, as shown in figure 16, two phonetic yu, initial consonant and simple or compound vowel of a Chinese syllable piece together a sound, and during phonetic, initial consonant is front, and simple or compound vowel of a Chinese syllable is rear, and vowel u becomes in the back time not only heavily but also is long.

3, as shown in figure 17, three phonetic yuan have a sound between initial consonant and simple or compound vowel of a Chinese syllable in mandarin, can be its referral letter, and vowel u serves as referral letter, and are weakened while serving as referral letter, become short and light, and vowel an becomes in the back time not only heavily but also be long.

4, as shown in figure 18, y and i, although w is very similar to u pronunciation, the two pronunciation is different.Initial consonant y and w pronunciation is short and light, and simple or compound vowel of a Chinese syllable i grows and weighs with u pronunciation, so the two is easy to distinguish, and sound spectrograph is very similar, very difficult identification.

Claims

1. the female method for visualizing of the Chinese phonetic based on assemblage characteristic, is characterized in that:

1.1, voice signal pre-service

1.2, feature extraction

(a) frame number that calculates pretreated voice signal is as its otonaga features;

(b) adopt a kind of relativity of frequency domain peak amplitude size and average amplitude size to represent resonance strength characteristic, for the voice signal after minute frame, the resonance intensity of every frame voice signal is:

Wherein, plural number

represent the

individual harmonic component transforms to the coefficient after frequency domain;

the harmonic wave number that represents this frame signal;

expression is averaged;

according to dissimilar identification voice, adjust, wherein

;

(c) adopt the method based on Hilbert-Huang conversion to estimate pretreated voice signal resonance peak feature, obtain the resonance peak eigenwert F1 of every frame signal, F2, F3;

(d) calculate the voice signal robust features parameter WPTC:WPTC1～WPTC20 based on wavelet package transforms;

(e) calculate the robust features parameter PMUSIC-MFCC:PMUSIC-MFCC1～PMUSIC-MFCC 12 based on MUSIC and apperceive characteristic;

1.3, width information coding

1.4, length information coding

1.5, main color coding

1.6, neural network design

1.7, pattern-information coding

Totally 32 assemblage characteristics are as the input of neural network for WPTC1～WPTC20 and PMUSIC-MFCC1～PMUSIC-MFCC 12, and the output of neural network is corresponding pattern-information; The output layer of neural network has 6 neurons, all adopt binary coding, have 64 different codes, wherein only with front 47 codes, successively correspondence 23 initial consonant b, p, m, f, d, t, n, l, g, k, h, j, q, x, zh, ch, sh, r, z, c, s, y, w and 24 simple or compound vowel of a Chinese syllable a, o, e, i, u, ü, ai, ei, ui, ao, ou, iu, ie, ü e, er, an, en, in, un, ü n, ang, eng, ing, ong;

1.8, image is synthetic

2. the female method for visualizing of Chinese phonetic based on assemblage characteristic according to claim 1, it is characterized in that: when described image synthesizes, first obtain width information and length information and determine image size, then in picture position, add main colouring information, finally, with the main colouring information of pattern-information displacement relevant position, obtain corresponding phonetic image.

3. the female method for visualizing of the Chinese phonetic based on assemblage characteristic according to claim 1, is characterized in that: during described voice signal pre-service, sample quantization is carried out with the sample frequency of 11.025kHz, the quantified precision of 16bit by processing unit; Pre-emphasis is to realize by single order numeral preemphasis filter, and the coefficient value of its preemphasis filter is 0.93-0.97; Dividing frame windowing is to carry out with the standard of 256 of frame lengths, and the data after minute frame are added to Hamming window processing, and end-point detection is to utilize in short-term nil product method to carry out.

4. the female method for visualizing of Chinese phonetic based on assemblage characteristic according to claim 1 and 2, is characterized in that: described picture traverse information=otonaga features *

,

5. the female method for visualizing of Chinese phonetic based on assemblage characteristic according to claim 1 and 2, is characterized in that: described image length information=each frame resonance strength characteristic mean value *

,

6. the female method for visualizing of the Chinese phonetic based on assemblage characteristic according to claim 1, is characterized in that: the pattern of described initial consonant image is white quality, and the pattern of described simple or compound vowel of a Chinese syllable image is black quality.

7. the female method for visualizing of the Chinese phonetic based on assemblage characteristic according to claim 1, is characterized in that: when the relativity of described employing frequency domain peak amplitude size and average amplitude size represents resonance strength characteristic, take 256 points as a frame.