CN105788608A

CN105788608A - Chinese initial consonant and compound vowel visualization method based on neural network

Info

Publication number: CN105788608A
Application number: CN201610121430.8A
Authority: CN
Inventors: 韩志艳; 王健
Original assignee: Bohai University
Current assignee: Bohai University
Priority date: 2016-03-03
Filing date: 2016-03-03
Publication date: 2016-07-20
Anticipated expiration: 2036-03-03
Also published as: CN105788608B

Abstract

The invention provides a Chinese initial consonant and compound vowel visualization method based on a neural network. The method comprises a step of obtaining a voice signal, a step of extracting a voice signal characteristic parameter and carrying out PCA dimension reduction, a step of designing and training a wavelet neural network, wherein the 64 binary codes output by the wavelet neural network are orderly corresponding to the 8*8 squares in a display screen, wherein the first 47 binary codes and the squares corresponding to the first 47 binary codes are orderly corresponding to 47 initial consonants and compound vowels arranged according to pronunciation characteristics, and when the voice signal comprehensive characteristic vector of an initial consonant or compound vowel is inputted into the wavelet neural network, the output of the wavelet neural network is the position information of the initial consonant or compound vowel, a step of dividing the 47 initial consonants and compound vowels into 12 groups, assigning different values to the RGB of the squares corresponding to the 12 groups of initial consonants and compound vowels to obtain color information, and a step of synthesizing the position information and the color information to achieve the visualization of initial consonants and compound vowels. The method is convenient to remember by a deft-mute and has good robustness and understandability, and the deft-mute can accurately identify the pronunciation corresponding to a visual image.

Description

Chinese phonetic mother's method for visualizing based on neutral net

Technical field

The present invention relates to the method for visualizing of a kind of Chinese phonetic initial consonant and simple or compound vowel of a Chinese syllable, particularly to a kind of Chinese phonetic mother's method for visualizing based on neutral net.

Background technology

Voice is the acoustics performance of language, is that Human communication's information is the most natural, most effective, the means of most convenient, is indispensable in daily life.But for deaf mute, communication is a thing that cannot realize.Research shows, the mankind, in the extraneous process of perception, receive that information rate is the fastest, acquisition information is maximum is vision, if can visually perceptual speech, deaf mute is carried out speech training by this, sets up, improves auditory sense cognition and have huge help.

Nineteen forty-seven R.K.Potter and G.A.Kopp et al. proposes a kind of voice visualization method sound spectrograph, there is different voice study experts to begin one's study subsequently and improve this voice visualization method, such as propose chromatogram and G.M.Kuhn in 1984 et al. at L.C.Stewart in 1976 et al. and propose the real-time sound spectrograph system that person hard of hearing is trained, and P.E.Stern in 1986, F.Plante in 1998 and R.Steinberg in 2008 et al. it is also proposed the improved method of many sound spectrographs, but the sound spectrograph of display is professional very strong, and be difficult to distinguish memory.Especially for the people that same voice is different, or even same voice same person is likely to cause the change of sound spectrograph, more bad for its robust performance of voice signal recorded under varying environment.

In addition, some scholars are also had to realize voice visual based on the motion change of phonatory organ and the change of facial expression, effectively dissect the phonation of people, but for its intelligibility of speech, it is difficult to reach ideal effect, beyond depolarization minority expert, people are difficult to either directly through observing the change of the motion of phonatory organ and facial expression and perceptual speech exactly.

Summary of the invention

For the deficiencies in the prior art, the present invention proposes the Chinese phonetic mother's method for visualizing based on neutral net.Specifically comprising the following steps that of the method

Step 1, voice signal obtain: utilize mike input speech data, and by obtaining corresponding voice signal after processing unit sample quantization.

Step 2, speech signal pre-processing: the voice signal obtained is carried out preemphasis, framing windowing and end-point detection.

Step 3, phonic signal character parameter extraction.

Step 3.1, employing estimate pretreated Speech formant frequency feature based on the method for Hibert-Huang conversion, obtain formant eigenvalue F1, F2, F3, the F4 of every frame signal；

Step 3.2, calculate based on the voice signal robust features parameter WPTC:WPTC1～WPTC20 of wavelet package transforms.

Step 3.3, calculate based on the robust features parameter PMUSIC-MFCC:PMUSIC-MFCC1～PMUSIC-MFCC12 of MUSIC and perception characteristic.

Step 3.4, calculating Mel frequency cepstral coefficient MFCC:MFCC1～MFCC12.

Step 4, PCA dimensionality reduction: utilize PCA PCA that above-mentioned phonic signal character parameter is carried out dimension-reduction treatment, it is thus achieved that voice signal multi-feature vector.

Step 5, neutral net design: adopt three layers wavelet neural network, wherein input layer has 12 neurons, hidden layer has 8 neurons, output layer has 6 neurons, utilizes M voice signal multi-feature vector to train this wavelet neural network, it is desirable to error is P, maximum iteration time is Q, if this wavelet neural network output error reaches maximum iteration time less than anticipation error or frequency of training, then deconditioning wavelet neural network, complete neutral net design.

nullStep 6、Positional information maps: the output layer of wavelet neural network has 6 neurons，All adopt binary coding，Have 64 different binary codings，Display screen arranges 64 grids，64 grids are lined up 8 row 8 and are arranged，64 binary codings are according to by left-to-right、Order from top to bottom is corresponding in turn to 8 × 8 grids，Wherein front 47 binary codings and grid corresponding to front 47 binary codings are corresponding in turn to the sound mother aoeiu ü ywaneninun ü njqxbpmfdtnlangengingongzhchshrgkhzcsaieiuiaoouiuie ü eer according to pronunciation characteristic sequence，When the voice signal multi-feature vector input wavelet neural network that some sound is female，Wavelet neural network is output as the binary coding of grid corresponding to this sound mother，This binary coding is the positional information that this sound is female，Grid corresponding to this sound mother is selected.

Step 7, colouring information obtain: according to pronunciation characteristic or place of articulation, 47 sound mothers are divided into 12 groups, and respectively the RGB of grid corresponding for 12 groups of sound mothers are composed different values, and the grid making 12 groups of sound mothers corresponding shows different colors.

Step 8, information synthesize: synthesising position information and colouring information, and when inputting the female voice signal multi-feature vector of some sound, grid corresponding to this sound mother shows certain color, all the other grids display black, it is achieved the visualization that sound is female.

When in described step 1, voice signal obtains, wherein the sample frequency of sample quantization is 11.025KHz, quantified precision is 16bit.

In described step 2 during speech signal pre-processing, wherein preemphasis is to utilize single order digital pre-emphasis filter to realize, the coefficient value scope of preemphasis filter is between 0.93-0.97, framing is to carry out with the standard of frame length 256, and the voice signal after framing is added Hamming window process, end-point detection is to utilize short-time energy-zero-product method to carry out.

In described step 8 during information synthesis, synthesising position information and colouring information are the positional information of the sound mother first obtaining input, more corresponding grid is added colouring information, and the grid making this sound mother corresponding shows certain color.

Beneficial effect:

1) present invention designs, in conjunction with the pronunciation characteristic that sound is female, the positional information that each sound is female, it is simple to deaf mute remembers；

2) present invention is according to pronunciation characteristic or place of articulation by the region that 47 grid image division are 12 groups of different colours, has given full play to the advantage that deaf mute is stronger to the Visual memory of chromatic stimulus；

3) positional information and colouring information are synthesized in piece image by the present invention, achieve voice signal visualization, compared with prior art, there is good robustness and understandability, compensate for the shortcoming that sound spectrograph is difficult to distinguish and remember, through specialized training after a while, deaf mute can go out the pronunciation corresponding to visual image by accurate recognition, exchanges with abled person；

4) present invention utilizes wavelet neural network to map to realize positional information, and wavelet neural network has the advantage of structure designability, convergence precision controllability and fast convergence rate, is effectively improved the correct coding rate that Chinese phonetic is female.

Accompanying drawing explanation

Fig. 1 is the flow chart of an embodiment of the present invention；

Fig. 2 is the structural representation of the wavelet neural network of an embodiment of the present invention；

The positional information that Fig. 3 is an embodiment of the present invention maps schematic diagram；

Fig. 4 is the voice visual effect exemplary plot of the initial consonant p of an embodiment of the present invention；

Fig. 5 is the voice visual effect exemplary plot of the simple or compound vowel of a Chinese syllable o of an embodiment of the present invention；

Fig. 6 is initial consonant y, the w voice visual effect exemplary plot with simple or compound vowel of a Chinese syllable i, u of an embodiment of the present invention.

Detailed description of the invention

Below in conjunction with accompanying drawing, the specific embodiment of the invention is elaborated.Based on Chinese phonetic mother's method for visualizing of neutral net, comprise the following steps that, as shown in Figure 1:

Step 1, voice signal obtain: utilize mike input speech data, and carried out sample quantization by processing units such as computer, single-chip microcomputer or dsp chips with the quantified precision of the sample frequency of 11.025KHz, 16bit, it is thus achieved that corresponding voice signal.The present embodiment uses computer as processing unit.

Step 2, speech signal pre-processing: the voice signal obtained is carried out preemphasis, framing windowing and end-point detection.Utilizing single order digital pre-emphasis filter that the voice signal obtained is carried out preemphasis process, wherein the coefficient value scope of preemphasis filter is between 0.93-0.97, takes 0.9375 in the present embodiment.Then carrying out sub-frame processing with the standard of frame length 256, and the voice signal after framing adds Hamming window process, recycling short-time energy-zero-product method carries out end-point detection.

Step 3, phonic signal character parameter extraction.

Step 3.1, employing estimate pretreated Speech formant frequency feature based on the method for Hibert-Huang conversion, obtain formant eigenvalue F1, F2, F3, the F4 of every frame signal.

Each rank formant frequency of the voice signal gone out according to a preliminary estimate by fast Fourier transform (FFT) determines the parameter of respective band pass filters, and by this parameter, voice signal is made Filtering Processing, filtered voice signal is carried out empirical mode decomposition (EMD) and obtains family's intrinsic mode function (IMF), determine that the instantaneous frequency and the Hilbert that calculate this IMF compose the formant frequency parameter namely obtaining voice signal containing formant frequency IMF by the maximum principle of energy.

According to wavelet package transforms in the feature processing characteristic of voice signal is consistent with human auditory system of permanent Q (quality factor) characteristic of each analysis frequency range, in conjunction with wavelet packet, the multi-level of frequency band is divided, and the feature according to auditory perceptual frequency band, select frequency band adaptively, calculate the voice signal robust features parameter WPTC:WPTC1～WPTC20 based on wavelet package transforms.

For improving the robustness of voice visual, adopt Multiple signal classification (MultipleSignalClassification, MUSIC) Power estimation technology also introduces perception characteristic wherein, calculates the robust features parameter PMUSIC-MFCC:PMUSIC-MFCC1～PMUSIC-MFCC12 based on MUSIC and perception characteristic.

Step 3.4, calculating Mel frequency cepstral coefficient MFCC:MFCC1～MFCC12.

To carry out discrete Fourier transform through pretreated every frame voice signal and obtain linear spectral, and obtain Mel frequency by Mel frequency filter group, then taking the logarithm and carrying out discrete cosine transform obtains Mel frequency cepstral coefficient.

Step 4, PCA dimensionality reduction: utilize PCA PCA that above-mentioned phonic signal character parameter is carried out dimension-reduction treatment, tieed up phonic signal character vector by 48 and reduce to 12 dimension voice signal multi-feature vectors.

Step 5, neutral net design: adopt three layers wavelet neural network, as shown in Figure 2, wherein input layer has 12 neurons, and hidden layer has 8 neurons, and output layer has 6 neurons, 1000 voice signal multi-feature vectors are utilized to train this wavelet neural network, anticipation error is 0.001, and maximum iteration time is 200, if this wavelet neural network output error reaches maximum iteration time less than anticipation error or frequency of training, then deconditioning wavelet neural network, completes neutral net design.

nullStep 6、Positional information maps: the output layer of wavelet neural network has 6 neurons，All adopt binary coding，Have 64 different binary codings，Display screen arranges 64 grids，64 grids are lined up 8 row 8 and are arranged，64 binary codings are according to by left-to-right、Order from top to bottom is corresponding in turn to 8 × 8 grids，Wherein front 47 binary codings and grid corresponding to front 47 binary codings are corresponding in turn to the sound mother aoeiu ü ywaneninun ü njqxbpmfdtnlangengingongzhchshrgkhzcsaieiuiaoouiuie ü eer according to pronunciation characteristic sequence，As shown in Figure 3，When the voice signal multi-feature vector input wavelet neural network that some sound is female，Wavelet neural network is output as the binary coding of grid corresponding to this sound mother，This binary coding is the positional information that this sound is female，Grid corresponding to this sound mother is selected，Such as 000000 grid representing the first row first row，Correspond to simple or compound vowel of a Chinese syllable a，000001 grid representing the first row secondary series，Correspond to simple or compound vowel of a Chinese syllable o，By that analogy.

Step 7, colouring information obtain: according to pronunciation characteristic or place of articulation, 47 sound mothers are divided into 12 groups, and respectively the RGB of grid corresponding for 12 groups of sound mothers are composed different values, and the grid making 12 groups of sound mothers corresponding shows different colors, such as binary coding 000000,000001,000010,000011,000100,000101 is 1st district, i.e. single vowel district, set R=0.95, G=0.75, B=0.68, color is pink；Binary coding 000110,000111 is 2nd district, i.e. yw district, sets R=0, G=0.95, B=0, and color is green；Binary coding 001000,001001,001010,001011,001100 is 3rd district, Ji Qian vowel followed by a nasal consonant district, sets R=0.52, G=0.38, B=0.76, and color is bluish violet；Binary coding 001101,001110,001111 is 4th district, i.e. dorsal district, sets R=0.25, G=0.52, B=0.18, and color is bottle green；Binary coding 010000,010001,010010 is 5th district, i.e. the lips range of sound, sets R=0.12, G=0.98, B=0.76, and color is aeruginous；Binary coding 010011 is 6th district, i.e. the lips and teeth range of sound, sets R=0, G=0, B=0.55, and color is blue；Binary coding 010100,010101,010110,010111 is 7th district, i.e. blade-alveolar district, sets R=0.75, G=0, B=0.55, and color is purple；Binary coding 011000,011001,011010,011011 is 8th district, Ji Hou vowel followed by a nasal consonant district, sets R=0.75, G=0, B=0, and color is red；Binary coding 011100,011101,011110,011111 is 9th district, i.e. blade-palatal district, sets R=0.98, G=0.96, B=0, and color is yellow；Binary coding 100000,100001,100010 is 10th district, the i.e. root of the tongue range of sound, sets R=0.87, G=0.87, B=0.79, and color is canescence；Binary coding 100011,100100,100101 is 11st district, i.e. dental district, sets R=0.74, G=0.42, B=0, and color is brown；Binary coding 100110,100111,101000,101001,101010,101011,101100,101101,101110 is 12nd district, i.e. compound vowel district, sets R=1, G=1, B=1, and color is white.

Step 8, information synthesize: synthesising position information and colouring information, when inputting the voice signal multi-feature vector of some sound mother, grid corresponding to this sound mother shows certain color, all the other grids display black, realize the visualization that sound is female, described synthesising position information and colouring information are the positional information of the sound mother first obtaining input, more corresponding grid is added colouring information, and the grid making this sound mother corresponding shows certain color.As shown in Figure 4, grid position corresponding for initial consonant p is at the third line secondary series, and binary coding is 010001, and color is aeruginous.As it is shown in figure 5, grid position corresponding to simple or compound vowel of a Chinese syllable o is at the first row secondary series, binary coding is 000001, and color is pink.As shown in Figure 6, y and i, both w and u pronounce much like, and sound spectrograph is also very similar, be difficult to identification, and the present invention is easily discriminated out.

Claims

1. the Chinese phonetic mother's method for visualizing based on neutral net, it is characterised in that: comprise the steps:

Step 1, voice signal obtain: utilize mike input speech data, and by obtaining corresponding voice signal after processing unit sample quantization；

Step 2, speech signal pre-processing: the voice signal obtained is carried out preemphasis, framing windowing and end-point detection；

Step 3, phonic signal character parameter extraction；

Step 4, PCA dimensionality reduction: utilize PCA PCA that above-mentioned phonic signal character parameter is carried out dimension-reduction treatment, it is thus achieved that voice signal multi-feature vector；

Step 5, neutral net design: adopt three layers wavelet neural network, wherein input layer has 12 neurons, hidden layer has 8 neurons, output layer has 6 neurons, utilizes M voice signal multi-feature vector to train this wavelet neural network, it is desirable to error is P, maximum iteration time is Q, if this wavelet neural network output error reaches maximum iteration time less than anticipation error or frequency of training, then deconditioning wavelet neural network, complete neutral net design；

nullStep 6、Positional information maps: the output layer of wavelet neural network has 6 neurons，All adopt binary coding，Have 64 different binary codings，Display screen arranges 64 grids，64 grids are lined up 8 row 8 and are arranged，64 binary codings are according to by left-to-right、Order from top to bottom is corresponding in turn to 8 × 8 grids，Wherein front 47 binary codings and grid corresponding to front 47 binary codings are corresponding in turn to the sound mother aoeiu ü ywaneninun ü njqxbpmfdtnlangengingongzhchshrgkhzcsaieiuiaoouiuie ü eer according to pronunciation characteristic sequence，When the voice signal multi-feature vector input wavelet neural network that some sound is female，Wavelet neural network is output as the binary coding of grid corresponding to this sound mother，This binary coding is the positional information that this sound is female，Grid corresponding to this sound mother is selected；

Step 7, colouring information obtain: according to pronunciation characteristic or place of articulation, 47 sound mothers are divided into 12 groups, and respectively the RGB of grid corresponding for 12 groups of sound mothers are composed different values, and the grid making 12 groups of sound mothers corresponding shows different colors；

2. the Chinese phonetic mother's method for visualizing based on neutral net according to claim 1, it is characterised in that: described step 3 specifically comprises the following steps that

Step 3.2, calculate based on the voice signal robust features parameter WPTC:WPTC1～WPTC20 of wavelet package transforms；

Step 3.3, calculate based on the robust features parameter PMUSIC-MFCC:PMUSIC-MFCC1～PMUSIC-MFCC12 of MUSIC and perception characteristic；

Step 3.4, calculating Mel frequency cepstral coefficient MFCC:MFCC1～MFCC12.

3. the Chinese phonetic mother's method for visualizing based on neutral net according to claim 1, it is characterised in that: in described step 1, processing unit carries out sample quantization with the quantified precision of the sample frequency of 11.025KHz, 16bit.

4. the Chinese phonetic mother's method for visualizing based on neutral net according to claim 1, it is characterized in that: described step 2 concrete grammar is as follows: preemphasis is to utilize single order digital pre-emphasis filter to realize, the coefficient value scope of preemphasis filter is between 0.93-0.97, framing is to carry out with the standard of frame length 256, and the voice signal after framing is added Hamming window process, end-point detection is to utilize short-time energy-zero-product method to carry out.

5. the Chinese phonetic mother's method for visualizing based on neutral net according to claim 1, it is characterized in that: in described step 8 during information synthesis, synthesising position information and colouring information are the positional information of the sound mother first obtaining input, again corresponding grid being added colouring information, the grid making this sound mother corresponding shows certain color.