Summary of the invention
The object of the present invention is to provide a kind of pronounce precision is high, pace of learning the is fast speech production method based on DIVA neural network model.
The technical solution that realizes the object of the invention is: a kind of speech production method based on DIVA neural network model, comprise speech samples extraction, speech samples classification and study, voice output and correction output voice, described speech samples classification adopts self-adaptive growth type neural network (AGNN) to realize the classification learning to speech samples with study, is specially:
Step 1, the speech resonant peak frequency of extraction is converted to matrix form by Jacobian, the dimension of the proper vector of this matrix is the neuronic number m of input layer candidate; The order of calculating the neuronic fitness function value of input layer candidate and increase progressively by fitness function value is arranged candidate's neuron, and the list of input layer candidate neuron fitness function value is S={S accordingly
i1≤ S
i2≤ ... ≤ S
im, and by corresponding order, candidate's neuron is placed in list X to X=(x
1..., x
m), described fitness function computing formula is:
Y
ifor real output value,
for desired value, n is that number and the n of data centralization sample is natural number;
Step 2, initial hidden neuron number r=0 also establish C
0=S
i1, C
0fitness function value during for hidden neuron number r=0;
Step 3, establish r=r+1 and p=r+1, wherein r is r hidden layer candidate neuron, generates hidden layer candidate's neuron that has p input;
If step 4 r>1, is connected respectively to hidden neurons all before it and input node x by this hidden layer candidate neuron
1on; Otherwise this hidden layer candidate neuron is only connected to input node x
1on;
Step 5, the initial value that the residing position h of element in the next set X that needs be connected with the new hidden layer candidate's neuron adding is set are 2, wherein 2≤h≤m, and m, h are positive integer; Neuronic this hidden layer candidate P input is connected on the input node that list X meta is set to h;
Step 6, train this hidden layer candidate's neuron and calculate its fitness function value C
rif, C
r>=C
r-1, perform step seven; If C
r<C
r-1this hidden layer candidate neuron is connected in network as r hidden neuron, then returns to execution step three to step 6, until do not meet this condition in m input layer access network;
Step 7, by h=h+1, again train this hidden layer candidate neuron, until when h=m, if still do not meet C
r<C
r-1, finishing training, this hidden layer candidate neuron is irrelevant with classification, gives up this hidden layer candidate neuron, using neuronic this hidden layer candidate previous hidden neuron as output layer;
Step 8, determine phoneme according to the output numerical value of output layer.
Further, the present invention is based in the speech production method of DIVA neural network model, in step 6, train this hidden layer candidate's neuron and calculate its fitness function value C
r, be specially:
(1), data set that speech resonant peak frequency normalization is formed is divided into training set, checking collection and test set, the number of samples of the training set of dividing here and checking collection is respectively n
a, n
b, divide according to being: n
a=n
b;
(2), according to three set after dividing, utilize following formula to calculate the neuronic fitness function value of hidden layer candidate C
r,
i=1 ..., n
b, n
bfor verifying concentrated sample number,
wherein, y
b∈ Y
b, Y
bfor verifying concentrated object vector, U
bfor input and the U of checking set pair hidden neuron
bfor the matrix of p × 1 vector, W
k-1for weight vector, k is iterations, and span is k=0,1,2,3 ..., n, wherein n is positive integer.
Further, the present invention is based in the speech production method of DIVA neural network model, output numerical value according to output layer in described step 8 is determined phoneme, be specially: the output numerical value of described output layer is the numerical value in 0 to 1 interval, and determine the corresponding phoneme of AGNN neural network output numerical value according to the corresponding value range of each phoneme in DIVA neural network model.
The present invention compared with prior art, its remarkable advantage: because self-adaptive growth type neural network model is since an input node study, regulate neuronic weights according to external rules, and increase gradually new input node and new hidden neuron.So the AGNN constructing is a narrow and dark network, have the input neuron, hidden neuron and the network that approach minimal amount to connect, can effectively prevent the over-fitting of network, and network assess the cost lowly, pace of learning is fast; The RBF network originally using in DIVA model on average can reach more than 90% at 80%, AGNN the nicety of grading of sample; For the learning sample master mould classification learning of general difficulty and generate this process used time of voice 10s-13s, and use system after AGNN model refinement under same case, learn and generate only used time 8s-10s of this process of voice for the learning sample category of model of general difficulty, under the same terms, compare fast 2-3s.For medium difficulty and above learning sample, use the system performance under same case after AGNN model refinement just to seem more excellent, compare the fast 4-5s in used time aspect with the model before improving, nicety of grading aspect original system to sample drops to 70%-75%, and uses the system after AGNN model refinement under equal conditions still can keep higher accuracy rate 90%; Visible self-adaptive growth type neural network model is applied to and in DIVA model, makes that the pronunciation precision of model is higher and pace of learning is faster.
Embodiment
Below in conjunction with accompanying drawing, the present invention is described in further detail.
As shown in Figure 1, a kind of speech production method based on DIVA neural network model of the present invention, comprise speech samples extraction, speech samples classification and study, voice output and correction output voice, it is characterized in that, described speech samples classification adopts self-adaptive growth type neural network (AGNN) to realize the classification learning to speech samples with study, is specially:
Step 1, the speech resonant peak frequency of extraction is converted to matrix form by Jacobian, the dimension of the proper vector of this matrix is the neuronic number m of input layer candidate; The order of calculating the neuronic fitness function value of input layer candidate and increase progressively by fitness function value is arranged candidate's neuron, and the list of input layer candidate neuron fitness function value is S={S accordingly
i1≤ S
i2≤ ... ≤ S
im, and by corresponding order, candidate's neuron is placed in list X to X=(x
1..., x
m), described fitness function computing formula is:
y
ifor real output value,
for desired value, n is that number and the n of data centralization sample is natural number;
Step 2, initial hidden neuron number r=0 also establish C
0=S
i1, C
0fitness function value during for hidden neuron number r=0;
Step 3, establish r=r+1 and p=r+1, wherein r is r hidden layer candidate neuron, generates hidden layer candidate's neuron that has p input;
If step 4 r>1, is connected respectively to hidden neurons all before it and input node x by this hidden layer candidate neuron
1on; Otherwise this hidden layer candidate neuron is only connected to input node x
1on;
Step 5, the initial value that the residing position h of element in the next set X that needs be connected with the new hidden layer candidate's neuron adding is set are 2, wherein 2≤h≤m, and m, h are positive integer; Neuronic this hidden layer candidate P input is connected on the input node that list X meta is set to h;
Step 6, train this hidden layer candidate's neuron and calculate its fitness function value C
rif, C
r>=C
r-1, perform step seven; If C
r<C
r-1this hidden layer candidate neuron is connected in network as r hidden neuron, then returns to execution step three to step 6, until do not meet this condition in m input layer access network; Wherein fitness function value C
r, be specially:
(1), data set that speech resonant peak frequency normalization is formed is divided into training set, checking collection and test set, the number of samples of the training set of dividing here and checking collection is respectively n
a, n
b, divide according to being: n
a=n
b;
(2), according to three set after dividing, utilize following formula to calculate the neuronic fitness function value of hidden layer candidate C
r,
i=1 ..., n
b, n
bfor verifying concentrated sample number,
wherein, y
b∈ Y
b, Y
bfor verifying concentrated object vector, U
bfor input and the U of checking set pair hidden neuron
bfor the matrix of p × 1 vector, W
k-1for weight vector, k is iterations, and span is k=0,1,2,3 ..., n, wherein n is positive integer, the higher iterations k of the precision value of training is larger.
Step 7, by h=h+1, again train this hidden layer candidate neuron, until when h=m, if still do not meet C
r<C
r-1, finishing training, this hidden layer candidate neuron is irrelevant with classification, gives up this hidden layer candidate neuron, using neuronic this hidden layer candidate previous hidden neuron as output layer;
Step 8, determine phoneme according to the output numerical value of output layer, the output numerical value of described output layer is the numerical value in 0 to 1 interval, and determines the corresponding phoneme of AGNN neural network output numerical value according to the corresponding value range of each phoneme in DIVA neural network model.
Embodiment
As shown in Figure 2, in this embodiment, first gather the voice process voice channel module of the pronunciation equipments such as microphone with a given delay, the formant frequency of voice is sent to cochlea module with vectorial form.The cochlea that cochlea module is calculated these voice represents (frequency spectrum), and formant frequency is sent to auditory cortex module.The voice transfer that auditory cortex represents the formant frequency being transmitted by cochlea module is to auditory cortex classification sensing module.Auditory cortex classification sensing module receives the base unit-phoneme that is just divided into voice after these voice, the phoneme target of initialization output arrive via voice cell collection module separately that auditory cortex and somatesthesia cortex module form respectively to the sense of hearing and somatesthesia result, this module is by relatively characterizing to identify sound bite from the sound bite of auditory cortex module and the phoneme stored, wherein each phoneme is characterized by the numerical range between 0-1, be stored in voice cell collection module, identifying is specifically: auditory cortex classification sensing module characterizes with the concentrated phoneme of voice cell the phoneme (being the output valve of AGNN) being divided into match one by one, if concentrate and do not find the phoneme that matches with it to characterize this phoneme also not by training study mistake at voice cell, voice cell collection module will create a new phoneme in specific region and characterize to represent current phoneme.Between the phoneme target of auditory cortex classification sensing module output and voice cell collection, it is man-to-man relation.Afterwards, voice cell collection module starts the generation of phoneme fragment, send need the phoneme target that produces index motor cortex, auditory cortex and somatesthesia cortex module.Motor cortex sends control command to sound channel module after receiving from the phoneme target index of voice cell collection module, sound channel module is calculated the channel parameters of the control command receiving, be sent to sound-box device and produce corresponding voice, sound channel module sends the auditory effect of calculating and parameter configuration to cochlea module and Sensory module formation feedback by voice channel and sensory channel respectively simultaneously.Sensory module is receiving after the channel configuration information sending over vector form being transmitted by sensory channel, calculates the result of the somatesthesia that channel configuration is relevant, and they are sent to somesthetic cortex module.Then, somesthetic cortex module is calculated the difference that the cortex between somatesthesia and the somatesthesia target of inputting represents, and sends somatesthesia error to motor cortex module in order to revise the voice that generate.Cochlea module is receiving after the formant frequency that the voice that produced by sound channel module pass the voice that represent with vector form of coming through voice channel, passed to auditory cortex module, auditory cortex module is just calculated the difference between these voice and its target voice of cortex representative, and propagation of error is arrived to motor cortex module, in order to revise the voice that generate.
As shown in the table, 29 phonemes storing in voice cell collection module in existing DIVA neural network model characterize corresponding numerical range.The classification results of AGNN is exactly a numerical value, and the numerical value of gained represents different phoneme (numerical value of gained drops on different numerical value intervals and just represents a specific phoneme).
As shown in Figure 3, get learn efficiency η=1.9, Δ=0.0015, the selection of initial weight meets normal distribution.
According to input data set X, the dimension that calculates its proper vector is input layer candidate neuron number m=8, according to formula:
wherein y
ifor real output value,
for desired value, n is the number of data centralization sample.The fitness function value of calculating each element in input data set X, the order increasing progressively by fitness function is arranged them, and choose successively front 8 as candidate's neuron, be respectively x
8, x
5, x
12, x
16, x
24, x
27, x
19, x
23, wherein first input neuron x
8corresponding fitness function value minimum is also designated as C
0.
Increase a hidden layer candidate neuron z
1, it has 2 inputs.Its two inputs and input layer candidate neuron x
8and x
5be connected, train this hidden layer candidate neuron then to calculate z
1fitness function C
1, now by C
1with C
0compare, have C
1<C
0, at this moment z
1join in network as the 1st hidden neuron.Increase again a hidden layer candidate neuron z
2, it has 3 inputs.By its front 2 inputs and hidden neuron z above
1and x
8be connected, the 3rd input is connected to x
5, train this candidate's hidden neuron and calculate its fitness function C
2, by C
2with C
1relatively there is C
2<C
1, z
2join in network as the 2nd hidden neuron.Add hidden layer candidate neuron z
3, it has 4 inputs.Front 3 inputs are connected to hidden neuron z
1and z
2and input layer x
8upper, the 4th input is connected to x
5upper, train this candidate's neuron and calculate its fitness function but the fitness function value now calculated is less than C
2, the 4th input is connected to x
12upper, train this candidate's neuron and calculate fitness function value C now
3, have C
3<C
2, z
3join in network as the 3rd hidden neuron.Z
4join in network as hidden layer candidate neuron, have 5 inputs, front 4 inputs are connected to z
1~ z
3and x
8upper, the 5th input is connected to x
12upper, train this candidate's neuron and calculate its fitness function but be less than C
3, the 5th input is connected to x
16upper, train this candidate's neuron and calculate its fitness function C
4, because C
4<C
3, z
4join in network as the 4th hidden neuron.Then again z
5join in network as hidden layer candidate neuron, it has 6 inputs, and front 5 inputs are connected to z
1~ z
4and x
8upper, the 6th input is connected to x
16upper, train this candidate's neuron and calculate its fitness function C
5, due to C
5<C
4by z
5join in network as the 5th hidden neuron.Continue to add z
6as hidden layer candidate neuron, it has 7 inputs, and front 6 inputs are connected to z
1~ z
5and x
8upper, the 7th input is connected to x
16upper, train this candidate's neuron and calculate its fitness function but be less than C
5, the 7th input is connected to x
24upper, train this candidate's neuron and calculate its fitness function C
6, because C
6<C
5, z
6join in network as the 6th hidden neuron.Again z
7be connected in network as hidden layer candidate neuron, have 8 inputs, front 7 inputs are connected to z
1~ z
6and x
8upper, the 8th input is connected to x
24upper, train this candidate's neuron and calculate its fitness function C
7, C
7<C
6, z7 is joined in network as the 7th hidden neuron.Again z
8join in network as hidden layer candidate neuron, it has 9 inputs, and front 8 inputs are connected to z
1~ z
7and x
8upper, and the 9th input is connected to x
24upper, train this hidden layer candidate's neuron and calculate its fitness function but be less than C
7, the 9th input is connected to x
27upper, train this hidden layer candidate's neuron and calculate its fitness function C
8, because C
8<C
7, by z
8join in network as the 8th hidden neuron.Below again by z
9add in network as hidden layer candidate neuron, it has 10 inputs, and front 9 inputs are connected to z
1~ z
8and x
8upper, the 10th input is connected to x
27, train this hidden layer candidate's neuron and calculate its fitness function C
9, due to C
9<C
8, by z
9join in network as the 9th hidden neuron.Continue to add z
10, as hidden layer candidate neuron, it has 11 inputs, and front 10 inputs are connected to z
1~ z
9and x
8upper, the 11st input is connected to x
27, train this hidden layer candidate's neuron and calculate its fitness function but be less than C
9, the 11st input is connected to x
19upper, train this hidden layer candidate's neuron and calculate its fitness function C
10, because C
10<C
9, z
10join in network as the 10th hidden neuron.Then add z
11as hidden layer candidate neuron, there are 12 inputs, front 11 inputs are connected to z
1~ z
10with upper x
8upper, the 12nd input is connected to x
19upper, train this hidden layer candidate's neuron and calculate its fitness function C
11, because C
11<C
10, z
11join in network.Add z
12as hidden layer candidate neuron, there are 13 inputs, its front 12 inputs are connected to z
1~ z
11with upper x
8upper, the 13rd input is connected to x
19upper, train this hidden layer candidate's neuron and calculate its fitness function but be less than C
11, by
13individual input is connected to x
23upper, train this hidden layer candidate's neuron and calculate its fitness function C
12, because C
12<C
11therefore, z
12join in network as hidden neuron.Add z
13as hidden layer candidate neuron, there are 14 inputs, its front 13 inputs are connected to z
1~ z
12with upper x
8upper, the 14th input is connected to x
23upper, train this hidden layer candidate's neuron and calculate its fitness function C
13<C
12, z
13join in network as hidden neuron.Add z
14as hidden layer candidate neuron, there are 15 inputs, front 14 inputs are connected to z
1~ z
13with upper x
8upper, the 15th input is connected to x
23upper, train this hidden layer candidate's neuron and calculate its fitness function but be less than C
13, now do not have candidate's neuron to connect below, give up z
14, z
13as output neuron.
Network has been selected 8 input feature vectors, 12 hidden neurons and 1 output neuron.First hidden neuron is connected to input node x
8, x
5.The input of output neuron is connected to the output z of hidden neuron
1~ z
12with input node x
8, x
23on.