CN102789594B

CN102789594B - Voice generation method based on DIVA neural network model

Info

Publication number: CN102789594B
Application number: CN201210219670.3A
Authority: CN
Inventors: 张少白; 徐磊; 刘欣
Original assignee: Nanjing Post and Telecommunication University
Current assignee: Boao Zongheng Network Technology Co ltd; Guangzhou Zib Artificial Intelligence Technology Co ltd
Priority date: 2012-06-28
Filing date: 2012-06-28
Publication date: 2014-08-13
Anticipated expiration: 2032-06-28
Also published as: CN102789594A

Abstract

The invention discloses a voice generation method based on a DIVA neural network model, which comprises voice sample extraction, voice sample classification and learning, voice output and output voice revise, wherein the voice sample classification and learning adopts a self-adaptive growth neural network (AGNN) to implement classified learning of a voice sample; the number of candidate nerve cells in an input layer is further calculated by the acquired voice formant frequency, then a nerve cell of an invisible layer is determined according to the candidate nerve cells in the input layer, and finally an output value of the AGNN is obtained and a phoneme is confirmed according to the output value. The neural network adopting the structure is high in training accuracy and a learner can learn quickly.

Description

A kind of speech production method based on DIVA neural network model

Technical field

The present invention relates to a kind of speech production method, particularly a kind of speech production method based on DIVA neural network model.

Background technology

Along with the development of artificial intelligence, people deepen continuously to the research in this field.Speech production to similar true man's pronunciation and the control of obtaining are robot articulatory system urgent problems.Speech production is a cognitive process that relates to all multiple location complexity of brain with obtaining, this process comprises a kind of hierarchy producing from extend to phoneme according to the statement of syntax and grammatical organization sentence or phrase, and need to be according to sounding time, in brain, the reciprocation of various sense organs and moving region be set up corresponding neural network model.DIVA (Directions Into Velocities of Articulators) model is exactly a kind of about speech production and the mathematical model of obtaining rear description correlation procedure at present, is mainly used to emulation and describes the correlation function that relates to speech production and speech understanding region in brain.Also can say that it is a kind of in order to generate word, syllable or phoneme, and be used for the adaptive neural network model of control simulation sound channel motion.In the neural network model that really has the speech production of biological significance now and obtain, the definition of DIVA model and test are the most comparatively speaking, and are unique a kind of models of applying pseudoinverse control program.

The development that people DIVA model for the demand driving of the unified calculation model of human language ability.This model, since being proposed first by the Guenther1994 of MIT university sound lab, had constantly been updated in the last few years, had improved.DIVA system is made up of voice channel module, cochlea module, auditory cortex model module, auditory cortex classification sensing module, voice cell collection module, motor cortex module, sound channel module, somesthetic cortex module, Sensory module and sensory channel module.

By the analysis to DIVA model, we can find that the sorting technique using in its auditory cortex classification sensing module is RBF.And RBF neural network is very large to the dependence of sample, for a certain concrete studying a question, the hidden layer node number that how to confirm is suitable, there is no general effective algorithm or theorem at present.People are more by virtue of experience, and repetition test is determined the scale of network, and the method that this examination is gathered is very loaded down with trivial details, are difficult for finding suitable structure.Speed of convergence, precision and the generalization ability of the nodes of network hidden layer to network all has a great impact.Hidden layer node is too much, though can complete training, can affect speed of convergence, and likely occurs study; And hidden layer node is very few, network can not fully be learnt, and does not reach the requirement of training precision.In addition, the time of RBF neural metwork training is also fast not.

Summary of the invention

The object of the present invention is to provide a kind of pronounce precision is high, pace of learning the is fast speech production method based on DIVA neural network model.

The technical solution that realizes the object of the invention is: a kind of speech production method based on DIVA neural network model, comprise speech samples extraction, speech samples classification and study, voice output and correction output voice, described speech samples classification adopts self-adaptive growth type neural network (AGNN) to realize the classification learning to speech samples with study, is specially:

Step 1, the speech resonant peak frequency of extraction is converted to matrix form by Jacobian, the dimension of the proper vector of this matrix is the neuronic number m of input layer candidate; The order of calculating the neuronic fitness function value of input layer candidate and increase progressively by fitness function value is arranged candidate's neuron, and the list of input layer candidate neuron fitness function value is S={S accordingly _i1≤ S _i2≤ ... ≤ S _im, and by corresponding order, candidate's neuron is placed in list X to X=(x ₁..., x _m), described fitness function computing formula is:

Y _ifor real output value, for desired value, n is that number and the n of data centralization sample is natural number;

Step 2, initial hidden neuron number r=0 also establish C ₀=S _i1, C ₀fitness function value during for hidden neuron number r=0;

Step 3, establish r=r+1 and p=r+1, wherein r is r hidden layer candidate neuron, generates hidden layer candidate's neuron that has p input;

If step 4 r>1, is connected respectively to hidden neurons all before it and input node x by this hidden layer candidate neuron ₁on; Otherwise this hidden layer candidate neuron is only connected to input node x ₁on;

Step 5, the initial value that the residing position h of element in the next set X that needs be connected with the new hidden layer candidate's neuron adding is set are 2, wherein 2≤h≤m, and m, h are positive integer; Neuronic this hidden layer candidate P input is connected on the input node that list X meta is set to h;

Step 6, train this hidden layer candidate's neuron and calculate its fitness function value C _rif, C _r>=C _r-1, perform step seven; If C _r<C _r-1this hidden layer candidate neuron is connected in network as r hidden neuron, then returns to execution step three to step 6, until do not meet this condition in m input layer access network;

Step 7, by h=h+1, again train this hidden layer candidate neuron, until when h=m, if still do not meet C _r<C _r-1, finishing training, this hidden layer candidate neuron is irrelevant with classification, gives up this hidden layer candidate neuron, using neuronic this hidden layer candidate previous hidden neuron as output layer;

Step 8, determine phoneme according to the output numerical value of output layer.

Further, the present invention is based in the speech production method of DIVA neural network model, in step 6, train this hidden layer candidate's neuron and calculate its fitness function value C _r, be specially:

(1), data set that speech resonant peak frequency normalization is formed is divided into training set, checking collection and test set, the number of samples of the training set of dividing here and checking collection is respectively n _a, n _b, divide according to being: n _a=n _b;

(2), according to three set after dividing, utilize following formula to calculate the neuronic fitness function value of hidden layer candidate C _r, i=1 ..., n _b, n _bfor verifying concentrated sample number, wherein, y _b∈ Y _b, Y _bfor verifying concentrated object vector, U _bfor input and the U of checking set pair hidden neuron _bfor the matrix of p × 1 vector, W ^k-1for weight vector, k is iterations, and span is k=0,1,2,3 ..., n, wherein n is positive integer.

Further, the present invention is based in the speech production method of DIVA neural network model, output numerical value according to output layer in described step 8 is determined phoneme, be specially: the output numerical value of described output layer is the numerical value in 0 to 1 interval, and determine the corresponding phoneme of AGNN neural network output numerical value according to the corresponding value range of each phoneme in DIVA neural network model.

The present invention compared with prior art, its remarkable advantage: because self-adaptive growth type neural network model is since an input node study, regulate neuronic weights according to external rules, and increase gradually new input node and new hidden neuron.So the AGNN constructing is a narrow and dark network, have the input neuron, hidden neuron and the network that approach minimal amount to connect, can effectively prevent the over-fitting of network, and network assess the cost lowly, pace of learning is fast; The RBF network originally using in DIVA model on average can reach more than 90% at 80%, AGNN the nicety of grading of sample; For the learning sample master mould classification learning of general difficulty and generate this process used time of voice 10s-13s, and use system after AGNN model refinement under same case, learn and generate only used time 8s-10s of this process of voice for the learning sample category of model of general difficulty, under the same terms, compare fast 2-3s.For medium difficulty and above learning sample, use the system performance under same case after AGNN model refinement just to seem more excellent, compare the fast 4-5s in used time aspect with the model before improving, nicety of grading aspect original system to sample drops to 70%-75%, and uses the system after AGNN model refinement under equal conditions still can keep higher accuracy rate 90%; Visible self-adaptive growth type neural network model is applied to and in DIVA model, makes that the pronunciation precision of model is higher and pace of learning is faster.

Brief description of the drawings

Fig. 1 is process flow diagram of the present invention;

Fig. 2 is the structured flowchart of DIVA neural network model;

Fig. 3 is the AGNN neural network structure schematic diagram for classifying in embodiment;

Embodiment

Below in conjunction with accompanying drawing, the present invention is described in further detail.

As shown in Figure 1, a kind of speech production method based on DIVA neural network model of the present invention, comprise speech samples extraction, speech samples classification and study, voice output and correction output voice, it is characterized in that, described speech samples classification adopts self-adaptive growth type neural network (AGNN) to realize the classification learning to speech samples with study, is specially:

Step 1, the speech resonant peak frequency of extraction is converted to matrix form by Jacobian, the dimension of the proper vector of this matrix is the neuronic number m of input layer candidate; The order of calculating the neuronic fitness function value of input layer candidate and increase progressively by fitness function value is arranged candidate's neuron, and the list of input layer candidate neuron fitness function value is S={S accordingly _i1≤ S _i2≤ ... ≤ S _im, and by corresponding order, candidate's neuron is placed in list X to X=(x ₁..., x _m), described fitness function computing formula is: y _ifor real output value, for desired value, n is that number and the n of data centralization sample is natural number;

Step 6, train this hidden layer candidate's neuron and calculate its fitness function value C _rif, C _r>=C _r-1, perform step seven; If C _r<C _r-1this hidden layer candidate neuron is connected in network as r hidden neuron, then returns to execution step three to step 6, until do not meet this condition in m input layer access network; Wherein fitness function value C _r, be specially:

(2), according to three set after dividing, utilize following formula to calculate the neuronic fitness function value of hidden layer candidate C _r, i=1 ..., n _b, n _bfor verifying concentrated sample number, wherein, y _b∈ Y _b, Y _bfor verifying concentrated object vector, U _bfor input and the U of checking set pair hidden neuron _bfor the matrix of p × 1 vector, W ^k-1for weight vector, k is iterations, and span is k=0,1,2,3 ..., n, wherein n is positive integer, the higher iterations k of the precision value of training is larger.

Step 8, determine phoneme according to the output numerical value of output layer, the output numerical value of described output layer is the numerical value in 0 to 1 interval, and determines the corresponding phoneme of AGNN neural network output numerical value according to the corresponding value range of each phoneme in DIVA neural network model.

Embodiment

As shown in Figure 2, in this embodiment, first gather the voice process voice channel module of the pronunciation equipments such as microphone with a given delay, the formant frequency of voice is sent to cochlea module with vectorial form.The cochlea that cochlea module is calculated these voice represents (frequency spectrum), and formant frequency is sent to auditory cortex module.The voice transfer that auditory cortex represents the formant frequency being transmitted by cochlea module is to auditory cortex classification sensing module.Auditory cortex classification sensing module receives the base unit-phoneme that is just divided into voice after these voice, the phoneme target of initialization output arrive via voice cell collection module separately that auditory cortex and somatesthesia cortex module form respectively to the sense of hearing and somatesthesia result, this module is by relatively characterizing to identify sound bite from the sound bite of auditory cortex module and the phoneme stored, wherein each phoneme is characterized by the numerical range between 0-1, be stored in voice cell collection module, identifying is specifically: auditory cortex classification sensing module characterizes with the concentrated phoneme of voice cell the phoneme (being the output valve of AGNN) being divided into match one by one, if concentrate and do not find the phoneme that matches with it to characterize this phoneme also not by training study mistake at voice cell, voice cell collection module will create a new phoneme in specific region and characterize to represent current phoneme.Between the phoneme target of auditory cortex classification sensing module output and voice cell collection, it is man-to-man relation.Afterwards, voice cell collection module starts the generation of phoneme fragment, send need the phoneme target that produces index motor cortex, auditory cortex and somatesthesia cortex module.Motor cortex sends control command to sound channel module after receiving from the phoneme target index of voice cell collection module, sound channel module is calculated the channel parameters of the control command receiving, be sent to sound-box device and produce corresponding voice, sound channel module sends the auditory effect of calculating and parameter configuration to cochlea module and Sensory module formation feedback by voice channel and sensory channel respectively simultaneously.Sensory module is receiving after the channel configuration information sending over vector form being transmitted by sensory channel, calculates the result of the somatesthesia that channel configuration is relevant, and they are sent to somesthetic cortex module.Then, somesthetic cortex module is calculated the difference that the cortex between somatesthesia and the somatesthesia target of inputting represents, and sends somatesthesia error to motor cortex module in order to revise the voice that generate.Cochlea module is receiving after the formant frequency that the voice that produced by sound channel module pass the voice that represent with vector form of coming through voice channel, passed to auditory cortex module, auditory cortex module is just calculated the difference between these voice and its target voice of cortex representative, and propagation of error is arrived to motor cortex module, in order to revise the voice that generate.

As shown in the table, 29 phonemes storing in voice cell collection module in existing DIVA neural network model characterize corresponding numerical range.The classification results of AGNN is exactly a numerical value, and the numerical value of gained represents different phoneme (numerical value of gained drops on different numerical value intervals and just represents a specific phoneme).

As shown in Figure 3, get learn efficiency η=1.9, Δ=0.0015, the selection of initial weight meets normal distribution.

According to input data set X, the dimension that calculates its proper vector is input layer candidate neuron number m=8, according to formula: wherein y _ifor real output value, for desired value, n is the number of data centralization sample.The fitness function value of calculating each element in input data set X, the order increasing progressively by fitness function is arranged them, and choose successively front 8 as candidate's neuron, be respectively x ₈, x ₅, x ₁₂, x ₁₆, x ₂₄, x ₂₇, x ₁₉, x ₂₃, wherein first input neuron x ₈corresponding fitness function value minimum is also designated as C ₀.

Increase a hidden layer candidate neuron z ₁, it has 2 inputs.Its two inputs and input layer candidate neuron x ₈and x ₅be connected, train this hidden layer candidate neuron then to calculate z ₁fitness function C ₁, now by C ₁with C ₀compare, have C ₁<C ₀, at this moment z ₁join in network as the 1st hidden neuron.Increase again a hidden layer candidate neuron z ₂, it has 3 inputs.By its front 2 inputs and hidden neuron z above ₁and x ₈be connected, the 3rd input is connected to x ₅, train this candidate's hidden neuron and calculate its fitness function C ₂, by C ₂with C ₁relatively there is C ₂<C ₁, z ₂join in network as the 2nd hidden neuron.Add hidden layer candidate neuron z ₃, it has 4 inputs.Front 3 inputs are connected to hidden neuron z ₁and z ₂and input layer x ₈upper, the 4th input is connected to x ₅upper, train this candidate's neuron and calculate its fitness function but the fitness function value now calculated is less than C ₂, the 4th input is connected to x ₁₂upper, train this candidate's neuron and calculate fitness function value C now ₃, have C ₃<C ₂, z ₃join in network as the 3rd hidden neuron.Z ₄join in network as hidden layer candidate neuron, have 5 inputs, front 4 inputs are connected to z ₁~ z ₃and x ₈upper, the 5th input is connected to x ₁₂upper, train this candidate's neuron and calculate its fitness function but be less than C ₃, the 5th input is connected to x ₁₆upper, train this candidate's neuron and calculate its fitness function C ₄, because C ₄<C ₃, z ₄join in network as the 4th hidden neuron.Then again z ₅join in network as hidden layer candidate neuron, it has 6 inputs, and front 5 inputs are connected to z ₁~ z ₄and x ₈upper, the 6th input is connected to x ₁₆upper, train this candidate's neuron and calculate its fitness function C ₅, due to C ₅<C ₄by z ₅join in network as the 5th hidden neuron.Continue to add z ₆as hidden layer candidate neuron, it has 7 inputs, and front 6 inputs are connected to z ₁~ z ₅and x ₈upper, the 7th input is connected to x ₁₆upper, train this candidate's neuron and calculate its fitness function but be less than C ₅, the 7th input is connected to x ₂₄upper, train this candidate's neuron and calculate its fitness function C ₆, because C ₆<C ₅, z ₆join in network as the 6th hidden neuron.Again z ₇be connected in network as hidden layer candidate neuron, have 8 inputs, front 7 inputs are connected to z ₁~ z ₆and x ₈upper, the 8th input is connected to x ₂₄upper, train this candidate's neuron and calculate its fitness function C ₇, C ₇<C ₆, z7 is joined in network as the 7th hidden neuron.Again z ₈join in network as hidden layer candidate neuron, it has 9 inputs, and front 8 inputs are connected to z ₁~ z ₇and x ₈upper, and the 9th input is connected to x ₂₄upper, train this hidden layer candidate's neuron and calculate its fitness function but be less than C ₇, the 9th input is connected to x ₂₇upper, train this hidden layer candidate's neuron and calculate its fitness function C ₈, because C ₈<C ₇, by z ₈join in network as the 8th hidden neuron.Below again by z ₉add in network as hidden layer candidate neuron, it has 10 inputs, and front 9 inputs are connected to z ₁~ z ₈and x ₈upper, the 10th input is connected to x ₂₇, train this hidden layer candidate's neuron and calculate its fitness function C ₉, due to C ₉<C ₈, by z ₉join in network as the 9th hidden neuron.Continue to add z ₁₀, as hidden layer candidate neuron, it has 11 inputs, and front 10 inputs are connected to z ₁~ z ₉and x ₈upper, the 11st input is connected to x ₂₇, train this hidden layer candidate's neuron and calculate its fitness function but be less than C ₉, the 11st input is connected to x ₁₉upper, train this hidden layer candidate's neuron and calculate its fitness function C ₁₀, because C ₁₀<C ₉, z ₁₀join in network as the 10th hidden neuron.Then add z ₁₁as hidden layer candidate neuron, there are 12 inputs, front 11 inputs are connected to z ₁~ z ₁₀with upper x ₈upper, the 12nd input is connected to x ₁₉upper, train this hidden layer candidate's neuron and calculate its fitness function C ₁₁, because C ₁₁<C ₁₀, z ₁₁join in network.Add z ₁₂as hidden layer candidate neuron, there are 13 inputs, its front 12 inputs are connected to z ₁~ z ₁₁with upper x ₈upper, the 13rd input is connected to x ₁₉upper, train this hidden layer candidate's neuron and calculate its fitness function but be less than C ₁₁, by ₁₃individual input is connected to x ₂₃upper, train this hidden layer candidate's neuron and calculate its fitness function C ₁₂, because C ₁₂<C ₁₁therefore, z ₁₂join in network as hidden neuron.Add z ₁₃as hidden layer candidate neuron, there are 14 inputs, its front 13 inputs are connected to z ₁~ z ₁₂with upper x ₈upper, the 14th input is connected to x ₂₃upper, train this hidden layer candidate's neuron and calculate its fitness function C ₁₃<C ₁₂, z ₁₃join in network as hidden neuron.Add z ₁₄as hidden layer candidate neuron, there are 15 inputs, front 14 inputs are connected to z ₁~ z ₁₃with upper x ₈upper, the 15th input is connected to x ₂₃upper, train this hidden layer candidate's neuron and calculate its fitness function but be less than C ₁₃, now do not have candidate's neuron to connect below, give up z ₁₄, z ₁₃as output neuron.

Network has been selected 8 input feature vectors, 12 hidden neurons and 1 output neuron.First hidden neuron is connected to input node x ₈, x ₅.The input of output neuron is connected to the output z of hidden neuron ₁~ z ₁₂with input node x ₈, x ₂₃on.

Claims

1. the speech production method based on DIVA neural network model, comprise speech samples extraction, speech samples classification and study, voice output and correction output voice, it is characterized in that, described speech samples classification adopts self-adaptive growth type neural network to realize the classification learning to speech samples with study, is specially:

2. the speech production method based on DIVA neural network model according to claim 1, is characterized in that: in step 6, train this hidden layer candidate's neuron and calculate its fitness function value C _r, be specially:

(1) data set speech resonant peak frequency normalization being formed is divided into training set, checking collection and test set, and the number of samples of the training set of dividing here and checking collection is respectively n _a, n _b, divide according to being: n _a=n _b;

(2) according to three set after dividing, utilize following formula to calculate the neuronic fitness function value of hidden layer candidate Cr,

C_{r} = E_{B} (k) = {(\underset{i}{Σ} {(e_{Bi}^{k})}^{2})}^{1 / 2}, i = 1, \cdot \cdot \cdot, n_{B},

N _bfor verifying concentrated sample number, wherein, y _b∈ Y _b, Y _bfor verifying concentrated object vector, U _bfor input and the U of checking set pair hidden neuron _bfor the matrix of p × 1 vector, W ^k-1for weight vector, k is iterations, and span is k=0,1,2,3 ..., n, wherein n is positive integer.

3. the speech production method based on DIVA neural network model according to claim 1, it is characterized in that: the output numerical value according to output layer in described step 8 is determined phoneme, be specially: the output numerical value of described output layer is the numerical value in 0 to 1 interval, and determine the corresponding phoneme of AGNN neural network output numerical value according to the corresponding value range of each phoneme in DIVA neural network model.