CN102789594B

CN102789594B - Voice generation method based on DIVA neural network model

Info

Publication number: CN102789594B
Application number: CN201210219670.3A
Authority: CN
Inventors: 张少白; 徐磊; 刘欣
Original assignee: Nanjing Post and Telecommunication University
Current assignee: Boao Zongheng Network Technology Co ltd; Guangzhou Zib Artificial Intelligence Technology Co ltd
Priority date: 2012-06-28
Filing date: 2012-06-28
Publication date: 2014-08-13
Anticipated expiration: 2032-06-28
Also published as: CN102789594A

Abstract

The invention discloses a voice production method based on a DIVA neural network model, which includes voice sample extraction, voice sample classification and learning, voice output and corrected output voice, and the voice sample classification and learning adopts an adaptive growing neural network (AGNN ) to realize the classification learning of speech samples, use the obtained speech formant frequency to further calculate the number of candidate neurons in the input layer, and then determine the hidden layer neurons according to the candidate neurons in the input layer, and finally obtain the output value of AGNN, and according to The output value is used to determine the phoneme, and the neural network with the above structure has high training accuracy and fast learning speed.

Description

A Speech Generation Method Based on DIVA Neural Network Model

技术领域 technical field

本发明涉及一种语音生成方法，特别是一种基于DIVA神经网络模型的语音生成方法。The invention relates to a speech generation method, in particular to a speech generation method based on a DIVA neural network model.

背景技术 Background technique

随着人工智能的发展，人们对这个领域的研究不断深入。对类似真人发音的语音生成和获取的控制，是机器人发音系统急需解决的问题。语音生成与获取是一个涉及大脑诸多部位复杂的认知过程，这个过程包括一种从依照句法和语法组织句子或短语的表述一直延伸到音素产生的分层结构，需要根据发声时大脑中各种感官和运动区域的交互作用建立相应的神经网络模型。目前DIVA(DirectionsInto Velocities of Articulators)模型就是一种关于语音生成与获取后描述相关处理过程的数学模型，主要被用来仿真和描述有关大脑中涉及语音生成和语音理解区域的相关功能。也可以说它是一种为了生成单词、音节或者音素，而用来控制模拟声道运动的自适应神经网络模型。在当今真正具有生物学意义的语音生成和获取的神经网络模型中，DIVA模型的定义和测试相对而言是最彻底的，并且是唯一一种应用伪逆控制方案的模型。With the development of artificial intelligence, people's research in this field continues to deepen. The control of voice generation and acquisition similar to human voice is an urgent problem to be solved in the robot voice system. Speech generation and acquisition is a complex cognitive process involving many parts of the brain. This process includes a hierarchical structure extending from the expression of sentences or phrases organized according to syntax and grammar to the production of phonemes. The interaction of sensory and motor areas builds a corresponding neural network model. At present, the DIVA (Directions Into Velocities of Articulators) model is a mathematical model related to the processing of speech generation and post-acquisition description. It is mainly used to simulate and describe the relevant functions in the brain related to speech generation and speech understanding. It can also be said that it is an adaptive neural network model used to control the movement of the simulated vocal tract in order to generate words, syllables or phonemes. Of today's truly biologically meaningful neural network models of speech generation and acquisition, the DIVA model is relatively the most thoroughly defined and tested, and is the only one that applies a pseudo-inverse control scheme.

人们对于人类语言能力的统一计算模型的需求推动着DIVA模型的发展。这个模型自从由MIT大学语音实验室的Guenther1994年首次提出以来，近些年来不断地被更新、完善和改进。DIVA系统由语音通道模块、耳蜗模块、听觉皮层模型模块、听觉皮层类别感知模块、语音细胞集模块、运动皮层模块、声道模块、体觉皮质模块、感觉模块和感觉通道模块组成。People's demand for a unified computational model of human language ability drives the development of the DIVA model. Since this model was first proposed by Guenther of the Speech Laboratory of MIT University in 1994, it has been continuously updated, improved and improved in recent years. The DIVA system consists of a speech channel module, a cochlea module, an auditory cortex model module, an auditory cortex category perception module, a speech cell set module, a motor cortex module, a vocal tract module, a somatosensory cortex module, a sensory module and a sensory channel module.

通过对DIVA模型的分析，我们可以发现其听觉皮层类别感知模块中所使用的分类方法是RBF。而RBF神经网络对样本的依赖性很大，对于某一具体的研究问题，如何确定合适的隐含层节点数，目前尚无通用有效的算法或者定理。人们更多的是凭借经验，反复试验来确定网络的规模，这种试凑的方法非常繁琐，不易找到合适的结构。网络隐含层的节点数对网络的收敛速度、精度及泛化能力都有很大的影响。隐含层节点过多，虽可以完成训练，但会影响收敛速度，而且有可能出现过学习；而隐含层节点过少，网络不能充分学习，达不到训练精度的要求。此外，RBF神经网络训练的时间也不够快。Through the analysis of the DIVA model, we can find that the classification method used in the category perception module of its auditory cortex is RBF. The RBF neural network is very dependent on samples. For a specific research problem, how to determine the appropriate number of hidden layer nodes, there is no general and effective algorithm or theorem. People rely more on experience and trial and error to determine the size of the network. This trial and error method is very cumbersome and it is difficult to find a suitable structure. The number of nodes in the hidden layer of the network has a great influence on the convergence speed, accuracy and generalization ability of the network. If there are too many nodes in the hidden layer, although the training can be completed, it will affect the convergence speed, and there may be over-learning; if there are too few nodes in the hidden layer, the network cannot fully learn and cannot meet the requirements of training accuracy. In addition, the training time of RBF neural network is not fast enough.

发明内容 Contents of the invention

本发明的目的在于提供一种发音精度高、学习速度快的基于DIVA神经网络模型的语音生成方法。The object of the present invention is to provide a voice generation method based on the DIVA neural network model with high pronunciation accuracy and fast learning speed.

实现本发明目的的技术解决方案为：一种基于DIVA神经网络模型的语音生成方法，包括语音样本提取、语音样本分类与学习、语音输出和修正输出语音，所述语音样本分类与学习采用自适应生长型神经网络（AGNN）实现对语音样本的分类学习，具体为：The technical solution that realizes the object of the present invention is: a kind of voice generation method based on DIVA neural network model, including voice sample extraction, voice sample classification and learning, voice output and corrected output voice, described voice sample classification and learning adopt adaptive The growing neural network (AGNN) realizes the classification and learning of speech samples, specifically:

步骤一、将提取的语音共振峰频率通过雅克比行列式转换为矩阵形式，该矩阵的特征向量的维数即输入层候选神经元的数目m；计算输入层候选神经元的适应度函数值并按适应度函数值递增的顺序排列候选神经元，输入层候选神经元适应度函数值的列表相应的为S={S_i1≤S_i2≤…≤S_im}，并按相应的顺序将候选神经元放在列表X中，X=(x₁，...，x_m)，所述适应度函数计算公式为： Step 1, convert the speech formant frequency of extraction into matrix form by Jacobian determinant, the dimension of the eigenvector of this matrix is the number m of input layer candidate neurons; calculate the fitness function value of input layer candidate neurons and Arrange the candidate neurons in the order of increasing fitness function value, the corresponding list of candidate neuron fitness function values in the input layer is S={S _i1 ≤S _i2 ≤...≤S _im }, and put the candidate neurons in the corresponding order Elements are placed in the list X, X=(x ₁ ,...,x _m ), the calculation formula of the fitness function is:

y_i为实际输出值，为目标值，n为数据集中样本的数目且n为自然数；y _i is the actual output value, is the target value, n is the number of samples in the data set and n is a natural number;

步骤二、初始隐层神经元个数r=0并设C₀=S_i1，C₀为隐层神经元个数r=0时的适应度函数值；Step 2: The number of neurons in the initial hidden layer is r=0 and C ₀ =S _i1 , where C ₀ is the fitness function value when the number of neurons in the hidden layer is r=0;

步骤三、设r=r+1和p=r+1，其中r为第r个隐层候选神经元，生成一个有p个输入的隐层候选神经元；Step 3. Set r=r+1 and p=r+1, where r is the rth hidden layer candidate neuron, and generate a hidden layer candidate neuron with p inputs;

步骤四、若r>1，将该隐层候选神经元分别连接到其前面所有的隐层神经元和输入节点x₁上；否则把该隐层候选神经元只连接到输入节点x₁上；Step 4. If r>1, connect the hidden layer candidate neuron to all previous hidden layer neurons and the input node _x1 ; otherwise, connect the hidden layer candidate neuron only to the input node _x1 ;

步骤五、设置下一个需要和新添加的隐层候选神经元相连接的集合X中的元素所处的位置h的初始值为2，其中2≤h≤m,m、h为正整数；将此隐层候选神经元的第P个输入连接到列表X中位置为h的输入节点上；Step 5. Set the initial value of the position h of the element in the next set X that needs to be connected to the newly added hidden layer candidate neuron to 2, where 2≤h≤m, m and h are positive integers; set The Pth input of this hidden layer candidate neuron is connected to the input node whose position is h in the list X;

步骤六、训练此隐层候选神经元并计算它的适应度函数值C_r，若C_r≥C_r-1，则执行步骤七；若C_r<C_r-1则将此隐层候选神经元连接到网络中作为第r个隐层神经元，再返回执行步骤三至步骤六，直到第m个输入层神经元接入网络中不满足此条件为止；Step 6. Train the candidate neurons of the hidden _layer and calculate its fitness function value C _r . If C _r ≥ C _r- ₁ , perform step 7; connected to the network as the rth hidden layer neuron, and then return to step 3 to step 6, until the mth input layer neuron connected to the network does not meet this condition;

步骤七、将h=h+1，重新训练此隐层候选神经元，直到h=m时，若仍不满足C_r<C_r-1，则结束训练，此隐层候选神经元与分类无关，舍弃此隐层候选神经元，把该隐层候选神经元的前一个隐层神经元作为输出层；Step 7: Set h=h+1, retrain the hidden layer candidate neurons until h=m, if C _r <C _r-1 is still not satisfied, then end the training, the hidden layer candidate neurons have nothing to do with classification , discard this hidden layer candidate neuron, and use the previous hidden layer neuron of this hidden layer candidate neuron as the output layer;

步骤八、根据输出层的输出数值来确定音素。Step 8: Determine the phoneme according to the output value of the output layer.

进一步地，本发明基于DIVA神经网络模型的语音生成方法中，步骤六中训练此隐层候选神经元并计算它的适应度函数值C_r，具体为：Further, in the speech generation method based on the DIVA neural network model of the present invention, the hidden layer candidate neuron is trained in step six and its fitness function value C _r is calculated, specifically:

(1)、将语音共振峰频率归一化所形成的数据集划分为训练集、验证集和测试集，这里划分的训练集和验证集的样本数目分别为n_A,n_B，划分依据为：n_A=n_B；(1), divide the data set formed by normalizing the voice formant frequency into a training set, a verification set and a test set. The number of samples of the training set and the verification set divided here are respectively n _A , n _B , and the division basis is :n _A =n _B ;

(2)、根据划分后的三个集合，利用下述公式计算隐层候选神经元的适应度函数值C_r，i＝1,…,n_B，n_B为验证集中的样本数，其中，y_B∈Y_B，Y_B为验证集中的目标向量，U_B为验证集对隐层神经元的输入且U_B为p×1矢量的矩阵，W^k-1为权向量，k为迭代次数，取值范围为k=0,1,2,3,…,n，其中n为正整数。(2) According to the three divided sets, use the following formula to calculate the fitness function value C _r of the candidate neurons in the hidden layer, i=1,...,n _B , n _B is the number of samples in the validation set, Among them, y _B ∈ Y _B , Y _B is the target vector in the verification set, U _B is the input of the verification set to the hidden layer neurons and U _B is a matrix of p×1 vectors, W ^k-1 is the weight vector, and k is The number of iterations, the value range is k=0,1,2,3,…,n, where n is a positive integer.

进一步地，本发明基于DIVA神经网络模型的语音生成方法中，所述步骤八中根据输出层的输出数值来确定音素，具体为：所述输出层的输出数值为0至1区间的数值，并根据DIVA神经网络模型中每个音素所对应的范围值来确定AGNN神经网络输出数值所对应的音素。Further, in the speech generation method based on the DIVA neural network model of the present invention, in the eighth step, the phoneme is determined according to the output value of the output layer, specifically: the output value of the output layer is a value between 0 and 1, and The phoneme corresponding to the output value of the AGNN neural network is determined according to the range value corresponding to each phoneme in the DIVA neural network model.

本发明与现有技术相比，其显著优点：由于自适应生长型神经网络模型是从一个输入节点开始学习，根据外部规则调节神经元的权值，并逐渐增加新的输入节点及新的隐神经元。所以构造出的AGNN是一个窄而深的网络，有接近最小数目的输入神经元、隐神经元及网络连接，能有效地防止网络的过拟合，并且网络的计算成本低，学习速度快；原来DIVA模型中所使用的RBF网络对样本的分类精度平均在80%，而AGNN则可以达到90%以上；对于一般难度的学习样本原模型分类学习并生成语音这个过程用时10s—13s，而使用AGNN模型改进后的系统在相同情况下，对于一般难度的学习样本模型分类学习并生成语音这个过程仅仅用时8s—10s，相同条件下相比快2—3s。对于中等难度及以上的学习样本来说使用AGNN模型改进后的系统在相同情况下性能就显得更优，和改进前的模型相比用时方面快4-5s，对样本的分类精度方面原系统下降到70%-75%，而使用AGNN模型改进后的系统在同等条件下仍能保持较高的准确率90%；可见将自适应生长型神经网络模型应用于DIVA模型中使得模型的发音精度更高而且学习速度更快。Compared with the prior art, the present invention has significant advantages: since the self-adaptive growth type neural network model learns from an input node, adjusts the weight of neurons according to external rules, and gradually adds new input nodes and new hidden Neurons. Therefore, the constructed AGNN is a narrow and deep network with close to the minimum number of input neurons, hidden neurons and network connections, which can effectively prevent over-fitting of the network, and the network has low computational cost and fast learning speed; The RBF network used in the original DIVA model has an average classification accuracy of 80% of the samples, while the AGNN can reach more than 90%. For the learning samples of general difficulty, the original model classifies and learns and generates speech. The process takes 10s-13s, while using Under the same conditions, the improved system of the AGNN model takes only 8s-10s for the general difficulty learning sample model to classify and learn and generate speech, which is 2-3s faster than that under the same conditions. For learning samples with medium difficulty and above, the performance of the improved system using the AGNN model is better under the same circumstances. Compared with the model before the improvement, the time is 4-5s faster, and the classification accuracy of the samples is reduced by the original system. to 70%-75%, while the improved system using the AGNN model can still maintain a high accuracy rate of 90% under the same conditions; it can be seen that applying the adaptive growth neural network model to the DIVA model makes the pronunciation accuracy of the model more accurate Higher and faster to learn.

附图说明 Description of drawings

图1为本发明的流程图；Fig. 1 is a flowchart of the present invention;

图2为DIVA神经网络模型的结构框图；Fig. 2 is the structural block diagram of DIVA neural network model;

图3为实施例中用于分类的AGNN神经网络结构示意图；Fig. 3 is the AGNN neural network structure schematic diagram that is used for classification in the embodiment;

具体实施方式 Detailed ways

下面结合附图对本发明作进一步详细描述。The present invention will be described in further detail below in conjunction with the accompanying drawings.

如图1所示，本发明一种基于DIVA神经网络模型的语音生成方法，包括语音样本提取、语音样本分类与学习、语音输出和修正输出语音，其特征在于，所述语音样本分类与学习采用自适应生长型神经网络（AGNN）实现对语音样本的分类学习，具体为：As shown in Fig. 1, a kind of speech generation method based on DIVA neural network model of the present invention comprises speech sample extraction, speech sample classification and learning, speech output and revised output speech, it is characterized in that, described speech sample classification and learning adopt The adaptive growing neural network (AGNN) realizes the classification and learning of speech samples, specifically:

步骤一、将提取的语音共振峰频率通过雅克比行列式转换为矩阵形式，该矩阵的特征向量的维数即输入层候选神经元的数目m；计算输入层候选神经元的适应度函数值并按适应度函数值递增的顺序排列候选神经元，输入层候选神经元适应度函数值的列表相应的为S={S_i1≤S_i2≤…≤S_im}，并按相应的顺序将候选神经元放在列表X中，X=(x₁,…,x_m)，所述适应度函数计算公式为：y_i为实际输出值，为目标值，n为数据集中样本的数目且n为自然数；Step 1, convert the speech formant frequency of extraction into matrix form by Jacobian determinant, the dimension of the eigenvector of this matrix is the number m of input layer candidate neurons; calculate the fitness function value of input layer candidate neurons and Arrange the candidate neurons in the order of increasing fitness function value, the corresponding list of candidate neuron fitness function values in the input layer is S={S _i1 ≤S _i2 ≤...≤S _im }, and put the candidate neurons in the corresponding order Elements are placed in the list X, X=(x ₁ ,…,x _m ), the calculation formula of the fitness function is: y _i is the actual output value, is the target value, n is the number of samples in the data set and n is a natural number;

步骤六、训练此隐层候选神经元并计算它的适应度函数值C_r，若C_r≥C_r-1，则执行步骤七；若C_r<C_r-1则将此隐层候选神经元连接到网络中作为第r个隐层神经元，再返回执行步骤三至步骤六，直到第m个输入层神经元接入网络中不满足此条件为止；其中适应度函数值C_r，具体为：Step 6. Train the candidate neurons of the hidden _layer and calculate its fitness function value C _r . If C _r ≥ C _r- ₁ , perform step 7; connected to the network as the rth hidden layer neuron, and then return to step 3 to step 6, until the mth input layer neuron connected to the network does not meet this condition; where the fitness function value C _r , specifically for:

（2）、根据划分后的三个集合，利用下述公式计算隐层候选神经元的适应度函数值C_r，i＝1,…,n_B，n_B为验证集中的样本数，其中，y_B∈Y_B，Y_B为验证集中的目标向量，U_B为验证集对隐层神经元的输入且U_B为p×1矢量的矩阵，W^k-1为权向量，k为迭代次数，取值范围为k=0,1,2,3,…,n，其中n为正整数，训练的精度越高迭代次数k值越大。(2) According to the three divided sets, use the following formula to calculate the fitness function value C _r of the candidate neurons in the hidden layer, i=1,...,n _B , n _B is the number of samples in the validation set, Among them, y _B ∈ Y _B , Y _B is the target vector in the verification set, U _B is the input of the verification set to the hidden layer neurons and U _B is a matrix of p×1 vectors, W ^k-1 is the weight vector, and k is The number of iterations, the value range is k=0,1,2,3,...,n, where n is a positive integer, the higher the training accuracy is, the larger the number of iterations k is.

步骤八、根据输出层的输出数值来确定音素，所述输出层的输出数值为0至1区间的数值，并根据DIVA神经网络模型中每个音素所对应的范围值来确定AGNN神经网络输出数值所对应的音素。Step 8, determine the phoneme according to the output value of the output layer, the output value of the output layer is a value in the interval of 0 to 1, and determine the AGNN neural network output value according to the corresponding range value of each phoneme in the DIVA neural network model the corresponding phonemes.

实施例 Example

如图2所示，在该实施例中，首先采集麦克风等发音设备的语音经过语音通道模块以一个给定的延迟，把语音的共振峰频率以向量的形式发送到耳蜗模块。耳蜗模块计算这个语音的耳蜗表示（频谱），并且把共振峰频率发送到听觉皮层模块。听觉皮层把由耳蜗模块传来的共振峰频率表示的语音传送到听觉皮层类别感知模块。听觉皮层类别感知模块接收到该语音后就将其分成语音的基本单位-音素，初始化输出的音素目标经由语音细胞集模块各自到达听觉皮层和体觉皮层模块分别形成的对听觉和体觉结果，这个模块通过比较来自听觉皮层模块的语音片段和已经存储的音素表征来识别语音片段，其中每个音素表征为0-1之间的数值范围，存储在语音细胞集模块中，具体来说识别过程为：听觉皮层类别感知模块把分成的音素（即AGNN的输出值)一一和语音细胞集中的音素表征相匹配，如果在语音细胞集中没有找到与之相匹配的音素表征即该音素还没有被训练学习过，语音细胞集模块就会在特定的区域创建一个新的音素表征来表示当前的音素。听觉皮层类别感知模块输出的音素目标和语音细胞集之间是一对一的关系。之后，语音细胞集模块启动音素片段的产生，发送需要产生的音素目标的索引到运动皮层、听觉皮层和体觉皮层模块。运动皮层在接收到来自语音细胞集模块的音素目标索引后向声道模块发送控制命令，声道模块计算接收到的控制命令的声道参数，传送到音箱设备产生相应的语音，同时声道模块将计算的听觉效果和参数配置分别通过语音通道和感觉通道传送给耳蜗模块和感觉模块形成反馈。感觉模块在接收到由感觉通道传来的以向量形式发送过来的声道配置信息后，计算声道配置相关的体觉的结果，并把他们发送到体觉皮质模块。然后，体觉皮质模块计算输入的体觉和体觉目标之间的皮层表示的差别，并将体觉误差传送给运动皮质模块用以修正生成的语音。耳蜗模块在接收到由声道模块产生的语音经语音通道传过来的以向量形式表示的语音的共振峰频率后，将其传到听觉皮层模块，听觉皮层模块就计算皮层代表的该语音和其目标语音之间的差异，并把误差传递到运动皮层模块，用以修正生成的语音。As shown in FIG. 2 , in this embodiment, the voice of the pronunciation device such as a microphone is firstly collected and passed through the voice channel module with a given delay, and the formant frequency of the voice is sent to the cochlear module in the form of a vector. The cochlear module computes the cochlear representation (spectrum) of this speech and sends the formant frequencies to the auditory cortex module. The auditory cortex transmits the speech represented by the formant frequency from the cochlear module to the auditory cortical category perception module. After the auditory cortex category perception module receives the speech, it divides it into the basic unit of speech - phoneme, and the phoneme target of the initialization output reaches the auditory cortex and the somatosensory cortex module respectively through the speech cell set module to form the auditory and somatosensory results respectively. This module recognizes speech fragments by comparing the speech fragments from the auditory cortex module with the stored phoneme representations, where each phoneme is represented as a value range between 0-1, stored in the speech cell set module, specifically the recognition process For: the auditory cortex category perception module matches the divided phonemes (i.e. the output value of AGNN) one by one with the phoneme representations in the speech cell set, if no matching phoneme representation is found in the speech cell set, that is, the phoneme has not been identified After training and learning, the speech cell set module will create a new phoneme representation in a specific area to represent the current phoneme. There is a one-to-one relationship between phoneme targets and speech cell assemblies output by class-aware modules in auditory cortex. Afterwards, the speech cell assembly module starts the generation of phoneme fragments, and sends the index of the phoneme target to be generated to the motor cortex, auditory cortex and somatosensory cortex modules. The motor cortex sends control commands to the vocal tract module after receiving the phoneme target index from the speech cell set module. The calculated auditory effect and parameter configuration are transmitted to the cochlear module and the sensory module through the speech channel and the sensory channel respectively to form feedback. After receiving the channel configuration information sent by the sensory channel in the form of a vector, the sensory module calculates the somatosensory results related to the channel configuration and sends them to the somatosensory cortex module. The somatosensory module then computes the difference in cortical representations between the input somatosensory and the somatosensory target, and passes the somatosensory error to the motor cortex module for correction of the generated speech. After the cochlear module receives the formant frequency of the voice expressed in vector form from the voice generated by the vocal tract module through the voice channel, it transmits it to the auditory cortex module, and the auditory cortex module calculates the voice represented by the cortex and other The difference between the target speech and pass the error to the motor cortex module to modify the generated speech.

如下表所示，现有的DIVA神经网络模型中语音细胞集模块中存储的29个音素表征所对应的数值范围。AGNN的分类结果就是一个数值，所得的数值代表不同的音素（所得的数值落在不同的数值区间就代表一个特定的音素）。As shown in the table below, the numerical ranges corresponding to the 29 phoneme representations stored in the speech cell set module in the existing DIVA neural network model. The classification result of AGNN is a numerical value, and the obtained numerical values represent different phonemes (the obtained numerical values fall in different numerical intervals to represent a specific phoneme).

如图3所示，取学习效率η=1.9,Δ=0.0015，初始权值的选择符合正态分布。As shown in Figure 3, the learning efficiency η = 1.9, Δ = 0.0015, and the selection of the initial weight conforms to the normal distribution.

根据输入数据集X，计算其特征向量的维数即输入层候选神经元数目m=8，根据公式：其中y_i为实际输出值，为目标值，n为数据集中样本的数目。计算输入数据集X中每个元素的适应度函数值，按适应度函数递增的顺序排列他们，并依次选取前8个作为候选神经元，分别为x₈,x₅,x₁₂,x₁₆,x₂₄,x₂₇,x₁₉,x₂₃，其中第一个输入神经元x₈对应的适应度函数值最小并记为C₀。According to the input data set X, calculate the dimension of its feature vector, that is, the number of candidate neurons in the input layer m=8, according to the formula: where y _i is the actual output value, is the target value, and n is the number of samples in the data set. Calculate the fitness function value of each element in the input data set X, arrange them in the order of increasing fitness function, and select the first 8 neurons as candidate neurons, which are x ₈ , x ₅ , x ₁₂ , x ₁₆ , x ₂₄ , x ₂₇ , x ₁₉ , x ₂₃ , where the fitness function value corresponding to the first input neuron x ₈ is the smallest and recorded as C ₀ .

增加一个隐层候选神经元z₁，则其有2个输入。把它的两个输入与输入层候选神经元x₈和x₅相连，训练此隐层候选神经元然后计算z₁的适应度函数C₁，此时将C₁与C₀相比，有C₁<C₀，这时把z₁加入到网络中作为第1个隐层神经元。再增加一个隐层候选神经元z₂，则其有3个输入。将其前2个输入与前面的隐层神经元z₁及x₈相连，第3个输入连接到x₅，训练此候选隐层神经元并计算它的适应度函数C₂，将C₂与C₁比较有C₂<C₁，把z₂加入到网络中作为第2个隐层神经元。加入隐层候选神经元z₃，则其有4个输入。把前3个输入连接到隐层神经元z₁和z₂及输入层神经元x₈上，第4个输入连接到x₅上，训练此候选神经元并计算其适应度函数但此时计算的适应度函数值小于C₂，把第4个输入连接到x₁₂上，训练此候选神经元并计算此时的适应度函数值C₃，有C₃<C₂，把z₃加入到网络中作为第3个隐层神经元。把z₄加入到网络中作为隐层候选神经元，有5个输入，把前4个输入连接到z₁~z₃和x₈上，将第5个输入连接到x₁₂上，训练此候选神经元并计算其适应度函数但小于C₃，将第5个输入连接到x₁₆上，训练此候选神经元并计算其适应度函数C₄，因为C₄<C₃，把z₄加入到网络中作为第4个隐层神经元。接着再把z₅加入到网络中作为隐层候选神经元，它有6个输入，把前5个输入连接到z₁~z₄和x₈上，将第6个输入连接到x₁₆上，训练该候选神经元并计算其适应度函数C₅，由于C₅<C₄将z₅加入到网络中作为第5个隐层神经元。继续添加z₆作为隐层候选神经元，其有7个输入，将前6个输入连接到z₁~z₅和x₈上，将第7个输入连接到x₁₆上，训练此候选神经元并计算其适应度函数但小于C₅，将第7个输入连接到x₂₄上，训练该候选神经元并计算其适应度函数C₆，因为C₆<C₅，把z₆加入到网络中作为第6个隐层神经元。再把z₇连接到网络中作为隐层候选神经元，则有8个输入，将前7个输入连接到z₁~z₆和x₈上，将第8个输入连接到x₂₄上，训练此候选神经元并计算其适应度函数C₇,C₇<C₆，把z7加入到网络中作为第7个隐层神经元。再把z₈加入到网络中作为隐层候选神经元，其有9个输入，将前8个输入连接到z₁~z₇和x₈上，并将第9个输入连接到x₂₄上，训练此隐层候选神经元并计算其适应度函数但小于C₇，将第9个输入连接到x₂₇上，训练该隐层候选神经元并计算其适应度函数C₈，因为C₈<C₇，将z₈加入到网络中作为第8个隐层神经元。下面再将z₉加入网络中作为隐层候选神经元，其有10个输入，把前9个输入连接到z₁~z₈和x₈上，将第10输入连接到x₂₇，训练此隐层候选神经元并计算其适应度函数C₉，由于C₉<C₈，将z₉加入到网络中作为第9个隐层神经元。继续添加z₁₀，作为隐层候选神经元，它有11个输入，把前10个输入连接到z₁~z₉和x₈上，将第11个输入连接到x₂₇，训练此隐层候选神经元并计算其适应度函数但小于C₉，则将第11个输入连接到x₁₉上，训练此隐层候选神经元并计算其适应度函数C₁₀，因为C₁₀<C₉,把z₁₀加入到网络中作为第10个隐层神经元。接着添加z₁₁作为隐层候选神经元，有12个输入，将前11个输入连接到z₁~z₁₀和上x₈上，将第12个输入连接到x₁₉上，训练此隐层候选神经元并计算其适应度函数C₁₁，因为C₁₁<C₁₀,把z₁₁加入到网络中。添加z₁₂作为隐层候选神经元，有13个输入，将其前12个输入连接到z₁~z₁₁和上x₈上，将第13个输入连接到x₁₉上，训练此隐层候选神经元并计算其适应度函数但小于C₁₁，则将第₁₃个输入连接到x₂₃上，训练此隐层候选神经元并计算其适应度函数C₁₂，因为C₁₂<C₁₁，故把z₁₂加入到网络中作为隐层神经元。添加z₁₃作为隐层候选神经元，有14个输入，将其前13个输入连接到z₁~z₁₂和上x₈上，将第14个输入连接到x₂₃上，训练此隐层候选神经元并计算其适应度函数C₁₃<C₁₂，把z₁₃加入到网络中作为隐层神经元。添加z₁₄作为隐层候选神经元，有15个输入，把前14个输入连接到z₁~z₁₃和上x₈上，将第15个输入连接到x₂₃上，训练此隐层候选神经元并计算其适应度函数但小于C₁₃，此时下面已没有候选神经元可连，舍弃z₁₄，把z₁₃作为输出神经元。Add a hidden layer candidate neuron z ₁ , which has 2 inputs. Connect its two inputs to the input layer candidate neurons x ₈ and x ₅ , train the hidden layer candidate neurons and then calculate the fitness function C ₁ of z ₁ , and compare C ₁ with C ₀ at this time, there is C ₁ <C ₀ , then add z ₁ to the network as the first hidden layer neuron. Add another hidden layer candidate neuron z ₂ , which has 3 inputs. Connect its first 2 inputs to the previous hidden layer neurons z ₁ and x ₈ , connect the third input to x ₅ , train this candidate hidden layer neuron and calculate its fitness function C ₂ , connect C ₂ with C ₁ compares with C ₂ <C ₁ , and z ₂ is added to the network as the second hidden layer neuron. Add candidate neuron z ₃ of the hidden layer, and it has 4 inputs. Connect the first 3 inputs to hidden layer neurons z ₁ and z ₂ and input layer neuron x ₈ , and connect the fourth input to x ₅ , train this candidate neuron and calculate its fitness function but at this time calculate The fitness function value is less than C ₂ , connect the fourth input to x ₁₂ , train the candidate neuron and calculate the fitness function value C ₃ at this time, if C ₃ <C ₂ , add z ₃ to the network as the third hidden layer neuron. Add z ₄ to the network as a hidden layer candidate neuron with 5 inputs, connect the first 4 inputs to z ₁ ~ z ₃ and x ₈ , connect the fifth input to x ₁₂ , and train this candidate neuron and calculate its fitness function but less than C ₃ , connect the 5th input to x ₁₆ , train this candidate neuron and calculate its fitness function C ₄ , since C ₄ < C ₃ , add z ₄ to In the network as the fourth hidden layer neuron. Then add z ₅ to the network as a hidden layer candidate neuron, which has 6 inputs, connect the first 5 inputs to z ₁ ~ z ₄ and x ₈ , connect the sixth input to x ₁₆ , Train the candidate neuron and calculate its fitness function C ₅ , since C ₅ < C ₄ , add z ₅ into the network as the fifth hidden layer neuron. Continue to add z ₆ as a hidden layer candidate neuron, which has 7 inputs, connect the first 6 inputs to z ₁ ~ z ₅ and x ₈ , connect the 7th input to x ₁₆ , and train this candidate neuron And calculate its fitness function but less than C ₅ , connect the 7th input to x ₂₄ , train the candidate neuron and calculate its fitness function C ₆ , since C ₆ < C ₅ , add z ₆ to the network as the sixth hidden layer neuron. Then connect z ₇ to the network as a hidden layer candidate neuron, then there are 8 inputs, connect the first 7 inputs to z ₁ ~ z ₆ and x ₈ , connect the 8th input to x ₂₄ , and train This candidate neuron calculates its fitness function C ₇ , C ₇ <C ₆ , and adds z7 into the network as the seventh hidden layer neuron. Then add z ₈ to the network as a hidden layer candidate neuron, which has 9 inputs, connect the first 8 inputs to z ₁ ~ z ₇ and x ₈ , and connect the ninth input to x ₂₄ , Train this hidden layer candidate neuron and calculate its fitness function but less than C ₇ , connect the 9th input to x ₂₇ , train this hidden layer candidate neuron and calculate its fitness function C ₈ , because C ₈ <C ₇ , add z ₈ to the network as the eighth hidden layer neuron. Next, add z ₉ to the network as a hidden layer candidate neuron, which has 10 inputs, connect the first 9 inputs to z ₁ ~ z ₈ and x ₈ , connect the tenth input to x ₂₇ , and train the hidden layer layer candidate neuron and calculate its fitness function C ₉ , since C ₉ <C ₈ , add z ₉ to the network as the ninth hidden layer neuron. Continue to add z ₁₀ as a hidden layer candidate neuron, it has 11 inputs, connect the first 10 inputs to z ₁ ~ z ₉ and x ₈ , connect the 11th input to x ₂₇ , and train this hidden layer candidate Neuron and calculate its fitness function but less than C ₉ , then connect the 11th input to x ₁₉ , train this hidden layer candidate neuron and calculate its fitness function C ₁₀ , because C ₁₀ <C ₉ , put z ₁₀ is added to the network as the 10th hidden layer neuron. Then add z ₁₁ as a hidden layer candidate neuron, with 12 inputs, connect the first 11 inputs to z ₁ ~ z ₁₀ and upper x ₈ , connect the 12th input to x ₁₉ , and train this hidden layer candidate Neuron and calculate its fitness function C ₁₁ , because C ₁₁ <C ₁₀ , add z ₁₁ to the network. Add z ₁₂ as a hidden layer candidate neuron with 13 inputs, connect its first 12 inputs to z ₁ ~ z ₁₁ and upper x ₈ , connect the 13th input to x ₁₉ , and train this hidden layer candidate Neuron and calculate its fitness function but less than C ₁₁ , then connect the _13th input to x ₂₃ , train this hidden layer candidate neuron and calculate its fitness function C ₁₂ , because C ₁₂ <C ₁₁ , so put z ₁₂ is added to the network as a hidden layer neuron. Add z ₁₃ as a hidden layer candidate neuron with 14 inputs, connect its first 13 inputs to z ₁ ~ z ₁₂ and upper x ₈ , connect the 14th input to x ₂₃ , and train this hidden layer candidate neuron and calculate its fitness function C ₁₃ <C ₁₂ , and add z ₁₃ into the network as a hidden layer neuron. Add z ₁₄ as a hidden layer candidate neuron with 15 inputs, connect the first 14 inputs to z ₁ ~ z ₁₃ and upper x ₈ , connect the 15th input to x ₂₃ , and train this hidden layer candidate neuron and calculate its fitness function but it is less than C ₁₃ . At this time, there are no candidate neurons to be connected, so z ₁₄ is discarded, and z ₁₃ is used as the output neuron.

网络选择了8个输入特征，12个隐层神经元和1个输出神经元。第一个隐层神经元连接到输入节点x₈,x₅。输出神经元的输入连接到隐层神经元的输出z₁~z₁₂和输入节点x₈,x₂₃上。The network selects 8 input features, 12 hidden layer neurons and 1 output neuron. The neurons of the first hidden layer are connected to input nodes x ₈ , x ₅ . The input of the output neuron is connected to the output z ₁ ~ z ₁₂ of the hidden layer neuron and the input nodes x ₈ , x ₂₃ .

Claims

1. A method for generating speech based on the DIVA neural network model, comprising speech sample extraction, speech sample classification and learning, speech output and correction output speech, characterized in that said speech sample classification and learning adopts an adaptive growth type neural network Realize the classification and learning of speech samples, specifically:

Step 1, convert the speech formant frequency of extraction into matrix form by Jacobian determinant, the dimension of the eigenvector of this matrix is the number m of input layer candidate neurons; calculate the fitness function value of input layer candidate neurons and Arrange the candidate neurons in the order of increasing fitness function value, the corresponding list of the fitness function value of the input layer candidate neurons is S={S _i1 ≤S _i2 ≤...≤S _im }, and put the candidate neurons in the corresponding order Elements are placed in the list X, X=(x ₁ ,...,x _m ), the calculation formula of the fitness function is: y _i is the actual output value, is the target value, n is the number of samples in the data set and n is a natural number;

Step 2: The number of neurons in the initial hidden layer is r=0 and C ₀ =S _i1 , where C ₀ is the fitness function value when the number of neurons in the hidden layer is r=0;

Step 3, set r=r+1 and p=r+1, where r is the rth hidden layer candidate neuron, and generate a hidden layer candidate neuron with p inputs;

Step 4. If r>1, connect the hidden layer candidate neuron to all previous hidden layer neurons and the input node _x1 ; otherwise, connect the hidden layer candidate neuron only to the input node _x1 ;

Step 5. Set the initial value of the position h of the element in the next set X that needs to be connected to the newly added hidden layer candidate neuron to 2, where 2≤h≤m, m and h are positive integers; set The Pth input of this hidden layer candidate neuron is connected to the input node whose position is h in the list X;

Step 6. Train the candidate neurons of the hidden _layer and calculate its fitness function value C _r . If C _r ≥ C _r- ₁ , perform step 7; connected to the network as the rth hidden layer neuron, and then return to step 3 to step 6, until the mth input layer neuron connected to the network does not meet this condition;

Step 7: Set h=h+1, retrain the hidden layer candidate neurons until h=m, if C _r <C _r-1 is still not satisfied, then end the training, the hidden layer candidate neurons have nothing to do with classification , discard this hidden layer candidate neuron, and use the previous hidden layer neuron of this hidden layer candidate neuron as the output layer;

Step 8: Determine the phoneme according to the output value of the output layer.

2. the speech generation method based on DIVA neural network model according to claim 1, is characterized in that: in step 6, trains this hidden layer candidate neuron and calculates its fitness function value C _r , specifically:

(1) Divide the data set formed by the normalization of speech formant frequencies into training set, verification set and test set. The number of samples in the training set and verification set divided here are respectively n _A and n _B . The division basis is as follows: n _A = n _B ;

(2) According to the three divided sets, use the following formula to calculate the fitness function value Cr of the candidate neurons in the hidden layer,

C_{r} = {E.}_{B} (k) = {(\underset{i}{Σ} {(e_{Bi}^{k})}^{2})}^{1 / 2}, i = 1, &Center Dot; \cdot \cdot, {no}_{B},

n _B is the number of samples in the validation set, Among them, y _B ∈ Y _B , Y _B is the target vector in the verification set, U _B is the input of the verification set to the hidden layer neurons and U _B is a matrix of p×1 vectors, W ^k-1 is the weight vector, and k is The number of iterations, the value range is k=0,1,2,3,...,n, where n is a positive integer.

3. the speech generation method based on DIVA neural network model according to claim 1, is characterized in that: in described step 8, determine phoneme according to the output numerical value of output layer, specifically: the output numerical value of described output layer is 0 to 1, and determine the phoneme corresponding to the output value of the AGNN neural network according to the range value corresponding to each phoneme in the DIVA neural network model.