WO2021208455A1 - Neural network speech recognition method and system oriented to home spoken environment - Google Patents

Neural network speech recognition method and system oriented to home spoken environment Download PDF

Info

Publication number
WO2021208455A1
WO2021208455A1 PCT/CN2020/133554 CN2020133554W WO2021208455A1 WO 2021208455 A1 WO2021208455 A1 WO 2021208455A1 CN 2020133554 W CN2020133554 W CN 2020133554W WO 2021208455 A1 WO2021208455 A1 WO 2021208455A1
Authority
WO
WIPO (PCT)
Prior art keywords
chinese
english
model
vector set
output
Prior art date
Application number
PCT/CN2020/133554
Other languages
French (fr)
Chinese (zh)
Inventor
张晖
程铭
赵海涛
孙雁飞
倪艺洋
朱洪波
Original Assignee
南京邮电大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 南京邮电大学 filed Critical 南京邮电大学
Priority to JP2021551834A priority Critical patent/JP7166683B2/en
Publication of WO2021208455A1 publication Critical patent/WO2021208455A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/005Language recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training

Definitions

  • the invention belongs to the technical field of intelligent recognition, and specifically relates to a neural network speech recognition method and system oriented to the home spoken environment.
  • the key object of speech recognition research is speech.
  • the speech signal is converted into information that can be recognized by the computer, so as to recognize the speaker's voice commands and text content.
  • the first method appeared earlier, it has not yet reached a more practical stage due to the complexity of its model; the second method is the Hidden Markov Model, which can be used to label the probability model of the problem. It also shows that the model randomly generates observation sequences, which greatly improves the speech recognition technology.
  • the third method uses shallow neural network learning and training to easily cause gradient instability, and manual extraction of sample features is time-consuming and laborious, and the recognition effect is not very good.
  • the GMM-HMM acoustic modeling method is the most widely used in practice, but when dealing with some complex speech signal problems in the home environment, the application scenarios of the traditional model appear to be relatively single.
  • the present invention provides a neural network speech recognition method oriented to the spoken environment of the home, which can solve the problems of low speech recognition rate and poor recognition efficiency.
  • the present invention also provides a home-oriented speech recognition method. Neural network speech recognition system for spoken language environment.
  • the neural network speech recognition method oriented to the spoken language environment at home includes:
  • Model construction add long and short-term memory network to the deep neural network to build a combined neural network DNN-LSTM model;
  • Chinese speech data training preprocess the collected Chinese speech data set to obtain a Chinese feature vector set, and use the Chinese feature vector set as the input of the DNN-LSTM model to perform iterative training to train to the optimal Chinese acoustic model;
  • English speech data training preprocess the collected English speech data set to obtain an English feature vector set, and use the English feature vector set as the input of the DNN-LSTM model to perform iterative training to train to the optimal English acoustic model;
  • the combined neural network DNN-LSTM model includes an input layer, a long and short-term memory network, a second hidden layer, a third hidden layer, a fourth hidden layer, and an output layer, and the long and short-term memory network serves as the first hidden layer.
  • the number of nodes in the first hidden layer is 512
  • the activation function selects the sigmoid function and the tanh function
  • the number of nodes in the second hidden layer, the third hidden layer and the fourth hidden layer are all 1024
  • the activation function selects the sigmoid function .
  • the Chinese feature vector set is used as the input of the DNN-LSTM model for iterative training, and the training steps include:
  • Performing language matching according to the Chinese output probability vector set and the English output probability vector set includes:
  • the probability vector set output from the Chinese acoustic model has p i that is significantly larger than other probability values, the probability values in the probability vector set output from the English acoustic model are not much different, where 1 ⁇ i ⁇ q;
  • the corresponding input voice signal voice0 of the unknown language is Chinese, and the output probability of the Chinese acoustic model is used as the final output result;
  • the corresponding input speech signal voice0 of the unknown language is English, and the output probability of the English acoustic model is used as the final output result.
  • the present invention also provides a system implemented by the above-mentioned neural network speech recognition method for the spoken language environment at home, which includes:
  • Model building module used to add long and short-term memory network to deep neural network to build combined neural network DNN-LSTM model
  • the model training module includes a Chinese model training unit and an English model training unit.
  • the Chinese model training unit is used to preprocess the collected Chinese speech data set to obtain a Chinese feature vector set, and combine the Chinese feature vector set Perform iterative training as the input of the DNN-LSTM model to train to the optimal Chinese acoustic model;
  • the English model training unit is used to preprocess the collected English speech data set to obtain an English feature vector set, and use the English feature vector set as the input of the DNN-LSTM model for iterative training to train to the best English acoustics Model;
  • the model testing module includes a voice input unit and a voice type judgment unit.
  • the voice input unit is used to pass an input voice signal voice0 in an unknown language through the Chinese acoustic model and the English acoustic model, respectively, to obtain The Chinese output probability vector set and the English output probability vector set;
  • the voice type judgment unit is configured to perform language matching according to the Chinese output probability vector set and the English output probability vector set, and output the judgment result.
  • the present invention combines the characteristics of LSTM and uses the memory unit to record long historical information and DNN can effectively extract the characteristics of high-level information in the data.
  • the present invention combines the characteristics of LSTM and uses the memory unit to record long historical information and DNN can effectively extract the characteristics of high-level information in the data.
  • the idea of adding LSTM to the first layer of the hidden layer of DNN build a combined neural network combining DNN and LSTM for acoustic modeling, and train and test the Chinese data set and English data set to obtain the Chinese acoustic model and English Acoustic model, and compare the output results of the input speech signal in the Chinese acoustic model and the English acoustic model by quoting the concept of entropy, and use the result with a smaller entropy value as the output result of the acoustic model, so as to achieve the purpose of simple language recognition and improve
  • the content of the speaker in the home scene can be quickly and accurately identified, which can be widely used in the actual home scene.
  • Fig. 1 is a block diagram of the overall structure of the combined neural network speech recognition algorithm for the spoken language environment at home according to the present invention
  • Figure 2 is a structural diagram of the DNN-LSTM model of the present invention.
  • Figure 3 shows the overall structure of LSTM.
  • Figure 1 is a block diagram of the overall structure of the combined neural network speech recognition algorithm for the home spoken environment. First, combine the characteristics of DNN and LSTM to construct the DNN-LSTM model; then, use the DNN-LSTM model to train the Chinese data set and the English data set , Save the Chinese acoustic model and English acoustic model; finally, output the result through language matching, so as to achieve the purpose of language recognition and speech recognition.
  • DNN is a deep neural network
  • LSTM is a long and short-term memory network
  • Long Short-Term Memory is the three-gate logic calculation structure diagram inside LSTM.
  • the core element of LSTM is the cell state, which represents the cell state over time. Information transfer. It runs through the entire chain in a straight line, with only a few small linear interactions, and information can easily flow without significant changes. In the transmission process, the information in the cell state is added or deleted through the input at the current time, the state of the hidden layer at the previous time, the cell state at the previous time, and the gate structure.
  • the memory unit of the LSTM model is mainly used to store and process speech features. It implements three calculations, namely the forget gate, input gate and output gate. The three gates are used to protect and control the neuron state at the current moment.
  • c t as follows:
  • Input gate The function of this gate is to determine how much information in the input x t is kept in c t .
  • the realization formula is:
  • t is the time t, I gate input through the input gate and the input gate corresponding state save.
  • W i , W c represent the weight matrix
  • b i , b c represent the bias terms
  • x t-1 , x t , x t+1 represent the input at the previous time, the current time and the next time respectively
  • h t- 1 , h t , h t+1 represent the neuron state at the previous moment, the current moment and the next moment, respectively
  • represents the sigmoid function.
  • Forgetting gate The function of this gate is to determine how many components of c t-1 in the input at time t remain in c t .
  • the realization formula is:
  • W f represents the weight matrix
  • b f represents the bias term
  • Output gate The function of this gate is to use how much the control unit c t outputs to the current output value h t of the LSTM.
  • the state after the input gate and the forget gate, that is, the realization formula of c t is:
  • the first half is the component remaining in c t after the information passes through the forgetting gate
  • the second half is the component remaining in c t after the information passes through the input gate.
  • o t is the state of the output layer at time t
  • W o represents the weight matrix
  • b o represents the bias term
  • the combined neural network speech recognition method for the oral environment at home includes:
  • Figure 2 shows the DNN-LSTM model.
  • the structure is as follows: layer 0 is the input layer, layer 1 to layer 4 are hidden layers, and layer 5 is the output layer, and its activation function is the softmax function.
  • the first layer is the LSTM network structure, the number of nodes is 512, and the activation function selects the sigmoid function and the tanh function.
  • a Dropout strategy is added to the neural unit; the latter 3 layers They are all DNN network structures, the number of nodes in each layer is 1024, and the activation function selects the sigmoid function; that is, the combined neural network DNN-LSTM model includes the input layer, the long and short-term memory network, the second hidden layer, the third hidden layer, and the fourth A hidden layer and an output layer, and the long and short-term memory network serves as the first hidden layer.
  • the model has 6 layers, the input vector of each layer of neuron is z n , and the output vector is y n , then:
  • W n is the weight matrix from the n-1th layer to the nth layer
  • b n is the bias of the nth layer
  • f n is the activation function of the nth layer.
  • Chinese speech data training preprocess the collected Chinese speech data set to obtain the Chinese feature vector set vector0, and use the Chinese feature vector set as the input of the DNN-LSTM model for iterative training to train to the optimal Chinese acoustic model;
  • the pre-processing operations include: sampling, pre-emphasis, windowing and framing, endpoint detection, and the feature vector vector0 is used as the DNN-LSTM model input for iterative training to train to the optimal acoustic model China_model.
  • the training steps are as follows:
  • the maximum number of iterations is set to 50. In each iteration, it is traversed from the first training sample to the last training sample, where i is used to Indicates the training sample being traversed;
  • English speech data training preprocess the collected English speech data set to obtain the English feature vector set vector1, and use the English feature vector set as the input of the DNN-LSTM model for iterative training to train to the optimal English acoustic model ;
  • the pre-processing operations include: sampling, pre-emphasis, windowing and framing, endpoint detection, and the feature vector vector1 as the DNN-LSTM model input for iterative training, training to the optimal acoustic model English_model, specific traversal steps and Chinese voice
  • the steps of data training are the same, so I won't repeat them here.
  • the information entropy formula is:
  • the probability vector set output from the Chinese acoustic model has p i that is significantly greater than other probability values, the probability values in the probability vector set output from the English acoustic model are not significantly different, where 1 ⁇ i ⁇ q;
  • whether the probability value is significantly different is related to the output classification of the softmax output layer.
  • the more output classifications the smaller the corresponding range.
  • the range is set to ⁇ , that is, the difference between the probability values is greater than or equal to ⁇ , that is, the phase difference Obviously, if the range of the difference between each probability value is less than ⁇ , the difference is not obvious.
  • the range ⁇ is approximately 0.2. The more output categories, the smaller the range.
  • the corresponding input voice signal voice0 of the unknown language is Chinese, and the output probability of the Chinese acoustic model is used as the final output
  • the output probability of the Chinese acoustic model is used as the final output
  • the corresponding input speech signal voice0 of the unknown language is English, and the output probability of the English acoustic model is used as the final output result.
  • the present invention also provides a neural network speech recognition system oriented to the spoken language environment at home.
  • the system includes:
  • Model building module used to add long and short-term memory network to deep neural network to build combined neural network DNN-LSTM model
  • the model training module includes a Chinese model training unit and an English model training unit.
  • the Chinese model training unit is used to preprocess the collected Chinese speech data set to obtain a Chinese feature vector set, and combine the Chinese feature vector set Perform iterative training as the input of the DNN-LSTM model to train to the optimal Chinese acoustic model;
  • the English model training unit is used to preprocess the collected English speech data set to obtain an English feature vector set, and use the English feature vector set as the input of the DNN-LSTM model for iterative training to train to the best English acoustics Model;
  • the model testing module includes a voice input unit and a voice type judgment unit.
  • the voice input unit is used to pass an input voice signal voice0 in an unknown language through the Chinese acoustic model and the English acoustic model, respectively, to obtain The Chinese output probability vector set and the English output probability vector set;
  • the voice type judgment unit is configured to perform language matching according to the Chinese output probability vector set and the English output probability vector set, and output the judgment result.
  • this application can be provided as methods, systems, or computer program products. Therefore, this application may adopt the form of a complete hardware embodiment, a complete application embodiment, or an embodiment combining applications and hardware. Moreover, this application may adopt the form of a computer program product implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program codes.
  • computer-usable storage media including but not limited to disk storage, CD-ROM, optical storage, etc.
  • These computer program instructions can also be stored in a computer-readable memory that can guide a computer or other programmable data processing equipment to work in a specific manner, so that the instructions stored in the computer-readable memory produce an article of manufacture including the instruction device.
  • the device implements the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.
  • These computer program instructions can also be loaded on a computer or other programmable data processing equipment, so that a series of operation steps are executed on the computer or other programmable equipment to produce computer-implemented processing, so as to execute on the computer or other programmable equipment.
  • the instructions provide steps for implementing the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Machine Translation (AREA)

Abstract

A neural network speech recognition method and system oriented to a home spoken environment, comprising: model construction: adding a long-short-term memory network into a deep neural network, to construct a combined neural network DNN-LSTM model; pre-processing an acquired speech data set to obtain a feature vector set, and using the feature vector set as an input of the DNN-LSTM model to perform iterative training, so as to train same to an optimal acoustic model; causing an input speech signal of an unknown language to be subjected to the trained DNN-LSTM model to respectively obtain a Chinese output probability vector set and an English output probability vector set; performing language matching according to the Chinese output probability vector set and the English output probability vector set, and outputting a determination result. The present invention is able to quickly and accurately recognize content of a speaker in a home scene, and can be widely applied to an actual home scene.

Description

一种面向家居口语环境的神经网络语音识别方法及系统Neural network speech recognition method and system facing home spoken environment 技术领域Technical field
本发明属于智能识别技术领域,具体涉及一种面向家居口语环境的神经网络语音识别方法及系统。The invention belongs to the technical field of intelligent recognition, and specifically relates to a neural network speech recognition method and system oriented to the home spoken environment.
背景技术Background technique
语音识别研究的重点对象是语音,将语音信号转换成可由计算机所识别的信息,从而识别说话人的语音命令及文字内容。语音识别的方法基本可以分为三种:基于语言学和声学、模型匹配和神经网络三种方法。第一种方法虽然出现较早,但由于其模型复杂的局限性,还没到达较为实用的阶段;第二种方法中应用较多的是隐马尔可夫模型,可用于标注问题的概率模型,并呈现出该模型随机生成观测序列,使语音识别技术得到很大的提升。第三种方法使用浅层神经网络学习训练容易造成梯度不稳定,并且人工提取样本特征费时费力,识别效果不是很好。在传统的语音识别系统中,GMM-HMM的声学建模方法在实际中是应用最广泛的,但是在家居环境下处理一些复杂的语音信号问题时,传统模型的应用场景就显得比较单一。The key object of speech recognition research is speech. The speech signal is converted into information that can be recognized by the computer, so as to recognize the speaker's voice commands and text content. There are basically three methods of speech recognition: based on linguistics and acoustics, model matching and neural networks. Although the first method appeared earlier, it has not yet reached a more practical stage due to the complexity of its model; the second method is the Hidden Markov Model, which can be used to label the probability model of the problem. It also shows that the model randomly generates observation sequences, which greatly improves the speech recognition technology. The third method uses shallow neural network learning and training to easily cause gradient instability, and manual extraction of sample features is time-consuming and laborious, and the recognition effect is not very good. In traditional speech recognition systems, the GMM-HMM acoustic modeling method is the most widely used in practice, but when dealing with some complex speech signal problems in the home environment, the application scenarios of the traditional model appear to be relatively single.
发明内容Summary of the invention
发明目的:为了克服现有技术的不足,本发明提供一种面向家居口语环境的神经网络语音识别方法,该方法可以解决语音识别率低以及识别效率差的问题,本发明还提供一种面向家居口语环境的神经网络语音识别系统。Objective of the invention: In order to overcome the shortcomings of the prior art, the present invention provides a neural network speech recognition method oriented to the spoken environment of the home, which can solve the problems of low speech recognition rate and poor recognition efficiency. The present invention also provides a home-oriented speech recognition method. Neural network speech recognition system for spoken language environment.
技术方案:一方面,本发明所述的面向家居口语环境的神经网络语音识别方法,该方法包括:Technical solution: On the one hand, the neural network speech recognition method oriented to the spoken language environment at home according to the present invention includes:
模型构建:在深度神经网络中加入长短期记忆网络,构建组合神经网络DNN-LSTM模型;Model construction: add long and short-term memory network to the deep neural network to build a combined neural network DNN-LSTM model;
模型训练:Model training:
中文语音数据训练:对采集的中文语音数据集预处理,得到中文特征向量集,并将所述中文特征向量集作为所述DNN-LSTM模型的输入进行迭代训练,训练至最优中文声学模型;Chinese speech data training: preprocess the collected Chinese speech data set to obtain a Chinese feature vector set, and use the Chinese feature vector set as the input of the DNN-LSTM model to perform iterative training to train to the optimal Chinese acoustic model;
英文语音数据训练:对采集的英文语音数据集预处理,得到英文特征向量集,并将所述英文特征向量集作为所述DNN-LSTM模型的输入进行迭代训练,训练至最优英文 声学模型;English speech data training: preprocess the collected English speech data set to obtain an English feature vector set, and use the English feature vector set as the input of the DNN-LSTM model to perform iterative training to train to the optimal English acoustic model;
模型测试:Model test:
将一个未知语种的输入语音信号voice0,分别经过所述中文声学模型和所述英文声学模型,分别得到中文输出概率向量集和英文输出概率向量集;Passing an input speech signal voice0 in an unknown language through the Chinese acoustic model and the English acoustic model respectively to obtain a Chinese output probability vector set and an English output probability vector set;
根据所述中文输出概率向量集和英文输出概率向量集进行语种匹配,并输出判断结果。Perform language matching according to the Chinese output probability vector set and the English output probability vector set, and output the judgment result.
进一步地,包括:Further, it includes:
所述组合神经网络DNN-LSTM模型包括输入层、长短期记忆网络、第二隐藏层、第三隐藏层、第四隐藏层和输出层,所述长短期记忆网络作为第一隐藏层。The combined neural network DNN-LSTM model includes an input layer, a long and short-term memory network, a second hidden layer, a third hidden layer, a fourth hidden layer, and an output layer, and the long and short-term memory network serves as the first hidden layer.
进一步地,包括:Further, it includes:
所述第一隐藏层的节点数为512个,其激活函数选择sigmoid函数和tanh函数,第二隐藏层、第三隐藏层和第四隐藏层的节点数均为1024个,激活函数选择sigmoid函数。The number of nodes in the first hidden layer is 512, the activation function selects the sigmoid function and the tanh function, the number of nodes in the second hidden layer, the third hidden layer and the fourth hidden layer are all 1024, and the activation function selects the sigmoid function .
进一步地,包括:Further, it includes:
所述中文特征向量集作为所述DNN-LSTM模型的输入进行迭代训练,训练步骤包括:The Chinese feature vector set is used as the input of the DNN-LSTM model for iterative training, and the training steps include:
(1)初始化模型结构中的权值矩阵W和偏置向量b为一个随机值;(1) Initialize the weight matrix W and the bias vector b in the model structure to a random value;
(2)开始从第1次到最大次数的迭代;在每次迭代中,都是从第一条语音数据训练样本开始遍历至最后一个训练样本;(2) Start from the first iteration to the maximum number of iterations; in each iteration, it is traversed from the first voice data training sample to the last training sample;
(3)在每一个训练样本的训练过程中,将对应的特征向量输入到输入层;从第一隐藏层开始遍历到输出层,并采用前向传播算法表示出正在遍历的对应层,然后根据损失函数表示输出层;前向传播算法完成后,开始从第四隐藏层遍历至第一隐藏层,采用反向传播算法表示对应的第一隐藏层;(3) In the training process of each training sample, input the corresponding feature vector to the input layer; traverse from the first hidden layer to the output layer, and use the forward propagation algorithm to indicate the corresponding layer being traversed, and then according to The loss function represents the output layer; after the forward propagation algorithm is completed, it starts to traverse from the fourth hidden layer to the first hidden layer, and uses the back propagation algorithm to represent the corresponding first hidden layer;
(4)反向传播算法完成后,从第一隐藏层到输出层的顺序开始遍历,并更新对应层的权值矩阵和偏置向量W n、b n,至此一次迭代过程中对于一个样本的训练就结束了;此时若样本没有遍历完,则继续遍历样本;若样本已经遍历完,则进行下一次的迭代; (4) After the backpropagation algorithm is completed, the sequence from the first hidden layer to the output layer is traversed, and the weight matrix and bias vector W n , b n of the corresponding layer are updated. The training is over; if the sample has not been traversed, continue to traverse the sample; if the sample has been traversed, the next iteration will be performed;
(5)当全部W、b的改变值都不超过迭代阈值,则停止迭代循环;(5) When all the change values of W and b do not exceed the iteration threshold, stop the iteration loop;
(6)保存各层的最优权值矩阵W和偏置向量b。(6) Save the optimal weight matrix W and bias vector b of each layer.
进一步地,包括:Further, it includes:
根据所述中文输出概率向量集和英文输出概率向量集进行语种匹配,包括:Performing language matching according to the Chinese output probability vector set and the English output probability vector set includes:
利用信息熵公式分别计算中文输出概率向量集P和英文输出概率向量集P′对应的信息熵,分别对应记为H和H′;其中,P={p 1,p 2,...,p q},P′={p′ 1,p′ 2,...,p′ t},q为中文声学模型输出分类的总数,t为英文声学模型输出分类的总数; Use the information entropy formula to calculate the information entropy corresponding to the Chinese output probability vector set P and the English output probability vector set P′, respectively, and mark them as H and H′; where P={p 1 , p 2 ,...,p q }, P′={p′ 1 , p′ 2 ,...,p′ t }, q is the total number of output categories of the Chinese acoustic model, and t is the total number of output categories of the English acoustic model;
若从中文声学模型输出的概率向量集中存在p i明显大于其他的概率值,从英文声学模型输出的概率向量集中的各个概率值相差不大,其中,1≤i≤q; If the probability vector set output from the Chinese acoustic model has p i that is significantly larger than other probability values, the probability values in the probability vector set output from the English acoustic model are not much different, where 1≤i≤q;
且若输入到中文声学模型对应的信息熵H比输入到英文声学模型对应的信息熵H′小,则对应的未知语种的输入语音信号voice0为中文,并将中文声学模型的输出概率作为最后输出结果;And if the information entropy H corresponding to the input to the Chinese acoustic model is smaller than the information entropy H′ corresponding to the input to the English acoustic model, the corresponding input voice signal voice0 of the unknown language is Chinese, and the output probability of the Chinese acoustic model is used as the final output result;
若从英文声学模型输出的概率向量集中存在p′ j明显大于其他的概率值,从中文声学模型输出的概率向量集中的各个概率值相差不大, If p′ j is significantly larger than the other probability values in the probability vector set output from the English acoustic model, the probability values in the probability vector set output from the Chinese acoustic model are not much different.
且若输入到英文声学模型对应的信息熵H′比输入到中文声学模型对应的信息熵H小,则对应的未知语种的输入语音信号voice0为英文,并将英文声学模型的输出概率作为最后输出结果。And if the information entropy H′ corresponding to the input to the English acoustic model is smaller than the information entropy H corresponding to the input to the Chinese acoustic model, the corresponding input speech signal voice0 of the unknown language is English, and the output probability of the English acoustic model is used as the final output result.
另一方面,本发明还提供一种上述面向家居口语环境的神经网络语音识别方法实现的系统,该系统包括:On the other hand, the present invention also provides a system implemented by the above-mentioned neural network speech recognition method for the spoken language environment at home, which includes:
模型构建模块,用于在深度神经网络中加入长短期记忆网络,构建组合神经网络DNN-LSTM模型;Model building module, used to add long and short-term memory network to deep neural network to build combined neural network DNN-LSTM model;
模型训练模块,其又包括中文模型训练单元和英文模型训练单元,所述中文模型训练单元,用于对采集的中文语音数据集预处理,得到中文特征向量集,并将所述中文特征向量集作为所述DNN-LSTM模型的输入进行迭代训练,训练至最优中文声学模型;The model training module includes a Chinese model training unit and an English model training unit. The Chinese model training unit is used to preprocess the collected Chinese speech data set to obtain a Chinese feature vector set, and combine the Chinese feature vector set Perform iterative training as the input of the DNN-LSTM model to train to the optimal Chinese acoustic model;
英文模型训练单元,用于对采集的英文语音数据集预处理,得到英文特征向量集,并将所述英文特征向量集作为所述DNN-LSTM模型的输入进行迭代训练,训练至最优英文声学模型;The English model training unit is used to preprocess the collected English speech data set to obtain an English feature vector set, and use the English feature vector set as the input of the DNN-LSTM model for iterative training to train to the best English acoustics Model;
模型测试模块,其又包括语音输入单元和语音类型判断单元,所述语音输入单元,用于将一个未知语种的输入语音信号voice0,分别经过所述中文声学模型和所述英文声学模型,分别得到中文输出概率向量集和英文输出概率向量集;所述语音类型判断单元,用于根据所述中文输出概率向量集和英文输出概率向量集进行语种匹配,并输出判断结果。The model testing module includes a voice input unit and a voice type judgment unit. The voice input unit is used to pass an input voice signal voice0 in an unknown language through the Chinese acoustic model and the English acoustic model, respectively, to obtain The Chinese output probability vector set and the English output probability vector set; the voice type judgment unit is configured to perform language matching according to the Chinese output probability vector set and the English output probability vector set, and output the judgment result.
有益效果:本发明与现有技术相比,其显著优点是:本发明结合LSTM利用记忆单元可以记录很长的历史信息的特征以及DNN可以有效的提取数据中的高层次信息的特征的特点,提出在DNN隐藏层的第一层加入LSTM的想法,构建了DNN和LSTM相结合的组合神经网络进行声学建模,并对中文数据集和英文数据集进行训练和测试,得到中文声学模型和英文声学模型,并通过引用熵的概念来比较输入语音信号在中文声学模型和英文声学模型的输出结果,将熵值较小的结果作为声学模型的输出结果,从而达到简单语种识别的目的,并且提高了整体的语音识别率,进而能够快速准确的识别家居场景下说话人的内容,可以广泛应用于实际家居场景。Beneficial effects: Compared with the prior art, the present invention has significant advantages: the present invention combines the characteristics of LSTM and uses the memory unit to record long historical information and DNN can effectively extract the characteristics of high-level information in the data. Put forward the idea of adding LSTM to the first layer of the hidden layer of DNN, build a combined neural network combining DNN and LSTM for acoustic modeling, and train and test the Chinese data set and English data set to obtain the Chinese acoustic model and English Acoustic model, and compare the output results of the input speech signal in the Chinese acoustic model and the English acoustic model by quoting the concept of entropy, and use the result with a smaller entropy value as the output result of the acoustic model, so as to achieve the purpose of simple language recognition and improve In order to improve the overall speech recognition rate, the content of the speaker in the home scene can be quickly and accurately identified, which can be widely used in the actual home scene.
附图说明Description of the drawings
图1为本发明所述的面向家居口语环境的组合神经网络语音识别算法总体结构框图;Fig. 1 is a block diagram of the overall structure of the combined neural network speech recognition algorithm for the spoken language environment at home according to the present invention;
图2为本发明所述的DNN-LSTM模型结构图;Figure 2 is a structural diagram of the DNN-LSTM model of the present invention;
图3为LSTM整体结构图。Figure 3 shows the overall structure of LSTM.
具体实施方式Detailed ways
为了更加详细的描述本发明提出的面向家居口语环境的组合神经网络语音识别算法,结合附图,举例说明如下。In order to describe in more detail the combined neural network speech recognition algorithm oriented to the spoken language environment at home proposed by the present invention, an example is described below with reference to the accompanying drawings.
如图1为面向家居口语环境的组合神经网络语音识别算法总体结构框图,首先结合DNN和LSTM的特点,构建DNN-LSTM模型;然后,采用DNN-LSTM模型对中文数据集和英文数据集进行训练,保存中文声学模型和英文声学模型;最后,通过语种匹配输出结果,从而到达语种识别和语音识别的目的。Figure 1 is a block diagram of the overall structure of the combined neural network speech recognition algorithm for the home spoken environment. First, combine the characteristics of DNN and LSTM to construct the DNN-LSTM model; then, use the DNN-LSTM model to train the Chinese data set and the English data set , Save the Chinese acoustic model and English acoustic model; finally, output the result through language matching, so as to achieve the purpose of language recognition and speech recognition.
DNN为深度神经网络,Deep Neural Networks,LSTM为长短期记忆网络,Long Short-Term Memory,如图3是LSTM内部三门逻辑计算结构图,LSTM的核心要素是细胞状态,表示细胞状态随时间的信息传递。它沿着整个链直线贯通,只有一些微小的线性相互作用,信息很容易在没有大幅度变化的情况下流动。在传递过程中,通过当前时刻输入、上一时刻隐藏层状态、上一时刻细胞状态以及门结构来增加或删除细胞状态中的信息。在语音识别中,LSTM模型的记忆单元主要用于存储与处理语音特征,它实现了三门计算,即遗忘门、输入门和输出门,通过三个门来保护和控制当前时刻的神经元状态c t,具体如下: DNN is a deep neural network, Deep Neural Networks, LSTM is a long and short-term memory network, and Long Short-Term Memory, as shown in Figure 3, is the three-gate logic calculation structure diagram inside LSTM. The core element of LSTM is the cell state, which represents the cell state over time. Information transfer. It runs through the entire chain in a straight line, with only a few small linear interactions, and information can easily flow without significant changes. In the transmission process, the information in the cell state is added or deleted through the input at the current time, the state of the hidden layer at the previous time, the cell state at the previous time, and the gate structure. In speech recognition, the memory unit of the LSTM model is mainly used to store and process speech features. It implements three calculations, namely the forget gate, input gate and output gate. The three gates are used to protect and control the neuron state at the current moment. c t , as follows:
(1)输入门:该门的作用是确定输入x t中有多少信息保留在c t中,实现公式为: (1) Input gate: The function of this gate is to determine how much information in the input x t is kept in c t . The realization formula is:
i t=σ(W i·[h t-1,x t]+b i)         (3) i t =σ(W i ·[h t-1 ,x t ]+b i ) (3)
Figure PCTCN2020133554-appb-000001
Figure PCTCN2020133554-appb-000001
其中,i t为t时刻输入门的输入,通过输入门,将输入门对应的状态
Figure PCTCN2020133554-appb-000002
保留下来。W i、W c表示权值矩阵,b i、b c表示偏置项,x t-1,x t,x t+1分别表示上一个时刻、当前时刻和下一个时刻的输入;h t-1,h t,h t+1分别表示上一个时刻、当前时刻和下一个时刻的神经元状态,σ表示sigmoid函数。
Wherein, t is the time t, I gate input through the input gate and the input gate corresponding state
Figure PCTCN2020133554-appb-000002
save. W i , W c represent the weight matrix, b i , b c represent the bias terms, x t-1 , x t , x t+1 represent the input at the previous time, the current time and the next time respectively; h t- 1 , h t , h t+1 represent the neuron state at the previous moment, the current moment and the next moment, respectively, and σ represents the sigmoid function.
(2)遗忘门:该门的作用是确定t时刻输入中的c t-1有多少成分保留在c t中。实现公式为: (2) Forgetting gate: The function of this gate is to determine how many components of c t-1 in the input at time t remain in c t . The realization formula is:
f t=σ(W f·[h t-1,x t]+b f)         (5) f t =σ(W f ·[h t-1 ,x t ]+b f ) (5)
其中,W f表示权值矩阵,b f表示偏置项。 Among them, W f represents the weight matrix, and b f represents the bias term.
(3)输出门:该门的作用是利用控制单元c t有多少输出到LSTM的当前输出值h t。首先经过输入门和遗忘门之后的状态,即c t实现公式为: (3) Output gate: The function of this gate is to use how much the control unit c t outputs to the current output value h t of the LSTM. First, the state after the input gate and the forget gate, that is, the realization formula of c t is:
Figure PCTCN2020133554-appb-000003
Figure PCTCN2020133554-appb-000003
其中,前半部分为信息经过遗忘门后保留在c t中的成分,后半部分为信息经过输入门后保留在c t中的成分。然后,为了确定c t有多少成分保留在h t中,输出的实现公式为: Among them, the first half is the component remaining in c t after the information passes through the forgetting gate, and the second half is the component remaining in c t after the information passes through the input gate. Then, in order to determine how many components of c t remain in h t , the output realization formula is:
o t=σ(W o[h t-1,x t]+b o)            (7) o t =σ(W o [h t-1 ,x t ]+b o ) (7)
其中,o t为t时刻输出层的状态,W o表示权重矩阵,b o表示偏置项。最后,经过输出门,隐藏层的最终输出结果为: Among them, o t is the state of the output layer at time t, W o represents the weight matrix, and b o represents the bias term. Finally, after the output gate, the final output of the hidden layer is:
h t=o t*tanh(c t)          (8) h t = o t *tanh(c t ) (8)
具体的:面向家居口语环境的组合神经网络语音识别方法包括:Specifically: The combined neural network speech recognition method for the oral environment at home includes:
首先,构建模型:在深度神经网络中加入长短期记忆网络,构建组合神经网络DNN-LSTM模型;First, build the model: add a long and short-term memory network to the deep neural network to build a combined neural network DNN-LSTM model;
如图2为DNN-LSTM模型,结构如下:第0层为输入层,第1层到第4层为隐藏层,第5层为输出层,其激活函数为softmax函数。在隐藏层中,第一层为LSTM网络结构,节点数为512个,其激活函数选择sigmoid函数和tanh函数,为防止网络内对数据的过分学习,在神经单元内部加入Dropout策略;后3层均为DNN网络结构,各层的节点数为1024个,激活函数选择sigmoid函数;即组合神经网络DNN-LSTM模型包 括输入层、长短期记忆网络、第二隐藏层、第三隐藏层、第四隐藏层和输出层,所述长短期记忆网络作为第一隐藏层。Figure 2 shows the DNN-LSTM model. The structure is as follows: layer 0 is the input layer, layer 1 to layer 4 are hidden layers, and layer 5 is the output layer, and its activation function is the softmax function. In the hidden layer, the first layer is the LSTM network structure, the number of nodes is 512, and the activation function selects the sigmoid function and the tanh function. In order to prevent excessive learning of the data in the network, a Dropout strategy is added to the neural unit; the latter 3 layers They are all DNN network structures, the number of nodes in each layer is 1024, and the activation function selects the sigmoid function; that is, the combined neural network DNN-LSTM model includes the input layer, the long and short-term memory network, the second hidden layer, the third hidden layer, and the fourth A hidden layer and an output layer, and the long and short-term memory network serves as the first hidden layer.
该模型有6层,每一层神经元的输入向量为z n,输出向量为y n,则有: The model has 6 layers, the input vector of each layer of neuron is z n , and the output vector is y n , then:
z n=W nz n-1+b n,n=1,2,3,4,5           (1) z n = W n z n-1 + b n , n = 1, 2, 3, 4, 5 (1)
式中,W n为第n-1层到第n层的权值矩阵,b n为第n层的偏置。根据输入向量可得输出为: In the formula, W n is the weight matrix from the n-1th layer to the nth layer, and b n is the bias of the nth layer. According to the input vector, the output can be:
y n=f n(z n)             (2) y n = f n (z n ) (2)
式中,f n为第n层的激活函数。 In the formula, f n is the activation function of the nth layer.
其次,进行模型训练:Second, perform model training:
中文语音数据训练:对采集的中文语音数据集预处理,得到中文特征向量集vector0,并将中文特征向量集作为所述DNN-LSTM模型的输入进行迭代训练,训练至最优中文声学模型;Chinese speech data training: preprocess the collected Chinese speech data set to obtain the Chinese feature vector set vector0, and use the Chinese feature vector set as the input of the DNN-LSTM model for iterative training to train to the optimal Chinese acoustic model;
其中,预处理操作包括:采样、预加重、加窗分帧、端点检测,并将特征向量vector0作为DNN-LSTM模型输入进行迭代训练,训练至最优声学模型China_model。Among them, the pre-processing operations include: sampling, pre-emphasis, windowing and framing, endpoint detection, and the feature vector vector0 is used as the DNN-LSTM model input for iterative training to train to the optimal acoustic model China_model.
训练的步骤如下:The training steps are as follows:
(1)将网络结构中的权值矩阵W和偏置向量b初始化为一个随机值。(1) Initialize the weight matrix W and the bias vector b in the network structure to a random value.
(2)开始从第1次到最大次数的迭代;本实施例将最大次数设置为50,在每次迭代中,都是从第一个训练样本开始遍历至最后一个训练样本,其中用i来表示正在遍历的训练样本;(2) Start from the first iteration to the maximum number of iterations; in this embodiment, the maximum number of iterations is set to 50. In each iteration, it is traversed from the first training sample to the last training sample, where i is used to Indicates the training sample being traversed;
(3)在每一个样本的训练过程中,将输入向量作为DNN的第一层输入,用a 1表示;然后从隐藏层的第一层到输出层开始遍历,用n来表示正在遍历那一层,每层都做前向传播算法计算a i,n=f(z i,n)=f(W na i,n-1+b n),表示正在遍历的该层遍历的第i个样本对应的输入层。 (3) In the training process of each sample, the input vector is used as the input of the first layer of the DNN, denoted by a 1 ; then the traversal starts from the first layer of the hidden layer to the output layer, and n is used to denote which one is being traversed. Layer, each layer is calculated by forward propagation algorithm a i,n =f(z i,n )=f(W n a i,n-1 +b n ), which means the i-th traversed by the layer being traversed The input layer corresponding to the sample.
根据损失函数计算输出层δ i,L,L即为输出层;前向传播算法完成后,开始从隐藏层的最后一层遍历至隐藏层的第一层,进行反向传播算法计算δ i,n=(W n+1) Tδ i,n+1⊙f′(z i,n);即正在遍历的该层正在遍历的第i个的训练样本对应的输出层,T为转置f′为求导,⊙表示同或运算。 Calculate the output layer δ i, L , L according to the loss function as the output layer; after the forward propagation algorithm is completed, it starts to traverse from the last layer of the hidden layer to the first layer of the hidden layer, and then performs the back propagation algorithm to calculate δ i, n = (W n+1 ) T δ i, n+1 ⊙ f′(z i, n ); that is, the output layer corresponding to the i-th training sample of the layer being traversed, and T is the transposition f ′ Means derivation, ⊙ means XOR operation.
(4)反向传播算法完成后,从隐藏层的第一层到输出层开始遍历,更新正在遍历 的第n层的W n、b n,则: (4) After the backpropagation algorithm is completed, traverse from the first layer of the hidden layer to the output layer, and update the W n and b n of the nth layer that is being traversed, then:
Figure PCTCN2020133554-appb-000004
Figure PCTCN2020133554-appb-000004
这样,一次迭代过程中对于某一个样本的训练就结束了;此时若样本没有遍历完,则继续遍历样本;若样本已经遍历完,则进行下一次的迭代,其中,m为训练样本总数,α是迭代步长;In this way, the training for a certain sample in one iteration process is over; if the sample has not been traversed, continue to traverse the sample; if the sample has been traversed, the next iteration will be performed, where m is the total number of training samples, α is the iteration step size;
(5)当全部W、b的改变值都不超过迭代阈值,则停止迭代循环;(5) When all the change values of W and b do not exceed the iteration threshold, stop the iteration loop;
(6)保存各层的最优权值矩阵W和偏置向量b。(6) Save the optimal weight matrix W and bias vector b of each layer.
英文语音数据训练:对采集的英文语音数据集预处理,得到英文特征向量集vector1,并将所述英文特征向量集作为所述DNN-LSTM模型的输入进行迭代训练,训练至最优英文声学模型;其中,预处理操作包括:采样、预加重、加窗分帧、端点检测,并将特征向量vector1作为DNN-LSTM模型输入进行迭代训练,训练至最优声学模型English_model,具体遍历步骤与中文语音数据训练的步骤相同,在此就不在赘述。English speech data training: preprocess the collected English speech data set to obtain the English feature vector set vector1, and use the English feature vector set as the input of the DNN-LSTM model for iterative training to train to the optimal English acoustic model ; Among them, the pre-processing operations include: sampling, pre-emphasis, windowing and framing, endpoint detection, and the feature vector vector1 as the DNN-LSTM model input for iterative training, training to the optimal acoustic model English_model, specific traversal steps and Chinese voice The steps of data training are the same, so I won't repeat them here.
最后,进行模型测试,具体步骤为:Finally, carry out the model test, the specific steps are as follows:
将一个未知语种的输入语音信号voice0,分别经过所述中文声学模型和所述英文声学模型,分别得到中文输出概率向量集和英文输出概率向量集;Passing an input speech signal voice0 in an unknown language through the Chinese acoustic model and the English acoustic model respectively to obtain a Chinese output probability vector set and an English output probability vector set;
根据所述中文输出概率向量集和英文输出概率向量集进行语种匹配,并输出判断结果。Perform language matching according to the Chinese output probability vector set and the English output probability vector set, and output the judgment result.
利用信息熵公式分别计算中文输出概率向量集P和英文输出概率向量集P′对应的信息熵,分别对应记为H和H′;其中,P={p 1,p 2,...,p q},P′={p′ 1,p′ 2,...,p′ t},q为中文声学模型输出分类的总数,t为英文声学模型输出分类的总数; Use the information entropy formula to calculate the information entropy corresponding to the Chinese output probability vector set P and the English output probability vector set P′, respectively, and mark them as H and H′; where P={p 1 , p 2 ,...,p q }, P′={p′ 1 , p′ 2 ,...,p′ t }, q is the total number of output categories of the Chinese acoustic model, and t is the total number of output categories of the English acoustic model;
信息熵公式为:The information entropy formula is:
Figure PCTCN2020133554-appb-000005
Figure PCTCN2020133554-appb-000005
若从中文声学模型输出的概率向量集中存在p i明显大于其他的概率值,从英文声学模型输出的概率向量集中的各个概率值相差不明显,其中,1≤i≤q; If the probability vector set output from the Chinese acoustic model has p i that is significantly greater than other probability values, the probability values in the probability vector set output from the English acoustic model are not significantly different, where 1≤i≤q;
本发明实施例中,概率值是否相差明显和softmax输出层的输出分类相关,输出分类越多,对应的范围越小,该范围设置为β,即概率值之间相差值大于等于β即为相差明显,若各个概率值相差的范围小于β即为相差不明显。在实验过程中,输出分类为5 类时,该范围β大约为0.2,输出分类越多,该范围越小。In the embodiment of the present invention, whether the probability value is significantly different is related to the output classification of the softmax output layer. The more output classifications, the smaller the corresponding range. The range is set to β, that is, the difference between the probability values is greater than or equal to β, that is, the phase difference Obviously, if the range of the difference between each probability value is less than β, the difference is not obvious. During the experiment, when the output is classified into 5 categories, the range β is approximately 0.2. The more output categories, the smaller the range.
且若输入到中文声学模型对应的信息熵H比输入到英文声学模型对应的信息熵H′小,则对应的未知语种的输入语音信号voice0为中文,并将中文声学模型的输出概率作为最后输出结果;即:根据信息熵的性质,熵越大,系统的信息量越大,不确定越高,当p 1=p 2=,.....,=p q时取最大值,中文输出概率的信息熵要比英文输出概率的信息熵小得多,也即中文语音信号在中文声学模型中的匹配度更高,故将中文声学模型的输出概率作为最后输出结果。 And if the information entropy H corresponding to the input to the Chinese acoustic model is smaller than the information entropy H′ corresponding to the input to the English acoustic model, the corresponding input voice signal voice0 of the unknown language is Chinese, and the output probability of the Chinese acoustic model is used as the final output The result; namely: according to the nature of information entropy, the greater the entropy, the greater the amount of information in the system, and the higher the uncertainty. When p 1 = p 2 =,..., = p q , the maximum value is used, and the Chinese output The information entropy of the probability is much smaller than the information entropy of the English output probability, that is, the matching degree of the Chinese speech signal in the Chinese acoustic model is higher, so the output probability of the Chinese acoustic model is taken as the final output result.
若从英文声学模型输出的概率向量集中存在p′ j明显大于其他的概率值,从中文声学模型输出的概率向量集中的各个概率值相差不大,其中,1≤j≤t; If p′ j is significantly larger than other probability values in the probability vector set output from the English acoustic model, the probability values in the probability vector set output from the Chinese acoustic model are not much different, where 1≤j≤t;
且若输入到英文声学模型对应的信息熵H′比输入到中文声学模型对应的信息熵H小,则对应的未知语种的输入语音信号voice0为英文,并将英文声学模型的输出概率作为最后输出结果。And if the information entropy H′ corresponding to the input to the English acoustic model is smaller than the information entropy H corresponding to the input to the Chinese acoustic model, the corresponding input speech signal voice0 of the unknown language is English, and the output probability of the English acoustic model is used as the final output result.
另一方面,本发明还提供一种面向家居口语环境的神经网络语音识别系统,该系统包括:On the other hand, the present invention also provides a neural network speech recognition system oriented to the spoken language environment at home. The system includes:
模型构建模块,用于在深度神经网络中加入长短期记忆网络,构建组合神经网络DNN-LSTM模型;Model building module, used to add long and short-term memory network to deep neural network to build combined neural network DNN-LSTM model;
模型训练模块,其又包括中文模型训练单元和英文模型训练单元,所述中文模型训练单元,用于对采集的中文语音数据集预处理,得到中文特征向量集,并将所述中文特征向量集作为所述DNN-LSTM模型的输入进行迭代训练,训练至最优中文声学模型;The model training module includes a Chinese model training unit and an English model training unit. The Chinese model training unit is used to preprocess the collected Chinese speech data set to obtain a Chinese feature vector set, and combine the Chinese feature vector set Perform iterative training as the input of the DNN-LSTM model to train to the optimal Chinese acoustic model;
英文模型训练单元,用于对采集的英文语音数据集预处理,得到英文特征向量集,并将所述英文特征向量集作为所述DNN-LSTM模型的输入进行迭代训练,训练至最优英文声学模型;The English model training unit is used to preprocess the collected English speech data set to obtain an English feature vector set, and use the English feature vector set as the input of the DNN-LSTM model for iterative training to train to the best English acoustics Model;
模型测试模块,其又包括语音输入单元和语音类型判断单元,所述语音输入单元,用于将一个未知语种的输入语音信号voice0,分别经过所述中文声学模型和所述英文声学模型,分别得到中文输出概率向量集和英文输出概率向量集;所述语音类型判断单元,用于根据所述中文输出概率向量集和英文输出概率向量集进行语种匹配,并输出判断结果。The model testing module includes a voice input unit and a voice type judgment unit. The voice input unit is used to pass an input voice signal voice0 in an unknown language through the Chinese acoustic model and the English acoustic model, respectively, to obtain The Chinese output probability vector set and the English output probability vector set; the voice type judgment unit is configured to perform language matching according to the Chinese output probability vector set and the English output probability vector set, and output the judgment result.
对于系统/装置实施例而言,由于其基本相似于方法实施例,所以描述的比较简单, 相关之处参见方法实施例的部分说明即可。As for the system/device embodiment, since it is basically similar to the method embodiment, the description is relatively simple, and for related parts, please refer to the part of the description of the method embodiment.
需要说明的是,在本文中,诸如第一和第二等之类的关系术语仅仅用来将一个实体或者一个操作与另一个实体或者另一个操作区分开来,而不一定要求或者暗示这些实体或者操作之间存在任何这种实际的关系或者顺序。It should be noted that in this article, relational terms such as first and second are only used to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply these entities Or there is any such actual relationship or sequence between operations.
本领域内的技术人员应明白,本申请的实施例可提供为方法、系统、或计算机程序产品。因此,本申请可采用完全硬件实施例、完全应用实施例、或结合应用和硬件方面的实施例的形式。而且,本申请可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。Those skilled in the art should understand that the embodiments of the present application can be provided as methods, systems, or computer program products. Therefore, this application may adopt the form of a complete hardware embodiment, a complete application embodiment, or an embodiment combining applications and hardware. Moreover, this application may adopt the form of a computer program product implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program codes.
本发明是参照根据本发明实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。The present invention is described with reference to flowcharts and/or block diagrams of methods, devices (systems), and computer program products according to embodiments of the present invention. It should be understood that each process and/or block in the flowchart and/or block diagram, and the combination of processes and/or blocks in the flowchart and/or block diagram can be realized by computer program instructions. These computer program instructions can be provided to the processor of a general-purpose computer, a special-purpose computer, an embedded processor, or other programmable data processing equipment to generate a machine, so that the instructions executed by the processor of the computer or other programmable data processing equipment are used to generate It is a device that realizes the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。These computer program instructions can also be stored in a computer-readable memory that can guide a computer or other programmable data processing equipment to work in a specific manner, so that the instructions stored in the computer-readable memory produce an article of manufacture including the instruction device. The device implements the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。These computer program instructions can also be loaded on a computer or other programmable data processing equipment, so that a series of operation steps are executed on the computer or other programmable equipment to produce computer-implemented processing, so as to execute on the computer or other programmable equipment. The instructions provide steps for implementing the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.
尽管已描述了本发明的优选实施例,但本领域内的技术人员一旦得知了基本创造性概念,则可对这些实施例作出另外的变更和修改。所以,所附权利要求意欲解释为包括优选实施例以及落入本发明范围的所有变更和修改。Although the preferred embodiments of the present invention have been described, those skilled in the art can make additional changes and modifications to these embodiments once they learn the basic creative concept. Therefore, the appended claims are intended to be interpreted as including the preferred embodiments and all changes and modifications falling within the scope of the present invention.
显然,本领域的技术人员可以对本发明进行各种改动和变型而不脱离本发明的精神和范围。这样,倘若本发明的这些修改和变型属于本发明权利要求及其等同技术的范围 之内,则本发明也意图包含这些改动和变型在内。Obviously, those skilled in the art can make various changes and modifications to the present invention without departing from the spirit and scope of the present invention. In this way, if these modifications and variations of the present invention fall within the scope of the claims of the present invention and equivalent technologies, the present invention is also intended to include these modifications and variations.

Claims (6)

  1. 一种面向家居口语环境的神经网络语音识别方法,其特征在于,该方法包括:A neural network speech recognition method oriented to the spoken language environment at home, characterized in that the method includes:
    模型构建:在深度神经网络中加入长短期记忆网络,构建组合神经网络DNN-LSTM模型;Model construction: add long and short-term memory network to the deep neural network to build a combined neural network DNN-LSTM model;
    模型训练:Model training:
    中文语音数据训练:对采集的中文语音数据集预处理,得到中文特征向量集,并将所述中文特征向量集作为所述DNN-LSTM模型的输入进行迭代训练,训练至最优中文声学模型;Chinese speech data training: preprocess the collected Chinese speech data set to obtain a Chinese feature vector set, and use the Chinese feature vector set as the input of the DNN-LSTM model to perform iterative training to train to the optimal Chinese acoustic model;
    英文语音数据训练:对采集的英文语音数据集预处理,得到英文特征向量集,并将所述英文特征向量集作为所述DNN-LSTM模型的输入进行迭代训练,训练至最优英文声学模型;English speech data training: preprocess the collected English speech data set to obtain an English feature vector set, and use the English feature vector set as the input of the DNN-LSTM model to perform iterative training to train to the optimal English acoustic model;
    模型测试:Model test:
    将一个未知语种的输入语音信号voice0,分别经过所述中文声学模型和所述英文声学模型,分别得到中文输出概率向量集和英文输出概率向量集;Passing an input speech signal voice0 in an unknown language through the Chinese acoustic model and the English acoustic model respectively to obtain a Chinese output probability vector set and an English output probability vector set;
    根据所述中文输出概率向量集和英文输出概率向量集进行语种匹配,并输出判断结果。Perform language matching according to the Chinese output probability vector set and the English output probability vector set, and output the judgment result.
  2. 根据权利要求1所述的面向家居口语环境的神经网络语音识别方法,其特征在于,所述组合神经网络DNN-LSTM模型包括输入层、长短期记忆网络、第二隐藏层、第三隐藏层、第四隐藏层和输出层,所述长短期记忆网络作为第一隐藏层。The neural network speech recognition method for home speaking environment according to claim 1, wherein the combined neural network DNN-LSTM model includes an input layer, a long and short-term memory network, a second hidden layer, a third hidden layer, The fourth hidden layer and the output layer, the long and short-term memory network serves as the first hidden layer.
  3. 根据权利要求2所述的面向家居口语环境的神经网络语音识别方法,其特征在于,所述第一隐藏层的节点数为512个,其激活函数选择sigmoid函数和tanh函数,第二隐藏层、第三隐藏层和第四隐藏层的节点数均为1024个,激活函数选择sigmoid函数。The neural network speech recognition method oriented to the oral environment at home according to claim 2, wherein the number of nodes in the first hidden layer is 512, and the activation function selects the sigmoid function and the tanh function, and the second hidden layer, The number of nodes in the third hidden layer and the fourth hidden layer are both 1024, and the sigmoid function is selected as the activation function.
  4. 根据权利要求3所述的面向家居口语环境的神经网络语音识别方法,其特征在于,所述中文特征向量集作为所述DNN-LSTM模型的输入进行迭代训练,训练步骤包括:The neural network speech recognition method for the spoken language environment at home according to claim 3, wherein the Chinese feature vector set is used as the input of the DNN-LSTM model for iterative training, and the training step comprises:
    (1)初始化模型结构中的权值矩阵W和偏置向量b为一个随机值;(1) Initialize the weight matrix W and the bias vector b in the model structure to a random value;
    (2)开始从第1次到最大次数的迭代;在每次迭代中,都是从第一条语音数据训练样本开始遍历至最后一个训练样本;(2) Start from the first iteration to the maximum number of iterations; in each iteration, it is traversed from the first voice data training sample to the last training sample;
    (3)在每一个训练样本的训练过程中,将对应的特征向量输入到输入层;从第一隐藏层开始遍历到输出层,并采用前向传播算法表示出正在遍历的对应层,然后根据损 失函数表示输出层;前向传播算法完成后,开始从第四隐藏层遍历至第一隐藏层,采用反向传播算法表示对应的第一隐藏层;(3) In the training process of each training sample, input the corresponding feature vector to the input layer; traverse from the first hidden layer to the output layer, and use the forward propagation algorithm to indicate the corresponding layer being traversed, and then according to The loss function represents the output layer; after the forward propagation algorithm is completed, it starts to traverse from the fourth hidden layer to the first hidden layer, and uses the back propagation algorithm to represent the corresponding first hidden layer;
    (4)反向传播算法完成后,从第一隐藏层到输出层的顺序开始遍历,并更新对应层的权值矩阵和偏置向量W n、b n,n为正在遍历的那一层,n=1,2,3,4,5,至此一次迭代过程中对于一个样本的训练就结束了;此时若样本没有遍历完,则继续遍历样本;若样本已经遍历完,则进行下一次的迭代; (4) After the backpropagation algorithm is completed, the order from the first hidden layer to the output layer is started, and the weight matrix and bias vectors W n and b n of the corresponding layer are updated, where n is the layer that is being traversed, n = 1, 2, 3, 4, 5, so far the training for a sample in one iteration process is over; if the sample has not been traversed at this time, continue to traverse the sample; if the sample has been traversed, proceed to the next Iteration
    (5)当全部W、b的改变值都不超过迭代阈值,则停止迭代循环;(5) When all the change values of W and b do not exceed the iteration threshold, stop the iteration loop;
    (6)保存各层的最优权值矩阵W和偏置向量b。(6) Save the optimal weight matrix W and bias vector b of each layer.
  5. 根据权利要求1所述的面向家居口语环境的神经网络语音识别方法,其特征在于,根据所述中文输出概率向量集和英文输出概率向量集进行语种匹配,包括:The neural network speech recognition method oriented to the spoken language environment at home according to claim 1, wherein performing language matching according to the Chinese output probability vector set and the English output probability vector set includes:
    利用信息熵公式分别计算中文输出概率向量集P和英文输出概率向量集P'对应的信息熵,分别对应记为H和H';其中,P={p 1,p 2,...,p q},P'={p′ 1,p' 2,...,p′ t},q为中文声学模型输出分类的总数,t为英文声学模型输出分类的总数; Use the information entropy formula to calculate the information entropy corresponding to the Chinese output probability vector set P and the English output probability vector set P'respectively, which are respectively marked as H and H'; where P={p 1 ,p 2 ,...,p q }, P'={p′ 1 ,p' 2 ,...,p′ t }, q is the total number of output categories of the Chinese acoustic model, and t is the total number of output categories of the English acoustic model;
    若从中文声学模型输出的概率向量集中存在p i明显大于其他的概率值,从英文声学模型输出的概率向量集中的各个概率值相差不大,其中,1≤i≤q; If the probability vector set output from the Chinese acoustic model has p i that is significantly larger than other probability values, the probability values in the probability vector set output from the English acoustic model are not much different, where 1≤i≤q;
    且若输入到中文声学模型对应的信息熵H比输入到英文声学模型对应的信息熵H'小,则对应的未知语种的输入语音信号voice0为中文,并将中文声学模型的输出概率作为最后输出结果;And if the information entropy H corresponding to the input to the Chinese acoustic model is smaller than the information entropy H'corresponding to the input to the English acoustic model, the corresponding input voice signal voice0 of the unknown language is Chinese, and the output probability of the Chinese acoustic model is used as the final output result;
    若从英文声学模型输出的概率向量集中存在p' j明显大于其他的概率值,从中文声学模型输出的概率向量集中的各个概率值相差不大,其中,1≤j≤t; P, if present, from the English acoustic model probability vector output focus' j probability value is significantly larger than the other, from the probability values differ by less Chinese acoustic model output probabilities of the respective set of vectors, wherein, 1≤j≤t;
    且若输入到英文声学模型对应的信息熵H'比输入到中文声学模型对应的信息熵H小,则对应的未知语种的输入语音信号voice0为英文,并将英文声学模型的输出概率作为最后输出结果。And if the information entropy H'corresponding to the input to the English acoustic model is smaller than the information entropy H corresponding to the input to the Chinese acoustic model, the corresponding input voice signal voice0 of the unknown language is English, and the output probability of the English acoustic model is used as the final output result.
  6. 一种根据权利要求1-5任一项所述的面向家居口语环境的神经网络语音识别方法实现的系统,其特征在于,该系统包括:A system implemented by a neural network speech recognition method for a spoken language environment at home according to any one of claims 1-5, wherein the system comprises:
    模型构建模块,用于在深度神经网络中加入长短期记忆网络,构建组合神经网络DNN-LSTM模型;The model building module is used to add a long and short-term memory network to the deep neural network to construct a combined neural network DNN-LSTM model;
    模型训练模块,其又包括中文模型训练单元和英文模型训练单元,所述中文模型训 练单元,用于对采集的中文语音数据集预处理,得到中文特征向量集,并将所述中文特征向量集作为所述DNN-LSTM模型的输入进行迭代训练,训练至最优中文声学模型;The model training module includes a Chinese model training unit and an English model training unit. The Chinese model training unit is used to preprocess the collected Chinese speech data set to obtain a Chinese feature vector set, and combine the Chinese feature vector set As the input of the DNN-LSTM model, perform iterative training to train to the optimal Chinese acoustic model;
    英文模型训练单元,用于对采集的英文语音数据集预处理,得到英文特征向量集,并将所述英文特征向量集作为所述DNN-LSTM模型的输入进行迭代训练,训练至最优英文声学模型;The English model training unit is used to preprocess the collected English speech data set to obtain an English feature vector set, and use the English feature vector set as the input of the DNN-LSTM model for iterative training to train to the best English acoustics Model;
    模型测试模块,其又包括语音输入单元和语音类型判断单元,所述语音输入单元,用于将一个未知语种的输入语音信号voice0,分别经过所述中文声学模型和所述英文声学模型,分别得到中文输出概率向量集和英文输出概率向量集;所述语音类型判断单元,用于根据所述中文输出概率向量集和英文输出概率向量集进行语种匹配,并输出判断结果。The model testing module includes a voice input unit and a voice type judgment unit. The voice input unit is used to pass an input voice signal voice0 in an unknown language through the Chinese acoustic model and the English acoustic model, respectively, to obtain The Chinese output probability vector set and the English output probability vector set; the voice type judging unit is configured to perform language matching according to the Chinese output probability vector set and the English output probability vector set, and output the judgment result.
PCT/CN2020/133554 2020-04-15 2020-12-03 Neural network speech recognition method and system oriented to home spoken environment WO2021208455A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP2021551834A JP7166683B2 (en) 2020-04-15 2020-12-03 Neural Network Speech Recognition Method and System for Domestic Conversation Environment

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010295068.2 2020-04-15
CN202010295068.2A CN111477220B (en) 2020-04-15 2020-04-15 Neural network voice recognition method and system for home spoken language environment

Publications (1)

Publication Number Publication Date
WO2021208455A1 true WO2021208455A1 (en) 2021-10-21

Family

ID=71753345

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/133554 WO2021208455A1 (en) 2020-04-15 2020-12-03 Neural network speech recognition method and system oriented to home spoken environment

Country Status (3)

Country Link
JP (1) JP7166683B2 (en)
CN (1) CN111477220B (en)
WO (1) WO2021208455A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116187205A (en) * 2023-04-24 2023-05-30 北京智芯微电子科技有限公司 Running state prediction method and device for digital twin body of power distribution network and training method
CN116306787A (en) * 2023-05-22 2023-06-23 江西省气象灾害应急预警中心(江西省突发事件预警信息发布中心) Visibility early warning model construction method, system, computer and readable storage medium

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111477220B (en) * 2020-04-15 2023-04-25 南京邮电大学 Neural network voice recognition method and system for home spoken language environment
CN112700792B (en) * 2020-12-24 2024-02-06 南京邮电大学 Audio scene identification and classification method
CN113823275A (en) * 2021-09-07 2021-12-21 广西电网有限责任公司贺州供电局 Voice recognition method and system for power grid dispatching

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108335693A (en) * 2017-01-17 2018-07-27 腾讯科技(深圳)有限公司 A kind of Language Identification and languages identification equipment
CN108389573A (en) * 2018-02-09 2018-08-10 北京易真学思教育科技有限公司 Language Identification and device, training method and device, medium, terminal
WO2019006473A1 (en) * 2017-06-30 2019-01-03 The Johns Hopkins University Systems and method for action recognition using micro-doppler signatures and recurrent neural networks
CN110517668A (en) * 2019-07-23 2019-11-29 普强信息技术(北京)有限公司 A kind of Chinese and English mixing voice identifying system and method
CN111477220A (en) * 2020-04-15 2020-07-31 南京邮电大学 Neural network speech recognition method and system for household spoken language environment

Family Cites Families (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9460711B1 (en) * 2013-04-15 2016-10-04 Google Inc. Multilingual, acoustic deep neural networks
JP6164639B2 (en) * 2013-05-23 2017-07-19 国立研究開発法人情報通信研究機構 Deep neural network learning method and computer program
CN103400577B (en) * 2013-08-01 2015-09-16 百度在线网络技术(北京)有限公司 The acoustic model method for building up of multilingual speech recognition and device
KR102305584B1 (en) * 2015-01-19 2021-09-27 삼성전자주식회사 Method and apparatus for training language model, method and apparatus for recognizing language
CN106297773B (en) * 2015-05-29 2019-11-19 中国科学院声学研究所 A kind of neural network acoustic training model method
CN107301860B (en) * 2017-05-04 2020-06-23 百度在线网络技术(北京)有限公司 Voice recognition method and device based on Chinese-English mixed dictionary
CN110970018B (en) * 2018-09-28 2022-05-27 珠海格力电器股份有限公司 Speech recognition method and device
CN110211588A (en) * 2019-06-03 2019-09-06 北京达佳互联信息技术有限公司 Audio recognition method, device and electronic equipment
CN110517663B (en) * 2019-08-01 2021-09-21 北京语言大学 Language identification method and system
CN110634502B (en) * 2019-09-06 2022-02-11 南京邮电大学 Single-channel voice separation algorithm based on deep neural network
CN110853618B (en) * 2019-11-19 2022-08-19 腾讯科技(深圳)有限公司 Language identification method, model training method, device and equipment
CN112562640B (en) * 2020-12-01 2024-04-12 北京声智科技有限公司 Multilingual speech recognition method, device, system, and computer-readable storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108335693A (en) * 2017-01-17 2018-07-27 腾讯科技(深圳)有限公司 A kind of Language Identification and languages identification equipment
WO2019006473A1 (en) * 2017-06-30 2019-01-03 The Johns Hopkins University Systems and method for action recognition using micro-doppler signatures and recurrent neural networks
CN108389573A (en) * 2018-02-09 2018-08-10 北京易真学思教育科技有限公司 Language Identification and device, training method and device, medium, terminal
CN110517668A (en) * 2019-07-23 2019-11-29 普强信息技术(北京)有限公司 A kind of Chinese and English mixing voice identifying system and method
CN111477220A (en) * 2020-04-15 2020-07-31 南京邮电大学 Neural network speech recognition method and system for household spoken language environment

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
ZHAO SHUFANG;DONG XIAOYU: "Research on Speech Recognition Based on Improved LSTM Deep Neural Network", JOURNAL OF ZHENGZHOU UNIVERSITY(ENGINEERING SCIENCE), vol. 39, no. 5, 19 July 2018 (2018-07-19), pages 63 - 67, XP055857765, ISSN: 1671-6833, DOI: 10.13708/j.issn.1671-6833.2018.02.004 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116187205A (en) * 2023-04-24 2023-05-30 北京智芯微电子科技有限公司 Running state prediction method and device for digital twin body of power distribution network and training method
CN116187205B (en) * 2023-04-24 2023-08-15 北京智芯微电子科技有限公司 Running state prediction method and device for digital twin body of power distribution network and training method
CN116306787A (en) * 2023-05-22 2023-06-23 江西省气象灾害应急预警中心(江西省突发事件预警信息发布中心) Visibility early warning model construction method, system, computer and readable storage medium
CN116306787B (en) * 2023-05-22 2023-08-22 江西省气象灾害应急预警中心(江西省突发事件预警信息发布中心) Visibility early warning model construction method, system, computer and readable storage medium

Also Published As

Publication number Publication date
JP2022540968A (en) 2022-09-21
CN111477220A (en) 2020-07-31
JP7166683B2 (en) 2022-11-08
CN111477220B (en) 2023-04-25

Similar Documents

Publication Publication Date Title
WO2021208455A1 (en) Neural network speech recognition method and system oriented to home spoken environment
CN112163426B (en) Relationship extraction method based on combination of attention mechanism and graph long-time memory neural network
CN107301864B (en) Deep bidirectional LSTM acoustic model based on Maxout neuron
US10325200B2 (en) Discriminative pretraining of deep neural networks
Shinozaki et al. Structure discovery of deep neural network based on evolutionary algorithms
WO2019083812A1 (en) Generating dual sequence inferences using a neural network model
CN110750665A (en) Open set domain adaptation method and system based on entropy minimization
TW201905897A (en) Voice wake-up method, device and electronic device
CN111653275B (en) Method and device for constructing voice recognition model based on LSTM-CTC tail convolution and voice recognition method
CN107688849A (en) A kind of dynamic strategy fixed point training method and device
WO2019037700A1 (en) Speech emotion detection method and apparatus, computer device, and storage medium
US10580432B2 (en) Speech recognition using connectionist temporal classification
Huang et al. Recurrent poisson process unit for speech recognition
Huang et al. Speaker adaptation of RNN-BLSTM for speech recognition based on speaker code
WO2018153200A1 (en) Hlstm model-based acoustic modeling method and device, and storage medium
CN111882042B (en) Neural network architecture automatic search method, system and medium for liquid state machine
Gopalakrishnan et al. Sentiment analysis using simplified long short-term memory recurrent neural networks
WO2022028378A1 (en) Voice intention recognition method, apparatus and device
Regmi et al. Nepali speech recognition using rnn-ctc model
CN111123894A (en) Chemical process fault diagnosis method based on combination of LSTM and MLP
Andrew et al. Sequential deep belief networks
Kang et al. Gated convolutional networks based hybrid acoustic models for low resource speech recognition
Ying English pronunciation recognition and detection based on HMM-DNN
Kumar et al. Analysis of automated text generation using deep learning
CN112598065B (en) Memory-based gating convolutional neural network semantic processing system and method

Legal Events

Date Code Title Description
ENP Entry into the national phase

Ref document number: 2021551834

Country of ref document: JP

Kind code of ref document: A

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20931383

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20931383

Country of ref document: EP

Kind code of ref document: A1