WO2018166316A1

WO2018166316A1 - Speaker's flu symptoms recognition method fused with multiple end-to-end neural network structures

Info

Publication number: WO2018166316A1
Application number: PCT/CN2018/076272
Authority: WO
Inventors: 李明; 倪志东
Original assignee: Foshan Shunde Sun Yat-Sen University Research Institute; Sun Yat Sen University; SYSU CMU Shunde International Joint Research Institute
Current assignee: Foshan Shunde Sun Yat-Sen University Research Institute; Sun Yat Sen University; SYSU CMU Shunde International Joint Research Institute
Priority date: 2017-03-13
Filing date: 2018-02-11
Publication date: 2018-09-20
Anticipated expiration: 2019-09-13
Also published as: CN107068167A

Abstract

A speaker's flu symptoms recognition method fused with multiple end-to-end neural network structures consisting of four end-to-end neutral networks, the method comprising: when the input is an original speech or speech spectrum, extracting optimal features by means of a convolutional neural network, and finally performing classification by means of a long/short-term memory network or a fully connected network; and when the input is mel frequency cepstral coefficient (MFCC) or constant Q cepstral coefficient (CQCC), directly performing classification by means of the long/short-term memory network, and finally fusing these systems together. The whole process integrates feature extraction and model classification, such that the whole speaker process of recognizing flu symptoms of a speaker can be simpler and quicker.

Description

Speaker cold symptom recognition method combining various end-to-end neural network structures

技术领域Technical field

本发明涉及语音处理技术领域，提出融合多种端到端深度学习结构的说话人感冒症状识别方法The invention relates to the field of speech processing technology, and proposes a speaker cold symptom recognition method combining various end-to-end deep learning structures

背景技术Background technique

1、说话人识别又称声纹识别，是指通过语音中包含特有的说话人信息，利用模式识别技术自动识别说话人的技术。当前的说话人技术是实验条件中取得很好的性能，但是在实际中，语音会受到环境噪声和说话人健康条件的影响，使得已有说话人识别技术的鲁棒性降低，感冒语音识别方法通过对已有语音进行分类判断是否为感冒语音，通过感冒语音识别方法提前判别语音是否是感冒语音，再进行说话人识别，可以提高说话人识别的鲁棒性。1. Speaker recognition, also known as voiceprint recognition, refers to the technique of automatically identifying a speaker by using pattern recognition technology by including unique speaker information in the voice. The current speaker technology achieves good performance in experimental conditions, but in practice, speech is affected by environmental noise and speaker health conditions, making the robustness of existing speaker recognition techniques less, and the cold speech recognition method. By classifying the existing speech to determine whether it is a cold speech, it is possible to improve the robustness of the speaker recognition by using the cold speech recognition method to determine in advance whether the speech is a cold speech or not, and then performing speaker recognition.

2、在语音技术研究中，研究者总是希望能找到表示目标类型的特征，从识别目标语音中找到明显区别正常语音的特性进行描述，语音特征提取是提取说话人的语音特征和声道特征，目前，主流的特征参数包括MFCC、LPCC、CQCC等，都是以单个特征为主，表征说话人感冒症状的信息不足，影响识别精度。同时需要大量区分分类目标语音的知识，而在语音识别算法中，起步较早的是基于声道模型和语音模型知识的方法，但是因为模型的复杂性，没有取得很好的实用效果，而模型匹配方法如动态时间规整、隐马尔可夫模型、矢量量化等技术等开始发挥良好的识别效果。把特征提取和模式分类分开研究是识别研究的常用方法，但是存在特征和模型不匹配、训练困难、特征不易寻找的问题，经典的识别框架存在上述的问题。2. In the research of speech technology, researchers always hope to find the characteristics of the target type, and find out the characteristics of the normal speech from the recognition target speech. The speech feature extraction is to extract the speaker's speech features and vocal features. At present, the mainstream characteristic parameters including MFCC, LPCC, CQCC, etc. are all based on a single feature, which lacks the information to characterize the symptoms of the speaker's cold, and affects the recognition accuracy. At the same time, a large amount of knowledge of classification target speech is needed. In the speech recognition algorithm, the method based on vocal tract model and speech model knowledge is earlier, but because of the complexity of the model, it has not achieved good practical effects, and the model Matching methods such as dynamic time warping, hidden Markov model, vector quantization, etc. begin to exert good recognition results. Separating feature extraction and pattern classification is a common method for identification research, but there are problems that the features and models do not match, the training is difficult, and the features are not easy to find. The classical recognition framework has the above problems.

3、近年来随着深度学习的发展，基于深层神经网络在图像和语音的识别已显示出巨大的能量，一系列的神经网络结构也被提出，比如自动编码网络、卷积神经网络和循环神经网络等。有很多学者发现，通过神经网络对语音进行学习，可以得到更好描述语音的隐藏结构特征，端到端的识别方法就是通过尽量少的先验知识，同时对特征学习和特征识别进行处理，具有很好的识别效果。3. In recent years, with the development of deep learning, the recognition of images and speech based on deep neural networks has shown great energy, and a series of neural network structures have also been proposed, such as automatic coding networks, convolutional neural networks and circulating nerves. Network, etc. Many scholars have found that learning speech through neural networks can obtain better hidden features of speech. The end-to-end identification method is to deal with feature learning and feature recognition with as little prior knowledge as possible. Good recognition effect.

发明内容：Summary of the invention:

根据现有识别技术都是把特征和模式分类分开研究，存在特征和模型不匹配、训练困难，特征不易寻找等问题，本发明提出融合多种端到端深度学习结构的说话人感冒症状识别方法，我们构建四种不同的端到端深度学习网络，最后融合四种不同的端到端神经网络结构进行说话人感冒症状识别。 According to the existing identification techniques, the features and patterns are separately classified, and there are problems such as mismatching of features and models, difficulty in training, and difficulty in finding features. The present invention proposes a method for identifying a cold symptom of a speaker that incorporates various end-to-end deep learning structures. We constructed four different end-to-end deep learning networks, and finally merged four different end-to-end neural network structures for speaker cold symptom recognition.

四种端到端深度学习结构分别为：1、输入为语音，网络为多层卷积神经网络和长短期记忆网路；2、输入为语音频谱，网络为多层卷积神经网路和长短期记忆网络；3、输入为语音频谱，网络为多层卷积神经网络和全连接网络；4、输入为梅尔倒谱系数和常数Q倒谱系数，网络为长短期记忆网络； The four end-to-end deep learning structures are: 1. Input is voice, the network is multi-layer convolutional neural network and long-term and short-term memory network; 2. Input is voice spectrum, network is multi-layer convolutional neural network and long Short-term memory network; 3, the input is the speech spectrum, the network is a multi-layer convolutional neural network and a fully connected network; 4. The input is the Mel cepstrum coefficient and the constant Q cepstral coefficient, and the network is a long-term and short-term memory network;

本发明的有益效果是：基于传统特征的不确定性，我们通过神经网络训练得到的输出可以更好的表达说话人感冒症状的特征，并且输入相对来说比较简单，不用过多的进行特征处理。因为语音具有时序信息，我们通过长短期记忆网络实现分类有更好的效果。通过把特征学习和模式分类统一在一起，使得整个说话人感冒症状识别过程更加简单快速，具有广泛的应用前景。 The beneficial effects of the present invention are: based on the uncertainty of the traditional features, the output obtained by the neural network training can better express the characteristics of the speaker's cold symptoms, and the input is relatively simple, without excessive feature processing. . Because voice has timing information, we have better results through long- and short-term memory networks. By unifying feature learning and pattern classification, the whole speaker's cold symptom recognition process is simpler and faster, and has broad application prospects.

附图说明DRAWINGS

图一为语音提取梅尔倒谱系数(MFCC)的流程Figure 1 shows the flow of voice extraction of the Mel Cepstral Coefficient (MFCC).

图二为语音提取常数Q倒谱系数(CQCC)的流程Figure 2 shows the flow of the speech extraction constant Q cepstrum coefficient (CQCC).

图三为第一个端到端神经网络，输入为语音，网络为CNN+LSTM。Figure 3 shows the first end-to-end neural network. The input is voice and the network is CNN+LSTM.

图四为第二个端到端神经网络，输入为语音频谱，网络为CNN+LSTMFigure 4 shows the second end-to-end neural network. The input is the voice spectrum and the network is CNN+LSTM.

图五为第三个端到端神经网络，输入为语音频谱，网络为CNN+全连接网络Figure 5 shows the third end-to-end neural network, the input is the voice spectrum, and the network is the CNN+ fully connected network.

图六为第四个端到端神经网络，输入为梅尔倒谱系数或者常数Q倒谱系数，网络为LSTM。Figure 6 shows the fourth end-to-end neural network. The input is the Mel cepstrum coefficient or the constant Q cepstral coefficient. The network is LSTM.

具体实施方式：detailed description:

为使本发明的技术方案和优点更加清楚，下面结合附图，对发明的技术方案进行清楚完整的描述： In order to make the technical solutions and advantages of the present invention more clear, the technical solutions of the invention will be clearly and completely described below with reference to the accompanying drawings:

步骤一：构建输入为语音、网络为CNN+LSTM的端到端神经网络，具体为：输入语音切分为相同大小的片段比如40ms，然后进行均值归一化，而相对应的卷积神经网络由8个模块组成，每一个模块是由一维卷积层、ReLU激活层、一维最大池化层组成的，其中，每一个卷积核的大小为32，池化核的大小为2，池化步长为2。而后使用长短期记忆网络进行分类。 Step 1: Construct an end-to-end neural network with input as voice and network as CNN+LSTM. Specifically, the input speech is divided into segments of the same size, such as 40ms, and then averaged, and the corresponding convolutional neural network. It consists of 8 modules, each of which consists of a one-dimensional convolution layer, a ReLU active layer, and a one-dimensional maximum pooling layer. Each convolution kernel has a size of 32 and a pooled core has a size of 2. The pooling step size is 2. Then use long and short-term memory networks for classification.

步骤二：构建输入为语音频谱，网络为为CNN+LSTM的端到端神经网络，具体为：输入语音切分为相同大小的片段，进行快速傅里叶变换，求出语音片段的频谱图，卷积神经网络则由6个模块组成，每个模块由二维卷积层、ReLU激活层、二维最大池化层组成。其中，第一个卷积层使用7*7的卷积层，第二层使用5*5的卷积核，剩下4层使用3*3的卷积核，所有的最大池化层使用3*3的池化核，池化步长为2。最后经过LSTM网络进行分类。Step 2: Construct the input as the speech spectrum, and the network is the end-to-end neural network of CNN+LSTM. Specifically, the input speech is divided into segments of the same size, and the fast Fourier transform is performed to obtain the spectrum of the speech segment. The convolutional neural network consists of six modules, each consisting of a two-dimensional convolutional layer, a ReLU active layer, and a two-dimensional maximum pooling layer. Among them, the first convolution layer uses 7*7 convolutional layer, the second layer uses 5*5 convolution kernel, the remaining 4 layers use 3*3 convolution kernel, and all the largest pooling layer uses 3 *3 pooled core, the pooling step size is 2. Finally, it is classified by the LSTM network.

步骤三：构建输入为语音频谱，网络为为CNN+LSTM的端到端神经网络，具体为：输入语音切分为相同大小的片段，进行快速傅里叶变换，求出语音片段的频谱图，卷积神经网络则由6个模块组成，每个模块由二维卷积层、ReLU激活层、二维最大池化层组成。其中，第一个卷积层使用7*7的卷积层，第二层使用5*5的卷积核，剩下4层使用3*3的卷积核，所有的最大池化层使用3*3的池化核，池化步长为2。再经过一个全连接层，最后经过Softmax进行分类。Step 3: Construct the input as the speech spectrum, and the network is the end-to-end neural network of CNN+LSTM. Specifically, the input speech is divided into segments of the same size, and the fast Fourier transform is performed to obtain the spectrum of the speech segment. The convolutional neural network consists of six modules, each consisting of a two-dimensional convolutional layer, a ReLU active layer, and a two-dimensional maximum pooling layer. Among them, the first convolution layer uses 7*7 convolutional layer, the second layer uses 5*5 convolution kernel, the remaining 4 layers use 3*3 convolution kernel, and all the largest pooling layer uses 3 *3 pooled core, the pooling step size is 2. After a full connection layer, it is finally classified by Softmax.

步骤四：构建输入为MFCC特征或者CQCC特征，网络为LSTM的端到端神经网路，MFCC特征通过对语音进行预加重，加窗分帧、快速傅里叶变换、梅尔刻度三角滤波器组滤波、取对数运算、离散余弦变换后最终得到的，而CQCC特征是通过对语音进行常数Q变换、求能量谱密度、取对数操作、余弦变换得到的。进行经过长短期记忆网路进行分类。对语音提取MFCC或者CQCC特征作为神经网络的输入，最后经过长短期记忆网络进行分类。Step 4: Construct an end-to-end neural network whose input is MFCC feature or CQCC feature, and the network is LSTM. The MFCC feature pre-emphasizes the speech, windowing frame, fast Fourier transform, and Meyer scale triangular filter bank. The filtering, logarithmic operation, and discrete cosine transform are finally obtained, and the CQCC feature is obtained by performing constant Q transform on the speech, obtaining the energy spectral density, taking the logarithm operation, and cosine transform. Perform long- and short-term memory networks for classification. The speech extraction MFCC or CQCC features are used as input to the neural network, and finally classified by long-term and short-term memory networks.

步骤五：将以上四个网络融合在一起进行说话人感冒语音识别。Step 5: Combine the above four networks to perform speaker cold speech recognition.

Claims

A method for identifying a cold symptom of a speaker that incorporates a variety of end-to-end deep learning structures, including:

S1, constructing the input as voice, and the network is an end-to-end neural network of a convolutional neural network plus a long-term and short-term memory network;

S2, constructing the input as a speech spectrum, and the network is an end-to-end neural network of a convolutional neural network plus a long-term and short-term memory network;

S3, constructing the input as a speech spectrum, and the network is a convolutional neural network plus a fully connected end-to-end neural network;

S4. The input is constructed as a voice MFCC/CQCC feature, and the network is an end-to-end neural network of a long-term and short-term memory network;

S5, combining the above four end-to-end neural networks for speaker cold symptom recognition;

The speaker cold symptom recognition method according to claim 1, wherein the input in S1 is voice, and the network is an end-to-end neural network of CNN+LSTM. Specifically, the input speech is divided into segments of the same size, for example, 40 ms, and then averaged, and the corresponding convolutional neural network is composed of 8 modules, each of which is composed of a one-dimensional convolution layer and a ReLU activation layer. The one-dimensional largest pooling layer, wherein each convolution kernel has a size of 32, the pooled core has a size of 2, and the pooling step has a size of 2. Then use long and short-term memory networks for classification.

The speaker cold symptom recognition method according to claim 1, wherein the input in S2 is a speech spectrum, and the network is an end-to-end nerve of CNN+LSTM. The network is specifically as follows: the input speech is divided into segments of the same size, and a fast Fourier transform is performed to obtain a spectrogram of the speech segment. The convolutional neural network is composed of six modules, and each module is composed of a two-dimensional convolution layer. The ReLU active layer and the two-dimensional maximum pooling layer are composed. Among them, the first convolution layer uses 7*7 convolutional layer, the second layer uses 5*5 convolution kernel, the remaining 4 layers use 3*3 convolution kernel, and all the largest pooling layer uses 3 *3 pooled core, the pooling step size is 2. Finally, it is classified by the LSTM network.

The speaker cold symptom recognition method according to claim 1, wherein the input in S3 is a speech spectrum, and the network is an end-to-end nerve of CNN+LSTM. The network is specifically as follows: the input speech is divided into segments of the same size, and a fast Fourier transform is performed to obtain a spectrogram of the speech segment. The convolutional neural network is composed of six modules, and each module is composed of a two-dimensional convolution layer. The ReLU active layer and the two-dimensional maximum pooling layer are composed. Among them, the first convolution layer uses 7*7 convolutional layer, the second layer uses 5*5 convolution kernel, the remaining 4 layers use 3*3 convolution kernel, and all the largest pooling layer uses 3 *3 pooled core, the pooling step size is 2. After a full connection layer, it is finally classified by Softmax.

The speaker cold symptom recognition method according to claim 1, wherein the MFCC feature of S4 is pre-emphasized, windowed, and fast Fourier transformed by voice. The Meyer scale triangular filter bank filter, the logarithm operation, and the discrete cosine transform are finally obtained, and the CQCC feature is obtained by performing constant Q transform on the speech, finding the energy spectral density, taking the logarithm operation, and cosine transform. Perform long- and short-term memory networks for classification. The speech extraction MFCC or CQCC features are used as input to the neural network, and finally classified by long-term and short-term memory networks.