WO2018166316A1 - Speaker's flu symptoms recognition method fused with multiple end-to-end neural network structures - Google Patents

Speaker's flu symptoms recognition method fused with multiple end-to-end neural network structures Download PDF

Info

Publication number
WO2018166316A1
WO2018166316A1 PCT/CN2018/076272 CN2018076272W WO2018166316A1 WO 2018166316 A1 WO2018166316 A1 WO 2018166316A1 CN 2018076272 W CN2018076272 W CN 2018076272W WO 2018166316 A1 WO2018166316 A1 WO 2018166316A1
Authority
WO
WIPO (PCT)
Prior art keywords
network
neural network
input
speech
speaker
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/CN2018/076272
Other languages
French (fr)
Chinese (zh)
Inventor
李明
倪志东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Foshan Shunde Sun Yat-Sen University Research Institute
Sun Yat Sen University
SYSU CMU Shunde International Joint Research Institute
Original Assignee
Foshan Shunde Sun Yat-Sen University Research Institute
Sun Yat Sen University
SYSU CMU Shunde International Joint Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Foshan Shunde Sun Yat-Sen University Research Institute, Sun Yat Sen University, SYSU CMU Shunde International Joint Research Institute filed Critical Foshan Shunde Sun Yat-Sen University Research Institute
Publication of WO2018166316A1 publication Critical patent/WO2018166316A1/en
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/66Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for extracting parameters related to health condition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Definitions

  • the invention relates to the field of speech processing technology, and proposes a speaker cold symptom recognition method combining various end-to-end deep learning structures
  • Speaker recognition also known as voiceprint recognition, refers to the technique of automatically identifying a speaker by using pattern recognition technology by including unique speaker information in the voice.
  • the current speaker technology achieves good performance in experimental conditions, but in practice, speech is affected by environmental noise and speaker health conditions, making the robustness of existing speaker recognition techniques less, and the cold speech recognition method.
  • By classifying the existing speech to determine whether it is a cold speech it is possible to improve the robustness of the speaker recognition by using the cold speech recognition method to determine in advance whether the speech is a cold speech or not, and then performing speaker recognition.
  • the speech feature extraction is to extract the speaker's speech features and vocal features.
  • the mainstream characteristic parameters including MFCC, LPCC, CQCC, etc. are all based on a single feature, which lacks the information to characterize the symptoms of the speaker's cold, and affects the recognition accuracy.
  • MFCC MFCC
  • LPCC LPCC
  • CQCC CQCC
  • a large amount of knowledge of classification target speech is needed.
  • the method based on vocal tract model and speech model knowledge is earlier, but because of the complexity of the model, it has not achieved good practical effects, and the model Matching methods such as dynamic time warping, hidden Markov model, vector quantization, etc. begin to exert good recognition results.
  • Separating feature extraction and pattern classification is a common method for identification research, but there are problems that the features and models do not match, the training is difficult, and the features are not easy to find.
  • the classical recognition framework has the above problems.
  • the present invention proposes a method for identifying a cold symptom of a speaker that incorporates various end-to-end deep learning structures. We constructed four different end-to-end deep learning networks, and finally merged four different end-to-end neural network structures for speaker cold symptom recognition.
  • the four end-to-end deep learning structures are: 1. Input is voice, the network is multi-layer convolutional neural network and long-term and short-term memory network; 2. Input is voice spectrum, network is multi-layer convolutional neural network and long Short-term memory network; 3, the input is the speech spectrum, the network is a multi-layer convolutional neural network and a fully connected network; 4. The input is the Mel cepstrum coefficient and the constant Q cepstral coefficient, and the network is a long-term and short-term memory network;
  • the beneficial effects of the present invention are: based on the uncertainty of the traditional features, the output obtained by the neural network training can better express the characteristics of the speaker's cold symptoms, and the input is relatively simple, without excessive feature processing. . Because voice has timing information, we have better results through long- and short-term memory networks. By unifying feature learning and pattern classification, the whole speaker's cold symptom recognition process is simpler and faster, and has broad application prospects.
  • FIG 1 shows the flow of voice extraction of the Mel Cepstral Coefficient (MFCC).
  • FIG. 1 shows the flow of the speech extraction constant Q cepstrum coefficient (CQCC).
  • Figure 3 shows the first end-to-end neural network.
  • the input is voice and the network is CNN+LSTM.
  • Figure 4 shows the second end-to-end neural network.
  • the input is the voice spectrum and the network is CNN+LSTM.
  • Figure 5 shows the third end-to-end neural network, the input is the voice spectrum, and the network is the CNN+ fully connected network.
  • Figure 6 shows the fourth end-to-end neural network.
  • the input is the Mel cepstrum coefficient or the constant Q cepstral coefficient.
  • the network is LSTM.
  • Step 1 Construct an end-to-end neural network with input as voice and network as CNN+LSTM.
  • the input speech is divided into segments of the same size, such as 40ms, and then averaged, and the corresponding convolutional neural network.
  • It consists of 8 modules, each of which consists of a one-dimensional convolution layer, a ReLU active layer, and a one-dimensional maximum pooling layer.
  • Each convolution kernel has a size of 32 and a pooled core has a size of 2.
  • the pooling step size is 2. Then use long and short-term memory networks for classification.
  • Step 2 Construct the input as the speech spectrum, and the network is the end-to-end neural network of CNN+LSTM. Specifically, the input speech is divided into segments of the same size, and the fast Fourier transform is performed to obtain the spectrum of the speech segment.
  • the convolutional neural network consists of six modules, each consisting of a two-dimensional convolutional layer, a ReLU active layer, and a two-dimensional maximum pooling layer. Among them, the first convolution layer uses 7*7 convolutional layer, the second layer uses 5*5 convolution kernel, the remaining 4 layers use 3*3 convolution kernel, and all the largest pooling layer uses 3 *3 pooled core, the pooling step size is 2. Finally, it is classified by the LSTM network.
  • Step 3 Construct the input as the speech spectrum, and the network is the end-to-end neural network of CNN+LSTM. Specifically, the input speech is divided into segments of the same size, and the fast Fourier transform is performed to obtain the spectrum of the speech segment.
  • the convolutional neural network consists of six modules, each consisting of a two-dimensional convolutional layer, a ReLU active layer, and a two-dimensional maximum pooling layer. Among them, the first convolution layer uses 7*7 convolutional layer, the second layer uses 5*5 convolution kernel, the remaining 4 layers use 3*3 convolution kernel, and all the largest pooling layer uses 3 *3 pooled core, the pooling step size is 2. After a full connection layer, it is finally classified by Softmax.
  • Step 4 Construct an end-to-end neural network whose input is MFCC feature or CQCC feature, and the network is LSTM.
  • the MFCC feature pre-emphasizes the speech, windowing frame, fast Fourier transform, and Meyer scale triangular filter bank.
  • the filtering, logarithmic operation, and discrete cosine transform are finally obtained, and the CQCC feature is obtained by performing constant Q transform on the speech, obtaining the energy spectral density, taking the logarithm operation, and cosine transform.
  • the speech extraction MFCC or CQCC features are used as input to the neural network, and finally classified by long-term and short-term memory networks.
  • Step 5 Combine the above four networks to perform speaker cold speech recognition.

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Public Health (AREA)
  • General Health & Medical Sciences (AREA)
  • Epidemiology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Image Analysis (AREA)

Abstract

A speaker's flu symptoms recognition method fused with multiple end-to-end neural network structures consisting of four end-to-end neutral networks, the method comprising: when the input is an original speech or speech spectrum, extracting optimal features by means of a convolutional neural network, and finally performing classification by means of a long/short-term memory network or a fully connected network; and when the input is mel frequency cepstral coefficient (MFCC) or constant Q cepstral coefficient (CQCC), directly performing classification by means of the long/short-term memory network, and finally fusing these systems together. The whole process integrates feature extraction and model classification, such that the whole speaker process of recognizing flu symptoms of a speaker can be simpler and quicker.

Description

融合多种端到端神经网络结构的说话人感冒症状识别方法  Speaker cold symptom recognition method combining various end-to-end neural network structures

技术领域Technical field

本发明涉及语音处理技术领域,提出融合多种端到端深度学习结构的说话人感冒症状识别方法The invention relates to the field of speech processing technology, and proposes a speaker cold symptom recognition method combining various end-to-end deep learning structures

背景技术Background technique

1、说话人识别又称声纹识别,是指通过语音中包含特有的说话人信息,利用模式识别技术自动识别说话人的技术。当前的说话人技术是实验条件中取得很好的性能,但是在实际中,语音会受到环境噪声和说话人健康条件的影响,使得已有说话人识别技术的鲁棒性降低,感冒语音识别方法通过对已有语音进行分类判断是否为感冒语音,通过感冒语音识别方法提前判别语音是否是感冒语音,再进行说话人识别,可以提高说话人识别的鲁棒性。1. Speaker recognition, also known as voiceprint recognition, refers to the technique of automatically identifying a speaker by using pattern recognition technology by including unique speaker information in the voice. The current speaker technology achieves good performance in experimental conditions, but in practice, speech is affected by environmental noise and speaker health conditions, making the robustness of existing speaker recognition techniques less, and the cold speech recognition method. By classifying the existing speech to determine whether it is a cold speech, it is possible to improve the robustness of the speaker recognition by using the cold speech recognition method to determine in advance whether the speech is a cold speech or not, and then performing speaker recognition.

2、在语音技术研究中,研究者总是希望能找到表示目标类型的特征,从识别目标语音中找到明显区别正常语音的特性进行描述,语音特征提取是提取说话人的语音特征和声道特征,目前,主流的特征参数包括MFCC、LPCC、CQCC等,都是以单个特征为主,表征说话人感冒症状的信息不足,影响识别精度。同时需要大量区分分类目标语音的知识,而在语音识别算法中,起步较早的是基于声道模型和语音模型知识的方法,但是因为模型的复杂性,没有取得很好的实用效果,而模型匹配方法如动态时间规整、隐马尔可夫模型、矢量量化等技术等开始发挥良好的识别效果。把特征提取和模式分类分开研究是识别研究的常用方法,但是存在特征和模型不匹配、训练困难、特征不易寻找的问题,经典的识别框架存在上述的问题。2. In the research of speech technology, researchers always hope to find the characteristics of the target type, and find out the characteristics of the normal speech from the recognition target speech. The speech feature extraction is to extract the speaker's speech features and vocal features. At present, the mainstream characteristic parameters including MFCC, LPCC, CQCC, etc. are all based on a single feature, which lacks the information to characterize the symptoms of the speaker's cold, and affects the recognition accuracy. At the same time, a large amount of knowledge of classification target speech is needed. In the speech recognition algorithm, the method based on vocal tract model and speech model knowledge is earlier, but because of the complexity of the model, it has not achieved good practical effects, and the model Matching methods such as dynamic time warping, hidden Markov model, vector quantization, etc. begin to exert good recognition results. Separating feature extraction and pattern classification is a common method for identification research, but there are problems that the features and models do not match, the training is difficult, and the features are not easy to find. The classical recognition framework has the above problems.

3、近年来随着深度学习的发展,基于深层神经网络在图像和语音的识别已显示出巨大的能量,一系列的神经网络结构也被提出,比如自动编码网络、卷积神经网络和循环神经网络等。有很多学者发现,通过神经网络对语音进行学习,可以得到更好描述语音的隐藏结构特征,端到端的识别方法就是通过尽量少的先验知识,同时对特征学习和特征识别进行处理,具有很好的识别效果。3. In recent years, with the development of deep learning, the recognition of images and speech based on deep neural networks has shown great energy, and a series of neural network structures have also been proposed, such as automatic coding networks, convolutional neural networks and circulating nerves. Network, etc. Many scholars have found that learning speech through neural networks can obtain better hidden features of speech. The end-to-end identification method is to deal with feature learning and feature recognition with as little prior knowledge as possible. Good recognition effect.

发明内容:Summary of the invention:

根据现有识别技术都是把特征和模式分类分开研究,存在特征和模型不匹配、训练困难,特征不易寻找等问题,本发明提出融合多种端到端深度学习结构的说话人感冒症状识别方法,我们构建四种不同的端到端深度学习网络,最后融合四种不同的端到端神经网络结构进行说话人感冒症状识别。 According to the existing identification techniques, the features and patterns are separately classified, and there are problems such as mismatching of features and models, difficulty in training, and difficulty in finding features. The present invention proposes a method for identifying a cold symptom of a speaker that incorporates various end-to-end deep learning structures. We constructed four different end-to-end deep learning networks, and finally merged four different end-to-end neural network structures for speaker cold symptom recognition.

四种端到端深度学习结构分别为:1、输入为语音,网络为多层卷积神经网络和长短期记忆网路;2、输入为语音频谱,网络为多层卷积神经网路和长短期记忆网络;3、输入为语音频谱,网络为多层卷积神经网络和全连接网络;4、输入为梅尔倒谱系数和常数Q倒谱系数,网络为长短期记忆网络; The four end-to-end deep learning structures are: 1. Input is voice, the network is multi-layer convolutional neural network and long-term and short-term memory network; 2. Input is voice spectrum, network is multi-layer convolutional neural network and long Short-term memory network; 3, the input is the speech spectrum, the network is a multi-layer convolutional neural network and a fully connected network; 4. The input is the Mel cepstrum coefficient and the constant Q cepstral coefficient, and the network is a long-term and short-term memory network;

本发明的有益效果是:基于传统特征的不确定性,我们通过神经网络训练得到的输出可以更好的表达说话人感冒症状的特征,并且输入相对来说比较简单,不用过多的进行特征处理。因为语音具有时序信息,我们通过长短期记忆网络实现分类有更好的效果。通过把特征学习和模式分类统一在一起,使得整个说话人感冒症状识别过程更加简单快速,具有广泛的应用前景。 The beneficial effects of the present invention are: based on the uncertainty of the traditional features, the output obtained by the neural network training can better express the characteristics of the speaker's cold symptoms, and the input is relatively simple, without excessive feature processing. . Because voice has timing information, we have better results through long- and short-term memory networks. By unifying feature learning and pattern classification, the whole speaker's cold symptom recognition process is simpler and faster, and has broad application prospects.

附图说明DRAWINGS

图一为语音提取梅尔倒谱系数(MFCC)的流程Figure 1 shows the flow of voice extraction of the Mel Cepstral Coefficient (MFCC).

图二为语音提取常数Q倒谱系数(CQCC)的流程Figure 2 shows the flow of the speech extraction constant Q cepstrum coefficient (CQCC).

图三为第一个端到端神经网络,输入为语音,网络为CNN+LSTM。Figure 3 shows the first end-to-end neural network. The input is voice and the network is CNN+LSTM.

图四为第二个端到端神经网络,输入为语音频谱,网络为CNN+LSTMFigure 4 shows the second end-to-end neural network. The input is the voice spectrum and the network is CNN+LSTM.

图五为第三个端到端神经网络,输入为语音频谱,网络为CNN+全连接网络Figure 5 shows the third end-to-end neural network, the input is the voice spectrum, and the network is the CNN+ fully connected network.

图六为第四个端到端神经网络,输入为梅尔倒谱系数或者常数Q倒谱系数,网络为LSTM。Figure 6 shows the fourth end-to-end neural network. The input is the Mel cepstrum coefficient or the constant Q cepstral coefficient. The network is LSTM.

具体实施方式:detailed description:

为使本发明的技术方案和优点更加清楚,下面结合附图,对发明的技术方案进行清楚完整的描述: In order to make the technical solutions and advantages of the present invention more clear, the technical solutions of the invention will be clearly and completely described below with reference to the accompanying drawings:

步骤一:构建输入为语音、网络为CNN+LSTM的端到端神经网络,具体为:输入语音切分为相同大小的片段比如40ms,然后进行均值归一化,而相对应的卷积神经网络由8个模块组成,每一个模块是由一维卷积层、ReLU激活层、一维最大池化层组成的,其中,每一个卷积核的大小为32,池化核的大小为2,池化步长为2。而后使用长短期记忆网络进行分类。 Step 1: Construct an end-to-end neural network with input as voice and network as CNN+LSTM. Specifically, the input speech is divided into segments of the same size, such as 40ms, and then averaged, and the corresponding convolutional neural network. It consists of 8 modules, each of which consists of a one-dimensional convolution layer, a ReLU active layer, and a one-dimensional maximum pooling layer. Each convolution kernel has a size of 32 and a pooled core has a size of 2. The pooling step size is 2. Then use long and short-term memory networks for classification.

步骤二:构建输入为语音频谱,网络为为CNN+LSTM的端到端神经网络,具体为:输入语音切分为相同大小的片段,进行快速傅里叶变换,求出语音片段的频谱图,卷积神经网络则由6个模块组成,每个模块由二维卷积层、ReLU激活层、二维最大池化层组成。其中,第一个卷积层使用7*7的卷积层,第二层使用5*5的卷积核,剩下4层使用3*3的卷积核,所有的最大池化层使用3*3的池化核,池化步长为2。最后经过LSTM网络进行分类。Step 2: Construct the input as the speech spectrum, and the network is the end-to-end neural network of CNN+LSTM. Specifically, the input speech is divided into segments of the same size, and the fast Fourier transform is performed to obtain the spectrum of the speech segment. The convolutional neural network consists of six modules, each consisting of a two-dimensional convolutional layer, a ReLU active layer, and a two-dimensional maximum pooling layer. Among them, the first convolution layer uses 7*7 convolutional layer, the second layer uses 5*5 convolution kernel, the remaining 4 layers use 3*3 convolution kernel, and all the largest pooling layer uses 3 *3 pooled core, the pooling step size is 2. Finally, it is classified by the LSTM network.

步骤三:构建输入为语音频谱,网络为为CNN+LSTM的端到端神经网络,具体为:输入语音切分为相同大小的片段,进行快速傅里叶变换,求出语音片段的频谱图,卷积神经网络则由6个模块组成,每个模块由二维卷积层、ReLU激活层、二维最大池化层组成。其中,第一个卷积层使用7*7的卷积层,第二层使用5*5的卷积核,剩下4层使用3*3的卷积核,所有的最大池化层使用3*3的池化核,池化步长为2。再经过一个全连接层,最后经过Softmax进行分类。Step 3: Construct the input as the speech spectrum, and the network is the end-to-end neural network of CNN+LSTM. Specifically, the input speech is divided into segments of the same size, and the fast Fourier transform is performed to obtain the spectrum of the speech segment. The convolutional neural network consists of six modules, each consisting of a two-dimensional convolutional layer, a ReLU active layer, and a two-dimensional maximum pooling layer. Among them, the first convolution layer uses 7*7 convolutional layer, the second layer uses 5*5 convolution kernel, the remaining 4 layers use 3*3 convolution kernel, and all the largest pooling layer uses 3 *3 pooled core, the pooling step size is 2. After a full connection layer, it is finally classified by Softmax.

步骤四:构建输入为MFCC特征或者CQCC特征,网络为LSTM的端到端神经网路,MFCC特征通过对语音进行预加重,加窗分帧、快速傅里叶变换、梅尔刻度三角滤波器组滤波、取对数运算、离散余弦变换后最终得到的,而CQCC特征是通过对语音进行常数Q变换、求能量谱密度、取对数操作、余弦变换得到的。进行经过长短期记忆网路进行分类。对语音提取MFCC或者CQCC特征作为神经网络的输入,最后经过长短期记忆网络进行分类。Step 4: Construct an end-to-end neural network whose input is MFCC feature or CQCC feature, and the network is LSTM. The MFCC feature pre-emphasizes the speech, windowing frame, fast Fourier transform, and Meyer scale triangular filter bank. The filtering, logarithmic operation, and discrete cosine transform are finally obtained, and the CQCC feature is obtained by performing constant Q transform on the speech, obtaining the energy spectral density, taking the logarithm operation, and cosine transform. Perform long- and short-term memory networks for classification. The speech extraction MFCC or CQCC features are used as input to the neural network, and finally classified by long-term and short-term memory networks.

步骤五:将以上四个网络融合在一起进行说话人感冒语音识别。Step 5: Combine the above four networks to perform speaker cold speech recognition.

Claims (5)

融合多种端到端深度学习结构的说话人感冒症状识别方法,包括: A method for identifying a cold symptom of a speaker that incorporates a variety of end-to-end deep learning structures, including: S1、构建输入为语音,网络为卷积神经网络加上长短期记忆网络的端到端神经网络;S1, constructing the input as voice, and the network is an end-to-end neural network of a convolutional neural network plus a long-term and short-term memory network; S2、构建输入为语音频谱,网络为卷积神经网络加上长短期记忆网络的端到端神经网络;S2, constructing the input as a speech spectrum, and the network is an end-to-end neural network of a convolutional neural network plus a long-term and short-term memory network; S3、构建输入为语音频谱,网络为卷积神经网络加上全连接的端到端神经网络;S3, constructing the input as a speech spectrum, and the network is a convolutional neural network plus a fully connected end-to-end neural network; S4、构建输入为语音MFCC/CQCC特征,网络为长短期记忆网络的端到端神经网络;S4. The input is constructed as a voice MFCC/CQCC feature, and the network is an end-to-end neural network of a long-term and short-term memory network; S5、融合以上四种端到端神经网络进行说话人感冒症状识别;S5, combining the above four end-to-end neural networks for speaker cold symptom recognition; 根据权利要求1所述的融合多种端到端深度学习结构的说话人感冒症状识别方法,其特征还在于:S1中所述的输入为语音、网络为CNN+LSTM的端到端神经网络,具体为,输入语音切分为相同大小的片段比如40ms,然后进行均值归一化,而相对应的卷积神经网络由8个模块组成,每一个模块是由一维卷积层、ReLU激活层、一维最大池化层组成的,其中,每一个卷积核的大小为32,池化核的大小为2,池化步长为2。而后使用长短期记忆网络进行分类。 The speaker cold symptom recognition method according to claim 1, wherein the input in S1 is voice, and the network is an end-to-end neural network of CNN+LSTM. Specifically, the input speech is divided into segments of the same size, for example, 40 ms, and then averaged, and the corresponding convolutional neural network is composed of 8 modules, each of which is composed of a one-dimensional convolution layer and a ReLU activation layer. The one-dimensional largest pooling layer, wherein each convolution kernel has a size of 32, the pooled core has a size of 2, and the pooling step has a size of 2. Then use long and short-term memory networks for classification. 根据权利要求1所述的融合多种端到端深度学习结构的说话人感冒症状识别方法,其特征还在于:S2中所述的输入为语音频谱,网络为为CNN+LSTM的端到端神经网络,具体为:输入语音切分为相同大小的片段,进行快速傅里叶变换,求出语音片段的频谱图,卷积神经网络则由6个模块组成,每个模块由二维卷积层、ReLU激活层、二维最大池化层组成。其中,第一个卷积层使用7*7的卷积层,第二层使用5*5的卷积核,剩下4层使用3*3的卷积核,所有的最大池化层使用3*3的池化核,池化步长为2。最后经过LSTM网络进行分类。 The speaker cold symptom recognition method according to claim 1, wherein the input in S2 is a speech spectrum, and the network is an end-to-end nerve of CNN+LSTM. The network is specifically as follows: the input speech is divided into segments of the same size, and a fast Fourier transform is performed to obtain a spectrogram of the speech segment. The convolutional neural network is composed of six modules, and each module is composed of a two-dimensional convolution layer. The ReLU active layer and the two-dimensional maximum pooling layer are composed. Among them, the first convolution layer uses 7*7 convolutional layer, the second layer uses 5*5 convolution kernel, the remaining 4 layers use 3*3 convolution kernel, and all the largest pooling layer uses 3 *3 pooled core, the pooling step size is 2. Finally, it is classified by the LSTM network. 根据权利要求1所述的融合多种端到端深度学习结构的说话人感冒症状识别方法,其特征还在于:S3中所述的输入为语音频谱,网络为为CNN+LSTM的端到端神经网络,具体为:输入语音切分为相同大小的片段,进行快速傅里叶变换,求出语音片段的频谱图,卷积神经网络则由6个模块组成,每个模块由二维卷积层、ReLU激活层、二维最大池化层组成。其中,第一个卷积层使用7*7的卷积层,第二层使用5*5的卷积核,剩下4层使用3*3的卷积核,所有的最大池化层使用3*3的池化核,池化步长为2。再经过一个全连接层,最后经过Softmax进行分类。 The speaker cold symptom recognition method according to claim 1, wherein the input in S3 is a speech spectrum, and the network is an end-to-end nerve of CNN+LSTM. The network is specifically as follows: the input speech is divided into segments of the same size, and a fast Fourier transform is performed to obtain a spectrogram of the speech segment. The convolutional neural network is composed of six modules, and each module is composed of a two-dimensional convolution layer. The ReLU active layer and the two-dimensional maximum pooling layer are composed. Among them, the first convolution layer uses 7*7 convolutional layer, the second layer uses 5*5 convolution kernel, the remaining 4 layers use 3*3 convolution kernel, and all the largest pooling layer uses 3 *3 pooled core, the pooling step size is 2. After a full connection layer, it is finally classified by Softmax. 根据权利要求1所述的融合多种端到端深度学习结构的说话人感冒症状识别方法,其特征还在于:S4的MFCC特征通过对语音进行预加重,加窗分帧、快速傅里叶变换、梅尔刻度三角滤波器组滤波、取对数运算、离散余弦变换后最终得到的,而CQCC特征是通过对语音进行常数Q变换、求能量谱密度、取对数操作、余弦变换得到的。进行经过长短期记忆网路进行分类。对语音提取MFCC或者CQCC特征作为神经网络的输入,最后经过长短期记忆网络进行分类。 The speaker cold symptom recognition method according to claim 1, wherein the MFCC feature of S4 is pre-emphasized, windowed, and fast Fourier transformed by voice. The Meyer scale triangular filter bank filter, the logarithm operation, and the discrete cosine transform are finally obtained, and the CQCC feature is obtained by performing constant Q transform on the speech, finding the energy spectral density, taking the logarithm operation, and cosine transform. Perform long- and short-term memory networks for classification. The speech extraction MFCC or CQCC features are used as input to the neural network, and finally classified by long-term and short-term memory networks.
PCT/CN2018/076272 2017-03-13 2018-02-11 Speaker's flu symptoms recognition method fused with multiple end-to-end neural network structures Ceased WO2018166316A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201710146957.0 2017-03-13
CN201710146957.0A CN107068167A (en) 2017-03-13 2017-03-13 Merge speaker's cold symptoms recognition methods of a variety of end-to-end neural network structures

Publications (1)

Publication Number Publication Date
WO2018166316A1 true WO2018166316A1 (en) 2018-09-20

Family

ID=59621946

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/076272 Ceased WO2018166316A1 (en) 2017-03-13 2018-02-11 Speaker's flu symptoms recognition method fused with multiple end-to-end neural network structures

Country Status (2)

Country Link
CN (1) CN107068167A (en)
WO (1) WO2018166316A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10692502B2 (en) * 2017-03-03 2020-06-23 Pindrop Security, Inc. Method and apparatus for detecting spoofing conditions
CN114299987A (en) * 2021-12-08 2022-04-08 中国科学技术大学 Training method of event analysis model, event analysis method and device thereof
US12488072B2 (en) 2020-04-15 2025-12-02 Pindrop Security, Inc. Passive and continuous multi-speaker voice biometrics

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107068167A (en) * 2017-03-13 2017-08-18 广东顺德中山大学卡内基梅隆大学国际联合研究院 Merge speaker's cold symptoms recognition methods of a variety of end-to-end neural network structures
CN108053841A (en) * 2017-10-23 2018-05-18 平安科技(深圳)有限公司 The method and application server of disease forecasting are carried out using voice
CN109960910B (en) * 2017-12-14 2021-06-08 Oppo广东移动通信有限公司 Voice processing method, device, storage medium and terminal device
CN109086892B (en) * 2018-06-15 2022-02-18 中山大学 General dependency tree-based visual problem reasoning model and system
CN109192226A (en) * 2018-06-26 2019-01-11 深圳大学 A kind of signal processing method and device
CN108899051B (en) * 2018-06-26 2020-06-16 北京大学深圳研究生院 A speech emotion recognition model and recognition method based on joint feature representation
CN109256118B (en) * 2018-10-22 2021-06-25 江苏师范大学 End-to-end Chinese dialect recognition system and method based on generative auditory model
CN109282837B (en) * 2018-10-24 2021-06-22 福州大学 Demodulation method of fiber Bragg grating interleaved spectrum based on LSTM network
CN111028859A (en) * 2019-12-15 2020-04-17 中北大学 A hybrid neural network vehicle recognition method based on audio feature fusion
CN116110437B (en) * 2023-04-14 2023-06-13 天津大学 Pathological voice quality evaluation method based on fusion of voice characteristics and speaker characteristics

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5214743A (en) * 1989-10-25 1993-05-25 Hitachi, Ltd. Information processing apparatus
CN105139864A (en) * 2015-08-17 2015-12-09 北京天诚盛业科技有限公司 Voice recognition method and voice recognition device
CN106328122A (en) * 2016-08-19 2017-01-11 深圳市唯特视科技有限公司 Voice identification method using long-short term memory model recurrent neural network
CN107068167A (en) * 2017-03-13 2017-08-18 广东顺德中山大学卡内基梅隆大学国际联合研究院 Merge speaker's cold symptoms recognition methods of a variety of end-to-end neural network structures

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5214743A (en) * 1989-10-25 1993-05-25 Hitachi, Ltd. Information processing apparatus
CN105139864A (en) * 2015-08-17 2015-12-09 北京天诚盛业科技有限公司 Voice recognition method and voice recognition device
CN106328122A (en) * 2016-08-19 2017-01-11 深圳市唯特视科技有限公司 Voice identification method using long-short term memory model recurrent neural network
CN107068167A (en) * 2017-03-13 2017-08-18 广东顺德中山大学卡内基梅隆大学国际联合研究院 Merge speaker's cold symptoms recognition methods of a variety of end-to-end neural network structures

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
TARA N.: "Convolutional, Long Short-Term Memory, fully connected Deep Neural Networks", ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2015 IEEE INTERNATIONAL CONFERENCE ON, 6 August 2015 (2015-08-06), XP033187628, ISSN: 2379-190X *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10692502B2 (en) * 2017-03-03 2020-06-23 Pindrop Security, Inc. Method and apparatus for detecting spoofing conditions
US11488605B2 (en) 2017-03-03 2022-11-01 Pindrop Security, Inc. Method and apparatus for detecting spoofing conditions
US12488072B2 (en) 2020-04-15 2025-12-02 Pindrop Security, Inc. Passive and continuous multi-speaker voice biometrics
CN114299987A (en) * 2021-12-08 2022-04-08 中国科学技术大学 Training method of event analysis model, event analysis method and device thereof

Also Published As

Publication number Publication date
CN107068167A (en) 2017-08-18

Similar Documents

Publication Publication Date Title
WO2018166316A1 (en) Speaker's flu symptoms recognition method fused with multiple end-to-end neural network structures
CN112509564B (en) End-to-end voice recognition method based on connection time sequence classification and self-attention mechanism
CN110245608B (en) An underwater target recognition method based on semi-tensor product neural network
CN112885372B (en) Intelligent diagnosis method, system, terminal and medium for power equipment fault sound
CN110459225B (en) Speaker recognition system based on CNN fusion characteristics
CN114863937B (en) Hybrid bird song recognition method based on deep transfer learning and XGBoost
WO2018227780A1 (en) Speech recognition method and device, computer device and storage medium
CN105096955B (en) A kind of speaker's method for quickly identifying and system based on model growth cluster
CN107331384A (en) Audio recognition method, device, computer equipment and storage medium
CN111048097B (en) Twin network voiceprint recognition method based on 3D convolution
CN110148408A (en) A kind of Chinese speech recognition method based on depth residual error
CN115101076B (en) Speaker clustering method based on multi-scale channel separation convolution feature extraction
CN110349588A (en) A kind of LSTM network method for recognizing sound-groove of word-based insertion
CN111785262B (en) Speaker age and gender classification method based on residual error network and fusion characteristics
CN113450830A (en) Voice emotion recognition method of convolution cyclic neural network with multiple attention mechanisms
CN110299142A (en) A kind of method for recognizing sound-groove and device based on the network integration
CN115440228A (en) Self-adaptive voiceprint recognition method and system
Zheng et al. MSRANet: Learning discriminative embeddings for speaker verification via channel and spatial attention mechanism in alterable scenarios
CN116153339B (en) A speech emotion recognition method and device based on improved attention mechanism
CN118072746A (en) Marine mammal call recognition and classification method based on feature fusion
CN113571095A (en) Speech emotion recognition method and system based on nested deep neural network
CN111653267A (en) Rapid language identification method based on time delay neural network
CN115249479B (en) Complex speech recognition method, system and terminal for power grid dispatching based on BRNN
CN118585889A (en) A ship type identification method and system based on ship radiated noise data
CN116842460A (en) Cough-related disease identification method and system based on attention mechanism and residual neural network

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18768185

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18768185

Country of ref document: EP

Kind code of ref document: A1