WO2018166316A1 - Speaker's flu symptoms recognition method fused with multiple end-to-end neural network structures - Google Patents
Speaker's flu symptoms recognition method fused with multiple end-to-end neural network structures Download PDFInfo
- Publication number
- WO2018166316A1 WO2018166316A1 PCT/CN2018/076272 CN2018076272W WO2018166316A1 WO 2018166316 A1 WO2018166316 A1 WO 2018166316A1 CN 2018076272 W CN2018076272 W CN 2018076272W WO 2018166316 A1 WO2018166316 A1 WO 2018166316A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- network
- neural network
- input
- speech
- speaker
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/66—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for extracting parameters related to health condition
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
Definitions
- the invention relates to the field of speech processing technology, and proposes a speaker cold symptom recognition method combining various end-to-end deep learning structures
- Speaker recognition also known as voiceprint recognition, refers to the technique of automatically identifying a speaker by using pattern recognition technology by including unique speaker information in the voice.
- the current speaker technology achieves good performance in experimental conditions, but in practice, speech is affected by environmental noise and speaker health conditions, making the robustness of existing speaker recognition techniques less, and the cold speech recognition method.
- By classifying the existing speech to determine whether it is a cold speech it is possible to improve the robustness of the speaker recognition by using the cold speech recognition method to determine in advance whether the speech is a cold speech or not, and then performing speaker recognition.
- the speech feature extraction is to extract the speaker's speech features and vocal features.
- the mainstream characteristic parameters including MFCC, LPCC, CQCC, etc. are all based on a single feature, which lacks the information to characterize the symptoms of the speaker's cold, and affects the recognition accuracy.
- MFCC MFCC
- LPCC LPCC
- CQCC CQCC
- a large amount of knowledge of classification target speech is needed.
- the method based on vocal tract model and speech model knowledge is earlier, but because of the complexity of the model, it has not achieved good practical effects, and the model Matching methods such as dynamic time warping, hidden Markov model, vector quantization, etc. begin to exert good recognition results.
- Separating feature extraction and pattern classification is a common method for identification research, but there are problems that the features and models do not match, the training is difficult, and the features are not easy to find.
- the classical recognition framework has the above problems.
- the present invention proposes a method for identifying a cold symptom of a speaker that incorporates various end-to-end deep learning structures. We constructed four different end-to-end deep learning networks, and finally merged four different end-to-end neural network structures for speaker cold symptom recognition.
- the four end-to-end deep learning structures are: 1. Input is voice, the network is multi-layer convolutional neural network and long-term and short-term memory network; 2. Input is voice spectrum, network is multi-layer convolutional neural network and long Short-term memory network; 3, the input is the speech spectrum, the network is a multi-layer convolutional neural network and a fully connected network; 4. The input is the Mel cepstrum coefficient and the constant Q cepstral coefficient, and the network is a long-term and short-term memory network;
- the beneficial effects of the present invention are: based on the uncertainty of the traditional features, the output obtained by the neural network training can better express the characteristics of the speaker's cold symptoms, and the input is relatively simple, without excessive feature processing. . Because voice has timing information, we have better results through long- and short-term memory networks. By unifying feature learning and pattern classification, the whole speaker's cold symptom recognition process is simpler and faster, and has broad application prospects.
- FIG 1 shows the flow of voice extraction of the Mel Cepstral Coefficient (MFCC).
- FIG. 1 shows the flow of the speech extraction constant Q cepstrum coefficient (CQCC).
- Figure 3 shows the first end-to-end neural network.
- the input is voice and the network is CNN+LSTM.
- Figure 4 shows the second end-to-end neural network.
- the input is the voice spectrum and the network is CNN+LSTM.
- Figure 5 shows the third end-to-end neural network, the input is the voice spectrum, and the network is the CNN+ fully connected network.
- Figure 6 shows the fourth end-to-end neural network.
- the input is the Mel cepstrum coefficient or the constant Q cepstral coefficient.
- the network is LSTM.
- Step 1 Construct an end-to-end neural network with input as voice and network as CNN+LSTM.
- the input speech is divided into segments of the same size, such as 40ms, and then averaged, and the corresponding convolutional neural network.
- It consists of 8 modules, each of which consists of a one-dimensional convolution layer, a ReLU active layer, and a one-dimensional maximum pooling layer.
- Each convolution kernel has a size of 32 and a pooled core has a size of 2.
- the pooling step size is 2. Then use long and short-term memory networks for classification.
- Step 2 Construct the input as the speech spectrum, and the network is the end-to-end neural network of CNN+LSTM. Specifically, the input speech is divided into segments of the same size, and the fast Fourier transform is performed to obtain the spectrum of the speech segment.
- the convolutional neural network consists of six modules, each consisting of a two-dimensional convolutional layer, a ReLU active layer, and a two-dimensional maximum pooling layer. Among them, the first convolution layer uses 7*7 convolutional layer, the second layer uses 5*5 convolution kernel, the remaining 4 layers use 3*3 convolution kernel, and all the largest pooling layer uses 3 *3 pooled core, the pooling step size is 2. Finally, it is classified by the LSTM network.
- Step 3 Construct the input as the speech spectrum, and the network is the end-to-end neural network of CNN+LSTM. Specifically, the input speech is divided into segments of the same size, and the fast Fourier transform is performed to obtain the spectrum of the speech segment.
- the convolutional neural network consists of six modules, each consisting of a two-dimensional convolutional layer, a ReLU active layer, and a two-dimensional maximum pooling layer. Among them, the first convolution layer uses 7*7 convolutional layer, the second layer uses 5*5 convolution kernel, the remaining 4 layers use 3*3 convolution kernel, and all the largest pooling layer uses 3 *3 pooled core, the pooling step size is 2. After a full connection layer, it is finally classified by Softmax.
- Step 4 Construct an end-to-end neural network whose input is MFCC feature or CQCC feature, and the network is LSTM.
- the MFCC feature pre-emphasizes the speech, windowing frame, fast Fourier transform, and Meyer scale triangular filter bank.
- the filtering, logarithmic operation, and discrete cosine transform are finally obtained, and the CQCC feature is obtained by performing constant Q transform on the speech, obtaining the energy spectral density, taking the logarithm operation, and cosine transform.
- the speech extraction MFCC or CQCC features are used as input to the neural network, and finally classified by long-term and short-term memory networks.
- Step 5 Combine the above four networks to perform speaker cold speech recognition.
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Signal Processing (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Public Health (AREA)
- General Health & Medical Sciences (AREA)
- Epidemiology (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Image Analysis (AREA)
Abstract
Description
技术领域Technical field
本发明涉及语音处理技术领域,提出融合多种端到端深度学习结构的说话人感冒症状识别方法The invention relates to the field of speech processing technology, and proposes a speaker cold symptom recognition method combining various end-to-end deep learning structures
背景技术Background technique
1、说话人识别又称声纹识别,是指通过语音中包含特有的说话人信息,利用模式识别技术自动识别说话人的技术。当前的说话人技术是实验条件中取得很好的性能,但是在实际中,语音会受到环境噪声和说话人健康条件的影响,使得已有说话人识别技术的鲁棒性降低,感冒语音识别方法通过对已有语音进行分类判断是否为感冒语音,通过感冒语音识别方法提前判别语音是否是感冒语音,再进行说话人识别,可以提高说话人识别的鲁棒性。1. Speaker recognition, also known as voiceprint recognition, refers to the technique of automatically identifying a speaker by using pattern recognition technology by including unique speaker information in the voice. The current speaker technology achieves good performance in experimental conditions, but in practice, speech is affected by environmental noise and speaker health conditions, making the robustness of existing speaker recognition techniques less, and the cold speech recognition method. By classifying the existing speech to determine whether it is a cold speech, it is possible to improve the robustness of the speaker recognition by using the cold speech recognition method to determine in advance whether the speech is a cold speech or not, and then performing speaker recognition.
2、在语音技术研究中,研究者总是希望能找到表示目标类型的特征,从识别目标语音中找到明显区别正常语音的特性进行描述,语音特征提取是提取说话人的语音特征和声道特征,目前,主流的特征参数包括MFCC、LPCC、CQCC等,都是以单个特征为主,表征说话人感冒症状的信息不足,影响识别精度。同时需要大量区分分类目标语音的知识,而在语音识别算法中,起步较早的是基于声道模型和语音模型知识的方法,但是因为模型的复杂性,没有取得很好的实用效果,而模型匹配方法如动态时间规整、隐马尔可夫模型、矢量量化等技术等开始发挥良好的识别效果。把特征提取和模式分类分开研究是识别研究的常用方法,但是存在特征和模型不匹配、训练困难、特征不易寻找的问题,经典的识别框架存在上述的问题。2. In the research of speech technology, researchers always hope to find the characteristics of the target type, and find out the characteristics of the normal speech from the recognition target speech. The speech feature extraction is to extract the speaker's speech features and vocal features. At present, the mainstream characteristic parameters including MFCC, LPCC, CQCC, etc. are all based on a single feature, which lacks the information to characterize the symptoms of the speaker's cold, and affects the recognition accuracy. At the same time, a large amount of knowledge of classification target speech is needed. In the speech recognition algorithm, the method based on vocal tract model and speech model knowledge is earlier, but because of the complexity of the model, it has not achieved good practical effects, and the model Matching methods such as dynamic time warping, hidden Markov model, vector quantization, etc. begin to exert good recognition results. Separating feature extraction and pattern classification is a common method for identification research, but there are problems that the features and models do not match, the training is difficult, and the features are not easy to find. The classical recognition framework has the above problems.
3、近年来随着深度学习的发展,基于深层神经网络在图像和语音的识别已显示出巨大的能量,一系列的神经网络结构也被提出,比如自动编码网络、卷积神经网络和循环神经网络等。有很多学者发现,通过神经网络对语音进行学习,可以得到更好描述语音的隐藏结构特征,端到端的识别方法就是通过尽量少的先验知识,同时对特征学习和特征识别进行处理,具有很好的识别效果。3. In recent years, with the development of deep learning, the recognition of images and speech based on deep neural networks has shown great energy, and a series of neural network structures have also been proposed, such as automatic coding networks, convolutional neural networks and circulating nerves. Network, etc. Many scholars have found that learning speech through neural networks can obtain better hidden features of speech. The end-to-end identification method is to deal with feature learning and feature recognition with as little prior knowledge as possible. Good recognition effect.
发明内容:Summary of the invention:
根据现有识别技术都是把特征和模式分类分开研究,存在特征和模型不匹配、训练困难,特征不易寻找等问题,本发明提出融合多种端到端深度学习结构的说话人感冒症状识别方法,我们构建四种不同的端到端深度学习网络,最后融合四种不同的端到端神经网络结构进行说话人感冒症状识别。 According to the existing identification techniques, the features and patterns are separately classified, and there are problems such as mismatching of features and models, difficulty in training, and difficulty in finding features. The present invention proposes a method for identifying a cold symptom of a speaker that incorporates various end-to-end deep learning structures. We constructed four different end-to-end deep learning networks, and finally merged four different end-to-end neural network structures for speaker cold symptom recognition.
四种端到端深度学习结构分别为:1、输入为语音,网络为多层卷积神经网络和长短期记忆网路;2、输入为语音频谱,网络为多层卷积神经网路和长短期记忆网络;3、输入为语音频谱,网络为多层卷积神经网络和全连接网络;4、输入为梅尔倒谱系数和常数Q倒谱系数,网络为长短期记忆网络; The four end-to-end deep learning structures are: 1. Input is voice, the network is multi-layer convolutional neural network and long-term and short-term memory network; 2. Input is voice spectrum, network is multi-layer convolutional neural network and long Short-term memory network; 3, the input is the speech spectrum, the network is a multi-layer convolutional neural network and a fully connected network; 4. The input is the Mel cepstrum coefficient and the constant Q cepstral coefficient, and the network is a long-term and short-term memory network;
本发明的有益效果是:基于传统特征的不确定性,我们通过神经网络训练得到的输出可以更好的表达说话人感冒症状的特征,并且输入相对来说比较简单,不用过多的进行特征处理。因为语音具有时序信息,我们通过长短期记忆网络实现分类有更好的效果。通过把特征学习和模式分类统一在一起,使得整个说话人感冒症状识别过程更加简单快速,具有广泛的应用前景。 The beneficial effects of the present invention are: based on the uncertainty of the traditional features, the output obtained by the neural network training can better express the characteristics of the speaker's cold symptoms, and the input is relatively simple, without excessive feature processing. . Because voice has timing information, we have better results through long- and short-term memory networks. By unifying feature learning and pattern classification, the whole speaker's cold symptom recognition process is simpler and faster, and has broad application prospects.
附图说明DRAWINGS
图一为语音提取梅尔倒谱系数(MFCC)的流程Figure 1 shows the flow of voice extraction of the Mel Cepstral Coefficient (MFCC).
图二为语音提取常数Q倒谱系数(CQCC)的流程Figure 2 shows the flow of the speech extraction constant Q cepstrum coefficient (CQCC).
图三为第一个端到端神经网络,输入为语音,网络为CNN+LSTM。Figure 3 shows the first end-to-end neural network. The input is voice and the network is CNN+LSTM.
图四为第二个端到端神经网络,输入为语音频谱,网络为CNN+LSTMFigure 4 shows the second end-to-end neural network. The input is the voice spectrum and the network is CNN+LSTM.
图五为第三个端到端神经网络,输入为语音频谱,网络为CNN+全连接网络Figure 5 shows the third end-to-end neural network, the input is the voice spectrum, and the network is the CNN+ fully connected network.
图六为第四个端到端神经网络,输入为梅尔倒谱系数或者常数Q倒谱系数,网络为LSTM。Figure 6 shows the fourth end-to-end neural network. The input is the Mel cepstrum coefficient or the constant Q cepstral coefficient. The network is LSTM.
具体实施方式:detailed description:
为使本发明的技术方案和优点更加清楚,下面结合附图,对发明的技术方案进行清楚完整的描述: In order to make the technical solutions and advantages of the present invention more clear, the technical solutions of the invention will be clearly and completely described below with reference to the accompanying drawings:
步骤一:构建输入为语音、网络为CNN+LSTM的端到端神经网络,具体为:输入语音切分为相同大小的片段比如40ms,然后进行均值归一化,而相对应的卷积神经网络由8个模块组成,每一个模块是由一维卷积层、ReLU激活层、一维最大池化层组成的,其中,每一个卷积核的大小为32,池化核的大小为2,池化步长为2。而后使用长短期记忆网络进行分类。 Step 1: Construct an end-to-end neural network with input as voice and network as CNN+LSTM. Specifically, the input speech is divided into segments of the same size, such as 40ms, and then averaged, and the corresponding convolutional neural network. It consists of 8 modules, each of which consists of a one-dimensional convolution layer, a ReLU active layer, and a one-dimensional maximum pooling layer. Each convolution kernel has a size of 32 and a pooled core has a size of 2. The pooling step size is 2. Then use long and short-term memory networks for classification.
步骤二:构建输入为语音频谱,网络为为CNN+LSTM的端到端神经网络,具体为:输入语音切分为相同大小的片段,进行快速傅里叶变换,求出语音片段的频谱图,卷积神经网络则由6个模块组成,每个模块由二维卷积层、ReLU激活层、二维最大池化层组成。其中,第一个卷积层使用7*7的卷积层,第二层使用5*5的卷积核,剩下4层使用3*3的卷积核,所有的最大池化层使用3*3的池化核,池化步长为2。最后经过LSTM网络进行分类。Step 2: Construct the input as the speech spectrum, and the network is the end-to-end neural network of CNN+LSTM. Specifically, the input speech is divided into segments of the same size, and the fast Fourier transform is performed to obtain the spectrum of the speech segment. The convolutional neural network consists of six modules, each consisting of a two-dimensional convolutional layer, a ReLU active layer, and a two-dimensional maximum pooling layer. Among them, the first convolution layer uses 7*7 convolutional layer, the second layer uses 5*5 convolution kernel, the remaining 4 layers use 3*3 convolution kernel, and all the largest pooling layer uses 3 *3 pooled core, the pooling step size is 2. Finally, it is classified by the LSTM network.
步骤三:构建输入为语音频谱,网络为为CNN+LSTM的端到端神经网络,具体为:输入语音切分为相同大小的片段,进行快速傅里叶变换,求出语音片段的频谱图,卷积神经网络则由6个模块组成,每个模块由二维卷积层、ReLU激活层、二维最大池化层组成。其中,第一个卷积层使用7*7的卷积层,第二层使用5*5的卷积核,剩下4层使用3*3的卷积核,所有的最大池化层使用3*3的池化核,池化步长为2。再经过一个全连接层,最后经过Softmax进行分类。Step 3: Construct the input as the speech spectrum, and the network is the end-to-end neural network of CNN+LSTM. Specifically, the input speech is divided into segments of the same size, and the fast Fourier transform is performed to obtain the spectrum of the speech segment. The convolutional neural network consists of six modules, each consisting of a two-dimensional convolutional layer, a ReLU active layer, and a two-dimensional maximum pooling layer. Among them, the first convolution layer uses 7*7 convolutional layer, the second layer uses 5*5 convolution kernel, the remaining 4 layers use 3*3 convolution kernel, and all the largest pooling layer uses 3 *3 pooled core, the pooling step size is 2. After a full connection layer, it is finally classified by Softmax.
步骤四:构建输入为MFCC特征或者CQCC特征,网络为LSTM的端到端神经网路,MFCC特征通过对语音进行预加重,加窗分帧、快速傅里叶变换、梅尔刻度三角滤波器组滤波、取对数运算、离散余弦变换后最终得到的,而CQCC特征是通过对语音进行常数Q变换、求能量谱密度、取对数操作、余弦变换得到的。进行经过长短期记忆网路进行分类。对语音提取MFCC或者CQCC特征作为神经网络的输入,最后经过长短期记忆网络进行分类。Step 4: Construct an end-to-end neural network whose input is MFCC feature or CQCC feature, and the network is LSTM. The MFCC feature pre-emphasizes the speech, windowing frame, fast Fourier transform, and Meyer scale triangular filter bank. The filtering, logarithmic operation, and discrete cosine transform are finally obtained, and the CQCC feature is obtained by performing constant Q transform on the speech, obtaining the energy spectral density, taking the logarithm operation, and cosine transform. Perform long- and short-term memory networks for classification. The speech extraction MFCC or CQCC features are used as input to the neural network, and finally classified by long-term and short-term memory networks.
步骤五:将以上四个网络融合在一起进行说话人感冒语音识别。Step 5: Combine the above four networks to perform speaker cold speech recognition.
Claims (5)
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201710146957.0 | 2017-03-13 | ||
| CN201710146957.0A CN107068167A (en) | 2017-03-13 | 2017-03-13 | Merge speaker's cold symptoms recognition methods of a variety of end-to-end neural network structures |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2018166316A1 true WO2018166316A1 (en) | 2018-09-20 |
Family
ID=59621946
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/CN2018/076272 Ceased WO2018166316A1 (en) | 2017-03-13 | 2018-02-11 | Speaker's flu symptoms recognition method fused with multiple end-to-end neural network structures |
Country Status (2)
| Country | Link |
|---|---|
| CN (1) | CN107068167A (en) |
| WO (1) | WO2018166316A1 (en) |
Cited By (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US10692502B2 (en) * | 2017-03-03 | 2020-06-23 | Pindrop Security, Inc. | Method and apparatus for detecting spoofing conditions |
| CN114299987A (en) * | 2021-12-08 | 2022-04-08 | 中国科学技术大学 | Training method of event analysis model, event analysis method and device thereof |
| US12488072B2 (en) | 2020-04-15 | 2025-12-02 | Pindrop Security, Inc. | Passive and continuous multi-speaker voice biometrics |
Families Citing this family (10)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN107068167A (en) * | 2017-03-13 | 2017-08-18 | 广东顺德中山大学卡内基梅隆大学国际联合研究院 | Merge speaker's cold symptoms recognition methods of a variety of end-to-end neural network structures |
| CN108053841A (en) * | 2017-10-23 | 2018-05-18 | 平安科技(深圳)有限公司 | The method and application server of disease forecasting are carried out using voice |
| CN109960910B (en) * | 2017-12-14 | 2021-06-08 | Oppo广东移动通信有限公司 | Voice processing method, device, storage medium and terminal device |
| CN109086892B (en) * | 2018-06-15 | 2022-02-18 | 中山大学 | General dependency tree-based visual problem reasoning model and system |
| CN109192226A (en) * | 2018-06-26 | 2019-01-11 | 深圳大学 | A kind of signal processing method and device |
| CN108899051B (en) * | 2018-06-26 | 2020-06-16 | 北京大学深圳研究生院 | A speech emotion recognition model and recognition method based on joint feature representation |
| CN109256118B (en) * | 2018-10-22 | 2021-06-25 | 江苏师范大学 | End-to-end Chinese dialect recognition system and method based on generative auditory model |
| CN109282837B (en) * | 2018-10-24 | 2021-06-22 | 福州大学 | Demodulation method of fiber Bragg grating interleaved spectrum based on LSTM network |
| CN111028859A (en) * | 2019-12-15 | 2020-04-17 | 中北大学 | A hybrid neural network vehicle recognition method based on audio feature fusion |
| CN116110437B (en) * | 2023-04-14 | 2023-06-13 | 天津大学 | Pathological voice quality evaluation method based on fusion of voice characteristics and speaker characteristics |
Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US5214743A (en) * | 1989-10-25 | 1993-05-25 | Hitachi, Ltd. | Information processing apparatus |
| CN105139864A (en) * | 2015-08-17 | 2015-12-09 | 北京天诚盛业科技有限公司 | Voice recognition method and voice recognition device |
| CN106328122A (en) * | 2016-08-19 | 2017-01-11 | 深圳市唯特视科技有限公司 | Voice identification method using long-short term memory model recurrent neural network |
| CN107068167A (en) * | 2017-03-13 | 2017-08-18 | 广东顺德中山大学卡内基梅隆大学国际联合研究院 | Merge speaker's cold symptoms recognition methods of a variety of end-to-end neural network structures |
-
2017
- 2017-03-13 CN CN201710146957.0A patent/CN107068167A/en active Pending
-
2018
- 2018-02-11 WO PCT/CN2018/076272 patent/WO2018166316A1/en not_active Ceased
Patent Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US5214743A (en) * | 1989-10-25 | 1993-05-25 | Hitachi, Ltd. | Information processing apparatus |
| CN105139864A (en) * | 2015-08-17 | 2015-12-09 | 北京天诚盛业科技有限公司 | Voice recognition method and voice recognition device |
| CN106328122A (en) * | 2016-08-19 | 2017-01-11 | 深圳市唯特视科技有限公司 | Voice identification method using long-short term memory model recurrent neural network |
| CN107068167A (en) * | 2017-03-13 | 2017-08-18 | 广东顺德中山大学卡内基梅隆大学国际联合研究院 | Merge speaker's cold symptoms recognition methods of a variety of end-to-end neural network structures |
Non-Patent Citations (1)
| Title |
|---|
| TARA N.: "Convolutional, Long Short-Term Memory, fully connected Deep Neural Networks", ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2015 IEEE INTERNATIONAL CONFERENCE ON, 6 August 2015 (2015-08-06), XP033187628, ISSN: 2379-190X * |
Cited By (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US10692502B2 (en) * | 2017-03-03 | 2020-06-23 | Pindrop Security, Inc. | Method and apparatus for detecting spoofing conditions |
| US11488605B2 (en) | 2017-03-03 | 2022-11-01 | Pindrop Security, Inc. | Method and apparatus for detecting spoofing conditions |
| US12488072B2 (en) | 2020-04-15 | 2025-12-02 | Pindrop Security, Inc. | Passive and continuous multi-speaker voice biometrics |
| CN114299987A (en) * | 2021-12-08 | 2022-04-08 | 中国科学技术大学 | Training method of event analysis model, event analysis method and device thereof |
Also Published As
| Publication number | Publication date |
|---|---|
| CN107068167A (en) | 2017-08-18 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| WO2018166316A1 (en) | Speaker's flu symptoms recognition method fused with multiple end-to-end neural network structures | |
| CN112509564B (en) | End-to-end voice recognition method based on connection time sequence classification and self-attention mechanism | |
| CN110245608B (en) | An underwater target recognition method based on semi-tensor product neural network | |
| CN112885372B (en) | Intelligent diagnosis method, system, terminal and medium for power equipment fault sound | |
| CN110459225B (en) | Speaker recognition system based on CNN fusion characteristics | |
| CN114863937B (en) | Hybrid bird song recognition method based on deep transfer learning and XGBoost | |
| WO2018227780A1 (en) | Speech recognition method and device, computer device and storage medium | |
| CN105096955B (en) | A kind of speaker's method for quickly identifying and system based on model growth cluster | |
| CN107331384A (en) | Audio recognition method, device, computer equipment and storage medium | |
| CN111048097B (en) | Twin network voiceprint recognition method based on 3D convolution | |
| CN110148408A (en) | A kind of Chinese speech recognition method based on depth residual error | |
| CN115101076B (en) | Speaker clustering method based on multi-scale channel separation convolution feature extraction | |
| CN110349588A (en) | A kind of LSTM network method for recognizing sound-groove of word-based insertion | |
| CN111785262B (en) | Speaker age and gender classification method based on residual error network and fusion characteristics | |
| CN113450830A (en) | Voice emotion recognition method of convolution cyclic neural network with multiple attention mechanisms | |
| CN110299142A (en) | A kind of method for recognizing sound-groove and device based on the network integration | |
| CN115440228A (en) | Self-adaptive voiceprint recognition method and system | |
| Zheng et al. | MSRANet: Learning discriminative embeddings for speaker verification via channel and spatial attention mechanism in alterable scenarios | |
| CN116153339B (en) | A speech emotion recognition method and device based on improved attention mechanism | |
| CN118072746A (en) | Marine mammal call recognition and classification method based on feature fusion | |
| CN113571095A (en) | Speech emotion recognition method and system based on nested deep neural network | |
| CN111653267A (en) | Rapid language identification method based on time delay neural network | |
| CN115249479B (en) | Complex speech recognition method, system and terminal for power grid dispatching based on BRNN | |
| CN118585889A (en) | A ship type identification method and system based on ship radiated noise data | |
| CN116842460A (en) | Cough-related disease identification method and system based on attention mechanism and residual neural network |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 18768185 Country of ref document: EP Kind code of ref document: A1 |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| 122 | Ep: pct application non-entry in european phase |
Ref document number: 18768185 Country of ref document: EP Kind code of ref document: A1 |