WO2018166316A1 - Procédé de reconnaissance de symptômes de la grippe d'un locuteur fusionné avec de multiples structures de réseau neuronal de bout en bout - Google Patents

Procédé de reconnaissance de symptômes de la grippe d'un locuteur fusionné avec de multiples structures de réseau neuronal de bout en bout Download PDF

Info

Publication number
WO2018166316A1
WO2018166316A1 PCT/CN2018/076272 CN2018076272W WO2018166316A1 WO 2018166316 A1 WO2018166316 A1 WO 2018166316A1 CN 2018076272 W CN2018076272 W CN 2018076272W WO 2018166316 A1 WO2018166316 A1 WO 2018166316A1
Authority
WO
WIPO (PCT)
Prior art keywords
network
neural network
input
speech
speaker
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/CN2018/076272
Other languages
English (en)
Chinese (zh)
Inventor
李明
倪志东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Foshan Shunde Sun Yat-Sen University Research Institute
Sun Yat Sen University
SYSU CMU Shunde International Joint Research Institute
Original Assignee
Foshan Shunde Sun Yat-Sen University Research Institute
Sun Yat Sen University
SYSU CMU Shunde International Joint Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Foshan Shunde Sun Yat-Sen University Research Institute, Sun Yat Sen University, SYSU CMU Shunde International Joint Research Institute filed Critical Foshan Shunde Sun Yat-Sen University Research Institute
Publication of WO2018166316A1 publication Critical patent/WO2018166316A1/fr
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/66Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for extracting parameters related to health condition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Definitions

  • the invention relates to the field of speech processing technology, and proposes a speaker cold symptom recognition method combining various end-to-end deep learning structures
  • Speaker recognition also known as voiceprint recognition, refers to the technique of automatically identifying a speaker by using pattern recognition technology by including unique speaker information in the voice.
  • the current speaker technology achieves good performance in experimental conditions, but in practice, speech is affected by environmental noise and speaker health conditions, making the robustness of existing speaker recognition techniques less, and the cold speech recognition method.
  • By classifying the existing speech to determine whether it is a cold speech it is possible to improve the robustness of the speaker recognition by using the cold speech recognition method to determine in advance whether the speech is a cold speech or not, and then performing speaker recognition.
  • the speech feature extraction is to extract the speaker's speech features and vocal features.
  • the mainstream characteristic parameters including MFCC, LPCC, CQCC, etc. are all based on a single feature, which lacks the information to characterize the symptoms of the speaker's cold, and affects the recognition accuracy.
  • MFCC MFCC
  • LPCC LPCC
  • CQCC CQCC
  • a large amount of knowledge of classification target speech is needed.
  • the method based on vocal tract model and speech model knowledge is earlier, but because of the complexity of the model, it has not achieved good practical effects, and the model Matching methods such as dynamic time warping, hidden Markov model, vector quantization, etc. begin to exert good recognition results.
  • Separating feature extraction and pattern classification is a common method for identification research, but there are problems that the features and models do not match, the training is difficult, and the features are not easy to find.
  • the classical recognition framework has the above problems.
  • the present invention proposes a method for identifying a cold symptom of a speaker that incorporates various end-to-end deep learning structures. We constructed four different end-to-end deep learning networks, and finally merged four different end-to-end neural network structures for speaker cold symptom recognition.
  • the four end-to-end deep learning structures are: 1. Input is voice, the network is multi-layer convolutional neural network and long-term and short-term memory network; 2. Input is voice spectrum, network is multi-layer convolutional neural network and long Short-term memory network; 3, the input is the speech spectrum, the network is a multi-layer convolutional neural network and a fully connected network; 4. The input is the Mel cepstrum coefficient and the constant Q cepstral coefficient, and the network is a long-term and short-term memory network;
  • the beneficial effects of the present invention are: based on the uncertainty of the traditional features, the output obtained by the neural network training can better express the characteristics of the speaker's cold symptoms, and the input is relatively simple, without excessive feature processing. . Because voice has timing information, we have better results through long- and short-term memory networks. By unifying feature learning and pattern classification, the whole speaker's cold symptom recognition process is simpler and faster, and has broad application prospects.
  • FIG 1 shows the flow of voice extraction of the Mel Cepstral Coefficient (MFCC).
  • FIG. 1 shows the flow of the speech extraction constant Q cepstrum coefficient (CQCC).
  • Figure 3 shows the first end-to-end neural network.
  • the input is voice and the network is CNN+LSTM.
  • Figure 4 shows the second end-to-end neural network.
  • the input is the voice spectrum and the network is CNN+LSTM.
  • Figure 5 shows the third end-to-end neural network, the input is the voice spectrum, and the network is the CNN+ fully connected network.
  • Figure 6 shows the fourth end-to-end neural network.
  • the input is the Mel cepstrum coefficient or the constant Q cepstral coefficient.
  • the network is LSTM.
  • Step 1 Construct an end-to-end neural network with input as voice and network as CNN+LSTM.
  • the input speech is divided into segments of the same size, such as 40ms, and then averaged, and the corresponding convolutional neural network.
  • It consists of 8 modules, each of which consists of a one-dimensional convolution layer, a ReLU active layer, and a one-dimensional maximum pooling layer.
  • Each convolution kernel has a size of 32 and a pooled core has a size of 2.
  • the pooling step size is 2. Then use long and short-term memory networks for classification.
  • Step 2 Construct the input as the speech spectrum, and the network is the end-to-end neural network of CNN+LSTM. Specifically, the input speech is divided into segments of the same size, and the fast Fourier transform is performed to obtain the spectrum of the speech segment.
  • the convolutional neural network consists of six modules, each consisting of a two-dimensional convolutional layer, a ReLU active layer, and a two-dimensional maximum pooling layer. Among them, the first convolution layer uses 7*7 convolutional layer, the second layer uses 5*5 convolution kernel, the remaining 4 layers use 3*3 convolution kernel, and all the largest pooling layer uses 3 *3 pooled core, the pooling step size is 2. Finally, it is classified by the LSTM network.
  • Step 3 Construct the input as the speech spectrum, and the network is the end-to-end neural network of CNN+LSTM. Specifically, the input speech is divided into segments of the same size, and the fast Fourier transform is performed to obtain the spectrum of the speech segment.
  • the convolutional neural network consists of six modules, each consisting of a two-dimensional convolutional layer, a ReLU active layer, and a two-dimensional maximum pooling layer. Among them, the first convolution layer uses 7*7 convolutional layer, the second layer uses 5*5 convolution kernel, the remaining 4 layers use 3*3 convolution kernel, and all the largest pooling layer uses 3 *3 pooled core, the pooling step size is 2. After a full connection layer, it is finally classified by Softmax.
  • Step 4 Construct an end-to-end neural network whose input is MFCC feature or CQCC feature, and the network is LSTM.
  • the MFCC feature pre-emphasizes the speech, windowing frame, fast Fourier transform, and Meyer scale triangular filter bank.
  • the filtering, logarithmic operation, and discrete cosine transform are finally obtained, and the CQCC feature is obtained by performing constant Q transform on the speech, obtaining the energy spectral density, taking the logarithm operation, and cosine transform.
  • the speech extraction MFCC or CQCC features are used as input to the neural network, and finally classified by long-term and short-term memory networks.
  • Step 5 Combine the above four networks to perform speaker cold speech recognition.

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Public Health (AREA)
  • General Health & Medical Sciences (AREA)
  • Epidemiology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Image Analysis (AREA)

Abstract

L'invention concerne un procédé de reconnaissance de symptômes de la grippe d'un locuteur fusionné avec de multiples structures de réseau neuronal de bout en bout constituées de quatre réseaux neuronaux de bout en bout, le procédé comprenant les étapes suivantes consistant : lorsque l'entrée est un spectre de parole ou de parole d'origine, à extraire des caractéristiques optimales au moyen d'un réseau neuronal à convolution, puis à réaliser une classification au moyen d'un réseau de mémoire à long/court terme ou d'un réseau entièrement connecté ; et lorsque l'entrée est un coefficient cepstral de fréquence mel (MFCC) ou un coefficient Q cepstral constant (CQCC), à effectuer directement une classification au moyen du réseau de mémoire à long/court terme, puis fusionner lesdits systèmes ensemble. L'ensemble du procédé intègre une extraction de caractéristique et une classification de modèle, de telle sorte que l'ensemble du processus de locuteur de reconnaissance des symptômes de grippe d'un locuteur peut être plus simple et plus rapide.
PCT/CN2018/076272 2017-03-13 2018-02-11 Procédé de reconnaissance de symptômes de la grippe d'un locuteur fusionné avec de multiples structures de réseau neuronal de bout en bout Ceased WO2018166316A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201710146957.0 2017-03-13
CN201710146957.0A CN107068167A (zh) 2017-03-13 2017-03-13 融合多种端到端神经网络结构的说话人感冒症状识别方法

Publications (1)

Publication Number Publication Date
WO2018166316A1 true WO2018166316A1 (fr) 2018-09-20

Family

ID=59621946

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/076272 Ceased WO2018166316A1 (fr) 2017-03-13 2018-02-11 Procédé de reconnaissance de symptômes de la grippe d'un locuteur fusionné avec de multiples structures de réseau neuronal de bout en bout

Country Status (2)

Country Link
CN (1) CN107068167A (fr)
WO (1) WO2018166316A1 (fr)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10692502B2 (en) * 2017-03-03 2020-06-23 Pindrop Security, Inc. Method and apparatus for detecting spoofing conditions
CN114299987A (zh) * 2021-12-08 2022-04-08 中国科学技术大学 事件分析模型的训练方法、事件分析方法及其装置
US12488072B2 (en) 2020-04-15 2025-12-02 Pindrop Security, Inc. Passive and continuous multi-speaker voice biometrics

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107068167A (zh) * 2017-03-13 2017-08-18 广东顺德中山大学卡内基梅隆大学国际联合研究院 融合多种端到端神经网络结构的说话人感冒症状识别方法
CN108053841A (zh) * 2017-10-23 2018-05-18 平安科技(深圳)有限公司 利用语音进行疾病预测的方法及应用服务器
CN109960910B (zh) * 2017-12-14 2021-06-08 Oppo广东移动通信有限公司 语音处理方法、装置、存储介质及终端设备
CN109086892B (zh) * 2018-06-15 2022-02-18 中山大学 一种基于一般依赖树的视觉问题推理模型及系统
CN109192226A (zh) * 2018-06-26 2019-01-11 深圳大学 一种信号处理方法及装置
CN108899051B (zh) * 2018-06-26 2020-06-16 北京大学深圳研究生院 一种基于联合特征表示的语音情感识别模型及识别方法
CN109256118B (zh) * 2018-10-22 2021-06-25 江苏师范大学 基于生成式听觉模型的端到端汉语方言识别系统和方法
CN109282837B (zh) * 2018-10-24 2021-06-22 福州大学 基于lstm网络的布拉格光纤光栅交错光谱的解调方法
CN111028859A (zh) * 2019-12-15 2020-04-17 中北大学 一种基于音频特征融合的杂交神经网络车型识别方法
CN116110437B (zh) * 2023-04-14 2023-06-13 天津大学 基于语音特征和说话人特征融合的病理嗓音质量评价方法

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5214743A (en) * 1989-10-25 1993-05-25 Hitachi, Ltd. Information processing apparatus
CN105139864A (zh) * 2015-08-17 2015-12-09 北京天诚盛业科技有限公司 语音识别方法和装置
CN106328122A (zh) * 2016-08-19 2017-01-11 深圳市唯特视科技有限公司 一种利用长短期记忆模型递归神经网络的语音识别方法
CN107068167A (zh) * 2017-03-13 2017-08-18 广东顺德中山大学卡内基梅隆大学国际联合研究院 融合多种端到端神经网络结构的说话人感冒症状识别方法

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5214743A (en) * 1989-10-25 1993-05-25 Hitachi, Ltd. Information processing apparatus
CN105139864A (zh) * 2015-08-17 2015-12-09 北京天诚盛业科技有限公司 语音识别方法和装置
CN106328122A (zh) * 2016-08-19 2017-01-11 深圳市唯特视科技有限公司 一种利用长短期记忆模型递归神经网络的语音识别方法
CN107068167A (zh) * 2017-03-13 2017-08-18 广东顺德中山大学卡内基梅隆大学国际联合研究院 融合多种端到端神经网络结构的说话人感冒症状识别方法

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
TARA N.: "Convolutional, Long Short-Term Memory, fully connected Deep Neural Networks", ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2015 IEEE INTERNATIONAL CONFERENCE ON, 6 August 2015 (2015-08-06), XP033187628, ISSN: 2379-190X *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10692502B2 (en) * 2017-03-03 2020-06-23 Pindrop Security, Inc. Method and apparatus for detecting spoofing conditions
US11488605B2 (en) 2017-03-03 2022-11-01 Pindrop Security, Inc. Method and apparatus for detecting spoofing conditions
US12488072B2 (en) 2020-04-15 2025-12-02 Pindrop Security, Inc. Passive and continuous multi-speaker voice biometrics
CN114299987A (zh) * 2021-12-08 2022-04-08 中国科学技术大学 事件分析模型的训练方法、事件分析方法及其装置

Also Published As

Publication number Publication date
CN107068167A (zh) 2017-08-18

Similar Documents

Publication Publication Date Title
WO2018166316A1 (fr) Procédé de reconnaissance de symptômes de la grippe d'un locuteur fusionné avec de multiples structures de réseau neuronal de bout en bout
CN112509564B (zh) 基于连接时序分类和自注意力机制的端到端语音识别方法
CN110245608B (zh) 一种基于半张量积神经网络的水下目标识别方法
CN112885372B (zh) 电力设备故障声音智能诊断方法、系统、终端及介质
CN110459225B (zh) 一种基于cnn融合特征的说话人辨认系统
CN114863937B (zh) 基于深度迁移学习与XGBoost的混合鸟鸣识别方法
WO2018227780A1 (fr) Procédé de reconnaissance vocale, dispositif informatique et support d'informations
CN105096955B (zh) 一种基于模型生长聚类的说话人快速识别方法及系统
CN107331384A (zh) 语音识别方法、装置、计算机设备及存储介质
CN111048097B (zh) 一种基于3d卷积的孪生网络声纹识别方法
CN110148408A (zh) 一种基于深度残差的中文语音识别方法
CN115101076B (zh) 一种基于多尺度通道分离卷积特征提取的说话人聚类方法
CN110349588A (zh) 一种基于词嵌入的lstm网络声纹识别方法
CN111785262B (zh) 一种基于残差网络及融合特征的说话人年龄性别分类方法
CN113450830A (zh) 具有多重注意机制的卷积循环神经网络的语音情感识别方法
CN110299142A (zh) 一种基于网络融合的声纹识别方法及装置
CN115440228A (zh) 一种自适应声纹识别方法及系统
Zheng et al. MSRANet: Learning discriminative embeddings for speaker verification via channel and spatial attention mechanism in alterable scenarios
CN116153339B (zh) 一种基于改进注意力机制的语音情感识别方法及装置
CN118072746A (zh) 基于特征融合的海洋哺乳动物叫声识别与分类方法
CN113571095A (zh) 基于嵌套深度神经网络的语音情感识别方法和系统
CN111653267A (zh) 一种基于时延神经网络的快速语种识别方法
CN115249479B (zh) 基于brnn的电网调度复杂语音识别方法、系统及终端
CN118585889A (zh) 一种基于船舶辐射噪声数据的船舶类型识别方法及系统
CN116842460A (zh) 基于注意力机制与残差神经网络的咳嗽关联疾病识别方法和系统

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18768185

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18768185

Country of ref document: EP

Kind code of ref document: A1