WO2020181998A1 - 一种基于监督变分编码器因素分解的混合声音事件检测方法 - Google Patents

一种基于监督变分编码器因素分解的混合声音事件检测方法 Download PDF

Info

Publication number
WO2020181998A1
WO2020181998A1 PCT/CN2020/077189 CN2020077189W WO2020181998A1 WO 2020181998 A1 WO2020181998 A1 WO 2020181998A1 CN 2020077189 W CN2020077189 W CN 2020077189W WO 2020181998 A1 WO2020181998 A1 WO 2020181998A1
Authority
WO
WIPO (PCT)
Prior art keywords
sound event
factor decomposition
sound
speech signal
supervised
Prior art date
Application number
PCT/CN2020/077189
Other languages
English (en)
French (fr)
Inventor
毛启容
陈静静
高利剑
黄多林
张飞飞
Original Assignee
江苏大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 江苏大学 filed Critical 江苏大学
Publication of WO2020181998A1 publication Critical patent/WO2020181998A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • G10L2015/025Phonemes, fenemes or fenones being the recognition units

Definitions

  • the invention relates to the fields of speech signal processing, pattern recognition and the like, and particularly relates to a sound event detection method related to a variational automatic encoder and a factor decomposition method.
  • Multi-category sound event detection refers to detecting whether each event occurs from an event mixed with multiple sounds. Compared with the traditional small-class sound event detection, it has wider applicability in the real world, and has broad application prospects and practical significance in the fields of medical scene monitoring and traffic scene sound event detection.
  • Traditional multi-category sound event detection methods mainly use the ideas of speech recognition and template matching. For example, they use a mixture of Gaussian models and hidden Markov models with Mel frequency cepstral coefficients, or use non-negative matrix factorization to Represent each kind of event and match it with the sound event dictionary; however, the manual features in this traditional method cannot completely represent different sound events. Recently, a deep neural network with a bottleneck layer is introduced to learn the bottleneck features of multi-category sound event detection, and good results have been achieved, but the accuracy rate is not very high. Unsupervised features indicate that learning has made good progress in capturing data generation factors.
  • the present invention provides a factor decomposition method, so that the decomposed features are not interfered by factors unrelated to the detection task, and the decomposed features are only for each specific sound event, thereby solving the accuracy of multi-category sound event detection in the real environment The problem is not high to improve the accuracy of detection.
  • the present invention first preprocesses the speech signal, extracts features, then extracts the potential attribute space of the sound event through a supervised variational encoder, and then learns the feature representation of each specific sound event through factor decomposition. Then use the corresponding sound event detector to detect whether a specific sound event occurs.
  • a mixed sound event detection method based on factor decomposition of a supervised variational encoder includes the following steps:
  • Step one preprocess the speech signal
  • Step 2 Extract the preprocessed speech signal features
  • Step 3 Use the supervised variational autoencoder to extract the potential attribute space of the sound event
  • Step 4 Use factor decomposition to decompose the various factors that make up the mixed sound, and then learn the characteristic representation of each specific sound event;
  • Step 5 Use the corresponding sound event detector to detect whether a specific sound event has occurred.
  • the first step is specifically: the speech signal is divided into frames according to a fixed frame length, and there are overlapping parts between frames.
  • the second step is specifically: extracting the Mel frequency cepstrum coefficients of the preprocessed speech signal.
  • the potential attribute space of the sound event in the third step is specifically: compressing the input voice signal characteristics into a low-dimensional Gaussian distribution.
  • the corresponding sound event detector in the step 5 adopts a deep neural network as the detector network.
  • this hybrid sound event detection method based on factor decomposition of supervised variational encoder introduces feature representation learning, learns the potential attribute space of sound events, and can process reality. Detection of multi-category sound events in the scene; another advantage is that this method introduces a generative model-variational autoencoder, so that more training data can be generated, thereby improving the detection accuracy through data enhancement methods . This method can also be used for various recognition tasks, such as speaker detection.
  • Fig. 1 is a flowchart of a mixed sound event detection method based on factor decomposition of a supervised variational encoder.
  • Fig. 2 is an explanatory diagram of the attention mechanism in the embodiment.
  • Fig. 1 it is a specific process of a method for detecting a sound event based on factor decomposition according to an embodiment of the present invention.
  • the method includes the following steps:
  • Step one Receive the voice signal and preprocess the voice signal: the voice signal is mainly divided into frames according to a fixed frame length, and there is overlap between frames, that is, there is intra-frame overlap.
  • Step 2 Extract the features of the preprocessed speech signal
  • Extracting the features of the preprocessed speech signal refers to extracting the MFCC (Mel Frequency Cepstral Coefficient) feature of each frame of the speech signal, and using 5 frames of signal as a sample.
  • the 5 frames of signal correspond to different consecutive moments, so each The sample contains time domain information.
  • Step 3 Use supervised variational autoencoder to extract the potential attribute space of sound events
  • is a random number that obeys a normal distribution with a mean value of 0 and a variance of 1. Because each sample contains the characteristics of 5 frames of speech signals, z contains time domain information, which is also the choice of long and short-term memory networks to process speech signals The main reason for the characteristics is that the long and short-term memory network can process time domain information and store it in the network for a long time, greatly reducing the possibility of gradient disappearance and gradient explosion.
  • Step 4 Use factor decomposition to decompose the various factors that make up the mixed sound, and then learn the characteristic representations related to each specific sound event
  • the attention mechanism is used in the potential attribute space of the sound event to avoid encoding the input sequence as a fixed-length potential vector, thereby providing greater flexibility; an attention layer should be designed for each sound event type , There are K sound event types, so K attention layers are designed.
  • the attention weight After using the softmax function to activate the potential attribute space of the sound event, the attention weight a k of the potential attribute space of the sound event can be obtained. Its calculation formula for:
  • i represents the i-th sample, with Respectively are The mean and variance of, for each feature
  • Step 5 use the corresponding sound event detector to detect whether a specific sound event has occurred
  • Using the corresponding sound event detector to detect whether a specific sound event has occurred is to construct a sound event detector for each specific sound event type, and use the binary classification function sigmoid to detect the probability of the corresponding sound event, thereby judging the event Does it happen?
  • the method is:
  • Detector is the constructed sound event detector, each sound event detector corresponds to one
  • the detector is a multilayer perceptron with a sigmoid function as the output.
  • This loss function is used as the second part of the factorization loss function.
  • the total specific event factor decomposition loss function proposed in the embodiment of the present invention is:
  • measures the degree of factor decomposition of the potential representation of each sound event.
  • the embodiment also trains a decoder to reconstruct the input voice signal features through the potential attribute space z of the sound event to ensure that the potential attribute space z captures the data generation factor, and the loss function is:
  • E means using the mean square error loss function.
  • the final total loss function is defined as:
  • is a weighting factor for measuring sound event detection and reconstruction tasks.
  • the embodiment selects two widely used sound event detection benchmark databases for experimental evaluation: TUT2017 and Freesound, and the embodiment also evaluates speaker recognition on the TIMIT data set.
  • the embodiment method and the current most advanced method ordinary deep neural network DNN, long short-term memory network LSTM, and joint neural evolution network with enhanced topology structure J -NEAT, Convolution-Circular Neural Network (CRNN, and Identity Vector i-Vector) are compared to prove the effectiveness of the algorithm proposed in the embodiment.
  • the embodiment uses two evaluation indicators, namely F1 score and error rate (ER), and their calculation formulas are:
  • TP(k) is true
  • FP(k) is false positive
  • FN(k) is false negative
  • N(k) is the total number of samples
  • S(k), D(k), and I(k) are the numbers of replacement, deletion and insertion respectively.
  • the TUT2017 data set contains sounds in various street scenes with different volume levels. This data set is most closely related to human activities and real traffic scenes.
  • the method of factor decomposition based on the supervised variational encoder of the embodiment achieved the highest F1 score, while maintaining a very competitive ER.
  • the J-NEAT method achieved the highest F1 score, but the ER ranked 15th; the CRNN method achieved the best ER, but the F1 score ranked 11th.
  • the method of factor decomposition based on the supervised variational encoder of the embodiment achieved the highest F1 score and ranked 4th in the ER.
  • the Freesound data set is a sound event database extracted from audio samples uploaded by users. It contains 28 kinds of sound events and is used to evaluate the performance of the algorithm proposed in the embodiment under the condition of increasing complexity.
  • the TIMIT data set contains a total of 6,300 voices from 630 people, each with 10 voices.
  • Each speech in the TIMIRT data set originates from only one speaker, which is used to evaluate the performance of the algorithm proposed in the embodiment for speaker recognition of mixed speech.
  • the sound event detection method based on the factor decomposition of the supervised variational encoder used in the embodiment can effectively solve the problem of low detection accuracy in the case of multiple types of sound events, and improve Accuracy; At the same time, it also provides a general framework for sound event detection and recognition tasks.

Landscapes

  • Engineering & Computer Science (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Acoustics & Sound (AREA)
  • Human Computer Interaction (AREA)
  • Signal Processing (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Monitoring And Testing Of Exchanges (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Measurement Of Mechanical Vibrations Or Ultrasonic Waves (AREA)

Abstract

一种基于监督变分编码器因素分解的混合声音事件检测方法,包括如下步骤:接收语音信号,并对语音信号进行预处理;提取预处理后的语音信号特征;使用监督变分自动编码器提取声音事件潜在属性空间;使用因素分解方法分解构成混合声音的各种因素,进而学习得到每个特定声音事件相关的特征表示;再使用对应的声音事件检测器检测特定声音事件是否发生。采用因素分解学习的方法解决混合声音中声音事件类别较多的情况下,声音事件检测准确率不高的问题,有效提高真实场景声音事件检测的准确度,还可用于说话人识别等任务。

Description

一种基于监督变分编码器因素分解的混合声音事件检测方法 技术领域
本发明涉及语音信号处理、模式识别等领域,特别涉及一种关于变分自动编码器和因素分解方法的声音事件检测方法。
背景技术
多类别声音事件检测是指从一个混有多种声音的事件当中,检测出每种事件是否发生。与传统少类别声音事件检测相比,在现实领域的适用性更广,在医学场景监听、交通场景声音事件检测等领域有着广阔的应用前景和实际意义。
传统的多类别声音事件检测方法主要是采用语音识别和模板匹配的思想,例如,使用混合高斯模型和以梅尔频率倒谱系数为特征的隐马尔可夫模型,或者是使用非负矩阵分解来表示每一种事件,并将其与声音事件词典进行匹配;然而,这种传统方法中的手工特征并不能完全表示不同的声音事件。最近,引入带有瓶颈层的深度神经网络来学习多类别声音事件检测的瓶颈特征,取得了很好的结果,但是准确率不是很高。无监督特征表示学习在捕获数据生成因子方面取得了不错的进展,然而如果直接用于多类别声音事件检测,则会为所有的声音事件学习到同样的一组特征,这可能会导致性能的下降,也就是说,这组特征对于多类别声音事件没有足够的辨别能力。尽管目前很多方法已经通过特征学习取得了一些新的进展,但是目前仍然没有解决如何通过因素分解的方法进行多类别声音事件检测,这正是现实环境中声音事件检测的重中之重。
发明内容
本发明提供一种因素分解方法,使得分解出的特征不受与检测任务无关的因素干扰,分解出的特征只针对每一个特定的声音事件,从而解决多类别声音事件检测在真实环境当中准确率不高的问题,提高检测的准确度。
为了解决以上技术问题,本发明首先对语音信号进行预处理、提取特征,然后通过监督变分编码器提取声音事件潜在属性空间,再通过因素分解的方法学习到每个特定声音事件的特征表示,然后使用对应的声音事件检测器检测特定声音事件是否发生。
具体技术方案如下:
一种基于监督变分编码器因素分解的混合声音事件检测方法,包括下列步骤:
步骤一,对语音信号进行预处理;
步骤二,提取预处理后的语音信号特征;
步骤三,使用监督变分自动编码器提取声音事件潜在属性空间;
步骤四,使用因素分解方法分解构成混合声音的各种因素,进而学习得到每个特定声音事件的特征表示;
步骤五,使用对应的声音事件检测器检测特定声音事件是否发生。
进一步,所述步骤一具体为:将语音信号按照固定的帧长度进行分帧,帧与帧之间有重叠部分。
进一步,所述步骤二具体为:提取预处理后语音信号的梅尔频率倒谱系数。
进一步,所述步骤三中声音事件潜在属性空间具体为:将输入的语音信号特征压缩到低维高斯分布中。
进一步,所述步骤四中特定声音事件的特征表示
Figure PCTCN2020077189-appb-000001
其中a k为声音事件潜在属性空间的注意力权重,z为声音事件潜在属性空间。
进一步,所述步骤五中对应的声音事件检测器采用深度神经网络作为检测器网络。
本发明具有有益效果:与传统的多类别声音事件检测相比,该种基于监督变分编码器因素分解的混合声音事件检测方法,引入特征表示学习,学习到声音事件潜在属性空间,能够处理现实场景当中多类别声音事件情况下的检测工作;另一个优势就是该方法引入了一个生成模型-变分自动编码器,这样就可以生成更多的训练数据,从而通过数据增强的方法提高检测准确率。该方法还有可用于各种识别任务,如说话人检测等。
附图说明
图1是基于监督变分编码器因素分解的混合声音事件检测方法的流程图。
图2是实施例中注意力机制的说明示意图。
具体实施方式
下面结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整的描述。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都是本发明保护的范围。
参见图1,是本发明提供的一个实施例的基于因素分解的声音事件检测方法的具体流程,该方法包括如下步骤:
步骤一,接收语音信号,并对语音信号进行预处理:主要是将语音信号按照固定的帧长度进行分帧,帧与帧之间有重叠部分,即存在帧内重叠。
步骤二,提取预处理后的语音信号特征
提取预处理后的语音信号特征是指提取语音信号每一帧的MFCC(梅尔频率倒谱系数)特征,并将5帧信号作为一个样本,5帧信号对应着连续的不同时刻,所以每个样本包含了时域信息。
步骤三,使用监督变分自动编码器提取声音事件潜在属性空间
用长短期记忆网络将输入的5帧语音信号特征X压缩到低维高斯分布当中去,该高斯分布的均值和方差分别为μ和σ;通过公式计算声音事件潜在属性空间z,其公式如下:
z=(μ+σ⊙ε)     (1)
其中ε是服从与均值为0、方差为1的正态分布的随机数;因为每个样本包含5帧语音信号的特征,z就包含时域信息,这也是选择长短期记忆网络来处理语音信号特征的最主要原因,长短期记忆网络能够处理时域信息,并且将其长期保存在网络内,大大降低梯度消失和梯度爆炸的可能性。
步骤四,使用因素分解方法分解构成混合声音的各种因素,进而学习得到每个特定声音事件相关的特征表示
如图2所示,在声音事件潜在属性空间运用注意力机制,避免将输入序列编码作为一个固定长度的潜在向量,从而提供更大的灵活性;要为每一个声音事件类型设计一个注意力层,共有K个声音事件类型,所以共设计了K个注意力层,使用softmax函数对声音事件潜在属性空间进行激活后,则可获取到声音事件潜在属性空间的注意力权重a k,其计算公式为:
a k=soft max k(z)     (2)
计算特定声音事件相关的特征表示
Figure PCTCN2020077189-appb-000002
其计算公式如下:
Figure PCTCN2020077189-appb-000003
通常合理地假设声音事件的出现是互相独立的,也就是说
Figure PCTCN2020077189-appb-000004
是相互独立的,那么就可以计算后验分布与先验分布之间的KL(Kullback-Leibler)散度,其计算公式如下:
Figure PCTCN2020077189-appb-000005
其中,i代表第i个样本,
Figure PCTCN2020077189-appb-000006
Figure PCTCN2020077189-appb-000007
分别是
Figure PCTCN2020077189-appb-000008
的均值和方差,对于每一个特征表示
Figure PCTCN2020077189-appb-000009
来说,后验分布
Figure PCTCN2020077189-appb-000010
应该与先验分布
Figure PCTCN2020077189-appb-000011
相匹配,
Figure PCTCN2020077189-appb-000012
服从于均值为0、方差为1的标准正态分布,其中i=1…I,I表示总的样本数,k=1…K;该散度作为因素分解损失函数的第一部分。
步骤五,使用对应的声音事件检测器检测特定声音事件是否发生
用对应的声音事件检测器检测特定声音事件是否发生,是指为每一个特定的声音事件类型构造一个声音事件检测器,用二分类函数sigmoid来检测对应的声音事件发生的概率,从而判断该事件是否发生,其方法为:
Figure PCTCN2020077189-appb-000013
Detector即为构造的声音事件检测器,每一个声音事件检测器对应一个
Figure PCTCN2020077189-appb-000014
检测器是一个以sigmoid函数作为输出的多层感知器。
所有的检测器都用一个二值交叉熵损失作为损失函数来进行训练:
Figure PCTCN2020077189-appb-000015
其中,
Figure PCTCN2020077189-appb-000016
代表第i个样本的真实值,为1或者0;
Figure PCTCN2020077189-appb-000017
是第i个样本被识别为第k个声音事件的可能性。该损失函数作为因素分解损失函数的第二部分。
综上,本发明实施例提出的总的特定事件因素分解损失函数为:
Figure PCTCN2020077189-appb-000018
其中,β衡量每一个声音事件的潜在表示的因素分解程度。
此外,实施例还训练了一个解码器来通过声音事件潜在属性空间z来对输入的语音信号特征进行重构,以确保潜在属性空间z捕获到了数据生成因子,其损失函数为:
Figure PCTCN2020077189-appb-000019
E表示采用均方误差损失函数。
定义最后的总的损失函数为:
L s-β-VAE(θ,φ,θ';x,y,z)=L recons(θ,φ;x,z)+λL disent(φ,θ';x,y,z)   (9)
其中,λ是衡量声音事件检测和重构任务的权重因子。
实施例选用2个广泛使用的声音事件检测基准数据库来进行实验评估:TUT2017和Freesound,同时实施例还在TIMIT数据集上进行说话人识别的评估。为比较实施例方法与其他方法的性能,在每一个数据集上,将实施例方法与当下最先进的方法(普通深度神经网络DNN、长短期记忆网络LSTM、增强拓扑结构的联合神经进化网络J-NEAT、卷积-循环神经网络CRNN、身份向量i-Vector)进行对比,从而证明实施例所提算法的有效性。在所有实验当中,实施例采用两种评价指标,分别是F1得分和错误率(ER),其计算公式分别为:
Figure PCTCN2020077189-appb-000020
其中,TP(k)是真正,FP(k)是假正,FN(k)是假负;
Figure PCTCN2020077189-appb-000021
其中,N(k)是总样本个数,S(k)、D(k)、I(k)分别是替换、删除和插入的个数。
(1)TUT2017数据集
TUT2017数据集包含了各种各样街道场景下的声音,音量大小各不相同,这个数据集与人类活动和真实交通场景最为密切相关。
表1采用不同方法后的F1得分和错误率(ER)
方法 F1(%) ER
DNN 42.80 0.9358
LSTM 43.22 0.9031
J-NEAT 44.90 0.8979
CRNN 41.70 0.7914
监督变分自动编码器 45.86 0.8259
从表1的实验结果中,可以看出,实施例的基于监督变分编码器因素分解的方法取得了最高的F1得分,与此同时,还保持着非常有竞争力的ER。在国际声音事件检测大赛DCASE2017当中,J-NEAT方法取得了最高的F1得分,但ER排第15位;CRNN方法取得了最好的ER,但F1得分排第11位。作为比较,实施例的基于监督变分编码器因素分解的方法取得了最高的F1得分,并且在ER上排到了第4位。
(2)Freesound数据集
Freesound数据集是从用户上传的音频样本当中提取出来的声音事件数据库,包含了28种声音事件,用来评估在复杂程度逐渐增加的情况下,实施例所提出的算法的性能。
表2不同声音事件类别数目下的F1得分和错误率(ER)
Figure PCTCN2020077189-appb-000022
从表2的实验结果中,可以看出,随着声音事件类别的增加,DNN和CRNN方法的F1得分快速下降,而实施例所提算法F1得分的下降速度则较为缓慢。DNN和CRNN方法的ER错误率快速增加,而实施例所提算法的ER错误率则缓慢增加。由此可以看出:实施例所提出的算法,最大的优势就是其可以处理现实场景中多类别的声音事件检测问题,这也是其它的方法所不擅长的地方。
(3)TIMIT数据集
TIMIT数据集总共包含了6300条语音,来自630个人,每个人10条语音。TIMIRT数据集中的每一条语音都只源自一个说话人,将其用来评估实施例提出的算法对于混合语音说话人识别的性能。
表3不同方法在TIMIT数据集上说话人识别的F1得分和错误率(ER)
方法 F1(%) ER
监督变分自动编码器 0.8120 0.3049
i-Vector 0.7338 0.4255
从表3的实验结果中,可以看到i-Vector方法的F1得分为73.38%,ER错误率为0.4255;而实施例的方法F1得分为81.20%,ER错误率为0.3049,实施例的方法比i-Vector方法性能更好。
从上面的验证结果可以看出,实施例提出的方法为各种各样的声音事件检测和识别任务提供了一个通用的框架。
以上实验结果表明:与其它的算法相比,实施例所采用的基于监督变分编码器因素分解的声音事件检测方法可以有效解决在多类别声音事件情况下,检测准确率不高的问题,提高准确度;同时,还为声音事件检测和识别任务提供了一个通用的框架。
以上所述是本发明的优选实施方式,应当指出,对于本技术领域的普通技术人员,在不脱离本发明原理的前提下,还可以做出若干改进和润饰,这些改进和润饰也视为本发明的保护范围。

Claims (6)

  1. 一种基于监督变分编码器因素分解的混合声音事件检测方法,其特征在于,包括下列步骤:
    步骤一,对语音信号进行预处理;
    步骤二,提取预处理后的语音信号特征;
    步骤三,使用监督变分自动编码器提取声音事件潜在属性空间;
    步骤四,使用因素分解方法分解构成混合声音的各种因素,进而学习得到每个特定声音事件的特征表示;
    步骤五,使用对应的声音事件检测器检测特定声音事件是否发生。
  2. 根据权利要求1所述的基于监督变分编码器因素分解的混合声音事件检测方法,其特征在于,所述步骤一具体为:将语音信号按照固定的帧长度进行分帧,帧与帧之间有重叠部分。
  3. 根据权利要求1所述的基于监督变分编码器因素分解的混合声音事件检测方法,其特征在于,所述步骤二具体为:提取预处理后语音信号的梅尔频率倒谱系数。
  4. 根据权利要求1所述的基于监督变分编码器因素分解的混合声音事件检测方法,其特征在于,所述步骤三中声音事件潜在属性空间具体为:将输入的语音信号特征压缩到低维高斯分布中。
  5. 根据权利要求1所述的基于监督变分编码器因素分解的混合声音事件检测方法,其特征在于,所述步骤四中特定声音事件的特征表示
    Figure PCTCN2020077189-appb-100001
    其中a k为声音事件潜在属性空间的注意力权重,z为声音事件潜在属性空间。
  6. 根据权利要求1所述的基于监督变分编码器因素分解的混合声音事件检测方法,其特征在于,所述步骤五中对应的声音事件检测器采用深度神经网络作为检测器网络。
PCT/CN2020/077189 2019-03-11 2020-02-28 一种基于监督变分编码器因素分解的混合声音事件检测方法 WO2020181998A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910179592.0 2019-03-11
CN201910179592.0A CN110070895B (zh) 2019-03-11 2019-03-11 一种基于监督变分编码器因素分解的混合声音事件检测方法

Publications (1)

Publication Number Publication Date
WO2020181998A1 true WO2020181998A1 (zh) 2020-09-17

Family

ID=67365195

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/077189 WO2020181998A1 (zh) 2019-03-11 2020-02-28 一种基于监督变分编码器因素分解的混合声音事件检测方法

Country Status (2)

Country Link
CN (1) CN110070895B (zh)
WO (1) WO2020181998A1 (zh)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110070895B (zh) * 2019-03-11 2021-06-22 江苏大学 一种基于监督变分编码器因素分解的混合声音事件检测方法
CN110659468B (zh) * 2019-08-21 2022-02-15 江苏大学 基于c/s架构和说话人识别技术的文件加密解密系统
CN110600059B (zh) * 2019-09-05 2022-03-15 Oppo广东移动通信有限公司 声学事件检测方法、装置、电子设备及存储介质
CN111312288A (zh) * 2020-02-20 2020-06-19 阿基米德(上海)传媒有限公司 一种广播音频事件处理方法、系统和计算机可读存储介质
CN111753549B (zh) * 2020-05-22 2023-07-21 江苏大学 一种基于注意力机制的多模态情感特征学习、识别方法
CN113707175B (zh) * 2021-08-24 2023-12-19 上海师范大学 基于特征分解分类器与自适应后处理的声学事件检测系统
CN115376484A (zh) * 2022-08-18 2022-11-22 天津大学 基于多帧预测的轻量级端到端语音合成系统构建方法

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104021373A (zh) * 2014-05-27 2014-09-03 江苏大学 一种半监督语音特征可变因素分解方法
JP2015057630A (ja) * 2013-08-13 2015-03-26 日本電信電話株式会社 音響イベント識別モデル学習装置、音響イベント検出装置、音響イベント識別モデル学習方法、音響イベント検出方法及びプログラム
CN104795064A (zh) * 2015-03-30 2015-07-22 福州大学 低信噪比声场景下声音事件的识别方法
CN106251860A (zh) * 2016-08-09 2016-12-21 张爱英 面向安防领域的无监督的新颖性音频事件检测方法及系统
US20170372725A1 (en) * 2016-06-28 2017-12-28 Pindrop Security, Inc. System and method for cluster-based audio event detection
CN110070895A (zh) * 2019-03-11 2019-07-30 江苏大学 一种基于监督变分编码器因素分解的混合声音事件检测方法

Family Cites Families (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101819770A (zh) * 2010-01-27 2010-09-01 武汉大学 音频事件检测系统及方法
CN102486920A (zh) * 2010-12-06 2012-06-06 索尼公司 音频事件检测方法和装置
CN103678483A (zh) * 2013-10-24 2014-03-26 江苏大学 基于自适应概率超图和半监督学习的视频语义分析方法
US9807037B1 (en) * 2016-07-08 2017-10-31 Asapp, Inc. Automatically suggesting completions of text
CN108510982B (zh) * 2017-09-06 2020-03-17 腾讯科技(深圳)有限公司 音频事件检测方法、装置及计算机可读存储介质
CN108777140B (zh) * 2018-04-27 2020-07-28 南京邮电大学 一种非平行语料训练下基于vae的语音转换方法
CN108875818B (zh) * 2018-06-06 2020-08-18 西安交通大学 基于变分自编码机与对抗网络结合的零样本图像分类方法
CN108881196B (zh) * 2018-06-07 2020-11-24 中国民航大学 基于深度生成模型的半监督入侵检测方法
CN109102798A (zh) * 2018-06-29 2018-12-28 厦门快商通信息技术有限公司 一种装修事件检测方法、装置、计算机设备及介质
US10789941B2 (en) * 2018-09-28 2020-09-29 Intel Corporation Acoustic event detector with reduced resource consumption
CN109447263B (zh) * 2018-11-07 2021-07-30 任元 一种基于生成对抗网络的航天异常事件检测方法

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2015057630A (ja) * 2013-08-13 2015-03-26 日本電信電話株式会社 音響イベント識別モデル学習装置、音響イベント検出装置、音響イベント識別モデル学習方法、音響イベント検出方法及びプログラム
CN104021373A (zh) * 2014-05-27 2014-09-03 江苏大学 一种半监督语音特征可变因素分解方法
CN104795064A (zh) * 2015-03-30 2015-07-22 福州大学 低信噪比声场景下声音事件的识别方法
US20170372725A1 (en) * 2016-06-28 2017-12-28 Pindrop Security, Inc. System and method for cluster-based audio event detection
CN106251860A (zh) * 2016-08-09 2016-12-21 张爱英 面向安防领域的无监督的新颖性音频事件检测方法及系统
CN110070895A (zh) * 2019-03-11 2019-07-30 江苏大学 一种基于监督变分编码器因素分解的混合声音事件检测方法

Also Published As

Publication number Publication date
CN110070895A (zh) 2019-07-30
CN110070895B (zh) 2021-06-22

Similar Documents

Publication Publication Date Title
WO2020181998A1 (zh) 一种基于监督变分编码器因素分解的混合声音事件检测方法
CN110400579B (zh) 基于方向自注意力机制和双向长短时网络的语音情感识别
Yang et al. Multimodal measurement of depression using deep learning models
Hu et al. Temporal multimodal learning in audiovisual speech recognition
CN112216271B (zh) 一种基于卷积块注意机制的视听双模态语音识别方法
CN109473120A (zh) 一种基于卷积神经网络的异常声音信号识别方法
CN110491416A (zh) 一种基于lstm和sae的电话语音情感分析与识别方法
CN110956953B (zh) 基于音频分析与深度学习的争吵识别方法
WO2016155047A1 (zh) 低信噪比声场景下声音事件的识别方法
CN113221673B (zh) 基于多尺度特征聚集的说话人认证方法及系统
CN102201237B (zh) 基于模糊支持向量机的可靠性检测的情感说话人识别方法
CN109243446A (zh) 一种基于rnn网络的语音唤醒方法
CN111048097A (zh) 一种基于3d卷积的孪生网络声纹识别方法
Elshaer et al. Transfer learning from sound representations for anger detection in speech
CN113707175A (zh) 基于特征分解分类器与自适应后处理的声学事件检测系统
Zheng et al. MSRANet: Learning discriminative embeddings for speaker verification via channel and spatial attention mechanism in alterable scenarios
Janbakhshi et al. Automatic dysarthric speech detection exploiting pairwise distance-based convolutional neural networks
CN110246509A (zh) 一种用于语音测谎的栈式去噪自编码器及深度神经网络结构
CN114547601B (zh) 一种基于多层分类策略的随机森林入侵检测方法
CN116434786A (zh) 融合文本语义辅助的教师语音情感识别方法
Hu et al. Speaker Recognition Based on 3DCNN-LSTM.
Zhang The algorithm of voiceprint recognition model based DNN-RELIANCE
Mızrak et al. Gender Detection by Acoustic Characteristics of Sound with Machine Learning Algorithms
CN114267361A (zh) 一种高识别度的说话人识别系统
Li et al. Audio similarity detection algorithm based on Siamese LSTM network

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20769192

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20769192

Country of ref document: EP

Kind code of ref document: A1