WO2015124006A1 - 一种具有自定义功能的音频检测分类方法 - Google Patents

一种具有自定义功能的音频检测分类方法 Download PDF

Info

Publication number
WO2015124006A1
WO2015124006A1 PCT/CN2014/091959 CN2014091959W WO2015124006A1 WO 2015124006 A1 WO2015124006 A1 WO 2015124006A1 CN 2014091959 W CN2014091959 W CN 2014091959W WO 2015124006 A1 WO2015124006 A1 WO 2015124006A1
Authority
WO
WIPO (PCT)
Prior art keywords
gaussian mixture
mixture model
training samples
audio
training
Prior art date
Application number
PCT/CN2014/091959
Other languages
English (en)
French (fr)
Inventor
杨毅
Original Assignee
清华大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 清华大学 filed Critical 清华大学
Publication of WO2015124006A1 publication Critical patent/WO2015124006A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum

Definitions

  • the invention belongs to the technical field of audio processing, and in particular relates to an audio detection and classification method with a customized function.
  • VAD Voice Activity Detection
  • a method based on cepstral features In systems such as audio recognition and speaker recognition, Voice Activity Detection (VAD) technology is widely used to eliminate silence and noise signals independent of the speaker in continuous audio signals, to determine the starting point of the audio segment and The end position improves the performance of the speech recognition and speaker recognition system.
  • Effective and accurate audio activation detection can reduce the data processing capacity of the system and the interference of subsequent audio analysis and processing by removing the noise segment or the signal of the silent segment, so as to improve the system identification performance.
  • the research on audio activation detection algorithms has been carried out for many years.
  • the traditional audio activation detection methods basically deal with audio signals obtained in a quiet environment, such as a method based on short-term average energy, an algorithm based on short-term average zero-crossing rate, and A method based on cepstral features.
  • the activation detection algorithm based on short-term average energy uses the short-term average energy feature to distinguish the unvoiced sound of the silent segment and the audio segment in a quiet environment according to the difference between the unvoiced energy and the voiced energy.
  • the three are arranged in order of short-term energy: voiced>unvoiced>mute, according to which the unvoiced and voiced signals of the silent segment and the audio segment and the audio segment signal in a quiet environment can be distinguished.
  • the active detection algorithm for the dual threshold audio signal is based on an audio activation detection algorithm combining short-term average zero-crossing rate and short-term average energy, which combines the characteristic parameters of the two audio signals.
  • This method first uses the short-term average energy to distinguish the audio segment from the non-audio segment, and further distinguishes the audio segment from the non-audio segment with the zero-crossing rate. Compared with the activation detection algorithm based on short-term average energy, it is better to avoid the audio signal starting with the clear consonant being misinterpreted as a non-audio segment.
  • the cepstrum can well represent the characteristics of the audio, so the cepstral coefficients are selected as the input feature vectors in most audio recognition systems, so the cepstrum coefficients are used as parameters for endpoint detection.
  • the cepstrum-based activation detection algorithm divides the audio signal into two signals in the high and low frequency bands in the frequency domain, and the bands can overlap. The two signals obtained are preprocessed to extract linear predictive coding. LPC) cepstrum parameters, further nonlinear transformation using the Meyer scale to obtain LPC Meier cepstrum coefficients. The cepstral distance method is then used to replace the cepstral distance with the short-term energy as a threshold.
  • the cepstral coefficient vector of these frames uses the average of the cepstral vectors of the previous frames to estimate the cepstrum vector of the background noise and keep updating, and calculate all test frames and background noise.
  • the cepstrum distance between the cepstrum distance trajectories can be obtained, and the cepstrum distance trajectory can be used to achieve activation detection.
  • the Hidden Markov Model can also be used as a statistical model of audio features like cepstral coefficients.
  • a continuous HMM marking the word and a continuous HMM marking the background noise are trained to represent the general audio and noise characteristics respectively, and the training is performed using the cepstrum vector based on the Baum-Welch algorithm.
  • the HMM is connected to a grammar model, and the noise frequency is preprocessed at the endpoint detection stage to obtain input eigenvectors, each vector consisting of cepstral coefficients, cepstral coefficient increments or time derivatives, and short-term energy increase of the current frame.
  • the audio activation detection algorithm based on sub-band energy features draws on the edge detection method used in the field of image processing.
  • Edge detection is a classic problem in the field of image processing.
  • the more common method is a linear filter derived from some optimization criteria, such as exponential filter, Gaussian function first-order differential filter.
  • the main target of the subband selection is to remove the relatively concentrated part of the noise signal energy, and at the same time retain most of the energy of the audio signal.
  • the audio signal is divided into two subbands of high and low frequency for the judgment of the audio segment and the non-audio segment. After the start and end points of the two sub-bands are obtained, the fusion of the sub-bands, that is, the comprehensive decision, is required.
  • the starting point of the final audio segment selects the top of the starting points of the two sub-bands, and the ending point selects the lower ending point of the two sub-bands as the final ending point.
  • the decision method based on the entropy function sets the frame length of the speech signal s(n) to N.
  • the maximum and minimum amplitudes in a frame of speech are M and -M, respectively.
  • the entropy of this frame is defined as:
  • the information entropy of each frame of speech signal can be calculated.
  • a threshold h is defined, and then the entropy value of each frame of speech is performed. For comparison, the threshold h is greater than the speech frame, and less than the threshold h is the silent frame.
  • the VAD should be designed as a customizable classifier, and the new audio data can be used to update the classifier to improve the environmental adaptability of the classifier.
  • an object of the present invention is to provide an audio detection classification method with a custom function, which first divides some original training sets into several types of training sets according to types, and performs feature extraction for each type of training set. And training the corresponding Gaussian mixture model and its parameters to obtain a global Gaussian mixture model; further using other training sets as new training samples, updating the global Gaussian mixture model to obtain a local model; finally extracting features from the test set , input the local model classifier, and smooth and output the result, the main advantage is to overcome the original audio Activation detection can't customize multiple categories and make decisions.
  • An audio detection classification method with a custom function comprising the following steps:
  • the first step is the feature extraction of different types of training samples.
  • the training samples include different types of audio signals, and the acoustic features are extracted from the training samples as training characteristics for speaker recognition;
  • the second step is to train the global Gaussian mixture model parameters.
  • the Gaussian mixture model parameters are trained on the first type of training samples, and the Gaussian mixture model parameters corresponding to the first type of training samples are output; and so on, the Gaussian mixture model is applied to the mth training samples. Parameter training, outputting Gaussian mixture model parameters corresponding to the mth training sample;
  • the third step is to train the local Gaussian mixture model parameters.
  • the fourth step is to test the classifier.
  • Acoustic features in the first step include human speech, background noise, door closing sound, and downtown noise.
  • the purpose of global model training is to train the most basic and extensive models, such as human speech, background noise, door closing noise, Babble Noise, etc., which are used in almost all applications. Are all objects that need to be defined. Therefore, it is necessary to perform pre-model training on these kinds of data to obtain their probability density distribution, so as to train the global model.
  • the local Gaussian mixture model training in the third step mainly combines the new training data with the global model to further train the Gaussian mixture model parameters to obtain a local model, including two cases: one is that the new training sample belongs to the existing audio type, then It is added to the existing training samples to update the Gaussian mixture model parameters; the other is that the new training samples do not belong to the existing audio type, and it is necessary to increase the category of the Gaussian mixture model and update the Gaussian mixture model parameters;
  • the Gaussian mixture model parameters are usually solved by the method of Expectation Maximization (EM), that is, given training data. Where l is the number of samples and all unknown parameters are found. In the process of establishing a Gaussian mixture model, if all the training samples are saved, the resources consumed are very large. The idea of incremental learning can be used to update the Gaussian mixture model parameters with the existing Gaussian mixture model parameters and new training samples. .
  • the method is as follows:
  • N and K are the training samples x i and the new training samples, respectively. The number.
  • the present invention refines and classifies different types of training samples by establishing a global model and a local model, and combines the global Gaussian mixture model to obtain a local Gaussian mixture model, and finally realizes audio activation detection with a custom function.
  • the method of the invention can be regarded as a method for modeling different types of data by using local learning instead of global learning in machine learning, by which the problem that the audio type cannot be distinguished and distinguished can be effectively solved. Using this approach on some data sets for audio activation detection, better performance than methods based on audio energy or other features can be obtained.
  • FIG. 1 is a flow chart of a global model training module for audio detection classification of the present invention.
  • FIG. 2 is a flow chart of a partial model training module of the audio detection classification of the present invention.
  • FIG. 3 is a flow chart of a method for testing a classifier of an audio detection classification of the present invention.
  • FIG. 1 is a flowchart of a global model training of an audio detection classification according to the present invention, including the following contents:
  • the present invention proposes a global model training method and apparatus based on audio detection classification, in particular, in a scenario for audio activation detection classification.
  • These methods and apparatus are not limited to audio activation detection classification, but can be any method and apparatus related to audio classification.
  • Figure 1 depicts an example of global model training based on audio detection classification.
  • the first type of training samples 101 as shown in FIG. 1 includes all of the first type of audio signals for training
  • the second type of training samples 102 includes all of the second type of audio signals for training
  • the Mth type of training includes all of the Mth class of audio signals for training.
  • Feature extraction 104 refers to extracting acoustic features as detection information after obtaining an audio signal using the first step. These acoustic features may be Mel Frequency Cepstral Coefficients (MFCC) or Linear Predicted Cepstral Coefficients (Linear). Frequency Cepstral Coefficients, LPCC) and other acoustic features;
  • MFCC Mel Frequency Cepstral Coefficients
  • Linear Linear Predicted Cepstral Coefficients
  • LPCC Linear Predicted Cepstral Coefficients
  • the first type of Gaussian mixture model 105 firstly trains the first type of training samples 101 to obtain their probability density distribution, and the output is the Gaussian mixture model parameters corresponding to the first type of training samples.
  • m 1, 2, ..., M, where ⁇ represents the mixing ratio of the mixed model, and ⁇ and ⁇ correspond to the mean vector and covariance matrix of each Gaussian distribution.
  • m represents the number of Gaussian distributions of the mixed model; and so on, the output of the second type of Gaussian mixture model 106 is the Gaussian mixture model parameter corresponding to the second type of training samples.
  • N m represents the number of Gaussian distributions of the mth mixed model, and n represents the number of categories.
  • FIG. 2 is a flow chart of a partial model training of the audio detection classification of the present invention, including the following contents:
  • Local model training includes two cases: one is that the new training sample belongs to the existing audio type, then it needs to be added to the existing training sample to update the Gaussian mixture model parameters; the other is that the new training sample does not belong to There are already audio types, you need to increase the category of the Gaussian mixture model and update the Gaussian mixture model parameters.
  • the Gaussian mixture model parameters are usually solved by the method of Expectation Maximization (EM), that is, given training data. Where l is the number of samples and all unknown parameters are found. In the process of establishing a Gaussian mixture model, if all the training samples are saved, the resources consumed are very large. The idea of incremental learning can be used to update the Gaussian mixture model parameters with the existing Gaussian mixture model parameters and new training samples. .
  • the method is as follows:
  • N and K are the training samples x i and the new training samples, respectively. The number.
  • FIG. 3 is a flow chart of a classifier test for an audio detection classification of the present invention, including the following contents:
  • Test sample 301 includes all of the first type of audio signals for testing
  • Feature extraction 302 refers to extracting acoustic features as detection information after obtaining an audio signal using the first step. These acoustic features may be Mel Frequency Cepstral Coefficients (MFCC) or Linear Predicted Cepstral Coefficients (Linear). Frequency Cepstral Coefficients, LPCC) and other acoustic features;
  • MFCC Mel Frequency Cepstral Coefficients
  • Linear Linear Predicted Cepstral Coefficients
  • LPCC Linear Predicted Cepstral Coefficients
  • the local classifier 303 is a Bayesian classifier based on a Gaussian mixture model, and the classifier is defined as follows:
  • ⁇ j the percentage of the j-th mixed model
  • p j (x; ⁇ j , ⁇ j ) is the j-th multidimensional Gaussian distribution, which is defined as follows:

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Measurement Of Mechanical Vibrations Or Ultrasonic Waves (AREA)
  • Complex Calculations (AREA)

Abstract

一种具有自定义功能的音频检测分类方法,对音频数据进行音频激活检测,通过将部分原始训练样本首先按照类型分为若干类训练样本(101,102,103),针对每类训练样本(101,102,103)进行特征提取(104),并训练与其对应的高斯混合模型(105,106,107)及其参数,得到一个全局高斯混合模型(202);进一步将其他训练样本(201)作为新的训练样本,对全局高斯混合模型(202)进行参数更新得到一个局部模型(204);最后对测试样本(301)提取特征(302),输入局部模型分类器(303),并对结果进行平滑(304)和输出。本方法通过全局及局部高斯混合模型的训练,可以使高斯混合模型的类别和参数随着样本的增加而更新,与分类器的结合进一步提高了系统性能,最终实现音频检测分类,可广泛应用于涉及音频检测分类的说话人识别、语音识别、人机交互等多种机器学习领域。

Description

一种具有自定义功能的音频检测分类方法 技术领域
本发明属于音频处理技术领域,特别涉及一种具有自定义功能的音频检测分类方法。
背景技术
在音频识别和说话人识别等系统中,音频激活检测(Voice activity detection,VAD)技术被广泛应用,主要用于排除连续音频信号中与说话人无关的静音和噪声信号,确定音频段的起点以及终点位置,提高语音识别和说话人识别系统的性能。有效而准确的音频激活检测,通过去除噪声段或是无声段的信号,减少系统的数据处理量及对后续音频分析处理的干扰,可以达到提高系统识别性能的目的。对音频激活检测算法的研究已经进行了多年,传统的音频激活检测方法基本上针对安静环境下获得的音频信号进行处理,如基于短时平均能量的方法、基于短时平均过零率的算法和基于倒谱特征的方法。
基于短时平均能量的激活检测算法根据清音能量与浊音能量的差别,利用短时平均能量特征来区分安静环境下的静音段及音频段的清浊音。三者按短时能量顺序排列依次为:浊音>清音>静音,据此可来区分安静环境下的静音段和音频段及音频段信号的清音与浊音。
双门限音频信号的激活检测算法是基于短时平均过零率与短时平均能量相结合的音频激活检测算法,它结合了两种音频信号的特征参数。这种方法首先使用短时平均能量来区分音频段\非音频段,进一步用过零率再次区分音频段\非音频段。相比较于基于短时平均能量的激活检测算法,能够更好的避免以清辅音开头的音频信号被误判成非音频段。
在噪声环境下,短时能量与其它特征参数都不能很好地区分音频段与非 音频段。倒谱能很好表示音频的特征,因此在大多数音频识别系统中选择倒谱系数作为输入特征矢量,因此将倒谱系数作为端点检测的参数。基于倒谱特征的激活检测算法将音频信号在频域上分为高、低频带两个信号,频带间可重叠,将得到的两个信号进行预处理后就提取线性预测编码(linear predictive coding,LPC)倒谱参数,进一步用美尔尺度进行非线性变换得到LPC美尔倒谱系数。随后用倒谱距离法,将倒谱距离代替短时能量作为门限。首先假定前几帧音频信号为背景噪声,计算这些帧的倒谱系数矢量,利用前几帧倒谱矢量的平均值可估计背景噪声的倒谱矢量并不断更新,计算所有测试帧与背景噪声之间的倒谱距离可得到倒谱距离轨迹,利用倒谱距离轨迹可实现激活检测。
隐马尔柯夫模型(Hidden Markov Model,HMM)也可以像倒谱系数那样作为音频特征的统计模型。在HMM音频检测器中,一个为词作标记的连续HMM和一个为背景噪声作标记的连续HMM被训练来分别表示一般音频与噪声的特征,训练采用基于Baum-Welch算法的倒谱向量来进行。HMM与一个语法模型相连接,在端点检测阶段对带噪音频进行预处理以得到输入特征矢量,每一矢量由倒谱系数,倒谱系数的增量或时间导数以及当前帧的短时能量增量等组成,然后引入维特比解码,按照模型参数与输入音频特征流得到与正发生的音频非常相似的音频,维特比解码器给出音频的端点,这种方法的基本系统结构与通常的音频识别器相同。
基于子带能量特征的音频激活检测算法借鉴了图像处理领域中使用的边缘检测方法。边缘检测是一个在图像处理领域中的经典问题,其中较为常用的方法是根据某种优化的准则推导出的线性滤波器,例如指数滤波器、高斯函数一阶差分滤波器等。子带选取主要目标是去除噪声信号能量比较集中的部分,同时尽量保留音频信号的绝大部分能量,据此将音频信号分为高、低频两个子带进行音频段\非音频段的判决。在得到两个子带的起点和结束点后,需要进行子带的融合即综合的判决。最终的音频段起点选取两个子带的起点中靠前的点,终点选取两个子带中比较靠后的结束点作为最终的结束点。
基于熵函数的判决方法设语音信号s(n)的帧长为N,在一帧语音中最大和最小的幅度分别为M和-M,则这一帧的熵定义为:
Figure PCTCN2014091959-appb-000001
构造出了熵函数之后就可以计算出每帧语音信号的信息熵,根据背景噪声信号的熵值小而浊音信号的熵值大的原理,定义一个门限h,然后对每帧语音的熵值进行比较,大于门限h为语音帧,小于门限h则为无声帧。
上述各种算法在安静环境下性能较好,但在实际的复杂背景噪声环境下系统性能下降明显,在背景噪声较大或者存在大能量突发噪声的情况下就会失效。由于语音识别和说话人识别的应用非常广泛灵活,因此设计一个固定的分类器进行音频激活检测没有通用性。
目前大多数使用的音频激活检测方法在安静的环境下具有很好的性能,但在背景噪声较大,或者存在大能量突发噪声的情况下就会失效。由于语音识别和说话人识别的应用非常广泛灵活,因此设计一个固定的分类器进行噪声探测没有通用性,不具有实际意义。例如,如果安装在一个空调旁边,那么空调的发出的声音应该被定义为主要噪声;而安装在门旁边,那么开门、关门和敲门所产生的声音则应该被定义为主要噪声。例如,在语音识别系统中,环境背景声音和低能量的人声可被定义为主要噪声;在另一些说话人识别系统里,类似尖叫声、爆炸声等突发信号被定义为是噪声,而人声、汽车声等则并不定义为噪声。因此,VAD应该被设计成一个可以自定义的分类器,同时可以用新的音频数据来更新分类器,提高分类器的环境适应性。
发明内容
为了克服上述现有技术的缺点,本发明的目的在于提供一种具有自定义功能的音频检测分类方法,将部分原始训练集首先按照类型分为若干类训练集,针对每类训练集进行特征提取,并训练与其对应的高斯混合模型及其参数,得到一个全局高斯混合模型;进一步将其他训练集作为新的训练样本,对全局高斯混合模型进行参数更新得到一个局部模型;最后对测试集提取特征,输入局部模型分类器,并对结果进行平滑和输出,其主要优点在于克服了原有的音频 激活检测无法自定义多个类别并进行判决的问题。
为了实现上述目的,本发明采用的技术方案是:
一种具有自定义功能的音频检测分类方法,包括以下步骤:
第一步,不同类别训练样本的特征提取
训练样本包括不同类别的音频信号,对这些训练样本提取声学特征作为说话人识别的训练特征;
第二步,训练全局高斯混合模型参数
在完成对训练样本的特征提取后,对第一类训练样本进行高斯混合模型参数训练,输出第一类训练样本对应的高斯混合模型参数;以此类推,对第m类训练样本进行高斯混合模型参数训练,输出第m类训练样本对应的高斯混合模型参数;
第三步,训练局部高斯混合模型参数
假设在第二步骤得到一系列高斯混合模型参数,当获得新的训练样本,则对全局高斯混合模型进行更新得到局部高斯混合模型参数,将新的训练样本结合全局高斯混合模型进一步训练高斯混合模型参数得到局部高斯混合模型;
第四步,测试分类器
在第三步得到了局部高斯混合模型参数后,构造基于局部高斯混合模型的贝叶斯分类器
Figure PCTCN2014091959-appb-000002
并对所有测试样本进行音频检测分类。
所述第一步中的声学特征包括人说话声、背景噪声、关门声以及闹市噪声。
所述第一步中,全局模型训练的目的是训练出最基本且最广泛的模型,例如人说话声、背景噪声、关门声、闹市噪声(Babble Noise)等,这些声音几乎在所有的应用里都是需要定义的对象。因此需要对这几种数据预先进行模型训练,得到它们的概率密度分布,从而训练得到全局模型。类似于说话人识别中的通用背景模型(Universal Background Model,UBM),全局模型 得到的输出是多个高斯混合模型参数
Figure PCTCN2014091959-appb-000003
n=1,2,...,Nm,m=1,2,...,M,其中π表示混合模型的混合比例,μ和Σ对应着每一个高斯分布的均值向量和协方差矩阵。Nm表示第m个混合模型高斯分布的个数,n表示类别数量。
所述第三步中局部高斯混合模型训练主要将新的训练数据结合全局模型进一步训练高斯混合模型参数得到局部模型,包括两种情况:一种是新的训练样本属于已有音频类型,则将其加入到已有的训练样本中,更新高斯混合模型参数;另一种是新的训练样本不属于已有音频类型,需要增加高斯混合模型的类别并更新高斯混合模型参数;
在第一种情况中,高斯混合模型参数通常用期望最大化(Expectation Maximization,EM)的方法来求解,即给定训练数据
Figure PCTCN2014091959-appb-000004
其中l是样本数目,求出所有的未知参数。在建立高斯混合模型的过程中,如果保存所有的训练样本,需要消耗的资源非常大,可以采用增量学习的思想来用已有的高斯混合模型参数以及新的训练样本来更新高斯混合模型参数。其方法如下:
假设某类高斯混合模型参数为πjjj,j=1,2,...,g,其中g是混合模型的个数,其训练的样本为x1,x2,...,xN,而新的训练样本为
Figure PCTCN2014091959-appb-000005
需要重新估计高斯混合模型的参数π′j,μ′j,Σ′j,j=1,2,...,g。则其总的期望Q为:
Figure PCTCN2014091959-appb-000006
其中θ={πjjj},j=1,2,...,g,θ′={π′j,μ′j,Σ′j},j=1,2,...,g,
Figure PCTCN2014091959-appb-000007
用数学期望来代替训练样本,估计π′j,μ′j,Σ′j,j=1,2,...,g:
Figure PCTCN2014091959-appb-000008
Figure PCTCN2014091959-appb-000009
Figure PCTCN2014091959-appb-000010
其中N和K分别为训练样本xi和新的训练样本
Figure PCTCN2014091959-appb-000011
的个数。
在第二种情况中,当需要增加一类或者几类新的音频类型并进行判别时,已知当前某类的高斯混合模型参数为πjjj,j=1,2,...,g,其中g是混合模型的个数,原来训练的样本数是N个。同时,我们得到了一些新的训练样本
Figure PCTCN2014091959-appb-000012
但并不属于现有的高斯混合模型。为了重新估计高斯混合模型的参数,假设新增了h个高斯混合模型参数为πjjj,j=g+1,g+2,...,g+h,则全部g+h个高斯混合模型参数为π′jjj,j=1,2,...,g+h。
与现有技术相比,本发明通过建立全局模型和局部模型,对不同类型的训练样本细化分类,结合全局高斯混合模型训练得到局部高斯混合模型,最终实现具有自定义功能的音频激活检测。本发明方法可以看作一种在机器学习中用局部学习替代全局学习、对不同类型的数据进行建模的方法,通过该方法,可有效地解决无法对音频自定义类型并进行区分的问题。在一些音频激活检测的数据集上采用这种方法,可以获得比基于音频能量或其他特征进行检测的方法更好的性能。
附图说明
图1是本发明的音频检测分类的全局模型训练模块流程图。
图2是本发明的音频检测分类的局部模型训练模块流程图。
图3是本发明的音频检测分类的分类器测试方法流程图。
具体实施方式
下面结合附图和实施例详细说明本发明的实施方式。
图1为本发明的音频检测分类的全局模型训练流程图,包括以下内容:
本发明提出一种基于音频检测分类的全局模型训练方法和装置,特别地,用于音频激活检测分类的场景下。这些方法和装置不局限于音频激活检测分类,也可以是任何与音频分类有关的方法和装置。
图1描述了一种基于音频检测分类的全局模型训练实例。
如图1所示的第一类训练样本101包括全部第一类用于训练的音频信号,第二类训练样本102包括全部第二类用于训练的音频信号,以此类推,第M类训练样本103包括全部第M类用于训练的音频信号。
特征提取104指的是,在利用第一步获得音频信号后,提取声学特征作为检测信息,这些声学特征可以为Mel频率倒谱系数(Mel Frequency Cepstral Coefficients,MFCC)或线性预测倒谱系数(Linear Frequency Cepstral Coefficients,LPCC)等多种声学特征;
第一类高斯混合模型105首先对第一类训练样本101进行模型训练,得到它们的概率密度分布,输出是第一类训练样本对应的高斯混合模型参数
Figure PCTCN2014091959-appb-000013
m=1,2,...,M,其中π表示混合模型的混合比例,μ和Σ对应着每一个高斯分布的均值向量和协方差矩阵。m表示混合模型高斯分布的个数;以此类推,第二类高斯混合模型106输出是第二类训练样本对应的高斯混合模型参数
Figure PCTCN2014091959-appb-000014
m=1,2,...,M;第Nm类高斯混合模型107输出是第Nm类训练样本对应的高斯混合模型参数
Figure PCTCN2014091959-appb-000015
n=1,2,...,Nm,m=1,2,...,M,其中π表示混合模型的混合比例,μ和Σ对应着每一个高斯分布的均值向量和协方差矩阵。Nm表示第m个混合模型高斯分布的个数,n表示类别数量。
图2为本发明的音频检测分类的局部模型训练流程图,包括以下内容:
已知当前全局模型202的参数为πjjj,j=1,2,...,g,其中g是混合模型的个数,原来训练的样本数是N个。当获得新的训练样本201后,其参数更新203方法如下:
局部模型训练包括两种情况:一种是新的训练样本属于已有音频类型,则需要将其加入到已有的训练样本中,更新高斯混合模型参数;另一种是新的训练样本不属于已有音频类型,需要增加高斯混合模型的类别并更新高斯混合模型参数。
在第一种情况中,高斯混合模型参数通常用期望最大化(Expectation  Maximization,EM)的方法来求解,即给定训练数据
Figure PCTCN2014091959-appb-000016
其中l是样本数目,求出所有的未知参数。在建立高斯混合模型的过程中,如果保存所有的训练样本,需要消耗的资源非常大,可以采用增量学习的思想来用已有的高斯混合模型参数以及新的训练样本来更新高斯混合模型参数。其方法如下:
假设某类高斯混合模型参数为πjjj,j=1,2,...,g,其中g是混合模型的个数,其训练的样本为x1,x2,...,xN,而新的训练样本为
Figure PCTCN2014091959-appb-000017
需要重新估计高斯混合模型的参数π′j,μ′j,Σ′j,j=1,2,...,g。则其总的期望Q为:
Figure PCTCN2014091959-appb-000018
其中θ={πjjj},j=1,2,...,g,θ′={π′j,μ′j,Σ′j},j=1,2,...,g,
Figure PCTCN2014091959-appb-000019
用数学期望来代替训练样本,估计π′j,μ′j,Σ′j,j=1,2,...,g:
Figure PCTCN2014091959-appb-000020
Figure PCTCN2014091959-appb-000021
Figure PCTCN2014091959-appb-000022
其中N和K分别为训练样本xi和新的训练样本
Figure PCTCN2014091959-appb-000023
的个数。
在第二种情况中,当需要增加一类或者几类新的音频类型并进行判别时,已知当前某类的高斯混合模型参数为πjjj,j=1,2,...,g,其中g是混合模型的个数,原来训练的样本数是N个。同时,我们得到了一些新的训练样本
Figure PCTCN2014091959-appb-000024
但并不属于现有的高斯混合模型。为了重新估计高斯混合模型的 参数,假设新增了h个高斯混合模型参数为πjjj,j=g+1,g+2,...,g+h,则全部g+h个高斯混合模型参数为π′jjj,j=1,2,...,g+h。
图3为本发明的音频检测分类的分类器测试流程图,包括以下内容:
测试样本301包括全部第一类用于测试的音频信号;
特征提取302指的是,在利用第一步获得音频信号后,提取声学特征作为检测信息,这些声学特征可以为Mel频率倒谱系数(Mel Frequency Cepstral Coefficients,MFCC)或线性预测倒谱系数(Linear Frequency Cepstral Coefficients,LPCC)等多种声学特征;
局部分类器303为基于高斯混合模型的贝叶斯分类器,分类器定义如下:
Figure PCTCN2014091959-appb-000025
其中l=g+h是全部高斯混合模型个数,πj表示第j个混合模型的百分比,pj(x;μjj)是第j个多维高斯分布,其定义如下:
Figure PCTCN2014091959-appb-000026

Claims (3)

  1. 一种具有自定义功能的音频检测分类方法,其特征在于,包括以下步骤:
    第一步,不同类别训练样本的特征提取
    训练样本包括不同类别的音频信号,对这些训练样本提取声学特征作为说话人识别的训练特征;
    第二步,训练全局高斯混合模型参数
    在完成对训练样本的特征提取后,对第一类训练样本进行高斯混合模型参数训练,输出第一类训练样本对应的高斯混合模型参数;以此类推,对第m类训练样本进行高斯混合模型参数训练,输出第m类训练样本对应的高斯混合模型参数;
    第三步,训练局部高斯混合模型参数
    假设在第二步骤得到一系列高斯混合模型参数,当获得新的训练样本,则对全局高斯混合模型进行更新得到局部高斯混合模型参数,将新的训练样本结合全局高斯混合模型进一步训练高斯混合模型参数得到局部高斯混合模型;
    第四步,测试分类器
    在第三步得到了局部高斯混合模型参数后,构造基于局部高斯混合模型的贝叶斯分类器
    Figure PCTCN2014091959-appb-100001
    并对所有测试样本进行音频检测分类。
  2. 根据权利要求1所述的具有自定义功能的音频检测分类方法,其特征在于,所述第一步中的声学特征包括人说话声、背景噪声、关门声以及闹市噪声。
  3. 根据权利要求1所述的具有自定义功能的音频检测分类方法,其特征在于,所述第三步中局部高斯混合模型训练包括两种情况:一种是新的训练样本属于已有音频类型,则将其加入到已有的训练样本中,更新高斯混合模 型参数;另一种是新的训练样本不属于已有音频类型,需要增加高斯混合模型的类别并更新高斯混合模型参数;
    在第一种情况中,假设已知某类高斯混合模型参数为πjjj,j=1,2,...,g,其中π表示高斯混合模型的混合比例,μ对应每一个高斯分布的均值向量,Σ对应每一个高斯分布的协方差矩阵,g是混合模型的个数,其训练的样本为x1,x2,...,xN,新的训练样本为
    Figure PCTCN2014091959-appb-100002
    重新估计高斯混合模型的参数π′j,μ′j,Σ′j,j=1,2,...,g如下:
    Figure PCTCN2014091959-appb-100003
    Figure PCTCN2014091959-appb-100004
    Figure PCTCN2014091959-appb-100005
    其中N和K分别为训练样本xi和新的训练样本
    Figure PCTCN2014091959-appb-100006
    的个数;
    在第二种情况中,当需要增加一类或者几类新的音频类型并进行判别时,已知当前某类的高斯混合模型参数为πjjj,j=1,2,...,g,其中π表示混合模型的混合比例,μ对应每一个高斯分布的均值向量,Σ对应每一个高斯分布的协方差矩阵,g是混合模型的个数,原来训练的样本数是N个;而新的训练样本
    Figure PCTCN2014091959-appb-100007
    不属于现有的高斯混合模型,为了重新估计高斯混合模型的参数,假设新增了h个高斯混合模型参数为πjjj,j=g+1,g+2,...,g+h,则全部g+h个高斯混合模型参数为π′jjj,j=1,2,...,g+h。
PCT/CN2014/091959 2014-02-19 2014-11-22 一种具有自定义功能的音频检测分类方法 WO2015124006A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201410055255.8 2014-02-19
CN201410055255.8A CN103824557B (zh) 2014-02-19 2014-02-19 一种具有自定义功能的音频检测分类方法

Publications (1)

Publication Number Publication Date
WO2015124006A1 true WO2015124006A1 (zh) 2015-08-27

Family

ID=50759580

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2014/091959 WO2015124006A1 (zh) 2014-02-19 2014-11-22 一种具有自定义功能的音频检测分类方法

Country Status (2)

Country Link
CN (1) CN103824557B (zh)
WO (1) WO2015124006A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112396084A (zh) * 2019-08-19 2021-02-23 中国移动通信有限公司研究院 数据处理方法、装置、设备及存储介质
CN114186581A (zh) * 2021-11-15 2022-03-15 国网天津市电力公司 基于mfcc和扩散化高斯混合模型的电缆隐患识别方法及装置

Families Citing this family (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103824557B (zh) * 2014-02-19 2016-06-15 清华大学 一种具有自定义功能的音频检测分类方法
CN104361891A (zh) * 2014-11-17 2015-02-18 科大讯飞股份有限公司 特定人群的个性化彩铃自动审核方法及系统
CN104409080B (zh) * 2014-12-15 2018-09-18 北京国双科技有限公司 语音端点检测方法和装置
CN105895080A (zh) * 2016-03-30 2016-08-24 乐视控股(北京)有限公司 语音识别模型训练方法、说话人类型识别方法及装置
US10152974B2 (en) * 2016-04-15 2018-12-11 Sensory, Incorporated Unobtrusive training for speaker verification
CN106251861B (zh) * 2016-08-05 2019-04-23 重庆大学 一种基于场景建模的公共场所异常声音检测方法
CN107358947A (zh) * 2017-06-23 2017-11-17 武汉大学 说话人重识别方法及系统
WO2019084419A1 (en) * 2017-10-27 2019-05-02 Google Llc NON-SUPERVISED LEARNING OF SEMANTIC AUDIO REPRESENTATIONS
CN107993664B (zh) * 2018-01-26 2021-05-28 北京邮电大学 一种基于竞争神经网络的鲁棒说话人识别方法
CN109473112B (zh) * 2018-10-16 2021-10-26 中国电子科技集团公司第三研究所 一种脉冲声纹识别方法、装置、电子设备及存储介质
CN111797708A (zh) * 2020-06-12 2020-10-20 瑞声科技(新加坡)有限公司 气流杂音检测方法、装置、终端及存储介质
CN113393848A (zh) * 2021-06-11 2021-09-14 上海明略人工智能(集团)有限公司 用于训练说话人识别模型的方法、装置、电子设备和可读存储介质
CN113421552A (zh) * 2021-06-22 2021-09-21 中国联合网络通信集团有限公司 音频识别方法和装置
CN114626418A (zh) * 2022-03-18 2022-06-14 中国人民解放军32802部队 一种基于多中心复残差网络的辐射源识别方法及装置

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6963835B2 (en) * 2003-03-31 2005-11-08 Bae Systems Information And Electronic Systems Integration Inc. Cascaded hidden Markov model for meta-state estimation
CN101188107A (zh) * 2007-09-28 2008-05-28 中国民航大学 一种基于小波包分解及混合高斯模型估计的语音识别方法
CN101546556A (zh) * 2008-03-28 2009-09-30 展讯通信(上海)有限公司 用于音频内容识别的分类系统
CN101546557A (zh) * 2008-03-28 2009-09-30 展讯通信(上海)有限公司 用于音频内容识别的分类器参数更新方法
CN101937678A (zh) * 2010-07-19 2011-01-05 东南大学 一种针对烦躁情绪的可据判的自动语音情感识别方法
US8180638B2 (en) * 2009-02-24 2012-05-15 Korea Institute Of Science And Technology Method for emotion recognition based on minimum classification error
CN103035239A (zh) * 2012-12-17 2013-04-10 清华大学 一种基于局部学习的说话人识别方法
CN103824557A (zh) * 2014-02-19 2014-05-28 清华大学 一种具有自定义功能的音频检测分类方法

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050021337A1 (en) * 2003-07-23 2005-01-27 Tae-Hee Kwon HMM modification method
JP4891806B2 (ja) * 2007-02-27 2012-03-07 日本電信電話株式会社 適応モデル学習方法とその装置、それを用いた音声認識用音響モデル作成方法とその装置、及び音響モデルを用いた音声認識方法とその装置、及びそれら装置のプログラムと、それらプログラムの記憶媒体
CN103077708B (zh) * 2012-12-27 2015-04-01 安徽科大讯飞信息科技股份有限公司 一种语音识别系统中拒识能力提升方法

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6963835B2 (en) * 2003-03-31 2005-11-08 Bae Systems Information And Electronic Systems Integration Inc. Cascaded hidden Markov model for meta-state estimation
CN101188107A (zh) * 2007-09-28 2008-05-28 中国民航大学 一种基于小波包分解及混合高斯模型估计的语音识别方法
CN101546556A (zh) * 2008-03-28 2009-09-30 展讯通信(上海)有限公司 用于音频内容识别的分类系统
CN101546557A (zh) * 2008-03-28 2009-09-30 展讯通信(上海)有限公司 用于音频内容识别的分类器参数更新方法
US8180638B2 (en) * 2009-02-24 2012-05-15 Korea Institute Of Science And Technology Method for emotion recognition based on minimum classification error
CN101937678A (zh) * 2010-07-19 2011-01-05 东南大学 一种针对烦躁情绪的可据判的自动语音情感识别方法
CN103035239A (zh) * 2012-12-17 2013-04-10 清华大学 一种基于局部学习的说话人识别方法
CN103824557A (zh) * 2014-02-19 2014-05-28 清华大学 一种具有自定义功能的音频检测分类方法

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112396084A (zh) * 2019-08-19 2021-02-23 中国移动通信有限公司研究院 数据处理方法、装置、设备及存储介质
CN114186581A (zh) * 2021-11-15 2022-03-15 国网天津市电力公司 基于mfcc和扩散化高斯混合模型的电缆隐患识别方法及装置

Also Published As

Publication number Publication date
CN103824557B (zh) 2016-06-15
CN103824557A (zh) 2014-05-28

Similar Documents

Publication Publication Date Title
WO2015124006A1 (zh) 一种具有自定义功能的音频检测分类方法
Ying et al. Voice activity detection based on an unsupervised learning framework
Zelinka et al. Impact of vocal effort variability on automatic speech recognition
Alam et al. Supervised/unsupervised voice activity detectors for text-dependent speaker recognition on the RSR2015 corpus
CN101136199A (zh) 语音数据处理方法和设备
Washani et al. Speech recognition system: A review
Akbacak et al. Environmental sniffing: noise knowledge estimation for robust speech systems
US11100932B2 (en) Robust start-end point detection algorithm using neural network
Vydana et al. Improved emotion recognition using GMM-UBMs
Unnibhavi et al. LPC based speech recognition for Kannada vowels
Lee et al. Speech/audio signal classification using spectral flux pattern recognition
CN102419976A (zh) 一种基于量子学习优化决策的音频索引方法
Trabelsi et al. A multi level data fusion approach for speaker identification on telephone speech
WO2016152132A1 (ja) 音声処理装置、音声処理システム、音声処理方法、および記録媒体
MY An improved feature extraction method for Malay vowel recognition based on spectrum delta
Sarma et al. Analysis of spurious vowel-like regions (vlrs) detected by excitation source information
Kaur et al. Speech Activity Detection and its Evaluation in Speaker Diarization System
Shahrul Azmi et al. Noise robustness of Spectrum Delta (SpD) features in Malay vowel recognition
Janicki et al. Improving GMM-based speaker recognition using trained voice activity detection
Pammi et al. Detection of nonlinguistic vocalizations using alisp sequencing
Hartmann et al. Nothing doing: Reevaluating missing feature ASR
Tsao et al. A study on separation between acoustic models and its applications.
Chao et al. Two-stage Vocal Effort Detection Based on Spectral Information Entropy for Robust Speech Recognition.
Ying et al. Robust voice activity detection based on noise eigenspace
Kshirsagar et al. Comparative study of phoneme recognition techniques

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 14883492

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 14883492

Country of ref document: EP

Kind code of ref document: A1