CN101228577B

CN101228577B - 自动化语音识别通道归一化方法及系统

Info

Publication number: CN101228577B
Application number: CN2005800022461A
Authority: CN
Inventors: 伊戈·兹洛卡尼克; 劳伦斯·S·吉利克; 乔丹·科亨
Original assignee: Voice Signal Technologies Inc
Current assignee: Voice Signal Technologies Inc
Priority date: 2004-01-12
Filing date: 2005-01-10
Publication date: 2011-11-23
Anticipated expiration: 2025-01-10
Also published as: CN101228577A; JP4682154B2; DE602005026949D1; EP1774516A4; US7797157B2; WO2005070130A2; WO2005070130A3; US20050182621A1; EP1774516B1; JP2007536562A; EP1774516A2

Abstract

根据语音语句的初始部分来测量统计信息。基于所测量的统计信息以及统计导出的映射相关的所测量的统计信息和特征归一化参数，来估计特征归一化参数。

Description

自动化语音识别通道归一化方法及系统

相关申请的交叉参考

本申请要求2004年1月12日提交的美国临时申请序列号为60/535,863的优先权。

技术领域

本发明涉及用于自动语音识别的通道归一化(channel normalization)。

背景技术

自动语音识别系统的识别性能(例如准确性)可受到通信通道的可变性的不利影响。可变性的有些原因是由于：说话者(如：声道几何形状、声门激励(glottal excitaion))、传输通道(如：至麦克风的可变位置和方向、室内音响效果、环境噪音)、以及使用具有不同特性的麦克风。为了降低通信通道对识别性能的影响，已提出了众多方案。一种这样的技术使倒谱系数(cepstral coefficient)的识别特征矢量归一化，由此每一特征维feature[i]关于时间t具有零均值和单位方差(unit variance)。该技术典型地利用以下来应用：利用K倒谱系数(或mel-频率倒谱系数)cepstrum[i]及其第一和第二阶导数(Δcepstrum[i]和ΔΔcepstrum[i])，以计算归一化的识别特征：

feature[i]＝(cep[i]-μ[i])/σ[i] 对于0≤i＜3K

其中：

cep[i]＝cepstrum[i]

cep[i+K]＝Δcepstrum[i] 对于0≤i＜K

cep[i+2K]＝ΔΔcepstum[i]

其中μ[i]是cep[i]关于时间t的均值，而σ²[i]是cep[i]关于时间t的方差。

倒谱均值归一化(即，减去μ[i])允许去除平稳和线性的-尽管是未知的-通道传递函数。倒谱方差归一化(即除以σ[i])有助于补偿由于附加噪声引起的倒谱系数的方差的减少。

通道特性的估计基于其上的时间量可影响语音识别器的性能。如果时间窗选得太长，则通道可不再被认为是平稳的。如果时间窗选得太短，则语音段的特定语声内容可使通道特性的估计存在偏差。作为折衷，许多识别系统基于语音的完整语句(utterance)来估计通道。依赖于识别系统的处理速度，该基于语句的归一化可导致不理想的系统延迟，由于对语句的处理直到语句结束才开始。时间同步(或在线处理)方案典型地利用通道归一化的某类型的递归实现，其中对于倒谱特征的均值和方差的长期估计在时间t内、每τ＝10-20msec增量地更新：

μ[i，t]＝αμ[i，t-τ]+(1-α)cep[i，t]

σ²[i，t]＝ασ²[i，t]+(1-α)(cep[i，t]-μ[i，t])²

在通道估计中，非语音段表示另一复杂化因子。由于传输通道将说话者和麦克风分离，因此，传输通道的效果仅在语音段期间变得在听觉上明显。因此，非语音段对语音段的可变比率将对被估计的通道特性具有深远影响。然而，尝试使用固定比率又受到了语音和非语音段之间的区别中所涉及的不确定性的限制。

发明内容

在一方面，一般而言，本发明的特征是一种用于处理数据的方法及对应的软件和系统。该方法包括：根据语音语句的初始部分来测量统计信息(statistics)；以及基于所测量的统计信息及统计导出的映射相关的所测量的统计信息和特征归一化参数来估计特征归一化参数。

根据本发明的一个实施例，提供了一种自动化语音识别通道归一化的方法，该方法包括：基于从离线处理过程中接收的语音语句中测量的统计信息和与所述语音语句相关联的特征归一化参数，形成统计导出的映射；根据在线处理过程中接收的语音语句的初始部分来测量统计信息；以及基于所测量的所述初始部分的统计信息以及所述统计导出的映射，来估计在线处理过程中接收的语音语句的特征归一化参数。

根据本发明的另一实施例，提供了一种自动化语音识别通道归一化的系统，该系统包括：回归模块，该回归模块被配置成：基于从离线处理过程中接收的语音语句中测量的统计信息和与所述语音语句相关联的特征归一化参数，形成统计导出的映射；初始处理模块，该初始处理模块被配置成从在线处理过程中接收的语音语句的初始部分中测量统计信息；及映射模块，该映射模块被配置成：基于所测量的所述初始部分的统计信息以及所述统计导出的映射，来估计在线处理过程中接收的语音语句的特征归一化参数。

本发明的方面可包括一个或多个以下特征。

所测量的统计信息包括对来自语音语句的一部分的能量的测量。

能量的测量包括能量的极值。

该方法还包括接受多个语句，所述语句每个与对应的特征归一化参数关联。统计信息根据多个语句的每个的一部分来测量，并且基于与多个语句对应的所测量的统计信息以及特征归一化参数来形成统计导出的映射。多个语句的每个的一部分可包括每一语句的初始部分、或者每个语句的整体部分。

形成统计导出的映射包括形成统计回归。

对应于多个语句的特征归一化参数包括多个语句在时间上的均值和方差。

本发明的方面可包括一个或多个以下优点。

减少了用于对通信通道的特性进行可靠估计的语音量。减少了与通道估计和归一化关联的系统延迟。没有执行语音和非语音段之间的显式区分，从而提高了自动语音识别对于噪声语音的鲁棒性。

本发明的其他特征和优点将从以下描述及权利要求中变得显而易见。

附图说明

图1是一个用于自动语音识别通道归一化的处理系统的框图。

具体实施方式

一种用于自动语音识别通道归一化的处理系统包括离线处理和在线处理，以生成归一化参数。该系统配置成利用有关通信通道性质的观察。例如，可进行以下关于说话者和通信通道的部分-包括房间、麦克风和周围噪声-的观察：

·说话者的长期频谱主要可由两个参数表征：总体响度和描述该频谱的总体斜率的频谱倾斜(spectral tilt)。频谱倾斜是每一音调周期(pitch period)期间声门保持打开相对于关闭的时间之间的比率的直接结果。虽然该比率在说话者及其发声努力(正常、喊叫)不同时轻微变化，但频谱倾斜典型地是-12dB/8度音程。在倒谱域中，总体响度由0阶倒谱系数捕获，而频谱倾斜由1阶倒谱系数捕获。对于长期频谱，所有的较高阶倒谱系数由于其在频域中的平滑形状而接近于零。

·由于反射和回声，房间的传输函数展现出强的峰和谷。在倒谱域中，这些频率至频率的变化主要影响较高阶系数，而不影响语音识别系统中使用的系数。除这些变化外，说话者和麦克风之间的距离和方向主要给予了响度中的总体衰减，主要影响0阶倒谱系数。

·麦克风和音频电路典型地给予了音频信号上的某类带通特性。对应的频率形状通常影响所有阶的倒谱系数。

·在语音段期间，周围的声学噪声减少了所有阶的倒谱系数的方差。该减少随信噪比的降低而增加。

处理系统的许多特性基于这些观察：

对μ[0]的可靠估计优选地应该包括至少一些语音段(如，语音帧，其中“帧”是：在一有限时间窗之上从语音信号V_s[t]的值中导出的、时间t的倒谱系数cep[i，t]的值)，由于其对说话者的响度以及说话者和/或麦克风的几何形状的依赖性。通道均值μ的较高阶系数主要依赖于麦克风和音频电路，且由此可根据不必需是语音帧的帧来估计。通道方差依赖于信噪比。噪声水平可仅根据非语音帧来估计，而对信号水平的估计应包括至少一些语音帧。

参考图1，用于自动语音识别通道归一化的处理系统10经由映射模块20来估计通信通道12的倒谱均值和方差，其中该映射模块20使用功能映射，其采用基于少数语音帧而快速收敛的、来自初始处理模块14的输入参数。特别是，以下线性映射快速响应于语音开始，同时消除了明确检测语音开始时间的需要：

μ[i，t]＝a₀(S[t]-N[t])+b₀+N[t] 对于i＝0

μ[i，t]＝cep[i，t] 对于0＜i＜K

μ[i，t]＝0 对于K≤i＜3K

σ[i，t]＝a_i+1(S[t]-N[t])+b_i+1 对于0≤i＜3K

这里，a_i和b_i是功能映射的权重。S[t]和N[t]分别是对于信号水平和噪声水平的估计。倒谱系数cep[i，t]是倒谱系数cep[i，t]在时间上的平均。

初始处理模块14通过跟踪帧能量cep[0]在时间上的极值来在线估计信号水平和噪声水平：

S[t]＝max{cep[0，τ]} 对于0≤τ≤t

N[t]＝min{cep[0，τ]} 对于0≤τ≤t

可替换地，可使用估计S和N的其他方法，包括使用cep[0，τ]的百分位(如，分别地，cep[0，τ]的第80和第20的百分位(80^th and 20^th percentiles))。

初始处理模块14通过在所有遇到的帧上平均而在线执行对平均倒谱系数cep[i，t]的估计：

cep[i，t]＝∑cep[i，τ]/(t+1)对于所有0≤τ≤t

可替换地，可使用递归方案。

线性权重aⁱ和bⁱ在之前的离线处理期间使用语音数据库16来确定，该语音数据库16包括在不同的声学环境中使用各种音频设备录制的众多说话者的语句V₁[t]，...，V_n[t]。基于映射模块20的对应对“输入模式”和“输出模式”，所述权重使用由线性回归模块18执行的线性回归来确定。作为“输入模式”，系统10使用在每一语句后获得的信号和噪声水平，其中每一语句是独立的。系统10基于每一语句的一部分(如语句的初始部分，或整个语句)来测量这些信号和噪声水平。作为“输出模式”，系统10使用基于给定会话的所有语音帧的通道均值和方差，利用标准公式：

μ[i]＝∑cep[i，τ]/(t+1) 对于所有0≤τ≤t

σ²[i]＝∑(cep[i，τ]-μ[i])²/(t+1) 对于所有0≤τ≤t

其中会话包括对于其通信通道12可被假定为平稳的所有语句。由于线性权重只模型化数据的整体趋势，所以此步骤中所使用的特定语音/静默区分不是关键的。

系统10使用的通道估计方案即使在少数语音帧时也执行良好，因为它主要依赖于对音频信号的两个特性的估计：其最小和最大能量。接近最终最小能量的值典型地在前几帧中遇到，即甚至在语句开始之前。接近最终最大能量的值典型地在说出的第一元音内遇到，而不管其语音身分(phoneticidentity)。

在语句开始之前，提出的通道估计方案通常将低估信噪比SNR＝S-N。因此，当所述SNR估计降低到表示系统10期望在其中运行良好的、最嘈杂的声学环境的SNR的值时，获得更准确的结果。同样，在SNR的估计和通道归一化之间如100-200ms这样小的处理延迟的引入，将确保将相当成熟的通道估计也施加到在语句的第一元音之前的少数语音帧。

附录包括软件实现的归一化方法。

其他实施例在所述权利要求的范围内。

#include <unistd.h>

#include <stdio.h>

#include <stdlib.h>

#include <string.h>

#include <math.h>

#include ″nrutil.h″

#include ″speechdef.h″

#define HISTOGRAM_RANGE 500

#define FLOOR_PERC 0.//0.01

#define SCALE_ESTIMATOR 0//(i％13)

int main(argc，argv)

/＊＝＝＝＝＝＝＝＝＝＝＝＝＝＝＊/

int argc；

char ＊argv[]；

{

char ioFileList[300]，channelMapFilnam[300]，

line[500]，cepFilename[300]，channelFilename[300]，

prevChannelFilename[300]；

float ＊channelMean，＊channelScale，＊avgMax，＊avgSqMax，＊avgMean，

＊avgScale，

＊crossMaxMean，＊crossMaxScale，＊weight，＊offset，floor，

＊minCepstrum，＊maxCepstrum，sessionMin，sessionMax；

int i，outDim，numFrames，dimFeature，frame，numPattern，

floorHistogram[HISTOGRAM_RANGE]，

index，numAccum，numHist，j，numEstimatedMeans；

struct FEATURE featureStruct；

FILE ＊listFp，＊outFp；

int ReadFeatureHTK(char filename[]，int frameIndex，struct FEATURE

＊feature)；

/＊＊＊file handling＊＊＊/

if(argc！＝4){

printf(″\nusage：％s <I/O file list><num estimated means>″\

″<output map>\n″，argv[0])；

printf(″\nnote：I/O file list needs to have format：″\

″(cepstrum HTK file，channel HTK file)\n\n″)；

exit(0)；

}

strcpy(ioFileList，argv[1])；

numEstimatedMeans＝atoi(argv[2])；

strcpy(channelMapFilnam，argv[3])；

//read file list

featureStruct.data＝NULL；

channelMean＝NULL；

channelScale＝NULL；

avgMax＝NULL；

avgSqMax＝NULL；

avgMean＝NULL；

avgScale＝NULL；

crossMaxMean＝NULL；

crossMaxScale＝NULL；

minCepstrum＝NULL；

maxCepstrum＝NULL；

numPattern＝0；

dimFeature＝0；

strcpy(prevChannelFilename，″″)；

for(i＝0；i＜HISTOGRAM_RANGE；i++){

floorHistogram[i]＝0；

}

numHist＝0；

if((listFp＝fopen(ioFileList，″r″))＝＝NULL){

printf(″\n％s：can′t open I/O file listing ％s ！！\n″，

argv[0]，ioFileList)；

exit(1)；

}

while(fgets(line，500，listFp)！＝NULL){

sscanf(line，″％s ％s″，cepFilename，channelFilename)；

/＊read channel file ＊/

if(ReadFeatureHTK(channelFilename，-1，&featureStruct)！＝0){

printf(″\n％s：error occured while reading″\

″header of ％s ！！\n″，argv[0]，channelFilename)；

exit(1)；

}

numFrames＝featureStruct.header.numFrames；

dimFeature＝featureStruct.header.dimFeature；

if(featureStruct.data＝＝NULL){

featureStruct.data＝Vector(0，dimFeature-1)；

avgMax＝Vector(0，dimFeature-1)；

avgSqMax＝Vector(0，dimFeature-1)；

avgMean＝Vector(0，dimFeature-1)；

avgScale＝Vector(0，dimFeature-1)；

crossMaxMean＝Vector(0，dimFeature-1)；

crossMaxScale＝Vector(0，dimFeature-1)；

channelMean＝Vector(0，dimFeature-1)；

channelScale＝Vector(0，dimFeature-1)；

minCepstrum＝Vector(0，dimFeature-1)；

maxCepstrum＝Vector(0，dimFeature-1)；

for(i＝0；i＜dimFeature；i++){

avgMax[i]＝0.；

avgSqMax[i]＝0.；

avgMean[i]＝0.；

avgScale[i]＝0.；

crossMaxMean[i]＝0.；

crossMaxScale[i]＝0.；

}

//read channel means

if(ReadFeatureHTK(channelFilename，0，&featureStruct)！＝0){

printf(″\n％s：error occured while readin ″\

″feature vector in ％s ！！\n″，argv[0]，gchannelFilename)；

exit(1)；

}

for(i＝0；i ＜dimFeature；i++){

channelMean[i]＝featureStruct.data[i]；

}

//read channel scales

if(ReadFeatureHTK(channelFilename，1，&featureStruct)！＝0){

printf(″\n％s：error occured while reading ″\

″feature vector in ％s ！！\n″，argv[0]，channelFilename)；

exit(1)；

}

for(i＝0；i＜dimFeature；i++){

channelScale[i]＝featureStruct.data[i]；

}

//read channel min

if(ReadFeatureHTK(channelFilename，2，&featureStruct)！＝0){

printf(″\n％s：error occured while reading ″\

″feature vector in ％s ！！\n″，argv[0]，channelFilename)；

exit(1)；

}

sessionMin＝featureStruct.data[0]；

//read channel max

if(ReadFeatureHTK(channelFilename，3，&featureStruct)！＝0){

printf(″\n％s：error occured while reading ″\

″feature vector in ％s ！！\n″，argv[0]，channelFilename)；

exit(1)；

}

sessionMax＝featureStruct.data[0]；

//read(raw)cepstrum file

if(ReadFeatureHTK(cepFilename，-1，&featureStruct)！＝0){

printf(″\n％s：error occured while reading″\

″header of ％s ！！\n″，argv[0]，cepFilename)；

exit(1)；

}

numFrames＝featureStruct.header.numFrames；

if(numFrames＜＝4)continue；

if(dimFeature！＝featureStruct.header.dimFeature){

printf(″\n％s：cepstrum file ％s and channel file ％s ″.\

″have inconsistent number of feature dimensions！！\n\n″，

argv[0]，cepFilename，channelFilename)；

exit(0)；

}

//calculate SNR for file

for(frame＝0；frame＜numFrames；frame++){

if(ReadFeatureHTK(cepFilename，frame，&featureStruct)！＝0){

printf(″\n％s：error occured while reading″\

″feature vector in ％s！！\n″，argv[0]，cepFilename)；

exit(1)；

}

if(frame＝＝0){

for(i＝0；i ＜dimFeature；i++){

minCepstrum[i]＝featureStruct.data[i]；

maxCepstrum[i]＝featureStruct.data[i]；

}

for(i＝0；i＜dimFeature；i++){

if(featureStruct.data[i]＜minCepstrum[i]){

minCepstrum[i]＝featureStruct.data[i]；

}

if(featureStruct.data[i]＞maxCepstrum[i]){

maxCepstrum[i]＝featureStruct.data[i]；

}

//normalize input and ouputs by minCepstrum

for(i＝0；i＜dimFeature；i++){

maxCepstrum[i]-＝minCepstrum[i]；

channelMean[i]-＝minCepstrum[i]；

}

//accumulate pattern statistics for mapping

for(i＝0；i＜dimFeature；i++){

avgMax[i]+＝maxCepstrum[i]；

avgSqMax[i]+＝maxCepstrum[i]＊maxCepstrum[i]；

avgMean[i]+＝channelMean[i]；

avgScale[i]+＝channelScale[i]；

crossMaxMean[i]+＝maxCepstrum[i]＊channelMean[i]；

crossMaxScale[i]+＝maxCepstrum[SCALE_ESTIMATOR]＊

channelScale[i]；

}

numPattern++；

//create floorHistogram of session SNRs

if(strcmp(channelFilename，prevChannelFilename)){

index＝(int)(sessionMax-sessionMin+0.5)；

if(index＞＝HISTOGRAM_RANGE){

index＝HISTOGRAM_RANGE-1；

}

floorHistogram[index]++；

numHist++；

strcpy(prevChannelFilename，channelFilename)；

}

fclose(listFp)；

//determine mapping

if(numPattern＜1){

printf(″\n％s：no data loaded！！\n″，argv[0])；

exit(0)；

}

for(i＝0；i＜dimFeature；i++){

avgMax[i]/＝numPattern；

avgSqMax[i]/＝numPattern；

avgSqMax[i]-＝avgMax[i]＊avgMax[i]；

avgMean[i]/＝numPattern；

avgScale[i]/＝numPattern；

crossMaxMean[i]/＝numPattern；

crossMaxMean[i]-＝avgMax[i]＊avgMean[i]；

crossMaxScale[i]/＝numPattern；

crossMaxScale[i]-＝avgMax[SCALE_ESTIMATOR]＊avgScale[i]；

}

//determine SNR floor

numAccum＝0；

floor＝-1.；

for(i＝0；i＜HISTOGRAM_RANGE；i++){

if(floorHistogram[i]＝＝0){

continue；

}

numAccum+＝floorHistogram[i]；

if(numAccum＞＝FLOOR_PERC＊numHist){

floor＝i；

break；

}

//printf(″\nhist[％d]＝％d(％f)″，

//i，floorHistogram[i]，numAccum/(float)numHist)；

if(numAccum＝＝nUmHist){

break；

}

//print mapping to file

outDim＝numFstimatedMeans+dimFeature；

weight＝Vector(0，outDim-1)；

offset＝Vector(0，outDim-1)；

j＝0；

for(i＝0；i＜numEstimatedMeans；i++){

weight[j]＝crossMaxMean[i]/avgSqMax[i]；

offset[j]＝avgMean[i]-weight[j]＊avgMax[i]；

j++；

}

for(i＝0；i＜dimFeature；i++){

weight[j]＝crossMaxScale[i]/avgSqMax[SCALE_ESTIMATOR]；

offset[j]avgScale[i]-weight[j]＊avgMax[SCALE_ESTIMATOR]；

j++；

}

if((outFp＝fopen(channelMapFilnam，″w″))＝＝NULL){

printf(″％s：Cannot open ％s for writing.\n″，

argv[0]，channelMapFilnam)；

exit(0)；

}

fprintf(outFp，″ChannelMap ％d entries\n″，outDim)；

fprintf(outFp，″Weights\n″)；

for(i＝0；i＜outDim；i++){

fprintf(outFp，″％e″，weight[i])；

}

fprintf(outFp，″\n″)；

fprintf(outFp，″offsets\n″)；

for(i＝0；i＜outDim；i++){

fprintf(outFp，″％e″，offset[i])；

}

fprintf(outFp，″\n″)；

fclose(outFp)；

printf(″\nFloor ％e\n″，floor)；

FreeVector(featureStruct.data，0，dimFeature-1)；

FreeVector(channelMean，0，dimFeature-1)；

FreeVector(channelScale，0，dimFeature-1)；

FreeVector(avgMax，0，dimFeature-1)；

FreeVector(avgSqMax，0，dimFeature-1)；

FreeVector(avgMean，0，dimFeature-1)；

FreeVector(avgScale，0，dimFeature-1)；

FreeVector(crossMaxMean，0，dimFeature-1)；

FreeVector(crossMaxScale，0，dimFeature-1)；

FreeVector(minCepstrum，0，dimFeature-1)；

FreeVector(maxCepstrum，0，dimFeature-1)；

FreeVector(weight，0，outDim-1)；

FreeVector(offset，0，outDim-1)；

exit(0)；

}

#include ″lvrobfuscate.h″/＊make this the first include！＊/

#include ″lvr.h″

#include <math.h>/＊for floor ＊/

#include <string.h>/＊for memset ＊/

#include ″featenums.h″

#include ″channelmapfloat.h″

#define VARIANCE_FLOOR 0.0001

/＊

＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝

＝

＊

＊ChannelMap_InitDynamic/ChannelMap_FreeDynamic

＊

＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝

＊/

void

ChannelMapFloat_InitDynamic(ChannelMapFloat ＊pChannelMap)

{

pChannelMap-＞pAvgCepstrum＝

VST_NEW_ARRAYZ(double，pChannelMap-＞numCepstralCoeff)；

pChannelMap-＞numFrames＝0；

pChannelMap-＞minEnergy＝1.；

pChannelMap-＞maxEnergy＝0.；

}

void

ChannelMapFloat_FreeDynamic(ChannelMapFloat ＊pChannelMap)

{

VST_FREE_ARRAY(pChannelMap-＞pAvgCepstrum)；

}

/＊

＝

＊ChannelMap_Reset

＊

＊/

void

ChannelMapFloat_Reset(ChannelMapFloat ＊pChannelMap，

BOOL bResetOnlyMinMax)

{

double decay；

if(bResetOnlyMinMax)

{

decay＝pChannelMap-＞snrDecay＊

(pChannelMap-＞maxEnergy-pChannelMap-＞minEnergy)/

2.；

pChannelMap-＞minEnergy+＝decay；

pChannelMap-＞maxEnergy-＝decay；

}

else

{

pChannelMap-＞numFrames＝0；

memset(pChannelMap-＞pAvgCepstrum，0，pChannelMap-＞numCepstralCoeff

＊

sizeof(＊pChannelMap-＞pAvgCepstrum))；

pChannelMap-＞minEnergy＝1.；

pChannelMap-＞maxEnergy＝0.；

}

/＊

＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝

＝

＊ChannelMap_GetMinMaxEnergy

＊

＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝

＊/

void

ChannelMapFloat_GetMinMaxEnergy(ChannelMapFloat ＊pChannelMap，

int16＊minEnergy，int16＊maxEnergy)

{

int tmp；

tmp＝(int)floor(pChannelMap-＞minEnergy ＊(1＜＜CEP_Q_PT)+0.5)；

if (tmp＞SHRT_MAX)

{

＊minEnergy＝SHRT_MAX；

}

else if(tmp＜SHRT_MIN)

{

＊minEnergy＝SHRT_MIN；

}

else

{

＊minEnergy＝(int16)tmp；

}

tmp＝(int)floor(pChannelMap-＞maxEnergy ＊(1＜＜CEP_Q_PT)+0.5)；

if(tmp＞SHRT_MAX)

{

＊maxEnergy＝SHRT_MAX；

}

else if(tmp＜SHRT_MIN)

{

＊maxEnergy＝SHRT_MIN；

}

else

{

＊maxEnergy＝(int16)tmp；

}

｝

/＊

＝

＊ ChannelMap_UpdateMinMaxEnergy

＊

＊/

void

ChannelMapFloat_UpdateMinMaxEnergy(ChannelMapFloat＊pChannelMap，double

energy)

{

if(energy＜pChannelMap-＞minEnergy)

{

pChannelMap-＞minEnergy＝energy；

}

if(energy＞pChannelMap-＞maxEnergy)

{

pChannelMap-＞maxEnergy＝energy；

}

/＊

＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝

＝

＊ ChannelMap_Apply

＊

＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝

＊/

void

ChannelMapFloat_Apply(ChannelMapFloat＊pChannelMap，double ＊pCepstrum，

Channel Float ＊pChannel)

{

int i；

double temp；

double snr；

VST_ASSERT(pChannelMap-＞pWeights！＝NULL)；

VST_ASSERT(pChannelMap-＞pOffsets！＝NULL)；

VST_ASSERT(pChannelMap-＞pAvgCepstrum！＝NULL)；

VST_ASSERT(pChannel-＞numTotalCepCoeff＝＝

3＊pChannelMap-＞numCepstralCoeff)；

VST_ERROR_IF_0(pChannel -＞bDoOnlineChannelNorm＝＝TRUE，

″Channel map incompatible with algorithm.\n″)；

/＊track minimum and maximum energy＊/

if(pChannelMap-＞maxEnergy＜pChannelMap-＞minEnergy)

{

pChannelMap-＞minEnergy＝pCepstrum[0]；

pChannelMap-＞maxEnergy＝pCepstrum[0]；

}

else if(pCepstrum[0]＜pChannelMap-＞minEnergy)

{

pChannelMap-＞minEnergy＝pCepstrum[0]；

}

else if(pCepstrum[0]＞pChannelMap-＞maxEnergy)

{

pChannelMap-＞maxEnergy＝pCepstrum[0]；

}

/＊track average cepstrum ＊/

if(pChannelMap-＞numFrames＜pChannelMap-＞numAccumFrames)

{

for(i＝0；i＜pChannelMap-＞numCepstralCoeff；i++)

{

pChannelMap-＞pAvgCepstru＊m[i]＝

(pChannelMap-＞numFrames ＊pChannelMap-＞pAvgCepstrum[i]+

pCepstrum[i])/(pChannelMap-＞numFrames+1)；

}

pChannelMap-＞numFrames++；

}

else

{

for(i＝0；i＜pChannelMap-＞numCepstralCoeff；i++)

{

pChannelMap-＞pAvgCepstrum[i]+＝

(pCepstrum[i]-pChannelMap-＞pAvgCepstrum[i])/

(1＜＜pChannelMap-＞cepAvgShift)；

｝

/＊estimate SNR and floor it ＊/

snr＝pChannelMap-＞maxEnergy-pChannelMap-＞minEnergy；

if(snr＜pChannelMap-＞snrFloor)

{

snr ＝pChannelMap-＞snrFloor；

}

/＊estimate channel mean[0]from SNR ＊/

temp＝pChannelMap-＞pWeights[0]＊snr+pChannelMap-＞pOffsets[0]；

temp+＝pChannelMap-＞minEnergy；

pChannel-＞pSpeechCepstralMean[0]＝temp；

/＊for other static dimensions：set channel means to running average ＊/

for(i＝1；i＜pChannelMap-＞numCepstralCoeff；i++)

{

pChannel-＞pSpeechCepstralMean[i]＝pChannelMap-＞pAvgCepstrum[i]；

}

/＊for all dynamic dimensions：set channel means to 0＊/

for(i＝pChannelMap-＞numCepstralCoeff；i＜pChannel-＞numTotalCepCoeff；

i++)

{

pChannel-＞pSpeechCepstralMean[i]＝0.；

}

/＊estimate scales from SNR ＊/

if(pChannel-＞bDoVarNorm)

{

for(i＝0；i＜pChannel-＞numTotalCepCoeff；i++)

{

temp＝

pChannelMap-＞pWeights[i+1]；＊snr+pChannelMap-

＞poffsets[i+1]；

if(temp＜VARIANCE_FLOOR)

{

temp＝VARIANCE_FLOOR；

}

pChannel-＞pSpeechCepstralVar[i]＝temp；

}

/＊

＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝

＝

＊ChannelMap_ComputeCRC

＊

＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝

＊/

VSTCrc

ChannelMapFloat_ComputeCRC(ChannelMapFloat ＊pChannelMap)

{

VSTCrc crc＝CRC_INITIAL_VALUE；

if(pChannelMap！＝NULL)

{

CRC_ADD(crc，pChannelMap-＞numCepstralCoeff)；

CRC_ADD(crc，pChannelMap-＞numFrames)；

CRC_ADDDOUBLE(crc，pChannelMap-＞minEnergy)；

CRC_ADDDOUBLE(crc，pChannelMap-＞maxEnergy)；

if (pChannelMap-＞pAvgCepstrum)

{

CRC_ADDDOUBLEARRAY(crc，pChannelMap-＞pAvgCepstrum，

pChannelMap-＞numCepstralCoeff)；

}

return crc；

}

/＊

＝

＊ ChannelMap_GetSnapshot

＊

＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝

＊/

void

ChannelMapFloat_GetSnapshot(ChannelMapFloat ＊pChannelMap，

int numCepstralCoeff，

double ＊pAvgCepstrum，

double ＊minEnergy，

double ＊maxEnergy，

int ＊numFrames)

{

int i；

if(pChannelMap！＝NULL)

{

＊numFrames＝pChannelMap-＞numFrames；

for(i＝0；i＜numCepstralCoeff；i++)

{

pAvgCepstrum[i]＝pChannelMap-＞pAvgCepstrum[i]；

}

＊minEnergy＝pChannelMap-＞minEnergy；

＊maxEnergy＝pChannelMap-＞maxEnergy；

}

else

{

＊numFrames＝0；

memset(pAvgCepstrum，0，numCepstralCoeff ＊sizeof(＊pAvgCepstrum))；

＊minEnergy＝1.；

＊maxEnergy＝0.；

}

/＊

＝

＊ ChannelMap_SetSnapshot

＊

＊/

void

ChannelMapFloat_SetSnapshot(ChannelMapFloat ＊pChannelMap，

const double ＊pAvgCepstrum，

double minEnergy，

double maxEnergy，

int numFrames)

{

int i；

if(pChannelMap！＝NULL)

{

pChannelMap-＞numFrames＝numFrames；

for(i＝0；i＜pChannelMap-＞numCepstral Coeff；i++)

{

pChannel Map-＞pAvgCepstrum[i]＝pAvgCepstrum[i]；

}

pChannelMap-＞minEnergy＝minEnergy；

pChannelMap-＞maxEnergy＝maxEnergy；

}

appendix.doc

Claims

1.一种自动化语音识别通道归一化的方法，包括：

基于从离线处理过程中接收的语音语句中测量的统计信息和与所述语音语句相关联的特征归一化参数，形成统计导出的映射；

根据在线处理过程中接收的语音语句的初始部分来测量统计信息；以及

基于所测量的所述初始部分的统计信息以及所述统计导出的映射，来估计在线处理过程中接收的语音语句的特征归一化参数。

2.根据权利要求1的方法，其中所测量的统计信息包括对来自所述语音语句的初始部分的能量的测量。

3.根据权利要求2的方法，其中对所述能量的测量包括所述能量的极值。

4.根据权利要求1的方法，其中形成所述统计导出的映射包括：

接收每个与对应的特征归一化参数关联的多个语句；

从所述多个语句中的每个语句的一部分中测量统计信息；以及

基于所测量的统计信息和与所述多个语句对应的所述特征归一化参数来形成所述统计导出的映射。

5.根据权利要求4的方法，其中所述多个语句中的每个语句的所述部分包括所述多个语句中的每个语句的初始部分。

6.根据权利要求4的方法，其中所述多个语句中的每个语句的所述部分包括所述多个语句中的每个语句的整体部分。

7.根据权利要求4的方法，其中形成所述统计导出的映射包括形成统计回归。

8.根据权利要求4的方法，其中对应于所述多个语句的特征归一化参数包括所述多个语句在时间之上的均值和方差。

9.根据权利要求1的方法，其中每个所测量的统计信息基于单个语音语句的一部分，而每个特征归一化参数基于复数个语音语句。

10.根据权利要求1的方法，其中形成所述统计导出的映射包括：确定将从离线处理过程中接收的语音语句中测量的统计信息与相关联的特征归一化参数联系起来的线性权重。

11.根据权利要求1的方法，其中在离线处理过程中接收的语音语句是从包括多个说话者在多个声学环境中的语句的语音数据库中接收的。

12.根据权利要求2的方法，其中对能量的所述测量中的第一测量值包括能量在一定时间间隔上的最大值，而对能量的所述测量中的第二测量值包括能量在一定时间间隔上的最小值。

13.一种自动化语音识别通道归一化的系统，包括：

回归模块，该回归模块被配置成：基于从离线处理过程中接收的语音语句中测量的统计信息和与所述语音语句相关联的特征归一化参数，形成统计导出的映射；

初始处理模块，该初始处理模块被配置成从在线处理过程中接收的语音语句的初始部分中测量统计信息；及

映射模块，该映射模块被配置成：基于所测量的所述初始部分的统计信息以及所述统计导出的映射，来估计在线处理过程中接收的语音语句的特征归一化参数。

14.根据权利要求13的系统，其中所测量的统计信息包括对来自所述语音语句的初始部分的能量的测量。

15.根据权利要求13的系统，

其中所述回归模块被配置成：

接收每个与对应的特征归一化参数关联的多个语句；

16.根据权利要求15的系统，其中所述多个语句中的每个语句的所述部分包括所述多个语句中的每个语句的初始部分。

17.根据权利要求15的系统，其中所述多个语句中的每个语句的所述部分包括所述多个语句中的每个语句的整体部分。