KR100883650B1

KR100883650B1 - Method for speech recognition using normalized state likelihood and apparatus thereof

Info

Publication number: KR100883650B1
Application number: KR1020020020916A
Authority: KR
Inventors: 최인정
Original assignee: 삼성전자주식회사
Priority date: 2002-04-17
Filing date: 2002-04-17
Publication date: 2009-02-18
Also published as: KR20030082265A

Abstract

본 발명은 정규화 상태 라이크리후드를 이용한 음성인식방법 및 그 장치에 관한 것으로, (a)음성신호를 입력받아 음성신호의 프래임들로부터 특징벡터를 추출하고 은닉 마르코프 모델의 각 상태들을 구성하는 단계;(b) 상태들에 대한 상태 로그 라이크리후드를 계산하는 단계;(c) 정규화요소를 결정하여 정규화 상태 로그 라이크리후드를 계산하는 단계;(d) 정규화 상태로그 라이크리후드에 해당하는 텍스트를 검색하여 출력하는 단계를 포함하므로, 음성신호의 프래임별 최대 로그 라이크리후드의 차이에 의한 인식에의 기여도 차이, 상태별 최대 라이크리후드에서의 차이에 의한 분별성 문제, 특징 스트림별 로그 라이크리후드 차이에 의한 분별력 상쇄 등을 보상하여 인식 성능을 보상한다.The present invention relates to a speech recognition method and apparatus using the normalized state lyre hood, comprising: (a) receiving a speech signal, extracting a feature vector from frames of the speech signal, and constructing respective states of the hidden Markov model; (b) calculating a state log lyrehood for the states; (c) calculating a normalized state log lyrehood by determining a normalization factor; Search and output, so that the difference in the contribution to the recognition by the difference in the maximum log lyehood per frame of the voice signal, the discriminant problem due to the difference in the maximum lyehood per state, log lyric by feature stream Recognition performance is compensated for by compensating for discernment caused by hood difference.

Description

Methods for speech recognition using normalized state likelihood and apparatus

도 1은 은닉 마르코브 모델(Hidden Markov Model, 이하 HMM으로 표기한다)을 이용한 상태 로그 라이크리후드(state log likeliood)를 계산하기 위하여 상태천이도를 나태는 도면이다.1 is a diagram illustrating a state transition diagram for calculating a state log likeliood using a Hidden Markov Model (hereinafter referred to as HMM).

도 2의 (a)는 본 발명에 따른 정규화 상태 라이크리후드를 이용한 음성인식방법에 대한 흐름을 나타내는 도면이고 도 2의 (b)는 정규화 상태 로그 라이크리후드를 계산하는 흐름을 나타내는 도면이다.Figure 2 (a) is a view showing the flow for the speech recognition method using the normalized state lyehood in accordance with the present invention and Figure 2 (b) is a view showing the flow for calculating the normalized state log lyehood.

도 3은 본 발명에 따른 정규화 상태 라이크리후드를 이용한 음성인식장치에 대한 블록도를 나타내는 도면이다.3 is a block diagram of a speech recognition apparatus using the normalized state Lyly Hood according to the present invention.

도 4는 상태 로그 라이크리후드를 정규화하기 위해서 적용되는 시그모이드 함수(sigmoid function)를 나타내는 도면이다.4 is a diagram illustrating a sigmoid function applied to normalize the state log lyrehood.

도 5는 본 발명의 방법으로 실험한 결과를 나타내는 도면이다.5 is a view showing the results of the experiment by the method of the present invention.

본 발명은 은닉 마르코브 모델(Hidden Markov Model, HMM)을 이용한 음성인 식분야에 관한 것으로, 특히 시그모이드 함수를 적용하여 정규화한 상태 로그- 라이크리후드(state log-likelihood)를 이용한 음성인식방법 및 그 장치에 관한 것이다.The present invention relates to the field of speech recognition using Hidden Markov Model (HMM), in particular speech recognition using state log-likelihood normalized by applying sigmoid function A method and apparatus therefor.

종래의 상태 라이크리후드(state likelihood)를 수정하는 방법에는 로컬 로그 라이크리후드를 근거로 정렬(sorting)하고 순위정보를 스코어로 환산하는 방법, 상태 로그 라이크리후드값에 페널티값을 적용하는 방법 등이 있지만, 프레임별 maximum log-likelihood 차이에 의한 인식에의 기여도 차이, state별 ML(maximum likelihood) state likelihood에서의 차이에 의한 분별성 문제, feature stream별 log-likelihood 차이에 의한 분별력 상쇄 등에 의한 문제를 발생시킨다.Conventional methods for modifying state likelihood include sorting based on local log likelihood, converting rank information into scores, and applying penalty values to state log likelihood values. The difference in contribution to recognition due to the maximum log-likelihood difference per frame, discriminant problem due to difference in maximum likelihood (ML) state likelihood for each state, and offsetting discernment due to log-likelihood difference for each feature stream Cause problems.

본 발명이 이루고자 하는 기술적 과제는, 상기 문제점들을 해결하기 위해서 음성신호의 프래임별 최대 로그 라이크리후드의 차이에 의한 인식에의 기여도 차이, 상태별 최대 라이크리후드에서의 차이에 의한 분별성 문제, 특징 스트림별 로그 라이크리후드 차이에 의한 분별력 상쇄 등을 보상하기 위해서 정규화 상태 로그-라이크리후드를 이용한 음성인식방법 및 그 장치를 제공하는 데 있다.The technical problem to be solved by the present invention is to solve the above problems, the difference in contribution to the recognition by the difference in the maximum log lyehood per frame of the speech signal, the problem of discriminability due to the difference in the maximum lyehood per state, In order to compensate for the discernment of the discriminant caused by the difference in the log lyehood per feature stream, a voice recognition method using the normalized state log-lye hood and a device thereof are provided.

본 발명이 이루고자 하는 다른 기술적 과제는, 상기 방법을 컴퓨터에서 실행시키기 위한 프로그램을 기록한 컴퓨터로 읽을 수 있는 기록매체를 제공하는데 있다.Another object of the present invention is to provide a computer-readable recording medium having recorded thereon a program for executing the method on a computer.

상기의 과제를 이루기 위한 본 발명에 따른 정규화 상태 로그-라이크리후드 를 이용한 음성인식방법은, (a)음성신호를 입력받아 상기 음성신호의 프래임들로부터 특징벡터를 추출하고 은닉 마르코프 모델의 각 상태들을 구성하는 단계;(b) 상기 상태들에 대한 상태 로그 라이크리후드를 계산하는 단계;(c) 정규화요소를 결정하여 정규화 상태 로그 라이크리후드를 계산하는 단계;(d) 상기 정규화 상태로그 라이크리후드에 해당하는 텍스트를 검색하여 출력하는 단계를 포함한다.Speech recognition method using the normalized state log-like hood in accordance with the present invention for achieving the above object, (a) receives a speech signal, extracts a feature vector from the frames of the speech signal and each state of the hidden Markov model (B) calculating a state log lyrehood for the states; (c) calculating a normalized state log lyrehood by determining a normalization factor; (d) And searching for the text corresponding to the creep hood and outputting the text.

상기의 과제를 이루기 위한 본 발명에 따른 정규화 상태 로그-라이크리후드를 이용한 음성인식장치는, 음성신호를 입력받아 상기 음성신호의 프래임들로부터 특징벡터를 추출하여 은닉 마르코프 모델의 각 상태들을 구성하는 상태구성부;상기 상태들에 대해 상태 로그 라이크리후드를 계산하는 확률계산부;정규화요소를 결정하여 정규화 상태 로그 라이크리후드를 계산하는 정규화계산부;상기 정규화 상태로그 라이크리후드에 해당하는 텍스트를 검색하여 출력하는 텍스트검색부를 포함한다.A speech recognition apparatus using a normalized state log-lyre hood according to the present invention for achieving the above object is configured to form a state of a hidden Markov model by receiving a speech signal and extracting feature vectors from the frames of the speech signal. Probability calculation unit for calculating a state log lyehood for the states; Normalization calculation unit for determining a normalization factor to calculate a normalized state log lyrehood; Text corresponding to the normalization state log lyrehood It includes a text search unit for searching and outputting.

이하에서, 첨부된 도면을 참조하여 본 발명의 바람직한 실시 예에 대하여 상세히 설명한다.Hereinafter, with reference to the accompanying drawings will be described in detail a preferred embodiment of the present invention.

도 1은 은닉 마르코브 모델(Hidden Markov Model, 이하 HMM으로 표기한다)을 이용한 상태 로그 라이크리후드(state log likeliood)를 계산하기 위하여 상태천이도를 나태는 도면으로, 도 1의 (a)는 "cat"에 대한 음성신호를 나타내고 도1의 (b)는 음성신호에 대한 프래임들을 나타내고 도 1의 (c)는 HMM의 좌우모델을 나타낸 상태천이도이다.FIG. 1 is a diagram illustrating a state transition diagram for calculating a state log likeliood using a Hidden Markov Model (hereinafter referred to as HMM). FIG. A voice signal for "cat" is shown, and FIG. 1 (b) shows frames for the voice signal, and FIG. 1 (c) is a state transition diagram showing left and right models of the HMM.

하나의 프래임에 대한 특징벡터가 C개의 특징 스트림으로 구성된다고 하면, 특징벡터(

)는 수학식1과 같이 나타나고,If a feature vector for one frame consists of C feature streams,

) Is represented as in Equation 1,

여기서, t는 프래임의 지수를 나타낸다.Where t represents the index of the frame.

HMM은 3개의 상태(in-transit, center, out-transit)로 구성되고, HMM의 상태는 상태 천이 확률 및 출력 확률밀도함수(Output Probability Density Function)의 계수들에 의해서 특징 지워지고, 상태 로그 라이크리후드는 수학식2와 같이 계산된다.The HMM consists of three states (in-transit, center, and out-transit), and the state of the HMM is characterized by the coefficients of the state transition probability and the output probability density function, and the state log lyric The hood is calculated as in Equation 2.

여기서, s는 상태 인덱스, c는 스트림 인텍스, m은 가우시안 인덱스,

는 특징벡터, λ_sc는 상태s와 스트림c의 은닉 마르코프 모델,

은 가우시안밀도함수의 평균벡터,

은 가우시안밀도함수의 공분산 행렬 및

은 가중치를 나타낸다.Where s is the state index, c is the stream index, m is the Gaussian index,

Is a feature vector, λ _sc is a hidden Markov model of states s and stream c,

Is the mean vector of the Gaussian density function,

Is the covariance matrix of the Gaussian density function and

Represents a weight.

도 2의 (a)는 본 발명에 따른 정규화 상태 라이크리후드를 이용한 음성인식방법에 대한 흐름을 나타내는 도면이고 도 2의 (b)는 정규화 상태 로그 라이크리후드를 계산하는 흐름을 나타내는 도면이다. Figure 2 (a) is a view showing the flow for the speech recognition method using the normalized state lyehood in accordance with the present invention and Figure 2 (b) is a view showing the flow for calculating the normalized state log lyehood.

이하 도 2 및 도 3을 함께 설명한다.2 and 3 will be described together.

음성신호가 입력(210단계)되면, 상태구성부(320)는 음성신호를 인식하는 과정에 들어가기 전에 처리하는 환경적응, 끝점검출, 반향제거 또는 잡음제거 등과 같은 전처리과정을 하고 디지털처리된 음성신호를 효과적으로 표현해주는 특징벡터를 추출한다(220단계). 특징추출에는 일반적으로 특징벡터의 저차항에 발화 당시의 성도(vocal tract) 특성을 반영하고 고차항에는 발화를 이끌게 한 기저(excitation) 신호의 특성이 반영되도록 한 켑스트럼(cepstrum)추출방식이 사용되고, 최근에는 인간의 청각인지과정을 반영한 켑스트럼 추출방식인 MFCC(Mel Frequency Cepstrum Coefficient)가 사용되기도 한다.When the voice signal is input (step 210), the state component 320 performs a preprocessing process such as environmental adaptation, endpoint detection, echo cancellation, or noise cancellation before processing the voice signal. Extracting the feature vector that effectively represents (step 220). In the feature extraction, the cepstrum extraction method in which the lower term of the feature vector generally reflects the vocal tract characteristics at the time of ignition, and the higher order term reflects the characteristics of the excitation signal that led to the utterance. Recently, MFCC (Mel Frequency Cepstrum Coefficient), a cepstrum extraction method that reflects the human auditory cognitive process, has been used.

정규화 상태 라이크리후드의 계산(230단계)은 정규화계산부(340)에서 수행하는데, 먼저, 확률계산부(330)는 수학식 3과 같은 하나의 스트림에 대한 상태 로그 라이크리후드를 계산한다(231단계).Calculation of the normalized state lyre hood (step 230) is performed by the normalization calculator 340. First, the probability calculator 330 calculates a state log lyre hood for one stream as shown in Equation (3). Step 231).

는 스트림 c에대한 상태 로그 라이크리후드,

는 특징벡터,

은 가우시안 밀도함수의 평균벡터,

은 가우시안 밀도함수의 공분산 행 렬,

Is a state log lyric hood for stream c,

Vector,

Is the mean vector of the Gaussian density function,

Is the covariance matrix of the Gaussian density function,

Represents a weight.

정규화요소결정부(341)는 상태 로그 라이크리후드를 정규화하기 위해서 도 4와 같은 시그모이드 함수(sigmoid function)를 적용하는데, 이 때 정규화 요소들인 시그모이드 함수의 기울기(

) 및 중심치(

)를 결정한다(233단계). 정규화 요소를 결정하는 방법에는 최소분류에러(minimum classification error)학습방법을 적용하여 학습데이터에 대한 오인식률이 최소화되도록 정규화요소들을 반복적으로 조정한다.The normalization element determination unit 341 applies a sigmoid function as shown in FIG. 4 in order to normalize the state log lyric hood, wherein the slope of the sigmoid function that is the normalization factor (

) And centroid (

(Step 233). In determining the normalization factor, the minimum classification error learning method is applied, and the normalization factors are repeatedly adjusted to minimize the false recognition rate of the learning data.

정규화요소가 결정되면, 정규화라이크리후드계산부(343)에서는 정규화상태 로그 라이크리후드를 수학식4와 같이 계산한다.When the normalization element is determined, the normalized likelihood hood calculation unit 343 calculates the normalized state log likelihood hood as in Equation (4).

,

여기서,

는 정규화 상태 로그 라이크리후드,

는 최대 상태 로그 라이크리후드,

는 최소 상태 로그 라이크리후드,

는 정규화하기 위해서 적용된 시그모이드함수의 기울기를 나타내는 변수값 ,

는 시그모이드함수의 중심치 및

는 상태 로그 라이크리후드를 나타낸다.here,

The normalized state log Lycrahood,

The state log lycra hood,

The minimum state log lyric hood,

Is a variable value representing the slope of the sigmoid function applied to normalize.

Is the center of the sigmoid function and

Indicates the state log lyrehood.

텍스트검색부(350)는 계산된 정규화 상태 로그 라이크리후드 및 기준모델저장부(360)에 저장된 등록어휘·발음사전·언어모델 확률값을 이용하여 가장 가능성이 높은 텍스트를 검색(240단계)하여 출력한다(250단계). The text search unit 350 searches for the most probable text using the registered vocabulary, phonetic dictionary, and language model probability values stored in the normalized state log lyric hood and the reference model storage unit 360 (step 240). (Step 250).

실험 조건은 아래 (1), (2), (3)과 같다.Experimental conditions are as follows (1), (2) and (3).

(1) 특징벡터 : 12-dim MFCC with live CMS, energy with energy normalization,(1) Feature vector: 12-dim MFCC with live CMS, energy with energy normalization,

12-dim DMFCC, delta energy (total 26-dim/frame). 12-dim DMFCC, delta energy (total 26-dim / frame).

(2) 45 monophones, 3-state left-to-right HMMs, continuous density HMMs, (2) 45 monophones, 3-state left-to-right HMMs, continuous density HMMs,

4 Gaussian pdfs per mixture. 4 Gaussian pdfs per mixture.

(3) 학습 데이터 : 무역상담 관련 고립어-1500 DB, 전체 20명 약 6천 단어, 조용한 사무실,(3) Study data: isolated language related to trade counseling-1500 DB, about 6,000 words in total 20 people, quiet office,

평가 데이터 : 지명 관련 500개 고립어 어휘 DB, 48명 7,559 단어, 무반향 녹음실. Assessment data: 500 isolated language vocabulary DB for names, 7,559 words for 48 people, anechoic recording studio.

도 5의 (a)는 적용방법에 따른 단어인식률의 결과를 나타내는 도면으로, 아래의 정규화(1) 및 (2)에 대한 결과를 나타내고 있다. FIG. 5 (a) is a diagram showing the results of word recognition rate according to the application method, and shows the results for the following normalization (1) and (2).

정규화 (1) : stream-dependent factors()들을 정규화요소로 한 경우.Normalization (1): stream-dependent factors ( ) As the normalization factor.

정규화 (2) : state-dependent factors(

)들을 정규화요소들로 한 경우에 대한 결과를 나타내는 도면이다. 여기서,

는 local best state index로써 아래의 수학식5와 같이 나타낸다. Normalization (2): state-dependent factors (

) Is a diagram showing the result of the case of using) as normalization elements. here,

Is the local best state index as shown in Equation 5 below.

도 5의 (b)는 최대 라이크리후드(Maximum likelihood, ML)의 상태가 다를 때 각 상태의 평균 로그 라이크리후드값의 영역별 상태수의 분포를 나타내고 도5의 (c)는 상태 스트림별 평균 로그 라이크리후드값의 영역별 상태수의 분포를 나타내는 도면이다. Figure 5 (b) shows the distribution of the number of states by area of the average log lycra hood value of each state when the state of the maximum likelihood (ML) is different, and Figure 5 (c) shows the state stream It is a figure which shows the distribution of the number of states for each area | region of an average logarithmic hood value.

본 발명은 또한 컴퓨터로 읽을 수 있는 기록매체에 컴퓨터가 읽을 수 있는 코드로서 구현하는 것이 가능하다. 컴퓨터가 읽을 수 있는 기록매체는 컴퓨터 시스템에 의하여 읽혀질 수 있는 데이터가 저장되는 모든 종류의 기록장치를 포함한다. 컴퓨터가 읽을 수 있는 기록매체의 예로는 ROM, RAM, CD-ROM, 자기 테이프, 하드디스크, 플로피디스크, 플래쉬 메모리, 광데이터 저장장치 등이 있으며, 또한 캐리어 웨이브(예를 들어 인터넷을 통한 전송)의 형태로 구현되는 것도 포함한다. 또한 컴퓨터가 읽을 수 있는 기록매체는 네트워크로 연결된 컴퓨터 시스템에 분산되어, 분산방식으로 컴퓨터가 읽을 수 있는 코드로서 저장되고 실행될 수 있다. The invention can also be embodied as computer readable code on a computer readable recording medium. The computer-readable recording medium includes all kinds of recording devices in which data that can be read by a computer system is stored. Examples of computer-readable recording media include ROM, RAM, CD-ROM, magnetic tape, hard disk, floppy disk, flash memory, optical data storage device, and also carrier waves (for example, transmission over the Internet). It also includes the implementation in the form of. The computer readable recording medium can also be distributed over network coupled computer systems so that the computer readable code is stored and executed in a distributed fashion.

이상에서 설명한 바와 같이, 본 발명에 의하면, 정규화 상태 로그 라이크리후드를 적용함으로써 프레임별 maximum log-likelihood 차이에 의한 인식에의 기여도 차이, state별 ML state likelihood에서의 차이에 의한 분별성 문제, feature stream별 log-likelihood 차이에 의한 분별력 상쇄 등에 의한 문제를 보상하여 인식 성능을 개선할 수 있다.As described above, according to the present invention, the difference in contribution to recognition by the maximum log-likelihood difference per frame by applying the normalized state log lyric hood, the discriminant problem due to the difference in the ML state likelihood for each state, feature It is possible to improve the recognition performance by compensating for the problem of discretion due to the difference in log-likelihood for each stream.

Claims

(a) receiving a voice signal, extracting a feature vector from frames of the voice signal, and configuring respective states of a hidden Markov model;

(b) calculating a state log lyrehood for the states;

(c) calculating a normalized state log lyrehood by determining a normalization factor; And

(d) searching for and outputting text corresponding to the normalized state log lyre hood;

Step (c) is

(c1) determining a variable value of a slope of a sigmoid function that is a normalization element and a center value of the sigmoid function; And

and (c2) normalizing the state log lyrehood by substituting the normalization element.

The method of claim 1, wherein in step (b),

The state log likelihood hood is calculated by the following equation,

[Equation]

Where s is the state index, c is the stream index, m is the Gaussian index,

State log lyrehood for stream c,

Vector,

Is the mean vector of the Gaussian density function,

Is the covariance matrix of the Gaussian density function,

Speech recognition method using a normalized state ly leehood, characterized in that represents the weight.

delete

The method of claim 1, wherein step (c1)

A speech recognition method using a normalized state likelihood, which is characterized by iteratively adjusting values of normalization elements by applying minimum classification error learning.

The method of claim 1, wherein step (c2)

The normalized state log lyric hood is calculated by the following equation,

[Equation]

,

here,

The normalized state log Lycrahood,

The state log lycra hood,

The minimum state log lyric hood,

Is the center of the sigmoid function and

The speech recognition method using the normalized state lyrehood, characterized in that the state log lyehood hood.

A state configuration unit configured to receive voice signals and extract feature vectors from frames of the voice signals to configure respective states of the hidden Markov model;

A probability calculator for calculating a state log likelihood for the states;

A normalization calculator for determining a normalization factor to calculate a normalized state log lyrehood; And

It includes a text search unit for searching and outputting the text corresponding to the normalized state log lyehood,

The normalization calculation unit

A speech recognition device using a normalized state lyrehood, wherein the state log lyrehood is normalized by determining a variable value of a slope of a sigmoid function that is a normalization element and a center value of the sigmoid function.

delete

A computer-readable recording medium having recorded thereon a program for executing the method of any one of claims 1, 2, 4, and 5 on a computer.