KR20120056086A

KR20120056086A - Method for adapting acoustic model and Voice recognition Apparatus using the method

Info

Publication number: KR20120056086A
Application number: KR1020100117611A
Authority: KR
Inventors: 김동현; 김영익; 조훈영; 김승희; 김상훈; 박준
Original assignee: 한국전자통신연구원
Priority date: 2010-11-24
Filing date: 2010-11-24
Publication date: 2012-06-01

Abstract

PURPOSE: An acoustic model adapting method and a voice recognizing device using the same are provided to eliminate a re-study burden of a user about a quantized acoustic model by an embedded voice recognizing machine. CONSTITUTION: An extracting unit(110) extracts features from a waveform corresponding to a voice. The extracting unit generates quantized data. A probability measuring unit(120) applies the quantized data, an adapted network, and a quantized acoustic model to fixed point-applied high-speed computation. The probability measuring unit calculates Gaussian occupancy probability. An adaption unit(130) updates the acoustic model. A voice recognizing unit(150) recognizes the extracted features using the updated acoustic model.

Description

Method for adapting acoustic model and Voice recognition Apparatus using the method

본 발명은 음향모델 적응 방법 및 이를 이용하는 음성인식 장치에 관한 것이다. The present invention relates to an acoustic model adaptation method and a speech recognition apparatus using the same.

음성인식기는 모바일 디바이스들의 성능이 향상됨에 따라, 점차 임베디드 시스템에서 활용 범위가 넓어지고 있다. 그러나, 음성인식기는 음향모델의 많은 매개변수들과 대용량 어휘를 다뤄야 하기 때문에 인식 과정의 연산 비용을 줄이는 기법들이 연구되고 있다. 이러한 기법들은 고정 소수점(fixed-point)을 이용하는 고속 연산 기법과, 모바일 임베디드 환경의 저용량 메모리를 위해 음향모델의 매개변수를 부분공간(subspace)으로 나눠 코드북에 양자화시킨 SDCHMM(subspace distribution clustering HMM) 등 모델 양자화 기법들을 포함한다. As the performance of mobile devices improves, voice recognizers are increasingly being used in embedded systems. However, since the speech recognizer must deal with many parameters and large vocabulary of the acoustic model, techniques to reduce the computational cost of the recognition process have been studied. These techniques include high-speed computational techniques using fixed-point, subspace distribution clustering HMM (SDCHMM), which is divided into subspaces of the parameters of the acoustic model for low-capacity memory in mobile embedded environments. Include model quantization techniques.

일반적으로, 음성인식기는 주변 환경 및 화자에 따라 인식 성능의 차이가 크게 난다. 특히, 스마트폰과 같은 개인화된 모바일 디바이스에서 작동하는 음성인식기는 추가적인 성능 향상을 위해 화자 적응 기법이 필요하다. In general, the speech recognizer varies greatly in recognition performance depending on the surrounding environment and the speaker. In particular, speech recognizers that operate on personalized mobile devices such as smartphones require speaker adaptation techniques to further improve performance.

화자 적응 기법 중 MAP(maximum a posteriori)와 MLLR(maximum likelihood linear regression)은 비교적 적은 연산 비용을 사용하는 대표적인 기법이지만, 모바일 환경에서 소형화된 음향모델을 이용하는 음성 인식기에서 화자 적응을 하기 위해서는 양자화된 음향 모델의 구조를 고려한 적응이 이루어져야 한다. 또한, 음성인식기는 실용적으로 적응에 필요한 계산 시간을 줄이기 위해 고속 연산 기법을 도입해야 한다.
Among the speaker adaptation techniques, MAP (maximum a posteriori) and MLLR (maximum likelihood linear regression) are representative techniques that use relatively low computational cost. Adaptations should be made taking into account the structure of the model. In addition, the speech recognizer must introduce a high-speed computation technique to practically reduce the computation time required for adaptation.

본 발명의 목적은, 양자화된 음향모델을 적응하는 방법과 이를 이용하는 음성인식 장치를 제공하는 것이다.
An object of the present invention is to provide a method for adapting a quantized acoustic model and a speech recognition apparatus using the same.

상기 과제를 해결하기 위한 본 발명의 실시예에 따른, 음향모델 적응 방법을 이용하는 음성인식 장치는 Speech recognition apparatus using an acoustic model adaptation method according to an embodiment of the present invention for solving the above problems is

음성 신호에 해당하는 파형에서 특징을 추출하고, 추출한 특징을 토대로 양자화 데이터를 생성하는 추출부; 상기 양자화 데이터, 적응 네트워크 및 양자화된 음향 모델을 고정소수점이 적용된 고속연산에 적용하여 가우시안 점유확률을 산출하는 확률 측정부; 상기 가우시안 점유확률과 적응 기법을 이용하여 상기 음향모델을 갱신하는 적응부; 및 상기 추출한 특징을 갱신된 음향모델을 이용하여 인식하고, 인식결과에 해당하는 문장을 출력하는 음성 인식부를 포함한다.
An extraction unit which extracts a feature from a waveform corresponding to the voice signal and generates quantized data based on the extracted feature; A probability measurer for applying a quantization data, an adaptive network, and a quantized acoustic model to a fast operation to which a fixed point is applied to calculate a Gaussian occupation probability; An adaptor adapted to update the acoustic model using the Gaussian occupation probability and the adaptive technique; And a speech recognition unit for recognizing the extracted feature using an updated acoustic model and outputting a sentence corresponding to the recognition result.

본 발명의 실시예에 따르면, 음성인식 장치는 음향모델 적응 방법을 이용하여, 스마트폰과 같은 임베디드 음성인식기에서 양자화된 음향모델에 대한 사용자의 재학습 부담을 없애고, 적은 계산 비용을 가지고 효과적으로 적응을 할 수 있다. According to an embodiment of the present invention, the speech recognition apparatus uses an acoustic model adaptation method, eliminating the user's re-learning burden for the quantized acoustic model in an embedded speech recognizer such as a smart phone, and effectively adapting the adaptation with a low calculation cost. can do.

또한, 본 발명의 실시예에 따르면 음향모델 적응 방법은 적은 계산량으로 온라인 적응도 가능하며, 지속적으로 한명의 사용자의 적응 데이터를 수집하고 이용할 수 있는 스마트폰과 같은 모바일 디바이스 환경에서 점진적으로 성능을 향상 시킬 수 있다.In addition, according to an embodiment of the present invention, the acoustic model adaptation method may be adapted online with a small amount of computation, and may gradually improve performance in a mobile device environment such as a smart phone that can continuously collect and use adaptation data of one user. Can be.

또한, 음향모델 적응 방법을 이용하는 음성인식 장치 인식된 문장에서 자동으로 FST 네트워크를 생성하여 비지도 적응에 이용할 수 있으므로, 적응 훈련을 따로 하는 번거로움을 줄이고 인식 후 인식결과의 수정모드를 통해 잘못 인식된 단어들을 화자 적응시키는 학습효과를 볼 수 있다.
In addition, the speech recognition device using the acoustic model adaptation method can automatically generate the FST network from the recognized sentences and use it for the unsupervised adaptation. You can see the learning effect of adapting the spoken words to the speaker.

도 1은 본 발명의 실시예에 따른 음성인식 장치를 개략적으로 나타내는 구성도이다.
도 2는 본 발명의 실시예에 따른 확률 측정부 및 적응부를 나타내는 구성도이다.
도 3은 본 발명의 실시예에 따른 적응 네트워크 생성부를 나타내는 구성도이다.
도 4는 본 발명의 실시예에 따른 음향모델 적응 방법을 나타내는 흐름도이다.
도 5는 본 발명의 실시예에 따른 적응 네트워크를 생성하는 방법을 나타내는 흐름도이다.1 is a configuration diagram schematically showing a voice recognition device according to an embodiment of the present invention.
2 is a block diagram illustrating a probability measuring unit and an adapting unit according to an exemplary embodiment of the present invention.
3 is a block diagram illustrating an adaptive network generator according to an exemplary embodiment of the present invention.
4 is a flowchart illustrating an acoustic model adaptation method according to an exemplary embodiment of the present invention.
5 is a flowchart illustrating a method of generating an adaptive network according to an embodiment of the present invention.

본 발명을 첨부된 도면을 참조하여 상세히 설명하면 다음과 같다. 여기서, 반복되는 설명, 본 발명의 요지를 불필요하게 흐릴 수 있는 공지 기능, 및 구성에 대한 상세한 설명은 생략한다. 본 발명의 실시형태는 당 업계에서 평균적인 지식을 가진 자에게 본 발명을 보다 완전하게 설명하기 위해서 제공되는 것이다. 따라서, 도면에서의 요소들의 형상 및 크기 등은 보다 명확한 설명을 위해 과장될 수 있다.
The present invention will now be described in detail with reference to the accompanying drawings. Here, the repeated description, well-known functions and configurations that may unnecessarily obscure the subject matter of the present invention, and detailed description of the configuration will be omitted. Embodiments of the present invention are provided to more completely describe the present invention to those skilled in the art. Accordingly, the shape and size of elements in the drawings may be exaggerated for clarity.

먼저, 음향모델 적응 방법 및 이를 이용하는 음성인식 장치에서 적용되는 연산방법들을 설명한다. First, an acoustic model adaptation method and a calculation method applied in a speech recognition apparatus using the same will be described.

1. 스칼라 양자화를 위한 고속 가우시안 연산방법Fast Gaussian Operation for Scalar Quantization

일반적으로, 음성인식 장치는 수학식 1과 같은 하나의 가우시안 로그 밀도(

) 연산을 기본 단위로 수행한다.In general, a speech recognition device has one Gaussian log density (Equation 1)

) Perform the operation in basic units.

여기서,

번째 가우시안 d-차수인 평균(

)과 분산(

)에 대한 관측데이터(

)의 연산은 수학식 2와 같은 양자화된 매개변수(

)의 테이블 연산

로 단순화 할 수 있다.here,

Mean of the first Gaussian d-order (

) And variance (

Observations data for

) Can be calculated using the quantized parameter (

Table operations

Can be simplified.

또한, 32비트 float형인 평균과 분산값을 각각 5비트와 3비트 인덱스 값으로 스칼라 양자화 하는 경우에는 음향모델 가우시안의 평균과 분산을 합친 64비트 데이터가 8비트 인덱스 데이터로 바뀌게 되어 압축 효과가 나타날 수 있다. In addition, in the case of scalar quantization of the average and variance values of 32-bit floats into 5-bit and 3-bit index values, 64-bit data obtained by combining the average and variance of the acoustic model Gaussian is converted into 8-bit index data. have.

수학식 2와 같이, 실제 양자화된 평균, 분산, 관측값 사이의 연산 테이블을 만들고 고정소수점(fixed-point)으로 저장하는 경우에는 가우시안 로그 밀도 연산을 고정소수점들의 합산으로 구할 수 있으므로, 고속연산이 가능해진다.As shown in Equation 2, when a calculation table between actual quantized averages, variances, and observations is created and stored as a fixed-point, the Gaussian log density operation can be obtained by summing the fixed-points. It becomes possible.

2. 스칼라 양자화 방법2. Scalar Quantization Method

스칼라 양자화 방법은 스칼라 값에 해당하는 평균과 분산만을 이용하여 기존 모델 벡터인 평균과 분산을 표현하고, 연결 구조의 조합으로 모든 음향 모델을 표현할 수 있는 방법이다. 이때, 평균과 분산을 각각 32개, 8개로 양자화하기 위해서는 다음과 같은 과정을 거쳐야 한다. The scalar quantization method expresses the mean and the variance of the existing model vector using only the mean and the variance corresponding to the scalar value, and expresses all the acoustic models by the combination of connected structures. In this case, in order to quantize the mean and the variance into 32 and 8, respectively, the following process is required.

먼저, 음향 모델의 모든 평균과 분산값을 차수별로 모은 뒤 평균 subtraction, 분산 표준화(normalization)을 하여 제로(zero) 평균과 고정된(fixed) 분산을 만든 후에 모든 차수를 합쳐서 k-mean 알고리즘을 수행한다. 이때 k-mean 에 Lloyd-max 알고리즘을 함께 수행하고 정렬하여 작은 값과 큰 값의 0.5%를 각각 제외시킨다. 위 과정을 반복하여 양자화된 평균값을 고정시킨다. 그리고 관측데이터의 분산값을 고려하여 모든 차수의 분산값을 같게 설정한 뒤 주성분분석(PCA) 방법으로 비선형 양자화 값을 구한다. 이와 같은 과정에서 평균과 분산값의 각 차수에 따른 시프트(shift)와 스케일(scale)값을 각각 생성하여 이용함으로써 스칼라 양자화 값을 가지고 모든 차수를 표현할 수 있게 된다.First, all the mean and variance values of the acoustic model are collected by order, then average subtraction and variance normalization to produce zero mean and fixed variance, and then add all orders to perform k-mean algorithm. do. At this time, Lloyd-max algorithm is performed on k-mean together and sorted to exclude 0.5% of small and large values. Repeat the above procedure to fix the quantized mean. The variances of all orders are set to be the same by considering the variances of the observed data, and then nonlinear quantization values are obtained by PCA method. In this process, by generating and using shift and scale values according to the orders of the mean and the variance, respectively, all orders can be represented with scalar quantization values.

3. 스칼라 양자화 모델의 링크 구조 적응 방법3. Adaptation method of link structure of scalar quantization model

SDCHMM과 같은 벡터 양자화된 음향모델의 적응 방법에는 코드북(codebook) 적응 방법과 링크구조(link structure) 적응 방법이 있다. Adaptation methods of vector quantized acoustic models such as SDCHMM include a codebook adaptation method and a link structure adaptation method.

코드북 적응 방법은 상대적으로 적은 갯수의 양자화된 코드워드(codeword) 매개변수를 직접 갱신해서 계산 비용이 적게 드는데 비해 하나의 코드워드를 공유하는 모델 매개변수들이 갱신 시에 충돌을 막기 위해서는 글로벌 변환기반 (transformation-based) 적응 외에는 할 수 없다는 단점이 있다. 그에 비해 링크구조 적응 방법은 일반적인 적응 방법을 도입하여 모델 매개변수를 갱신 할 수 있으나, 갱신시 가장 가까운 코드워드값으로 양자화 되도록 인덱스 값을 수정하기 때문에 양자화 오류가 잔존한다. 그러나 보통 그러한 단점보다 인덱스 값 수정을 통한 적응 효과가 더 크기 때문에 본 발명의 스칼라 양자화된 음향 모델 적응에도 링크구조 적응 기법을 활용하여 적용한다. The codebook adaptation method is relatively expensive to compute by directly updating a relatively small number of quantized codeword parameters.However, in order to prevent conflicts between model parameters that share a codeword, the global translator board ( The drawback is that you cannot do anything but adaptation-based adaptation. On the other hand, the link structure adaptation method can update the model parameters by introducing a general adaptation method, but the quantization error remains because the index value is modified to quantize to the nearest codeword value. However, since the adaptation effect by modifying the index value is larger than the disadvantages, the link structure adaptation technique is also applied to the scalar quantized acoustic model adaptation of the present invention.

스칼라 양자화 모델에 링크구조 적응 방법을 도입하기 위하여 다음과 같은 수학식을 이용한다. 수학식 3과 수학식 4는 스칼라 양자화된 음향모델의

번째 인덱스 코드워드인

과

에 대해

state,

가우시안,

차수로 적응된 모델 매개변수

와

의 링크구조를 갱신하는 식으로서 거리함수,

를 이용하여 가장 가까운 값으로 링크구조의 인덱스를 바꾼다. 여기서, 코드워드인 평균과 분산값은 각각

차수를 갖는 scale 상수

,

와 shift 상수

,

를 가지고 각 차수의 특성을 반영한다.In order to introduce a link structure adaptation method into a scalar quantization model, the following equation is used. Equations 3 and 4 represent the scalar quantized acoustic model.

Index codeword

and

About

state,

Gaussian,

Model Parameters Adapted to Order

Wow

To update the link structure of the distance function,

Change the index of the link structure to the nearest value using. Here, the mean and the variance of the codewords are respectively

Scale constants with degree

,

And shift constant

,

Reflect the characteristics of each order.

다음, 수학식 5는 입력된 특징 벡터

의

차원 값을 양자화하여 연결하는 인덱스 방법을 나타낸다. Next, Equation 5 is the input feature vector

of

An indexing method of quantizing and connecting dimension values is shown.

스칼라 양자화된 음향모델

,

는 수학식 6과 같은 고정소수점(fixed-point) 링크구조 MAP적응 알고리즘으로 변환하여 이용할 수 있다. Scalar Quantized Acoustic Model

,

May be converted into a fixed-point link structure MAP adaptation algorithm as shown in Equation (6).

여기서

는 적응데이터 반영 비율을 정하는 실험 상수이고,

는 Baum-Welch 알고리즘을 통해 얻은 t시간의 s state, g 가우시안 고정소수점 사후확률이다. here

Is an experimental constant that determines the ratio of adaptive data reflection,

Is the s state, g, Gaussian fixed-point posterior probability of t time obtained using the Baum-Welch algorithm.

또한, 음향모델 적응 방법으로 많이 사용하는 MLLR 방법은 Baum의 보조함수를 목적함수로 이용하는 변환 기반(transformation-based) 알고리즘이다. 이 방법에 스칼라 양자화된 음향모델의 평균값을 수학식 7과 같이, 선형 변환하여 적응된

을 얻는다.In addition, the MLLR method, which is widely used as an acoustic model adaptation method, is a transformation-based algorithm using Baum's auxiliary function as the objective function. The average value of the scalar quantized acoustic model is adapted to the method by linear transformation,

Get

여기서

는

변환 행렬이고,

는 (n+1) 확장된 평균값

이다. 변환 행렬은 최우도(maximum likelihood)를 지향하는 방향으로 구하게 되는데 목적함수 Baum의 보조함수에 넣으면 수학식 8과 같이 단순화 하여 표현할 수 있다.here

Is

Transformation matrix,

Is the (n + 1) extended mean value

to be. The transformation matrix is obtained in the direction toward the maximum likelihood. When the transformation matrix is put in the auxiliary function of the objective function Baum, it can be simplified and expressed as in Equation (8).

여기서

과

는 각각 기존 음향 모델과 새 음향 모델을 나타낸다.

는 t 시간에

회귀 클래스에 속하는 s state, g 가우시안의 고정소수점 사후 확률이다. 또한,

는 분산행렬에 대한 양자화 링크된 값이다. here

and

Represent the existing and new acoustic models, respectively.

In t time

Fixed-point posterior probability of s state, g Gaussian, belonging to the regression class. Also,

Is the quantized linked value for the variance matrix.

수학식 8을 변환 행렬

에 대한 원소로 편미분하여 정리하면 변환 벡터

에 대해 수학식 9와 같이 요약할 수 있다.Transform Equation 8

Convert by Partial Differential into Elements for Transformation Vectors

Can be summarized as in Equation (9).

4. 고정소수점을 이용하는 고속 연산 방법4. High-speed computation method using fixed point

모바일 임베디드 디바이스에서 연산은 같은 비트인 플로트(float)와 정수(integer) 타입에 따라 많은 차이가 난다. 같은 값을 유도하는 연산식도 플로트(float) 타입의 곱셉 연산으로 전개한 연산식이 정수(integer) 타입의 덧셈으로 전개한 연산식에 비해 더욱 많은 계산 시간을 보이는 문제점이 있다. 소수점 이하 연산이 많은 가우시안 확률연산 부분은 이러한 문제를 해결하기 위하여, 플로트 소수점(float-point)대신 고정 소수점(fixed-point)를 적용함으로써 계산비용을 줄이고 있다. In mobile embedded devices, operations vary greatly depending on the same bit float and integer types. Expressions that derive the same value also have a problem that the expressions developed by float multiply operations show more computation time than the expressions developed by addition of integer types. In order to solve this problem, Gaussian probabilistic parts with a lot of sub-point operations reduce the computation cost by applying fixed-point instead of float-point.

적응 데이터를 이용하여 음향모델을 갱신하기 위해서는 주로 3단계 적응 프로세스를 이용하게 된다. In order to update the acoustic model using the adaptation data, a three-step adaptation process is mainly used.

첫 번째는 데이터 통계치를 구하는 단계로, 적응 데이터로부터 가우시안 사후확률을 얻는 과정이라 할 수 있다.The first step is to obtain data statistics, which is a process of obtaining Gaussian posterior probabilities from adaptive data.

두 번째는 적응 알고리즘과 관련된 매개변수를 구하여 적응된 모델 매개변수를 구하는 과정으로 다양한 적응 기법이 활용되는 단계이다.The second step is to find the parameters of the adaptation algorithm and then to adapt the model parameters.

세 번째는 기존 음향 모델을 적응된 모델 매개변수로 갱신하는 과정이다. The third is to update the existing acoustic model with the adapted model parameters.

3 단계 적응 프로세스 중에서 상대적으로 첫 번째와 두 번째 과정이 많은 계산비용을 필요로 하는데, 두 번째 단계인 적응 알고리즘을 성능 대비 저비용 방법이 가능한 fixed-point MAP / fixed-point MLLR방식으로 적용 하였다. 그리고 적응 데이터 길이에 따라 가장 많은 연산이 들어가는 첫 번째 과정의 BW 알고리즘을 다음과 같이 구성하여 중복 연산을 없애고 비용을 최소화 하였다.In the three-step adaptation process, the first and second processes require a relatively large amount of computational cost. The second step, the adaptive algorithm, is applied as a fixed-point MAP / fixed-point MLLR method that enables a low-cost method to performance. In addition, the BW algorithm of the first process, which includes the most operations according to the adaptive data length, is constructed as follows to eliminate redundant operations and minimize cost.

① 특징벡터로부터 가우시안 발생 확률구하기: 수학식 2의 테이블 연산과 고정 소수점 로그합산(log Add) 연산을 통해 발생 확률 값을 구한다.① Obtain Gaussian occurrence probability from feature vector: Find the probability of occurrence by using table operation and fixed-point log add operation of Equation 2.

② Forward & Backward 알고리즘: ①에서 구한 발생확률간의 합산을 고정 소수점 로그합산 연산으로 합산한다.② Forward & Backward Algorithm: The summation between occurrence probabilities obtained in ① is summed by fixed point log sum operation.

③ 가우시안 사후확률 구하기: ①과 ②에서 구한 값들을 결합하여 고정 소수점으로 변환하여 계산한다.③ Obtain Gaussian posterior probability: Combine the values obtained in ① and ② and convert them to fixed point.

Viterbi알고리즘에 비해 정교한 가우시안 사후확률을 구하는 BW방법이 계산량이 많이 소모되는 이유는 로그합산이 많기 때문이다. 로그 연산 자체는 범위가 넓어서 양자화 하기 어려우나, 수학식 10과 같은 로그합산은 수학식 11로 제약 되기 때문에 전체 분포를 양자화 하여 고정 소수점 테이블로 정리 하여 이용하면 많은 계산 비용을 줄일 수 있다. Compared to the Viterbi algorithm, the BW method, which calculates the elaborate Gaussian posterior probability, consumes a lot of computation because of the large log sum. Logarithmic operations themselves are difficult to quantize because they have a wide range, but log-sum additions such as Equation 10 are constrained by Equation 11.

5. FST 네트워크를 이용한 비지도 적응 방법5. Unsupervised Adaptation Using FST Network

일반적으로 음향모델 적응 방법에는 지도 적응 방법과 비지도 적응 방법이 있다. In general, the acoustic model adaptation method includes a map adaptation method and a non-map adaptation method.

지도 적응 방법은 입력 적응 데이터에 해당하는 정답 전사문 네트워크를 미리 가지고 있다가 적응 알고리즘에서 필요한 음향모델을 전사문 네트워크에 해당하는 음향모델로 이용하는 방법이다. The map adaptation method has a correct answer transcription network corresponding to the input adaptation data in advance and uses the acoustic model required by the adaptation algorithm as an acoustic model corresponding to the transcription statement network.

비지도 적응 방법은 전사문을 미리 정하지 않고, 입력된 데이터의 인식된 결과를 가지고 전사문 네트워크를 작성하고, 이 네트워크의 음향 모델열을 이용하는 방법을 말한다. The non-map adaptation method refers to a method of creating a transcription door network with a recognized result of input data without using a transcription door in advance, and using the acoustic model sequence of the network.

보통 인식된 단어들을 가지고 1-best 음향 모델열로 재구성하여 적응에 필요한 음향모델을 제공하는 방식으로 비지도 적응을 하지만 본 발명은 음성 인식된 문장을 음성인식기의 발음사전이 가지고 있는 어휘들로 나누어 다중 발음 네트워크를 문장 FST(finite state transducer) 네트워크로 구성하고, 이를 GMM(Gaussian mixture model) 단위의 문장 FST 네트워크로 바꿔, 적응에 필요한 음향모델을 자동으로 제공하는 방법이다. Usually, the unsupervised adaptation is performed by reconstructing the 1-best acoustic model sequence with the recognized words to provide an acoustic model necessary for adaptation, but the present invention divides the speech recognized sentence into the vocabulary possessed by the pronunciation dictionary of the speech recognizer. The multi-pronunciation network is composed of a sentence FST (finite state transducer) network, which is converted into a sentence FST network in the unit of Gaussian mixture model (GMM) to automatically provide an acoustic model for adaptation.

이러한 방법은 1-best 음향 모델열이 가지는 단점인 다양한 발음변이를 표현하기 어렵다는 점을 극복한다. 그리고 FST 네트워크로 구성을 이용하기 때문에 단어들이 문장에서 구성될 수 있는 여러 경로를 함께 수용할 수 있다.
This method overcomes the difficulty of expressing various pronunciation variations, which is a disadvantage of the 1-best acoustic model sequence. And because the configuration is used as an FST network, words can accommodate multiple paths that can be composed in sentences.

이하에서는, 본 발명의 실시예에 따른 음향모델 적응 방법 및 이를 이용하는 음성인식 장치에 대하여 상기 설명한 연산방법들과 첨부한 도면을 참고로 하여 상세히 설명한다.Hereinafter, the acoustic model adaptation method and the speech recognition apparatus using the same according to an embodiment of the present invention will be described in detail with reference to the above-described calculation methods and the accompanying drawings.

도 1은 본 발명의 실시예에 따른 음성인식 장치를 개략적으로 나타내는 구성도이다. 1 is a configuration diagram schematically showing a voice recognition device according to an embodiment of the present invention.

도 1을 참고하면, 음성인식 장치(100)는 양자화 특징 추출부(110), 확률 측정부(120), 적응부(130), 음향모델 저장부(140) 및 음성 인식부(150)를 포함한다. 또한, 음성인식 장치는 외부의 적응 네트워크 생성부(200)와 연동하여 동작하며, 이에 한정되지 않는다. 여기서, 적응 네트워크 생성부(200)는 음성인식 장치(100)의 인식결과 즉, 문장을 이용하여 적응 네트워크에 해당하는 가우시안(Gaussian mixture model, 이하 "GMM"라고 함) 단위의 유한 상태 변환(finite state transducer, FST) 네트워크를 생성한다. Referring to FIG. 1, the speech recognition apparatus 100 includes a quantization feature extractor 110, a probability measurer 120, an adaptor 130, an acoustic model storage 140, and a speech recognizer 150. do. In addition, the voice recognition device operates in conjunction with an external adaptive network generator 200, but is not limited thereto. Here, the adaptive network generator 200 uses a recognition result of the speech recognition apparatus 100, that is, a finite state transformation in units of a Gaussian mixture model (hereinafter, referred to as a "GMM") corresponding to the adaptive network using sentences. state transducer (FST) network.

음성인식 장치(100)는 사용자로부터 음성 즉, 음성 신호를 전달받는다. The voice recognition device 100 receives a voice, that is, a voice signal from the user.

양자화 특징 추출부(110)는 음성 신호의 파형에서 특징을 추출하고, 추출한 특징을 토대로 켑스트럼(Cepstrum) 영역에서의 양자화 데이터를 생성한다. 여기서, 양자화 데이터는 양자화 특징 벡터를 포함한다. The quantization feature extractor 110 extracts a feature from a waveform of a voice signal and generates quantization data in a cepstrum region based on the extracted feature. Here, the quantization data includes quantization feature vectors.

확률 측정부(120)는 적응 네트워크, 양자화 특징 벡터 및 양자화된 음향모델을 고정소수점이 적용된 고속연산에 적용하여, 가우시안(Gaussian) 발생확률(output probability)과 점유확률(posterior probability)을 산출한다. The probability measurer 120 applies an adaptive network, a quantized feature vector, and a quantized acoustic model to a fast operation with a fixed point, thereby calculating a Gaussian output probability and an occupancy probability.

적응부(130)는 가우시안 점유확률과 적응 기법을 이용하여 음향모델을 갱신한다. 여기서, 적응 기법은 고정소수점(fixed-point) 링크구조에서 MLLR(Maximum Likelihood Linear Regression)과 MAP(maximum a posteriori) 적응 기법을 포함한다. The adaptor 130 updates the acoustic model using a Gaussian occupation probability and an adaptation technique. Here, the adaptive technique includes a maximum likelihood linear regression (MLLR) and a maximum a posteriori (MAP) adaptation technique in a fixed-point link structure.

음향모델 저장부(140)는 적응부(130)에서 갱신한 음향모델을 저장한다. 이때, 음향모델은 스칼라 양자화된 음향모델에 해당하며, 코드북과 평균 및 분산에 대한 인덱스 링크 구조를 포함한다. The acoustic model storage unit 140 stores the acoustic model updated by the adaptation unit 130. In this case, the acoustic model corresponds to a scalar quantized acoustic model, and includes a codebook and an index link structure for average and variance.

음성 인식부(150)는 양자화 특징 추출부(110)에서 추출한 특징을 갱신된 음향모델을 이용하여 인식하고, 인식결과에 해당하는 단어를 포함하는 문장을 출력한다.
The speech recognizer 150 recognizes the feature extracted by the quantization feature extractor 110 using the updated acoustic model, and outputs a sentence including a word corresponding to the recognition result.

다음, 확률 측정부(120)와 적응부(130)를 도 2를 참조하여 상세하게 설명한다. Next, the probability measuring unit 120 and the adaptor 130 will be described in detail with reference to FIG. 2.

도 2는 본 발명의 실시예에 따른 확률 측정부 및 적응부를 나타내는 구성도이다. 2 is a block diagram illustrating a probability measuring unit and an adapting unit according to an exemplary embodiment of the present invention.

도 2를 참고하면, 확률 측정부(120)는 발생확률 산출부(121), 알고리즘 수행부(122) 및 점유확률 산출부(123)를 포함한다. Referring to FIG. 2, the probability measuring unit 120 includes an occurrence probability calculating unit 121, an algorithm performing unit 122, and an occupancy probability calculating unit 123.

발생확률 산출부(121)는 양자화 특징 벡터, 스칼라 양자화 음향모델을 GMM 네트워크에서의 고정포인트 로그합산 방법에 적용하여 가우시안 발생확률을 산출한다. The generation probability calculator 121 calculates a Gaussian occurrence probability by applying a quantization feature vector and a scalar quantization acoustic model to a fixed point log-sum method in a GMM network.

알고리즘 수행부(122)는 가우시안 발생확률을 GMM 네트워크에서의 고정포인트 로그합산 방법에 적용하여 포워드-백워드 확률을 산출한다. The algorithm performing unit 122 calculates the forward-backward probability by applying the Gaussian occurrence probability to the fixed point log summing method in the GMM network.

점유확률 산출부(123)는 포워드-백워드 확률을 고정소수점 연산에 적용하여 가우시안 점유확률을 산출한다. 여기서, 가우시안 점유확률은 스칼라 양자화된 HMM(hidden markov model)의 가우시안 점유확률에 해당한다. The occupation probability calculator 123 calculates a Gaussian occupation probability by applying a forward-backward probability to a fixed-point operation. Here, the Gaussian occupation probability corresponds to the Gaussian occupation probability of scalar quantized HMM (hidden markov model).

다음, 적응부(130)는 매개변수 산출부(131), 모델 갱신부(132)를 포함한다. Next, the adaptor 130 includes a parameter calculator 131 and a model updater 132.

매개변수 산출부(131)는 가우시안 점유확률을 고정소수점 링크구조에서의 MLLR 기법에 적용하여 고정소수점 변환 매개변수를 산출한다. The parameter calculator 131 calculates the fixed-point conversion parameter by applying the Gaussian occupation probability to the MLLR technique in the fixed-point link structure.

모델 갱신부(132)는 산출한 고정소수점 변환 매개변수를 이용하여 음향모델을 갱신하고, 갱신한 음향모델과 가우시안 점유확률을 고정소수점 링크구조에서의 MAP기법에 적용하여 음향모델을 다시 갱신한다.
The model updater 132 updates the acoustic model by using the calculated fixed-point conversion parameters, and updates the acoustic model again by applying the updated acoustic model and Gaussian occupation probability to the MAP technique in the fixed-point link structure.

다음, 적응 네트워크 생성부(200)를 도 3을 참조하여 상세하게 설명한다. Next, the adaptive network generator 200 will be described in detail with reference to FIG. 3.

도 3은 본 발명의 실시예에 따른 적응 네트워크 생성부를 나타내는 구성도이다. 3 is a block diagram illustrating an adaptive network generator according to an exemplary embodiment of the present invention.

먼저, 적응 네트워크 생성부(200)는 음성인식 장치(100)의 인식결과 즉, 문장을 전달받는다. First, the adaptive network generator 200 receives a recognition result, that is, a sentence of the speech recognition apparatus 100.

도 3을 참고하면, 적응 네트워크 생성부(200)는 사전 검색부(210), 네트워크 생성부(220) 및 네트워크 변환부(230)를 포함한다. Referring to FIG. 3, the adaptive network generator 200 includes a pre-searcher 210, a network generator 220, and a network converter 230.

사전 검색부(210)는 전달받은 문장이 포함하는 적어도 하나의 단어에 해당하는 모노폰(monophone) 발음을 다중 발음 사전에서 검색하고, 검색한 결과 즉, 단어당 적어도 하나의 모노폰 발음을 입력받는다. The dictionary search unit 210 searches for a monophone pronunciation corresponding to at least one word included in the received sentence in the multi-pronunciation dictionary, and receives a search result, that is, receives at least one monophone pronunciation per word. .

네트워크 생성부(220)는 단어당 적어도 하나의 모노폰 발음을 이용하여 문장 FST 네트워크를 생성한다. The network generator 220 generates a sentence FST network using at least one monophone pronunciation per word.

네트워크 변환부(230)는 문장 FST 네트워크를 적어도 하나의 모노폰 발음을 이용하여 GMM단위 FST 네트워크로 변환한다. The network converter 230 converts the sentence FST network into a GMM unit FST network using at least one monophone pronunciation.

구체적으로, 네트워크 변환부(230)는 문장 FST 네트워크가 포함하는 적어도 하나의 모노폰 발음을 트라이폰(triphone)으로 변환한다. 다음, 네트워크 변환부(230)는 타이드(tied)된 트라이폰 리스트에서 변환된 트라이폰을 조회하고, 조회결과에 해당하는 타이드된 트라이폰으로 변환한다. 또한, 네트워크 변환부(230)는 타이드된 트라이폰에 해당하는 세가지 상태(3 state) GMM 리스트를 조회하여, 조회결과에 해당하는 GMM단위 FST 네트워크로 변환한다.Specifically, the network converter 230 converts at least one monophone pronunciation included in the sentence FST network into a triphone. Next, the network converter 230 inquires the converted triphones from the tied triphone list, and converts the triphones into tide triphones corresponding to the inquiry result. In addition, the network conversion unit 230 inquires a three-state (3 state) GMM list corresponding to the tide triphone, and converts to a GMM unit FST network corresponding to the inquiry result.

본 발명의 실시예에 따른, 적응 네트워크 생성부(200)는 검색과 변환을 수행하고, 이에 해당되는 단어나 모노폰, 트라이폰, GMM 들이 비트(bit)단위 인덱스로 구성되어 있기 때문에 빠른 검색과 변환이 이뤄져 음성인식된 문장에서 GMM 문장 FST가 생성되는데 필요한 계산비용을 최소화 할 수 있다.
According to an embodiment of the present invention, the adaptive network generation unit 200 performs a search and conversion, and since the corresponding words, monophones, triphones, and GMMs are composed of bit unit indexes, a fast search and The conversion is performed to minimize the computational cost required to generate the GMM sentence FST from the speech recognition sentence.

다음, 음향모델 적응 방법을 도 4를 참조하여 상세하게 설명한다.Next, the acoustic model adaptation method will be described in detail with reference to FIG. 4.

도 4는 본 발명의 실시예에 따른 음향모델 적응 방법을 나타내는 흐름도이다.4 is a flowchart illustrating an acoustic model adaptation method according to an exemplary embodiment of the present invention.

먼저, 본 발명의 실시예에 따른 음성인식 장치(100)는 적응 네트워크 생성부(200)와 연동하여 동작하며, 적응 네트워크 생성부(200)로부터 적응 네트워크에 해당하는 GMM 단위 FST 네트워크를 전달받는다. First, the voice recognition apparatus 100 according to an embodiment of the present invention operates in conjunction with the adaptive network generator 200 and receives a GMM unit FST network corresponding to the adaptive network from the adaptive network generator 200.

도 4를 참고하면, 음성인식 장치(100)는 사용자로부터 전달받은 음성 신호의 파형에서 특징을 추출하고, 추출한 특징을 토대로 양자화 데이터를 생성한다(S410). Referring to FIG. 4, the speech recognition apparatus 100 extracts a feature from a waveform of a voice signal received from a user, and generates quantized data based on the extracted feature (S410).

음성인식 장치(100)는 적응 네트워크, 양자화 데이터 및 양자화된 음향모델을 고정소수점이 적용된 고속연산에 적용하여, 가우시안 발생확률과 점유확률을 산출한다(S420).The speech recognition apparatus 100 calculates a Gaussian occurrence probability and an occupancy probability by applying the adaptive network, the quantization data, and the quantized acoustic model to a fast operation with a fixed point.

음성인식 장치(100)는 가우시안 점유확률과 고정소수점 링크구조에서의 MLLR과 MAP기법을 이용하여 음향모델을 갱신한다(S430). The speech recognition apparatus 100 updates the acoustic model using the MLLR and the MAP technique in the Gaussian occupation probability and the fixed-point link structure (S430).

음성인식 장치(100)는 음성 신호의 파형에서 추출한 특징을 갱신된 음향모델을 이용하여 인식하고, 인식결과에 해당하는 단어를 출력한다(S440).
The speech recognition apparatus 100 recognizes the feature extracted from the waveform of the speech signal using the updated acoustic model, and outputs a word corresponding to the recognition result (S440).

다음, 인식결과에 해당하는 단어를 이용하여 적응 네트워크를 생성하는 방법을 도 5를 참조하여 상세하게 설명한다.Next, a method of generating an adaptive network using words corresponding to the recognition result will be described in detail with reference to FIG. 5.

도 5는 본 발명의 실시예에 따른 적응 네트워크를 생성하는 방법을 나타내는 흐름도이다.5 is a flowchart illustrating a method of generating an adaptive network according to an embodiment of the present invention.

도 5를 참고하면, 적응 네트워크 생성부(200)는 전달받은 문장이 포함하는 적어도 하나의 단어에 해당하는 모노폰 발음을 다중 발음 사전에서 검색한다(S510). 다음, 적응 네트워크 생성부(200)는 검색한 결과 즉, 단어당 적어도 하나의 모노폰 발음을 일력받는다. Referring to FIG. 5, the adaptive network generator 200 searches for a monophone pronunciation corresponding to at least one word included in a received sentence in a multi-pronunciation dictionary (S510). Next, the adaptive network generator 200 receives a search result, that is, at least one monophone pronunciation per word.

적응 네트워크 생성부(200)는 검색결과에 해당하는 단어당 적어도 하나의 모노폰 발음을 이용하여 문장 FST 네트워크를 생성한다(S520).The adaptive network generator 200 generates a sentence FST network using at least one monophone pronunciation per word corresponding to the search result (S520).

적응 네트워크 생성부(200)는 문장 FST 네트워크가 포함하는 적어도 하나의 모노폰 발음을 트라이폰으로 변환한다(S530).The adaptive network generation unit 200 converts at least one monophone pronunciation included in the sentence FST network into a triphone (S530).

적응 네트워크 생성부(200)는 타이드된 트라이폰 리스트에서 변환된 트라이폰을 조회하고, 트라이폰을 조회결과에 해당하는 타이드된 트라이폰으로 변환한다(S540).The adaptive network generation unit 200 inquires the converted triphone from the tide triphone list, and converts the triphone into a tide triphone corresponding to the inquiry result (S540).

적응 네트워크 생성부(200)는 타이드된 트라이폰에 해당하는 세가지 상태(3 state) GMM 리스트를 조회하여, 조회결과에 해당하는 GMM단위 FST 네트워크로 변환한다(S550).The adaptive network generation unit 200 inquires a three state GMM list corresponding to the tied triphone and converts the GMM unit FST network corresponding to the inquiry result (S550).

일반적으로, 음향모델 적응 방법은 많은 데이터를 확보해야 하고, 기본 모델을 변환하는 과정에서 많은 매개변수의 변환 계산과 확률에 이용되는 소수점 이하 연산을 많이 하게 된다. 소수점 이하 표현은 플로트 소수점(float-point) 타입으로 연산이 이뤄줘야 한다. 이러한 연산과정은 임베디드 환경에서 다루는데 많은 부담을 제공한다. 더구나 메모리 사용량을 줄이기 위해 음향 모델을 양자화하면 다뤄야 하는 매개변수의 구조를 바꿔야 한다. 따라서, 연산량을 줄이기 위해 소수점 이하 숫자를 정수(integer)로 변화하여 계산하는 고정 소수점 타입 연산방법을 이용한다. In general, the acoustic model adaptation method requires a lot of data, and in the process of converting the basic model, a lot of decimal operations are used for transform calculation and probability of many parameters. Expressions below the decimal point must be computed as float-point type. This computation process places a lot of burden on the embedded environment. Moreover, quantizing the acoustic model to reduce memory usage requires changing the structure of the parameters to be handled. Therefore, in order to reduce the amount of computation, a fixed-point type arithmetic method is used in which a number after the decimal point is changed to an integer.

본 발명은 기존 인식성능을 유지하면서 음향 모델의 크기를 8분의 1가량 줄인 스칼라 양자화 모델을 가지고 적은 적응 데이터를 이용한 지속적으로 빠른 적응을 할 수 있는 고정 소수점을 이용하는 비지도 링크 구조 MAP-MLLR 적응 방법이다. The present invention has a scalar quantization model that reduces the size of the acoustic model by about one-eighth while maintaining the existing recognition performance, and employs a non-map link structure MAP-MLLR adaptation using a fixed point that can continuously adapt quickly using less adaptive data. It is a way.

따라서, 본 발명의 실시예에 따른, 스마트폰과 같은 임베디드 음성인식 장치에서는 사용자의 재학습 부담을 없애면서도 효과적으로 화자 적응을 할 수 있다.Therefore, according to an embodiment of the present invention, the embedded voice recognition device such as a smart phone can effectively adapt the speaker while eliminating the re-learning burden of the user.

이상에서와 같이 도면과 명세서에서 최적의 실시예가 개시되었다. 여기서 특정한 용어들이 사용되었으나, 이는 단지 본 발명을 설명하기 위한 목적에서 사용된 것이지 의미 한정이나 특허청구범위에 기재된 본 발명의 범위를 제한하기 위하여 사용된 것은 아니다. 그러므로, 본 기술 분야의 통상의 지식을 가진자라면 이로부터 다양한 변형 및 균등한 타 실시예가 가능하다는 점을 이해할 것이다. 따라서, 본 발명의 진정한 기술적 보호범위는 첨부된 특허청구범위의 기술적 사상에 의해 정해져야 할 것이다.As described above, an optimal embodiment has been disclosed in the drawings and specification. Although specific terms have been used herein, they are used only for the purpose of describing the present invention and are not used to limit the scope of the present invention as defined in the meaning or claims. Therefore, those skilled in the art will understand that various modifications and equivalent other embodiments are possible from this. Therefore, the true technical protection scope of the present invention will be defined by the technical spirit of the appended claims.

100: 음성인식 장치
110; 양자화 특징 추출부
120; 확률 측정부
121; 발생확률 산출부
122; 알고리즘 수행부
123; 점유확률 산출부
130; 적응부
131; 매개변수 산출부
132; 모델 갱신부
140; 음향모델 저장부
150; 음성 인식부
200; 적응 네트워크 생성부
210; 사전 검색부
220; 네트워크 생성부
230; 네트워크 변환부100: speech recognition device
110; Quantization Feature Extraction Unit
120; Probability measure
121; Occurrence probability calculation unit
122; Algorithm Execution Unit
123; Occupancy Probability Calculator
130; Adaptation
131; Parameter calculation section
132; Model update department
140; Acoustic model storage unit
150; Speech recognition unit
200; Adaptive network generator
210; Dictionary search section
220; Network generator
230; Network converter

Claims

In the speech recognition device using the acoustic model adaptation method,
An extraction unit for extracting a feature from a waveform corresponding to a voice and generating quantized data based on the extracted feature;
A probability measurement unit for applying a quantization data, an adaptive network, and a quantized acoustic model to a fast operation applied a fixed point to calculate a Gaussian occupation probability;
An adaptor adapted to update the acoustic model using the Gaussian occupation probability and the adaptive technique; And
Speech recognition unit for recognizing the extracted feature using the updated acoustic model, and outputs a sentence corresponding to the recognition result
Speech recognition device comprising a.