KR20070109314A

KR20070109314A - Method of selecting the training data based on non-uniform sampling for the speech recognition vector quantization

Info

Publication number: KR20070109314A
Application number: KR1020060042109A
Authority: KR
Inventors: 류창선; 김재인; 김홍국; 오유리; 이길호
Original assignee: 주식회사 케이티
Priority date: 2006-05-10
Filing date: 2006-05-10
Publication date: 2007-11-15
Also published as: KR100901640B1; WO2007129802A1

Abstract

A method for selecting the training data based on non-uniform sampling for the speech recognition vector quantization is provided to select the training based on non-uniform sampling considering frequency of each phoneme. When sample speech data are received, phonetic information for each phoneme is obtained by performing forced alignment on the speech data(201). A phoneme frequency list is generated based on the obtained phonetic information(202). The frequency rate of each phoneme is statistically computed according to corresponding language, based on the generated phoneme frequency list(203). Training data are deduced considering the computed statistic value, wherein the training data minimizes a difference between the total frequency for each phoneme and the frequency rate of each phoneme(204).

Description

Method of selecting the training data based on non-uniform sampling for the speech recognition vector quantization}

도 1은 본 발명이 적용되는 분산 음성 인식 시스템에 대한 일실시예 구성도.1 is a configuration diagram of an embodiment of a distributed speech recognition system to which the present invention is applied.

도 2는 본 발명에 따른 음성 특징 벡터 양자화에 있어 비균일 표본을 기반으로 하는 학습 데이터 선정 방법에 대한 일실시예 순서도.2 is a flowchart illustrating a method of selecting training data based on non-uniform samples in quantizing a speech feature vector according to the present invention.

도 3은 본 발명에서 사용하는 음성 인식 과정에 대한 일실시예 설명도.Figure 3 illustrates one embodiment of a speech recognition process used in the present invention.

도 4는 도 3에 있어 음향 모델 생성을 위한 학습 과정에 대한 일실시예 순서도.4 is a flow chart of an embodiment of a learning process for generating an acoustic model in FIG. 3.

도 5는 본 발명을 한국어에 적용시킨 음성 특징 벡터 양자화기를 사용한 음성 인식 성능을 나타내는 그래프.5 is a graph showing speech recognition performance using a speech feature vector quantizer to which the present invention is applied to Korean.

도 6은 본 발명을 영어에 적용시킨 음성 특징 벡터 양자화기를 사용한 음성 인식 성능을 나타내는 그래프.6 is a graph showing speech recognition performance using a speech feature vector quantizer to which the present invention is applied to English.

* 도면의 주요 부분에 대한 부호 설명* Explanation of symbols on the main parts of the drawing

11 : A/D 변환기 12 : 음성 특징 벡터 추출기11: A / D Converter 12: Speech Feature Vector Extractor

13 : 음성 특징 벡터 양자화기13: Voice Feature Vector Quantizer

21 : 음성 특징 벡터 역양자화기 22 : 음성 인식기21: speech feature vector dequantizer 22: speech recognizer

본 발명은 음성 인식을 위한 음성 특징 벡터 양자화에 있어 그 학습 데이터를 선정하는 방법에 관한 것으로, 더욱 상세하게는 음성 특징 벡터 양자화기를 훈련시키는데 사용되는 학습 데이터를 선정하는데 있어 언어에 따른 음성의 특성, 예컨대 음소별 출현 빈도를 고려해 비균일 표본을 기반으로 하여 학습 데이터를 선정하는, 음성 인식을 위한 음성 특징 벡터 양자화에 있어 비균일 표본을 기반으로 하는 학습 데이터 선정 방법에 관한 것이다.The present invention relates to a method for selecting the training data in speech feature vector quantization for speech recognition, and more particularly, to selecting the training data used to train the speech feature vector quantizer. For example, the present invention relates to a method of selecting training data based on a nonuniform sample in quantizing a speech feature vector for speech recognition, which selects training data based on a nonuniform sample in consideration of the occurrence frequency of each phoneme.

분산 음성 인식 기술(DSR; Distributed Speech Recognition)은 휴대폰 등과 같은 저사양 단말기에서도 음성 인식 기능이 구현될 수 있도록 하는 기술로서, 저사양 단말기에서는 음성 신호의 특징을 인식하고, 고사양 서버에서는 저사양 단말기로부터 제공받은 음성 특징을 토대로 음성 인식을 수행하는 이원적 처리 시스템으로 이루어진다.Distributed Speech Recognition (DSR) is a technology that enables the speech recognition function to be implemented in low-end terminals such as mobile phones. In low-end terminals, voice signals are recognized, and in high-end servers, voices received from low-end terminals It consists of a dual processing system that performs speech recognition based on features.

일반적으로, 위와 같은 음성 인식 분야에 있어 MFCC(Mel-Frequency Cepstrum Coefficient)가 보편적으로 사용된다. MFCC란 멜스케일로 표현된 주파수 스펙트럼의 형태를 정현파 성분으로 나타낸 것으로서, 사용자로부터 입력받은 음성을 표현 하는 음성 특징 벡터(일명 음성 특징 파라미터)를 말한다. 이러한 MFCC가 통신망을 통해 단말기에서 서버로 전송되기 위해서는 양자화 과정을 거쳐야 된다. 이를 앞서 언급한 분산 음성 인식 시스템을 참조해 음성 인식 과정을 간략히 설명하면 다음과 같다.In general, in the speech recognition field as described above, MFCC (Mel-Frequency Cepstrum Coefficient) is commonly used. The MFCC represents a form of the frequency spectrum expressed in melscale as a sinusoidal component, and refers to a speech feature vector (aka speech feature parameter) representing a speech received from a user. In order for the MFCC to be transmitted from the terminal to the server through a communication network, a quantization process must be performed. The speech recognition process will be described briefly with reference to the aforementioned distributed speech recognition system.

단말기측에서는 사용자로부터 입력받은 음성으로부터 MFCC를 통해 음성 특징 벡터를 추출하고서, 이를 통신망에 전송하기 적합한 형태로 처리하기 위해 음성 특징 벡터를 양자화시킨다. 즉, 사용자로부터 입력받은 음성에 대응되는 MFCC를 유한한 수로 이루어진 코드북(codebook)의 중심 벡터들 중에서 가장 가까운 거리를 갖는 벡터를 골라 내어서 이를 비트 데이터로 전송하는 것이다. 부연 설명하자면, 사용자로부터 발화되는 음성 인식을 위해서는 유사한 값을 갖는 군집에 대한 중심값을 갖는 코드북이 요구되며, 이러한 코드북은 음성 특징 벡터 양자화에 있어 그 기준값으로 사용된다.The terminal extracts the speech feature vector from the speech input from the user through the MFCC and quantizes the speech feature vector to process the speech feature vector into a form suitable for transmission to a communication network. In other words, the MFCC corresponding to the voice input from the user is selected from among the finite number of center vectors of the codebook and the vector having the closest distance is transmitted as bit data. In other words, a codebook having a center value for a cluster having similar values is required for speech recognition spoken by a user, and the codebook is used as a reference value in speech feature vector quantization.

한편, 서버측에서는 단말기로부터 수신받은 음성 특징 벡터를 역양자화시키고서, 음성 모델로서 HMM(Hidden Markov Model)을 사용해 음성에 대응되는 단어를 인식한다. 여기서, HMM은 음성 인식을 위한 기본 단위, 예컨대 음소를 모델링하는 프로세스로서, 음성 인식 엔진으로 들어오는 음소와 음성 인식 엔진 내의 DB에 등록되어 있는 음소를 결합해 단어와 문장을 만드는 것이다.On the other hand, the server side dequantizes the voice feature vector received from the terminal, and recognizes a word corresponding to the voice by using a Hidden Markov Model (HMM) as the voice model. Here, the HMM is a process of modeling a basic unit, for example, a phoneme, for speech recognition. The HMM combines phonemes coming into the speech recognition engine and phonemes registered in a DB in the speech recognition engine to form words and sentences.

위와 같이 음성 특징 벡터 양자화 과정은 음성 인식 시스템에 있어 그 성능을 좌우하는 주요한 요소로서, 이러한 음성 특징 벡터 양자화기의 설계는 코드북 생성이라 할 수 있다. 특히, 이러한 코드북은 수많은 음성 데이터로부터 학습 데이 터(학습 MFCC 벡터)를 추출한 후에 이 추출한 훈련 데이터 중에서 가장 대표되는 값으로서 그 음성 특징 벡터 양자화의 학습 데이터로 선정되어야 한다.As described above, the speech feature vector quantization process is a major factor that determines the performance of the speech recognition system. The design of the speech feature vector quantizer may be referred to as codebook generation. In particular, such a codebook should be selected as the learning data of the speech feature vector quantization as the most representative value among the extracted training data after extracting the learning data (learning MFCC vector) from the numerous speech data.

부연 설명하자면, 코드북은 음성 데이터의 중심 벡터를 인덱스로 갖으며, 음성 인식률을 높이기 위해서는 많은 양의 훈련 데이터가 필요하다. 앞서 언급한 바와 같이 음성 특징 벡터 양자화의 결과는 코드북 인덱스로 표현되는 데이터열을 의미하며, 음성 특징 벡터 양자화는 벡터간 거리 계산을 통해 가장 가까운 코드북 인덱스가 부여된다.In other words, the codebook has a center vector of speech data as an index, and a large amount of training data is required to increase the speech recognition rate. As mentioned above, the result of the speech feature vector quantization means a data string expressed by a codebook index, and the speech feature vector quantization is given the closest codebook index through the distance calculation between the vectors.

그런데, 현재까지 음성 인식 기술들은 대부분 음성 특징 벡터 양자화기의 처리 알고리즘에 관한 것만 제안되고 있을 뿐, 실질적으로 음성 특징 벡터 양자화기를 훈련시키는데 필요한 학습 데이터 선정에 관한 알고리즘이 제안되지 않고 있다.However, to date, most of speech recognition technologies have been proposed only for processing algorithms of speech feature vector quantizers, and practically no algorithms regarding selection of training data for training speech feature vector quantizers have been proposed.

예컨대, 음성 특징 벡터 양자화에 있어 학습 데이터 선정에 관한 종래기술로는 사전에 관리자 등에 의해 선택받은 임의의 무계획적인 샘플 학습 데이터만을 사용하고 있는 형편이다.For example, in the conventional technology related to learning data selection in speech feature vector quantization, only arbitrary unscheduled sample learning data selected by a manager or the like is used.

또한, 앞서 언급한 바와 같이 HMM의 인식 성능이 우수해 대부분의 음성 인식 시스템에 이 HMM이 채택되고 있으나, 이러한 HMM에 있어 그 훈련 데이터의 신뢰성이 낮을 경우에 서버측에서 정확한 음향 모델 추정이 어려울 뿐더러 음소 문맥에 관한 방대한 사전 지식이 필요하다. 예컨대, 단말기측에서 음성 특징 벡터를 양자화하는데 있어 그 훈련 데이터에 신뢰성이 높을 경우에 서버측에서는 별도의 음성 인식 처리가 필요없으며 단지 기존 기술을 통해 음성 특징 벡터 역양자화만을 수행하면 되는 것이다.In addition, as mentioned above, HMM is adopted in most speech recognition systems because of the excellent recognition performance of HMM. However, when the training data of the HMM is low in reliability, accurate acoustic model estimation is difficult on the server side. Extensive prior knowledge of phonemic context is required. For example, if the training feature is highly reliable in quantizing the speech feature vector on the terminal side, no additional speech recognition processing is required on the server side, and only the speech feature vector dequantization is performed through the existing technology.

따라서, 음성 특징 벡터를 보다 적은 비트 수로 양자화시킬 수 있도록 하는 기술이 절실히 요구되고 있으며, 이를 위해 무엇보다도 음성 특징 벡터 양자화에 있어 그 훈련 데이터로 사용되는 학습 데이터 선정 알고리즘 개발이 선행되어야 됨을 당업자라면 쉽게 이해할 수 있을 것이다.Therefore, there is an urgent need for a technique for quantizing a speech feature vector with a smaller number of bits, and it is easy for those skilled in the art to develop a learning data selection algorithm that is used as training data in quantization of speech feature vectors. I can understand.

본 발명은 상기와 같은 문제점을 해결하고 상기와 같은 요구에 부응하기 위하여 제안된 것으로, 음성 특징 벡터 양자화기를 훈련시키는데 사용되는 학습 데이터를 선정하는데 있어 언어에 따른 음성의 특성, 예컨대 음소별 출현 빈도를 고려해 비균일 표본을 기반으로 하여 학습 데이터를 선정하는, 음성 인식을 위한 음성 특징 벡터 양자화에 있어 비균일 표본을 기반으로 하는 학습 데이터 선정 방법을 제공하는데 그 목적이 있다.The present invention has been proposed to solve the above problems and to meet the above demands, and to select learning data used to train the speech feature vector quantizer, the characteristics of speech according to language, for example, the frequency of appearance of each phoneme, The purpose of the present invention is to provide a method for selecting training data based on non-uniform samples in quantizing speech feature vectors for speech recognition, which selects training data based on non-uniform samples.

상기의 목적을 달성하기 위한 본 발명은, 음성 인식을 위한 음성 특징 벡터 양자화에 있어 그 학습 데이터를 선정하는 방법에 있어서, 샘플 음성 데이터를 입력받으면 이 음성 데이터를 포스드 얼라이먼트(forced alignment)시켜서 각 음소에 대한 발음 정보를 획득하는 포스드 얼라이먼트 단계; 상기 획득한 발음 정보를 바탕으로 하여 음소별 출현 리스트를 생성하는 음소별 출현 리스트 생성 단계; 상기 생성한 음소별 출현 리스트를 토대로 해당 언어에 따른 음소의 출현 빈도 비율을 통계적으로 계산하는 음소 출현 빈도 비율 계산 단계; 및 상기 계산한 통계값을 참조하여 각 음소에 대해 전체 빈도수와 음소별 출현 빈도 비율간의 오차가 최소화되는 학습 데이터를 도출하는 학습 데이터 도출 단계를 포함한다.In order to achieve the above object, the present invention provides a method for selecting the learning data in speech feature vector quantization for speech recognition. When sample speech data is received, the speech data is forced to be aligned. A force alignment step of obtaining pronunciation information for the phoneme; A phoneme appearance list generation step of generating a phoneme appearance list based on the obtained pronunciation information; A phoneme appearance frequency ratio calculation step of statistically calculating an appearance frequency ratio of phonemes according to a corresponding language based on the generated phoneme appearance list; And a training data derivation step of deriving training data for minimizing an error between the total frequency and the phoneme frequency ratio for each phoneme with reference to the calculated statistical value.

상술한 목적, 특징 및 장점은 첨부된 도면과 관련한 다음의 상세한 설명을 통하여 보다 분명해 질 것이며, 그에 따라 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자가 본 발명의 기술적 사상을 용이하게 실시할 수 있을 것이다. 또한, 본 발명을 설명함에 있어서 본 발명과 관련된 공지 기술에 대한 구체적인 설명이 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우에 그 상세한 설명을 생략하기로 한다. 이하, 첨부된 도면을 참조하여 본 발명에 따른 바람직한 일실시예를 상세히 설명하기로 한다.The above objects, features and advantages will become more apparent from the following detailed description taken in conjunction with the accompanying drawings, whereby those skilled in the art may easily implement the technical idea of the present invention. There will be. In addition, in describing the present invention, when it is determined that the detailed description of the known technology related to the present invention may unnecessarily obscure the gist of the present invention, the detailed description thereof will be omitted. Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings.

도 1은 본 발명이 적용되는 분산 음성 인식 시스템에 대한 일실시예 구성도이다.1 is a configuration diagram of an embodiment of a distributed speech recognition system to which the present invention is applied.

도 1에 도시된 바와 같이, 본 발명이 적용되는 분산 음성 인식 시스템은 저사양 휴대폰 등과 같은 단말기(클라이언트)와 고사양 서버로 이루어진다.As shown in FIG. 1, the distributed speech recognition system to which the present invention is applied is composed of a terminal (client) such as a low specification mobile phone and a high specification server.

상기 단말기는 A/D 변환기(11), 음성 특징 벡터 추출기(12) 및 음성 특징 벡터 양자화기(13) 등을 포함하며, 상기 서버는 음성 특징 벡터 역양자화기(21) 및 음성 인식기(22) 등을 포함하며, 여타 기본적인 구성요소에 대해서는 그 설명을 생략한다. 덧붙여, 본 발명을 설명함에 있어 분산 음성 인식 시스템을 그 예로서 설명하되, 본 발명이 단일 음성 인식 시스템, 예컨대 A/D 변환기, 음성 특징 벡터 추 출기, 음성 특징 벡터 양자화기/역양자화기 및 음성 인식기로 구성되는 여타 음성 인식 장치에도 적용됨을 당업자라면 쉽게 이해할 수 있을 것이다.The terminal includes an A / D converter 11, a voice feature vector extractor 12, a voice feature vector quantizer 13, and the like, and the server includes a voice feature vector dequantizer 21 and a voice recognizer 22. And the like, and descriptions of other basic components are omitted. In addition, in describing the present invention, a distributed speech recognition system will be described as an example, but the present invention provides a single speech recognition system such as an A / D converter, a speech feature vector extractor, a speech feature vector quantizer / dequantizer and a speech. It will be apparent to those skilled in the art that the present invention also applies to other speech recognition apparatus configured as a recognizer.

본 발명의 특징은, 음성 특징 벡터 양자화기(13)를 훈련시키는데(일명 설계하는데) 사용되는 학습 데이터를 선정하는데 있어 언어(예; 한국어, 영어 등)에 따른 음성의 특성, 예컨대 음소별 출현 빈도를 고려해 비균일 표본을 기반으로 하여 학습 데이터를 선정하는 것이다.A feature of the present invention is that, in selecting learning data used to train the voice feature vector quantizer 13 (aka design), the characteristics of the voice according to the language (eg, Korean, English, etc.), for example, the frequency of appearance of each phoneme. We select the training data based on a non-uniform sample.

즉, 본 발명에서는 MFCC로 표현되는 음성 특징 벡터를 양자화하기 위해 그 학습 데이터를 선정하는데 있어 언어의 통계적 특성을 사용한다. 예컨대, 언어에 따라 음성의 특성이 서로 다르게 나타나며, 이러한 점에 착안해 음성의 특성에 따라 학습 데이터를 선정하고서, 이 학습 데이터를 토대로 코드북을 생성한 후에 이를 음성 특징 벡터 양자화기(13)에 적용시켜 음성 인식률을 높이고자 한다.That is, the present invention uses the statistical characteristics of the language in selecting the learning data to quantize the speech feature vector represented by the MFCC. For example, the characteristics of speech appear differently depending on the language. Based on this, the learning data are selected according to the characteristics of the speech, a codebook is generated based on the learning data, and then applied to the speech feature vector quantizer 13. To increase the speech recognition rate.

본 발명의 학습 데이터 선정 알고리즘을 설명하기에 앞서 분산 음성 인식 시스템의 구성요소에 대해 간략히 설명하면 다음과 같다.Prior to describing the learning data selection algorithm of the present invention, the components of the distributed speech recognition system will be briefly described as follows.

상기 A/D 변환기(11)에서는 사용자로부터 아날로그 형태의 음성신호를 입력받아 이를 디지털 형태의 음성 데이터로 변환한다.The A / D converter 11 receives an analog voice signal from a user and converts it into digital voice data.

상기 음성 특징 벡터 추출기(12)에서는 MFCC를 토대로 A/D 변환기(11)에서 변환한 음성 데이터로부터 그 음성 특징을 나타내는 음성 특징 벡터를 추출한다.The speech feature vector extractor 12 extracts a speech feature vector representing the speech feature from the speech data converted by the A / D converter 11 based on the MFCC.

상기 음성 특징 벡터 양자화기(13)에서는 사전에 학습 데이터를 통해 훈련받은 훈련 데이터를 토대로 음성 특징 벡터 추출기(12)에서 추출한 음성 특징 벡터를 통신망에 적합하도록 비트 스트림 형태로 양자화시킨다. 덧붙여, 이러한 음성 특징 벡터 양자화기(13)에 적용되는 양자화 프로세스로는 ETSI의 표준안 "ETSI ES 201 108"에 제안된 분할 벡터 양자화(Split Vector Quantization) 방식을 사용한다.The speech feature vector quantizer 13 quantizes the speech feature vector extracted by the speech feature vector extractor 12 in the form of a bit stream based on the training data previously trained through the training data. In addition, the quantization process applied to the speech feature vector quantizer 13 uses a split vector quantization scheme proposed in the ETSI standard "ETSI ES 201 108".

상기 음성 특징 벡터 역양자화기(21)에서는 단말기측으로부터 수신받은 비트 스트림 데이터를 역양자화시킨다.The speech feature vector dequantizer 21 dequantizes the bit stream data received from the terminal.

상기 음성 인식기(22)에서는 HMM을 토대로 음성 특징 벡터 역양자화기(21)에서 역양자화시킨 데이터, 예컨대 음성 특징 벡터를 인식해 단어, 문장 등과 같은 음성 인식 결과를 출력한다. 부가적으로, 이러한 서버측에는 음성 특징 벡터를 이용해 음성을 재구성해서 재생시키는 기능도 수행할 수 있다.The speech recognizer 22 recognizes data dequantized by the speech feature vector dequantizer 21 based on the HMM, for example, a speech feature vector, and outputs a speech recognition result such as a word or a sentence. In addition, the server side may perform a function of reconstructing and reproducing speech using the speech feature vector.

위와 같은 분산 음성 인식 시스템에 있어 MFCC를 통해 그 음성 특징이 추출되어져 14차 계수로서 음성 특징 벡터가 표현되는데, 이러한 14차 계수의 MFCC 벡터가 양자화 과정을 거치게 된다. 부연 설명하자면, MFCC 벡터는 특별히 정해진 차수의 계수로서 표현되는 것은 아니며, 일반적으로 14차, 15차 계수 등으로서 표현하는 경우에 있어 그 음성 인식 성능률이 우수하기에 이러한 차수의 계수로 표현되는 것임을 밝혀둔다.In the distributed speech recognition system as described above, the speech feature is extracted through MFCC to express a speech feature vector as a 14th order coefficient. The 14th order MFCC vector undergoes a quantization process. In other words, the MFCC vector is not represented as a coefficient of a particular order. In general, when a MFCC vector is expressed as a 14th order or 15th order coefficient, the MFCC vector is expressed as a coefficient of such order because its speech recognition performance is excellent. Reveal.

이하, 도 2를 참조하여 본 발명에서 제시하는 비균일 표본을 기반으로 하는 학습 데이터 선정 방법에 대해 상세히 설명하기로 하며, 도 3 및 도 4를 함께 참조하여 본 발명에서 사용하는 음성 인식 과정 및 그 음향 모델 생성 과정에 대해 설명하기로 하며, 도 5 및 도 6을 참조하여 한국어, 영어에 있어 본 발명을 적용시킨 음성 인식 성능률에 대해 설명하기로 한다.Hereinafter, a method of selecting learning data based on a non-uniform sample presented in the present invention will be described in detail with reference to FIG. 2, and a voice recognition process used in the present invention with reference to FIGS. 3 and 4 and the same. A process of generating an acoustic model will be described, and a speech recognition performance rate to which the present invention is applied in Korean and English will be described with reference to FIGS. 5 and 6.

도 2는 본 발명에 따른 음성 특징 벡터 양자화에 있어 비균일 표본을 기반으로 하는 학습 데이터 선정 방법에 대한 일실시예 순서도이다.2 is a flowchart illustrating a method for selecting training data based on a non-uniform sample in quantizing a speech feature vector according to the present invention.

먼저, 음성 특징 벡터 양자화기 학습에 사용되는 샘플 음성 데이터를 입력받으면 이 음성 데이터를 포스드 얼라이먼트(forced alignment)시켜서 각 음소에 대한 발음 정보를 획득한다(201). 즉, 음성 데이터를 포스드 얼라이먼트시켜 발성 문장, 발음 전사, 음소 종류, 각 음소별 시작시간 및 각 음소별 종료시간을 구한다. 다음의 [표 1]은 샘플 음성 데이터를 포스드 얼라이먼트시킨 결과를 나타낸다.First, when sample speech data used for speech feature vector quantizer learning is input, the speech data is forced alignment to obtain pronunciation information about each phoneme (201). That is, the voice data is force-aligned to obtain a utterance sentence, pronunciation transcription, phoneme type, start time for each phoneme, and end time for each phoneme. Table 1 below shows the results of force alignment of the sample voice data.

[표 1]에 있어, 샘플 음성 데이터는 "그리고 계속해서 사람들이 뛰어 나오고 있죠"를 발성한 음성으로, "ㄱㅡㄹㅣㄱㅗ ㄱㅖㅅㅗㅋㅐㅅㅓ ㅅㅏㄹㅏㅁㄷㅡㄹㅣ ㄸㅟㅓ ㄴㅏㅗㄱ ㅗ ㅣㄷㅉㅗ"로 전사되며, 각 음소별 시작시간 및 종료시간을 알 수 있다.In [Table 1], the sample voice data is the voice that "and people are jumping out". It is transcribed into "ㅗ ㄷ ㅉ ㅗ", and you can see the start time and end time of each phoneme.

그리고 나서, 상기 포스드 얼라이먼트 과정을 통해 획득한 발음 정보를 바탕으로 하여 음소별 출현 리스트를 생성한다(202). 다음의 [표 2]는 본 발명을 검증하는데 사용한 한국어 DB의 음성 데이터에 대해 포스드 얼라이먼트를 수행한 결과에 있어 음소별 출현 리스트를 나타내며, 마지막 두 행이 각 음소별 빈도수 및 학습 데이터의 총 음소 수를 나타낸다.Then, a phoneme-specific appearance list is generated based on the pronunciation information obtained through the forced alignment process (202). [Table 2] shows the appearance list of each phoneme in the result of performing the post alignment on the voice data of the Korean DB used to verify the present invention, and the last two rows represent the total phonemes of the frequency and learning data for each phoneme. Indicates a number.

그런 후, 상기 생성한 음소별 출현 리스트를 토대로 그 언어(해당 언어)에 따른 음소의 출현 빈도 비율을 통계적으로 계산한다(203). 여기서, 음소의 출현 빈도 비율을 통계적으로 계산하는데 있어, 발음 사전을 토대로 발음열들 중에서 어떠한 발음에 대한 그 비율을 분석한다. 예컨대, 한국어의 음소 출현 빈도를 분석하기 위해, "성인의 일상적인 대화에서 나타나는 말소리 출현 빈도 연구" 및 "한국어 음소의 기능 부담량과 음소 연쇄에 관한 계량 언어학적 연구" 등과 같은 발음 사전 자료를 토대로 각 자음과 각 모음에 대한 음소 출현 빈도 분석을 수행한다. 한편, 한국어의 음소 출현 빈도를 분석하기 위해, "CMU dictionary" 등과 같은 발음 사전 자료를 토대로 각 자음과 각 모음에 대한 음소 출현 빈도 분석을 수행한다.Thereafter, based on the generated appearance list for each phoneme, a frequency ratio of occurrences of phonemes according to the language (the language) is calculated statistically (203). Here, in statistically calculating the frequency ratio of phonemes, the ratio of any of the pronunciation strings is analyzed based on the pronunciation dictionary. For example, to analyze the frequency of phonemes in Korean, based on pronunciation dictionary data such as "study frequency of speech in adult daily conversation" and "quantitative linguistic studies on functional burden and phoneme chain of Korean phonemes" Perform phoneme frequency analysis for each consonant and each vowel. On the other hand, in order to analyze the frequency of phonemes in Korean, the phoneme frequency of each consonant and each vowel is analyzed based on pronunciation dictionary data such as "CMU dictionary".

다음의 [표 3]은 한국어에 따른 음소의 출현 빈도 비율을 나타내며, 다음의 [표 4]는 영어에 따른 음소의 출현 빈도 비율을 나타낸다.[Table 3] shows the frequency ratio of phonemes according to Korean, and [Table 4] shows the frequency frequency of phonemes according to English.

표 2와, 표 3[또는 영어에 있어서는 표 4]에 있어, 각 음소에 대해 전체 빈도수[표 2]와 음소별 출현 빈도 비율[표 3]간의 오차가 최소화되는 학습 데이터를 도출한다(204). 이와 같은 학습 데이터 도출 과정("204" 과정)에 대해서는 하기에 그 예를 들어 보다 상세히 후술하기로 한다.In Table 2 and Table 3 (or Table 4 in English), learning data for minimizing the error between the total frequency [Table 2] and the appearance frequency ratio of each phoneme [Table 3] is derived for each phoneme (204). . This learning data derivation process ("204" process) will be described later in more detail, for example.

위와 같은 "201" 과정 내지 "204" 과정을 거쳐 선정한 학습 데이터를 음성 특징 벡터 양자화기(13)의 훈련 데이터로서 결정한다. 이처럼 결정된 훈련 데이터는 코드북 생성을 위한 파라미터로 사용된다. 여기서, 코드북 생성에 있어 사용되는 알고리즘은 K-means 알고리즘 등과 같이 어떠한 알고리즘이 사용되어도 무방하다.The training data selected through the above "201" to "204" process is determined as the training data of the speech feature vector quantizer 13. The training data thus determined is used as a parameter for codebook generation. Here, the algorithm used for generating the codebook may be any algorithm such as the K-means algorithm.

전술한 본 발명의 학습 데이터 선정 알고리즘에 대해 그 이해를 높이고자 예를 들어 설명하면, 발음 사전에 있어 어떤 음소 "/a/"에 대한 영어 빈도 비율이 2%이고, 샘플 학습 데이터[샘플 음성 데이터]에서 음소 "/a/"가 발생한 횟수가 "40"이며, 그 샘플 학습 데이터에서 발생한 총 음소 수가 "1,000"개인 경우에는, 샘플 학습 데이터에서 음소 "/a/"가 발생한 빈도 비율이 4%이다. 이에, 포스드 얼라이먼트시킨 후에 생성한 음소 "/a/"에 대해 샘플 학습 데이터 리스트 중 "1, 3, 5, …, 35, 37, 39번째 데이터"만을 선정하여 음성 특징 벡터 양자화기의 학습 데이터로서 사용(도출)하면 된다.To improve the understanding of the above-described learning data selection algorithm of the present invention, for example, the English frequency ratio of any phoneme "/ a /" in the pronunciation dictionary is 2%, and the sample learning data [sample voice data] ], The number of occurrences of the phoneme "/ a /" is "40" and the total number of phonemes generated from the sample training data is "1,000", and the percentage of occurrences of the phoneme "/ a /" in the sample training data is 4%. to be. Therefore, only the "1, 3, 5, ..., 35, 37, 39th data" from the sample training data list is selected for the phoneme "/ a /" generated after the forced alignment, and the training data of the speech feature vector quantizer is selected. It can be used as a (derivation).

예컨대, 발음 사전에 있어 특정 음소에 대한 출현 빈도 비율과, 샘플 학습 데이터에 있어 전체 음소 수 대비 해당 음소 출현 빈도 비율간의 차이를 최소화시키기 위해[즉 4% => 2%, 절반의 차이], 총 40개의 샘플 학습 데이터 중에서 절반의 샘플 학습 데이터("1, 3, 5, …, 35, 37, 39번째 데이터")를 음성 특징 벡터 양자화기 훈련을 위한 학습 데이터로서 선정하는 것이다.For example, to minimize the difference between the frequency of appearance for a particular phoneme in the phonetic dictionary and the ratio of phoneme appearance to the total number of phonemes in the sample training data [4% => 2%, half difference], Half of the sample training data ("1, 3, 5, ..., 35, 37, 39th data") of 40 sample training data is selected as training data for a speech feature vector quantizer training.

도 3은 본 발명에서 사용하는 음성 인식 과정에 대한 일실시예 설명도이며, 도 4는 도 3에 있어 음향 모델 생성을 위한 학습 과정에 대한 일실시예 순서도이다.FIG. 3 is a diagram illustrating an embodiment of a speech recognition process used in the present invention, and FIG. 4 is a flowchart of an embodiment of a learning process for generating an acoustic model in FIG. 3.

도 3은 HMM을 이용한 음성 인식 과정을 나타낸 것으로서, 사용자의 음성으로부터 그 음성 특징을 추출하고서(301), 이 추출한 음성 특징에 대해 음향 모델(303), 언어 모델(304) 및 발음 사전(305)을 탐색해 패턴 매칭(302)을 통해 그 음성에 대응되는 단어, 문장을 인식하는 과정을 거친다.3 illustrates a speech recognition process using an HMM, extracting a speech feature from a user's speech (301), and an acoustic model (303), a language model (304), and a pronunciation dictionary (305) with respect to the extracted speech feature. After searching for the pattern matching 302 through the process of recognizing the words, sentences corresponding to the voice.

음성 특징 추출(301)에 있어서는 유럽전기통신표준협회(ETSI)의 표준안 "ETSI ES 201 108"에서 제안하는 방식을 사용한다. 즉, 음성 데이터로부터 MFCC를 통해 그 음성 특징이 추출되어져 14차의 계수로서 음성 특징 벡터가 형성되며, 이 음성 특징 벡터에 대해 음향 모델(303), 언어 모델(304) 및 발음 사전(305)을 이용한 패턴 매칭을 통해 확률값을 최대로 갖는 단어열을 찾는다.In the speech feature extraction 301, the method proposed by the European Telecommunication Standards Institute (ETSI) standard proposal "ETSI ES 201 108" is used. That is, the speech feature is extracted from the speech data through the MFCC to form a speech feature vector as a 14th order coefficient, and the acoustic model 303, the language model 304 and the pronunciation dictionary 305 are formed for the speech feature vector. The pattern matching is used to find a word string having the maximum probability.

음향 모델(303)에 있어서는 HMM이 사용되며, 특히 본 발명에서는 음향 모델로서 언어에 따른 음소 모델을 사용한다. 이러한 음소 모델 생성을 위한 학습 과정에 대해서는 도 4를 참조하여 설명하면 다음과 같다.In the acoustic model 303, an HMM is used. In particular, in the present invention, a phoneme model according to a language is used as the acoustic model. The learning process for generating the phoneme model is described below with reference to FIG. 4.

먼저, 도 2를 참조해 설명했던 본 발명에 따라 선정한 학습 데이터에서 추출한 음성 특징 벡터를 이용해 음소 독립형 모델인 모노폰(monophone) 기반 모델을 생성한다(401).First, a monophone based model, which is a phoneme independent model, is generated using a speech feature vector extracted from the selected training data according to the present invention described with reference to FIG. 2 (401).

그리고 나서, 상기 생성한 모노폰 기반 모델을 토대로 포스드 얼라이먼트시켜 음소 레이블 파일을 새로 생성한다(402).Then, a new phoneme label file is generated by performing forced alignment based on the generated monophone-based model (402).

한편, 상기 생성한 모노폰 기반 모델을 확장시켜 음소 종속형 모델인 트라이폰(triphone) 기반 모델을 생성한다(403).Meanwhile, the generated monophone based model is extended to generate a triphone based model, which is a phoneme dependent model (403).

그런 후, 상기 생성한 트라이폰 기반 모델에 대한 훈련 데이터량이 적은 점을 고려해 상태 공유를 수행한다(404).Then, state sharing is performed in consideration of the small amount of training data for the generated triphone-based model (404).

그리고 나서, 상기 상태 공유를 수행한 결과로서 얻은 모델을 혼합 밀도 증가시켜 최종적인 음향 모델을 생성한다(405).Then, the model obtained as a result of the state sharing is increased by mixing density to generate a final acoustic model (405).

한편, 도 3에 도시된 상기 언어 모델(304)에 있어서는 통계 기반 방식을 사용한다. 여기서, 통계 기반 방식은 주어진 상황에서 발화된 음성의 DB로부터, 가능한 단어열의 확률값을 통계적으로 추정하는 방식을 말한다. 이러한 통계 기반 방식의 언어 모델 중 대표적인 것이 엔-그램(n-gram)이다. 이 엔-그램 방식은 단어열의 확률을 이전 n개 이전의 조건부 확률들의 곱으로 근사화하여 사용하며, 도 3에 있어서는 bigram을 사용한다.Meanwhile, in the language model 304 illustrated in FIG. 3, a statistic-based method is used. Here, the statistics-based method refers to a method of statistically estimating the probability value of possible word strings from a DB of speech spoken in a given situation. A representative of such statistical-based language models is n-gram. This en-gram method approximates the word sequence probability by the product of the previous n conditional probabilities, and uses bigram in FIG.

상기 발음 사전(305)에 있어서는, 한국어의 경우에는 SiTEC의 "CleanSent01"에서 제공하는 발음 사전을 사용하며, 영어의 경우에는 Carneige Mellon 대학교에서 제공하는 "CMU dictionary V.0.6"을 사용한다.In the pronunciation dictionary 305, in Korean, a pronunciation dictionary provided by SiTEC's "CleanSent01" is used, and in English, "CMU dictionary V.0.6" provided by Carneige Mellon University is used.

한편, 도 3에 사용되는 음성 DB에 있어서는, 한국어의 경우에는 음성정보산업기술센터(SiTEC)에서 제공하는 "낭독문장 음성 DB(CleanSent01)"를 사용하며, 영어의 경우에는 ETSI에서 구축한 "AURORA 4 DB(Wall Street Journal)"를 사용한다.On the other hand, in the speech DB used in FIG. 3, in the case of Korean, the "reading sentence speech DB (CleanSent01)" provided by the Voice Information Industry and Technology Center (SiTEC) is used, and in the case of English, "AURORA" constructed by ETSI. 4 DB (Wall Street Journal).

도 5는 본 발명을 한국어에 적용시킨 음성 특징 벡터 양자화기를 사용한 음성 인식 성능을 나타내는 그래프이며, 도 6은 본 발명을 영어에 적용시킨 음성 특징 벡터 양자화기를 사용한 음성 인식 성능을 나타내는 그래프이다.FIG. 5 is a graph showing speech recognition performance using a speech feature vector quantizer to which the present invention is applied to Korean. FIG. 6 is a graph showing speech recognition performance using a speech feature vector quantizer to which the present invention is applied to English.

도 5에서, 좌측의 그래프는 무계획적인 샘플 학습 데이터를 사용해 음성 특징 벡터 양자화기를 훈련시킨 경우에 있어 한국어의 평균 단어 인식률을 나타내고, 우측의 그래프는 본 발명에서 제시하는 음소별 출현 빈도에 따라 학습 데이터를 선정해 이를 사용해 음성 특징 벡터 양자화기를 훈련시킨 경우에 있어 한국어의 평균 단어 인식률을 나타낸다.In FIG. 5, the graph on the left shows the average word recognition rate of Korean in the case of training the speech feature vector quantizer using unplanned sample learning data, and the graph on the right shows the learning data according to the frequency of occurrence of each phoneme presented in the present invention. The average word recognition rate of Korean is shown in case of selecting and using this to train the speech feature vector quantizer.

도 6에서, 좌측의 그래프는 무계획적인 샘플 학습 데이터를 사용해 음성 특징 벡터 양자화기를 훈련시킨 경우에 있어 영어의 평균 단어 인식률을 나타내고, 우측의 그래프는 본 발명에서 제시하는 음소별 출현 빈도에 따라 학습 데이터를 선정해 이를 사용해 음성 특징 벡터 양자화기를 훈련시킨 경우에 있어 영어의 평균 단어 인식률을 나타낸다.In FIG. 6, the graph on the left shows the average word recognition rate in English when the speech feature vector quantizer is trained using unplanned sample learning data, and the graph on the right shows the learning data according to the frequency of occurrence of each phoneme presented in the present invention. The average word recognition rate of English is shown in the case of selecting and using this to train the speech feature vector quantizer.

이하, 도 5 및 도 6에 각각 도시된 본 발명의 음성 인식 성능 실험에 대해 설명하면 다음과 같다.Hereinafter, the speech recognition performance test of the present invention shown in FIGS. 5 and 6 will be described.

영어에 있어, "Wall Street Journal DB(WSJ0)의 multi-condition 학습용 데이터"를 샘플 학습용 데이터로 사용하였으며, 음성 입력 조건으로서 "Sennheiser 근거리 마이크" 및 몇 가지 종류의 원거리 마이크로 녹음된 7,138개의 발화 음성 데이터를 사용하였으며, 이 발화 음성 데이터를 16kHz의 샘플링 레이트로 저장하였다.In English, "multi-condition training data from Wall Street Journal DB (WSJ0)" was used as sample training data, and 7,138 spoken voice data recorded as "Sennheiser Near Microphone" and several types of remote microphones as voice input conditions. The spoken speech data was stored at a sampling rate of 16 kHz.

또한, 음성 인식 성능 평가를 위해 5,000 단어급에 해당되는 "WSJ0"의 평가용 데이터 중에서 multi-condition 평가용 데이터 14개의 셋(Set) 중 일부인 "Set1 ∼ Set7"을 사용하였다. 여기서, 각 Set은 330개의 음성 데이터로 구성되며, 이는 8명의 화자로부터 녹음된 발화 음성으로 한 화자 당 약 40개의 음성을 발화한 것이다.In addition, "Set1 to Set7" which is a part of 14 sets of multi-condition evaluation data among the evaluation data of "WSJ0" corresponding to 5,000 words is used for speech recognition performance evaluation. Here, each set is composed of 330 voice data, which is about 40 voices per speaker, which is recorded by eight speakers.

또한, 39차 음성 특징 벡터를 사용하였으며, 이를 위해 12차 MFCC, 로그 에너지를 추출하였고 아울러 1차/2차 미분 계수를 사용하였다. 여기서 음성 특징 벡터 추출에 있어 "ETSI에 제안된 분산 음성 인식 front-end 표준 알고리즘"을 사용하였다. In addition, a 39th-order speech feature vector was used. For this purpose, 12th-order MFCC and log energy were extracted, and first- and second-order differential coefficients were used. In the speech feature vector extraction, "distributed speech recognition front-end standard algorithm proposed in ETSI" is used.

또한, 학습에 사용된 음성 특징 벡터에 켑스트럼 평균 정규화와 에너지 정규화 방식을 사용하였다.In addition, the cepstrum mean normalization and energy normalization methods were used for the speech feature vectors used for learning.

또한, 음향 모델은 3개의 상태(state)를 갖는 "left-to-right model"을 사용하였으며, 문맥 독립적인 4개의 혼합 밀도 및 "cross-word 트라이폰 모델"을 사용하였다. 이러한 음향 모델은 "HTK version 3.2 toolkit"을 사용해 학습시켰다. 예컨대, 두 개의 묵음 모델이 포함된 41개의 단음에 기반을 둔 음향 모델에서 시작하여 트라이폰에 기반을 둔 음향 모델로 확장한 후에, 이진 트리를 이용한 상태 공유를 수행시켜 그 상태의 수를 줄였다. 이러한 과정을 거친 음향 모델은 8,360개의 트라이폰 음향 모델과 5,356 상태를 갖는 음향 모델로 구성된다.In addition, the acoustic model used a "left-to-right model" with three states, and used four context-independent mixing densities and a "cross-word triphone model". These acoustic models were trained using the "HTK version 3.2 toolkit". For example, starting with 41 monophonic acoustic models with two silent models and extending to a triphone-based acoustic model, state sharing using binary trees was performed to reduce the number of states. This model consists of 8,360 triphone acoustic models and 5,356 acoustic models.

위와 같은 조건으로 음성 인식 시스템을 셋팅시킨 후에, 음성 특징 벡터 양자화 과정을 거치지 않는 음성 인식 성능과, 무계획적인 샘플 학습 데이터를 사용해 음성 특징 벡터 양자화기를 훈련시킨 경우에 있어 그 음성 인식 성능과, 본 발명에서 제시하는 음소별 출현 빈도에 따라 학습 데이터를 선정해 이를 사용해 음성 특징 벡터 양자화기를 훈련시킨 경우에 있어 그 음성 인식 성능을 비교하면 다음과 같다.After setting the speech recognition system under the above conditions, the speech recognition performance without the speech feature vector quantization process, and the speech recognition performance when the speech feature vector quantizer is trained using unplanned sample training data, and the present invention In the case of training the speech feature vector quantizer by selecting the learning data according to the frequency of each phoneme presented in the above, the speech recognition performance is compared as follows.

다음의 [표 5]는 음성 특징 벡터 양자화 과정을 거치지 않는 음성 인식 성능 중 단어 오인식률을 나타낸다. [표 5]에서, 7개의 Set에 대하여 평균 단어 오인식률이 21.8%임을 알 수 있다.[Table 5] below shows word false recognition rate among speech recognition performances that do not undergo speech feature vector quantization. In Table 5, it can be seen that the mean word recognition rate is 21.8% for the seven sets.

Set 1Set 1 Set 2Set 2 Set 3Set 3 Set 4Set 4 Set 5Set 5 Set 6Set 6 Set 7Set 7 AvgAvg 14.314.3 17.617.6 22.122.1 27.027.0 25.125.1 23.123.1 23.523.5 21.821.8

다음으로, 무계획적인 학습 데이터로부터 코드북을 생성해 이를 음성 특징 벡터 양자화기에 적용한 방식의 단어 오인식률과, 본 발명에서 제시하는 알고리즘을 사용해 선정한 학습 데이터로부터 코드북을 생성해 이를 음성 특징 벡터 양자화기에 적용한 방식의 단어 오인식률에 대해 비교하면 다음과 같다.Next, the word misrecognition rate of the codebook generated from the unplanned training data and applied to the speech feature vector quantizer, and the codebook generated from the selected training data using the algorithm proposed by the present invention and applied to the speech feature vector quantizer. Comparing the word misrecognition rate of is as follows.

다음의 [표 6]은 무계획적인 학습 데이터로부터 코드북을 생성해 이를 음성 특징 벡터 양자화기에 적용한 방식의 단어 오인식률을 나타내며, 다음의 [표 7]은 본 발명에서 제시하는 알고리즘을 사용해 선정한 학습 데이터로부터 코드북을 생성해 이를 음성 특징 벡터 양자화기에 적용한 방식의 단어 오인식률을 나타낸다.[Table 6] shows the word misrecognition rate of the codebook generated from the unplanned training data and applied to the speech feature vector quantizer, and the following [Table 7] is selected from the training data selected using the algorithm of the present invention. A codebook is generated and applied to the speech feature vector quantizer.

Set 1Set 1 Set 2Set 2 Set 3Set 3 Set 4Set 4 Set 5Set 5 Set 6Set 6 Set 7Set 7 AvgAvg 17.817.8 18.918.9 25.125.1 29.429.4 26.526.5 24.824.8 25.925.9 24.124.1

Set 1Set 1 Set 2Set 2 Set 3Set 3 Set 4Set 4 Set 5Set 5 Set 6Set 6 Set 7Set 7 AvgAvg 13.613.6 16.816.8 23.023.0 27.627.6 24.924.9 21.821.8 24.524.5 21.721.7

[표 6] 및 [표 7]에서, 기존의 무계획적인 학습 데이터에 의한 방식은 7개의 Set에 대한 평균 단어 오인식률이 24.1%인데 반해, 본 발명에서 제시하는 음소 출현 빈도에 따른 학습 데이터 선정 방식은 7개의 Set에 대한 평균 단어 오인식률이 21.7%임을 알 수 있다.In [Table 6] and [Table 7], the conventional unplanned learning data method has a mean word misrecognition rate of 7 sets for 7 sets, whereas the learning data selection method according to the phoneme occurrence frequency proposed in the present invention. The average word misrecognition rate of 7 sets is 21.7%.

예컨대, 본 발명에서 제시하는 알고리즘을 사용한 결과로 기존의 무계획적인 학습 데이터 선정 알고리즘 대비 단어 오인식률이 약 10% 정도 감소됨을 알 수 있다. 이는 음성 특징 벡터 양자화 과정을 거치지 않는 방식에 의한 평균 단어 오인식률인 21.8%와 거의 동일함을 알 수 있으며, 이러한 본 발명은 음성 특징 벡터 양자화 과정을 거치는 것으로 인해 음성 인식 성능이 저하되지 않음을 당업자라면 쉽게 이해될 수 있을 것이다.For example, as a result of using the algorithm proposed in the present invention, it can be seen that the word misrecognition rate is reduced by about 10% compared to the existing unplanned learning data selection algorithm. It can be seen that this is almost the same as the average word misrecognition rate of 21.8% by the method that does not go through the speech feature vector quantization process, the present invention is that the speech recognition performance is not degraded due to the speech feature vector quantization process Would be easy to understand.

전술한 바와 같이, 본 발명에서 제시하는 학습 데이터 선정 알고리즘은 도 1에 도시된 분산 음성 인식 시스템 뿐만 아니라, 여타 음성 특징 벡터 양자화기가 구비된 어떠한 장치, 시스템에도 적용이 가능하다. 예컨대, 음성 부호화에 사용되는 선형 예측 계수(LPC; Linear Prediction Coefficient)의 양자화기에도 적용이 가능하며, 이와 같이 양자화 대상이 되는 벡터의 특성을 고려컨대 영상, 오디오 등의 멀티미디어 처리 시스템[이러한 것을 통틀어 음성 처리 장치라 함]에도 적용이 가능함을 당업자라면 쉽게 이해할 수 있을 것이다.As described above, the learning data selection algorithm proposed in the present invention can be applied to any device or system equipped with other speech feature vector quantizers as well as the distributed speech recognition system illustrated in FIG. 1. For example, the present invention can be applied to a quantizer of a linear prediction coefficient (LPC) used for speech encoding, and the multimedia processing systems such as video and audio are considered in consideration of the characteristics of the vector to be quantized. It will be readily understood by those skilled in the art that the present invention can be applied to a processing device.

상술한 바와 같은 본 발명의 방법은 프로그램으로 구현되어 컴퓨터로 읽을 수 있는 형태로 기록매체(씨디롬, 램, 롬, 플로피 디스크, 하드 디스크, 광자기 디스크 등)에 저장될 수 있다. 이러한 과정은 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있으므로 더 이상 상세히 설명하지 않기로 한다.As described above, the method of the present invention may be implemented as a program and stored in a recording medium (CD-ROM, RAM, ROM, floppy disk, hard disk, magneto-optical disk, etc.) in a computer-readable form. Since this process can be easily implemented by those skilled in the art will not be described in more detail.

이상에서 설명한 본 발명은, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 있어 본 발명의 기술적 사상을 벗어나지 않는 범위 내에서 여러 가지 치환, 변형 및 변경이 가능하므로 전술한 실시예 및 첨부된 도면에 의해 한정되는 것이 아니다.The present invention described above is capable of various substitutions, modifications, and changes without departing from the technical spirit of the present invention for those skilled in the art to which the present invention pertains. It is not limited by the drawings.

상기와 같은 본 발명은 기존의 무계획적인 샘플 학습 데이터를 사용하는 것에 비해 음성 특징 벡터 양자화에 필요한 비트 수를 동일하게 하는 조건에 있어 그 음성 인식 성능을 보다 향상시킬 수 있는 효과가 있다.As described above, the present invention has an effect of further improving the speech recognition performance under the condition that the number of bits required for quantizing the speech feature vector is the same as compared with the conventional unplanned sample learning data.

또한, 본 발명은 기존의 음성 특징 벡터 양자화 과정을 수행하지 않는 시스 템 대비 동일한 음성 인식 성능을 낼 수 있으며, 이는 음성 특징 벡터 양자화 과정을 거치는 것으로 인해 음성 인식 성능이 저하되는 것을 방지하는 효과가 있다.In addition, the present invention can achieve the same speech recognition performance compared to the system that does not perform the conventional speech feature vector quantization process, which has the effect of preventing the speech recognition performance is degraded due to the speech feature vector quantization process. .

또한, 본 발명은 음성 인식에 있어 그 훈련 데이터의 신뢰성을 높일 수 있으며, 이에 HMM을 통해 정확한 음향 모델을 추정해 별도의 프로세서, 기기를 구비하지 않은 기존 음성 인식 장치를 통해서도 보다 우수한 음성 인식 성공률을 보장할 수 있는 효과가 있다.In addition, the present invention can increase the reliability of the training data in the speech recognition, and thus, by accurately estimating an accurate acoustic model through the HMM, a better speech recognition success rate is achieved even through an existing speech recognition device that does not have a separate processor and device. There is a guaranteed effect.

Claims

In the method for selecting the training data in the speech feature vector quantization for speech recognition,

A force alignment step of obtaining alignment information for each phoneme by force-aligning the voice data when receiving the sample voice data;

A phoneme appearance list generation step of generating a phoneme appearance list based on the obtained pronunciation information;

A phoneme appearance frequency ratio calculation step of statistically calculating an appearance frequency ratio of phonemes according to a corresponding language based on the generated phoneme appearance list; And

Learning data derivation step of deriving training data for minimizing the error between the total frequency and the frequency of appearance frequency for each phoneme with reference to the calculated statistical value

Training data selection method based on a non-uniform sample in speech feature vector quantization for speech recognition comprising a.

The method of claim 1,

The pronunciation information includes speech data, pronunciation transcription, phoneme type, start time for each phoneme, and end time for each phoneme. Selection method.

The method of claim 1,

The method for selecting learning data based on non-uniform samples in quantizing a speech feature vector for speech recognition, characterized in that the phoneme appearance list includes a frequency of each phoneme and a total phoneme number of training data.

The method of claim 1,

The speech feature vector is a learning data selection method based on a non-uniform sample in the speech feature vector quantization for speech recognition, characterized in that it is expressed as Mel-Frequency Cepstral Coefficient (MFCC).

The method of claim 1,

Learning language based on a non-uniform sample in speech feature vector quantization for speech recognition, characterized in that the language is Korean.

The method of claim 1,

Learning language based on a non-uniform sample in speech feature vector quantization for speech recognition, wherein the language is English.

The method of claim 1,

The training data selected by the training data derivation step is based on a non-uniform sample in speech feature vector quantization for speech recognition, characterized in that it is used as training data for generating an acoustic model in a Hidden Markov Model (HMM). Learning data selection method.

The method of claim 1,

The phoneme appearance frequency ratio calculation step,

A method for selecting training data based on a non-uniform sample in quantizing a speech feature vector for speech recognition, characterized by analyzing a ratio of any pronunciation among pronunciation strings based on a pronunciation dictionary of the language.

The method according to any one of claims 1 to 8,

The learning data derivation step,

Speech recognition, characterized in that for selecting the number of frequencies corresponding to the difference between the appearance frequency ratio for a specific phoneme and the phoneme appearance frequency ratio of the total phoneme in the sample speech data in the sample learning data A method for selecting training data based on non-uniform samples in speech feature vector quantization.

In a speech processing device having a processor,

A function of obtaining alignment information for each phoneme by force-aligning the voice data when receiving sample voice data;

Generating a phoneme appearance list based on the acquired pronunciation information;

A function of statistically calculating an appearance frequency ratio of phonemes according to a corresponding language based on the generated phoneme appearance list; And

A function of deriving learning data for minimizing the error between the total frequency and the frequency of appearance frequency for each phoneme by referring to the calculated statistical value

A computer-readable recording medium having recorded thereon a program for realizing this.