KR101398639B1

KR101398639B1 - Method and apparatus for speech registration

Info

Publication number: KR101398639B1
Application number: KR1020070100995A
Authority: KR
Inventors: 공동건; 명현; 방석원; 윤재삼; 오유리; 김민아; 김홍국; 박지훈
Original assignee: 삼성전자주식회사
Priority date: 2007-10-08
Filing date: 2007-10-08
Publication date: 2014-05-28
Also published as: KR20090035944A

Abstract

음성 인식 장치 및 방법이 제공된다. 본 발명의 실시예에 따른 음성 인식 장치는 입력된 음성 신호의 특징 벡터로부터 음성 격자(Phonetic Lattice)를 생성하는 제1 음소 인식부와, 음성 격자를 음소 규칙을 이용하여 보정하는 언어 모델 보정부 및 보정된 음성 격자를 이용하여 특징 벡터로부터 음성을 인식하는 제2 음소 인식부를 포함한다.A speech recognition apparatus and method are provided. A speech recognition apparatus according to an embodiment of the present invention includes a first phoneme recognition unit for generating a phonetic lattice from a feature vector of an input speech signal, a language model correction unit for correcting the speech lattice using a phoneme rule, And a second phoneme recognition unit for recognizing speech from the feature vector using the corrected speech grid.

음성 인식, 음성 등록, 언어 모델, 음향 모델, 언어 모델 보정, 실시간 음성 인식. Speech recognition, voice registration, language model, acoustic model, language model correction, real time speech recognition.

Description

[0001] The present invention relates to a method and apparatus for speech recognition,

본 발명은 음성 인식 장치 및 방법에 관한 것으로, 보다 상세하게는 언어 모델 보정 기반의 이중 경로 음성 인식(Two-pass decoding) 방법 및 장치에 관한 것이다.The present invention relates to a speech recognition apparatus and method, and more particularly, to a method and apparatus for two-pass speech recognition based on language model correction.

일반적으로 지능형 로봇은 로봇이 스스로 판단하여 행동할 수 있는 로봇을 말하며, 인간과 현실 공간을 공유하고 상호 작용하면서 인간의 기능을 수행하는 기술 융합 시스템으로 주부의 가사 노동을 보조하거나 청소 로봇, 잔디 깎기 로봇 등으로 사용될 수 있다. 한편, 사람과 지능형 로봇간의 상호 작용을 위해서는 사람에게 있어서 가장 자연스럽고 쉬운 인터페이스 중의 하나인 음성을 이용하는 것이 효율적이다. 특히, 사용자의 편의성을 증대시키기 위해 가정에서 사용되는 청소 로봇이나 오락용 로봇 등을 음성을 이용하여 명령하여 작동하는 것이 요구된다.In general, an intelligent robot is a robot that can judge and act on its own, and it is a technology convergence system that performs human functions while sharing human interactions with real space, assisting housework labor of housewives, Robot, and the like. On the other hand, it is effective to use voice, which is one of the most natural and easy interface for a person, for interaction between a human and an intelligent robot. In particular, in order to increase the convenience of the user, it is required to command and operate the cleaning robot or the entertainment robot used in the home by voice.

이러한 기능을 수행하기 위해서는 기본적으로 사용자의 음성을 실시간으로 등록하고 이를 인식하는 기능이 필요하다. 실시간 음성 처리란 사용자가 느끼기에 사람과 로봇 사이의 대화가 사람과 사람 사이의 대화와 같이 로봇이 음성을 처리하는 시간을 사용자가 느끼지 못할 정도로 실제로 기다리는 시간이 없음을 의미한다. 따라서 사용자의 음성이 끝나자마자 로봇에서의 결과가 출력되어야 한다.In order to perform such a function, basically, a function of registering and recognizing the user's voice in real time is needed. Real-time speech processing means that the conversation between a person and a robot does not have time to actually wait for the user to feel the time that the robot processes voice, such as a dialog between a person and a person. Therefore, the output from the robot should be output as soon as the user's voice ends.

종래의 음성 인식 장치는 단어 기반으로 미리 등록된 단어에 한해서 사용할 수 있어 목적에 따라 음성 인식 기능이 제한적이고 OOV(Out Of Vocabulary)에 대한 처리가 필요하였다. 따라서, 무제한의 단어를 입력 받기 위한 방법 중의 하나로서, 발음 사전 없이 음소 단위의 언어 모델과 음향 모델을 이용하여 음소 인식을 수행하는 방법을 사용하였다.Conventional speech recognition apparatuses can be used only for words registered in advance on a word basis, so that the speech recognition function is limited according to purposes and processing for OOV (Out Of Vocabulary) is required. Therefore, as one of the methods for receiving unlimited words, a method of performing phoneme recognition using a phoneme-unit language model and an acoustic model without a phonetic dictionary was used.

그러나, 종래의 음성 인식 장치 및 방법에는 다음과 같은 문제점이 있다.However, the conventional speech recognition apparatus and method have the following problems.

먼저, 모노폰(Monophone) 기반의 음향 모델과 음소 기반의 n-gram 언어 모델을 사용하여 음소 인식을 수행하는 경우에는 실시간으로 음소 인식을 수행할 수 있으나 인식 성능이 현저히 저하되는 문제점이 있었다. 반면에, 트라이폰(Triphone) 기반의 음향 모델과 음소 기반의 n-gram 언어 모델을 사용할 경우에는(이하, 종래 기술 A라고 한다.), 인식 성능은 향상되나 계산량의 증가로 실시간으로 음소 인식을 수행하기 어려운 문제점이 있었다.First, when phoneme recognition is performed using a monophone-based acoustic model and a phoneme-based n-gram language model, phoneme recognition can be performed in real time, but recognition performance is significantly degraded. On the other hand, when a triphone-based acoustic model and a phoneme-based n-gram language model are used (hereinafter referred to as a prior art A), recognition performance improves, There was a problem that was difficult to perform.

또한, 기존의 이중 경로 음성 인식 방법(Two-pass decoding)(이하, 종래 기술 B라고 한다.)은 먼저 unigram, bigram과 같은 단순한 언어 모델을 적용하여 Fast pass decoding을 수행하고, trigram 등과 같은 보다 복잡한 언어 모델의 탐색 공간을 축소시키고 확률을 재조정하여 다시 Second pass decoding을 수행한다. 이를 통해 계산량을 줄이고, 음성 인식의 정확도를 어느 정도 높일 수 있었으나, 음성의 오인식률이 높은 문제점은 여전히 존재하였다.In addition, the conventional two-pass speech recognition method (hereinafter referred to as the conventional technique B) first performs fast pass decoding by applying a simple language model such as unigram and bigram, The search space of the language model is reduced and the probability is re-adjusted to perform the second pass decoding again. This can reduce the amount of computation and improve the accuracy of voice recognition to a certain extent, but the problem of high recognition rate of voice still exists.

본 발명은 상기한 문제점을 개선하기 위해 고안된 것으로, 본 발명이 이루고자 하는 기술적 과제는 언어 모델 보정 기반의 이중 경로 음성 인식 장치 및 방법을 제공함으로써, 무제한의 단어를 실시간으로 등록시킬 수 있고, 음성의 인식의 정확도를 높일 수 있는 음성 인식 장치 및 방법을 제공하는 것이다.It is an object of the present invention to provide a dual-path speech recognition apparatus and method based on language model correction, whereby unlimited words can be registered in real time, And to provide a speech recognition apparatus and method capable of improving the accuracy of recognition.

본 발명의 기술적 과제는 이상에서 언급한 것들로 제한되지 않으며, 언급되지 않은 또 다른 기술적 과제는 아래의 기재로부터 당업자에게 명확하게 이해될 수 있을 것이다.The technical problem of the present invention is not limited to those mentioned above, and another technical problem which is not mentioned can be clearly understood by those skilled in the art from the following description.

상기 목적을 달성하기 위하여, 본 발명의 일 실시예에 따른 음성 인식 장치는 입력된 음성 신호의 특징 벡터로부터 음성 격자(Phonetic Lattice)를 생성하는 제1 음소 인식부와, 상기 음성 격자를 음소 규칙을 이용하여 보정하는 언어 모델 보정부 및 상기 보정된 음성 격자를 이용하여 상기 특징 벡터로부터 음성을 인식하는 제2 음소 인식부를 포함한다.In order to achieve the above object, a speech recognition apparatus according to an embodiment of the present invention includes a first phoneme recognition unit for generating a phonetic lattice from a feature vector of an input speech signal, And a second phoneme recognition unit for recognizing speech from the feature vector using the corrected speech grid.

또한, 상기 목적을 달성하기 위하여, 본 발명의 일 실시예에 따른 음성 인식 방법은 입력된 음성 신호의 특징 벡터로부터 음성 격자를 생성하는 단계와, 상기 음성 격자를 음소 규칙을 이용하여 보정하는 단계 및 상기 보정된 음성 격자를 이용하여 상기 특징 벡터로부터 음성을 인식하는 단계를 포함한다.According to another aspect of the present invention, there is provided a speech recognition method including generating a speech grid from a feature vector of an input speech signal, correcting the speech grid using a phoneme rule, And recognizing speech from the feature vector using the corrected speech grating.

상기한 바와 같은 본 발명의 음성 인식 장치 및 방법에 따르면 다음과 같은 효과가 하나 혹은 그 이상 있다.According to the speech recognition apparatus and method of the present invention, one or more of the following effects can be obtained.

첫째, 미리 내장된 발음 사전에 존재하는 단어만을 등록하던 기존 기술과 달리 무제한의 단어를 등록시키고 인식할 수 있다.First, unlimited words can be registered and recognized unlike existing technologies that only register words existing in the built-in pronunciation dictionary.

둘째, 언어 모델 보정 기반의 이중 경로 음성 인식으로 인해 성능 향상을 통해 새로운 단어를 실시간으로 등록시키고 인식할 수 있으며, 음성 인식의 정확도를 높일 수 있는 음성 인식 장치 및 방법을 제공할 수 있다.Second, it is possible to provide a speech recognition apparatus and method which can register and recognize new words in real time through performance enhancement due to dual-path speech recognition based on language model correction, and improve the accuracy of speech recognition.

셋째, 사용자의 편의성을 증대시키기 위하여 로봇 및 가전기기에 음성 인식을 접목하여 사용자와의 즉각적이고 원활한 상호 작용을 위해 가전기기로의 불필요한 근접 접근 및 키보드 등의 입력 도구의 불필요한 조작을 제거할 수 있다.Third, in order to enhance user's convenience, it is possible to eliminate unnecessary operation of inputting tools such as a keyboard and unnecessary proximity to a home appliance for immediate and smooth interaction with a user by incorporating voice recognition into a robot and a home appliance .

본 발명의 효과들은 이상에서 언급한 효과들로 제한되지 않으며, 언급되지 않은 또 다른 효과들은 청구범위의 기재로부터 당업자에게 명확하게 이해될 수 있을 것이다.The effects of the present invention are not limited to the effects mentioned above, and other effects not mentioned can be clearly understood by those skilled in the art from the description of the claims.

본 발명의 이점 및 특징, 그리고 그것들을 달성하는 방법은 첨부되는 도면과 함께 상세하게 후술되어 있는 실시예들을 참조하면 명확해질 것이다. 그러나 본 발명은 이하에서 개시되는 실시예들에 한정되는 것이 아니라 서로 다른 다양한 형태로 구현될 수 있으며, 단지 본 실시예들은 본 발명의 개시가 완전하도록 하고, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 발명의 범주를 완전하게 알려주기 위해 제공되는 것이며, 본 발명은 청구항의 범주에 의해 정의될 뿐이다. 명세서 전체에 걸쳐 동일 참조 부호는 동일 구성 요소를 지칭한다.BRIEF DESCRIPTION OF THE DRAWINGS The advantages and features of the present invention, and the manner of achieving them, will be apparent from and elucidated with reference to the embodiments described hereinafter in conjunction with the accompanying drawings. The present invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art. To fully disclose the scope of the invention to those skilled in the art, and the invention is only defined by the scope of the claims. Like reference numerals refer to like elements throughout the specification.

이 때, 본 실시예에서 사용되는 '~부'라는 용어는 소프트웨어 또는 FPGA또는 ASIC과 같은 하드웨어 구성요소를 의미하며, '~부'는 어떤 역할들을 수행한다. 그렇지만 '~부'는 소프트웨어 또는 하드웨어에 한정되는 의미는 아니다. '~부'는 어드레싱할 수 있는 저장 매체에 있도록 구성될 수도 있고 하나 또는 그 이상의 프로세서들을 재생시키도록 구성될 수도 있다. 따라서, 일 예로서 '~부'는 소프트웨어 구성요소들, 객체지향 소프트웨어 구성요소들, 클래스 구성요소들 및 태스크 구성요소들과 같은 구성요소들과, 프로세스들, 함수들, 속성들, 프로시저들, 서브루틴들, 프로그램 코드의 세그먼트들, 드라버들, 펌웨어, 마이크로코드, 회로, 데이터, 데이터베이스, 데이터 구조들, 테이블들, 어레이들, 및 변수들을 포함할 수 있다. 구성요소들과 '~부'들 안에서 제공되는 기능은 더 작은 수의 구성요소들 및 '~부'들로 결합되거나 추가적인 구성요소들과 '~부'들로 더 분리될 수 있다.Herein, the term " part " used in the present embodiment means a hardware component such as software or an FPGA or an ASIC, and 'part' performs certain roles. However, 'part' is not meant to be limited to software or hardware. &Quot; to " may be configured to reside on an addressable storage medium and may be configured to play one or more processors. Thus, by way of example, 'parts' may refer to components such as software components, object-oriented software components, class components and task components, and processes, functions, , Subroutines, segments of program code, drivers, firmware, microcode, circuitry, data, databases, data structures, tables, arrays, and variables. The functions provided in the components and components may be further combined with a smaller number of components and components or further components and components.

이하, 본 발명의 실시예들에 의하여 음성 인식 장치 및 방법을 설명하기 위한 도면들을 참고하여 본 발명에 대해 설명하도록 한다.DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Hereinafter, the present invention will be described with reference to the drawings for explaining a speech recognition apparatus and method according to embodiments of the present invention.

도 1은 본 발명의 일실시예에 따른 음성 인식 장치의 구성을 나타내는 블록도이다.1 is a block diagram illustrating a configuration of a speech recognition apparatus according to an embodiment of the present invention.

본 발명의 일실시예에 따른 음성 인식 장치(100)는, 특정 벡터 추출부(110)와, 제1 음소 인식부(120)와, 언어 모델 보정부(130) 및 제2 음소 인식부(140)를 포함할 수 있다.The speech recognition apparatus 100 according to an embodiment of the present invention includes a specific vector extraction unit 110, a first phoneme recognition unit 120, a language model correction unit 130, and a second phoneme recognition unit 140 ).

한편, 본 발명의 일실시예에 따른 음성 인식 장치(100)는 음성의 인식은 물 론, 새로운 음성의 등록을 위해서도 사용될 수 있다. 이하, 음성의 인식 기능을 예를 들어 설명하기로 한다.Meanwhile, the speech recognition apparatus 100 according to an embodiment of the present invention can be used not only for speech recognition, but also for registering new speech. Hereinafter, the voice recognition function will be described as an example.

특징 벡터 추출부(110)는 입력된 음성 신호(10)를 단위 프레임으로 분할하여 분할된 프레임 영역에 대응되는 특징 벡터(112)를 추출하는 역할을 한다.The feature vector extractor 110 divides the input speech signal 10 into unit frames and extracts feature vectors 112 corresponding to the divided frame regions.

먼저, 음성 신호가 음성 인식 장치(100)에 입력되면 음성 구간 검출(Voice Activity Detection, VAD)를 통하여 입력된 음성 신호(10)에서 실제의 음성 구간을 검출할 수 있다. 검출된 음성 구간에 대하여 음성 신호로부터 음성 인식에 적합한 정보를 획득하기 위해 음성의 특징을 추출한다. 이 때, 음성 신호의 주파수 특성을 단위 프레임 별로 계산하여 음성 신호에 포함된 특징 벡터(112)를 추출하게 된다. 이를 위하여, 특징 벡터 추출부(110)에는 아날로그 음성 신호를 디지털로 변환하는 아날로그-디지털 변환 수단(A/D Converter)이 구비되어 있을 수 있는데, 디지털로 변환된 음성 신호는 약 10ms 단위의 프레임으로 나뉘어 처리될 수 있다. 예를 들어, 16kHz의 표준화율 하에서는 음성 인식을 위한 샘플수는 160개이다. 이와 같은 단위 프레임은 적어도 하나 이상의 은닉 마르코프 모델 상태(Hidden Markov Model 상태)를 포함할 수 있다.First, when a voice signal is input to the voice recognition apparatus 100, an actual voice interval can be detected from the voice signal 10 input through voice activity detection (VAD). The feature of the speech is extracted in order to acquire information suitable for speech recognition from the speech signal with respect to the detected speech section. At this time, the frequency characteristic of the speech signal is calculated for each unit frame, and the feature vector 112 included in the speech signal is extracted. To this end, the feature vector extracting unit 110 may be provided with an analog-to-digital converter (A / D converter) for converting an analog voice signal into a digital signal. Can be divided and processed. For example, under the standardization rate of 16 kHz, the number of samples for speech recognition is 160. Such a unit frame may include at least one Hidden Markov Model state (Hidden Markov Model state).

바람직하게는, 특징 벡터 추출부(110)는 멜-주파수 켑스트럼 계수(Mel-Frequency Cepstrum Coefficients, MFCC) 특징 추출 방식을 이용하여 특징 벡터(112)를 추출할 수 있다. 멜-주파수 켑스트럼 계수 특징 추출 방식은 멜-켑스트럼 계수, 로그 에너지, 그리고 이들의 1차, 2차 미분을 결합한 형태의 특징 벡터(112)를 사용할 수 있다.Preferably, the feature vector extractor 110 may extract the feature vector 112 using a Mel-Frequency Cepstrum Coefficients (MFCC) feature extraction scheme. The Mel-frequency cepstrum coefficient feature extraction method can use the feature vector 112 in the form of a combination of the Mel-Wurst coefficients, the log energy, and their first- and second-order derivatives.

한편, 특징 벡터 추출부(110)는 단위 프레임 영역에서 음성 신호의 특징을 추출하는데 있어서, 선형 예측 부호화(Linear Predictive Coding, LPC), 선형 예측 부호화에 의한 켑스트럼(LPC derived Cepstrum), 인지 성형 예측(Perceptive Linear Prediction, PLP), 청각 모델(Audio Model) 특징 추출 및 필터 뱅크(Filter Bank) 등의 방법을 사용할 수도 있다.The feature vector extraction unit 110 extracts features of a speech signal in a unit frame region by performing linear predictive coding (LPC), LPC derived cepstrum by linear predictive coding, Perceptive Linear Prediction (PLP), Audio Model Feature Extraction, and Filter Bank.

제1 음소 인식부(120)는 입력된 음성 신호(10)의 특징 벡터(112)로부터 음성 격자(Phonetic Lattice)(128)를 생성하는 역할을 한다.The first phoneme recognition unit 120 generates a phonetic lattice 128 from the feature vector 112 of the input speech signal 10.

도 2는 도 1의 음성 인식 장치에서 제1 음소 인식부가 음성 신호의 특징 벡터로부터 음성 격자를 생성하는 과정을 나타낸 도면이다.FIG. 2 is a diagram illustrating a process of generating a speech grid from a feature vector of a speech signal of a first phoneme recognition unit in the speech recognition apparatus of FIG. 1;

도 2에 도시된 바와 같이, 제1 음소 인식부(120)는 특징 벡터(112)를 제1 음향 모델(122)과 언어 모델(124)을 이용하여 패턴 매칭시켜 음소를 인식하여 제1 음소열 군(126)을 생성하고, 제1 음소열 군(126)으로부터 음성 격자(128)를 생성할 수 있다. 이 단계에서의 음소 인식을 Fast-Pass Decoding이라 한다.2, the first phoneme recognizing unit 120 recognizes a phoneme by pattern matching the feature vector 112 using the first acoustic model 122 and the language model 124, Group 126 and generate a voice grid 128 from the first phoneme string group 126. [ Phoneme recognition at this stage is called Fast-Pass Decoding.

바람직하게는, 제1 음향 모델(122)은 모노폰(Monophone) 기반의 음향 모델을 사용할 수 있다. 일반적으로, 음향 모델은 단어, 음절, 음소 단위 등으로 학습된 은닉 마르코프 모델(Hidden Markov Model, HMM)을 사용할 수 있는데, 음소 단위 HMM을 음성 인식에 사용하는 것이 효율적이다. 음소 단위 HMM은 단일 음소만을 고려한 모노폰(Monophone) 모델과 앞뒤 음소를 고려한 트라이폰(Triphone) 모델을 주로 사용할 수 있다. 실제 발음되는 음소는 앞뒤 음소에 따라 기준 음소의 음성 특성이 다르므로 트라이폰 모델을 사용함으로써 인식 성능을 높일 수 있다. 그러나 트라이폰 모델을 사용할 경우, 그 모델 수가 모노폰 모델 개수의 3승배가 되어 음소 인식 수행에 있어서 막대한 시간이 요구되는 단점이 있다.Preferably, the first acoustic model 122 may use a monophone-based acoustic model. In general, the acoustic model can use a hidden Markov model (HMM) learned by word, syllable, and phoneme unit, and it is effective to use phoneme-based HMM for speech recognition. The phoneme unit HMM can mainly use a monophone model considering only a single phoneme and a triphone model considering a fore and aft phoneme. Since the phonetic characteristic of the reference phoneme differs according to the front and back phonemes, the recognition performance can be improved by using the triphone model. However, when the triphone model is used, the number of the models becomes three times the number of the monophonic models, and a great time is required for performing the phoneme recognition.

따라서, 제1 음소 인식부(120)에서는 모노폰 기반의 음향 모델을 사용하여 음소 인식을 수행할 수 있다. 한편, 음성 인식에 사용하는 모노폰 기반의 음향 모델은 21개의 모음과 19개의 자음, 2개의 묵음으로 총 42개의 음소를 포함할 수 있다.Accordingly, the first phoneme recognition unit 120 can perform phoneme recognition using a monophone-based acoustic model. On the other hand, the monophone-based acoustic model used for speech recognition may include 21 vowels, 19 consonants, and 2 silences, for a total of 42 phonemes.

한편, 언어 모델(124)은 문법 등 인간의 언어 발생 패턴을 모델링하고 인식 공간을 한정시킴으로써 탐색 공간을 줄여서 인식 시간 및 계산량을 줄이기 위하여 사용하며, 통계 기반 방식을 사용할 수 있다. 통계 기반 방식은 주어진 상황에서 발화된 음성의 데이터베이스로부터, 가능한 단어열의 확률값을 통계적으로 추정하는 방식이다. 이 기법을 통하여 생성된 언어 모델은 문법에 맞지 않는 문장도 수용하는 유연한 구조를 가지나, 이 경우 탐색 공간이 커지는 단점을 가진다. 통계 기반 방식의 언어 모델 중 대표적인 것이 n-gram으로, 이전 n-1 개의 단어들로부터 다음에 나타날 단어의 확률을 정의하는 방식이다. 즉, 단어열에 대한 확률을 이전 n-1 개의 단어들의 조건부 확률들의 곱으로 근사화될 수 있다.On the other hand, the language model 124 is used to reduce the search space by modeling human language generation patterns such as grammar and defining the recognition space, thereby reducing recognition time and calculation amount, and can use a statistical based method. The statistical based method is a method of statistically estimating a probability value of a possible word sequence from a database of speech uttered in a given situation. The language model generated through this technique has a flexible structure to accommodate sentences that do not match the grammar, but it has the disadvantage of increasing the search space. One of the statistical-based language models is the n-gram, which defines the probability of the next word from the previous n-1 words. That is, the probability for a word string can be approximated by the product of the conditional probabilities of the previous n-1 words.

바람직하게는, 입력된 음성 신호(10)를 등록하고자 할 때에는 언어 모델(124)로서 음소 단위의 bigram 기반의 언어 모델을 사용할 수 있고, 입력된 음성 신호(10)를 인식하고자 할 때에는 단어 단위의 unigram 기반의 언어 모델을 사용할 수 있다.Preferably, a bigram-based language model may be used as the language model 124 when registering the input speech signal 10. When the input speech signal 10 is to be recognized, You can use a unigram-based language model.

도 3은 음소 단위의 bigram 기반의 언어 모델을 나타내는 도면이다.3 is a diagram showing a bigram-based language model of a phoneme unit.

음소 단위의 bigram 기반의 언어 모델은 현재 음소에 대한 확률을 이전 음소에 대한 조건부 확률들의 곱으로 근사화한 언어 모델이다. 즉, n 개의 음소를 가지는 모노폰 기반의 음향 모델을 이용하여 음성 인식을 수행한다고 하면, n×n에 해당하는 탐색 공간을 필요로 한다.The phoneme-based bigram-based language model is a language model in which the probability for the current phoneme is approximated by the product of the conditional probabilities for the previous phoneme. That is, if speech recognition is performed using a monophone-based acoustic model having n phonemes, a search space corresponding to n × n is required.

예를 들어, 42개의 음소를 가지는 음향 모델을 이용하여 음성 인식을 수행한다고 하면, 트라이폰 모델 기반의 음향 모델의 경우, 42개의 음소에 대하여 앞뒤에 위치하는 음소를 고려해야 하기 때문에 약 70,000개 (42×42×42개)의 음소에 대한 음향 모델을 사용해야 한다. 또한, 두 개의 음소 사이에 나타날 수 있는 확률값을 고려하는 bigram 기반의 언어 모델을 사용할 경우, 70,000개 음소의 제곱승(70,000×70,000개)에 해당하는 수만큼 탐색 공간을 필요로 하게 된다. 따라서, 음성 인식 성능을 향상시킬 수는 있으나, 계산량이 증가하고 음소 인식에 소요되는 시간이 크게 늘어나므로 실시간으로 음소 인식을 수행하기 어려움이 있다.For example, if speech recognition is performed using an acoustic model having 42 phonemes, since the phonemes positioned before and after the 42 phonemes must be considered in the case of the triphone model-based acoustic model, about 70,000 (42 × 42 × 42) phonemes should be used. In addition, when using a bigram-based language model that considers probability values that can appear between two phonemes, a search space corresponding to the square power of 70,000 phonemes (70,000 × 70,000) is required. Therefore, although speech recognition performance can be improved, it is difficult to perform phoneme recognition in real time because the amount of calculation is increased and the time required for phoneme recognition is greatly increased.

따라서, 본 발명의 일실시예에 따른 음성 인식 장치(100)에서는 제1 음소열 군(126)을 인식하기 위한 시간을 줄이기 위해서 모노폰 기반의 음향 모델을 가지고 bigram 기반의 언어 모델을 생성하여 인식에 사용함으로써 탐색 공간을 줄여 처리 속도를 크게 줄일 수 있다.Accordingly, in the speech recognition apparatus 100 according to an embodiment of the present invention, a bigram-based language model is generated using a monophone-based acoustic model in order to reduce the time for recognizing the first phoneme string group 126, , The search space can be reduced and the processing speed can be greatly reduced.

한편, 제1 음소열 군(126)은 비터비 알고리즘(Viterbi Algorithm)을 이용하여 생성될 수 있다. 즉, 특징 벡터(112)는 제1 음향 모델(122)과 언어 모델(124)을 이용하여 패턴 매칭시켜 제1 음소열 군(126)을 생성하는 패턴 매칭 방법으로, 동적 프로그램 중 하나인 비터비 알고리즘을 사용할 수 있다. 비터비 알고리즘에서는 입 력된 음성 신호(10)의 각 프레임에 대하여 모든 음소의 모든 상태(State)에 대해서 상태 확률 계산 과정을 수행함으로써 입력된 음성 신호(10)에 대한 모든 가능한 음소열에 대한 확률을 계산한 후, 최고의 확률을 가지는 제1 음소열(즉, N-best 음소열)들을 찾아낼 수 있다. 최고의 확률을 가지는 제1 음소열 군(126)을 이루는 음소열의 개수는 음성 인식 장치(100) 내에서 미리 설정된 값일 수 있다. 또한, 비터비 알고리즘의 수행 능력을 증대시키기 위해서 일정 이상의 확률을 가지는 상태들 만을 계산하는 가지치기(Pruning) 방식을 사용할 수 있다. 따라서, 음성 격자(128)는 비터비 알고리즘과 가지치기에 의해서 생성된 N-best 음소열(126)을 이용하여 생성할 수 있다.Meanwhile, the first phoneme string group 126 may be generated using a Viterbi algorithm. That is, the feature vector 112 is a pattern matching method for generating a first phoneme string group 126 by performing pattern matching using the first acoustic model 122 and the language model 124, Algorithm can be used. The Viterbi algorithm calculates the probability of all possible phoneme sequences for the input speech signal 10 by performing a state probability calculation process for all states of all phonemes for each frame of the input speech signal 10 , The first phoneme string having the highest probability (i.e., the N-best phoneme string) can be found. The number of phonemes constituting the first phoneme string group 126 having the highest probability may be a value preset in the speech recognition apparatus 100. Also, in order to increase the performance of the Viterbi algorithm, a pruning method for calculating only states having a certain probability may be used. Accordingly, the voice grid 128 can be generated using the N-best phoneme string 126 generated by the Viterbi algorithm and the pruning.

도 4는 제1 음소 인식부에서 제1 음소열 군을 생성한 후 이에 해당하는 음성 격자를 생성한 상태를 나타내는 예시도이다.FIG. 4 is a diagram illustrating a state in which a first phoneme string is generated in a first phoneme recognition unit and a corresponding speech grid is generated. FIG.

예를 들어, 제1 음소 인식부(120)에서 "학교"라는 입력된 음성 신호(10)에 대해 모노폰 기반의 음향 모델(122)과 bigram 기반의 언어 모델(124)로 음소 인식을 수행할 경우, 비터비 알고리즘을 통해 가지치기되고 살아남은 제1 음소열 군(126)은 "h a g g jo", "h a g G jo", "a g g I o a", " h E g g jo"일 수 있다. 이러한 제1 음소열 군(126)은 입력된 음성 신호(10)에 대해 인식 가능한 모든 음소열들 중 최고의 확률을 가지는 음소열(N-best 음소열)들을 의미한다. 이 후, 제1 음소열 군(126)을 이용하여 음성 격자(128)를 생성할 수 있다.For example, the first phoneme recognition unit 120 performs phoneme recognition on a monophone-based acoustic model 122 and a bigram-based language model 124 with respect to the input speech signal 10 called "school" The first phoneme string group 126, which has been pruned and survived through the Viterbi algorithm, may be "hagg jo", "hag G jo", "agg I oa", "h E gg jo". The first phoneme string group 126 refers to a phoneme string (N-best phoneme string) having the highest probability among all phoneme strings recognizable with respect to the input speech signal 10. Thereafter, the voice grid 128 can be generated using the first group of phoneme strings 126.

상기와 같이, 제1 음소 인식부(120)에서는 모노폰 기반의 음향 모델(122)을 이용하여 신속한 음소 인식을 행하여 생성한 음성 격자(128)를 생성하고, 이러한 음성 격자(128)를 이용하여 한정된 음소만으로 이루어진 언어 모델로 재생성하여 사용하므로, 이후 제2 음소 인식부(140)에서의 음소 인식에서 탐색 공간을 크게 줄일 수 있다.As described above, the first phoneme recognizing unit 120 generates the voice lattice 128 generated by performing the rapid phoneme recognition using the monophone-based acoustic model 122, and using the voice lattice 128, The second phoneme recognition unit 140 can reduce the search space significantly in the phoneme recognition.

언어 모델 보정부(130)는 음성 격자(128)를 음소 규칙을 이용하여 보정하는 역할을 한다. 즉, 언어 모델 보정부(130)는 제1 음소 인식부(120)에서 생성된 음성 격자(128)를 보정함으로써 제2 음소 인식부(140)에서 보다 정확하고 신속한 음소 인식을 수행할 수 있도록 할 수 있다.The language model correcting unit 130 corrects the voice grid 128 using a phoneme rule. That is, the language model correcting unit 130 corrects the voice grid 128 generated by the first phoneme recognizing unit 120 so that the second phoneme recognizing unit 140 can perform more accurate and rapid phoneme recognition .

도 5는 도 1의 음성 인식 장치에서 언어 모델 보정부를 나타낸 블록도이다.5 is a block diagram showing a language model correcting unit in the speech recognition apparatus of FIG.

본 발명의 일실시예에 따른 음성 인식 장치(100)의 언어 모델 보정부(130)는, 음소열 생성부(131)와, 음소열 선택부(132)와, 음소열 보정부(133) 및 음성 격자 생성부(134)를 포함할 수 있다.The language model correcting unit 130 of the speech recognition apparatus 100 according to an embodiment of the present invention includes a phoneme string generating unit 131, a phoneme string selecting unit 132, a phoneme string correcting unit 133, And a voice grid generator 134.

음소열 생성부(131)는 제1 음소 인식부(120)에서 생성된 음성 격자(128)로부터 가능한 음소열 군을 생성할 수 있다. 그리고, 음소열 선택부(132)는 음소열 생성부(131)에서 생성된 음소열 군을 확률 값에 따라 정렬한 후, 높은 확률 값을 가지는 음소열 군을 선택할 수 있다. 그리고, 음소열 보정부(133)는 음소열 선택부(132)에서 선택된 음소열 군을 음소 규칙을 이용하여 변환하여 보정된 음소열 군을 생성할 수 있다.The phoneme string generating unit 131 may generate a group of phoneme strings available from the voice grid 128 generated by the first phoneme recognizing unit 120. The phoneme string selecting unit 132 may sort the phoneme string groups generated by the phoneme string generating unit 131 according to the probability values, and then select a phoneme string group having a high probability value. The phoneme string correcting unit 133 can generate the corrected phoneme string group by converting the selected phoneme string in the phoneme string selector 132 using a phoneme rule.

음소열 보정부(133)에서 음소열 군을 변환하는 데 있어서 사용되는 음소 규칙은, 언어 모델 보정을 위해 많은 실험 데이터를 기반으로 하여 특정 영역에 의존적인 통계적 기법을 사용하는 대신, 음성학 또는 음운론의 지식을 기반으로 보편적 인 규칙을 적용할 수 있다. 예를 들어, 음소 규칙은 한국 어문 규정집의 표준 발음법을 이용할 수 있다.The phoneme rule used to convert the phoneme string group in the phoneme string correcting unit 133 is based on a lot of experimental data for correcting the language model and instead of using a statistical technique dependent on a specific area, Knowledge-based universal rules can be applied. For example, the phonetic rules can use the standard pronunciation method of the Korean language dictionary.

아래의 표 1은 언어 모델 보정부의 음소열 보정부에서 음소열을 변환하는데 적용 가능한 음소 규칙의 예시를 나타낸다.Table 1 below shows examples of phoneme rules that can be applied to convert phoneme strings in a phoneme string modifier of a language model modifier.

번호number 규칙명Rule name 음소열Phoneme column 보정된 음소열Calibrated phoneme string 규칙 1Rule 1 중복 자음열 제거Remove redundant consonants 자음1-자음2-자음3Consonant 1-consonant 2-consonant 3 자음1-자음2
자음1-자음3
자음2-자음3Consonant 1-consonant 2
Consonant 1-consonant 3
Consonant 2-consonant 3 규칙 2Rule 2 이중 자음의 단어 끝 음소 제거Remove the end phoneme of a double consonant 자음1-자음2-묵음(sil)Consonant 1-consonant 2-silence 자음1-묵음(sil)
자음2-묵음(sil)Consonant 1-silence
Consonant 2-silence 규칙 3Rule 3 종성 불가능 자음의 단어 끝 음소 제거Removal of ending phonemes of consonant consonants 음소1-자음1-묵음(sil)Phoneme 1-consonant 1-silence 음소1-묵음(sil)Phoneme 1-silence (sil) 규칙 4Rule 4 이중 자음의 단어 첫 음소 제거Removing the first phoneme of a double consonant 묵음(sil)-자음1-자음2Silence - Consonants 1 - Consonant 2 묵음(sil)-자음1
묵음(sil)-자음2Silence - consonant 1
Silence - Consonant 2 규칙 5Rule 5 짧은 고모음의 무성화 보정Correction of short high vowels {/ㅍ, ㅌ, ㅋ, ㅊ, ㅅ, ㅆ, ㅎ/}-{자음}{/,,,,, - {consonants} {/ㅍ, ㅌ, ㅋ, ㅊ, ㅅ, ㅆ, ㅎ/}-{/ㅣ, ㅟ, ㅜ, ㅡ/}{/,,,,,,,,,,,,,,,,,,,,,, 규칙 6Rule 6 유음화 보정Tone correction {/ㄹ/}-{/ㄹ/}{/ L /} - {/ l /} {/ㄴ/}-{/ㄹ/}
{/ㄹ/}-{/ㄴ/}
{/ㄹ/}-{/ㄹ/}{/ C /} - {/ l /}
{/ L /} - {/ c /}
{/ L /} - {/ l /} 규칙 7Rule 7 장애음의 비음화 보정Non-acoustic correction of obstructed sound {/ㅁ, ㄴ, ㅇ/}-{비음}{/ ㅁ, ㄴ, ㅇ /} - {NON} {/ㅂ, ㄷ, ㄱ/}-{비음}
{/ㅁ, ㄴ, ㅇ/}-{비음}{/ F, c, a /} - {nasal}
{/ ㅁ, ㄴ, ㅇ /} - {NON} 규칙 8Rule 8 ㄷ-구개음화 보정ㄷ - palatalization correction {/ㅈ, ㅊ/}-{모음}-묵음(sil){V / v / v} - {vowel} - silence (sil) {/ㄷ, ㅌ/}-{모음}
{/ㅈ, ㅊ/}-{모음}{/ C, § /} - {vowel}
{/ Ə, ː /} - {vowel} 규칙 9Rule 9 어간 종성 /ㄴ, ㅁ/ 뒤에서의 경음화 보완Interpretation of the term / Complementing the transcription from b, ㅁ / back {/ㅣ, ㅁ/}-{/ㄲ, ㄸ, ㅃ, ㅆ, ㅉ/}{/ ㅣ, ㅁ /} - {/ ㄲ, ㄸ, ㅃ, ㅆ, ㅉ /} {/ㅣ, ㅁ/}-{/ㄱ, ㄷ, ㅂ, ㅅ, ㅈ/}
{/ㅣ, ㅁ/}-{/ㄲ, ㄸ, ㅃ, ㅆ, ㅉ/}{/ ㅣ, ㅁ /} - {/ a, c, f,
{/ ㅣ, ㅁ /} - {/ ㄲ, ㄸ, ㅃ, ㅆ, ㅉ /} 규칙 10Rule 10 ㄹ-두음 법칙 보정D - Calibration of the two-tone principle 묵음(sil)-{/ㄴ/}-{/ㅣ, ㅖ, ㅕ, ㅑ, ㅒ/}Sil - {/ ㅣ, ㅖ, ㅕ, ㅑ, ㅒ /} 묵음(sil)-{/ㄴ/}-{/ㅣ, ㅖ, ㅕ, ㅑ, ㅒ/}
묵음(sil)-{/ㄹ/}-{/ㅣ, ㅖ, ㅕ, ㅑ, ㅒ/}Sil - {/ ㅣ, ㅖ, ㅕ, ㅑ, ㅒ /}
(Sil) - {/ l /} - {/ ㅣ, ㅖ, ㅕ, ㅑ, ㅒ /} 규칙 11Rule 11 구개 자음 뒤에서의 구개 반모음 탈락 보정Correction of palatal hemisynthesis after palatal consonants /ㅈ, ㅊ/-/ㅕ//,, / / - / ㅕ / /ㅈ, ㅊ/-/ㅓ//,, / / - / ㅓ / 규칙 12Rule 12 "j" 첨가"j" added /ㅣ, ㅔ. ㅐ, ㅟ, ㅚ/-/ㅓ// ㅣ, ㅔ. ㅐ, ㅟ, ㅚ / - / ㅓ / /ㅣ, ㅔ. ㅐ, ㅟ, ㅚ/-/ㅕ/
/ㅣ, ㅔ. ㅐ, ㅟ, ㅚ/-/ㅓ// ㅣ, ㅔ. ㅐ, ㅟ, ㅚ / - / ㅕ /
/ ㅣ, ㅔ. ㅐ, ㅟ, ㅚ / - / ㅓ /

예를 들어, 입력된 음성 신호(10)가 "같다"인 경우, 제1 음소 인식부(120)로부터 생성된 음소열이 "ㄱ ㅏ ㅌ ㅅ ㄷ ㅏ"일 수 있다. 이 때, "ㅌ", "ㅅ", "ㄷ" 부분은 표 1에 나타난 바와 같이, 규칙 1의 "중복 자음열 제거"의 음소 규칙을 통해, "ㅌ"-"ㅅ", "ㅌ"-"ㄷ", 또는 "ㅅ"-"ㄷ"을 포함하는 음소열로 변환될 수 있다. 따라서, 보정된 음소열 군은 "ㄱ ㅏ ㅌ ㅅ ㅏ", "ㄱ ㅏ ㅌ ㄷ ㅏ", 및 "ㄱ ㅏ ㅅ ㄷ ㅏ"가 될 수 있다.For example, when the input speech signal 10 is "same", the phoneme string generated from the first phoneme recognition unit 120 may be "a". As shown in Table 1, the "ㅌ", "ㅅ", and "ㄷ" parts of the sentence are " Quot ;, "c ", or" g "-" c ". Thus, the corrected group of phonemes can be "ㄱ", "" ", and" ".

마지막으로, 음성 격자 생성부(134)는 음소열 보정부(133)에서 보정된 음소열 군으로부터 보정된 음성 격자(135)를 생성할 수 있다. 또한, 보정된 음성 격자(135)에 대해서 2-gram, 3-gram 등의 고차 언어 모델을 적용함으로써, 보정된 음성 격자(135) 내의 음소열 간의 확률을 조정할 수 있다.Finally, the speech grid generation unit 134 can generate the corrected speech grid 135 from the group of phoneme strings corrected by the phoneme string correction unit 133. [ Further, by applying a higher-order language model such as 2-gram or 3-gram to the corrected speech grid 135, the probability between the phoneme strings in the corrected speech grid 135 can be adjusted.

상기와 같이 언어 모델 보정부(130)에서 보정된 음성 격자(135)는 제2 음소 인식부(140)에서 음소 인식을 하는데 있어서 언어 모델로서 사용할 수 있다. 따라서, 제1 음소 인식부(120)에서의 Fast-Pass Decoding 과정에서 생길 수 있는 오류를 언어 모델 보정 기법을 이용하여 보정한 후, 그 결과를 제2 음소 인식부(140)에서의 Second-Pass Decoding 과정에 적용함으로써 보다 신뢰도 높은 음소 인식을 수행할 수 있다.As described above, the speech grid 135 corrected by the language model correcting unit 130 can be used as a language model for phoneme recognition in the second phoneme recognizing unit 140. Therefore, after correcting an error that may occur in the fast-pass decoding process in the first phoneme recognizer 120 using a language model correction technique, the second phoneme recognition unit 140 corrects the result using a second-pass Decoding process, it is possible to perform more reliable phoneme recognition.

제2 음소 인식부(140)는 보정된 음성 격자(135)를 이용하여 특징 벡터(112)로부터 음성을 인식하는 역할을 한다.The second phoneme recognition unit 140 recognizes speech from the feature vector 112 using the corrected speech grid 135.

도 6은 도 1의 음성 인식 장치에서 제2 음소 인식부가 입력된 음성 신호의 특징 벡터로부터 N-Best 음소열을 생성하는 음소 인식 과정을 나타낸 도면이다.6 is a diagram illustrating a phoneme recognition process for generating an N-best phoneme string from a feature vector of a speech signal input by the second phoneme recognition unit in the speech recognition apparatus of FIG.

제2 음소 인식부(140)에서는, 제1 음소 인식부(120)에서 설명한 바와 같이, 특징 벡터(112)를 제2 음향 모델(142)과 보정된 음성 격자(135)를 이용하여 패턴 매칭시켜 음소를 인식할 수 있다. 이 단계에서의 음소 인식을 Second-Pass Decoding이라 한다.The second phoneme recognition unit 140 performs pattern matching on the feature vectors 112 using the second acoustic model 142 and the corrected speech grids 135 as described in the first phoneme recognition unit 120 Phonemes can be recognized. Phoneme recognition at this stage is referred to as second-pass decoding.

바람직하게는, 제2 음향 모델(142)은 트라이폰 기반의 음향 모델을 사용할 수 있다. 상술한 바와 같이, 트라이폰 기반의 음향 모델을 사용함으로써, 음소 인식을 보다 정확하게 수행할 수 있다.Preferably, the second acoustic model 142 may use a triphone-based acoustic model. As described above, by using the triphone-based acoustic model, phoneme recognition can be performed more accurately.

제2 음소 인식부(140)에서 음소를 인식하여 생성된 N-Best 음소열 군(144)을 입력된 음성 신호(10)에 대한 발음으로 인식함으로써 인식 과정을 종료한다. 이러한 음소 인식을 통해 음성 등록 과정을 수행할 수도 있다. 이 때에 N-Best 음소열은 최고의 확률값을 가지는 음소열들을 의미한다.The recognition process is terminated by recognizing the N-best phoneme string group 144 generated by recognizing the phoneme in the second phoneme recognition unit 140 as the pronunciation of the input speech signal 10. [ The voice registration process can be performed through the phoneme recognition. At this time, the N-best phoneme sequence means phoneme strings having the highest probability value.

도 7은 종래 기술 A, 종래 기술 B, 및 본 발명의 일실시예에 따른 음성 인식 장치를 이용하여 입력된 음성 신호를 인식한 경우, 각각의 단어 인식 에 소요된 시간을 비교한 그래프이다.FIG. 7 is a graph comparing the time taken to recognize each word when recognizing the input speech signal using the speech recognition apparatus according to the prior art A, the background art B, and the speech recognition apparatus according to an embodiment of the present invention.

종래 기술 A의 경우, 42개의 음소를 가지는 트라이폰 기반의 음향 모델을 사용하였는데, 약 70,000 개의 음소를 검색하여야 하기 때문에 음성 인식에 소요되는 시간이 가장 많이 필요하다. 그리고, 종래 기술 B의 경우, 모노폰 기반의 음향 모델로 음소 인식을 수행한 후(Fast-Pass Decoding), 그 결과를 기반으로 한 언어 모델과 트라이폰 기반의 음향 모델로 음소 인식을 수행하여(Second-Pass Decoding) 언어 모델의 탐색 공간을 축소시키고 그에 따른 계산량이 줄어 음성 인식에 소요되는 시간이 가장 적게 요구된다. 본 발명의 경우, 모노폰 기반의 음향 모델을 이용하여 음성 격자(128)를 생성하고 이를 보정한 음성 격자(128) 기반의 언어 모델을 이용하여 음소 인식을 행하기 때문에 종래 기술 B보다 좀 더 많은 시간이 소요된다.In the case of the prior art A, a triphone-based acoustic model having 42 phonemes is used. Since it needs to search about 70,000 phonemes, the time required for speech recognition is the most needed. In the case of the conventional technique B, phoneme recognition is performed with a monophone-based acoustic model (Fast-Pass Decoding), and phoneme recognition is performed with a language model based on the result and a triphone-based acoustic model Second-Pass Decoding) The search space of the language model is reduced and the amount of computation is reduced, which requires the least time for speech recognition. In the case of the present invention, since the voice grid 128 is generated using the monophone-based acoustic model and the phoneme recognition is performed using the language model based on the voice grid 128 corrected for the voice grid 128, It takes time.

도 8는 종래 기술 A, 종래 기술 B, 및 본 발명의 일실시예에 따른 음성 인식 장치를 이용하여 입력된 음성 신호를 인식한 경우, 각각의 단어 오인식률을 비교한 그래프이다.FIG. 8 is a graph comparing the recognition rates of the respective words when recognizing the input speech signal using the speech recognition apparatus according to the prior art A, the background art B, and the speech recognition apparatus according to an embodiment of the present invention.

도 8에서는 음성 인식에 대한 정확도를 나타내기 위해 단어의 오인식률(Word Error Rate, WER)을 이용하였다.In FIG. 8, Word Error Rate (WER) is used to indicate the accuracy of speech recognition.

종래 기술 A의 경우, 트라이폰 기반의 음향 모델을 사용함으로써 가장 낮은 오인식률을 보였으며, 종래 기술 B의 경우, 모노폰 기반의 음향 모델로 음소 인식을 수행하여 생성된 언어 모델과 트라이폰 기반의 음향 모델을 이용하여 음성 인식을 수행함으로써 오인식률이 높게 나타났다.In the case of the prior art A, the lowest false recognition rate is shown by using the triphone-based acoustic model. In the case of the prior art B, the language model generated by performing the phoneme recognition with the monophone-based acoustic model and the triphone- The recognition rate is higher by performing speech recognition using the acoustic model.

그러나, 본 발명의 경우, 종래 기술 A에 비해 단어의 오인식률이 상대적으로 약 14% 정도 증가하였지만, 종래 기술 B의 오인식률에 비해 상대적으로 약 18% 정도 감소하게 되었다.However, according to the present invention, the erroneous recognition rate of the word is increased by about 14% compared to the conventional art A, but is reduced by about 18% relative to the erroneous recognition rate of the conventional technique B.

도 7 및 도 8을 참조하면, 종래 기술 A의 경우, 가장 낮은 오인식률을 보였으나 음성 인식에 소요되는 시간에 비해 크게 향상된 보이지 못함을 알 수 있다. 또한, 종래 기술 B의 경우, 음성 인식에 소요되는 시간은 짧으나 음성 인식에 있어서 오인식률이 높아서 정확도가 떨어짐을 알 수 있다. 따라서, 본 발명의 일실시예에 따른 음성 인식 장치 및 방법에 의하면, 종래 기술 A에 비해 소요 시간을 크게 줄일 수 있을 뿐 아니라, 종래 기술 B에 비해 음성 인식에 대한 정확도를 훨씬 높일 수 있게 된다.Referring to FIG. 7 and FIG. 8, it can be seen that, in the case of the prior art A, the lowest false recognition rate is obtained but the time required for speech recognition is not greatly improved. In the case of the conventional technique B, the time required for speech recognition is short, but the accuracy of speech recognition is high due to high recognition rate. Therefore, according to the speech recognition apparatus and method according to the embodiment of the present invention, the time required for speech recognition can be significantly shortened compared with the conventional art A, and the accuracy of speech recognition can be significantly improved as compared with the conventional art B.

따라서, 본 발명의 일실시예에 따른 음성 인식 장치 및 방법에 의하면, 이중 경로 음성 인식 방법을 이용하여 탐색 공간을 줄임으로써 음성 인식에 소요되는 시간을 줄여 실시간으로 음성을 인식할 수 있다. 또한, 언어 모델을 보정함으로써 음성 인식에 대한 정확도를 향상시킬 수 있다.Therefore, according to the speech recognition apparatus and method according to an embodiment of the present invention, the time required for speech recognition can be reduced by reducing the search space using the dual path speech recognition method, and the speech can be recognized in real time. In addition, the accuracy of speech recognition can be improved by correcting the language model.

상기와 같이 구성되는 본 발명의 일실시예에 따른 음성 인식 방법을 설명하면 다음과 같다.A speech recognition method according to an embodiment of the present invention will now be described.

도 10은 본 발명의 일실시예에 따른 음성 인식 방법을 나타낸 순서도이다.10 is a flowchart illustrating a speech recognition method according to an embodiment of the present invention.

먼저 입력된 음성 신호(10)를 단위 프레임으로 분할하여 분할된 프레임 영역에 대응되는 특징 벡터(112)를 추출할 수 있다(S201). 그리고, 제1 음소 인식부(120)에서 특징 벡터(112)를 제1 음향 모델(122), 즉 모노폰 기반의 음향 모델과 언어 모델을 이용하여 패턴 매칭시켜 음소를 인식하여 제1 음소열 군(126)을 생성할 수 있다(S202). 그리고, 제1 음소열 군(126)으로부터 음성 격자(128)를 생성할 수 있다(S203). 제1 음소 인식부(120)에서의 음소 인식(Fast-Pass Decoding)은 모노폰 기반의 음향 모델을 사용함으로써 음소 인식을 수행하는 시간을 줄일 수 있다.First, the input speech signal 10 is divided into unit frames, and feature vectors 112 corresponding to the divided frame regions can be extracted (S201). The first phoneme recognition unit 120 recognizes phonemes by pattern matching the feature vectors 112 using a first acoustic model 122, that is, a monophone-based acoustic model and a language model, (Step S202). Then, the voice grid 128 can be generated from the first phoneme string group 126 (S203). Fast-Pass Decoding in the first phoneme recognition unit 120 can reduce the time for performing phoneme recognition by using a monophone-based acoustic model.

다음으로, 언어 모델 보정부(130)에서는 제1 음소 인식부(120)에서 생성된 음성 격자(128)를 보정할 수 있다. 즉, 제1 음소 인식부(120)로부터 생성된 음성 격자(128)로부터 음소열 군을 생성하고(S204), 생성된 음소열 군을 확률 값에 따라 정렬한 후 높은 확률 값을 가지는 음소열 군을 선택할 수 있다(S205).Next, in the language model correcting unit 130, the speech grid 128 generated by the first phoneme recognizing unit 120 can be corrected. That is, a phoneme string group is generated from the speech grating 128 generated from the first phoneme recognition unit 120 (S204), and the generated phoneme string group is sorted according to the probability value, (S205).

다음으로 선택된 음소열 군을 변환하여 보정된 음소열 군을 생성하게 되는데(S206), 이 때에는 상술한 바와 같이, 도 6과 같은 규칙이 적용될 수 있다. 그리고, 보정된 음소열 군으로부터 보정된 음성 격자(135)를 생성할 수 있다(S207). 언어 모델 보정부(130)에서 보정된 음성 격자(135)를 생성하여 이를 제2 음소 인식부(140)에서 음소 인식에 사용함으로써 음성 인식을 수행하는 시간을 줄일 수 있다.Next, the selected phoneme string group is converted to generate a corrected phoneme string group (S206). At this time, the same rules as in Fig. 6 can be applied as described above. Then, the corrected speech grid 135 can be generated from the corrected group of phonemes (S207). The language model corrector 130 generates the corrected speech grid 135 and uses the generated speech grid 135 for the phoneme recognition in the second phoneme recognition unit 140, thereby reducing the time required for speech recognition.

마지막으로, 제2 음소 인식부(140)에서는 특징 벡터(112)를 제2 음향 모델(142), 즉 트라이폰 기반의 음향 모델과 보정된 음성 격자(135) 기반의 언어 모델을 이용하여 패턴 매칭시켜 음소를 인식할 수 있다(S208). 언어 모델 보정부(130)에서 보정된 음성 격자(135)를 이용하여 음소 인식을 수행함으로써 음성 인식을 수행하는 시간을 줄일 수 있고, 트라이폰 기반의 음향 모델을 사용함으로써 인식률을 높일 수 있다.Finally, the second phoneme recognition unit 140 performs the pattern matching using the second acoustic model 142, that is, the language model based on the triphone-based acoustic model and the corrected voice grid 135, So that the phoneme can be recognized (S208). The time required for speech recognition can be reduced by performing the phoneme recognition using the speech grid 135 corrected by the language model correction unit 130 and the recognition rate can be increased by using the triphone-based acoustic model.

본 발명의 일실시예에 따른 음성 인식 장치(100) 및 방법은 지능형 로봇 시스템뿐만 아니라 음성 인터페이스를 사용한 모든 분야에 적용이 가능하다. 이는 휴대용 기기나 가전기기에 적용하여 음성을 이용하여 명령하고 작동하도록 수행할 수 있으며 등록된 발화자의 음성을 통하여 발화자를 식별함으로써 등록된 사용자 이외의 사람에 의한 기기의 사용을 제한하게 되어 기기의 개인화를 추구할 수 있게 된다. 또한 사람과 기계 사이에 자연스러운 대화를 위한 대용량 음성 인식에서 문제가 되고 있는 OOV(out-of-vocabulary)를 무제한 음성 인식을 통해 효율적으로 처리할 수 있어 기기와 사람간의 좀 더 자연스러운 상호 작용을 돕는데 이용될 수 있다.The speech recognition apparatus 100 and method according to an embodiment of the present invention can be applied not only to an intelligent robot system but also to all fields using a speech interface. This can be applied to portable devices or household appliances to perform commands and operations using voice. By identifying the speaker through the voice of a registered speaker, the use of the device by a person other than the registered user is restricted, . In addition, OOV (out-of-vocabulary), which is a problem in large-capacity speech recognition for natural conversation between people and machines, can be efficiently processed through unlimited speech recognition, thereby enabling a more natural interaction between a device and a person .

본 발명이 속하는 기술분야의 통상의 지식을 가진 자는 본 발명이 그 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 실시될 수 있다는 것을 이해할 수 있을 것이다. 그러므로 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며 한정적이 아닌 것으로 이해해야만 한다. 본 발명의 범위는 상기 상세한 설명보다는 후술하는 특허청구의 범위에 의하여 나타내어지며, 특허청구의 범위의 의미 및 범위 그리고 그 균등 개념으로부터 도출되는 모든 변경 또는 변형된 형태가 본 발명의 범위에 포함되는 것으로 해석되어야 한다.It will be understood by those skilled in the art that the present invention may be embodied in other specific forms without departing from the spirit or essential characteristics thereof. It is therefore to be understood that the above-described embodiments are illustrative in all aspects and not restrictive. The scope of the present invention is defined by the appended claims rather than the foregoing detailed description, and all changes or modifications derived from the meaning and scope of the claims and the equivalents thereof are included in the scope of the present invention Should be interpreted.

도 7은 종래 기술 A, 종래 기술 B, 및 본 발명의 일실시예에 따른 음성 인식 장치를 이용하여 입력 음성 신호를 인식한 경우, 각각의 단어 인식 에 소요된 시간을 비교한 그래프이다.FIG. 7 is a graph comparing time taken for recognition of an input speech signal when recognizing an input speech signal using the speech recognition apparatus according to the prior art A, the background art B, and the embodiment of the present invention.

도 8은 종래 기술 A, 종래 기술 B, 및 본 발명의 일실시예에 따른 음성 인식 장치를 이용하여 입력 음성 신호를 인식한 경우, 각각의 단어 오인식률을 비교한 그래프이다.FIG. 8 is a graph comparing the word recognition rates of the input speech signals when recognizing input speech signals using the speech recognition apparatus according to the prior art A, the related art B, and an embodiment of the present invention.

도 9는 본 발명의 일실시예에 따른 음성 인식 방법을 나타낸 순서도이다.9 is a flowchart illustrating a speech recognition method according to an embodiment of the present invention.

<도면의 주요 부분에 대한 부호의 설명>Description of the Related Art

100: 음성 인식 장치100: Speech recognition device

110: 특징 벡터 추출부110: Feature vector extracting unit

120: 제1 음소 인식부120: First phoneme recognition unit

130: 언어 모델 보정부130: Language Model Correction

140: 제2 음소 인식부140: Second phoneme recognition section

Claims

A first phoneme recognition unit for generating a phonetic lattice from the feature vector of the input speech signal;

A language model corrector for correcting the speech grid using a phoneme rule; And

And a second phoneme recognition unit for recognizing speech from the feature vector using the corrected speech grid.

The method according to claim 1,

And a feature vector extracting unit for dividing the input speech signal into unit frames and extracting the feature vectors corresponding to the divided frame regions.

3. The method of claim 2,

Wherein the feature vector extracting unit comprises:

And a Mel-Frequency Cepstrum Coefficients feature extraction unit for extracting the feature vector using a Mel-Frequency Cepstrum Coefficients feature extraction method.

The method according to claim 1,

Wherein the first phoneme recognition unit comprises:

Wherein the feature vectors are pattern-matched using a first acoustic model and a language model to generate a first phoneme string group by recognizing phonemes and generate the speech lattice from the first phoneme string group.

5. The method of claim 4,

Wherein the first acoustic model is a monophone-based acoustic model.

5. The method of claim 4,

Wherein the language model is a bigram-based language model in units of phonemes when registering the input speech signal.

5. The method of claim 4,

Wherein the language model is an unigram-based language model for recognizing the input speech signal by word unit.

5. The method of claim 4,

Wherein the first phoneme string group is generated using a Viterbi algorithm.

The method according to claim 1,

The language model correcting unit,

A phoneme sequence generating unit for generating a phoneme string group from the speech grid;

A phoneme string selector for selecting a phoneme string group having a high probability value after sorting the generated phoneme string group according to a probability value;

A phoneme column correcting unit for converting the selected phoneme string group using the phoneme rule to generate a corrected phoneme string group; And

And a speech grid generator for generating the corrected speech grid from the corrected group of phoneme strings.

10. The method of claim 1 or 9,

Wherein the phoneme rule is based on knowledge of phonetics or phonology.

The method according to claim 1,

Wherein the second phoneme recognition unit comprises:

Wherein the feature vector is pattern-matched using the second acoustic model and the corrected speech grating to recognize a phoneme.

12. The method of claim 11,

And the second acoustic model is a triphone based acoustic model.

Generating a speech lattice from feature vectors of the input speech signal;

Correcting the speech grid using a phoneme rule; And

And recognizing speech from the feature vector using the corrected speech grid.

14. The method of claim 13,

Further comprising dividing the input speech signal into unit frames and extracting the feature vectors corresponding to the divided frame regions.

15. The method of claim 14,

Wherein the extracting of the feature vector comprises:

And extracting the feature vector using a mel-frequency cepstral coefficient feature extraction method.

14. The method of claim 13,

Wherein the step of generating the speech grating comprises:

Generating a first phoneme string group by recognizing a sound by pattern matching the feature vectors using a first acoustic model and a language model; And

And generating the speech grid from the first phoneme string group.

17. The method of claim 16,

Wherein the first acoustic model is a monophone acoustic model.

17. The method of claim 16,

Wherein the language model is a bigram-based language model of the phoneme when the input speech signal is to be registered.

17. The method of claim 16,

Wherein the first phoneme string group is generated using a Viterbi algorithm.

14. The method of claim 13,

The step of correcting the speech grating comprises:

Generating a phoneme string from the speech grid;

Selecting a phoneme string group having a high probability value after sorting the generated phoneme string group according to a probability value;

Converting the selected phoneme string group using the phoneme rule to generate a corrected phoneme string group; And

And generating the corrected speech grid from the corrected group of phoneme strings.

22. The method according to claim 13 or 21,

Wherein the phoneme rule is based on knowledge of phonetics or phonology.

14. The method of claim 13,

Wherein recognizing speech from the feature vector comprises:

Wherein the feature vector is pattern-matched using a second acoustic model and the corrected speech grating to recognize a phoneme.

24. The method of claim 23,

And the second acoustic model is a triphone-based acoustic model.