KR101677530B1

KR101677530B1 - Apparatus for speech recognition and method thereof

Info

Publication number: KR101677530B1
Application number: KR1020100039217A
Authority: KR
Inventors: 한익상; 박치연; 김정수; 조정미
Original assignee: 삼성전자주식회사
Priority date: 2010-04-27
Filing date: 2010-04-27
Publication date: 2016-11-21
Also published as: KR20110119478A

Abstract

음성 대화형 사용자 인터페이스에서의 음성 인식 장치 및 방법이 제공된다. 음성 인식 장치는, 입력된 음성에 대한 연속어 인식을 수행하여 제1 단어 격자 정보를 검출하는 1차 음성 인식부와, 입력된 음성으로부터 인식된 음소열과 단어 리스트에 포함된 단어 브랜치와의 음소열 매칭을 통하여 제2 단어 격자 정보를 검출하는 제2 단어 격자 정보 검출부와, 제1 단어 격자 정보 및 제2 단어 격자 정보를 통합하여 통합 격자 정보를 생성하는 병합부와, 통합 격자 정보를 이용하여 2차 음성 인식을 수행하여 문장을 생성하는 2차 음성 인식부를 포함한다. A speech recognition apparatus and method in a voice interactive user interface are provided. The speech recognition apparatus includes a first speech recognition unit for detecting first word lattice information by performing consecutive speech recognition on the input speech, a second speech recognition unit for recognizing a phoneme sequence from the phonemes recognized from the input speech and the word branch included in the word list A second word lattice information detection unit for detecting second word lattice information through matching; a merging unit for merging the first word lattice information and the second word lattice information to generate integrated lattice information; And a second voice recognition unit for performing a car speech recognition to generate a sentence.

Description

[0001] APPARATUS FOR SPEECH RECOGNITION AND METHOD [0002]

음성 인식 기술에 관한 것으로, 더욱 상세하게는, 음성 대화형 사용자 인터페이스에서의 음성 인식 장치 및 방법에 관한 것이다. And more particularly, to a speech recognition apparatus and method in a voice interactive user interface.

멀티미디어 콘텐츠가 풍부해지고 접근성이 용이해지는 한편, 이를 뒷받침하는 기기들의 하드웨어 사양이 개선되고 있다. 이에 따라, 터치, 동작, 음성 대화 사용자 인터페이스와 같이, 사용자와 기기 간의 사용자 인터페이스도 더욱 사용자 친화적으로 바뀌고 있다. The multimedia contents are abundant, the accessibility is easy, and the hardware specifications supporting the devices are improving. As a result, the user interface between the user and the device, such as the touch, action, and voice chat user interface, is also becoming more user friendly.

이러한 사용자 인터페이스 중에 음성 대화 사용자 인터페이스는 주로 전화망을 통한 비행기나 기차 예약 시스템 등 비교적 제한적이고 간단한 시스템에만 상용되고 있다. 이와 같이, 음성 대화 사용자 인터페이스가 제한적으로 사용되는 이유 중 하나는 사용자가 체감하는 음성인식률 및 대화성공률이 낮기 때문이다. 음성인식률 및 대화성공률이 낮은 원인으로는 잡음 등에 의한 발성 오류가 있거나, 사용자들이 음성 대화 사용자 인터페이스에 패턴화되지 않은 다양한 형태의 문장을 입력하기 때문이다. Among these user interfaces, the voice chat user interface is commonly used only in relatively limited and simple systems such as airplane and train reservation systems. As described above, one of the reasons why the voice conversation user interface is limited is because the voice recognition rate and conversation success rate experienced by the user are low. The reasons for low speech recognition rate and conversation success rate are that there are vocal errors due to noise or the like, or users input various types of sentences that are not patterned in the voice conversation user interface.

사용자가 음성 대화형 인터페이스를 통하여 자연스런 문장을 발성한 경우 음성 인식 성능을 향상시키기 위한 음성 인식 장치 및 방법이 제공된다. A speech recognition apparatus and method for improving speech recognition performance when a user utters a natural sentence through a voice interactive interface is provided.

일 측면에 따른 음성 인식 장치는, 입력된 음성에 대한 연속어 인식을 수행하여 제1 단어 격자 정보를 검출하는 1차 음성 인식부와, 입력된 음성으로부터 인식된 음소열과 단어 리스트에 포함된 단어 브랜치와의 음소열 매칭을 통하여 제2 단어 격자 정보를 검출하는 제2 단어 격자 정보 검출부와, 제1 단어 격자 정보 및 제2 단어 격자 정보를 통합하여 통합 격자 정보를 생성하는 병합부와, 통합 격자 정보를 이용하여 2차 음성 인식을 수행하여 입력된 음성에 대한 문장을 생성하는 2차 음성 인식부를 포함한다. A speech recognition apparatus according to one aspect includes a first speech recognition unit for recognizing first word grid information by performing consecutive speech recognition on the input speech, a second word recognition unit for recognizing a word string A second word lattice information detecting unit for detecting second word lattice information through phoneme matching with the first word lattice information and the second word lattice information, a merging unit for merging the first word lattice information and the second word lattice information to generate integrated lattice information, And a second speech recognition unit for performing a second speech recognition using the second speech recognition unit and generating a sentence about the input speech.

다른 측면에 따른 음성 인식 방법은, 입력된 음성에 대한 연속어 인식을 수행하여 제1 단어 격자 정보를 검출하는 단계와, 입력된 음성으로부터 인식된 음소열과 단어 리스트에 포함된 단어 브랜치와의 음소열 매칭을 통하여 제2 단어 격자 정보를 검출하는 단계와, 제1 단어 격자 정보 및 제2 단어 격자 정보를 통합하여 통합 격자 정보를 생성하는 단계와, 통합 격자 정보를 이용하여 2차 음성 인식을 수행하여 입력된 음성에 대한 문장을 생성하는 단계를 포함한다. According to another aspect of the present invention, there is provided a speech recognition method comprising: detecting first word lattice information by performing consecutive speech recognition on an input speech; detecting phonemes of a phoneme string recognized from the input speech and a phoneme string Detecting second word lattice information through matching, generating integrated lattice information by combining first word lattice information and second word lattice information, performing second speech recognition using integrated lattice information, And generating a sentence for the input voice.

또 다른 측면에 따른 음성 인식 장치는, 입력된 음성에 대한 연속어 인식을 수행하여 제2 단어 격자 정보를 검출하는 1차 음성 인식부와, 입력된 음성으로부터 인식된 음소열과 단어 리스트에 포함된 단어 브랜치와의 음소열 매칭을 통하여 제2 단어 격자 정보를 검출하는 제2 단어 격자 정보 검출부와, 제1 단어 격자 정보 및 제2 단어 격자 정보를 통합하여 통합 격자 정보를 생성하는 병합부를 포함한다. According to another aspect of the present invention, there is provided a speech recognition apparatus comprising: a first speech recognition unit for recognizing second word lattice information by performing consecutive speech recognition on the input speech; and a speech recognition unit for recognizing words included in the word list A second word lattice information detecting unit for detecting second word lattice information through phoneme string matching with the branch, and a merging unit for merging the first word lattice information and the second word lattice information to generate integrated lattice information.

사용자가 여러 형태의 문장을 발성하는 경우에, 제한적 데이터를 가진 언어 모델을 보완할 수 있으며, 기존의 언어 모델로 인하여 음성을 잘못 인식 확률을 경감하여, 음성 인식 성능을 향상시킬 수 있다. When a user utters various types of sentences, the language model with limited data can be supplemented, and the speech recognition performance can be improved by reducing the false recognition probability of speech due to the existing language model.

도 1은 음성 인식 장치의 구성의 일 예를 나타내는 도면이다.
도 2는 도 1의 음성 인식 장치에 포함된 제2 단어 격자 정보 검출부의 구성의 일 예를 나타내는 도면이다.
도 3은 도 2의 제2 단어 격자 정보 검출부의 제2 단어 검출 동작의 일 예를 나타내는 도면이다.
도 4는 도 1의 음성 인식 장치에 포함된 병합부의 동작의 일 예를 나타내는 흐름도이다.
도 5는 도 1의 2차 음성 인식부의 동작의 일 예를 나타내는 도면이다.
도 6은 도 1의 제2 언어 모델의 일 예를 나타내는 도면이다.
도 7은 음성 인식 방법의 일 예를 나타내는 순서도이다.
도 8a는 일 실시예에 따른 음성 인식 방법이 입력된 음성의 앞 부분에 발성 오류가 있는 경우 적용된 일 예를 나타내고, 도 8b는 일 실시예에 따른 음성 인식 방법이 음성 인식기가 불충분한 언어 모델을 가진 경우 적용된 일 예를 나타낸다. 1 is a diagram showing an example of the configuration of a speech recognition apparatus.
2 is a diagram showing an example of a configuration of a second word lattice information detecting unit included in the speech recognition apparatus of FIG.
FIG. 3 is a diagram illustrating an example of a second word detection operation of the second word lattice information detection unit of FIG. 2. FIG.
4 is a flowchart showing an example of the operation of the merging unit included in the speech recognition apparatus of FIG.
5 is a diagram showing an example of the operation of the secondary speech recognition unit of FIG.
6 is a diagram showing an example of the second language model of FIG.
7 is a flowchart showing an example of a speech recognition method.
FIG. 8A illustrates an example in which the speech recognition method according to an embodiment is applied when there is a speech error in the front part of the input speech, FIG. 8B illustrates a speech recognition method according to an embodiment, Examples of applied cases are shown.

이하, 첨부된 도면을 참조하여 본 발명의 일 실시예를 상세하게 설명한다. 본 발명을 설명함에 있어 관련된 공지 기능 또는 구성에 대한 구체적인 설명이 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우에는 그 상세한 설명을 생략할 것이다. 또한, 후술되는 용어들은 본 발명에서의 기능을 고려하여 정의된 용어들로서 이는 사용자, 운용자의 의도 또는 관례 등에 따라 달라질 수 있다. 그러므로 그 정의는 본 명세서 전반에 걸친 내용을 토대로 내려져야 할 것이다. Hereinafter, an embodiment of the present invention will be described in detail with reference to the accompanying drawings. In the following description of the present invention, a detailed description of known functions and configurations incorporated herein will be omitted when it may make the subject matter of the present invention rather unclear. In addition, the terms described below are defined in consideration of the functions of the present invention, which may vary depending on the intention of the user, the operator, or the custom. Therefore, the definition should be based on the contents throughout this specification.

도 1은 음성 인식 장치의 구성의 일 예를 나타내는 도면이다. 1 is a diagram showing an example of the configuration of a speech recognition apparatus.

음성 인식 장치(100)는 1차 음성 인식부(110), 제2 단어 격자 정보 검출부(120), 병합부(130), 2차 음성 인식부(140) 및 저장부(150)를 포함한다. 저장부(150)는 음향 모델(152), 단어 리스트(154), 제1 언어 모델(156) 및 제2 언어 모델(158)을 포함할 수 있다. The speech recognition apparatus 100 includes a first speech recognition unit 110, a second word lattice information detection unit 120, a merging unit 130, a second speech recognition unit 140, and a storage unit 150. The storage unit 150 may include an acoustic model 152, a word list 154, a first language model 156, and a second language model 158.

저장부(150)는 음성 인식 장치(100) 외부의 저장 매체로 구현될 수 있다. 음향 모델(152)은 음성 신호에 대한 특징을 나타내는 모델이다. 단어 리스트(154)는 복수 개의 단어 및 단어별 음소열을 포함하는 리스트이다. 제1 언어 모델(156)은 인접한 n개의 단어에 대한 언어 모델인 n-그램 모델과 같은 통계기반 언어 모델 또는 문맥 자유 문법과 같은 문법 기반 언어 모델일 수 있다. 제2 언어 모델(158)은 제1 언어 모델(156)보다 느슨한(loose) 언어 모델일 수 있다. The storage unit 150 may be implemented as a storage medium external to the speech recognition apparatus 100. The acoustic model 152 is a model for characterizing the speech signal. The word list 154 is a list including a plurality of words and phoneme strings for each word. The first language model 156 may be a grammar-based language model, such as a statistical-based language model, such as an n-gram model, or a context-free grammar, which is a language model for adjacent n words. The second language model 158 may be a loose language model than the first language model 156.

1차 음성 인식부(110)은 음향 모델(152), 단어 리스트(154) 및 제1 언어 모델(156)을 이용하여 1차적으로 음성을 인식하여 인식 대상 단어들로 구성되는 제1 단어 격자 정보(word lattice information)를 생성한다. The primary speech recognition unit 110 recognizes the primary speech using the acoustic model 152, the word list 154, and the first language model 156, and outputs the first word grid information (word lattice information).

상세하게는, 1차 음성 인식부(110)는 기존 연속어 인식기의 일 부분일 수 있다. 연속어인식 시스템은 문장 단위로 인식을 수행하는 시스템을 의미한다. 1차 음성 인식부(110)는 입력되는 음성을 프레임화하고 주파수 대역의 신호로 변환한 다음, 변환된 신호로부터 특징 정보를 추출하여, 음성 특징 벡터를 생성할 수 있다. 1차 음성 인식부(110)는 추출된 특징 벡터에 음향 모델(152)을 적용하여 단어 리스트(154)에 있는 단어들을 검출하고, 검출된 단어들에 대해, 단어들 사이의 관계를 나타내는 제1 언어 모델(156)을 적용하여, 제1 단어 격자 정보를 생성할 수 있다. In detail, the primary speech recognition unit 110 may be a part of a conventional continuous speech recognizer. A continuous speech recognition system means a system that performs recognition on a sentence level basis. The primary speech recognition unit 110 can frame the input speech and convert it into a frequency band signal, and then extract feature information from the converted signal to generate a speech feature vector. The primary speech recognition unit 110 detects the words in the word list 154 by applying the acoustic model 152 to the extracted feature vector, and for the detected words, The language model 156 may be applied to generate the first word lattice information.

여기에서, 제1 단어 격자 정보는 복수 개의 제1 단어 브랜치의 집합인 단어 격자에 대한 정보를 의미한다. 제1 단어 브랜치는, 단어, 단어 id(identifier), 단어의 시작 프레임 위치, 단어의 끝 프레임 위치, 단어의 음향 모델 스코어 값 등을 포함할 수 있다. 여기에서, 시작 프레임의 위치는 해당 단어가 검출된 시작 시간을 나타내고, 끝 프레임의 위치는 해당 단어가 검출된 마지막 시간을 나타낼 수 있다. Here, the first word lattice information is information on a word lattice which is a set of a plurality of first word branches. The first word branch may include a word, an identifier, a start frame position of the word, an end frame position of the word, an acoustic model score value of the word, and the like. Here, the position of the start frame indicates the start time at which the corresponding word is detected, and the position of the end frame indicates the last time at which the corresponding word was detected.

입력된 음성에 대해 음향 모델(152)을 적용하기 위하여 특징 벡터를 추출하는 동작은 1차 음성 인식부(110)에서 수행하는 것으로 설명하였으나, 이 동작은 별도의 전처리부(도시되지 않음)에서 수행되어 저장부(150)의 특정 공간에 저장되어 이용될 수 있다. 따라서, 1차 음성 인식부(110)뿐만 아니라, 2차 단어 격자 정보 검출부(120) 및 2차 음성 인식부(140)는 저장부(150)에 저장된 특징 벡터를 이용하도록 구성될 수 있다. Although the operation of extracting the feature vector to apply the acoustic model 152 to the input speech has been described as being performed by the primary speech recognition unit 110, this operation may be performed by a separate preprocessing unit And stored in a specific space of the storage unit 150 for use. Therefore, not only the primary speech recognition unit 110 but also the secondary word grid information detection unit 120 and the secondary speech recognition unit 140 may be configured to use the feature vector stored in the storage unit 150. [

제2 단어 격자 정보 검출부(120)는 입력된 음성으로부터 인식된 음소열과 단어 리스트에 포함된 단어 브랜치와의 음소열 매칭을 통하여 제2 단어 격자 정보를 검출할 수 있다. The second word lattice information detection unit 120 may detect the second word lattice information through phoneme matching between the phoneme string recognized from the input speech and the word branch included in the word list.

여기에서, 제2 단어 격자 정보는 복수 개의 제2 단어 브랜치의 집합인 제2 단어 격자에 대한 정보를 의미한다. 제2 단어 브랜치는, 단어, 단어 id(identifier), 단어의 시작 프레임 위치, 단어의 끝 프레임 위치, 단어의 음향 모델 스코어 값 등을 포함할 수 있다. 여기에서, 시작 프레임의 위치는 해당 단어가 검출된 시작 시간을 나타내고, 끝 프레임의 위치는 해당 단어가 검출된 마지막 시간을 나타낼 수 있다. Here, the second word lattice information is information on a second word lattice which is a set of a plurality of second word lattices. The second word branch may include a word, a word identifier, a start frame position of the word, an end frame position of the word, an acoustic model score value of the word, and the like. Here, the position of the start frame indicates the start time at which the corresponding word is detected, and the position of the end frame indicates the last time at which the corresponding word was detected.

제2 단어 격자 정보 검출부(120)는 최종적으로 출력되는 제2 단어 격자 정보를 병합부(130)로 전달한다. 제2 단어 격자 정보 검출부(120)의 상세 구성 및 동작에 대해서는 도 2를 참조하여 후술한다. The second word lattice information detection unit 120 transmits the finally outputted second word lattice information to the merge unit 130. The detailed configuration and operation of the second word lattice information detecting unit 120 will be described later with reference to FIG.

제1 단어 격자 정보는 발성을 완전히 커버할 수 있지만, 언어 모델의 영향 하에서 만들어진 것이므로 오류의 전파로 인해 주요 제1 단어들을 추출해내지 못할 수도 있다. 이에 비해, 제2 단어 격자 정보는, 발성된 하나의 문장 내에서 비교적 정확하게 발성하여, 음향 스코어가 높은 단어들을 주로 포함한다. 그러나, 제2 단어 격자 정보는 발성을 부분적으로만 커버하게 되어, 제2 단어 격자 정보만으로 전체 발성을 완전히 커버하는 문장을 만들지 못할 수 있다. The first word lattice information may completely cover utterance, but it is made under the influence of the language model, so it may not be possible to extract the main first words due to the propagation of errors. On the other hand, the second word lattice information mainly includes words having a relatively high sound score in a sentence spoken relatively accurately. However, the second word lattice information covers only the utterance partly, so that it is not possible to make a sentence that completely covers utterance utterance only by the second word lattice information.

병합부(130)는 제1 단어 격자 정보 및 제2 단어 격자 정보를 병합하여, 통합 단어 격자를 생성한다. 제1 단어 격자 정보와 제2 단어 격자 정보를 서로 병합하게 되면, 서로의 단점을 보완할 수 있게 된다. 그래서, 2차 음성 인식부(140)에서 정확하게 발성한 부분 위주로 완전한 문장을 만들 수 있도록 하기 위하여 통합 격자 정보가 이용될 수 있다. The merging unit 130 merges the first word grid information and the second word grid information to generate an integrated word grid. If the first word lattice information and the second word lattice information are merged with each other, the disadvantages of each other can be compensated. Thus, the integrated lattice information can be used to make a complete sentence centered on the correctly spoken part in the secondary speech recognition unit 140.

2차 음성 인식부(140)는 병합부(130)에서 생성된 통합 격자 정보를 이용하여 입력된 음성에 대한 문장을 생성한다. 2차 음성 인식부(140)는 기존 연속어 인식기의 스택 디코딩 부분에 대응될 수 있다. 그러나, 기존 연속어 인식기와는 달리, 언어 모델을 이원적으로 적용한다는 차이점이 있다. The second speech recognition unit 140 generates a sentence for the input speech using the integrated lattice information generated by the merging unit 130. The secondary speech recognizer 140 may correspond to the stack decoding portion of the conventional continuous word recognizer. However, unlike the conventional continuous word recognizer, there is a difference in that the language model is applied in a binary manner.

2차 음성 인식부(140)는 제1 단어 격자 정보에 속한 단어 브랜치끼리 연결을 위하여 제1 언어 모델(156)을 이용하고, 제2 단어 격자 정보에 속한 제2 단어 브랜치와 제1 단어 격자 정보에 속한 단어 브랜치와의 연결을 위하여 제1 언어 모델보다 느슨한 제2 언어 모델(158)을 이용하여 음성 인식을 수행할 수 있다. 제2 단어격자 정보에 속한 단어 브랜치끼리의 연결은 제2 언어 모델(158)을 이용하여 수행할 수 있다. 여기에서, 제2 언어 모델(158)로서 제1 언어 모델(156)보다 제약이 덜한 언어 모델이 이용된다. The second speech recognition unit 140 uses the first language model 156 for linking word branches belonging to the first word lattice information and uses the second word branch belonging to the second word lattice information and the first word lattice information It is possible to perform speech recognition using a second language model 158 that is looser than the first language model in order to connect to a word branch belonging to the first language model. The connection between the word branches belonging to the second word lattice information can be performed using the second language model 158. Here, as the second language model 158, a language model less restrictive than the first language model 156 is used.

예를 들어, 제1 언어 모델(156)은 n-그램 언어 모델이고, 제2 언어 모델(158)은 2개의 단어가 소정 거리 내에서 동시에 출현하는 정도를 확률로 모델링한 동시 발생형 언어 모델일 수 있다. 다른 예로, 제1 언어 모델(156)은 단어 및 단어의 형태소를 고려하여 단어를 연결하는 언어 모델이고, 제2 언어 모델(168)은 단어의 형태소만을 고려하여 단어를 연결하는 언어 모델일 수 있다. For example, the first language model 156 is an n-gram language model, and the second language model 158 is a co-occurrence language model that models the probability that two words occur simultaneously within a predetermined distance . In another example, the first language model 156 may be a language model that links words in terms of word and word morphemes, and the second language model 168 may be a language model that links words in consideration of morphemes of words only .

2차 음성 인식부(140)가 문장을 생성한 결과, 후보 문장이 복수 개 생성되는 경우에는, 다시 음향 모델(152)을 이용하여, 각 후보 문장에 포함된 단어들의 음향 특징 정보를 이용하여, 각 후보 문장에 포함된 음향 스코어 값이 가장 높은 후보 문장을 최종 음성 인식 문장으로 출력할 수 있다.When the second speech recognition unit 140 generates a plurality of candidate sentences as a result of the generation of the sentences, the second speech recognition unit 140 uses the acoustic model 152 again to extract acoustic feature information of the words included in each candidate sentence, A candidate sentence having the highest acoustic score value included in each candidate sentence can be outputted as a final speech recognition sentence.

도 1에 도시된 음성 인식 장치(100)는, 통합 격자 정보를 이용하여 2차 음성 인식부(140)를 통해 문장 형태의 음성 인식 결과를 출력하는 예에 대하여 설명하였다. 그러나, 음성 인식 장치(100)는 제1 음성 인식부(110), 제2 단어 격자 정보 검출부(120) 및 병합부(130)만으로 구성될 수도 있고, 제1 음성 인식부(110), 제2 단어 격자 정보 검출부(120) 및 병합부(130)에 다른 응용 모듈이 결합된 형태로 구성될 수도 있다. 예를 들어, 다른 응용 모듈은 통합 격자 정보를 이용하여 음성 녹음 파일 검색을 수행하는 검색 애플리케이션 수행부일 수 있다. 여기에서, 검색 애플리케이션은 음성 강의 검색, 동영상 검색 등 다양한 분야의 검색에 이용되는 애플리케이션을 포함할 수 있다. The speech recognition apparatus 100 shown in FIG. 1 has described an example of outputting a speech recognition result in the form of a sentence through the secondary speech recognition unit 140 using the integrated lattice information. However, the speech recognition apparatus 100 may include only the first speech recognition unit 110, the second word lattice information detection unit 120, and the merge unit 130, and may include only the first speech recognition unit 110, The word grid information detecting unit 120 and the merging unit 130 may be combined with other application modules. For example, another application module may be a search application performing unit that performs a voice recording file search using integrated grid information. Here, the search application may include applications used for searching in various fields such as voice lecture search, video search, and the like.

음성 인식 장치(100)는 각종 전화 예약 시스템뿐만 아니라, 텔레비전 및 휴대폰 등과 같은 멀티미디어 기기, 로봇, 키오스크(kiosk) 등 다양한 전자 제품에 탑재된 음성 대화형 사용자 인터페이스로서 구현될 수 있다. The voice recognition apparatus 100 can be implemented as a voice interactive user interface mounted on various electronic products such as multimedia devices such as television and cellular phones, robots, kiosks, as well as various telephone reservation systems.

도 2는 도 1의 음성 인식 장치에 포함된 제2 단어 격자 정보 검출부(120)의 구성의 일 예를 나타내는 도면이다. FIG. 2 is a diagram illustrating an example of a configuration of a second word lattice information detector 120 included in the speech recognition apparatus of FIG. 1. Referring to FIG.

제2 단어 격자 정보 검출부(120)는 음소열 인식부(210), 단어 매칭부(220), 리스코어링부(230) 및 제2 단어 통합부(240)를 포함할 수 있다. The second word lattice information detection unit 120 may include a phoneme string recognition unit 210, a word matching unit 220, a recourse unit 230, and a second word integration unit 240.

음소열 인식부(210)는 주어진 음성으로부터 음소열을 인식한다. 음소열 인식부(210)는 입력 음성 신호에서 특징 벡터를 검출하고, 검출된 특징 벡터 및 음향 모델(152)을 이용하여, 소정 길이의 음소열을 인식할 수 있다. 음소열 인식부(210)는 입력 음성 신호에서 특징 벡터를 검출하는 동작을 별도로 수행하지 않고, 저장부(150)에 미리 저장된 특징 벡터를 이용할 수도 있다. The phoneme string recognition unit 210 recognizes a phoneme string from a given voice. The phoneme string recognition unit 210 can detect a feature vector in the input speech signal and recognize the phoneme string of a predetermined length using the detected feature vector and the acoustic model 152. [ The phoneme string recognition unit 210 may use a feature vector previously stored in the storage unit 150 without separately performing a feature vector detection operation on the input speech signal.

음소열 인식부(210)는 소정의 언어별 음소 결합 법칙을 이용하여, 검출된 특징 벡터로부터 각 언어에 따른 최적화된 음소열을 추출할 수 있다. 예를 들어, 음소열 인식부(210)는 검출된 특징 벡터를 스칼라(scalar) 값으로 변경하여, 39차 cepstrum 벡터열을 분석하여, 알파벳 중 최적의 하나의 문자(예를 들어, /a/)로 인지할 수 있다. 또한, 한글의 초/중/종성의 결합 법칙을 소정의 음소 문법기(phone grammer)에 저장해 놓은 환경에서, 음소열 인식부(210)는 한글이라는 언어적 특성을 고려하여 검출된 특징 벡터를 소정의 한글 음소로 인지할 수 있다. 이러한 과정을 통해, 음소열 인식부(210)는 한글의 경우 45개 내외의 음소열을 인식할 수 있다. 여기에서, 음소열 인식부(210)의 동작의 일 예를 설명한 것으로, 음소열 인식은 다른 여러가지 방법으로 수행될 수 있다. The phoneme string recognition unit 210 can extract an optimized phoneme string according to each language from the detected feature vectors using a predetermined language-dependent phoneme combination rule. For example, the phoneme string recognition unit 210 converts the detected feature vector into a scalar value, analyzes the 39th cepstrum vector sequence, and generates an optimal one of the alphabets (for example, / a / ). In addition, in an environment in which the combining rule of the Korean, the medium, and the continuity is stored in a predetermined phoneme grammer, the phoneme string recognition unit 210 recognizes the characteristic vector, Can be recognized by the Korean phoneme. Through this process, the phoneme string recognition unit 210 can recognize phonemes of about 45 words in the case of Hangul. Here, an example of the operation of the phoneme string recognition unit 210 is described, and the phoneme string recognition can be performed by various other methods.

단어 매칭부(220)는 단어 리스트(154)에 포함된 단어들 및 각 단어의 음소열과, 음소열 인식부(210)에서 인식된 음소열간의 매칭을 수행하고, 유사도를 계산한다. 유사도는 매칭 정도를 나타내는 매칭 스코어로 나타낼 수 있다. 상세하게는, 단어 리스트(154)의 각 단어들에 대해 음소열 중에서, 인식된 음소열과 연관된 음소열을 인식 후보로서 선별할 수 있다. 이를 위해, 단어 매칭부(220)는 인식된 음소열과 단어 리스트(154)에 존재하는 어휘의 음소열과의 유사도를 계산하고, 계산된 유사도에 기초하여 인식 후보를 추출할 수 있다. The word matching unit 220 performs matching between the phonemes of the words included in the word list 154 and the phonemes of the respective words and the phoneme recognized by the phoneme string recognition unit 210 and calculates the similarity. The degree of similarity can be represented by a matching score indicating the degree of matching. In detail, among the phoneme strings for the words in the word list 154, the phoneme strings associated with the recognized phoneme strings can be selected as recognition candidates. To this end, the word matching unit 220 may calculate the similarity between the recognized phoneme string and the phoneme string in the vocabulary list 154, and extract the recognition candidate based on the calculated similarity.

단어 매칭부(220)는 음소 혼동 매트릭스(phone confusion matrix)를 이용해서 인식된 음소열과 단어 리스트(154)에 존재하는 어휘의 음소열 간의 유사도 즉, 매칭 스코어를 계산할 수 있다. 여기에서, 음소 혼동 매트릭스는 음소열 인식부(210)에서 사용되는 음소 세트(phone set)와 단어 리스트(154)에서 정의된 기준 음소열들 사이의 혼동(confusion) 정도를 확률값으로 표현한 것이다. The word matching unit 220 can calculate the similarity between the phoneme string recognized using the phone confusion matrix and the phoneme string of the vocabulary existing in the word list 154, that is, the matching score. Here, the phoneme confusion matrix is a probability value representing the degree of confusion between the phoneme set (phone set) used in the phoneme string recognition unit 210 and the reference phoneme strings defined in the word list 154.

여기에서, 단어 매칭부(220)는 1차 음성 인식부(110)에서 이용하는 단어 리스트(154)를 이용하여 제1 단어를 매칭하는 것으로 설명하였으나, 단어 매칭부(220)에는 단어 리스트(154)보다 적은 용량의 중심 어휘를 포함하는 별도의 중심 단어 리스트를 이용할 수도 있다. 중심 단어 리스트는, 음성 인식 장치(110)가 이용되는 분야에 따라 다르게 구성될 수 있다. 예를 들어, 음성 인식 장치(110)가 네비게이션에 적용되는 경우, 중심 단어 리스트는 지명 중심의 단어로 구성될 수 있다. 이와 같이, 단어 매칭부(220)가 중심 단어 리스트를 이용하게 되면, 음성 인식이 적용되는 응용예에서 주로 이용하는 단어들에 대한 음성 인식률을 높일 수 있을 것이다. Here, the word matching unit 220 matches the first word using the word list 154 used in the first speech recognition unit 110. However, the word matching unit 220 may include the word list 154, A separate core word list may be used that includes a lesser amount of central vocabulary. The center word list may be configured differently depending on the field in which the speech recognition apparatus 110 is used. For example, when the speech recognition device 110 is applied to navigation, the center word list may be composed of words in the center of the place name. As described above, if the word matching unit 220 uses the center word list, the speech recognition rate for words used mainly in applications in which speech recognition is applied can be increased.

리스코어링부(230)는 단어 매칭부(220)에서 출력되는 제2 단어들 중 매칭 스코어가 특정 임계치 이상의 제2 단어들에 대해서 음향 모델(152) 및 음소 문법(phone grammer)를 이용하여 비터비 검색(Viterbi matching) 과정 등을 통해 리스코어링을 수행할 수 있다. 이와 같이, 음소열 매칭을 통해, 비교적 적은 양의 데이터인 스코어가 특정 임계치 이상의 제2 단어들에 대하여, 다시 음향 모델(152)을 이용하여 리스코어링을 수행함으로써, 처리시에 요구되는 리소스에 제한이 있더라도 원할하게 음성 인식을 수행할 수 있다. The recalling unit 230 may determine whether the second word among the second words output from the word matching unit 220 has a viterbi ratio using an acoustic model 152 and a phonemic grammar And a recursive process may be performed through a Viterbi matching process. In this manner, through phoneme matching, the score, which is a relatively small amount of data, is recoiled for the second words above a certain threshold, again using the acoustic model 152, thereby limiting the resources required for processing It is possible to perform speech recognition smoothly.

제2 단어 통합부(240)는 리스코어링부(230)를 통하여 얻어진 제2 단어들에 대하여, 제1 언어 모델(156)을 이용하여, 2개 이상의 단어를 통합할 수 있다. 또한, 제2 단어 통합부(240)는 제1 언어 모델(156)을 이용하여 인접한 제2 단어들을 묶으면서 스코어를 산출할 수 있다. 여기에서, 제2 단어 통합부(240)가 모든 제2 단어들을 2개 이상 통합하여 출력하여야 하는 것을 의미하는 것은 아니다. 제2 단어 통합부(240)는 2개 이상의 단어를 통합한 각 경우에 대하여 스코어를 산출하여, 임계치 이상의 스코어를 가진 통합 단어를 출력할 수 있다. The second word integrating unit 240 may integrate two or more words using the first language model 156 for the second words obtained through the recalling unit 230. [ In addition, the second word integrating unit 240 may calculate the score by grouping the adjacent second words using the first language model 156. Here, the second word integrating unit 240 does not mean that all the second words should be integrated and outputted. The second word integrating unit 240 may calculate a score for each case in which two or more words are combined and output an integrated word having a score of a threshold value or more.

제2 단어 통합부(240)는 산출된 스코어가 임계치 이상인 통합된 제2 단어들에 대해 수학식 1과 같은 방식으로 스코어를 산출할 수 있다. The second word integrating unit 240 may calculate the score in the same manner as in Equation (1) for the integrated second words whose calculated score is equal to or greater than the threshold value.

Score_Acoustic은 제2 단어에 대한 음향 모델(152)의 매칭 스코어로 제2 단어의 프레임 개수로 정규화된 값이다. 일반적으로 Score_Acoustic의 경우, '오', '예', '우' 등의 짧은 단어들이 높은 스코어를 얻는다. Score _Acoustic is a matching score of the acoustic model 152 for the second word and is a value normalized by the number of frames of the second word. In general, in the case of Score _Acoustic , short words such as 'oh', 'yes' and 'right' get high scores.

제2 단어 통합부(240)는 이를 보정하기 위해 ω_Length/#Frame 항목처럼 프레임 개수가 짧을수록 페널티를 줄 수 있다. 여기에서, ω_Length는 조절 파라미터로 일정 수준 이상으로 긴 단어들 사이에는 길이에 따른 페널티의 차이가 크지 않도록 하는 역할을 할 수 있다. #Frame는 통합된 2 이상의 제2 단어에 대응하는 음향 프레임의 개수이다. The second word integrating unit 240 may give a penalty as the number of frames is shorter, such as the ω _Length / # Frame item, in order to correct it. Here, ω _Length is an adjustment parameter, and it can serve to prevent a penalty difference in length from being long between words longer than a predetermined level. And #Frame is the number of acoustic frames corresponding to at least two integrated second words.

한편, 스코어 산출에 ω_Length/#Frame 항목을 고려하게 되면, 짧은 단어들로 구성된 문장이 발성된 경우 단어 길이에 의한 페널티로 인해 각 단어들의 스코어가 너무 낮게 나올 수 있다. 제2 단어들 간에 서로 묶일 수 있는 것들을 통합한 상태에서 스코어링을 하게 되면, 단어 길이에 의한 피해를 방지할 수 있다. On the other hand, if the ω _Length / # Frame item is considered in the score calculation, the score of each word may be too low due to penalty due to the word length when a sentence composed of short words is uttered. If scoring is performed in a state in which the second words can be tied together, it is possible to prevent damage due to the word length.

이때, 제2 단어들을 묶기 위해서는 인접 단어들 간의 확률을 모델링한 언어 모델 중 트라이그램(trigram) 같은 언어 모델을 사용할 수 있다. 이 때, 2이상의 제2 단어가 서로 인접할 확률을 스코어로 표현한 것이 Score_Language이며, ω_Language는 조절 파라미터이다. At this time, in order to group the second words, a language model such as a trigram among the language models that model the probability between adjacent words can be used. At this time, Score _Language represents the probability that two or more second words are adjacent to each other, and ω _Language is an adjustment parameter.

예를 들어, '로마/라는/말/의/유래/가/뭐/야'라는 발성은 짧은 제2 단어들로 구성되는 데이터들이 서로 묶이게 되면, 예를 들어, '로마/라는'과 같이 묶인 상태에서 스코어가 산정될 수 있으므로, 스코어 산출이 훨씬 정확해질 수 있다. 제2 단어들이 묶여진 상태에서 예를 들어 "로마" 및 "라는"이 "로마라는"으로 묶여진 상태에서 스코어를 산출할 때, Score_Acoustic은 "로마"에 대한 음향 모델(152)의 매칭 스코어 및 "라는"에 대한 음향 모델(152)의 매칭 스코어의 평균 값일 수 있다. For example, if the vocabulary of 'Rome / that / word / of / of / comes from / is / is / is / is night' is tied to the data consisting of short second words, for example, Since the score can be calculated in the state, the score output can be much more accurate. Score _Acoustic calculates the matching score of the acoustic model 152 for "Rome " and the matching score for the" Roman " May be the average value of the matching score of the acoustic model 152 for "

즉, 제2 단어 통합부(240)는 통합된 제2 단어에 포함된 각 단어의 음향 모델(152)의 매칭 스코어 Score_Acoustic, 통합된 제2 단어에 포함된 2이상의 단어가 서로 인접할 확률인 Score_Language및, 통합된 제2 단어에 대응하는 음향 프레임의 개수 #Frame에 비례하도록 스코어를 산출할 수 있다. 이와 같이, 제2 단어를 2 이상 통합함으로써, 중요한 단어가 인식되지 않을 확률을 낮출 수 있다. That is, the second word integrating unit 240 extracts a matching score Score _Acoustic of the acoustic model 152 of each word included in the integrated second word, a probability that two or more words included in the integrated second word are adjacent to each other Score _Language, and the number of sound frames #Frame corresponding to the integrated second word. Thus, by integrating two or more of the second words, the probability that an important word is not recognized can be reduced.

도 2에서는, 제2 단어 통합부(240)가 제1 음성 인식부(110)가 이용하는 제1 언어 모델(156)을 이용하여 단어를 통합하는 예를 나타내고 있으나, 이에 제한되는 것은 아니다. 일 예로, 제1 음성 인식부(110)는 5-gram의 인접형 언어 모델을 이용하고, 제2 단어 통합부(240)는 트라이그램(trigram) 등의 인접형 언어 모델을 이용할 수 있다. 2 shows an example in which the second word integrating unit 240 integrates words using the first language model 156 used by the first speech recognizing unit 110. However, the present invention is not limited thereto. For example, the first speech recognition unit 110 may use a 5-gram contiguous language model, and the second word integration unit 240 may use an adjacent language model such as a trigram.

도 3은 도 2의 제2 단어 격자 정보 검출부의 제2 단어 검출 동작의 일 예를 나타내는 도면이다. FIG. 3 is a diagram illustrating an example of a second word detection operation of the second word lattice information detection unit of FIG. 2. FIG.

도 3의 음성 입력은, '로마 신분제도에 대해 알려 줘'라는 발성에 대한 음향 신호 파형을 나타낸다. The speech input of FIG. 3 represents the acoustic signal waveform for the utterance "Tell me about the Roman identity system."

도 3의 음소열 인식은 도 2의 음소열 인식부(210)가 발성된 음향 신호에 대해 음소열을 인식한 결과를 나타낸다. The phoneme string recognition of FIG. 3 shows a result of recognizing a phoneme string with respect to an acoustic signal uttered by the phoneme string recognition unit 210 of FIG.

도 3의 단어 매칭은, 도 2의 단어 매칭부(220)가 단어 리스트(152)에 포함된 단어의 음소열과 입력된 음소열을 매칭한 결과, 매칭 스코어가 임계치 이상인 매칭된 단어를 음성의 해당 부분에 나타내고 있다. 예를 들어, 입력된 음성의 앞 부분에 제1 단어 브랜치로서 '로마', '로마인', '도마' 등이 매칭되고, 중간에는 '신분', '제도', '분재' 등 주로 음향학적으로 잘 맞는 제2 단어들을 포함하는 제2 단어 격자 정보가 추출되었음을 알 수 있다. The word matching in FIG. 3 is performed when the word matching unit 220 of FIG. 2 matches a phoneme string of a word included in the word list 152 with an input phoneme string, and a matched word whose matching score is equal to or higher than a threshold value . For example, 'Roman', 'Romain', 'Thomas' and so on are matched as the first word branch in the front part of the input voice. In the middle, 'Acoustic', 'System' It can be seen that the second word lattice information including the second words which are well suited is extracted.

도 4는 도 1의 음성 인식 장치에 포함된 병합부(130)의 동작의 일 예를 나타내는 흐름도이다. 4 is a flowchart illustrating an example of the operation of the merging unit 130 included in the speech recognition apparatus of FIG.

전술한 바와 같이, 병합부(130)는 제1 단어 격자 정보 및 제2 단어 격자 정보를 병합하여, 제1 단어 격자 정보 및 제2 단어 격자 정보로 구성되는 통합 단어 격자를 생성한다. As described above, the merging unit 130 merges the first word lattice information and the second word lattice information to generate an integrated word lattice composed of the first word lattice information and the second word lattice information.

병합부(130)는 제1 단어 격자 정보 및 제2 단어 격자 정보를 입력받는다(410). 병합부(130)는 제2 단어 격자 정보의 제2 단어 브랜치 중에서 제1 단어 격자 정보의 제1 단어 브랜치와 중복되는 부분을 제거한다(420). The merging unit 130 receives the first word grid information and the second word grid information (410). The merging unit 130 removes a portion of the second word lattice of the second word lattice information that overlaps the first word lattice information of the first word lattice information (420).

병합부(130)는 중복되지 않는 제2 단어 브랜치를 제1 단어 격자 정보에 삽입하여 통합 격자 정보를 생성한다(430). 이때, 병합부(130)는, 통합 격자 정보에 포함된 격자 정보가 제1 단어 격자 정보인지 제2 단어 격자 정보인지를 구별하기 위한 정보를 통합 격자 정보에 포함시킬 수 있다. The merging unit 130 inserts the non-overlapping second word branch into the first word lattice information to generate integrated lattice information (430). At this time, the merging unit 130 may include information for identifying whether the lattice information included in the integrated lattice information is the first word lattice information or the second word lattice information, in the integrated lattice information.

도 5는 도 1의 2차 음성 인식부(140)의 동작의 일 예를 나타내는 도면이다. 5 is a diagram showing an example of the operation of the secondary speech recognizer 140 of FIG.

2차 음성 인식부(140)는 제1 단어 격자에서 온 제1 단어 브랜치끼리 연결될 때에는, 기존 언어 모델인 제1 언어 모델(156)을 적용하고, 제2 단어 격자에서 온 제2 단어 브랜치와 제1 단어 브랜치가 연결될 때에는, 제1 언어 모델(156)보다 제약이 덜한 제2 언어 모델(158)을 적용할 수 있다. 이는 제1 단어 격자 정보 및 제2 단어 격자 정보를 통합하여 문장을 생성할 때, 제1 언어 모델(156)을 일원적으로 적용하는 경우, 음향학적인 스코어가 높은 제2 단어 격자 정보에 포함된 제2 단어 브랜치들이 배제될 수 있으므로, 이를 방지하기 위함이다. The second speech recognition unit 140 applies the first language model 156 which is an existing language model when the first word branches from the first word grid are connected to each other, When one word branch is concatenated, the second language model 158, which is less restrictive than the first language model 156, may be applied. This is because, when the first word lattice information and the second word lattice information are integrated to generate a sentence, when the first language model 156 is integrally applied, the second word lattice information, which is included in the second word lattice information having a high acoustical score, This is to prevent word branches from being excluded.

Case 1은 w1, w2 다음에 w3a가 제1 단어 격자에서 온 경우이다. 이 때는, 다음 단어인 w4가 제1 단어 격자에서 온 것인지, 단어 격자에서 온 것인지에 따라 적용하는 언어 모델이 달라진다. w4a처럼 제1 단어 격자에서 왔으면, 제1 언어 모델(156)을 적용하어 스코어링을 하고, w4b처럼 제2 단어 격자에서 왔으면, 제2 언어 모델(158)을 적용하여 스코어링을 할 수 있다. Case 1 is a case where w1 and w2 are followed by w3a from the first word lattice. In this case, the language model to be applied depends on whether the next word w4 is from the first word lattice or from the word lattice. If a word from the first word lattice such as w4a is used, the first language model 156 is applied for scoring, and if it comes from the second word lattice like w4b, the second language model 158 can be applied for scoring.

Case 2는 w1, w2 다음에 w3b가 제2 단어 격자에서 온 경우이다. 이 때는, 다음 단어인 w4로 확장할 때, w3b가 제2 단어 격자에서 온 것이므로, w4가 제1 단어 격자에서 왔는지, 제2 단어 격자에서 왔는지에 무관하게 제2 언어 모델(158)을 이용하여 스코어링을 수행할 수 있다. Case 2 is the case where w1 and w2 are followed by w3b from the second word lattice. In this case, when extending to the next word w4, since w3b comes from the second word lattice, the second language model 158 is used irrespective of whether w4 comes from the first word lattice or from the second word lattice Scoring can be performed.

도 6은 도 1의 제2 언어 모델(158)의 일 예를 나타내는 도면이다. 6 is a diagram showing an example of the second language model 158 of FIG.

도 2의 제2 언어 모델(158)은 제1 언어 모델(156)보다 제약이 덜한 즉, 느슨한 언어 모델일 수 있다. The second language model 158 of FIG. 2 may be less restrictive, i.e., a loose language model than the first language model 156.

도 1의 제1 언어 모델(156)은 일반적인 n-그램일 수 있다. 제1 언어 모델(156)은 수학식 2로 나타낼 수 있다. The first language model 156 of FIG. 1 may be a general n-gram. The first language model 156 can be expressed by Equation (2).

N(·)은 단어가 훈련 DB에 출현한 횟수를 의미한다.N (·) means the number of times a word appeared in the training DB.

도 6에서 각 블록은 단어 정보를 나타낸다. 단어 정보는 단어 이름 및 형태소 정보로 구성될 수 있다. 예를 들어, 단어 정보 w3은 단어 이름(n3) 및 형태소(t3)로 구성될 수 있음을 나타낸다. 도 6에서 화살표는 제2 언어 모델(158)이 화살표로 연결된 2개의 단어가 동시에 출현할 확률을 나타내는 동시 발생형(co-occurence) 언어 모델임을 나타낸다. 이 경우, 제2 언어 모델(158)은 수학식 3으로 나타낼 수 있다. In FIG. 6, each block represents word information. The word information may be composed of a word name and morphological information. For example, the word information w3 indicates that it can be composed of a word name (n3) and a morpheme (t3). The arrow in FIG. 6 indicates that the second language model 158 is a co-occurrence language model that represents the probability that two words connected by arrows will occur at the same time. In this case, the second language model 158 can be expressed by Equation (3).

d는 x와 y 사이의 거리로, d가 1이면 바로 인접한 상태를 나타내고, 2이면 x와 y 사이에 다른 단어가 한 개 끼어있는 상태를 의미한다. 그래서, N_d(x, y)는 x와 y가 d라는 거리 내에서 동시에 출현하는 횟수를 의미한다. d is the distance between x and y, where d represents 1 and 2, respectively. Thus, N _d (x, y) means the number of times x and y occur simultaneously within a distance d.

제1 언어 모델(156)이 인접한 단어들을 모델링하는데 비해, 동시 발생형 언어 모델(158)에서는 이와 같이 반드시 인접할 필요는 없고 주어진 거리 내에 동시에 존재하기만 하면 되므로, 제1 언어 모델(156)보다는 유연한 방식이라고 할 수 있다. Since the first language model 156 models contiguous words, the concurrent language model 158 does not necessarily contiguously contend with the first language model 156, It is a flexible way.

다만, 2 단어가 d라는 거리 내에 있더라도, 형태소 문법 또한 확률상 인접해서 붙일 수 있는 확률이 작다면 되도록 인접해서 붙이면 안되므로, f(t_x, t_y) 같은 제약 조건을 두었다. t_x는 x라는 단어의 형태소이며 t_y는 y라는 단어의 형태소를 나타낸다. f(t_x, t_y)는 2개의 형태소 사이에 연결될 확률이 높으면 1에 가까워지고, 붙을 확률이 낮으면 0에 가까워지도록 설계하면 된다. However, even if the two words are within a distance of d, the morpheme grammar is constrained to f (t _x , t _y ) because it should not be adjacently attached so long as the probabilities of imposing adjacent adjacencies are small. t _x is the morpheme of the word x and t _y is the morpheme of the word y. f (t _x , t _y ) can be designed so that it becomes close to 1 if the probability of being connected between two morphemes is high, and close to 0 when the probability of being stuck is low.

도 7은 음성 인식 방법의 일 예를 나타내는 순서도이다. 7 is a flowchart showing an example of a speech recognition method.

도 1의 음성 인식 장치(100)는 입력된 음성에 대한 연속어 인식을 수행하여 제1 단어 격자 정보를 검출한다(710). The speech recognition apparatus 100 of FIG. 1 performs consecutive speech recognition on the input speech to detect the first word lattice information (710).

음성 인식 장치(100)는 입력된 음성으로부터 인식된 음소열과 단어 리스트에 포함된 단어 브랜치와의 음소열 매칭을 통하여 제2 단어 격자 정보를 검출한다(720). The speech recognition apparatus 100 detects the second word lattice information through phoneme string matching between the phoneme string recognized from the input speech and the word branch included in the word list (720).

제2 단어 격자 정보를 검출하기 위해서, 입력된 음성으로부터 인식된 음소열과 단어 리스트에 포함된 단어들의 음소열을 매칭하여 매칭 스코어를 계산하고, 음향 모델을 이용하여, 매칭된 제2 단어들 중에서 매칭 스코어가 임계치 이상인 제2 단어들에 대한 리스코어링을 수행할 수 있다. 그런 다음, 제2 단어들이 음성 인식시 2개 이상 묶여서 인식되도록 하기 위하여 언어 모델을 이용하여 제2 단어들을 통합할 수 있다. 인식된 음소열과 단어 리스트에 포함된 단어들의 음소열을 매칭하여 매칭 스코어를 계산할 때에는, 음소 혼동 행렬(phone confusion matrix)이 이용될 수 있다. In order to detect the second word lattice information, a matching score is calculated by matching the phoneme string recognized from the input speech with the phoneme string of words included in the word list, and using the acoustic model, matching among the matched second words And perform the rescoring of the second words whose scores are above the threshold. Then, the second words may be integrated using the language model to allow the second words to be recognized as two or more tied in speech recognition. When the matching score is calculated by matching the phoneme string of the words included in the word list with the recognized phoneme string, a phone confusion matrix may be used.

음성 인식 장치(100)는 제1 단어 격자 정보 및 제2 단어 격자 정보를 통합하여 통합 격자 정보를 생성한다(730). 음성 인식 장치(100)는 통합 격자 정보를 이용하여 2차 음성 인식을 수행하여 입력된 음성에 대한 문장을 생성한다(740). The speech recognition apparatus 100 generates the integrated lattice information by combining the first word lattice information and the second word lattice information (730). The speech recognition apparatus 100 performs a second speech recognition using the integrated lattice information to generate a sentence about the input speech (740).

도 8a는 일 실시예에 따른 음성 인식 방법이 입력된 음성의 앞 부분에 발성 오류가 있는 경우 적용된 일 예를 나타내고, 도 8b는 일 실시예에 따른 음성 인식 방법이 음성 인식기가 불충분한 언어 모델을 가진 경우 적용된 일 예를 나타낸다. FIG. 8A illustrates an example in which the speech recognition method according to an embodiment is applied when there is a speech error in the front part of the input speech, FIG. 8B illustrates a speech recognition method according to an embodiment, Examples of applied cases are shown.

도 8a에 도시된 바와 같이, 입력된 음성의 앞 부분에 발성 오류가 있는 경우이다. 입력된 음성에 대한 음성 인식의 정답 즉, 사용자의 발성 의도는 "AB···"라고 가정한다. 여기에서, "···"은 첫번째 단어 A와 B 이후에 발성된 단어들을 생략하여 표현한 것이다. As shown in FIG. 8A, there is a speech error in the front part of the input voice. It is assumed that the correct answer of the speech recognition for the inputted speech, that is, the user's utterance intention is "AB ... ". Here, "..." is expressed by omitting the words uttered after the first words A and B. [

제1 단어 격자는 언어 모델의 적용에 따라, 발성의 앞 부분이 "C"로 잘못 인식됨으로 인해, 발성의 그 다음 부분이 "D"로 인식되었음을 나타낸다. 또한, 제2 단어 격자는 정답인 "A"가 발성의 앞 부분에 나타나지 않고, "C"로 잘못 인식된 것을 나타낸다. 그러나, 제2 단어 격자에서는 발성에 오류가 있는 앞 부분에서 "A"가 출현하지 않았다 하더라도, 언어 모델의 제약을 받지 않으므로, 뒷 부분은 제대로 "B"로 인식될 수 있다. The first word lattice indicates that the next part of the vocalization is recognized as "D " due to the incorrect recognition of the front part of the utterance as" C " Also, the second word grid indicates that the correct answer "A" does not appear at the beginning of utterance, but is misrecognized as "C ". However, even if "A" does not appear in the front part of the vocabulary in the second word lattice, the latter part can be properly recognized as "B" because it is not restricted by the language model.

한편, 제1 단어 격자는 전체 발성을 커버하지만 제2 단어 격자는 발성의 일부분만을 커버하므로, 발성에 포함된 단어들이 모두 출력된다고 보장할 수 없다. On the other hand, the first word lattice covers the entire utterance, but the second word lattice covers only a part of the utterance, so that it can not be guaranteed that all words included in utterance are output.

통합 격자는 단어 격자와 제1 단어 격자를 모두 통합한 것이다. 따라서, 일 실시예에 따라 통합 격자를 이용하여 최종 음성 인식을 수행하면, 발성 오류로 인하여 발성의 앞 부분이 "C"로 잘못 인식되었더라도, 나머지 음성 인식 결과는 잘못 인식된 결과에 영향을 받지 않을 수 있다. 따라서, 통합 격자를 이용하여 음성 인식을 수행하면 발성 오류로 인해 추가적으로 발생될 수 있는 음성 인식의 오류를 줄일 수 있다. The integrated grid is the union of both the word grid and the first word grid. Therefore, if the final speech recognition is performed using the integrated lattice according to the embodiment, even if the front part of the utterance is erroneously recognized as "C" due to the utterance error, the remaining speech recognition result is not affected by the result . Therefore, if speech recognition is performed using the integrated grid, it is possible to reduce errors in speech recognition that may be additionally caused due to speech errors.

도 8b는, 발성에 특별한 오류가 없더라도, 사용자의 다양한 발성에 대하 언어 모델이 충분한 데이터를 가지 못하는 경우를 나타낸다. 제1 단어 격자는 "A" 다음에 "B"라는 단어가 불충분한 언어 모델에 의해 나오지 못하였음을 나타낸다. 따라서, 제1 단어 격자에서는 "AD···"가 출력되었다. 그러나, 제2 단어 격자는 언어 모델에 의한 제약을 받지 않으므로 "AB···"가 모두 출현하게 된다. FIG. 8B shows a case where the language model does not have sufficient data for various utterances of the user even if there is no specific error in utterance. The first word grid indicates that the word "B" following "A " Therefore, "AD ..." is output in the first word lattice. However, since the second word lattice is not restricted by the language model, "AB..." All appear.

통합 격자는 제1 단어 격자 및 제2 단어 격자를 모두 포함하므로, 최종 음성 인식 결과는 음성 인식의 정답인 "AB···"가 될 수 있다. Since the combined lattice includes both the first word lattice and the second word lattice, the final speech recognition result may be the correct answer of " AB...

이와 같이, 일 실시예에 따른 음성 인식 장치 및 방법에 따르면, 사용자가 여러 형태의 문장을 발성하는 경우에, 제한적 데이터를 가진 언어 모델을 보완할 수 있으며, 기존의 언어 모델로 인하여 음성을 잘못 인식 확률을 경감하여, 음성 인식 성능을 향상시킬 수 있다. As described above, according to the speech recognition apparatus and method according to an embodiment, when a user utters various types of sentences, a language model having limited data can be supplemented, The probability is reduced, and the speech recognition performance can be improved.

본 발명의 일 양상은 컴퓨터로 읽을 수 있는 기록 매체에 컴퓨터가 읽을 수 있는 코드로서 구현될 수 있다. 상기의 프로그램을 구현하는 코드들 및 코드 세그먼트들은 당해 분야의 컴퓨터 프로그래머에 의하여 용이하게 추론될 수 있다. 컴퓨터가 읽을 수 있는 기록매체는 컴퓨터 시스템에 의하여 읽혀질 수 있는 데이터가 저장되는 모든 종류의 기록 장치를 포함한다. 컴퓨터가 읽을 수 있는 기록 매체의 예로는 ROM, RAM, CD-ROM, 자기 테이프, 플로피 디스크, 광 디스크 등을 포함한다. 또한, 컴퓨터가 읽을 수 있는 기록 매체는 네트워크로 연결된 컴퓨터 시스템에 분산되어, 분산 방식으로 컴퓨터가 읽을 수 있는 코드로 저장되고 실행될 수 있다.One aspect of the present invention may be embodied as computer readable code on a computer readable recording medium. The code and code segments implementing the above program can be easily deduced by a computer programmer in the field. A computer-readable recording medium includes all kinds of recording apparatuses in which data that can be read by a computer system is stored. Examples of the computer-readable recording medium include ROM, RAM, CD-ROM, magnetic tape, floppy disk, optical disk, and the like. The computer-readable recording medium may also be distributed over a networked computer system and stored and executed in computer readable code in a distributed manner.

이상의 설명은 본 발명의 일 실시예에 불과할 뿐, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자는 본 발명의 본질적 특성에서 벗어나지 않는 범위에서 변형된 형태로 구현할 수 있을 것이다. 따라서, 본 발명의 범위는 전술한 실시예에 한정되지 않고 특허 청구범위에 기재된 내용과 동등한 범위 내에 있는 다양한 실시 형태가 포함되도록 해석되어야 할 것이다. It will be apparent to those skilled in the art that various modifications and variations can be made in the present invention without departing from the spirit or scope of the invention. Therefore, the scope of the present invention should not be limited to the above-described embodiments, but should be construed to include various embodiments within the scope of the claims.

Claims

A first speech recognition unit for recognizing the first word lattice information by performing consecutive speech recognition on the input speech;
A second word lattice information detector for detecting second word lattice information through phoneme matching of a phoneme string recognized from the input speech and a word branch included in a word list;
A merging unit for merging the first word lattice information and the second word lattice information to generate integrated lattice information; And
And a second speech recognition unit for performing a second speech recognition using the integrated lattice information to generate a sentence about the input speech,
Wherein the merging unit includes information for identifying whether the lattice information included in the integrated lattice information is the first word lattice information or the second word lattice information in the integrated lattice information.

The method according to claim 1,
Wherein the second word-lattice information detecting unit comprises:
A phoneme string recognition unit for recognizing a phoneme string from the input voice;
A word matching unit for performing a phoneme string matching of the recognized phoneme string and words included in the word list to calculate a matching score; And
A recalling unit that recovers the second words among the matched second words using the acoustic model, the second words having a matching score of a threshold value or more; And
And a second word integrating unit for integrating the second words using a language model so that the second words are recognized as two or more tied in speech recognition.

3. The method of claim 2,
Wherein the word matching unit calculates the matching score using a phoneme confusion matrix.

3. The method of claim 2,
The second word integrated unit the second words, if integrated two or more second word match score or more (Score _Acoustic), the integrated two-acoustic model of each word in the integrated second words next to each other (Score _Language ) and the number of sound frames (#Frame) corresponding to two or more integrated second words.

The method according to claim 1,
The merging unit removes a portion of the second word lattice information of the second word lattice information that overlaps with the first word lattice information and inserts the remaining non-overlapping second word branch into the first word lattice information, And the second word grid information.

delete

The method according to claim 1,
The secondary speech recognizing unit recognizes,
A first language model is used for linking word branches belonging to the first word lattice information, and a connection between a first word branch belonging to the first word lattice information and a second word branch belonging to the second word lattice information And performs speech recognition using a second language model looser than the first language model for speech recognition.

8. The method of claim 7,
Wherein the first language model is an n-gram language model and the second language model is a simultaneous language model that models probability of simultaneous occurrence of two words within a predetermined distance.

9. The method of claim 8,
Wherein the second language model is designed to be close to 1 if the probability of being connected between two morphemes is high and to be close to 0 if the probability of being stuck is low.

8. The method of claim 7,
Wherein the first language model is a language model for connecting words in consideration of morphemes of words and words, and the second language model is a language model for connecting words in consideration of morphemes of words.

Detecting first word lattice information by performing consecutive word recognition on the input speech;
Detecting second word lattice information through phoneme string matching of a phoneme string recognized from the input speech and a word branch included in a word list;
Combining the first word lattice information and the second word lattice information to generate integrated lattice information; And
Performing a second speech recognition using the integrated lattice information to generate a sentence for the input speech,
Wherein the generating the integrated lattice information comprises:
Wherein the information for distinguishing whether the lattice information included in the integrated lattice information is the first word lattice information or the second word lattice information is included in the integrated lattice information.

12. The method of claim 11,
Wherein the detecting the second word lattice information comprises:
Recognizing a phoneme string from the input voice;
Performing a phoneme string matching of the recognized phoneme string and words included in the word list to calculate a matching score;
Performing recocoding of second words having a matching score of a threshold value or more among the matched second words using an acoustic model; And
And integrating the second words using a language model to allow the second words to be recognized when the speech recognition is performed by combining two or more words.

13. The method of claim 12,
Wherein the matching score is calculated using a phoneme confusion matrix in a step of calculating a matching score by matching the recognized phoneme string with a phoneme string of words included in the word list.

13. The method of claim 12,
Wherein the merging of the second words comprises:
If the second words integrate two or more, the matching of the acoustic model of each word in the integrated second word score (Score _Acoustic), the probability of the integration of two or more second words next to each other (Score _Language) And calculating a score for the merged second words so as to be proportional to the number of sound frames (#Frame) corresponding to the integrated two or more second words.

12. The method of claim 11,
Wherein the generating the integrated lattice information comprises:
Removing a portion of the second word branch of the second word lattice information that overlaps with the first word lattice information; And
Inserting the remaining non-overlapping second word branch into the first word lattice information and merging the first word lattice information and the second word lattice information.

12. The method of claim 11,
Wherein the step of generating the sentence with respect to the input speech by performing the second speech recognition using the integrated lattice information comprises:
A first language model is used for linking word branches belonging to the first word lattice information, and a connection between a first word branch belonging to the first word lattice information and a second word branch belonging to the second word lattice information Wherein the speech recognition is performed using a second language model that is looser than the first language model.

17. The method of claim 16,
Wherein the first language model is an n-gram language model and the second language model is a simultaneous language model that models probability of simultaneous occurrence of two words within a predetermined distance.

A first speech recognition unit for recognizing the first word lattice information by performing consecutive speech recognition on the input speech;
A second word lattice information detector for detecting second word lattice information through phoneme matching of a phoneme string recognized from the input speech and a word branch included in a word list;
A merging unit for merging the first word lattice information and the second word lattice information to generate integrated lattice information; And
And a search application execution unit for performing a search for a voice file using the integrated lattice information.

19. The method of claim 18,
Wherein the second word-lattice information detecting unit comprises:
A phoneme string recognition unit for recognizing a phoneme string from the input voice;
A word matching unit for performing a phoneme string matching of the recognized phoneme string and words included in the word list to calculate a matching score; And
A recalling unit that recovers the second words among the matched second words using the acoustic model, the second words having a matching score of a threshold value or more; And
And a second word integrating unit for integrating the second words using a language model so that the second words are recognized as two or more tied in speech recognition.

delete