KR20170109178A

KR20170109178A - Method of detecting a misperception section in speech recognition on natural language

Info

Publication number: KR20170109178A
Application number: KR1020160032897A
Authority: KR
Inventors: 오유리; 박전규; 이윤근
Original assignee: 한국전자통신연구원
Priority date: 2016-03-18
Filing date: 2016-03-18
Publication date: 2017-09-28

Abstract

According to an embodiment of the present invention, a method for detecting a misrecognized section in a natural language voice recognition comprises the following steps: extracting a characteristic vector from voice input from outside; performing a first viterbi decoding by using a sound model and a language model with respect to the characteristic vector; performing a second viterbi decoding by using the sound model and the language model with respect to the characteristic vector; and comparing a first string obtained according to the first viterbi decoding and first time information, and a second string obtained according to the second viterbi decoding and second time information. A weight for the language model at a time of performing the second viterbi decoding can be zero.

Description

FIELD OF THE INVENTION [0001] The present invention relates to a method for detecting a false-

본 발명은 음성 인식에 관한 것으로, 좀 더 상세하게는, 자연어 음성 인식에 있어서, 오인식 구간을 검출하는 방법에 관한 것이다.BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to speech recognition, and more particularly, to a method for detecting a false recognition period in natural language speech recognition.

일반적으로 음성 인식에 있어서, 음성 인식 결과의 적용 여부를 결정하기 위하여 발화 검증 방법이 적용된다. 발화 검증 시 일반적으로 음성인식 결과 문자열에 대한 신뢰도를 계산하는 방법이 사용될 수 있다. 예를 들어, 이를 위하여 반모델(Anti-phoneme)을 이용한 우도(likelihood) 기반 계산 방법, 음성 인식 탐색 공간 내에서의 문자열 확률을 이용한 방법 등 다양한 방법이 사용될 수 있다.Generally, in speech recognition, a speech recognition verification method is applied to determine whether or not a speech recognition result is applied. Generally speaking, a method of calculating the reliability of a speech recognition result string can be used in verification of speech. For example, various methods such as a likelihood-based calculation method using anti-phoneme and a method using a string probability in a speech recognition search space can be used.

일반적으로 음성 인식 시, 음향 모델과 언어 모델에 대한 확률을 이용하여 입력 음성에 대한 결과가 출력된다. 이때, 부정확한 음성 또는 학습 된 음향 모델의 특징과 특성이 다른 음성이 입력을 들어오는 경우, 음향 모델에 대한 확률이 낮아서 언어 모델의 영향을 많이 받게 된다. 이 경우 언어 모델적 확률이 높은 문자열이 결과로 나오게 되어 오인식 문자열이 뭉치로 나오는 경우가 발생한다. 따라서, 음성 인식의 신뢰성을 향상시키기 위해, 음성 오인식 뭉치 구간을 검출하는 것은 매우 중요하다.Generally, when speech recognition is performed, the results of the input speech are output using the probabilities of the acoustic model and the language model. In this case, if the input of the voice having different characteristics and characteristics of the incorrect voice or the learned acoustic model is low, the probability of the acoustic model is low, and thus it is affected by the language model. In this case, a string with a high probability of a language model is outputted as a result, and a mistaken string appears as a bundle. Therefore, in order to improve the reliability of speech recognition, it is very important to detect the speech recognition type cluster section.

본 발명의 기술적 사상은 자연어 음성 인식에서 오인식 뭉치 구간이 발생한 구간을 효율적으로 예측하는 방법을 제공한다. The technical idea of the present invention provides a method for efficiently predicting a section in which a false-positive bundle section occurs in natural language speech recognition.

본 발명의 실시 예에 따른 자연어 음성 인식에서 오인식 뭉치 구간을 검출하는 방법은, 외부로부터 입력된 음성으로부터 특징 벡터를 추출하는 단계, 상기 특징 벡터에 대하여, 음향 모델과 언어 모델을 사용하여 제 1 비터비 디코딩을 수행하는 단계, 상기 특징 벡터에 대하여, 음향 모델을 사용하여 제 2 비터비 디코딩을 수행하는 단계, 상기 제 1 비터비 디코딩에 따라 획득된 제 1 문자열 및 제 1 시간 정보와 상기 제 2 비터비 디코딩에 따라 획득된 제 2 문자열 및 제 2 시간 정보를 비교하는 단계를 포함하되, 상기 제 1 비터비 디코딩 수행 시의 상기 언어 모델에 대한 가중치는 0이 아니고, 상기 제 2 비터비 디코딩 수행 시의 상기 언어 모델에 대한 가중치는 0일 수 있다.The method for detecting a false-positive cluster in natural language speech recognition according to an embodiment of the present invention includes the steps of extracting a feature vector from an externally-input voice, using an acoustic model and a language model for the feature vector, Performing non-decoding on the feature vector, performing a second viterbi decoding using the acoustic model for the feature vector, performing a second viterbi decoding on the first character and first time information obtained in accordance with the first viterbi decoding, And comparing the second string and second time information obtained in accordance with the Viterbi decoding, wherein the weight for the language model at the time of performing the first Viterbi decoding is not 0, and the second Viterbi decoding The weight for the language model of the poem may be zero.

본 발명의 실시 예에 따르면, 자연어 음성 인식에서 오인식 뭉치 구간이 발생한 구간을 효율적으로 예측하는 방법을 제공할 수 있다.According to the embodiment of the present invention, it is possible to provide a method for efficiently predicting a section in which a false-positive bundle section occurs in natural language speech recognition.

도 1은 본 발명의 실시 예에 따른 음성 인식 시스템을 보여주는 블록도이다.
도 2는 본 발명의 실시 예에 따른 음성 인식 시스템의 오인식 뭉치 구간 검출의 개략적인 동작을 보여주는 도면이다.
도 3은 본 발명의 실시 예에 따른 오인식 뭉치 구간 검출의 예를 보여주는 표이다.
도 4는 본 발명의 실시 예에 따른 음성 인식 동작을 보여주는 순서도이다.
도 5는 본 발명의 실시 예에 따른 음성 인식 시스템이 적용된 모바일 장치를 보여주는 블록도이다.1 is a block diagram illustrating a speech recognition system according to an embodiment of the present invention.
FIG. 2 is a diagram illustrating a schematic operation of the false-positive bundle interval detection of the speech recognition system according to the embodiment of the present invention.
FIG. 3 is a table showing an example of a false-positive bundle interval detection according to an embodiment of the present invention.
4 is a flowchart illustrating a speech recognition operation according to an embodiment of the present invention.
5 is a block diagram illustrating a mobile device to which a speech recognition system according to an embodiment of the present invention is applied.

앞의 일반적인 설명 및 다음의 상세한 설명 모두 예시적이라는 것이 이해되어야 하며, 청구된 발명의 부가적인 설명이 제공되는 것으로 여겨져야 한다. 참조 부호들이 본 발명의 바람직한 실시 예들에 상세히 표시되어 있으며, 그것의 예들이 참조 도면들에 표시되어 있다. 가능한 어떤 경우에도, 동일한 참조 번호들이 동일한 또는 유사한 부분을 참조하기 위해서 설명 및 도면들에 사용된다.It is to be understood that both the foregoing general description and the following detailed description are exemplary and should provide a further description of the claimed invention. Reference numerals are shown in detail in the preferred embodiments of the present invention, examples of which are shown in the drawings. Wherever possible, the same reference numbers are used in the description and drawings to refer to the same or like parts.

본 명세서에서 설명되는 용어는 단지 특정한 실시 예를 설명하기 위한 목적으로 사용되며, 그것에 한정되지 않는다. "하나의"와 같은 용어는 달리 명백하게 지칭하지 않으면 복수의 형태를 포함하는 것으로 이해되어야 한다. "포함하는" 또는 "구성되는"과 같은 용어는 설명된 특징, 단계, 동작, 성분, 및/또는 구성요소의 존재를 명시하며, 추가적인 하나 또는 그 이상의 특징, 단계, 동작, 성분, 구성요소 및/또는 그들의 그룹의 존재를 배제하지 않는다.The terminology described herein is used for the purpose of describing a specific embodiment only, and is not limited thereto. Terms such as "one" should be understood to include plural forms unless explicitly referred to as " one ". The terms "comprising" or "comprising" are used to specify the presence of stated features, steps, operations, components, and / or components and may include additional features, steps, operations, components, And / or does not exclude the presence of their group.

도 1은 본 발명의 실시 예에 따른 음성 인식 시스템(100)을 보여주는 블록도이다. 도 1을 참조하면, 음성 인식 시스템(100)은 특징 추출부(110), 탐색부(120), 인식결과 출력부(130), 및 데이터베이스(140)를 포함할 수 있다.1 is a block diagram illustrating a speech recognition system 100 according to an embodiment of the present invention. Referring to FIG. 1, the speech recognition system 100 may include a feature extraction unit 110, a search unit 120, a recognition result output unit 130, and a database 140.

예를 들어, 음성 인식 시스템(100)은 스마트폰, 태블릿 PC와 같은 모바일 장치, 또는 컴퓨터에 구비되어 소프트웨어 또는 펌웨어의 형태로 구동되는 일종의 모듈일 수 있다. 또는 음성 인식 시스템(100)은 특정한 목적을 위하여 하드웨어로서 제작된 반도체 칩과 같은 장치일 수도 있다.For example, the speech recognition system 100 may be a smart phone, a mobile device such as a tablet PC, or a type of module provided in a computer and operated in the form of software or firmware. Or the speech recognition system 100 may be a device such as a semiconductor chip fabricated as a hardware for a specific purpose.

도 1을 참조하면, 특징 추출부(110)는 외부로부터 입력된 음성으로부터 특징 벡터들을 추출할 수 있다. 예를 들어, 특징 정보가 추출되기 전에, 외부로부터 입력된 음성 신호로부터 반향을 제거하거나 또는 잡음을 제거하는 동작과 같은 전처리 과정이 수행될 수 있다. 예를 들어, 특징 정보는 디지털 처리된 음성 신호를 효과적으로 표현해주는 정보일 수 있다.Referring to FIG. 1, the feature extraction unit 110 may extract feature vectors from a voice input from the outside. For example, before the feature information is extracted, a preprocessing process such as removing echoes or removing noise from an externally input voice signal may be performed. For example, the feature information may be information that effectively represents a digitally processed voice signal.

탐색부(120)는 특징 추출부(110)에 의해 생성된 특징 벡터와 가장 유사도가 높은 단어열을 탐색할 수 있다. 예를 들어, 특징 벡터와 가장 유사한 단어열을 찾기 위해 음향학적인 모델과 언어적인 모델이 모두 필요할 수 있다. 이때 필요한 모델들은 각각 음향 모델(140)과 언어 모델(150)일 수 있다. 예를 들어, 탐색부(120)는 비터비(viterbi) 알고리즘을 이용하여 가장 확률이 높은 단어열을 탐색할 수 있다.The search unit 120 can search for a word string having the highest degree of similarity to the feature vector generated by the feature extraction unit 110. For example, both an acoustical model and a linguistic model may be needed to find the word sequence most similar to the feature vector. The required models may be the acoustic model 140 and the language model 150, respectively. For example, the search unit 120 may search for the most probable word sequence using a viterbi algorithm.

인식 결과 출력부(130)는 탐색부(120)에 의해 탐색된 단어열을 텍스트의 형태로 출력할 수 있다. 예를 들어, 탐색부(120)에 의해 탐색된 단어열은, 본 발명의 실시 예에 따른 음성 인식 시스템(100)이 구비된 모바일 장치 또는 컴퓨터의 디스플레에 사용자에 의해 인식될 수 있는 문자의 형태로 표시될 수 있을 것이다.The recognition result output unit 130 may output the word sequence searched by the search unit 120 in the form of text. For example, the word sequence searched by the search unit 120 may be a word sequence that can be recognized by the user on the display of the mobile device or the computer equipped with the speech recognition system 100 according to the embodiment of the present invention. . &Lt; / RTI >

음향 모델(140)은 탐색부(120)에 의해 단어가 인식되는데 필요한 일종의 데이터베이스이다. 예를 들어, 음향 모델(140)은 음성 인식 시스템(100)으로 입력된 음성을 음소 단위(예를 들어, 'ㄱ', 'ㄴ', 'ㄷ', ..., 'ㅏ', 'ㅑ', 'ㅓ', ... 등)를 딥러닝 기술로 학습하여 지식화한 자료일 수 있다. 특히 이러한 음향 모델의 경우 발음하는 각 음소들이 주위의 음소에 따라 영향을 받기 때문에 단순한 음소 모델(Simplified Phoneme Like Unit Model)뿐만 아니라 문맥 기반의 음소 인식(Context-dependent Phone Model) 모델이 사용될 수 있다. 특히, 각 음향 모델의 파라미터를 추정하기 위해서 학습(Training)방식이 사용될 수 있다. 이러한 학습 방식은 대용량의 다양한 환경에서 수집된 음성 학습 데이터를 이용하여 발성자의 특성이나 환경 잡음 등에 둔감한 음향 모델을 설정하기 위해서 적용된다.The acoustic model 140 is a kind of database required for words to be recognized by the search unit 120. For example, the acoustic model 140 may include a phoneme unit (e.g., 'a', 'b', 'c', ..., 'a', ' ',' ㅓ ', ...) can be learned by deep learning. Particularly, in the case of such an acoustic model, since each phoneme to be pronounced is influenced by surrounding phonemes, a context-dependent phone model as well as a simple phoneme like unit model can be used. In particular, a training method can be used to estimate the parameters of each acoustic model. This learning method is applied to set an acoustic model that is insensitive to the characteristics of a speaker or environmental noise using speech learning data collected in a large variety of environments.

언어 모델(150)은 주어진 문장 내에서 각 단어들 사이의 관계를 찾아내고 이를 음성 인식에 반영할 수 있도록 통계화된 자료일 수 있다. 언어 모델(150)은 단어나 어휘의 쓰임새를 학습하여 지식화한 자료를 포함할 수 있다. 예를 들어, 언어 모델(150)에는 '아버지'라는 단어 다음에 '는', '가', '를' 등과 같은 어휘가 어느 정도의 확률로 나타날 수 있는 가를 나타내는 통계적 모델이 적용될 수 있다. 이는 여러 단어가 순서대로 주어진 경우 그 다음에 나타나는 단어는 앞 단어와 연관성이 크다는 것에 착안한 것이다.The language model 150 may be statistical data that can be used to find relationships between words within a given sentence and reflect them in speech recognition. The language model 150 may include knowledge of learned vocabulary or vocabulary usage. For example, in the language model 150, a statistical model may be applied to indicate a degree of probability that a word such as 'a', 'a', 'a', and the like may appear after 'father'. This means that if the words are given in order, then the next word is related to the previous word.

이 외에도, 단어열 탐색을 위해, 자음접변, 구개음화 등의 일반적인 음운 현상을 반영하기 위해 발음규칙 데이터베이스(미도시)가 사용될 수 있으며, 인식 어휘 자체를 등록하기 위해 어휘 사전(미도시)이 사용될 수도 있다.In addition, a pronunciation rule database (not shown) may be used to reflect general phonological phenomena such as consonantation, palatalization, etc. for word string search, and a vocabulary dictionary (not shown) may be used to register the recognition vocabulary itself have.

본 발명의 실시 예에 따르면, 음성 인식에서 오인식 뭉치 부분을 검출하기 위해 음향 모델(140)과 언어 모델(150)의 가중치를 달리 한다. 예를 들어, 입력된 음성에 대하여, 탐색부(120)는 음향 모델(140)과 언어 모델(150)을 모두 이용하여 비터비 디코딩(viterbi decoding)을 수행할 수 있다. 이때 사용되는 언어 모델 가중치는 음향모델 확률의 범위와 언어모델 확률의 범위를 조정하기 위한 것으로, 학습 또는 개발 과정에서 최적의 값이 미리 정해질 수 있다. 이후, 탐색부(120)는 음향 모델(140)을 이용하여 비터비 디코딩을 수행할 수 있다. 즉, 언어 모델에 대한 가중치를 '0'으로 둠으로써, 언어 모델 확률을 무시한 상태에서 음성 인식이 수행될 수 있다. 즉, 대용량 음성 인식에서 사용되는 탐색 공간을 사용하되, 음향 모델의 확률만을 고려한 음성 인식 결과 문자열을 획득할 수 있다. According to the embodiment of the present invention, the weights of the acoustic model 140 and the language model 150 are differentiated in order to detect a false-positive bundle in speech recognition. For example, the search unit 120 may perform viterbi decoding using both the acoustic model 140 and the language model 150 with respect to the input voice. In this case, the language model weights are used to adjust the range of the acoustic model probability and the range of the language model probability, and the optimum value can be predetermined in the course of learning or development. After that, the search unit 120 may perform Viterbi decoding using the acoustic model 140. That is, by setting the weight for the language model to '0', speech recognition can be performed in a state in which the language model probability is ignored. That is, it is possible to acquire the speech recognition result character string using only the probability of the acoustic model, using the search space used in the large-capacity speech recognition.

이후, 음향 모델(140)과 언어 모델(150)을 모두 이용하여 디코딩된 결과(즉, 언어 모델의 가중치가 0이 아닌 경우)와, 음향 모델(140)을 이용하여 디코딩된 결과(즉, 언어 모델의 가중치가 0인 경우)를 비교하여 오인식 뭉치 구간이 검출될 수 있다. 이러한 오인식 뭉치 구간 검출 동작은, 인식 결과 출력부(130)에서 바로 실행되거나, 또는 별도의 비교부(미도시)에 의해 수행될 수도 있다. 언어 모델의 가중치가 0인 경우의 디코딩 결과는, 언어 모델의 가중치가 0이 아닌 경우의 디코딩 결과에 비하여 음절 수의 차이가 많이 날 수 있다. 따라서, 정렬된 문자열 구간에서 음절 수의 차이가 많이 나는 구간이 오인식 뭉치 구간으로 인식될 수 있다. 이러한 오인식 뭉치 구간 검출의 개략적인 동작이 도 2 및 도 3에 도시되었다.Thereafter, the decoded result using both the acoustic model 140 and the language model 150 (i.e., when the weight of the language model is not 0) and the decoded result using the acoustic model 140 And the weight of the model is 0), the erroneous-type bundle section can be detected. Such a false-cell bundle interval detection operation may be performed directly by the recognition result output unit 130 or may be performed by a separate comparison unit (not shown). The decoding result when the weight of the language model is 0 may be larger than the decoding result when the weight of the language model is not zero. Therefore, a section in which the difference in the number of syllables is large in the aligned string section can be recognized as a false recognition section section. The outline of the operation of detecting the false-positive bundle interval is shown in FIG. 2 and FIG.

도 2는 본 발명의 실시 예에 따른 음성 인식 시스템의 오인식 뭉치 구간 검출의 개략적인 동작을 보여주는 도면이다. 도 3은 본 발명의 실시 예에 따른 오인식 뭉치 구간 검출의 예를 보여주는 표이다. 이하, 도 2 및 도 3을 함께 첨부하여 설명하기로 한다.FIG. 2 is a diagram illustrating a schematic operation of the false-positive bundle interval detection of the speech recognition system according to the embodiment of the present invention. FIG. 3 is a table showing an example of a false-positive bundle interval detection according to an embodiment of the present invention. Hereinafter, Fig. 2 and Fig. 3 will be described together.

우선 S110 단계에서, 외부(예를 들어, 사용자)로부터 음성이 수신될 수 있다. 예시적으로, 사용자로부터 "네, 좀 많이 부담이 좀 되는 것 같아요"라는 음성이 입력되었다고 가정한다. 비록 도면에는 도시되지 않았으나, 수신된 음성으로부터 반향 제거 또는 잡음 제거와 같은 전처리 동작이 수행된 후, 수신된 음성으로부터 특징 벡터들이 추출될 수 있다.First, in step S110, a voice may be received from an external (e.g., user). By way of example, it is assumed that the user inputs a voice saying "Yes, I feel a little burdened a little." Although not shown in the figure, feature vectors may be extracted from the received speech after a preprocessing operation such as echo cancellation or noise cancellation is performed on the received speech.

S120 단계에서, 제 1 비터비 디코딩 동작이 수행될 수 있다. 제 1 비터비 디코딩 동작은 음향 모델과 언어 모델에 모두 가중치를 부여하여 수행된다. 즉, 언어 모델에 대한 가중치는 0이 아니다. 예시적으로, 제 1 비터비 디코딩 결과에 따라, "네, 그런 것 같아요"라는 문장이 인식될 수 있다. 여기서 "좀 많이 부담이 되는 것" 이 "그런 것"으로 오인식 되었음을 알 수 있다. 이는 이 구간의 발화가 불분명하거나 또는 잡음 등이 섞이는 등의 이유로 인하여 음향모델 확률이 낮아서 언어모델의 확률이 강하게 적용되었기 때문으로 해석될 수 있다.In step 120, a first Viterbi decoding operation may be performed. The first viterbi decoding operation is performed by weighting both the acoustic model and the language model. That is, the weight for the language model is not zero. Illustratively, according to the first Viterbi decoding result, the sentence "Yes, I think so" can be recognized. Here, it can be seen that "something that is a little burdensome" is mistaken as "something like that". This can be interpreted as the fact that the probability of the speech model is strongly applied because the probability of the acoustic model is low due to the unclear speech of this section or the mixing of noise.

S130 단계에서, 제 2 비터비 디코딩 동작이 수행될 수 있다. 제 2 비터비 디코딩 동작은 음향 모델에 가중치를 부여하여 수행된다. 즉, 언어 모델에 대한 가중치는 0이다. 언어 모델 확률을 무시한 상태에서 음성 인식을 수행한 결과, 예시적으로, "네예 금 퓌 빨리그라님니 런 되예예걷같애렷"라는 문장이 인식될 수 있다. 이 과정에서는 언어모델의 확률을 배제하였기 때문에 인식된 문자열이 실제 발화한 문자열과는 차이가 발생할 수 있으나, 발화한 음절 수를 유사하게 획득할 수 있다.In step S130, a second viterbi decoding operation may be performed. The second Viterbi decoding operation is performed by weighting the acoustic model. That is, the weight for the language model is zero. As a result of performing speech recognition in the state of ignoring the probability of language model, the sentence such as "Negev Gold Puy fasting < / RTI > In this process, since the probability of the language model is excluded, the recognized string may differ from the actual speech string, but the number of uttered syllables can be similarly obtained.

S140 단계에서, 제 1 비터비 디코딩 결과와 제 2 비터비 디코딩 결과가 비교될 수 있다. 즉, S120 단계에서 획득된 문자열 정보 및 시간 정보와, S130 단계에서 획득된 문자열 정보 및 시간 정보를 비교하여 오인식 뭉치 구간이 검출된다. 시간에 따라 전체 문장을 구분하거나, 또는 시간 정보를 이용하여 해당 문자열이 정렬되는 시간을 비교하면 전체 문장을 특정 구간으로 구분할 수 있다. 예를 들어, 제 1 비터비 디코딩 결과는 "네 // 그런 // 것 같아요"와 같이 3 구간으로 구분될 수 있고, 제 2 비터비 디코딩 결과는 "네 // 예금퓌빨리그라님니런되예예 // 걷 같애렷"과 같이 3 구간으로 구분될 수 있다.In step 140, the first viterbi decoding result and the second viterbi decoding result may be compared. That is, the erroneous-type bundle section is detected by comparing the string information and the time information obtained in step S120 with the string information and the time information obtained in step S130. The whole sentence can be divided into a specific section by dividing the whole sentences according to time, or comparing the time when the corresponding strings are arranged using time information. For example, the first Viterbi decoding result may be divided into three sections such as "Yes // I think so," and the second Viterbi decoding result may be " // Walking in the sky ".

제 1 비터비 디코딩 결과와 제 2 비터비 디코딩 결과를 비교하면, 양 결과의 2번째 구간에서 음절의 수가 크게 차이가 나는 것을 알 수 있다. 즉, 음절의 수가 크게 차이가 나는 제 2 구간을 오인식 뭉치 구간으로 검출할 수 있게 된다. 비록 본 실시 예에서는, 예시적으로 제 1 비터비 디코딩 동작이 제 2 비터비 디코딩 동작보다 먼저 수행되는 것으로 도시되었으나, 제 2 비터비 디코딩 동작이 제 1 비터비 디코딩 동작보다 먼저 수행될 수도 있다. 또는 제 1 비터비 디코딩 동작과 제 2 비터비 디코딩 동작은 동시에 수행될 수도 있다.When the first Viterbi decoding result is compared with the second Viterbi decoding result, it can be seen that the number of syllables is significantly different in the second section of both results. That is, the second section in which the number of syllables is significantly different can be detected as a false-positive bundle section. Although the first Viterbi decoding operation is illustratively performed before the second Viterbi decoding operation in the present embodiment, the second Viterbi decoding operation may be performed before the first Viterbi decoding operation. Alternatively, the first viterbi decoding operation and the second viterbi decoding operation may be performed simultaneously.

이와 같은 검색 방법에 의하면, 복잡한 알고리즘이나 연산의 수행 없이도, 단지 언어 모델에 가중치를 0을 부여하여 비터비 디코딩을 수행함으로써, 입력 음성의 오인식 뭉치 구간을 용이하게 검색할 수 있다.According to such a search method, it is possible to easily retrieve the erroneous speech bundle section of the input speech by performing the Viterbi decoding only by assigning a weight to the language model, without performing a complicated algorithm or operation.

도 4는 본 발명의 실시 예에 따른 음성 인식 동작을 보여주는 순서도이다.4 is a flowchart illustrating a speech recognition operation according to an embodiment of the present invention.

S210 단계에서, 입력된 음성으로부터 추출된 특징 정보로부터 특징 벡터들이 생성될 수 있다. 특징 정보는 디지털 처리된 음성 신호를 효과적으로 표현해주는 정보일 수 있다.In step S210, feature vectors may be generated from the feature information extracted from the input speech. The feature information may be information for effectively representing a digitally processed voice signal.

S220 단계에서, 제 1 비터비 디코딩 동작이 실행될 수 있다. 예를 들어, 제 1 비터비 디코딩 동작은 S210 단계에서 생성된 특징 벡터들에 대해 음향 모델과 언어 모델을 이용하여 디코딩을 수행하는 동작일 수 있다. 이때, 언어 모델에 대한 가중치는 0이 아닐 수 있다. 예를 들어, 본 단계는 도 1에 도시된 탐색부(120)에 의해 실행될 수 있으며, 음향 모델(140) 및 언어 모델(150)을 이용하여 실행될 수 있다.In step S220, a first Viterbi decoding operation may be performed. For example, the first viterbi decoding operation may be an operation of performing decoding using the acoustic model and the language model for the feature vectors generated in step S210. At this time, the weight for the language model may not be zero. For example, this step may be performed by the search unit 120 shown in FIG. 1 and may be performed using the acoustic model 140 and the language model 150.

S230 단계에서, 제 2 비터비 디코딩 동작이 실행될 수 있다. 예를 들어, 제 2 비터비 디코딩 동작은 S210 단계에서 생성된 특징 벡터들에 대해 음향 모델을 이용하여 디코딩을 수행하는 동작일 수 있다. 이때, 언어 모델에 대한 가중치는 0일 수 있다. 즉, 언어 모델 확률을 무시한 상태에서 음성 인식을 수행되는 것이다. 예를 들어, 본 단계는, 도 1에 도시된 탐색부(120)에 의해 실행될 수 있으며, 음향 모델(140)을 이용하여 실행될 수 있다.In step S230, a second Viterbi decoding operation may be performed. For example, the second viterbi decoding operation may be an operation of performing decoding using the acoustic model for the feature vectors generated in step S210. At this time, the weight for the language model may be zero. That is, speech recognition is performed in a state in which the language model probability is ignored. For example, this step may be performed by the search unit 120 shown in FIG. 1 and may be performed using the acoustic model 140.

S240 단계에서, 제 1 비터비 디코딩 결과와 제 2 비터비 디코딩 결과가 비교될 수 있다. 예를 들어, 본 단계는 도 1에 도시된 인식결과 출력부(130)에서 수행되거나, 또는 별도로 구비된 비교부(미도시)에 의해 수행될 수도 있다. 제 2 비터비 디코딩 동작 시 언어 모델을 고려하지 않았기 때문에, 제 1 비터비 디코딩 결과와 제 2 비터비 디코딩 결과는 그 음절의 개수가 크게 차이가 날 수 있다. 따라서, 제 1 비터비 디코딩 동작에서 획득된 문자열 정보 및 시간 정보와, 제 2 비터비 디코딩 동작에서 획득된 문자열 정보 및 시간 정보를 이용한 비교 동작이 실행될 수 있다.In step S240, the first Viterbi decoding result and the second Viterbi decoding result may be compared. For example, this step may be performed by the recognition result output unit 130 shown in FIG. 1, or may be performed by a separately provided comparison unit (not shown). Since the language model is not considered in the second viterbi decoding operation, the first viterbi decoding result and the second viterbi decoding result may differ greatly in the number of syllables. Therefore, a comparison operation using the string information and time information obtained in the first viterbi decoding operation and the string information and the time information obtained in the second viterbi decoding operation can be performed.

S250 단계에서, 오인식 뭉치 구간 검출 동작이 실행될 수 있다. 본 단계는 S240 단계에서의 비교 결과에 기초하여 실행될 수 있다. 예를 들어, 도 3에 도시된 표에 도시된 바와 같이, 제 1 비터비 디코딩 동작 결과와 제 2 비터비 디코딩 동작 결과는 그 음절의 개수에 있어서 큰 차이가 있기 때문에, 오인식 뭉치 구간을 좀 더 용이하게 검출할 수 있다. 만일 본 단계에서의 판단 결과에 따라 오인식 뭉치 구간이 검출되었다면, 음성을 재입력하라는 안내가 사용자에게 제공될 수 있을 것이다. 예를 들어, 이러한 안내는, 모바일 장치의 스피커 또는 디스플레이를 통하여 사용자에게 제공될 수 있을 것이다.In step S250, the false-positive cluster section detection operation can be executed. This step can be executed based on the comparison result in step S240. For example, as shown in the table shown in FIG. 3, since the first Viterbi decoding operation result and the second Viterbi decoding operation result have a large difference in the number of syllables, the erroneous- It can be easily detected. If the erroneous-type bundle section is detected according to the determination result in this step, a guide to re-enter the voice may be provided to the user. For example, such guidance may be provided to the user through the speaker or display of the mobile device.

도 5는 본 발명의 실시 예에 따른 음성 인식 시스템(도면에는 음성 인식 모듈, 1610)로 도시됨)이 적용된 모바일 장치(1000)를 보여주는 블록도이다. 예를 들어, 모바일 장치(1000)는 스마트폰, 태블릿 PC, 디지털 카메라와 같은 다양한 전자 장치일 수 있다. 도 5를 참조하면, 모바일 장치(1000)는 MIPI(mobile industry processor interface) 표준 또는 eDP(Embedded DisplayPort) 표준을 지원할 수 있도록 구성될 수 있다. 모바일 장치(1000)는 애플리케이션 프로세서(1100), 디스플레이부(1200), 이미지 처리부(1300), 데이터 스토리지(1400), 무선 송수신부(1500), DRAM(1600), AHRS (Attitude Heading Reference System)(1700), 및 적외선 센서(1800)를 포함할 수 있다.FIG. 5 is a block diagram illustrating a mobile device 1000 to which a speech recognition system (shown as a speech recognition module 1610) according to an embodiment of the present invention is applied. For example, the mobile device 1000 may be a variety of electronic devices such as smart phones, tablet PCs, digital cameras, and the like. Referring to FIG. 5, the mobile device 1000 may be configured to support a mobile industry processor interface (MIPI) standard or an eDP (Embedded DisplayPort) standard. The mobile device 1000 includes an application processor 1100, a display unit 1200, an image processing unit 1300, a data storage 1400, a wireless transceiver unit 1500, a DRAM 1600, an Attitude Heading Reference System (AHRS) 1700, and an infrared sensor 1800.

애플리케이션 프로세서(1100)는 모바일 장치(1000)의 전반적인 동작을 제어할 수 있다. 애플리케이션 프로세서(1100)는 디스플레이부(1200)와 인터페이싱을 수행하는 DSI 호스트 및 이미지 처리부(1300)와 인터페이싱을 수행하는 CSI 호스트를 포함할 수 있다.The application processor 1100 may control the overall operation of the mobile device 1000. The application processor 1100 may include a DSI host that performs interfacing with the display unit 1200 and a CSI host that performs interfacing with the image processing unit 1300.

디스플레이부(1200)는 디스플레이 패널(1210) 및 DSI (display serial interface) 주변 회로(1220)를 포함할 수 있다. 디스플레이 패널(1210)은 영상 데이터를 디스플레이할 수 있다. 애플리케이션 프로세서(1100)에 내장된 DSI 호스트는 DSI를 통하여 디스플레이 패널(1210)과 시리얼 통신을 수행할 수 있다. DSI 주변 회로(1220)는 디스플레이 패널(1210)을 구동하는데 필요한 타이밍 컨트롤러, 소스 드라이버 등을 포함할 수 있다.The display unit 1200 may include a display panel 1210 and a display serial interface (DSI) peripheral circuit 1220. The display panel 1210 can display image data. The DSI host embedded in the application processor 1100 can perform serial communication with the display panel 1210 through the DSI. The DSI peripheral circuit 1220 may include a timing controller, a source driver, and the like necessary for driving the display panel 1210.

이미지 처리부(1300)는 카메라 모듈(1310) 및 CSI (camera serial interface) 주변 회로(1320)를 포함할 수 있다. 카메라 모듈(1310) 및 CSI 주변 회로(1320)는 렌즈, 이미지 센서 등을 포함할 수 있다. 카메라 모듈(1310)에서 생성된 이미지 데이터는 이미지 프로세서에서 처리될 수 있으며, 처리된 이미지는 CSI를 통하여 애플리케이션 프로세서(1100)로 전달될 수 있다. The image processing unit 1300 may include a camera module 1310 and a camera serial interface (CSI) peripheral circuit 1320. The camera module 1310 and the CSI peripheral circuit 1320 may include lenses, image sensors, and the like. The image data generated in the camera module 1310 may be processed in the image processor, and the processed image may be transferred to the application processor 1100 via the CSI.

데이터 스토리지(1400)는 임베디드 스토리지 및 탈착형 스토리지를 포함할 수 있다. 임베디드 스토리지 및 탈착형 스토리지는 M-PHY 계층을 통하여 애플리케이션 프로세서(1100)와 통신을 수행할 수 있다. 예를 들어, 애플리케이션 프로세서(1100)와 탈착형 스토리지는 다양한 카드 프로토콜(예를 들어, UFDs, MMC, eMMC SD(secure digital), mini SD, Micro SD 등)에 의해 통신할 수 있다. 임베디드 스토리지 및 탈착형 스토리지는 플래시 메모리와 같은 불휘발성 메모리 장치로 구성될 수 있다. Data storage 1400 may include embedded storage and removable storage. The embedded storage and removable storage can communicate with the application processor 1100 via the M-PHY layer. For example, application processor 1100 and removable storage may communicate by various card protocols (e.g., UFDs, MMC, eMMC secure digital (SD), mini SD, Micro SD, etc.). The embedded storage and removable storage may be configured as non-volatile memory devices such as flash memory.

무선 송수신부(1500)는 안테나(1510), RF 부(1520), 및 모뎀(1530)을 포함할 수 있다. 모뎀(1530)은 M-PHY 계층을 통하여 애플리케이션 프로세서(1100)와 통신하는 것으로 도시되었다. 그러나, 실시 예에 따라서, 모뎀(1530)은 애플리케이션 프로세서(1100)에 내장될 수 있다.The wireless transceiver unit 1500 may include an antenna 1510, an RF unit 1520, and a modem 1530. The modem 1530 is shown communicating with the application processor 1100 via the M-PHY layer. However, depending on the embodiment, the modem 1530 may be embedded in the application processor 1100.

DRAM(1600)은 애플리케이션 프로세서(1100)에 의해 구동되는 다양한 애플리케이션, 펌웨어, 소프트웨어 등이 로딩될 수 있다. 예를 들어, DRAM(1600)에는 본 발명의 실시 예에 따른 음성 인식 모듈(1610)이 로딩될 수 있다. DRAM(1600)에 로딩된 음성 인식 모듈(1610)은 애플리케이션 프로세서(1100)에 의해 구동될 수 있다. 비록 예시적으로 DRAM이 사용되는 것으로 도시되었으나, 이에 한정되지 않으며, 음성 인식 모듈(1610)이 로딩되어 애플리케이션 프로세서(1100)에 의해 구동될 수 있는 다양한 메모리 장치가 사용될 수 있다.The DRAM 1600 may be loaded with various applications, firmware, software, etc., which are driven by the application processor 1100. For example, the DRAM 1600 may be loaded with a speech recognition module 1610 according to an embodiment of the present invention. The speech recognition module 1610 loaded into the DRAM 1600 may be driven by the application processor 1100. Although illustratively DRAM is shown as being used, it is not so limited, and various memory devices that can be loaded by the speech recognition module 1610 and driven by the application processor 1100 may be used.

스피커(1700)는 인식하고자 하는 음성을 수신할 수 있다. 스피커(1700)를 통하여 사용자로부터 입력된 음성은 반향 제거 또는 잡음 제거와 같은 전처리 과정을 거친 뒤, 음성 인식 모듈(1610)에서 처리될 것이다.The speaker 1700 can receive the voice to be recognized. The voice input from the user through the speaker 1700 will be processed in the voice recognition module 1610 after a preprocessing such as echo cancellation or noise cancellation.

본 발명의 범위 또는 기술적 사상을 벗어나지 않고 본 발명의 구조가 다양하게 수정되거나 변경될 수 있음은 이 분야에 숙련된 자들에게 자명하다. 상술한 내용을 고려하여 볼 때, 만약 본 발명의 수정 및 변경이 아래의 청구항들 및 동등물의 범주 내에 속한다면, 본 발명이 이 발명의 변경 및 수정을 포함하는 것으로 여겨진다.It will be apparent to those skilled in the art that the structure of the present invention can be variously modified or changed without departing from the scope or spirit of the present invention. In view of the foregoing, it is intended that the present invention cover the modifications and variations of this invention provided they fall within the scope of the following claims and equivalents.

100: 음성 인식 시스템
110: 특징 추출부
120: 탐색부
130: 인식 결과 출력부
140: 음향 모델
150: 언어 모델100: Speech Recognition System
110: Feature extraction unit
120:
130: recognition result output unit
140: Acoustic model
150: Language model

Claims

1. A method for detecting a false recognition bundle interval in speech recognition using a device having a speech recognition system, the method comprising:
Extracting a feature vector from an externally input voice;
Performing a first Viterbi decoding using an acoustic model and a language model for the feature vector;
Performing a second viterbi decoding using the acoustic model for the feature vector;
Comparing the first string and the first time information obtained according to the first Viterbi decoding with the second string and the second time information obtained according to the second Viterbi decoding,
Wherein the weight for the language model at the time of performing the first Viterbi decoding is not zero and the weight for the language model at the time of performing the second Viterbi decoding is zero.