KR101483945B1

KR101483945B1 - Method for spoken semantic analysis with speech recognition and apparatus thereof

Info

Publication number: KR101483945B1
Application number: KR1020130129046A
Authority: KR
Inventors: 김영준
Original assignee: 에스케이텔레콤 주식회사
Priority date: 2013-10-29
Filing date: 2013-10-29
Publication date: 2015-01-19

Abstract

The present invention relates to a voice recognizing method capable of analyzing a meaning and a voice recognizing device therefor. The voice recognizing method and the voice recognizing device are capable of: generating a combination network by searching for a meaning at the same time searching for a voice; and analyzing the meaning at the same time recognizing the voice for voice data using the combination network. To this end, according to an embodiment of the present invention, the voice recognizing device comprises: a storing unit to store an acoustic model, a language model, and a meaning model; and a voice recognizing unit to search for a voice using the acoustic model and the language model for inputted voice data, to search for a meaning using the acoustic model and the meaning model, to combine nodes according to the voice search and the meaning search to generate a combination network, and to output voice recognition results and meaning analysis results by performing a final search using the combination network.

Description

BACKGROUND OF THE INVENTION 1. Field of the Invention [0001] The present invention relates to a speech recognition method,

본 발명은 의미 분석이 가능한 음성 인식 방법 및 이를 위한 음성 인식 장치에 관한 것으로서, 음성 탐색과 동시에 의미 탐색을 진행하여 결합 네트워크를 생성하고, 이를 이용하여 음성 데이터에 대한 음성 인식은 물론 의미 분석을 동시에 진행할 수 있는 의미 분석이 가능한 음성 인식 방법 및 이를 위한 음성 인식 장치에 관한 것이다. The present invention relates to a speech recognition method capable of performing semantic analysis and a speech recognition apparatus for the same, and is capable of performing semantic search simultaneously with voice search to generate a combined network, And more particularly, to a speech recognition method capable of proceeding with semantic analysis and a speech recognition apparatus therefor.

이 부분에 기술된 내용은 단순히 본 실시 예에 대한 배경 정보를 제공할 뿐 종래기술을 구성하는 것은 아니다.The contents described in this section merely provide background information on the present embodiment and do not constitute the prior art.

음성 인식(speech recognition) 시스템에서는 인식하고자 하는 대상 영역을 하나의 탐색 공간(search network)으로 표현하고, 해당 탐색 공간의 조건 내에서 입력 음성 신호(음성 데이터)와 가장 유사한 단어열을 찾는 탐색 과정을 수행한다.In a speech recognition system, a search process is performed by expressing a target area to be recognized as a search network and finding a word string most similar to an input speech signal (speech data) within the condition of the search space .

탐색 공간을 만드는 방법으로 여러 가지가 있는데, 그 중 가중 유한 상태 변환기(Weighted Finite State Transducer, WFST)를 이용하는 방법이 널리 확산되고 있다. WFST는 언어모델, 발음 사전, 문맥의존 음향 모델, HMM 상태 네트워크를 결합하여 하나의 커다란 가중치를 갖는 오토마타(automata)로 간주하고, 기존의 오토마타 이론을 가중 오토마타로 확장하여 인식 네트워크에 적용함으로써 음성 인식을 수행하게 된다. There are many ways to create a search space. Among them, a method using a weighted finite state transducer (WFST) is widely spreading. WFST regards a language model, a pronunciation dictionary, a context-dependent acoustic model, and HMM state network as one large weighted automata, and extends the existing automata theory to a weighted automata to apply to speech recognition .

WFST에 기반한 알고리즘에서는 모든 네트워크를 오토마타로 표현할 경우, 이러한 오토마타들을 합치는 알고리즘(Composition), 결정화 알고리즘(Determization), 공통의 path를 합쳐서 최적의 path를 생성해내는 최적화 알고리즘(Minimization) 알고리즘을 이용하여 최적화하게 된다.In WFST-based algorithms, when all networks are represented by automata, the algorithm that combines these automata, the crystallization algorithm, and the minimization algorithm that combines the common paths to generate the optimal path are used. .

그러나, 종래의 WFST를 이용한 음성 인식에서는 단순히 입력된 음성을 텍스트로 변환하는 것에만 사용하고, 의미를 분석하는 과정은 전환된 텍스트를 이용해 다시 언어처리(Natural Language Processing)적인 다른 방법을 사용해서 진행해야 하였다. 이에 음성 인식에서 오류가 발생하면 후처리 과정에 속하는 언어 처리에서는 오류에 대한 대응이 어렵다는 문제점이 제기되어 왔다.However, in the speech recognition using the conventional WFST, only the input voice is used for converting to text, and the process of analyzing the meaning is performed using another method such as Natural Language Processing using the converted text I have to. Therefore, if an error occurs in the speech recognition, it is difficult to cope with the error in the language processing belonging to the post-processing.

한국공개특허 제10-2008-0026951호, 2008년 3월 26일 공개 (명칭: 강인한 원거리 음성 인식 시스템을 위한 음성 인식 방법)Korean Patent Laid-Open No. 10-2008-0026951, March 26, 2008 (Name: Speech Recognition Method for Robust Long-Range Speech Recognition System)

상술한 바와 같이, 본 발명은 종래 기술의 문제점을 해결하기 위해 제안된 것으로서, 음성 탐색과 동시에 의미 탐색을 진행하여 결합 네트워크를 생성하고, 이를 이용하여 음성 데이터에 대한 음성 인식은 물론 의미 분석을 동시에 진행할 수 있는 의미 분석이 가능한 음성 인식 방법 및 이를 위한 음성 인식 장치를 제공하는 데 목적이 있다. As described above, the present invention has been proposed in order to solve the problems of the prior art. The present invention creates a combined network by performing a semantic search simultaneously with a voice search, and uses the same to perform semantic analysis There is provided a speech recognition method capable of proceeding with meaning analysis and a speech recognition apparatus therefor.

그러나, 이러한 본 발명의 목적은 상기의 목적으로 제한되지 않으며, 언급되지 않은 또 다른 목적들은 아래의 기재로부터 명확하게 이해될 수 있을 것이다.However, the object of the present invention is not limited to the above-mentioned objects, and other objects not mentioned can be clearly understood from the following description.

상술한 바와 같은 목적을 달성하기 위한 본 발명의 실시 예에 따른 음성 인식 장치는 음향 모델, 언어 모델, 의미 모델을 저장하는 저장부; 및 입력된 음성 데이터에 대해 상기 음향 모델 및 상기 언어 모델을 이용하여 음성 탐색을 수행하고, 상기 음향 모델 및 상기 의미 모델을 이용하여 의미 탐색을 수행한 후, 상기 음성 탐색 및 상기 의미 탐색에 따른 노드를 결합하여 결합 네트워크를 생성하고, 상기 결합 네트워크를 이용하여 최종 탐색을 수행하여, 음성 인식 결과 및 의미 분석 결과를 출력하는 음성 인식부;를 포함하여 구성될 수 있다.According to an aspect of the present invention, there is provided a speech recognition apparatus including a storage unit for storing an acoustic model, a language model, and a semantic model; And performing a voice search using the acoustic model and the language model with respect to the input voice data, performing a semantic search using the acoustic model and the semantic model, And a speech recognition unit for generating a combined network, performing a final search using the combined network, and outputting a speech recognition result and a semantic analysis result.

이때, 상기 저장부는 발음 사전을 더 저장하며, 상기 음성 인식부는 상기 음향 모델, 상기 언어 모델 및 상기 발음 사전을 이용하여 음성 탐색을 수행할 수 있다.The storage unit may further store a pronunciation dictionary, and the speech recognition unit may perform a voice search using the acoustic model, the language model, and the pronunciation dictionary.

이때, 상기 의미 모델은 형태소와 의미 태그를 포함하는 하나 이상의 의미 패턴 정보를 포함할 수 있다.At this time, the semantic model may include at least one semantic pattern information including a morpheme and a semantic tag.

또한, 상기 결합 네트워크는 FSN(Finite State Network), word-pair grammar, n-gram 중 적어도 어느 하나의 네트워크일 수 있다.Also, the connection network may be at least one of a finite state network (FSN), a word-pair grammar, and an n-gram.

또한, 상기 음성 인식부는 음성 데이터가 입력되면, 입력된 상기 음성 데이터에서 특징 데이터를 추출하고, 추출된 특징 데이터에 따라 상기 음성 탐색 및 상기 의미 탐색을 수행할 수 있다.The voice recognition unit may extract the feature data from the input voice data and perform the voice search and the meaning search according to the extracted feature data when the voice data is input.

또한, 상기 음성 인식부는 상기 결합 네트워크와 상기 음향 모델을 이용하여 최종 탐색을 수행할 수 있다.In addition, the speech recognition unit may perform a final search using the combined network and the acoustic model.

또한, 상기 음성 인식부는 상기 의미 탐색에 따른 노드에 가중치를 두어 음성 인식 결과를 생성할 수 있다.In addition, the speech recognition unit may generate a speech recognition result by weighting the nodes according to the meaning search.

상술한 바와 같은 목적을 달성하기 위한 본 발명의 실시 예에 따른 의미 분석이 가능한 음성 인식 방법은 음성 인식 장치가 음성 데이터를 입력 받는 단계; 상기 음성 인식 장치가 상기 음성 데이터에 대해 음향 모델 및 언어 모델을 이용한 음성 탐색과, 음향 모델 및 의미 모델을 이용한 의미 탐색을 병렬 수행하는 단계와; 상기 음성 인식 장치가 상기 병렬 수행된 상기 음성 탐색 및 상기 의미 탐색에 따른 노드를 결합하여 결합 네트워크를 생성하는 단계; 및 상기 음성 인식 장치가 상기 결합 네트워크 및 상기 음향 모델을 이용하여 최종 음성 인식 및 의미 분석을 진행하는 단계;를 포함하여 이뤄질 수 있다.According to another aspect of the present invention, there is provided a speech recognition method capable of performing semantic analysis according to an embodiment of the present invention. Performing speech search using the acoustic model and the language model on the speech data, and performing semantic search using the acoustic model and the semantic model in parallel; Combining the voice search and the nodes according to the semantic search performed by the voice recognition device in parallel to generate a combined network; And performing the final speech recognition and semantic analysis using the combined network and the acoustic model by the speech recognition device.

이때, 상기 입력 받는 단계 이후에, 상기 음성 인식 장치가 상기 음성 데이터에서 특징 데이터를 추출하는 단계;를 더 포함하여 이뤄질 수 있다.In this case, the step of extracting the feature data from the voice data by the voice recognition device may be performed after the step of receiving the voice data.

또한, 상기 병렬 수행하는 단계는 상기 음성 인식 장치가 상기 특징 데이터에 대한 음성 탐색 및 의미 탐색을 수행할 수 있다.In addition, in the parallel performing step, the speech recognition apparatus may perform a voice search and a semantic search on the feature data.

또한, 상기 병렬 수행하는 단계는 상기 음성 인식 장치가 상기 의미 탐색에 따른 노드에 가중치를 부여하여 음성 인식을 진행할 수 있다.In addition, in the parallel performing step, the speech recognition apparatus may weight a node according to the semantic search to proceed with speech recognition.

또한, 상기 최종 음성 인식 및 의미 분석을 진행하는 단계 이후에, 상기 음성 인식 장치가 상기 의미 탐색에 따른 노드에 포함된 의미 태그를 이용하여 해당되는 객체 정보를 확인하고, 분석 결과로 가공하는 단계;를 더 포함하여 이뤄질 수 있다.The method further comprises: after the step of performing the final speech recognition and the semantic analysis, the speech recognition device identifies the corresponding object information using a semantic tag included in the node according to the semantic search and processes the result as an analysis result; May be further included.

본 발명의 의미 분석이 가능한 음성 인식 방법 및 이를 위한 음성 인식 장치에 의하면, 음성 탐색과 동시에 의미 탐색을 진행하여 결합 네트워크를 생성함으로써, 이를 이용하여 음성 데이터에 대한 음성 인식은 물론 의미 분석이 동시에 가능하며, 의미 분석을 위한 별도의 절차를 수행하지 않으므로 보다 신속하게 의미 분석 결과를 제공할 수 있다는 우수한 효과가 있다.According to the speech recognition method capable of analyzing the meaning of the present invention and the speech recognition apparatus for the same, semantic search is performed simultaneously with speech search to generate a combined network, thereby enabling speech recognition of speech data as well as semantic analysis And does not perform a separate procedure for semantic analysis, so that the semantic analysis result can be provided more quickly.

도 1은 본 발명의 실시 예에 따른 음성 인식 장치의 동작을 개략적으로 설명하기 위한 예시도이다.
도 2는 본 발명의 실시 예에 따른 음성 인식 장치의 주요 구성을 도시한 블록도이다.
도 3은 본 발명의 실시 예에 따른 음성 인식부의 주요 구성을 설명하기 위한 블록도이다.
도 4는 본 발명의 실시 예에 따른 저장부의 주요 구성을 설명하기 위한 블록도이다.
도 5는 본 발명의 실시 예에 따른 의미 분석이 가능한 음성 인식 방법을 설명하기 위한 흐름도이다.
도 6 및 도 7은 본 발명의 실시 예에 따른 의미 분석이 가능한 음성 인식 방법을 설명하기 위한 예시도이다.FIG. 1 is an exemplary diagram for schematically explaining an operation of a speech recognition apparatus according to an embodiment of the present invention.
2 is a block diagram illustrating a main configuration of a speech recognition apparatus according to an embodiment of the present invention.
3 is a block diagram for explaining a main configuration of a speech recognition unit according to an embodiment of the present invention.
4 is a block diagram illustrating a main configuration of a storage unit according to an embodiment of the present invention.
5 is a flowchart for explaining a speech recognition method capable of semantic analysis according to an embodiment of the present invention.
6 and 7 are diagrams for explaining a speech recognition method capable of semantic analysis according to an embodiment of the present invention.

이하 첨부된 도면을 참조하여 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자가 본 발명을 용이하게 실시할 수 있는 바람직한 실시예를 상세히 설명한다. 다만, 본 발명의 바람직한 실시예에 대한 동작 원리를 상세하게 설명함에 있어 관련된 공지 기능 또는 구성에 대한 구체적인 설명이 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우에는 그 상세한 설명을 생략한다. 이는 불필요한 설명을 생략함으로써 본 발명의 핵심을 흐리지 않고 더욱 명확히 전달하기 위함이다. 또한 본 발명은 다양한 변경을 가할 수 있고 여러 가지 실시 예를 가질 수 있는 바, 특정 실시 예들을 도면에 예시하고 상세한 설명에 상세하게 설명하고자 하나, 이는 본 발명을 특정한 실시 형태로 한정하려는 것은 아니며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다.DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings. However, the detailed description of known functions and configurations incorporated herein will be omitted when it may unnecessarily obscure the subject matter of the present invention. This is to omit the unnecessary description so as to convey the key of the present invention more clearly without fading. While the invention is susceptible to various modifications and alternative forms, specific embodiments thereof are shown by way of example in the drawings and will herein be described in detail. However, it should be understood that the invention is not limited to the specific embodiments thereof, It is to be understood that the invention is intended to cover all modifications, equivalents, and alternatives falling within the spirit and scope of the invention.

더하여, 어떤 구성요소가 다른 구성요소에 "연결되어" 있다거나 "접속되어" 있다고 언급할 경우, 이는 논리적 또는 물리적으로 연결되거나, 접속될 수 있음을 의미한다. 다시 말해, 구성요소가 다른 구성요소에 직접적으로 연결되거나 접속되어 있을 수 있지만, 중간에 다른 구성요소가 존재할 수도 있으며, 간접적으로 연결되거나 접속될 수도 있다고 이해되어야 할 것이다. In addition, when referring to an element as being "connected" or "connected" to another element, it means that it can be connected or connected logically or physically. In other words, it is to be understood that although an element may be directly connected or connected to another element, there may be other elements in between, or indirectly connected or connected.

또한, 본 명세서에서 사용한 용어는 단지 특정한 실시 예를 설명하기 위해 사용된 것으로, 본 발명을 한정하려는 의도가 아니다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 또한, 본 명세서에서 기술되는 "포함 한다" 또는 "가지다" 등의 용어는 명세서상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.Also, the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. The singular expressions include plural expressions unless the context clearly dictates otherwise. It is also to be understood that the terms such as " comprising "or" having ", as used herein, are intended to specify the presence of stated features, integers, It should be understood that the foregoing does not preclude the presence or addition of other features, numbers, steps, operations, elements, parts, or combinations thereof.

이제 본 발명의 실시 예에 따른 의미 분석이 가능한 음성 인식 방법 및 이를 위한 음성 인식 장치에 대하여 도면을 참조하여 상세하게 설명하도록 한다. 이때, 도면 전체에 걸쳐 유사한 기능 및 작용을 하는 부분에 대해서는 동일한 도면 부호를 사용하며, 이에 대한 중복되는 설명은 생략하기로 한다.Now, a speech recognition method capable of semantic analysis according to an embodiment of the present invention and a speech recognition apparatus therefor will be described in detail with reference to the drawings. Here, the same reference numerals are used for similar functions and functions throughout the drawings, and a duplicate description thereof will be omitted.

도 1은 본 발명의 실시 예에 따른 음성 인식 장치의 동작을 개략적으로 설명하기 위한 예시도이다.FIG. 1 is an exemplary diagram for schematically explaining an operation of a speech recognition apparatus according to an embodiment of the present invention.

도 1을 참조하면, 본 발명의 음성 인식 장치(100)는 사용자로부터 음성 데이터가 입력되면, 입력된 음성 데이터를 기초로 음성 인식을 진행하여 인식 결과를 출력하게 된다. 이때, 본 발명의 음성 인식 장치(100)는 음성 인식 결과는 물론 상기 음성 데이터에 대한 의미 분석 결과를 동시에 출력할 수 있다. 보다 구체적인 음성 인식 장치(100)의 동작 방법에 대해서는 후술하도록 하며, 본 발명의 음성 인식 장치(100)는 독립된 하나의 장치로 구현될 수 있다. 음성 인식 장치(100)가 사용자가 이용할 수 있는 독립된 하나의 장치로 구현되는 경우, 음성 인식 장치(100)는 사용자로부터 입력되는 음성 신호를 감지하여 음성 데이터를 생성할 수 있는 마이크를 포함하는 입력 모듈을 포함할 수 있으며, 음성 인식 결과, 의미 분석 결과를 출력할 수 있는 표시 모듈을 포함하여 구현될 수 있다.Referring to FIG. 1, the speech recognition apparatus 100 of the present invention performs speech recognition based on input speech data and outputs recognition results when speech data is input from a user. At this time, the speech recognition apparatus 100 of the present invention can simultaneously output not only speech recognition results, but also semantic analysis results for the speech data. A more specific method of operating the speech recognition apparatus 100 will be described later, and the speech recognition apparatus 100 of the present invention can be implemented as one independent apparatus. When the speech recognition apparatus 100 is implemented as a single independent apparatus that can be used by a user, the speech recognition apparatus 100 may include an input module including a microphone capable of sensing voice signals input from a user and generating voice data, And a display module capable of outputting a speech recognition result and a semantic analysis result.

또한, 본 발명의 음성 인식 장치(100)는 특정 하드웨어 장치에 내장된(embedded) 형태의 장치로 구현될 수도 있다. 이 경우, 음성 인식 장치(100)는 해당 장치에 구비되어 있는 입력 모듈로부터 음성 데이터를 전달받을 수 있으며, 의미 분석 및 음성 인식 결과를 출력할 수 있는 표시 모듈로 상기 결과를 전달할 수 있다. 이때, 상기 음성 인식 장치(100)는 어플리케이션(application)과 같은 프로그램 형태로 구현될 수도 있다. 예시로, 스마트 폰(smart phone)과 같은 사용자가 이용할 수 있는 각종 전자 장치에 내장되거나, 프로그램 형태로 상기 장치에 설치되어 이용될 수 있다.Also, the speech recognition apparatus 100 of the present invention may be implemented as an embedded type device in a specific hardware device. In this case, the speech recognition apparatus 100 can receive voice data from an input module provided in the apparatus, and can transmit the result to a display module capable of outputting a meaning analysis result and a speech recognition result. At this time, the speech recognition apparatus 100 may be implemented in the form of a program such as an application. For example, it can be embedded in various electronic devices available to the user such as a smart phone, or installed in the device in a program form.

또한, 본 발명의 음성 인식 장치(100)는 웹 서버 형태로 구현될 수도 있다. 음성 인식 장치(100)가 웹 서버 형태로 구현되는 경우, 사용자는 자신의 스마트폰과 같은 사용자 단말을 이용하여 파일 형태의 음성 데이터를 생성하고 이를 통신망을 거쳐 음성 인식 장치(100)로 전달할 수 있으며, 음성 인식 장치(100)는 통신망을 통해 사용자 단말로부터 전송되는 음성 데이터를 수신할 수 있다. 아울러, 음성 인식 장치(100)는 의미 분석 및 음성 인식 결과를 통신망을 거쳐 사용자 단말로 전송하고, 이를 수신한 사용자 단말이 표시 모듈을 통해 출력할 수 있다. 또한, 음성 인식 장치(100)는 언어 학습, 호 분류 등 특정 서비스를 지원하는 웹 서버와 연동하여 동작할 수 있으며, 상기 웹 서버와 일체로 형성될 수도 있다. 또한, 본 발명의 음성 인식 장치(100)는 음성을 입력 받는 모듈 및 음성을 인식하는 모듈이 하드웨어적으로 구분된 형태인 이원적 처리 시스템으로 구현될 수도 있다.In addition, the speech recognition apparatus 100 of the present invention may be implemented as a web server. When the voice recognition apparatus 100 is implemented as a web server, the user can generate voice data in the form of a file using a user terminal such as a smart phone of her own and transmit the voice data to the voice recognition apparatus 100 via the communication network , The voice recognition apparatus 100 can receive voice data transmitted from the user terminal through the communication network. In addition, the speech recognition apparatus 100 may transmit the result of the semantic analysis and the speech recognition to the user terminal via the communication network, and the user terminal that receives the result may output the received result through the display module. In addition, the speech recognition apparatus 100 may operate in conjunction with a web server supporting a specific service such as language learning, call classification, or may be formed integrally with the web server. In addition, the speech recognition apparatus 100 of the present invention may be implemented as a dual processing system in which a module for receiving a voice and a module for recognizing a voice are hardware-classified.

이하, 본 발명의 실시 예에 따른 음성 인식 장치(100)의 주요 구성 및 동작 방법에 대해 보다 더 구체적으로 설명하도록 한다.Hereinafter, the main configuration and operation method of the speech recognition apparatus 100 according to the embodiment of the present invention will be described in more detail.

도 2는 본 발명의 실시 예에 따른 음성 인식 장치의 주요 구성을 도시한 블록도이다.2 is a block diagram illustrating a main configuration of a speech recognition apparatus according to an embodiment of the present invention.

도 2를 참조하면, 본 발명의 실시 예에 따른 음성 인식 장치(100)는 인터페이스부(10), 음성 인식부(20) 및 저장부(30)를 포함하여 구성될 수 있다.Referring to FIG. 2, the speech recognition apparatus 100 according to the embodiment of the present invention may include an interface unit 10, a speech recognition unit 20, and a storage unit 30.

각 구성에 대해 보다 구체적으로 설명하면, 인터페이스부(10)는 전술한 바와 같이 사용자의 음성 데이터를 입력 받아 음성 인식부(20)로 전달하고, 상기 음성 인식부(20)를 통해 전달되는 의미 분석 및 음성 인식 결과를 사용자에게 제공하는 역할을 수행한다. 이때, 상기 음성 인식 장치(100)가 독립된 장치로 구현되는 경우, 인터페이스부(10)는 마이크를 통해 입력된 사용자의 음성 신호를 디지털 데이터 형태의 음성 데이터로 생성하는 역할을 수행하고, 음성 인식 결과 및 의미 분석 결과를 별도의 표시 모듈을 통해 출력하도록 제공하는 역할을 할 수 있다. 또한, 음성 인식 장치(100)가 특정 하드웨어 장치에 내장되거나 프로그램 형태로 구현되는 경우, 또는 웹 서버 형태로 구현되는 경우, 인터페이스부(10)는 별도의 인터페이스 또는 유무선 통신 모듈을 통해 전송되는 음성 데이터를 수신하고, 음성 인식 결과 및 의미 분석 결과를 전송하는 역할을 수행할 수 있다. As described above, the interface unit 10 receives the user's voice data and transfers the voice data to the voice recognition unit 20. The voice recognition unit 20 performs a semantic analysis And a voice recognition result to the user. In this case, when the speech recognition apparatus 100 is implemented as an independent apparatus, the interface unit 10 generates a voice signal of a user input through a microphone as voice data in the form of digital data, And to output the semantic analysis result through a separate display module. In addition, when the speech recognition apparatus 100 is embedded in a specific hardware device or implemented in a form of a program, or in a form of a web server, the interface unit 10 may include voice data transmitted via a separate interface or a wire / And transmits the speech recognition result and the semantic analysis result.

음성 인식부(20)는 인터페이스부(10)를 통해 음성 데이터가 입력되면, 입력된 음성 데이터를 인식하고 음성 인식 결과를 생성하는 역할을 수행한다. 특히, 본 발명의 실시 예에 따른 음성 인식부(20)는 음성 데이터를 인식하여 음성 인식 결과를 생성하는 것과 동시에 음성 데이터에 대한 의미 분석을 동시에 수행하고 이에 대한 결과를 생성할 수 있다.The speech recognition unit 20 recognizes the input speech data and generates a speech recognition result when the speech data is inputted through the interface unit 10. [ In particular, the speech recognition unit 20 according to the embodiment of the present invention recognizes speech data to generate a speech recognition result, and at the same time performs semantic analysis on the speech data and generates a result of the analysis.

저장부(30)는 본 발명의 음성 인식 및 의미 분석을 위한 다양한 정보를 저장하고 관리하는 역할을 수행한다. 이러한 저장부(30)는 플래시 메모리(flash memory), 하드디스크(hard disk), 멀티미디어 카드 마이크로(multimedia card micro) 타입의 메모리(예컨대, SD 또는 XD 메모리 등), 램(RAM), 롬(ROM) 등의 저장매체를 포함하여 구성될 수 있다.The storage unit 30 stores and manages various information for speech recognition and semantic analysis of the present invention. The storage unit 30 may be a flash memory, a hard disk, a memory of a multimedia card micro type (e.g., SD or XD memory), a RAM, a ROM ), And the like.

상술한 본 발명의 음성 인식부(20) 및 저장부(30)에 대해 도 3 및 도 4를 참조하여 보다 더 구체적으로 설명하도록 한다. The speech recognition unit 20 and the storage unit 30 of the present invention will be described in more detail with reference to FIGS. 3 and 4. FIG.

도 3은 본 발명의 실시 예에 따른 음성 인식부의 주요 구성을 설명하기 위한 블록도이며, 도 4는 본 발명의 실시 예에 따른 저장부의 주요 구성을 설명하기 위한 블록도이다.FIG. 3 is a block diagram for explaining a main configuration of a speech recognition unit according to an embodiment of the present invention, and FIG. 4 is a block diagram for explaining a main configuration of a storage unit according to an embodiment of the present invention.

먼저, 도 3을 참조하면, 본 발명의 음성 인식부(20)는 특징 추출 모듈(21), 음성 탐색 모듈(22), 의미 탐색 모듈(23), 결합 네트워크 생성 모듈(24) 및 최종 탐색 모듈(25)을 포함하여 구성될 수 있다.3, the speech recognition unit 20 of the present invention includes a feature extraction module 21, a voice search module 22, a meaning search module 23, a combined network creation module 24, (25).

특징 추출 모듈(21)은 입력된 음성 데이터에서 유용한 특징을 추출하는 것으로서, 인간의 청각 특성을 반영하는 특징 데이터를 추출하고, 이를 음성 탐색 모듈(22) 및 의미 탐색 모듈(23)로 전달한다. The feature extraction module 21 extracts feature data from the input voice data and reflects the hearing characteristics of the user and transmits the extracted feature data to the voice search module 22 and the meaning search module 23.

이를 위해, 특징 추출 모듈(21)은 먼저, 아날로그 형태의 연속적인 소리 신호인 음성 신호를 디지털 형태의 이산적인 음성 데이터 값으로 변환시키는 ADC(Analog to Digital Convert) 과정을 수행하게 된다. 그리고 특징 추출 모듈(21)은 상기 디지털 형태로 변환된 음성 데이터를 기초로 특징 데이터를 추출한다. 여기서, 특징 데이터는 상기 디지털 형태로 변환된 음성 데이터의 주파수 영역에서의 음성, 음향학적인 특징 데이터를 의미한다. 예컨대, 음의 길이(duration), 음의 에너지(energy), 피치(pitch), 파워(power), LPC(linear predictive coding) 계수, 모음의 구성음소 즉, 포만트(formant), RFC(Rising Falling Connection)/Tilt, 스펙트럼(Spectrum), VOT(Voice Onset Time) 등이 특징 데이터로 추출될 수 있다. To this end, the feature extraction module 21 first performs an analog to digital conversion (ADC) process for converting a voice signal, which is a continuous analog voice signal, into a discrete voice data value in a digital form. The feature extraction module 21 extracts the feature data based on the voice data converted into the digital form. Here, the feature data means voice and acoustical feature data in the frequency domain of the voice data converted into the digital form. For example, the following parameters may be used: negative duration, negative energy, pitch, power, linear predictive coding (LPC) coefficients, compositional phonemes, formant, RFC Connection / Tilt, Spectrum, VOT (Voice Onset Time), and the like can be extracted as feature data.

이러한 특징 데이터는 MFCC(Mel-Frequency Cepstrum Codfficient), LPCC(Linear Prediction Coefficient Cepstrum) 또는 PLPCC(Preceptual Linear Prediction Ceptrum Coeffcient), EIH(Ensemble Interval Histogram), SMC (Short-time Modified Coherence) 중 어느 하나의 기법을 통해 추출될 수 있다. The feature data may be any one of a Mel-Frequency Cepstrum Codec (MFCC), a Linear Prediction Coefficient Cepstrum (LPCC), a Preceptual Linear Prediction Ceptrum Coeffcient (PLPCC), an Ensemble Interval Histogram (EIH) Lt; / RTI >

음성 탐색 모듈(22)은 저장부(30)의 음향 모델 데이터베이스(31)와 언어 모델 데이터베이스(32)를 이용하여 음성 탐색을 수행하고 탐색된 노드를 결합 네트워크 생성 모듈(24)로 전달한다. 이때, 음성 탐색 모듈(22)은 발음 사전을 저장하는 발음 사전 데이터베이스(33)를 이용하여 표기 음소를 발음 음소로 변환할 수도 있다. 보다 구체적으로 설명하면, 음성 탐색 모듈(22)은 특징 추출 모듈(21)을 통해 전달받은 음성 데이터의 특징 데이터를 먼저 음향 모델 데이터베이스(31)에 저장된 음향 모델과 비교한다. 이후, 상기 특징 데이터에 대응하는 음소열을 추출하고, 추출된 음소열에 해당하는 언어를 언어 모델 데이터베이스(32)를 통해 추출한다. 이러한 과정을 통해 결합 네트워크 내에 표시될 수 있는 하나 이상의 노드를 추출할 수 있다. The voice search module 22 performs voice search using the acoustic model database 31 and the language model database 32 of the storage unit 30 and transmits the discovered nodes to the join network creation module 24. At this time, the voice search module 22 may convert the notation phoneme into a pronunciation phoneme using the pronunciation dictionary database 33 storing the pronunciation dictionary. More specifically, the voice search module 22 compares the feature data of the voice data received through the feature extraction module 21 with the acoustic model stored in the acoustic model database 31 first. Thereafter, a phoneme string corresponding to the feature data is extracted, and a language corresponding to the extracted phoneme string is extracted through the language model database 32. Through this process, one or more nodes that can be displayed in the combined network can be extracted.

의미 탐색 모듈(23)은 음향 모델 데이터베이스(31)와 의미 모델 데이터베이스(32)를 이용하여 의미 탐색을 수행하고, 탐색된 노드를 결합 네트워크 생성 모듈(24)로 전달한다. 보다 구체적으로 설명하면, 의미 탐색 모듈(23)은 특징 추출 모듈(21)을 통해 전달받은 음성 데이터의 특징 데이터를 먼저 음향 모델 데이터베이스(31)에 저장된 음향 모델과 비교한다. 이후, 상기 특징 데이터에 대응하는 음소열을 추출하고, 추출된 음소열에 해당하는 의미 패턴 정보를 의미 모델 데이터베이스(32)를 이용하여 추출한다. 이러한 과정을 통해 결합 네트워크 내에 표시될 수 있는 하나 이상의 노드를 추출할 수 있다.The semantic search module 23 performs a semantic search using the acoustic model database 31 and the semantic model database 32 and delivers the searched nodes to the association network creation module 24. [ More specifically, the meaning search module 23 compares the feature data of the voice data received through the feature extraction module 21 with the acoustic model stored in the acoustic model database 31 first. Thereafter, the phoneme string corresponding to the feature data is extracted, and the semantic pattern information corresponding to the extracted phoneme string is extracted using the semantic model database 32. Through this process, one or more nodes that can be displayed in the combined network can be extracted.

결합 네트워크 생성 모듈(24)은 상기 음성 탐색 모듈(22)로부터 도출된 노드와 의미 탐색 모듈로부터 노출된 노드를 결합하여 결합 네트워크를 생성하는 역할을 수행한다. The combined network creation module 24 creates a combined network by combining the nodes derived from the voice search module 22 and the nodes exposed from the meaning search module.

여기서, 상기 결합 네트워크는 FSN(Finite State Network), word-pair grammar, n-gram 중 적어도 어느 하나의 네트워크가 될 수 있다. 여기서의 네트워크란 한 단위 뒤에 나타날 수 있는 단어들을 연결하되, 규칙에 의해 고정시키거나 통계적인 확률에 따라 연결하는 것을 의미한다. Here, the combined network may be at least one of a finite state network (FSN), a word-pair grammar, and an n-gram. Here, the term "network" refers to words that may appear after a unit, which are fixed by rules or connected by statistical probabilities.

여기서, word-pair grammar는 특정 단어 뒤에 나타날 수 있는 단어들만 연결시키는 것으로, 예를 들어 "먹고"+"싶습니다" 는 순서대로 연결할 수 있지만 그 반대로는 연결될 수 없는 것을 이용하여 탐색을 수행하는 방식이다.Here, the word-pair grammar is a method of connecting only words that can appear after a specific word, for example, searching by using something that can be connected in the order of "eat" + "want" .

상기 N-gram은 단어와 단어 사이의 연결에 통계적인 확률을 이용하는 것으로, 학습 데이터를 이용하여 어떤 단어가 한 단어 다음에 나타날 확률을 계산하여 확률이 높은 쪽으로 탐색을 수행하는 방식이다. The N-gram utilizes statistical probabilities for linking words to words, and calculates a probability that a word appears after one word by using learning data, thereby searching for a higher probability.

반면, FSN(Finite State Network)은 구성 가능한 문장들을 모두 네트워크로 묶는 것으로, 고유 명칭을 갖는 상태(state)와 FSN의 상태를 변화시키는 작업인 전이(transition)으로 구성된다.On the other hand, a finite state network (FSN) is a network that bundles all configurable sentences. It consists of a state with a unique name and a transition, which is an operation that changes the state of the FSN.

이하에서는 설명의 편의를 위해 결합 네트워크가 FSN인 것을 예로 들어 설명하나, 이에 한정되는 것은 아니다. Hereinafter, for convenience of description, it is assumed that the combined network is an FSN, but the present invention is not limited thereto.

이러한 결합 네트워크 생성 모듈(24)은 음성 탐색 모듈(22)로부터 탐색에 따른 노드가 인가되고, 의미 탐색 모듈(23)로부터 탐색에 따른 노드가 전달되면 상기 노드들을 결합하여 최종 결합 네트워크를 생성하게 된다. 이때, 분절된 형태소 단위로 의미적으로 개념이 동일한 하나 이상의 어휘를 추가, 삭제 또는 변경하여 복수의 어휘 클래스와 상기 복수의 노드 간 전이(transition)를 포함하는 네트워크를 구성할 수 있다. The combined network generation module 24 generates a final combined network by combining nodes when a node according to the search is applied from the voice search module 22 and a node according to the search is transmitted from the meaning search module 23 . At this time, a network including a plurality of vocabulary classes and a transition between the plurality of nodes can be constructed by adding, deleting or changing one or more vocabularies having the same semantical concept in segmented morpheme units.

그리고, 이를 최종 탐색 모듈(25)로 전달하면, 최종 탐색 모듈(25)은 상기 결합 네트워크를 이용하여 최종 음성 인식 결과 및 의미 분석 결과를 생성할 수 있다. 다시 말해, 최종 탐색 모듈(25)은 상기 음성 데이터의 특징 데이터를 상기 결합 네트워크와 음향 모델을 이용하여 최종 탐색을 수행하게 된다. The final search module 25 may then generate the final speech recognition result and the semantic analysis result using the combined network. In other words, the final search module 25 performs the final search of the feature data of the voice data using the combined network and the acoustic model.

예를 들어 설명하면, 음성 탐색 모듈(22)이 음성 탐색을 진행한 결과 다음과 같은 하나 이상의 어휘 노드를 도출할 수 있다.For example, the voice search module 22 may derive one or more vocabulary nodes as a result of the voice search.

어, 저, 음, 멤버쉽, 포인트, 가, 는, 얼마, 남, 았어, 았죠, 았어요, 어디서, 를, 을, 써야, 되죠, 하죠, 하나요Uh, uh, um, membership, point, go, uh, how much, man, it was, okay, where, uh, uh, uh, uh, uh, uh ...

또한, 의미 탐색 모듈(23)이 의미 탐색을 진행한 결과 다음과 같은 하나 이상의 어휘 노드를 도출할 수 있다. In addition, the semantic search module 23 can derive one or more lexical nodes as a result of the semantic search.

멤버쉽[서비스명], 포인트[서비스명], 얼마[수량], 남다[잔여], 쓰[소비], 어디[장소], 쓰다[소비]Membership [Service Name], Point [Service Name], How much [Quantity], Remain [Remain], Write [Consumption], Where [Place], Write [Consumption]

음성 탐색 모듈(22) 및 의미 탐색 모듈(23)은 도출된 하나 이상의 노드를 결합 네트워크 생성 모듈(24)은 상기 노드들을 이용하여 네트워크를 생성할 수 있다. 이후, 결합 네트워크 생성 모듈(24)은 생성된 노드들을 최종 탐색 모듈(25)로 전달한다. The voice search module 22 and the semantic search module 23 may combine the derived one or more nodes. The combining network generation module 24 may use the nodes to create a network. The combined network creation module 24 then passes the generated nodes to the final search module 25.

최종 탐색 모듈(25)은 상기 결합 네트워크 생성 모듈(24)을 통해 생성된 결합 네트워크를 이용하여 음성 인식 및 의미 분석을 진행한다. 이때, 상기 최종 탐색 모듈(25)은 음성 데이터에 대한 특징 데이터를 음향 모델 데이터베이스(31)의 음향 모델과 비교하여 음소열을 추출하고, 추출된 음소열에 해당하는 노드들의 집합으로 음성 인식 결과 및 의미 분석 결과를 생성할 수 있다. The final search module 25 performs speech recognition and semantic analysis using the combined network generated through the combined network creation module 24. [ At this time, the final search module 25 extracts the phoneme string by comparing the feature data of the voice data with the acoustic model of the acoustic model database 31, and extracts the voice recognition result and semantic information as a set of nodes corresponding to the extracted phoneme string The analysis result can be generated.

이때, 상기 최종 탐색 모듈(25)은 의미 탐색 모듈(23)을 통해 도출된 노드들에 대해 가중치를 부여하고, 상기 의미 탐색 모듈(23)을 통해 도출된 노드, 즉 의미 태그가 부여된 노드들 중심으로 문장을 생성할 수도 있다. At this time, the final search module 25 assigns weights to the nodes derived through the semantic search module 23, and transmits the nodes derived through the semantic search module 23, i.e., You can also create a sentence around the center.

그리고 나서, 최종 탐색 모듈(25)은 생성된 음성 인식 결과 및 의미 분석 결과를 인터페이스부(10)로 전달하여 사용자에게 제공할 수도 있다. Then, the final search module 25 may transmit the generated speech recognition result and semantic analysis result to the interface unit 10 and provide the result to the user.

이때, 상기 결합 네트워크를 구성하는 어느 하나의 노드가 의미 태그를 포함하는 경우, 음성 인식 장치(100)의 최종 탐색 모듈(25)은 상기 의미 태그에 대응하는 객체 정보를 확인하고, 이를 분석 결과로 가공하여 출력하게 된다. 여기서 객체 정보는 객체 유형, 객체 대상, 객체 속성 중 어느 하나를 포함할 수 있다. In this case, if any node constituting the combined network includes a semantic tag, the final search module 25 of the voice recognition apparatus 100 confirms object information corresponding to the semantic tag, Processed and output. Here, the object information may include any one of an object type, an object object, and an object attribute.

의미 태그Meaning tag 객체 정보Object Information 문의, 확인, 요청, 권유, ...Inquiry, confirmation, request, invitation, ... 객체 유형Object type 서비스명, 장소, ...Service name, place, ... 객체 대상Object target 잔여, 한도, 이동, 최대, 근접, ...Remaining, Limit, Move, Max, Close, ... 객체 속성Object Properties

예를 들어, 의미 태그가 서비스명일 경우, 최종 탐색 모듈(25)은 상기 서비스명에 대응하는 객체 정보(객체 대상)를 확인하고 이를 분석 결과로 가공할 수 있다. 이때, "객체 유형: 의미 태그"와 같은 형태로 가공할 수 있다.For example, if the semantic tag is a service name, the final search module 25 can identify the object information (object object) corresponding to the service name and process it as an analysis result. At this time, it can be processed in the form of "object type: semantic tag".

아울러, 저장부(30)는 전술한 바와 같이 음향 모델 데이터베이스(31), 언어 모델 데이터베이스(32), 발음 사전 데이터베이스(33) 및 의미 모델 데이터베이스(34)를 포함하여 구성된다.The storage unit 30 includes an acoustic model database 31, a language model database 32, a pronunciation dictionary database 33, and a semantic model database 34 as described above.

음향 모델 데이터베이스(31)는 음소들을 통계적으로 모델링한 음향 모델을 저장하고 관리한다. 이때, 상기 음향 모델은 HMM(hidden Markov Model)이 될 수 있으며, 음향 모델의 기본 단위는 음소열이 될 수 있다. 따라서, 음향 모델 데이터베이스(31)를 통해 특징 데이터에 대한 음소열을 추출할 수 있다. The acoustic model database 31 stores and manages acoustic models statistically modeling phonemes. At this time, the acoustic model may be an HMM (hidden Markov Model), and the basic unit of the acoustic model may be a phoneme string. Therefore, the phoneme string for the feature data can be extracted through the acoustic model database 31. [

언어 모델 데이터베이스(32)는 언어 모델을 저장하고 관리하며, 학습 및 탐색 시 임의적인 문장보다는 문법에 맞는 문장이 선별되도록 지원하는 역할을 수행한다. 여기서, 상기 언어 모델은 FSN, word-pair grammar, n-gram 중 적어도 어느 하나의 네트워크 형태로 구현될 수 있다. The language model database 32 stores and manages the language model, and supports the selection of sentences matching grammar rather than arbitrary sentences in learning and searching. Here, the language model may be implemented as a network of at least one of an FSN, a word-pair grammar, and an n-gram.

발음 사전 데이터베이스(33)는 발음 사전을 저장하고 관리한다. 발음 사전이란 표준 발음법에 의거하여 간단한 규칙을 정하거나 특정 환경과 발화자 및 사투리까지의 특색을 고려하는 정의한 것을 의미한다.The pronunciation dictionary database 33 stores and manages the pronunciation dictionary. A pronunciation dictionary defines a simple rule based on the standard pronunciation method, or a definition that takes into account the characteristics of a specific environment, a speaker and dialect.

의미 모델 데이터베이스(24)는 의미 모델을 저장하고 관리한다. 이때 상기 의미 모델은 LSP(Lexico-Syntatic Pattern)에 따라 생성될 수 있다. LSP에 따라 의미 모델을 생성하는 과정에 대해 간략히 설명하면, 먼저, 하나의 문장에서 하나 이상의 형태소 후보들을 생성하고, 생성된 후보들에 대하여 사전 탐색, 단어 형성 규칙 등을 고려하여 형태소를 선택한다. 그리고 선택된 형태소에 품사 태그를 분석한다. 이후, 하나 혹은 복수 개의 형태소를 결합하고, 하나 혹은 복수 개의 형태소에 의미 태그를 부여한다. 이러한 LSP에 따라 의미 모델을 생성하는 과정은 공지된 기술을 이용하므로, 구체적인 설명은 생략하도록 하며, 상기 의미 모델 데이터베이스(24)에 저장되고 관리되는 의미 패턴 정보는 전술한 바와 같이 하나 혹은 복수 개의 형태소와 의미 태그를 포함하는 형태가 될 수 있다. 예컨대, 멤버쉽[서비스명], 포인트[서비스명], 얼마[수량], 남다[잔여], 쓰[소비], 어디[장소], 쓰다[소비]의 형태로 저장될 수 있으며, 멤버쉽[서비스명]을 예로 들면, "멤버쉽"은 언어의 최소 단위의 형태소를 의미한다. [서비스명]은 상기 형태소(멤버쉽)가 의미하는 바를 나타내는 의미 태그이다. 또 다른 예를 들면, "주유소", "서울"에 대하여 [장소], "들렸다가"에 대하여 [경유], "가자"에 대하여 [이동], "가장"에 대하여 [최대], "가까운"에 대하여 [근접]이라는 의미 태그가 부여될 수 있다. The semantic model database 24 stores and manages semantic models. At this time, the semantic model may be generated according to a LSP (Lexico-Synthetic Pattern). A process of generating a semantic model according to an LSP will be briefly described. First, one or more morpheme candidates are generated in one sentence, and a morpheme is selected in consideration of dictionary search and word formation rules for the generated candidates. Then, the part mark tag is analyzed in the selected morpheme. Then, one or a plurality of morphemes are combined, and one or more morphemes are assigned a semantic tag. Since the process of generating a semantic model according to the LSP uses a known technique, a detailed description thereof will be omitted, and the semantic pattern information stored and managed in the semantic model database 24 may include one or a plurality of morphemes And a semantic tag. For example, it can be stored in the form of membership [service name], point [service name], amount [quantity], remaining [remaining], consumption [consumption], place [ ], For example, "membership" means the morpheme of the smallest unit of language. [Service name] is a meaning tag indicating what the morpheme (membership) means. For another example, the "maximum", "near", "near", "near", and "near" with respect to "gas station" and "Seoul" Quot; proximity " can be assigned to the " close "

이상으로 본 발명의 실시 예에 따른 음성 인식 장치(100)의 주요 구성에 대해 설명하였다. The main configuration of the speech recognition apparatus 100 according to the embodiment of the present invention has been described above.

본 발명의 일 실시 예에 따른 음성 인식 장치(100)는 인터페이스부(10), 음성 인식부(20) 및 저장부(30)만을 포함하여 구성되는 것을 예로 들어 설명하였으나, 본 발명의 음성 인식 장치(100)는 전처리부(미도시) 및 후처리부(미도시)를 더 포함하여 구성될 수도 있다. The speech recognition apparatus 100 according to an embodiment of the present invention includes only the interface unit 10, the speech recognition unit 20 and the storage unit 30, (100) may further include a preprocessing unit (not shown) and a post-processing unit (not shown).

이때, 전처리부(미도시)는 입력된 음성 데이터를 음성 인식에 적합하도록 전처리하는 역할을 수행할 수 있다. 예컨대 불필요한 잡음 제거, 음성 향상의 기능 등을 수행할 수 있다. 후처리부(미도시)는 음성 인식 결과에 대하여 띄어쓰기와 맞춤법 오류 등을 수정하고, 외래어 표기의 일관성을 맞추며 판별이 불가능한 발성이 포함되는 경우, 이를 삭제하는 등의 기능을 수행할 수 있다. At this time, the preprocessing unit (not shown) may perform preprocessing of the inputted voice data to be suitable for voice recognition. For example, unnecessary noise cancellation, voice enhancement, and the like can be performed. The post-processing unit (not shown) corrects the spacing and spelling errors with respect to the speech recognition result, aligns the consistency of the foreign word display, and deletes the speech if the speech that can not be discriminated is included.

이하, 본 발명의 실시 예에 따른 의미 분석이 가능한 음성 인식 방법에 대해 설명하도록 한다.Hereinafter, a speech recognition method capable of semantic analysis according to an embodiment of the present invention will be described.

도 5는 본 발명의 실시 예에 따른 의미 분석이 가능한 음성 인식 방법을 설명하기 위한 흐름도이다.5 is a flowchart for explaining a speech recognition method capable of semantic analysis according to an embodiment of the present invention.

도 2 및 도 5를 참조하면, 본 발명의 실시 예에 따른 의미 분석이 가능한 음성 인식 방법은 먼저 음성 인식 장치(100)가 음성 데이터가 입력되면(S101), 상기 음성 데이터에서 특징 데이터를 추출한다(S103). 이때, 상기 음성 인식 장치(100)는 MFCC(Mel-Frequency Cepstrum Coefficients), LPCC(Linear Prediction Cepstral Coefficients), EIH(Ensemble Interval Histogram), SMC (Short-time Modified Coherence) 및 PLP(Perceptual Linear Prediction) 중 어느 하나의 기법으로 특징 데이터를 추출할 수 있다.2 and 5, a speech recognition method capable of semantic analysis according to an exemplary embodiment of the present invention includes first extracting feature data from the speech data when the speech recognition apparatus 100 receives speech data (S101) (S103). In this case, the speech recognition apparatus 100 may be configured to perform various tasks such as Mel-Frequency Cepstrum Coefficients (MFCC), Linear Predictive Cepstral Coefficients (LPCC), Ensemble Interval Histogram (EIH), Short-Time Modified Coherence (SMC), and Perceptual Linear Prediction Feature data can be extracted by any one technique.

그리고, 음성 인식 장치(100)는 음향 모델 및 언어 모델을 이용하여 음성 탐색을 수행하고, 음향 모델 및 의미 모델을 이용하여 의미 탐색을 수행한다(S105).Then, the speech recognition apparatus 100 performs a voice search using an acoustic model and a language model, and performs a semantic search using an acoustic model and a semantic model (S105).

즉, 음성 인식 장치(100)는 상기 특징 데이터를 먼저 음향 모델과 비교하여, 상기 특징 데이터에 대응하는 음소열을 추출하고, 추출된 음소열에 해당하는 언어를 언어 모델을 통해 추출함으로써, 음성 탐색에 따른 하나 이상의 노드를 추출한다. That is, the speech recognition apparatus 100 first compares the feature data with an acoustic model, extracts a phoneme string corresponding to the feature data, extracts a language corresponding to the extracted phoneme string through a language model, And extracts at least one corresponding node.

또한, 이와 동시에 음성 인식 장치(100)는 상기 특징 데이터를 음향 모델과 비교하여, 상기 특징 데이터에 대응하는 음소열을 추출하고, 추출된 음소열에 해당하는 의미 패턴 정보를 의미 모델을 통해 추출함으로써, 의미 탐색에 따른 하나 이상의 노드를 추출한다. At the same time, the speech recognition apparatus 100 compares the feature data with an acoustic model, extracts a phoneme string corresponding to the feature data, extracts semantic pattern information corresponding to the extracted phoneme string through a semantic model, Extract one or more nodes according to semantic search.

이후, 음성 인식 장치(100)는 S105 단계를 통해 도출된 노드를 이용하여 결합 네트워크를 생성한다(S107). 여기서, 상기 결합 네트워크는 FSN(Finite State Network), word-pair grammar, n-gram 중 적어도 어느 하나의 네트워크가 될 수 있다.Thereafter, the speech recognition apparatus 100 creates a combined network using the derived node through step S105 (S107). Here, the combined network may be at least one of a finite state network (FSN), a word-pair grammar, and an n-gram.

그리고 나서, 음성 인식 장치(100)는 상기 결합 네트워크를 이용하여 최종 음성 탐색 및 의미 탐색을 수행한다(S109). 이때, 상기 음성 인식 장치(100)는 상기 특징 데이터를 음향 모델과 비교하여 음소열을 추출하고, 추출된 음소열에 해당하는 노드들의 집합으로 음성 인식 결과 및 의미 분석 결과를 생성하게 된다. Then, the speech recognition apparatus 100 performs a final voice search and a semantic search using the combined network (S109). At this time, the speech recognition apparatus 100 extracts a phoneme string by comparing the feature data with an acoustic model, and generates a speech recognition result and a semantic analysis result as a set of nodes corresponding to the extracted phoneme string.

여기서, 음성 인식 장치(100)는 의미 탐색을 통해 도출된 노드들에 대해 가중치를 부여하고, 상기 가중치가 부여된 노드들 중심으로 문장을 생성할 수도 있다. Here, the speech recognition apparatus 100 may assign a weight to nodes derived through semantic searching and generate a sentence based on the weighted nodes.

그리고 음성 인식 장치(100)는 음성 인식 결과 및 의미 분석 결과를 출력할 수 있다(S111). Then, the speech recognition apparatus 100 can output the speech recognition result and the semantic analysis result (S111).

이때, 상기 결합 네트워크를 구성하는 어느 하나의 노드가 의미 태그를 포함하는 경우, 음성 인식 장치(100)는 상기 의미 태그에 대응하는 객체 정보를 확인하고, 이를 분석 결과로 가공하여 출력하게 된다. 여기서 객체 정보는 객체 유형, 객체 대상, 객체 속성 중 어느 하나를 포함할 수 있다. In this case, if any node constituting the combined network includes a semantic tag, the speech recognition apparatus 100 confirms the object information corresponding to the semantic tag, and processes it as an analysis result. Here, the object information may include any one of an object type, an object object, and an object attribute.

도 6 및 도 7을 참조하여 설명하면, 도 6 및 도 7은 본 발명의 실시 예에 따른 의미 분석이 가능한 음성 인식 방법을 설명하기 위한 예시도이다. Referring to FIGS. 6 and 7, FIGS. 6 and 7 are diagrams for explaining a speech recognition method capable of semantic analysis according to an embodiment of the present invention.

즉, 소정의 문장이 FSN 형식의 네트워크로 표시된 상태 전이도의 일 예를 도시한 것으로, 총 6개의 상태에 대한 노드와 7개의 전이를 포함한다. 아울러, 설명의 편의를 위해, 상기 각각의 노드는 단일 단어만을 포함하는 것을 예로 들어 설명하나 이에 한정되는 것은 아니며, 유사한 단어의 목록으로 이뤄질 수도 있다. 예컨대 "쓰고"에 대한 단어를 포함하는 노드는 상기 "쓰고" 이외도 "쓰는", "써", "쓸 수"와 같이 상기 "쓰고"와 유사한 단어를 더 포함할 수 있다. That is, an example of a state transition diagram in which a predetermined sentence is expressed in the form of an FSN type network includes a node for a total of six states and seven transitions. In addition, for convenience of explanation, each node includes only a single word, but is not limited thereto, and may be a list of similar words. For example, a node including the word "writing" may further include words similar to "writing ", such as " writing ", " writing"

또한, 설명의 편의를 위해 6개의 상태에 대한 노드와 7개의 전이만을 포함하는 것을 예로 들어 설명하나, 더 많은 노드와 전이로 이뤄질 수 도 있다. Also, for convenience of description, an example including only six states and seven transitions is described as an example, but may be made up of more nodes and transitions.

이러한 네트워크를 통해 다음과 같은 문장이 생성될 수 있다.Through this network, the following sentence can be generated.

멤버쉽 포인트 쓰고 얼마 남았어Membership points are a while away

그러나, 전술한 바와 같기 상기 네트워크의 노드가 더 많은 단어를 포함하고, 더 많은 전이로 이뤄지는 경우 다음과 같은 다양한 문장들이 생성될 수도 있다. However, as described above, if the node of the network includes more words and more transitions are made, various sentences such as the following may be generated.

멤버쉽 포인트 얼마 남았어
멤버쉽 포인트가 얼마 남았죠
멤버쉽이 얼마 남았어요
멤버쉽 포인트 쓰고 얼마 남았죠
멤버쉽 포인트를 어디서 써야하죠
멤버쉽 포인트는 어디서 써야하죠
멤버쉽 포인트 어 음 어디서 저 써요How many membership points are there?
How many membership points are there?
How much is your membership?
How much time did you spend writing your membership points?
Where do I write my membership points?
Where do I write my membership points?
Membership Points Well, where do I write it?

아울러, 도 7은 일반적인 음성 탐색만을 수행하고 이에 따른 노드만을 이용하여 네트워크를 생성한 상태이며, 도 8은 본 발명은 음성 탐색과 함께 의미 탐색에 따른 노드를 이용하여 네트워크를 생성한 상태를 도시한 것이다.In addition, FIG. 7 shows a state in which only a general voice search is performed and a network is created using only the nodes. FIG. 8 shows a state in which a network is created using a node according to semantic search together with voice search will be.

즉, 전술한 바와 같이 도 7의 일반적인 음성 탐색만이 가능한 네트워크를 이용하여서는 음성 인식 결과, "멤버쉽 포인트 쓰고 얼마 남았어"만을 생성할 수 있으나, 본 발명의 실시 예에 따른 결합 네트워크는 도 8에 도시된 바와 같이 의미 탐색에 따른 노드, 즉 "멤버쉽[서비스명]", "포인트[서비스명]", "남[잔여]", "았어[문의]"를 더 포함하며 상기 결합 네트워크를 통해 "멤버쉽 포인트 쓰고 얼마 남았어"라는 음성 인식 결과와 "질의유형: 문의, 질의대상: 멤버쉽 포인트, 대상속성: 잔여"와 같은 의미 분석 결과를 함께 추출하게 된다. 즉, 전술한 바와 같이 음성 인식 장치(100)는 노드에 포함된 의미 태그를 확인하고, 상기 의미 태그에 대응하는 객체 정보를 확인한 후 이를 분석 결과로 가공하여 출력하게 된다. In other words, as described above, only a general voice search of FIG. 7 can be used to generate only the "membership point is short" by speech recognition result. However, the combination network according to the embodiment of the present invention is shown in FIG. As shown in the figure, further includes nodes according to the semantic search, i.e., "membership [service name] "," point [service name] ", " And the result of the semantic analysis such as "query type: inquiry, query target: membership point, target attribute: residual" is extracted together with the speech recognition result of " That is, as described above, the speech recognition apparatus 100 confirms the semantic tags included in the node, verifies the object information corresponding to the semantic tags, and outputs the processed result as an analysis result.

이와 같이, 본 발명은 음성 탐색과 동시에 의미 탐색을 진행하여 결합 네트워크를 생성함으로써, 이를 이용하여 음성 데이터에 대한 음성 인식은 물론 의미 분석이 동시에 가능하며, 의미 분석을 위한 별도의 절차를 수행하지 않으므로 보다 신속하게 의미 분석 결과를 제공할 수 있다는 우수한 효과가 있다.As described above, the present invention performs semantic search simultaneously with voice search to generate a combined network, thereby making it possible to perform semantic analysis as well as voice recognition on voice data at the same time, and does not perform a separate procedure for semantic analysis There is an excellent effect that the semantic analysis result can be provided more quickly.

이상으로 본 발명의 실시 예에 따른 의미 분석이 가능한 음성 인식 방법에 대해 설명하였다.The speech recognition method capable of semantic analysis according to the embodiment of the present invention has been described above.

본 발명의 실시 예에 따른 음성 인식 방법은 컴퓨터로 읽을 수 있는 기록매체에 컴퓨터가 읽을 수 있는 코드로서 구현하는 것이 가능하다. 컴퓨터가 읽을 수 있는 기록매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있으며, 컴퓨터 시스템에 의해 읽혀질 수 있는 데이터가 저장되는 모든 종류의 기록장치를 포함한다. 컴퓨터가 읽을 수 있는 기록매체의 예로는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(Magnetic Media), CD-ROM(Compact Disk Read Only Memory), DVD(Digital Video Disk)와 같은 광기록 매체(Optical Media), 플롭티컬 디스크(Floptical Disk)와 같은 자기-광 매체(Magneto-Optical Media) 및 롬(ROM, Read Only Memory), 램(RAM, Random Access Memory), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치를 포함한다.The speech recognition method according to the embodiment of the present invention can be implemented as a computer-readable code on a computer-readable recording medium. The computer readable recording medium may include program instructions, data files, data structures, and the like, alone or in combination, and includes all kinds of recording apparatuses in which data that can be read by a computer system is stored. Examples of the computer-readable recording medium include an optical recording medium such as a magnetic medium such as a hard disk, a floppy disk and a magnetic tape, a compact disk read only memory (CD-ROM), and a digital video disk (ROM), random access memory (RAM), flash memory, and the like, such as a magneto-optical medium such as a magneto-optical medium and a floppy disk, And hardware devices that are specifically configured to perform the functions described herein.

또한, 컴퓨터가 읽을 수 있는 기록매체는 네트워크로 연결된 컴퓨터 시스템에 분산되어, 분산방식으로 컴퓨터가 읽을 수 있는 코드가 저장되고 실행될 수 있다. 그리고, 본 발명을 구현하기 위한 기능적인(functional) 프로그램, 코드 및 코드 세그먼트들은 본 발명이 속하는 기술분야의 프로그래머들에 의해 용이하게 추론될 수 있다.In addition, the computer-readable recording medium may be distributed over network-connected computer systems so that computer readable codes can be stored and executed in a distributed manner. In addition, functional programs, codes, and code segments for implementing the present invention can be easily inferred by programmers of the technical field to which the present invention belongs.

이상으로 본 발명의 기술적 사상을 예시하기 위한 바람직한 실시예와 관련하여 설명하고 도시하였지만, 본 발명은 이와 같이 도시되고 설명된 그대로의 구성 및 작용에만 국한되는 것은 아니며, 기술적 사상의 범주를 이탈함없이 본 발명에 대해 다수의 변경 및 수정이 가능함을 당업자들은 잘 이해할 수 있을 것이다. 따라서 그러한 모든 적절한 변경 및 수정과 균등물들도 본 발명의 범위에 속하는 것으로 간주되어야 할 것이다.While the present invention has been particularly shown and described with reference to exemplary embodiments thereof, it is to be understood that the invention is not limited to the disclosed exemplary embodiments, but, on the contrary, It will be appreciated by those skilled in the art that numerous changes and modifications can be made to the invention. And all such modifications and changes as fall within the scope of the present invention are therefore to be regarded as being within the scope of the present invention.

본 발명에 의하면, 음성 탐색과 동시에 의미 탐색을 진행하여 결합 네트워크를 생성함으로써, 이를 이용하여 음성 데이터에 대한 음성 인식은 물론 의미 분석이 동시에 가능하며, 의미 분석을 위한 별도의 절차를 수행하지 않으므로 보다 신속하게 의미 분석 결과를 제공할 수 있다는 우수한 효과가 있으며, 이를 통해 음성 인식 산업의 발전에 이바지할 수 있다.According to the present invention, a semantic search is performed simultaneously with a voice search to create a combined network. By using this, a voice recognition for voice data as well as a semantic analysis can be performed at the same time. It is possible to provide quick semantic analysis results, which can contribute to the development of the speech recognition industry.

더불어, 본 발명은 시판 또는 영업의 가능성이 충분할 뿐만 아니라 현실적으로 명백하게 실시할 수 있는 정도이므로 산업상 이용가능성이 있다.In addition, since the present invention is not only possible to be marketed or operated, but also can be practically and practically carried out, it is industrially applicable.

10: 인터페이스부 20: 음성 인식부 21: 특징 추출 모듈
22: 음성 탐색 모듈 23: 의미 탐색 모듈
24: 결합 네트워크 생성 모듈 25: 최종 탐색 모듈
30: 저장부 31: 음향 모델 데이터베이스
32: 언어 모델 데이터베이스 33: 발음 사전 데이터베이스
34: 의미 모델 데이터베이스 100: 음성 인식 장치10: interface unit 20: voice recognition unit 21: feature extraction module
22: Voice search module 23: Meaning search module
24: Combined Network Generation Module 25: Final Search Module
30: storage unit 31: acoustic model database
32: language model database 33: pronunciation dictionary database
34: semantic model database 100: speech recognition device

Claims

A storage unit for storing an acoustic model, a language model, and a semantic model; And
A voice search is performed on the input voice data using the acoustic model and the language model, a semantic search is performed using the acoustic model and the semantic model, and then the voice search and the node according to the semantic search are performed A speech recognition unit for generating a combined network by performing a final search using the combined network to output a speech recognition result and a semantic analysis result;
/ RTI >
Wherein the speech recognition unit generates a speech recognition result by weighting nodes according to the semantic search when performing a final search using the combined network.

The method according to claim 1,
The storage unit
More pronunciation dictionary,
The speech recognition unit
Wherein the speech recognition apparatus performs a speech search using the acoustic model, the language model, and the pronunciation dictionary.

The method according to claim 1,
Wherein the semantic model includes at least one semantic pattern information including a morpheme and a semantic tag.

The method according to claim 1,
The combining network
Wherein the network is at least one of a Finite State Network (FSN), a word-pair grammar, and an n-gram.

The method according to claim 1,
The speech recognition unit
Extracts the feature data from the input voice data when voice data is input, and performs the voice search and the meaning search according to the extracted feature data.

The method according to claim 1,
The speech recognition unit
And performs a final search using the combined network and the acoustic model.

delete

Receiving speech data from the speech recognition apparatus;
Performing speech search using the acoustic model and the language model on the speech data, and performing semantic search using the acoustic model and the semantic model in parallel;
Combining the voice search and the nodes according to the semantic search performed by the voice recognition device in parallel to generate a combined network; And
The speech recognition apparatus performing final speech recognition and semantic analysis using the combined network and the acoustic model;
/ RTI >
Wherein the speech recognition apparatus performs a speech recognition by assigning a weight to a node according to the semantic search when the final speech recognition is performed using the combined network and the acoustic model.

9. The method of claim 8,
After receiving the input,
Extracting feature data from the speech data;
The speech recognition method comprising the steps of:

10. The method of claim 9,
The step of performing in parallel
Wherein the speech recognition apparatus performs a speech search and a semantic search for the feature data.

delete

9. The method of claim 8,
After performing the final speech recognition and semantic analysis,
Identifying the corresponding object information using the semantic tag included in the node according to the semantic search and processing the result as an analysis result;
The speech recognition method comprising the steps of: