KR20210074649A

KR20210074649A - Speech Recognition Method Determining the Subject of Response using Multi-Modal Analysis in Natural Language Sentences

Info

Publication number: KR20210074649A
Application number: KR1020190165579A
Authority: KR
Inventors: 정민화; 이규환; 조원익; 김종인; 정지오
Original assignee: 서울대학교산학협력단
Priority date: 2019-12-12
Filing date: 2019-12-12
Publication date: 2021-06-22
Also published as: KR102334961B1

Abstract

The present invention relates to a speech recognition method for determining whether or not to respond using text and sound information extracted from a natural utterance without inputting a call language. As a two-pass cascade type using an intention classifier and a topic classifier, the speech recognition method for determining whether or not to respond in a natural language sentence of the present invention is capable of configuring a unique language model for each topic through the topic classifier, and at the same time, is capable of more accurately determining a response and a non-response by using not only the text information but also the sound information in order to overcome a problem of difficulty in classifying utterances as a declarative sentence or an interrogative sentence in accordance with an intonation in the intention classifier.

Description

Speech Recognition Method Determining the Subject of Response using Multi-Modal Analysis in Natural Language Sentences

본 발명은 대화형 음성인식 방법에 관한 것으로, 특히 상세하게는 호출어를 입력하지 않고 자연스러운 발화로부터 추출한 텍스트와 음향정보를 이용하여 응대 또는 비응대 여부를 판별하는 음성인식 방법에 관한 것이다. The present invention relates to an interactive voice recognition method, and more particularly, to a voice recognition method for determining whether to respond or not by using text and sound information extracted from natural speech without inputting a call word.

음성 인식(Speech Recognition)이란 사람이 말하는 음성 언어를 컴퓨터가 해석해 그 내용을 문자 데이터로 전환하는 처리로, 미리 기록해 둔 특정인의 음성 패턴과 비교해 인증용도로 사용하는 화자인식과는 구별되는 기술이다. 정보통신과 자동차 산업이 융합된 텔레매틱스(telematics)나 로봇 등 지능형 기계에서 음성으로 기기를 제어하고 정보를 검색하는데 폭넓게 사용된다. 사용자 범위를 넓힐 수 있도록 다양한 화자들이 발성한 음성을 통계적으로 모델링하여 음향모델 및 발음모델을 구성하고, 말뭉치 수집을 통해 언어모델을 구성한다. Speech Recognition is a process in which a computer interprets the speech language spoken by a person and converts the contents into text data. It is a technology distinct from speaker recognition used for authentication purposes by comparing it with a pre-recorded voice pattern of a specific person. It is widely used to control devices with voice and search for information in intelligent machines such as telematics and robots, where information communication and the automobile industry are fused. In order to broaden the range of users, the voices uttered by various speakers are statistically modeled to construct an acoustic model and a pronunciation model, and a language model is constructed through corpus collection.

'말'을 이용하여 인간과 기계의 대화가 가능하기 위해서는 지능형 기계의 입출력 인터페이스가 음성이어야 하고, 이러한 기계를 음성인식 기기라고도 한다. 음성인식 기기의 음성인식율 정확도가 높아지면서 음성인식기술의 응용서비스도 확대되어 스마트폰의 비서형 음성인식 시스템에서 스피커형 인공지능(Artificial Intelligence)비서를 거쳐 사물인터넷(Internet of Things)의 입력기술로 확장되고 있다. In order for human-machine conversation to be possible using 'speech', the input/output interface of an intelligent machine must be voice, and such a machine is also called a voice recognition device. As the accuracy of the voice recognition rate of voice recognition devices increases, the application services of voice recognition technology have also been expanded, and from the assistant-type voice recognition system of the smartphone to the speaker-type artificial intelligence assistant and the input technology of the Internet of Things (Internet of Things). is expanding

종래의 음성인식 기기에서는 "시리"나 "OK Google" 등의 호출어를 이용하여 음성인식 모드를 활성화하거나, 호출어범위를 확대하여 미리 정해진 호출어로 활성화하기도 한다. 이처럼 호출어가 정해져 있는 경우에는 자연스러운 대화 중 발화어를 인식하여 기기를 제어할 수 없으므로, 기기의 범용성이 낮아지게 된다.In a conventional voice recognition device, a voice recognition mode is activated using a calling word such as “Siri” or “OK Google”, or a predetermined calling word is activated by expanding the calling word range. When the call word is determined as described above, the device cannot be controlled by recognizing the spoken word during a natural conversation, so the versatility of the device is lowered.

대한민국 공개특허 2014-0073889호는 '대화형 음성인식을 위한 호출어 버퍼링 및 필링 인터페이스'에 관한 것으로, 상기 발명은 사용자가 매번 음성을 입력할 때마다 호출어를 반복하여 입력하지 않더라도 사용자의 자연스러운 대화형 음성입력으로부터 명령어 구문에 대한 음성인식을 수행하여 처리할 수 있도록 하는 기술을 개시한다. 그러나, 상기 발명은 사용자가 자신의 목소리로 직접 호출어를 입력하는 과정을 거치고, 호출어와 함께 입력하는 음성파형을 인식해서 기기가 질문할 때 답변에서 그 파형을 재인식하는 방식이므로 자연어 음성인식 기술이라고 보기는 어렵다. 따라서 호출어를 사용하지 않고도 음성인식기가 응대여부를 판별할 수 있도록 하는 시스템에 대한 기술개발이 요구된다. Korean Patent Application Laid-Open No. 2014-0073889 relates to 'a call word buffering and filling interface for interactive voice recognition', and the invention relates to a user's natural conversation even if the user does not repeatedly input the call word every time the user inputs a voice. Disclosed is a technology for processing by performing speech recognition of command phrases from type speech input. However, the invention is called natural language speech recognition technology because the user goes through the process of directly inputting the calling word with his or her voice, recognizing the voice waveform input with the calling word, and re-recognizing the waveform in the answer when the device asks a question. hard to see Therefore, it is required to develop a technology for a system that allows the voice recognizer to determine whether to respond without using a call word.

대한민국 공개특허 2014-0073889호Republic of Korea Patent Publication No. 2014-0073889

본 발명은 호출어 없이 음성 식별을 통해 사용자의 발화내용과 음향정보를 분석해서 음성인식 기기가 사용자에게 응대해야 하는 내용과 그렇지 않은 내용을 판별해내는 음성인식 방법을 제공하고자 한다.An object of the present invention is to provide a voice recognition method that analyzes a user's utterance and sound information through voice identification without a call word to determine what a voice recognition device should respond to and what it does not.

본 발명은 자연어 문장에서 응대 여부를 판단하는 음성인식 방법으로: 상기 방법은, 사용자가 발화한 음성을 음성입력 장치에서 디지털 음성신호로 변환하는 단계; 상기 변환된 디지털 음성신호에서 음향정보 추출 툴킷인 OpenSmile Toolkit을 이용하여 음향정보를 추출하고, 임베디드용 음성인식기를 이용하여 텍스트를 단어별로 인식하는 단계; 상기 추출된 음향정보와 상기 인식된 단어를 의도분류기로 보내어, 요청문과 의문문 그리고 평서문으로 분류하는 단계; 상기 의도분류기에서 평서문으로 분류된 음성에 대해 비응대로 결정하고, 요청문과 의문문은 토픽분류기로 보내는 단계; 및 상기 요청문과 의문문을 토픽분류기에서 미리 정한 클래스의 토픽 및 기타로 분류하여 기타는 비응대로 결정하고, 상기 미리 정한 클래스의 토픽을 응대대상으로 판단하는 단계를 포함하고, 상기 토픽분류기 및 상기 의도분류기는, 자연어처리 툴킷 Fasttext의 문장분류 알고리즘인 Linear Bag of Words Classifier를 이용하는, 자연어 문장에서 응대 여부를 판단하는 음성인식 방법을 제공한다.The present invention provides a voice recognition method for determining whether to respond in a natural language sentence, the method comprising the steps of: converting a voice uttered by a user into a digital voice signal in a voice input device; extracting sound information from the converted digital voice signal using OpenSmile Toolkit, which is a toolkit for extracting sound information, and recognizing text by word using an embedded voice recognizer; sending the extracted sound information and the recognized word to an intention classifier, and classifying it into a request sentence, a question sentence, and a declarative sentence; determining non-response for the voice classified as a declarative sentence in the intention classifier, and sending the request sentence and the interrogative sentence to a topic classifier; and classifying the request and interrogative sentences into topics and others of a predetermined class in a topic classifier, determining others as non-responsive, and determining the topic of the predetermined class as a response target, wherein the topic classifier and the intent classifier provides a speech recognition method for determining whether to respond in natural language sentences using Linear Bag of Words Classifier, a sentence classification algorithm of Fasttext, a natural language processing toolkit.

본 발명은 또한, 상기 의도분류기는, 음향정보로 음고(Pitch)와 포먼트(Forment)정보가 포함된 단어 인식을 위한 문장 데이터베이스를 포함하고, 상기 문장 데이터베이스는, 입력된 문장을 요청문, 평서문 및 의문문으로 분류하기 위한 음고와 포먼트 정보가 포함된 요청문, 평서문 및 의문문별 문장 데이터를 포함하며, 상기 음고와 포먼트 정보가 포함된 요청문, 평서문, 및 의문문별 문장 데이터는 미리 정한 기간 단위로 갱신하여 저장하는, 자연어 문장에서 응대 여부를 판단하는 음성인식 방법을 제공한다.In another aspect of the present invention, the intention classifier includes a sentence database for recognizing words including pitch and formant information as sound information, and the sentence database converts the input sentence into a request sentence and a plain statement. and request sentences including pitch and formant information for classifying into interrogative sentences, and sentence data for each declarative sentence and interrogative sentence, wherein the request sentence, declarative sentence, and interrogative sentence data including the pitch and formant information are stored for a predetermined period Provided is a voice recognition method for determining whether to respond in a natural language sentence, which is updated and stored in units.

본 발명은 또한, 상기 문장 데이터베이스는, 상기 판단하는 단계에 따른 토픽별 답변을 상기 미리 정한 클래스의 토픽을 포함하는 요청문 및 의문문에 응대하는 평서문 문장 데이터로 더 포함하고, 상기 판단하는 단계는, 스피커로 상기 응대하는 평서문 문장을 발화하는 단계를 더 포함하는, 자연어 문장에서 응대 여부를 판단하는 음성인식 방법을 제공한다.The present invention also further includes, in the sentence database, an answer by topic according to the determining step as declarative sentence data corresponding to a request and a question sentence including a topic of the predetermined class, wherein the determining step includes: It provides a voice recognition method for determining whether to respond in a natural language sentence, further comprising the step of uttering the corresponding plaintext sentence with a speaker.

본 발명은 또한, 상기 미리 정한 클래스의 토픽은 이메일(email), 주택 제어(house control), 날씨(weather), 및 일정(schedule)이며, 상기 토픽분류기는, 상기 미리 정한 클래스의 토픽에 새로운 토픽을 추가하는 토픽추가부를 더 포함하는, 자연어 문장에서 응대 여부를 판단하는 음성인식 방법을 제공한다.In the present invention, the topics of the predetermined class are e-mail, house control, weather, and schedule, and the topic classifier is a new topic in the topic of the predetermined class. It provides a voice recognition method for determining whether to respond in a natural language sentence, further comprising a topic adding unit for adding a.

본 발명은 또한, 상기 토픽분류기는, 단어 데이터베이스를 포함하고, 상기 단어 데이터베이스는 각 토픽별 임베딩 데이터를 포함하며, 상기 각 토픽별 단어 및 유사단어 데이터는 미리 정한 기간 단위로 갱신하여 저장하는, 자연어 문장에서 응대 여부를 판단하는 음성인식 방법을 제공한다.The present invention also provides that the topic classifier includes a word database, the word database includes embedding data for each topic, and the words and similar word data for each topic are updated and stored in units of a predetermined period. A voice recognition method for determining whether to respond in a sentence is provided.

본 발명은 또한, 사용자가 발화한 음성을 음성입력 장치에서 디지털 음성신호로 변환하도록 프로그램된 코드 부분; 상기 변환된 디지털 음성신호에서 음향정보 추출 툴킷인 OpenSmile Toolkit을 이용하여 음향정보를 추출하고, 임베디드용 음성인식기를 이용하여 텍스트를 단어별로 인식하도록 프로그램된 코드 부분; 상기 추출된 음향정보와 상기 인식된 단어를 의도분류기로 보내어, 요청문과 의문문 그리고 평서문으로 분류하도록 프로그램된 코드 부분; 상기 의도분류기에서 평서문으로 분류된 음성에 대해 비응대로 결정하고, 요청문과 의문문은 토픽분류기로 보내도록 프로그램된 코드 부분; 및 상기 요청문과 의문문을 토픽분류기에서 미리 정한 클래스의 토픽 및 기타로 분류하여 기타는 비응대로 결정하고, 상기 미리 정한 클래스의 토픽을 응대대상으로 판단하도록 프로그램된 코드 부분을 포함하고, 상기 토픽분류기 및 상기 의도분류기는, 자연어처리 툴킷 Fasttext의 문장분류 알고리즘인 Linear Bag of Words Classifier를 이용하는, 자연어 문장에서 응대 여부를 판단하도록 프로그램된 음성인식 컴퓨터 프로그램을 저장하는 컴퓨터 판독가능 저장매체를 제공한다.The present invention also provides a code part programmed to convert the voice uttered by the user into a digital voice signal in the voice input device; a code part programmed to extract sound information from the converted digital voice signal using OpenSmile Toolkit, which is a sound information extraction toolkit, and to recognize text by word using an embedded voice recognizer; a code part programmed to send the extracted sound information and the recognized word to an intention classifier and classify it into a request sentence, a question sentence, and a declarative sentence; a code part programmed to determine non-response for the voice classified as a declarative sentence by the intention classifier, and to send the request sentence and the interrogative sentence to the topic classifier; and a code part programmed to classify the request and interrogative sentences into topics and others of a predetermined class in a topic classifier, determine others as non-response, and determine a topic of the predetermined class as a response target, the topic classifier and The intention classifier provides a computer-readable storage medium storing a speech recognition computer program programmed to determine whether to respond in a natural language sentence using a Linear Bag of Words Classifier, a sentence classification algorithm of the natural language processing toolkit Fasttext.

본 발명의 자연어 문장에서 응대 여부를 판단하는 음성인식 방법은 의도분류기와 토픽분류기를 이용하는 Two-Pass Cascade Type이므로, 토픽분류기를 통해 각 토픽별 고유의 언어모델 구성이 가능함과 동시에 의도 분류기에서 발화가 억양에 따라 평서문일지 의문문일지 분류하기 어려운 경우의 문제를 극복하기 위해 텍스트 정보 뿐만 아니라, 음향 정보를 이용함으로써 보다 정확하게 응대와 비응대를 판단할 수 있다.Since the speech recognition method for determining whether to respond in a natural language sentence of the present invention is a two-pass cascade type using an intention classifier and a topic classifier, it is possible to construct a unique language model for each topic through the topic classifier, and at the same time, the utterance in the intention classifier is possible. In order to overcome the problem that it is difficult to classify whether it is a declarative sentence or an interrogative sentence depending on intonation, it is possible to more accurately determine response and non-response by using sound information as well as text information.

도 1은 본 발명에 따른 음성인식기와 음향정보추출기를 거쳐 의도분류기 및 토픽분류기를 차례로 거치는 응대 및 비응대 문장판별 음성인식방법의 예시적인 구조를 나타낸다.
도 2는 본 발명에 따른 의도분류기와 토픽분류기를 이용한 응대 및 비응대 문장 판별방법의 개념적인 흐름도를 나타낸다. 1 shows an exemplary structure of a speech recognition method for diagnosing answered and non-responsive sentences, which sequentially passes through a speech recognizer and a sound information extractor, an intention classifier, and a topic classifier according to the present invention.
2 is a conceptual flowchart of a method for discriminating response and non-corresponding sentences using an intention classifier and a topic classifier according to the present invention.

다양한 양상이 도면을 참조하여 개시된다. 하기 설명에서는 설명을 목적으로, 하나 이상의 양상의 전반적 이해를 돕기 위해 다수의 구체적인 세부사항이 개시된다. 그러나 이러한 양상은 각각의 구체적인 세부사항 없이도 실행될 수 있다는 점이 인식될 것이다. 이후의 기재 및 첨부된 도면은 하나 이상의 양상에 대한 특정한 예시적인 양상을 상세하게 기술한다. 하지만, 이러한 양상은 예시적인 것이고 다양한 양상의 원리에서 다양한 방법 중 일부가 이용될 수 있으며 기술되는 설명은 그러한 양상 및 그 균등물을 모두 포함하고자 하는 의도이다. Various aspects are disclosed with reference to the drawings. In the following description, for purposes of explanation, numerous specific details are set forth to provide a thorough understanding of one or more aspects. It will be appreciated, however, that these aspects may be practiced without each specific detail. The following description and the accompanying drawings set forth in detail certain illustrative aspects of one or more aspects. These aspects are illustrative, however, and some of the various methods may be employed in principle in various aspects and the description is intended to include all such aspects and their equivalents.

다양한 양상 및 특징이 다수의 장치, 모듈 등을 포함할 수 있는 시스템에 의하여 제시될 것이다. 다양한 시스템이 추가적인 장치, 부품, 구성품 등을 포함할 수 있고 그리고/또는 도면들과 관련하여 논의된 장치, 부품, 구성품 등 모두를 포함할 수 없다는 점 또한 이해되고 인식되어야 한다. Various aspects and features will be presented by a system that may include a number of devices, modules, and the like. It should also be understood and appreciated that the various systems may include additional devices, parts, components, etc. and/or may not include all of the devices, parts, components, etc. discussed with respect to the drawings.

본 명세서에서 사용되는 "실시례", "예", "양상", "예시" 등은 기술된 임의의 양상 또는 설계가 다른 양상 또는 설계들보다 양호하다거나, 이점이 있는 것으로 해석되지 않아야 한다. 아래에서 사용되는 용어인 '시스템' '서버' 단말기 등은 일반적으로 컴퓨터 관련 실체(computer-related entity)를 의미하며, 예를 들어, 하드웨어, 하드웨어와 소프트웨어의 조합, 소프트웨어를 의미할 수 있다.As used herein, "embodiment", "example", "aspect", "exemplary", etc. should not be construed as an advantage or advantage over any aspect or design described herein. The terms 'system', 'server' and the like used below generally refer to a computer-related entity, and may refer to, for example, hardware, a combination of hardware and software, or software.

더불어, 용어 "또는"은 배타적 "또는"이 아니라 내포적 "또는"을 의미하는 것으로 의도된다. 즉, 달리 특정되지 않거나 문맥상 명확하지 않은 경우에, "X는 A 또는 B를 이용한다"는 자연적인 내포적 치환 중 하나를 의미하는 것으로 의도된다. 즉, X가 A를 이용하거나; X가 B를 이용하거나; 또는 X가 A 및 B 모두를 이용하는 경우, "X는 A 또는 B를 이용한다"가 상기 경우 어느 것으로도 적용될 수 있다. 또한, 본 명세서에 사용된 "및/또는"이라는 용어는 열거된 관련 항목 중 하나 이상 항목의 가능한 모든 조합을 지칭하고 포함하는 것으로 이해되어야 한다.In addition, the term “or” is intended to mean an inclusive “or” rather than an exclusive “or.” That is, unless otherwise specified or clear from context, "X employs A or B" is intended to mean one of the natural implicit substitutions. That is, X employs A; X employs B; Or when X employs both A and B, "X employs A or B" can be applied to any of the above cases. It is also to be understood that the term “and/or” as used herein refers to and includes all possible combinations of one or more of the listed related items.

또한, "포함한다" 및/또는 "포함하는"이라는 용어는, 해당 특징, 단계, 동작, 모듈, 및/또는 구성요소가 존재함을 의미하지만, 하나 이상의 다른 특징, 단계, 동작, 모듈, 구성요소, 및/또는 이 그룹의 존재 또는 추가를 배제하지 않는 것으로 이해되어야 한다. 더불어, 본 명세서에서 제1 및 제2 등의 용어가 다양한 구성요소를 설명하기 위해 사용될 수 있지만, 이들 구성요소는 이러한 용어에 의해 한정되지 아니한다. 즉, 이러한 용어는 둘 이상의 구성요소 간의 구별을 위해서 사용될 뿐이고, 순서 또는 우선순위를 의미하는 것으로 해석되지 않아야 한다. 또한, 달리 특정되지 않거나 단수 형태를 지시하는 것으로 문맥상 명확하지 않은 경우에, 본 명세서와 청구범위에서 단수는 일반적으로 "하나 또는 그 이상"을 의미하는 것으로 해석되어야 한다. 이하 첨부된 도면을 참조하여 본 발명의 실시예를 설명한다.Also, the terms "comprises" and/or "comprising" mean that the feature, step, action, module, and/or component is present, but one or more other features, steps, actions, modules, configurations. It is to be understood that this does not exclude the presence or addition of elements, and/or groups thereof. In addition, although terms such as first and second may be used in this specification to describe various components, these components are not limited by these terms. That is, these terms are only used to distinguish between two or more components, and should not be construed as meaning an order or priority. Also, unless otherwise specified or unless the context is clear as to designating a singular form, the singular in the specification and claims should generally be construed to mean "one or more." Hereinafter, embodiments of the present invention will be described with reference to the accompanying drawings.

호출어 호출 없는 로봇 음성인식 기술은 호출어를 이용한 종래의 음성인식 인터페이스와는 다르게 사용자의 발화내용을 판단해서 사람과 로봇 사이의 보다 자연스러운 대화 인터페이스를 제공할 수 있다. 뿐만 아니라 기존 시스템은 (호출어 인식 -> 음성 인식 -> 테스크) 수행 이었다면, 본 발명은 (음성인식 -> 테스크) 수행으로 즉각적으로 사용자의 요구를 반영할 수 있다. 이하 도면을 참조하여 본 발명을 상세히 설명한다.Unlike a conventional voice recognition interface using a call word, the robot voice recognition technology without a call word can provide a more natural conversational interface between a human and a robot by judging the content of the user's utterance. In addition, if the existing system performed (call word recognition -> voice recognition -> task), the present invention can immediately reflect the user's request by performing (voice recognition -> task). The present invention will be described in detail below with reference to the drawings.

도 1은 본 발명에 따른 음성인식기와 음향정보추출기를 거쳐 의도분류기 및 토픽분류기를 차례로 거치는 응대 및 비응대 문장판별 음성인식방법의 예시적인 구조를 나타내고, 도 2는 본 발명에 따른 의도분류기와 토픽분류기를 이용한 응대 및 비응대 문장 판별방법의 개념적인 흐름도를 나타낸다. 본 발명의 일 구현예에서 음성인식방법은, 사용자가 발화한 음성을 음성입력 장치에서 디지털 음성신호로 변환하는 단계; 상기 변환된 디지털 음성신호에서 음향정보 추출 툴킷인 OpenSmile Toolkit을 이용하여 음향정보를 추출하고, 임베디드용 음성인식기를 이용하여 텍스트를 단어별로 인식하는 단계; 상기 추출된 음향정보와 상기 인식된 단어를 의도분류기로 보내어, 요청문과 의문문 그리고 평서문으로 분류하는 단계; 상기 의도분류기에서 평서문으로 분류된 음성에 대해 비응대로 결정하고, 요청문과 의문문은 토픽분류기로 보내는 단계; 및 상기 요청문과 의문문을 토픽분류기에서 미리 정한 클래스의 토픽 및 기타로 분류하여 기타는 비응대로 결정하고, 상기 미리 정한 클래스의 토픽을 응대대상으로 판단하는 단계를 포함한다. 1 shows an exemplary structure of a speech recognition method for responsive and non-responsive sentence discrimination that sequentially passes through a speech recognizer and a sound information extractor according to the present invention, an intention classifier and a topic classifier, and FIG. 2 is an intention classifier and a topic according to the present invention. A conceptual flowchart of a method for discriminating response and non-corresponding sentences using a classifier is shown. In one embodiment of the present invention, a voice recognition method includes the steps of: converting a voice uttered by a user into a digital voice signal in a voice input device; extracting sound information from the converted digital voice signal using OpenSmile Toolkit, which is a toolkit for extracting sound information, and recognizing text by word using an embedded voice recognizer; sending the extracted sound information and the recognized word to an intention classifier and classifying it into a request sentence, a question sentence, and a declarative sentence; determining non-response for the voice classified as a declarative sentence by the intention classifier, and sending the request sentence and the interrogative sentence to the topic classifier; and classifying the request and interrogative sentences into topics and others of a predetermined class by a topic classifier, determining others as non-responsive, and determining the topic of the predetermined class as a response target.

본 발명의 일 구현예에서 상기 토픽분류기 및 상기 의도분류기는, 자연어처리 툴킷 Fasttext의 문장분류 알고리즘인 Linear Bag of Words Classifier를 이용할 수 있다. 상기와 같은 구성을 통해 본 발명은 자연어 문장에서 응대여부 판단의 정확도를 종래기술보다 향상시키는 것이 가능하다. 본 발명의 일 구현예에서 상기 의도분류기는, 음향정보로 음고(Pitch)와 포먼트(Forment)정보가 포함된 단어 인식을 위한 문장 데이터베이스를 포함한다. 본 발명의 일 구현예에서 상기 음향정보로 인해 억양, 강세, 템포 등의 정보의 반영이 가능하게 된다. 또한 본 발명의 일 구현예에서 상기 문장 데이터베이스는, 입력된 문장을 요청문, 평서문 및 의문문으로 분류하기 위한 음고와 포먼트 정보가 포함된 요청문, 평서문 및 의문문별 문장 데이터를 포함하며, 상기 음고와 포먼트 정보가 포함된 요청문, 평서문, 및 의문문별 문장 데이터는 미리 정한 기간 단위로 갱신하여 저장하는 것이 가능하다. In an embodiment of the present invention, the topic classifier and the intention classifier may use a Linear Bag of Words Classifier, a sentence classification algorithm of the natural language processing toolkit Fasttext. Through the above configuration, the present invention can improve the accuracy of determining whether to respond in a natural language sentence compared to the prior art. In one embodiment of the present invention, the intention classifier includes a sentence database for word recognition including pitch and formant information as sound information. In one embodiment of the present invention, it is possible to reflect information such as intonation, stress, and tempo due to the sound information. In addition, in an embodiment of the present invention, the sentence database includes sentence data for each request, declarative sentence, and interrogative sentence including pitch and formant information for classifying the input sentence into a request sentence, a declarative sentence, and a question sentence, and the pitch It is possible to update and store the text data for each request, declarative text, and interrogative text including the formant information and the formant information in units of a predetermined period.

또한 본 발명의 일 구현예에서, 상기 문장 데이터베이스는, 상기 판단하는 단계에 따른 토픽별 답변을 상기 미리 정한 클래스의 토픽을 포함하는 요청문 및 의문문에 응대하는 평서문 문장 데이터로 더 포함하고, 상기 판단하는 단계는, 스피커로 상기 응대하는 평서문 문장을 발화하는 단계를 더 포함할 수 있다. In addition, in one embodiment of the present invention, the sentence database further includes, as declarative sentence data corresponding to a request sentence and a question sentence including a topic of the predetermined class, an answer for each topic according to the determining step, and the judgment The doing may further include uttering the corresponding declarative sentence through a speaker.

본 발명의 일 구현예에서 상기 미리 정한 클래스의 토픽은 이메일(email), 주택 제어(house control), 날씨(weather), 및 일정(schedule)이며, 상기 토픽분류기는, 상기 미리 정한 클래스의 토픽에 새로운 토픽을 추가하는 토픽추가부를 더 포함할 수 있다. 이러한 기능을 통해서 토픽의 확장이 자연스럽게 구현될 수 있다. 본 발명의 일 구현예에서 상기 토픽분류기는, 단어 데이터베이스를 포함하고, 상기 단어 데이터베이스는 각 토픽별 임베딩 데이터를 포함하며, 상기 각 토픽별 단어 및 유사단어 데이터는 미리 정한 기간 단위로 갱신하여 저장할 수 있다. In one embodiment of the present invention, the topics of the predetermined class are e-mail, house control, weather, and schedule, and the topic classifier is the topic of the predetermined class. It may further include a topic adding unit for adding a new topic. Through these functions, topic expansion can be implemented naturally. In one embodiment of the present invention, the topic classifier includes a word database, the word database includes embedding data for each topic, and the word and similar word data for each topic can be updated and stored in units of a predetermined period. have.

본 발명의 음성인식 방법은 컴퓨터 판독 가능한 저장매체의 형태로 구현될 수 있으며, 상기 저장매체는 자연어 문장에서 응대 여부를 판단하도록 프로그램된 음성인식 컴퓨터 프로그램을 저장할 수 있다. 상기 저장매체는 사용자가 발화한 음성을 음성입력 장치에서 디지털 음성신호로 변환하도록 프로그램된 코드 부분; 상기 변환된 디지털 음성신호에서 음향정보 추출 툴킷인 LIBROSA python library를 이용하여 음향정보를 추출하고, 임베디드용 음성인식기를 이용하여 텍스트를 단어별로 인식하도록 프로그램된 코드 부분; 상기 추출된 음향정보와 상기 인식된 단어를 의도분류기로 보내어, 요청문과 의문문 그리고 평서문으로 분류하도록 프로그램된 코드 부분; 상기 의도분류기에서 평서문으로 분류된 음성에 대해 비응대로 결정하고, 요청문과 의문문은 토픽분류기로 보내도록 프로그램된 코드 부분; 및 상기 요청문과 의문문을 토픽분류기에서 미리 정한 클래스의 토픽 및 기타로 분류하여 기타는 비응대로 결정하고, 상기 미리 정한 클래스의 토픽을 응대대상으로 판단하도록 프로그램된 코드 부분을 포함할 수 있다. The voice recognition method of the present invention may be implemented in the form of a computer-readable storage medium, and the storage medium may store a voice recognition computer program programmed to determine whether to respond in a natural language sentence. The storage medium may include a code part programmed to convert the voice uttered by the user into a digital voice signal in the voice input device; a code part programmed to extract sound information using the LIBROSA python library, which is a sound information extraction toolkit, from the converted digital voice signal, and to recognize text by word using an embedded voice recognizer; a code part programmed to send the extracted sound information and the recognized word to an intention classifier and classify it into a request sentence, a question sentence, and a declarative sentence; a code part programmed to determine non-response for the voice classified as a declarative sentence by the intention classifier, and to send the request sentence and the interrogative sentence to the topic classifier; and a code part programmed to classify the request and interrogative sentences into topics and others of a predetermined class in a topic classifier, determine others as non-responsive, and determine topics of the predetermined class as response targets.

본 발명의 일 구현예에서 상기 토픽분류기 및 상기 의도분류기는, 자연어처리 툴킷 Fasttext의 문장분류 알고리즘인 Linear Bag of Words Classifier를 이용한다. In an embodiment of the present invention, the topic classifier and the intention classifier use a Linear Bag of Words Classifier, a sentence classification algorithm of the natural language processing toolkit Fasttext.

이상 살펴본 바와 같이 본 발명은 자연어 문장에서 응대 여부를 판단하는 음성인식 방법에 관한 것이다. 이 발명은 예를 들어 자동차분야에서 주행 중에 즉각적으로 사용자의 요구를 반영하는데 응용될 수 있으며, 홈 오토메이션 분야의 사물인터넷(Internet of Things) 환경에서 사용자 인터페이스 편의성 증가에 응용가능하고, 인공지능 비서 응용에서는 스마트 스피커 또는 로봇의 사용자 인터페이스 편의성 증가에 활용될 수 있다. As described above, the present invention relates to a speech recognition method for determining whether to respond in a natural language sentence. This invention can be applied to, for example, immediately reflecting the user's needs while driving in the automobile field, and can be applied to increase user interface convenience in the Internet of Things environment in the home automation field, and artificial intelligence assistant application can be used to increase the convenience of the user interface of smart speakers or robots.

여기에 설명되는 다양한 실시예는 예를 들어, 소프트웨어, 하드웨어 또는 이들의 조합된 것을 이용하여 컴퓨터 또는 이와 유사한 장치로 읽을 수 있는 매체 내에서 구현될 수 있다.Various embodiments described herein may be implemented in a computer-readable medium using, for example, software, hardware, or a combination thereof.

하드웨어적인 구현에 의하면, 여기에 설명되는 실시예는 ASICs (application specific integrated circuits), DSPs (digital signal processors), DSPDs (digital signal processing devices), PLDs (programmable logic devices), FPGAs (field programmable gate arrays, 프로세서(processors), 제어기(controllers), 마이크로 컨트롤러(micro-controllers), 마이크로 프로세서(microprocessors), 기타 기능 수행을 위한 전기적인 유닛 중 적어도 하나를 이용하여 구현될 수 있다. 일부의 경우에 본 명세서에서 설명되는 실시예들이 관리서버 및/또는 시스템 자체로 구현될 수 있다.According to the hardware implementation, the embodiments described herein are ASICs (application specific integrated circuits), DSPs (digital signal processors), DSPDs (digital signal processing devices), PLDs (programmable logic devices), FPGAs (field programmable gate arrays, It may be implemented using at least one of processors, controllers, micro-controllers, microprocessors, and other electrical units for performing functions. The described embodiments may be implemented with the management server and/or the system itself.

소프트웨어적인 구현에 의하면, 본 명세서에서 설명되는 절차 및 기능과 같은 실시예들은 별도의 소프트웨어 모듈들로 구현될 수 있다. 상기 소프트웨어 모듈들 각각은 본 명세서에서 설명되는 하나 이상의 기능 및 작동을 수행할 수 있다. 적절한 프로그램 언어로 씌여진 소프트웨어 어플리케이션으로 소프트웨어 코드가 구현될 수 있다. 상기 소프트웨어 코드는 관리서버 및/또는 데이터베이스에 저장되고, 앱에 의해 실행될 수 있다.According to the software implementation, embodiments such as the procedures and functions described in this specification may be implemented as separate software modules. Each of the software modules may perform one or more functions and operations described herein. The software code may be implemented as a software application written in a suitable programming language. The software code may be stored in the management server and/or database, and executed by the app.

한편, 여기서 제시된 다양한 실시예들은 방법, 장치, 또는 표준 프로그래밍 및/또는 엔지니어링 기술을 사용한 제조 물품(article)으로 구현될 수 있다. 용어 "제조 물품"은 임의의 컴퓨터 판독가능한 장치로부터 액세스 가능한 컴퓨터 프로그램, 캐리어, 또는 매체(media)를 포함한다. 예를 들어, 컴퓨터 판독가능한 매체는 자기 저장 장치(예를 들면, 하드 디스크, 플로피 디스크, 자기 스트립, 등), 광학 디스크(예를 들면, CD, DVD, 등), 스마트 카드, 및 플래쉬 메모리 장치(예를 들면, EEPROM, 카드, 스틱, 키 드라이브, 등)를 포함하지만, 이들로 제한되는 것은 아니다. 또한, 여기서 제시되는 다양한 저장 매체는 정보를 저장하기 위한 하나 이상의 장치 및/또는 다른 기계-판독가능한 매체를 포함한다. 용어 "기계-판독가능한 매체"는 명령(들) 및/또는 데이터를 저장, 보유, 및/또는 전달할 수 있는 무선 채널 및 다양한 다른 매체를 포함하지만, 이들로 제한되는 것은 아니다. Meanwhile, various embodiments presented herein may be implemented as methods, apparatus, or articles of manufacture using standard programming and/or engineering techniques. The term “article of manufacture” includes a computer program, carrier, or media accessible from any computer-readable device. For example, computer-readable media include magnetic storage devices (eg, hard disks, floppy disks, magnetic strips, etc.), optical disks (eg, CDs, DVDs, etc.), smart cards, and flash memory devices. (eg, EEPROM, card, stick, key drive, etc.). Also, various storage media presented herein include one or more devices and/or other machine-readable media for storing information. The term “machine-readable medium” includes, but is not limited to, wireless channels and various other media capable of storing, holding, and/or carrying instruction(s) and/or data.

제시된 실시예들에 대한 설명은 임의의 본 발명의 기술 분야에서 통상의 지식을 가진 자가 본 발명을 이용하거나 또는 실시할 수 있도록 제공된다. 이러한 실시예들에 대한 다양한 변형들은 본 발명의 기술 분야에서 통상의 지식을 가진 자에게 명백할 것이며, 여기에 정의된 일반적인 원리들은 본 발명의 범위를 벗어남이 없이 다른 실시예들에 적용될 수 있다. 그리하여, 본 발명은 여기에 제시된 실시예들로 한정되는 것이 아니라, 여기에 제시된 원리들 및 신규한 특징들과 일관되는 최광의의 범위에서 해석되어야 할 것이다. The description of the presented embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the scope of the invention. Thus, the present invention is not to be limited to the embodiments presented herein, but is to be construed in the widest scope consistent with the principles and novel features presented herein.

Claims

As a speech recognition method for determining whether to respond in a natural language sentence:
The method includes the steps of: converting a voice uttered by a user into a digital voice signal in a voice input device;
extracting sound information from the converted digital voice signal using OpenSmile Toolkit, which is a sound information extraction toolkit, and recognizing text by word using an embedded voice recognizer;
sending the extracted sound information and the recognized word to an intention classifier and classifying it into a request sentence, a question sentence, and a declarative sentence;
determining non-response for the voice classified as a declarative sentence by the intention classifier, and sending the request sentence and the interrogative sentence to the topic classifier; and
Classifying the request and interrogative sentences into topics and others of a predetermined class in a topic classifier, determining others as non-response, and determining the topic of the predetermined class as a response target,
The topic classifier and the intention classifier use Linear Bag of Words Classifier, a sentence classification algorithm of the natural language processing toolkit Fasttext,
A speech recognition method for determining whether to respond in a natural language sentence.

The method of claim 1,
The intention classifier includes a sentence database for word recognition including pitch and formant information as sound information,
The sentence database includes sentence data for each request, declarative sentence, and interrogative sentence including pitch and formant information for classifying the input sentence into a request sentence, a declarative sentence, and a question sentence, and the request including the pitch and formant information Sentence data for each sentence, declarative sentence, and interrogative sentence is updated and stored in units of a predetermined period
A speech recognition method for determining whether to respond in a natural language sentence.

3. The method of claim 2,
The sentence database further includes an answer by topic according to the step of determining as declarative sentence data corresponding to a request sentence and a question sentence including a topic of the predetermined class,
The determining step further comprises the step of uttering the corresponding declarative sentence through a speaker,
A speech recognition method for determining whether to respond in a natural language sentence.

The method of claim 1,
The topics of the predetermined class are email, house control, weather, and schedule,
The topic classifier further comprises a topic adding unit for adding a new topic to the topic of the predetermined class,
A speech recognition method for determining whether to respond in a natural language sentence.

5. The method of claim 4,
The topic classifier includes a word database,
The word database includes embedding data for each topic, and the word and similar word data for each topic are updated and stored in units of a predetermined period.
A speech recognition method for determining whether to respond in a natural language sentence.

a code part programmed to convert the voice uttered by the user into a digital voice signal in the voice input device;
a code part programmed to extract sound information from the converted digital voice signal using the LIBROSA python library, which is a sound information extraction toolkit, and to recognize text by word using an embedded voice recognizer;
a code part programmed to send the extracted sound information and the recognized word to an intention classifier and classify it into a request sentence, a question sentence, and a declarative sentence;
a code part programmed to determine non-response for the voice classified as a declarative sentence in the intention classifier, and to send the request sentence and the interrogative sentence to the topic classifier; and
Classifying the request and interrogative sentences into topics and others of a predetermined class in a topic classifier and determining others as non-responsive, including a code part programmed to determine the topic of the predetermined class as a response target,
The topic classifier and the intention classifier use Linear Bag of Words Classifier, a sentence classification algorithm of the natural language processing toolkit Fasttext,
A computer-readable storage medium storing a speech recognition computer program programmed to determine whether to respond in a natural language sentence.