KR20200109841A

KR20200109841A - A speech recognition apparatus

Info

Publication number: KR20200109841A
Application number: KR1020190029565A
Authority: KR
Inventors: 오성조; 김은직; 이수익; 박준우
Original assignee: 주식회사 포지큐브
Priority date: 2019-03-14
Filing date: 2019-03-14
Publication date: 2020-09-23

Abstract

According to an embodiment of the present invention, an operation method of a device for automatic speech recognition comprises the steps of: receiving a voice signal of a speaker from a terminal device; obtaining speech data according to analysis of the speech signal; and automatically recognizing and processing the speech data while a speech end timer window is continuing to convert and output text information. The speech end timer window is adaptively variable according to at least one of value information of the speech data detected according to the analysis of the speech signal, classification information of the speech data, and disfluency information of the speech data.

Description

Automatic speech recognition device {A SPEECH RECOGNITION APPARATUS}

본 발명은 자동 음성 인식 장치에 관한 것이다. 보다 구체적으로, 본 발명은 스피치 종료 시간 검출 프로세스를 통해, 응답 시간과 대기 시간을 단축시킬 수 있는 음성 인식 장치에 관한 것이다.The present invention relates to an automatic speech recognition device. More specifically, the present invention relates to a speech recognition apparatus capable of shortening a response time and a waiting time through a speech end time detection process.

음성 신호는 사용자간 자신이 표현할 의사전달을 위하여 가장 보편적이고 간편하며 용이하고 신속하게 사용되는 수단 또는 매체이다.The voice signal is the most common, simple, easily and quickly used means or medium for communication between users.

그러나, 자연적인 상태에서의 중장거리 의사전달이 어려운 문제점이 있어, 이러한 음성 신호의 자동화된 인식 처리를 위해 자동 스피치 인식(Automatic Speech recognition, ASR) 기술이 제안되고 있다.However, there is a problem in that it is difficult to transmit communication over a medium to long distance in a natural state, and thus an automatic speech recognition (ASR) technology has been proposed for automatic recognition processing of such a speech signal.

이러한 자동 스피치 인식(ASR) 기술은, 음성(AUDIO) 신호로부터 스피치(SPEECH) 구간을 검출하고, 스피치 구간으로부터 의사전달을 위한 문맥 데이터를 획득 처리하여, 자동화된 응답을 가능하게 한다.Such an automatic speech recognition (ASR) technology enables an automated response by detecting a SPEECH section from an AUDIO signal and obtaining and processing context data for communication from the speech section.

최근에는, 이러한 ASR 기술이 별도의 서버 기반 학습 시스템으로 구축되고, 사용자 음성 입력에 따른 자연어 해석 및 유사 텍스트 데이터로 출력하는 방식의 서비스로 제공되고 있다. 학습 시스템은 예를 들어, 딥 뉴럴 네트워크(Deep Neural Network, DNN) 기반의 언어 모델을 이용한 확률 기반 근접 텍스트 탐색 다이어그램에 의해 구축될 수 있으며, 이를 위한 다양한 데이터베이스와 트레이닝 데이터들이 이용되고 있다.In recent years, such ASR technology has been constructed as a separate server-based learning system, and is provided as a service of interpreting natural language according to user voice input and outputting similar text data. The learning system may be constructed by, for example, a probability-based proximity text search diagram using a language model based on a deep neural network (DNN), and various databases and training data are used for this.

그러나, 현재의 ASR 시스템은 문맥 데이터를 정확히 추출하는 데에는 유용할 수 있으나, 스피치 종료(END OF SPEECH, EOS) 시점을 검출하는 데에는 비효율적인 문제점이 있다.However, the current ASR system may be useful for accurately extracting context data, but there is a problem in that it is inefficient in detecting an END OF SPEECH (EOS) time point.

보다 구체적으로, 현재의 ASR 시스템은 자동화된 스피치 인식 프로세스 내부에서, 스피치 종료(EOS) 시점을 자체적으로 검출하고 있으며, 이는 EOS 시점을 너무 일찍 결정하거나 너무 늦게 결정하게 하는 요인이 되고 있다.More specifically, the current ASR system automatically detects the end of speech (EOS) time within the automated speech recognition process, which is a factor that causes the EOS time to be determined too early or too late.

예를 들어, 음성의 비유창성(Disfluency)에 의한 무음 등이 이어질 수 있으나, 현재의 ARS 시스템은 이를 스피치 종료로 판단하여 그대로 무의미한 데이터로서 반환 출력해버리거나, 반대로 스피치 종료시점을 정상적으로 판정하지 못하여 불필요한 대기 시간을 증가시키는 문제점이 발생하고 있는 것이다.For example, silence due to the disfluency of the voice may lead, but the current ARS system judges this as the end of speech and returns it as meaningless data, or, conversely, it is not necessary to determine the end of the speech normally. There is a problem of increasing the waiting time.

이와 같은 문제점으로 인해, 의미 있는 텍스트 데이터가 정상 추출되지 못하거나, 대기 시간 증가로 인한 사용자 편의성을 저해시키는 문제점들이 발생되고 있다.Due to such a problem, there are problems in that meaningful text data cannot be extracted normally or that user convenience is impaired due to an increase in waiting time.

한국공개특허 10-2009-0054642(발명의 명칭 : 음성 인식 방법 및 장치)Korean Patent Laid-Open Patent 10-2009-0054642 (Name of invention: voice recognition method and apparatus)

본 발명은 상기한 바와 같은 문제점을 해결하고자 안출된 것으로, ASR 시스템과는 독립적으로, 음성 신호의 분석에 따른 스피치 데이터의 특성에 따라, 스피치 종료(EOS) 시점을 적응적으로 가변 결정하게 함으로써 보다 신속하고 정확한 스피치 종료를 검출하게 하는 자동 음성 인식 장치 및 그 동작 방법을 제공하는 데 그 목적이 있다.The present invention was devised to solve the above-described problems, and independently of the ASR system, by adaptively determining the end of speech (EOS) time according to the characteristics of speech data according to the analysis of the speech signal. It is an object of the present invention to provide an automatic speech recognition apparatus and a method of operating the same, which enables rapid and accurate speech termination to be detected.

상기한 바와 같은 과제를 해결하기 위한 본 발명의 실시 예에 따른 자동 음성 인식 장치는, 단말 장치로부터 발화자의 음성 신호를 수신하는 음성 수신부; 상기 음성 신호의 분석에 따라 스피치 데이터를 획득하는 오디오 분석 모듈; 스피치 종료 타이머 윈도우가 지속되는 동안의 스피치 데이터를 자동 인식 처리하여, 텍스트 정보로 변환 출력하는 자동 스피치 인식 처리부; 및 상기 음성 신호 분석에 따라 검출된 상기 스피치 데이터의 가치 정보, 상기 스피치 데이터의 분류 정보 및 상기 스피치 데이터의 비유창성(Disfluency) 정보 중 적어도 하나에 따라, 상기 스피치 종료 타이머 윈도우를 적응적으로 가변하여 스피치 종료점을 결정하는 EOS 매칭부를 포함한다.An automatic speech recognition apparatus according to an embodiment of the present invention for solving the above-described problems includes: a voice receiver configured to receive a voice signal of a talker from a terminal device; An audio analysis module for acquiring speech data according to the analysis of the speech signal; An automatic speech recognition processing unit for automatically recognizing speech data while the speech end timer window is sustained, and converting and outputting text information; And adaptively varying the speech end timer window according to at least one of value information of the speech data detected according to the analysis of the speech signal, classification information of the speech data, and disfluency information of the speech data. It includes an EOS matching unit that determines a speech end point.

한편, 상기한 바와 같은 과제를 해결하기 위한 본 발명의 실시 예에 따른 자동 음성 인식 장치의 동작 방법은, 단말 장치로부터 발화자의 음성 신호를 수신하는 단계; 상기 음성 신호의 분석에 따라 스피치 데이터를 획득하는 단계; 및 스피치 종료 타이머 윈도우가 지속되는 동안의 스피치 데이터를 자동 인식 처리하여, 텍스트 정보로 변환 출력하는 단계를 포함하고, 상기 스피치 종료 타이머 윈도우는, 상기 음성 신호 분석에 따라 검출된 상기 스피치 데이터의 가치 정보, 상기 스피치 데이터의 분류 정보 및 상기 스피치 데이터의 비유창성(Disfluency) 정보 중 적어도 하나에 따라 적응적으로 가변되는 것을 특징으로 한다.Meanwhile, a method of operating an automatic speech recognition apparatus according to an embodiment of the present invention for solving the above-described problems includes the steps of: receiving a speech signal of a talker from a terminal device; Obtaining speech data according to the analysis of the speech signal; And automatically recognizing and processing speech data while the speech end timer window is continuing, and converting and outputting the text information, wherein the speech end timer window includes value information of the speech data detected according to the analysis of the speech signal. And adaptively variable according to at least one of classification information of the speech data and disfluency information of the speech data.

한편, 상기한 바와 같은 과제를 해결하기 위한 본 발명의 실시 예에 따른 방법은 상기 방법을 컴퓨터에서 실행시키기 위한 프로그램 및 상기 프로그램이 기록된 컴퓨터가 읽을 수 있는 기록 매체로 구현될 수 있다.Meanwhile, the method according to an embodiment of the present invention for solving the above-described problems may be implemented with a program for executing the method on a computer and a computer-readable recording medium in which the program is recorded.

본 발명의 실시 예에 따르면, 음성 신호의 분석에 따른 스피치 데이터의 특성에 따라, 상기 스피치 종료 타이머 윈도우를 적응적으로 가변 결정하게 함으로써, 보다 신속하고 정확한 스피치 종료(EOS) 시점을 검출할 수 있다. According to an embodiment of the present invention, by adaptively variable determination of the speech end timer window according to the characteristics of speech data according to analysis of a speech signal, it is possible to detect a more rapid and accurate speech end (EOS) time point. .

또한, 상기 스피치 데이터의 특성은 상기 음성 신호 분석에 따라 검출된 상기 스피치 데이터의 가치 정보, 상기 스피치 데이터의 분류 정보 및 상기 스피치 데이터의 비유창성(Disfluency) 정보 중 적어도 하나에 따라 결정됨으로써, 높은 수준의 정확성을 유지하면서도 음성 인식 응답의 대기 시간을 단축시킬 수 있으며, 이에 따른 사용자 편의성을 크게 향상시킬 수 있다.In addition, the characteristics of the speech data are determined according to at least one of the value information of the speech data detected according to the analysis of the speech signal, the classification information of the speech data, and the disfluency information of the speech data. While maintaining the accuracy of, the waiting time for a voice recognition response can be shortened, and accordingly, user convenience can be greatly improved.

도 1은 본 발명의 실시 예에 따른 전체 시스템을 개략적으로 도시한 블록도이다.
도 2 및 도 6은 본 발명의 실시 예에 따른 스피치 인식 및 대기 시간 단축 효과를 음성 데이터의 구간에 따라 설명하기 위한 도면들이다.
도 7 내지 도 9는 본 발명의 실시 예에 따른 스피치 인식 방법을 각 프로세스별로 보다 구체적으로 설명하기 위한 흐름도들이다.
도 10은 본 발명의 실시 예에 따른 스피치 인식 프로세스와 기존 프로세스간 대기 시간 단축 차이를 설명하기 위한 래더 다이어그램이다.1 is a block diagram schematically showing an entire system according to an embodiment of the present invention.
2 and 6 are diagrams for explaining speech recognition and waiting time reduction effects according to an embodiment of the present invention according to a section of voice data.
7 to 9 are flowcharts illustrating a speech recognition method according to an embodiment of the present invention in more detail for each process.
10 is a ladder diagram for explaining a difference in waiting time reduction between a speech recognition process and an existing process according to an embodiment of the present invention.

이하의 내용은 단지 본 발명의 원리를 예시한다. 그러므로 당업자는 비록 본 명세서에 명확히 설명되거나 도시되지 않았지만 본 발명의 원리를 구현하고 본 발명의 개념과 범위에 포함된 다양한 장치를 발명할 수 있는 것이다. 또한, 본 명세서에 열거된 모든 조건부 용어 및 실시예들은 원칙적으로, 본 발명의 개념이 이해되도록 하기 위한 목적으로만 명백히 의도되고, 이와 같이 특별히 열거된 실시예들 및 상태들에 제한적이지 않는 것으로 이해되어야 한다.The following content merely illustrates the principles of the present invention. Therefore, those skilled in the art can implement the principles of the present invention and invent various devices included in the concept and scope of the present invention, although not clearly described or illustrated herein. In addition, it is understood that all conditional terms and examples listed in this specification are, in principle, expressly intended only for the purpose of making the concept of the present invention understood, and are not limited to the embodiments and states specifically listed as such. Should be.

또한, 본 발명의 원리, 관점 및 실시예들 뿐만 아니라 특정 실시예를 열거하는 모든 상세한 설명은 이러한 사항의 구조적 및 기능적 균등물을 포함하도록 의도되는 것으로 이해되어야 한다. 또한 이러한 균등물들은 현재 공지된 균등물뿐만 아니라 장래에 개발될 균등물 즉 구조와 무관하게 동일한 기능을 수행하도록 발명된 모든 소자를 포함하는 것으로 이해되어야 한다.In addition, it is to be understood that all detailed descriptions listing specific embodiments as well as principles, aspects and embodiments of the present invention are intended to include structural and functional equivalents of these matters. It should also be understood that these equivalents include not only currently known equivalents, but also equivalents to be developed in the future, that is, all devices invented to perform the same function regardless of structure.

따라서, 예를 들어, 본 명세서의 블럭도는 본 발명의 원리를 구체화하는 예시적인 회로의 개념적인 관점을 나타내는 것으로 이해되어야 한다. 이와 유사하게, 모든 흐름도, 상태 변환도, 의사 코드 등은 컴퓨터가 판독 가능한 매체에 실질적으로 나타낼 수 있고 컴퓨터 또는 프로세서가 명백히 도시되었는지 여부를 불문하고 컴퓨터 또는 프로세서에 의해 수행되는 다양한 프로세스를 나타내는 것으로 이해되어야 한다.Thus, for example, the block diagrams herein are to be understood as representing a conceptual perspective of exemplary circuits embodying the principles of the invention. Similarly, all flowcharts, state transition diagrams, pseudocodes, etc. are understood to represent various processes performed by a computer or processor, whether or not the computer or processor is clearly depicted and that can be represented substantially in a computer-readable medium. Should be.

또한 프로세서, 제어 또는 이와 유사한 개념으로 제시되는 용어의 명확한 사용은 소프트웨어를 실행할 능력을 가진 하드웨어를 배타적으로 인용하여 해석되어서는 아니되고, 제한 없이 디지털 신호 프로세서(DSP) 하드웨어, 소프트웨어를 저장하기 위한 롬(ROM), 램(RAM) 및 비 휘발성 메모리를 암시적으로 포함하는 것으로 이해되어야 한다. 주지관용의 다른 하드웨어도 포함될 수 있다.In addition, the explicit use of terms presented as processor, control, or similar concepts should not be interpreted exclusively by referring to hardware capable of executing software, and without limitation, digital signal processor (DSP) hardware, ROM for storing software. It should be understood to implicitly include (ROM), RAM, and non-volatile memory. Other commonly used hardware may also be included.

상술한 목적, 특징 및 장점은 첨부된 도면과 관련한 다음의 상세한 설명을 통하여 보다 분명해질 것이며, 그에 따라 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자가 본 발명의 기술적 사상을 용이하게 실시할 수 있을 것이다. 또한, 본 발명을 설명함에 있어서 본 발명과 관련된 공지 기술에 대한 구체적인 설명이 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우에 그 상세한 설명을 생략하기로 한다. The above-described objects, features, and advantages will become more apparent through the following detailed description in connection with the accompanying drawings, whereby those of ordinary skill in the technical field to which the present invention pertains can easily implement the technical idea of the present invention. There will be. In addition, in describing the present invention, when it is determined that a detailed description of a known technology related to the present invention may unnecessarily obscure the subject matter of the present invention, a detailed description thereof will be omitted.

이하, 첨부된 도면을 참조하여 본 발명에 따른 바람직한 일실시예를 상세히 설명하기로 한다.Hereinafter, a preferred embodiment according to the present invention will be described in detail with reference to the accompanying drawings.

도 1은 본 발명의 실시 예에 따른 전체 시스템을 개략적으로 도시한 블록도이다.1 is a block diagram schematically showing an entire system according to an embodiment of the present invention.

도 1을 참조하면 본 발명의 일 실시 예에 따른 시스템은, 단말 장치(100), 오디오 분석 모듈(120), 로컬 색인 모듈(130), 매칭 모듈(140), 자동 스피치 인식(ASR) 시스템(200)을 포함하여 구성될 수 있다.Referring to FIG. 1, a system according to an embodiment of the present invention includes a terminal device 100, an audio analysis module 120, a local index module 130, a matching module 140, and an automatic speech recognition (ASR) system ( 200) may be included.

단말 장치(100)는, 도 1에 도시된 바와 같이 발화자로부터 발화 음성을 입력받아 음성 신호를 출력하기 위한 하나 이상의 음성 수신부를 구비하는 사용자 단말일 수 있으며, 입력된 발화 음성에 대응하는 음성 신호를 오디오 분석 모듈(120)로 전달하는 장치로서, 휴대폰, 스마트 폰(smart phone), 노트북 컴퓨터(laptop computer), 디지털방송용 단말, PDA(Personal Digital Assistants), PMP(Portable Multimedia Player), 네비게이션 등과 같이 예시되는 사용자 입력 및 정보 표시 등이 가능한 다양한 장치일 수 있다.As shown in FIG. 1, the terminal device 100 may be a user terminal having one or more voice receivers for receiving a spoken voice from a talker and outputting a voice signal, and receiving a voice signal corresponding to the input spoken voice. As a device delivered to the audio analysis module 120, examples such as mobile phones, smart phones, laptop computers, terminals for digital broadcasting, PDAs (Personal Digital Assistants), PMPs (Portable Multimedia Players), navigation, etc. It may be a variety of devices capable of user input and information display.

그리고, 본 발명의 실시 예에 따른 자동 음성 인식 장치는, 단말 장치(100)와 로컬 연결되는 오디오 분석 모듈(120), 로컬 색인 모듈(130), 로컬 데이터베이스(110) 및 매칭 모듈(140)을 포함하는 로컬 인식 시스템으로 구성될 수 있다.In addition, the automatic speech recognition apparatus according to an embodiment of the present invention includes an audio analysis module 120, a local index module 130, a local database 110, and a matching module 140, which are locally connected to the terminal device 100. It may be configured as a local recognition system including.

여기서, 로컬 인식 시스템은 원격지에 구비되는 자동 스피치 인식 시스템(200)과는 별도 구비될 수 있으며, 로컬 데이터베이스(110)를 통해 음성 신호와 매칭 가능한 오디오 또는 텍스트 정보를 획득하여 단말 장치(100)로 제공하거나, 자동 스피치 인식 시스템(200)과 네트워크를 통해 유선 또는 무선으로 연결되어, 음성 신호로부터 추출되는 스피치 음성 데이터를 자동 스피치 인식 시스템(200)으로 제공하고, 자동 스피치 인식 시스템(200)로부터 인식 결과 데이터를 단말 장치(100)로 제공할 수 있다.Here, the local recognition system may be provided separately from the automatic speech recognition system 200 provided at a remote location, and audio or text information that can be matched with a voice signal is obtained through the local database 110 and transferred to the terminal device 100. Provides or is connected to the automatic speech recognition system 200 by wire or wirelessly through a network, and provides speech voice data extracted from the voice signal to the automatic speech recognition system 200 and recognizes from the automatic speech recognition system 200 The result data may be provided to the terminal device 100.

특히, 본 발명의 실시 예에 따른 로컬 인식 시스템에 의해, 로컬 오디오 분석 및 변환에 따라 매칭 가능한 오디오 및 텍스트 정보가 획득되어 우선적으로 단말 장치(100)로 제공될 수 있으며, 나아가 스피치 종료(EOS) 처리를 위한 강제(Forced) EOS 처리가 수행될 수 있는 바, 이에 따른 단말 장치(100)의 응답 처리 지연이 감소되고 사용자 편의성을 증진시킬 수 있다.In particular, by the local recognition system according to an embodiment of the present invention, audio and text information that can be matched according to local audio analysis and conversion may be obtained and preferentially provided to the terminal device 100, and furthermore, end of speech (EOS). Since the forced EOS processing for processing may be performed, a delay in response processing of the terminal device 100 may be reduced and user convenience may be improved.

한편, 로컬 인식 시스템에서 매칭되지 않는 스피치 음성 데이터는 자동 스피치 인식 시스템(200)으로 전달될 수 있으며, 자동 스피치 인식 시스템(200)은 스피치 데이터에 대한 학습 데이터 기반의 통상의 스피치 인식 처리를 수행하고, 인식 결과에 따른 스피치 인식 데이터를 단말 장치(100)로 전송할 수 있다. 스피치 인식 데이터는 예를 들어, 텍스트 기반의 문맥 데이터를 포함할 수 있다.Meanwhile, speech speech data that is not matched in the local recognition system may be transferred to the automatic speech recognition system 200, and the automatic speech recognition system 200 performs a normal speech recognition process based on learning data on the speech data. , Speech recognition data according to the recognition result may be transmitted to the terminal device 100. The speech recognition data may include text-based context data, for example.

단말 장치(100)는 매칭 모듈(140) 또는 자동 스피치 인식 시스템(200)으로부터 매칭된 오디오/텍스트 정보 또는 자동 스피치 인식 시스템(200)의 스피치 인식 데이터를 수신하고, 수신된 정보에 대응하는 응답 처리를 수행할 수 있다.The terminal device 100 receives matched audio/text information from the matching module 140 or the automatic speech recognition system 200 or speech recognition data of the automatic speech recognition system 200, and processes a response corresponding to the received information. You can do it.

예를 들어, 단말 장치(100)는 매칭된 오디오/텍스트 정보 또는 스피치 인식 데이터로부터 사용자 음성 입력에 대응하는 명령(instruction) 정보를 식별하고, 명령 정보에 대응하는 사용자 인터페이스 출력을 처리하거나, 타 사용자로의 메시지 전송 또는 전화 연결을 처리하거나, 응답 음성의 출력 처리 등을 수행할 수 있다.For example, the terminal device 100 identifies instruction information corresponding to a user's voice input from matched audio/text information or speech recognition data, processes a user interface output corresponding to the instruction information, or processes other users It can transmit a message to the phone or process a telephone connection, or process a response voice output.

또한, 예를 들어 단말 장치(100)는 상기 스피치 데이터의 단어 정보에 기초하여, 발화자로부터 요청된 질의를 결정하고, 이에 대응한 처리를 제공할 수 있다.Also, for example, the terminal device 100 may determine a query requested from a talker based on word information of the speech data, and provide a corresponding process.

또한, 예를 들어 단말 장치(100)는 로컬 데이터베이스(110)에 기록된 통화 기록 정보에 기초하여, 기존 이용내역이 없는 경우, 상기 발화자에게 상기 스피치 데이터 기반 대화 서비스를 위한 초기 안내 정보를 제공할 수도 있다.In addition, for example, based on the call record information recorded in the local database 110, the terminal device 100 may provide initial guide information for the speech data-based conversation service to the talker when there is no existing usage history. May be.

앞서 설명한 바와 같이, 이와 같은 단말 장치(100)와 연결되는 로컬 인식 시스템은 자동 스피치 인식 시스템(200)과 네트워크를 통해 유선 또는 무선으로 연결될 수 있다. 네트워크간 상호간 통신을 위해 각 단말 장치(100) 및 로컬 인식 시스템의 모듈들과 자동 스피치 인식 시스템(200)은 인터넷 네트워크, LAN, WAN, PSTN(Public Switched Telephone Network), PSDN(Public Switched Data Network), 케이블 TV 망, WIFI, 이동 통신망 및 기타 무선 통신망 등을 통하여 데이터를 송수신할 수 있다. 또한, 각 단말 장치(100) 및 자동 스피치 인식 시스템(200)은 각 통신망에 상응하는 프로토콜로 통신하기 위한 각각의 통신 모듈을 포함할 수 있다.As described above, the local recognition system connected to the terminal device 100 may be connected to the automatic speech recognition system 200 by wire or wirelessly through a network. For communication between networks, each terminal device 100 and modules of the local recognition system and the automatic speech recognition system 200 include Internet network, LAN, WAN, PSTN (Public Switched Telephone Network), PSDN (Public Switched Data Network). , Cable TV network, WIFI, mobile communication network and other wireless communication networks, etc. can transmit and receive data. In addition, each terminal device 100 and the automatic speech recognition system 200 may include respective communication modules for communicating with a protocol corresponding to each communication network.

그리고, 본 명세서에서 설명되는 본 발명의 실시 예에 따른 로컬 인식 시스템의 로컬 데이터베이스(110), 오디오 분석 모듈(120), 로컬 색인 모듈(130) 및 매칭 모듈(140)은 단말 장치(100) 내부에 위치한 내부 모듈로서 구비되거나, 외부 인터페이스에 의해 단말 장치(100)와 로컬 연결되는 외부 모듈로서 구현될 수 있는 바, 본 발명의 실시 예에 따른 자동 음성 인식 장치는 단말 장치(100) 단독으로 구성되거나, 외부 확장 연결된 로컬 인식 시스템으로 구성되거나, 자동 스피치 인식 시스템(200)을 포함하는 전체 시스템으로도 구성될 수 있으며, 그 구성 형태에 제한되는 것은 아니다.In addition, the local database 110, the audio analysis module 120, the local index module 130, and the matching module 140 of the local recognition system according to an embodiment of the present invention described herein are inside the terminal device 100 It is provided as an internal module located at, or can be implemented as an external module that is locally connected to the terminal device 100 by an external interface. Or, it may be configured as a local recognition system connected externally, or may be configured as a whole system including the automatic speech recognition system 200, but is not limited to the configuration form.

보다 구체적으로, 본 발명의 실시 예에 따른 자동 음성 인식 장치의 오디오 분석 모듈(120)은, 단말 장치(100)의 음성 수신부에 수신되는 발화 음성으로부터 음성 신호가 출력되면, 상기 음성 신호의 분석에 따른 스피치 데이터를 획득할 수 있다.More specifically, the audio analysis module 120 of the automatic speech recognition apparatus according to an embodiment of the present invention, when a voice signal is output from the spoken voice received by the voice receiving unit of the terminal device 100, analyzes the voice signal. According to the speech data can be obtained.

여기서, 상기 스피치 데이터는, 음성 신호의 로컬 분류 프로세스에 의해, 스피치 음성으로 분류된 음성 데이터를 포함할 수 있다.Here, the speech data may include speech data classified as speech speech by a local classification process of speech signals.

오디오 분석 모듈(120)은, 먼저 음성 데이터의 로컬 변환을 색인 모듈(130)로 요청할 수 있으며, 색인 모듈(130)은 로컬 데이터베이스(110)를 이용하여, 음성 데이터와 매칭되는 오디오 또는 텍스트 정보를 색인하고, 색인 결과를 매칭 모듈(140)로 전달할 수 있다.The audio analysis module 120 may first request local conversion of voice data to the index module 130, and the index module 130 uses the local database 110 to obtain audio or text information matching the voice data. Indexing may be performed, and the index result may be transmitted to the matching module 140.

그리고, 매칭 모듈(140)은 음성 데이터의 색인 결과에 따라 매칭된 오디오/텍스트 정보가 존재하는 경우, 상기 매칭된 오디오/텍스트를 음성 신호에 대응되는 응답 결과로서 단말 장치(100)로 제공할 수 있다.In addition, the matching module 140 may provide the matched audio/text to the terminal device 100 as a response result corresponding to the voice signal when the matched audio/text information exists according to the index result of the voice data. have.

한편, 매칭 모듈(140)은, 색인 결과 매칭된 오디오/텍스트 정보가 존재하지 않는 경우에는 자동 스피치 인식 시스템(200)으로 상기 스피치 음성으로 분류된 음성 데이터를 전달할 수 있다.Meanwhile, when the matched audio/text information does not exist as a result of the index, the matching module 140 may transmit the voice data classified as the speech voice to the automatic speech recognition system 200.

이에 따라, 자동 스피치 인식 시스템(200)에서는 통상의 자동화된 스피치 인식 처리를 수행한 자동 스피치 인식 데이터를 단말 장치(100)로 제공할 수 있다.Accordingly, the automatic speech recognition system 200 may provide automatic speech recognition data that has been subjected to an ordinary automated speech recognition process to the terminal device 100.

한편, 오디오 분석 모듈(120)에서 음성 신호의 분석에 따라 획득되는 스피치 데이터는, 스피치 분석 정보를 포함할 수 있다. 여기서, 스피치 분석 정보는, 상기 음성 신호 분석에 따라 검출된 상기 스피치 데이터의 유효성(Value) 정보, 상기 스피치 데이터의 분류(Categrize) 정보 및 상기 스피치 데이터의 비-유창성(Disfluency) 정보 중 적어도 하나를 포함할 수 있으며, 각 분석 정보에 대하여는 각각 구체적으로 후술하도록 한다.Meanwhile, speech data acquired by the audio analysis module 120 according to the analysis of a speech signal may include speech analysis information. Here, the speech analysis information includes at least one of validity (Value) information of the speech data detected according to the analysis of the speech signal, classification information of the speech data, and non-disfluency information of the speech data. May be included, and each analysis information will be described in detail later.

이에 따라, 오디오 분석 모듈(120)은 상기 스피치 음성 분류에 따른 음성 정보와, 스피치 분석 정보를 생성하기 위한 음성 신호의 분석 처리를 다각화하여 수행할 수 있는 바, 특히 본 발명의 실시 예에 따른 스피치 데이터는 매칭 모듈(140)에서의 스피치 종료(EOS) 타이머 윈도우의 결정 프로세스에 이용될 수 있다.Accordingly, the audio analysis module 120 can perform a variety of analysis processing of the speech information according to the speech speech classification and the speech signal for generating the speech analysis information. In particular, speech according to an embodiment of the present invention The data may be used in the process of determining the end of speech (EOS) timer window in the matching module 140.

여기서, 본 발명의 실시 예에 따른 EOS 타이머 윈도우는 원격지의 자동 스피치 인식 시스템(200)과는 독립적으로 로컬 스피치 종료를 검출하기 위하여, 매칭 모듈(140)에서 설정 및 관리되는 시간 구간 정보를 나타낼 수 있다.Here, the EOS timer window according to an embodiment of the present invention may indicate time interval information set and managed by the matching module 140 in order to detect the end of local speech independently from the automatic speech recognition system 200 at a remote location. have.

이에 따라, 매칭 모듈(140)은, 스피치 종료 타이머 윈도우가 지속되는 동안의 스피치 데이터를 로컬 색인 모듈(130) 또는 자동 스피치 인식 시스템(200)을 통해 자동 인식하여 단말 장치(100)로 응답 처리하게 하되, 스피치 종료 타이머 윈도우가 만료된 경우에는 강제 EOS 정보를 생성하여 단말 장치(100)로 제공함으로써, 자동 스피치 인식 시스템(200)과는 독립적인 로컬 스피치 종료를 가능하게 한다.Accordingly, the matching module 140 automatically recognizes the speech data during the duration of the speech end timer window through the local index module 130 or the automatic speech recognition system 200 to process a response to the terminal device 100. However, when the speech end timer window has expired, forced EOS information is generated and provided to the terminal device 100, thereby enabling local speech termination independent of the automatic speech recognition system 200.

특히, 매칭 모듈(140)은 상기 음성 신호 분석에 따라 검출된 상기 스피치 데이터의 유효성 정보, 상기 스피치 데이터의 분류 정보 및 상기 스피치 데이터의 비유창성(Disfluency) 중 적어도 하나에 따라, 상기 스피치종료 타이머 윈도우를 적응적으로 가변하여 스피치 종료점을 결정하게 하는 바, 자동 스피치 인식 시스템(200)만을 이용하는 기존 시스템 대비 신속하고 효과적인 응답 지연 단축효과를 가져올 수 있다.In particular, the matching module 140 is the speech end timer window according to at least one of validity information of the speech data, classification information of the speech data, and disfluency of the speech data detected according to the analysis of the speech signal. The speech end point is adaptively varied to determine the speech end point, which can bring about a quick and effective response delay reduction effect compared to the existing system using only the automatic speech recognition system 200.

또한, 로컬 데이터베이스(110)는 단말 장치(100)의 이전 질의 및 응답 정보에 기초한 확률 모델을 저장할 수 있으며, 색인 모듈(130)은 오디오 분석 모듈(120)에서 추출된 텍스트 정보에 매칭되는 응답 정보를 색인하여 매칭 모듈(140)로 전달할 수 있다.In addition, the local database 110 may store a probability model based on previous query and response information of the terminal device 100, and the index module 130 may provide response information matching the text information extracted from the audio analysis module 120. May be indexed and transmitted to the matching module 140.

여기서, 이와 같은 로컬 데이터베이스(110)는 입력 음성 데이터 및 응답 명령 데이터를 축적하고, 확률 모델을 업데이트함으로써, 정확도를 더 높일 수 있다.Here, such a local database 110 accumulates input voice data and response command data, and updates the probability model to further increase accuracy.

그리고, 매칭 모듈(140)은 매칭이 확인된 경우의 응답 정보를 단말 장치(100)로 제공할 수 있는 바, 사용자가 질의한 음성이 원격 자동 스피치 인식 시스템(200)으로 전달되지 않고도, 조기에 응답 결과가 제공되도록 할 수 있으며, 응답 결과 매칭에 따른 EOS가 조기 결정될 수 있다.In addition, the matching module 140 may provide response information when matching is confirmed to the terminal device 100, so that the voice inquired by the user is not transmitted to the remote automatic speech recognition system 200 at an early stage. Response results can be provided, and EOS according to response result matching can be determined early.

따라서, 본 발명의 실시 예에 따르면, 원격지의 자동 스피치 인식 시스템(200)만을 이용한 색인 프로세스보다 훨씬 더 빠른 명령 처리가 가능하게 된다.Accordingly, according to an embodiment of the present invention, it is possible to process a command much faster than an indexing process using only the automatic speech recognition system 200 at a remote location.

도 2 내지 도 6은 본 발명의 실시 예에 따른 스피치 인식 및 대기 시간 단축 프로세스를 음성 데이터의 구간에 따라 설명하기 위한 도면들이다.2 to 6 are diagrams for explaining speech recognition and waiting time reduction processes according to an exemplary embodiment of the present invention according to a section of voice data.

예를 들어, 통상적 사용자 단말의 하드웨어 사양만으로는 언어 모델의 처리가 용이하지 않은 바, 원격지의 서버로서 구비되는 자동 스피치 인식 시스템이 이용되고 있다. 이러한 일반적 자동 스피치 인식 시스템은 단말의 요청에 따른 스피치 음성 데이터에 대응하는 중간 결과를 제공하되 스피치 종료(EOS)호출 시점을 스피치 인식 시스템이 결정하고, EOS 호출에 따른 최종 문장 정보를 제공한다.For example, since it is not easy to process a language model only with hardware specifications of a typical user terminal, an automatic speech recognition system provided as a remote server is used. This general automatic speech recognition system provides an intermediate result corresponding to the speech voice data according to the request of the terminal, but the speech recognition system determines the time of the end of speech (EOS) call, and provides final sentence information according to the EOS call.

일반적 자동 스피치 인식 시스템은 미리 정해진 묵음(Silence) 시간 프레임이 식별되는 경우, EOS를 호출하도록 설정되어 있으며, 자동 스피치 인식 시스템은 EOS가 호출될 때까지의 음성 데이터를 텍스트 데이터로 해석할 수 있으므로, EOS 결정을 위한 자동 스피치 인식 시스템의 처리 시간에 의한 지연 시간이 발생될 수밖에 없으며, 사용자의 불편함이 가중될 수 있다. 또한, 사용자가 명령을 말하는 동안 어떤 이유로든 음성이 일시 중단될 수 있으므로 잘못된 EOS가 호출될 가능성 또한 존재하고 있다.A general automatic speech recognition system is set to call EOS when a predetermined silence time frame is identified, and the automatic speech recognition system can interpret the voice data until the EOS is called as text data. A delay time due to the processing time of the automatic speech recognition system for determining EOS is inevitably generated, and user inconvenience may be increased. In addition, there is also the possibility that the wrong EOS will be called because the voice can be temporarily suspended for any reason while the user is speaking the command.

예를 들어, 도 2를 참조하면 통상적 자동 스피치 인식 시스템의 사용자 경험에 따른 전체 대기 시간은, 로그 선형 모델 기반의 방정식으로 표현될 수 있다. For example, referring to FIG. 2, the total waiting time according to the user experience of a typical automatic speech recognition system may be expressed by an equation based on a log linear model.

여기서, (μ)는 네트워크 지연(13)인 응답 변수일 수 있으며, (α)는 자동 스피치 인식(ASR) 처리 시간(12)인 독립 변수일 수 있고, (β)는 EOS 검출을 위한 ASR 처리 시간(11)인 차단 변수일 수 있으며, (ｅ) 는 사용자가 예측한 것과 맞지 않는 결과를 나타낼 오류 비율을 나타낼 수 있다. 그리고, (X)는 EOS 호출 시점에 의해 결정되는 발화 길이(14)를 나타낼 수 있다. 예를 들어, 예기치 않은 결과로 인해 발화가 반복되면 전체적인 X가 증가할 수 있다.Here, (μ) may be a response variable that is network delay (13), (α) may be an independent variable that is automatic speech recognition (ASR) processing time (12), and (β) is ASR processing for EOS detection It may be a blocking variable that is time (11), and (e) may indicate an error rate indicating a result that does not match what the user predicted. In addition, (X) may represent the speech length 14 determined by the EOS call time point. For example, if the utterance is repeated due to unexpected results, the overall X may increase.

그리고, 이와 같은 통상의 자동 스피치 인식 시스템에서, 오디오 입력의 묵음을 감지하면 EOS 시간 제한(time-out)이 시작될 수 있다. 예를 들어, EOS 시간 제한 (100 초, 도 2 구간 11) 이전에 오디오가 감지되면 시간 제한이 재설정되며. 오디오가 EOS 시간 제한 구간(11)내에 검출되지 않으면, EOS가 호출될 수 있다. 이후, 자동 스피치 인식 시스템이 오디오를 처리한 결과가 단말로 제공되는 시점(15)이 형성된다.In addition, in such a typical automatic speech recognition system, when silence of an audio input is detected, an EOS time-out may be started. For example, if audio is detected before the EOS time limit (100 seconds, section 11 in Fig. 2), the time limit is reset. If audio is not detected within the EOS time limit period 11, EOS may be called. Thereafter, a time point 15 is formed when the result of processing the audio by the automatic speech recognition system is provided to the terminal.

도 3은 통상의 자동 스피치 인식 시스템에서 음성 입력 데이터를 확인하는 방식을 나타내며, 음성 데이터는 버스트(burst) 및 묵음 패턴(21, 28, 22, 29)들을 포함할 수 있고, 복수의 시간 프레임에 따른 연속적인 데이터로서 구성될 수 있다. 그리고, 통상의 자동 스피치 인식 시스템에서 스피치 종료(EOS) 시점은 완전한 문장 구성을 위해, 복수의 시간 프레임이 종료된 이후의 시점(27)에서 호출될 수 있다.3 shows a method of confirming voice input data in a conventional automatic speech recognition system, and the voice data may include bursts and silence patterns 21, 28, 22, 29, and may include a plurality of time frames. It can be configured as continuous data. In addition, in a typical automatic speech recognition system, the end of speech (EOS) time point may be called at a time point 27 after the plurality of time frames have ended for complete sentence construction.

그러나, 본 발명의 실시 예에 따른 매칭 모듈(140)은 오디오 분석 모듈(120)의 분석 프로세스에 따라, 중간 묵음 구간(31)에서 초기 EOS를 빠르게 호출할 수 있는 바, 이에 따른 사용자 경험 대기 시간이 단축될 수 있다. 또한, 매칭 모듈(140)은 색인 모듈(130)의 로컬 데이터베이스(110) 색인 결과를 제공함으로써, 초기 EOS 검출로 인한 해석 오류를 보완할 수 있다. 이를 위해, 로컬 인식 시스템은 단말 장치(100) 내부에 상주하는 모듈로서 구비될 수 있으며, 로컬 데이터베이스(110)는 단말 장치(100)의 구동 히스토리 정보를 저장하는 캐시 데이터베이스 또는 로컬 TTS/STT 변환 엔진 데이터베이스를 포함할 수 있다.However, the matching module 140 according to an embodiment of the present invention can quickly call the initial EOS in the middle silent period 31 according to the analysis process of the audio analysis module 120, and thus the user experience waiting time This can be shortened. In addition, the matching module 140 may compensate for an analysis error due to initial EOS detection by providing the index result of the local database 110 of the index module 130. To this end, the local recognition system may be provided as a module residing inside the terminal device 100, and the local database 110 is a cache database or a local TTS/STT conversion engine that stores driving history information of the terminal device 100 May contain databases.

이와 같이, 초기 EOS가 로컬 프로세스에 의해 빠르게 호출되는 경우, 지연 감소로 인해 사용자가 느끼는 대기 시간은 크게 줄어들 수 있으며, 전술한 수학식 1 모델에 기초한 이상적 효과를 도출할 수 있다.In this way, when the initial EOS is quickly called by the local process, the waiting time felt by the user can be greatly reduced due to the reduction in delay, and an ideal effect based on the above-described Equation 1 model can be derived.

다만, 초기 EOS 검출은 잠재적 오역을 유발할 수 있는 바, 재질의에 의한 추가적 대기 시간이 소요될 수 있고, 따라서, 본 발명의 실시 예에 따른 로컬 데이터베이스(110)는 사용자가 원하는 결과를 미리 색인하기 위해, 단말 장치(100)의 이전 질의 정보(QUERY)를 기록하고, 자주 묻는 질의에 대응되는 응답 정보를 캐싱하여 로컬 데이터베이스(110)에 미리 저장할 수 있다.However, since initial EOS detection may cause a potential misinterpretation, it may take an additional waiting time due to the material. Therefore, the local database 110 according to an embodiment of the present invention is used to index the results desired by the user in advance. , Previous query information (QUERY) of the terminal device 100 may be recorded, and response information corresponding to frequently asked queries may be cached and stored in the local database 110 in advance.

이에 따라, 로컬 데이터베이스(110)는 잘못된 응답의 가능성을 줄이기 위한 확률 모델을 구축할 수 있으며, 색인 모듈(130)을 통해 색인된 응답 정보를 매칭 모듈(140)로 제공할 수 있다.Accordingly, the local database 110 may build a probability model to reduce the possibility of an incorrect response, and may provide indexed response information to the matching module 140 through the indexing module 130.

한편, 오디오 분석 모듈(120)은 노이즈를 잘못 해석하여 원치 않는 초기 EOS를 유발할 수있는 잠재적인 오류를 보완하기 위해, 잡음 분석 및 음성 활동 감지를 통한 비 문맥 데이터의 필터링을 처리할 수 있다. 이에 따라 스피치 데이터로 분류되지 않은 음성 데이터는 색인 모듈(130) 또는 자동 스피치 인식 시스템(200)으로 전달되기 이전에 차단될 수 있다. 이러한 로컬 필터링 프로세스에 의해, 자동 스피치 인식 시스템(200)으로의 불필요한 데이터 전송을 줄일 수 있으며 리소스 사용을 최적화 할 수 있다.Meanwhile, the audio analysis module 120 may process filtering of non-context data through noise analysis and voice activity detection in order to compensate for a potential error that may cause an unwanted initial EOS by misinterpreting noise. Accordingly, voice data not classified as speech data may be blocked before being transmitted to the index module 130 or the automatic speech recognition system 200. By this local filtering process, unnecessary data transmission to the automatic speech recognition system 200 can be reduced and resource usage can be optimized.

이러한 처리를 위해, 오디오 분석 모듈(120)은 먼저 입력되는 음성 데이터의 노이즈 레벨 모니터링을 처리하여, 노이즈 구간과 음성 구간을 초기 식별할 수 있으며, 음성 구간의 버스트(burst) 패턴을 검출하거나, 연속적 노이즈 레벨 데이터를 검출함으로써, 음성 신호로부터 스피치 음성 데이터만을 분류 추출할 수 있다.For this processing, the audio analysis module 120 first processes the noise level monitoring of the input voice data to initially identify the noise section and the voice section, and detects a burst pattern of the voice section or continuously By detecting the noise level data, only speech audio data can be classified and extracted from the audio signal.

또한, 오디오 분석 모듈(120)은 주파수 영역에서의 음성 변수 범위를 설정함으로써, 사람의 말로 간주될 수 있는 스피치 데이터만을 분류 추출할 수도 있다.Further, the audio analysis module 120 may classify and extract only speech data that can be regarded as human speech by setting a range of speech variables in the frequency domain.

한편, 도 4 및 도 5는 비유창성 구간이 포함된 경우의 음성 데이터 파형을 나타낸다.Meanwhile, FIGS. 4 and 5 show waveforms of voice data when a non-fluency section is included.

본 발명의 실시 예에 따른 음성 인식 시스템에 있어서, 의미 없는 데이터를 반환하거나 음성 명령을 다시 반복해야 하는 추가 지연이 발생되지 않도록, 매칭 모듈(140)은 비-유창성(Disfluency) 구간의 존재를 확인하고, 비 유창성이 확인된 경우 인식될 다음 단어의 중요도 가중치를 증가시키는 처리를 수행할 수 있다.In the speech recognition system according to an embodiment of the present invention, the matching module 140 checks the existence of a non-fluency section so that there is no additional delay in returning meaningless data or repeating the voice command again. And, when non-fluency is confirmed, a process of increasing the importance weight of the next word to be recognized may be performed.

특히, 한국어의 경우 "주어-목적어-동사"의 형식을 따르며, 주어가 완전히 생략되기도 하는 바, 비유창성 정보는 적절한 텍스트 및 문맥 인식을 제공하면서도, 초기 EOS를 검출하기 위한 이상적인 정보로 이용될 수 있다. 본 발명의 실시 예에 따른 시스템에 기반한 3개월간의 데이터 수집 및 테스트 처리에 있어서도, 90%이상이 "주어-목적어-동사" 또는 "목적어-동사" 형태였음을 확인한 바 있다.In particular, in the case of Korean, the format of "subject-object-verb" is followed, and the subject is completely omitted. In this case, the non-fluency information can be used as ideal information for detecting initial EOS while providing appropriate text and context recognition. have. In the data collection and test processing for 3 months based on the system according to an embodiment of the present invention, it has been confirmed that more than 90% were in the form of "subject-object-verb" or "object-verb".

보다 구체적으로, 도 4를 참조하면 통상적 자동 스피치 인식 시스템의 경우, '아', '그' 및 침묵이 반복되는 말더듬과 같은 비유창성 구간(400, 402)이 발생되면 상황과 관련성이 부족한 해석을 산출하게 된다. 또한, 시간 프레임의 사전 설정된 크기에 따라 침묵구간(401, 403) 등에서 잘못된 EOS를 호출 할 수 있다. 또한, 잘못된 입력 또는 EOS가 처리되고 최종 결과가 반환되는 동안 자동 스피치 인식 시스템은 추가 입력을 허용하지 않는 경우가 존재한다. 따라서, 사용자는 원하는 발화를 반복하게 되고, 전반적 대기 시간이 증가되는 문제점이 존재한다.More specifically, referring to FIG. 4, in the case of a conventional automatic speech recognition system, when the non-fluency sections 400 and 402 such as'ah','he', and silent stuttering are repeated, an interpretation lacking in relation to the situation is performed. Will be calculated. In addition, it is possible to call the wrong EOS in the silent section (401, 403) according to the preset size of the time frame. In addition, there are cases in which the automatic speech recognition system does not allow additional input while incorrect input or EOS is processed and the final result is returned. Accordingly, the user repeats the desired utterance, and there is a problem that the overall waiting time is increased.

이를 해결하기 위해, 본 발명의 실시 예에 따른 매칭 모듈(140)은 EOS 검출을 위한 EOS 시간 윈도우를 로컬 설정하고, 이에 대응한 윈도우 크기를 가변시킬 수 있으며, 특히 비유창성 구간 검출시 EOS 타임 윈도우를 확장하면서 다음 단어의 인식 가중치를 증가시킬 수 있다. 예를 들어, 본 발명의 실시 예에 따른 매칭 모듈(140)은, 도 4에서의 비유창성 발화로 인한 묵음 구간(401, 403)에서의 EOS 타임 윈도우를 연장시킬 수 있으며, 이에 따라 EOS 검출이 비활성화될 수 있다. To solve this, the matching module 140 according to an embodiment of the present invention may locally set an EOS time window for EOS detection, and may change a corresponding window size. In particular, the EOS time window when detecting a non-fluency section It is possible to increase the recognition weight of the next word while expanding. For example, the matching module 140 according to an embodiment of the present invention may extend the EOS time window in the silent periods 401 and 403 due to the non-fluency speech in FIG. 4, and thus, EOS detection Can be disabled.

또한, 매칭 모듈(140)은 EOS 타임 윈도우의 종료 시간을 연속된 프레임의 음성 데이터가 종료되는 시점까지로 재설정할 수 있는 바, 이에 따른 EOS 호출 연장 시점(408)이 재설정될 수 있게 된다.In addition, the matching module 140 may reset the end time of the EOS time window to the time point at which the voice data of consecutive frames ends, and thus the EOS call extension time point 408 may be reset.

또한, 도 5는 EOS 타임 윈도우의 동적 크기 가변에 의한 시간 지연 단축을 나타내는 것으로, 매칭 모듈(140)은 음성 데이터가 비유창성에 따른 버스트 데이터를 포함하는 경우, 발화 길이에 따른 EOS 타임 윈도우의 동적 변경을 처리할 수 있다.In addition, FIG. 5 shows the reduction in time delay due to the dynamic size of the EOS time window. When the voice data includes burst data according to influency, the EOS time window is dynamically adjusted according to the speech length. Change can be handled.

도 5를 참조하면, 일반적 EOS 검출의 경우, 발화 종료 이후의 고정된 길이의 시간 프레임 구간(410)이 지나야 하지만, 본 발명의 실시 예에 따른 매칭 모듈(140)의 경우 비유창성을 포함한 음성 데이터에 대하여는 발화 길이에 대응한 EOS 타임 윈도우의 동적 감축 변경(411)을 처리함으로써, 사용자는 그 차이만큼의 대기 시간(412) 감소 효과를 받을 수 있다.Referring to FIG. 5, in the case of general EOS detection, a fixed length time frame period 410 after the end of the speech must pass, but in the case of the matching module 140 according to an embodiment of the present invention, voice data including influency With respect to, by processing the dynamic reduction change 411 of the EOS time window corresponding to the speech length, the user can receive the effect of reducing the waiting time 412 by the difference.

한편, 도 6은 본 발명의 실시 예에 따라, 비유창성(disfluency)을 포함하는 음성 데이터의 처리 지연 시간 단축을 종래 기술 대비 설명하기 위한 도면이다.Meanwhile, FIG. 6 is a diagram for explaining a reduction in processing delay time of voice data including disfluency compared to the prior art according to an embodiment of the present invention.

도 6(A)에 도시된 바와 같이, 종래 기술에 있어서 발화자의 음성에서 비유창성을 포함하는 음성 구간(500, 501)과 침묵 구간(503)이 발생되면, 음성 인식 시스템에서는 침묵 구간에 의한 EOS 구간(515)을 검출하고, 인식 처리 시간(505)이 지난 후, 잘못된 문맥 인식 결과(506)를 제공한다.As shown in Fig. 6A, in the prior art, when the speech sections 500 and 501 and the silent section 503 including influency in the talker's voice are generated, the speech recognition system generates EOS due to the silent section. The section 515 is detected, and after the recognition processing time 505 elapses, an incorrect context recognition result 506 is provided.

결과적으로, 사용자는 실제 질의 음성(502)에 대한 응답을 얻지 못하여, 다시 발화(504)하여야 하고, 이에 따라, 추가적인 EOS 구간(516) 검출과, 인식 시간(514)이 소요되어 추가 지연 시간(517)이 소요된 이후에 응답 결과가 제공되게 된다.As a result, the user does not get a response to the actual query voice 502 and has to speak again (504). Accordingly, additional EOS section 516 detection and recognition time 514 are required, resulting in an additional delay time ( After 517) is taken, the response result is provided.

반면, 본 발명의 실시 예에 따른 프로세스에 의한 도 7(B)를 참조하면, 발화자로부터 비유창성 구간(508, 509)을 포함하는 음성 입력이 수신되더라도, 매칭 모듈(140)에서는 EOS 타임 윈도우를 로컬 설정하되, 비유창성 검출에 따른 타임 윈도우 종료 시간을 연장함으로써 실제 질의음성(510)까지 수신될 때까지 대기할 수 있게 된다.On the other hand, referring to FIG. 7(B) according to the process according to an embodiment of the present invention, even if a voice input including the non-fluency sections 508 and 509 is received from the talker, the matching module 140 sets the EOS time window. It is set locally, but it is possible to wait until the actual query voice 510 is received by extending the time window end time according to detection of influency.

또한, EOS 윈도우 만료(511)에 따른 조기 EOS가 검출되고, 이에 따른 처리 시간(512)만이 적절히 소요된 이후에 인식 결과가 사용자에게 제공될 수 있다.In addition, after the early EOS due to the EOS window expiration 511 is detected, and only the corresponding processing time 512 is appropriately spent, the recognition result may be provided to the user.

이와 같이, 종래 기술 대비 대기 시간 차이(513)가 명확히 단축될 수 있음을 확인할 수 있는 바, 이는 사용자 편의성 향상의 주요 요소가 될 수 있다.In this way, it can be seen that the difference in waiting time 513 compared to the prior art can be clearly shortened, and this can be a major factor in improving user convenience.

도 7 내지 도 9는 본 발명의 실시 예에 따른 스피치 인식 방법을 각 프로세스별로 보다 구체적으로 설명하기 위한 흐름도들이다.7 to 9 are flowcharts illustrating a speech recognition method according to an embodiment of the present invention in more detail for each process.

도 7은 본 발명의 실시 예에 따른 로컬 스피치 인식 프로세스를 전체적으로 설명하기 위한 흐름도이다.7 is a flowchart illustrating a process for recognizing local speech as a whole according to an embodiment of the present invention.

도 7을 참조하면, 먼저 오디오 분석 모듈(120)은 단말 장치(100)로부터 수신되는 음성 신호로부터 노이즈 분석을 처리한다(S101).Referring to FIG. 7, first, the audio analysis module 120 processes noise analysis from a voice signal received from the terminal device 100 (S101).

그리고, 오디오 분석 모듈(120)은 노이즈 분석에 따른 음성 활동(Activity) 정보를 검출한다(S103).Then, the audio analysis module 120 detects voice activity information according to noise analysis (S103).

이후, 오디오 분석 모듈(120)은 스피치 데이터 연속성(Succesivity)를 확인한다(S105).Thereafter, the audio analysis module 120 checks speech data continuity (S105).

이에 따라, 오디오 분석 모듈(120)는 종합적 분석 결과에 따른 음성 신호의 스피치 데이터 분류 가능여부를 판단하며(S107), 분류 가능한 경우 음성 신호를 색인 모듈(130)로 전달하여 로컬 자동 스피치 인식 프로세스에 진입한다(S109).Accordingly, the audio analysis module 120 determines whether speech data of the speech signal can be classified according to the comprehensive analysis result (S107), and if the classification is possible, transmits the speech signal to the index module 130 to be used in the local automatic speech recognition process. Enter (S109).

여기서, 매칭 모듈(140)은 자동 스피치 인식 프로세스를 위한 로컬 EOS 타임 윈도우를 설정할 수 있다.Here, the matching module 140 may set a local EOS time window for the automatic speech recognition process.

또한, S107단계에서 노이즈 검출에 따라 스피치 데이터로 분류 가능하지 않은 것으로 판단된 경우에는 오디오 분석 모듈(120)은 다음 시간 프레임의 음성 신호에 대한 노이즈 분석을 수행할 수 있다.In addition, if it is determined in step S107 that it is not possible to classify speech data according to noise detection, the audio analysis module 120 may perform noise analysis on the speech signal of the next time frame.

로컬 자동 스피치 인식 프로세스에 있어서, 색인 모듈(130)은 스피치 데이터 사전 분류(Categorize) 처리를 수행한다(S111). 사전 분류는 스피치 데이터를 미리 설정된 분류값들에 매칭시키는 분류 처리를 포함할 수 있다.In the local automatic speech recognition process, the index module 130 performs speech data pre-categorization (S111). The pre-classification may include classification processing of matching speech data to preset classification values.

그리고, 매칭 모듈(140)은 오디오 분석 모듈(120)의 분석 결과로부터 스피치 데이터에 비유창성(Disfluency) 구간이 존재하는지 판단하며(S113), 스피치 데이터의 컨텐츠 유효성(valuable)이 존재하는지 판단하고(S115), 비유창성이 존재하지 않고 유효성이 존재하는 경우, 색인 모듈(130)을 통해 로컬 데이터베이스(110)의 데이터 분석 기반의 매칭 처리를 수행한다(S117).In addition, the matching module 140 determines whether there is a disfluency section in the speech data from the analysis result of the audio analysis module 120 (S113), and determines whether the content validity of the speech data exists ( S115), when there is no influency and validity, the index module 130 performs a data analysis-based matching process of the local database 110 (S117).

여기서, 매칭 처리는 로컬 데이터베이스(110)에 캐시된 히스토리 정보에 기초하거나, 로컬 데이터베이스(110)에 탑재된 TTS 및 STT 엔진에 의해 처리될 수 있으며, 매칭 모듈(140)은 색인 모듈(130)의 응답 결과를 확인하여 매칭여부를 확인(S119)할 수 있다.Here, the matching process may be based on the history information cached in the local database 110 or may be processed by the TTS and STT engines mounted in the local database 110, and the matching module 140 is By checking the response result, it is possible to check whether there is a match (S119).

한편, S113 단계에서 스피치 데이터에 비유창성 구간이 존재하거나, S115 단계에서 스피치 데이터의 컨텐츠 유효성이 존재하지 않거나, S119 단계에서 매칭여부가 확인되지 않는 경우, 매칭 모듈(140)은 EOS 타임 윈도우를를 미리 설정된 시간 구간동안 연장한다(S125).On the other hand, if there is an inflexible section in the speech data in step S113, the content validity of the speech data does not exist in step S115, or if matching is not confirmed in step S119, the matching module 140 presets the EOS time window in advance. It extends for the set time period (S125).

이후, 매칭 모듈(140)은 타임 윈도우 만료 여부를 확인하고(S127), 만료된 시점에 EOS를 호출하여 매칭 확인에 따른 문장 종료를 판단하며(S121), 문장 종료에 따른 출력 데이터가 존재하는 경우 단말 장치(100)로 제공하여, 문장 응답에 대응하는 적절한 데이터 처리가 수행되도록 한다(S123).Thereafter, the matching module 140 checks whether the time window has expired (S127), calls EOS at the expiration point to determine the end of the sentence according to the matching confirmation (S121), and when output data according to the end of the sentence exists Provided to the terminal device 100, the appropriate data processing corresponding to the sentence response is performed (S123).

한편, 도 8은 본 발명의 실시 예에 따른 스피치 데이터 분류 프로세스를 보다 구체적으로 도시한 흐름도이다.Meanwhile, FIG. 8 is a more detailed flowchart illustrating a speech data classification process according to an embodiment of the present invention.

도 8을 참조하면, 오디오 분석 모듈(120)은 먼저 음성 버퍼로부터 데이터를 리드한다(S201).Referring to FIG. 8, the audio analysis module 120 first reads data from a voice buffer (S201).

그리고, 오디오 분석 모듈(120)은 데이터 프레임별 음성 활동 정보를 검출하며(S203), 프레임별 활동 정보를 판단하고(S205), 이에 따른 연속적(consecutive) 활동을 판단하며(S207), 연속적 활동 정보로부터 음성 패턴에 대응하는 연속적 특징의 존재를 판단함으로써(S209), 스피치 데이터로의 분류 가능성을 판단(S211)할 수 있다.In addition, the audio analysis module 120 detects voice activity information for each data frame (S203), determines the activity information for each frame (S205), and determines continuous activity accordingly (S207), and continuous activity information By determining the existence of the continuous feature corresponding to the speech pattern from (S209), it is possible to determine the possibility of classification as speech data (S211).

오디오 분석 모듈(120)은 분류 가능성 판단에 따라, 분류 가능한 경우에는 스피치 데이터로 분류 처리하며(S213), 분류 불가한 경우에는 데이터 무시 처리를 수행한다(S215).According to the classification possibility, the audio analysis module 120 classifies as speech data if it is possible to classify (S213), and performs data ignoring processing if it is impossible to classify it (S215).

이와 같은 오디오 분석 모듈(120)의 처리에 따라, 음성 데이터에 대한 로컬 자동 스피치 인식 프로세스로 진입하거나, 자동 스피치 인식 시스템(200)으로 전달하기 이전에 잘못된 오디오 데이터로 인한 EOS 호출 영향을 최소화할 수 있다. 이를 위해, 오디오 분석 모듈(120)은 음성 버퍼로부터 판독된 데이터의 음성 활동 정보를 확인할 수 있으며, 고정된 시간 프레임을 기반으로 하는 음성 활동을 검출하고, 다음 프레임 구성의 기대치 설정을 위한 연속적 활동 검출여부를 판단할 수 있다.According to the processing of the audio analysis module 120, it is possible to minimize the effect of EOS calls due to incorrect audio data before entering the local automatic speech recognition process for voice data or transferring it to the automatic speech recognition system 200. have. To this end, the audio analysis module 120 can check the voice activity information of the data read from the voice buffer, detects the voice activity based on a fixed time frame, and detects continuous activity for setting the expected value of the next frame configuration. You can judge whether or not.

예를 들어, 오디오 분석 모듈(120)은 연속적 파형에 따른 현재 프레임 내 음성 활동 검출에 따라, 다음 프레임의 음성 존재 가능성을 예측할 수 있으며, 스피치 데이터의 경우 음성 데이터 사이에 침묵구간이 존재하는 경향이 있으므로, 이에 대응하는 음성 활동의 시간 프레임 패턴 분석에 따라, 음성 데이터가 노이즈인지 또는 스피치 데이터를 포함하는지를 검출할 수 있다.For example, the audio analysis module 120 may predict the possibility of the existence of a voice in the next frame according to the detection of voice activity in the current frame according to a continuous waveform, and in the case of speech data, a silent section tends to exist between the voice data. Therefore, according to the time frame pattern analysis of the corresponding voice activity, whether the voice data is noise or includes speech data can be detected.

보다 구체적으로, 오디오 분석 모듈(120)은 상기 음성 신호로부터 스피치 예측 데이터에 대응하는 묵음 패턴 및 버스트 음성 데이터를 검출하고, 상기 스피치 예측 데이터의 묵음 대 사운드 레벨 비율 연산에 따라, 상기 스피치 예측 데이터를 상기 스피치 데이터로 분류할 수 있는 것이다.More specifically, the audio analysis module 120 detects a silence pattern and burst speech data corresponding to the speech prediction data from the speech signal, and calculates the speech prediction data according to a silence-to-sound level ratio calculation of the speech prediction data. It can be classified as the speech data.

또한, 오디오 분석 모듈(120)은 상기 음성 신호의 연속적 레벨 지속 검출에 따른 노이즈 레벨을 결정하고, 상기 음성 신호 샘플 대비 비 침묵 구간 및 비 노이즈 구간의 비율에 따라, 상기 스피치 데이터의 컨텐츠 유효성(Valuable)을 결정할 수 있는 바, 매칭 모듈(140)은 컨텐츠 유효성 분석 결과에 따라, 데이터 분석 기반 매칭 수행여부를 결정할 수 있다.In addition, the audio analysis module 120 determines a noise level according to continuous detection of the continuous level of the speech signal, and according to the ratio of the non-silent section and the non-noise section to the voice signal sample, the content validity of the speech data (Valuable ) May be determined, and the matching module 140 may determine whether to perform data analysis-based matching according to the content validity analysis result.

한편, 도 9는 본 발명의 실시 예에 따른 비유창성 존재여부 기반 음성 인식 프로세스를 도시한 흐름도이다.Meanwhile, FIG. 9 is a flowchart illustrating a speech recognition process based on the presence or absence of non-fluency according to an embodiment of the present invention.

도 9를 참조하면, 오디오 분석 모듈(120)에서 음성 데이터가 수신되고(S301), 스피치 데이터가 분류되면(S303), 매칭 모듈(140)은 스피치 종료(EOS) 타이머 윈도우를 설정하고, 이에 따른 EOS 타이머를 개시한다(S305).9, when voice data is received by the audio analysis module 120 (S301) and speech data is classified (S303), the matching module 140 sets a speech end (EOS) timer window, and accordingly The EOS timer is started (S305).

그리고, 매칭 모듈(140)은 로컬 데이터베이스(110)의 캐시 데이터 또는 STT/TTS 엔진을 통해 스피치 데이터에 대응하는 텍스트 정보를 획득한다(S307).Then, the matching module 140 acquires text information corresponding to the speech data through the cache data of the local database 110 or the STT/TTS engine (S307).

여기서, 매칭 모듈(140)은 오디오 분석 모듈(120)의 분석에 따른 스피치 데이터의 비유창성(Disfluency)의 존재를 판단하고(S309), 존재하는 경우 EOS 타이머를 연장하며(S311), 다음 단어에 대응하는 단어 중요도 가중치를 증가시킨다(S313).Here, the matching module 140 determines the presence of disfluency of speech data according to the analysis of the audio analysis module 120 (S309), and if it exists, extends the EOS timer (S311), and The importance weight of the corresponding word is increased (S313).

한편, 비유창성이 확인되지 않는 경우, 매칭 모듈(140)은 색인 모듈(130)을 통해 텍스트 정보의 중요도 가중치를 기반으로 하는 응답 매칭 분석을 수행하고(S315), 불필요한 단어가 예측되었는지 판단하며(S317), 불필요한 단어가 아니라면 추가 오디오 이전에 EOS 타이머가 종료되었는지를 판단하여(S319), 응답 결과에 대응하는 출력을 리턴하고, 이에 대응하는 명령 데이터 처리가 단말 장치(100)에서 처리되도록 한다(S321).On the other hand, when influency is not confirmed, the matching module 140 performs response matching analysis based on the importance weight of the text information through the index module 130 (S315), and determines whether unnecessary words are predicted ( S317), if it is not an unnecessary word, it is determined whether the EOS timer has ended before the additional audio (S319), an output corresponding to the response result is returned, and the command data processing corresponding thereto is processed by the terminal device 100 ( S321).

만약 S317 단계에서 불필요한 단어가 예측되었거나, S319 단계에서 EOS 타이머가 종료되지 않은 경우에는, 오디오 분석 모듈(120)에서 다시 다음 음성 데이터를 수신하여 처리한다.If unnecessary words are predicted in step S317 or the EOS timer has not expired in step S319, the audio analysis module 120 receives and processes the next voice data again.

또한, 매칭 모듈(140)은 상기 비유창성(Disfluency) 정보 획득을 위한 비유창성 단어 리스트 및 예측 키워드 가중치 정보를 로컬 데이터베이스(110)에 저장 및 관리할 수 있다.In addition, the matching module 140 may store and manage a list of non-fluency words and predicted keyword weight information for obtaining the disfluency information in the local database 110.

이에 따라, 매칭 모듈(140)은 상기 비유창성 정보에 대응하는 상기 스피치 종료 타이머 윈도우 가변값을 미리 저장된 정보에 기초하여 설정하되, 상기 비유창성 정보에 따른 가변 정도는 상기 예측 키워드 가중치 정보에 기초하여 결정되도록 설정될 수 있다.Accordingly, the matching module 140 sets the speech end timer window variable value corresponding to the non-fluency information based on pre-stored information, and the degree of variability according to the non-fluency information is based on the predicted keyword weight information. It can be set to be determined.

도 10은 본 발명의 실시 예에 따른 스피치 인식 프로세스와 기존 프로세스간 대기 시간 단축 차이를 설명하기 위한 래더 다이어그램이다.10 is a ladder diagram for explaining a difference in waiting time reduction between a speech recognition process and an existing process according to an embodiment of the present invention.

도 10(A)를 참조하면, 종래 기술의 단말 장치는 스피치의 종료에 따른 종료 스피치 데이터가 전송되고, 이에 따른 ASR 시스템의 EOS 검출에 따라 스피치 응답 시간이 지연됨을 확인할 수 있다.Referring to FIG. 10A, it can be seen that the terminal device of the prior art transmits the end speech data according to the end of speech, and the speech response time is delayed according to the EOS detection of the ASR system.

이에 반해, 도 10(B)를 참조하면, 본 발명의 실시 예에 따른 단말 장치(100)는, 동일한 스피치 종료 시점이 도래하더라도, 로컬 인식 시스템에 의한 강제 EOS 처리가 가능하게 되며, 이에 따른 EOS 타임 윈도우 가변에 의한 최종 인식 결과 처리가 조기 수행되어 스피치 응답이 종래 기술보다 빠르게 발화자에게 제공될 수 있다.On the other hand, referring to FIG. 10(B), the terminal device 100 according to an embodiment of the present invention enables forced EOS processing by the local recognition system even when the same speech end point arrives. Processing of the final recognition result by varying the time window is performed early, so that a speech response can be provided to the talker faster than in the prior art.

한편, 상술한 본 발명의 다양한 실시 예들에 따른 방법은 프로그램으로 구현되어 다양한 비일시적 판독 가능 매체(non-transitory computer readable medium)에 저장된 상태로 각 서버 또는 기기들에 제공될 수 있다. 이에 따라, 사용자 단말(100)은 서버 또는 기기에 접속하여, 상기 프로그램을 다운로드할 수 있다.Meanwhile, the above-described method according to various embodiments of the present invention may be implemented as a program and provided to each server or devices while being stored in various non-transitory computer readable media. Accordingly, the user terminal 100 can access the server or device and download the program.

비일시적 판독 가능 매체란 레지스터, 캐쉬, 메모리 등과 같이 짧은 순간 동안 데이터를 저장하는 매체가 아니라 반영구적으로 데이터를 저장하며, 기기에 의해 판독(reading)이 가능한 매체를 의미한다. 구체적으로는, 상술한 다양한 어플리케이션 또는 프로그램들은 CD, DVD, 하드 디스크, 블루레이 디스크, USB, 메모리카드, ROM 등과 같은 비일시적 판독 가능 매체에 저장되어 제공될 수 있다.The non-transitory readable medium refers to a medium that stores data semi-permanently and can be read by a device, not a medium that stores data for a short moment, such as a register, cache, or memory. Specifically, the above-described various applications or programs may be provided by being stored in a non-transitory readable medium such as a CD, DVD, hard disk, Blu-ray disk, USB, memory card, and ROM.

또한, 이상에서는 본 발명의 바람직한 실시예에 대하여 도시하고 설명하였지만, 본 발명은 상술한 특정의 실시예에 한정되지 아니하며, 청구범위에서 청구하는 본 발명의 요지를 벗어남이 없이 당해 발명이 속하는 기술분야에서 통상의 지식을 가진자에 의해 다양한 변형실시가 가능한 것은 물론이고, 이러한 변형실시들은 본 발명의 기술적 사상이나 전망으로부터 개별적으로 이해되어서는 안될 것이다.In addition, although the preferred embodiments of the present invention have been illustrated and described above, the present invention is not limited to the specific embodiments described above, and the technical field to which the present invention belongs without departing from the gist of the present invention claimed in the claims. In addition, various modifications are possible by those of ordinary skill in the art, and these modifications should not be individually understood from the technical spirit or prospect of the present invention.

Claims

In the automatic speech recognition device,
Receives a talker's voice signal from the user terminal, obtains speech data by analyzing the voice signal, and provides response information corresponding to the speech data by local voice automatic recognition while the speech data for the duration of the speech end timer window is maintained. do or,
And a local speech recognition system configured to transmit speech data that does not match in local speech recognition to a remote automatic speech recognition system to provide response information corresponding to the speech data to the user terminal device;
Automatic speech recognition device.

The method of claim 1,
The speech end timer window,
Characterized in that it is varied by a matching module driven by a separate local process that is distinguished from the remote automatic speech recognition system
Automatic speech recognition device.

The method of claim 1,
The local speech recognition system,
An audio analysis module for receiving a speech signal of a talker and obtaining speech data including speech data classified as speech speech by a local classification process of the speech signal according to the analysis of the speech signal;
A local index module for receiving voice data from the audio analysis module, searching for audio or text information matching speech data using a local database, and outputting an index result;
When there is audio/text information matched according to the index result of the voice data from the local index module, the matched audio/text is provided to the terminal device as an answer result corresponding to the voice signal, or the matched audio result is the index result. / If there is no text information, a matching module for transmitting the speech data classified as the speech data to the remote automatic speech recognition system; Containing
Automatic speech recognition device.

The method of claim 3,
Speech data obtained according to the analysis of the speech signal by the audio analysis module includes speech analysis information,
The speech analysis information includes at least one of validity (Value) information of the speech data detected according to the speech signal analysis, classification information of the speech data (Categrize) information, and non-fluency information of the speech data.
Automatic speech recognition device.

According to claim 3
The matching module adaptively adjusts the speech end timer window according to at least one of validity information of the speech data, classification information of the speech data, and disfluency of the speech data detected according to the analysis of the speech signal. Variable to determine the speech endpoint
Automatic speech recognition device.

According to claim 1
The local speech recognition system automatically recognizes the speech data while the speech end timer window is sustained through a local speech recognition or an automatic speech recognition system at a remote location and processes the response by the terminal device, but when the speech end timer window expires EOS information is generated and provided to the terminal device to enable local speech termination independent of the remote automatic speech recognition system.
Automatic speech recognition device.

The method of claim 3,
The matching module,
Extending the end time of the speech end timer window when there is an influency section in the speech data according to the disfluency information
Automatic speech recognition device.

The method of claim 7,
The matching module increases the importance weight for recognizing the next word when there is an influency section in the speech data.
Automatic speech recognition device.