KR20190062369A

KR20190062369A - Speech-controlled apparatus for preventing false detections of keyword and method of operating the same

Info

Publication number: KR20190062369A
Application number: KR1020190064068A
Authority: KR
Inventors: 김병열; 한익상; 권오혁; 이봉진; 오명우; 최민석; 이찬규; 임정희; 최지수; 강한용; 김수환; 최정아
Original assignee: 네이버 주식회사; 라인 가부시키가이샤
Priority date: 2019-05-30
Filing date: 2019-05-30
Publication date: 2019-06-05
Also published as: KR102061206B1

Abstract

Provided are a voice control device capable of preventing keyword misrecognition and a method for operating the same. According to the present invention, the method comprises the steps of: receiving an audio signal corresponding to a surrounding sound and generating audio stream data; detecting a candidate keyword corresponding to a predefined keyword from the audio stream data and determining a first section corresponding to first audio data in which the candidate keyword is detected from the audio stream data; extracting a first speaker feature vector for the first audio data; extracting a second speaker feature vector for second audio data corresponding to a second section preceding the first section from the audio stream data; and determining whether the keyword is included in the first audio data on the basis of the similarity between the first and second speaker feature vectors and determining whether to perform wake-up.

Description

TECHNICAL FIELD [0001] The present invention relates to a speech control apparatus and a speech control apparatus,

본 개시는 음성 제어 장치에 관한 것으로서, 보다 상세하게는 키워드를 잘못 인식하는 것을 방지할 수 있는 음성 제어 장치 및 이의 동작 방법에 관한 것이다.BACKGROUND OF THE INVENTION 1. Field of the Invention [0002] The present invention relates to a voice control apparatus, and more particularly, to a voice control apparatus and an operation method thereof that can prevent a keyword from being erroneously recognized.

휴대용 통신 장치, 데스크톱, 태블릿, 및 엔터테인먼트 시스템들과 같은 컴퓨팅 장치들의 성능이 고도화면서, 조작성을 향상시키기 위하여 음성 인식 기능이 탑재되어 음성에 의해 제어되는 전자 기기들이 출시되고 있다. 음성 인식 기능은 별도의 버튼 조작 또는 터치 모듈의 접촉에 의하지 않고 사용자의 음성을 인식함으로써 장치를 손쉽게 제어할 수 있는 장점을 가진다.Background of the Invention [0002] As performance of computing devices such as portable communication devices, desktops, tablets, and entertainment systems have been enhanced, electronic devices in which speech recognition functions are incorporated and are controlled by voice are being introduced to improve operability. The voice recognition function has an advantage that the device can be easily controlled by recognizing the voice of the user without touching the touch module or by operating a separate button.

이러한 음성 인식 기능에 의하면, 예를 들어 스마트 폰과 같은 휴대용 통신 장치에서는 별도의 버튼을 누르는 조작 없이 통화 기능을 수행하거나 문자 메시지를 작성할 수 있으며, 길찾기, 인터넷 검색, 알람 설정 등 다양한 기능을 손쉽게 설정할 수 있다. 그러나, 이러한 음성 제어 장치가 사용자의 음성을 오인식하여 의도하지 않은 동작을 수행하는 문제가 발생할 수 있다.According to the voice recognition function, for example, a portable communication device such as a smart phone can perform a calling function or a text message without any operation of pressing a separate button, and can easily perform various functions such as route search, Can be set. However, such a voice control apparatus may misunderstand the voice of the user and cause unintended operation.

일 실시예는 키워드를 오인식하는 것을 방지할 수 있는 음성 제어 장치 및 이의 동작 방법을 제공할 수 있다.One embodiment can provide a voice control apparatus and an operation method thereof that can prevent a keyword from being misrecognized.

상술한 기술적 과제를 달성하기 위한 기술적 수단으로서, 본 개시의 제1 측면은, 주변 소리에 대응하는 오디오 신호를 수신하여, 오디오 스트림 데이터를 생성하는 오디오 처리부; 상기 오디오 스트림 데이터로부터 미리 정의된 키워드에 대응하는 후보 키워드를 검출하고, 상기 오디오 스트림 데이터에서 상기 후보 키워드가 검출된 제1 오디오 데이터에 해당하는 제1 구간을 결정하는 키워드 검출부; 상기 제1 오디오 데이터에 대한 제1 화자 특징 벡터를 추출하고, 상기 오디오 스트림 데이터에서 상기 제1 구간에 선행하는 제2 구간에 해당하는 제2 오디오 데이터에 대한 제2 화자 특징 벡터를 추출하는 화자 특징 벡터 추출부; 및 상기 제1 화자 특징 벡터와 상기 제2 화자 특징 벡터의 유사도를 기초로 상기 제1 오디오 데이터에 상기 키워드가 포함되었는지의 여부를 판단하는 웨이크업 판단부를 포함하는 음성 제어 장치를 제공할 수 있다.According to a first aspect of the present invention, there is provided an audio processing apparatus comprising: an audio processing unit for receiving audio signals corresponding to ambient sounds and generating audio stream data; A keyword detection unit detecting a candidate keyword corresponding to a predefined keyword from the audio stream data and determining a first section corresponding to the first audio data in which the candidate keyword is detected in the audio stream data; Extracting a first speaker feature vector for the first audio data and extracting a second speaker feature vector for second audio data corresponding to a second section preceding the first section in the audio stream data A vector extractor; And a wake-up determination unit for determining whether the keyword is included in the first audio data based on the similarity between the first speaker feature vector and the second speaker feature vector.

또한, 본 개시의 제2 측면은, 주변 소리에 대응하는 오디오 신호를 수신하여, 오디오 스트림 데이터를 생성하는 단계; 상기 오디오 스트림 데이터로부터 미리 정의된 키워드에 대응하는 후보 키워드를 검출하고, 상기 오디오 스트림 데이터에서 상기 후보 키워드가 검출된 제1 오디오 데이터에 해당하는 제1 구간을 결정하는 단계; 상기 제1 오디오 데이터에 대한 제1 화자 특징 벡터를 추출하는 단계; 상기 오디오 스트림 데이터에서 상기 제1 구간에 선행하는 제2 구간에 해당하는 제2 오디오 데이터에 대한 제2 화자 특징 벡터를 추출하는 단계; 및 상기 제1 화자 특징 벡터와 상기 제2 화자 특징 벡터의 유사도를 기초로 상기 제1 오디오 데이터에 상기 키워드가 포함되었는지의 여부를 판단하고, 웨이크업 여부를 결정하는 단계를 포함하는 음성 제어 장치의 동작 방법을 제공할 수 있다.Also, a second aspect of the present disclosure is a method comprising: receiving an audio signal corresponding to ambient sound to generate audio stream data; Detecting a candidate keyword corresponding to a predefined keyword from the audio stream data and determining a first section corresponding to the first audio data in which the candidate keyword is detected in the audio stream data; Extracting a first speaker feature vector for the first audio data; Extracting a second speaker feature vector for second audio data corresponding to a second section preceding the first section in the audio stream data; And determining whether or not the keyword is included in the first audio data based on the degree of similarity between the first speaker feature vector and the second speaker feature vector and determining whether or not the keyword is included in the first audio data. A method of operation can be provided.

또한, 본 개시의 제3 측면은, 음성 제어 장치의 프로세서가 제2 측면에 따른 동작 방법을 실행하도록 하는 명령어들을 포함하는 하나 이상의 프로그램이 기록된 컴퓨터로 읽을 수 있는 기록 매체를 제공할 수 있다.In addition, the third aspect of the present disclosure may provide a computer-readable recording medium having recorded thereon one or more programs including instructions for causing a processor of a voice control apparatus to execute a method of operation according to the second aspect.

본 개시의 다양한 실시예들에 따르면, 키워드를 오인식할 가능성이 감소되므로 음성 제어 장치의 오동작이 방지될 수 있다.According to various embodiments of the present disclosure, malfunction of the voice control device can be prevented since the possibility of misrecognizing the keyword is reduced.

도 1은 일 실시예에 따른 네트워크 환경의 예를 도시한 도면이다.
도 2는 일 실시예에 따라서 전자 기기 및 서버의 내부 구성을 설명하기 위한 블럭도이다.
도 3은 일 실시예에 따른 음성 제어 장치의 프로세서가 포함할 수 있는 기능 블럭들의 예를 도시한 도면이다.
도 4는 일 실시예에 따라서 음성 제어 장치가 수행할 수 있는 동작 방법의 예를 도시한 흐름도이다.
도 5는 다른 실시예에 따라서 음성 제어 장치가 수행할 수 있는 동작 방법의 예를 도시한 흐름도이다.
도 6a는 일 실시예에 따른 음성 제어 장치가 도 5의 동작 방법을 실행하는 경우에 단독 명령 키워드가 발화되는 예를 도시한다.
도 6b는 일 실시예에 따른 음성 제어 장치가 도 6의 동작 방법을 실행하는 경우에 일반 대화 음성이 발화되는 예를 도시한다.
도 7는 또 다른 실시예에 따라서 음성 제어 장치가 수행할 수 있는 동작 방법의 예를 도시한 흐름도이다.
도 8a는 일 실시예에 따른 음성 제어 장치가 도 7의 동작 방법을 실행하는 경우에 웨이크업 키워드와 자연어 음성 명령이 발화되는 예를 도시한다.
도 8b는 일 실시예에 따른 음성 제어 장치가 도 7의 동작 방법을 실행하는 경우에 일반 대화 음성이 발화되는 예를 도시한다.1 is a diagram illustrating an example of a network environment according to an embodiment.
2 is a block diagram for explaining an internal configuration of an electronic device and a server according to an embodiment.
3 is a diagram illustrating an example of functional blocks that a processor of a voice control apparatus according to an embodiment may include.
4 is a flowchart illustrating an example of an operation method that can be performed by the voice control apparatus according to an embodiment.
5 is a flowchart showing an example of an operation method that can be performed by the voice control apparatus according to another embodiment.
6A shows an example in which a single command keyword is uttered when the voice control apparatus according to an embodiment executes the operation method of FIG.
6B shows an example in which a general conversation voice is uttered when the voice control apparatus according to an embodiment executes the operation method of Fig.
7 is a flowchart showing an example of an operation method that the voice control apparatus can perform according to another embodiment.
8A shows an example in which a wake-up keyword and a natural language voice command are uttered when the voice control apparatus according to an embodiment executes the operation method of Fig.
FIG. 8B shows an example in which a general conversation voice is uttered when the voice control apparatus according to an embodiment executes the operation method of FIG. 7. FIG.

아래에서는 첨부한 도면을 참조하여 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 본 발명의 실시예를 상세히 설명한다. 그러나 본 발명은 여러 가지 상이한 형태로 구현될 수 있으며 여기에서 설명하는 실시예에 한정되지 않는다. 그리고 도면에서 본 발명을 명확하게 설명하기 위해서 설명과 관계없는 부분은 생략하였으며, 명세서 전체를 통하여 유사한 부분에 대해서는 유사한 도면 부호를 붙였다. Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings, which will be readily apparent to those skilled in the art. The present invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. In order to clearly illustrate the present invention, parts not related to the description are omitted, and similar parts are denoted by like reference characters throughout the specification.

명세서 전체에서, 어떤 부분이 다른 부분과 "연결"되어 있다고 할 때, 이는 "직접적으로 연결"되어 있는 경우뿐 아니라, 그 중간에 다른 소자를 사이에 두고 "전기적으로 연결"되어 있는 경우도 포함한다. 또한 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있는 것을 의미한다.Throughout the specification, when a part is referred to as being "connected" to another part, it includes not only "directly connected" but also "electrically connected" with another part in between . Also, when an element is referred to as " comprising ", it means that it can include other elements as well, without departing from the other elements unless specifically stated otherwise.

본 명세서에서 다양한 곳에 등장하는 "일부 실시예에서" 또는 "일 실시예에서" 등의 어구는 반드시 모두 동일한 실시예를 가리키는 것은 아니다.The phrases " in some embodiments " or " in one embodiment " appearing in various places in this specification are not necessarily all referring to the same embodiment.

일부 실시예는 기능적인 블럭 구성들 및 다양한 처리 단계들로 나타내어질 수 있다. 이러한 기능 블럭들의 일부 또는 전부는, 특정 기능들을 실행하는 다양한 개수의 하드웨어 및/또는 소프트웨어 구성들로 구현될 수 있다. 예를 들어, 본 개시의 기능 블럭들은 하나 이상의 마이크로프로세서들에 의해 구현되거나, 소정의 기능을 위한 회로 구성들에 의해 구현될 수 있다. 또한, 예를 들어, 본 개시의 기능 블럭들은 다양한 프로그래밍 또는 스크립팅 언어로 구현될 수 있다. 기능 블럭들은 하나 이상의 프로세서들에서 실행되는 알고리즘으로 구현될 수 있다. 또한, 본 개시는 전자적인 환경 설정, 신호 처리, 및/또는 데이터 처리 등을 위하여 종래 기술을 채용할 수 있다. “모듈” 및 “구성”등과 같은 용어는 넓게 사용될 수 있으며, 기계적이고 물리적인 구성들로서 한정되는 것은 아니다.Some embodiments may be represented by functional block configurations and various processing steps. Some or all of these functional blocks may be implemented with various numbers of hardware and / or software configurations that perform particular functions. For example, the functional blocks of the present disclosure may be implemented by one or more microprocessors, or by circuit configurations for a given function. Also, for example, the functional blocks of the present disclosure may be implemented in various programming or scripting languages. The functional blocks may be implemented with algorithms running on one or more processors. In addition, the present disclosure may employ conventional techniques for electronic configuration, signal processing, and / or data processing, and the like. The terms " module " and " configuration " and the like are used extensively and are not limited to mechanical and physical configurations.

또한, 도면에 도시된 구성 요소들 간의 연결 선 또는 연결 부재들은 기능적인 연결 및/또는 물리적 또는 회로적 연결들을 예시적으로 나타낸 것일 뿐이다. 실제 장치에서는 대체 가능하거나 추가된 다양한 기능적인 연결, 물리적인 연결, 또는 회로 연결들에 의해 구성 요소들 간의 연결이 나타내어질 수 있다.Also, the connection lines or connection members between the components shown in the figures are merely illustrative of functional connections and / or physical or circuit connections. In practical devices, connections between components can be represented by various functional connections, physical connections, or circuit connections that can be replaced or added.

본 개시에서 키워드는 음성 제어 장치의 특정 기능을 웨이크업 할 수 있는 음성 정보를 말한다. 키워드는 사용자의 음성 신호에 기초하며, 단독 명령 키워드일 수도 있고, 웨이크업 키워드일 수 있다. 웨이크업 키워드는 슬립 모드 상태의 음성 제어 장치를 웨이크업 모드로 전환할 수 있는 음성 기반 키워드로서, 예컨대, “클로바”, “하이 컴퓨터” 등과 같은 음성 키워드일 수 있다. 사용자는 웨이크업 키워드를 발화한 후, 음성 제어 장치가 수행하길 원하는 기능이나 동작을 지시하기 위한 명령을 자연어 형태로 발화할 수 있다. 이 경우, 음성 제어 장치는 자연어 형태의 음성 명령을 음성 인식하고, 음성 인식된 결과에 대응하는 기능 또는 동작을 수행할 수 있다. 단독 명령 키워드는 예컨대 음악이 재생 중인 경우 “중지”와 같이 음성 제어 장치의 동작을 직접 제어할 수 있는 음성 키워드일 수 있다. 본 개시에서 언급되는 웨이크업 키워드는 웨이크업 워드, 핫워드, 트리거 워드 등과 같은 용어로 지칭될 수 있다.In this disclosure, the keyword refers to voice information capable of waking up a specific function of the voice control device. The keyword is based on the user's voice signal, and may be a single command keyword or a wake-up keyword. The wakeup keyword is a voice-based keyword capable of switching the voice control apparatus in the sleep mode state to the wakeup mode, and may be a voice keyword such as " Clover ", " After the user has uttered the wake-up keyword, the user can utter a command for instructing the function or operation desired to be performed by the voice control apparatus in a natural language form. In this case, the voice control apparatus can voice-recognize a voice command in a natural language form and perform a function or an operation corresponding to the voice-recognized result. The single command keyword can be a voice keyword that can directly control the operation of the voice control device such as " stop " when the music is being played back. The wakeup keyword referred to in this disclosure may be referred to as a wakeup word, a hot word, a trigger word, and the like.

본 개시에서 후보 키워드는 키워드와 발음이 유사한 워드들을 포함한다. 예컨대, 키워드가 “클로바”인 경우, 후보 키워드는 “클로버”, “글로벌”, “클럽” 등일 수 있다. 후보 키워드는 음성 제어 장치의 키워드 검출부가 오디오 데이터에서 키워드로서 검출한 것으로 정의될 수 있다. 후보 키워드는 키워드와 동일할 수도 있지만, 키워드와 유사한 발음을 갖는 다른 워드일 수도 있다. 일반적으로 음성 제어 장치는 사용자가 후보 키워드에 해당하는 용어가 포함된 문장을 발화하는 경우에도 해당 키워드로 오인식하여 웨이크업 할 수 있다. 본 개시에 따른 음성 제어 장치는 음성 신호에서 위와 같은 후보 키워드가 검출되는 경우에도 반응하지만, 후보 키워드에 의해 웨이크업 되는 것을 방지할 수 있다.In the present disclosure, the candidate keyword includes words whose pronunciation is similar to that of the keyword. For example, when the keyword is "Clover", the candidate keywords may be "Clover", "Global", "Club", and the like. The candidate keyword may be defined as a keyword detected by the keyword detection unit of the voice control apparatus as a keyword in the audio data. The candidate keyword may be the same as the keyword, but it may be another word having a similar pronunciation to the keyword. Generally speaking, even if the user utters a sentence including a term corresponding to a candidate keyword, the voice control apparatus can wake up the word by misrecognizing the keyword. The voice control apparatus according to the present disclosure can also prevent the candidate keyword from being woken up by the candidate keyword even when the candidate keyword is detected in the voice signal.

본 개시에서 음성 인식 기능은 사용자의 음성 신호를 문자열(또는 텍스트)로 변환하는 것을 말한다. 사용자의 음성 신호는 음성 명령을 포함할 수 있다. 음성 명령은 음성 제어 장치의 특정 기능을 실행할 수 있다. In the present disclosure, the speech recognition function refers to converting a user's voice signal into a character string (or text). The user's voice signal may include voice commands. The voice command can perform a specific function of the voice control device.

본 개시에서 음성 제어 장치는 음성 제어 기능이 탑재된 전자 기기를 말한다. 음성 제어 기능이 탑재된 전자 기기는 스마트 스피커 또는 인공 지능 스피커와 같은 독립된 전자 기기일 수 있다. 또한, 음성 제어 기능이 탑재된 전자 기기는 음성 제어 기능이 탑재된 컴퓨팅 장치, 예컨대, 데스크톱, 노트북 등일 수 있을 뿐만 아니라, 휴대가 가능한 컴퓨터 장치, 예컨대, 스마트 폰 등일 수 있다. 이 경우, 컴퓨팅 장치에는 음성 제어 기능을 실행하기 위한 프로그램 또는 애플리케이션이 설치될 수 있다. 또한, 음성 제어 기능이 탑재된 전자 기기는 특정 기능을 주로 수행하는 전자 제품, 예컨대, 스마트 텔레비전, 스마트 냉장고, 스마트 에어컨, 스마트 네비게이션 등일 수 있으며, 자동차의 인포테인먼트 시스템일 수도 있다. 뿐만 아니라, 음성에 의해 제어될 수 있는 사물 인터넷 장치도 이에 해당할 수 있다.In this disclosure, the voice control device refers to an electronic device equipped with a voice control function. The electronic device equipped with the voice control function may be an independent electronic device such as a smart speaker or an artificial intelligent speaker. The electronic device equipped with the voice control function may be a computing device equipped with a voice control function, for example, a desktop, a notebook, and the like, as well as a portable computer device such as a smart phone. In this case, the computing device may be provided with a program or application for executing the voice control function. The electronic device equipped with the voice control function may be an electronic product mainly performing a specific function, for example, a smart television, a smart refrigerator, a smart air conditioner, a smart navigation, or an infotainment system of an automobile. In addition, object Internet devices that can be controlled by voice may also be applicable.

본 개시에서 음성 제어 장치의 특정 기능은, 예를 들어, 음성 제어 장치에 설치된 애플리케이션을 실행하는 것을 포함할 수 있으나 이로 제한되지 않는다. 예를 들어, 음성 제어 장치가 스마트 스피커인 경우, 음성 제어 장치의 특정 기능은 음악 재생, 인터넷 쇼핑, 음성 정보 제공, 스마트 스피커에 접속된 전자 또는 기계 장치의 제어 등을 포함할 수 있다. 예를 들어, 음성 제어 장치가 스마트 폰인 경우에, 애플리케이션을 실행하는 것은 전화 걸기, 길 찾기, 인터넷 검색, 또는 알람 설정 등을 포함할 수 있다. 예를 들어, 음성 제어 장치가 스마트 텔레비전인 경우에, 애플리케이션을 실행하는 것은 프로그램 검색, 또는 채널 검색 등을 포함할 수 있다. 음성 제어 장치가 스마트 오븐인 경우에, 애플리케이션을 실행하는 것은 요리 방법 검색 등을 포함할 수 있다. 음성 제어 장치가 스마트 냉장고인 경우에, 애플리케이션을 실행하는 것은 냉장 및 냉동 상태 점검, 또는 온도 설정 등을 포함할 수 있다. 음성 제어 장치가 스마트 자동차인 경우에, 애플리케이션을 실행하는 것은 자동 시동, 자율 주행, 자동 주차 등을 포함할 수 있다. 본 개시에서 애플리케이션을 실행하는 것은 상술한 바로 제한되지 않는다.Certain functions of the voice control device in this disclosure may include, but are not limited to, running an application installed in, for example, a voice control device. For example, if the voice control device is a smart speaker, particular functions of the voice control device may include music playback, internet shopping, voice information provision, control of electronic or mechanical devices connected to the smart speaker, and the like. For example, in the case where the voice control device is a smartphone, running the application may include dialing, navigating, searching the internet, or setting an alarm. For example, in the case where the voice control apparatus is a smart television, executing the application may include program search, channel search, or the like. In the case where the voice control device is a smart oven, executing the application may include searching for cooking methods and the like. In the case where the voice control device is a smart refrigerator, executing the application may include refrigeration and freezing status check, or temperature setting, and the like. In the case where the voice control device is a smart car, executing the application may include automatic starting, autonomous driving, automatic parking, and the like. Implementation of an application in this disclosure is not limited to the one just described.

본 개시에서 키워드는 워드 형태를 갖거나, 구 형태를 가질 수 있다. 본 개시에서, 웨이크업 키워드 이후에 발화되는 음성 명령은 자연어 형태의 문장 형태, 워드 형태, 또는 구 형태를 가질 수 있다.In the present disclosure, the keyword may have a word form or a sphere form. In the present disclosure, voice commands that are uttered after the wakeup keyword may have the form of a sentence form, a word form, or a sphere form in natural language form.

이하 첨부된 도면을 참고하여 본 개시를 상세히 설명하기로 한다.DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS The present invention will be described in detail with reference to the accompanying drawings.

도 1은 일 실시예에 따른 네트워크 환경의 예를 도시한 도면이다. 1 is a diagram illustrating an example of a network environment according to an embodiment.

도 1에 도시된 네트워크 환경은 복수의 전자 기기들(100a-100f), 서버(200) 및 네트워크(300)를 포함하는 것으로 예시적으로 도시된다.The network environment shown in FIG. 1 is illustratively illustrated as including a plurality of electronic devices 100a-100f, a server 200, and a network 300. [

전자 기기들(100a-100f)은 음성으로 제어될 수 있는 예시적인 전자 기기들이다. 전자 기기들(100a-100f) 각각은 음성 인식 기능 외에 특정 기능을 실행할 수 있다. 전자 기기들(100a-100f)의 예를 들면, 스마트 또는 인공지능 스피커, 스마트 폰(smart phone), 휴대폰, 네비게이션, 컴퓨터, 노트북, 디지털방송용 단말, PDA(Personal Digital Assistants), PMP(Portable Multimedia Player), 태블릿 PC, 스마트 전자 제품 등이 있다. 전자 기기들(100a-100f)은 무선 또는 유선 통신 방식을 이용하여 네트워크(300)를 통해 서버(200) 및/또는 다른 전자 기기들(100a-100f)과 통신할 수 있다. 그러나, 이에 한정되지 않으며, 전자 기기들(100a-100f) 각각은 네트워크(300)에 연결되지 않고 독립적으로 동작할 수도 있다. 전자 기기들(100a-100f)은 전자 기기(100)로 통칭될 수 있다.The electronic devices 100a-100f are exemplary electronic devices that can be controlled by voice. Each of the electronic devices 100a-100f may perform a specific function in addition to the voice recognition function. Examples of the electronic devices 100a to 100f include a smart or artificial intelligent speaker, a smart phone, a mobile phone, a navigation device, a computer, a notebook, a terminal for digital broadcasting, a personal digital assistant (PDA) ), Tablet PCs, and smart electronic products. The electronic devices 100a-100f may communicate with the server 200 and / or other electronic devices 100a-100f via the network 300 using a wireless or wired communication scheme. However, the present invention is not limited thereto, and each of the electronic devices 100a-100f may operate independently without being connected to the network 300. [ The electronic devices 100a-100f may be referred to as an electronic device 100. [

네트워크(300)의 통신 방식은 제한되지 않으며, 네트워크(300)가 포함할 수 있는 통신망(일례로, 이동통신망, 유선 인터넷, 무선 인터넷, 방송망)을 활용하는 통신 방식뿐만 아니라, 전자 기기들(100a-100f) 간의 근거리 무선 통신이 포함될 수 있다. 예를 들어, 네트워크(300)는, PAN(personal area network), LAN(local area network), CAN(campus area network), MAN(metropolitan area network), WAN(wide area network), BBN(broadband network), 인터넷 등의 네트워크 중 하나 이상의 임의의 네트워크를 포함할 수 있다. 또한, 네트워크(300)는 버스 네트워크, 스타 네트워크, 링 네트워크, 메쉬 네트워크, 스타-버스 네트워크, 트리 또는 계층적(hierarchical) 네트워크 등을 포함하는 네트워크 토폴로지 중 임의의 하나 이상을 포함할 수 있으나, 이에 제한되지 않는다.The communication method of the network 300 is not limited and can be applied not only to a communication method using a communication network (for example, a mobile communication network, a wired Internet, a wireless Internet, a broadcasting network) -100f) may be included. For example, the network 300 may be a personal area network (LAN), a local area network (LAN), a campus area network (CAN), a metropolitan area network (MAN), a wide area network (WAN) , A network such as the Internet, and the like. The network 300 may also include any one or more of a network topology including a bus network, a star network, a ring network, a mesh network, a star-bus network, a tree or a hierarchical network, It is not limited.

서버(200)는 네트워크(300)를 통해 전자 기기들(100a-100f)과 통신하며, 음성 인식 기능을 수행하는 컴퓨터 장치 또는 복수의 컴퓨터 장치들로 구현될 수 있다. 서버(200)는 클라우드 형태로 분산될 수 있으며, 명령, 코드, 파일, 컨텐츠 등을 제공할 수 있다.The server 200 may be implemented as a computer device or a plurality of computer devices that communicate with the electronic devices 100a-100f via the network 300 and perform a voice recognition function. The server 200 can be distributed in a cloud form and can provide commands, codes, files, contents, and the like.

예를 들면, 서버(200)는 전자 기기들(100a-100f)로부터 제공되는 오디오 파일을 수신하여 오디오 파일 내의 음성 신호를 문자열(또는 텍스트)로 변환하고, 변환된 문자열(또는 텍스트)를 전자 기기들(100a-100f)로 제공할 수 있다. 또한, 서버(200)는 네트워크(300)를 통해 접속한 전자 기기들(100a-100f)에게 음성 제어 기능을 수행하기 위한 어플리케이션의 설치를 위한 파일을 제공할 수 있다. 예컨대, 제2 전자 기기(100b)는 서버(200)로부터 제공된 파일을 이용하여 어플리케이션을 설치할 수 있다. 제2 전자 기기(100b)는 설치된 운영체제(Operating System, OS) 및/또는 적어도 하나의 프로그램(예컨대, 설치된 음성 제어 어플리케이션)의 제어에 따라 서버(200)에 접속하여 서버(200)가 제공하는 음성 인식 서비스를 제공받을 수 있다.For example, the server 200 receives an audio file provided from the electronic devices 100a to 100f, converts the audio signal in the audio file into a string (or text), and transmits the converted string (or text) 100a-100f. &Lt; / RTI > In addition, the server 200 may provide a file for installing an application for performing a voice control function to the electronic devices 100a-100f connected through the network 300. [ For example, the second electronic device 100b can install an application using a file provided from the server 200. [ The second electronic device 100b accesses the server 200 under the control of an installed operating system (OS) and / or at least one program (for example, an installed voice control application) Recognition service can be provided.

도 2는 일 실시예에 따라서 전자 기기 및 서버의 내부 구성을 설명하기 위한 블럭도이다.2 is a block diagram for explaining an internal configuration of an electronic device and a server according to an embodiment.

전자 기기(100)는 도 1의 전자 기기들(100a-100f) 중 하나이며, 전자 기기들(100a-100f)은 적어도 도 2에 도시된 내부 구성을 가질 수 있다. 전자 기기(100)는 네트워크(300)를 통해 음성 인식 기능을 수행하는 서버(200)에 접속되는 것으로 도시되어 있지만, 이는 예시적이며, 전자 기기(100)는 독립적으로 음성 인식 기능을 수행할 수도 있다. 전자 기기(100)는 음성에 의해 제어될 수 있는 전자 기기로서, 음성 제어 장치(100)로 지칭될 수 있다. 음성 제어 장치(100)는 스마트 또는 인공지능 스피커, 컴퓨팅 장치, 휴대용 컴퓨팅 장치, 스마트 가전 제품 등에 포함되거나, 이들에 유선 및/또는 무선으로 연결되도록 구현될 수 있다.The electronic apparatus 100 is one of the electronic apparatuses 100a to 100f in Fig. 1, and the electronic apparatuses 100a to 100f can have at least the internal configuration shown in Fig. Although the electronic device 100 is shown connected to the server 200 performing the voice recognition function through the network 300, this is exemplary, and the electronic device 100 may perform the voice recognition function independently have. The electronic device 100 is an electronic device that can be controlled by voice, and may be referred to as a voice control device 100. The voice control apparatus 100 may be included in or connected to a smart or artificial intelligent speaker, a computing device, a portable computing device, a smart home appliance, or the like, and may be wired and / or wirelessly connected thereto.

전자 기기(100)와 서버(200)는 메모리(110, 210), 프로세서(120, 220), 통신 모듈(130, 230), 및 입출력 인터페이스(140, 240)를 포함할 수 있다. 메모리(110, 210)는 컴퓨터에서 판독 가능한 기록 매체로서, RAM(random access memory), ROM(read only memory) 및 디스크 드라이브와 같은 비소멸성 대용량 기록장치(permanent mass storage device)를 포함할 수 있다. 또한, 메모리(110, 210)에는 운영체제와 적어도 하나의 프로그램 코드(예컨대, 전자 기기(100)에 설치되어 구동되는 음성 제어 어플리케이션, 음성 인식 어플리케이션 등을 위한 코드)가 저장될 수 있다. 이러한 소프트웨어 구성요소들은 컴퓨터에서 판독 가능한 기록 매체가 아닌 통신 모듈(130, 230)을 통해 메모리(110, 210)에 로딩될 수도 있다. 예를 들어, 적어도 하나의 프로그램은 개발자들 또는 어플리케이션의 설치 파일을 배포하는 파일 배포 시스템이 네트워크(300)를 통해 제공하는 파일들에 의해 설치되는 프로그램에 기반하여 메모리(110, 210)에 로딩될 수 있다.The electronic device 100 and the server 200 may include memories 110 and 210, processors 120 and 220, communication modules 130 and 230, and input / output interfaces 140 and 240. The memories 110 and 210 are computer readable recording media and may include a permanent mass storage device such as a random access memory (RAM), a read only memory (ROM), and a disk drive. The memory 110 and 210 may store an operating system and at least one program code (for example, a voice control application installed in the electronic device 100 and a code for a voice recognition application). These software components may be loaded into the memories 110 and 210 via communication modules 130 and 230 rather than a computer readable recording medium. For example, at least one program may be loaded into the memory 110, 210 based on a program installed by the developers or the file distribution system that distributes the installation files of the application via the network 300 .

프로세서(120, 220)는 기본적인 산술, 로직 및 입출력 연산을 수행함으로써, 컴퓨터 프로그램의 명령을 처리하도록 구성될 수 있다. 명령은 메모리(110, 210) 또는 통신 모듈(130, 230)에 의해 프로세서(120, 220)로 제공될 수 있다. 예를 들어 프로세서(120, 220)는 메모리(110, 210)와 같은 기록 장치에 저장된 프로그램 코드에 따라 수신되는 명령을 실행하도록 구성될 수 있다.Processors 120 and 220 may be configured to process instructions of a computer program by performing basic arithmetic, logic, and I / O operations. An instruction may be provided to the processor 120, 220 by the memory 110, 210 or the communication module 130, 230. For example, the processor 120, 220 may be configured to execute a command received in accordance with a program code stored in a recording device, such as the memory 110, 210.

통신 모듈(130, 230)은 네트워크(300)를 통해 전자 기기(100)와 서버(200)가 서로 통신하기 위한 기능을 제공할 수 있으며, 다른 전자 기기(100b-100f)와 통신하기 위한 기능을 제공할 수 있다. 일례로, 전자 기기(100)의 프로세서(120)가 메모리(110)와 같은 기록 장치에 저장된 프로그램 코드에 따라 생성한 요청(일례로 음성 인식 서비스 요청)이 통신 모듈(130)의 제어에 따라 네트워크(300)를 통해 서버(200)로 전달될 수 있다. 역으로, 서버(200)의 프로세서(220)의 제어에 따라 제공되는 음성 인식 결과인 문자열(텍스트) 등이 통신 모듈(230)과 네트워크(300)를 거쳐 전자 기기(100)의 통신 모듈(130)을 통해 전자 기기(100)로 수신될 수 있다. 예를 들어 통신 모듈(130)을 통해 수신된 서버(200)의 음성 인식 결과는 프로세서(120)나 메모리(110)로 전달될 수 있다. 서버(200)는 제어 신호나 명령, 컨텐츠, 파일 등을 전자 기기(100)로 송신할 수 있으며, 통신 모듈(130)을 통해 수신된 제어 신호나 명령 등은 프로세서(120)나 메모리(110)로 전달되고, 컨텐츠나 파일 등은 전자 기기(100)가 더 포함할 수 있는 별도의 저장 매체로 저장될 수 있다.The communication modules 130 and 230 can provide a function for the electronic device 100 and the server 200 to communicate with each other through the network 300 and provide a function for communicating with other electronic devices 100b and 100f . For example, a request (e.g., a voice recognition service request) generated by the processor 120 of the electronic device 100 in accordance with a program code stored in a recording device such as the memory 110 is transmitted to the network 110 under the control of the communication module 130. [ And may be transmitted to the server 200 through the Internet 300. Conversely, a character string (text) as a result of speech recognition provided under the control of the processor 220 of the server 200 may be transmitted to the communication module 130 of the electronic device 100 via the communication module 230 and the network 300 To the electronic device 100 via the Internet. For example, the voice recognition result of the server 200 received through the communication module 130 may be transmitted to the processor 120 or the memory 110. The server 200 can transmit control signals, commands, contents, files, and the like to the electronic device 100. The control signals and commands received through the communication module 130 can be transmitted to the processor 120 or the memory 110, And contents, files, and the like may be stored as a separate storage medium that the electronic device 100 may further include.

입출력 인터페이스(140, 240)는 입출력 장치들(150)과의 인터페이스를 위한 수단일 수 있다. 예를 들어, 입력 장치는 마이크(151)뿐만 아니라, 키보드 또는 마우스 등의 장치를 포함할 수 있으며, 출력 장치는 스피커(152)뿐만 아니라, 상태를 나타내는 상태 표시 LED(Light Emitting Diode), 어플리케이션의 통신 세션을 표시하기 위한 디스플레이와 같은 장치를 포함할 수 있다. 다른 예로서, 입출력 장치들(150)은 터치스크린과 같이 입력과 출력을 위한 기능이 하나로 통합된 장치를 포함할 수 있다.The input / output interfaces 140 and 240 may be means for interfacing with the input / output devices 150. For example, the input device may include not only the microphone 151 but also a device such as a keyboard or a mouse, and the output device may include not only the speaker 152 but also a status indicating LED (Light Emitting Diode) And a display for displaying a communication session. As another example, the input / output devices 150 may include a device with integrated functions for input and output, such as a touch screen.

마이크(151)는 주변 소리를 전기적인 오디오 신호로 변환할 수 있다. 마이크(151)는 전자 기기(100) 내에 직접 장착되지 않고, 통신 가능하게 연결되는 외부 장치(예컨대, 스마트 시계)에 장착되고, 생성된 외부 신호는 통신으로 전자 기기(100)에 전송될 수 있다. 도 2에는, 마이크(151)가 전자 기기(100)의 내부에 포함되는 것으로 도시되었으나, 다른 일 실시예에 따르면, 마이크(151)는 별도의 장치 내에 포함되고, 전자 기기(100)와는 유선 또는 무선 통신으로 연결되는 형태로 구현될 수 있다.The microphone 151 can convert the ambient sound into an electrical audio signal. The microphone 151 may be mounted on an external device (for example, a smart clock) communicably connected without being directly mounted in the electronic device 100, and the generated external signal may be transmitted to the electronic device 100 through communication . 2, the microphone 151 is shown as being included in the interior of the electronic device 100, but according to another embodiment, the microphone 151 is included in a separate device, And may be implemented in a form connected to wireless communication.

다른 실시예들에서 전자 기기(100) 및 서버(200)는 도 2의 구성요소들보다 더 많은 구성요소들을 포함할 수도 있다. 예를 들어, 전자 기기(100)는 전술한 입출력 장치들(150) 중 적어도 일부를 포함하도록 구성되거나, 트랜시버(transceiver), GPS(Global Positioning System) 모듈, 카메라, 각종 센서, 데이터베이스 등과 같은 다른 구성요소들을 더 포함할 수도 있다.In other embodiments, the electronic device 100 and the server 200 may include more components than the components of FIG. For example, the electronic device 100 may be configured to include at least some of the input / output devices 150 described above, or may include other components such as a transceiver, Global Positioning System (GPS) module, camera, Elements.

도 3은 일 실시예에 따른 음성 제어 장치의 프로세서가 포함할 수 있는 기능 블럭들의 예를 도시한 도면이고, 도 4는 일 실시예에 따라서 음성 제어 장치가 수행할 수 있는 동작 방법의 예를 도시한 흐름도이다.FIG. 3 is a diagram illustrating an example of functional blocks that a processor of a voice control apparatus according to an embodiment may include; FIG. 4 illustrates an example of an operation method that a voice control apparatus can perform according to an embodiment; It is a flow chart.

도 3에 도시된 바와 같이, 음성 제어 장치(100)의 프로세서(120)는 오디오 처리부(121), 키워드 검출부(122), 화자 특징 벡터 추출부(123), 웨이크업 판단부(124), 음성 인식부(125) 및 기능부(126)를 포함할 수 있다. 이러한 프로세서(120) 및 기능 블럭들(121-126) 중 적어도 일부는 도 4에 도시된 동작 방법이 포함하는 단계들(S110 내지 S190)을 수행하도록 음성 제어 장치(100)를 제어할 수 있다. 예를 들면, 프로세서(120) 및 프로세서(120)의 기능 블럭들(121-126) 중 적어도 일부는 음성 제어 장치(100)의 메모리(110)가 포함하는 운영체제의 코드와 적어도 하나의 프로그램 코드에 따른 명령을 실행하도록 구현될 수 있다.3, the processor 120 of the voice control apparatus 100 includes an audio processing unit 121, a keyword detection unit 122, a speaker feature vector extraction unit 123, a wakeup determination unit 124, A recognition unit 125 and a function unit 126. [ At least some of the processor 120 and the functional blocks 121-126 may control the voice control apparatus 100 to perform the steps S110 through S190 included in the operating method illustrated in FIG. For example, at least some of the functional blocks 121-126 of the processor 120 and the processor 120 may be associated with code of the operating system that the memory 110 of the voice control device 100 includes and at least one program code To execute the corresponding command.

도 3에 도시된 기능 블럭들(121-126)의 일부 또는 전부는, 특정 기능을 실행하는 하드웨어 및/또는 소프트웨어 구성들로 구현될 수 있다. 도 3에 도시된 기능 블럭들(121-126)이 수행하는 기능들은, 하나 이상의 마이크로프로세서에 의해 구현되거나, 해당 기능을 위한 회로 구성들에 의해 구현될 수 있다. 도 3에 도시된 기능 블럭들(121-126)의 일부 또는 전부는 프로세서(120)에서 실행되는 다양한 프로그래밍 언어 또는 스크립트 언어로 구성된 소프트웨어 모듈일 수 있다. 예를 들면, 오디오 처리부(121)와 키워드 검출부(122)는 디지털 신호 처리기(DSP)로 구현되고, 화자 특징 벡터 추출부(123), 웨이크업 판단부(124) 및 음성 인식부(125)는 소프트웨어 모듈로 구현될 수 있다.Some or all of the functional blocks 121-126 shown in FIG. 3 may be implemented in hardware and / or software configurations that perform particular functions. The functions performed by the functional blocks 121-126 shown in FIG. 3 may be implemented by one or more microprocessors, or may be implemented by circuit configurations for the function. Some or all of the functional blocks 121-126 shown in FIG. 3 may be software modules configured in various programming or scripting languages that are executed in the processor 120. For example, the audio processing unit 121 and the keyword detection unit 122 are implemented by a digital signal processor (DSP), and the speaker feature vector extraction unit 123, the wakeup determination unit 124, and the voice recognition unit 125 Software module.

오디오 처리부(121)는 주변 소리에 대응하는 오디오 신호를 수신하여, 오디오 스트림 데이터를 생성한다. 오디오 처리부(121)는 마이크(151)와 같은 입력장치로부터 주변 소리에 대응하는 오디오 신호를 수신할 수 있다. 마이크(151)는 음성 제어 장치(100)에 통신으로 연결되는 주변 장치에 포함되고, 오디오 처리부(121)는 마이크(151)에서 생성된 오디오 신호를 통신으로 수신할 수 있다. 주변 소리는 사용자가 발화한 음성뿐만 아니라, 배경음을 포함한다. 따라서, 오디오 신호에는 음성 신호뿐만 아니라 배경음 신호도 포함된다. 배경음 신호는 키워드 검출 및 음성 인식에서 노이즈에 해당할 수 있다.The audio processing unit 121 receives the audio signal corresponding to the ambient sound, and generates audio stream data. The audio processing unit 121 can receive an audio signal corresponding to the ambient sound from an input device such as the microphone 151. [ The microphone 151 is included in a peripheral device communicatively connected to the voice control apparatus 100 and the audio processing unit 121 can receive the audio signal generated in the microphone 151 by communication. The ambient sound includes not only the voice uttered by the user but also the background sound. Therefore, an audio signal includes not only a voice signal but also a background sound signal. The background sound signal may correspond to noise in keyword detection and speech recognition.

오디오 처리부(121)는 연속적으로 수신되는 오디오 신호에 대응하는 오디오 스트림 데이터를 생성할 수 있다. 오디오 처리부(121)는 오디오 신호를 필터링하고 디지털화하여 오디오 스트림 데이터를 생성할 수 있다. 오디오 처리부(121)는 오디오 신호를 필터링하여 노이즈 신호를 제거하고 배경음 신호에 비해 음성 신호를 증폭할 수 있다. 또한, 오디오 처리부(121)는 오디오 신호에서 음성 신호의 에코를 제거할 수도 있다.The audio processing unit 121 may generate audio stream data corresponding to the continuously received audio signals. The audio processing unit 121 may filter and digitize the audio signal to generate audio stream data. The audio processing unit 121 may filter the audio signal to remove the noise signal and amplify the audio signal as compared with the background sound signal. Also, the audio processing unit 121 may remove the echo of the audio signal from the audio signal.

오디오 처리부(121)는 음성 제어 장치(100)가 슬립 모드로 동작할 때에도 오디오 신호를 수신하기 위해 항상 동작할 수 있다. 오디오 처리부(121)는 음성 제어 장치(100)가 슬립 모드로 동작할 때 낮은 동작 주파수로 동작하고, 음성 제어 장치(100)가 정상 모드로 동작할 때에는 높은 동작 주파수로 동작할 수 있다.The audio processing unit 121 can always operate to receive the audio signal even when the voice control apparatus 100 operates in the sleep mode. The audio processing unit 121 operates at a low operating frequency when the voice control apparatus 100 operates in the sleep mode and can operate at a high operating frequency when the voice control apparatus 100 operates in the normal mode.

메모리(110)는 오디오 처리부(121)에서 생성된 오디오 스트림 데이터를 일시적으로 저장할 수 있다. 오디오 처리부(121)는 메모리(110)를 이용하여 오디오 스트림 데이터를 버퍼링할 수 있다. 메모리(110)에는 키워드를 포함하는 오디오 데이터뿐만 아니라 키워드가 검출되기 전의 오디오 데이터가 함께 저장된다. 최근의 오디오 데이터를 메모리(110)에 저장하기 위해, 메모리(110)에 가장 오래 전에 저장된 오디오 데이터가 삭제될 수 있다. 메모리(110)에 할당한 크기가 동일하다면, 언제나 동일한 기간의 오디오 데이터가 저장될 수 있다. 메모리(110)에 저장된 오디오 데이터에 해당하는 상기 기간은 키워드를 발성하는 시간보다 긴 것이 바람직하다.The memory 110 may temporarily store the audio stream data generated by the audio processing unit 121. [ The audio processing unit 121 may buffer audio stream data using the memory 110. [ The memory 110 stores not only audio data including a keyword but also audio data before a keyword is detected. In order to store recent audio data in the memory 110, the audio data stored in the memory 110 the longest before can be deleted. If the sizes allocated to the memory 110 are the same, audio data of the same period can be always stored. It is preferable that the period corresponding to the audio data stored in the memory 110 is longer than the time period during which the keyword is uttered.

본 발명의 또다른 실시예에 따르면, 메모리(110)는 오디오 처리부(121)에서 생성된 오디오 스트림에 대한 화자 특징 벡터를 추출하여 저장할 수 있다. 이 때 화자 특징 벡터는 특정 길이의 오디오 스트림에 대하여 추출하여 저장될 수 있다. 앞서 설명한 바와 같이, 최근에 생성된 오디오 스트림에 대한 화자 특징 벡터를 저장하기 위하여 가장 오래 저장된 화자 특징 벡터가 삭제될 수 있다.According to another embodiment of the present invention, the memory 110 may extract and store a speaker feature vector for an audio stream generated by the audio processing unit 121. [ At this time, the speaker feature vector may be extracted and stored for an audio stream of a specific length. As described above, the speaker feature vector that is stored the longest to store the speaker feature vector for the recently generated audio stream may be deleted.

키워드 검출부(122)는 오디오 처리부(121)에서 생성된 오디오 스트림 데이터로부터 미리 정의된 키워드에 대응하는 후보 키워드를 검출한다. 키워드 검출부(122)는 메모리(110)에 일시적으로 저장된 오디오 스트림 데이터로부터 미리 정의된 키워드에 대응하는 후보 키워드를 검출할 수 있다. 미리 정의된 키워드는 복수일 수 있으며, 복수의 미리 정의된 키워드들은 키워드 저장소(110a)에 저장될 수 있다. 키워드 저장소(110a)는 메모리(110)에 포함될 수 있다.The keyword detection unit 122 detects a candidate keyword corresponding to a predefined keyword from the audio stream data generated by the audio processing unit 121. [ The keyword detection unit 122 may detect a candidate keyword corresponding to a predefined keyword from the audio stream data temporarily stored in the memory 110. [ The plurality of predefined keywords may be stored in the keyword storage 110a. The keyword storage 110a may be included in the memory 110. [

후보 키워드는 키워드 검출부(122)에서 오디오 스트림 데이터 중에서 키워드로서 검출한 것을 의미한다. 후보 키워드는 키워드와 동일할 수도 있고, 키워드와 유사하게 발음되는 다른 단어일 수 있다. 예컨대, 키워드가 “클로바”인 경우, 후보 키워드는 “글로벌”일 수 있다. 즉, 사용자가 “글로벌”을 포함한 문장을 발성한 경우, 키워드 검출부(122)는 오디오 스트림 데이터에서 “글로벌”을 “클로바”로 오인하여 검출할 수 있다. 이렇게 검출된 “글로벌”은 후보 키워드에 해당한다.The candidate keyword means that the keyword is detected as a keyword in the audio stream data by the keyword detection unit 122. [ The candidate keyword may be the same as the keyword, or it may be another word that is pronounced similar to the keyword. For example, when the keyword is " Clover ", the candidate keyword may be " global ". In other words, when the user utters a sentence including " global ", the keyword detection unit 122 can detect and detect " global " The thus detected " global " corresponds to a candidate keyword.

키워드 검출부(122)는 오디오 스트림 데이터를 알려진 키워드 데이터와 비교하여, 오디오 스트림 데이터 내에 키워드에 대응하는 음성이 포함될 가능성을 계산할 수 있다. 키워드 검출부(122)는 오디오 스트림 데이터로부터 필터뱅크 에너지(Filter Bank Energy) 또는 멜 주파수 켑스트럼 계수(Mel-Frequency Cepstral Coefficients)와 같은 오디오 특징들을 추출할 수 있다. 키워드 검출부(122)는 분류 윈도우(classifying window)들을 이용하여, 예를 들어 서포트 벡터 머신(support vector machine) 또는 신경망(neural network)을 이용하여 이러한 오디오 특징들을 처리할 수 있다. 오디오 특징들의 처리에 기초하여, 키워드 검출부(122)는 오디오 스트림 데이터 내에 키워드가 포함될 가능성을 계산할 수 있다. 키워드 검출부(122)는 상기 가능성이 미리 설정한 기준치보다 높은 경우, 오디오 스트림 데이터 내에 키워드가 포함되어 있다고 판단함으로써 후보 키워드를 검출할 수 있다.The keyword detection unit 122 compares the audio stream data with known keyword data, and calculates the possibility of including a voice corresponding to the keyword in the audio stream data. The keyword detection unit 122 may extract audio features such as filter bank energy or Mel-Frequency Cepstral coefficients from the audio stream data. The keyword detection unit 122 may process these audio features using classifying windows, for example, using a support vector machine or a neural network. Based on the processing of the audio features, the keyword detection unit 122 can calculate the possibility of including a keyword in the audio stream data. The keyword detecting unit 122 can detect the candidate keyword by determining that the keyword is included in the audio stream data when the probability is higher than a preset reference value.

키워드 검출부(122)는 키워드 데이터에 대응하는 음성 샘플들을 이용하여 인공 신경망을 생성하고, 생성된 신경망을 이용하여 오디오 스트림 데이터에서 키워드를 검출하도록 트레이닝 될 수 있다. 키워드 검출부(122)는 오디오 스트림 데이터 내의 프레임마다 각 각 키워드를 구성하는 음소의 확률 또는 키워드의 전체적인 확률을 계산할 수 있다. 키워드 검출부(122)는 오디오 스트림 데이터로부터 각 음소들에 해당할 확률 시퀀스 또는 키워드 자체의 확률을 출력할 수 있다. 이 시퀀스 또는 확률을 기초로 키워드 검출부(122)는 오디오 스트림 데이터 내에 키워드가 포함될 가능성을 계산할 수 있으며, 그 가능성이 미리 설정된 기준치 이상인 경우에 후보 키워드가 검출된 것으로 판단할 수 있다. 전술한 방식은 예시적이며, 키워드 검출부(122)의 동작은 다양한 방식을 통해 구현될 수 있다.The keyword detection unit 122 may generate the artificial neural network using the speech samples corresponding to the keyword data and may be trained to detect the keyword in the audio stream data using the generated neural network. The keyword detection unit 122 can calculate the probability of a phoneme constituting each keyword or the overall probability of a keyword for each frame in the audio stream data. The keyword detection unit 122 can output the probability of a probability sequence corresponding to each phoneme or the probability of the keyword itself from the audio stream data. Based on this sequence or probability, the keyword detection unit 122 can calculate the possibility that the keyword is included in the audio stream data, and can determine that the candidate keyword is detected when the probability is greater than a predetermined reference value. The above-described method is illustrative, and the operation of the keyword detection unit 122 can be implemented in various ways.

또한, 키워드 검출부(122)는 오디오 스트림 데이터 내의 프레임마다 오디오 특징들을 추출함으로써 해당 프레임의 오디오 데이터가 사람의 음성에 해당할 가능성과 배경음에 해당할 가능성을 산출할 수 있다. 키워드 검출부(122)는 사람의 음성에 해당할 가능성과 배경음에 해당할 가능성을 비교하여, 해당 프레임의 오디오 데이터가 사람의 음성에 해당한다고 판단할 수 있다. 예컨대, 키워드 검출부(122)는 해당 프레임의 오디오 데이터가 사람의 음성에 해당할 가능성이 배경음에 해당할 가능성보다 미리 설정한 기준치보다 높은 경우에, 해당 프레임의 오디오 데이터가 사람의 음성에 대응한다고 판단할 수 있다.In addition, the keyword detection unit 122 may extract the audio features for each frame in the audio stream data, thereby calculating the likelihood that the audio data of the corresponding frame corresponds to the human voice and the possibility of corresponding to the background sound. The keyword detection unit 122 may compare the possibility of correspondence to the human voice and the possibility of corresponding to the background sound and determine that the audio data of the corresponding frame corresponds to the human voice. For example, when the possibility that the audio data of the frame corresponds to the human voice is higher than the possibility that it corresponds to the background sound, the keyword detection unit 122 determines that the audio data of the corresponding frame corresponds to the human voice can do.

키워드 검출부(122)는 오디오 스트림 데이터에서 후보 키워드가 검출된 구간을 특정할 수 있으며, 후보 키워드가 검출된 구간의 시점과 종점을 결정할 수 있다. 오디오 스트림 데이터에서 후보 키워드가 검출된 구간은 키워드 검출 구간, 현재 구간, 또는 제1 구간으로 지칭될 수 있다. 오디오 스트림 데이터에서 제1 구간에 해당하는 오디오 데이터는 제1 오디오 데이터로 지칭한다. 키워드 검출부(122)는 후보 키워드가 검출된 구간의 끝을 종점으로 결정할 수 있다. 다른 예에 따르면, 키워드 검출부(122)는 후보 키워드가 검출된 후 미리 설정한 시간(예컨대, 0.5초) 동안의 묵음이 발생할 때까지 기다린 후, 제1 구간에 묵음 구간이 포함되도록 제1 구간의 종점을 결정하거나 묵음 기간이 포함되지 않도록 제1 구간의 종점을 결정할 수 있다.The keyword detection unit 122 can specify a section in which the candidate keyword is detected in the audio stream data, and can determine the start point and the end point of the section in which the candidate keyword is detected. The section in which the candidate keyword is detected in the audio stream data may be referred to as a keyword detection section, a current section, or a first section. The audio data corresponding to the first section in the audio stream data is referred to as first audio data. The keyword detecting unit 122 can determine the end of the section in which the candidate keyword is detected as the end point. According to another example, the keyword detecting unit 122 waits until silence occurs for a preset time (for example, 0.5 second) after the candidate keyword is detected, and then waits until the silence period is included in the first period It is possible to determine the end point or determine the end point of the first section so that the silence period is not included.

화자 특징 벡터 추출부(123)는 메모리(110)에 일시적으로 저장된 오디오 스트림 데이터에서 제2 구간에 해당하는 제2 오디오 데이터를 메모리(110)로부터 독출한다. 제2 구간은 제1 구간의 이전 구간으로서, 제2 구간의 종점은 제1 구간의 시점과 동일할 수 있다. 제2 구간은 이전 구간으로 지칭될 수 있다. 제2 구간의 길이는 검출된 후보 키워드에 대응하는 키워드에 따라 가변적으로 설정될 수 있다. 다른 예에 따르면, 제2 구간의 길이는 고정적으로 설정될 수 있다. 또 다른 예에 따르면, 제2 구간의 길이는 키워드 검출 성능이 최적화되도록 적응적으로 가변될 수 있다. 예를 들면, 마이크(151)가 출력하는 오디오 신호가 “네잎 클로버”이고, 후보 키워드가 “클로버”인 경우, 제2 오디오 데이터는 “네잎”이라는 음성에 대응할 수 있다.The speaker characteristic vector extracting unit 123 reads second audio data corresponding to the second section from the memory 110 in the audio stream data temporarily stored in the memory 110. The second section may be the previous section of the first section and the end point of the second section may be the same as the beginning of the first section. The second section may be referred to as the previous section. The length of the second section may be variably set according to the keyword corresponding to the detected candidate keyword. According to another example, the length of the second section may be fixedly set. According to another example, the length of the second section may be adaptively variable such that keyword detection performance is optimized. For example, when the audio signal outputted by the microphone 151 is "four-leaf clover" and the candidate keyword is "clover", the second audio data can correspond to the voice "four-leaf".

화자 특징 벡터 추출부(123)는 제1 구간에 해당하는 제1 오디오 데이터의 제1 화자 특징 벡터와 제2 구간에 해당하는 제2 오디오 데이터의 제2 화자 특징 벡터를 추출한다. 화자 특징 벡터 추출부(123)는 화자 인식에 강인한 화자 특징 벡터를 오디오 데이터로부터 추출할 수 있다. 화자 특징 벡터 추출부(123)는 시간 도메인(time domain) 기반의 음성 신호를 주파수 도메인(frequency domain) 상의 신호로 변환하고, 변환된 신호의 주파수 에너지를 서로 다르게 변형함으로써 화자 특징 벡터를 추출할 수 있다. 예컨대, 화자 특징 벡터는 멜 주파수 켑스트럼 계수(Mel-Frequency Cepstral Coefficients) 또는 필터뱅크 에너지(Filter Bank Energy)를 기초로 추출될 수 있으나, 이에 한정되는 것은 아니며 다양한 방식으로 오디오 데이터로부터 화자 특징 벡터를 추출할 수 있다.The speaker feature vector extractor 123 extracts a first speaker feature vector of the first audio data corresponding to the first section and a second speaker feature vector of the second audio data corresponding to the second section. The speaker feature vector extracting unit 123 can extract a speaker feature vector robust to speaker recognition from the audio data. The speaker characteristic vector extracting unit 123 extracts a speaker characteristic vector by converting a time domain based speech signal into a signal in a frequency domain and modifying the frequency energy of the converted signal differently from each other have. For example, the speaker feature vector may be extracted based on Mel-Frequency Cepstral Coefficients or Filter Bank Energy, but is not limited thereto and may be derived from the audio data in a variety of ways, Can be extracted.

화자 특징 벡터 추출부(123)는 일반적으로 슬립 모드로 동작할 수 있다. 키워드 검출부(122)는 오디오 스트림 데이터에서 후보 키워드를 검출하면 화자 특징 벡터 추출부(123)를 웨이크업 할 수 있다. 키워드 검출부(122)는 오디오 스트림 데이터에서 후보 키워드를 검출하면 화자 특징 벡터 추출부(123)에 웨이크업 신호를 송신할 수 있다. 화자 특징 벡터 추출부(123)는 키워드 검출부(122)에서 후보 키워드가 검출되었다는 것을 나타내는 웨이크업 신호에 응답하여 웨이크업 될 수 있다.The speaker feature vector extractor 123 can generally operate in the sleep mode. The keyword detecting unit 122 can wake up the speaker feature vector extracting unit 123 upon detecting a candidate keyword in the audio stream data. The keyword detection unit 122 can transmit a wakeup signal to the speaker characteristic vector extraction unit 123 when it detects a candidate keyword in the audio stream data. The speaker feature vector extracting unit 123 may be woken up in response to a wake-up signal indicating that the candidate keyword is detected in the keyword detecting unit 122. [

일 실시예에 따르면, 화자 특징 벡터 추출부(123)는 오디오 데이터의 각 프레임마다 프레임 특징 벡터를 추출하고, 추출된 프레임 특징 벡터들을 정규화 및 평균화하여 오디오 데이터를 대표하는 화자 특징 벡터를 추출할 수 있다. 추출된 프레임 특징 벡터들을 정규화하는데 L2-정규화가 사용될 수 있다. 추출된 프레임 특징 벡터들의 평균화는 오디오 데이터 내의 모든 프레임들 각각에 대해 추출된 프레임 특징 벡터들을 정규화하여 생성되는 정규화된 프레임 특징 벡터들의 평균을 산출함으로써 달성될 수 있다.According to one embodiment, the speaker feature vector extractor 123 extracts a frame feature vector for each frame of audio data, normalizes and averages the extracted frame feature vectors, and extracts a speaker feature vector representing the audio data have. L2-normalization may be used to normalize the extracted frame feature vectors. The averaging of the extracted frame feature vectors can be achieved by calculating an average of the normalized frame feature vectors generated by normalizing the extracted frame feature vectors for each of all the frames in the audio data.

예를 들면, 화자 특징 벡터 추출부(123)는 제1 오디오 데이터의 각 프레임마다 제1 프레임 특징 벡터를 추출하고, 추출된 제1 프레임 특징 벡터들을 정규화 및 평균화하여 제1 오디오 데이터를 대표하는 상기 제1 화자 특징 벡터를 추출할 수 있다. 또한, 화자 특징 벡터 추출부(123)는 제2 오디오 데이터의 각 프레임마다 제2 프레임 특징 벡터를 추출하고, 추출된 제2 프레임 특징 벡터들을 정규화 및 평균화하여 제2 오디오 데이터를 대표하는 제2 화자 특징 벡터를 추출할 수 있다.For example, the speaker feature vector extractor 123 extracts a first frame feature vector for each frame of the first audio data, normalizes and averages the extracted first frame feature vectors, The first speaker feature vector can be extracted. The speaker feature vector extractor 123 extracts a second frame feature vector for each frame of the second audio data and normalizes and averages the extracted second frame feature vectors to obtain a second speaker representative of the second audio data, Feature vectors can be extracted.

다른 실시예에 따르면, 화자 특징 벡터 추출부(123)는 오디오 데이터 내의 모든 프레임에 대하여 프레임 특징 벡터를 각각 추출하는 것이 아니라, 오디오 데이터 내의 일부 프레임에 대하여 프레임 특징 벡터를 각각 추출할 수 있다. 상기 일부 프레임은 해당 프레임의 오디오 데이터가 사용자의 음성 데이터일 가능성이 높은 프레임으로 음성 프레임으로서 선택될 수 있다. 이러한 음성 프레임의 선택은 키워드 검출부(122)에 의해 수행될 수 있다. 키워드 검출부(122)는 오디오 스트림 데이터의 각 프레임마다 사람 음성일 제1 확률과 배경음일 제2 확률을 계산할 수 있다. 키워드 검출부(122)는 각 프레임의 오디오 데이터가 사람 음성일 제1 확률이 배경음일 제2 확률보다 미리 설정된 기준치보다 높은 프레임을 음성 프레임으로 결정할 수 있다. 키워드 검출부(122)는 해당 프레임이 음성 프레임인지의 여부를 나타내는 플래그 또는 비트를 오디오 스트림 데이터의 각 프레임 관련지어 메모리(110)에 저장할 수 있다.According to another embodiment, the speaker feature vector extractor 123 may extract the frame feature vectors for some frames in the audio data, rather than extracting the frame feature vectors for all the frames in the audio data. The certain frame may be selected as a voice frame in a frame in which the audio data of the corresponding frame is likely to be voice data of the user. The selection of such a voice frame may be performed by the keyword detecting unit 122. [ The keyword detection unit 122 may calculate the first probability of a human voice and the second probability of a background sound for each frame of the audio stream data. The keyword detection unit 122 may determine a frame whose audio data of each frame is higher than a second predetermined probability that the first probability that the audio data is the human voice is the background sound as a voice frame. The keyword detection unit 122 may store a flag or bit indicating whether the frame is a voice frame in the memory 110 for storing each frame of the audio stream data.

화자 특징 벡터 추출부(123)는 제1 및 제2 오디오 데이터를 메모리(110)로부터 독출할 때, 플래그 또는 비트를 함께 독출함으로써, 해당 프레임이 음성 프레임인지의 여부를 알 수 있다. The speaker characteristic vector extracting unit 123 reads out the first and second audio data from the memory 110, and reads the flag or bit together to know whether or not the corresponding frame is a voice frame.

화자 특징 벡터 추출부(123)는 오디오 데이터 내의 프레임들 중에서 음성 프레임으로 결정된 프레임들 각각에 대하여 프레임 특징 벡터를 추출하고, 추출된 제1 프레임 특징 벡터들을 정규화 및 평균화하여 오디오 데이터를 대표하는 화자 특징 벡터를 추출할 수 있다. 예를 들면, 화자 특징 벡터 추출부(123)는 제1 오디오 데이터 내의 프레임들 중에서 음성 프레임으로 결정된 프레임들 각각에 대하여 제1 프레임 특징 벡터를 추출하고, 추출된 제1 프레임 특징 벡터들을 정규화 및 평균화하여 제1 오디오 데이터를 대표하는 상기 제1 화자 특징 벡터를 추출할 수 있다. 또한, 화자 특징 벡터 추출부(123)는 제2 오디오 데이터 내의 프레임들 중에서 음성 프레임으로 결정된 프레임들 각각에 대하여 제2 프레임 특징 벡터를 추출하고, 추출된 제2 프레임 특징 벡터들을 정규화 및 평균화하여 제2 오디오 데이터를 대표하는 제2 화자 특징 벡터를 추출할 수 있다.The speaker feature vector extractor 123 extracts a frame feature vector for each of the frames determined as a voice frame among the frames in the audio data, normalizes and averages the extracted first frame feature vectors, The vector can be extracted. For example, the speaker feature vector extractor 123 extracts a first frame feature vector for each of frames determined as a voice frame among frames in the first audio data, and normalizes and averages the extracted first frame feature vectors So that the first speaker characteristic vector representing the first audio data can be extracted. In addition, the speaker feature vector extractor 123 extracts a second frame feature vector for each of the frames determined as a voice frame from the frames in the second audio data, normalizes and averages the extracted second frame feature vectors, 2 < / RTI > audio data representative of the second speaker feature vector.

웨이크업 판단부(124)는 화자 특징 벡터 추출부(123)에서 추출된 제1 화자 특징 벡터와 제2 화자 특징 벡터의 유사도를 기초로, 제1 오디오 데이터에 해당 키워드가 포함되었는지의 여부, 즉, 제1 구간의 오디오 신호에 해당 키워드가 포함되었는지의 여부를 판단한다. 웨이크업 판단부(124)는 제1 화자 특징 벡터와 제2 화자 특징 벡터 간의 유사도를 미리 설정한 기준치와 비교하여, 유사도가 기준치 이하인 경우에, 제1 구간의 제1 오디오 데이터에 해당 키워드가 포함되었다고 판단할 수 있다.The wakeup determination unit 124 determines whether or not the keyword is included in the first audio data based on the similarity between the first speaker feature vector extracted by the speaker feature vector extractor 123 and the second speaker feature vector, , It is determined whether or not the keyword is included in the audio signal of the first section. The wake-up determination unit 124 compares the similarity between the first speaker feature vector and the second speaker feature vector with a preset reference value, and when the similarity is equal to or less than the reference value, the wakeup determination unit 124 includes the keyword in the first audio data of the first section .

음성 제어 장치(100)가 키워드를 오인식하는 대표적인 경우는 사용자의 음성 중에 키워드와 유사한 발음의 단어가 음성 중간에 위치는 경우이다. 예를 들면, 키워드가 “클로바”인 경우, 사용자가 다른 사람에게 “네잎 클로버를 어떻게 찾을 수 있어”라고 말하는 경우에도 음성 제어 장치(100)는 “클로버”에 반응하여 웨이크업 할 수 있으며, 사용자가 의도하지 않은 동작을 수행할 수 있다. 심지어, 텔레비전 뉴스에서 아나운서가 “제이엔 글로벌의 시가총액은 ...”이라고 말하는 경우에도, 음성 제어 장치(100)는 “글로벌”에 반응하여 웨이크업 할 수 있다. 이와 같이 키워드 오인식이 발생하는 것을 방지하기 위하여, 일 실시예에 다르면, 키워드와 유사한 발음의 단어는 음성의 맨 앞에 위치하는 경우에만 음성 제어 장치(100)가 반응할 수 있다. 또한, 주변 배경 소음이 많은 환경이나, 다른 사람들이 대화를 하고 있는 환경에서는 사용자가 키워드에 해당하는 음성을 맨 앞에 발성하더라도 주변 배경 소음이나 다른 사람들의 대화로 인하여 사용자가 키워드에 해당하는 음성을 맨 앞에 발성하였다는 것이 감지되지 않을 수 있다. 일 실시예에 따르면, 음성 제어 장치(100)는 후보 키워드가 검출된 구간의 제1 화자 특징 벡터와 이전 구간의 제2 화자 특징 벡터를 추출하고, 제1 화자 특징 벡터와 제2 화자 특징 벡터가 서로 상이할 경우에는 사용자가 키워드에 해당하는 음성을 맨 앞에 발성하였다고 판단할 수 있다.In a typical case where the voice control apparatus 100 misrecognizes a keyword, a word whose pronunciation is similar to the keyword is located in the middle of the voice. For example, in the case where the keyword is " Clover ", the voice control apparatus 100 can also wake up in response to the " clover " even when the user tells another person "how to find four- Can perform unintended operations. Even in the case of a television news announcement, the voice control device 100 may wake up in response to a " global " even if the announcer says " In order to prevent the occurrence of the keyword misrecognition, the voice control apparatus 100 may react only when a word having a pronunciation similar to the keyword is located at the front of the voice. Also, in an environment with a high ambient background noise or in a conversation environment of other people, even if the user speaks the voice corresponding to the keyword at the front, the background noise or the conversation of the other person causes the voice corresponding to the keyword It may not be detected that it has been voiced before. According to one embodiment, the voice control apparatus 100 extracts a first speaker feature vector of a section in which a candidate keyword is detected and a second speaker feature vector of a previous section, and outputs a first speaker feature vector and a second speaker feature vector If they are different from each other, it can be determined that the user uttered the voice corresponding to the keyword at the head.

이러한 판단을 위하여, 웨이크업 판단부(124)는 제1 화자 특징 벡터와 제2 화자 특징 벡터 간의 유사도가 미리 설정한 기준치 이하인 경우에는, 사용자가 키워드에 해당하는 음성을 맨 앞에 발성하였다고 판단할 수 있다. 즉, 웨이크업 판단부(124)는 제1 구간의 제1 오디오 데이터에 해당 키워드가 포함되었다고 판단할 수 있으며, 음성 제어 장치(100)의 일부 기능을 웨이크업 할 수 있다. 제1 화자 특징 벡터와 제2 화자 특징 벡터 간의 유사도가 높다는 것은 제1 오디오 데이터에 대응하는 음성을 말한 사람과 제2 오디오 데이터에 대응하는 음성을 말한 사람이 동일할 가능성이 높다는 것이다.For this determination, when the degree of similarity between the first and second speaker characteristic vectors is equal to or less than a preset reference value, the wake-up determination unit 124 may determine that the user uttered a voice corresponding to the keyword have. That is, the wake-up determination unit 124 can determine that the keyword is included in the first audio data of the first section, and can wake up some functions of the voice control apparatus 100. The high degree of similarity between the first speaker feature vector and the second speaker feature vector means that there is a high possibility that the person speaking the voice corresponding to the first audio data and the person speaking the voice corresponding to the second audio data are the same.

제2 오디오 데이터가 묵음에 해당할 경우, 화자 특징 벡터 추출부(123)는 제2 오디오 데이터로부터 묵음에 해당하는 제2 화자 특징 벡터를 추출할 수 있다. 화자 특징 벡터 추출부(123)는 제1 오디오 데이터로부터 사용자의 음성에 해당하는 제1 화자 특징 벡터를 추출할 것이므로, 제1 화자 특징 벡터와 제2 화자 특징 벡터 간의 유사도는 낮을 수 있다.If the second audio data corresponds to silence, the speaker feature vector extracting unit 123 may extract a second speaker feature vector corresponding to the silence from the second audio data. Since the speaker feature vector extractor 123 extracts a first speaker feature vector corresponding to the user's voice from the first audio data, the similarity between the first speaker feature vector and the second speaker feature vector may be low.

음성 인식부(125)는 오디오 처리부(121)에서 생성된 오디오 스트림 데이터에서 제3 구간에 해당하는 제3 오디오 데이터를 수신하고, 제3 오디오 데이터를 음성 인식할 수 있다. 다른 예에 따르면, 음성 인식부(125)는 제3 오디오 데이터가 외부(예컨대, 서버(200))에서 음성 인식되도록 제3 오디오 데이터를 외부에 전송하고, 음성 인식 결과를 수신할 수 있다.The speech recognition unit 125 receives the third audio data corresponding to the third section from the audio stream data generated by the audio processing unit 121, and recognizes the third audio data. According to another example, the speech recognition section 125 may transmit the third audio data to the outside so that the third audio data is recognized as speech in the external (for example, the server 200), and receive the speech recognition result.

기능부(126)는 키워드에 대응하는 기능을 수행할 수 있다. 예컨대, 음성 제어 장치(100)가 스마트 스피커인 경우, 기능부(126)는 음악 재생부, 음성 정보 제공부, 주변 기기 제어부 등을 포함할 수 있으며, 검출된 키워드에 대응하는 기능을 수행할 수 있다. 음성 제어 장치(100)가 스마트 폰인 경우, 기능부(126)는 전화 연결부, 문자 송수신부, 인터넷 검색부 등을 포함할 수 있으며, 검출된 키워드에 대응하는 기능을 수행할 수 있다. 기능부(126)는 음성 제어 장치(100)의 종류에 따라 다양하게 구성될 수 있다. 기능부(126)는 음성 제어 장치(100)가 실행할 수 있는 다양한 기능들을 수행하기 위한 기능 블럭들을 포괄적으로 나타낸 것이다.The function unit 126 may perform a function corresponding to the keyword. For example, when the voice control apparatus 100 is a smart speaker, the function unit 126 may include a music reproducing unit, an audio information providing unit, a peripheral device control unit, and the like, have. When the voice control apparatus 100 is a smart phone, the function unit 126 may include a telephone connection unit, a character transmission / reception unit, an Internet search unit, and the like, and may perform a function corresponding to the detected keyword. The function unit 126 may be configured variously according to the type of the voice control apparatus 100. The functional unit 126 is a functional block for performing various functions that the voice control apparatus 100 can perform.

도 3에 도시된 음성 제어 장치(100)는 음성 인식부(125)를 포함하는 것으로 도시되어 있지만, 이는 예시적이며, 음성 제어 장치(100)는 음성 인식부(125)를 포함하지 않고, 도 2에 도시된 서버(200)가 음성 인식 기능을 대신 수행할 수 있다. 이 경우, 도 1에 도시된 바와 같이 음성 제어 장치(100)는 네트워크(300)를 통해 음성 인식 기능을 수행하는 서버(200)에 접속될 수 있다. 음성 제어 장치(100)는 음성 인식이 필요한 음성 신호를 포함하는 음성 파일을 서버(200)에 제공할 수 있으며, 서버(200)는 음성 파일 내의 음성 신호에 대하여 음성 인식을 수행하여 음성 신호에 대응하는 문자열을 생성할 수 있다. 서버(200)는 생성된 문자열을 네트워크(300)를 통해 음성 제어 장치(100)에 송신할 수 있다. 그러나, 아래에서는 음성 제어 장치(100)가 음성 인식 기능을 수행하는 음성 인식부(125)를 포함하는 것으로 가정하고 설명한다.The voice control apparatus 100 shown in Fig. 3 is shown as including voice recognition section 125, but this is exemplary, voice control apparatus 100 does not include voice recognition section 125, The server 200 shown in FIG. 2 may perform the voice recognition function instead. In this case, as shown in FIG. 1, the voice control apparatus 100 may be connected to the server 200 performing the voice recognition function through the network 300. The voice control apparatus 100 may provide a voice file including a voice signal required for voice recognition to the server 200. The server 200 performs voice recognition on the voice signal in the voice file, Can be generated. The server 200 can transmit the generated character string to the voice control apparatus 100 through the network 300. [ However, it is assumed below that the voice control apparatus 100 includes a voice recognition unit 125 that performs a voice recognition function.

프로세서(120)는 동작 방법을 위한 프로그램 파일에 저장된 프로그램 코드를 메모리(110)에 로딩할 수 있다. 예를 들면, 음성 제어 장치(100)에는 프로그램 파일에 따라 프로그램이 설치(install)될 수 있다. 이때 음성 제어 장치(100)에 설치된 프로그램이 실행되는 경우, 프로세서(120)는 프로그램 코드를 메모리(110)에 로딩할 수 있다. 이때, 프로세서(120)가 포함하는 오디오 처리부(121), 키워드 검출부(122), 화자 특징 벡터 추출부(123), 웨이크업 판단부(124), 음성 인식부(125) 및 기능부(126) 중 적어도 일부의 각각은 메모리(110)에 로딩된 프로그램 코드 중 대응하는 코드에 따른 명령을 실행하여 도 4의 단계들(S110 내지 S190)을 실행하도록 구현될 수 있다.The processor 120 may load the program code stored in the program file for the method of operation into the memory 110. [ For example, a program may be installed in the voice control apparatus 100 according to a program file. At this time, when the program installed in the voice control apparatus 100 is executed, the processor 120 may load the program code into the memory 110. [ At this time, the audio processing unit 121, the keyword detection unit 122, the speaker feature vector extraction unit 123, the wake-up determination unit 124, the voice recognition unit 125, and the functional unit 126 included in the processor 120, May be implemented to execute instructions in accordance with the corresponding one of the program codes loaded into memory 110 to execute steps S110 through S190 of FIG.

이후에서 프로세서(120)의 기능 블럭들(121-126)이 음성 제어 장치(100)를 제어하는 것은 프로세서(120)가 음성 제어 장치(100)의 다른 구성요소들을 제어하는 것으로 이해될 수 있다. 예를 들어, 프로세서(120)는 음성 제어 장치(100)가 포함하는 통신 모듈(130)을 제어하여 음성 제어 장치(100)가 예컨대 서버(200)와 통신하도록 음성 제어 장치(100)를 제어할 수 있다.It is to be appreciated that the functional blocks 121-126 of the processor 120 then control the voice control device 100 as the processor 120 controls the other components of the voice control device 100. [ For example, the processor 120 controls the communication module 130 included in the voice control apparatus 100 to control the voice control apparatus 100 to communicate with the server 200, for example, .

단계(S110)에서 프로세서(120), 예컨대, 오디오 처리부(121)는 주변 소리에 대응하는 오디오 신호를 수신한다. 오디오 처리부(121)는 지속적으로 주변 소리에 대응하는 오디오 신호를 수신할 수 있다. 오디오 신호는 마이크(151)와 같은 입력 장치가 주변 소리에 대응하여 생성한 전기 신호일 수 있다.In step S110, the processor 120, for example, the audio processing unit 121 receives an audio signal corresponding to the ambient sound. The audio processing unit 121 can continuously receive the audio signal corresponding to the ambient sound. The audio signal may be an electrical signal generated by an input device such as the microphone 151 in response to the ambient sound.

단계(S120)에서 프로세서(120), 예컨대, 오디오 처리부(121)는 마이크(151)로부터의 오디오 신호를 기초로 오디오 스트림 데이터를 생성한다. 오디오 스트림 데이터는 지속적으로 수신되는 오디오 신호에 대응한 것이다. 오디오 스트림 데이터는 오디오 신호를 필터링하고 디지털화함으로써 생성되는 데이터일 수 있다.In step S120, the processor 120, for example, the audio processing unit 121 generates audio stream data based on the audio signal from the microphone 151. [ The audio stream data corresponds to an audio signal that is continuously received. The audio stream data may be data generated by filtering and digitizing the audio signal.

단계(S130)에서 프로세서(120), 예컨대, 오디오 처리부(121)는 단계(S120)에서 생성되는 오디오 스트림 데이터를 메모리(110)에 일시적으로 저장한다. 메모리(110)는 한정된 크기를 가지며, 현재로부터 최근 일정 시간 동안의 오디오 신호에 대응하는 오디오 스트림 데이터의 일부가 메모리(110)에 일시적으로 저장될 수 있다. 새로운 오디오 스트림 데이터가 생성되면, 메모리(110)에 저장된 오디오 스트림 데이터 중에서 가장 오래된 데이터가 삭제되고, 메모리(110) 내의 삭제에 의해 비게 된 공간에 새로운 오디오 스트림 데이터가 저장될 수 있다.In step S130, the processor 120, for example, the audio processing unit 121 temporarily stores audio stream data generated in step S120 in the memory 110. [ The memory 110 has a limited size and a part of the audio stream data corresponding to the audio signal for the most recent period of time from the present can be temporarily stored in the memory 110. [ When the new audio stream data is generated, the oldest data among the audio stream data stored in the memory 110 is deleted, and new audio stream data can be stored in the space freed by the deletion in the memory 110.

단계(S140)에서 프로세서(120), 예컨대, 키워드 검출부(122)는 단계(S120)에서 생성되는 오디오 스트림 데이터로부터 미리 정의된 키워드에 대응하는 후보 키워드를 검출한다. 후보 키워드는 미리 정의된 키워드와 유사한 발음을 갖는 단어로서, 단계(S140)에서 키워드 검출부(122)에서 키워드로서 검출된 워드를 지칭한다.In step S140, the processor 120, for example, the keyword detection unit 122 detects a candidate keyword corresponding to a predefined keyword from the audio stream data generated in step S120. The candidate keyword is a word having a pronunciation similar to a predefined keyword, and refers to a word detected as a keyword in the keyword detection unit 122 in step S140.

단계(S150)에서 프로세서(120), 예컨대, 키워드 검출부(122)는 오디오 스트림 데이터에서 후보 키워드가 검출된 키워드 검출 구간을 식별하고, 키워드 검출 구간의 시점과 종점을 결정한다. 키워드 검출 구간은 현재 구간으로 지칭될 수 있다. 오디오 스트림 데이터에서 현재 구간에 대응하는 데이터는 제1 오디오 데이터로 지칭될 수 있다.In step S150, the processor 120, for example, the keyword detection unit 122 identifies the keyword detection period in which the candidate keyword is detected in the audio stream data, and determines the start point and the end point of the keyword detection period. The keyword detection interval may be referred to as the current interval. The data corresponding to the current section in the audio stream data may be referred to as first audio data.

단계(S160)에서 프로세서(120), 예컨대, 화자 특징 벡터 추출부(123)는 메모리(110)로부터 이전 구간에 해당하는 제2 오디오 데이터를 독출한다. 이전 구간은 현재 구간의 바로 직전 구간으로서, 이전 구간의 종점은 현재 구간의 시점과 동일할 수 있다. 화자 특징 벡터 추출부(123)는 메모리(110)로부터 제1 오디오 데이터도 함께 독출할 수 있다.In step S160, the processor 120, for example, the speaker feature vector extracting unit 123 reads the second audio data corresponding to the previous section from the memory 110. [ The previous period may be immediately before the current period, and the end point of the previous period may be the same as the current period. The speaker feature vector extracting unit 123 may read the first audio data from the memory 110 together.

단계(S170)에서 프로세서(120), 예컨대, 화자 특징 벡터 추출부(123)는 제1 오디오 데이터로부터 제1 화자 특징 벡터를 추출하고, 제2 오디오 데이터로부터 제2 화자 특징 벡터를 추출한다. 제1 화자 특징 벡터는 제1 오디오 데이터에 대응하는 음성의 화자를 식별하기 위한 지표이고, 제2 화자 특징 벡터는 제2 오디오 데이터에 대응하는 음성의 화자를 식별하기 위한 지표이다. 프로세서(120), 예컨대, 웨이크업 판단부(124)는 제1 화자 특징 벡터와 제2 화자 특징 벡터 간의 유사도를 기초로 제1 오디오 데이터에 키워드가 포함되었는지의 여부를 판단할 수 있다. 웨이크업 판단부(124)는 제1 오디오 데이터에 키워드가 포함되었다고 판단할 경우, 음성 제어 장치(100)의 일부 구성요소들을 웨이크업 할 수 있다.In step S170, the processor 120, for example, the speaker feature vector extractor 123 extracts a first speaker feature vector from the first audio data and a second speaker feature vector from the second audio data. The first speaker feature vector is an index for identifying a speaker of a voice corresponding to the first audio data and the second speaker feature vector is an indicator for identifying a speaker of a voice corresponding to the second audio data. The processor 120, for example, the wake-up determination unit 124 may determine whether or not the keyword is included in the first audio data based on the degree of similarity between the first speaker feature vector and the second speaker feature vector. The wake-up determination unit 124 may wake up some components of the voice control apparatus 100 when determining that the keyword is included in the first audio data.

단계(S180)에서 프로세서(120), 예컨대, 웨이크업 판단부(124)는 제1 화자 특징 벡터와 제2 화자 특징 벡터 간의 유사도를 미리 설정된 기준치와 비교한다. In step S180, the processor 120, for example, the wakeup determination unit 124 compares the similarity between the first speaker feature vector and the second speaker feature vector with a preset reference value.

웨이크업 판단부(124)는 제1 화자 특징 벡터와 제2 화자 특징 벡터 간의 유사도가 미리 설정된 기준치 이하인 경우, 현재 구간의 제1 오디오 데이터의 화자와 이전 구간의 제2 오디오 데이터의 화자가 서로 상이하다는 것이므로, 제1 오디오 데이터에 키워드가 포함되었다고 판단할 수 있다. 이 경우, 단계(S190)에서와 같이 프로세서(120), 예컨대, 웨이크업 판단부(124)는 음성 제어 장치(100)의 일부 구성요소들을 웨이크업 할 수 있다.If the similarity between the first speaker feature vector and the second speaker feature vector is equal to or less than a preset reference value, the speaker of the first audio data in the current section and the speaker of the second audio data in the previous section are different from each other , It can be determined that the keyword is included in the first audio data. In this case, as in step S190, the processor 120, for example, the wakeup determination unit 124 may wake up some components of the voice control apparatus 100. [

그러나, 웨이크업 판단부(124)는 제1 화자 특징 벡터와 제2 화자 특징 벡터 간의 유사도가 미리 설정된 기준치보다 큰 경우, 현재 구간의 제1 오디오 데이터의 화자와 이전 구간의 제2 오디오 데이터의 화자가 서로 동일하다는 것이므로, 제1 오디오 데이터에 키워드가 포함되지 않았다고 판단하여, 웨이크업을 진행하지 않을 수 있다. 이 경우, 단계(S110)로 진행하여 주변 소리에 대응하는 오디오 신호를 수신한다. 단계(S110)에서 오디오 신호의 수신은 단계들(S120-S190)을 수행할 때에도 계속된다.However, when the degree of similarity between the first speaker feature vector and the second speaker feature vector is greater than a predetermined reference value, the wake-up determination unit 124 determines whether the speaker of the first audio data of the current section and the speaker of the second audio data of the previous section It is judged that the keyword is not included in the first audio data, so that the wake-up can not proceed. In this case, the flow advances to step S110 to receive an audio signal corresponding to the ambient sound. The reception of the audio signal in step S110 continues even when performing steps S120-S190.

도 3의 키워드 저장소(110a)에는 미리 정의된 복수의 키워드들이 저장될 수 있다. 이러한 키워드들은 웨이크업 키워드이거나 단독 명령 키워드일 수 있다. 웨이크업 키워드는 음성 제어 장치(100)의 일부 기능을 웨이크업 하기 위한 것이다. 일반적으로 사용자는 웨이크업 키워드를 발화한 후 원하는 자연어 음성 명령을 발화한다. 음성 제어 장치(100)는 자연어 음성 명령을 음성 인식하고, 자연어 음성 명령에 대응하는 동작 및 기능을 수행할 수 있다.A plurality of predefined keywords may be stored in the keyword storage 110a of FIG. These keywords may be wake-up keywords or single command keywords. The wakeup keyword is for waking up some functions of the voice control apparatus 100. [ Generally, the user utters the desired natural language voice command after uttering the wakeup keyword. The voice control apparatus 100 can voice-recognize the natural voice command and perform the operation and function corresponding to the natural voice command.

단독 명령 키워드는 음성 제어 장치(100)가 특정 동작 또는 기능을 직접 수행하기 위한 것으로서, 예컨대, “재생”, “중지” 등과 같이 미리 정의된 간단한 단어일 수 있다. 음성 제어 장치(100)는 단독 명령 키워드가 수신되면, 단독 명령 키워드에 해당하는 기능을 웨이크업하고, 해당 기능을 수행할 수 있다.The single command keyword is for the voice control apparatus 100 to directly perform a specific operation or function, and may be a predefined simple word such as " playback ", " When the single command keyword is received, the voice control apparatus 100 can wake up the function corresponding to the single command keyword and perform the corresponding function.

아래에서는 오디오 스트림 데이터로부터 단독 명령 키워드에 대응하는 후보 키워드를 검출한 경우와 오디오 스트림 데이터로부터 웨이크업 키워드에 대응하는 후보 키워드를 검출한 경우 각각에 대하여 설명한다.Hereinafter, the case where a candidate keyword corresponding to a single instruction keyword is detected from audio stream data and the case where a candidate keyword corresponding to a wakeup keyword is detected from audio stream data will be described.

도 5는 다른 실시예에 따라서 음성 제어 장치가 수행할 수 있는 동작 방법의 예를 도시한 흐름도이다.5 is a flowchart showing an example of an operation method that can be performed by the voice control apparatus according to another embodiment.

도 6a는 일 실시예에 따른 음성 제어 장치가 도 5의 동작 방법을 실행하는 경우에 단독 명령 키워드가 발화되는 예를 도시하고, 도 6b는 일 실시예에 따른 음성 제어 장치가 도 5의 동작 방법을 실행하는 경우에 일반 대화 음성이 발화되는 예를 도시한다.FIG. 6A shows an example in which a single command keyword is uttered when the voice control apparatus according to an embodiment executes the operation method of FIG. 5, and FIG. 6B shows an example in which the voice control apparatus according to an embodiment performs an operation method The general conversation voice is uttered.

도 5의 동작 방법은 도 4의 동작 방법과 실질적으로 동일한 단계들을 포함한다. 도 5의 단계들 중에서 도 4의 단계들과 실질적으로 동일한 단계들에 대해서는 자세히 설명하지 않는다. 도 6a와 도 6b에는 오디오 스트림 데이터에 대응하는 오디오 신호들과 오디오 신호들에 대응하는 사용자의 음성이 도시된다. 도 6a에는 음성 “중지”에 대응하는 오디오 신호들이 도시되고, 도 6b에는 음성 “여기서 정지해”에 대응하는 오디오 신호들이 도시된다.The method of operation of Figure 5 includes substantially the same steps as the method of operation of Figure 4. Steps substantially the same as the steps of FIG. 4 among the steps of FIG. 5 are not described in detail. 6A and 6B show the audio signals corresponding to the audio stream data and the user's voice corresponding to the audio signals. 6A shows audio signals corresponding to a voice " Pause " and FIG. 6B shows audio signals corresponding to a voice " Stop Here ".

도 6a 및 도 6b와 함께 도 5를 참조하면, 단계(S210)에서 프로세서(120), 예컨대, 오디오 처리부(121)는 주변 소리에 대응하는 오디오 신호를 수신한다. Referring to FIG. 5 together with FIGS. 6A and 6B, in step S210, the processor 120, for example, the audio processing unit 121 receives audio signals corresponding to ambient sounds.

단계(S220)에서 프로세서(120), 예컨대, 오디오 처리부(121)는 마이크(151)로부터의 오디오 신호를 기초로 오디오 스트림 데이터를 생성한다.In step S220, the processor 120, for example, the audio processing unit 121 generates audio stream data based on the audio signal from the microphone 151. [

단계(S230)에서 프로세서(120), 예컨대, 오디오 처리부(121)는 단계(S220)에서 생성되는 오디오 스트림 데이터를 메모리(110)에 일시적으로 저장한다.In step S230, the processor 120, for example, the audio processing unit 121 temporarily stores the audio stream data generated in step S220 in the memory 110. [

단계(S240)에서 프로세서(120), 예컨대, 키워드 검출부(122)는 단계(S220)에서 생성되는 오디오 스트림 데이터로부터 미리 정의된 단독 명령 키워드에 대응하는 후보 키워드를 검출한다. 단독 명령 키워드는 음성 제어 장치(100)의 동작을 직접 제어할 수 있는 음성 키워드일 수 있다. 예컨대, 단독 명령 키워드는 도 6a에 도시된 바와 같이 “중지”와 같은 단어일 수 있다. 이 경우, 음성 제어 장치(100)는 예컨대 음악이나 동영상을 재생하고 있을 수 있다.In step S240, the processor 120, for example, the keyword detection unit 122 detects a candidate keyword corresponding to a predefined single command keyword from the audio stream data generated in step S220. The single command keyword may be a voice keyword that can directly control the operation of the voice control apparatus 100. [ For example, the single command keyword may be a word such as " stop " as shown in FIG. 6A. In this case, the voice control apparatus 100 may be reproducing music or moving pictures, for example.

도 6a의 예에서, 키워드 검출부(122)는 오디오 신호들에서 “중지”라는 후보 키워드를 검출할 수 있다. 도 6b의 예에서, 키워드 검출부(122)는 오디오 신호들에서 “중지”라는 키워드와 유사한 발음을 갖는 단어인 “정지”라는 후보 키워드를 검출할 수 있다.In the example of Fig. 6A, the keyword detection unit 122 may detect a candidate keyword " stop " in the audio signals. In the example of Fig. 6B, the keyword detecting section 122 can detect a candidate keyword " stop ", which is a word having pronunciation similar to the keyword " stop " in audio signals.

단계(S250)에서 프로세서(120), 예컨대, 키워드 검출부(122)는 오디오 스트림 데이터에서 후보 키워드가 검출된 키워드 검출 구간을 식별하고, 키워드 검출 구간의 시점과 종점을 결정한다. 키워드 검출 구간은 현재 구간으로 지칭될 수 있다. 오디오 스트림 데이터에서 현재 구간에 대응하는 데이터는 제1 오디오 데이터로 지칭될 수 있다.In step S250, the processor 120, for example, the keyword detection unit 122 identifies the keyword detection period in which the candidate keyword is detected in the audio stream data, and determines the start point and the end point of the keyword detection period. The keyword detection interval may be referred to as the current interval. The data corresponding to the current section in the audio stream data may be referred to as first audio data.

도 6a의 예에서, 키워드 검출부(122)는 “중지”라는 후보 키워드를 검출한 구간을 현재 구간으로 식별하고, 현재 구간의 시점과 종점을 결정할 수 있다. 상기 현재 구간에 대응하는 오디오 데이터는 제1 오디오 데이터(AD1)로 지칭될 수 있다.In the example of Fig. 6A, the keyword detection unit 122 can identify the section in which the candidate keyword " stop " is detected as the current section, and determine the start and end points of the current section. The audio data corresponding to the current section may be referred to as first audio data AD1.

도 6b의 예에서, 키워드 검출부(122)는 “정지”라는 후보 키워드를 검출한 구간을 현재 구간으로 식별하고, 현재 구간의 시점과 종점을 결정할 수 있다. 상기 현재 구간에 대응하는 오디오 데이터는 제1 오디오 데이터(AD1)로 지칭될 수 있다.In the example of FIG. 6B, the keyword detecting unit 122 can identify the section in which the candidate keyword " STOP " is detected as the current section, and determine the start and end points of the current section. The audio data corresponding to the current section may be referred to as first audio data AD1.

또한, 단계(S250)에서 프로세서(120), 예컨대, 키워드 검출부(122)는 검출된 후보 키워드가 웨이크업 키워드와 단독 명령 키워드 중 어떤 키워드에 대응하는 후보 키워드인지를 판단할 수 있다. 도 6a와 도 6b의 예에서, 키워드 검출부(122)는 검출된 후보 키워드, 즉, “중지” 및 “정지”가 단독 명령 키워드에 대응하는 후보 키워드라는 것을 판단할 수 있다.In step S250, the processor 120, for example, the keyword detection unit 122 can determine whether the detected candidate keyword is a candidate keyword corresponding to a wakeup keyword or a single command keyword. 6A and 6B, the keyword detecting unit 122 may determine that the detected candidate keywords, that is, "stop" and "stop" are candidate keywords corresponding to a single command keyword.

단계(S260)에서 프로세서(120), 예컨대, 화자 특징 벡터 추출부(123)는 메모리(110)로부터 이전 구간에 해당하는 제2 오디오 데이터를 독출한다. 이전 구간은 현재 구간의 바로 직전 구간으로서, 이전 구간의 종점은 현재 구간의 시점과 동일할 수 있다. 화자 특징 벡터 추출부(123)는 메모리(110)로부터 제1 오디오 데이터도 함께 독출할 수 있다.In step S260, the processor 120, for example, the speaker feature vector extracting unit 123 reads the second audio data corresponding to the previous section from the memory 110. [ The previous period may be immediately before the current period, and the end point of the previous period may be the same as the current period. The speaker feature vector extracting unit 123 may read the first audio data from the memory 110 together.

도 6a의 예에서, 화자 특징 벡터 추출부(123)는 현재 구간의 바로 직전 구간인 이전 구간에 대응하는 제2 오디오 데이터(AD2)를 메모리(110)로부터 독출할 수 있다. 도 6b의 예에서, 화자 특징 벡터 추출부(123)는 현재 구간의 바로 직전 구간인 이전 구간에 대응하는 제2 오디오 데이터(AD2)를 메모리(110)로부터 독출할 수 있다. 도 6b의 예에서, 제2 오디오 데이터(AD2)는 “기서”라는 음성에 대응할 수 있다. 이전 구간의 길이는 검출된 후보 키워드에 따라 가변적으로 설정될 수 있다.In the example of FIG. 6A, the speaker feature vector extractor 123 may read from the memory 110 the second audio data AD2 corresponding to the previous section, which is a section immediately before the current section. In the example of FIG. 6B, the speaker feature vector extractor 123 may read the second audio data AD2 corresponding to the previous section, which is a section immediately before the current section, from the memory 110. FIG. In the example of FIG. 6B, the second audio data AD2 may correspond to a voice called " Voice. &Quot; The length of the previous section may be variably set according to the detected candidate keyword.

단계(S270)에서 프로세서(120), 예컨대, 화자 특징 벡터 추출부(123)는 오디오 처리부(121)로부터 현재 구간 이후의 다음 구간에 해당하는 제3 오디오 데이터를 수신한다. 다음 구간은 현재 구간의 바로 다음 구간으로서, 다음 구간의 시점은 현재 구간의 종점과 동일할 수 있다.In step S270, the processor 120, for example, the speaker feature vector extractor 123 receives the third audio data corresponding to the next section after the current section from the audio processor 121. [ The next interval is the interval immediately after the current interval, and the point of the next interval may be the same as the end point of the current interval.

도 6a의 예에서, 화자 특징 벡터 추출부(123)는 현재 구간 직후의 다음 구간에 대응하는 제3 오디오 데이터(AD3)를 오디오 처리부(121)로부터 수신할 수 있다. 도 6b의 예에서, 화자 특징 벡터 추출부(123)는 현재 구간 직후의 다음 구간에 대응하는 제3 오디오 데이터(AD3)를 오디오 처리부(121)로부터 수신할 수 있다. 도 6b의 예에서, 제3 오디오 데이터(AD3)는 “해”라는 음성에 대응할 수 있다. 다음 구간의 길이는 검출된 후보 키워드에 따라 가변적으로 설정될 수 있다.In the example of FIG. 6A, the speaker feature vector extractor 123 may receive the third audio data AD3 corresponding to the next section immediately after the current section from the audio processor 121. FIG. In the example of FIG. 6B, the speaker feature vector extracting unit 123 may receive the third audio data AD3 corresponding to the next section immediately after the current section from the audio processing unit 121. FIG. In the example of FIG. 6B, the third audio data AD3 may correspond to a voice called " harm. &Quot; The length of the next section may be variably set according to the detected candidate keyword.

단계(S280)에서 프로세서(120), 예컨대, 화자 특징 벡터 추출부(123)는 제1 내지 제3 오디오 데이터로부터 제1 내지 제3 화자 특징 벡터들을 각각 추출한다. 제1 내지 제3 화자 특징 벡터들 각각은 제1 내지 제3 오디오 데이터에 대응하는 음성의 화자를 식별하기 위한 지표들이다. 프로세서(120), 예컨대, 웨이크업 판단부(124)는 제1 화자 특징 벡터와 제2 화자 특징 벡터 간의 유사도, 및 제1 화자 특징 벡터와 제3 화자 특징 벡터 간의 유사도를 기초로 제1 오디오 데이터에 단독 명령 키워드가 포함되었는지의 여부를 판단할 수 있다. 웨이크업 판단부(124)는 제1 오디오 데이터에 단독 명령 키워드가 포함되었다고 판단할 경우, 음성 제어 장치(100)의 일부 구성요소들을 웨이크업 할 수 있다.In step S280, the processor 120, for example, the speaker feature vector extraction unit 123 extracts the first to third speaker feature vectors from the first to third audio data, respectively. Each of the first to third speaker characteristic vectors are indicators for identifying a speaker of a voice corresponding to the first to third audio data. The processor 120, for example, the wake-up determination unit 124, determines the similarity between the first speaker feature vector and the second speaker feature vector, and the similarity between the first speaker feature vector and the third speaker feature vector, It is possible to judge whether or not a single command keyword is included in the command. The wake-up determination unit 124 may wake up some components of the voice control apparatus 100 when determining that the single command keyword is included in the first audio data.

도 6a의 예에서, 제1 오디오 데이터(AD1)에 대응하는 제1 화자 특징 벡터는 “중지”라는 음성을 발성한 화자를 식별하기 위한 지표이다. 제2 오디오 데이터(AD2)와 제3 오디오 데이터(AD3)는 실질적으로 묵음이므로, 제2 및 제3 화자 특징 벡터는 묵음에 대응하는 벡터를 가질 수 있다. 따라서, 제1 화자 특징 벡터와 제2 및 제3 화자 특징 벡터들 간의 유사도는 낮을 수 있다.In the example of Fig. 6A, the first speaker feature vector corresponding to the first audio data AD1 is an index for identifying a speaker who uttered the voice " stop ". Since the second audio data AD2 and the third audio data AD3 are substantially silent, the second and third speaker feature vectors may have vectors corresponding to silence. Thus, the similarity between the first speaker feature vector and the second and third speaker feature vectors may be low.

다른 예로서, 이전 구간과 다음 구간에 “중지”라는 음성을 발성한 화자가 아닌 다른 사람이 음성을 발성하는 경우, 제2 및 제3 화자 특징 벡터는 상기 다른 사람에 대응한 벡터를 가질 것이므로, 제1 화자 특징 벡터와 제2 및 제3 화자 특징 벡터들 간의 유사도는 낮을 수 있다.As another example, when a person other than the speaker who uttered the voice "stop" in the previous section and the next section utters the voice, the second and third speaker characteristic vectors will have the vector corresponding to the other person, The similarity between the first speaker feature vector and the second and third speaker feature vectors may be low.

도 6b의 예에서는 한 사람이 “여기서 정지해”라고 발성하였다. 따라서, “정지”에 대응하는 제1 오디오 데이터(AD1)로부터 추출되는 제1 화자 특징 벡터, “기서”에 대응하는 제2 오디오 데이터(AD2)로부터 추출되는 제2 화자 특징 벡터, 및 “해”에 대응하는 제3 오디오 데이터(AD3)로부터 추출되는 제3 화자 특징 벡터는 모두 실질적으로 동일한 화자를 식별하기 위한 벡터이므로, 제1 내지 제3 화자 특징 벡터들 간의 유사도는 높을 수 있다.In the example of FIG. 6B, one person uttered "stop here". Therefore, a first speaker feature vector extracted from the first audio data AD1 corresponding to the " stop ", a second speaker feature vector extracted from the second audio data AD2 corresponding to the " The third speaker feature vector extracted from the third audio data AD3 corresponding to the first speaker feature vector is a vector for identifying the speaker substantially identical to each other, so that the similarity between the first through third speaker feature vectors may be high.

단계(S290)에서 프로세서(120), 예컨대, 웨이크업 판단부(124)는 제1 화자 특징 벡터와 제2 화자 특징 벡터 간의 유사도를 미리 설정된 기준치와 비교하고, 제1 화자 특징 벡터와 제3 화자 특징 벡터 간의 유사도를 미리 설정된 기준치와 비교한다. 웨이크업 판단부(124)는 제1 화자 특징 벡터와 제2 화자 특징 벡터 간의 유사도가 미리 설정된 기준치 이하이고 제1 화자 특징 벡터와 제3 화자 특징 벡터 간의 유사도가 미리 설정된 기준치 이하인 경우, 현재 구간의 제1 오디오 데이터의 화자는 이전 구간의 제2 오디오 데이터의 화자 및 다음 구간의 제3 오디오 데이터의 화자로부터 상이하다는 것이므로, 제1 오디오 데이터에 단독 명령 키워드가 포함되었다고 판단할 수 있다. 이 경우, 단계(S300)에서와 같이 프로세서(120), 예컨대, 웨이크업 판단부(124)는 단독 명령 키워드를 기능부(126)에 제공하고, 기능부(126)는 웨이크업 판단부(124)에 의한 제1 오디오 데이터에 단독 명령 키워드가 포함되었다는 판단에 응답하여 단독 명령 키워드에 대응하는 기능을 수행할 수 있다.In step S290, the processor 120, for example, the wakeup determination unit 124 compares the similarity between the first speaker feature vector and the second speaker feature vector with a preset reference value, And compares the similarity between the feature vectors with preset reference values. If the similarity between the first speaker feature vector and the second speaker feature vector is equal to or less than a preset reference value and the similarity between the first speaker feature vector and the third speaker feature vector is equal to or less than a predetermined reference value, Since the speaker of the first audio data is different from the speaker of the second audio data of the previous section and the speaker of the third audio data of the next section, it can be determined that the single command keyword is included in the first audio data. In this case, as in step S300, the processor 120, for example, the wakeup determination unit 124 provides the single command keyword to the function unit 126, and the function unit 126 controls the wakeup determination unit 124 In response to the determination that the single command keyword is included in the first audio data by the single command keyword.

도 6a의 예에서, 제1 화자 특징 벡터는 “중지”라고 발성한 화자에 대응하는 벡터이고, 제2 및 제3 화자 특징 벡터들은 묵음에 대응한 벡터이므로, 제1 화자 특징 벡터와 제2 및 제3 화자 특징 벡터들 간의 유사도들은 미리 설정된 기준치보다 낮을 수 있다. 이 경우, 웨이크업 판단부(124)는 제1 오디오 데이터(AD1)에 “중지”라는 단독 명령 키워드가 포함되었다고 판단할 수 있다. 이 경우, 기능부(126)는 상기 판단에 응답하여 웨이크업 될 수 있으며, “중지”라는 단독 명령 키워드에 대응하는 동작 또는 기능을 수행할 수 있다. 예컨대, 음성 제어 장치(100)가 음악을 재생 중이었다면, 기능부(126)는 “중지”라는 단독 명령 키워드에 대응하여 음악 재생을 중지시킬 수 있다.In the example of FIG. 6A, the first speaker feature vector is a vector corresponding to a speaker uttered " stop ", and the second and third speaker feature vectors are vectors corresponding to silence, The similarities between the third speaker feature vectors may be lower than a preset reference value. In this case, the wake-up determination unit 124 may determine that the single command keyword " stop " is included in the first audio data AD1. In this case, the function unit 126 may be woken up in response to the determination, and may perform an operation or function corresponding to the sole command keyword " stop ". For example, if the voice control apparatus 100 is playing music, the function unit 126 may stop the music reproduction in response to the sole command keyword " stop ".

그러나, 웨이크업 판단부(124)는 제1 화자 특징 벡터와 제2 화자 특징 벡터 간의 유사도가 미리 설정된 기준치보다 크거나, 제1 화자 특징 벡터와 제3 화자 특징 벡터 간의 유사도가 미리 설정된 기준치보다 큰 경우, 현재 구간의 제1 오디오 데이터의 화자가 이전 구간의 제2 오디오 데이터의 화자 또는 다음 구간의 제3 오디오 데이터의 화자와 동일하다는 것이므로, 제1 오디오 데이터에 키워드가 포함되지 않았다고 판단하여, 웨이크업을 진행하지 않을 수 있다. 이 경우, 단계(S210)로 진행하여 주변 소리에 대응하는 오디오 신호를 수신한다.However, if the degree of similarity between the first speaker feature vector and the second speaker feature vector is greater than a preset reference value, or if the similarity between the first speaker feature vector and the third speaker feature vector is greater than a preset reference value The speaker of the first audio data of the current section is the same as the speaker of the second audio data of the previous section or the speaker of the third audio data of the next section and therefore judges that the keyword is not included in the first audio data, It may not proceed. In this case, the flow advances to step S210 to receive an audio signal corresponding to the ambient sound.

도 6b의 예에서, 한 사람이 “여기서 정지해”라고 발성하였으므로, 제1 내지 제3 화자 특징 벡터들 간의 유사도는 높을 것이다. 도 6b의 예에서의 발성인 “여기에 정지해”에는 음성 제어 장치를 제어하거나 웨이크업 하기 위한 키워드가 포함되지 않았으므로, 웨이크업 판단부(124)는 제1 오디오 데이터(AD1)에 단독 명령 키워드가 포함되지 않았다고 판단하고, 기능부(126)가 “정지” 또는 “중지”에 해당하는 기능이나 동작을 수행하지 않도록 할 수 있다.In the example of Fig. 6B, since a person uttered " stop here ", the similarity between the first to third speaker feature vectors would be high. 6B, the wake-up determination unit 124 determines that the wake-up determination unit 124 determines that the audio data AD1 is a stand-alone command It can be determined that the keyword is not included and the function unit 126 can not perform the function or operation corresponding to " stop " or " stop ".

일반적인 기술에 따르면, 음성 제어 장치는 “여기서 정지해”라는 발성 중 “정지”라는 음성을 검출하여, “정지”에 해당하는 기능이나 동작을 수행할 수 있다. 이러한 기능이나 동작은 사용자가 의도하지 않은 것으로서, 사용자는 음성 제어 장치를 사용할 때 불편함을 느낄 수 있다. 그러나, 일 실시예에 따르면, 음성 제어 장치(100)는 사용자의 음성으로부터 단독 명령 키워드를 정확히 인식할 수 있기 때문에, 일반적인 기술과는 달리 오동작을 수행하지 않을 수 있다.According to the general technique, the voice control apparatus can detect a voice called " stop " during utterance " stop here " to perform a function or an operation corresponding to " stop ". Such a function or operation is not intended by the user, and the user may feel inconvenience when using the voice control device. However, according to the embodiment, since the voice control apparatus 100 can correctly recognize the single command keyword from the voice of the user, unlike the general technique, the voice control apparatus 100 may not perform malfunction.

도 7은 또 다른 실시예에 따라서 음성 제어 장치가 수행할 수 있는 동작 방법의 예를 도시한 흐름도이다.7 is a flowchart showing an example of an operation method that the voice control apparatus can perform according to another embodiment.

도 8a는 일 실시예에 따른 음성 제어 장치가 도 7의 동작 방법을 실행하는 경우에 웨이크업 키워드와 자연어 음성 명령이 발화되는 예를 도시하고, 도 8b는 일 실시예에 따른 음성 제어 장치가 도 7의 동작 방법을 실행하는 경우에 일반 대화 음성이 발화되는 예를 도시한다.FIG. 8A shows an example in which a wake-up keyword and a natural language voice command are uttered when the voice control apparatus according to an embodiment executes the operation method of FIG. 7, and FIG. 8B shows an example in which the voice control apparatus according to an embodiment 7 shows an example in which a general conversation voice is uttered.

도 7의 동작 방법은 도 4의 동작 방법과 실질적으로 동일한 단계들을 포함한다. 도 7의 단계들 중에서 도 4의 단계들과 실질적으로 동일한 단계들에 대해서는 자세히 설명하지 않는다. 도 6a와 도 6b에는 오디오 스트림 데이터에 대응하는 오디오 신호들과 오디오 신호들에 대응하는 사용자의 음성이 도시된다. 도 8a에는 웨이크업 키워드 “클로바”와 자연어 음성 명령 “내일 날씨를 알려줘”에 대응하는 오디오 신호들이 도시되고, 도 6b에는 “네잎 클로바를 어떻게 찾을 수 있어”라는 대화 음성에 대응하는 오디오 신호들이 도시된다.The method of operation of Figure 7 includes substantially the same steps as the method of operation of Figure 4. Steps substantially the same as the steps of FIG. 4 among the steps of FIG. 7 are not described in detail. 6A and 6B show the audio signals corresponding to the audio stream data and the user's voice corresponding to the audio signals. Fig. 8A shows audio signals corresponding to the wakeup keyword " Clover " and the natural voice command " Tell Tomorrow Weather " and Fig. 6B shows audio signals corresponding to the conversation voice "How to find a four- do.

도 8a 및 도 8b와 함께 도 7를 참조하면, 단계(S410)에서 프로세서(120), 예컨대, 오디오 처리부(121)는 주변 소리에 대응하는 오디오 신호를 수신한다. 단계(S420)에서 프로세서(120), 예컨대, 오디오 처리부(121)는 마이크(151)로부터의 오디오 신호를 기초로 오디오 스트림 데이터를 생성한다. 단계(S430)에서 프로세서(120), 예컨대, 오디오 처리부(121)는 단계(S120)에서 생성되는 오디오 스트림 데이터를 메모리(110)에 일시적으로 저장한다.Referring to FIG. 7 together with FIGS. 8A and 8B, in step S410, the processor 120, for example, the audio processing unit 121 receives audio signals corresponding to ambient sounds. In step S420, the processor 120, for example, the audio processing unit 121 generates audio stream data based on the audio signal from the microphone 151. [ In step S430, the processor 120, for example, the audio processing unit 121 temporarily stores the audio stream data generated in step S120 in the memory 110. [

단계(S440)에서 프로세서(120), 예컨대, 키워드 검출부(122)는 단계(S420)에서 생성되는 오디오 스트림 데이터로부터 미리 정의된 웨이크업 키워드에 대응하는 후보 키워드를 검출한다. 웨이크업 키워드는 슬립 모드 상태의 음성 제어 장치를 웨이크업 모드로 전환할 수 있는 음성 기반 키워드이다. 예컨대, 웨이크업 키워드는 “클로바”, “하이 컴퓨터” 등과 같은 음성 키워드일 수 있다.In step S440, the processor 120, for example, the keyword detection unit 122 detects candidate keywords corresponding to predefined wakeup keywords from the audio stream data generated in step S420. The wake-up keyword is a voice-based keyword that can switch the voice control device in the sleep mode to the wake-up mode. For example, the wakeup keyword may be a spoken keyword such as " clover, " " high computer,

도 8a의 예에서, 키워드 검출부(122)는 오디오 신호들에서 “클로바”라는 후보 키워드를 검출할 수 있다. 도 6b의 예에서, 키워드 검출부(122)는 오디오 신호들에서 “클로바”라는 키워드와 유사한 발음을 갖는 단어인 “클로버”라는 후보 키워드를 검출할 수 있다.In the example of FIG. 8A, the keyword detecting section 122 can detect a candidate keyword " Clover " in the audio signals. In the example of Fig. 6B, the keyword detecting unit 122 can detect a candidate keyword " clover " which is a word having a similar pronunciation to the keyword " clover " in the audio signals.

단계(S450)에서 프로세서(120), 예컨대, 키워드 검출부(122)는 오디오 스트림 데이터에서 후보 키워드가 검출된 키워드 검출 구간을 식별하고, 키워드 검출 구간의 시점과 종점을 결정한다. 키워드 검출 구간은 현재 구간으로 지칭될 수 있다. 오디오 스트림 데이터에서 현재 구간에 대응하는 데이터는 제1 오디오 데이터로 지칭될 수 있다.In step S450, the processor 120, for example, the keyword detection unit 122 identifies the keyword detection period in which the candidate keyword is detected in the audio stream data, and determines the start point and the end point of the keyword detection period. The keyword detection interval may be referred to as the current interval. The data corresponding to the current section in the audio stream data may be referred to as first audio data.

도 8a의 예에서, 키워드 검출부(122)는 “클로바”라는 후보 키워드를 검출한 구간을 현재 구간으로 식별하고, 현재 구간의 시점과 종점을 결정할 수 있다. 상기 현재 구간에 대응하는 오디오 데이터는 제1 오디오 데이터(AD1)로 지칭될 수 있다. 도 8b의 예에서, 키워드 검출부(122)는 “클로버”라는 후보 키워드를 검출한 구간을 현재 구간으로 식별하고, 현재 구간의 시점과 종점을 결정할 수 있다. 상기 현재 구간에 대응하는 오디오 데이터는 제1 오디오 데이터(AD1)로 지칭될 수 있다.In the example of FIG. 8A, the keyword detection unit 122 can identify the section in which the candidate keyword " Clover " is detected as the current section, and determine the start and end points of the current section. The audio data corresponding to the current section may be referred to as first audio data AD1. In the example of FIG. 8B, the keyword detecting section 122 can identify the section in which the candidate keyword " clover " is detected as the current section, and determine the start and end points of the current section. The audio data corresponding to the current section may be referred to as first audio data AD1.

또한, 단계(S450)에서 프로세서(120), 예컨대, 키워드 검출부(122)는 검출된 후보 키워드가 웨이크업 키워드와 단독 명령 키워드 중 어떤 키워드에 대응하는 후보 키워드인지를 판단할 수 있다. 도 8a와 도 8b의 예에서, 키워드 검출부(122)는 검출된 후보 키워드, 즉, “클로바” 및 “클로버”가 웨이크업 키워드에 대응하는 후보 키워드라는 것을 판단할 수 있다.In step S450, the processor 120, for example, the keyword detection unit 122 can determine whether the detected candidate keyword is a candidate keyword corresponding to a wakeup keyword or a single command keyword. 8A and 8B, the keyword detection unit 122 can determine that the detected candidate keywords, i.e., "Clover" and "Clover" are candidate keywords corresponding to the wakeup keyword.

단계(S460)에서 프로세서(120), 예컨대, 화자 특징 벡터 추출부(123)는 메모리(110)로부터 이전 구간에 해당하는 제2 오디오 데이터를 독출한다. 이전 구간은 현재 구간의 바로 직전 구간으로서, 이전 구간의 종점은 현재 구간의 시점과 동일할 수 있다. 화자 특징 벡터 추출부(123)는 메모리(110)로부터 제1 오디오 데이터도 함께 독출할 수 있다.In step S460, the processor 120, for example, the speaker characteristic vector extracting unit 123 reads the second audio data corresponding to the previous section from the memory 110. [ The previous period may be immediately before the current period, and the end point of the previous period may be the same as the current period. The speaker feature vector extracting unit 123 may read the first audio data from the memory 110 together.

도 8a의 예에서, 화자 특징 벡터 추출부(123)는 현재 구간의 바로 직전 구간인 이전 구간에 대응하는 제2 오디오 데이터(AD2)를 메모리(110)로부터 독출할 수 있다. 도 8b의 예에서, 화자 특징 벡터 추출부(123)는 현재 구간의 바로 직전 구간인 이전 구간에 대응하는 제2 오디오 데이터(AD2)를 메모리(110)로부터 독출할 수 있다. 도 8b의 예에서, 제2 오디오 데이터(AD2)는 “네잎”이라는 음성에 대응할 수 있다. 이전 구간의 길이는 검출된 후보 키워드에 따라 가변적으로 설정될 수 있다.In the example of FIG. 8A, the speaker feature vector extractor 123 may read from the memory 110 the second audio data AD2 corresponding to the previous section, which is a section immediately before the current section. In the example of FIG. 8B, the speaker feature vector extractor 123 may read from the memory 110 the second audio data AD2 corresponding to the previous section, which is a section immediately before the current section. In the example of Fig. 8B, the second audio data AD2 may correspond to a voice of " four leaves ". The length of the previous section may be variably set according to the detected candidate keyword.

단계(S470)에서 프로세서(120), 예컨대, 화자 특징 벡터 추출부(123)는 제1 및 제2 오디오 데이터로부터 제1 및 제2 화자 특징 벡터들을 각각 추출한다. 프로세서(120), 예컨대, 웨이크업 판단부(124)는 제1 화자 특징 벡터와 제2 화자 특징 벡터 간의 유사도를 기초로 제1 오디오 데이터에 웨이크업 키워드가 포함되었는지의 여부를 판단할 수 있다. 웨이크업 판단부(124)는 제1 오디오 데이터에 웨이크업 키워드가 포함되었다고 판단할 경우, 음성 제어 장치(100)의 일부 구성요소들을 웨이크업 할 수 있다.In step S470, the processor 120, for example, the speaker feature vector extraction unit 123 extracts the first and second speaker feature vectors from the first and second audio data, respectively. The processor 120, for example, the wakeup determination unit 124 may determine whether or not the first audio data includes the wakeup keyword based on the degree of similarity between the first speaker feature vector and the second speaker feature vector. The wakeup determination unit 124 may wake up some components of the voice control apparatus 100 when determining that the first audio data includes the wakeup keyword.

도 8a의 예에서, 제1 오디오 데이터(AD1)에 대응하는 제1 화자 특징 벡터는 “클로바”라는 음성을 발성한 화자를 식별하기 위한 지표이다. 제2 오디오 데이터(AD2)는 실질적으로 묵음이므로, 제2 화자 특징 벡터는 묵음에 대응하는 벡터를 가질 수 있다. 따라서, 제1 화자 특징 벡터와 제2 화자 특징 벡터 간의 유사도는 낮을 수 있다.In the example of FIG. 8A, the first speaker feature vector corresponding to the first audio data AD1 is an index for identifying a speaker who uttered the voice "Clover". Since the second audio data AD2 is substantially silent, the second speaker feature vector may have a vector corresponding to silence. Thus, the similarity between the first speaker feature vector and the second speaker feature vector may be low.

다른 예로서, 이전 구간에 “클로바”라는 음성을 발성한 화자가 아닌 다른 사람이 음성을 발성하는 경우, 제2 화자 특징 벡터는 상기 다른 사람에 대응한 벡터를 가질 것이므로, 제1 화자 특징 벡터와 제2 화자 특징 벡터들 간의 유사도는 낮을 수 있다.As another example, when a person other than the speaker who uttered the voice of " Clover " in the previous section utters the voice, the second speaker characteristic vector will have the vector corresponding to the other person, The similarity between the second speaker feature vectors may be low.

도 8b의 예에서는 한 사람이 “네잎 클로버를 어떻게 찾을 수 있어”라고 발성하였다. 따라서, “클로버”에 대응하는 제1 오디오 데이터(AD1)로부터 추출되는 제1 화자 특징 벡터와 “네잎”에 대응하는 제2 오디오 데이터(AD2)로부터 추출되는 제2 화자 특징 벡터는 모두 실질적으로 동일한 화자를 식별하기 위한 벡터이므로, 제1 및 제2 화자 특징 벡터들 간의 유사도는 높을 수 있다.In the example of FIG. 8B, one person uttered "How can I find a four-leaf clover?" Therefore, the first speaker characteristic vector extracted from the first audio data AD1 corresponding to " clover " and the second speaker characteristic vector extracted from the second audio data AD2 corresponding to " four leaves & Since the vector is for identifying the speaker, the similarity between the first and second speaker feature vectors may be high.

단계(S480)에서 프로세서(120), 예컨대, 웨이크업 판단부(124)는 제1 및 제2 화자 특징 벡터들 간의 유사도를 미리 설정된 기준치와 비교한다. 웨이크업 판단부(124)는 제1 및 제2 화자 특징 벡터들 간의 유사도가 미리 설정된 기준치보다 큰 경우, 현재 구간의 제1 오디오 데이터의 화자와 이전 구간의 제2 오디오 데이터의 화자가 서로 동일하다는 것이므로, 제1 오디오 데이터에 키워드가 포함되지 않았다고 판단하여, 웨이크업을 진행하지 않을 수 있다. 이 경우, 단계(S410)로 진행하며, 프로세서(120), 예컨대, 오디오 처리부(121)는 주변 소리에 대응하는 오디오 신호를 수신한다.In step S480, the processor 120, for example, the wake-up determination unit 124 compares the similarity between the first and second speaker feature vectors with preset reference values. If the similarity degree between the first and second speaker feature vectors is greater than a preset reference value, the wake-up determination unit 124 determines that the speaker of the first audio data in the current section and the speaker of the second audio data in the previous section are the same , It is judged that the keyword is not included in the first audio data, so that the wake-up can not proceed. In this case, the process proceeds to step S410, and the processor 120, for example, the audio processing unit 121 receives the audio signal corresponding to the ambient sound.

도 8b의 예에서, 한 사람이 “네잎 클로버...”라고 발성하였으므로, 제1 및 제2 화자 특징 벡터들 간의 유사도는 높을 것이다. 도 8b의 예에서, “네잎 클로버”라고 발성한 사람은 음성 제어 장치(100)를 웨이크업 하려는 의도가 없다고 판단하고, 웨이크업 판단부(124)는 제1 오디오 데이터(AD1)에 웨이크업 키워드가 포함되지 않았다고 판단하고, 음성 제어 장치(100)를 웨이크업 하지 않을 수 있다.In the example of Fig. 8B, since a person utters "four-leaf clover ...", the similarity between the first and second speaker feature vectors will be high. 8B, it is determined that the person who uttered "four-leaf clover" does not intend to wake up the voice control apparatus 100, and the wake-up determination unit 124 determines that the wakeup keyword is included in the first audio data AD1 And does not wake up the voice control apparatus 100. [0052]

웨이크업 판단부(124)는 제1 및 제2 화자 특징 벡터들 간의 유사도가 미리 설정된 기준치 이하인 경우, 현재 구간의 제1 오디오 데이터의 화자와 이전 구간의 제2 오디오 데이터의 화자가 서로 상이하다는 것이므로, 제1 오디오 데이터에 키워드가 포함되었다고 판단할 수 있다. 이 경우, 웨이크업 판단부(124)는 음성 제어 장치(100)의 일부 구성요소들을 웨이크업 할 수 있다. 예컨대, 웨이크업 판단부(124)는 음성 인식부(125)를 웨이크업 할 수 있다.If the similarity degree between the first and second speaker feature vectors is equal to or less than a preset reference value, the speaker of the first audio data of the current section and the speaker of the second audio data of the previous section are different from each other , It can be determined that the keyword is included in the first audio data. In this case, the wake-up determination unit 124 may wake up some components of the voice control apparatus 100. [ For example, the wake-up determination unit 124 may wake up the voice recognition unit 125. [

도 8a의 예에서, 제1 화자 특징 벡터는 “클로바”라고 발성한 화자에 대응하는 벡터이고, 제2 화자 특징 벡터는 묵음에 대응한 벡터이므로, 제1 및 제2 화자 특징 벡터들 간의 유사도는 미리 설정된 기준치보다 낮을 수 있다. 이 경우, 웨이크업 판단부(124)는 제1 오디오 데이터(AD1)에 “클로바”라는 웨이크업 키워드가 포함되었다고 판단할 수 있다. 이 경우, 음성 인식부(125)는 자연어 음성 명령을 인식하기 위해 웨이크업 할 수 있다.8A, since the first speaker feature vector is a vector corresponding to a speaker uttered as " Clover " and the second speaker feature vector is a vector corresponding to silence, the similarity between the first and second speaker feature vectors is And may be lower than a preset reference value. In this case, the wake-up determination unit 124 may determine that the wakeup keyword " Clover " is included in the first audio data AD1. In this case, the voice recognition unit 125 can wake up to recognize a natural voice command.

단계(S490)에서 프로세서(120), 예컨대, 음성 인식부(125)는 오디오 처리부(121)로부터 현재 구간 이후의 다음 구간에 해당하는 제3 오디오 데이터를 수신한다. 다음 구간은 현재 구간의 바로 다음 구간으로서, 다음 구간의 시점은 현재 구간의 종점과 동일할 수 있다.In step S490, the processor 120, for example, the voice recognition unit 125 receives the third audio data corresponding to the next section after the current section from the audio processing unit 121. [ The next interval is the interval immediately after the current interval, and the point of the next interval may be the same as the end point of the current interval.

음성 인식부(125)는 제3 오디오 데이터에서 미리 설정한 길이의 묵음이 검출될 때 다음 구간의 종점을 결정할 수 있다. 음성 인식부(125)는 제3 오디오 데이터를 음성 인식할 수 있다. 음성 인식부(125)는 다양한 방식으로 제3 오디오 데이터를 음성 인식할 수 있다. 다른 예에 따르면, 음성 인식부(125)는 제3 오디오 데이터의 음성 인식 결과를 얻기 위해, 외부 장치, 예컨대, 도 2에 도시되는 음성 인식 기능을 갖는 서버(200)로 제3 오디오 데이터를 전송할 수 있다. 서버(200)는 제3 오디오 데이터를 수신하고, 제3 오디오 데이터를 음성 인식함으로써 제3 오디오 데이터에 대응하는 문자열(텍스트)를 생성하고, 생성된 문자열(텍스트)를 음성 인식 결과로서 음성 인식부(125)로 전송할 수 있다.The voice recognition unit 125 can determine the end point of the next section when silence of a predetermined length is detected in the third audio data. The voice recognition unit 125 can recognize the third audio data. The voice recognition unit 125 can recognize the third audio data in various ways. According to another example, in order to obtain the voice recognition result of the third audio data, the voice recognition unit 125 transmits the third audio data to an external device, for example, the server 200 having the voice recognition function shown in Fig. 2 . The server 200 receives the third audio data, generates a character string (text) corresponding to the third audio data by voice recognition of the third audio data, and outputs the generated character string (text) (125).

도 8a의 예에서, 다음 구간의 제3 오디오 데이터는 “내일 날씨를 알려줘”와 같은 자연어 음성 명령이다. 음성 인식부(125)는 제3 오디오 데이터를 직접 음성 인식하여 음성 인식 결과를 생성하거나, 제3 오디오 데이터가 음성 인식되도록 외부(예컨대, 서버(200))에 전송할 수 있다.In the example of FIG. 8A, the third audio data of the next section is a natural language voice command such as " Tell Tomorrow Weather ". The speech recognition unit 125 may directly recognize the third audio data to generate a speech recognition result, or may transmit the third audio data to the external (e.g., the server 200) so that the third audio data is recognized as speech.

단계(S500)에서 프로세서(120), 예컨대, 기능부(126)는 제3 오디오 데이터의 음성 인식 결과에 대응하는 기능을 수행할 수 있다. 도 8a의 예에서, 기능부(126)는 내일 날씨를 검색하여 결과를 제공하는 음성 정보 제공부일 수 있으며, 기능부(126)는 인터넷을 이용하여 내일 날씨를 검색하고, 그 결과를 사용자에게 제공할 수 있다. 기능부(126)는 내일 날씨의 검색 결과를 스피커(152)를 이용하여 음성으로 제공할 수도 있다. 기능부(126)는 제3 오디오 데이터의 음성 인식 결과에 응답하여 웨이크업 될 수 있다.In step S500, the processor 120, for example, the functional unit 126 may perform a function corresponding to the speech recognition result of the third audio data. In the example of FIG. 8A, the function unit 126 may be a voice information providing unit for searching for tomorrow's weather and providing a result, and the function unit 126 searches for tomorrow's weather using the Internet, can do. The function unit 126 may provide the search result of tomorrow's weather by using the speaker 152 as a voice. The function unit 126 can be woken up in response to the speech recognition result of the third audio data.

이상 설명된 본 발명에 따른 실시예는 컴퓨터 상에서 다양한 구성요소를 통하여 실행될 수 있는 컴퓨터 프로그램의 형태로 구현될 수 있으며, 이와 같은 컴퓨터 프로그램은 컴퓨터로 판독 가능한 매체에 기록될 수 있다. 이때, 매체는 컴퓨터로 실행 가능한 프로그램을 계속 저장하거나, 실행 또는 다운로드를 위해 임시 저장하는 것일 수도 있다. 또한, 매체는 단일 또는 수개 하드웨어가 결합된 형태의 다양한 기록수단 또는 저장수단일 수 있는데, 어떤 컴퓨터 시스템에 직접 접속되는 매체에 한정되지 않고, 네트워크 상에 분산 존재하는 것일 수도 있다. 매체의 예시로는, 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체, CD-ROM 및 DVD와 같은 광기록 매체, 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical medium), 및 ROM, RAM, 플래시 메모리 등을 포함하여 프로그램 명령어가 저장되도록 구성된 것이 있을 수 있다. 또한, 다른 매체의 예시로, 애플리케이션을 유통하는 앱 스토어나 기타 다양한 소프트웨어를 공급 내지 유통하는 사이트, 서버 등에서 관리하는 기록매체 내지 저장매체도 들 수 있다.The embodiments of the present invention described above can be embodied in the form of a computer program that can be executed on various components on a computer, and the computer program can be recorded on a computer-readable medium. At this time, the medium may be a program that continuously stores a computer executable program, or temporarily stores the program for execution or downloading. In addition, the medium may be a variety of recording means or storage means in the form of a combination of a single hardware or a plurality of hardware, but is not limited to a medium directly connected to a computer system, but may be dispersed on a network. Examples of the medium include a magnetic medium such as a hard disk, a floppy disk and a magnetic tape, an optical recording medium such as CD-ROM and DVD, a magneto-optical medium such as a floptical disk, And program instructions including ROM, RAM, flash memory, and the like. As another example of the medium, a recording medium or a storage medium managed by a site or a server that supplies or distributes an application store or various other software to distribute the application may be mentioned.

본 명세서에서, "부", "모듈" 등은 프로세서 또는 회로와 같은 하드웨어 구성(hardware component), 및/또는 프로세서와 같은 하드웨어 구성에 의해 실행되는 소프트웨어 구성(software component)일 수 있다. 예를 들면, "부", "모듈" 등은 소프트웨어 구성 요소들, 객체 지향 소프트웨어 구성 요소들, 클래스 구성 요소들 및 태스크 구성 요소들과 같은 구성 요소들과, 프로세스들, 함수들, 속성들, 프로시저들, 서브루틴들, 프로그램 코드의 세그먼트들, 드라이버들, 펌웨어, 마이크로 코드, 회로, 데이터, 데이터베이스, 데이터 구조들, 테이블들, 어레이들 및 변수들에 의해 구현될 수 있다.In this specification, " part, " " module, " and the like, may be a hardware component, such as a processor or circuit, and / or a software component, executed by a hardware configuration, such as a processor. For example, " part, " " module, " and / or the like, may refer to elements such as software components, object oriented software components, class components and task components, Microcode, circuitry, data, databases, data structures, tables, arrays, and variables, as will be appreciated by those skilled in the art.

전술한 본 발명의 설명은 예시를 위한 것이며, 본 발명이 속하는 기술분야의 통상의 지식을 가진 자는 본 발명의 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 쉽게 변형이 가능하다는 것을 이해할 수 있을 것이다. 그러므로 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며 한정적이 아닌 것으로 이해해야만 한다. 예를 들어, 단일형으로 설명되어 있는 각 구성 요소는 분산되어 실시될 수도 있으며, 마찬가지로 분산된 것으로 설명되어 있는 구성 요소들도 결합된 형태로 실시될 수 있다.It will be understood by those skilled in the art that the foregoing description of the present invention is for illustrative purposes only and that those of ordinary skill in the art can readily understand that various changes and modifications may be made without departing from the spirit or essential characteristics of the present invention. will be. It is therefore to be understood that the above-described embodiments are illustrative in all aspects and not restrictive. For example, each component described as a single entity may be distributed and implemented, and components described as being distributed may also be implemented in a combined form.

본 발명의 범위는 상기 상세한 설명보다는 후술하는 특허청구범위에 의하여 나타내어지며, 특허청구범위의 의미 및 범위 그리고 그 균등 개념으로부터 도출되는 모든 변경 또는 변형된 형태가 본 발명의 범위에 포함되는 것으로 해석되어야 한다.The scope of the present invention is defined by the appended claims rather than the detailed description and all changes or modifications derived from the meaning and scope of the claims and their equivalents are to be construed as being included within the scope of the present invention do.

100: 음성 제어 장치(전자 기기)
110: 메모리
120: 프로세서
121: 오디오 처리부
122: 키워드 검출부
123: 화자 특징 벡터 추출부
124: 웨이크업 판단부
125: 음성 인식부
126: 기능부100: Voice control device (electronic device)
110: Memory
120: Processor
121: Audio processor
122: Keyword detection unit
123: speaker characteristic vector extracting unit
124: Wake-up judgment unit
125:
126:

Claims

An audio processing unit for receiving audio signals corresponding to ambient sounds and generating audio stream data;
A keyword detection unit detecting a candidate keyword corresponding to a predetermined keyword from the audio stream data and determining a first section corresponding to the first audio data in which the candidate keyword is detected in the audio stream data;
Extracting a first speaker feature vector for the first audio data and extracting a second speaker feature vector for second audio data corresponding to a second section preceding the first section in the audio stream data A vector extractor; And
And a wake-up determination unit for determining whether the keyword is included in the first audio data based on the degree of similarity between the first speaker feature vector and the second speaker feature vector.

The method according to claim 1,
Wherein the wake-up determination unit determines that the keyword is included in the first audio data when the degree of similarity between the first speaker feature vector and the second speaker feature vector is equal to or less than a preset reference value.

The method according to claim 1,
When the keyword detection unit detects the candidate keyword corresponding to the single command keyword in the audio stream data,
Wherein the speaker feature vector extraction unit receives third audio data corresponding to a third section following the first section from the audio stream data, extracts a third speaker feature vector of the third audio data,
The wake-up determination unit may further include a wake-up determination unit for determining whether the wake-up determination unit determines that the wake-up determination unit determines that the wake-up determination unit determines that the wake- Is included in the audio signal.

The method of claim 3,
The wake-up determination unit may determine that the degree of similarity between the first speaker feature vector and the second speaker feature vector is equal to or less than a preset reference value and the similarity between the first speaker feature vector and the third speaker feature vector is equal to or less than a preset reference value And determines that the single command keyword is included in the first audio data.

The method according to claim 1,
When the keyword detection unit detects the candidate keyword corresponding to the wakeup keyword in the audio stream data,
In response to a determination that the wakeup keyword is included in the first audio data by the wakeup determination unit, a third audio corresponding to a third interval following the first interval in the audio stream data, Further comprising a voice recognition unit for receiving the data and for recognizing the third audio data or transmitting the third audio data to the outside so as to recognize the third audio data.

Receiving audio signals corresponding to ambient sounds to generate audio stream data;
Detecting a candidate keyword corresponding to a predetermined keyword from the audio stream data and determining a first section corresponding to the first audio data in which the candidate keyword is detected in the audio stream data;
Extracting a first speaker feature vector for the first audio data;
Extracting a second speaker feature vector for second audio data corresponding to a second section preceding the first section in the audio stream data; And
Determining whether the keyword is included in the first audio data based on the similarity between the first speaker feature vector and the second speaker feature vector, and determining whether to wake up the audio data.

The method according to claim 6,
Wherein the step of determining whether to wake-
Comparing the similarity between the first speaker feature vector and the second speaker feature vector with a preset reference value;
Determining that the keyword is included in the first audio data and waking up if the similarity is less than or equal to the preset reference value; And
Determining that the keyword is not included in the first audio data and not waking up if the similarity exceeds the predetermined reference value.

The method according to claim 6,
When the detected candidate keyword is the candidate keyword corresponding to the single command keyword,
Receiving third audio data corresponding to a third interval following the first interval in the audio stream data;
Extracting a third speaker feature vector of the third audio data;
When the similarity degree between the first speaker feature vector and the second speaker feature vector is equal to or less than a preset reference value and the similarity between the first speaker feature vector and the third speaker feature vector is equal to or less than a predetermined reference value, Further comprising determining that the data includes the single command keyword.

The method according to claim 6,
If the detected keyword is the candidate keyword corresponding to the wakeup keyword,
Receiving third audio data corresponding to a third interval following the first interval in the audio stream data in response to the determination that the wakeup keyword is included in the first audio data; And
Further comprising the steps of: recognizing the third audio data or transmitting the third audio data to the outside so as to recognize the third audio data.

A computer program comprising instructions for causing a processor of a voice control device to perform the method of any one of claims 6 to 9.