KR101986354B1

KR101986354B1 - Speech-controlled apparatus for preventing false detections of keyword and method of operating the same

Info

Publication number: KR101986354B1
Application number: KR1020170062391A
Authority: KR
Inventors: 김병열; 한익상; 권오혁; 이봉진; 오명우; 최민석; 이찬규; 임정희; 최지수; 강한용; 김수환; 최정아
Original assignee: 네이버 주식회사; 라인 가부시키가이샤
Priority date: 2017-05-19
Filing date: 2017-05-19
Publication date: 2019-09-30
Also published as: KR20180127065A; JP2019133182A; JP2018194844A; JP2022033258A; JP6510117B2

Abstract

키워드 오인식을 방지할 수 있는 음성 제어 장치 및 이의 동작 방법이 제공된다. 상기 동작 방법은, 주변 소리에 대응하는 오디오 신호를 수신하여, 오디오 스트림 데이터를 생성하는 단계; 상기 오디오 스트림 데이터로부터 미리 정의된 키워드에 대응하는 후보 키워드를 검출하고, 상기 오디오 스트림 데이터에서 상기 후보 키워드가 검출된 제1 오디오 데이터에 해당하는 제1 구간의 시점과 종점을 결정하는 단계; 상기 제1 오디오 데이터에 대한 제1 화자 특징 벡터를 추출하는 단계; 상기 오디오 스트림 데이터에서 상기 제1 구간의 시점을 종점으로 하는 제2 구간에 해당하는 제2 오디오 데이터에 대한 제2 화자 특징 벡터를 추출하는 단계; 및 상기 제1 화자 특징 벡터와 상기 제2 화자 특징 벡터의 유사도를 기초로 상기 제1 오디오 데이터에 상기 키워드가 포함되었는지의 여부를 판단하고, 웨이크업 여부를 결정하는 단계를 포함한다.Provided are a voice control device capable of preventing keyword misrecognition and a method of operating the same. The method may include receiving an audio signal corresponding to ambient sound and generating audio stream data; Detecting a candidate keyword corresponding to a predefined keyword from the audio stream data, and determining a start point and an end point of a first section corresponding to the first audio data from which the candidate keyword is detected in the audio stream data; Extracting a first speaker feature vector for the first audio data; Extracting a second speaker feature vector for second audio data corresponding to a second section having a starting point of the first section as an end point from the audio stream data; And determining whether the keyword is included in the first audio data based on the similarity between the first speaker feature vector and the second speaker feature vector, and determining whether to wake up.

Description

Speech-controlled apparatus for preventing false detections of keyword and method of operating the same}

본 개시는 음성 제어 장치에 관한 것으로서, 보다 상세하게는 키워드를 잘못 인식하는 것을 방지할 수 있는 음성 제어 장치 및 이의 동작 방법에 관한 것이다.The present disclosure relates to a voice control device, and more particularly, to a voice control device and a method of operating the same, which can prevent a keyword from being misrecognized.

휴대용 통신 장치, 데스크톱, 태블릿, 및 엔터테인먼트 시스템들과 같은 컴퓨팅 장치들의 성능이 고도화면서, 조작성을 향상시키기 위하여 음성 인식 기능이 탑재되어 음성에 의해 제어되는 전자 기기들이 출시되고 있다. 음성 인식 기능은 별도의 버튼 조작 또는 터치 모듈의 접촉에 의하지 않고 사용자의 음성을 인식함으로써 장치를 손쉽게 제어할 수 있는 장점을 가진다.As the performance of computing devices such as portable communication devices, desktops, tablets, and entertainment systems have been advanced, electronic devices equipped with speech recognition functions and controlled by voice have been introduced to improve operability. The voice recognition function has an advantage of easily controlling the device by recognizing the user's voice without using a separate button operation or contact of the touch module.

이러한 음성 인식 기능에 의하면, 예를 들어 스마트 폰과 같은 휴대용 통신 장치에서는 별도의 버튼을 누르는 조작 없이 통화 기능을 수행하거나 문자 메시지를 작성할 수 있으며, 길찾기, 인터넷 검색, 알람 설정 등 다양한 기능을 손쉽게 설정할 수 있다. 그러나, 이러한 음성 제어 장치가 사용자의 음성을 오인식하여 의도하지 않은 동작을 수행하는 문제가 발생할 수 있다.According to the voice recognition function, for example, a portable communication device such as a smart phone can perform a call function or write a text message without pressing a separate button, and can easily perform various functions such as directions, Internet search, and alarm setting. Can be set. However, there may be a problem that such a voice control device performs an unintended operation by misrecognizing a user's voice.

일 실시예는 키워드를 오인식하는 것을 방지할 수 있는 음성 제어 장치 및 이의 동작 방법을 제공할 수 있다.One embodiment can provide a voice control device and an operation method thereof that can prevent misrecognition of a keyword.

상술한 기술적 과제를 달성하기 위한 기술적 수단으로서, 본 개시의 제1 측면은, 주변 소리에 대응하는 오디오 신호를 수신하여, 오디오 스트림 데이터를 생성하는 오디오 처리부; 상기 오디오 스트림 데이터로부터 미리 정의된 키워드에 대응하는 후보 키워드를 검출하고, 상기 오디오 스트림 데이터에서 상기 후보 키워드가 검출된 제1 오디오 데이터에 해당하는 제1 구간의 시점과 종점을 결정하는 키워드 검출부; 상기 제1 오디오 데이터에 대한 제1 화자 특징 벡터를 추출하고, 상기 오디오 스트림 데이터에서 상기 제1 구간의 시점을 종점으로 하는 제2 구간에 해당하는 제2 오디오 데이터에 대한 제2 화자 특징 벡터를 추출하는 화자 특징 벡터 추출부; 및 상기 제1 화자 특징 벡터와 상기 제2 화자 특징 벡터의 유사도를 기초로 상기 제1 오디오 데이터에 상기 키워드가 포함되었는지의 여부를 판단하는 웨이크업 판단부를 포함하는 음성 제어 장치를 제공할 수 있다.As a technical means for achieving the above-described technical problem, the first aspect of the present disclosure, the audio processing unit for receiving an audio signal corresponding to the ambient sound, to generate audio stream data; A keyword detector which detects a candidate keyword corresponding to a predefined keyword from the audio stream data, and determines a start point and an end point of a first section corresponding to the first audio data from which the candidate keyword is detected in the audio stream data; Extracting a first speaker feature vector for the first audio data and extracting a second speaker feature vector for second audio data corresponding to a second section having an end point of the first section in the audio stream data. A speaker feature vector extracting unit; And a wakeup determiner configured to determine whether the keyword is included in the first audio data based on a similarity between the first speaker feature vector and the second speaker feature vector.

또한, 본 개시의 제2 측면은, 주변 소리에 대응하는 오디오 신호를 수신하여, 오디오 스트림 데이터를 생성하는 단계; 상기 오디오 스트림 데이터로부터 미리 정의된 키워드에 대응하는 후보 키워드를 검출하고, 상기 오디오 스트림 데이터에서 상기 후보 키워드가 검출된 제1 오디오 데이터에 해당하는 제1 구간의 시점과 종점을 결정하는 단계; 상기 제1 오디오 데이터에 대한 제1 화자 특징 벡터를 추출하는 단계; 상기 오디오 스트림 데이터에서 상기 제1 구간의 시점을 종점으로 하는 제2 구간에 해당하는 제2 오디오 데이터에 대한 제2 화자 특징 벡터를 추출하는 단계; 및 상기 제1 화자 특징 벡터와 상기 제2 화자 특징 벡터의 유사도를 기초로 상기 제1 오디오 데이터에 상기 키워드가 포함되었는지의 여부를 판단하고, 웨이크업 여부를 결정하는 단계를 포함하는 음성 제어 장치의 동작 방법을 제공할 수 있다.In addition, a second aspect of the present disclosure, the step of receiving an audio signal corresponding to the ambient sound, generating audio stream data; Detecting a candidate keyword corresponding to a predefined keyword from the audio stream data, and determining a start point and an end point of a first section corresponding to the first audio data from which the candidate keyword is detected in the audio stream data; Extracting a first speaker feature vector for the first audio data; Extracting a second speaker feature vector for second audio data corresponding to a second section having a starting point of the first section as an end point from the audio stream data; And determining whether the keyword is included in the first audio data based on a similarity between the first speaker feature vector and the second speaker feature vector, and determining whether to wake up. It can provide a method of operation.

또한, 본 개시의 제3 측면은, 음성 제어 장치의 프로세서가 제2 측면에 따른 동작 방법을 실행하도록 하는 명령어들을 포함하는 하나 이상의 프로그램이 기록된 컴퓨터로 읽을 수 있는 기록 매체를 제공할 수 있다.In addition, a third aspect of the present disclosure may provide a computer readable recording medium having recorded thereon one or more programs including instructions for causing a processor of the voice control apparatus to execute a method of operation according to the second aspect.

본 개시의 다양한 실시예들에 따르면, 키워드를 오인식할 가능성이 감소되므로 음성 제어 장치의 오동작이 방지될 수 있다.According to various embodiments of the present disclosure, the possibility of misrecognition of a keyword may be reduced, thereby preventing malfunction of the voice control apparatus.

도 1은 일 실시예에 따른 네트워크 환경의 예를 도시한 도면이다.
도 2는 일 실시예에 따라서 전자 기기 및 서버의 내부 구성을 설명하기 위한 블럭도이다.
도 3은 일 실시예에 따른 음성 제어 장치의 프로세서가 포함할 수 있는 기능 블럭들의 예를 도시한 도면이다.
도 4는 일 실시예에 따라서 음성 제어 장치가 수행할 수 있는 동작 방법의 예를 도시한 흐름도이다.
도 5는 다른 실시예에 따라서 음성 제어 장치가 수행할 수 있는 동작 방법의 예를 도시한 흐름도이다.
도 6a는 일 실시예에 따른 음성 제어 장치가 도 5의 동작 방법을 실행하는 경우에 단독 명령 키워드가 발화되는 예를 도시한다.
도 6b는 일 실시예에 따른 음성 제어 장치가 도 6의 동작 방법을 실행하는 경우에 일반 대화 음성이 발화되는 예를 도시한다.
도 7는 또 다른 실시예에 따라서 음성 제어 장치가 수행할 수 있는 동작 방법의 예를 도시한 흐름도이다.
도 8a는 일 실시예에 따른 음성 제어 장치가 도 7의 동작 방법을 실행하는 경우에 웨이크업 키워드와 자연어 음성 명령이 발화되는 예를 도시한다.
도 8b는 일 실시예에 따른 음성 제어 장치가 도 7의 동작 방법을 실행하는 경우에 일반 대화 음성이 발화되는 예를 도시한다.1 is a diagram illustrating an example of a network environment according to an embodiment.
2 is a block diagram illustrating an internal configuration of an electronic device and a server according to an exemplary embodiment.
3 is a diagram illustrating an example of functional blocks that may be included in a processor of a voice control apparatus according to an embodiment.
4 is a flowchart illustrating an example of an operation method that a voice control apparatus may perform according to an exemplary embodiment.
5 is a flowchart illustrating an example of an operation method that a voice control apparatus may perform according to another exemplary embodiment.
6A illustrates an example in which a single command keyword is uttered when the voice control apparatus according to an embodiment executes the operation method of FIG. 5.
6B illustrates an example in which a general conversation voice is spoken when the apparatus for controlling voice according to an embodiment executes the operation method of FIG. 6.
7 is a flowchart illustrating an example of an operation method that a voice control apparatus may perform according to another exemplary embodiment.
FIG. 8A illustrates an example in which a wake-up keyword and a natural language voice command are uttered when the voice control apparatus according to an exemplary embodiment executes the operation method of FIG. 7.
FIG. 8B illustrates an example in which a general conversation voice is spoken when the apparatus for controlling voice according to an embodiment executes the operation method of FIG. 7.

아래에서는 첨부한 도면을 참조하여 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 본 발명의 실시예를 상세히 설명한다. 그러나 본 발명은 여러 가지 상이한 형태로 구현될 수 있으며 여기에서 설명하는 실시예에 한정되지 않는다. 그리고 도면에서 본 발명을 명확하게 설명하기 위해서 설명과 관계없는 부분은 생략하였으며, 명세서 전체를 통하여 유사한 부분에 대해서는 유사한 도면 부호를 붙였다. DETAILED DESCRIPTION Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings so that those skilled in the art may easily implement the present invention. As those skilled in the art would realize, the described embodiments may be modified in various different ways, all without departing from the spirit or scope of the present invention. In the drawings, parts irrelevant to the description are omitted in order to clearly describe the present invention, and like reference numerals designate like parts throughout the specification.

명세서 전체에서, 어떤 부분이 다른 부분과 "연결"되어 있다고 할 때, 이는 "직접적으로 연결"되어 있는 경우뿐 아니라, 그 중간에 다른 소자를 사이에 두고 "전기적으로 연결"되어 있는 경우도 포함한다. 또한 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있는 것을 의미한다.Throughout the specification, when a part is "connected" to another part, this includes not only "directly connected" but also "electrically connected" with another element in between. . In addition, when a part is said to "include" a certain component, which means that it may further include other components, except to exclude other components unless otherwise stated.

본 명세서에서 다양한 곳에 등장하는 "일부 실시예에서" 또는 "일 실시예에서" 등의 어구는 반드시 모두 동일한 실시예를 가리키는 것은 아니다.The phrases “in some embodiments” or “in one embodiment” appearing in various places in the specification are not necessarily all referring to the same embodiment.

일부 실시예는 기능적인 블럭 구성들 및 다양한 처리 단계들로 나타내어질 수 있다. 이러한 기능 블럭들의 일부 또는 전부는, 특정 기능들을 실행하는 다양한 개수의 하드웨어 및/또는 소프트웨어 구성들로 구현될 수 있다. 예를 들어, 본 개시의 기능 블럭들은 하나 이상의 마이크로프로세서들에 의해 구현되거나, 소정의 기능을 위한 회로 구성들에 의해 구현될 수 있다. 또한, 예를 들어, 본 개시의 기능 블럭들은 다양한 프로그래밍 또는 스크립팅 언어로 구현될 수 있다. 기능 블럭들은 하나 이상의 프로세서들에서 실행되는 알고리즘으로 구현될 수 있다. 또한, 본 개시는 전자적인 환경 설정, 신호 처리, 및/또는 데이터 처리 등을 위하여 종래 기술을 채용할 수 있다. “모듈” 및 “구성”등과 같은 용어는 넓게 사용될 수 있으며, 기계적이고 물리적인 구성들로서 한정되는 것은 아니다.Some embodiments may be represented by functional block configurations and various processing steps. Some or all of these functional blocks may be implemented in various numbers of hardware and / or software configurations that perform particular functions. For example, the functional blocks of the present disclosure may be implemented by one or more microprocessors or by circuit configurations for a given function. In addition, for example, the functional blocks of the present disclosure may be implemented in various programming or scripting languages. The functional blocks may be implemented in algorithms running on one or more processors. In addition, the present disclosure may employ the prior art for electronic configuration, signal processing, and / or data processing. Terms such as “module” and “configuration” can be used broadly and are not limited to mechanical and physical configurations.

또한, 도면에 도시된 구성 요소들 간의 연결 선 또는 연결 부재들은 기능적인 연결 및/또는 물리적 또는 회로적 연결들을 예시적으로 나타낸 것일 뿐이다. 실제 장치에서는 대체 가능하거나 추가된 다양한 기능적인 연결, 물리적인 연결, 또는 회로 연결들에 의해 구성 요소들 간의 연결이 나타내어질 수 있다.In addition, the connecting lines or connecting members between the components shown in the drawings are merely illustrative of functional connections and / or physical or circuit connections. In an actual device, the connections between components may be represented by various functional connections, physical connections, or circuit connections that are replaceable or added.

본 개시에서 키워드는 음성 제어 장치의 특정 기능을 웨이크업 할 수 있는 음성 정보를 말한다. 키워드는 사용자의 음성 신호에 기초하며, 단독 명령 키워드일 수도 있고, 웨이크업 키워드일 수 있다. 웨이크업 키워드는 슬립 모드 상태의 음성 제어 장치를 웨이크업 모드로 전환할 수 있는 음성 기반 키워드로서, 예컨대, “클로바”, “하이 컴퓨터” 등과 같은 음성 키워드일 수 있다. 사용자는 웨이크업 키워드를 발화한 후, 음성 제어 장치가 수행하길 원하는 기능이나 동작을 지시하기 위한 명령을 자연어 형태로 발화할 수 있다. 이 경우, 음성 제어 장치는 자연어 형태의 음성 명령을 음성 인식하고, 음성 인식된 결과에 대응하는 기능 또는 동작을 수행할 수 있다. 단독 명령 키워드는 예컨대 음악이 재생 중인 경우 “중지”와 같이 음성 제어 장치의 동작을 직접 제어할 수 있는 음성 키워드일 수 있다. 본 개시에서 언급되는 웨이크업 키워드는 웨이크업 워드, 핫워드, 트리거 워드 등과 같은 용어로 지칭될 수 있다.In the present disclosure, a keyword refers to voice information capable of waking up a specific function of the voice control device. The keyword may be a single command keyword or a wake-up keyword based on a voice signal of the user. The wake-up keyword is a voice-based keyword that can switch the voice control device in the sleep mode to the wake-up mode. For example, the wake-up keyword may be a voice keyword such as “clova” or “high computer”. After the user utters the wakeup keyword, the user may utter a command for instructing a function or an operation that the voice control apparatus wants to perform in natural language form. In this case, the voice control apparatus may recognize a voice command in a natural language form and perform a function or operation corresponding to the voice recognized result. The single command keyword may be a voice keyword capable of directly controlling the operation of the voice control device, for example, when music is playing. The wakeup keyword mentioned in the present disclosure may be referred to as a term such as a wakeup word, a hotword, a trigger word, and the like.

본 개시에서 후보 키워드는 키워드와 발음이 유사한 워드들을 포함한다. 예컨대, 키워드가 “클로바”인 경우, 후보 키워드는 “클로버”, “글로벌”, “클럽” 등일 수 있다. 후보 키워드는 음성 제어 장치의 키워드 검출부가 오디오 데이터에서 키워드로서 검출한 것으로 정의될 수 있다. 후보 키워드는 키워드와 동일할 수도 있지만, 키워드와 유사한 발음을 갖는 다른 워드일 수도 있다. 일반적으로 음성 제어 장치는 사용자가 후보 키워드에 해당하는 용어가 포함된 문장을 발화하는 경우에도 해당 키워드로 오인식하여 웨이크업 할 수 있다. 본 개시에 따른 음성 제어 장치는 음성 신호에서 위와 같은 후보 키워드가 검출되는 경우에도 반응하지만, 후보 키워드에 의해 웨이크업 되는 것을 방지할 수 있다.In the present disclosure, candidate keywords include words that are similar in pronunciation to the keywords. For example, if the keyword is "clover", the candidate keyword may be "clover", "global", "club", or the like. The candidate keyword may be defined as a keyword detector of the voice control apparatus detected as a keyword in the audio data. The candidate keyword may be the same as the keyword, but may be another word having a pronunciation similar to the keyword. In general, even when a user speaks a sentence including a term corresponding to a candidate keyword, the voice control device may misidentify the keyword and wake up. The voice control device according to the present disclosure may react even when the above candidate keywords are detected in the voice signal, but may prevent the user from being woken up by the candidate keywords.

본 개시에서 음성 인식 기능은 사용자의 음성 신호를 문자열(또는 텍스트)로 변환하는 것을 말한다. 사용자의 음성 신호는 음성 명령을 포함할 수 있다. 음성 명령은 음성 제어 장치의 특정 기능을 실행할 수 있다. In the present disclosure, the voice recognition function refers to converting a voice signal of a user into a string (or text). The voice signal of the user may include a voice command. The voice command may execute a specific function of the voice control device.

본 개시에서 음성 제어 장치는 음성 제어 기능이 탑재된 전자 기기를 말한다. 음성 제어 기능이 탑재된 전자 기기는 스마트 스피커 또는 인공 지능 스피커와 같은 독립된 전자 기기일 수 있다. 또한, 음성 제어 기능이 탑재된 전자 기기는 음성 제어 기능이 탑재된 컴퓨팅 장치, 예컨대, 데스크톱, 노트북 등일 수 있을 뿐만 아니라, 휴대가 가능한 컴퓨터 장치, 예컨대, 스마트 폰 등일 수 있다. 이 경우, 컴퓨팅 장치에는 음성 제어 기능을 실행하기 위한 프로그램 또는 애플리케이션이 설치될 수 있다. 또한, 음성 제어 기능이 탑재된 전자 기기는 특정 기능을 주로 수행하는 전자 제품, 예컨대, 스마트 텔레비전, 스마트 냉장고, 스마트 에어컨, 스마트 네비게이션 등일 수 있으며, 자동차의 인포테인먼트 시스템일 수도 있다. 뿐만 아니라, 음성에 의해 제어될 수 있는 사물 인터넷 장치도 이에 해당할 수 있다.In the present disclosure, the voice control device refers to an electronic device having a voice control function. The electronic device equipped with the voice control function may be an independent electronic device such as a smart speaker or an artificial intelligence speaker. In addition, the electronic device equipped with the voice control function may not only be a computing device equipped with the voice control function, for example, a desktop, a laptop, or the like, but also a portable computer device such as a smartphone. In this case, a program or an application for executing a voice control function may be installed in the computing device. In addition, the electronic device equipped with the voice control function may be an electronic product mainly performing a specific function, for example, a smart television, a smart refrigerator, a smart air conditioner, a smart navigation, or the like, or may be an infotainment system of an automobile. In addition, the IoT apparatus which may be controlled by voice may correspond to this.

본 개시에서 음성 제어 장치의 특정 기능은, 예를 들어, 음성 제어 장치에 설치된 애플리케이션을 실행하는 것을 포함할 수 있으나 이로 제한되지 않는다. 예를 들어, 음성 제어 장치가 스마트 스피커인 경우, 음성 제어 장치의 특정 기능은 음악 재생, 인터넷 쇼핑, 음성 정보 제공, 스마트 스피커에 접속된 전자 또는 기계 장치의 제어 등을 포함할 수 있다. 예를 들어, 음성 제어 장치가 스마트 폰인 경우에, 애플리케이션을 실행하는 것은 전화 걸기, 길 찾기, 인터넷 검색, 또는 알람 설정 등을 포함할 수 있다. 예를 들어, 음성 제어 장치가 스마트 텔레비전인 경우에, 애플리케이션을 실행하는 것은 프로그램 검색, 또는 채널 검색 등을 포함할 수 있다. 음성 제어 장치가 스마트 오븐인 경우에, 애플리케이션을 실행하는 것은 요리 방법 검색 등을 포함할 수 있다. 음성 제어 장치가 스마트 냉장고인 경우에, 애플리케이션을 실행하는 것은 냉장 및 냉동 상태 점검, 또는 온도 설정 등을 포함할 수 있다. 음성 제어 장치가 스마트 자동차인 경우에, 애플리케이션을 실행하는 것은 자동 시동, 자율 주행, 자동 주차 등을 포함할 수 있다. 본 개시에서 애플리케이션을 실행하는 것은 상술한 바로 제한되지 않는다.In the present disclosure, a specific function of the voice control device may include, but is not limited to, executing an application installed in the voice control device. For example, when the voice control device is a smart speaker, specific functions of the voice control device may include music playing, internet shopping, providing voice information, control of an electronic or mechanical device connected to the smart speaker, and the like. For example, if the voice control device is a smartphone, running the application may include making a call, finding a route, searching the Internet, setting an alarm, or the like. For example, if the voice control device is a smart television, executing the application may include program search, channel search, or the like. In the case where the voice control device is a smart oven, executing the application may include cooking method search and the like. When the voice control device is a smart refrigerator, executing the application may include checking a refrigeration and freezing state, setting a temperature, or the like. When the voice control device is a smart car, executing the application may include auto start, autonomous driving, auto parking, and the like. Execution of the application in the present disclosure is not limited just as described above.

본 개시에서 키워드는 워드 형태를 갖거나, 구 형태를 가질 수 있다. 본 개시에서, 웨이크업 키워드 이후에 발화되는 음성 명령은 자연어 형태의 문장 형태, 워드 형태, 또는 구 형태를 가질 수 있다.In the present disclosure, the keyword may have a word form or a spherical form. In the present disclosure, the voice command spoken after the wakeup keyword may have a sentence form, a word form, or a phrase form in natural language form.

이하 첨부된 도면을 참고하여 본 개시를 상세히 설명하기로 한다.Hereinafter, the present disclosure will be described in detail with reference to the accompanying drawings.

도 1은 일 실시예에 따른 네트워크 환경의 예를 도시한 도면이다. 1 is a diagram illustrating an example of a network environment according to an embodiment.

도 1에 도시된 네트워크 환경은 복수의 전자 기기들(100a-100f), 서버(200) 및 네트워크(300)를 포함하는 것으로 예시적으로 도시된다.The network environment shown in FIG. 1 is exemplarily illustrated as including a plurality of electronic devices 100a-100f, a server 200, and a network 300.

전자 기기들(100a-100f)은 음성으로 제어될 수 있는 예시적인 전자 기기들이다. 전자 기기들(100a-100f) 각각은 음성 인식 기능 외에 특정 기능을 실행할 수 있다. 전자 기기들(100a-100f)의 예를 들면, 스마트 또는 인공지능 스피커, 스마트 폰(smart phone), 휴대폰, 네비게이션, 컴퓨터, 노트북, 디지털방송용 단말, PDA(Personal Digital Assistants), PMP(Portable Multimedia Player), 태블릿 PC, 스마트 전자 제품 등이 있다. 전자 기기들(100a-100f)은 무선 또는 유선 통신 방식을 이용하여 네트워크(300)를 통해 서버(200) 및/또는 다른 전자 기기들(100a-100f)과 통신할 수 있다. 그러나, 이에 한정되지 않으며, 전자 기기들(100a-100f) 각각은 네트워크(300)에 연결되지 않고 독립적으로 동작할 수도 있다. 전자 기기들(100a-100f)은 전자 기기(100)로 통칭될 수 있다.The electronic devices 100a-100f are exemplary electronic devices that can be controlled by voice. Each of the electronic devices 100a-100f may execute a specific function in addition to the voice recognition function. Examples of the electronic devices 100a-100f include smart or artificial intelligence speakers, smart phones, mobile phones, navigation, computers, laptops, digital broadcasting terminals, personal digital assistants (PDAs), and portable multimedia players (PMPs). ), Tablet PCs, and smart electronics. The electronic devices 100a-100f may communicate with the server 200 and / or other electronic devices 100a-100f through the network 300 using a wireless or wired communication scheme. However, the present invention is not limited thereto, and each of the electronic devices 100a-100f may operate independently without being connected to the network 300. The electronic devices 100a-100f may be collectively referred to as the electronic device 100.

네트워크(300)의 통신 방식은 제한되지 않으며, 네트워크(300)가 포함할 수 있는 통신망(일례로, 이동통신망, 유선 인터넷, 무선 인터넷, 방송망)을 활용하는 통신 방식뿐만 아니라, 전자 기기들(100a-100f) 간의 근거리 무선 통신이 포함될 수 있다. 예를 들어, 네트워크(300)는, PAN(personal area network), LAN(local area network), CAN(campus area network), MAN(metropolitan area network), WAN(wide area network), BBN(broadband network), 인터넷 등의 네트워크 중 하나 이상의 임의의 네트워크를 포함할 수 있다. 또한, 네트워크(300)는 버스 네트워크, 스타 네트워크, 링 네트워크, 메쉬 네트워크, 스타-버스 네트워크, 트리 또는 계층적(hierarchical) 네트워크 등을 포함하는 네트워크 토폴로지 중 임의의 하나 이상을 포함할 수 있으나, 이에 제한되지 않는다.The communication method of the network 300 is not limited, and the electronic devices 100a as well as a communication method utilizing a communication network (for example, a mobile communication network, a wired internet, a wireless internet, and a broadcasting network) that the network 300 may include. Near field communication may be included. For example, the network 300 may include a personal area network (PAN), a local area network (LAN), a campus area network (CAN), a metropolitan area network (MAN), a wide area network (WAN), and a broadband network (BBN). And one or more of networks such as the Internet. In addition, the network 300 may include any one or more of network topologies including a bus network, a star network, a ring network, a mesh network, a star-bus network, a tree, or a hierarchical network. It is not limited.

서버(200)는 네트워크(300)를 통해 전자 기기들(100a-100f)과 통신하며, 음성 인식 기능을 수행하는 컴퓨터 장치 또는 복수의 컴퓨터 장치들로 구현될 수 있다. 서버(200)는 클라우드 형태로 분산될 수 있으며, 명령, 코드, 파일, 컨텐츠 등을 제공할 수 있다.The server 200 communicates with the electronic devices 100a-100f through the network 300 and may be implemented as a computer device or a plurality of computer devices that perform a voice recognition function. The server 200 may be distributed in a cloud form and provide commands, codes, files, contents, and the like.

예를 들면, 서버(200)는 전자 기기들(100a-100f)로부터 제공되는 오디오 파일을 수신하여 오디오 파일 내의 음성 신호를 문자열(또는 텍스트)로 변환하고, 변환된 문자열(또는 텍스트)를 전자 기기들(100a-100f)로 제공할 수 있다. 또한, 서버(200)는 네트워크(300)를 통해 접속한 전자 기기들(100a-100f)에게 음성 제어 기능을 수행하기 위한 어플리케이션의 설치를 위한 파일을 제공할 수 있다. 예컨대, 제2 전자 기기(100b)는 서버(200)로부터 제공된 파일을 이용하여 어플리케이션을 설치할 수 있다. 제2 전자 기기(100b)는 설치된 운영체제(Operating System, OS) 및/또는 적어도 하나의 프로그램(예컨대, 설치된 음성 제어 어플리케이션)의 제어에 따라 서버(200)에 접속하여 서버(200)가 제공하는 음성 인식 서비스를 제공받을 수 있다.For example, the server 200 receives an audio file provided from the electronic devices 100a-100f, converts a voice signal in the audio file into a string (or text), and converts the converted string (or text) into the electronic device. It may be provided in the (100a-100f). In addition, the server 200 may provide a file for installing an application for performing a voice control function to the electronic devices 100a-100f connected through the network 300. For example, the second electronic device 100b may install an application using a file provided from the server 200. The second electronic device 100b connects to the server 200 under the control of an installed operating system (OS) and / or at least one program (eg, an installed voice control application) to provide a voice provided by the server 200. Recognition service can be provided.

도 2는 일 실시예에 따라서 전자 기기 및 서버의 내부 구성을 설명하기 위한 블럭도이다.2 is a block diagram illustrating an internal configuration of an electronic device and a server according to an exemplary embodiment.

전자 기기(100)는 도 1의 전자 기기들(100a-100f) 중 하나이며, 전자 기기들(100a-100f)은 적어도 도 2에 도시된 내부 구성을 가질 수 있다. 전자 기기(100)는 네트워크(300)를 통해 음성 인식 기능을 수행하는 서버(200)에 접속되는 것으로 도시되어 있지만, 이는 예시적이며, 전자 기기(100)는 독립적으로 음성 인식 기능을 수행할 수도 있다. 전자 기기(100)는 음성에 의해 제어될 수 있는 전자 기기로서, 음성 제어 장치(100)로 지칭될 수 있다. 음성 제어 장치(100)는 스마트 또는 인공지능 스피커, 컴퓨팅 장치, 휴대용 컴퓨팅 장치, 스마트 가전 제품 등에 포함되거나, 이들에 유선 및/또는 무선으로 연결되도록 구현될 수 있다.The electronic device 100 may be one of the electronic devices 100a-100f of FIG. 1, and the electronic devices 100a-100f may have an internal configuration illustrated in FIG. 2. Although the electronic device 100 is illustrated as being connected to a server 200 that performs a voice recognition function through the network 300, this is exemplary and the electronic device 100 may perform the voice recognition function independently. have. The electronic device 100 is an electronic device that can be controlled by a voice and may be referred to as a voice control device 100. The voice control device 100 may be included in a smart or artificial intelligence speaker, a computing device, a portable computing device, a smart home appliance, or the like, or may be implemented to be wired and / or wirelessly connected thereto.

전자 기기(100)와 서버(200)는 메모리(110, 210), 프로세서(120, 220), 통신 모듈(130, 230), 및 입출력 인터페이스(140, 240)를 포함할 수 있다. 메모리(110, 210)는 컴퓨터에서 판독 가능한 기록 매체로서, RAM(random access memory), ROM(read only memory) 및 디스크 드라이브와 같은 비소멸성 대용량 기록장치(permanent mass storage device)를 포함할 수 있다. 또한, 메모리(110, 210)에는 운영체제와 적어도 하나의 프로그램 코드(예컨대, 전자 기기(100)에 설치되어 구동되는 음성 제어 어플리케이션, 음성 인식 어플리케이션 등을 위한 코드)가 저장될 수 있다. 이러한 소프트웨어 구성요소들은 컴퓨터에서 판독 가능한 기록 매체가 아닌 통신 모듈(130, 230)을 통해 메모리(110, 210)에 로딩될 수도 있다. 예를 들어, 적어도 하나의 프로그램은 개발자들 또는 어플리케이션의 설치 파일을 배포하는 파일 배포 시스템이 네트워크(300)를 통해 제공하는 파일들에 의해 설치되는 프로그램에 기반하여 메모리(110, 210)에 로딩될 수 있다.The electronic device 100 and the server 200 may include memories 110 and 210, processors 120 and 220, communication modules 130 and 230, and input / output interfaces 140 and 240. The memories 110 and 210 are computer readable recording media and may include permanent mass storage devices such as random access memory (RAM), read only memory (ROM), and disk drives. In addition, the memory 110 and 210 may store an operating system and at least one program code (eg, a code for a voice control application, a voice recognition application, etc. installed and driven in the electronic device 100). Such software components may be loaded into the memory 110, 210 through the communication module 130, 230 rather than a computer readable recording medium. For example, the at least one program may be loaded into the memory 110 or 210 based on a program installed by files provided through a network 300 by a file distribution system distributing installation files of developers or applications. Can be.

프로세서(120, 220)는 기본적인 산술, 로직 및 입출력 연산을 수행함으로써, 컴퓨터 프로그램의 명령을 처리하도록 구성될 수 있다. 명령은 메모리(110, 210) 또는 통신 모듈(130, 230)에 의해 프로세서(120, 220)로 제공될 수 있다. 예를 들어 프로세서(120, 220)는 메모리(110, 210)와 같은 기록 장치에 저장된 프로그램 코드에 따라 수신되는 명령을 실행하도록 구성될 수 있다.Processors 120 and 220 may be configured to process instructions of a computer program by performing basic arithmetic, logic and input / output operations. Instructions may be provided to the processors 120 and 220 by the memory 110 and 210 or the communication modules 130 and 230. For example, the processor 120, 220 may be configured to execute a command received according to a program code stored in a recording device such as the memory 110, 210.

통신 모듈(130, 230)은 네트워크(300)를 통해 전자 기기(100)와 서버(200)가 서로 통신하기 위한 기능을 제공할 수 있으며, 다른 전자 기기(100b-100f)와 통신하기 위한 기능을 제공할 수 있다. 일례로, 전자 기기(100)의 프로세서(120)가 메모리(110)와 같은 기록 장치에 저장된 프로그램 코드에 따라 생성한 요청(일례로 음성 인식 서비스 요청)이 통신 모듈(130)의 제어에 따라 네트워크(300)를 통해 서버(200)로 전달될 수 있다. 역으로, 서버(200)의 프로세서(220)의 제어에 따라 제공되는 음성 인식 결과인 문자열(텍스트) 등이 통신 모듈(230)과 네트워크(300)를 거쳐 전자 기기(100)의 통신 모듈(130)을 통해 전자 기기(100)로 수신될 수 있다. 예를 들어 통신 모듈(130)을 통해 수신된 서버(200)의 음성 인식 결과는 프로세서(120)나 메모리(110)로 전달될 수 있다. 서버(200)는 제어 신호나 명령, 컨텐츠, 파일 등을 전자 기기(100)로 송신할 수 있으며, 통신 모듈(130)을 통해 수신된 제어 신호나 명령 등은 프로세서(120)나 메모리(110)로 전달되고, 컨텐츠나 파일 등은 전자 기기(100)가 더 포함할 수 있는 별도의 저장 매체로 저장될 수 있다.The communication modules 130 and 230 may provide a function for the electronic device 100 and the server 200 to communicate with each other through the network 300, and may provide a function for communicating with another electronic device 100b-100f. Can provide. In one example, a request (eg, a voice recognition service request) generated by the processor 120 of the electronic device 100 according to a program code stored in a recording device such as the memory 110 is controlled by the communication module 130. It may be delivered to the server 200 through 300. Conversely, a character string (text), etc., which is a voice recognition result provided under the control of the processor 220 of the server 200, is transmitted through the communication module 230 and the network 300 to the communication module 130 of the electronic device 100. ) May be received by the electronic device 100. For example, the voice recognition result of the server 200 received through the communication module 130 may be transferred to the processor 120 or the memory 110. The server 200 may transmit control signals, commands, contents, files, and the like to the electronic device 100, and the control signals or commands received through the communication module 130 may be transmitted to the processor 120 or the memory 110. The content or the file may be stored as a separate storage medium that the electronic device 100 may further include.

입출력 인터페이스(140, 240)는 입출력 장치들(150)과의 인터페이스를 위한 수단일 수 있다. 예를 들어, 입력 장치는 마이크(151)뿐만 아니라, 키보드 또는 마우스 등의 장치를 포함할 수 있으며, 출력 장치는 스피커(152)뿐만 아니라, 상태를 나타내는 상태 표시 LED(Light Emitting Diode), 어플리케이션의 통신 세션을 표시하기 위한 디스플레이와 같은 장치를 포함할 수 있다. 다른 예로서, 입출력 장치들(150)은 터치스크린과 같이 입력과 출력을 위한 기능이 하나로 통합된 장치를 포함할 수 있다.The input / output interfaces 140 and 240 may be means for interfacing with the input / output devices 150. For example, the input device may include not only the microphone 151 but also a device such as a keyboard or a mouse, and the output device may include not only the speaker 152 but also a status display LED (Light Emitting Diode) indicating a state, It may include a device such as a display for displaying a communication session. As another example, the input / output devices 150 may include a device in which functions for input and output are integrated into one, such as a touch screen.

마이크(151)는 주변 소리를 전기적인 오디오 신호로 변환할 수 있다. 마이크(151)는 전자 기기(100) 내에 직접 장착되지 않고, 통신 가능하게 연결되는 외부 장치(예컨대, 스마트 시계)에 장착되고, 생성된 외부 신호는 통신으로 전자 기기(100)에 전송될 수 있다. 도 2에는, 마이크(151)가 전자 기기(100)의 내부에 포함되는 것으로 도시되었으나, 다른 일 실시예에 따르면, 마이크(151)는 별도의 장치 내에 포함되고, 전자 기기(100)와는 유선 또는 무선 통신으로 연결되는 형태로 구현될 수 있다.The microphone 151 may convert ambient sound into an electrical audio signal. The microphone 151 is not directly mounted in the electronic device 100, but is mounted in an external device (eg, a smart watch) that is communicatively connected, and the generated external signal may be transmitted to the electronic device 100 by communication. . In FIG. 2, the microphone 151 is illustrated as being included in the electronic device 100, but according to another embodiment, the microphone 151 is included in a separate device and is wired or connected to the electronic device 100. It may be implemented in the form of being connected by wireless communication.

다른 실시예들에서 전자 기기(100) 및 서버(200)는 도 2의 구성요소들보다 더 많은 구성요소들을 포함할 수도 있다. 예를 들어, 전자 기기(100)는 전술한 입출력 장치들(150) 중 적어도 일부를 포함하도록 구성되거나, 트랜시버(transceiver), GPS(Global Positioning System) 모듈, 카메라, 각종 센서, 데이터베이스 등과 같은 다른 구성요소들을 더 포함할 수도 있다.In other embodiments, the electronic device 100 and the server 200 may include more components than the components of FIG. 2. For example, the electronic device 100 may be configured to include at least some of the above-described input and output devices 150, or other components such as a transceiver, a global positioning system (GPS) module, a camera, various sensors, a database, and the like. It may further include elements.

도 3은 일 실시예에 따른 음성 제어 장치의 프로세서가 포함할 수 있는 기능 블럭들의 예를 도시한 도면이고, 도 4는 일 실시예에 따라서 음성 제어 장치가 수행할 수 있는 동작 방법의 예를 도시한 흐름도이다.3 is a diagram illustrating an example of functional blocks that a processor of a voice control apparatus may include, and FIG. 4 illustrates an example of an operation method that the voice control apparatus may perform, according to an exemplary embodiment. One flow chart.

도 3에 도시된 바와 같이, 음성 제어 장치(100)의 프로세서(120)는 오디오 처리부(121), 키워드 검출부(122), 화자 특징 벡터 추출부(123), 웨이크업 판단부(124), 음성 인식부(125) 및 기능부(126)를 포함할 수 있다. 이러한 프로세서(120) 및 기능 블럭들(121-126) 중 적어도 일부는 도 4에 도시된 동작 방법이 포함하는 단계들(S110 내지 S190)을 수행하도록 음성 제어 장치(100)를 제어할 수 있다. 예를 들면, 프로세서(120) 및 프로세서(120)의 기능 블럭들(121-126) 중 적어도 일부는 음성 제어 장치(100)의 메모리(110)가 포함하는 운영체제의 코드와 적어도 하나의 프로그램 코드에 따른 명령을 실행하도록 구현될 수 있다.As shown in FIG. 3, the processor 120 of the voice control apparatus 100 includes an audio processor 121, a keyword detector 122, a speaker feature vector extractor 123, a wakeup determiner 124, and a voice. The recognition unit 125 and the function unit 126 may be included. At least some of the processor 120 and the function blocks 121-126 may control the voice control apparatus 100 to perform steps S110 to S190 included in the operation method illustrated in FIG. 4. For example, at least some of the processor 120 and the functional blocks 121-126 of the processor 120 may be included in at least one program code and code of an operating system included in the memory 110 of the voice control apparatus 100. It can be implemented to execute a command according to.

도 3에 도시된 기능 블럭들(121-126)의 일부 또는 전부는, 특정 기능을 실행하는 하드웨어 및/또는 소프트웨어 구성들로 구현될 수 있다. 도 3에 도시된 기능 블럭들(121-126)이 수행하는 기능들은, 하나 이상의 마이크로프로세서에 의해 구현되거나, 해당 기능을 위한 회로 구성들에 의해 구현될 수 있다. 도 3에 도시된 기능 블럭들(121-126)의 일부 또는 전부는 프로세서(120)에서 실행되는 다양한 프로그래밍 언어 또는 스크립트 언어로 구성된 소프트웨어 모듈일 수 있다. 예를 들면, 오디오 처리부(121)와 키워드 검출부(122)는 디지털 신호 처리기(DSP)로 구현되고, 화자 특징 벡터 추출부(123), 웨이크업 판단부(124) 및 음성 인식부(125)는 소프트웨어 모듈로 구현될 수 있다.Some or all of the functional blocks 121-126 shown in FIG. 3 may be implemented in hardware and / or software configurations that perform particular functions. The functions performed by the functional blocks 121-126 shown in FIG. 3 may be implemented by one or more microprocessors or by circuit configurations for the corresponding function. Some or all of the functional blocks 121-126 shown in FIG. 3 may be software modules comprised of various programming languages or script languages that are executed in the processor 120. For example, the audio processor 121 and the keyword detector 122 are implemented as a digital signal processor (DSP), and the speaker feature vector extractor 123, the wakeup determiner 124, and the voice recognizer 125 are provided. It may be implemented as a software module.

오디오 처리부(121)는 주변 소리에 대응하는 오디오 신호를 수신하여, 오디오 스트림 데이터를 생성한다. 오디오 처리부(121)는 마이크(151)와 같은 입력장치로부터 주변 소리에 대응하는 오디오 신호를 수신할 수 있다. 마이크(151)는 음성 제어 장치(100)에 통신으로 연결되는 주변 장치에 포함되고, 오디오 처리부(121)는 마이크(151)에서 생성된 오디오 신호를 통신으로 수신할 수 있다. 주변 소리는 사용자가 발화한 음성뿐만 아니라, 배경음을 포함한다. 따라서, 오디오 신호에는 음성 신호뿐만 아니라 배경음 신호도 포함된다. 배경음 신호는 키워드 검출 및 음성 인식에서 노이즈에 해당할 수 있다.The audio processor 121 receives audio signals corresponding to ambient sounds and generates audio stream data. The audio processor 121 may receive an audio signal corresponding to ambient sound from an input device such as the microphone 151. The microphone 151 may be included in a peripheral device that is connected to the voice control device 100 by communication, and the audio processor 121 may receive an audio signal generated by the microphone 151 through communication. The ambient sound includes not only a voice spoken by the user, but also a background sound. Therefore, the audio signal includes not only a voice signal but also a background sound signal. The background sound signal may correspond to noise in keyword detection and speech recognition.

오디오 처리부(121)는 연속적으로 수신되는 오디오 신호에 대응하는 오디오 스트림 데이터를 생성할 수 있다. 오디오 처리부(121)는 오디오 신호를 필터링하고 디지털화하여 오디오 스트림 데이터를 생성할 수 있다. 오디오 처리부(121)는 오디오 신호를 필터링하여 노이즈 신호를 제거하고 배경음 신호에 비해 음성 신호를 증폭할 수 있다. 또한, 오디오 처리부(121)는 오디오 신호에서 음성 신호의 에코를 제거할 수도 있다.The audio processor 121 may generate audio stream data corresponding to continuously received audio signals. The audio processor 121 may generate audio stream data by filtering and digitizing the audio signal. The audio processor 121 may remove the noise signal by filtering the audio signal and amplify the voice signal compared to the background sound signal. In addition, the audio processor 121 may remove an echo of the voice signal from the audio signal.

오디오 처리부(121)는 음성 제어 장치(100)가 슬립 모드로 동작할 때에도 오디오 신호를 수신하기 위해 항상 동작할 수 있다. 오디오 처리부(121)는 음성 제어 장치(100)가 슬립 모드로 동작할 때 낮은 동작 주파수로 동작하고, 음성 제어 장치(100)가 정상 모드로 동작할 때에는 높은 동작 주파수로 동작할 수 있다.The audio processor 121 may always operate to receive the audio signal even when the voice control device 100 operates in the sleep mode. The audio processor 121 may operate at a low operating frequency when the voice control device 100 operates in a sleep mode, and may operate at a high operating frequency when the voice control device 100 operates in a normal mode.

메모리(110)는 오디오 처리부(121)에서 생성된 오디오 스트림 데이터를 일시적으로 저장할 수 있다. 오디오 처리부(121)는 메모리(110)를 이용하여 오디오 스트림 데이터를 버퍼링할 수 있다. 메모리(110)에는 키워드를 포함하는 오디오 데이터뿐만 아니라 키워드가 검출되기 전의 오디오 데이터가 함께 저장된다. 최근의 오디오 데이터를 메모리(110)에 저장하기 위해, 메모리(110)에 가장 오래 전에 저장된 오디오 데이터가 삭제될 수 있다. 메모리(110)에 할당한 크기가 동일하다면, 언제나 동일한 기간의 오디오 데이터가 저장될 수 있다. 메모리(110)에 저장된 오디오 데이터에 해당하는 상기 기간은 키워드를 발성하는 시간보다 긴 것이 바람직하다.The memory 110 may temporarily store the audio stream data generated by the audio processor 121. The audio processor 121 may buffer the audio stream data using the memory 110. The memory 110 stores audio data including a keyword as well as audio data before a keyword is detected. In order to store the latest audio data in the memory 110, the audio data stored longest in the memory 110 may be deleted. If the size allocated to the memory 110 is the same, audio data of the same period can be stored at all times. It is preferable that the period corresponding to the audio data stored in the memory 110 is longer than the time for generating the keyword.

본 발명의 또다른 실시예에 따르면, 메모리(110)는 오디오 처리부(121)에서 생성된 오디오 스트림에 대한 화자 특징 벡터를 추출하여 저장할 수 있다. 이 때 화자 특징 벡터는 특정 길이의 오디오 스트림에 대하여 추출하여 저장될 수 있다. 앞서 설명한 바와 같이, 최근에 생성된 오디오 스트림에 대한 화자 특징 벡터를 저장하기 위하여 가장 오래 저장된 화자 특징 벡터가 삭제될 수 있다.According to another embodiment of the present invention, the memory 110 may extract and store the speaker feature vector for the audio stream generated by the audio processor 121. In this case, the speaker feature vector may be extracted and stored for an audio stream having a specific length. As described above, the longest stored speaker feature vector may be deleted to store the speaker feature vector for the recently generated audio stream.

키워드 검출부(122)는 오디오 처리부(121)에서 생성된 오디오 스트림 데이터로부터 미리 정의된 키워드에 대응하는 후보 키워드를 검출한다. 키워드 검출부(122)는 메모리(110)에 일시적으로 저장된 오디오 스트림 데이터로부터 미리 정의된 키워드에 대응하는 후보 키워드를 검출할 수 있다. 미리 정의된 키워드는 복수일 수 있으며, 복수의 미리 정의된 키워드들은 키워드 저장소(110a)에 저장될 수 있다. 키워드 저장소(110a)는 메모리(110)에 포함될 수 있다.The keyword detector 122 detects a candidate keyword corresponding to a predefined keyword from the audio stream data generated by the audio processor 121. The keyword detector 122 may detect a candidate keyword corresponding to a predefined keyword from the audio stream data temporarily stored in the memory 110. There may be a plurality of predefined keywords, and the plurality of predefined keywords may be stored in the keyword store 110a. The keyword store 110a may be included in the memory 110.

후보 키워드는 키워드 검출부(122)에서 오디오 스트림 데이터 중에서 키워드로서 검출한 것을 의미한다. 후보 키워드는 키워드와 동일할 수도 있고, 키워드와 유사하게 발음되는 다른 단어일 수 있다. 예컨대, 키워드가 “클로바”인 경우, 후보 키워드는 “글로벌”일 수 있다. 즉, 사용자가 “글로벌”을 포함한 문장을 발성한 경우, 키워드 검출부(122)는 오디오 스트림 데이터에서 “글로벌”을 “클로바”로 오인하여 검출할 수 있다. 이렇게 검출된 “글로벌”은 후보 키워드에 해당한다.The candidate keyword means that the keyword detector 122 detects the keyword in the audio stream data. The candidate keyword may be the same as the keyword or may be another word that is pronounced similarly to the keyword. For example, if the keyword is "clover", the candidate keyword may be "global". That is, when the user utters a sentence including “global”, the keyword detector 122 may misdetect “global” as “close” in the audio stream data. The detected “global” corresponds to a candidate keyword.

키워드 검출부(122)는 오디오 스트림 데이터를 알려진 키워드 데이터와 비교하여, 오디오 스트림 데이터 내에 키워드에 대응하는 음성이 포함될 가능성을 계산할 수 있다. 키워드 검출부(122)는 오디오 스트림 데이터로부터 필터뱅크 에너지(Filter Bank Energy) 또는 멜 주파수 켑스트럼 계수(Mel-Frequency Cepstral Coefficients)와 같은 오디오 특징들을 추출할 수 있다. 키워드 검출부(122)는 분류 윈도우(classifying window)들을 이용하여, 예를 들어 서포트 벡터 머신(support vector machine) 또는 신경망(neural network)을 이용하여 이러한 오디오 특징들을 처리할 수 있다. 오디오 특징들의 처리에 기초하여, 키워드 검출부(122)는 오디오 스트림 데이터 내에 키워드가 포함될 가능성을 계산할 수 있다. 키워드 검출부(122)는 상기 가능성이 미리 설정한 기준치보다 높은 경우, 오디오 스트림 데이터 내에 키워드가 포함되어 있다고 판단함으로써 후보 키워드를 검출할 수 있다.The keyword detector 122 may compare the audio stream data with the known keyword data and calculate a possibility that a voice corresponding to the keyword is included in the audio stream data. The keyword detector 122 may extract audio features such as Filter Bank Energy or Mel-Frequency Cepstral Coefficients from the audio stream data. The keyword detector 122 may process these audio features using classifying windows, for example, using a support vector machine or a neural network. Based on the processing of the audio features, the keyword detector 122 may calculate the likelihood that a keyword is included in the audio stream data. The keyword detector 122 may detect a candidate keyword by determining that the keyword is included in the audio stream data when the probability is higher than a preset reference value.

키워드 검출부(122)는 키워드 데이터에 대응하는 음성 샘플들을 이용하여 인공 신경망을 생성하고, 생성된 신경망을 이용하여 오디오 스트림 데이터에서 키워드를 검출하도록 트레이닝 될 수 있다. 키워드 검출부(122)는 오디오 스트림 데이터 내의 프레임마다 각 각 키워드를 구성하는 음소의 확률 또는 키워드의 전체적인 확률을 계산할 수 있다. 키워드 검출부(122)는 오디오 스트림 데이터로부터 각 음소들에 해당할 확률 시퀀스 또는 키워드 자체의 확률을 출력할 수 있다. 이 시퀀스 또는 확률을 기초로 키워드 검출부(122)는 오디오 스트림 데이터 내에 키워드가 포함될 가능성을 계산할 수 있으며, 그 가능성이 미리 설정된 기준치 이상인 경우에 후보 키워드가 검출된 것으로 판단할 수 있다. 전술한 방식은 예시적이며, 키워드 검출부(122)의 동작은 다양한 방식을 통해 구현될 수 있다.The keyword detector 122 may be trained to generate an artificial neural network using voice samples corresponding to the keyword data, and to detect a keyword in the audio stream data using the generated neural network. The keyword detector 122 may calculate the probability of a phoneme constituting each keyword or the overall probability of the keyword for each frame in the audio stream data. The keyword detector 122 may output a probability sequence corresponding to each phoneme or a probability of the keyword itself from the audio stream data. Based on this sequence or probability, the keyword detector 122 may calculate a possibility that the keyword is included in the audio stream data, and may determine that the candidate keyword is detected when the probability is equal to or greater than a preset reference value. The above-described method is exemplary, and the operation of the keyword detector 122 may be implemented through various methods.

또한, 키워드 검출부(122)는 오디오 스트림 데이터 내의 프레임마다 오디오 특징들을 추출함으로써 해당 프레임의 오디오 데이터가 사람의 음성에 해당할 가능성과 배경음에 해당할 가능성을 산출할 수 있다. 키워드 검출부(122)는 사람의 음성에 해당할 가능성과 배경음에 해당할 가능성을 비교하여, 해당 프레임의 오디오 데이터가 사람의 음성에 해당한다고 판단할 수 있다. 예컨대, 키워드 검출부(122)는 해당 프레임의 오디오 데이터가 사람의 음성에 해당할 가능성이 배경음에 해당할 가능성보다 미리 설정한 기준치보다 높은 경우에, 해당 프레임의 오디오 데이터가 사람의 음성에 대응한다고 판단할 수 있다.In addition, the keyword detector 122 may calculate the likelihood that the audio data of the frame corresponds to human voice and the background sound by extracting audio features for each frame in the audio stream data. The keyword detector 122 may compare the likelihood of corresponding to the human voice with the likelihood of the background sound, and determine that the audio data of the corresponding frame corresponds to the human voice. For example, the keyword detector 122 determines that the audio data of the frame corresponds to the human voice when the audio data of the frame is higher than the preset reference value than the possibility of the background sound. can do.

키워드 검출부(122)는 오디오 스트림 데이터에서 후보 키워드가 검출된 구간을 특정할 수 있으며, 후보 키워드가 검출된 구간의 시점과 종점을 결정할 수 있다. 오디오 스트림 데이터에서 후보 키워드가 검출된 구간은 키워드 검출 구간, 현재 구간, 또는 제1 구간으로 지칭될 수 있다. 오디오 스트림 데이터에서 제1 구간에 해당하는 오디오 데이터는 제1 오디오 데이터로 지칭한다. 키워드 검출부(122)는 후보 키워드가 검출된 구간의 끝을 종점으로 결정할 수 있다. 다른 예에 따르면, 키워드 검출부(122)는 후보 키워드가 검출된 후 미리 설정한 시간(예컨대, 0.5초) 동안의 묵음이 발생할 때까지 기다린 후, 제1 구간에 묵음 구간이 포함되도록 제1 구간의 종점을 결정하거나 묵음 기간이 포함되지 않도록 제1 구간의 종점을 결정할 수 있다.The keyword detector 122 may specify a section in which the candidate keyword is detected in the audio stream data, and determine the start point and the end point of the section in which the candidate keyword is detected. The section in which the candidate keyword is detected in the audio stream data may be referred to as a keyword detection section, the current section, or the first section. Audio data corresponding to the first section in the audio stream data is referred to as first audio data. The keyword detector 122 may determine the end of the section in which the candidate keyword is detected as the end point. According to another example, the keyword detection unit 122 waits until a silence occurs for a preset time (for example, 0.5 seconds) after the candidate keyword is detected, and then includes the silence section in the first section. The end point of the first section may be determined so that the end point is not included or the silent period is included.

화자 특징 벡터 추출부(123)는 메모리(110)에 일시적으로 저장된 오디오 스트림 데이터에서 제2 구간에 해당하는 제2 오디오 데이터를 메모리(110)로부터 독출한다. 제2 구간은 제1 구간의 이전 구간으로서, 제2 구간의 종점은 제1 구간의 시점과 동일할 수 있다. 제2 구간은 이전 구간으로 지칭될 수 있다. 제2 구간의 길이는 검출된 후보 키워드에 대응하는 키워드에 따라 가변적으로 설정될 수 있다. 다른 예에 따르면, 제2 구간의 길이는 고정적으로 설정될 수 있다. 또 다른 예에 따르면, 제2 구간의 길이는 키워드 검출 성능이 최적화되도록 적응적으로 가변될 수 있다. 예를 들면, 마이크(151)가 출력하는 오디오 신호가 “네잎 클로버”이고, 후보 키워드가 “클로버”인 경우, 제2 오디오 데이터는 “네잎”이라는 음성에 대응할 수 있다.The speaker feature vector extractor 123 reads the second audio data corresponding to the second section from the memory 110 in the audio stream data temporarily stored in the memory 110. The second section is a previous section of the first section, and the end point of the second section may be the same as the start point of the first section. The second section may be referred to as the previous section. The length of the second section may be variably set according to a keyword corresponding to the detected candidate keyword. According to another example, the length of the second section may be fixedly set. According to another example, the length of the second section may be adaptively changed to optimize keyword detection performance. For example, when the audio signal output from the microphone 151 is "four leaf clover" and the candidate keyword is "clover", the second audio data may correspond to a voice of "four leaf".

화자 특징 벡터 추출부(123)는 제1 구간에 해당하는 제1 오디오 데이터의 제1 화자 특징 벡터와 제2 구간에 해당하는 제2 오디오 데이터의 제2 화자 특징 벡터를 추출한다. 화자 특징 벡터 추출부(123)는 화자 인식에 강인한 화자 특징 벡터를 오디오 데이터로부터 추출할 수 있다. 화자 특징 벡터 추출부(123)는 시간 도메인(time domain) 기반의 음성 신호를 주파수 도메인(frequency domain) 상의 신호로 변환하고, 변환된 신호의 주파수 에너지를 서로 다르게 변형함으로써 화자 특징 벡터를 추출할 수 있다. 예컨대, 화자 특징 벡터는 멜 주파수 켑스트럼 계수(Mel-Frequency Cepstral Coefficients) 또는 필터뱅크 에너지(Filter Bank Energy)를 기초로 추출될 수 있으나, 이에 한정되는 것은 아니며 다양한 방식으로 오디오 데이터로부터 화자 특징 벡터를 추출할 수 있다.The speaker feature vector extractor 123 extracts a first speaker feature vector of the first audio data corresponding to the first section and a second speaker feature vector of the second audio data corresponding to the second section. The speaker feature vector extractor 123 may extract the speaker feature vector robust to speaker recognition from the audio data. The speaker feature vector extractor 123 may extract a speaker feature vector by converting a time domain based speech signal into a signal in a frequency domain and transforming the frequency energy of the converted signal differently. have. For example, the speaker feature vector may be extracted based on Mel-Frequency Cepstral Coefficients or Filter Bank Energy, but is not limited thereto. Can be extracted.

화자 특징 벡터 추출부(123)는 일반적으로 슬립 모드로 동작할 수 있다. 키워드 검출부(122)는 오디오 스트림 데이터에서 후보 키워드를 검출하면 화자 특징 벡터 추출부(123)를 웨이크업 할 수 있다. 키워드 검출부(122)는 오디오 스트림 데이터에서 후보 키워드를 검출하면 화자 특징 벡터 추출부(123)에 웨이크업 신호를 송신할 수 있다. 화자 특징 벡터 추출부(123)는 키워드 검출부(122)에서 후보 키워드가 검출되었다는 것을 나타내는 웨이크업 신호에 응답하여 웨이크업 될 수 있다.The speaker feature vector extractor 123 may generally operate in a sleep mode. The keyword detector 122 may wake up the speaker feature vector extractor 123 when the candidate keyword is detected from the audio stream data. When the keyword detector 122 detects a candidate keyword from the audio stream data, the keyword detector 122 may transmit a wakeup signal to the speaker feature vector extractor 123. The speaker feature vector extractor 123 may wake up in response to the wakeup signal indicating that the candidate keyword is detected by the keyword detector 122.

일 실시예에 따르면, 화자 특징 벡터 추출부(123)는 오디오 데이터의 각 프레임마다 프레임 특징 벡터를 추출하고, 추출된 프레임 특징 벡터들을 정규화 및 평균화하여 오디오 데이터를 대표하는 화자 특징 벡터를 추출할 수 있다. 추출된 프레임 특징 벡터들을 정규화하는데 L2-정규화가 사용될 수 있다. 추출된 프레임 특징 벡터들의 평균화는 오디오 데이터 내의 모든 프레임들 각각에 대해 추출된 프레임 특징 벡터들을 정규화하여 생성되는 정규화된 프레임 특징 벡터들의 평균을 산출함으로써 달성될 수 있다.According to an embodiment, the speaker feature vector extractor 123 may extract a frame feature vector for each frame of the audio data and normalize and average the extracted frame feature vectors to extract the speaker feature vector representing the audio data. have. L2-normalization may be used to normalize the extracted frame feature vectors. The averaging of the extracted frame feature vectors can be achieved by calculating the average of the normalized frame feature vectors generated by normalizing the extracted frame feature vectors for each of all the frames in the audio data.

예를 들면, 화자 특징 벡터 추출부(123)는 제1 오디오 데이터의 각 프레임마다 제1 프레임 특징 벡터를 추출하고, 추출된 제1 프레임 특징 벡터들을 정규화 및 평균화하여 제1 오디오 데이터를 대표하는 상기 제1 화자 특징 벡터를 추출할 수 있다. 또한, 화자 특징 벡터 추출부(123)는 제2 오디오 데이터의 각 프레임마다 제2 프레임 특징 벡터를 추출하고, 추출된 제2 프레임 특징 벡터들을 정규화 및 평균화하여 제2 오디오 데이터를 대표하는 제2 화자 특징 벡터를 추출할 수 있다.For example, the speaker feature vector extractor 123 extracts a first frame feature vector for each frame of the first audio data and normalizes and averages the extracted first frame feature vectors to represent the first audio data. The first speaker feature vector may be extracted. In addition, the speaker feature vector extractor 123 extracts a second frame feature vector for each frame of the second audio data, and normalizes and averages the extracted second frame feature vectors to represent the second audio data. Feature vectors can be extracted.

다른 실시예에 따르면, 화자 특징 벡터 추출부(123)는 오디오 데이터 내의 모든 프레임에 대하여 프레임 특징 벡터를 각각 추출하는 것이 아니라, 오디오 데이터 내의 일부 프레임에 대하여 프레임 특징 벡터를 각각 추출할 수 있다. 상기 일부 프레임은 해당 프레임의 오디오 데이터가 사용자의 음성 데이터일 가능성이 높은 프레임으로 음성 프레임으로서 선택될 수 있다. 이러한 음성 프레임의 선택은 키워드 검출부(122)에 의해 수행될 수 있다. 키워드 검출부(122)는 오디오 스트림 데이터의 각 프레임마다 사람 음성일 제1 확률과 배경음일 제2 확률을 계산할 수 있다. 키워드 검출부(122)는 각 프레임의 오디오 데이터가 사람 음성일 제1 확률이 배경음일 제2 확률보다 미리 설정된 기준치보다 높은 프레임을 음성 프레임으로 결정할 수 있다. 키워드 검출부(122)는 해당 프레임이 음성 프레임인지의 여부를 나타내는 플래그 또는 비트를 오디오 스트림 데이터의 각 프레임 관련지어 메모리(110)에 저장할 수 있다.According to another exemplary embodiment, the speaker feature vector extractor 123 may extract frame feature vectors for some frames in the audio data, instead of extracting the frame feature vectors for all the frames in the audio data. The some frame may be selected as a voice frame as a frame in which the audio data of the frame is likely to be voice data of the user. The selection of the voice frame may be performed by the keyword detector 122. The keyword detector 122 may calculate a first probability of a human voice and a second probability of a background sound for each frame of the audio stream data. The keyword detector 122 may determine a frame in which audio data of each frame is a voice frame in which a first probability that the first voice is a human voice is higher than a preset reference value than the second probability that the background voice is a voice. The keyword detector 122 may store a flag or bit indicating whether the corresponding frame is a voice frame in the memory 110 in association with each frame of the audio stream data.

화자 특징 벡터 추출부(123)는 제1 및 제2 오디오 데이터를 메모리(110)로부터 독출할 때, 플래그 또는 비트를 함께 독출함으로써, 해당 프레임이 음성 프레임인지의 여부를 알 수 있다. When the speaker feature vector extractor 123 reads the first and second audio data from the memory 110, the speaker feature vector extractor 123 reads a flag or a bit together to determine whether the corresponding frame is a voice frame.

화자 특징 벡터 추출부(123)는 오디오 데이터 내의 프레임들 중에서 음성 프레임으로 결정된 프레임들 각각에 대하여 프레임 특징 벡터를 추출하고, 추출된 제1 프레임 특징 벡터들을 정규화 및 평균화하여 오디오 데이터를 대표하는 화자 특징 벡터를 추출할 수 있다. 예를 들면, 화자 특징 벡터 추출부(123)는 제1 오디오 데이터 내의 프레임들 중에서 음성 프레임으로 결정된 프레임들 각각에 대하여 제1 프레임 특징 벡터를 추출하고, 추출된 제1 프레임 특징 벡터들을 정규화 및 평균화하여 제1 오디오 데이터를 대표하는 상기 제1 화자 특징 벡터를 추출할 수 있다. 또한, 화자 특징 벡터 추출부(123)는 제2 오디오 데이터 내의 프레임들 중에서 음성 프레임으로 결정된 프레임들 각각에 대하여 제2 프레임 특징 벡터를 추출하고, 추출된 제2 프레임 특징 벡터들을 정규화 및 평균화하여 제2 오디오 데이터를 대표하는 제2 화자 특징 벡터를 추출할 수 있다.The speaker feature vector extractor 123 extracts a frame feature vector for each of the frames determined as the voice frame among the frames in the audio data, and normalizes and averages the extracted first frame feature vectors to represent the speaker data. Vectors can be extracted. For example, the speaker feature vector extractor 123 extracts a first frame feature vector for each of frames determined as a voice frame among frames in the first audio data, and normalizes and averages the extracted first frame feature vectors. The first speaker feature vector representing the first audio data can be extracted. In addition, the speaker feature vector extractor 123 extracts a second frame feature vector for each of frames determined as a voice frame among frames in the second audio data, and normalizes and averages the extracted second frame feature vectors. A second speaker feature vector representing the two audio data can be extracted.

웨이크업 판단부(124)는 화자 특징 벡터 추출부(123)에서 추출된 제1 화자 특징 벡터와 제2 화자 특징 벡터의 유사도를 기초로, 제1 오디오 데이터에 해당 키워드가 포함되었는지의 여부, 즉, 제1 구간의 오디오 신호에 해당 키워드가 포함되었는지의 여부를 판단한다. 웨이크업 판단부(124)는 제1 화자 특징 벡터와 제2 화자 특징 벡터 간의 유사도를 미리 설정한 기준치와 비교하여, 유사도가 기준치 이하인 경우에, 제1 구간의 제1 오디오 데이터에 해당 키워드가 포함되었다고 판단할 수 있다.The wakeup determiner 124 determines whether or not a corresponding keyword is included in the first audio data based on the similarity between the first speaker feature vector and the second speaker feature vector extracted by the speaker feature vector extractor 123. It is determined whether the corresponding keyword is included in the audio signal of the first section. The wakeup determiner 124 compares the similarity between the first speaker feature vector and the second speaker feature vector with a preset reference value, and includes the keyword in the first audio data of the first section when the similarity is less than or equal to the reference value. Can be judged.

음성 제어 장치(100)가 키워드를 오인식하는 대표적인 경우는 사용자의 음성 중에 키워드와 유사한 발음의 단어가 음성 중간에 위치는 경우이다. 예를 들면, 키워드가 “클로바”인 경우, 사용자가 다른 사람에게 “네잎 클로버를 어떻게 찾을 수 있어”라고 말하는 경우에도 음성 제어 장치(100)는 “클로버”에 반응하여 웨이크업 할 수 있으며, 사용자가 의도하지 않은 동작을 수행할 수 있다. 심지어, 텔레비전 뉴스에서 아나운서가 “제이엔 글로벌의 시가총액은 ...”이라고 말하는 경우에도, 음성 제어 장치(100)는 “글로벌”에 반응하여 웨이크업 할 수 있다. 이와 같이 키워드 오인식이 발생하는 것을 방지하기 위하여, 일 실시예에 다르면, 키워드와 유사한 발음의 단어는 음성의 맨 앞에 위치하는 경우에만 음성 제어 장치(100)가 반응할 수 있다. 또한, 주변 배경 소음이 많은 환경이나, 다른 사람들이 대화를 하고 있는 환경에서는 사용자가 키워드에 해당하는 음성을 맨 앞에 발성하더라도 주변 배경 소음이나 다른 사람들의 대화로 인하여 사용자가 키워드에 해당하는 음성을 맨 앞에 발성하였다는 것이 감지되지 않을 수 있다. 일 실시예에 따르면, 음성 제어 장치(100)는 후보 키워드가 검출된 구간의 제1 화자 특징 벡터와 이전 구간의 제2 화자 특징 벡터를 추출하고, 제1 화자 특징 벡터와 제2 화자 특징 벡터가 서로 상이할 경우에는 사용자가 키워드에 해당하는 음성을 맨 앞에 발성하였다고 판단할 수 있다.A typical case in which the voice control apparatus 100 misrecognizes a keyword is a case in which a word having a pronunciation similar to a keyword is located in the middle of a voice of a user's voice. For example, if the keyword is "clover", the voice control device 100 may wake up in response to the "clover" even when the user says to the other person, "how can I find the four-leaf clover". May perform an unintended operation. Even if the announcer in the television news says "Jay Global's market cap is ...", the voice control device 100 can wake up in response to the "global". As described above, in order to prevent a keyword misrecognition from occurring, according to an exemplary embodiment, the voice control apparatus 100 may react only when a word having a pronunciation similar to a keyword is positioned at the front of the voice. In addition, in an environment with a lot of background noise or other people having a conversation, even if the user utters the voice corresponding to the keyword in front of the user, the voice corresponding to the keyword may be caused by the background noise or the conversation of other people. You may not be able to detect that you have spoken ahead. According to an embodiment, the voice control apparatus 100 extracts the first speaker feature vector of the section in which the candidate keyword is detected and the second speaker feature vector of the previous section, and the first speaker feature vector and the second speaker feature vector are extracted. If they are different from each other, it may be determined that the user has spoken the voice corresponding to the keyword first.

이러한 판단을 위하여, 웨이크업 판단부(124)는 제1 화자 특징 벡터와 제2 화자 특징 벡터 간의 유사도가 미리 설정한 기준치 이하인 경우에는, 사용자가 키워드에 해당하는 음성을 맨 앞에 발성하였다고 판단할 수 있다. 즉, 웨이크업 판단부(124)는 제1 구간의 제1 오디오 데이터에 해당 키워드가 포함되었다고 판단할 수 있으며, 음성 제어 장치(100)의 일부 기능을 웨이크업 할 수 있다. 제1 화자 특징 벡터와 제2 화자 특징 벡터 간의 유사도가 높다는 것은 제1 오디오 데이터에 대응하는 음성을 말한 사람과 제2 오디오 데이터에 대응하는 음성을 말한 사람이 동일할 가능성이 높다는 것이다.For this determination, when the similarity between the first speaker feature vector and the second speaker feature vector is equal to or less than a preset reference value, the wakeup determiner 124 may determine that the user has spoken the voice corresponding to the keyword first. have. That is, the wakeup determiner 124 may determine that the corresponding keyword is included in the first audio data of the first section, and wake up some functions of the voice control apparatus 100. The high degree of similarity between the first speaker feature vector and the second speaker feature vector means that a person who speaks a voice corresponding to the first audio data and a person who speaks a voice corresponding to the second audio data are more likely to be the same.

제2 오디오 데이터가 묵음에 해당할 경우, 화자 특징 벡터 추출부(123)는 제2 오디오 데이터로부터 묵음에 해당하는 제2 화자 특징 벡터를 추출할 수 있다. 화자 특징 벡터 추출부(123)는 제1 오디오 데이터로부터 사용자의 음성에 해당하는 제1 화자 특징 벡터를 추출할 것이므로, 제1 화자 특징 벡터와 제2 화자 특징 벡터 간의 유사도는 낮을 수 있다.If the second audio data corresponds to silence, the speaker feature vector extractor 123 may extract a second speaker feature vector corresponding to the silence from the second audio data. Since the speaker feature vector extractor 123 will extract the first speaker feature vector corresponding to the user's voice from the first audio data, the similarity between the first speaker feature vector and the second speaker feature vector may be low.

음성 인식부(125)는 오디오 처리부(121)에서 생성된 오디오 스트림 데이터에서 제3 구간에 해당하는 제3 오디오 데이터를 수신하고, 제3 오디오 데이터를 음성 인식할 수 있다. 다른 예에 따르면, 음성 인식부(125)는 제3 오디오 데이터가 외부(예컨대, 서버(200))에서 음성 인식되도록 제3 오디오 데이터를 외부에 전송하고, 음성 인식 결과를 수신할 수 있다.The speech recognizer 125 may receive third audio data corresponding to the third section from the audio stream data generated by the audio processor 121, and recognize the third audio data by voice. According to another example, the voice recognition unit 125 may transmit the third audio data to the outside to receive the voice recognition result so that the third audio data is externally recognized (eg, the server 200).

기능부(126)는 키워드에 대응하는 기능을 수행할 수 있다. 예컨대, 음성 제어 장치(100)가 스마트 스피커인 경우, 기능부(126)는 음악 재생부, 음성 정보 제공부, 주변 기기 제어부 등을 포함할 수 있으며, 검출된 키워드에 대응하는 기능을 수행할 수 있다. 음성 제어 장치(100)가 스마트 폰인 경우, 기능부(126)는 전화 연결부, 문자 송수신부, 인터넷 검색부 등을 포함할 수 있으며, 검출된 키워드에 대응하는 기능을 수행할 수 있다. 기능부(126)는 음성 제어 장치(100)의 종류에 따라 다양하게 구성될 수 있다. 기능부(126)는 음성 제어 장치(100)가 실행할 수 있는 다양한 기능들을 수행하기 위한 기능 블럭들을 포괄적으로 나타낸 것이다.The function unit 126 may perform a function corresponding to the keyword. For example, when the voice control device 100 is a smart speaker, the function unit 126 may include a music player, a voice information provider, a peripheral device controller, and the like, and may perform a function corresponding to the detected keyword. have. When the voice control apparatus 100 is a smart phone, the function unit 126 may include a telephone connection unit, a text transmission unit, an internet search unit, and the like, and may perform a function corresponding to the detected keyword. The function unit 126 may be configured in various ways according to the type of the voice control device 100. The functional unit 126 is a comprehensive representation of functional blocks for performing various functions that the voice control apparatus 100 may execute.

도 3에 도시된 음성 제어 장치(100)는 음성 인식부(125)를 포함하는 것으로 도시되어 있지만, 이는 예시적이며, 음성 제어 장치(100)는 음성 인식부(125)를 포함하지 않고, 도 2에 도시된 서버(200)가 음성 인식 기능을 대신 수행할 수 있다. 이 경우, 도 1에 도시된 바와 같이 음성 제어 장치(100)는 네트워크(300)를 통해 음성 인식 기능을 수행하는 서버(200)에 접속될 수 있다. 음성 제어 장치(100)는 음성 인식이 필요한 음성 신호를 포함하는 음성 파일을 서버(200)에 제공할 수 있으며, 서버(200)는 음성 파일 내의 음성 신호에 대하여 음성 인식을 수행하여 음성 신호에 대응하는 문자열을 생성할 수 있다. 서버(200)는 생성된 문자열을 네트워크(300)를 통해 음성 제어 장치(100)에 송신할 수 있다. 그러나, 아래에서는 음성 제어 장치(100)가 음성 인식 기능을 수행하는 음성 인식부(125)를 포함하는 것으로 가정하고 설명한다.Although the voice control device 100 illustrated in FIG. 3 is illustrated as including a voice recognition unit 125, this is exemplary, and the voice control device 100 does not include the voice recognition unit 125. The server 200 shown in FIG. 2 may perform a voice recognition function instead. In this case, as illustrated in FIG. 1, the voice control apparatus 100 may be connected to a server 200 that performs a voice recognition function through the network 300. The voice control apparatus 100 may provide a voice file including a voice signal requiring voice recognition to the server 200, and the server 200 performs voice recognition on the voice signal in the voice file to correspond to the voice signal. You can create a string that says The server 200 may transmit the generated character string to the voice control apparatus 100 through the network 300. However, hereinafter, it is assumed and described that the voice control apparatus 100 includes a voice recognition unit 125 that performs a voice recognition function.

프로세서(120)는 동작 방법을 위한 프로그램 파일에 저장된 프로그램 코드를 메모리(110)에 로딩할 수 있다. 예를 들면, 음성 제어 장치(100)에는 프로그램 파일에 따라 프로그램이 설치(install)될 수 있다. 이때 음성 제어 장치(100)에 설치된 프로그램이 실행되는 경우, 프로세서(120)는 프로그램 코드를 메모리(110)에 로딩할 수 있다. 이때, 프로세서(120)가 포함하는 오디오 처리부(121), 키워드 검출부(122), 화자 특징 벡터 추출부(123), 웨이크업 판단부(124), 음성 인식부(125) 및 기능부(126) 중 적어도 일부의 각각은 메모리(110)에 로딩된 프로그램 코드 중 대응하는 코드에 따른 명령을 실행하여 도 4의 단계들(S110 내지 S190)을 실행하도록 구현될 수 있다.The processor 120 may load program code stored in a program file for an operation method into the memory 110. For example, a program may be installed in the voice control apparatus 100 according to a program file. In this case, when a program installed in the voice control apparatus 100 is executed, the processor 120 may load a program code into the memory 110. In this case, the processor 120 includes the audio processor 121, the keyword detector 122, the speaker feature vector extractor 123, the wakeup determiner 124, the voice recognizer 125, and the function unit 126. Each of at least some of them may be implemented to execute steps S110 to S190 of FIG. 4 by executing an instruction according to a corresponding code among program codes loaded in the memory 110.

이후에서 프로세서(120)의 기능 블럭들(121-126)이 음성 제어 장치(100)를 제어하는 것은 프로세서(120)가 음성 제어 장치(100)의 다른 구성요소들을 제어하는 것으로 이해될 수 있다. 예를 들어, 프로세서(120)는 음성 제어 장치(100)가 포함하는 통신 모듈(130)을 제어하여 음성 제어 장치(100)가 예컨대 서버(200)와 통신하도록 음성 제어 장치(100)를 제어할 수 있다.The control of the voice control device 100 by the functional blocks 121-126 of the processor 120 hereinafter may be understood as the processor 120 controlling other components of the voice control device 100. For example, the processor 120 controls the communication module 130 included in the voice control device 100 to control the voice control device 100 so that the voice control device 100 communicates with, for example, the server 200. Can be.

단계(S110)에서 프로세서(120), 예컨대, 오디오 처리부(121)는 주변 소리에 대응하는 오디오 신호를 수신한다. 오디오 처리부(121)는 지속적으로 주변 소리에 대응하는 오디오 신호를 수신할 수 있다. 오디오 신호는 마이크(151)와 같은 입력 장치가 주변 소리에 대응하여 생성한 전기 신호일 수 있다.In operation S110, the processor 120, for example, the audio processor 121, receives an audio signal corresponding to ambient sound. The audio processor 121 may continuously receive an audio signal corresponding to the ambient sound. The audio signal may be an electrical signal generated by an input device such as the microphone 151 in response to the ambient sound.

단계(S120)에서 프로세서(120), 예컨대, 오디오 처리부(121)는 마이크(151)로부터의 오디오 신호를 기초로 오디오 스트림 데이터를 생성한다. 오디오 스트림 데이터는 지속적으로 수신되는 오디오 신호에 대응한 것이다. 오디오 스트림 데이터는 오디오 신호를 필터링하고 디지털화함으로써 생성되는 데이터일 수 있다.In operation S120, the processor 120, for example, the audio processor 121 generates audio stream data based on the audio signal from the microphone 151. The audio stream data corresponds to an audio signal that is continuously received. The audio stream data may be data generated by filtering and digitizing the audio signal.

단계(S130)에서 프로세서(120), 예컨대, 오디오 처리부(121)는 단계(S120)에서 생성되는 오디오 스트림 데이터를 메모리(110)에 일시적으로 저장한다. 메모리(110)는 한정된 크기를 가지며, 현재로부터 최근 일정 시간 동안의 오디오 신호에 대응하는 오디오 스트림 데이터의 일부가 메모리(110)에 일시적으로 저장될 수 있다. 새로운 오디오 스트림 데이터가 생성되면, 메모리(110)에 저장된 오디오 스트림 데이터 중에서 가장 오래된 데이터가 삭제되고, 메모리(110) 내의 삭제에 의해 비게 된 공간에 새로운 오디오 스트림 데이터가 저장될 수 있다.In operation S130, the processor 120, for example, the audio processor 121 temporarily stores the audio stream data generated in operation S120 in the memory 110. The memory 110 has a limited size, and a portion of the audio stream data corresponding to the audio signal from the present time during the last predetermined time may be temporarily stored in the memory 110. When new audio stream data is generated, the oldest data among the audio stream data stored in the memory 110 may be deleted, and the new audio stream data may be stored in a space vacated by the deletion in the memory 110.

단계(S140)에서 프로세서(120), 예컨대, 키워드 검출부(122)는 단계(S120)에서 생성되는 오디오 스트림 데이터로부터 미리 정의된 키워드에 대응하는 후보 키워드를 검출한다. 후보 키워드는 미리 정의된 키워드와 유사한 발음을 갖는 단어로서, 단계(S140)에서 키워드 검출부(122)에서 키워드로서 검출된 워드를 지칭한다.In operation S140, the processor 120, for example, the keyword detector 122, detects a candidate keyword corresponding to a predefined keyword from the audio stream data generated in operation S120. The candidate keyword is a word having a pronunciation similar to a predefined keyword, and refers to a word detected as a keyword in the keyword detector 122 in step S140.

단계(S150)에서 프로세서(120), 예컨대, 키워드 검출부(122)는 오디오 스트림 데이터에서 후보 키워드가 검출된 키워드 검출 구간을 식별하고, 키워드 검출 구간의 시점과 종점을 결정한다. 키워드 검출 구간은 현재 구간으로 지칭될 수 있다. 오디오 스트림 데이터에서 현재 구간에 대응하는 데이터는 제1 오디오 데이터로 지칭될 수 있다.In operation S150, the processor 120, for example, the keyword detector 122, identifies a keyword detection section in which the candidate keyword is detected in the audio stream data, and determines a start point and an end point of the keyword detection section. The keyword detection section may be referred to as the current section. Data corresponding to the current section in the audio stream data may be referred to as first audio data.

단계(S160)에서 프로세서(120), 예컨대, 화자 특징 벡터 추출부(123)는 메모리(110)로부터 이전 구간에 해당하는 제2 오디오 데이터를 독출한다. 이전 구간은 현재 구간의 바로 직전 구간으로서, 이전 구간의 종점은 현재 구간의 시점과 동일할 수 있다. 화자 특징 벡터 추출부(123)는 메모리(110)로부터 제1 오디오 데이터도 함께 독출할 수 있다.In operation S160, the processor 120, for example, the speaker feature vector extractor 123 reads second audio data corresponding to the previous section from the memory 110. The previous section is a section immediately before the current section, and the end point of the previous section may be the same as the start point of the current section. The speaker feature vector extractor 123 may also read first audio data from the memory 110.

단계(S170)에서 프로세서(120), 예컨대, 화자 특징 벡터 추출부(123)는 제1 오디오 데이터로부터 제1 화자 특징 벡터를 추출하고, 제2 오디오 데이터로부터 제2 화자 특징 벡터를 추출한다. 제1 화자 특징 벡터는 제1 오디오 데이터에 대응하는 음성의 화자를 식별하기 위한 지표이고, 제2 화자 특징 벡터는 제2 오디오 데이터에 대응하는 음성의 화자를 식별하기 위한 지표이다. 프로세서(120), 예컨대, 웨이크업 판단부(124)는 제1 화자 특징 벡터와 제2 화자 특징 벡터 간의 유사도를 기초로 제1 오디오 데이터에 키워드가 포함되었는지의 여부를 판단할 수 있다. 웨이크업 판단부(124)는 제1 오디오 데이터에 키워드가 포함되었다고 판단할 경우, 음성 제어 장치(100)의 일부 구성요소들을 웨이크업 할 수 있다.In operation S170, the processor 120, for example, the speaker feature vector extractor 123 extracts the first speaker feature vector from the first audio data and extracts the second speaker feature vector from the second audio data. The first speaker feature vector is an indicator for identifying the speaker of the voice corresponding to the first audio data, and the second speaker feature vector is an indicator for identifying the speaker of the voice corresponding to the second audio data. The processor 120, for example, the wakeup determiner 124 may determine whether a keyword is included in the first audio data based on a similarity between the first speaker feature vector and the second speaker feature vector. When the wakeup determiner 124 determines that a keyword is included in the first audio data, the wakeup determiner 124 may wake up some components of the voice control apparatus 100.

단계(S180)에서 프로세서(120), 예컨대, 웨이크업 판단부(124)는 제1 화자 특징 벡터와 제2 화자 특징 벡터 간의 유사도를 미리 설정된 기준치와 비교한다. In operation S180, the processor 120, for example, the wakeup determiner 124 compares the similarity between the first speaker feature vector and the second speaker feature vector with a preset reference value.

웨이크업 판단부(124)는 제1 화자 특징 벡터와 제2 화자 특징 벡터 간의 유사도가 미리 설정된 기준치 이하인 경우, 현재 구간의 제1 오디오 데이터의 화자와 이전 구간의 제2 오디오 데이터의 화자가 서로 상이하다는 것이므로, 제1 오디오 데이터에 키워드가 포함되었다고 판단할 수 있다. 이 경우, 단계(S190)에서와 같이 프로세서(120), 예컨대, 웨이크업 판단부(124)는 음성 제어 장치(100)의 일부 구성요소들을 웨이크업 할 수 있다.When the similarity between the first speaker feature vector and the second speaker feature vector is equal to or less than a preset reference value, the wakeup determiner 124 differs from the speaker of the first audio data in the current section and the speaker of the second audio data in the previous section. Therefore, it can be determined that a keyword is included in the first audio data. In this case, as in step S190, the processor 120, for example, the wakeup determiner 124, may wake up some components of the voice control apparatus 100.

그러나, 웨이크업 판단부(124)는 제1 화자 특징 벡터와 제2 화자 특징 벡터 간의 유사도가 미리 설정된 기준치보다 큰 경우, 현재 구간의 제1 오디오 데이터의 화자와 이전 구간의 제2 오디오 데이터의 화자가 서로 동일하다는 것이므로, 제1 오디오 데이터에 키워드가 포함되지 않았다고 판단하여, 웨이크업을 진행하지 않을 수 있다. 이 경우, 단계(S110)로 진행하여 주변 소리에 대응하는 오디오 신호를 수신한다. 단계(S110)에서 오디오 신호의 수신은 단계들(S120-S190)을 수행할 때에도 계속된다.However, when the similarity between the first speaker feature vector and the second speaker feature vector is greater than a preset reference value, the wakeup determiner 124 may talk to the speaker of the first audio data of the current section and the speaker of the second audio data of the previous section. Since is the same as each other, it is determined that the keyword is not included in the first audio data, the wake-up may not proceed. In this case, the process proceeds to step S110 to receive an audio signal corresponding to the ambient sound. Reception of the audio signal in step S110 continues even when performing steps S120-S190.

도 3의 키워드 저장소(110a)에는 미리 정의된 복수의 키워드들이 저장될 수 있다. 이러한 키워드들은 웨이크업 키워드이거나 단독 명령 키워드일 수 있다. 웨이크업 키워드는 음성 제어 장치(100)의 일부 기능을 웨이크업 하기 위한 것이다. 일반적으로 사용자는 웨이크업 키워드를 발화한 후 원하는 자연어 음성 명령을 발화한다. 음성 제어 장치(100)는 자연어 음성 명령을 음성 인식하고, 자연어 음성 명령에 대응하는 동작 및 기능을 수행할 수 있다.The keyword store 110a of FIG. 3 may store a plurality of predefined keywords. These keywords may be wake-up keywords or standalone command keywords. The wakeup keyword is for waking up some functions of the voice control apparatus 100. In general, a user utters a wake-up keyword and then utters a desired natural language voice command. The voice control apparatus 100 may recognize a natural language voice command and perform an operation and a function corresponding to the natural language voice command.

단독 명령 키워드는 음성 제어 장치(100)가 특정 동작 또는 기능을 직접 수행하기 위한 것으로서, 예컨대, “재생”, “중지” 등과 같이 미리 정의된 간단한 단어일 수 있다. 음성 제어 장치(100)는 단독 명령 키워드가 수신되면, 단독 명령 키워드에 해당하는 기능을 웨이크업하고, 해당 기능을 수행할 수 있다.The single command keyword is for the voice control apparatus 100 to directly perform a specific operation or function. For example, the single command keyword may be a predefined simple word such as “play” or “stop”. When the single command keyword is received, the voice control device 100 may wake up a function corresponding to the single command keyword and perform the corresponding function.

아래에서는 오디오 스트림 데이터로부터 단독 명령 키워드에 대응하는 후보 키워드를 검출한 경우와 오디오 스트림 데이터로부터 웨이크업 키워드에 대응하는 후보 키워드를 검출한 경우 각각에 대하여 설명한다.Hereinafter, description will be given of a case where a candidate keyword corresponding to a single command keyword is detected from audio stream data and a case where a candidate keyword corresponding to a wake-up keyword is detected from audio stream data.

도 5는 다른 실시예에 따라서 음성 제어 장치가 수행할 수 있는 동작 방법의 예를 도시한 흐름도이다.5 is a flowchart illustrating an example of an operation method that a voice control apparatus may perform according to another exemplary embodiment.

도 6a는 일 실시예에 따른 음성 제어 장치가 도 5의 동작 방법을 실행하는 경우에 단독 명령 키워드가 발화되는 예를 도시하고, 도 6b는 일 실시예에 따른 음성 제어 장치가 도 5의 동작 방법을 실행하는 경우에 일반 대화 음성이 발화되는 예를 도시한다.6A illustrates an example in which a single command keyword is uttered when the voice control apparatus according to an embodiment executes the operation method of FIG. 5, and FIG. 6B illustrates the operation method of FIG. 5 by the voice control apparatus according to an embodiment. An example in which a general conversation voice is uttered when executing a.

도 5의 동작 방법은 도 4의 동작 방법과 실질적으로 동일한 단계들을 포함한다. 도 5의 단계들 중에서 도 4의 단계들과 실질적으로 동일한 단계들에 대해서는 자세히 설명하지 않는다. 도 6a와 도 6b에는 오디오 스트림 데이터에 대응하는 오디오 신호들과 오디오 신호들에 대응하는 사용자의 음성이 도시된다. 도 6a에는 음성 “중지”에 대응하는 오디오 신호들이 도시되고, 도 6b에는 음성 “여기서 정지해”에 대응하는 오디오 신호들이 도시된다.The operating method of FIG. 5 includes steps substantially the same as the operating method of FIG. 4. Among the steps of FIG. 5, steps substantially the same as those of FIG. 4 will not be described in detail. 6A and 6B show audio signals corresponding to audio stream data and a voice of a user corresponding to audio signals. FIG. 6A shows audio signals corresponding to voice “stop” and FIG. 6B shows audio signals corresponding to voice “stop here”.

도 6a 및 도 6b와 함께 도 5를 참조하면, 단계(S210)에서 프로세서(120), 예컨대, 오디오 처리부(121)는 주변 소리에 대응하는 오디오 신호를 수신한다. Referring to FIG. 5 together with FIGS. 6A and 6B, in operation S210, the processor 120, for example, the audio processor 121, receives an audio signal corresponding to ambient sounds.

단계(S220)에서 프로세서(120), 예컨대, 오디오 처리부(121)는 마이크(151)로부터의 오디오 신호를 기초로 오디오 스트림 데이터를 생성한다.In operation S220, the processor 120, for example, the audio processor 121 generates audio stream data based on the audio signal from the microphone 151.

단계(S230)에서 프로세서(120), 예컨대, 오디오 처리부(121)는 단계(S220)에서 생성되는 오디오 스트림 데이터를 메모리(110)에 일시적으로 저장한다.In operation S230, the processor 120, for example, the audio processor 121 temporarily stores the audio stream data generated in operation S220 in the memory 110.

단계(S240)에서 프로세서(120), 예컨대, 키워드 검출부(122)는 단계(S220)에서 생성되는 오디오 스트림 데이터로부터 미리 정의된 단독 명령 키워드에 대응하는 후보 키워드를 검출한다. 단독 명령 키워드는 음성 제어 장치(100)의 동작을 직접 제어할 수 있는 음성 키워드일 수 있다. 예컨대, 단독 명령 키워드는 도 6a에 도시된 바와 같이 “중지”와 같은 단어일 수 있다. 이 경우, 음성 제어 장치(100)는 예컨대 음악이나 동영상을 재생하고 있을 수 있다.In operation S240, the processor 120, for example, the keyword detector 122, detects a candidate keyword corresponding to a predefined single command keyword from the audio stream data generated in operation S220. The single command keyword may be a voice keyword capable of directly controlling the operation of the voice control apparatus 100. For example, the single command keyword may be a word such as “stop” as shown in FIG. 6A. In this case, the voice control apparatus 100 may play music or a video, for example.

도 6a의 예에서, 키워드 검출부(122)는 오디오 신호들에서 “중지”라는 후보 키워드를 검출할 수 있다. 도 6b의 예에서, 키워드 검출부(122)는 오디오 신호들에서 “중지”라는 키워드와 유사한 발음을 갖는 단어인 “정지”라는 후보 키워드를 검출할 수 있다.In the example of FIG. 6A, the keyword detector 122 may detect a candidate keyword of “stop” in the audio signals. In the example of FIG. 6B, the keyword detector 122 may detect a candidate keyword “stop” which is a word having a pronunciation similar to the keyword “stop” in the audio signals.

단계(S250)에서 프로세서(120), 예컨대, 키워드 검출부(122)는 오디오 스트림 데이터에서 후보 키워드가 검출된 키워드 검출 구간을 식별하고, 키워드 검출 구간의 시점과 종점을 결정한다. 키워드 검출 구간은 현재 구간으로 지칭될 수 있다. 오디오 스트림 데이터에서 현재 구간에 대응하는 데이터는 제1 오디오 데이터로 지칭될 수 있다.In operation S250, the processor 120, for example, the keyword detector 122, identifies a keyword detection section in which the candidate keyword is detected in the audio stream data, and determines a start point and an end point of the keyword detection section. The keyword detection section may be referred to as the current section. Data corresponding to the current section in the audio stream data may be referred to as first audio data.

도 6a의 예에서, 키워드 검출부(122)는 “중지”라는 후보 키워드를 검출한 구간을 현재 구간으로 식별하고, 현재 구간의 시점과 종점을 결정할 수 있다. 상기 현재 구간에 대응하는 오디오 데이터는 제1 오디오 데이터(AD1)로 지칭될 수 있다.In the example of FIG. 6A, the keyword detector 122 may identify a section in which the candidate keyword “stop” is detected as the current section, and determine the start point and the end point of the current section. The audio data corresponding to the current section may be referred to as first audio data AD1.

도 6b의 예에서, 키워드 검출부(122)는 “정지”라는 후보 키워드를 검출한 구간을 현재 구간으로 식별하고, 현재 구간의 시점과 종점을 결정할 수 있다. 상기 현재 구간에 대응하는 오디오 데이터는 제1 오디오 데이터(AD1)로 지칭될 수 있다.In the example of FIG. 6B, the keyword detector 122 may identify a section in which the candidate keyword “stop” is detected as the current section and determine the start point and the end point of the current section. The audio data corresponding to the current section may be referred to as first audio data AD1.

또한, 단계(S250)에서 프로세서(120), 예컨대, 키워드 검출부(122)는 검출된 후보 키워드가 웨이크업 키워드와 단독 명령 키워드 중 어떤 키워드에 대응하는 후보 키워드인지를 판단할 수 있다. 도 6a와 도 6b의 예에서, 키워드 검출부(122)는 검출된 후보 키워드, 즉, “중지” 및 “정지”가 단독 명령 키워드에 대응하는 후보 키워드라는 것을 판단할 수 있다.In operation S250, the processor 120, for example, the keyword detector 122 may determine whether the detected candidate keyword corresponds to a keyword of the wake-up keyword and the single command keyword. In the example of FIGS. 6A and 6B, the keyword detector 122 may determine that the detected candidate keywords, that is, "stop" and "stop" are candidate keywords corresponding to the single command keyword.

단계(S260)에서 프로세서(120), 예컨대, 화자 특징 벡터 추출부(123)는 메모리(110)로부터 이전 구간에 해당하는 제2 오디오 데이터를 독출한다. 이전 구간은 현재 구간의 바로 직전 구간으로서, 이전 구간의 종점은 현재 구간의 시점과 동일할 수 있다. 화자 특징 벡터 추출부(123)는 메모리(110)로부터 제1 오디오 데이터도 함께 독출할 수 있다.In operation S260, the processor 120, for example, the speaker feature vector extractor 123 reads second audio data corresponding to a previous section from the memory 110. The previous section is a section immediately before the current section, and the end point of the previous section may be the same as the start point of the current section. The speaker feature vector extractor 123 may also read first audio data from the memory 110.

도 6a의 예에서, 화자 특징 벡터 추출부(123)는 현재 구간의 바로 직전 구간인 이전 구간에 대응하는 제2 오디오 데이터(AD2)를 메모리(110)로부터 독출할 수 있다. 도 6b의 예에서, 화자 특징 벡터 추출부(123)는 현재 구간의 바로 직전 구간인 이전 구간에 대응하는 제2 오디오 데이터(AD2)를 메모리(110)로부터 독출할 수 있다. 도 6b의 예에서, 제2 오디오 데이터(AD2)는 “기서”라는 음성에 대응할 수 있다. 이전 구간의 길이는 검출된 후보 키워드에 따라 가변적으로 설정될 수 있다.In the example of FIG. 6A, the speaker feature vector extractor 123 may read second audio data AD2 corresponding to a previous section, which is a section immediately before the current section, from the memory 110. In the example of FIG. 6B, the speaker feature vector extractor 123 may read the second audio data AD2 corresponding to the previous section, which is a section immediately before the current section, from the memory 110. In the example of FIG. 6B, the second audio data AD2 may correspond to a voice of “from”. The length of the previous section may be variably set according to the detected candidate keyword.

단계(S270)에서 프로세서(120), 예컨대, 화자 특징 벡터 추출부(123)는 오디오 처리부(121)로부터 현재 구간 이후의 다음 구간에 해당하는 제3 오디오 데이터를 수신한다. 다음 구간은 현재 구간의 바로 다음 구간으로서, 다음 구간의 시점은 현재 구간의 종점과 동일할 수 있다.In operation S270, the processor 120, for example, the speaker feature vector extractor 123 receives third audio data corresponding to the next section after the current section from the audio processor 121. The next section is the next section after the current section, and the start point of the next section may be the same as the end point of the current section.

도 6a의 예에서, 화자 특징 벡터 추출부(123)는 현재 구간 직후의 다음 구간에 대응하는 제3 오디오 데이터(AD3)를 오디오 처리부(121)로부터 수신할 수 있다. 도 6b의 예에서, 화자 특징 벡터 추출부(123)는 현재 구간 직후의 다음 구간에 대응하는 제3 오디오 데이터(AD3)를 오디오 처리부(121)로부터 수신할 수 있다. 도 6b의 예에서, 제3 오디오 데이터(AD3)는 “해”라는 음성에 대응할 수 있다. 다음 구간의 길이는 검출된 후보 키워드에 따라 가변적으로 설정될 수 있다.In the example of FIG. 6A, the speaker feature vector extractor 123 may receive third audio data AD3 corresponding to the next section immediately after the current section from the audio processor 121. In the example of FIG. 6B, the speaker feature vector extractor 123 may receive, from the audio processor 121, third audio data AD3 corresponding to a next section immediately after the current section. In the example of FIG. 6B, the third audio data AD3 may correspond to a voice of “solution”. The length of the next section may be variably set according to the detected candidate keyword.

단계(S280)에서 프로세서(120), 예컨대, 화자 특징 벡터 추출부(123)는 제1 내지 제3 오디오 데이터로부터 제1 내지 제3 화자 특징 벡터들을 각각 추출한다. 제1 내지 제3 화자 특징 벡터들 각각은 제1 내지 제3 오디오 데이터에 대응하는 음성의 화자를 식별하기 위한 지표들이다. 프로세서(120), 예컨대, 웨이크업 판단부(124)는 제1 화자 특징 벡터와 제2 화자 특징 벡터 간의 유사도, 및 제1 화자 특징 벡터와 제3 화자 특징 벡터 간의 유사도를 기초로 제1 오디오 데이터에 단독 명령 키워드가 포함되었는지의 여부를 판단할 수 있다. 웨이크업 판단부(124)는 제1 오디오 데이터에 단독 명령 키워드가 포함되었다고 판단할 경우, 음성 제어 장치(100)의 일부 구성요소들을 웨이크업 할 수 있다.In operation S280, the processor 120, for example, the speaker feature vector extractor 123 extracts the first to third speaker feature vectors from the first to third audio data, respectively. Each of the first to third speaker feature vectors is indices for identifying the speaker of the voice corresponding to the first to third audio data. The processor 120, for example, the wakeup determiner 124 may determine the first audio data based on the similarity between the first speaker feature vector and the second speaker feature vector, and the similarity between the first speaker feature vector and the third speaker feature vector. It may be determined whether or not the single command keyword is included in the. When the wakeup determiner 124 determines that a single command keyword is included in the first audio data, the wakeup determiner 124 may wake up some components of the voice control apparatus 100.

도 6a의 예에서, 제1 오디오 데이터(AD1)에 대응하는 제1 화자 특징 벡터는 “중지”라는 음성을 발성한 화자를 식별하기 위한 지표이다. 제2 오디오 데이터(AD2)와 제3 오디오 데이터(AD3)는 실질적으로 묵음이므로, 제2 및 제3 화자 특징 벡터는 묵음에 대응하는 벡터를 가질 수 있다. 따라서, 제1 화자 특징 벡터와 제2 및 제3 화자 특징 벡터들 간의 유사도는 낮을 수 있다.In the example of FIG. 6A, the first speaker feature vector corresponding to the first audio data AD1 is an indicator for identifying the speaker who spoke the voice of "stopped". Since the second audio data AD2 and the third audio data AD3 are substantially silent, the second and third speaker feature vectors may have a vector corresponding to the silence. Thus, the similarity between the first speaker feature vector and the second and third speaker feature vectors may be low.

다른 예로서, 이전 구간과 다음 구간에 “중지”라는 음성을 발성한 화자가 아닌 다른 사람이 음성을 발성하는 경우, 제2 및 제3 화자 특징 벡터는 상기 다른 사람에 대응한 벡터를 가질 것이므로, 제1 화자 특징 벡터와 제2 및 제3 화자 특징 벡터들 간의 유사도는 낮을 수 있다.As another example, when someone other than the speaker who spoke the voice of "stop" in the previous section and the next section speaks, the second and third speaker feature vectors will have a vector corresponding to the other person. The similarity between the first speaker feature vector and the second and third speaker feature vectors may be low.

도 6b의 예에서는 한 사람이 “여기서 정지해”라고 발성하였다. 따라서, “정지”에 대응하는 제1 오디오 데이터(AD1)로부터 추출되는 제1 화자 특징 벡터, “기서”에 대응하는 제2 오디오 데이터(AD2)로부터 추출되는 제2 화자 특징 벡터, 및 “해”에 대응하는 제3 오디오 데이터(AD3)로부터 추출되는 제3 화자 특징 벡터는 모두 실질적으로 동일한 화자를 식별하기 위한 벡터이므로, 제1 내지 제3 화자 특징 벡터들 간의 유사도는 높을 수 있다.In the example of FIG. 6B, a person uttered "stop here". Thus, the first speaker feature vector extracted from the first audio data AD1 corresponding to "stop", the second speaker feature vector extracted from the second audio data AD2 corresponding to "origin", and the "solution". Since all of the third speaker feature vectors extracted from the third audio data AD3 corresponding to are vectors for identifying substantially the same speaker, the similarity between the first to third speaker feature vectors may be high.

단계(S290)에서 프로세서(120), 예컨대, 웨이크업 판단부(124)는 제1 화자 특징 벡터와 제2 화자 특징 벡터 간의 유사도를 미리 설정된 기준치와 비교하고, 제1 화자 특징 벡터와 제3 화자 특징 벡터 간의 유사도를 미리 설정된 기준치와 비교한다. 웨이크업 판단부(124)는 제1 화자 특징 벡터와 제2 화자 특징 벡터 간의 유사도가 미리 설정된 기준치 이하이고 제1 화자 특징 벡터와 제3 화자 특징 벡터 간의 유사도가 미리 설정된 기준치 이하인 경우, 현재 구간의 제1 오디오 데이터의 화자는 이전 구간의 제2 오디오 데이터의 화자 및 다음 구간의 제3 오디오 데이터의 화자로부터 상이하다는 것이므로, 제1 오디오 데이터에 단독 명령 키워드가 포함되었다고 판단할 수 있다. 이 경우, 단계(S300)에서와 같이 프로세서(120), 예컨대, 웨이크업 판단부(124)는 단독 명령 키워드를 기능부(126)에 제공하고, 기능부(126)는 웨이크업 판단부(124)에 의한 제1 오디오 데이터에 단독 명령 키워드가 포함되었다는 판단에 응답하여 단독 명령 키워드에 대응하는 기능을 수행할 수 있다.In operation S290, the processor 120, for example, the wakeup determiner 124 compares the similarity between the first speaker feature vector and the second speaker feature vector with a preset reference value, and compares the first speaker feature vector and the third speaker. The similarity between the feature vectors is compared with a preset reference value. The wakeup determiner 124 may determine whether the similarity between the first speaker feature vector and the second speaker feature vector is equal to or less than a preset reference value and the similarity between the first speaker feature vector and the third speaker feature vector is less than or equal to the preset reference value. Since the speaker of the first audio data is different from the speaker of the second audio data of the previous section and the speaker of the third audio data of the next section, it may be determined that the single command keyword is included in the first audio data. In this case, as in step S300, the processor 120, for example, the wakeup determination unit 124, provides a single command keyword to the function unit 126, and the function unit 126 provides the wakeup determination unit 124. In response to the determination that the single command keyword is included in the first audio data), a function corresponding to the single command keyword may be performed.

도 6a의 예에서, 제1 화자 특징 벡터는 “중지”라고 발성한 화자에 대응하는 벡터이고, 제2 및 제3 화자 특징 벡터들은 묵음에 대응한 벡터이므로, 제1 화자 특징 벡터와 제2 및 제3 화자 특징 벡터들 간의 유사도들은 미리 설정된 기준치보다 낮을 수 있다. 이 경우, 웨이크업 판단부(124)는 제1 오디오 데이터(AD1)에 “중지”라는 단독 명령 키워드가 포함되었다고 판단할 수 있다. 이 경우, 기능부(126)는 상기 판단에 응답하여 웨이크업 될 수 있으며, “중지”라는 단독 명령 키워드에 대응하는 동작 또는 기능을 수행할 수 있다. 예컨대, 음성 제어 장치(100)가 음악을 재생 중이었다면, 기능부(126)는 “중지”라는 단독 명령 키워드에 대응하여 음악 재생을 중지시킬 수 있다.In the example of FIG. 6A, since the first speaker feature vector is a vector corresponding to the speaker uttered "stop", and the second and third speaker feature vectors are vectors corresponding to silence, the first speaker feature vector and the second and Similarities between the third speaker feature vectors may be lower than a preset reference value. In this case, the wakeup determiner 124 may determine that the first audio data AD1 includes a single command keyword of "stop". In this case, the function unit 126 may wake up in response to the determination, and may perform an operation or a function corresponding to a single command keyword of "stop". For example, if the voice control apparatus 100 is playing music, the function unit 126 may stop music playback in response to a single command keyword of "stop".

그러나, 웨이크업 판단부(124)는 제1 화자 특징 벡터와 제2 화자 특징 벡터 간의 유사도가 미리 설정된 기준치보다 크거나, 제1 화자 특징 벡터와 제3 화자 특징 벡터 간의 유사도가 미리 설정된 기준치보다 큰 경우, 현재 구간의 제1 오디오 데이터의 화자가 이전 구간의 제2 오디오 데이터의 화자 또는 다음 구간의 제3 오디오 데이터의 화자와 동일하다는 것이므로, 제1 오디오 데이터에 키워드가 포함되지 않았다고 판단하여, 웨이크업을 진행하지 않을 수 있다. 이 경우, 단계(S210)로 진행하여 주변 소리에 대응하는 오디오 신호를 수신한다.However, the wakeup determiner 124 may determine that the similarity between the first speaker feature vector and the second speaker feature vector is greater than the preset reference value, or the similarity between the first speaker feature vector and the third speaker feature vector is greater than the preset reference value. In this case, since the speaker of the first audio data of the current section is the same as the speaker of the second audio data of the previous section or the speaker of the third audio data of the next section, it is determined that the keyword is not included in the first audio data. You may not proceed. In this case, the process proceeds to step S210 to receive an audio signal corresponding to the ambient sound.

도 6b의 예에서, 한 사람이 “여기서 정지해”라고 발성하였으므로, 제1 내지 제3 화자 특징 벡터들 간의 유사도는 높을 것이다. 도 6b의 예에서의 발성인 “여기에 정지해”에는 음성 제어 장치를 제어하거나 웨이크업 하기 위한 키워드가 포함되지 않았으므로, 웨이크업 판단부(124)는 제1 오디오 데이터(AD1)에 단독 명령 키워드가 포함되지 않았다고 판단하고, 기능부(126)가 “정지” 또는 “중지”에 해당하는 기능이나 동작을 수행하지 않도록 할 수 있다.In the example of FIG. 6B, since one person spoke "stop here", the similarity between the first to third speaker feature vectors would be high. In the example of FIG. 6B, the word "stop here" does not include a keyword for controlling or waking up the voice control device, so that the wakeup determination unit 124 commands the first audio data AD1 alone. It may be determined that the keyword is not included, and the function unit 126 may not perform a function or operation corresponding to “stop” or “stop”.

일반적인 기술에 따르면, 음성 제어 장치는 “여기서 정지해”라는 발성 중 “정지”라는 음성을 검출하여, “정지”에 해당하는 기능이나 동작을 수행할 수 있다. 이러한 기능이나 동작은 사용자가 의도하지 않은 것으로서, 사용자는 음성 제어 장치를 사용할 때 불편함을 느낄 수 있다. 그러나, 일 실시예에 따르면, 음성 제어 장치(100)는 사용자의 음성으로부터 단독 명령 키워드를 정확히 인식할 수 있기 때문에, 일반적인 기술과는 달리 오동작을 수행하지 않을 수 있다.According to a general technique, the voice control device may detect a voice “Stop” among the voices “Stop here” and perform a function or operation corresponding to “Stop”. Such a function or operation is not intended by the user, and the user may feel uncomfortable when using the voice control device. However, according to an exemplary embodiment, since the voice control apparatus 100 may correctly recognize a single command keyword from the user's voice, unlike the general technology, the voice control apparatus 100 may not perform a malfunction.

도 7은 또 다른 실시예에 따라서 음성 제어 장치가 수행할 수 있는 동작 방법의 예를 도시한 흐름도이다.7 is a flowchart illustrating an example of an operation method that a voice control apparatus may perform according to another exemplary embodiment.

도 8a는 일 실시예에 따른 음성 제어 장치가 도 7의 동작 방법을 실행하는 경우에 웨이크업 키워드와 자연어 음성 명령이 발화되는 예를 도시하고, 도 8b는 일 실시예에 따른 음성 제어 장치가 도 7의 동작 방법을 실행하는 경우에 일반 대화 음성이 발화되는 예를 도시한다.FIG. 8A illustrates an example in which a wake-up keyword and a natural language voice command are uttered when the voice control apparatus executes the operation method of FIG. 7. FIG. 8B illustrates a voice control apparatus according to an embodiment. FIG. 7 shows an example in which a general conversation voice is spoken when the operation method of FIG. 7 is executed.

도 7의 동작 방법은 도 4의 동작 방법과 실질적으로 동일한 단계들을 포함한다. 도 7의 단계들 중에서 도 4의 단계들과 실질적으로 동일한 단계들에 대해서는 자세히 설명하지 않는다. 도 6a와 도 6b에는 오디오 스트림 데이터에 대응하는 오디오 신호들과 오디오 신호들에 대응하는 사용자의 음성이 도시된다. 도 8a에는 웨이크업 키워드 “클로바”와 자연어 음성 명령 “내일 날씨를 알려줘”에 대응하는 오디오 신호들이 도시되고, 도 6b에는 “네잎 클로바를 어떻게 찾을 수 있어”라는 대화 음성에 대응하는 오디오 신호들이 도시된다.The operating method of FIG. 7 includes steps substantially the same as the operating method of FIG. 4. Among the steps of FIG. 7, steps substantially the same as those of FIG. 4 will not be described in detail. 6A and 6B show audio signals corresponding to audio stream data and a voice of a user corresponding to audio signals. FIG. 8A shows audio signals corresponding to the wake-up keyword "CLOVA" and a natural language voice command "tell me the weather tomorrow", and FIG. 6B shows audio signals corresponding to the dialogue voice "How can I find a four-leaf clova"? do.

도 8a 및 도 8b와 함께 도 7를 참조하면, 단계(S410)에서 프로세서(120), 예컨대, 오디오 처리부(121)는 주변 소리에 대응하는 오디오 신호를 수신한다. 단계(S420)에서 프로세서(120), 예컨대, 오디오 처리부(121)는 마이크(151)로부터의 오디오 신호를 기초로 오디오 스트림 데이터를 생성한다. 단계(S430)에서 프로세서(120), 예컨대, 오디오 처리부(121)는 단계(S120)에서 생성되는 오디오 스트림 데이터를 메모리(110)에 일시적으로 저장한다.Referring to FIG. 7 together with FIGS. 8A and 8B, in operation S410, the processor 120, for example, the audio processor 121, receives an audio signal corresponding to ambient sound. In operation S420, the processor 120, for example, the audio processor 121 generates audio stream data based on the audio signal from the microphone 151. In operation S430, the processor 120, for example, the audio processor 121 temporarily stores the audio stream data generated in operation S120 in the memory 110.

단계(S440)에서 프로세서(120), 예컨대, 키워드 검출부(122)는 단계(S420)에서 생성되는 오디오 스트림 데이터로부터 미리 정의된 웨이크업 키워드에 대응하는 후보 키워드를 검출한다. 웨이크업 키워드는 슬립 모드 상태의 음성 제어 장치를 웨이크업 모드로 전환할 수 있는 음성 기반 키워드이다. 예컨대, 웨이크업 키워드는 “클로바”, “하이 컴퓨터” 등과 같은 음성 키워드일 수 있다.In operation S440, the processor 120, for example, the keyword detector 122, detects a candidate keyword corresponding to the predefined wakeup keyword from the audio stream data generated in operation S420. The wakeup keyword is a voice-based keyword capable of switching the voice control device in the sleep mode to the wakeup mode. For example, the wake-up keyword may be a voice keyword such as "clova", "high computer", or the like.

도 8a의 예에서, 키워드 검출부(122)는 오디오 신호들에서 “클로바”라는 후보 키워드를 검출할 수 있다. 도 6b의 예에서, 키워드 검출부(122)는 오디오 신호들에서 “클로바”라는 키워드와 유사한 발음을 갖는 단어인 “클로버”라는 후보 키워드를 검출할 수 있다.In the example of FIG. 8A, the keyword detector 122 may detect a candidate keyword of "close" in the audio signals. In the example of FIG. 6B, the keyword detector 122 may detect a candidate keyword “clover” which is a word having a pronunciation similar to the keyword “clover” in the audio signals.

단계(S450)에서 프로세서(120), 예컨대, 키워드 검출부(122)는 오디오 스트림 데이터에서 후보 키워드가 검출된 키워드 검출 구간을 식별하고, 키워드 검출 구간의 시점과 종점을 결정한다. 키워드 검출 구간은 현재 구간으로 지칭될 수 있다. 오디오 스트림 데이터에서 현재 구간에 대응하는 데이터는 제1 오디오 데이터로 지칭될 수 있다.In operation S450, the processor 120, for example, the keyword detector 122, identifies a keyword detection section in which candidate keywords are detected in the audio stream data, and determines a start point and an end point of the keyword detection section. The keyword detection section may be referred to as the current section. Data corresponding to the current section in the audio stream data may be referred to as first audio data.

도 8a의 예에서, 키워드 검출부(122)는 “클로바”라는 후보 키워드를 검출한 구간을 현재 구간으로 식별하고, 현재 구간의 시점과 종점을 결정할 수 있다. 상기 현재 구간에 대응하는 오디오 데이터는 제1 오디오 데이터(AD1)로 지칭될 수 있다. 도 8b의 예에서, 키워드 검출부(122)는 “클로버”라는 후보 키워드를 검출한 구간을 현재 구간으로 식별하고, 현재 구간의 시점과 종점을 결정할 수 있다. 상기 현재 구간에 대응하는 오디오 데이터는 제1 오디오 데이터(AD1)로 지칭될 수 있다.In the example of FIG. 8A, the keyword detection unit 122 may identify a section in which the candidate keyword “clova” is detected as the current section and determine the start point and the end point of the current section. The audio data corresponding to the current section may be referred to as first audio data AD1. In the example of FIG. 8B, the keyword detector 122 may identify a section in which the candidate keyword “clover” is detected as the current section and determine the start point and the end point of the current section. The audio data corresponding to the current section may be referred to as first audio data AD1.

또한, 단계(S450)에서 프로세서(120), 예컨대, 키워드 검출부(122)는 검출된 후보 키워드가 웨이크업 키워드와 단독 명령 키워드 중 어떤 키워드에 대응하는 후보 키워드인지를 판단할 수 있다. 도 8a와 도 8b의 예에서, 키워드 검출부(122)는 검출된 후보 키워드, 즉, “클로바” 및 “클로버”가 웨이크업 키워드에 대응하는 후보 키워드라는 것을 판단할 수 있다.In operation S450, the processor 120, for example, the keyword detector 122, may determine whether the detected candidate keyword corresponds to which keyword of the wake-up keyword and the single command keyword. In the example of FIG. 8A and FIG. 8B, the keyword detector 122 may determine that the detected candidate keywords, that is, "clover" and "clover", are candidate keywords corresponding to the wakeup keyword.

단계(S460)에서 프로세서(120), 예컨대, 화자 특징 벡터 추출부(123)는 메모리(110)로부터 이전 구간에 해당하는 제2 오디오 데이터를 독출한다. 이전 구간은 현재 구간의 바로 직전 구간으로서, 이전 구간의 종점은 현재 구간의 시점과 동일할 수 있다. 화자 특징 벡터 추출부(123)는 메모리(110)로부터 제1 오디오 데이터도 함께 독출할 수 있다.In operation S460, the processor 120, for example, the speaker feature vector extractor 123 reads second audio data corresponding to the previous section from the memory 110. The previous section is a section immediately before the current section, and the end point of the previous section may be the same as the start point of the current section. The speaker feature vector extractor 123 may also read first audio data from the memory 110.

도 8a의 예에서, 화자 특징 벡터 추출부(123)는 현재 구간의 바로 직전 구간인 이전 구간에 대응하는 제2 오디오 데이터(AD2)를 메모리(110)로부터 독출할 수 있다. 도 8b의 예에서, 화자 특징 벡터 추출부(123)는 현재 구간의 바로 직전 구간인 이전 구간에 대응하는 제2 오디오 데이터(AD2)를 메모리(110)로부터 독출할 수 있다. 도 8b의 예에서, 제2 오디오 데이터(AD2)는 “네잎”이라는 음성에 대응할 수 있다. 이전 구간의 길이는 검출된 후보 키워드에 따라 가변적으로 설정될 수 있다.In the example of FIG. 8A, the speaker feature vector extractor 123 may read second audio data AD2 corresponding to a previous section, which is a section immediately before the current section, from the memory 110. In the example of FIG. 8B, the speaker feature vector extractor 123 may read second audio data AD2 corresponding to the previous section, which is a section immediately before the current section, from the memory 110. In the example of FIG. 8B, the second audio data AD2 may correspond to a voice of “four leaves”. The length of the previous section may be variably set according to the detected candidate keyword.

단계(S470)에서 프로세서(120), 예컨대, 화자 특징 벡터 추출부(123)는 제1 및 제2 오디오 데이터로부터 제1 및 제2 화자 특징 벡터들을 각각 추출한다. 프로세서(120), 예컨대, 웨이크업 판단부(124)는 제1 화자 특징 벡터와 제2 화자 특징 벡터 간의 유사도를 기초로 제1 오디오 데이터에 웨이크업 키워드가 포함되었는지의 여부를 판단할 수 있다. 웨이크업 판단부(124)는 제1 오디오 데이터에 웨이크업 키워드가 포함되었다고 판단할 경우, 음성 제어 장치(100)의 일부 구성요소들을 웨이크업 할 수 있다.In operation S470, the processor 120, for example, the speaker feature vector extractor 123 extracts the first and second speaker feature vectors from the first and second audio data, respectively. The processor 120, for example, the wakeup determiner 124 may determine whether the wakeup keyword is included in the first audio data based on the similarity between the first speaker feature vector and the second speaker feature vector. When the wakeup determiner 124 determines that the wakeup keyword is included in the first audio data, the wakeup determiner 124 may wake up some components of the voice control apparatus 100.

도 8a의 예에서, 제1 오디오 데이터(AD1)에 대응하는 제1 화자 특징 벡터는 “클로바”라는 음성을 발성한 화자를 식별하기 위한 지표이다. 제2 오디오 데이터(AD2)는 실질적으로 묵음이므로, 제2 화자 특징 벡터는 묵음에 대응하는 벡터를 가질 수 있다. 따라서, 제1 화자 특징 벡터와 제2 화자 특징 벡터 간의 유사도는 낮을 수 있다.In the example of FIG. 8A, the first speaker feature vector corresponding to the first audio data AD1 is an indicator for identifying a speaker who has spoken a voice of "clova". Since the second audio data AD2 is substantially silent, the second speaker feature vector may have a vector corresponding to the silence. Therefore, the similarity between the first speaker feature vector and the second speaker feature vector may be low.

다른 예로서, 이전 구간에 “클로바”라는 음성을 발성한 화자가 아닌 다른 사람이 음성을 발성하는 경우, 제2 화자 특징 벡터는 상기 다른 사람에 대응한 벡터를 가질 것이므로, 제1 화자 특징 벡터와 제2 화자 특징 벡터들 간의 유사도는 낮을 수 있다.As another example, when a person other than the speaker who spoke the voice of "Clobar" in the previous section speaks the voice, the second speaker feature vector will have a vector corresponding to the other person. The similarity between the second speaker feature vectors may be low.

도 8b의 예에서는 한 사람이 “네잎 클로버를 어떻게 찾을 수 있어”라고 발성하였다. 따라서, “클로버”에 대응하는 제1 오디오 데이터(AD1)로부터 추출되는 제1 화자 특징 벡터와 “네잎”에 대응하는 제2 오디오 데이터(AD2)로부터 추출되는 제2 화자 특징 벡터는 모두 실질적으로 동일한 화자를 식별하기 위한 벡터이므로, 제1 및 제2 화자 특징 벡터들 간의 유사도는 높을 수 있다.In the example of FIG. 8B, a person uttered, "How can I find four-leaf clover." Accordingly, both the first speaker feature vector extracted from the first audio data AD1 corresponding to "clover" and the second speaker feature vector extracted from the second audio data AD2 corresponding to "four leaves" are substantially the same. Since it is a vector for identifying a speaker, the similarity between the first and second speaker feature vectors may be high.

단계(S480)에서 프로세서(120), 예컨대, 웨이크업 판단부(124)는 제1 및 제2 화자 특징 벡터들 간의 유사도를 미리 설정된 기준치와 비교한다. 웨이크업 판단부(124)는 제1 및 제2 화자 특징 벡터들 간의 유사도가 미리 설정된 기준치보다 큰 경우, 현재 구간의 제1 오디오 데이터의 화자와 이전 구간의 제2 오디오 데이터의 화자가 서로 동일하다는 것이므로, 제1 오디오 데이터에 키워드가 포함되지 않았다고 판단하여, 웨이크업을 진행하지 않을 수 있다. 이 경우, 단계(S410)로 진행하며, 프로세서(120), 예컨대, 오디오 처리부(121)는 주변 소리에 대응하는 오디오 신호를 수신한다.In operation S480, the processor 120, for example, the wakeup determiner 124 compares the similarity between the first and second speaker feature vectors with a preset reference value. When the similarity between the first and second speaker feature vectors is greater than a preset reference value, the wakeup determiner 124 indicates that the speaker of the first audio data of the current section and the speaker of the second audio data of the previous section are the same. Since it is determined that the keyword is not included in the first audio data, the wakeup may not be performed. In this case, the process proceeds to step S410, where the processor 120, for example, the audio processor 121 receives an audio signal corresponding to the ambient sound.

도 8b의 예에서, 한 사람이 “네잎 클로버...”라고 발성하였으므로, 제1 및 제2 화자 특징 벡터들 간의 유사도는 높을 것이다. 도 8b의 예에서, “네잎 클로버”라고 발성한 사람은 음성 제어 장치(100)를 웨이크업 하려는 의도가 없다고 판단하고, 웨이크업 판단부(124)는 제1 오디오 데이터(AD1)에 웨이크업 키워드가 포함되지 않았다고 판단하고, 음성 제어 장치(100)를 웨이크업 하지 않을 수 있다.In the example of FIG. 8B, since one person spoke "four leaf clover ...", the similarity between the first and second speaker feature vectors would be high. In the example of FIG. 8B, a person who speaks "four-leaf clover" determines that there is no intention to wake up the voice control apparatus 100, and the wake-up determination unit 124 wakes up the first audio data AD1. May not be included and may not wake up the voice control apparatus 100.

웨이크업 판단부(124)는 제1 및 제2 화자 특징 벡터들 간의 유사도가 미리 설정된 기준치 이하인 경우, 현재 구간의 제1 오디오 데이터의 화자와 이전 구간의 제2 오디오 데이터의 화자가 서로 상이하다는 것이므로, 제1 오디오 데이터에 키워드가 포함되었다고 판단할 수 있다. 이 경우, 웨이크업 판단부(124)는 음성 제어 장치(100)의 일부 구성요소들을 웨이크업 할 수 있다. 예컨대, 웨이크업 판단부(124)는 음성 인식부(125)를 웨이크업 할 수 있다.If the similarity between the first and second speaker feature vectors is less than or equal to a preset reference value, the wakeup determiner 124 may be different from the speaker of the first audio data in the current section and the speaker of the second audio data in the previous section. It may be determined that a keyword is included in the first audio data. In this case, the wakeup determiner 124 may wake up some components of the voice control apparatus 100. For example, the wakeup determiner 124 may wake up the voice recognition unit 125.

도 8a의 예에서, 제1 화자 특징 벡터는 “클로바”라고 발성한 화자에 대응하는 벡터이고, 제2 화자 특징 벡터는 묵음에 대응한 벡터이므로, 제1 및 제2 화자 특징 벡터들 간의 유사도는 미리 설정된 기준치보다 낮을 수 있다. 이 경우, 웨이크업 판단부(124)는 제1 오디오 데이터(AD1)에 “클로바”라는 웨이크업 키워드가 포함되었다고 판단할 수 있다. 이 경우, 음성 인식부(125)는 자연어 음성 명령을 인식하기 위해 웨이크업 할 수 있다.In the example of FIG. 8A, since the first speaker feature vector is a vector corresponding to a speaker uttered as "Clobar" and the second speaker feature vector is a vector corresponding to silence, the similarity between the first and second speaker feature vectors is It may be lower than the preset reference value. In this case, the wakeup determiner 124 may determine that the first audio data AD1 includes a wakeup keyword of "close". In this case, the voice recognition unit 125 may wake up to recognize the natural language voice command.

단계(S490)에서 프로세서(120), 예컨대, 음성 인식부(125)는 오디오 처리부(121)로부터 현재 구간 이후의 다음 구간에 해당하는 제3 오디오 데이터를 수신한다. 다음 구간은 현재 구간의 바로 다음 구간으로서, 다음 구간의 시점은 현재 구간의 종점과 동일할 수 있다.In operation S490, the processor 120, for example, the speech recognizer 125, receives the third audio data corresponding to the next section after the current section from the audio processor 121. The next section is the next section after the current section, and the start point of the next section may be the same as the end point of the current section.

음성 인식부(125)는 제3 오디오 데이터에서 미리 설정한 길이의 묵음이 검출될 때 다음 구간의 종점을 결정할 수 있다. 음성 인식부(125)는 제3 오디오 데이터를 음성 인식할 수 있다. 음성 인식부(125)는 다양한 방식으로 제3 오디오 데이터를 음성 인식할 수 있다. 다른 예에 따르면, 음성 인식부(125)는 제3 오디오 데이터의 음성 인식 결과를 얻기 위해, 외부 장치, 예컨대, 도 2에 도시되는 음성 인식 기능을 갖는 서버(200)로 제3 오디오 데이터를 전송할 수 있다. 서버(200)는 제3 오디오 데이터를 수신하고, 제3 오디오 데이터를 음성 인식함으로써 제3 오디오 데이터에 대응하는 문자열(텍스트)를 생성하고, 생성된 문자열(텍스트)를 음성 인식 결과로서 음성 인식부(125)로 전송할 수 있다.The speech recognition unit 125 may determine the end point of the next section when the silence of the preset length is detected from the third audio data. The voice recognition unit 125 may voice recognize the third audio data. The voice recognition unit 125 may recognize the third audio data in various ways. According to another example, the voice recognition unit 125 transmits the third audio data to an external device, for example, the server 200 having the voice recognition function illustrated in FIG. 2, to obtain a voice recognition result of the third audio data. Can be. The server 200 receives the third audio data, generates a character string (text) corresponding to the third audio data by voice recognition of the third audio data, and uses the generated character string (text) as a voice recognition result. Transmit to 125.

도 8a의 예에서, 다음 구간의 제3 오디오 데이터는 “내일 날씨를 알려줘”와 같은 자연어 음성 명령이다. 음성 인식부(125)는 제3 오디오 데이터를 직접 음성 인식하여 음성 인식 결과를 생성하거나, 제3 오디오 데이터가 음성 인식되도록 외부(예컨대, 서버(200))에 전송할 수 있다.In the example of FIG. 8A, the third audio data of the next section is a natural language voice command such as “tell me the weather tomorrow”. The voice recognizer 125 may directly recognize the third audio data to generate a voice recognition result, or may transmit the third audio data to the outside (eg, the server 200) so that the third audio data may be voice recognized.

단계(S500)에서 프로세서(120), 예컨대, 기능부(126)는 제3 오디오 데이터의 음성 인식 결과에 대응하는 기능을 수행할 수 있다. 도 8a의 예에서, 기능부(126)는 내일 날씨를 검색하여 결과를 제공하는 음성 정보 제공부일 수 있으며, 기능부(126)는 인터넷을 이용하여 내일 날씨를 검색하고, 그 결과를 사용자에게 제공할 수 있다. 기능부(126)는 내일 날씨의 검색 결과를 스피커(152)를 이용하여 음성으로 제공할 수도 있다. 기능부(126)는 제3 오디오 데이터의 음성 인식 결과에 응답하여 웨이크업 될 수 있다.In operation S500, the processor 120, for example, the function unit 126, may perform a function corresponding to a voice recognition result of the third audio data. In the example of FIG. 8A, the function unit 126 may be a voice information providing unit that searches for the weather tomorrow and provides a result, and the function unit 126 searches for the weather tomorrow using the Internet and provides the result to the user. can do. The function unit 126 may provide a search result of tomorrow's weather as a voice using the speaker 152. The function unit 126 may wake up in response to a voice recognition result of the third audio data.

이상 설명된 본 발명에 따른 실시예는 컴퓨터 상에서 다양한 구성요소를 통하여 실행될 수 있는 컴퓨터 프로그램의 형태로 구현될 수 있으며, 이와 같은 컴퓨터 프로그램은 컴퓨터로 판독 가능한 매체에 기록될 수 있다. 이때, 매체는 컴퓨터로 실행 가능한 프로그램을 계속 저장하거나, 실행 또는 다운로드를 위해 임시 저장하는 것일 수도 있다. 또한, 매체는 단일 또는 수개 하드웨어가 결합된 형태의 다양한 기록수단 또는 저장수단일 수 있는데, 어떤 컴퓨터 시스템에 직접 접속되는 매체에 한정되지 않고, 네트워크 상에 분산 존재하는 것일 수도 있다. 매체의 예시로는, 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체, CD-ROM 및 DVD와 같은 광기록 매체, 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical medium), 및 ROM, RAM, 플래시 메모리 등을 포함하여 프로그램 명령어가 저장되도록 구성된 것이 있을 수 있다. 또한, 다른 매체의 예시로, 애플리케이션을 유통하는 앱 스토어나 기타 다양한 소프트웨어를 공급 내지 유통하는 사이트, 서버 등에서 관리하는 기록매체 내지 저장매체도 들 수 있다.Embodiments according to the present invention described above may be implemented in the form of a computer program that can be executed through various components on a computer, such a computer program may be recorded on a computer readable medium. In this case, the medium may be to continuously store a program executable by the computer, or to temporarily store for execution or download. In addition, the medium may be a variety of recording means or storage means in the form of a single or several hardware combined, not limited to a medium directly connected to any computer system, it may be distributed on the network. Examples of the medium include magnetic media such as hard disks, floppy disks and magnetic tape, optical recording media such as CD-ROMs and DVDs, magneto-optical media such as floptical disks, And ROM, RAM, flash memory, and the like, configured to store program instructions. In addition, examples of another medium may include a recording medium or a storage medium managed by an app store that distributes an application, a site that supplies or distributes various software, a server, or the like.

본 명세서에서, "부", "모듈" 등은 프로세서 또는 회로와 같은 하드웨어 구성(hardware component), 및/또는 프로세서와 같은 하드웨어 구성에 의해 실행되는 소프트웨어 구성(software component)일 수 있다. 예를 들면, "부", "모듈" 등은 소프트웨어 구성 요소들, 객체 지향 소프트웨어 구성 요소들, 클래스 구성 요소들 및 태스크 구성 요소들과 같은 구성 요소들과, 프로세스들, 함수들, 속성들, 프로시저들, 서브루틴들, 프로그램 코드의 세그먼트들, 드라이버들, 펌웨어, 마이크로 코드, 회로, 데이터, 데이터베이스, 데이터 구조들, 테이블들, 어레이들 및 변수들에 의해 구현될 수 있다.In this specification, “unit”, “module”, or the like may be a hardware component such as a processor or a circuit, and / or a software component executed by a hardware configuration such as a processor. For example, "parts", "modules", and the like may include components such as software components, object-oriented software components, class components, and task components, processes, functions, properties, It may be implemented by procedures, subroutines, segments of program code, drivers, firmware, microcode, circuits, data, databases, data structures, tables, arrays and variables.

전술한 본 발명의 설명은 예시를 위한 것이며, 본 발명이 속하는 기술분야의 통상의 지식을 가진 자는 본 발명의 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 쉽게 변형이 가능하다는 것을 이해할 수 있을 것이다. 그러므로 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며 한정적이 아닌 것으로 이해해야만 한다. 예를 들어, 단일형으로 설명되어 있는 각 구성 요소는 분산되어 실시될 수도 있으며, 마찬가지로 분산된 것으로 설명되어 있는 구성 요소들도 결합된 형태로 실시될 수 있다.The foregoing description of the present invention is intended for illustration, and it will be understood by those skilled in the art that the present invention may be easily modified in other specific forms without changing the technical spirit or essential features of the present invention. will be. Therefore, it should be understood that the embodiments described above are exemplary in all respects and not restrictive. For example, each component described as a single type may be implemented in a distributed manner, and similarly, components described as distributed may be implemented in a combined form.

본 발명의 범위는 상기 상세한 설명보다는 후술하는 특허청구범위에 의하여 나타내어지며, 특허청구범위의 의미 및 범위 그리고 그 균등 개념으로부터 도출되는 모든 변경 또는 변형된 형태가 본 발명의 범위에 포함되는 것으로 해석되어야 한다.The scope of the present invention is shown by the following claims rather than the above description, and all changes or modifications derived from the meaning and scope of the claims and their equivalents should be construed as being included in the scope of the present invention. do.

100: 음성 제어 장치(전자 기기)
110: 메모리
120: 프로세서
121: 오디오 처리부
122: 키워드 검출부
123: 화자 특징 벡터 추출부
124: 웨이크업 판단부
125: 음성 인식부
126: 기능부100: voice control device (electronic device)
110: memory
120: processor
121: audio processor
122: keyword detection unit
123: speaker feature vector extraction unit
124: wake-up judgment unit
125: speech recognition unit
126: function

Claims

An audio processor for receiving audio signals corresponding to ambient sounds and generating audio stream data;
A keyword detector which detects a candidate keyword corresponding to a predefined keyword from the audio stream data, and determines a start point and an end point of a first section corresponding to first audio data from which the candidate keyword is detected in the audio stream data;
When the candidate keyword is detected, the second audio data corresponding to the second section having the starting point of the first section as the end point is determined from the audio stream data, and a first speaker feature vector for the first audio data and the A speaker feature vector extracting unit for extracting a second speaker feature vector for the second audio data; And
And a wakeup determiner configured to determine whether the keyword is included in the first audio data based on a similarity between the first speaker feature vector and the second speaker feature vector.

The method of claim 1,
And the wakeup determiner determines that the keyword is included in the first audio data when the similarity between the first speaker feature vector and the second speaker feature vector is equal to or less than a preset reference value.

The method of claim 1,
Further comprising a keyword storage for storing a plurality of keywords including the predefined keyword,
Each of the keywords is a wake-up keyword or a single command keyword.

The method of claim 3,
When the candidate keyword corresponding to the single command keyword is detected in the audio stream data by the keyword detector,
The speaker feature vector extracting unit receives third audio data corresponding to a third section having the end point of the first section as the start point from the audio stream data, extracts a third speaker feature vector of the third audio data,
The wakeup determiner is further configured to determine the single instruction keyword in the first audio data based on a similarity between the first speaker feature vector and the second speaker feature vector and a similarity between the first speaker feature vector and the third speaker feature vector. And determining whether or not it is included.

The method of claim 4, wherein
The wakeup determiner may be further configured when the similarity between the first speaker feature vector and the second speaker feature vector is equal to or less than a preset reference value and the similarity between the first speaker feature vector and the third speaker feature vector is equal to or less than a preset reference value. And determining that the single command keyword is included in the first audio data.

The method of claim 3,
When the candidate keyword corresponding to the wakeup keyword is detected in the audio stream data by the keyword detector,
In response to determining that the wake-up keyword is included in the first audio data by the wake-up determining unit, the wake-up unit wakes up and corresponds to a third section having an end point of the first section in the audio stream data as a start point. And a voice recognition unit configured to receive third audio data and to externally recognize the third audio data or to recognize the third audio data.

The method of claim 6,
The second period is variably determined according to the wake-up keyword.

The method of claim 1,
The speaker feature vector extraction unit,
Extracting a first frame feature vector for each frame of the first audio data, normalizing and averaging the extracted first frame feature vectors, and extracting the first speaker feature vector representing the first audio data,
Extracting a second frame feature vector for each frame of the second audio data, and extracting the second speaker feature vector representing the second audio data by normalizing and averaging the extracted second frame feature vectors. Voice control device.

The method of claim 1,
The keyword detector calculates a first probability of being a human voice and a second probability of a background sound for each frame of the audio stream data, and determines the voice frame as a frame in which the first probability is higher than a preset reference value than the second probability. ,
The speaker feature vector extraction unit,
A first frame feature vector is extracted for each of frames determined as a voice frame among the frames in the first audio data, and the normalized and averaged extracted first frame feature vectors represent the first audio data. 1 extract the speaker feature vector,
A second frame feature vector is extracted for each of frames determined as a voice frame among the frames in the second audio data, and the second frame data representing the second audio data is normalized and averaged. 2. A speech control apparatus comprising extracting two speaker feature vectors.

delete

Receiving audio signals corresponding to ambient sounds to generate audio stream data;
Detecting a candidate keyword corresponding to a predefined keyword from the audio stream data, and determining a start point and an end point of a first section corresponding to the first audio data from which the candidate keyword is detected in the audio stream data;
When the candidate keyword is detected, the second audio data corresponding to the second section having the starting point of the first section as the end point is determined from the audio stream data, and a first speaker feature vector for the first audio data and the Extracting a second speaker feature vector for the second audio data; And
And determining whether the keyword is included in the first audio data based on the similarity between the first speaker feature vector and the second speaker feature vector, and determining whether to wake up. Way.

The method of claim 11,
Determining whether or not the wake-up,
Comparing the similarity between the first speaker feature vector and the second speaker feature vector with a preset reference value;
Determining that the keyword is included in the first audio data when the similarity is equal to or less than the preset reference value and waking up; And
Determining that the keyword is not included in the first audio data when the similarity exceeds the preset reference value and not waking up.

The method of claim 11,
When the detected candidate keyword is the candidate keyword corresponding to a single command keyword,
Receiving third audio data corresponding to a third section having an end point of the first section in the audio stream data;
Extracting a third speaker feature vector of the third audio data;
The first audio when the similarity between the first speaker feature vector and the second speaker feature vector is equal to or less than a preset reference value and the similarity between the first speaker feature vector and the third speaker feature vector is equal to or less than a preset reference value; And determining that the single command keyword is included in the data.

The method of claim 13,
And in response to determining that the single command keyword is included in the first audio data, performing a function corresponding to the single command keyword.

The method of claim 11,
If the detected keyword is the candidate keyword corresponding to the wake-up keyword,
In response to determining that the wake-up keyword is included in the first audio data, receiving third audio data corresponding to a third section having an end point of the first section in the audio stream data; And
And transmitting the third audio data to the outside so that the third audio data is voice recognized or the third audio data is voice recognized.

The method of claim 11,
Extracting the first speaker feature vector and the second speaker feature vector,
Extracting a first frame feature vector for each frame of the first audio data;
Extracting the first speaker feature vector representing the first audio data by normalizing and averaging the extracted first frame feature vectors;
Extracting a second frame feature vector for each frame of the second audio data; And
Extracting the second speaker feature vector representing the second audio data by normalizing and averaging the extracted second frame feature vectors.

The method of claim 11,
Calculating a first probability of being a human voice and a second probability of being a background sound for each frame of the audio stream data, and determining a frame as a voice frame in which the first probability is higher than a predetermined reference value than the second probability. and,
Extracting the first speaker feature vector and the second speaker feature vector,
Extracting a first frame feature vector for each of the frames determined as a voice frame among the frames in the first audio data;
Extracting the first speaker feature vector representing the first audio data by normalizing and averaging the extracted first frame feature vectors;
Extracting a second frame feature vector for each of the frames of the second audio data that are determined to be voice frames; And
Extracting the second speaker feature vector representing the second audio data by normalizing and averaging the extracted second frame feature vectors.

18. A computer readable recording medium having recorded thereon one or more programs containing instructions for causing a processor of the voice control device to execute the method of any one of claims 11 to 17.