KR20190064383A

KR20190064383A - Device and method for recognizing wake-up word using information related to speech signal

Info

Publication number: KR20190064383A
Application number: KR1020180055963A
Authority: KR
Inventors: 양태영
Original assignee: 주식회사 인텔로이드
Priority date: 2017-11-30
Filing date: 2018-05-16
Publication date: 2019-06-10
Also published as: KR102071867B1

Abstract

Disclosed is a voice recognition device for providing a service through wake-up word recognition. The voice recognition device comprises: a voice reception unit obtaining a voice signal; a processor sensing motion of a user speaking voice corresponding to the voice signal to generate motion information representing the motion of the user, detecting a wake-up word from the voice signal based on the motion information, and generating output information based on the detection result; and an output unit outputting the output information.

Description

BACKGROUND OF THE INVENTION 1. Field of the Invention [0001] The present invention relates to a device and a method for recognizing a call word using information related to a voice signal,

본 개시는 음성 인식 장치 및 음성 인식 방법에 관한 것으로, 더욱 상세하게는 음성 신호와 관련된 정보를 이용하여 호출어를 인식하는 장치 및 방법에 관한 것이다.BACKGROUND OF THE INVENTION 1. Field of the Invention [0001] The present invention relates to a speech recognition apparatus and a speech recognition method, and more particularly, to an apparatus and method for recognizing a speech using information related to a speech signal.

음성인식 장치는 인간의 음성을 부호로 변환하여 컴퓨터에 입력하는 장치를 말한다. 음성인식의 방식으로는 단어단위의 인식으로써 음성의 표준패턴과 입력 음성을 비교하여 입력 음성에 가장 가까운 음성패턴을 찾아 내는 패턴 매칭 방식, 하나의 단어와 다른 단어를 구별하기 위한 함수를 미리 마련 하고 그 함수를 입력 음성에 작용시켜 판정하는 식별 함수방식이 있다. 또 단어마다 음성을 인식하는 것이 아니고 음소등의 단위로 인식하여 인식정보 등을 이용하여 문맥을 인식하는 방법도 고안되고 있다.A speech recognition device refers to a device that converts human voice into codes and inputs them to a computer. As a method of speech recognition, a pattern matching method in which a standard pattern of speech is compared with an input speech by recognizing words, and a speech pattern closest to the input speech is found, a function for distinguishing one word from another is prepared There is an identification function method in which the function is applied to the input voice to determine it. There is also a method of recognizing a context by using recognition information, etc., by recognizing a unit of phonemes, etc., instead of recognizing a voice for each word.

음성 인식 분야의 여러 기술들 중, 사용자로부터 취득한 음성에 포함된 호출어 또는 키워드(keyword)를 검출하는 키워드 스팟팅(keyword spotting) 기술이 최근 여러 분야에서 각광받고 있다. 키워드 스팟팅이 제대로 수행되기 위해서는 음성에 포함된 키워드를 인식하고 상기 키워드를 검출하는 비율인 검출률이 높아야 한다. 하지만 이러한 검출률과 함께 키워드 스팟팅에서 중요하게 다루어지는 문제가 키워드 오인식 문제이다. 즉, 음성으로부터 검출된 키워드를 다른 키워드인 것으로 잘못 인식하는 경우, 키워드 스팟팅이 적용된 단말기는 사용자에게 원하지 않는 서비스를 제거하거나 사용자가 의도하지 않았던 처리를 수행할 수도 있다. 따라서, 기존의 키워드 스팟팅 기술에서의 낮은 검출률 또는 높은 오인식률 문제를 해결할 수 있는 방안이 요구되고 있다.Among various techniques in the field of speech recognition, a keyword spotting technique for detecting a call word or a keyword included in a voice acquired from a user has recently been spotlighted in various fields. In order for the keyword spotting to be performed properly, the detection rate, which is the rate of detecting the keyword included in the voice, must be high. However, with this detection rate, the key issue in keyword spotting is keyword recognition. That is, when the keyword detected from the voice is mistakenly recognized as another keyword, the terminal to which the keyword spotting is applied may remove the unwanted service to the user or perform the processing that the user did not intend. Accordingly, there is a demand for a solution that can solve the low detection rate or the high recognition rate problem in the existing keyword spotting technology.

한편, 음성인식을 이용해 호출어를 인식하고 호출어 인식이 성공한 경우, 특정 서비스를 제공하는 기기에 대한 연구 및 출시가 이루어지고 있다. 이때, 호출어 인식의 경우, 임베디드 음성 인식을 통해 실시간으로 검출이 수행되기 때문에 오인식률이 상대적으로 높아지는 문제가 있다. 이에 따라, 호출어를 인식하는 방법과 관련된 기술이 요구되고 있다. On the other hand, when a caller is recognized using speech recognition and the recognition of the caller is successful, research and release of a device providing a specific service are being conducted. At this time, in the case of caller recognition, since the detection is performed in real time through the embedded speech recognition, there is a problem that the recognition rate is relatively increased. Accordingly, a technique related to a method of recognizing a call word is required.

본 개시는 상기와 같은 문제점을 해결하기 위해 안출된 것으로서, 호출어 인식의 정확도를 높일 수 있는 음성 인식 장치 또는 음성 인식 방법을 제공하고자 하는 목적을 가지고 있다. 또한, 본 개시는 호출어 인식의 오인식률을 감소시키는 음성 인식 장치 또는 음성 인식 방법을 제공한다.SUMMARY OF THE INVENTION The present invention has been made to solve the above-mentioned problems, and it is an object of the present invention to provide a speech recognition apparatus or a speech recognition method capable of improving the accuracy of speech recognition. The present disclosure also provides a speech recognition apparatus or speech recognition method that reduces the false recognition rate of speech recognition.

상기와 같은 과제를 해결하기 위한 본 개시의 실시예에 따르면, 음성 신호를 획득하는 음성 수신부, 상기 음성 신호에 대응하는 음성을 발화한 사용자의 동작을 감지하여 감지된 상기 사용자의 동작을 나타내는 동작정보를 생성하고, 상기 동작정보를 기초로, 상기 음성 신호로부터 상기 호출어를 검출하고, 상기 검출 결과를 기초로 출력 정보를 생성하는 프로세서 및 상기 출력 정보를 출력하는 출력부를 포함하는 음성 인식 장치를 제공할 수 있다.According to another aspect of the present invention, there is provided an apparatus for generating a sound signal, the apparatus comprising: a sound receiving unit for acquiring a sound signal; an operation unit for detecting an operation of a user who uttered a sound corresponding to the sound signal, A processor for generating the output information based on the operation information, a processor for detecting the call word from the voice signal and generating output information based on the detection result, and an output unit for outputting the output information can do.

또한 본 개시의 일 실시예에 따른 방법은, 음성 신호를 획득하는 단계, 상기 음성 신호에 대응하는 음성을 발화한 사용자의 동작을 감지하여 감지된 상기 사용자의 동작을 나타내는 동작정보를 생성하는 단계, 상기 동작정보를 기초로 상기 음성 신호로부터 상기 호출어를 검출하는 단계 및 상기 검출 결과를 기초로 생성된 출력 정보를 출력하는 단계를 포함할 수 있다.According to another aspect of the present invention, there is provided a method for generating a speech signal, comprising the steps of: acquiring a speech signal; generating motion information indicating a motion of the user by sensing motion of a user who has uttered a voice corresponding to the speech signal; Detecting the call word from the voice signal based on the operation information, and outputting the generated output information based on the detection result.

또 다른 측면에 따른 전자 장치로 읽을 수 있는 기록매체는 상술한 방법을 전자 장치에서 실행시키기 위한 프로그램을 기록한 기록매체를 포함할 수 있다.A recording medium readable by an electronic device according to another aspect may include a recording medium recording a program for executing the above-described method in an electronic device.

본 개시의 일 실시예에 따르면, 호출어 인식의 정확도를 높여 호출어 인식의 오인식률을 감소시킬 수 있다. 또한, 본 개시의 일 실시예에 따르면, 음성을 발화한 사용자에게 효과적으로 출력 정보를 제공할 수 있다. 본 개시는 음성 신호가 발화된 상황 또는 발화 의도를 예측하여 호출어 인식의 오인식률을 감소시킬 수 있다.According to an embodiment of the present disclosure, it is possible to improve the accuracy of the caller recognition and reduce the misrecognition rate of the caller recognition. Further, according to the embodiment of the present disclosure, it is possible to effectively provide the output information to the user who uttered the voice. The present disclosure can reduce the misrecognition rate of caller recognition by predicting a situation in which a speech signal has been uttered or an utterance intention.

또한, 본 개시의 일 실시예는, 사용자의 음성을 취득한 환경의 특성에 기초하여 호출어를 인식할 수 있다. 이를 통해, 본 개시는 호출어 오인식으로 인한 기기의 오작동을 줄이고, 호출어 인식에 소모되는 전력 효율을 증가시킬 수 있다.In addition, one embodiment of the present disclosure can recognize a caller based on the characteristics of the environment from which the user's voice was acquired. Through this, the present disclosure can reduce the malfunction of the apparatus due to the caller misidentification and increase the power efficiency consumed in the caller recognition.

도 1은 본 개시의 일 실시예에 따라 음성 인식 장치 및 서버를 포함하는 서비스 제공 시스템을 나타내는 개략도이다.
도 2는 본 발명의 실시예에 따른 음성 인식 장치의 구성을 나타내는 도면이다.
도 3은 본 개시의 일 실시예에 따라, 기준 유사도에 따른 호출어 검출 결과를 나타내는 도면이다.
도 4는 본 개시의 일 실시예에 따른 음성 인식 장치가 사용자의 시선방향을 기초로 기준 유사도를 결정하는 방법을 나타내는 도면이다.
도 5는 본 개시의 일 실시예에 따른 음성 인식 장치가 음성 인식 장치의 주변 환경 정보를 기초로 기준 유사도를 획득하는 방법을 나타내는 도면이다.
도 6은 본 개시의 일 실시예에 따른 음성 인식 장치가 사용자 이외에 다른 인물과 관련된 정보를 기초로 기준 유사도를 결정하는 방법을 나타내는 도면이다.
도 7은 본 발명의 실시예에 따른 음성 인식 장치의 동작을 나타내는 흐름도이다.1 is a schematic diagram illustrating a service providing system including a speech recognition device and a server according to one embodiment of the present disclosure;
2 is a block diagram of a speech recognition apparatus according to an embodiment of the present invention.
Fig. 3 is a diagram showing call result detection according to a reference similarity degree according to an embodiment of the present disclosure; Fig.
4 is a diagram illustrating a method of determining a reference similarity degree based on a user's gaze direction according to an embodiment of the present disclosure.
FIG. 5 is a diagram illustrating a method of acquiring a reference similarity based on information about the environment of a speech recognition apparatus according to an embodiment of the present disclosure. FIG.
6 is a diagram illustrating a method of determining a reference similarity based on information associated with a person other than a user, according to an embodiment of the present disclosure.
7 is a flowchart illustrating an operation of a speech recognition apparatus according to an embodiment of the present invention.

아래에서는 첨부한 도면을 참고로 하여 본 발명의 실시 예에 대하여 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 상세히 설명한다. 그러나 본 발명은 여러 가지 상이한 형태로 구현될 수 있으며 여기에서 설명하는 실시 예에 한정되지 않는다. 그리고 도면에서 본 발명을 명확하게 설명하기 위해서 설명과 관계없는 부분은 생략하였으며, 명세서 전체를 통하여 유사한 부분에 대해서는 유사한 도면 부호를 붙였다. Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings so that those skilled in the art can easily carry out the present invention. The present invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. In order to clearly illustrate the present invention, parts not related to the description are omitted, and similar parts are denoted by like reference characters throughout the specification.

또한 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있는 것을 의미한다.Also, when an element is referred to as "comprising ", it means that it can include other elements as well, without departing from the other elements unless specifically stated otherwise.

본 개시는, 음성 신호로부터 기 설정된 호출어를 검출하여 출력 정보를 제공하는 음성 인식 장치 및 방법에 관한 것이다. 구체적으로, 본 개시의 일 실시예에 따른 음성 인식 장치 및 방법은, 동작정보를 이용하여 호출어에 대응하지 않는 음성 신호가 호출어에 대응하는 것으로 잘못 인식되는 비율을 나타내는 오인식률을 감소시킬 수 있다. 이때, 동작정보는 호출어 검출의 대상이 되는 음성 신호에 대응하는 음성을 발화한 사용자의 동작을 나타내는 정보일 수 있다. 또한, 본 개시에서, 호출어(wake-up word)는 음성 인식 장치의 서비스 제공 기능을 트리거(trigger)하기 위해 설정된 키워드(keyword)를 나타낼 수 있다. 이하, 첨부된 도면을 참고하여 본 발명을 상세히 설명한다. The present disclosure relates to a speech recognition apparatus and method for detecting preset speech from speech signals and providing output information. Specifically, the speech recognition apparatus and method according to an embodiment of the present disclosure can reduce the erroneous recognition rate indicating a rate at which a speech signal that does not correspond to a call word is erroneously recognized as corresponding to a call word, using operation information have. In this case, the operation information may be information indicating the operation of the user who uttered the voice corresponding to the voice signal to be detected by the caller. Further, in this disclosure, a wake-up word may represent a keyword set to trigger the service providing function of the speech recognition apparatus. Hereinafter, the present invention will be described in detail with reference to the accompanying drawings.

도 1은 본 개시의 일 실시예에 따라 음성 인식 장치(100) 및 서버(200)를 포함하는 서비스 제공 시스템을 나타내는 개략도이다. 도 1에 도시된 바와 같이, 서비스 제공 시스템은 적어도 하나의 음성 인식 장치(100) 및 서버(200)를 포함할 수 있다. 본 개시의 일 실시예에 따른 서비스 제공 시스템은 기 설정된 호출어(이하, '호출어')를 기반으로 서비스를 제공할 수 있다. 예를 들어, 서비스 제공 시스템은 획득된 음성 신호를 인식하여 인식된 결과에 대응하는 서비스를 제공할 수 있다. 이때, 서비스 제공 시스템은 획득된 음성 신호로부터 호출어가 검출되는지 판단할 수 있다. 또한, 서비스 제공 시스템은 획득된 음성 신호로부터 호출어가 검출되는 경우, 인식 결과에 대응하는 서비스를 제공할 수 있다. 반대로 서비스 제공 시스템은 획득된 음성 신호로부터 호출어가 검출되지 않는 경우, 음성 인식을 수행하지 않거나 인식 결과에 대응하는 서비스를 제공하지 않을 수 있다. 서비스 제공 시스템은 음성 인식 장치(100)를 통해 인식 결과에 대응하는 출력 정보를 제공할 수 있다.1 is a schematic diagram illustrating a service providing system including a speech recognition apparatus 100 and a server 200 according to an embodiment of the present disclosure. As shown in FIG. 1, the service providing system may include at least one voice recognition apparatus 100 and a server 200. A service providing system according to an embodiment of the present disclosure can provide a service based on a predetermined caller (hereinafter, referred to as a 'caller'). For example, the service providing system may recognize the acquired voice signal and provide a service corresponding to the recognized result. At this time, the service providing system can determine whether a caller is detected from the acquired voice signal. In addition, the service providing system can provide a service corresponding to the recognition result when a caller is detected from the acquired voice signal. Conversely, the service providing system may not perform speech recognition or provide a service corresponding to the recognition result if no caller is detected from the acquired speech signal. The service providing system can provide the output information corresponding to the recognition result through the speech recognition apparatus 100. [

일 실시예에 따라, 음성 인식 장치(100)는 호출어를 인식하여 음성 인식 장치(100)의 서비스 제공 기능을 웨이크-업(wake-up)할 수 있다. 예를 들어, 음성 인식 장치(100)는 획득된 음성 신호로부터 호출어가 검출되는 경우, 서비스 제공을 위한 음성 인식 동작을 웨이크-업할 수 있다. 음성 인식 장치(100)는 음성 인식 장치(100) 내의 임베디드(embedded) 인식 모듈을 통해 호출어를 인식할 수 있다. 이때, 호출어 인식은 음성 신호로부터 호출어가 검출되는지를 판별하는 동작을 나타낼 수 있다. According to one embodiment, the speech recognition apparatus 100 may recognize a call word and wake-up the service providing function of the speech recognition apparatus 100. [ For example, the speech recognition apparatus 100 may wake up a speech recognition operation for providing a service when a caller is detected from the acquired speech signal. The speech recognition apparatus 100 can recognize a call word through an embedded recognition module in the speech recognition apparatus 100. [ At this time, the caller recognition can indicate an operation of determining whether a caller is detected from the voice signal.

구체적으로, 음성 인식 장치(100)는 음성 신호를 기 설정된 특정 길이 단위(frame)로 분할할 수 있다. 또한, 음성 인식 장치(100)는 분할된 각각의 음성 신호로부터 음향학적 특징(acoustic feature)을 추출할 수 있다. 음성 인식 장치(100)는 추출된 음향학적 특징과 호출어에 대응하는 음향 모델 사이의 유사도를 산출할 수 있다. 또한, 음성 인식 장치(100)는 추출된 음향학적 특징과 호출어에 대응하는 음향 모델 또는 음성인식을 위한 모델 사이의 유사도에 기초하여 호출어의 존재 여부를 판별할 수 있다. 이때, 음향학적 특징은 음성 인식에 필요한 정보를 나타낼 수 있다. 예를 들어, 음향학적 특징은 포먼트(formant) 정보 및 피치(pitch) 정보를 포함할 수 있다. 포먼트는 음성 스펙트럼의 스펙트럴 피크(spectral peaks)로 정의되며 스펙트로그램(spectrogram)에서 진폭의 피크(amplitude peak) 값으로 정량화될 수 있다. 여기에서 가장 낮은 주파수를 갖는 포먼트를 제1포먼트(F1), 그 다음을 순서대로 F2, F3 등으로 산출하며, 인간이 모음을 발성할 때의 음성분석에는 F1과 F2를 주로 사용한다. 피치는 음성의 기본 주파수(Fundamental Frequency)를 의미하며 음성의 주기적 특성을 나타낸다. 음성 인식 장치(100)는 LPC(Linear Predictive Coding) Cepstrum, PLP(Perceptual Linear Prediction) Cepstrum, MFCC(Mel Frequency Cepstral Coefficient) 및 필터뱅크 에너지 분석(Filter Bank Energy Analysis) 중 적어도 어느 하나를 사용하여 음성 신호의 음향학적 특징을 추출할 수 있다. 예를 들어, 음성 인식 장치(100)는 음성 신호로부터 추출된 음향학적 특징과 가장 유사도가 높은 음향 모델 또는 음성인식 모델의 텍스트 데이터가 호출어에 대응하는 텍스트와 동일한 경우, 해당 음성 신호로부터 호출어가 검출된 것으로 판단할 수 있다. 하지만, 본 개시의 호출어 인식 과정이 이에 제한되는 것은 아니다.Specifically, the speech recognition apparatus 100 may divide the speech signal into predetermined predetermined length units (frames). In addition, the speech recognition apparatus 100 may extract an acoustic feature from each of the divided speech signals. The speech recognition apparatus 100 can calculate the similarity between the extracted acoustic feature and the acoustic model corresponding to the call word. In addition, the speech recognition apparatus 100 can determine the presence or absence of the caller based on the extracted acoustic characteristics and the similarity between the acoustic model corresponding to the call word or the model for speech recognition. At this time, the acoustic feature may represent information necessary for speech recognition. For example, the acoustic features may include formant information and pitch information. Formants are defined as spectral peaks of the sound spectrum and can be quantified as amplitude peak values in the spectrogram. Here, the formants having the lowest frequency are calculated by the first formant (F1), the next order is F2, F3, and so forth. F1 and F2 are mainly used for voice analysis when humans utter vowel. Pitch refers to the fundamental frequency of speech and represents the periodic nature of speech. The speech recognition apparatus 100 may use at least one of a Linear Predictive Coding (LPC) Cepstrum, a Perceptual Linear Prediction (PLP) Cepstrum, a Mel Frequency Cepstral Coefficient (MFCC), and a Filter Bank Energy Analysis Can be extracted. For example, when the text data of the acoustic model or the speech recognition model having the highest similarity to the acoustic characteristics extracted from the speech signal is the same as the text corresponding to the caller, the speech recognition apparatus 100 extracts, It can be judged that it has been detected. However, the process of recognizing an invocation of the present disclosure is not limited thereto.

또한, 음성 인식 장치(100)는 호출어 인식 수행 시, 기준 유사도를 참조하여 호출어를 검출할 수 있다. 여기에서, 기준 유사도는 획득한 음성 신호로부터 호출어가 검출되는지를 판별하는 기준이 되는 문턱 유사도 값(threshold)으로, 후술하는 도 3에서 보다 상세하게 설명하도록 한다. 또한, 음성 인식 장치(100)는 음성 신호에 대응하는 음성을 발화한 사용자(300)의 동작을 나타내는 동작정보에 기반하여 기준 유사도를 결정할 수 있다. 이에 대해서는 후술할 도 4 내지 도 6을 통해 상세하게 설명하도록 한다. 이하, 본 개시에서 특별한 기재가 없는 한 사용자는 음성 신호에 대응하는 음성을 발화한 사용자(300)를 나타낸다.In addition, the speech recognition apparatus 100 may detect the call word by referring to the reference similarity degree when performing the call word recognition. Here, the reference similarity degree is a threshold similarity value (threshold) used as a reference for discriminating whether a caller is detected from the acquired speech signal, and will be described in more detail in FIG. 3 which will be described later. In addition, the speech recognition apparatus 100 can determine the reference similarity based on operation information indicating the operation of the user 300 that has uttered the speech corresponding to the speech signal. This will be described in detail with reference to FIGS. 4 to 6, which will be described later. Hereinafter, unless otherwise specified in the present disclosure, the user indicates the user 300 who has uttered the voice corresponding to the voice signal.

다른 일 실시예에 따라, 호출어 인식은 서버(200)에 의해 수행될 수도 있다. 이때, 음성 인식 장치(100)는 음성 신호를 서버(200)로 전송하고 인식 결과를 요청할 수 있다. 또한, 음성 인식 장치(100)는 서버(200)로부터 수신된 인식 결과를 기초로 음성 인식 장치(100)의 서비스 제공 기능을 웨이크-업할 수 있다.In accordance with another embodiment, caller recognition may be performed by the server 200. [ At this time, the voice recognition apparatus 100 may transmit a voice signal to the server 200 and request a recognition result. The voice recognition apparatus 100 may also wake up the service providing function of the voice recognition apparatus 100 based on the recognition result received from the server 200. [

본 개시에서, 서비스는 인식된 결과에 대응하는 출력 정보를 의미할 수 있다. 또한, 서비스 또는 출력 정보는 음성 인식 장치(100) 고유의 기능을 포함할 수도 있다. 예를 들어, 음성 인식 장치(100)가 음성 인식 기능을 탑재한 가전기기인 경우, 출력 정보는 가전기기의 작동 시작 또는 종료를 포함할 수 있다. 음성 인식 장치(100)가 정보 제공 장치인 경우, 출력 정보는 사용자(300)의 질의(query)에 대응하는 정보를 포함할 수 있다. 예를 들어, 음성 인식 장치(100)는 사용자(300)로부터 발화된 음성에 대응하는 음성 신호를 획득할 수 있다. 사용자(300)는 음성 신호를 통해 음성 인식 장치(100)에게 호출어 및 다양한 유형의 요청(request)을 입력할 수 있다. 이때, 요청은 음성 인식 장치(100)에 대한 서비스 요청을 의미할 수 있다. 일 실시예에 따라, 음성 인식 장치(100)는 벽면에 부착된 IoT 단말일 수 있으나 이에 한정되지 않는다. 예를 들어, 음성 인식 장치(100)는 현관에 설치된 조명(light) 형태의 IoT 단말일 수 있다. 또는 음성 인식 장치(100)는 음성 인식 기능이 탑재된 냉/난방 기기, 셋톱 박스(set-top box), 냉장고, TV와 같은 가전기기일 수 있다. In this disclosure, a service may mean output information corresponding to a recognized result. In addition, the service or output information may include functions unique to the speech recognition apparatus 100. [ For example, when the speech recognition apparatus 100 is a home appliance equipped with a voice recognition function, the output information may include starting or ending operation of the home appliance. If the speech recognition apparatus 100 is an information providing apparatus, the output information may include information corresponding to a query of the user 300. [ For example, the speech recognition apparatus 100 may acquire a speech signal corresponding to speech uttered from the user 300. [ The user 300 can input a caller and various types of requests to the voice recognition apparatus 100 through voice signals. At this time, the request may indicate a service request to the speech recognition apparatus 100. [ According to one embodiment, the speech recognition apparatus 100 may be an IoT terminal attached to a wall surface, but is not limited thereto. For example, the speech recognition apparatus 100 may be a light-type IoT terminal installed in a front porch. Alternatively, the speech recognition apparatus 100 may be a home appliance such as a refrigerator / airconditioner, a set-top box, a refrigerator, and a television equipped with a speech recognition function.

또한, 서비스 제공을 위한 음성 인식 동작이 활성화된 경우, 음성 인식 장치(100)는 음성 신호로부터 음성을 인식하여 사용자(300)가 요청한 서비스를 제공할 수 있다. 이때, 음성 신호는 호출어에 대응하는 음성 신호를 획득한 때부터 소정의 시간 이내에 획득된 음성 신호일 수 있다. 음성 인식 장치(100)는 임베디드 인식 모듈을 통해, 음성 신호로부터 음성을 인식할 수 있다. 음성 인식 장치(100)는 전술한 호출어 인식 방법과 동일 또는 유사한 방법으로, 음성 신호에 대한 서비스 제공을 위한 음성 인식을 수행할 수 있다.In addition, when the voice recognition operation for providing the service is activated, the voice recognition apparatus 100 recognizes the voice from the voice signal and can provide the service requested by the user 300. In this case, the voice signal may be a voice signal obtained within a predetermined time from acquisition of the voice signal corresponding to the caller. The speech recognition apparatus 100 can recognize the speech from the speech signal through the embedded recognition module. The speech recognition apparatus 100 can perform speech recognition for providing a service for a speech signal in the same or similar manner as the above-described speech recognition method.

또는 음성 인식 장치(100)는 획득된 음성 신호를 서버(200)로 전송하여 음성 신호로부터 음성을 인식할 수도 있다. 획득된 음성 신호가 포함하는 키워드를 정확히 인식하기 위해 충분한 연산량 또는 대량의 음성 인식 관련 데이터 베이스가 요구될 수 있기 때문이다. 예를 들어, 음성 인식 장치(100)가 소형의 단말로 구현되는 경우, 단말이 자체적으로 데이터베이스를 보유하는 것보다 외부에 별도의 데이터베이스를 두고 네트워크 접속을 통해 외부의 데이터베이스를 활용하는 것이 유리할 수 있다.Or the speech recognition apparatus 100 may transmit the obtained speech signal to the server 200 to recognize the speech from the speech signal. A sufficient amount of computation or a large amount of speech recognition related databases may be required to accurately recognize the keyword included in the acquired speech signal. For example, when the speech recognition apparatus 100 is implemented as a small terminal, it may be advantageous that a separate database is provided outside the terminal rather than the terminal itself, and an external database is utilized through a network connection .

본 개시의 일 실시예에 따른 서버(200)는, 전술한 음성 인식 장치(100)가 호출어 또는 서비스 제공을 위한 음성 인식을 수행하는 방법과 동일 또는 유사한 방법으로 음성 인식을 수행할 수 있다. 예를 들어, 서버(200)는 음성 인식 장치(100)로부터 획득된 음성 신호에 대해 음성 인식을 수행할 수 있다. 또한, 서버(200)는 음성 인식을 위한 데이터베이스를 포함할 수 있다. 이때, 데이터베이스는 적어도 하나의 음향 모델 또는 음성 인식 모델을 포함할 수 있다. 그러나 서버(200)가 데이터베이스를 반드시 포함하는 것은 아니며, 서비스 제공 시스템은 서버(200)와 연결된 별도의 저장소(미도시)를 포함할 수도 있다. 이때, 서버(200)는 데이터베이스를 포함하는 저장소로부터 적어도 하나의 음향 모델 또는 음성 인식 모델을 획득할 수 있다. 또한, 서버(200)는 음성 인식 장치(100)로부터 획득한 음성 신호에 대해 음성 인식을 수행한 결과를 음성 인식 장치(100)로 전송할 수 있다. 한편, 도 1에서 서비스 제공 시스템은 서버(200)를 포함하고 있으나, 본 개시가 이에 제한되는 것은 아니다. 예를 들어, 본 개시의 일 실시예에 따른 서비스 제공 시스템은 서버(200)를 포함하지 않을 수 있다. 이 경우, 서버(200)의 기능은 음성 인식 장치(100) 또는 음성 인식 장치(100)가 아닌 음성 인식 기능을 탑재한 다른 장치에 의해 수행될 수 있다.The server 200 according to an embodiment of the present disclosure can perform speech recognition in the same or similar way as the above-described speech recognition apparatus 100 performs speech recognition for providing a caller or a service. For example, the server 200 may perform speech recognition on the speech signal obtained from the speech recognition apparatus 100. [ In addition, the server 200 may include a database for voice recognition. At this time, the database may include at least one acoustic model or a speech recognition model. However, the server 200 does not necessarily include a database, and the service providing system may include a separate storage (not shown) connected to the server 200. At this time, the server 200 may acquire at least one acoustic model or speech recognition model from the repository including the database. In addition, the server 200 may transmit the result of performing the speech recognition to the speech recognition apparatus 100. [0050] FIG. 1, the service providing system includes the server 200, but the present disclosure is not limited thereto. For example, the service provision system according to one embodiment of the present disclosure may not include the server 200. [ In this case, the function of the server 200 may be performed by the voice recognition apparatus 100 or another apparatus equipped with a voice recognition function other than the voice recognition apparatus 100. [

일 실시예에 따라, 서비스 제공 시스템이 복수의 음성 인식 장치(100a, 100b)를 포함하는 경우, 호출어는 음성 인식 장치 별로 다르게 설정될 수 있다. 제1 음성 인식 장치(100a)는 '구름'을 호출어로 사용할 수 있다. 또한, 제2 음성 인식 장치(100b)는 '하늘'을 호출어로 사용할 수 있다. 이 경우, 제1 음성 인식 장치(100a)가 '구름'과 일정 유사도 이상인 음성 신호를 획득하는 경우, 제1 음성 인식 장치(100a)는 제1 음성 인식 장치(100a)의 서비스 제공 기능을 웨이크-업할 수 있다. 반면, 사용자(300)가 발화한 음성 신호가 '하늘'을 포함하는 경우, 제1 음성 인식 장치(100a)는 서비스 제공 기능을 웨이크-업하지 않을 수 있다. 음성 신호가 제1 음성 인식 장치(100a)가 사용하는 호출어에 대응하지 않는 경우이기 때문이다. 또한, 제2 음성 인식 장치(100b)는 '하늘'에 대응하는 음성 신호를 획득한 경우에만 제2 음성 인식 장치(100b)의 서비스 제공 기능을 웨이크-업할 수 있다.According to one embodiment, when the service providing system includes a plurality of speech recognition apparatuses 100a and 100b, the caller may be set differently for each speech recognition apparatus. The first speech recognition apparatus 100a can use 'cloud' as an invocation language. Also, the second voice recognition apparatus 100b can use 'sky' as an invocation language. In this case, when the first speech recognition apparatus 100a acquires a speech signal having a certain degree of similarity to the 'cloud', the first speech recognition apparatus 100a transmits the service providing function of the first speech recognition apparatus 100a to the wake- Up. On the other hand, when the voice signal uttered by the user 300 includes 'sky', the first voice recognition apparatus 100a may not wake up the service providing function. This is because the voice signal does not correspond to the call word used by the first voice recognition device 100a. Also, the second voice recognition device 100b can wake up the service providing function of the second voice recognition device 100b only when the voice signal corresponding to 'sky' is acquired.

도 2는 본 발명의 실시예에 따른 음성 인식 장치(100)의 구성을 나타내는 도면이다. 일 실시예에 따라, 음성 인식 장치(100)는 음성 수신부(110), 프로세서(120) 및 출력부(130)를 포함할 수 있다. 그러나 도 2에 도시된 구성 요소의 일부는 생략될 수 있으며, 도 2에 도시되지 않은 구성 요소를 추가적으로 포함할 수 있다. 또한, 음성 인식 장치(100)는 적어도 둘 이상의 서로 다른 구성요소를 일체로서 구비할 수도 있다. 일 실시예에 따라, 음성 인식 장치(100)는 하나의 반도체 칩(chip)으로 구현될 수도 있다.2 is a diagram showing a configuration of a speech recognition apparatus 100 according to an embodiment of the present invention. According to one embodiment, the speech recognition apparatus 100 may include a voice receiving unit 110, a processor 120, and an output unit 130. However, some of the components shown in Fig. 2 may be omitted and may additionally include components not shown in Fig. In addition, the speech recognition apparatus 100 may include at least two or more different components as one unit. According to one embodiment, the speech recognition apparatus 100 may be implemented as a single semiconductor chip.

음성 수신부(110)는 음성 신호를 획득할 수 있다. 음성 수신부(110)는 음성 수신부(110)로 입사되는 음성 신호를 수집할 수 있다. 일 실시예에 따라, 음성 수신부(110)는 적어도 하나의 마이크를 포함할 수 있다. 예를 들어, 음성 수신부(110)는 복수의 마이크를 포함하는 마이크 어레이를 포함할 수 있다. 이때, 마이크 어레이는 원 또는 구 형태 이외의 정육면체 또는 정삼각형과 같은 다양한 형태로 배열된 복수의 마이크를 포함할 수 있다. 다른 일 실시예에 따라, 음성 수신부(110)는 외부의 음향 수집 장치로부터 수집된 음성에 대응하는 음성 신호를 수신할 수도 있다. 예를 들어, 음성 수신부(110)는 음성 신호가 입력되는 음성 신호 입력 단자를 포함할 수 있다. 구체적으로 음성 수신부(110)는 유선으로 전송되는 음성 신호를 수신하는 음성 신호 입력 단자를 포함할 수 있다. 또는, 음성 수신부(110)는 블루투스(bluetooth) 또는 와이파이(Wi-Fi) 통신 방법을 이용하여 무선으로 전송되는 음성 신호를 수신할 수도 있다.The voice receiving unit 110 may acquire a voice signal. The voice receiving unit 110 may collect voice signals input to the voice receiving unit 110. According to one embodiment, the voice receiving unit 110 may include at least one microphone. For example, the voice receiving unit 110 may include a microphone array including a plurality of microphones. At this time, the microphone array may include a plurality of microphones arranged in various forms such as a cube other than a circle or a sphere, or a regular triangle. According to another embodiment, the voice receiving unit 110 may receive a voice signal corresponding to the voice collected from the external sound collecting apparatus. For example, the voice receiving unit 110 may include a voice signal input terminal through which a voice signal is input. Specifically, the voice receiving unit 110 may include a voice signal input terminal for receiving a voice signal transmitted through a wire. Alternatively, the voice receiving unit 110 may receive a voice signal transmitted wirelessly using a Bluetooth or Wi-Fi communication method.

프로세서(120)는 명세서 전반에 걸쳐 설명되는 음성 인식 장치(100)의 전반적인 동작을 제어할 수 있다. 프로세서(120)는 음성 인식 장치(100)의 각 구성 요소를 제어할 수 있다. 프로세서(120)는 각종 데이터와 신호의 연산 및 처리를 수행할 수 있다. 프로세서(120)는 반도체 칩 또는 전자 회로 형태의 하드웨어로 구현되거나 하드웨어를 제어하는 소프트웨어로 구현될 수 있다. 프로세서(120)는 하드웨어와 상기 소프트웨어가 결합된 형태로 구현될 수도 있다. 프로세서(120)는 소프트웨어가 포함하는 적어도 하나의 프로그램을 실행하여 음성 인식 장치(100)의 동작을 제어할 수 있다.The processor 120 may control the overall operation of the speech recognition apparatus 100 as described throughout the specification. Processor 120 may control each component of speech recognition apparatus 100. The processor 120 may perform arithmetic processing and processing of various data and signals. The processor 120 may be implemented in hardware in the form of a semiconductor chip or an electronic circuit, or may be implemented in software that controls hardware. The processor 120 may be implemented as a combination of hardware and software. The processor 120 may execute at least one program included in the software to control the operation of the speech recognition apparatus 100. [

일 실시예에 따라, 프로세서(120)는 전술한 음성 수신부(110)를 통해 획득된 음성 신호로부터 음성을 인식할 수 있다. 프로세서(120)는 음성 수신부(110)를 통해 음성 신호를 획득할 수 있다. 프로세서(120)는 전술한 임베디드 음성 인식 기능을 포함할 수 있다. 일 실시예에 따라, 프로세서(120)는 임베디드 음성 인식 기능을 이용하여 음성 신호로부터 호출어를 검출할 수 있다. 또한, 프로세서(120)는 서버로부터 음성 인식 결과를 획득할 수 있다. 이 경우, 프로세서(120)는 통신부(미도시)를 통해 음성 신호를 서버로 전송하고 음성 신호에 대한 인식 결과를 서버에게 요청할 수도 있다. 또한, 프로세서(120)는 검출 결과를 기초로 출력 정보를 생성할 수 있다. 호출어가 검출된 경우, 프로세서(120)는 서비스 제공 기능을 웨이크-업할 수 있다. 이 경우, 프로세서(120)는 서비스 제공 기능이 웨이크-업 되었음을 알리는 정보를 포함하는 출력 정보를 생성할 수 있다. 또한, 프로세서(120)는 음성 인식을 수행하여 획득된 인식 결과에 대응하는 출력 정보를 생성할 수 있다. 반대로, 호출어가 검출되지 않은 경우, 프로세서(120)는 호출어가 검출되지 않았음을 알리는 정보를 포함하는 출력 정보를 생성할 수 있다. 또는, 이 경우, 프로세서(120)는 사용자에게 출력 정보를 제공하지 않을 수도 있다. 프로세서(120)는 생성된 출력 정보를 이하 설명되는 출력부(130)를 통해 출력할 수 있다. According to one embodiment, the processor 120 can recognize the speech from the speech signal obtained through the speech receiver 110 described above. The processor 120 may acquire a voice signal through the voice receiving unit 110. [ The processor 120 may include the above-described embedded speech recognition functionality. According to one embodiment, the processor 120 may detect an incoming call from a voice signal using an embedded speech recognition function. In addition, the processor 120 may obtain speech recognition results from the server. In this case, the processor 120 may transmit the voice signal to the server through a communication unit (not shown) and request the server for the voice recognition result. Further, the processor 120 can generate output information based on the detection result. If a caller is detected, the processor 120 may wake up the serving function. In this case, the processor 120 may generate output information including information indicating that the service providing function has been woken up. In addition, the processor 120 may perform speech recognition to generate output information corresponding to the obtained recognition result. Conversely, if no caller is detected, the processor 120 may generate output information that includes information indicating that the caller was not detected. Or, in this case, the processor 120 may not provide output information to the user. The processor 120 may output the generated output information through the output unit 130 described below.

출력부(130)는 사용자에게 제공되는 정보를 출력할 수 있다. 출력부(130)는 프로세서(120)에 의해 생성된 출력 정보를 출력할 수 있다. 또한, 출력부(130)는 빛, 소리, 진동과 같은 형태로 변환된 출력 정보를 출력할 수도 있다. 일 실시예에 따라, 출력부(130)는 스피커, 디스플레이, LED를 포함하는 각종 광원 및 모니터 중 적어도 어느 하나일 수 있으나 이에 한정되지 않는다. 예를 들어, 출력부(130)는 호출어 검출 결과를 기초로 생성된 출력 정보를 출력할 수 있다. 이때, 출력 정보는 호출어 검출 결과를 포함할 수 있다. 출력부(130)는 호출어가 검출된 경우와 호출어가 검출되지 않은 경우에 따라 구별되는 검출 신호를 출력할 수 있다. 예를 들어, 출력부(130)는 광원을 통해, 호출어가 검출된 경우 '파란색' 빛을 출력하고, 호출어가 검출되지 않은 경우 '붉은색' 빛을 출력할 수 있다. 출력부(130)는 스피커를 통해 호출어가 검출된 경우에만 기 설정된 오디오 신호를 출력할 수도 있다. The output unit 130 may output information provided to the user. The output unit 130 may output the output information generated by the processor 120. Also, the output unit 130 may output the converted output information in the form of light, sound, or vibration. According to one embodiment, the output unit 130 may be at least one of various light sources including a speaker, a display, an LED, and a monitor, but is not limited thereto. For example, the output unit 130 may output the generated output information based on the result of the call detection. At this time, the output information may include the result of the call detection. The output unit 130 can output a detection signal that is distinguishable when a caller is detected and when a caller is not detected. For example, the output unit 130 outputs 'blue' light when a caller is detected through a light source, and 'red' light when a caller is not detected. The output unit 130 may output a predetermined audio signal only when a caller is detected through a speaker.

또한, 출력부(130)는 음성 인식 장치(100) 고유의 기능을 수행할 수 있다. 구체적으로, 음성 인식 장치(100)가 음성 인식 기능을 포함하는 정보 제공 장치인 경우, 출력부(130)는 사용자의 질의에 대응하는 정보를 오디오 신호 또는 비디오 신호의 형태로 제공할 수도 있다. 예를 들어, 출력부(130)는 사용자의 질의에 대응하는 정보를 텍스트 포맷 또는 음성 포맷으로 출력할 수 있다. 또한, 출력부(130)는 음성 인식 장치(100)와 유무선으로 연결된 다른 장치의 동작을 제어하는 제어신호를 다른 장치로 전송할 수도 있다. 예를 들어, 음성 인식 장치(100)가 벽면에 부착된 IoT 단말인 경우, 음성 인식 장치(100)는 난방 장치의 온도를 제어하는 제어신호를 난방 장치로 전송할 수 있다.In addition, the output unit 130 may perform a function unique to the speech recognition apparatus 100. [ Specifically, when the speech recognition apparatus 100 is an information providing apparatus including a speech recognition function, the output unit 130 may provide information corresponding to a user's query in the form of an audio signal or a video signal. For example, the output unit 130 may output information corresponding to a user's query in a text format or a voice format. Also, the output unit 130 may transmit a control signal for controlling the operation of the voice recognition apparatus 100 and other wired or wireless devices to another apparatus. For example, when the speech recognition apparatus 100 is an IoT terminal attached to a wall, the speech recognition apparatus 100 can transmit a control signal for controlling the temperature of the heating apparatus to the heating apparatus.

한편, 음성 인식 장치(100)는 사용자가 음성 신호에 대응하는 음성을 발화한 의도가 음성 인식 장치(100)를 호출하는 것이 아닌 경우에도 음성 신호로부터 호출어가 검출된 것으로 잘못 인식하여 오동작할 수 있다. 특히, 사용자가 호출어와 유사한 단어를 발화한 경우, 음성 인식 장치(100)는 해당 경우에도 음성 신호로부터 호출어가 검출된 것으로 잘못 인식하여 오동작할 수 있다. 음성 인식 장치(100)가 음성 인식 기능이 탑재된 가전기기인 경우, 호출어의 오인식으로 인해 불필요한 전력 소비가 발생할 수 있다. 본 개시의 일 실시예에 따른 음성 인식 장치(100)는 사용자와 관련된 동작정보를 이용하여 사용자가 음성 신호에 대응하는 음성을 발화한 의도를 예측할 수 있다. 또한, 음성 인식 장치(100)는 음성 인식 수행 시, 예측된 발화 의도에 따라 다른 기준 유사도를 적용하여 호출어 오인식 발생을 감소시킬 수 있다. 음성 인식 장치(100)가 발화 의도를 예측하는 경우, 사용자가 음성 인식 장치를 호출할 가능성이 높은 경우인지, 낮은 경우인지를 판단할 수 있기 때문이다.On the other hand, even if the user does not call the speech recognition apparatus 100 with the intention of the user to utter the voice corresponding to the voice signal, the voice recognition apparatus 100 may erroneously recognize that the caller is detected from the voice signal and malfunction . Particularly, when the user utteres a word similar to the caller, the voice recognition apparatus 100 may erroneously recognize that the caller is detected from the voice signal even in that case, thereby malfunctioning. In the case where the speech recognition apparatus 100 is a home appliance equipped with a speech recognition function, unnecessary power consumption may occur due to the erroneous recognition of the caller. The speech recognition apparatus 100 according to an embodiment of the present disclosure can predict the intention of the user to utter a voice corresponding to the voice signal using the operation information associated with the user. In addition, the speech recognition apparatus 100 may reduce the occurrence of the false recognition errors by applying different reference similarities according to the predicted speech intention when the speech recognition is performed. This is because, when the speech recognition apparatus 100 predicts the speech intention, it can be determined whether the user is likely to call the speech recognition apparatus or is low.

예를 들어, 음성 인식 장치(100)는 사용자(300)의 동작을 나타내는 동작정보를 기초로 기준 유사도를 결정할 수 있다. 이를 통해, 음성 인식 장치(100)는 호출어 인식의 오인식률을 감소시킬 수 있다. 또한, 음성 인식 장치(100)는 호출어의 오인식으로 인한 불필요한 전력 소비를 감소시킬 수 있다. 여기에서, 음성 인식 장치(100)의 오인식률은 획득된 음성 신호가 호출어에 대응하지 않는 경우, 음성 인식 장치(100)가 호출어가 검출된 것으로 오인식하는 비율을 나타낸다. 오인식률은 아래 수학식 1과 같이 나타낼 수 있다.For example, the speech recognition apparatus 100 may determine a reference similarity based on operation information indicating an operation of the user 300. [ Accordingly, the voice recognition apparatus 100 can reduce the erroneous recognition rate of the caller recognition. In addition, the speech recognition apparatus 100 can reduce unnecessary power consumption due to erroneous recognition of the caller. Here, the erroneous recognition rate of the speech recognition apparatus 100 indicates a rate at which the speech recognition apparatus 100 misrecognizes that the caller is detected when the acquired speech signal does not correspond to the caller. The erroneous recognition rate can be expressed by the following equation (1).

[수학식 1][Equation 1]

오인식률 = 100 * (인식 단어 수) / (비호출어 입력 단어 수) [%]Recognition rate = 100 * (number of recognized words) / (number of unsent guided words) [%]

수학식 1에서, “비호출어 입력 단어 수”는 호출어가 아닌 음성 입력 단어의 개수를 나타낼 수 있다. 또한, “인식 단어 수”는 입력된 비호출어 입력 단어 중에서 호출어로 인식된 단어의 개수를 나타낼 수 있다. 이하, 도 3 내지 도 7을 통해, 음성 인식 장치(100)가 호출어 인식의 오인식률을 감소시키는 방법에 관하여 상세하게 설명한다.In Equation (1), " number of non-standard word input words " may represent the number of voice input words other than the call word. Also, the " recognized word count " can represent the number of words recognized as an invocation word among the input non-reference word input words. Hereinafter, a method for reducing the false recognition rate of the speech recognition by the speech recognition apparatus 100 will be described in detail with reference to FIG. 3 through FIG.

일 실시예에 따라 음성 인식 장치(100)는 획득한 음성 신호로부터 기준 유사도를 참조하여 호출어를 추출할 수 있다. 예를 들어, 음성 인식 장치(100)는 획득된 음성 신호와 호출어 사이의 유사도가 기준 유사도 이상인 경우, 해당 음성 신호로부터 호출어가 검출된 것으로 판별할 수 있다. 전술한 바와 같이, 음성 인식 장치(100)는 추출된 음향학적 특징과 호출어에 대응하는 음향 모델 사이의 유사도에 기초하여 호출어의 존재 여부를 판별할 수 있다. 구체적으로 음성 인식 장치(100)는 추출된 음향학적 특징과 호출어에 대응하는 음향 모델 사이의 유사도를 획득하고, 획득된 유사도를 기준 유사도와 비교할 수 있다. 이때, 획득된 유사도가 기준 유사도보다 큰 경우, 음성 인식 장치(100)는 음성 신호로부터 호출어가 검출된 것으로 판단할 수 있다. 반대로, 획득된 유사도가 기준 유사도보다 작은 경우, 음성 인식 장치(100)는 음성 신호로부터 호출어가 검출되지 않은 것으로 판단할 수 있다. According to an embodiment, the speech recognition apparatus 100 can extract a call word from the acquired speech signal by referring to the reference similarity. For example, if the similarity between the acquired speech signal and the caller is equal to or greater than the reference similarity, the speech recognition apparatus 100 can determine that the caller is detected from the speech signal. As described above, the speech recognition apparatus 100 can determine the presence or absence of the caller based on the similarity between the extracted acoustic feature and the acoustic model corresponding to the caller. Specifically, the speech recognition apparatus 100 can acquire the similarity between the extracted acoustic feature and the acoustic model corresponding to the call word, and compare the obtained similarity with the reference similarity. At this time, if the obtained similarity degree is larger than the reference similarity degree, the speech recognition apparatus 100 can determine that the caller is detected from the voice signal. Conversely, when the obtained degree of similarity is smaller than the reference degree of similarity, the speech recognition apparatus 100 can determine that no speech is detected from the speech signal.

도 3은 본 개시의 일 실시예에 따라, 기준 유사도에 따른 호출어 검출 결과를 나타내는 도면이다. 도 3은 호출어가 '소리야'이고, 음성 인식 장치(100)가 호출어를 인식하는 경우를 나타낸다. 도 3의 유사도는 복수의 음성 신호(음성A, 음성A' 및 음성A'') 각각과 호출어 사이의 유사도를 나타낸다. 음성 인식 장치(100)는 전술한 방법으로 음성 신호로부터 음향학적 특징을 추출하여 음성 신호와 호출어 사이의 유사도를 산출할 수 있다. 도 3에서는 음성 신호와 호출어 사이의 유사도를 0 내지 1의 값으로 나타내고 있으나, 본 개시가 이에 제한되는 것은 아니다. 예를 들어, 유사도는 백분율(%) 형태로 표현될 수도 있다. Fig. 3 is a diagram showing call result detection according to a reference similarity degree according to an embodiment of the present disclosure; Fig. 3 shows a case where the caller is 'sound' and the voice recognition apparatus 100 recognizes the caller. The similarity in FIG. 3 represents the similarity between each of a plurality of speech signals (speech A, speech A 'and speech A' ') and the caller. The speech recognition apparatus 100 can extract the acoustic feature from the speech signal and calculate the similarity between the speech signal and the caller by the above-described method. 3, the similarity between the speech signal and the caller is represented by a value of 0 to 1. However, the present disclosure is not limited thereto. For example, similarity may be expressed in percentage (%) form.

음성 인식 장치(100)는 음성A(“소리야”)와 호출어 사이의 유사도를 0.95, 음성A'(“서리야”)와 호출어 사이의 유사도를 0.55, 음성A''(“술이야”)와 호출어 사이의 유사도를 0.35로 각각 산출할 수 있다. 이때, 음성A는 사용자가 “소리야”라는 호출어를 제대로 발음했을 때 음성 인식 장치(100)가 획득한 음성 신호일 수 있다. 음성 인식 장치(100)가 사용자로부터 음성A와 유사한 음성A'(예를 들어, “서리야”)를 나타내는 음성 신호를 획득한 경우, 음성 인식 장치(100)는 음성A'를 '소리야'로 인식할 수 있다. 이때, 음성A'는 음성A에 잡음이 포함되었거나 사용자의 발음이 음성A 와 미세하게 다른 경우 감지되는 음성 신호일 수 있다. 한편, 음성A''(예를 들어, “술이야”)는 호출어와의 유사도가 낮은 음성 신호일 수 있다.The speech recognition apparatus 100 has a similarity degree between the voice A (the "voice") and the caller is 0.95, the similarity between the voice A '(the "voice") and the caller is 0.55, the voice A " ") And the caller can be calculated as 0.35. At this time, the voice A may be a voice signal acquired by the voice recognition apparatus 100 when the user correctly pronounces the word " voice ". When the speech recognition apparatus 100 acquires a speech signal representing a speech A '(for example, "frost") similar to the speech A from the user, the speech recognition apparatus 100 transmits the speech A' . At this time, the voice A 'may be a voice signal which is detected when the voice A includes noise or when the voice of the user is slightly different from the voice A. On the other hand, the voice A " (e.g., " drunk ") may be a voice signal having a low degree of similarity with the caller.

도 3의 (a)는 기준 유사도 값이 0.8인 경우를 나타낸다. 도 3의 (a)에서, 음성 인식 장치(100)는 음성A로부터 호출어가 검출된 것으로 판단할 수 있다. 음성A의 유사도가 기준 유사도를 만족시키는 경우이기 때문이다. 기준 유사도가 높은 경우, 음성 인식 장치(100)는 기준 유사도가 낮은 경우에 비해, 호출어와 유사도가 높은 음성 신호에 대해서만 호출어에 대응하는 것으로 판단할 수 있다. 이 경우, 음성 인식 장치(100)는 호출어와 조금이라도 다른 음성 신호를 호출어에 대응하지 않는 것으로 판단할 수 있다. 도 3의 (b)는 기준 유사도 값이 0.5인 경우를 나타낸다. 이 경우, 음성A 및 음성A'가 기준 유사도를 만족시킨다. 이에 따라, 음성 인식 장치(100)는 음성A 및 음성A'로부터 호출어가 검출된 것으로 판단할 수 있다. 기준 유사도가 낮은 경우, 음성 인식 장치(100)는 기준 유사도가 높은 경우에 비해, 호출어와 유사도가 낮은 음성 신호에 대해서도 호출어에 대응하는 것으로 판단할 수 있다. 3 (a) shows a case where the reference similarity value is 0.8. 3 (a), the voice recognition apparatus 100 can determine that a voice call is detected from the voice A, This is because the similarity of voice A satisfies the reference similarity. When the reference similarity is high, the speech recognition apparatus 100 can determine that the speech signal corresponds to only the speech signal having a high degree of similarity to the caller, as compared with the case where the reference similarity is low. In this case, the speech recognition apparatus 100 can determine that the speech signal does not correspond to the speech word. 3 (b) shows a case where the reference similarity value is 0.5. In this case, the voice A and the voice A 'satisfy the reference similarity. Accordingly, the speech recognition apparatus 100 can determine that the caller is detected from the voice A and the voice A '. When the reference similarity is low, the speech recognition apparatus 100 can determine that the speech signal corresponds to the speech signal even if the speech signal has a similarity to the speech signal, as compared with the case where the reference similarity is high.

이에 따라, 음성 인식 장치(100)는 기준 유사도를 이용하여 호출어 인식의 민감도를 조절할 수 있다. 이때, 호출어 인식 민감도는 획득한 음성 신호가 음성 인식 결과물인 텍스트 데이터로 변환되는 정도를 나타낼 수 있다. 또한, 기준 유사도는 음성 인식율과 관련된 것으로, 기준 유사도에 따라 특정 음성 신호에 대한 호출어 검출 결과가 달라질 수 있다. 또한, 기준 유사도에 따라 음성 인식 장치(100)의 호출어 인식 수행 여부가 결정될 수도 있다. 예를 들어, 음성 인식 장치(100)는 기 설정된 조건에서는 호출어를 인식하지 않을 수 있다. 이 경우, 음성 인식 장치(100)는 기준 유사도를 '1' 이상의 값으로 설정하여 호출어를 인식하지 않을 수 있다. 유사도 0~1 사이 값으로 표현되는 실시예에서 기준 유사도가 '1' 보다 큰 경우, 음성 인식 장치(100)는 음성 신호와 호출어 사이의 유사도에 관계없이 음성 신호로부터 호출어가 검출되지 않은 것으로 판단하기 때문이다. 음성 인식 장치(100)는 일부 음성 신호에 대해 호출어 인식에 소모되는 전원을 턴오프(turn-off)할 수 있다. 이에 따라, 음성 인식 장치(100)는 호출어 인식에 소모되는 전력 효율을 증가시킬 수 있다.Accordingly, the speech recognition apparatus 100 can adjust the sensitivity of recognition of the caller using the reference similarity. At this time, the caller recognition sensitivity may indicate the degree to which the acquired speech signal is converted into text data, which is the speech recognition result. Also, the reference similarity is related to the voice recognition rate, and the result of the voice call detection for a specific voice signal may be changed according to the reference similarity. In addition, it may be determined whether the speech recognition apparatus 100 performs the speech recognition according to the reference similarity degree. For example, the speech recognition apparatus 100 may not recognize the call word under predetermined conditions. In this case, the speech recognition apparatus 100 may set the reference similarity to a value equal to or greater than '1', thereby not recognizing the call word. If the reference similarity degree is greater than '1' in the embodiment represented by the value of similarity 0 to 1, the speech recognition apparatus 100 judges that no caller is detected from the speech signal regardless of the similarity between the speech signal and the caller . The speech recognition apparatus 100 may turn off the power consumed in recognizing the call for some speech signals. Accordingly, the speech recognition apparatus 100 can increase the power efficiency consumed in recognizing the call word.

또한, 기준 유사도는 획득한 음성 신호로부터 음성 인식의 결과물인 텍스트 데이터를 획득하려는 적극성을 나타내는 수치일 수 있다. 기준 유사도는 획득한 음성 신호가 음성 인식 결과물인 특정 텍스트(text) 데이터에 매칭될 수 있는 관용의 정도를 나타내는 수치일 수도 있다. 또한, 기준 유사도는 호출어가 아닌 비호출어에 대해서는 적용되지 않을 수 있다. 또는 기준 유사도는 호출어와 비호출어에 대해 독립적으로 결정될 수도 있다. 이를 통해, 음성 인식 장치(100)는 호출어에 대해서는 엄격하게 음성 인식을 수행하고, 서비스 제공을 위한 음성 인식에 대해서는 상대적으로 관용적인 음성 인식을 수행할 수도 있다. In addition, the reference similarity may be a numerical value indicating an affinity for acquiring text data, which is the result of speech recognition, from the acquired speech signal. The reference similarity may be a numerical value indicating a degree of tolerance that the obtained speech signal can be matched with specific text data which is a speech recognition result. In addition, the reference similarity may not apply to non-pronouns other than the caller. Alternatively, the reference similarity may be determined independently for the caller and the non-caller. Accordingly, the speech recognition apparatus 100 may perform speech recognition strictly for a call word, and perform speech recognition relatively tolerant to speech recognition for providing a service.

이하에서는, 본 개시의 일 실시예에 따른 음성 인식 장치(100)가 기준 유사도를 획득하는 방법에 관하여 설명한다. 본 개시의 일 실시예에 따른 음성 인식 장치(100)는 사용자의 동작 및 환경 조건을 고려한 음성 인식을 수행할 수 있다. 이를 통해, 음성 인식 장치(100)는 사용자에게 효과적으로 출력 정보를 제공할 수 있다.Hereinafter, a method by which the speech recognition apparatus 100 according to an embodiment of the present disclosure acquires the reference similarity will be described. The speech recognition apparatus 100 according to an embodiment of the present disclosure can perform speech recognition in consideration of a user's operation and environmental conditions. Accordingly, the speech recognition apparatus 100 can effectively provide the output information to the user.

일 실시예에 따라, 음성 인식 장치(100)는 사용자의 동작을 나타내는 동작정보를 기초로 기준 유사도를 획득할 수 있다. 예를 들어, 음성 인식 장치(100)는 사용자의 동작을 감지하여 감지된 동작을 나타내는 동작정보를 생성할 수 있다. 이때, 동작정보는 사용자가 음성 신호에 대응하는 음성을 발화한 시간에 감지된 동작을 나타낼 수 있다. 음성 인식 장치(100)는 음성 신호를 획득한 시간을 기준으로 결정된 시간 구간 동안 감지된 사용자 동작을 기초로 동작정보를 생성할 수 있다. 구체적으로, 음성 인식 장치(100)는 음성 신호를 획득한 시간 전후의 소정의 시간 동안 감지된 사용자 동작을 기초로 동작정보를 생성할 수 있다. 또한, 동작정보는 사용자의 이동방향, 이동속도, 시선방향(view-point), 사용자 주변 환경 및 음성 인식 장치(100)의 주변 환경 중 적어도 하나를 포함할 수 있다. 또한, 음성 인식 장치(100)는 동작정보를 기초로 기준 유사도를 결정할 수 있다. 예를 들어, 동작정보가 음성 인식 장치(100)에 대한 호출 가능성이 높은 동작을 나타낼 때, 음성 인식 장치(100)는 기준 유사도를 감소시킬 수 있다. 또한, 동작정보가 음성 인식 장치에 대한 호출 가능성이 낮은 동작을 나타낼 때, 음성 인식 장치(100)는 기준 유사도를 증가시킬 수 있다. 또한, 음성 인식 장치(100)는 전술한 방법으로 획득된 기준 유사도를 기초로 음성 신호로부터 호출어를 검출할 수 있다. 또한, 음성 인식 장치(100)는 호출어 검출 결과를 기초로 생성된 출력 정보를 출력할 수 있다.According to one embodiment, the speech recognition apparatus 100 may obtain the reference similarity based on the operation information indicating the operation of the user. For example, the speech recognition apparatus 100 may detect the user's operation and generate operation information indicating the sensed operation. At this time, the operation information may indicate an operation detected at a time when the user uttered a voice corresponding to the voice signal. The speech recognition apparatus 100 may generate the operation information based on the user operation detected during the time period determined based on the time when the voice signal is acquired. Specifically, the speech recognition apparatus 100 may generate operation information based on a user operation sensed for a predetermined time before and after a time when the speech signal is acquired. In addition, the operation information may include at least one of a moving direction, a moving speed, a view-point, a user's surrounding environment, and a surrounding environment of the voice recognition apparatus 100. Further, the speech recognition apparatus 100 can determine the reference similarity based on the operation information. For example, when the operation information indicates a highly likely call to the speech recognition apparatus 100, the speech recognition apparatus 100 may reduce the reference similarity. In addition, when the operation information indicates a low possibility of calling the speech recognition apparatus, the speech recognition apparatus 100 can increase the reference similarity. Further, the speech recognition apparatus 100 can detect the call word from the speech signal based on the reference similarity obtained by the above-described method. Further, the speech recognition apparatus 100 can output the generated output information based on the result of the call detection.

일 실시예에 따라, 음성 인식 장치(100)는 동작감지센서(미도시)의 센싱 결과를 이용하여 사용자의 동작을 식별할 수 있다. 예를 들어, 동작감지센서는 음성 인식 장치(100)를 기준으로 사용자의 이동방향, 이동속도, 시선방향, 사용자 주변 환경 및 음성 인식 장치(100)의 주변 환경 중 적어도 하나를 감지할 수 있다. 동작감지센서는 음성 인식 장치(100)에 구비될 수 있다. 또는 음성 인식 장치(100)는 음성 인식 장치(100)와 유/무선으로 연결된 동작감지센서를 이용하여 동작정보를 획득할 수 있다. 예를 들어, 동작감지센서는 초음파 거리 센서, 카메라, 뎁스(depth) 카메라 및 마이크 중 적어도 어느 하나일 수 있으나, 본 개시가 이에 제한되는 것은 아니다.According to one embodiment, the speech recognition apparatus 100 can identify a user's operation using a sensing result of a motion detection sensor (not shown). For example, the motion detection sensor may sense at least one of a moving direction, a moving speed, a viewing direction, a user surrounding environment, and a surrounding environment of the voice recognition apparatus 100 based on the voice recognition apparatus 100. The motion detection sensor may be provided in the voice recognition apparatus 100. Alternatively, the speech recognition apparatus 100 may acquire operation information using a motion detection sensor connected to the speech recognition apparatus 100 in a wireless manner. For example, the motion detection sensor may be at least one of an ultrasonic distance sensor, a camera, a depth camera, and a microphone, but the present disclosure is not limited thereto.

일 실시예에 따라, 음성 인식 장치(100)는 사용자를 나타내는 객체를 포함하는 이미지 정보를 기초로 동작정보를 획득할 수 있다. 예를 들어, 음성 인식 장치(100)는 음성 신호를 획득한 시간 정보를 기초로 사용자를 나타내는 객체를 포함하는 이미지 정보를 획득할 수 있다. 이때, 사용자를 나타내는 객체는 이미지 정보의 일부분으로 사용자를 나타내는 이미지 부분일 수 있다. 또한, 음성 인식 장치(100)는 획득한 이미지 정보로부터 사용자의 시선방향, 이동방향, 이동거리, 사용자의 주변 환경에 관한 정보를 획득할 수 있다. 이때, 획득된 이미지 정보는 음성 신호를 획득한 시간으로부터 소정의 시간 동안 수집된 이미지 정보일 수 있다. 구체적으로, 음성 인식 장치(100)는 이미지 정보로부터 사용자(300)를 나타내는 객체를 추출할 수 있다. 음성 인식 장치(100)는 추출된 객체를 이용하여 사용자의 시선방향, 이동방향, 이동거리, 사용자의 주변 환경에 관한 정보를 획득할 수 있다.According to one embodiment, the speech recognition apparatus 100 may obtain operation information based on image information including an object representing a user. For example, the speech recognition apparatus 100 may acquire image information including an object representing a user based on time information obtained by acquiring a speech signal. At this time, the object representing the user may be an image portion representing the user as a part of the image information. In addition, the speech recognition apparatus 100 can acquire information on the user's gaze direction, movement direction, movement distance, and user's surrounding environment from the acquired image information. At this time, the obtained image information may be image information collected for a predetermined time from the time when the voice signal is acquired. Specifically, the speech recognition apparatus 100 may extract an object representing the user 300 from the image information. The speech recognition apparatus 100 can acquire information on the user's gaze direction, movement direction, movement distance, and the user's surrounding environment using the extracted object.

한편, 일 실시예에 따라, 음성 인식 장치(100)는 음성 신호를 분석하여 동작정보를 획득할 수도 있다. 예를 들어, 음성 인식 장치(100)는 획득된 음성 신호를 분석하여 사용자의 이동방향, 이동속도, 시선방향, 사용자 주변 환경 및 음성 인식 장치(100)의 주변 환경 중 적어도 하나에 관한 정보를 획득할 수 있다. 이때, 음성 인식 장치(100)는 획득된 음성 신호의 시간별/주파수별 특성을 이용할 수 있다. 이 경우, 음성 신호는 복수의 마이크를 포함하는 마이크 어레이를 통해 수집된 음성 신호일 수 있다. 예를 들어, 음성 인식 장치(100)는 시간별/ 주파수별 음성 신호의 크기(magnitude) 변화를 기초로 이동방향, 이동거리 및 시선방향 중 적어도 하나를 산출할 수 있다. 또한, 음성 인식 장치(100)는 음성 신호가 포함하는 음색과 관련된 정보를 추출하여 음성을 발화한 복수의 인물이 존재하는지 판단할 수 있다. 도 1을 통해 전술한 포먼트와 피치는, 사람의 음성마다 다른 특성을 나타낸다. 이에 따라, 음성 인식 장치(100)는 포먼트 정보 및 피치 정보에 기초하여 음성 신호에 포함된 음성이 복수의 발화자로부터 생성되었는지 여부를 판별할 수 있다.Meanwhile, according to one embodiment, the speech recognition apparatus 100 may analyze the speech signal to obtain the operation information. For example, the speech recognition apparatus 100 analyzes the acquired speech signal to acquire information about at least one of a moving direction, a moving speed, a viewing direction, a user's surrounding environment, and a surrounding environment of the speech recognition apparatus 100 can do. At this time, the speech recognition apparatus 100 can use characteristics of the acquired speech signal by time / frequency. In this case, the voice signal may be a voice signal collected through a microphone array including a plurality of microphones. For example, the speech recognition apparatus 100 may calculate at least one of a movement direction, a movement distance, and a gaze direction based on a magnitude change of a voice signal by time / frequency. In addition, the speech recognition apparatus 100 may extract information related to the tone color included in the speech signal, and determine whether there are a plurality of persons who uttered the speech. 1, the formants and pitches described above exhibit different characteristics for different human voices. Thus, the speech recognition apparatus 100 can determine whether or not the speech included in the speech signal is generated from the plurality of speech-speakers based on the formant information and the pitch information.

일 실시예에 따라, 음성 인식 장치(100)는 사용자의 시선방향을 나타내는 시선방향 정보를 기초로 전술한 기준 유사도를 결정할 수 있다. 예를 들어, 사용자의 시선방향이 음성 인식 장치(100)를 향하는 경우, 음성 인식 장치(100)를 호출할 확률이 사용자의 시선방향이 음성 인식 장치(100)를 향하지 않는 경우에 비해 더 높을 수 있기 때문이다. 이 경우, 음성 인식 장치(100)는 사용자의 시선방향이 음성 인식 장치(100)가 설치된 위치와 유사할수록 기준 유사도를 낮출 수 있다. 반대로, 음성 인식 장치(100)는 사용자의 시선방향이 음성 인식 장치(100)가 설치된 위치와 멀어질수록 기준 유사도를 높일 수 있다. 이하에서는, 본 개시의 일 실시예에 따른 음성 인식 장치(100)가 사용자의 시선방향을 이용하여 호출어를 검출하는 방법에 관하여 설명한다. 한편, 도 4 내지 도 6에서는 설명의 편의를 위해 사용자(300) 및 음성 인식 장치(100) 등이 평면상에 위치된 것으로 설명하고 있으나, 본 개시가 이에 제한되는 것은 아니며, 3차원 공간 상의 실시예로 확장될 수 있다.According to one embodiment, the speech recognition apparatus 100 can determine the above-described reference similarity based on the gaze direction information indicating the direction of the user's gaze. For example, when the user's gaze direction is toward the voice recognition apparatus 100, the probability of calling the voice recognition apparatus 100 may be higher than when the user's gaze direction is not directed to the voice recognition apparatus 100 It is because. In this case, the voice recognition apparatus 100 can lower the reference similarity degree as the user's sight line direction becomes similar to the position where the voice recognition apparatus 100 is installed. On the contrary, the speech recognition apparatus 100 can increase the reference similarity as the direction of the user's gaze becomes farther away from the position where the speech recognition apparatus 100 is installed. Hereinafter, a description will be made of a method in which the speech recognition apparatus 100 according to an embodiment of the present disclosure detects a call word using the direction of the user's eyes. 4 to 6 illustrate that the user 300 and the voice recognition device 100 are located on a plane for the sake of convenience of description. However, the present disclosure is not limited to this, For example.

도 4는 본 개시의 일 실시예에 따른 음성 인식 장치(100)가 사용자(300)의 시선방향을 기초로 기준 유사도를 결정하는 방법을 나타내는 도면이다. 도 4를 참조하면, 음성 인식 장치(100)는 음성 인식 장치(100)가 설치된 위치 및 사용자(300)의 위치를 기준으로 사용자(300)의 시선방향을 나타내는 시선 각도 값을 산출할 수 있다. 구체적으로, 음성 인식 장치(100)는 음성 인식 장치(100)가 설치된 위치와 사용자(300)의 위치 사이의 위치관계를 기초로, 사용자의 시선방향을 나타내는 시선 각도를 결정할 수 있다. 음성 인식 장치(100)는 사용자(300)의 위치와 음성 인식 장치(100)의 중심을 연결하는 가장 짧은 직선(x)을 기준 각도(예를 들어, '0'도)로 하여 시선 각도(예를 들어, 방위각(azimuth) 및 고도각(elevation))를 결정할 수 있다. 음성 인식 장치(100)는 시선 각도를 기초로 기준 유사도를 결정할 수 있다. 예를 들어, 제1 시선 각도가 제2 시선 각도보다 작고, 제1 음성 신호에 대응하는 사용자의 시선방향이 제1 시선 각도로 나타나는 경우, 음성 인식 장치(100)는 음성 신호에 대응하는 사용자의 시선방향이 제2 시선 각도인 경우보다 작은 기준 유사도를 기초로 호출어를 검출할 수 있다.4 is a diagram illustrating a method of determining a reference similarity based on a viewing direction of a user 300 by the voice recognition apparatus 100 according to an embodiment of the present disclosure. Referring to FIG. 4, the voice recognition apparatus 100 may calculate a gaze angle value indicating a gaze direction of the user 300 based on a position where the voice recognition apparatus 100 is installed and a position of the user 300. Specifically, the voice recognition apparatus 100 can determine the gaze angle indicating the gaze direction of the user based on the positional relationship between the position where the voice recognition apparatus 100 is installed and the position of the user 300. [ The speech recognition apparatus 100 may set the shortest straight line x connecting the position of the user 300 to the center of the speech recognition apparatus 100 as a reference angle (for example, '0' For example, azimuth and elevation can be determined. The speech recognition apparatus 100 can determine the reference similarity based on the viewing angle. For example, when the first gaze angle is smaller than the second gaze angle and the user's gaze direction corresponding to the first voice signal appears as the first gaze angle, the voice recognition apparatus 100 recognizes the user's The caller can be detected on the basis of a smaller reference similarity than when the gaze direction is the second gaze angle.

일 실시예에 따라, 음성 인식 장치(100)는 기준 유사도를 결정하기 위한 각도 구간을 결정할 수도 있다. 예를 들어, 음성 인식 장치(100)는 특정 각도를 포함하는 각도 구간 각각에 매핑되는 기준 유사도 값을 이용할 수 있다. 도 4의 실시예는 기 설정된 각도 범위(r)내의 제1 각도 구간과 기 설정된 각도 범위(r) 밖의 제2 각도 구간을 포함하는 경우를 나타낸다. 도 4에서, 음성 인식 장치(100)는 직선(x)을 기준으로 기 설정된 각도 범위(r) 내의 제1 각도 구간을 결정할 수 있다. 기 설정된 각도 범위(r)는 -30도 ~30도일 수 있다. 이때, 음성 인식 장치(100)는 제1 각도 구간 및 제2 각도 구간 각각에 대응하는 제1 기준 유사도 및 제2 기준 유사도를 기초로 획득된 음성 신호에 적용되는 기준 유사도를 결정할 수 있다. 예를 들어, 음성 인식 장치(100)는 제1 기준 유사도를 '0.6'으로, 제2 기준 유사도를 '0.9'로 각각 설정할 수 있다. 이때, 음성 인식 장치(100)는 시선 각도가 제1 각도 구간에 포함되는 경우, 제1 기준 유사도인 '0.6'을 기초로 음성 신호로부터 호출어를 검출할 수 있다. 또한, 음성 인식 장치(100)는 시선 각도가 제2 각도 구간에 포함되는 경우, 제2 기준 유사도인 '0.9'를 기초로 음성 신호로부터 호출어를 검출할 수 있다. 도 4를 참조하면, 직선(x)을 0도로 하여 시선 각도가 20도인 경우(a), 음성 인식 장치(100)는 기준 유사도를 '0.6'으로 설정할 수 있다. 사용자(300)의 시선방향이 제1 각도 구간에 포함되는 경우이기 때문이다. 한편, 직선(x)을 기준으로 시선 각도가 45도인 경우(b), 음성 인식 장치(100)는 기준 유사도를 '0.9'로 설정할 수 있다. 사용자(300)의 시선방향이 제2 각도 구간에 포함되는 경우이기 때문이다. According to one embodiment, speech recognition apparatus 100 may determine an angular period for determining a reference similarity. For example, the speech recognition apparatus 100 may use a reference similarity value that is mapped to each of the angle sections including a specific angle. The embodiment of FIG. 4 shows a case in which the first angle section within a predetermined angle range r and the second angle section outside the predetermined angle range r are included. In Fig. 4, the speech recognition apparatus 100 can determine a first angular interval within a predetermined angular range r based on the straight line x. The predetermined angle range (r) may be -30 to 30 degrees. At this time, the speech recognition apparatus 100 may determine a reference similarity applied to the speech signal obtained based on the first reference similarity and the second reference similarity corresponding to the first angular interval and the second angular interval, respectively. For example, the speech recognition apparatus 100 may set the first reference similarity degree to 0.6 and the second reference similarity degree to 0.9. At this time, when the gaze angle is included in the first angle section, the speech recognition apparatus 100 can detect the caller from the voice signal based on the first reference similarity value '0.6'. In addition, the speech recognition apparatus 100 can detect the caller from the speech signal based on the second reference similarity '0.9' when the line of sight angle is included in the second angle interval. Referring to FIG. 4, when the straight line x is 0 degrees and the gaze angle is 20 degrees (a), the speech recognition apparatus 100 can set the reference similarity degree to 0.6. And the viewing direction of the user 300 is included in the first angle section. On the other hand, when the visual angle is 45 degrees with respect to the straight line (x) (b), the speech recognition apparatus 100 can set the reference similarity degree to 0.9. And the viewing direction of the user 300 is included in the second angle section.

한편, 일 실시예에 따라, 음성 인식 장치(100)는 사용자의 동작정보 및 음성 인식 장치(100)의 주변 환경 정보를 기초로 기준 유사도를 획득할 수 있다. 예를 들어, 음성 인식 장치(100)의 주변 환경 정보는 음성 인식 장치(100)가 설치된 위치를 기준으로 기 설정된 영역 내에 설치된 적어도 하나의 다른 전자 장치와 관련된 정보를 포함할 수 있다. 예를 들어, 음성 인식 장치(100)는 음성 인식 장치(100)가 설치된 위치를 기준으로 기 설정된 영역 내에 설치된 적어도 하나의 다른 전자 장치의 위치 및 사용자(300)의 시선방향을 기초로 기준 유사도를 결정할 수 있다. 기 설정된 영역 내에 적어도 하나의 다른 전자 장치가 위치하는 경우, 사용자(300)가 다른 전자 장치를 사용하는 중일 수 있기 때문이다. 이 경우, 음성 인식 장치(100)는 사용자(300)가 음성 인식 장치(100)를 호출할 가능성이 낮은 것으로 판단할 수 있다. 또한, 음성 인식 장치(100)는 기 설정된 영역 내에 적어도 하나의 다른 전자 장치가 위치하지 않는 경우에 비해, 높은 기준 유사도를 기초로 호출어를 검출할 수 있다. 이때, 다른 전자 장치는 음성 인식 기능을 구비한 기기일 수도 있으나, 본 개시가 이에 제한되는 것은 아니다.Meanwhile, according to one embodiment, the speech recognition apparatus 100 may acquire the reference similarity based on the operation information of the user and the surrounding environment information of the speech recognition apparatus 100. For example, the environmental information of the speech recognition apparatus 100 may include information related to at least one other electronic apparatus installed in a predetermined area based on a location where the speech recognition apparatus 100 is installed. For example, the speech recognition apparatus 100 may calculate a reference similarity based on a position of at least one other electronic device installed in a predetermined area based on a position at which the speech recognition apparatus 100 is installed, You can decide. If at least one other electronic device is located within the predetermined area, the user 300 may be using another electronic device. In this case, the speech recognition apparatus 100 may determine that the possibility that the user 300 calls the speech recognition apparatus 100 is low. Further, the speech recognition apparatus 100 can detect the caller based on the high reference similarity, as compared with the case where at least one other electronic device is not located in the predetermined area. At this time, the other electronic device may be a device having a voice recognition function, but the present disclosure is not limited thereto.

여기에서, 기 설정된 영역은 음성 인식 장치(100)를 제공하는 제공자에 의해 설정된 크기를 의미할 수 있다. 또는 음성 인식 장치(100)가 건물 내에 설치된 기기인 경우, 기 설정된 영역은 건물에서 음성 인식 장치(100)가 설치된 위치를 포함하는 적어도 하나의 구획된 공간을 포함할 수 있다. 예를 들어, 음성 인식 장치(100)가 음성 인식 기능을 탑재한 냉방 기기인 경우, 음성 인식 장치(100)는 구획된 공간의 천장 또는 모서리에 설치될 수 있다. 이 경우, 음성 인식 장치(100)가 설치된 위치를 기준으로 기 설정된 영역은 적어도 하나의 다른 전자 장치를 포함할 수 있다. 예를 들어, 음성 인식 장치(100)가 가정 내 설치된 IoT 단말인 경우, 음성 인식 장치(100)는 TV와 함께 거실에 설치될 수 있다.Here, the predetermined area may mean a size set by the provider providing the speech recognition apparatus 100. [ Or the speech recognition apparatus 100 is a device installed in the building, the predetermined area may include at least one partitioned space including a location where the speech recognition apparatus 100 is installed in the building. For example, when the voice recognition apparatus 100 is a cooling apparatus equipped with a voice recognition function, the voice recognition apparatus 100 may be installed at the ceiling or corner of the partitioned space. In this case, the predetermined area based on the location where the speech recognition device 100 is installed may include at least one other electronic device. For example, when the voice recognition apparatus 100 is an IoT terminal installed in the home, the voice recognition apparatus 100 may be installed in the living room together with the TV.

도 5는 본 개시의 일 실시예에 따른 음성 인식 장치(100)가 음성 인식 장치(100)의 주변 환경 정보를 기초로 기준 유사도를 획득하는 방법을 나타내는 도면이다. 일 실시예에 따라, 음성 인식 장치(100)는 사용자의 시선방향에 대응하는 다른 전자 장치를 결정할 수 있다. 예를 들어, 기 설정된 영역(500) 내에는 적어도 하나의 다른 전자 장치가 있을 수 있다. 이때, 음성 인식 장치(100)는 적어도 하나의 다른 전자 장치 중에서, 사용자(300)의 시선방향에 대응하는 적어도 하나의 다른 전자 장치를 결정할 수 있다. 또한, 음성 인식 장치(100)는 시선방향에 대응하는 적어도 하나의 다른 전자 장치의 동작 상태를 기초로 기준 유사도를 결정할 수 있다. 여기에서, 전자 장치의 동작 상태는 전자 장치의 턴온(turn-on) 또는 턴오프(turn-off) 상태, 전자 장치의 설정, 전자 장치의 전력 소모 상태를 포함할 수 있다. 5 is a diagram showing a method of acquiring the reference similarity degree based on the surrounding information of the speech recognition apparatus 100 by the speech recognition apparatus 100 according to an embodiment of the present disclosure. According to one embodiment, the speech recognition apparatus 100 may determine another electronic device corresponding to the direction of the user's gaze. For example, there may be at least one other electronic device in the predetermined area 500. [ At this time, the speech recognition apparatus 100 may determine at least one other electronic apparatus corresponding to the direction of the eyes of the user 300 among at least one other electronic apparatus. Further, the speech recognition apparatus 100 can determine the reference similarity based on the operation state of at least one other electronic device corresponding to the gaze direction. Here, the operational state of the electronic device may include a turn-on or turn-off state of the electronic device, a setting of the electronic device, and a power consumption state of the electronic device.

도 5를 참조하면, 제1 전자 장치(51) 및 제2 전자 장치(52)는 음성 인식 장치(100)가 설치된 위치를 기준으로 기 설정된 영역(500) 내에는 위치될 수 있다. 예를 들어, 제1 전자 장치(51)는 디스플레이 장치(예를 들어, TV)일 수 있다. 또한, 제2 전자 장치(52)는 조명기기 일 수 있다. 사용자(300)의 시선방향에 대응하는 다른 전자 장치가 제1 전자 장치(51)인 경우, 음성 인식 장치(100)는 제1 전자 장치(51)의 동작 상태를 기초로 기준 유사도를 결정할 수 있다. 도 5는 사용자(300)의 시선방향에 대응하는 다른 전자 장치가 하나인 경우를 설명하고 있으나 본 개시가 이에 제한되는 것은 아니다. 구체적으로, 제1 전자 장치(51)가 턴온 상태인 경우, 제1 전자 장치(51)가 턴 오프 상태인 경우에 비해 높은 기준 유사도를 기초로 호출어를 검출할 수 있다. 사용자(300)가 제1 전자 장치(51)를 통해 제공되는 컨텐츠를 시청하는 경우, 음성 인식 장치(100)는 사용자(300)가 음성 인식 장치(100)를 호출할 가능성이 낮은 것으로 판단할 수 있기 때문이다.Referring to FIG. 5, the first electronic device 51 and the second electronic device 52 may be located within a predetermined area 500 based on the location where the speech recognition device 100 is installed. For example, the first electronic device 51 may be a display device (e.g., a TV). In addition, the second electronic device 52 may be a lighting device. If the other electronic device corresponding to the line of sight of the user 300 is the first electronic device 51 then the speech recognition device 100 can determine the reference similarity based on the operating state of the first electronic device 51 . 5 illustrates a case in which there is one other electronic device corresponding to the viewing direction of the user 300, the present disclosure is not limited thereto. Specifically, when the first electronic device 51 is turned on, the caller can be detected based on a high reference similarity as compared to when the first electronic device 51 is turned off. When the user 300 watches content provided through the first electronic device 51, the speech recognition apparatus 100 may determine that the user 300 is less likely to call the speech recognition apparatus 100 It is because.

또한, 기준 유사도는 제1 전자 장치(51)의 턴온 지속 시간에 따라 달라질 수도 있다. 예를 들어, 음성 인식 장치(100)는 음성 신호를 획득한 시간을 기준으로 제1 전자 장치(51)가 연속적으로 소모한 전력량 정보를 획득할 수 있다. 음성 인식 장치(100)는 전력량 정보를 기초로 제1 전자 장치(51)의 턴온 지속 시간을 산출할 수 있다. 또한, 음성 인식 장치(100)는 턴온 지속 시간이 길수록 상대적으로 턴온 지속 시간이 짧은 경우에 비해, 높은 기준 유사도를 기초로 호출어를 검출할 수 있다. 사용자(300)가 디스플레이 장치를 통해 제공되는 컨텐츠를 시청하는 시간이 길어지는 경우, 사용자(300)는 컨텐츠 시청 이외의 추가적인 활동을 위해 음성 인식 장치(100)를 호출할 수도 있기 때문이다. 또한, 다른 전자 장치의 동작 상태에 따라 기준 유사도를 결정하는 방법은 사용자(300) 또는 유사한 전자 장치를 이용하는 복수의 다른 사용자의 호출이력에 따라 달라질 수 있다. 음성 인식 장치(100)가 호출이력을 기초로 기준 유사도를 획득하는 방법에 대해서는 후술한다.In addition, the reference similarity may vary depending on the turn-on duration of the first electronic device 51. [ For example, the speech recognition apparatus 100 may acquire the amount of power information continuously consumed by the first electronic device 51 based on the time at which the speech signal is acquired. The speech recognition apparatus 100 may calculate the turn-on duration of the first electronic device 51 based on the energy amount information. In addition, the speech recognition apparatus 100 can detect the caller based on the high reference similarity as compared with the case where the turn-on duration is relatively short as the turn-on duration is relatively long. This is because the user 300 may call the voice recognition apparatus 100 for additional activities other than watching the contents when the user 300 takes a long time to watch the contents provided through the display apparatus. In addition, the method of determining the reference similarity according to the operating state of the other electronic device may vary depending on the call history of the plurality of other users using the user 300 or similar electronic device. A method in which the speech recognition apparatus 100 acquires the reference similarity based on the call history will be described later.

한편, 음성 인식 장치(100)가 다른 전자 장치의 동작 상태에 따라 기준 유사도를 결정하는 방법은 다른 전자 장치의 기능에 따라 달라질 수도 있다. 예를 들어, 사용자(300)의 시선방향에 대응하는 다른 전자 장치가 제2 전자 장치(52)인 경우, 음성 인식 장치(100)는 제2 전자 장치(52)의 동작 상태를 기초로 기준 유사도를 결정할 수 있다. 이때, 제2 전자 장치(52)가 턴 오프 상태인 경우, 음성 인식 장치(100)는 제2 전자 장치(52)가 턴 온 상태인 경우에 비해 낮은 기준 유사도를 기초로 음성 신호로부터 호출어를 검출할 수 있다. 사용자가 제2 전자 장치(52)를 턴 온하기 위해 음성 인식 장치(100)를 호출할 수 있기 때문이다. Meanwhile, the manner in which the speech recognition apparatus 100 determines the reference similarity according to the operation state of other electronic apparatuses may vary depending on the functions of other electronic apparatuses. For example, if the other electronic device corresponding to the line-of-sight direction of the user 300 is the second electronic device 52, the speech recognition device 100 determines the reference similarity Can be determined. At this time, when the second electronic device 52 is in the turned off state, the speech recognition apparatus 100 determines that the second electronic device 52 is in a turned-on state, Can be detected. Since the user can call the speech recognition device 100 to turn on the second electronic device 52.

본 개시의 일 실시예에 따라, 음성 인식 장치(100)는 사용자의 시선방향의 변화를 나타내는 변화정보를 기초로 기준 유사도를 결정할 수 있다. 사용자의 시선방향이 음성 인식 장치(100)를 향하는 방향과 가까워지는 경우, 사용자가 음성 신호에 대응하는 음성을 발화한 의도가 음성 인식 장치(100)를 호출할 목적일 수 있기 때문이다. 이 경우, 음성 인식 장치(100)는 사용자가 음성 인식 장치(100)를 호출할 가능성이 높은 것으로 판단할 수 있다. 또한, 음성 인식 장치(100)는 사용자의 시선방향이 음성 인식 장치(100)를 향하는 방향과 멀어지는 경우에 비해, 낮은 기준 유사도를 기초로 호출어를 검출할 수 있다. 예를 들어, 음성 인식 장치(100)는 음성 신호가 획득된 때부터 기 설정된 시간 동안 사용자의 시선방향의 변화를 나타내는 변화정보를 획득할 수 있다. 음성 인식 장치(100)는 음성 신호가 획득된 때부터 기 설정된 시간 동안, 전술한 동작감지센서를 이용하여 사용자의 시선방향을 추적할 수 있다. 예를 들어, 기 설정된 시간은 음성 신호가 지속되는 시간을 나타낼 수 있다. According to one embodiment of the present disclosure, the speech recognition apparatus 100 can determine the reference similarity based on change information indicating a change in the direction of the user's gaze. This is because the intention of the user to utter the voice corresponding to the voice signal may be the purpose of calling the voice recognition apparatus 100 when the direction of the user's gaze approaches the voice recognition apparatus 100. [ In this case, the speech recognition apparatus 100 can determine that the user is likely to call the speech recognition apparatus 100. Further, the speech recognition apparatus 100 can detect the caller based on the low reference similarity as compared with the case where the direction of the user's gaze is away from the direction toward the speech recognition apparatus 100. [ For example, the speech recognition apparatus 100 may acquire change information indicating a change in the direction of the user's gaze for a predetermined time from when the speech signal is acquired. The voice recognition apparatus 100 can track the user's gaze direction using the motion detection sensor for a predetermined period of time from when the voice signal is acquired. For example, the predetermined time may indicate the time duration of the voice signal.

또한, 음성 인식 장치(100)는 전술한 적어도 하나의 다른 전자 장치 각각이 설치된 위치 및 시선방향의 변화정보를 기초로 기준 유사도를 결정할 수 있다. 음성 인식 장치(100)는 음성 인식 장치(100)와 다른 전자 장치(51, 52) 사이의 위치관계 및 시선방향의 변화정보를 기초로 기준 유사도를 결정할 수 있다. 예를 들어, 시선방향의 변화가 음성 인식 장치(100)와 다른 전자 장치(51, 52) 사이의 위치관계에 상응하는 경우, 음성 인식 장치(100)는 시선방향의 변화에 따라 서로 다른 기준 유사도를 결정할 수 있다. 시선방향이 기 설정된 방향 범위 내에서 변화하는 경우, 음성 인식 장치(100)는 시선방향의 변화에 따라 서로 다른 기준 유사도를 결정할 수 있다. 이때, 기 설정된 방향 범위는 음성 인식 장치(100)와 다른 전자 장치(51, 52) 사이의 위치 관계를 기초로 설정된 값일 수 있다. 도 5에서와 같이, 음성 인식 장치(100)가 제1 전자 장치(51) 상단에 설치되어 있고, 시선방향이 제1 방향(511)을 기준으로 기 설정된 방향 범위 내에서 변화하는 경우, 음성 인식 장치(100)는 시선방향의 변화에 따라 서로 다른 기준 유사도를 결정할 수 있다. 반대로, 시선방향의 변화가 음성 인식 장치(100)와 다른 전자 장치(51, 52) 사이의 위치관계에 상응하지 않는 경우, 음성 인식 장치(100)는 시선방향의 변화에 따라 기준 유사도를 변경하지 않을 수 있다.Further, the speech recognition apparatus 100 can determine the reference similarity based on the change information of the position and the viewing direction in which each of the above-mentioned at least one other electronic apparatus is installed. The speech recognition apparatus 100 can determine the reference similarity based on the positional relationship between the speech recognition apparatus 100 and the other electronic apparatuses 51 and 52 and the change in the direction of the line of sight. For example, when the change of the line-of-sight direction corresponds to the positional relationship between the speech recognition apparatus 100 and the other electronic apparatuses 51 and 52, the speech recognition apparatus 100 generates different reference similarities Can be determined. When the gaze direction changes within a predetermined direction range, the voice recognition apparatus 100 can determine different reference similarities according to the change of the gaze direction. At this time, the predetermined direction range may be a value set based on the positional relationship between the speech recognition apparatus 100 and the other electronic apparatuses 51 and 52. 5, when the speech recognition apparatus 100 is installed at the upper end of the first electronic device 51 and the direction of the line of sight changes within a predetermined direction range with respect to the first direction 511, The apparatus 100 can determine different reference similarities according to changes in the viewing direction. Conversely, when the change of the line-of-sight direction does not correspond to the positional relationship between the speech recognition apparatus 100 and the other electronic apparatuses 51 and 52, the speech recognition apparatus 100 does not change the reference similarity according to the change of the line-of- .

구체적으로, 도 5에서와 같이, 음성 인식 장치(100)가 제1 전자 장치(51) 상단에 설치된 경우, 음성 인식 장치(100)는 시선방향의 위/아래 방향 변화에 따라 서로 다른 기준 유사도를 기초로 호출어를 검출할 수 있다. 반면, 시선방향이 좌/우 방향으로 변하는 경우, 음성 인식 장치(100)는 동일한 기준 유사도를 기초로 호출어를 검출할 수 있다. 또한, 시선방향의 변화가 기 설정된 각도 범위 내인 경우, 음성 인식 장치(100)는 시선방향의 변화에 따라 서로 다른 기준 유사도를 결정할 수 있다. 이때, 기 설정된 각도 범위는 음성 인식 장치(100)와 다른 전자 장치(51, 52) 사이의 위치 관계를 기초로 설정된 값일 수 있다. 예를 들어, 시선방향의 변화가 사용자(300)의 위치를 기준으로 음성 인식 장치(100) 및 제1 전자 장치(51) 각각이 설치된 위치를 나타내는 각도 범위 내인 경우, 음성 인식 장치(100)는 시선방향의 변화에 따라 서로 다른 기준 유사도를 결정할 수 있다. 반면, 시선방향의 변화가 사용자(300)의 위치를 기준으로 음성 인식 장치(100) 및 제1 전자 장치(51) 각각이 설치된 위치를 나타내는 각도 범위를 벗어나는 경우, 음성 인식 장치(100)는 시선방향의 변화에 따라 기준 유사도를 변경하지 않을 수 있다. 5, when the speech recognition apparatus 100 is installed on the upper side of the first electronic device 51, the speech recognition apparatus 100 may generate different reference similarities according to the upward / downward changes in the direction of the line of sight The caller can be detected based on the basis. On the other hand, when the gaze direction changes in the left / right direction, the speech recognition apparatus 100 can detect the caller based on the same reference similarity. In addition, when the change in the line-of-sight direction is within the predetermined angle range, the speech recognition apparatus 100 can determine different reference similarities according to the change in the line-of-sight direction. At this time, the preset angle range may be a value set based on the positional relationship between the speech recognition apparatus 100 and the other electronic apparatuses 51 and 52. For example, when the change of the line of sight is within the angle range indicating the position where the speech recognition apparatus 100 and the first electronic apparatus 51 are installed based on the position of the user 300, the speech recognition apparatus 100 Different reference similarities can be determined according to the change of the gaze direction. On the other hand, when the change of the line of sight deviates from the angle range indicating the position where the speech recognition device 100 and the first electronic device 51 are installed based on the position of the user 300, The reference similarity may not be changed according to the change of the direction.

한편, 본 개시의 일 실시예에 따라, 음성 인식 장치(100)는 사용자 이외의 다른 인물과 관련된 정보를 기초로 기준 유사도를 결정할 수 있다. 음성 인식 장치(100)는 사용자의 동작정보 및 사용자 이외의 다른 인물과 관련된 정보를 기초로 기준 유사도를 결정할 수 있다. 이때, 다른 인물과 관련된 정보는 다른 인물의 존재 여부를 나타내는 정보, 다른 인물의 위치를 나타내는 위치 정보 및 다른 인물의 시선방향을 나타내는 시선방향 정보 중 적어도 하나를 포함할 수 있다. 음성 인식 장치(100)가 설치된 위치를 기준으로 기 설정된 영역 내에 사용자 이외의 다른 인물이 존재하는 경우, 사용자가 음성 신호에 대응하는 음성을 발화한 의도가 다른 인물과의 대화를 목적으로 하는 것일 수 있기 때문이다. 이 경우, 음성 인식 장치(100)는 사용자(300)가 음성 인식 장치(100)를 호출할 가능성이 낮은 것으로 판단할 수 있다. 또한, 음성 인식 장치(100)는 사용자 이외의 다른 인물이 존재하지 않는 경우에 비해, 높은 기준 유사도를 기초로 음성 신호로부터 호출어를 검출할 수 있다. Meanwhile, according to one embodiment of the present disclosure, the speech recognition apparatus 100 can determine the reference similarity based on information related to a person other than the user. The speech recognition apparatus 100 can determine the reference similarity based on user's operation information and information related to a person other than the user. At this time, the information related to another person may include at least one of information indicating whether another person exists, position information indicating the position of another person, and gaze direction information indicating a gaze direction of another person. When a person other than the user is present within a predetermined area based on the position where the speech recognition apparatus 100 is installed, the intention of the user to utter a voice corresponding to the voice signal may be to communicate with another person It is because. In this case, the speech recognition apparatus 100 may determine that the possibility that the user 300 calls the speech recognition apparatus 100 is low. In addition, the speech recognition apparatus 100 can detect a call word from a speech signal based on a high reference similarity, as compared with the case where no character other than the user exists.

도 6은 본 개시의 일 실시예에 따른 음성 인식 장치(100)가 사용자 이외에 다른 인물(600)과 관련된 정보를 기초로 기준 유사도를 결정하는 방법을 나타내는 도면이다. 일 실시예에 따라, 음성 인식 장치(100)는 전술한 동작감지센서를 이용하여, 사용자(300)의 위치를 기준으로 기 설정된 영역 내에 다른 인물(600)이 존재하는지를 판단할 수 있다(단계 S602). 이때, 기 설정된 영역의 크기는 도 5와 관련하여 설명된 기 설정된 영역에 비해 좁거나 동일할 수 있다. 또한, 기 설정된 영역은 사용자(300)의 위치를 기준으로 사용자(300)와 대화가 가능한 거리를 반지름으로 하는 영역을 나타낼 수도 있다. 예를 들어, 음성 인식 장치(100)는 전술한 이미지 정보가 사용자(300) 이외의 다른 인물(600)에 대응하는 객체를 포함하는지 판단할 수 있다. 이때, 다른 인물(600)에 대응하는 객체는 이미지 정보의 일부분으로 다른 인물(600)을 나타내는 이미지 부분일 수 있다. 이미지 정보가 다른 인물(600)에 대응하는 객체를 포함하는 경우, 음성 인식 장치(100)는 이미지 정보를 기초로 다른 인물(600)의 위치를 나타내는 위치 정보를 획득할 수 있다. 또한, 음성 인식 장치(100)는 다른 인물(600)의 위치를 이용하여 획득된 기준 유사도를 기초로 호출어를 검출할 수 있다. FIG. 6 is a diagram illustrating a method for the speech recognition apparatus 100 according to an embodiment of the present disclosure to determine a reference similarity based on information associated with a person 600 other than a user. According to one embodiment, the speech recognition apparatus 100 may determine whether another character 600 exists in a predetermined area based on the position of the user 300 using the motion detection sensor described above (step S602 ). At this time, the size of the predetermined area may be narrower or equal to that of the predetermined area described with reference to FIG. In addition, the predetermined area may indicate an area having a radius at which the user can talk with the user 300 based on the position of the user 300. For example, the speech recognition apparatus 100 may determine whether the above-described image information includes an object corresponding to a person 600 other than the user 300. [ At this time, the object corresponding to the other person 600 may be an image portion representing another person 600 as a part of the image information. When the image information includes an object corresponding to another person 600, the speech recognition apparatus 100 can acquire position information indicating the position of another person 600 based on the image information. In addition, the speech recognition apparatus 100 can detect the caller based on the reference similarity obtained using the position of another person 600. [

일 실시예에 따라, 음성 인식 장치(100)는 사용자(300)의 제1 시선방향을 나타내는 제1 시선방향 정보 및 다른 인물(600)의 위치를 나타내는 위치 정보를 기초로 기준 유사도를 결정할 수 있다(단계 S604). 구체적으로, 제1 시선방향이 다른 인물(600)의 위치를 향하는 경우, 음성 인식 장치(100)는 상대적으로 높은 기준 유사도를 기초로 호출어를 검출할 수 있다. 제1 시선방향이 다른 인물(600)의 위치를 향하는 경우, 사용자가 음성 신호에 대응하는 음성을 발화한 의도가 다른 인물과의 대화를 목적으로 하는 것일 수 있기 때문이다. 반대로, 제1 시선방향이 다른 인물(600)의 위치를 향하지 않는 경우, 상대적으로 낮은 기준 유사도를 기초로 호출어를 검출할 수 있다. According to one embodiment, the speech recognition apparatus 100 can determine the reference similarity based on the first gaze direction information indicating the first gaze direction of the user 300 and the position information indicating the position of another person 600 (Step S604). Specifically, when the first line of sight is directed to the position of another person 600, the speech recognition apparatus 100 can detect the caller based on a relatively high reference similarity. This is because, when the first line of sight is directed to the position of another person 600, the intention of the user to utter the voice corresponding to the voice signal may be to communicate with another person. Conversely, when the first gaze direction is not directed to the position of another person 600, the caller can be detected based on the relatively low reference similarity.

또한, 음성 인식 장치(100)는 다른 인물(600)의 제2 시선방향을 나타내는 제2 시선방향 정보를 기초로 기준 유사도를 결정할 수 있다(단계 S606). 다른 인물(600)의 시선방향이 사용자(300)를 향하는 경우, 사용자(300)와 다른 인물(600)이 대화 중일 가능성이 높기 때문이다. 구체적으로, 음성 인식 장치(100)는 전술한 이미지 정보로부터 제2 시선방향 정보를 획득할 수 있다. 음성 인식 장치(100)는 이미지 정보로부터 다른 인물(600)에 대응하는 객체를 추출할 수 있다. 또한, 음성 인식 장치(100)는 추출된 객체를 이용하여 제2 시선방향 정보를 획득할 수 있다. Further, the speech recognition apparatus 100 can determine the reference similarity degree based on the second gaze direction information indicating the second gaze direction of another person 600 (step S606). If the direction of the gaze of the other person 600 faces the user 300, there is a high possibility that the user 300 and the other person 600 are in conversation. Specifically, the speech recognition apparatus 100 can acquire second gaze direction information from the above-described image information. The speech recognition apparatus 100 may extract an object corresponding to another person 600 from the image information. In addition, the speech recognition apparatus 100 can acquire the second gaze direction information using the extracted object.

도 6을 참조하면, 음성 인식 장치(100)는 단계 S602 내지 단계 S606을 통한 판단 결과를 기초로 기준 유사도를 결정할 수 있다(단계S608). 단계 S602에서, 다른 인물(600)이 존재하지 않는 경우, 음성 인식 장치(100)는 기준 유사도 '0.55'를 기초로 음성 신호로부터 호출어를 검출할 수 있다. 또한, 단계 S602 및 단계 S604에서, 다른 인물(600)이 존재하고 제1 시선방향이 다른 인물(600)의 위치를 향하는 경우, 음성 인식 장치(100)는 제2 시선방향에 따라 기준 유사도를 결정할 수 있다. 이때, 단계 S606에서, 제2 시선방향이 사용자(300)의 위치를 향하는 경우, 음성 인식 장치(100)는 기준 유사도 '0.95'를 기초로 음성 신호로부터 호출어를 검출할 수 있다. 제1 시선방향 및 제2 시선방향이 서로를 향하는 경우, 음성 인식 장치(100)를 호출할 가능성이 적기 때문이다. 단계 S606에서, 제2 시선방향이 사용자(300)의 위치를 향하지 않는 경우, 음성 인식 장치(100)는 기준 유사도 '0.85'를 기초로 음성 신호로부터 호출어를 검출할 수 있다.Referring to FIG. 6, the speech recognition apparatus 100 may determine a reference similarity based on the determination result of steps S602 through S606 (step S608). If there is no other person 600 in step S602, the speech recognition apparatus 100 can detect the call word from the speech signal based on the reference similarity degree '0.55'. In the case where another person 600 exists and the first gaze direction is directed to the position of another person 600 in steps S602 and S604, the voice recognition apparatus 100 determines the reference similarity degree in accordance with the second gaze direction . At this time, in step S606, when the second line of sight is directed to the position of the user 300, the speech recognition apparatus 100 can detect the caller from the speech signal based on the reference similarity degree '0.95'. This is because, when the first gaze direction and the second gaze direction are directed to each other, the possibility of calling the voice recognition apparatus 100 is low. In step S606, if the second gaze direction does not face the position of the user 300, the speech recognition apparatus 100 can detect the caller from the speech signal based on the reference similarity degree '0.85'.

단계 S602 및 단계 S604에서, 다른 인물(600)이 존재하고 제1 시선방향이 다른 인물(600)의 위치를 향하지 않는 경우, 음성 인식 장치(100)는 제2 시선방향에 따라 기준 유사도를 결정할 수 있다. 이때, 제2 시선방향이 사용자(300)의 위치를 향하는 경우, 음성 인식 장치(100)는 기준 유사도 '0.75'를 기초로 음성 신호로부터 호출어를 검출할 수 있다. 또한, 제2 시선방향이 사용자(300)의 위치를 향하지 않는 경우, 음성 인식 장치(100)는 기준 유사도 '0.65'를 기초로 음성 신호로부터 호출어를 검출할 수 있다. 음성 신호에 대응하는 음성을 발화한 사용자(300)의 시선방향을 나타내는 제1 시선방향 정보가 제2 시선방향 정보에 비해 기준 유사도 결정에 높은 비율로 반영될 수 있다.In the case where the other person 600 exists in the step S602 and the first gaze direction does not face the position of the other person 600 in the steps S602 and S604, the voice recognition apparatus 100 can determine the reference similarity degree in accordance with the second gaze direction have. At this time, when the second line of sight is directed to the position of the user 300, the speech recognition apparatus 100 can detect the call word from the speech signal based on the reference similarity degree '0.75'. Further, when the second visual line direction is not directed to the position of the user 300, the speech recognition apparatus 100 can detect the call word from the speech signal based on the reference similarity degree '0.65'. The first gaze direction information indicating the gaze direction of the user 300 that has uttered the voice corresponding to the voice signal can be reflected at a higher rate to the determination of the reference similarity than the second gaze direction information.

일 실시예에 따라, 음성 인식 장치(100)는 제1 시선방향 정보 및 제2 시선방향 정보에 따라 각각 결정된 기준 유사도를 이용할 수도 있다. 예를 들어, 음성 인식 장치(100)는 제1 시선방향 정보가 나타내는 방향과 다른 인물(600)의 위치를 기초로 제1 기준 유사도를 결정할 수 있다. 또한, 음성 인식 장치(100)는 제2 시선방향 정보가 나타내는 방향과 사용자(300)의 위치를 기초로 제2 기준 유사도를 결정할 수 있다. 또한, 음성 인식 장치(100)는 제1 기준 유사도 및 제2 기준 유사도의 평균을 기초로 음성 신호로부터 호출어를 검출할 수도 있다.According to one embodiment, the speech recognition apparatus 100 may use the reference similarity determined in accordance with the first gaze direction information and the second gaze direction information, respectively. For example, the speech recognition apparatus 100 can determine the first reference similarity based on the position of the person 600 different from the direction indicated by the first gaze direction information. In addition, the speech recognition apparatus 100 can determine the second reference similarity based on the direction indicated by the second gaze direction information and the position of the user 300. [ The speech recognition apparatus 100 may also detect a call word from the speech signal based on an average of the first reference similarity and the second reference similarity.

한편, 전술한 바와 같이, 음성 인식 장치(100)는 호출이력을 참조하여 기준 유사도를 획득할 수도 있다. 이때, 호출이력은 특정 음성 인식 장치가 호출된 이력을 나타낼 수 있다. 또한, 음성 인식 장치(100)는 호출이력을 기초로 사용자(300)의 동작정보, 음성 인식 장치(100) 주변 환경 정보 및 사용자 이외의 다른 인물에 관련된 정보 중 적어도 하나에 대응하는 호출 빈도수를 산출할 수 있다. 이때, 호출 빈도수는 해당 상황에서 음성 인식 장치(100)가 호출된 누적 횟수를 나타낼 수 있다. 음성 인식 장치(100)는 산출된 호출 빈도수를 기초로 기준 유사도를 결정할 수 있다. 예를 들어, 음성 인식 장치(100)는 기준 유사도를 호출 빈도수에 반비례하도록 설정할 수 있다. 또한, 음성 인식 장치(100)는 결정된 기준 유사도를 기초로 호출어를 검출할 수 있다.On the other hand, as described above, the speech recognition apparatus 100 may acquire the reference similarity with reference to the call history. At this time, the call history may indicate the history that the specific speech recognition apparatus is called. Further, the speech recognition apparatus 100 calculates a call frequency corresponding to at least one of the operation information of the user 300, the environment information of the speech recognition apparatus 100, and information related to a person other than the user, based on the call history can do. At this time, the calling frequency can indicate the cumulative number of times the voice recognition apparatus 100 is called in the corresponding situation. The speech recognition apparatus 100 can determine the reference similarity based on the calculated call frequency. For example, the speech recognition apparatus 100 can set the reference similarity to be in inverse proportion to the calling frequency. Further, the speech recognition apparatus 100 can detect the caller based on the determined reference similarity.

또한, 음성 인식 장치(100)는 음성 인식 장치(100)가 아닌 다른 음성 인식 기기가 호출된 호출이력을 이용할 수 있다. 예를 들어, 음성 인식 장치(100)는 음성 인식 장치(100)가 설치된 장소와 유사한 장소에 설치된 다른 음성 인식 기기의 호출이력을 기초로 기준 유사도를 결정할 수 있다. 이때, 음성 인식 장치(100)는 음성 인식 장치(100) 또는 서버(200)와 연결된 각각의 음성 인식 장치가 설치된 장소에 관한 정보를 획득할 수 있다. 이때, 설치된 장소에 관한 정보는 음성 인식 장치가 설치된 지역, 장소의 용도 특성(예를 들어, 가정 또는 사무실)을 포함할 수 있다. 구체적으로, 음성 인식 장치(100)가 다른 음성 인식 기기의 호출 빈도수가 높은 시간에 음성 신호를 획득하는 경우, 호출 빈도수가 낮은 시간대에 비해, 낮은 기준 유사도를 기초로 호출어를 검출할 수 있다. 음성 인식 장치(100)가 호출될 가능성이 높은 시간일 수 있기 때문이다. In addition, the voice recognition apparatus 100 can use a call history in which a voice recognition apparatus other than the voice recognition apparatus 100 is called. For example, the speech recognition apparatus 100 can determine the reference similarity based on the call history of other speech recognition apparatuses installed in a place similar to the place where the speech recognition apparatus 100 is installed. At this time, the speech recognition apparatus 100 can acquire information about a place where each speech recognition apparatus connected to the speech recognition apparatus 100 or the server 200 is installed. At this time, the information about the installed place may include an area where the voice recognition device is installed, a usage characteristic (e.g., home or office) of the place. Specifically, when the speech recognition apparatus 100 acquires the speech signal at a high frequency of the call of the other speech recognition apparatus, the caller can be detected based on the low reference similarity as compared with the time period during which the call frequency is low. This is because it is likely that the speech recognition apparatus 100 is called.

또한, 음성 인식 장치(100)는 사용자가 아닌 다른 사용자에 의해 음성 인식 장치(100)가 호출된 호출이력을 기초로 기준 유사도를 결정할 수 있다. 이때, 다른 사용자는 사용자와 생활 패턴이 유사한 사용자를 포함할 수 있다. 또한, 음성 인식 장치(100)는 동일한 사용자에 의해 호출된 다른 음성 인식 기기의 호출이력을 기초로 기준 유사도를 결정할 수도 있다. 이때, 음성 인식 장치(100)는 음성 신호가 포함하는 음향학적 특징을 이용하여 특정 사용자를 식별할 수 있다. 또한, 음성 인식 장치(100)는 사용자의 동작정보에 따른 호출이력을 기초로 기준 유사도를 결정할 수 있다. In addition, the speech recognition apparatus 100 can determine the reference similarity based on the call history in which the speech recognition apparatus 100 is called by a user other than the user. At this time, another user may include a user whose life pattern is similar to that of the user. Further, the speech recognition apparatus 100 may determine the reference similarity based on the call history of other speech recognition apparatuses called by the same user. At this time, the voice recognition apparatus 100 can identify a specific user by using the acoustic feature included in the voice signal. Also, the speech recognition apparatus 100 can determine the reference similarity based on the call history according to the operation information of the user.

일 실시예에 따라, 음성 인식 장치(100)는 사용자의 이동정보를 기초로 기준 유사도를 결정할 수 있다. 이때, 이동정보는 사용자의 이동방향 및 이동속도 중 적어도 하나를 나타내는 정보일 수 있다. 예를 들어, 음성 인식 장치(100)는 호출이력을 기초로, 이동정보에 대응하는 호출 빈도수를 산출할 수 있다. 이때, 호출이력은 기 저장된 정보일 수 있다. 구체적으로, 음성 인식 장치(100)는 획득된 음성 신호에 대응하는 이동정보와 유사한 이동정보에 매핑되는 호출 빈도수를 산출할 수 있다. 예를 들어, 획득된 음성 신호에 대응하는 이동정보가 이동속도 1m^2/s를 나타내는 경우, 음성 인식 장치(100)는 이동속도 0.8~1.3m^2/s에 매핑되는 호출 빈도수를 산출할 수 있다. 산출된 호출 빈도수가 '10'인 경우, 음성 인식 장치(100)는 호출 빈도수가 '1'인 경우에 비해 낮은 기준 유사도를 기초로 호출어를 검출할 수 있다. 또는 산출된 호출 빈도수가 '0'인 경우, 음성 인식 장치(100)는 호출어를 인식하지 않을 수도 있다. 이때, 음성 인식 장치(100)는 기준 유사도를 최대 기준 유사도 값보다 큰 값으로 설정할 수 있다. According to one embodiment, the speech recognition apparatus 100 may determine the reference similarity based on the movement information of the user. At this time, the movement information may be information indicating at least one of the movement direction and the movement speed of the user. For example, the speech recognition apparatus 100 can calculate the call frequency corresponding to the movement information based on the call history. At this time, the call history may be pre-stored information. Specifically, the speech recognition apparatus 100 can calculate the frequency of calls mapped to movement information similar to movement information corresponding to the acquired voice signal. For example, if the movement information corresponding to the acquired speech signal indicates a movement speed of 1 m ^ 2 / s, the speech recognition apparatus 100 calculates a call frequency mapped to the movement speed of 0.8 to 1.3 m ^ 2 / s . When the calculated call frequency is '10', the speech recognition apparatus 100 can detect the caller based on the low reference similarity as compared with the case where the call frequency is '1'. Or the calculated call frequency is '0', the speech recognition apparatus 100 may not recognize the caller. At this time, the speech recognition apparatus 100 can set the reference similarity to a value larger than the maximum reference similarity value.

도 7은 본 발명의 실시예에 따른 음성 인식 장치(100)의 동작을 나타내는 흐름도이다. 도 7에 따르면, 단계 S702에서, 음성 인식 장치(100)는 음성 신호를 획득할 수 있다. 단계 S704에서, 음성 인식 장치(100)는 동작정보를 기초로 기준 유사도를 획득할 수 있다. 음성 인식 장치(100)는 음성 신호에 대응하는 음성을 발화한 사용자의 동작을 감지하여 동작정보를 생성할 수 있다. 예를 들어, 음성 인식 장치(100)는 음성 신호에 대응하는 음성을 발화한 사용자의 시선방향을 기초로 기준 유사도를 결정할 수 있다. 단계 S706에서, 음성 인식 장치(100)는 기준 유사도를 기초로 획득된 음성 신호로부터 호출어를 검출할 수 있다. 음성 인식 장치(100)는 음성 신호와 호출어 사이의 유사도가 기준 유사도 이상인 경우, 음성 신호로부터 호출어가 검출된 것으로 판단할 수 있다. 반대로, 음성 신호와 호출어 사이의 유사도가 기준 유사도 이하인 경우, 음성 인식 장치(100)는 음성 신호로부터 호출어가 검출되지 않는 것으로 판단할 수 있다. 단계 S708에서, 음성 인식 장치(100)는 호출어 검출 결과를 기초로 생성된 출력 정보를 출력할 수 있다. 이에 따라, 음성 인식 장치(100)는 사용자의 발화 의도를 고려하여 호출어를 인식할 수 있다. 또한, 음성 인식 장치(100)는 호출어 인식 오인식률을 감소시킬 수 있다. 7 is a flowchart showing the operation of the speech recognition apparatus 100 according to the embodiment of the present invention. According to Fig. 7, in step S702, the speech recognition apparatus 100 can acquire a speech signal. In step S704, the speech recognition apparatus 100 can acquire the reference similarity based on the operation information. The speech recognition apparatus 100 can generate the operation information by sensing the operation of the user who has uttered the speech corresponding to the speech signal. For example, the speech recognition apparatus 100 can determine the reference similarity based on the direction of the user's eyes that uttered the speech corresponding to the speech signal. In step S706, the speech recognition apparatus 100 can detect the call word from the speech signal obtained based on the reference similarity. The voice recognition apparatus 100 can determine that a caller is detected from the voice signal when the similarity degree between the voice signal and the caller is equal to or greater than the reference similarity degree. Conversely, when the similarity between the voice signal and the caller is less than or equal to the reference similarity degree, the voice recognition apparatus 100 can determine that no caller is detected from the voice signal. In step S708, the speech recognition apparatus 100 can output the generated output information based on the result of the call detection. Accordingly, the speech recognition apparatus 100 can recognize the call word in consideration of the user's utterance intention. In addition, the speech recognition apparatus 100 can reduce the recognition rate of caller recognition errors.

일부 실시예는 컴퓨터에 의해 실행되는 프로그램 모듈과 같은 컴퓨터에 의해 실행가능한 명령어를 포함하는 기록 매체의 형태로도 구현될 수 있다. 컴퓨터 판독 가능 매체는 컴퓨터에 의해 액세스될 수 있는 임의의 가용 매체일 수 있고, 휘발성 및 비휘발성 매체, 분리형 및 비분리형 매체를 모두 포함할 수 있다. 또한, 컴퓨터 판독가능 매체는 컴퓨터 저장 매체를 포함할 수 있다. 컴퓨터 저장 매체는 컴퓨터 판독가능 명령어, 데이터 구조, 프로그램 모듈 또는 기타 데이터와 같은 정보의 저장을 위한 임의의 방법 또는 기술로 구현된 휘발성 및 비휘발성, 분리형 및 비분리형 매체를 모두 포함할 수 있다. 또한, 본 명세서에서, “부”는 프로세서 또는 회로와 같은 하드웨어 구성(hardware component), 및/또는 프로세서와 같은 하드웨어 구성에 의해 실행되는 소프트웨어 구성(software component)일 수 있다.Some embodiments may also be implemented in the form of a recording medium including instructions executable by a computer, such as program modules, being executed by a computer. Computer readable media can be any available media that can be accessed by a computer, and can include both volatile and nonvolatile media, removable and non-removable media. The computer-readable medium may also include computer storage media. Computer storage media may include both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Also, in this specification, the term " part " may be a hardware component such as a processor or a circuit, and / or a software component executed by a hardware component such as a processor.

전술한 본 개시의 설명은 예시를 위한 것이며, 본 개시가 속하는 기술분야의 통상의 지식을 가진 자는 본 개시의 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 쉽게 변형이 가능하다는 것을 이해할 수 있을 것이다. 그러므로 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며 한정적이 아닌 것으로 이해해야만 한다. 예를 들어, 단일형으로 설명되어 있는 각 구성 요소는 분산되어 실시될 수도 있으며, 마찬가지로 분산된 것으로 설명되어 있는 구성 요소들도 결합된 형태로 실시될 수 있다.It is to be understood that the foregoing description of the disclosure is for the purpose of illustration and that those skilled in the art will readily appreciate that other embodiments may be readily devised without departing from the spirit or essential characteristics of the disclosure will be. It is therefore to be understood that the above-described embodiments are illustrative in all aspects and not restrictive. For example, each component described as a single entity may be distributed and implemented, and components described as being distributed may also be implemented in a combined form.

Claims

A speech recognition apparatus for providing a service through speech recognition,
A voice receiving unit for acquiring a voice signal;
Generating operation information indicating an operation of the user by sensing an operation of a user who has uttered a voice corresponding to the voice signal,
Detecting the call word from the voice signal based on the operation information,
A processor for generating output information based on the detection result; And
And an output unit for outputting the output information.

The method according to claim 1,
Wherein the operation information includes first gaze direction information indicating a gaze direction of the user,
The processor comprising:
Based on the first gaze direction information, determines a reference similarity indicating a criterion for determining whether the call word is detected from the voice signal,
And detects the call word from the voice signal based on the reference similarity.

3. The method of claim 2,
The processor comprising:
Determines the reference similarity on the basis of a viewing angle and a predetermined range indicating the direction of the user's gaze,
Setting the reference similarity to a lower value than when the gaze angle is within the preset range when the gaze angle is within the preset range,
Wherein the gaze angle is determined based on a position where the speech recognition apparatus is installed and a position of the user.

3. The method of claim 2,
The processor comprising:
And determines the reference similarity based on a position of at least one other electronic device installed in a predetermined area based on a position where the speech recognition device is installed and a direction of the user's gaze.

5. The method of claim 4,
The processor comprising:
And determines the reference similarity based on the operating state of at least one other electronic device corresponding to the direction of the user's gaze, out of the at least one other electronic device.

5. The method of claim 4,
The processor comprising:
And determines the reference similarity based on change information indicating a change in the direction of the user's gaze for a predetermined period of time from when the voice signal is acquired.

The method according to claim 6,
The processor comprising:
And determines the reference similarity based on the change information when the change in the viewing direction corresponds to a positional relationship between the speech recognition device and the at least one other electronic device.

3. The method of claim 2,
The processor comprising:
Acquiring image information including an object representing the user based on time information obtained by acquiring the voice signal,
And obtains the first gaze direction information from the image information.

9. The method of claim 8,
The processor comprising:
Determining whether the image information includes an object corresponding to a person other than the user,
Acquiring positional information indicating a position of the other person based on the image information when the image information includes an object corresponding to the other person,
And determines the reference similarity based on the first gaze direction information and the position information.

10. The method of claim 9,
The processor comprising:
Acquiring second gaze direction information indicating the gaze direction of the other person from the image information,
And determines the reference similarity based on the first gaze direction information and the second gaze direction information.

The method according to claim 1,
Wherein the operation information includes movement information indicating at least one of a movement direction and a movement speed of the user,
The processor comprising:
And detects the call word from the voice signal based on the movement information.

12. The method of claim 11,
The processor comprising:
Calculating a call frequency corresponding to the movement information based on the stored call history,
Determining a reference similarity indicating a criterion for determining whether the caller is detected from the voice signal based on the call frequency,
Detecting the call word from the voice signal with reference to the reference similarity,
Wherein the reference similarity degree is inversely proportional to the calling frequency.

13. The method of claim 12,
The processor comprising:
Calculating at least one of the moving direction and the moving speed based on the acoustic characteristics of the voice signal,
Wherein the acoustic characteristic includes at least one of a magnitude change of the voice signal over time or a frequency-dependent magnitude change of the voice signal.

A method of operating a speech recognition apparatus for providing a service through call word recognition,
Obtaining a voice signal;
Detecting operation of a user who has uttered a voice corresponding to the voice signal and generating operation information indicating the detected operation of the user;
Detecting the call word from the voice signal based on the operation information; And
And outputting the generated output information based on the detection result.

15. A recording medium readable by an electronic device, the program causing the method of claim 14 to be executed in an electronic device.