KR102071867B1

KR102071867B1 - Device and method for recognizing wake-up word using information related to speech signal

Info

Publication number: KR102071867B1
Application number: KR1020180055963A
Authority: KR
Inventors: 양태영
Original assignee: 주식회사 인텔로이드
Priority date: 2017-11-30
Filing date: 2018-05-16
Publication date: 2020-01-31
Also published as: KR20190064383A

Abstract

호출어 인식을 통해 서비스를 제공하는 음성 인식 장치가 개시된다. 음성 인식 장치는 음성 신호를 획득하는 음성 수신부, 상기 음성 신호에 대응하는 음성을 발화한 사용자의 동작을 감지하여 감지된 상기 사용자의 동작을 나타내는 동작정보를 생성하고, 상기 동작정보를 기초로 상기 음성 신호로부터 상기 호출어를 검출하고, 상기 검출 결과를 기초로 출력 정보를 생성하는 프로세서 및 상기 출력 정보를 출력하는 출력부를 포함한다.Disclosed is a speech recognition apparatus providing a service through call word recognition. The apparatus for recognizing a voice may generate a voice receiver to obtain a voice signal, motion information indicating a motion of the user detected by detecting a motion of a user who has uttered a voice corresponding to the voice signal, and based on the motion information. And a processor for detecting the caller from a signal, generating output information based on the detection result, and an output unit for outputting the output information.

Description

DEVICE AND METHOD FOR RECOGNIZING WAKE-UP WORD USING INFORMATION RELATED TO SPEECH SIGNAL}

본 개시는 음성 인식 장치 및 음성 인식 방법에 관한 것으로, 더욱 상세하게는 음성 신호와 관련된 정보를 이용하여 호출어를 인식하는 장치 및 방법에 관한 것이다.The present disclosure relates to a speech recognition apparatus and a speech recognition method, and more particularly, to an apparatus and a method for recognizing a caller using information related to a speech signal.

음성인식 장치는 인간의 음성을 부호로 변환하여 컴퓨터에 입력하는 장치를 말한다. 음성인식의 방식으로는 단어단위의 인식으로써 음성의 표준패턴과 입력 음성을 비교하여 입력 음성에 가장 가까운 음성패턴을 찾아 내는 패턴 매칭 방식, 하나의 단어와 다른 단어를 구별하기 위한 함수를 미리 마련 하고 그 함수를 입력 음성에 작용시켜 판정하는 식별 함수방식이 있다. 또 단어마다 음성을 인식하는 것이 아니고 음소등의 단위로 인식하여 인식정보 등을 이용하여 문맥을 인식하는 방법도 고안되고 있다.A speech recognition device refers to a device that converts a human voice into a code and inputs it to a computer. In the speech recognition method, it is a word unit recognition. It compares the standard pattern of the voice with the input voice and finds the voice pattern closest to the input voice, and prepares a function for distinguishing one word from another word in advance. There is an identification function method that determines the function by acting on the input voice. In addition, a method of recognizing a context using recognition information and the like by recognizing a unit of a phoneme or the like rather than recognizing a voice for each word has also been devised.

음성 인식 분야의 여러 기술들 중, 사용자로부터 취득한 음성에 포함된 호출어 또는 키워드(keyword)를 검출하는 키워드 스팟팅(keyword spotting) 기술이 최근 여러 분야에서 각광받고 있다. 키워드 스팟팅이 제대로 수행되기 위해서는 음성에 포함된 키워드를 인식하고 상기 키워드를 검출하는 비율인 검출률이 높아야 한다. 하지만 이러한 검출률과 함께 키워드 스팟팅에서 중요하게 다루어지는 문제가 키워드 오인식 문제이다. 즉, 음성으로부터 검출된 키워드를 다른 키워드인 것으로 잘못 인식하는 경우, 키워드 스팟팅이 적용된 단말기는 사용자에게 원하지 않는 서비스를 제거하거나 사용자가 의도하지 않았던 처리를 수행할 수도 있다. 따라서, 기존의 키워드 스팟팅 기술에서의 낮은 검출률 또는 높은 오인식률 문제를 해결할 수 있는 방안이 요구되고 있다.Among various technologies in the speech recognition field, a keyword spotting technique for detecting a call word or a keyword included in a voice acquired from a user has been spotlighted in various fields recently. In order to properly perform keyword spotting, a detection rate, which is a ratio of recognizing a keyword included in a voice and detecting the keyword, must be high. However, the problem that is important in keyword spotting along with the detection rate is the keyword misrecognition problem. That is, when a keyword detected from voice is erroneously recognized as another keyword, the terminal to which keyword spotting is applied may remove a service that is not desired by the user or perform a process not intended by the user. Therefore, there is a need for a method that can solve the problem of low detection rate or high recognition rate in the existing keyword spotting technique.

한편, 음성인식을 이용해 호출어를 인식하고 호출어 인식이 성공한 경우, 특정 서비스를 제공하는 기기에 대한 연구 및 출시가 이루어지고 있다. 이때, 호출어 인식의 경우, 임베디드 음성 인식을 통해 실시간으로 검출이 수행되기 때문에 오인식률이 상대적으로 높아지는 문제가 있다. 이에 따라, 호출어를 인식하는 방법과 관련된 기술이 요구되고 있다. On the other hand, when the recognition of the caller using the voice recognition and the caller recognition is successful, research and release on the device that provides a specific service has been made. At this time, in the case of call word recognition, since the detection is performed in real time through the embedded speech recognition, there is a problem in that the false recognition rate is relatively high. Accordingly, there is a need for a technique related to a method of recognizing call words.

본 개시는 상기와 같은 문제점을 해결하기 위해 안출된 것으로서, 호출어 인식의 정확도를 높일 수 있는 음성 인식 장치 또는 음성 인식 방법을 제공하고자 하는 목적을 가지고 있다. 또한, 본 개시는 호출어 인식의 오인식률을 감소시키는 음성 인식 장치 또는 음성 인식 방법을 제공한다.The present disclosure has been made to solve the above problems, and an object of the present invention is to provide a speech recognition apparatus or a speech recognition method capable of increasing the accuracy of call word recognition. In addition, the present disclosure provides a speech recognition apparatus or a speech recognition method for reducing the false recognition rate of call word recognition.

상기와 같은 과제를 해결하기 위한 본 개시의 실시예에 따르면, 음성 신호를 획득하는 음성 수신부, 상기 음성 신호에 대응하는 음성을 발화한 사용자의 동작을 감지하여 감지된 상기 사용자의 동작을 나타내는 동작정보를 생성하고, 상기 동작정보를 기초로, 상기 음성 신호로부터 상기 호출어를 검출하고, 상기 검출 결과를 기초로 출력 정보를 생성하는 프로세서 및 상기 출력 정보를 출력하는 출력부를 포함하는 음성 인식 장치를 제공할 수 있다.According to an embodiment of the present disclosure for solving the above problems, a voice receiver for obtaining a voice signal, motion information indicating the motion of the user detected by detecting the motion of the user who uttered the voice corresponding to the voice signal Provides a speech recognition apparatus including a processor configured to generate a signal, detect the caller word from the voice signal based on the operation information, and generate output information based on the detection result, and an output unit to output the output information. can do.

또한 본 개시의 일 실시예에 따른 방법은, 음성 신호를 획득하는 단계, 상기 음성 신호에 대응하는 음성을 발화한 사용자의 동작을 감지하여 감지된 상기 사용자의 동작을 나타내는 동작정보를 생성하는 단계, 상기 동작정보를 기초로 상기 음성 신호로부터 상기 호출어를 검출하는 단계 및 상기 검출 결과를 기초로 생성된 출력 정보를 출력하는 단계를 포함할 수 있다.In addition, the method according to an embodiment of the present disclosure, the method comprising the steps of obtaining a voice signal, detecting the motion of the user who uttered the voice corresponding to the voice signal to generate motion information indicating the detected motion of the user, Detecting the caller from the voice signal based on the operation information and outputting the output information generated based on the detection result.

또 다른 측면에 따른 전자 장치로 읽을 수 있는 기록매체는 상술한 방법을 전자 장치에서 실행시키기 위한 프로그램을 기록한 기록매체를 포함할 수 있다.According to yet another aspect, a recording medium readable by an electronic device may include a recording medium recording a program for executing the above-described method in an electronic device.

본 개시의 일 실시예에 따르면, 호출어 인식의 정확도를 높여 호출어 인식의 오인식률을 감소시킬 수 있다. 또한, 본 개시의 일 실시예에 따르면, 음성을 발화한 사용자에게 효과적으로 출력 정보를 제공할 수 있다. 본 개시는 음성 신호가 발화된 상황 또는 발화 의도를 예측하여 호출어 인식의 오인식률을 감소시킬 수 있다.According to an embodiment of the present disclosure, the accuracy of call word recognition may be increased to reduce the false recognition rate of call word recognition. In addition, according to an embodiment of the present disclosure, the output information may be effectively provided to the user who spoke the voice. The present disclosure can reduce the false recognition rate of caller recognition by predicting the situation or speech intention that the speech signal is uttered.

또한, 본 개시의 일 실시예는, 사용자의 음성을 취득한 환경의 특성에 기초하여 호출어를 인식할 수 있다. 이를 통해, 본 개시는 호출어 오인식으로 인한 기기의 오작동을 줄이고, 호출어 인식에 소모되는 전력 효율을 증가시킬 수 있다.In addition, an embodiment of the present disclosure may recognize a caller based on a characteristic of an environment in which a voice of a user is acquired. Through this, the present disclosure can reduce the malfunction of the device due to the caller misrecognition and increase the power efficiency consumed for the caller recognition.

도 1은 본 개시의 일 실시예에 따라 음성 인식 장치 및 서버를 포함하는 서비스 제공 시스템을 나타내는 개략도이다.
도 2는 본 발명의 실시예에 따른 음성 인식 장치의 구성을 나타내는 도면이다.
도 3은 본 개시의 일 실시예에 따라, 기준 유사도에 따른 호출어 검출 결과를 나타내는 도면이다.
도 4는 본 개시의 일 실시예에 따른 음성 인식 장치가 사용자의 시선방향을 기초로 기준 유사도를 결정하는 방법을 나타내는 도면이다.
도 5는 본 개시의 일 실시예에 따른 음성 인식 장치가 음성 인식 장치의 주변 환경 정보를 기초로 기준 유사도를 획득하는 방법을 나타내는 도면이다.
도 6은 본 개시의 일 실시예에 따른 음성 인식 장치가 사용자 이외에 다른 인물과 관련된 정보를 기초로 기준 유사도를 결정하는 방법을 나타내는 도면이다.
도 7은 본 발명의 실시예에 따른 음성 인식 장치의 동작을 나타내는 흐름도이다.1 is a schematic diagram illustrating a service providing system including a voice recognition device and a server according to an embodiment of the present disclosure.
2 is a diagram illustrating a configuration of a speech recognition apparatus according to an embodiment of the present invention.
3 is a diagram illustrating a call word detection result based on a reference similarity according to an exemplary embodiment of the present disclosure.
4 is a diagram illustrating a method of determining, by a voice recognition apparatus, a reference similarity based on a user's gaze direction according to an embodiment of the present disclosure.
5 is a diagram illustrating a method of obtaining, by a voice recognition apparatus, a reference similarity based on surrounding environment information of the voice recognition apparatus, according to an exemplary embodiment.
FIG. 6 is a diagram illustrating a method of determining, by a voice recognition apparatus, a reference similarity based on information related to another person besides a user, according to an exemplary embodiment.
7 is a flowchart illustrating an operation of a speech recognition apparatus according to an embodiment of the present invention.

아래에서는 첨부한 도면을 참고로 하여 본 발명의 실시 예에 대하여 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 상세히 설명한다. 그러나 본 발명은 여러 가지 상이한 형태로 구현될 수 있으며 여기에서 설명하는 실시 예에 한정되지 않는다. 그리고 도면에서 본 발명을 명확하게 설명하기 위해서 설명과 관계없는 부분은 생략하였으며, 명세서 전체를 통하여 유사한 부분에 대해서는 유사한 도면 부호를 붙였다. Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings so that those skilled in the art may easily implement the present invention. As those skilled in the art would realize, the described embodiments may be modified in various different ways, all without departing from the spirit or scope of the present invention. In the drawings, parts irrelevant to the description are omitted in order to clearly describe the present invention, and like reference numerals designate like parts throughout the specification.

또한 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있는 것을 의미한다.In addition, when a part is said to "include" a certain component, this means that it may further include other components, except to exclude other components unless otherwise stated.

본 개시는, 음성 신호로부터 기 설정된 호출어를 검출하여 출력 정보를 제공하는 음성 인식 장치 및 방법에 관한 것이다. 구체적으로, 본 개시의 일 실시예에 따른 음성 인식 장치 및 방법은, 동작정보를 이용하여 호출어에 대응하지 않는 음성 신호가 호출어에 대응하는 것으로 잘못 인식되는 비율을 나타내는 오인식률을 감소시킬 수 있다. 이때, 동작정보는 호출어 검출의 대상이 되는 음성 신호에 대응하는 음성을 발화한 사용자의 동작을 나타내는 정보일 수 있다. 또한, 본 개시에서, 호출어(wake-up word)는 음성 인식 장치의 서비스 제공 기능을 트리거(trigger)하기 위해 설정된 키워드(keyword)를 나타낼 수 있다. 이하, 첨부된 도면을 참고하여 본 발명을 상세히 설명한다. The present disclosure relates to a speech recognition apparatus and method for detecting output call words from a speech signal and providing output information. In detail, the apparatus and method for recognizing a voice according to an exemplary embodiment of the present disclosure may reduce a false recognition rate indicating a rate at which a voice signal that does not correspond to a caller is incorrectly recognized as corresponding to the caller using motion information. have. In this case, the operation information may be information indicating an operation of a user who uttered a voice corresponding to a voice signal to be detected as a caller. In addition, in the present disclosure, a wake-up word may indicate a keyword set to trigger a service providing function of the speech recognition apparatus. Hereinafter, with reference to the accompanying drawings will be described in detail the present invention.

도 1은 본 개시의 일 실시예에 따라 음성 인식 장치(100) 및 서버(200)를 포함하는 서비스 제공 시스템을 나타내는 개략도이다. 도 1에 도시된 바와 같이, 서비스 제공 시스템은 적어도 하나의 음성 인식 장치(100) 및 서버(200)를 포함할 수 있다. 본 개시의 일 실시예에 따른 서비스 제공 시스템은 기 설정된 호출어(이하, '호출어')를 기반으로 서비스를 제공할 수 있다. 예를 들어, 서비스 제공 시스템은 획득된 음성 신호를 인식하여 인식된 결과에 대응하는 서비스를 제공할 수 있다. 이때, 서비스 제공 시스템은 획득된 음성 신호로부터 호출어가 검출되는지 판단할 수 있다. 또한, 서비스 제공 시스템은 획득된 음성 신호로부터 호출어가 검출되는 경우, 인식 결과에 대응하는 서비스를 제공할 수 있다. 반대로 서비스 제공 시스템은 획득된 음성 신호로부터 호출어가 검출되지 않는 경우, 음성 인식을 수행하지 않거나 인식 결과에 대응하는 서비스를 제공하지 않을 수 있다. 서비스 제공 시스템은 음성 인식 장치(100)를 통해 인식 결과에 대응하는 출력 정보를 제공할 수 있다.1 is a schematic diagram illustrating a service providing system including a voice recognition device 100 and a server 200 according to an exemplary embodiment of the present disclosure. As shown in FIG. 1, the service providing system may include at least one voice recognition device 100 and a server 200. The service providing system according to an exemplary embodiment of the present disclosure may provide a service based on a preset call word (hereinafter, 'call word'). For example, the service providing system may recognize the acquired voice signal and provide a service corresponding to the recognized result. In this case, the service providing system may determine whether a caller is detected from the acquired voice signal. In addition, the service providing system may provide a service corresponding to a recognition result when a caller is detected from the acquired voice signal. In contrast, the service providing system may not perform voice recognition or provide a service corresponding to the recognition result when the caller is not detected from the acquired voice signal. The service providing system may provide output information corresponding to a recognition result through the voice recognition apparatus 100.

일 실시예에 따라, 음성 인식 장치(100)는 호출어를 인식하여 음성 인식 장치(100)의 서비스 제공 기능을 웨이크-업(wake-up)할 수 있다. 예를 들어, 음성 인식 장치(100)는 획득된 음성 신호로부터 호출어가 검출되는 경우, 서비스 제공을 위한 음성 인식 동작을 웨이크-업할 수 있다. 음성 인식 장치(100)는 음성 인식 장치(100) 내의 임베디드(embedded) 인식 모듈을 통해 호출어를 인식할 수 있다. 이때, 호출어 인식은 음성 신호로부터 호출어가 검출되는지를 판별하는 동작을 나타낼 수 있다. According to an embodiment of the present disclosure, the voice recognition apparatus 100 may wake-up a service providing function of the voice recognition apparatus 100 by recognizing the call word. For example, when the caller is detected from the acquired voice signal, the voice recognition apparatus 100 may wake up the voice recognition operation for providing a service. The speech recognition apparatus 100 may recognize a caller through an embedded recognition module in the speech recognition apparatus 100. In this case, the call word recognition may indicate an operation of determining whether a call word is detected from the voice signal.

구체적으로, 음성 인식 장치(100)는 음성 신호를 기 설정된 특정 길이 단위(frame)로 분할할 수 있다. 또한, 음성 인식 장치(100)는 분할된 각각의 음성 신호로부터 음향학적 특징(acoustic feature)을 추출할 수 있다. 음성 인식 장치(100)는 추출된 음향학적 특징과 호출어에 대응하는 음향 모델 사이의 유사도를 산출할 수 있다. 또한, 음성 인식 장치(100)는 추출된 음향학적 특징과 호출어에 대응하는 음향 모델 또는 음성인식을 위한 모델 사이의 유사도에 기초하여 호출어의 존재 여부를 판별할 수 있다. 이때, 음향학적 특징은 음성 인식에 필요한 정보를 나타낼 수 있다. 예를 들어, 음향학적 특징은 포먼트(formant) 정보 및 피치(pitch) 정보를 포함할 수 있다. 포먼트는 음성 스펙트럼의 스펙트럴 피크(spectral peaks)로 정의되며 스펙트로그램(spectrogram)에서 진폭의 피크(amplitude peak) 값으로 정량화될 수 있다. 여기에서 가장 낮은 주파수를 갖는 포먼트를 제1포먼트(F1), 그 다음을 순서대로 F2, F3 등으로 산출하며, 인간이 모음을 발성할 때의 음성분석에는 F1과 F2를 주로 사용한다. 피치는 음성의 기본 주파수(Fundamental Frequency)를 의미하며 음성의 주기적 특성을 나타낸다. 음성 인식 장치(100)는 LPC(Linear Predictive Coding) Cepstrum, PLP(Perceptual Linear Prediction) Cepstrum, MFCC(Mel Frequency Cepstral Coefficient) 및 필터뱅크 에너지 분석(Filter Bank Energy Analysis) 중 적어도 어느 하나를 사용하여 음성 신호의 음향학적 특징을 추출할 수 있다. 예를 들어, 음성 인식 장치(100)는 음성 신호로부터 추출된 음향학적 특징과 가장 유사도가 높은 음향 모델 또는 음성인식 모델의 텍스트 데이터가 호출어에 대응하는 텍스트와 동일한 경우, 해당 음성 신호로부터 호출어가 검출된 것으로 판단할 수 있다. 하지만, 본 개시의 호출어 인식 과정이 이에 제한되는 것은 아니다.In detail, the speech recognition apparatus 100 may divide the speech signal into a predetermined specific length unit (frame). Also, the speech recognition apparatus 100 may extract acoustic features from each of the divided speech signals. The speech recognition apparatus 100 may calculate a similarity between the extracted acoustic feature and the acoustic model corresponding to the call word. In addition, the speech recognition apparatus 100 may determine the presence or absence of the call word based on the similarity between the extracted acoustic feature and the acoustic model corresponding to the call word or the model for voice recognition. In this case, the acoustic feature may represent information required for speech recognition. For example, the acoustic feature may include formant information and pitch information. Formants are defined as spectral peaks of the speech spectrum and can be quantified as amplitude peak values in the spectrogram. Here, the formant having the lowest frequency is calculated as the first formant (F1), followed by F2, F3, etc., and F1 and F2 are mainly used for speech analysis when a human vowel is produced. Pitch means the fundamental frequency of the voice and indicates the periodic characteristics of the voice. The speech recognition apparatus 100 uses at least one of a linear predictive coding (LPC) Cepstrum, a perceptual linear prediction (PLP) Cepstrum, a Mel Frequency Cepstral Coefficient (MFCC), and a Filter Bank Energy Analysis (MFC). The acoustic characteristics of can be extracted. For example, if the text data of the acoustic model or the speech recognition model having the highest similarity to the acoustic feature extracted from the speech signal is the same as the text corresponding to the caller, the speech recognition apparatus 100 may call from the speech signal. It may be determined that it is detected. However, the call word recognition process of the present disclosure is not limited thereto.

또한, 음성 인식 장치(100)는 호출어 인식 수행 시, 기준 유사도를 참조하여 호출어를 검출할 수 있다. 여기에서, 기준 유사도는 획득한 음성 신호로부터 호출어가 검출되는지를 판별하는 기준이 되는 문턱 유사도 값(threshold)으로, 후술하는 도 3에서 보다 상세하게 설명하도록 한다. 또한, 음성 인식 장치(100)는 음성 신호에 대응하는 음성을 발화한 사용자(300)의 동작을 나타내는 동작정보에 기반하여 기준 유사도를 결정할 수 있다. 이에 대해서는 후술할 도 4 내지 도 6을 통해 상세하게 설명하도록 한다. 이하, 본 개시에서 특별한 기재가 없는 한 사용자는 음성 신호에 대응하는 음성을 발화한 사용자(300)를 나타낸다.In addition, the speech recognition apparatus 100 may detect the call word by referring to the reference similarity when performing the call word recognition. Here, the reference similarity is a threshold similarity value (threshold) serving as a criterion for determining whether a caller is detected from the acquired voice signal, which will be described in detail later with reference to FIG. 3. In addition, the speech recognition apparatus 100 may determine the reference similarity based on the operation information indicating the operation of the user 300 who uttered the voice corresponding to the speech signal. This will be described in detail with reference to FIGS. 4 to 6 to be described later. Hereinafter, unless otherwise specified in the present disclosure, the user represents the user 300 who uttered a voice corresponding to the voice signal.

다른 일 실시예에 따라, 호출어 인식은 서버(200)에 의해 수행될 수도 있다. 이때, 음성 인식 장치(100)는 음성 신호를 서버(200)로 전송하고 인식 결과를 요청할 수 있다. 또한, 음성 인식 장치(100)는 서버(200)로부터 수신된 인식 결과를 기초로 음성 인식 장치(100)의 서비스 제공 기능을 웨이크-업할 수 있다.According to another embodiment, call word recognition may be performed by the server 200. In this case, the voice recognition apparatus 100 may transmit a voice signal to the server 200 and request a recognition result. In addition, the speech recognition apparatus 100 may wake up the service providing function of the speech recognition apparatus 100 based on the recognition result received from the server 200.

본 개시에서, 서비스는 인식된 결과에 대응하는 출력 정보를 의미할 수 있다. 또한, 서비스 또는 출력 정보는 음성 인식 장치(100) 고유의 기능을 포함할 수도 있다. 예를 들어, 음성 인식 장치(100)가 음성 인식 기능을 탑재한 가전기기인 경우, 출력 정보는 가전기기의 작동 시작 또는 종료를 포함할 수 있다. 음성 인식 장치(100)가 정보 제공 장치인 경우, 출력 정보는 사용자(300)의 질의(query)에 대응하는 정보를 포함할 수 있다. 예를 들어, 음성 인식 장치(100)는 사용자(300)로부터 발화된 음성에 대응하는 음성 신호를 획득할 수 있다. 사용자(300)는 음성 신호를 통해 음성 인식 장치(100)에게 호출어 및 다양한 유형의 요청(request)을 입력할 수 있다. 이때, 요청은 음성 인식 장치(100)에 대한 서비스 요청을 의미할 수 있다. 일 실시예에 따라, 음성 인식 장치(100)는 벽면에 부착된 IoT 단말일 수 있으나 이에 한정되지 않는다. 예를 들어, 음성 인식 장치(100)는 현관에 설치된 조명(light) 형태의 IoT 단말일 수 있다. 또는 음성 인식 장치(100)는 음성 인식 기능이 탑재된 냉/난방 기기, 셋톱 박스(set-top box), 냉장고, TV와 같은 가전기기일 수 있다. In the present disclosure, a service may mean output information corresponding to a recognized result. In addition, the service or output information may include a function unique to the voice recognition apparatus 100. For example, when the voice recognition apparatus 100 is a home appliance equipped with a voice recognition function, the output information may include starting or ending the operation of the home appliance. When the voice recognition apparatus 100 is an information providing apparatus, the output information may include information corresponding to a query of the user 300. For example, the speech recognition apparatus 100 may obtain a speech signal corresponding to the speech spoken by the user 300. The user 300 may input a call word and various types of requests to the voice recognition apparatus 100 through a voice signal. In this case, the request may mean a service request for the voice recognition apparatus 100. According to an embodiment, the voice recognition apparatus 100 may be an IoT terminal attached to a wall, but is not limited thereto. For example, the voice recognition apparatus 100 may be an IoT terminal in the form of a light installed in the front door. Alternatively, the voice recognition device 100 may be a home appliance such as a cooling / heating device, a set-top box, a refrigerator, or a TV equipped with a voice recognition function.

또한, 서비스 제공을 위한 음성 인식 동작이 활성화된 경우, 음성 인식 장치(100)는 음성 신호로부터 음성을 인식하여 사용자(300)가 요청한 서비스를 제공할 수 있다. 이때, 음성 신호는 호출어에 대응하는 음성 신호를 획득한 때부터 소정의 시간 이내에 획득된 음성 신호일 수 있다. 음성 인식 장치(100)는 임베디드 인식 모듈을 통해, 음성 신호로부터 음성을 인식할 수 있다. 음성 인식 장치(100)는 전술한 호출어 인식 방법과 동일 또는 유사한 방법으로, 음성 신호에 대한 서비스 제공을 위한 음성 인식을 수행할 수 있다.In addition, when a voice recognition operation for providing a service is activated, the voice recognition apparatus 100 may recognize a voice from a voice signal and provide a service requested by the user 300. In this case, the voice signal may be a voice signal obtained within a predetermined time from when the voice signal corresponding to the caller is acquired. The speech recognition apparatus 100 may recognize the speech from the speech signal through the embedded recognition module. The voice recognition apparatus 100 may perform voice recognition for providing a service for a voice signal in the same or similar manner as the above-described method for recognizing a call word.

또는 음성 인식 장치(100)는 획득된 음성 신호를 서버(200)로 전송하여 음성 신호로부터 음성을 인식할 수도 있다. 획득된 음성 신호가 포함하는 키워드를 정확히 인식하기 위해 충분한 연산량 또는 대량의 음성 인식 관련 데이터 베이스가 요구될 수 있기 때문이다. 예를 들어, 음성 인식 장치(100)가 소형의 단말로 구현되는 경우, 단말이 자체적으로 데이터베이스를 보유하는 것보다 외부에 별도의 데이터베이스를 두고 네트워크 접속을 통해 외부의 데이터베이스를 활용하는 것이 유리할 수 있다.Alternatively, the speech recognition apparatus 100 may transmit the obtained speech signal to the server 200 to recognize the speech from the speech signal. This is because a sufficient amount of computation or a large amount of database related to speech recognition may be required to accurately recognize a keyword included in the acquired speech signal. For example, when the voice recognition apparatus 100 is implemented as a small terminal, it may be advantageous to use an external database through a network connection with a separate database externally rather than having the database itself. .

본 개시의 일 실시예에 따른 서버(200)는, 전술한 음성 인식 장치(100)가 호출어 또는 서비스 제공을 위한 음성 인식을 수행하는 방법과 동일 또는 유사한 방법으로 음성 인식을 수행할 수 있다. 예를 들어, 서버(200)는 음성 인식 장치(100)로부터 획득된 음성 신호에 대해 음성 인식을 수행할 수 있다. 또한, 서버(200)는 음성 인식을 위한 데이터베이스를 포함할 수 있다. 이때, 데이터베이스는 적어도 하나의 음향 모델 또는 음성 인식 모델을 포함할 수 있다. 그러나 서버(200)가 데이터베이스를 반드시 포함하는 것은 아니며, 서비스 제공 시스템은 서버(200)와 연결된 별도의 저장소(미도시)를 포함할 수도 있다. 이때, 서버(200)는 데이터베이스를 포함하는 저장소로부터 적어도 하나의 음향 모델 또는 음성 인식 모델을 획득할 수 있다. 또한, 서버(200)는 음성 인식 장치(100)로부터 획득한 음성 신호에 대해 음성 인식을 수행한 결과를 음성 인식 장치(100)로 전송할 수 있다. 한편, 도 1에서 서비스 제공 시스템은 서버(200)를 포함하고 있으나, 본 개시가 이에 제한되는 것은 아니다. 예를 들어, 본 개시의 일 실시예에 따른 서비스 제공 시스템은 서버(200)를 포함하지 않을 수 있다. 이 경우, 서버(200)의 기능은 음성 인식 장치(100) 또는 음성 인식 장치(100)가 아닌 음성 인식 기능을 탑재한 다른 장치에 의해 수행될 수 있다.The server 200 according to an exemplary embodiment of the present disclosure may perform voice recognition in the same or similar manner to that of the above-described voice recognition apparatus 100 performing voice recognition for providing a call word or a service. For example, the server 200 may perform voice recognition on the voice signal obtained from the voice recognition apparatus 100. In addition, the server 200 may include a database for speech recognition. In this case, the database may include at least one acoustic model or a speech recognition model. However, the server 200 does not necessarily include a database, and the service providing system may include a separate storage (not shown) connected with the server 200. In this case, the server 200 may obtain at least one acoustic model or speech recognition model from a storage including a database. In addition, the server 200 may transmit a result of performing voice recognition on the voice signal obtained from the voice recognition apparatus 100 to the voice recognition apparatus 100. Meanwhile, in FIG. 1, the service providing system includes a server 200, but the present disclosure is not limited thereto. For example, the service providing system according to an embodiment of the present disclosure may not include the server 200. In this case, the function of the server 200 may be performed by another device equipped with a voice recognition function other than the voice recognition device 100 or the voice recognition device 100.

일 실시예에 따라, 서비스 제공 시스템이 복수의 음성 인식 장치(100a, 100b)를 포함하는 경우, 호출어는 음성 인식 장치 별로 다르게 설정될 수 있다. 제1 음성 인식 장치(100a)는 '구름'을 호출어로 사용할 수 있다. 또한, 제2 음성 인식 장치(100b)는 '하늘'을 호출어로 사용할 수 있다. 이 경우, 제1 음성 인식 장치(100a)가 '구름'과 일정 유사도 이상인 음성 신호를 획득하는 경우, 제1 음성 인식 장치(100a)는 제1 음성 인식 장치(100a)의 서비스 제공 기능을 웨이크-업할 수 있다. 반면, 사용자(300)가 발화한 음성 신호가 '하늘'을 포함하는 경우, 제1 음성 인식 장치(100a)는 서비스 제공 기능을 웨이크-업하지 않을 수 있다. 음성 신호가 제1 음성 인식 장치(100a)가 사용하는 호출어에 대응하지 않는 경우이기 때문이다. 또한, 제2 음성 인식 장치(100b)는 '하늘'에 대응하는 음성 신호를 획득한 경우에만 제2 음성 인식 장치(100b)의 서비스 제공 기능을 웨이크-업할 수 있다.According to an embodiment, when the service providing system includes a plurality of voice recognition devices 100a and 100b, the caller may be set differently for each voice recognition device. The first voice recognition apparatus 100a may use 'cloud' as a caller. In addition, the second speech recognition apparatus 100b may use 'sky' as a caller. In this case, when the first voice recognition apparatus 100a obtains a voice signal having a predetermined similarity or higher than 'cloud', the first voice recognition apparatus 100a wakes up the service providing function of the first voice recognition apparatus 100a. I can do it. On the other hand, when the voice signal uttered by the user 300 includes 'sky', the first voice recognition apparatus 100a may not wake up the service providing function. This is because the voice signal does not correspond to the caller used by the first voice recognition apparatus 100a. In addition, the second voice recognition apparatus 100b may wake up the service providing function of the second voice recognition apparatus 100b only when a voice signal corresponding to 'sky' is obtained.

도 2는 본 발명의 실시예에 따른 음성 인식 장치(100)의 구성을 나타내는 도면이다. 일 실시예에 따라, 음성 인식 장치(100)는 음성 수신부(110), 프로세서(120) 및 출력부(130)를 포함할 수 있다. 그러나 도 2에 도시된 구성 요소의 일부는 생략될 수 있으며, 도 2에 도시되지 않은 구성 요소를 추가적으로 포함할 수 있다. 또한, 음성 인식 장치(100)는 적어도 둘 이상의 서로 다른 구성요소를 일체로서 구비할 수도 있다. 일 실시예에 따라, 음성 인식 장치(100)는 하나의 반도체 칩(chip)으로 구현될 수도 있다.2 is a diagram illustrating a configuration of a voice recognition device 100 according to an embodiment of the present invention. According to an embodiment, the voice recognition apparatus 100 may include a voice receiver 110, a processor 120, and an outputter 130. However, some of the components shown in FIG. 2 may be omitted, and may further include components not shown in FIG. 2. In addition, the speech recognition apparatus 100 may be provided with at least two different components as one body. According to an embodiment, the speech recognition apparatus 100 may be implemented with one semiconductor chip.

음성 수신부(110)는 음성 신호를 획득할 수 있다. 음성 수신부(110)는 음성 수신부(110)로 입사되는 음성 신호를 수집할 수 있다. 일 실시예에 따라, 음성 수신부(110)는 적어도 하나의 마이크를 포함할 수 있다. 예를 들어, 음성 수신부(110)는 복수의 마이크를 포함하는 마이크 어레이를 포함할 수 있다. 이때, 마이크 어레이는 원 또는 구 형태 이외의 정육면체 또는 정삼각형과 같은 다양한 형태로 배열된 복수의 마이크를 포함할 수 있다. 다른 일 실시예에 따라, 음성 수신부(110)는 외부의 음향 수집 장치로부터 수집된 음성에 대응하는 음성 신호를 수신할 수도 있다. 예를 들어, 음성 수신부(110)는 음성 신호가 입력되는 음성 신호 입력 단자를 포함할 수 있다. 구체적으로 음성 수신부(110)는 유선으로 전송되는 음성 신호를 수신하는 음성 신호 입력 단자를 포함할 수 있다. 또는, 음성 수신부(110)는 블루투스(bluetooth) 또는 와이파이(Wi-Fi) 통신 방법을 이용하여 무선으로 전송되는 음성 신호를 수신할 수도 있다.The voice receiver 110 may acquire a voice signal. The voice receiver 110 may collect a voice signal incident to the voice receiver 110. According to an embodiment, the voice receiver 110 may include at least one microphone. For example, the voice receiver 110 may include a microphone array including a plurality of microphones. In this case, the microphone array may include a plurality of microphones arranged in various forms such as a cube or an equilateral triangle other than a circle or sphere. According to another exemplary embodiment, the voice receiver 110 may receive a voice signal corresponding to voice collected from an external sound collection apparatus. For example, the voice receiver 110 may include a voice signal input terminal to which a voice signal is input. In more detail, the voice receiver 110 may include a voice signal input terminal for receiving a voice signal transmitted through a wire. Alternatively, the voice receiver 110 may receive a voice signal transmitted wirelessly using a Bluetooth or Wi-Fi communication method.

프로세서(120)는 명세서 전반에 걸쳐 설명되는 음성 인식 장치(100)의 전반적인 동작을 제어할 수 있다. 프로세서(120)는 음성 인식 장치(100)의 각 구성 요소를 제어할 수 있다. 프로세서(120)는 각종 데이터와 신호의 연산 및 처리를 수행할 수 있다. 프로세서(120)는 반도체 칩 또는 전자 회로 형태의 하드웨어로 구현되거나 하드웨어를 제어하는 소프트웨어로 구현될 수 있다. 프로세서(120)는 하드웨어와 상기 소프트웨어가 결합된 형태로 구현될 수도 있다. 프로세서(120)는 소프트웨어가 포함하는 적어도 하나의 프로그램을 실행하여 음성 인식 장치(100)의 동작을 제어할 수 있다.The processor 120 may control the overall operation of the speech recognition apparatus 100 described throughout the specification. The processor 120 may control each component of the speech recognition apparatus 100. The processor 120 may perform calculation and processing of various data and signals. The processor 120 may be implemented in hardware in the form of a semiconductor chip or electronic circuit or in software for controlling the hardware. The processor 120 may be implemented in the form of a combination of hardware and the software. The processor 120 may control the operation of the speech recognition apparatus 100 by executing at least one program included in software.

일 실시예에 따라, 프로세서(120)는 전술한 음성 수신부(110)를 통해 획득된 음성 신호로부터 음성을 인식할 수 있다. 프로세서(120)는 음성 수신부(110)를 통해 음성 신호를 획득할 수 있다. 프로세서(120)는 전술한 임베디드 음성 인식 기능을 포함할 수 있다. 일 실시예에 따라, 프로세서(120)는 임베디드 음성 인식 기능을 이용하여 음성 신호로부터 호출어를 검출할 수 있다. 또한, 프로세서(120)는 서버로부터 음성 인식 결과를 획득할 수 있다. 이 경우, 프로세서(120)는 통신부(미도시)를 통해 음성 신호를 서버로 전송하고 음성 신호에 대한 인식 결과를 서버에게 요청할 수도 있다. 또한, 프로세서(120)는 검출 결과를 기초로 출력 정보를 생성할 수 있다. 호출어가 검출된 경우, 프로세서(120)는 서비스 제공 기능을 웨이크-업할 수 있다. 이 경우, 프로세서(120)는 서비스 제공 기능이 웨이크-업 되었음을 알리는 정보를 포함하는 출력 정보를 생성할 수 있다. 또한, 프로세서(120)는 음성 인식을 수행하여 획득된 인식 결과에 대응하는 출력 정보를 생성할 수 있다. 반대로, 호출어가 검출되지 않은 경우, 프로세서(120)는 호출어가 검출되지 않았음을 알리는 정보를 포함하는 출력 정보를 생성할 수 있다. 또는, 이 경우, 프로세서(120)는 사용자에게 출력 정보를 제공하지 않을 수도 있다. 프로세서(120)는 생성된 출력 정보를 이하 설명되는 출력부(130)를 통해 출력할 수 있다. According to an embodiment, the processor 120 may recognize a voice from the voice signal obtained through the voice receiver 110 described above. The processor 120 may obtain a voice signal through the voice receiver 110. The processor 120 may include the aforementioned embedded speech recognition function. According to an embodiment, the processor 120 may detect a caller from a voice signal using an embedded voice recognition function. In addition, the processor 120 may obtain a voice recognition result from the server. In this case, the processor 120 may transmit a voice signal to the server through a communication unit (not shown) and request the server to recognize the voice signal. In addition, the processor 120 may generate output information based on the detection result. When the caller is detected, the processor 120 may wake up the service providing function. In this case, the processor 120 may generate output information including information indicating that the service providing function has been woken up. In addition, the processor 120 may perform voice recognition to generate output information corresponding to the obtained recognition result. In contrast, when the caller word is not detected, the processor 120 may generate output information including information indicating that the caller word is not detected. In this case, the processor 120 may not provide output information to the user. The processor 120 may output the generated output information through the output unit 130 described below.

출력부(130)는 사용자에게 제공되는 정보를 출력할 수 있다. 출력부(130)는 프로세서(120)에 의해 생성된 출력 정보를 출력할 수 있다. 또한, 출력부(130)는 빛, 소리, 진동과 같은 형태로 변환된 출력 정보를 출력할 수도 있다. 일 실시예에 따라, 출력부(130)는 스피커, 디스플레이, LED를 포함하는 각종 광원 및 모니터 중 적어도 어느 하나일 수 있으나 이에 한정되지 않는다. 예를 들어, 출력부(130)는 호출어 검출 결과를 기초로 생성된 출력 정보를 출력할 수 있다. 이때, 출력 정보는 호출어 검출 결과를 포함할 수 있다. 출력부(130)는 호출어가 검출된 경우와 호출어가 검출되지 않은 경우에 따라 구별되는 검출 신호를 출력할 수 있다. 예를 들어, 출력부(130)는 광원을 통해, 호출어가 검출된 경우 '파란색' 빛을 출력하고, 호출어가 검출되지 않은 경우 '붉은색' 빛을 출력할 수 있다. 출력부(130)는 스피커를 통해 호출어가 검출된 경우에만 기 설정된 오디오 신호를 출력할 수도 있다. The output unit 130 may output information provided to the user. The output unit 130 may output output information generated by the processor 120. In addition, the output unit 130 may output output information converted into a form such as light, sound, and vibration. According to an embodiment, the output unit 130 may be at least one of various light sources including a speaker, a display, an LED, and a monitor, but is not limited thereto. For example, the output unit 130 may output the output information generated based on the call word detection result. In this case, the output information may include a call word detection result. The output unit 130 may output detection signals that are distinguished according to a case where a call word is detected and a case where the call word is not detected. For example, the output unit 130 may output 'blue' light when the caller is detected through the light source, and output 'red' light when the caller is not detected. The output unit 130 may output the preset audio signal only when the caller is detected through the speaker.

또한, 출력부(130)는 음성 인식 장치(100) 고유의 기능을 수행할 수 있다. 구체적으로, 음성 인식 장치(100)가 음성 인식 기능을 포함하는 정보 제공 장치인 경우, 출력부(130)는 사용자의 질의에 대응하는 정보를 오디오 신호 또는 비디오 신호의 형태로 제공할 수도 있다. 예를 들어, 출력부(130)는 사용자의 질의에 대응하는 정보를 텍스트 포맷 또는 음성 포맷으로 출력할 수 있다. 또한, 출력부(130)는 음성 인식 장치(100)와 유무선으로 연결된 다른 장치의 동작을 제어하는 제어신호를 다른 장치로 전송할 수도 있다. 예를 들어, 음성 인식 장치(100)가 벽면에 부착된 IoT 단말인 경우, 음성 인식 장치(100)는 난방 장치의 온도를 제어하는 제어신호를 난방 장치로 전송할 수 있다.In addition, the output unit 130 may perform a function unique to the speech recognition apparatus 100. In detail, when the speech recognition apparatus 100 is an information providing apparatus including a speech recognition function, the output unit 130 may provide information corresponding to a user's query in the form of an audio signal or a video signal. For example, the output unit 130 may output information corresponding to a user's query in a text format or a voice format. In addition, the output unit 130 may transmit a control signal for controlling the operation of another device connected to the voice recognition device 100 by wire or wirelessly to another device. For example, when the voice recognition device 100 is an IoT terminal attached to a wall, the voice recognition device 100 may transmit a control signal for controlling the temperature of the heating device to the heating device.

한편, 음성 인식 장치(100)는 사용자가 음성 신호에 대응하는 음성을 발화한 의도가 음성 인식 장치(100)를 호출하는 것이 아닌 경우에도 음성 신호로부터 호출어가 검출된 것으로 잘못 인식하여 오동작할 수 있다. 특히, 사용자가 호출어와 유사한 단어를 발화한 경우, 음성 인식 장치(100)는 해당 경우에도 음성 신호로부터 호출어가 검출된 것으로 잘못 인식하여 오동작할 수 있다. 음성 인식 장치(100)가 음성 인식 기능이 탑재된 가전기기인 경우, 호출어의 오인식으로 인해 불필요한 전력 소비가 발생할 수 있다. 본 개시의 일 실시예에 따른 음성 인식 장치(100)는 사용자와 관련된 동작정보를 이용하여 사용자가 음성 신호에 대응하는 음성을 발화한 의도를 예측할 수 있다. 또한, 음성 인식 장치(100)는 음성 인식 수행 시, 예측된 발화 의도에 따라 다른 기준 유사도를 적용하여 호출어 오인식 발생을 감소시킬 수 있다. 음성 인식 장치(100)가 발화 의도를 예측하는 경우, 사용자가 음성 인식 장치를 호출할 가능성이 높은 경우인지, 낮은 경우인지를 판단할 수 있기 때문이다.On the other hand, the speech recognition apparatus 100 may malfunction by incorrectly recognizing that the caller is detected from the speech signal even when the user does not call the speech recognition apparatus 100 when the intention of uttering the speech corresponding to the speech signal is incorrect. . In particular, when the user speaks a word similar to the call word, the voice recognition apparatus 100 may incorrectly recognize that the call word is detected from the voice signal and malfunction. When the speech recognition apparatus 100 is a home appliance equipped with a speech recognition function, unnecessary power consumption may occur due to misrecognition of a caller. The speech recognition apparatus 100 according to an embodiment of the present disclosure may predict the intention of the user uttering a voice corresponding to the voice signal by using motion information related to the user. In addition, when performing speech recognition, the speech recognition apparatus 100 may reduce caller misrecognition by applying different reference similarities according to predicted speech intent. This is because when the speech recognition apparatus 100 predicts the intention to speak, it is possible to determine whether the user is likely to call the speech recognition apparatus or is low.

예를 들어, 음성 인식 장치(100)는 사용자(300)의 동작을 나타내는 동작정보를 기초로 기준 유사도를 결정할 수 있다. 이를 통해, 음성 인식 장치(100)는 호출어 인식의 오인식률을 감소시킬 수 있다. 또한, 음성 인식 장치(100)는 호출어의 오인식으로 인한 불필요한 전력 소비를 감소시킬 수 있다. 여기에서, 음성 인식 장치(100)의 오인식률은 획득된 음성 신호가 호출어에 대응하지 않는 경우, 음성 인식 장치(100)가 호출어가 검출된 것으로 오인식하는 비율을 나타낸다. 오인식률은 아래 수학식 1과 같이 나타낼 수 있다.For example, the speech recognition apparatus 100 may determine the reference similarity based on the motion information indicating the motion of the user 300. Through this, the speech recognition apparatus 100 may reduce the false recognition rate of the call word recognition. In addition, the speech recognition apparatus 100 may reduce unnecessary power consumption due to misrecognition of the caller. Here, the false recognition rate of the speech recognition apparatus 100 represents a rate at which the speech recognition apparatus 100 misrecognizes that the caller is detected when the acquired speech signal does not correspond to the caller. False recognition rate can be expressed as Equation 1 below.

[수학식 1][Equation 1]

오인식률 = 100 * (인식 단어 수) / (비호출어 입력 단어 수) [%]False recognition rate = 100 * (number of recognized words) / (number of non-calling words) [%]

수학식 1에서, “비호출어 입력 단어 수”는 호출어가 아닌 음성 입력 단어의 개수를 나타낼 수 있다. 또한, “인식 단어 수”는 입력된 비호출어 입력 단어 중에서 호출어로 인식된 단어의 개수를 나타낼 수 있다. 이하, 도 3 내지 도 7을 통해, 음성 인식 장치(100)가 호출어 인식의 오인식률을 감소시키는 방법에 관하여 상세하게 설명한다.In Equation 1, “non-call word input word number” may indicate the number of voice input words that are not call words. In addition, the "recognized word number" may indicate the number of words recognized as a caller among the input non-call word input words. 3 to 7, a method in which the speech recognition apparatus 100 reduces a false recognition rate of call word recognition will be described in detail.

일 실시예에 따라 음성 인식 장치(100)는 획득한 음성 신호로부터 기준 유사도를 참조하여 호출어를 추출할 수 있다. 예를 들어, 음성 인식 장치(100)는 획득된 음성 신호와 호출어 사이의 유사도가 기준 유사도 이상인 경우, 해당 음성 신호로부터 호출어가 검출된 것으로 판별할 수 있다. 전술한 바와 같이, 음성 인식 장치(100)는 추출된 음향학적 특징과 호출어에 대응하는 음향 모델 사이의 유사도에 기초하여 호출어의 존재 여부를 판별할 수 있다. 구체적으로 음성 인식 장치(100)는 추출된 음향학적 특징과 호출어에 대응하는 음향 모델 사이의 유사도를 획득하고, 획득된 유사도를 기준 유사도와 비교할 수 있다. 이때, 획득된 유사도가 기준 유사도보다 큰 경우, 음성 인식 장치(100)는 음성 신호로부터 호출어가 검출된 것으로 판단할 수 있다. 반대로, 획득된 유사도가 기준 유사도보다 작은 경우, 음성 인식 장치(100)는 음성 신호로부터 호출어가 검출되지 않은 것으로 판단할 수 있다. According to an embodiment, the speech recognition apparatus 100 may extract a call word by referring to a reference similarity from the obtained speech signal. For example, when the similarity between the obtained voice signal and the call word is equal to or greater than the reference similarity, the voice recognition apparatus 100 may determine that the call word is detected from the corresponding voice signal. As described above, the speech recognition apparatus 100 may determine the presence or absence of the call word based on the similarity between the extracted acoustic feature and the acoustic model corresponding to the call word. In detail, the speech recognition apparatus 100 may obtain a similarity between the extracted acoustic feature and the acoustic model corresponding to the call word, and compare the obtained similarity with the reference similarity. In this case, when the acquired similarity is greater than the reference similarity, the speech recognition apparatus 100 may determine that the caller is detected from the speech signal. In contrast, when the acquired similarity is smaller than the reference similarity, the speech recognition apparatus 100 may determine that the caller is not detected from the speech signal.

도 3은 본 개시의 일 실시예에 따라, 기준 유사도에 따른 호출어 검출 결과를 나타내는 도면이다. 도 3은 호출어가 '소리야'이고, 음성 인식 장치(100)가 호출어를 인식하는 경우를 나타낸다. 도 3의 유사도는 복수의 음성 신호(음성A, 음성A' 및 음성A'') 각각과 호출어 사이의 유사도를 나타낸다. 음성 인식 장치(100)는 전술한 방법으로 음성 신호로부터 음향학적 특징을 추출하여 음성 신호와 호출어 사이의 유사도를 산출할 수 있다. 도 3에서는 음성 신호와 호출어 사이의 유사도를 0 내지 1의 값으로 나타내고 있으나, 본 개시가 이에 제한되는 것은 아니다. 예를 들어, 유사도는 백분율(%) 형태로 표현될 수도 있다. 3 is a diagram illustrating a call word detection result based on a reference similarity according to an exemplary embodiment of the present disclosure. 3 illustrates a case in which the call word is 'sound' and the speech recognition apparatus 100 recognizes the call word. 3 shows the similarity between each of the plurality of voice signals (Voice A, Voice A 'and Voice A' ') and the caller. The speech recognition apparatus 100 may calculate the similarity between the speech signal and the caller by extracting an acoustic feature from the speech signal by the above-described method. In FIG. 3, the similarity between the voice signal and the caller is represented by a value of 0 to 1, but the present disclosure is not limited thereto. For example, the similarity may be expressed in the form of a percentage (%).

음성 인식 장치(100)는 음성A(“소리야”)와 호출어 사이의 유사도를 0.95, 음성A'(“서리야”)와 호출어 사이의 유사도를 0.55, 음성A''(“술이야”)와 호출어 사이의 유사도를 0.35로 각각 산출할 수 있다. 이때, 음성A는 사용자가 “소리야”라는 호출어를 제대로 발음했을 때 음성 인식 장치(100)가 획득한 음성 신호일 수 있다. 음성 인식 장치(100)가 사용자로부터 음성A와 유사한 음성A'(예를 들어, “서리야”)를 나타내는 음성 신호를 획득한 경우, 음성 인식 장치(100)는 음성A'를 '소리야'로 인식할 수 있다. 이때, 음성A'는 음성A에 잡음이 포함되었거나 사용자의 발음이 음성A 와 미세하게 다른 경우 감지되는 음성 신호일 수 있다. 한편, 음성A''(예를 들어, “술이야”)는 호출어와의 유사도가 낮은 음성 신호일 수 있다.The speech recognition apparatus 100 measures 0.95 for the similarity between the voice A ("Soya") and the caller, and 0.55 for the similarity between the voice A '("Suriya") and the caller. ”) And the caller can be calculated as 0.35 respectively. In this case, the voice A may be a voice signal obtained by the voice recognition apparatus 100 when the user pronounces the call word “sound” correctly. When the speech recognition apparatus 100 obtains a voice signal representing a voice A '(for example, “Suriya”) similar to the voice A from the user, the speech recognition apparatus 100 may refer to the voice “A” as a sound. Can be recognized. In this case, the voice A 'may be a voice signal detected when noise is included in the voice A or when the user's pronunciation is slightly different from the voice A. On the other hand, the voice A '' (for example, "sake") may be a voice signal having a low similarity with the caller.

도 3의 (a)는 기준 유사도 값이 0.8인 경우를 나타낸다. 도 3의 (a)에서, 음성 인식 장치(100)는 음성A로부터 호출어가 검출된 것으로 판단할 수 있다. 음성A의 유사도가 기준 유사도를 만족시키는 경우이기 때문이다. 기준 유사도가 높은 경우, 음성 인식 장치(100)는 기준 유사도가 낮은 경우에 비해, 호출어와 유사도가 높은 음성 신호에 대해서만 호출어에 대응하는 것으로 판단할 수 있다. 이 경우, 음성 인식 장치(100)는 호출어와 조금이라도 다른 음성 신호를 호출어에 대응하지 않는 것으로 판단할 수 있다. 도 3의 (b)는 기준 유사도 값이 0.5인 경우를 나타낸다. 이 경우, 음성A 및 음성A'가 기준 유사도를 만족시킨다. 이에 따라, 음성 인식 장치(100)는 음성A 및 음성A'로부터 호출어가 검출된 것으로 판단할 수 있다. 기준 유사도가 낮은 경우, 음성 인식 장치(100)는 기준 유사도가 높은 경우에 비해, 호출어와 유사도가 낮은 음성 신호에 대해서도 호출어에 대응하는 것으로 판단할 수 있다. 3A illustrates a case where the reference similarity value is 0.8. In FIG. 3A, the voice recognition apparatus 100 may determine that a call word is detected from the voice A. FIG. This is because the similarity of voice A satisfies the standard similarity. When the reference similarity is high, the speech recognition apparatus 100 may determine that the caller corresponds only to the voice signal having a high similarity with the caller, as compared with the case where the reference similarity is low. In this case, the speech recognition apparatus 100 may determine that a voice signal that is slightly different from the call word does not correspond to the call word. 3B illustrates a case where the reference similarity value is 0.5. In this case, voice A and voice A 'satisfy the standard similarity. Accordingly, the speech recognition apparatus 100 may determine that the call language is detected from the voice A and the voice A '. When the reference similarity is low, the speech recognition apparatus 100 may determine that the voice signal corresponds to the call word even when the reference signal has a high similarity as compared to the case where the reference similarity is high.

이에 따라, 음성 인식 장치(100)는 기준 유사도를 이용하여 호출어 인식의 민감도를 조절할 수 있다. 이때, 호출어 인식 민감도는 획득한 음성 신호가 음성 인식 결과물인 텍스트 데이터로 변환되는 정도를 나타낼 수 있다. 또한, 기준 유사도는 음성 인식율과 관련된 것으로, 기준 유사도에 따라 특정 음성 신호에 대한 호출어 검출 결과가 달라질 수 있다. 또한, 기준 유사도에 따라 음성 인식 장치(100)의 호출어 인식 수행 여부가 결정될 수도 있다. 예를 들어, 음성 인식 장치(100)는 기 설정된 조건에서는 호출어를 인식하지 않을 수 있다. 이 경우, 음성 인식 장치(100)는 기준 유사도를 '1' 이상의 값으로 설정하여 호출어를 인식하지 않을 수 있다. 유사도 0~1 사이 값으로 표현되는 실시예에서 기준 유사도가 '1' 보다 큰 경우, 음성 인식 장치(100)는 음성 신호와 호출어 사이의 유사도에 관계없이 음성 신호로부터 호출어가 검출되지 않은 것으로 판단하기 때문이다. 음성 인식 장치(100)는 일부 음성 신호에 대해 호출어 인식에 소모되는 전원을 턴오프(turn-off)할 수 있다. 이에 따라, 음성 인식 장치(100)는 호출어 인식에 소모되는 전력 효율을 증가시킬 수 있다.Accordingly, the speech recognition apparatus 100 may adjust the sensitivity of call word recognition using the reference similarity. In this case, the call word recognition sensitivity may indicate the degree to which the acquired speech signal is converted into text data that is a speech recognition result. In addition, the reference similarity is related to the speech recognition rate, and a call word detection result for a specific speech signal may vary according to the reference similarity. In addition, it may be determined whether the speech recognition apparatus 100 performs call word recognition based on the reference similarity. For example, the speech recognition apparatus 100 may not recognize the caller under a preset condition. In this case, the speech recognition apparatus 100 may not recognize the caller by setting the reference similarity to a value of '1' or more. When the reference similarity is greater than '1' in the embodiment represented by a value between the similarities 0 to 1, the speech recognition apparatus 100 determines that the caller is not detected from the voice signal regardless of the similarity between the voice signal and the caller. Because. The speech recognition apparatus 100 may turn off the power consumed for the caller recognition for some speech signals. Accordingly, the speech recognition apparatus 100 may increase power efficiency consumed for calling word recognition.

또한, 기준 유사도는 획득한 음성 신호로부터 음성 인식의 결과물인 텍스트 데이터를 획득하려는 적극성을 나타내는 수치일 수 있다. 기준 유사도는 획득한 음성 신호가 음성 인식 결과물인 특정 텍스트(text) 데이터에 매칭될 수 있는 관용의 정도를 나타내는 수치일 수도 있다. 또한, 기준 유사도는 호출어가 아닌 비호출어에 대해서는 적용되지 않을 수 있다. 또는 기준 유사도는 호출어와 비호출어에 대해 독립적으로 결정될 수도 있다. 이를 통해, 음성 인식 장치(100)는 호출어에 대해서는 엄격하게 음성 인식을 수행하고, 서비스 제공을 위한 음성 인식에 대해서는 상대적으로 관용적인 음성 인식을 수행할 수도 있다. In addition, the reference similarity may be a numerical value representing the aggressiveness to acquire text data that is a result of speech recognition from the acquired speech signal. The reference similarity may be a numerical value representing the degree of tolerance that the acquired speech signal can be matched to specific text data that is a speech recognition result. In addition, the reference similarity may not be applied to non-call words that are not call words. Alternatively, the reference similarity may be determined independently for callers and non-callers. Through this, the speech recognition apparatus 100 may perform speech recognition strictly for the call word and relatively idiomatic speech recognition for the speech recognition for providing the service.

이하에서는, 본 개시의 일 실시예에 따른 음성 인식 장치(100)가 기준 유사도를 획득하는 방법에 관하여 설명한다. 본 개시의 일 실시예에 따른 음성 인식 장치(100)는 사용자의 동작 및 환경 조건을 고려한 음성 인식을 수행할 수 있다. 이를 통해, 음성 인식 장치(100)는 사용자에게 효과적으로 출력 정보를 제공할 수 있다.Hereinafter, a method of obtaining a reference similarity by the speech recognition apparatus 100 according to an exemplary embodiment of the present disclosure will be described. The speech recognition apparatus 100 according to an exemplary embodiment may perform speech recognition in consideration of a user's operation and environmental conditions. Through this, the speech recognition apparatus 100 may effectively provide output information to the user.

일 실시예에 따라, 음성 인식 장치(100)는 사용자의 동작을 나타내는 동작정보를 기초로 기준 유사도를 획득할 수 있다. 예를 들어, 음성 인식 장치(100)는 사용자의 동작을 감지하여 감지된 동작을 나타내는 동작정보를 생성할 수 있다. 이때, 동작정보는 사용자가 음성 신호에 대응하는 음성을 발화한 시간에 감지된 동작을 나타낼 수 있다. 음성 인식 장치(100)는 음성 신호를 획득한 시간을 기준으로 결정된 시간 구간 동안 감지된 사용자 동작을 기초로 동작정보를 생성할 수 있다. 구체적으로, 음성 인식 장치(100)는 음성 신호를 획득한 시간 전후의 소정의 시간 동안 감지된 사용자 동작을 기초로 동작정보를 생성할 수 있다. 또한, 동작정보는 사용자의 이동방향, 이동속도, 시선방향(view-point), 사용자 주변 환경 및 음성 인식 장치(100)의 주변 환경 중 적어도 하나를 포함할 수 있다. 또한, 음성 인식 장치(100)는 동작정보를 기초로 기준 유사도를 결정할 수 있다. 예를 들어, 동작정보가 음성 인식 장치(100)에 대한 호출 가능성이 높은 동작을 나타낼 때, 음성 인식 장치(100)는 기준 유사도를 감소시킬 수 있다. 또한, 동작정보가 음성 인식 장치에 대한 호출 가능성이 낮은 동작을 나타낼 때, 음성 인식 장치(100)는 기준 유사도를 증가시킬 수 있다. 또한, 음성 인식 장치(100)는 전술한 방법으로 획득된 기준 유사도를 기초로 음성 신호로부터 호출어를 검출할 수 있다. 또한, 음성 인식 장치(100)는 호출어 검출 결과를 기초로 생성된 출력 정보를 출력할 수 있다.According to an embodiment, the speech recognition apparatus 100 may obtain a reference similarity based on motion information indicating a motion of a user. For example, the voice recognition apparatus 100 may detect motion of a user and generate motion information indicating the detected motion. In this case, the motion information may indicate a motion detected at a time when the user utters a voice corresponding to the voice signal. The speech recognition apparatus 100 may generate motion information based on a user motion detected during a time interval determined based on a time at which a voice signal is acquired. In detail, the voice recognition apparatus 100 may generate motion information based on a user motion detected for a predetermined time before and after the time of acquiring the voice signal. In addition, the motion information may include at least one of a moving direction of the user, a moving speed, a view-point, a surrounding environment of the user, and a surrounding environment of the voice recognition apparatus 100. Also, the speech recognition apparatus 100 may determine the reference similarity based on the motion information. For example, when the motion information indicates an operation with high callability to the speech recognition apparatus 100, the speech recognition apparatus 100 may reduce the reference similarity. In addition, when the motion information indicates an operation with low callability to the speech recognition apparatus, the speech recognition apparatus 100 may increase the reference similarity. In addition, the speech recognition apparatus 100 may detect the caller from the speech signal based on the reference similarity obtained by the aforementioned method. In addition, the speech recognition apparatus 100 may output output information generated based on the result of detecting the call word.

일 실시예에 따라, 음성 인식 장치(100)는 동작감지센서(미도시)의 센싱 결과를 이용하여 사용자의 동작을 식별할 수 있다. 예를 들어, 동작감지센서는 음성 인식 장치(100)를 기준으로 사용자의 이동방향, 이동속도, 시선방향, 사용자 주변 환경 및 음성 인식 장치(100)의 주변 환경 중 적어도 하나를 감지할 수 있다. 동작감지센서는 음성 인식 장치(100)에 구비될 수 있다. 또는 음성 인식 장치(100)는 음성 인식 장치(100)와 유/무선으로 연결된 동작감지센서를 이용하여 동작정보를 획득할 수 있다. 예를 들어, 동작감지센서는 초음파 거리 센서, 카메라, 뎁스(depth) 카메라 및 마이크 중 적어도 어느 하나일 수 있으나, 본 개시가 이에 제한되는 것은 아니다.According to an embodiment, the voice recognition apparatus 100 may identify a user's motion by using a sensing result of a motion sensor (not shown). For example, the motion detection sensor may detect at least one of a moving direction, a moving speed, a gaze direction, a user's surroundings, and a surrounding environment of the voice recognition apparatus 100 based on the voice recognition apparatus 100. The motion detection sensor may be provided in the voice recognition apparatus 100. Alternatively, the voice recognition device 100 may obtain motion information using a motion detection sensor connected to the voice recognition device 100 by wire / wireless. For example, the motion detection sensor may be at least one of an ultrasonic distance sensor, a camera, a depth camera, and a microphone, but the present disclosure is not limited thereto.

일 실시예에 따라, 음성 인식 장치(100)는 사용자를 나타내는 객체를 포함하는 이미지 정보를 기초로 동작정보를 획득할 수 있다. 예를 들어, 음성 인식 장치(100)는 음성 신호를 획득한 시간 정보를 기초로 사용자를 나타내는 객체를 포함하는 이미지 정보를 획득할 수 있다. 이때, 사용자를 나타내는 객체는 이미지 정보의 일부분으로 사용자를 나타내는 이미지 부분일 수 있다. 또한, 음성 인식 장치(100)는 획득한 이미지 정보로부터 사용자의 시선방향, 이동방향, 이동거리, 사용자의 주변 환경에 관한 정보를 획득할 수 있다. 이때, 획득된 이미지 정보는 음성 신호를 획득한 시간으로부터 소정의 시간 동안 수집된 이미지 정보일 수 있다. 구체적으로, 음성 인식 장치(100)는 이미지 정보로부터 사용자(300)를 나타내는 객체를 추출할 수 있다. 음성 인식 장치(100)는 추출된 객체를 이용하여 사용자의 시선방향, 이동방향, 이동거리, 사용자의 주변 환경에 관한 정보를 획득할 수 있다.According to an embodiment, the speech recognition apparatus 100 may obtain motion information based on image information including an object representing a user. For example, the speech recognition apparatus 100 may obtain image information including an object representing a user based on time information on obtaining a speech signal. In this case, the object representing the user may be an image portion representing the user as part of the image information. In addition, the speech recognition apparatus 100 may obtain information about the user's gaze direction, the moving direction, the moving distance, and the user's surrounding environment from the acquired image information. In this case, the acquired image information may be image information collected for a predetermined time from the time of obtaining the voice signal. In detail, the voice recognition apparatus 100 may extract an object representing the user 300 from the image information. The speech recognition apparatus 100 may obtain information about a gaze direction, a movement direction, a movement distance, and a user's surrounding environment using the extracted object.

한편, 일 실시예에 따라, 음성 인식 장치(100)는 음성 신호를 분석하여 동작정보를 획득할 수도 있다. 예를 들어, 음성 인식 장치(100)는 획득된 음성 신호를 분석하여 사용자의 이동방향, 이동속도, 시선방향, 사용자 주변 환경 및 음성 인식 장치(100)의 주변 환경 중 적어도 하나에 관한 정보를 획득할 수 있다. 이때, 음성 인식 장치(100)는 획득된 음성 신호의 시간별/주파수별 특성을 이용할 수 있다. 이 경우, 음성 신호는 복수의 마이크를 포함하는 마이크 어레이를 통해 수집된 음성 신호일 수 있다. 예를 들어, 음성 인식 장치(100)는 시간별/ 주파수별 음성 신호의 크기(magnitude) 변화를 기초로 이동방향, 이동거리 및 시선방향 중 적어도 하나를 산출할 수 있다. 또한, 음성 인식 장치(100)는 음성 신호가 포함하는 음색과 관련된 정보를 추출하여 음성을 발화한 복수의 인물이 존재하는지 판단할 수 있다. 도 1을 통해 전술한 포먼트와 피치는, 사람의 음성마다 다른 특성을 나타낸다. 이에 따라, 음성 인식 장치(100)는 포먼트 정보 및 피치 정보에 기초하여 음성 신호에 포함된 음성이 복수의 발화자로부터 생성되었는지 여부를 판별할 수 있다.Meanwhile, according to an exemplary embodiment, the speech recognition apparatus 100 may obtain motion information by analyzing a speech signal. For example, the voice recognition device 100 analyzes the acquired voice signal to obtain information about at least one of a moving direction, a moving speed, a gaze direction, a user's surroundings, and a surrounding environment of the voice recognition apparatus 100. can do. In this case, the speech recognition apparatus 100 may use characteristics of time and frequency of the acquired speech signal. In this case, the voice signal may be a voice signal collected through a microphone array including a plurality of microphones. For example, the speech recognition apparatus 100 may calculate at least one of a moving direction, a moving distance, and a gaze direction based on a change in magnitude of a voice signal for each time and frequency. In addition, the speech recognition apparatus 100 may determine whether a plurality of persons who uttered a voice exist by extracting information related to a tone included in the speech signal. The formant and pitch described above with reference to FIG. 1 exhibit different characteristics for each human voice. Accordingly, the speech recognition apparatus 100 may determine whether the speech included in the speech signal is generated from the plurality of talkers based on the formant information and the pitch information.

일 실시예에 따라, 음성 인식 장치(100)는 사용자의 시선방향을 나타내는 시선방향 정보를 기초로 전술한 기준 유사도를 결정할 수 있다. 예를 들어, 사용자의 시선방향이 음성 인식 장치(100)를 향하는 경우, 음성 인식 장치(100)를 호출할 확률이 사용자의 시선방향이 음성 인식 장치(100)를 향하지 않는 경우에 비해 더 높을 수 있기 때문이다. 이 경우, 음성 인식 장치(100)는 사용자의 시선방향이 음성 인식 장치(100)가 설치된 위치와 유사할수록 기준 유사도를 낮출 수 있다. 반대로, 음성 인식 장치(100)는 사용자의 시선방향이 음성 인식 장치(100)가 설치된 위치와 멀어질수록 기준 유사도를 높일 수 있다. 이하에서는, 본 개시의 일 실시예에 따른 음성 인식 장치(100)가 사용자의 시선방향을 이용하여 호출어를 검출하는 방법에 관하여 설명한다. 한편, 도 4 내지 도 6에서는 설명의 편의를 위해 사용자(300) 및 음성 인식 장치(100) 등이 평면상에 위치된 것으로 설명하고 있으나, 본 개시가 이에 제한되는 것은 아니며, 3차원 공간 상의 실시예로 확장될 수 있다.According to an embodiment, the speech recognition apparatus 100 may determine the above-described reference similarity based on the gaze direction information indicating the gaze direction of the user. For example, when the user's gaze is directed toward the voice recognition apparatus 100, the probability of calling the voice recognition device 100 may be higher than when the user's gaze is not toward the voice recognition apparatus 100. Because there is. In this case, the speech recognition apparatus 100 may lower the reference similarity as the user's gaze direction is similar to the position where the speech recognition apparatus 100 is installed. On the contrary, the voice recognition apparatus 100 may increase the reference similarity as the user's gaze direction moves away from the location where the voice recognition apparatus 100 is installed. Hereinafter, a description will be given of a method in which the voice recognition apparatus 100 according to an exemplary embodiment of the present disclosure detects a call word using the user's gaze direction. Meanwhile, although FIGS. 4 to 6 illustrate that the user 300 and the voice recognition apparatus 100 are positioned on a plane for convenience of description, the present disclosure is not limited thereto and may be implemented in a three-dimensional space. For example, it can be extended.

도 4는 본 개시의 일 실시예에 따른 음성 인식 장치(100)가 사용자(300)의 시선방향을 기초로 기준 유사도를 결정하는 방법을 나타내는 도면이다. 도 4를 참조하면, 음성 인식 장치(100)는 음성 인식 장치(100)가 설치된 위치 및 사용자(300)의 위치를 기준으로 사용자(300)의 시선방향을 나타내는 시선 각도 값을 산출할 수 있다. 구체적으로, 음성 인식 장치(100)는 음성 인식 장치(100)가 설치된 위치와 사용자(300)의 위치 사이의 위치관계를 기초로, 사용자의 시선방향을 나타내는 시선 각도를 결정할 수 있다. 음성 인식 장치(100)는 사용자(300)의 위치와 음성 인식 장치(100)의 중심을 연결하는 가장 짧은 직선(x)을 기준 각도(예를 들어, '0'도)로 하여 시선 각도(예를 들어, 방위각(azimuth) 및 고도각(elevation))를 결정할 수 있다. 음성 인식 장치(100)는 시선 각도를 기초로 기준 유사도를 결정할 수 있다. 예를 들어, 제1 시선 각도가 제2 시선 각도보다 작고, 제1 음성 신호에 대응하는 사용자의 시선방향이 제1 시선 각도로 나타나는 경우, 음성 인식 장치(100)는 음성 신호에 대응하는 사용자의 시선방향이 제2 시선 각도인 경우보다 작은 기준 유사도를 기초로 호출어를 검출할 수 있다.4 is a diagram illustrating a method of determining, by the voice recognition apparatus 100, a reference similarity based on a gaze direction of the user 300 according to an exemplary embodiment. Referring to FIG. 4, the speech recognition apparatus 100 may calculate a gaze angle value indicating a gaze direction of the user 300 based on a location where the speech recognition apparatus 100 is installed and a location of the user 300. In detail, the speech recognition apparatus 100 may determine a gaze angle indicating a gaze direction of the user based on a positional relationship between a location where the speech recognition apparatus 100 is installed and a location of the user 300. The speech recognition apparatus 100 uses the shortest straight line x connecting the position of the user 300 and the center of the speech recognition apparatus 100 as a reference angle (for example, '0' degrees) to determine a gaze angle (for example, For example, azimuth and elevation may be determined. The speech recognition apparatus 100 may determine the reference similarity based on the gaze angle. For example, when the first gaze angle is smaller than the second gaze angle and the gaze direction of the user corresponding to the first voice signal is indicated by the first gaze angle, the voice recognition apparatus 100 may determine the user's response to the voice signal. The caller can be detected based on a reference similarity smaller than the case where the line of sight is the second line of sight.

일 실시예에 따라, 음성 인식 장치(100)는 기준 유사도를 결정하기 위한 각도 구간을 결정할 수도 있다. 예를 들어, 음성 인식 장치(100)는 특정 각도를 포함하는 각도 구간 각각에 매핑되는 기준 유사도 값을 이용할 수 있다. 도 4의 실시예는 기 설정된 각도 범위(r)내의 제1 각도 구간과 기 설정된 각도 범위(r) 밖의 제2 각도 구간을 포함하는 경우를 나타낸다. 도 4에서, 음성 인식 장치(100)는 직선(x)을 기준으로 기 설정된 각도 범위(r) 내의 제1 각도 구간을 결정할 수 있다. 기 설정된 각도 범위(r)는 -30도 ~30도일 수 있다. 이때, 음성 인식 장치(100)는 제1 각도 구간 및 제2 각도 구간 각각에 대응하는 제1 기준 유사도 및 제2 기준 유사도를 기초로 획득된 음성 신호에 적용되는 기준 유사도를 결정할 수 있다. 예를 들어, 음성 인식 장치(100)는 제1 기준 유사도를 '0.6'으로, 제2 기준 유사도를 '0.9'로 각각 설정할 수 있다. 이때, 음성 인식 장치(100)는 시선 각도가 제1 각도 구간에 포함되는 경우, 제1 기준 유사도인 '0.6'을 기초로 음성 신호로부터 호출어를 검출할 수 있다. 또한, 음성 인식 장치(100)는 시선 각도가 제2 각도 구간에 포함되는 경우, 제2 기준 유사도인 '0.9'를 기초로 음성 신호로부터 호출어를 검출할 수 있다. 도 4를 참조하면, 직선(x)을 0도로 하여 시선 각도가 20도인 경우(a), 음성 인식 장치(100)는 기준 유사도를 '0.6'으로 설정할 수 있다. 사용자(300)의 시선방향이 제1 각도 구간에 포함되는 경우이기 때문이다. 한편, 직선(x)을 기준으로 시선 각도가 45도인 경우(b), 음성 인식 장치(100)는 기준 유사도를 '0.9'로 설정할 수 있다. 사용자(300)의 시선방향이 제2 각도 구간에 포함되는 경우이기 때문이다. According to an embodiment, the speech recognition apparatus 100 may determine an angular section for determining a reference similarity. For example, the speech recognition apparatus 100 may use a reference similarity value mapped to each angle section including a specific angle. 4 illustrates a case where a first angle section within a preset angle range r and a second angle section outside a preset angle range r are included. In FIG. 4, the speech recognition apparatus 100 may determine a first angle section within a preset angle range r based on the straight line x. The preset angle range r may be -30 degrees to 30 degrees. In this case, the speech recognition apparatus 100 may determine a reference similarity applied to the obtained voice signal based on the first reference similarity and the second reference similarity corresponding to each of the first and second angular sections. For example, the speech recognition apparatus 100 may set the first reference similarity to '0.6' and the second reference similarity to '0.9', respectively. In this case, when the gaze angle is included in the first angle section, the voice recognition apparatus 100 may detect the caller from the voice signal based on '0.6' which is the first reference similarity. In addition, when the gaze angle is included in the second angle section, the speech recognition apparatus 100 may detect the caller from the speech signal based on '0.9', which is the second reference similarity. Referring to FIG. 4, when the gaze angle is 20 degrees with the straight line x at 0 degrees (a), the speech recognition apparatus 100 may set the reference similarity level to '0.6'. This is because the gaze direction of the user 300 is included in the first angle section. Meanwhile, when the gaze angle is 45 degrees based on the straight line x (b), the speech recognition apparatus 100 may set the reference similarity to '0.9'. This is because the gaze direction of the user 300 is included in the second angle section.

한편, 일 실시예에 따라, 음성 인식 장치(100)는 사용자의 동작정보 및 음성 인식 장치(100)의 주변 환경 정보를 기초로 기준 유사도를 획득할 수 있다. 예를 들어, 음성 인식 장치(100)의 주변 환경 정보는 음성 인식 장치(100)가 설치된 위치를 기준으로 기 설정된 영역 내에 설치된 적어도 하나의 다른 전자 장치와 관련된 정보를 포함할 수 있다. 예를 들어, 음성 인식 장치(100)는 음성 인식 장치(100)가 설치된 위치를 기준으로 기 설정된 영역 내에 설치된 적어도 하나의 다른 전자 장치의 위치 및 사용자(300)의 시선방향을 기초로 기준 유사도를 결정할 수 있다. 기 설정된 영역 내에 적어도 하나의 다른 전자 장치가 위치하는 경우, 사용자(300)가 다른 전자 장치를 사용하는 중일 수 있기 때문이다. 이 경우, 음성 인식 장치(100)는 사용자(300)가 음성 인식 장치(100)를 호출할 가능성이 낮은 것으로 판단할 수 있다. 또한, 음성 인식 장치(100)는 기 설정된 영역 내에 적어도 하나의 다른 전자 장치가 위치하지 않는 경우에 비해, 높은 기준 유사도를 기초로 호출어를 검출할 수 있다. 이때, 다른 전자 장치는 음성 인식 기능을 구비한 기기일 수도 있으나, 본 개시가 이에 제한되는 것은 아니다.Meanwhile, according to an exemplary embodiment, the speech recognition apparatus 100 may obtain a reference similarity based on the user's motion information and the surrounding environment information of the speech recognition apparatus 100. For example, the surrounding environment information of the speech recognition apparatus 100 may include information related to at least one other electronic device installed in a preset area based on the location where the speech recognition apparatus 100 is installed. For example, the speech recognition apparatus 100 may determine a reference similarity based on the location of at least one other electronic device installed in a preset area and the viewing direction of the user 300 based on the location where the speech recognition apparatus 100 is installed. You can decide. This is because when the at least one other electronic device is located in the preset area, the user 300 may be using another electronic device. In this case, the speech recognition apparatus 100 may determine that the user 300 has a low possibility of calling the speech recognition apparatus 100. Also, the voice recognition apparatus 100 may detect the caller based on a high reference similarity, as compared with the case where at least one other electronic device is not located in the preset area. In this case, the other electronic device may be a device having a voice recognition function, but the present disclosure is not limited thereto.

여기에서, 기 설정된 영역은 음성 인식 장치(100)를 제공하는 제공자에 의해 설정된 크기를 의미할 수 있다. 또는 음성 인식 장치(100)가 건물 내에 설치된 기기인 경우, 기 설정된 영역은 건물에서 음성 인식 장치(100)가 설치된 위치를 포함하는 적어도 하나의 구획된 공간을 포함할 수 있다. 예를 들어, 음성 인식 장치(100)가 음성 인식 기능을 탑재한 냉방 기기인 경우, 음성 인식 장치(100)는 구획된 공간의 천장 또는 모서리에 설치될 수 있다. 이 경우, 음성 인식 장치(100)가 설치된 위치를 기준으로 기 설정된 영역은 적어도 하나의 다른 전자 장치를 포함할 수 있다. 예를 들어, 음성 인식 장치(100)가 가정 내 설치된 IoT 단말인 경우, 음성 인식 장치(100)는 TV와 함께 거실에 설치될 수 있다.Here, the preset area may mean a size set by the provider providing the speech recognition apparatus 100. Alternatively, when the voice recognition device 100 is a device installed in a building, the preset area may include at least one partitioned space including a location where the voice recognition device 100 is installed in the building. For example, when the speech recognition apparatus 100 is a cooling device equipped with a speech recognition function, the speech recognition apparatus 100 may be installed at a ceiling or a corner of a partitioned space. In this case, the area preset based on the location where the voice recognition device 100 is installed may include at least one other electronic device. For example, when the voice recognition device 100 is an IoT terminal installed in a home, the voice recognition device 100 may be installed in a living room together with a TV.

도 5는 본 개시의 일 실시예에 따른 음성 인식 장치(100)가 음성 인식 장치(100)의 주변 환경 정보를 기초로 기준 유사도를 획득하는 방법을 나타내는 도면이다. 일 실시예에 따라, 음성 인식 장치(100)는 사용자의 시선방향에 대응하는 다른 전자 장치를 결정할 수 있다. 예를 들어, 기 설정된 영역(500) 내에는 적어도 하나의 다른 전자 장치가 있을 수 있다. 이때, 음성 인식 장치(100)는 적어도 하나의 다른 전자 장치 중에서, 사용자(300)의 시선방향에 대응하는 적어도 하나의 다른 전자 장치를 결정할 수 있다. 또한, 음성 인식 장치(100)는 시선방향에 대응하는 적어도 하나의 다른 전자 장치의 동작 상태를 기초로 기준 유사도를 결정할 수 있다. 여기에서, 전자 장치의 동작 상태는 전자 장치의 턴온(turn-on) 또는 턴오프(turn-off) 상태, 전자 장치의 설정, 전자 장치의 전력 소모 상태를 포함할 수 있다. 5 is a diagram illustrating a method of acquiring a reference similarity by the voice recognition apparatus 100 according to an exemplary embodiment of the present disclosure, based on the surrounding environment information of the voice recognition apparatus 100. According to an embodiment, the voice recognition apparatus 100 may determine another electronic device corresponding to the user's gaze direction. For example, at least one other electronic device may exist in the preset area 500. In this case, the voice recognition apparatus 100 may determine at least one other electronic device corresponding to the gaze direction of the user 300 from among at least one other electronic device. Also, the speech recognition apparatus 100 may determine a reference similarity based on an operating state of at least one other electronic device corresponding to the eyeline direction. Here, the operating state of the electronic device may include a turn-on or turn-off state of the electronic device, a setting of the electronic device, and a power consumption state of the electronic device.

도 5를 참조하면, 제1 전자 장치(51) 및 제2 전자 장치(52)는 음성 인식 장치(100)가 설치된 위치를 기준으로 기 설정된 영역(500) 내에는 위치될 수 있다. 예를 들어, 제1 전자 장치(51)는 디스플레이 장치(예를 들어, TV)일 수 있다. 또한, 제2 전자 장치(52)는 조명기기 일 수 있다. 사용자(300)의 시선방향에 대응하는 다른 전자 장치가 제1 전자 장치(51)인 경우, 음성 인식 장치(100)는 제1 전자 장치(51)의 동작 상태를 기초로 기준 유사도를 결정할 수 있다. 도 5는 사용자(300)의 시선방향에 대응하는 다른 전자 장치가 하나인 경우를 설명하고 있으나 본 개시가 이에 제한되는 것은 아니다. 구체적으로, 제1 전자 장치(51)가 턴온 상태인 경우, 제1 전자 장치(51)가 턴 오프 상태인 경우에 비해 높은 기준 유사도를 기초로 호출어를 검출할 수 있다. 사용자(300)가 제1 전자 장치(51)를 통해 제공되는 컨텐츠를 시청하는 경우, 음성 인식 장치(100)는 사용자(300)가 음성 인식 장치(100)를 호출할 가능성이 낮은 것으로 판단할 수 있기 때문이다.Referring to FIG. 5, the first electronic device 51 and the second electronic device 52 may be located in the preset area 500 based on the location where the voice recognition device 100 is installed. For example, the first electronic device 51 may be a display device (for example, a TV). In addition, the second electronic device 52 may be a lighting device. When the other electronic device corresponding to the eyeline direction of the user 300 is the first electronic device 51, the voice recognition device 100 may determine the reference similarity based on the operation state of the first electronic device 51. . 5 illustrates a case in which one electronic device corresponding to the eyeline direction of the user 300 is one, but the present disclosure is not limited thereto. In detail, when the first electronic device 51 is turned on, the caller may be detected based on a higher reference similarity than when the first electronic device 51 is turned off. When the user 300 views the content provided through the first electronic device 51, the voice recognition apparatus 100 may determine that the user 300 is unlikely to call the voice recognition apparatus 100. Because there is.

또한, 기준 유사도는 제1 전자 장치(51)의 턴온 지속 시간에 따라 달라질 수도 있다. 예를 들어, 음성 인식 장치(100)는 음성 신호를 획득한 시간을 기준으로 제1 전자 장치(51)가 연속적으로 소모한 전력량 정보를 획득할 수 있다. 음성 인식 장치(100)는 전력량 정보를 기초로 제1 전자 장치(51)의 턴온 지속 시간을 산출할 수 있다. 또한, 음성 인식 장치(100)는 턴온 지속 시간이 길수록 상대적으로 턴온 지속 시간이 짧은 경우에 비해, 높은 기준 유사도를 기초로 호출어를 검출할 수 있다. 사용자(300)가 디스플레이 장치를 통해 제공되는 컨텐츠를 시청하는 시간이 길어지는 경우, 사용자(300)는 컨텐츠 시청 이외의 추가적인 활동을 위해 음성 인식 장치(100)를 호출할 수도 있기 때문이다. 또한, 다른 전자 장치의 동작 상태에 따라 기준 유사도를 결정하는 방법은 사용자(300) 또는 유사한 전자 장치를 이용하는 복수의 다른 사용자의 호출이력에 따라 달라질 수 있다. 음성 인식 장치(100)가 호출이력을 기초로 기준 유사도를 획득하는 방법에 대해서는 후술한다.Also, the reference similarity may vary depending on the turn-on duration of the first electronic device 51. For example, the speech recognition apparatus 100 may obtain information on the amount of power continuously consumed by the first electronic device 51 based on the time when the speech signal is obtained. The speech recognition apparatus 100 may calculate the turn-on duration of the first electronic device 51 based on the power amount information. In addition, the longer the turn-on duration, the speech recognition apparatus 100 may detect the caller based on a higher reference similarity than the case where the turn-on duration is relatively short. This is because the user 300 may call the voice recognition apparatus 100 for an additional activity other than viewing the content when the user 300 watches the content provided through the display device. Also, the method of determining the reference similarity according to the operation state of the other electronic device may vary according to the call history of the user 300 or a plurality of other users who use the similar electronic device. A method of obtaining the reference similarity based on the call history by the speech recognition apparatus 100 will be described later.

한편, 음성 인식 장치(100)가 다른 전자 장치의 동작 상태에 따라 기준 유사도를 결정하는 방법은 다른 전자 장치의 기능에 따라 달라질 수도 있다. 예를 들어, 사용자(300)의 시선방향에 대응하는 다른 전자 장치가 제2 전자 장치(52)인 경우, 음성 인식 장치(100)는 제2 전자 장치(52)의 동작 상태를 기초로 기준 유사도를 결정할 수 있다. 이때, 제2 전자 장치(52)가 턴 오프 상태인 경우, 음성 인식 장치(100)는 제2 전자 장치(52)가 턴 온 상태인 경우에 비해 낮은 기준 유사도를 기초로 음성 신호로부터 호출어를 검출할 수 있다. 사용자가 제2 전자 장치(52)를 턴 온하기 위해 음성 인식 장치(100)를 호출할 수 있기 때문이다. The method of determining the reference similarity according to the operation state of another electronic device may vary according to the function of another electronic device. For example, when another electronic device corresponding to the gaze direction of the user 300 is the second electronic device 52, the voice recognition device 100 may have a reference similarity based on the operation state of the second electronic device 52. Can be determined. In this case, when the second electronic device 52 is turned off, the voice recognition device 100 may call a caller from a voice signal based on a reference similarity lower than that when the second electronic device 52 is turned on. Can be detected. This is because the user may call the voice recognition device 100 to turn on the second electronic device 52.

본 개시의 일 실시예에 따라, 음성 인식 장치(100)는 사용자의 시선방향의 변화를 나타내는 변화정보를 기초로 기준 유사도를 결정할 수 있다. 사용자의 시선방향이 음성 인식 장치(100)를 향하는 방향과 가까워지는 경우, 사용자가 음성 신호에 대응하는 음성을 발화한 의도가 음성 인식 장치(100)를 호출할 목적일 수 있기 때문이다. 이 경우, 음성 인식 장치(100)는 사용자가 음성 인식 장치(100)를 호출할 가능성이 높은 것으로 판단할 수 있다. 또한, 음성 인식 장치(100)는 사용자의 시선방향이 음성 인식 장치(100)를 향하는 방향과 멀어지는 경우에 비해, 낮은 기준 유사도를 기초로 호출어를 검출할 수 있다. 예를 들어, 음성 인식 장치(100)는 음성 신호가 획득된 때부터 기 설정된 시간 동안 사용자의 시선방향의 변화를 나타내는 변화정보를 획득할 수 있다. 음성 인식 장치(100)는 음성 신호가 획득된 때부터 기 설정된 시간 동안, 전술한 동작감지센서를 이용하여 사용자의 시선방향을 추적할 수 있다. 예를 들어, 기 설정된 시간은 음성 신호가 지속되는 시간을 나타낼 수 있다. According to an embodiment of the present disclosure, the speech recognition apparatus 100 may determine the reference similarity based on the change information indicating the change in the gaze direction of the user. This is because the user's intention of uttering a voice corresponding to the voice signal may be the purpose of calling the voice recognition apparatus 100 when the user's gaze direction approaches the direction toward the voice recognition apparatus 100. In this case, the speech recognition apparatus 100 may determine that the user has a high possibility of calling the speech recognition apparatus 100. In addition, the voice recognition apparatus 100 may detect the caller based on a low reference similarity as compared with a case where the user's gaze direction is far from the direction toward the voice recognition apparatus 100. For example, the speech recognition apparatus 100 may obtain change information indicating a change in the user's gaze direction for a preset time from when the speech signal is obtained. The voice recognition apparatus 100 may track the user's gaze direction by using the above-described motion detection sensor for a preset time from when the voice signal is obtained. For example, the preset time may indicate a time for which the voice signal lasts.

또한, 음성 인식 장치(100)는 전술한 적어도 하나의 다른 전자 장치 각각이 설치된 위치 및 시선방향의 변화정보를 기초로 기준 유사도를 결정할 수 있다. 음성 인식 장치(100)는 음성 인식 장치(100)와 다른 전자 장치(51, 52) 사이의 위치관계 및 시선방향의 변화정보를 기초로 기준 유사도를 결정할 수 있다. 예를 들어, 시선방향의 변화가 음성 인식 장치(100)와 다른 전자 장치(51, 52) 사이의 위치관계에 상응하는 경우, 음성 인식 장치(100)는 시선방향의 변화에 따라 서로 다른 기준 유사도를 결정할 수 있다. 시선방향이 기 설정된 방향 범위 내에서 변화하는 경우, 음성 인식 장치(100)는 시선방향의 변화에 따라 서로 다른 기준 유사도를 결정할 수 있다. 이때, 기 설정된 방향 범위는 음성 인식 장치(100)와 다른 전자 장치(51, 52) 사이의 위치 관계를 기초로 설정된 값일 수 있다. 도 5에서와 같이, 음성 인식 장치(100)가 제1 전자 장치(51) 상단에 설치되어 있고, 시선방향이 제1 방향(511)을 기준으로 기 설정된 방향 범위 내에서 변화하는 경우, 음성 인식 장치(100)는 시선방향의 변화에 따라 서로 다른 기준 유사도를 결정할 수 있다. 반대로, 시선방향의 변화가 음성 인식 장치(100)와 다른 전자 장치(51, 52) 사이의 위치관계에 상응하지 않는 경우, 음성 인식 장치(100)는 시선방향의 변화에 따라 기준 유사도를 변경하지 않을 수 있다.In addition, the speech recognition apparatus 100 may determine a reference similarity based on the position where each of the at least one other electronic device described above is installed and the change information in the line of sight. The speech recognition apparatus 100 may determine the reference similarity based on the positional relationship between the speech recognition apparatus 100 and the other electronic devices 51 and 52 and the change information in the gaze direction. For example, when the change in the gaze direction corresponds to the positional relationship between the voice recognition device 100 and the other electronic devices 51 and 52, the voice recognition device 100 may have different reference similarities according to the change in the gaze direction. Can be determined. When the gaze direction changes within a preset direction range, the speech recognition apparatus 100 may determine different reference similarities according to the change in the gaze direction. In this case, the preset direction range may be a value set based on a positional relationship between the voice recognition device 100 and the other electronic devices 51 and 52. As shown in FIG. 5, when the voice recognition apparatus 100 is installed on the upper end of the first electronic device 51 and the line of sight changes within a preset direction range based on the first direction 511, voice recognition is performed. The apparatus 100 may determine different reference similarities according to the change in the visual direction. On the contrary, when the change in the gaze direction does not correspond to the positional relationship between the voice recognition device 100 and the other electronic devices 51 and 52, the voice recognition device 100 does not change the reference similarity according to the change in the gaze direction. You may not.

구체적으로, 도 5에서와 같이, 음성 인식 장치(100)가 제1 전자 장치(51) 상단에 설치된 경우, 음성 인식 장치(100)는 시선방향의 위/아래 방향 변화에 따라 서로 다른 기준 유사도를 기초로 호출어를 검출할 수 있다. 반면, 시선방향이 좌/우 방향으로 변하는 경우, 음성 인식 장치(100)는 동일한 기준 유사도를 기초로 호출어를 검출할 수 있다. 또한, 시선방향의 변화가 기 설정된 각도 범위 내인 경우, 음성 인식 장치(100)는 시선방향의 변화에 따라 서로 다른 기준 유사도를 결정할 수 있다. 이때, 기 설정된 각도 범위는 음성 인식 장치(100)와 다른 전자 장치(51, 52) 사이의 위치 관계를 기초로 설정된 값일 수 있다. 예를 들어, 시선방향의 변화가 사용자(300)의 위치를 기준으로 음성 인식 장치(100) 및 제1 전자 장치(51) 각각이 설치된 위치를 나타내는 각도 범위 내인 경우, 음성 인식 장치(100)는 시선방향의 변화에 따라 서로 다른 기준 유사도를 결정할 수 있다. 반면, 시선방향의 변화가 사용자(300)의 위치를 기준으로 음성 인식 장치(100) 및 제1 전자 장치(51) 각각이 설치된 위치를 나타내는 각도 범위를 벗어나는 경우, 음성 인식 장치(100)는 시선방향의 변화에 따라 기준 유사도를 변경하지 않을 수 있다. In detail, as shown in FIG. 5, when the speech recognition apparatus 100 is installed on the upper portion of the first electronic device 51, the speech recognition apparatus 100 may have different reference similarities according to a change in the up / down direction of the gaze direction. The caller can be detected on the basis. On the other hand, when the line of sight changes in a left / right direction, the speech recognition apparatus 100 may detect the caller based on the same reference similarity. In addition, when the change in the gaze direction is within a preset angle range, the speech recognition apparatus 100 may determine different reference similarities according to the change in the gaze direction. In this case, the preset angle range may be a value set based on a positional relationship between the voice recognition device 100 and the other electronic devices 51 and 52. For example, when the change in the gaze direction is within an angle range indicating a position where each of the voice recognition apparatus 100 and the first electronic device 51 is installed based on the position of the user 300, the voice recognition apparatus 100 Different reference similarities may be determined according to changes in the line of sight. On the other hand, when the change in the gaze direction is out of an angular range indicating a position where each of the voice recognition apparatus 100 and the first electronic device 51 is installed based on the position of the user 300, the voice recognition apparatus 100 gazes. The reference similarity may not be changed according to the change of direction.

한편, 본 개시의 일 실시예에 따라, 음성 인식 장치(100)는 사용자 이외의 다른 인물과 관련된 정보를 기초로 기준 유사도를 결정할 수 있다. 음성 인식 장치(100)는 사용자의 동작정보 및 사용자 이외의 다른 인물과 관련된 정보를 기초로 기준 유사도를 결정할 수 있다. 이때, 다른 인물과 관련된 정보는 다른 인물의 존재 여부를 나타내는 정보, 다른 인물의 위치를 나타내는 위치 정보 및 다른 인물의 시선방향을 나타내는 시선방향 정보 중 적어도 하나를 포함할 수 있다. 음성 인식 장치(100)가 설치된 위치를 기준으로 기 설정된 영역 내에 사용자 이외의 다른 인물이 존재하는 경우, 사용자가 음성 신호에 대응하는 음성을 발화한 의도가 다른 인물과의 대화를 목적으로 하는 것일 수 있기 때문이다. 이 경우, 음성 인식 장치(100)는 사용자(300)가 음성 인식 장치(100)를 호출할 가능성이 낮은 것으로 판단할 수 있다. 또한, 음성 인식 장치(100)는 사용자 이외의 다른 인물이 존재하지 않는 경우에 비해, 높은 기준 유사도를 기초로 음성 신호로부터 호출어를 검출할 수 있다. Meanwhile, according to an exemplary embodiment of the present disclosure, the speech recognition apparatus 100 may determine a reference similarity based on information related to a person other than the user. The speech recognition apparatus 100 may determine the reference similarity based on the motion information of the user and the information related to a person other than the user. In this case, the information related to the other person may include at least one of information indicating whether the other person exists, location information indicating the location of the other person, and gaze direction information indicating the gaze direction of the other person. When a person other than the user exists in a preset area based on the location where the speech recognition apparatus 100 is installed, the intention that the user utters a voice corresponding to the voice signal may be for the purpose of a conversation with another person. Because there is. In this case, the speech recognition apparatus 100 may determine that the user 300 has a low possibility of calling the speech recognition apparatus 100. In addition, the voice recognition apparatus 100 may detect the caller from the voice signal based on a high reference similarity, compared to the case where no person other than the user exists.

도 6은 본 개시의 일 실시예에 따른 음성 인식 장치(100)가 사용자 이외에 다른 인물(600)과 관련된 정보를 기초로 기준 유사도를 결정하는 방법을 나타내는 도면이다. 일 실시예에 따라, 음성 인식 장치(100)는 전술한 동작감지센서를 이용하여, 사용자(300)의 위치를 기준으로 기 설정된 영역 내에 다른 인물(600)이 존재하는지를 판단할 수 있다(단계 S602). 이때, 기 설정된 영역의 크기는 도 5와 관련하여 설명된 기 설정된 영역에 비해 좁거나 동일할 수 있다. 또한, 기 설정된 영역은 사용자(300)의 위치를 기준으로 사용자(300)와 대화가 가능한 거리를 반지름으로 하는 영역을 나타낼 수도 있다. 예를 들어, 음성 인식 장치(100)는 전술한 이미지 정보가 사용자(300) 이외의 다른 인물(600)에 대응하는 객체를 포함하는지 판단할 수 있다. 이때, 다른 인물(600)에 대응하는 객체는 이미지 정보의 일부분으로 다른 인물(600)을 나타내는 이미지 부분일 수 있다. 이미지 정보가 다른 인물(600)에 대응하는 객체를 포함하는 경우, 음성 인식 장치(100)는 이미지 정보를 기초로 다른 인물(600)의 위치를 나타내는 위치 정보를 획득할 수 있다. 또한, 음성 인식 장치(100)는 다른 인물(600)의 위치를 이용하여 획득된 기준 유사도를 기초로 호출어를 검출할 수 있다. FIG. 6 is a diagram illustrating a method of determining, by the voice recognition apparatus 100, a reference similarity based on information related to another person 600 besides a user, according to an exemplary embodiment. According to an embodiment of the present disclosure, the voice recognition apparatus 100 may determine whether another person 600 exists in a preset area based on the position of the user 300 by using the above-described motion detection sensor (step S602). ). In this case, the size of the preset area may be narrower or the same as the preset area described with reference to FIG. 5. In addition, the preset area may indicate an area having a radius of a distance at which the user 300 can communicate with the user 300 based on the location of the user 300. For example, the speech recognition apparatus 100 may determine whether the above-described image information includes an object corresponding to the person 600 other than the user 300. In this case, the object corresponding to the other person 600 may be an image part representing the other person 600 as part of the image information. When the image information includes an object corresponding to another person 600, the speech recognition apparatus 100 may obtain location information indicating the location of the other person 600 based on the image information. Also, the speech recognition apparatus 100 may detect the caller based on the reference similarity obtained using the location of the other person 600.

일 실시예에 따라, 음성 인식 장치(100)는 사용자(300)의 제1 시선방향을 나타내는 제1 시선방향 정보 및 다른 인물(600)의 위치를 나타내는 위치 정보를 기초로 기준 유사도를 결정할 수 있다(단계 S604). 구체적으로, 제1 시선방향이 다른 인물(600)의 위치를 향하는 경우, 음성 인식 장치(100)는 상대적으로 높은 기준 유사도를 기초로 호출어를 검출할 수 있다. 제1 시선방향이 다른 인물(600)의 위치를 향하는 경우, 사용자가 음성 신호에 대응하는 음성을 발화한 의도가 다른 인물과의 대화를 목적으로 하는 것일 수 있기 때문이다. 반대로, 제1 시선방향이 다른 인물(600)의 위치를 향하지 않는 경우, 상대적으로 낮은 기준 유사도를 기초로 호출어를 검출할 수 있다. According to an embodiment, the speech recognition apparatus 100 may determine the reference similarity based on the first visual direction information indicating the first visual direction of the user 300 and the location information indicating the location of the other person 600. (Step S604). In detail, when the first line of sight faces a location of another person 600, the speech recognition apparatus 100 may detect a caller based on a relatively high reference similarity. This is because when the first line of sight is directed toward the location of the other person 600, the user may utter the voice corresponding to the voice signal for the purpose of a conversation with another person. On the contrary, when the first line of sight does not face the position of the other person 600, the caller may be detected based on a relatively low reference similarity.

또한, 음성 인식 장치(100)는 다른 인물(600)의 제2 시선방향을 나타내는 제2 시선방향 정보를 기초로 기준 유사도를 결정할 수 있다(단계 S606). 다른 인물(600)의 시선방향이 사용자(300)를 향하는 경우, 사용자(300)와 다른 인물(600)이 대화 중일 가능성이 높기 때문이다. 구체적으로, 음성 인식 장치(100)는 전술한 이미지 정보로부터 제2 시선방향 정보를 획득할 수 있다. 음성 인식 장치(100)는 이미지 정보로부터 다른 인물(600)에 대응하는 객체를 추출할 수 있다. 또한, 음성 인식 장치(100)는 추출된 객체를 이용하여 제2 시선방향 정보를 획득할 수 있다. Also, the speech recognition apparatus 100 may determine the reference similarity based on the second visual direction information indicating the second visual direction of the other person 600 (step S606). This is because when the line of sight of the other person 600 faces the user 300, there is a high possibility that the user 300 and the other person 600 are in a conversation. In detail, the speech recognition apparatus 100 may obtain second visual direction information from the above-described image information. The speech recognition apparatus 100 may extract an object corresponding to the other person 600 from the image information. Also, the speech recognition apparatus 100 may obtain second gaze direction information by using the extracted object.

도 6을 참조하면, 음성 인식 장치(100)는 단계 S602 내지 단계 S606을 통한 판단 결과를 기초로 기준 유사도를 결정할 수 있다(단계S608). 단계 S602에서, 다른 인물(600)이 존재하지 않는 경우, 음성 인식 장치(100)는 기준 유사도 '0.55'를 기초로 음성 신호로부터 호출어를 검출할 수 있다. 또한, 단계 S602 및 단계 S604에서, 다른 인물(600)이 존재하고 제1 시선방향이 다른 인물(600)의 위치를 향하는 경우, 음성 인식 장치(100)는 제2 시선방향에 따라 기준 유사도를 결정할 수 있다. 이때, 단계 S606에서, 제2 시선방향이 사용자(300)의 위치를 향하는 경우, 음성 인식 장치(100)는 기준 유사도 '0.95'를 기초로 음성 신호로부터 호출어를 검출할 수 있다. 제1 시선방향 및 제2 시선방향이 서로를 향하는 경우, 음성 인식 장치(100)를 호출할 가능성이 적기 때문이다. 단계 S606에서, 제2 시선방향이 사용자(300)의 위치를 향하지 않는 경우, 음성 인식 장치(100)는 기준 유사도 '0.85'를 기초로 음성 신호로부터 호출어를 검출할 수 있다.Referring to FIG. 6, the speech recognition apparatus 100 may determine a reference similarity based on the determination result through steps S602 through S606 (step S608). In operation S602, when the other person 600 does not exist, the speech recognition apparatus 100 may detect the caller from the speech signal based on the reference similarity level '0.55'. Also, in steps S602 and S604, when another person 600 is present and the first visual direction is directed toward the location of the other person 600, the speech recognition apparatus 100 determines the reference similarity degree according to the second visual direction. Can be. In this case, in operation S606, when the second gaze direction is toward the location of the user 300, the speech recognition apparatus 100 may detect the caller from the speech signal based on the reference similarity level '0.95'. This is because when the first visual direction and the second visual direction face each other, there is less possibility of calling the voice recognition apparatus 100. In operation S606, when the second line of sight does not face the location of the user 300, the speech recognition apparatus 100 may detect the caller from the speech signal based on the reference similarity level '0.85'.

단계 S602 및 단계 S604에서, 다른 인물(600)이 존재하고 제1 시선방향이 다른 인물(600)의 위치를 향하지 않는 경우, 음성 인식 장치(100)는 제2 시선방향에 따라 기준 유사도를 결정할 수 있다. 이때, 제2 시선방향이 사용자(300)의 위치를 향하는 경우, 음성 인식 장치(100)는 기준 유사도 '0.75'를 기초로 음성 신호로부터 호출어를 검출할 수 있다. 또한, 제2 시선방향이 사용자(300)의 위치를 향하지 않는 경우, 음성 인식 장치(100)는 기준 유사도 '0.65'를 기초로 음성 신호로부터 호출어를 검출할 수 있다. 음성 신호에 대응하는 음성을 발화한 사용자(300)의 시선방향을 나타내는 제1 시선방향 정보가 제2 시선방향 정보에 비해 기준 유사도 결정에 높은 비율로 반영될 수 있다.In operation S602 and step S604, when the other person 600 exists and the first visual direction does not face the position of the other person 600, the speech recognition apparatus 100 may determine the reference similarity according to the second visual direction. have. In this case, when the second visual direction is toward the location of the user 300, the speech recognition apparatus 100 may detect the caller from the speech signal based on the reference similarity level '0.75'. In addition, when the second line of sight does not face the location of the user 300, the speech recognition apparatus 100 may detect the caller from the speech signal based on the reference similarity degree '0.65'. The first gaze direction information indicating the gaze direction of the user 300 who uttered the voice corresponding to the voice signal may be reflected at a higher rate in determining the reference similarity than the second gaze direction information.

일 실시예에 따라, 음성 인식 장치(100)는 제1 시선방향 정보 및 제2 시선방향 정보에 따라 각각 결정된 기준 유사도를 이용할 수도 있다. 예를 들어, 음성 인식 장치(100)는 제1 시선방향 정보가 나타내는 방향과 다른 인물(600)의 위치를 기초로 제1 기준 유사도를 결정할 수 있다. 또한, 음성 인식 장치(100)는 제2 시선방향 정보가 나타내는 방향과 사용자(300)의 위치를 기초로 제2 기준 유사도를 결정할 수 있다. 또한, 음성 인식 장치(100)는 제1 기준 유사도 및 제2 기준 유사도의 평균을 기초로 음성 신호로부터 호출어를 검출할 수도 있다.According to an embodiment, the speech recognition apparatus 100 may use reference similarities determined according to the first visual direction information and the second visual direction information, respectively. For example, the speech recognition apparatus 100 may determine the first reference similarity based on the position of the person 600 that is different from the direction indicated by the first gaze direction information. Also, the speech recognition apparatus 100 may determine the second reference similarity based on the direction indicated by the second visual direction information and the location of the user 300. Also, the speech recognition apparatus 100 may detect the caller from the speech signal based on the average of the first reference similarity and the second reference similarity.

한편, 전술한 바와 같이, 음성 인식 장치(100)는 호출이력을 참조하여 기준 유사도를 획득할 수도 있다. 이때, 호출이력은 특정 음성 인식 장치가 호출된 이력을 나타낼 수 있다. 또한, 음성 인식 장치(100)는 호출이력을 기초로 사용자(300)의 동작정보, 음성 인식 장치(100) 주변 환경 정보 및 사용자 이외의 다른 인물에 관련된 정보 중 적어도 하나에 대응하는 호출 빈도수를 산출할 수 있다. 이때, 호출 빈도수는 해당 상황에서 음성 인식 장치(100)가 호출된 누적 횟수를 나타낼 수 있다. 음성 인식 장치(100)는 산출된 호출 빈도수를 기초로 기준 유사도를 결정할 수 있다. 예를 들어, 음성 인식 장치(100)는 기준 유사도를 호출 빈도수에 반비례하도록 설정할 수 있다. 또한, 음성 인식 장치(100)는 결정된 기준 유사도를 기초로 호출어를 검출할 수 있다.On the other hand, as described above, the speech recognition apparatus 100 may obtain a reference similarity with reference to the call history. In this case, the call history may represent a history of a call made by a specific voice recognition device. In addition, the voice recognition apparatus 100 calculates a call frequency corresponding to at least one of operation information of the user 300, environment information around the voice recognition apparatus 100, and information related to a person other than the user, based on the call history. can do. In this case, the call frequency may indicate a cumulative number of times the speech recognition apparatus 100 is called in the corresponding situation. The speech recognition apparatus 100 may determine the reference similarity based on the calculated call frequency. For example, the speech recognition apparatus 100 may set the reference similarity to be inversely proportional to the call frequency. Also, the speech recognition apparatus 100 may detect the call word based on the determined reference similarity.

또한, 음성 인식 장치(100)는 음성 인식 장치(100)가 아닌 다른 음성 인식 기기가 호출된 호출이력을 이용할 수 있다. 예를 들어, 음성 인식 장치(100)는 음성 인식 장치(100)가 설치된 장소와 유사한 장소에 설치된 다른 음성 인식 기기의 호출이력을 기초로 기준 유사도를 결정할 수 있다. 이때, 음성 인식 장치(100)는 음성 인식 장치(100) 또는 서버(200)와 연결된 각각의 음성 인식 장치가 설치된 장소에 관한 정보를 획득할 수 있다. 이때, 설치된 장소에 관한 정보는 음성 인식 장치가 설치된 지역, 장소의 용도 특성(예를 들어, 가정 또는 사무실)을 포함할 수 있다. 구체적으로, 음성 인식 장치(100)가 다른 음성 인식 기기의 호출 빈도수가 높은 시간에 음성 신호를 획득하는 경우, 호출 빈도수가 낮은 시간대에 비해, 낮은 기준 유사도를 기초로 호출어를 검출할 수 있다. 음성 인식 장치(100)가 호출될 가능성이 높은 시간일 수 있기 때문이다. In addition, the voice recognition apparatus 100 may use a call history called by another voice recognition device other than the voice recognition apparatus 100. For example, the speech recognition apparatus 100 may determine the reference similarity based on the call history of another speech recognition apparatus installed in a place similar to the place where the speech recognition apparatus 100 is installed. In this case, the voice recognition apparatus 100 may obtain information about a place where each voice recognition apparatus connected to the voice recognition apparatus 100 or the server 200 is installed. In this case, the information regarding the installed place may include a region in which the speech recognition apparatus is installed, and a use characteristic of the place (for example, home or office). In detail, when the voice recognition apparatus 100 obtains a voice signal at a time when a call frequency of another voice recognition device is high, the caller may be detected based on a low reference similarity as compared to a time when the call frequency is low. This is because the voice recognition apparatus 100 may be at a high probability of being called.

또한, 음성 인식 장치(100)는 사용자가 아닌 다른 사용자에 의해 음성 인식 장치(100)가 호출된 호출이력을 기초로 기준 유사도를 결정할 수 있다. 이때, 다른 사용자는 사용자와 생활 패턴이 유사한 사용자를 포함할 수 있다. 또한, 음성 인식 장치(100)는 동일한 사용자에 의해 호출된 다른 음성 인식 기기의 호출이력을 기초로 기준 유사도를 결정할 수도 있다. 이때, 음성 인식 장치(100)는 음성 신호가 포함하는 음향학적 특징을 이용하여 특정 사용자를 식별할 수 있다. 또한, 음성 인식 장치(100)는 사용자의 동작정보에 따른 호출이력을 기초로 기준 유사도를 결정할 수 있다. In addition, the speech recognition apparatus 100 may determine a reference similarity based on a call history in which the speech recognition apparatus 100 is called by a user other than the user. In this case, the other user may include a user whose life pattern is similar to that of the user. In addition, the speech recognition apparatus 100 may determine the reference similarity based on the call history of another speech recognition device called by the same user. In this case, the speech recognition apparatus 100 may identify a specific user by using acoustic features included in the speech signal. In addition, the speech recognition apparatus 100 may determine the reference similarity based on the call history according to the user's motion information.

일 실시예에 따라, 음성 인식 장치(100)는 사용자의 이동정보를 기초로 기준 유사도를 결정할 수 있다. 이때, 이동정보는 사용자의 이동방향 및 이동속도 중 적어도 하나를 나타내는 정보일 수 있다. 예를 들어, 음성 인식 장치(100)는 호출이력을 기초로, 이동정보에 대응하는 호출 빈도수를 산출할 수 있다. 이때, 호출이력은 기 저장된 정보일 수 있다. 구체적으로, 음성 인식 장치(100)는 획득된 음성 신호에 대응하는 이동정보와 유사한 이동정보에 매핑되는 호출 빈도수를 산출할 수 있다. 예를 들어, 획득된 음성 신호에 대응하는 이동정보가 이동속도 1m^2/s를 나타내는 경우, 음성 인식 장치(100)는 이동속도 0.8~1.3m^2/s에 매핑되는 호출 빈도수를 산출할 수 있다. 산출된 호출 빈도수가 '10'인 경우, 음성 인식 장치(100)는 호출 빈도수가 '1'인 경우에 비해 낮은 기준 유사도를 기초로 호출어를 검출할 수 있다. 또는 산출된 호출 빈도수가 '0'인 경우, 음성 인식 장치(100)는 호출어를 인식하지 않을 수도 있다. 이때, 음성 인식 장치(100)는 기준 유사도를 최대 기준 유사도 값보다 큰 값으로 설정할 수 있다. According to an embodiment, the speech recognition apparatus 100 may determine a reference similarity based on the movement information of the user. In this case, the movement information may be information indicating at least one of a movement direction and a movement speed of the user. For example, the speech recognition apparatus 100 may calculate a call frequency corresponding to the movement information based on the call history. In this case, the call history may be pre-stored information. In detail, the speech recognition apparatus 100 may calculate a call frequency mapped to movement information similar to movement information corresponding to the acquired speech signal. For example, when the movement information corresponding to the acquired voice signal indicates a moving speed of 1 m ^ 2 / s, the speech recognition apparatus 100 may calculate a call frequency mapped to a moving speed of 0.8 to 1.3 m ^ 2 / s. Can be. When the calculated call frequency is '10', the speech recognition apparatus 100 may detect a call word based on a lower reference similarity than when the call frequency is '1'. Alternatively, when the calculated call frequency is '0', the speech recognition apparatus 100 may not recognize the call language. In this case, the speech recognition apparatus 100 may set the reference similarity to a value larger than the maximum reference similarity value.

도 7은 본 발명의 실시예에 따른 음성 인식 장치(100)의 동작을 나타내는 흐름도이다. 도 7에 따르면, 단계 S702에서, 음성 인식 장치(100)는 음성 신호를 획득할 수 있다. 단계 S704에서, 음성 인식 장치(100)는 동작정보를 기초로 기준 유사도를 획득할 수 있다. 음성 인식 장치(100)는 음성 신호에 대응하는 음성을 발화한 사용자의 동작을 감지하여 동작정보를 생성할 수 있다. 예를 들어, 음성 인식 장치(100)는 음성 신호에 대응하는 음성을 발화한 사용자의 시선방향을 기초로 기준 유사도를 결정할 수 있다. 단계 S706에서, 음성 인식 장치(100)는 기준 유사도를 기초로 획득된 음성 신호로부터 호출어를 검출할 수 있다. 음성 인식 장치(100)는 음성 신호와 호출어 사이의 유사도가 기준 유사도 이상인 경우, 음성 신호로부터 호출어가 검출된 것으로 판단할 수 있다. 반대로, 음성 신호와 호출어 사이의 유사도가 기준 유사도 이하인 경우, 음성 인식 장치(100)는 음성 신호로부터 호출어가 검출되지 않는 것으로 판단할 수 있다. 단계 S708에서, 음성 인식 장치(100)는 호출어 검출 결과를 기초로 생성된 출력 정보를 출력할 수 있다. 이에 따라, 음성 인식 장치(100)는 사용자의 발화 의도를 고려하여 호출어를 인식할 수 있다. 또한, 음성 인식 장치(100)는 호출어 인식 오인식률을 감소시킬 수 있다. 7 is a flowchart illustrating an operation of the speech recognition apparatus 100 according to an exemplary embodiment of the present invention. According to FIG. 7, in operation S702, the speech recognition apparatus 100 may obtain a speech signal. In operation S704, the speech recognition apparatus 100 may obtain a reference similarity based on the motion information. The speech recognition apparatus 100 may generate motion information by detecting a motion of a user who has uttered a voice corresponding to the voice signal. For example, the speech recognition apparatus 100 may determine the reference similarity based on the eye direction of the user who spoke the voice corresponding to the speech signal. In operation S706, the speech recognition apparatus 100 may detect a call word from the obtained speech signal based on the reference similarity. When the similarity between the voice signal and the call word is greater than or equal to the reference similarity, the voice recognition apparatus 100 may determine that the call word is detected from the voice signal. On the contrary, when the similarity between the voice signal and the call word is less than or equal to the reference similarity, the voice recognition apparatus 100 may determine that the call word is not detected from the voice signal. In operation S708, the speech recognition apparatus 100 may output output information generated based on the call word detection result. Accordingly, the speech recognition apparatus 100 may recognize the call language in consideration of the user's intention to speak. In addition, the speech recognition apparatus 100 may reduce a call word recognition false recognition rate.

일부 실시예는 컴퓨터에 의해 실행되는 프로그램 모듈과 같은 컴퓨터에 의해 실행가능한 명령어를 포함하는 기록 매체의 형태로도 구현될 수 있다. 컴퓨터 판독 가능 매체는 컴퓨터에 의해 액세스될 수 있는 임의의 가용 매체일 수 있고, 휘발성 및 비휘발성 매체, 분리형 및 비분리형 매체를 모두 포함할 수 있다. 또한, 컴퓨터 판독가능 매체는 컴퓨터 저장 매체를 포함할 수 있다. 컴퓨터 저장 매체는 컴퓨터 판독가능 명령어, 데이터 구조, 프로그램 모듈 또는 기타 데이터와 같은 정보의 저장을 위한 임의의 방법 또는 기술로 구현된 휘발성 및 비휘발성, 분리형 및 비분리형 매체를 모두 포함할 수 있다. 또한, 본 명세서에서, “부”는 프로세서 또는 회로와 같은 하드웨어 구성(hardware component), 및/또는 프로세서와 같은 하드웨어 구성에 의해 실행되는 소프트웨어 구성(software component)일 수 있다.Some embodiments may also be embodied in the form of a recording medium containing instructions executable by a computer, such as program modules executed by the computer. Computer readable media can be any available media that can be accessed by a computer and can include both volatile and nonvolatile media, removable and non-removable media. In addition, the computer readable medium may include a computer storage medium. Computer storage media may include both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. In addition, in this specification, “unit” may be a hardware component such as a processor or a circuit, and / or a software component executed by a hardware component such as a processor.

전술한 본 개시의 설명은 예시를 위한 것이며, 본 개시가 속하는 기술분야의 통상의 지식을 가진 자는 본 개시의 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 쉽게 변형이 가능하다는 것을 이해할 수 있을 것이다. 그러므로 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며 한정적이 아닌 것으로 이해해야만 한다. 예를 들어, 단일형으로 설명되어 있는 각 구성 요소는 분산되어 실시될 수도 있으며, 마찬가지로 분산된 것으로 설명되어 있는 구성 요소들도 결합된 형태로 실시될 수 있다.The foregoing description of the disclosure is provided by way of example, and it will be understood by those skilled in the art that the present disclosure may be easily modified into other specific forms without changing the technical spirit or essential features of the present disclosure. will be. Therefore, it should be understood that the embodiments described above are exemplary in all respects and not restrictive. For example, each component described as a single type may be implemented in a distributed manner, and similarly, components described as distributed may be implemented in a combined form.

Claims

In the speech recognition device that provides a service through the recognition of the call word,
A voice receiver for obtaining a voice signal;
Detecting motion of a user who has uttered a voice corresponding to the voice signal to generate motion information indicating the detected motion of the user, wherein the motion information includes movement information indicating at least one of a moving direction and a moving speed of the user; Included,
Detecting the caller from the voice signal based on the movement information;
A processor configured to generate output information based on the detection result; And
And an output unit for outputting the output information.

The method of claim 1,
The motion information includes first visual direction information indicating the visual direction of the user,
The processor,
Based on the first visual direction information, determine a reference similarity indicating a criterion for determining whether the call word is detected from the voice signal,
And detect the call word from the speech signal based on the reference similarity.

The method of claim 2,
The processor,
The reference similarity is determined based on a gaze angle indicating a gaze direction of the user and a preset range.
When the gaze angle is within the preset range, the reference similarity is set to a lower value than when the gaze angle is outside the preset range,
The gaze angle is determined based on a location where the voice recognition device is installed and a location of the user.

The method of claim 2,
The processor,
And determining the reference similarity based on a position of at least one other electronic device installed in a predetermined area based on a location where the voice recognition device is installed and a direction of the user's eyes.

The method of claim 4, wherein
The processor,
And among the at least one other electronic device, determine the reference similarity based on an operation state of at least one other electronic device corresponding to the eyeline direction of the user.

The method of claim 4, wherein
The processor,
And the reference similarity degree is determined based on change information indicating a change in the user's gaze direction for a predetermined time from when the voice signal is obtained.

The method of claim 6,
The processor,
And the reference similarity is determined based on the change information when the change in the visual direction corresponds to a positional relationship between the voice recognition device and the at least one other electronic device.

The method of claim 2,
The processor,
Obtaining image information including an object representing the user based on time information on obtaining the voice signal,
And the first visual direction information is obtained from the image information.

The method of claim 8,
The processor,
Determine whether the image information includes an object corresponding to a person other than the user,
When the image information includes an object corresponding to the other person, obtain location information indicating the location of the other person based on the image information,
And determining the reference similarity degree based on the first visual direction information and the location information.

The method of claim 9,
The processor,
Obtaining second visual direction information indicating the visual direction of the other person from the image information,
And the reference similarity is determined based on the first visual direction information and the second visual direction information.

delete

The method of claim 1,
The processor,
Calculating a call frequency corresponding to the movement information based on the stored call history;
Based on the call frequency, determine a reference similarity indicating a criterion for determining whether the call word is detected from the speech signal,
Detecting the caller from the speech signal with reference to the reference similarity;
And the reference similarity is inversely proportional to the call frequency.

The method of claim 12,
The processor,
Calculating at least one of the moving direction and the moving speed based on acoustic characteristics of the voice signal,
The acoustic characteristic may include at least one of a magnitude change of the speech signal or a frequency change of the speech signal.

In the method of operation of a speech recognition device that provides a service through call word recognition,
Obtaining a speech signal;
Generating motion information indicating a detected motion of the user by detecting a motion of a user who has spoken a voice corresponding to the voice signal, wherein the motion information indicates at least one of a moving direction and a moving speed of the user; It includes;
Detecting the caller from the voice signal based on the movement information; And
And outputting the output information generated based on the detection result.

An electronic device readable recording medium having recorded thereon a program for executing the method of claim 14 on an electronic device.