KR102071865B1

KR102071865B1 - Device and method for recognizing wake-up word using server recognition result

Info

Publication number: KR102071865B1
Application number: KR1020180055968A
Authority: KR
Inventors: 양태영
Original assignee: 주식회사 인텔로이드
Priority date: 2017-11-30
Filing date: 2018-05-16
Publication date: 2020-01-31
Also published as: KR20190064384A

Abstract

호출어 인식을 통해 서비스를 제공하는 음성 인식 장치가 개시된다. 음성 인식 장치는 음성 신호를 획득하는 음성 수신부, 상기 음성 신호에 대한 상기 호출어 검출 결과를 나타내는 제1 인식 결과를 생성하고, 상기 음성 신호를 획득한 수신환경에 대응하는 호출이력 및 상기 제1 인식 결과를 기초로 상기 음성 신호의 적어도 일부를 서버로 전송하고, 상기 음성 신호의 적어도 일부에 대한 상기 서버의 인식 결과를 나타내는 제2 인식 결과, 및 상기 제1 인식 결과 중 적어도 하나를 기초로 상기 출력 정보를 생성하는 프로세서 및 상기 출력 정보를 출력하는 출력부를 포함한다.Disclosed is a speech recognition apparatus providing a service through call word recognition. The apparatus for recognizing a voice generates a voice receiver for acquiring a voice signal, a first recognition result indicating the caller detection result for the voice signal, and a call history and the first recognition corresponding to a reception environment in which the voice signal is acquired. Transmitting at least a portion of the speech signal to a server based on a result, and outputting the speech based on at least one of a second recognition result indicating a recognition result of the server for at least a portion of the speech signal, and the first recognition result It includes a processor for generating information and an output unit for outputting the output information.

Description

DEVICE AND METHOD FOR RECOGNIZING WAKE-UP WORD USING SERVER RECOGNITION RESULT}

본 개시는 음성 인식 장치 및 음성 인식 방법에 관한 것으로, 더욱 상세하게는 서버인식 결과를 이용하여 호출어 인식의 오인식률을 향상시키는 장치 및 방법에 관한 것이다.The present disclosure relates to a speech recognition apparatus and a speech recognition method, and more particularly, to an apparatus and a method for improving a false recognition rate of call word recognition by using a server recognition result.

음성 인식 기술은 사용자와 전자 장치 사이의 상호작용을 보다 원활하게 만드는 핵심기술 중 하나이다. 음성 인식 기술을 통해, 전자 장치는 사용자의 음성을 듣고 이해할 수 있으며, 이해한 내용을 바탕으로 사용자에게 적절한 서비스를 제공할 수도 있다. 이에 따라, 사용자는 별도의 조작 없이도 전자 장치에 대하여 사용자가 원하는 서비스를 요청할 수 있다.Speech recognition technology is one of the key technologies to make the interaction between the user and the electronic device more smooth. Through voice recognition technology, the electronic device may listen to and understand the voice of the user, and may provide an appropriate service to the user based on the understood contents. Accordingly, the user may request a service desired by the user from the electronic device without any separate operation.

음성 인식 분야의 여러 기술들 중, 사용자로부터 취득한 음성에 포함된 호출어(wake-up word) 또는 키워드(keyword)를 검출하는 키워드 스팟팅(keyword spotting) 기술이 최근 여러 분야에서 각광받고 있다. 키워드 스팟팅이 제대로 수행되기 위해서는 음성에 포함된 키워드를 인식하고 상기 키워드를 검출하는 비율인 검출률이 높아야 한다. 하지만 이러한 검출률과 함께 키워드 스팟팅에서 중요하게 다루어지는 문제가 키워드 오인식 문제이다. 즉, 음성으로부터 검출된 키워드를 다른 키워드인 것으로 잘못 인식하는 경우, 키워드 스팟팅이 적용된 단말기는 사용자에게 원하지 않는 서비스를 제거하거나 사용자가 의도하지 않았던 처리를 수행할 수도 있다. 따라서, 기존의 키워드 스팟팅 기술에서의 낮은 검출률 또는 높은 오인식률 문제를 해결할 수 있는 방안이 요구되고 있다.Among various technologies in the speech recognition field, a keyword spotting technique for detecting wake-up words or keywords included in a voice acquired from a user has been in the spotlight in recent years. In order to properly perform keyword spotting, a detection rate, which is a ratio of recognizing a keyword included in a voice and detecting the keyword, must be high. However, the problem that is important in keyword spotting along with the detection rate is the keyword misrecognition problem. That is, when a keyword detected from voice is erroneously recognized as another keyword, the terminal to which keyword spotting is applied may remove a service that is not desired by the user or perform a process not intended by the user. Therefore, there is a need for a method that can solve the problem of low detection rate or high recognition rate in the existing keyword spotting technique.

한편, 음성인식을 이용해 호출어를 인식하고 호출어 인식이 성공한 경우, 특정 서비스를 제공하는 기기에 대한 연구 및 출시가 이루어지고 있다. 이때, 호출어 인식의 경우, 임베디드 음성 인식을 통해 실시간으로 검출이 수행되기 때문에 오인식률이 상대적으로 높아지는 문제가 있다. 이에 따라, 호출어를 인식하는 방법과 관련된 기술이 요구되고 있다. On the other hand, when the recognition of the caller using the voice recognition and the caller recognition is successful, research and release on the device that provides a specific service has been made. At this time, in the case of call word recognition, since the detection is performed in real time through the embedded speech recognition, there is a problem in that the false recognition rate is relatively high. Accordingly, there is a need for a technique related to a method of recognizing call words.

본 개시는 상기와 같은 문제점을 해결하기 위해 안출된 것으로서, 호출어 인식의 정확도를 높일 수 있는 음성 인식 장치 또는 음성 인식 방법을 제공하고자 하는 목적을 가지고 있다. 구체적으로, 본 개시는 호출어 인식의 오인식률을 감소시키는 음성 인식 장치 또는 음성 인식 방법을 제공한다.The present disclosure has been made to solve the above problems, and an object of the present invention is to provide a speech recognition apparatus or a speech recognition method capable of increasing the accuracy of call word recognition. In particular, the present disclosure provides a speech recognition apparatus or a speech recognition method for reducing a false recognition rate of call word recognition.

상기와 같은 과제를 해결하기 위한 본 발명의 실시예에 따르면, 일 실시예에 따른 장치는, 음성 신호를 획득하는 음성 수신부, 상기 음성 신호에 대한 상기 호출어 검출 결과를 나타내는 제1 인식 결과를 생성하고, 상기 음성 신호를 획득한 수신환경에 대응하는 호출이력 및 상기 제1 인식 결과를 기초로 상기 음성 신호의 적어도 일부를 서버로 전송하고, 상기 서버로 상기 음성 신호의 적어도 일부를 전송한 경우, 상기 음성 신호의 적어도 일부에 대한 상기 서버의 인식 결과를 나타내는 제2 인식 결과, 및 상기 제1 인식 결과를 기초로 출력 정보를 생성하고, 상기 서버로 상기 음성 신호를 전송하지 않은 경우, 상기 제1 인식 결과를 기초로 출력 정보를 생성하는, 프로세서 및 생성된 출력 정보를 출력하는 출력부를 포함할 수 있다.According to an embodiment of the present invention for solving the above problems, the apparatus according to an embodiment, the voice receiving unit for obtaining a voice signal, and generates a first recognition result indicating the call word detection result for the voice signal And transmitting at least a part of the voice signal to a server based on a call history corresponding to a reception environment for acquiring the voice signal and the first recognition result, and transmitting at least a part of the voice signal to the server. A second recognition result indicating a recognition result of the server of at least a portion of the voice signal, and output information based on the first recognition result, and when the voice signal is not transmitted to the server, the first It may include a processor for generating output information based on the recognition result, and an output unit for outputting the generated output information.

일 실시예에 따른 음성 인식 방법은, 음성 신호를 획득하는 단계, 상기 음성 신호에 대한 상기 호출어 검출 결과를 나타내는 제1 인식 결과를 생성하는 단계, 상기 음성 신호를 획득한 수신환경에 대응하는 호출이력 및 상기 제1 인식 결과를 기초로 상기 음성 신호의 적어도 일부를 서버로 전송하는 단계, 상기 서버로 상기 음성 신호의 적어도 일부를 전송한 경우, 상기 음성 신호의 적어도 일부에 대한 상기 서버의 인식 결과를 나타내는 제2 인식 결과, 및 상기 제1 인식 결과를 기초로 출력 정보를 생성하고, 상기 서버로 상기 음성 신호를 전송하지 않은 경우, 상기 제1 인식 결과를 기초로 출력 정보를 생성하는 단계 및 생성된 출력 정보를 출력하는 단계를 포함할 수 있다.The voice recognition method according to an embodiment of the present disclosure may include obtaining a voice signal, generating a first recognition result indicating the call word detection result of the voice signal, and calling a call corresponding to a reception environment in which the voice signal is obtained. Transmitting at least a portion of the voice signal to a server based on a history and the first recognition result, and when the at least part of the voice signal is transmitted to the server, a result of the server's recognition of at least a portion of the voice signal Generating output information based on a second recognition result indicating a and a first recognition result, and generating and outputting information based on the first recognition result when the voice signal is not transmitted to the server. Outputting the output information.

또 다른 측면에 따른 컴퓨터로 읽을 수 있는 기록매체는 상술한 방법을 컴퓨터에서 실행시키기 위한 프로그램을 기록한 기록매체를 포함할 수 있다.The computer-readable recording medium according to another aspect may include a recording medium recording a program for executing the above-described method on a computer.

본 개시의 일 실시예에 따르면, 호출어 인식의 정확도를 높여 호출어 인식의 오인식률을 감소시킬 수 있다. 또한, 본 개시의 일 실시예에 따르면, 음성을 발화한 사용자에게 효과적으로 출력 정보를 제공할 수 있다. According to an embodiment of the present disclosure, the accuracy of call word recognition may be increased to reduce the false recognition rate of call word recognition. In addition, according to an embodiment of the present disclosure, the output information may be effectively provided to the user who spoke the voice.

또한, 본 개시는 사용자의 음성을 취득한 환경의 특성에 기초하여 호출어를 인식할 수 있다. 이를 통해, 본 개시는 호출어 오인식으로 인한 기기의 오작동을 줄이고 음성 인식을 이용하여 서비스를 제공하는 음성 인식 장치의 에너지 효율을 증가시킬 수 있다.In addition, the present disclosure can recognize the caller based on the characteristics of the environment in which the user's voice is acquired. Through this, the present disclosure can reduce the malfunction of the device due to caller misrecognition and increase the energy efficiency of the speech recognition device that provides a service using speech recognition.

도 1은 본 개시의 일 실시예에 따라 음성 인식 장치 및 서버를 포함하는 서비스 제공 시스템을 나타내는 개략도이다.
도 2는 본 발명의 실시예에 따른 음성 인식 장치의 구성을 나타내는 도면이다.
도 3은 본 개시의 일 실시예에 따른 음성 신호를 나타내는 도면이다.
도 4는 본 개시의 일 실시예에 따라 호출어 파트 및 비호출어 파트를 포함하는 음성 신호를 나타내는 도면이다.
도 5는 본 개시의 일 실시예에 따른 음성 인식 장치의 동작을 나타내는 흐름도이다.
도 6은 본 개시의 일 실시예에 따라 음성 인식 장치와 관련된 호출이력의 예시를 나타내는 도면이다.
도 7은 본 개시의 일 실시예에 따른 음성 인식 장치의 동작 방법을 나타내는 흐름도이다.1 is a schematic diagram illustrating a service providing system including a voice recognition device and a server according to an embodiment of the present disclosure.
2 is a diagram illustrating a configuration of a speech recognition apparatus according to an embodiment of the present invention.
3 is a diagram illustrating a voice signal according to an embodiment of the present disclosure.
4 is a diagram illustrating a voice signal including a caller part and a non-caller part according to an embodiment of the present disclosure.
5 is a flowchart illustrating an operation of a speech recognition apparatus according to an embodiment of the present disclosure.
6 is a diagram illustrating an example of a call history associated with a voice recognition device according to an embodiment of the present disclosure.
7 is a flowchart illustrating a method of operating a speech recognition apparatus according to an embodiment of the present disclosure.

아래에서는 첨부한 도면을 참조하여 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 본 발명의 실시예를 상세히 설명할 수 있다. 그러나 본 발명은 여러 가지 상이한 형태로 구현될 수 있으며 여기에서 설명하는 실시예에 한정되지 않는다. 그리고 도면에서 본 발명을 명확하게 설명하기 위해서 설명과 관계없는 부분은 생략하였으며, 명세서 전체를 통하여 유사한 부분에 대해서는 유사한 도면 부호를 붙였다. 명세서 전체에서, 어떤 부분이 어떤 구성요소를 "포함"할 수 있다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있는 것을 의미할 수 있다.DETAILED DESCRIPTION Hereinafter, exemplary embodiments of the present invention may be described in detail with reference to the accompanying drawings so that those skilled in the art may easily implement the present invention. As those skilled in the art would realize, the described embodiments may be modified in various different ways, all without departing from the spirit or scope of the present invention. In the drawings, parts irrelevant to the description are omitted in order to clearly describe the present invention, and like reference numerals designate like parts throughout the specification. Throughout the specification, when a part is said to "include" any component, this may mean that it may further include other components, without excluding other components unless specifically stated otherwise. .

본 개시는, 음성 신호로부터 기 설정된 호출어를 검출하여 출력 정보를 제공하는 음성 인식 장치 및 방법에 관한 것이다. 구체적으로, 본 개시의 일 실시예에 따른 음성 인식 장치 및 방법은, 서버에서 수행된 인식 결과를 이용하여, 호출어에 대응하지 않는 음성 신호가 호출어에 대응하는 것으로 잘못 인식되는 비율을 나타내는 오인식률을 감소시킬 수 있다. 본 개시에서, 호출어(wake-up word)는 음성 인식 장치의 서비스 제공 기능을 트리거(trigger)하기 위해 설정된 키워드(keyword)를 나타낼 수 있다. 이하, 첨부된 도면을 참고하여 본 발명을 상세히 설명한다. 이하 첨부된 도면을 참고하여 본 발명을 상세히 설명한다. The present disclosure relates to a speech recognition apparatus and method for detecting output call words from a speech signal and providing output information. Specifically, the apparatus and method for recognizing speech according to an embodiment of the present disclosure uses a recognition result performed at a server to indicate a rate at which a speech signal that does not correspond to a call word is erroneously recognized as corresponding to the call word. The recognition rate can be reduced. In the present disclosure, a wake-up word may indicate a keyword set for triggering a service providing function of the speech recognition apparatus. Hereinafter, with reference to the accompanying drawings will be described in detail the present invention. Hereinafter, the present invention will be described in detail with reference to the accompanying drawings.

도 1은 본 개시의 일 실시예에 따라 음성 인식 장치(100) 및 서버(200)를 포함하는 서비스 제공 시스템을 나타내는 개략도이다. 도 1에 도시된 바와 같이, 서비스 제공 시스템은 적어도 하나의 음성 인식 장치(100) 및 서버(200)를 포함할 수 있다. 본 개시의 일 실시예에 따른 서비스 제공 시스템은 기 설정된 호출어(이하, '호출어')를 기반으로 서비스를 제공할 수 있다. 예를 들어, 서비스 제공 시스템은 획득된 음성 신호를 인식하여 인식된 결과에 대응하는 서비스를 제공할 수 있다. 이때, 서비스 제공 시스템은 획득된 음성 신호로부터 호출어가 검출되는지 판단할 수 있다. 또한, 서비스 제공 시스템은 획득된 음성 신호로부터 호출어가 검출되는 경우, 인식 결과에 대응하는 서비스를 제공할 수 있다. 반대로 서비스 제공 시스템은 획득된 음성 신호로부터 호출어가 검출되지 않는 경우, 음성 인식을 수행하지 않거나 인식 결과에 대응하는 서비스를 제공하지 않을 수 있다. 서비스 제공 시스템은 음성 인식 장치(100)를 통해 인식 결과에 대응하는 출력 정보를 제공할 수 있다.1 is a schematic diagram illustrating a service providing system including a voice recognition device 100 and a server 200 according to an exemplary embodiment of the present disclosure. As shown in FIG. 1, the service providing system may include at least one voice recognition device 100 and a server 200. The service providing system according to an exemplary embodiment of the present disclosure may provide a service based on a preset call word (hereinafter, 'call word'). For example, the service providing system may recognize the acquired voice signal and provide a service corresponding to the recognized result. In this case, the service providing system may determine whether a caller is detected from the acquired voice signal. In addition, the service providing system may provide a service corresponding to a recognition result when a caller is detected from the acquired voice signal. In contrast, the service providing system may not perform voice recognition or provide a service corresponding to the recognition result when the caller is not detected from the acquired voice signal. The service providing system may provide output information corresponding to a recognition result through the voice recognition apparatus 100.

본 개시의 일 실시예에 따른 음성 인식 장치(100)는 벽면에 부착된 IoT 단말일 수 있으나 이에 한정되지 않는다. 예를 들어, 음성 인식 장치(100)는 현관에 설치된 조명(light) 형태의 IoT 단말일 수 있다. 또는 음성 인식 장치(100)는 음성 인식 기능이 탑재된 냉/난방 기기, 셋톱 박스(set-top box), 냉장고, TV와 같은 가전기기일 수 있다.The voice recognition apparatus 100 according to an embodiment of the present disclosure may be an IoT terminal attached to a wall, but is not limited thereto. For example, the voice recognition apparatus 100 may be an IoT terminal in the form of a light installed in the front door. Alternatively, the voice recognition device 100 may be a home appliance such as a cooling / heating device, a set-top box, a refrigerator, or a TV equipped with a voice recognition function.

일 실시예에 따라, 음성 인식 장치(100)는 호출어를 인식하여 음성 인식 장치(100)의 서비스 제공 기능을 웨이크-업(wake-up)할 수 있다. 예를 들어, 음성 인식 장치(100)는 획득된 음성 신호로부터 호출어가 검출되는 경우, 서비스 제공을 위한 음성 인식 동작을 웨이크-업할 수 있다. 음성 인식 장치(100)는 음성 인식 장치(100) 내의 임베디드(embedded) 인식 모듈을 통해 호출어를 인식할 수 있다. 이때, 호출어 인식은 음성 신호로부터 호출어가 검출되는지를 판별하는 동작을 나타낼 수 있다. 음성 인식 장치(100)가 음성인식을 수행하는 방법에 대해서는 도 3을 통해 후술한다.According to an embodiment of the present disclosure, the voice recognition apparatus 100 may wake-up a service providing function of the voice recognition apparatus 100 by recognizing the call word. For example, when the caller is detected from the acquired voice signal, the voice recognition apparatus 100 may wake up the voice recognition operation for providing a service. The speech recognition apparatus 100 may recognize a caller through an embedded recognition module in the speech recognition apparatus 100. In this case, the call word recognition may indicate an operation of determining whether a call word is detected from the voice signal. A method of performing speech recognition by the speech recognition apparatus 100 will be described later with reference to FIG. 3.

한편, 음성 인식 장치(100)는 사용자가 음성 신호에 대응하는 음성을 발화한 의도가 음성 인식 장치(100)를 호출하는 것이 아닌 경우에도 음성 신호로부터 호출어가 검출된 것으로 잘못 인식하여 오동작할 수 있다. 특히, 사용자가 호출어와 유사한 단어를 발화한 경우, 음성 인식 장치(100)는 해당 음성 신호로부터 호출어가 검출된 것으로 잘못 인식하여 오동작할 수 있다. 음성 인식 장치(100)가 음성 인식 기능이 탑재된 가전기기인 경우, 호출어의 오인식으로 인해 불필요한 전력 소비가 발생할 수 있다.On the other hand, the speech recognition apparatus 100 may malfunction by incorrectly recognizing that the caller is detected from the speech signal even when the user does not call the speech recognition apparatus 100 when the intention of uttering the speech corresponding to the speech signal is incorrect. . In particular, when the user speaks a word similar to the call word, the speech recognition apparatus 100 may incorrectly recognize that the call word is detected from the corresponding voice signal and malfunction. When the speech recognition apparatus 100 is a home appliance equipped with a speech recognition function, unnecessary power consumption may occur due to misrecognition of a caller.

일 실시예에 따라, 호출어 인식은 서버(200)에 의해 수행될 수도 있다. 이때, 음성 인식 장치(100)는 음성 신호를 서버(200)로 전송하고 인식 결과를 요청할 수 있다. 또한, 음성 인식 장치(100)는 서버(200)로부터 수신된 인식 결과를 기초로 출력 정보를 생성할 수 있다. 이를 통해, 음성 인식 장치(100)는 호출어 인식 오인식률을 감소시킬 수 있다. 음성 인식 장치(100)는 음성 인식 장치(100)에 비해 더 높은 연산 처리 능력을 가지는 서버(200)에 의한 호출어 인식 결과를 획득할 수 있기 때문이다. 또한, 음성 인식 장치(100)는 호출어의 오인식으로 인한 불필요한 전력 소비를 감소시킬 수 있다. 여기에서, 음성 인식 장치(100)의 오인식률은 획득된 음성 신호가 호출어에 대응하지 않는 경우, 음성 인식 장치(100)가 호출어가 검출된 것으로 오인식하는 비율을 나타낸다. 오인식률은 아래 수학식 1과 같이 나타낼 수 있다.According to one embodiment, call word recognition may be performed by the server 200. In this case, the voice recognition apparatus 100 may transmit a voice signal to the server 200 and request a recognition result. In addition, the speech recognition apparatus 100 may generate output information based on the recognition result received from the server 200. Through this, the speech recognition apparatus 100 may reduce the call word recognition false recognition rate. This is because the speech recognition apparatus 100 may obtain a call word recognition result by the server 200 having a higher computational processing capability than the speech recognition apparatus 100. In addition, the speech recognition apparatus 100 may reduce unnecessary power consumption due to misrecognition of the caller. Here, the false recognition rate of the speech recognition apparatus 100 represents a rate at which the speech recognition apparatus 100 misrecognizes that the caller is detected when the acquired speech signal does not correspond to the caller. False recognition rate can be expressed as Equation 1 below.

[수학식 1][Equation 1]

오인식률 = 100 * (인식 단어 수) / (비호출어 입력 단어 수) [%]False recognition rate = 100 * (number of recognized words) / (number of non-calling words) [%]

수학식 1에서, “비호출어 입력 단어 수”는 호출어가 아닌 음성 입력 단어의 개수를 나타낼 수 있다. 또한, “인식 단어 수”는 입력된 비호출어 입력 단어 중에서 호출어로 인식된 단어의 개수를 나타낼 수 있다. 그러나, 음성 인식 장치(100)가 획득한 음성 신호를 서버(200)로 전송하는 경우, 네트워크의 데이터 트래픽이 증가할 수 있다. 이 경우, 음성 인식 장치(100)는 네트워크 환경에 따라 서버(200)로부터 원활한 인식 결과를 수신하지 못할 수도 있다. 본 개시의 일 실시예에 따른 음성 인식 장치(100)는 음성 신호를 획득한 수신환경에 대응하는 호출이력을 기초로 음성 신호를 서버(200)에게 전송할 수 있다. 예를 들어, 음성 인식 장치(100)는 호출이력을 기초로 음성 신호 중에서 서버(200)에게 전송할 적어도 일부분을 결정할 수 있다. 또한, 음성 인식 장치(100)는 결정된 음성 신호의 적어도 일부를 서버(200)로 전송할 수 있다. 여기에서, 호출이력은 특정 수신환경에서 음성 인식 장치가 호출된 이력을 나타내는 정보일 수 있다. 이와 관련하여서는 도 5 내지 도 6을 통해 구체적으로 설명하도록 한다.In Equation 1, “non-call word input word number” may indicate the number of voice input words that are not call words. In addition, the "recognized word number" may indicate the number of words recognized as a caller among the input non-call word input words. However, if the voice recognition apparatus 100 transmits the acquired voice signal to the server 200, data traffic of the network may increase. In this case, the voice recognition apparatus 100 may not receive a smooth recognition result from the server 200 according to the network environment. The speech recognition apparatus 100 according to an exemplary embodiment may transmit a speech signal to the server 200 based on a call history corresponding to a reception environment in which the speech signal is obtained. For example, the voice recognition apparatus 100 may determine at least a portion of the voice signal to be transmitted to the server 200 based on the call history. In addition, the voice recognition apparatus 100 may transmit at least a part of the determined voice signal to the server 200. Here, the call history may be information indicating a history of the call of the speech recognition apparatus in a specific receiving environment. In this regard will be described in detail with reference to FIGS.

본 개시의 일 실시예에 따른 서버(200)는, 음성 인식 장치(100)가 호출어 또는 서비스 제공을 위한 음성 인식을 수행하는 방법과 동일 또는 유사한 방법으로 음성 인식을 수행할 수 있다. 예를 들어, 서버(200)는 음성 인식 장치(100)로부터 획득된 음성 신호에 대해 음성 인식을 수행할 수 있다. 음성 인식 장치(100)로부터 음성 신호의 적어도 일부를 수신한 서버(200)는 음성 인식을 수행하여 생성된 인식 결과를 음성 인식 장치(100)로 전송할 수 있다. 또한, 서버(200)는 음성 인식을 위한 데이터베이스를 포함할 수 있다. 이때, 데이터베이스는 적어도 하나의 음향 모델 또는 음성 인식 모델을 포함할 수 있다. 그러나 서버(200)가 데이터베이스를 반드시 포함하는 것은 아니며, 서비스 제공 시스템은 서버(200)와 연결된 별도의 저장소(미도시)를 포함할 수도 있다. 이때, 서버(200)는 데이터베이스를 포함하는 저장소로부터 적어도 하나의 음향 모델 또는 음성 인식 모델을 획득할 수 있다.The server 200 according to an exemplary embodiment of the present disclosure may perform voice recognition in the same or similar manner to that of the voice recognition apparatus 100 performing voice recognition for providing a call word or a service. For example, the server 200 may perform voice recognition on the voice signal obtained from the voice recognition apparatus 100. The server 200 receiving at least a part of the voice signal from the voice recognition apparatus 100 may transmit the recognition result generated by performing voice recognition to the voice recognition apparatus 100. In addition, the server 200 may include a database for speech recognition. In this case, the database may include at least one acoustic model or a speech recognition model. However, the server 200 does not necessarily include a database, and the service providing system may include a separate storage (not shown) connected with the server 200. In this case, the server 200 may obtain at least one acoustic model or speech recognition model from a storage including a database.

도 2는 본 발명의 실시예에 따른 음성 인식 장치(100)의 구성을 나타내는 도면이다. 일 실시예에 따라, 음성 인식 장치(100)는 음성 수신부(110), 프로세서(120) 및 출력부(130)를 포함할 수 있다. 그러나 도 2에 도시된 구성 요소의 일부는 생략될 수 있으며, 도 2에 도시되지 않은 구성 요소를 추가적으로 포함할 수 있다. 또한, 음성 인식 장치(100)는 적어도 둘 이상의 서로 다른 구성요소를 일체로서 구비할 수도 있다. 일 실시예에 따라, 음성 인식 장치(100)는 하나의 반도체 칩(chip)으로 구현될 수도 있다.2 is a diagram illustrating a configuration of a voice recognition device 100 according to an embodiment of the present invention. According to an embodiment, the voice recognition apparatus 100 may include a voice receiver 110, a processor 120, and an outputter 130. However, some of the components shown in FIG. 2 may be omitted, and may further include components not shown in FIG. 2. In addition, the speech recognition apparatus 100 may be provided with at least two different components as one body. According to an embodiment, the speech recognition apparatus 100 may be implemented with one semiconductor chip.

음성 수신부(110)는 음성 신호를 획득할 수 있다. 음성 수신부(110)는 음성 수신부(110)로 입사되는 음성 신호를 수집할 수 있다. 일 실시예에 따라, 음성 수신부(110)는 적어도 하나의 마이크를 포함할 수 있다. 예를 들어, 음성 수신부(110)는 복수의 마이크를 포함하는 마이크 어레이를 포함할 수 있다. 이때, 마이크 어레이는 원 또는 구 형태 이외의 정육면체 또는 정삼각형과 같은 다양한 형태로 배열된 복수의 마이크를 포함할 수 있다. 다른 일 실시예에 따라, 음성 수신부(110)는 외부의 음향 수집 장치로부터 수집된 음성에 대응하는 음성 신호를 수신할 수도 있다. 예를 들어, 음성 수신부(110)는 음성 신호가 입력되는 음성 신호 입력 단자를 포함할 수 있다. 구체적으로, 음성 수신부(110)는 유선으로 전송되는 음성 신호를 수신하는 음성 신호 입력 단자를 포함할 수 있다. 또는, 음성 수신부(110)는 블루투스(bluetooth) 또는 와이파이(Wi-Fi) 통신 방법을 이용하여 무선으로 전송되는 음성 신호를 수신할 수도 있다.The voice receiver 110 may acquire a voice signal. The voice receiver 110 may collect a voice signal incident to the voice receiver 110. According to an embodiment, the voice receiver 110 may include at least one microphone. For example, the voice receiver 110 may include a microphone array including a plurality of microphones. In this case, the microphone array may include a plurality of microphones arranged in various forms such as a cube or an equilateral triangle other than a circle or sphere. According to another exemplary embodiment, the voice receiver 110 may receive a voice signal corresponding to voice collected from an external sound collection apparatus. For example, the voice receiver 110 may include a voice signal input terminal to which a voice signal is input. In detail, the voice receiver 110 may include a voice signal input terminal for receiving a voice signal transmitted through a wire. Alternatively, the voice receiver 110 may receive a voice signal transmitted wirelessly using a Bluetooth or Wi-Fi communication method.

프로세서(120)는 명세서 전반에 걸쳐 설명되는 음성 인식 장치(100)의 전반적인 동작을 제어할 수 있다. 프로세서(120)는 음성 인식 장치(100)의 각 구성 요소를 제어할 수 있다. 프로세서(120)는 각종 데이터와 신호의 연산 및 처리를 수행할 수 있다. 프로세서(120)는 반도체 칩 또는 전자 회로 형태의 하드웨어로 구현되거나 하드웨어를 제어하는 소프트웨어로 구현될 수 있다. 프로세서(120)는 하드웨어와 상기 소프트웨어가 결합된 형태로 구현될 수도 있다. 프로세서(120)는 소프트웨어가 포함하는 적어도 하나의 프로그램을 실행하여 음성 인식 장치(100)의 동작을 제어할 수 있다.The processor 120 may control the overall operation of the speech recognition apparatus 100 described throughout the specification. The processor 120 may control each component of the speech recognition apparatus 100. The processor 120 may perform calculation and processing of various data and signals. The processor 120 may be implemented in hardware in the form of a semiconductor chip or electronic circuit or in software for controlling the hardware. The processor 120 may be implemented in the form of a combination of hardware and the software. The processor 120 may control the operation of the speech recognition apparatus 100 by executing at least one program included in software.

일 실시예에 따라, 프로세서(120)는 전술한 음성 수신부(110)를 통해 획득된 음성 신호로부터 음성을 인식할 수 있다. 프로세서(120)는 전술한 임베디드 음성 인식 모듈을 포함할 수 있다. 일 실시예에 따라, 프로세서(120)는 임베디드 음성 인식 모듈을 이용하여 음성 신호로부터 호출어를 인식할 수 있다. 또한, 프로세서(120)는 송수신부(미도시)를 통해 음성 신호에 대한 인식 결과를 서버(200)에게 요청할 수도 있다. 예를 들어, 송수신부는 프로세서(120)의 제어에 의해 외부 통신 장치와 정보를 송수신할 수 있다. 이때, 송수신부는 외부와 통신을 수행하기 위한 물리적인 하드웨어 및 무형의 소프트웨어를 포함할 수 있다. 또한, 프로세서(120)는 송수신부를 통해 외부의 장치와 유/무선 네트워크를 통해 데이터를 송수신할 수 있다. 이때, 외부의 장치는 음성 인식 장치(100)를 제외한 외부의 모든 통신 네트워크, 개별 유무선 통신 단말기, 서버 및 AP(access point)를 포함할 수 있다. 외부의 장치는 다른 음성 인식 장치 및 서버(200)를 포함할 수 있으나 이에 한정되지 않는다. 또한, 프로세서(120)는 송수신부(미도시)를 통해 음성 신호를 서버(200)로 전송할 수 있다. 프로세서(120)는 서버(200)로부터 획득된 음성 인식 결과를 기초로 출력 정보를 생성할 수도 있다.According to an embodiment, the processor 120 may recognize a voice from the voice signal obtained through the voice receiver 110 described above. The processor 120 may include the embedded voice recognition module described above. According to an embodiment, the processor 120 may recognize the caller from the voice signal using the embedded voice recognition module. In addition, the processor 120 may request the server 200 to recognize a voice signal through a transceiver (not shown). For example, the transceiver may transmit and receive information with an external communication device under the control of the processor 120. In this case, the transceiver may include physical hardware and intangible software for communicating with the outside. In addition, the processor 120 may transmit and receive data to and from an external device and a wired / wireless network through a transceiver. In this case, the external device may include all external communication networks except for the voice recognition device 100, individual wired / wireless communication terminals, a server, and an access point (AP). The external device may include other speech recognition device and server 200, but is not limited thereto. In addition, the processor 120 may transmit a voice signal to the server 200 through a transceiver (not shown). The processor 120 may generate output information based on the speech recognition result obtained from the server 200.

프로세서(120)는 출력 정보를 생성할 수 있다. 예를 들어, 호출어가 검출된 경우, 프로세서(120)는 서비스 제공 기능을 웨이크-업할 수 있다. 이 경우, 프로세서(120)는 서비스 제공 기능이 웨이크-업 되었음을 알리는 정보를 포함하는 출력 정보를 생성할 수 있다. 또한, 프로세서(120)는 음성 인식을 수행하여 획득된 인식 결과에 대응하는 출력 정보를 생성할 수 있다. 반대로, 호출어가 검출되지 않은 경우, 프로세서(120)는 호출어가 검출되지 않았음을 알리는 정보를 포함하는 출력 정보를 생성할 수 있다. 또는, 이 경우, 프로세서(120)는 사용자에게 출력 정보를 제공하지 않을 수도 있다. 프로세서(120)는 생성된 출력 정보를 이하 설명되는 출력부(130)를 통해 출력할 수 있다.The processor 120 may generate output information. For example, if a caller is detected, the processor 120 may wake up the service provisioning function. In this case, the processor 120 may generate output information including information indicating that the service providing function has been woken up. In addition, the processor 120 may perform voice recognition to generate output information corresponding to the obtained recognition result. In contrast, when the caller word is not detected, the processor 120 may generate output information including information indicating that the caller word is not detected. In this case, the processor 120 may not provide output information to the user. The processor 120 may output the generated output information through the output unit 130 described below.

출력부(130)는 사용자에게 제공되는 정보를 출력할 수 있다. 출력부(130)는 프로세서(120)에 의해 생성된 출력 정보를 출력할 수 있다. 또한, 출력부(130)는 빛, 소리, 진동과 같은 형태로 변환된 출력 정보를 출력할 수도 있다. 일 실시예에 따라, 출력부(130)는 스피커, 디스플레이, LED를 포함하는 각종 광원 및 모니터 중 적어도 하나일 수 있으나 이에 한정되지 않는다. 예를 들어, 출력부(130)는 호출어 검출 결과를 기초로 생성된 출력 정보를 출력할 수 있다. 이때, 출력 정보는 호출어 검출 결과를 포함할 수 있다. 출력부(130)는 호출어가 검출된 경우와 호출어가 검출되지 않은 경우에 따라 구별되는 검출 신호를 출력할 수 있다. 예를 들어, 출력부(130)는 광원을 통해, 호출어가 검출된 경우 '파란색' 빛을 출력하고, 호출어가 검출되지 않은 경우 '붉은색' 빛을 출력할 수 있다. 출력부(130)는 스피커를 통해 호출어가 검출된 경우에만 기 설정된 오디오 신호를 출력할 수도 있다. The output unit 130 may output information provided to the user. The output unit 130 may output output information generated by the processor 120. In addition, the output unit 130 may output output information converted into a form such as light, sound, and vibration. According to an embodiment, the output unit 130 may be at least one of various light sources and a monitor including a speaker, a display, an LED, and the like. For example, the output unit 130 may output the output information generated based on the call word detection result. In this case, the output information may include a call word detection result. The output unit 130 may output detection signals that are distinguished according to a case where a call word is detected and a case where the call word is not detected. For example, the output unit 130 may output 'blue' light when the caller is detected through the light source, and output 'red' light when the caller is not detected. The output unit 130 may output the preset audio signal only when the caller is detected through the speaker.

또한, 출력부(130)는 음성 인식 장치(100) 고유의 기능을 수행할 수 있다. 구체적으로, 음성 인식 장치(100)가 음성 인식 기능을 포함하는 정보 제공 장치인 경우, 출력부(130)는 사용자의 질의에 대응하는 정보를 오디오 신호 또는 비디오 신호의 형태로 제공할 수도 있다. 예를 들어, 출력부(130)는 사용자의 질의에 대응하는 정보를 텍스트 포맷 또는 음성 포맷으로 출력할 수 있다. 또한, 출력부(130)는 음성 인식 장치(100)와 유무선으로 연결된 다른 장치의 동작을 제어하는 제어 신호를 다른 장치로 전송할 수도 있다. 예를 들어, 음성 인식 장치(100)가 벽면에 부착된 IoT 단말인 경우, 음성 인식 장치(100)는 난방 장치의 온도를 제어하는 제어 신호를 난방 장치로 전송할 수 있다.In addition, the output unit 130 may perform a function unique to the speech recognition apparatus 100. In detail, when the speech recognition apparatus 100 is an information providing apparatus including a speech recognition function, the output unit 130 may provide information corresponding to a user's query in the form of an audio signal or a video signal. For example, the output unit 130 may output information corresponding to a user's query in a text format or a voice format. In addition, the output unit 130 may transmit a control signal for controlling the operation of another device connected to the voice recognition device 100 by wire or wirelessly to another device. For example, when the voice recognition device 100 is an IoT terminal attached to a wall, the voice recognition device 100 may transmit a control signal for controlling the temperature of the heating device to the heating device.

본 개시의 일 실시예에 따라, 프로세서(120)는 음성 수신부(110)를 통해 음성 신호를 획득할 수 있다. 프로세서(120)는 음성 신호를 획득한 수신환경에 대응하는 호출이력 및 제1 인식 결과를 기초로 음성 신호의 적어도 일부를 서버(200)로 전송할 수 있다. 여기에서, 제1 인식 결과는 음성 신호에 대한 음성 인식 장치(100)의 인식 결과를 나타낼 수 있다. 이때, 제1 인식 결과는 음성 신호의 호출어 포함 여부를 나타내는 음성 인식 장치(100)에 의한 인식 결과를 포함할 수 있다. 또한, 제 1 인식 결과는 음성 인식 장치(100)에 의해 산출된 음성 신호와 호출어 사이의 유사도를 포함할 수 있다. 프로세서(120)가 서버(200)로 음성 신호의 적어도 일부를 전송한 경우, 프로세서(120)는 제1 인식 결과 및 제2 인식 결과를 기초로 출력 정보를 생성할 수 있다. 여기에서, 제 2 인식 결과는 음성 신호 중 서버(200)로 전송된 적어도 일부분에 대한 서버(200)의 인식 결과를 나타낼 수 있다. 이때, 제2 인식 결과는 음성 신호의 호출어 포함 여부를 포함할 수 있다. 제2 인식 결과는 서비스 제공을 위한 음성 인식 결과를 포함할 수 있다. 또한, 프로세서(120)는 최종 호출어 검출 결과를 기초로 출력 정보를 생성할 수 있다. 이때, 최종 호출어 검출 결과는 제1 인식 결과 및 제2 인식 결과를 기초로 획득한 호출어 검출 결과를 나타낼 수 있다. 반면, 프로세서(120)가 서버(200)로 음성 신호를 전송하지 않은 경우, 프로세서(120)는 제1 인식 결과를 기초로 출력 정보를 생성할 수 있다. 이하, 도 3 내지 도 6을 통해 음성 인식 장치(100)의 상세한 동작 방식에 대해서 서술하도록 한다.According to an embodiment of the present disclosure, the processor 120 may obtain a voice signal through the voice receiver 110. The processor 120 may transmit at least a part of the voice signal to the server 200 based on the call history corresponding to the reception environment in which the voice signal is obtained and the first recognition result. Here, the first recognition result may represent the recognition result of the voice recognition apparatus 100 for the voice signal. In this case, the first recognition result may include a recognition result by the speech recognition apparatus 100 indicating whether the voice signal includes a call word. In addition, the first recognition result may include a similarity between the voice signal calculated by the speech recognition apparatus 100 and the caller. When the processor 120 transmits at least a part of the voice signal to the server 200, the processor 120 may generate output information based on the first recognition result and the second recognition result. Here, the second recognition result may indicate a result of the server 200 recognizing at least a portion of the voice signal transmitted to the server 200. In this case, the second recognition result may include whether a voice signal is included in the call signal. The second recognition result may include a voice recognition result for providing a service. In addition, the processor 120 may generate output information based on a final caller detection result. In this case, the final call word detection result may indicate a call word detection result obtained based on the first recognition result and the second recognition result. On the other hand, when the processor 120 does not transmit a voice signal to the server 200, the processor 120 may generate output information based on the first recognition result. Hereinafter, a detailed operation method of the speech recognition apparatus 100 will be described with reference to FIGS. 3 to 6.

도 3은 본 개시의 일 실시예에 따른 음성 신호를 나타내는 도면이다. 도 3을 참조하면, 음성 신호는 적어도 하나의 프레임(frame)으로 구성될 수 있다. 여기에서, 프레임은 특정 길이로 구분된 신호의 일부 구간을 의미할 수 있다. 도 3에서 f1 내지 f9는 음성 신호에 포함된 각 프레임을 나타낸다. 일 실시예에 따라, 음성 인식 장치(100)는 음성 신호를 기 설정된 프레임으로 분할할 수 있다. 또한, 음성 인식 장치(100)는 분할된 각각의 음성 신호로부터 음향학적 특징(acoustic feature)을 추출할 수 있다. 음성 인식 장치(100)는 추출된 음향학적 특징과 호출어에 대응하는 음향 모델 사이의 유사도를 산출할 수 있다. 또한, 음성 인식 장치(100)는 추출된 음향학적 특징과 호출어에 대응하는 음향 모델 또는 음성인식을 위한 모델 사이의 유사도에 기초하여 호출어의 존재 여부를 판별할 수 있다. 이때, 음향학적 특징은 음성 인식에 필요한 정보를 나타낼 수 있다. 3 is a diagram illustrating a voice signal according to an embodiment of the present disclosure. Referring to FIG. 3, the voice signal may be composed of at least one frame. Here, the frame may mean a part of a signal divided into a specific length. In FIG. 3, f1 to f9 represent each frame included in the voice signal. According to an embodiment, the voice recognition apparatus 100 may divide the voice signal into a preset frame. Also, the speech recognition apparatus 100 may extract acoustic features from each of the divided speech signals. The speech recognition apparatus 100 may calculate a similarity between the extracted acoustic feature and the acoustic model corresponding to the call word. In addition, the speech recognition apparatus 100 may determine the presence or absence of the call word based on the similarity between the extracted acoustic feature and the acoustic model corresponding to the call word or the model for voice recognition. In this case, the acoustic feature may represent information required for speech recognition.

예를 들어, 음향학적 특징은 포먼트(formant) 정보 및 피치(pitch) 정보를 포함할 수 있다. 포먼트는 음성 스펙트럼의 스펙트럴 피크(spectral peaks)로 정의되며 스펙트로그램(spectrogram)에서 진폭의 피크(amplitude peak) 값으로 정량화될 수 있다. 피치는 음성의 기본 주파수(Fundamental Frequency)를 의미하며 음성의 주기적 특성을 나타낸다. 음성 인식 장치(100)는 LPC(Linear Predictive Coding) Cepstrum, PLP(Perceptual Linear Prediction) Cepstrum, MFCC(Mel Frequency Cepstral Coefficient) 및 필터뱅크 에너지 분석(Filter Bank Energy Analysis) 중 적어도 하나를 사용하여 음성 신호의 음향학적 특징을 추출할 수 있다. 또한, 음성 인식 장치(100)는 음성 신호로부터 추출된 음향학적 특징과 적어도 하나의 상기 음향 모델 간의 유사도를 판별할 수 있다. 음성 인식 장치(100)는 추출된 음향학적 특징과 가장 유사도가 높은 음향 모델을 해당 음성 신호에 대응하는 음향 모델인 것으로 판별할 수 있다. 또한, 음성 인식 장치(100)는 음성 신호에 대응하는 음향 모델의 텍스트 데이터가 호출어에 대응하는 텍스트를 포함하는지 판별할 수 있다. 호출어에 대응하는 텍스트를 포함하는 경우, 음성 인식 장치(100)는 해당 음성 신호로부터 호출어가 검출된 것으로 결정할 수 있다. 예를 들어, 호출어가 '소리야'인 경우, 음성 인식 장치(100)는 획득된 음성 신호에 대응하는 음향 모델의 텍스트 데이터가 '소리야'를 포함하는 지 판별할 수 있다. For example, the acoustic feature may include formant information and pitch information. Formants are defined as spectral peaks of the speech spectrum and can be quantified as amplitude peak values in the spectrogram. Pitch means the fundamental frequency of the voice and indicates the periodic characteristics of the voice. The speech recognition apparatus 100 uses at least one of a Linear Predictive Coding (LPC) Cepstrum, a Perceptual Linear Prediction (PLP) Cepstrum, a Mel Frequency Cepstral Coefficient (MFCC), and a Filter Bank Energy Analysis (MFCC). Acoustic features can be extracted. In addition, the speech recognition apparatus 100 may determine a similarity between the acoustic feature extracted from the speech signal and the at least one acoustic model. The speech recognition apparatus 100 may determine that the acoustic model having the highest similarity with the extracted acoustic feature is the acoustic model corresponding to the corresponding speech signal. In addition, the speech recognition apparatus 100 may determine whether the text data of the acoustic model corresponding to the speech signal includes the text corresponding to the caller. When the text corresponding to the call word is included, the voice recognition apparatus 100 may determine that the call word is detected from the corresponding voice signal. For example, when the caller is 'sori', the speech recognition apparatus 100 may determine whether the text data of the acoustic model corresponding to the obtained speech signal includes 'sori'.

도 4는 본 개시의 일 실시예에 따라 호출어 파트(401) 및 비호출어 파트(402)를 포함하는 음성 신호를 나타내는 도면이다. 도 4를 참조하면, 음성 신호(400)는 호출어 파트(401)와 비호출어 파트(402)를 포함할 수 있다. 여기에서, 호출어 파트(401)는 음성 신호 중에서 호출어에 대응하는 음성을 포함하는 음성 신호의 일 부분을 나타낼 수 있다. 또한, 비호출어 파트(402)는 음성 신호 중에서 호출어가 아닌 비호출어에 대응하는 음성을 포함하는 음성 신호의 일부분을 나타낼 수 있다. 음성 신호 중에서 호출어 파트(401)를 제외한 부분이 비호출어 파트(402)일 수 있다. 음성 인식 장치(100)가 음성 신호로부터 호출어를 검출한 경우, 음성 인식 장치(100)는 음성 신호를 호출어 파트(401)와 비호출어 파트(402)로 분리할 수 있다. 전술한 바와 같이, 음성 인식 장치(100) 또는 서버(200)는 적어도 하나의 프레임 단위로, 음성 신호로부터 음성을 인식할 수 있다. 일 실시예에 따라, 음성 인식 장치(100)는 음성 신호가 포함하는 적어도 하나의 프레임 중 일부 프레임을 서버(200)로 전송할 수 있다. 예를 들어, 음성 인식 장치(100)는 적어도 하나의 프레임을 포함하는 호출어 파트(401)를 서버(200)로 전송할 수 있다. 이때, 호출어 파트(401)는 음성 신호가 포함하는 적어도 하나의 프레임 중에서 호출어에 대응하는 음성 신호를 포함하는 적어도 하나의 프레임을 나타낼 수 있다. 또한, 음성 인식 장치(100)는 적어도 하나의 프레임을 포함하는 비호출어 파트(402)를 서버(200)로 전송할 수 있다. 이때, 비호출어 파트(402)는 음성 신호가 포함하는 적어도 하나의 프레임 중에서 호출어 파트(401)를 제외한 적어도 하나의 프레임을 나타낼 수 있다.4 is a diagram illustrating a voice signal including a caller part 401 and a non-caller part 402 according to one embodiment of the present disclosure. Referring to FIG. 4, the voice signal 400 may include a caller part 401 and a non-caller part 402. Here, the caller part 401 may represent a part of the voice signal including the voice corresponding to the caller among the voice signals. In addition, the non-calling part 402 may represent a portion of the voice signal including the voice corresponding to the non-calling language, which is not the caller. The portion of the voice signal except the caller part 401 may be the non-caller part 402. When the speech recognition apparatus 100 detects the call word from the speech signal, the speech recognition apparatus 100 may separate the speech signal into the call word part 401 and the non-call part 402. As described above, the speech recognition apparatus 100 or the server 200 may recognize the speech from the speech signal in at least one frame unit. According to an embodiment, the voice recognition apparatus 100 may transmit some frames among at least one frame included in the voice signal to the server 200. For example, the speech recognition apparatus 100 may transmit a caller part 401 including at least one frame to the server 200. In this case, the caller part 401 may indicate at least one frame including a voice signal corresponding to the caller among at least one frame included in the voice signal. In addition, the speech recognition apparatus 100 may transmit the non-calling part 402 including the at least one frame to the server 200. In this case, the non-call part 402 may represent at least one frame except for the caller part 401 among at least one frame included in the voice signal.

이하에서는, 본 개시의 일 실시예에 따른 음성 인식 장치(100)가 제1 인식 결과 및 제2 인식 결과 중 적어도 하나를 이용하여 출력 정보를 제공하는 방법에 관하여 도 5를 참조하여 설명한다. 도 5는 본 개시의 일 실시예에 따른 음성 인식 장치(100)의 동작을 나타내는 흐름도이다. 단계 S502에서, 음성 인식 장치(100)는 음성 신호를 획득할 수 있다. 예를 들어, 음성 인식 장치(100)는 사용자(300)로부터 발화된 음성에 대응하는 음성 신호를 획득할 수 있다. 사용자(300)는 음성 신호를 통해 음성 인식 장치(100)에게 호출어 및 다양한 유형의 요청(request)을 입력할 수 있다. 서비스 제공을 위한 음성 인식 동작이 활성화된 경우, 음성 인식 장치(100)는 음성 신호로부터 음성을 인식하여 사용자(300)가 요청한 서비스를 제공할 수 있다. 이때, 음성 신호는 호출어에 대응하는 음성 신호를 획득한 때부터 소정의 시간 이내에 획득된 음성 신호일 수 있다.Hereinafter, a method in which the voice recognition apparatus 100 according to an exemplary embodiment of the present disclosure provides output information using at least one of a first recognition result and a second recognition result will be described with reference to FIG. 5. 5 is a flowchart illustrating an operation of the speech recognition apparatus 100 according to an exemplary embodiment. In operation S502, the speech recognition apparatus 100 may obtain a speech signal. For example, the speech recognition apparatus 100 may obtain a speech signal corresponding to the speech spoken by the user 300. The user 300 may input a call word and various types of requests to the voice recognition apparatus 100 through a voice signal. When the voice recognition operation for providing a service is activated, the voice recognition apparatus 100 may provide a service requested by the user 300 by recognizing the voice from the voice signal. In this case, the voice signal may be a voice signal obtained within a predetermined time from when the voice signal corresponding to the caller is acquired.

단계 S504에서, 음성 인식 장치(100)는 음성 신호에 대한 호출어 검출 여부를 나타내는 제1 인식 결과를 생성할 수 있다. 단계 S504에서, 제1 인식 결과가 획득된 음성 신호로부터 호출어가 검출된 것을 나타내는 경우, 음성 인식 장치(100)는 음성 신호의 적어도 일부를 서버(200)로 전송할 수 있다. 예를 들어, 음성 인식 장치(100)는 도 4에서와 같이, 획득된 음성 신호를 호출어 파트(401)와 비호출어 파트(402)로 분리할 수 있다. 이 경우, 음성 인식 장치(100)는 음성 신호의 호출어 파트(401) 및 비호출어 파트(402) 중 적어도 하나를 서버(200)로 전송할 수 있다(단계 S506). 반면, 단계 S504에서, 제1 인식 결과가 획득된 음성 신호로부터 호출어가 검출되지 않음을 나타내는 경우, 음성 신호는 비호출어 파트(402)를 포함할 있다. 이 경우, 음성 인식 장치(100)는 음성 신호를 서버(200)로 전송하지 않을 수 있다. 음성 인식 장치(100)가 음성 신호로부터 호출어가 검출되지 않은 것으로 판단한 경우, 음성 인식 장치(100)는 전술한 서비스 제공 기능을 웨이크-업하지 않을 수 있기 때문이다. 또한, 음성 인식 장치(100)는 후술할 제2 출력 정보를 제공할 수 있다(단계 S518).In operation S504, the speech recognition apparatus 100 may generate a first recognition result indicating whether a call word is detected for the speech signal. In operation S504, when the first recognition result indicates that a call word is detected from the obtained voice signal, the voice recognition apparatus 100 may transmit at least a part of the voice signal to the server 200. For example, as shown in FIG. 4, the speech recognition apparatus 100 may separate the obtained speech signal into a caller part 401 and a non-caller part 402. In this case, the speech recognition apparatus 100 may transmit at least one of the caller part 401 and the non-caller part 402 of the voice signal to the server 200 (step S506). On the other hand, in step S504, when the first recognition result indicates that a call word is not detected from the obtained voice signal, the voice signal may include a non-call part 402. In this case, the voice recognition apparatus 100 may not transmit the voice signal to the server 200. This is because the speech recognition apparatus 100 may not wake up the above-described service providing function when the speech recognition apparatus 100 determines that the call word is not detected from the speech signal. In addition, the speech recognition apparatus 100 may provide second output information, which will be described later (step S518).

단계 S506에서, 음성 인식 장치(100)는 호출어 파트(401)에 대한 서버(200)로의 전송 여부를 결정할 수 있다. 예를 들어, 음성 인식 장치(100)는 전술한 제1 인식 결과 및 호출이력을 기초로 음성 신호의 호출어 파트(401)에 대한 전송(또는 재인식) 여부를 결정할 수 있다. 또한, 음성 인식 장치(100)가 호출어 파트(401)에 대한 재인식을 수행하지 않을 것으로 결정한 경우, 음성 인식 장치(100)는 음성 신호의 호출어 파트(401)를 서버(200)로 전송하지 않을 수 있다. 이때, 음성 인식 장치(100)는 비호출어 파트(402)를 서버(200)로 전송할 수 있다(단계 S508). 이 경우, 음성 인식 장치(100)는 단계S510에서, 서버(200)로부터 제2 인식 결과를 획득할 수 있다. 이때, 제2 인식 결과는 호출어 인식 결과를 포함하지 않을 수 있다. 제2 인식 결과는 비호출어 파트(402)에 대한 음성 인식 결과를 포함할 수 있다. 음성 인식 장치(100)는 비호출어 파트(402)에 대한 음성 인식 결과를 기초로 제1 출력 정보를 생성할 수 있다.In operation S506, the speech recognition apparatus 100 may determine whether to transmit the caller part 401 to the server 200. For example, the speech recognition apparatus 100 may determine whether to transmit (or re-recognize) the caller part 401 of the speech signal based on the aforementioned first recognition result and call history. In addition, when the voice recognition apparatus 100 determines not to perform the recognition of the caller part 401, the voice recognition device 100 does not transmit the caller part 401 of the voice signal to the server 200. You may not. In this case, the speech recognition apparatus 100 may transmit the non-call part 402 to the server 200 (step S508). In this case, the voice recognition apparatus 100 may obtain a second recognition result from the server 200 in step S510. In this case, the second recognition result may not include the call word recognition result. The second recognition result may include a speech recognition result for the non-call part 402. The speech recognition apparatus 100 may generate first output information based on a speech recognition result for the non-call part 402.

반면, 음성 인식 장치(100)가 호출어 파트(401)에 대한 재인식을 수행하는 것으로 결정한 경우, 음성 인식 장치(100)는 서버(200)로 호출어 파트(401) 및 비호출어 파트(402)를 함께 전송할 수 있다(단계 S512). 단계 S512에서, 음성 인식 장치(100)가 호출어 파트(401)를 포함하는 음성 신호를 서버(200)로 전송한 경우, 전술한 제2 인식 결과는 서버(200)의 호출어 인식 결과를 포함할 수 있다. 이때, 서버(200)의 호출어 인식 결과는 음성 신호로부터 호출어가 검출 되었는지 여부를 나타내는 서버(200)에 의한 인식 결과를 나타낼 수 있다. 단계 S514에서, 음성 인식 장치(100)는 제2 인식 결과를 기초로 음성 신호에 대한 호출어 검출 여부를 판별할 수 있다. 제2 인식 결과가 음성 신호로부터 호출어가 검출된 것을 나타내는 경우, 음성 인식 장치(100)는 제1 출력 정보를 제공할 수 있다(단계 S516). 여기에서, 제1 출력 정보는 서비스 제공을 위한 음성 인식 결과를 기초로 생성된 출력 정보일 수 있다. 음성 인식 장치(100)는 제2 인식 결과를 기초로 출력 정보를 생성할 수 있다. 이때, 제2 인식 결과는 음성 신호의 비호출어 파트(402)에 대한 음성 인식 결과를 포함할 수 있다. 음성 인식 장치(100)는 비호출어 파트(402)에 대한 음성 인식 결과를 기초로 제1 출력 정보를 생성할 수 있다. 비호출어 파트(402)에 대한 음성 인식 결과는 서버(200) 또는 서버(200)와 연결된 외부의 장치에 의해 수행된 결과일 수 있다. 또한, 비호출어 파트(402)에 대한 음성 인식 결과는 전술한 서비스 제공을 위한 음성 인식 결과를 나타낼 수 있다.On the other hand, when the voice recognition apparatus 100 determines that the caller part 401 re-recognizes, the voice recognition device 100 sends the caller part 401 and the non-caller part 402 to the server 200. Can be transmitted together (step S512). In operation S512, when the voice recognition apparatus 100 transmits the voice signal including the caller part 401 to the server 200, the above-described second recognition result includes the caller recognition result of the server 200. can do. In this case, the call word recognition result of the server 200 may represent the recognition result by the server 200 indicating whether the call word is detected from the voice signal. In operation S514, the speech recognition apparatus 100 may determine whether to detect a call word for the speech signal based on the second recognition result. If the second recognition result indicates that the caller is detected from the speech signal, the speech recognition apparatus 100 may provide first output information (step S516). Here, the first output information may be output information generated based on a voice recognition result for providing a service. The speech recognition apparatus 100 may generate output information based on the second recognition result. In this case, the second recognition result may include a voice recognition result for the non-calling part 402 of the voice signal. The speech recognition apparatus 100 may generate first output information based on a speech recognition result for the non-call part 402. The speech recognition result for the non-call part 402 may be a result performed by the server 200 or an external device connected to the server 200. In addition, the speech recognition result for the non-call part 402 may represent the speech recognition result for providing the above-described service.

반대로, 단계 S514에서, 제2 인식 결과가 음성 신호로부터 호출어가 검출되지 않음을 나타내는 경우, 음성 인식 장치(100)는 제2 출력 정보를 제공할 수 있다(단계 S518). 여기에서, 제2 출력 정보는 음성 신호로부터 호출어가 검출되지 않음을 나타내는 정보일 수 있다. 예를 들어, 제2 출력 정보는 도 2를 통해 전술한 검출 신호일 수 있다. 또는 도 5와 달리, 음성 인식 장치(100)는 출력 정보를 제공하지 않을 수도 있다. Conversely, in step S514, when the second recognition result indicates that a call word is not detected from the voice signal, the voice recognition apparatus 100 may provide second output information (step S518). Here, the second output information may be information indicating that the caller word is not detected from the voice signal. For example, the second output information may be the detection signal described above with reference to FIG. 2. Alternatively, unlike FIG. 5, the speech recognition apparatus 100 may not provide output information.

도 5에서, 음성 인식 장치(100)는 제1 인식 결과와 제2 인식 결과가 서로 다른 경우, 제2 인식 결과를 최종 호출어 검출 결과로 선택할 수 있다. 예를 들어, 음성 신호에 대해 제1 인식 결과가 호출어 검출을 나타내고 제2 인식 결과가 호출어 검출되지 않음을 나타내는 경우, 음성 인식 장치(100)는 음성 신호로부터 호출어가 검출되지 않은 것으로 판단할 수 있다. 이 경우, 음성 인식 장치(100)는 서비스 제공을 위한 출력 정보를 생성하지 않을 수 있다. 제2 인식 결과는 제1 인식 결과에 비해 보다 정밀한 음향 모델의 유사도 판별 결과일 수 있다. 이에 따라, 제2 인식 결과는 제1 인식 결과에 비해 정확도가 높을 수 있다. 제2 인식 결과는 제1 인식 결과에 비해 서버(200)에 의해 더 많은 자원(resource)을 이용할 수 있기 때문이다. 여기에서, 자원은 음성 인식에 이용되는 메모리, 버퍼와 같은 저장 공간을 의미할 수 있다. 또한, 자원은 프로세서에 의해 데이터 연산이 처리되는 시간 또는 빈도 수를 의미할 수 있다. 예를 들어, 음성 인식에 보다 많은 자원이 할당되면 음성 신호에 대한 필터링(filtering)을 수행할 때 보다 고차의 필터를 이용할 수 있다. 또 다른 예로써, 음성 인식에 보다 많은 자원이 할당되면 실수 또는 복소수 연산을 통해 보다 세밀한 처리 결과 값을 가질 수 있다. 본 개시의 일 실시예에 따른 서버(200)는 음성 인식 장치(100)에 비해 높은 데이터 연산 처리 성능을 가지는 적어도 하나의 프로세서를 포함할 수 있다. 또한, 서버(200)는 음성 인식 장치(100)에 비해 큰 저장 공간을 가질 수 있다. 예를 들어, 제2 인식 결과는 제1 인식 결과에 비해 더 많은 개수의 가우시안 분포를 포함하는 음향 모델을 기초로 수행된 인식 결과일 수 있다. 여기에서, 가우시안 분포는 호출어 검출에 이용되는 음향 모델이 포함하는 음향학적 특징을 나타낼 수 있다. 이를 통해, 음성 인식 장치(100)는 서버(200)로부터, 제1 인식 결과에 비해 추가적인 정보를 더 이용하여 수행된 인식 결과를 획득할 수 있다. In FIG. 5, when the first recognition result and the second recognition result are different from each other, the voice recognition apparatus 100 may select the second recognition result as the final call word detection result. For example, when the first recognition result indicates the call word detection and the second recognition result indicates that the call word is not detected for the voice signal, the voice recognition apparatus 100 may determine that the call word is not detected from the voice signal. Can be. In this case, the speech recognition apparatus 100 may not generate output information for providing a service. The second recognition result may be a similarity determination result of the acoustic model that is more accurate than the first recognition result. Accordingly, the second recognition result may have a higher accuracy than the first recognition result. This is because the second recognition result may use more resources by the server 200 than the first recognition result. Here, the resource may mean a storage space such as a memory or a buffer used for speech recognition. In addition, a resource may mean a time or frequency at which data operations are processed by a processor. For example, if more resources are allocated to speech recognition, higher order filters may be used when filtering the speech signal. As another example, when more resources are allocated to speech recognition, a more detailed processing result may be obtained through real or complex arithmetic. The server 200 according to the exemplary embodiment of the present disclosure may include at least one processor having a higher data processing performance than the speech recognition apparatus 100. In addition, the server 200 may have a larger storage space than the speech recognition apparatus 100. For example, the second recognition result may be a recognition result performed based on an acoustic model including a larger number of Gaussian distributions than the first recognition result. Here, the Gaussian distribution may represent acoustic characteristics included in the acoustic model used for call word detection. In this way, the speech recognition apparatus 100 may obtain the recognition result performed by using additional information from the server 200, compared to the first recognition result.

한편, 본 개시의 일 실시예에 따른 음성 인식 장치(100)는 제1 인식 결과 및 제2 인식 결과를 도 5와 다른 방식으로 조합하여 호출어에 대한 최종 인식 결과를 획득할 수도 있다. 예를 들어, 제1 인식 결과 및 제2 인식 결과는 음성 신호와 호출어에 대응하는 음향 모델 사이의 유사도를 포함할 수 있다. 제1 인식 결과가 유사도를 포함하는 경우, 음성 인식 장치(100)는 제1 인식 결과 및 제2 인식 결과를 조합하여 음성 신호에 대한 최종 호출어 검출여부를 생성할 수 있다. 구체적으로, 제1 인식 결과가 '0.8'이고 제2 인식 결과가 '0.5'인 경우, 음성 인식 장치(100)는 결과값 '(0.8+0.5)/2=0.65'를 획득할 수 있다. 이때, 음성 인식 장치(100)는 결과값을 기준 유사도와 비교하여 최종 호출어 인식 결과를 획득할 수 있다. 기준 유사도가 '0.6'인 경우, 음성 인식 장치(100)는 음성 신호로부터 호출어가 검출된 것으로 판단할 수 있다. 일 실시예에 따라, 음성 인식 장치(100)는 제1 인식 결과 및 제2 인식 결과를 가중합하여 최종 호출어 인식 결과를 생성할 수 있다. 예를 들어, 음성 인식 장치(100)는 제1 인식 결과 및 제2 인식 결과 각각에 적용되는 가중 파라미터를 결정할 수 있다. 음성 인식 장치(100)는 제1 인식 결과 및 제2 인식 결과에 각각 '3' 및 '7'에 대응하는 가중 파라미터를 적용하여 가중합할 수 있다. 음성 인식 장치(100)는 결과값 '(2.4+3.5)/(3+7)=0.59'을 획득할 수 있다. 기준 유사도가 '0.6'인 경우, 음성 인식 장치(100)는 음성 신호로부터 호출어가 검출되지 않은 것으로 판단할 수 있다.Meanwhile, the speech recognition apparatus 100 according to an exemplary embodiment may obtain a final recognition result for the caller by combining the first recognition result and the second recognition result in a different manner from that of FIG. 5. For example, the first recognition result and the second recognition result may include a similarity between the speech signal and the acoustic model corresponding to the call word. When the first recognition result includes the similarity, the speech recognition apparatus 100 may generate whether the final call word is detected for the speech signal by combining the first recognition result and the second recognition result. In detail, when the first recognition result is '0.8' and the second recognition result is '0.5', the speech recognition apparatus 100 may obtain a result value '(0.8 + 0.5) /2=0.65'. In this case, the speech recognition apparatus 100 may compare the result value with the reference similarity to obtain a final call word recognition result. If the reference similarity is '0.6', the speech recognition apparatus 100 may determine that the caller is detected from the speech signal. According to an embodiment, the speech recognition apparatus 100 may generate a final call word recognition result by weighting the first recognition result and the second recognition result. For example, the speech recognition apparatus 100 may determine a weighting parameter applied to each of the first recognition result and the second recognition result. The speech recognition apparatus 100 may apply a weighting parameter corresponding to '3' and '7' to the first recognition result and the second recognition result, and weight the sums. The speech recognition apparatus 100 may obtain a result value '(2.4 + 3.5) / (3 + 7) = 0.59'. If the reference similarity is '0.6', the speech recognition apparatus 100 may determine that the caller is not detected from the speech signal.

한편, 전술한 단계 S506에서, 음성 인식 장치(100)는 음성 신호를 획득한 수신환경에 대응하는 호출이력을 기초로 호출어 파트 전송 여부를 결정할 수 있다. 이하에서는, 일 실시예에 따라 음성 신호에 대한 제1 인식 결과가 호출어 검출을 나타내는 경우, 음성 인식 장치(100)가 호출이력을 기초로 음성 신호의 적어도 일부를 서버(200)로 전송하는 방법에 대해 설명한다. 예를 들어, 수신환경은 음성 인식 장치가 호출된 시간, 음성 인식 장치를 호출한 특정 사용자, 음성 인식 장치가 호출된 때 음성 인식 장치가 위치된 공간의 조도(luminance) 중 적어도 하나를 포함할 수 있다. 또한, 음성 인식 장치(100)는 호출이력을 기초로 음성 신호가 획득된 시간, 음성 신호에 대응하는 음성을 발화한 사용자(300) 및 음성 인식 장치(100) 주변 환경 정보 중 적어도 하나에 대응하는 호출 빈도수를 산출할 수 있다. 이때, 호출 빈도수는 해당 상황에서 음성 인식 장치(100)가 호출된 누적 횟수를 나타낼 수 있다.Meanwhile, in the above-described step S506, the speech recognition apparatus 100 may determine whether to transmit the caller part based on the call history corresponding to the reception environment in which the voice signal is obtained. Hereinafter, when the first recognition result of the voice signal indicates a call word detection, according to an embodiment, the voice recognition apparatus 100 transmits at least a part of the voice signal to the server 200 based on the call history. Explain about. For example, the receiving environment may include at least one of a time when the speech recognition apparatus is called, a specific user who called the speech recognition apparatus, and luminance of a space where the speech recognition apparatus is located when the speech recognition apparatus is called. have. In addition, the voice recognition apparatus 100 may correspond to at least one of a time when a voice signal was acquired, a user 300 who uttered a voice corresponding to the voice signal, and surrounding environment information of the voice recognition apparatus 100 based on the call history. You can calculate the frequency of calls. In this case, the call frequency may indicate a cumulative number of times the speech recognition apparatus 100 is called in the corresponding situation.

예를 들어, 음성 인식 장치(100)는 음성 신호를 획득한 시간 정보에 대응하는 호출이력을 기초로 음성 신호의 적어도 일부를 서버로 전송할 수 있다. 구체적으로, 호출이력은 음성 인식 장치(100)가 호출된 시간에 따른 시간 별 호출 빈도수를 포함할 수 있다. 음성 인식 장치(100)는 음성 신호를 획득한 시간에 대응하는 호출 빈도수를 산출할 수 있다. 구체적으로, 도 6은 본 개시의 일 실시예에 따라, 음성 인식 장치와 관련된 호출 이력의 예시를 나타내는 도면이다. 도 6은 음성 인식 장치와 관련된 시간 별 호출 빈도수를 나타낸다. 도 6은 음성 인식 장치(100)가 '오후 6시 30분'에 제1 음성 신호(61)를 획득하고 '오전 2시 50분'에 제2 음성 신호(62)를 획득한 경우를 나타낸다. 음성 인식 장치(100)는 시간 별 호출 빈도수(예를 들어, '호출 빈도수(장치)')를 기초로 호출어 파트를 서버(200)로 전송할 수 있다. 음성 인식 장치(100)는 음성 신호를 획득한 시간을 기초로 제1 시간(601)에 대응하는 제1 호출 빈도수를 산출할 수 있다. 또한, 음성 인식 장치(100)는 제1 호출 빈도수를 기준치와 비교할 수 있다. 이때, 기준치는 호출어 파트에 대한 서버(200)로의 전송 여부를 결정하는 기준이 되는 문턱값일 수 있다. 예를 들어, 제1 호출 빈도수가 기준치 보다 큰 경우, 음성 인식 장치(100)는 제1 호출어 파트를 제외한 음성 신호를 서버(200)로 전송할 수 있다. 여기에서, 제1 호출어 파트는 제1 음성 신호(61)가 포함하는 호출어 파트이다. 반면, 기준치가 '30'인 경우, 음성 인식 장치(100)는 제2 음성 신호(62)가 포함하는 제2 호출어 파트를 서버(200)로 전송할 수 있다. 여기에서, 제2 호출어 파트는 제2 음성 신호(62)가 포함하는 호출어 파트이다. 음성 신호를 획득한 시간에 대응하는 제2 호출 빈도수가 기준치 보다 작은 경우이기 때문이다. 음성 인식 장치(100)는 제2 음성 신호(62)를 획득한 시간을 기초로 제2 시간(602)에 대응하는 제2 호출 빈도수를 산출할 수 있다.For example, the speech recognition apparatus 100 may transmit at least a part of the speech signal to the server based on a call history corresponding to the time information on which the speech signal is obtained. Specifically, the call history may include the frequency of call by time according to the time when the voice recognition apparatus 100 is called. The speech recognition apparatus 100 may calculate a call frequency corresponding to the time at which the speech signal is obtained. Specifically, FIG. 6 is a diagram illustrating an example of a call history associated with a voice recognition device, according to an embodiment of the present disclosure. 6 shows the frequency of calls per hour associated with the speech recognition apparatus. FIG. 6 illustrates a case in which the speech recognition apparatus 100 obtains the first voice signal 61 at 6:30 pm and the second voice signal 62 at 2:50 am. The speech recognition apparatus 100 may transmit the caller part to the server 200 based on the call frequency per hour (eg, the call frequency (device)). The speech recognition apparatus 100 may calculate a first call frequency corresponding to the first time 601 based on the time when the speech signal is obtained. In addition, the speech recognition apparatus 100 may compare the first call frequency with a reference value. In this case, the reference value may be a threshold value that is a reference for determining whether to transmit to the server 200 for the caller part. For example, when the first call frequency is greater than the reference value, the voice recognition apparatus 100 may transmit a voice signal excluding the first caller part to the server 200. Here, the first caller part is a caller part included in the first voice signal 61. On the other hand, when the reference value is '30', the voice recognition apparatus 100 may transmit the second caller part included in the second voice signal 62 to the server 200. Here, the second caller part is a caller part included in the second voice signal 62. This is because the second call frequency corresponding to the acquisition time of the voice signal is smaller than the reference value. The speech recognition apparatus 100 may calculate a second call frequency corresponding to the second time 602 based on the time when the second speech signal 62 is obtained.

또한, 음성 인식 장치(100)는 도 6의 '호출 빈도수(서버)'(603)를 이용하여 호출어 파트의 전송여부를 결정할 수도 있다. '호출 빈도수(서버)'(603)는 다른 음성 인식 기기가 호출된 호출이력을 포함할 수 있다. 음성 인식 장치(100)는 음성 인식 장치(100)가 아닌 다른 음성 인식 기기가 호출된 호출이력을 이용할 수 있다. 이때, 다른 음성 인식 기기는 음성 인식 장치(100)와 연결된 서버(200)를 이용하여 음성 인식 서비스를 제공하는 음성 인식 기기일 수 있다. 또한, 다른 음성 인식 기기는 음성 인식 장치(100) 설치된 장소와 유사한 장소에 설치된 음성 인식 기기를 나타낼 수 있다. 예를 들어, 다른 음성 인식 기기는 음성 인식 장치(100)가 설치된 위치를 기준으로 기 설정된 영역 내에 설치된 기기일 수 있다. 다른 음성 인식 기기는 음성 인식 장치(100)와 지리적으로 공통된 지역에 설치된 기기일 수도 있다. 구체적으로, 음성 인식 장치(100)가 다른 음성 인식 기기의 호출 빈도수가 높은 시간에 음성 신호를 획득하는 경우, 음성 인식 장치(100)는 서버(200)로 호출어 파트(401)를 전송하지 않을 수 있다. 음성 신호로부터 호출어가 검출된 것을 나타내는 제1 인식 결과의 신뢰도가 높은 경우이기 때문이다. 음성 인식 장치(100)는 음성 인식 장치(100) 또는 서버(200)와 연결된 각각의 음성 인식 장치가 설치된 장소에 관한 정보를 획득할 수 있다. 이때, 설치된 장소에 관한 정보는 음성 인식 장치가 설치된 지역, 장소의 용도 특성(예를 들어, 가정 또는 사무실)을 포함할 수 있다.In addition, the speech recognition apparatus 100 may determine whether the caller part is transmitted using the 'call frequency (server)' 603 of FIG. 6. The 'call frequency (server)' 603 may include a call history for which another voice recognition device is called. The speech recognition apparatus 100 may use a call history called by another speech recognition device other than the speech recognition apparatus 100. In this case, the other voice recognition device may be a voice recognition device that provides a voice recognition service by using the server 200 connected to the voice recognition device 100. In addition, the other voice recognition device may represent a voice recognition device installed in a place similar to a place where the voice recognition device 100 is installed. For example, the other voice recognition device may be a device installed in a preset area based on the location where the voice recognition device 100 is installed. The other voice recognition device may be a device installed in a geographically common area with the voice recognition device 100. In detail, when the voice recognition apparatus 100 obtains a voice signal at a time when a call frequency of another voice recognition apparatus is high, the voice recognition apparatus 100 may not transmit the caller part 401 to the server 200. Can be. This is because the reliability of the first recognition result indicating that the caller is detected from the speech signal is high. The speech recognition apparatus 100 may obtain information regarding a place where each speech recognition apparatus connected to the speech recognition apparatus 100 or the server 200 is installed. In this case, the information regarding the installed place may include a region in which the speech recognition apparatus is installed, and a use characteristic of the place (for example, home or office).

일 실시예에 따라, 음성 인식 장치(100)는 신뢰도를 기초로 호출어 파트에 대한 전송 여부를 결정할 수 있다. 여기에서, 신뢰도는 제1 인식 결과에 대한 오류 발생 가능성을 나타낼 수 있다. 신뢰도는 획득된 음성 신호로부터 호출어가 검출된 것을 나타내는 제1 인식 결과의 오류 발생 가능성을 나타낼 수 있다. 예를 들어, 신뢰도가 클 수록 신뢰도가 작은 경우에 비해, 제1 인식 결과가 오류일 가능성이 더 작은 것을 나타낼 수 있다. 음성 인식 장치(100)는 호출이력을 기초로 신뢰도를 결정할 수 있다. 예를 들어, 음성 인식 장치(100)는 신뢰도를 호출 빈도수에 비례하도록 설정할 수 있다. 또한, 음성 인식 장치(100)는 결정된 신뢰도를 기초로 호출어 파트에 대한 서버로의 전송 여부를 결정할 수 있다. 구체적으로, 음성 인식 장치(100)는 제1 시간 정보에 대응하는 제1 호출 빈도수가 제2 시간 정보에 대응하는 제2 호출 빈도수보다 많은 경우, 제1 시간 정보에 대응하는 제1 신뢰도를 제2 시간 정보에 대응하는 제2 신뢰도에 비해 높은 값으로 설정할 수 있다. 또한, 신뢰도가 기 설정된 값 보다 높은 경우, 음성 인식 장치(100)는 획득한 음성 신호 중에서 호출어 파트를 제외한 음성 신호의 일부를 서버(200)로 전송할 수 있다. 반대로, 신뢰도가 기 설정된 값 보다 낮은 경우, 음성 인식 장치(100)는 호출어 파트를 포함하는 음성 신호 전체를 서버(200)로 전송할 수 있다.According to an embodiment, the speech recognition apparatus 100 may determine whether to transmit the caller part based on the reliability. In this case, the reliability may indicate a possibility of error in the first recognition result. The reliability may indicate a possibility of error of the first recognition result indicating that the caller is detected from the obtained speech signal. For example, as the reliability is higher, the first recognition result may be less likely to be an error than when the reliability is small. The speech recognition apparatus 100 may determine the reliability based on the call history. For example, the speech recognition apparatus 100 may set the reliability to be proportional to the call frequency. Also, the speech recognition apparatus 100 may determine whether to transmit the caller part to the server based on the determined reliability. In detail, when the first call frequency corresponding to the first time information is greater than the second call frequency corresponding to the second time information, the speech recognition apparatus 100 may set the first reliability corresponding to the first time information to the second reliability. It may be set to a higher value than the second reliability corresponding to the time information. In addition, when the reliability is higher than the preset value, the speech recognition apparatus 100 may transmit a part of the speech signal excluding the caller part from the obtained speech signal to the server 200. On the contrary, when the reliability is lower than the preset value, the speech recognition apparatus 100 may transmit the entire voice signal including the caller part to the server 200.

여기에서, 기 설정된 값은 서비스 제공 시스템에 기 저장된 값일 수 있다. 기 설정된 값은 음성 인식 장치(100) 또는 음성 인식 서비스를 제공하는 제공자에 의해 설정된 값일 수 있다. 기 설정된 값을 음성 인식 장치(100)를 통해 서비스를 제공받는 특정 사용자에 의해 설정된 값일 수도 있다. 예를 들어, 음성 인식 장치(100)는 서버(200)로부터 기 설정된 값을 획득할 수 있다. 또는 음성 인식 장치(100)는 음성 인식 장치(100) 내부에 기 저장된 값을 이용할 수도 있다. 또한, 기 설정된 값은 네트워크 환경에 따라 결정된 값일 수 있다. 예를 들어, 네트워크 환경에서 수용 가능한 데이터 트래픽 양이 충분한 경우, 데이터 트래픽 양이 적은 경우에 비해 높은 값으로 설정될 수 있다. 데이터 트래픽이 충분한 경우, 음성 인식 장치(100)가 서버(200)로 음성 신호를 전송하기에 용이할 수 있기 때문이다.Here, the preset value may be a value previously stored in the service providing system. The preset value may be a value set by the speech recognition apparatus 100 or a provider for providing a speech recognition service. The preset value may be a value set by a specific user who receives a service through the voice recognition apparatus 100. For example, the speech recognition apparatus 100 may obtain a preset value from the server 200. Alternatively, the speech recognition apparatus 100 may use a value previously stored in the speech recognition apparatus 100. In addition, the preset value may be a value determined according to a network environment. For example, when the amount of data traffic that is acceptable in the network environment is sufficient, it may be set to a higher value than when the amount of data traffic is small. This is because if the data traffic is sufficient, it may be easy for the voice recognition apparatus 100 to transmit a voice signal to the server 200.

일 실시예에 따라, 제1 인식 결과가 음성 신호와 호출어 사이의 유사도를 포함하는 경우, 음성 인식 장치(100)는 제1 인식 결과 및 신뢰도를 기초로 결과값을 산정할 수 있다. 결과값이 기 설정된 값 이상인 경우, 음성 인식 장치(100)는 호출어 파트를 제외한 음성 신호를 서버(200)로 전송할 수 있다. 음성 인식 장치(100)는 음성 신호 중에서, 음성 신호의 호출어 파트를 제외한 비호출어 파트를 서버로 전송할 수 있다. 반대로, 결과값이 기 설정된 값 이하인 경우, 음성 인식 장치(100)는 호출어 파트를 포함하는 음성 신호 전체를 서버(200)로 전송할 수 있다.According to an embodiment, when the first recognition result includes a similarity between the voice signal and the call word, the speech recognition apparatus 100 may calculate a result value based on the first recognition result and the reliability. When the result value is greater than or equal to the preset value, the voice recognition apparatus 100 may transmit a voice signal excluding the caller part to the server 200. The speech recognition apparatus 100 may transmit a non-calling part except a caller part of the voice signal to the server. On the contrary, when the result value is less than or equal to the preset value, the voice recognition apparatus 100 may transmit the entire voice signal including the caller part to the server 200.

한편, 본 개시의 일 실시예에 따라, 호출 이력은 상기 음성 인식 장치를 호출한 사용자 별 호출 이력을 포함할 수 있다. 여기에서, 사용자 별 호출이력은 특정 음성 인식 장치에 대한 특정 사용자의 호출이력을 의미할 수 있다. 이 경우, 음성 인식 장치(100)는 사용자 별 호출이력을 기초로 음성 신호의 적어도 일부를 서버(200)로 전송할 수 있다. 음성 신호에 대응하는 음성을 발화한 사용자에 대응하는 호출이력이 존재하는 경우, 음성 신호로부터 호출어가 검출된 것을 나타내는 제1 인식 결과에 대한 신뢰도가 높을 수 있기 때문이다. 예를 들어, 음성 인식 장치(100)는 기 인식된 음성과 동일한 사용자로부터 발화된 음성을 최초로 수집된 음성에 비해 더 정확하게 인식할 수 있다. 이때, 음성 인식 장치(100)는 뉴럴 네트워크(neural network)를 통한 딥 러닝(deep learning) 기술을 이용할 수 있다. 또한, 사용자에 대응하는 호출이력이 존재하는 경우, 호출이력이 존재하지 않는 경우에 비해, 사용자가 음성 인식 장치(100)를 호출하였을 가능성이 더 높기 때문이다.Meanwhile, according to an embodiment of the present disclosure, the call history may include a call history for each user who calls the voice recognition apparatus. Here, the call history for each user may mean a call history of a specific user for a specific speech recognition device. In this case, the voice recognition apparatus 100 may transmit at least a part of the voice signal to the server 200 based on the call history for each user. This is because when the call history corresponding to the user who spoke the voice corresponding to the voice signal is present, the reliability of the first recognition result indicating that the caller is detected from the voice signal may be high. For example, the speech recognition apparatus 100 may recognize the speech spoken by the same user as the previously recognized speech more accurately than the speech collected for the first time. In this case, the speech recognition apparatus 100 may use a deep learning technique through a neural network. In addition, when there is a call history corresponding to the user, it is more likely that the user has called the voice recognition apparatus 100 as compared to the case where the call history does not exist.

예를 들어, 음성 인식 장치(100)는 획득된 음성 신호에 대응하는 음성을 발화한 사용자(300)에 대응하는 호출이력이 존재하는지 판별할 수 있다. 구체적으로, 음성 인식 장치(100)는 음성 신호를 기초로 음성 신호에 대응하는 음성을 발화한 사용자(300)를 식별할 수 있다. 음성 인식 장치(100)는 음성 신호로부터 음성 패턴을 추출하여 사용자(300)의 사용자 식별정보를 획득할 수 있다. 음성 인식 장치(100)는 획득된 사용자 식별정보를 기초로 사용자에 대응하는 사용자 별 호출이력을 획득할 수 있다. 그리고, 음성 인식 장치(100)는 판별 결과 및 제1 인식 결과를 기초로 음성 신호의 적어도 일부를 서버(200)로 전송할 수 있다. 구체적으로, 사용자에 대응하는 호출이력이 존재하는 경우, 음성 인식 장치(100)는 호출어 파트 및 비호출어 파트를 서버로 전송할 수 있다. 사용자에 대응하는 호출이력이 존재하지 않는 경우, 음성 인식 장치(100)는 비호출어 파트를 서버로 전송할 수 있다.For example, the voice recognition apparatus 100 may determine whether there is a call history corresponding to the user 300 who uttered the voice corresponding to the acquired voice signal. In detail, the voice recognition apparatus 100 may identify the user 300 who uttered a voice corresponding to the voice signal based on the voice signal. The speech recognition apparatus 100 may obtain the user identification information of the user 300 by extracting the speech pattern from the speech signal. The speech recognition apparatus 100 may obtain a call history for each user corresponding to the user based on the obtained user identification information. The voice recognition apparatus 100 may transmit at least a part of the voice signal to the server 200 based on the determination result and the first recognition result. In detail, when a call history corresponding to the user exists, the voice recognition apparatus 100 may transmit the caller part and the non-caller part to the server. If the call history corresponding to the user does not exist, the speech recognition apparatus 100 may transmit the non-call part to the server.

예를 들어, 음성 인식 장치(100)는 사용자(300)로부터 발화된 음성에 대응하는 음성 신호의 호출어 파트를 서버(200)로 전송하지 않을 수 있다. 이때, 음성 인식 장치(100)는 사용자(300)와 관련된 공간에 설치된 경우일 수 있다. 구체적으로, 사용자(300)와 관련된 공간은 사용자(300)가 거주하는 가정 및 상주하는 사무실 중 적어도 하나를 포함할 수 있다. 또한, 특정 사용자에 대응하는 호출 빈도수가 기 설정된 빈도수 보다 많은 경우, 음성 인식 장치(100)는 음성 인식 장치(100)가 설치된 공간을 특정 사용자와 관련된 공간으로 판단할 수 있다. 특정 사용자에 대응하는 호출 빈도수는 특정 사용자의 음성 인식 장치(100)에 대한 호출 빈도수를 포함할 수 있다. 또한, 특정 사용자에 대응하는 호출 빈도수는 음성 인식 장치(100)가 설치된 위치를 기준으로 기 설정된 영역 내에 설치된 다른 음성 인식 기기에 대한 특정 사용자의 호출 빈도수를 포함할 수 있다.For example, the voice recognition apparatus 100 may not transmit the caller part of the voice signal corresponding to the voice spoken by the user 300 to the server 200. In this case, the speech recognition apparatus 100 may be installed in a space associated with the user 300. In detail, the space associated with the user 300 may include at least one of a home and a resident office where the user 300 resides. In addition, when the call frequency corresponding to the specific user is greater than the preset frequency, the speech recognition apparatus 100 may determine a space in which the speech recognition apparatus 100 is installed as a space associated with the specific user. The call frequency corresponding to the specific user may include the call frequency for the voice recognition apparatus 100 of the specific user. In addition, the call frequency corresponding to the specific user may include the frequency of calls of a specific user to another voice recognition device installed in a preset area based on the location where the voice recognition apparatus 100 is installed.

본 개시의 일 실시예에 따라, 호출이력은 조도 별 호출 빈도수를 포함할 수 있다. 조도 별 호출 빈도수는 음성 인식 장치가 호출된 조도 각각에 대응하는 호출 빈도수를 나타낼 수 있다. 음성 인식 장치(100)는 조도 별 호출 빈도수를 기초로 음성 신호의 적어도 일부를 서버(200)로 전송할 수 있다. 음성 인식 장치(100)는 음성 인식 장치(100)가 호출된 시점에 음성 인식 장치(100)가 설치된 공간의 조도에 따라 제공하는 서비스가 달라지는 기기일 수 있기 때문이다. 예를 들어, 음성 인식 장치(100)가 조명 기능을 탑재한 경우, 기 설정된 조도 미만에서의 호출 빈도수는 기 설정된 조도 이상에서의 호출 빈도수 보다 더 클 수 있다. 음성 인식 장치(100)는 음성 신호를 획득한 시점에 음성 인식 장치(100)가 설치된 공간의 조도를 나타내는 조도 정보를 획득할 수 있다. 음성 인식 장치(100)는 호출이력을 기초로 조도 정보에 대응하는 조도 별 호출 빈도수를 산출할 수 있다. 또한, 음성 인식 장치(100)는 산출된 호출 빈도수를 기초로 호출어 파트를 서버(200)로 전송할 수 있다. 구체적으로, 특정 조도에 대응하는 호출 빈도수가 기 설정된 빈도수 보다 큰 경우, 음성 인식 장치(100)는 호출어 파트를 제외한 음성 신호의 적어도 일부를 서버(200)로 전송할 수 있다. 특정 조도에 대응하는 호출 빈도수가 기 설정된 빈도수 보다 작은 경우, 음성 인식 장치(100)는 호출어 파트 및 비호출어 파트를 서버(200)로 전송할 수 있다. According to an embodiment of the present disclosure, the call log may include a call frequency for each illumination. The call frequency for each illuminance may indicate a call frequency corresponding to each of the illuminance to which the voice recognition apparatus is called. The speech recognition apparatus 100 may transmit at least a part of the speech signal to the server 200 based on the call frequency for each illuminance. This is because the speech recognition apparatus 100 may be a device in which a service provided according to the illumination of the space where the speech recognition apparatus 100 is installed at the time when the speech recognition apparatus 100 is called. For example, when the speech recognition apparatus 100 is equipped with an illumination function, the call frequency below the preset illuminance may be greater than the call frequency above the preset illuminance. The speech recognition apparatus 100 may acquire illuminance information indicating the illuminance of the space in which the speech recognition apparatus 100 is installed at the time when the speech signal is acquired. The speech recognition apparatus 100 may calculate the call frequency for each illuminance corresponding to the illuminance information based on the call history. In addition, the speech recognition apparatus 100 may transmit the caller part to the server 200 based on the calculated call frequency. In detail, when the call frequency corresponding to the specific illuminance is greater than the preset frequency, the voice recognition apparatus 100 may transmit at least a part of the voice signal excluding the caller part to the server 200. When the call frequency corresponding to the specific illuminance is smaller than the preset frequency, the speech recognition apparatus 100 may transmit the call part part and the non-call part part to the server 200.

도 7은 본 개시의 일 실시예에 따른 음성 인식 장치(100)의 동작 방법을 나타내는 흐름도이다. 도 7을 참조하면, 단계 S702에서, 음성 인식 장치(100)는 음성 신호를 획득할 수 있다. 단계 S704에서, 음성 인식 장치(100)는 음성 신호로부터 호출어를 검출하는 제 1 인식 결과를 생성할 수 있다. 단계 S706에서, 음성 인식 장치(100)는 호출이력 및 제1 인식 결과를 기초로 음성 신호의 적어도 일부를 서버(200)로 전송할 수 있다. 구체적으로, 음성 인식 장치(100)는 호출이력 및 제1 인식 결과를 기초로 호출어 파트에 대한 서버(200)로의 전송 여부를 결정할 수 있다. 또한, 음성 인식 장치(100)는 전송 여부 결정을 기초로 음성 신호의 적어도 일부를 서버(200)로 전송할 수 있다. 단계 S708에서, 음성 인식 장치(100)는 서버로부터 획득한 제2 인식 결과 및 제1 인식 결과 중 적어도 하나를 기초로 출력 정보를 생성할 수 있다. 단계 S710에서, 음성 인식 장치(100)는 생성된 출력 정보를 출력할 수 있다. 예를 들어, 최종 호출어 인식 결과가 음성 신호로부터 호출어가 검출되지 않은 것을 나타내는 경우, 음성 인식 장치(100)는 호출어 검출 결과를 나타내는 출력 정보를 제공할 수 있다. 최종 호출어 인식 결과가 음성 신호로부터 호출어가 검출된 것을 나타내는 경우, 음성 인식 장치(100)는 서비스 제공을 위한 출력 정보를 제공할 수 있다. 전술한 방법을 통해, 음성 인식 장치(100)는 호출어 인식 오인식률을 감소시킬 수 있다. 또한, 음성 인식 장치(100)는 호출이력을 기초로 음성 신호의 호출어 파트를 서버(200)로 선별적으로 전송할 수 있다. 음성 인식 장치(100)는 통신 자원 측면에 있어서 효율적으로 호출어 인식 오인식률을 감소시킬 수 있다.7 is a flowchart illustrating a method of operating the speech recognition apparatus 100 according to an exemplary embodiment. Referring to FIG. 7, in operation S702, the speech recognition apparatus 100 may obtain a speech signal. In operation S704, the speech recognition apparatus 100 may generate a first recognition result for detecting the call word from the speech signal. In operation S706, the voice recognition apparatus 100 may transmit at least a part of the voice signal to the server 200 based on the call history and the first recognition result. In detail, the speech recognition apparatus 100 may determine whether to transmit the caller part to the server 200 based on the call history and the first recognition result. Also, the speech recognition apparatus 100 may transmit at least a part of the speech signal to the server 200 based on the transmission decision. In operation S708, the speech recognition apparatus 100 may generate output information based on at least one of the second recognition result and the first recognition result obtained from the server. In operation S710, the speech recognition apparatus 100 may output the generated output information. For example, when the final call word recognition result indicates that the call word is not detected from the voice signal, the speech recognition apparatus 100 may provide output information indicating the call word detection result. When the final call word recognition result indicates that the call word is detected from the voice signal, the voice recognition apparatus 100 may provide output information for providing a service. Through the above-described method, the speech recognition apparatus 100 may reduce the call word recognition false recognition rate. In addition, the voice recognition apparatus 100 may selectively transmit the caller part of the voice signal to the server 200 based on the call history. The speech recognition apparatus 100 may effectively reduce the caller recognition false recognition rate in terms of communication resources.

일부 실시예는 컴퓨터에 의해 실행되는 프로그램 모듈과 같은 컴퓨터에 의해 실행가능한 명령어를 포함하는 기록 매체의 형태로도 구현될 수 있다. 컴퓨터 판독 가능 매체는 컴퓨터에 의해 액세스될 수 있는 임의의 가용 매체일 수 있고, 휘발성 및 비휘발성 매체, 분리형 및 비분리형 매체를 모두 포함할 수 있다. 또한, 컴퓨터 판독가능 매체는 컴퓨터 저장 매체를 포함할 수 있다. 컴퓨터 저장 매체는 컴퓨터 판독가능 명령어, 데이터 구조, 프로그램 모듈 또는 기타 데이터와 같은 정보의 저장을 위한 임의의 방법 또는 기술로 구현된 휘발성 및 비휘발성, 분리형 및 비분리형 매체를 모두 포함할 수 있다. 또한, 본 명세서에서, “부”는 프로세서 또는 회로와 같은 하드웨어 구성(hardware component), 및/또는 프로세서와 같은 하드웨어 구성에 의해 실행되는 소프트웨어 구성(software component)일 수 있다.Some embodiments may also be embodied in the form of a recording medium containing instructions executable by a computer, such as program modules executed by the computer. Computer readable media can be any available media that can be accessed by a computer and can include both volatile and nonvolatile media, removable and non-removable media. In addition, the computer readable medium may include a computer storage medium. Computer storage media may include both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. In addition, in this specification, “unit” may be a hardware component such as a processor or a circuit, and / or a software component executed by a hardware component such as a processor.

전술한 본 개시의 설명은 예시를 위한 것이며, 본 개시가 속하는 기술분야의 통상의 지식을 가진 자는 본 개시의 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 쉽게 변형이 가능하다는 것을 이해할 수 있을 것이다. 그러므로 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며 한정적이 아닌 것으로 이해해야만 한다. 예를 들어, 단일형으로 설명되어 있는 각 구성 요소는 분산되어 실시될 수도 있으며, 마찬가지로 분산된 것으로 설명되어 있는 구성 요소들도 결합된 형태로 실시될 수 있다.The foregoing description of the disclosure is provided by way of example, and it will be understood by those skilled in the art that the present disclosure may be easily modified into other specific forms without changing the technical spirit or essential features of the present disclosure. will be. Therefore, it should be understood that the embodiments described above are exemplary in all respects and not restrictive. For example, each component described as a single type may be implemented in a distributed manner, and similarly, components described as distributed may be implemented in a combined form.

Claims

In the speech recognition device that provides a service through the recognition of the call word,
A voice receiver for acquiring a voice signal, wherein the voice signal is divided into a caller part corresponding to a caller and a non-caller part corresponding to a non-caller;
Generate a first recognition result indicating a caller detection result for the caller part,
If the first recognition result indicates that a caller is detected from the voice signal, the caller of the voice signal among the voice signals based on a call history corresponding to a reception environment for acquiring the voice signal and the first recognition result. Decide whether to send the part along with the non-call part to an external server,
According to the determination, transmitting at least a portion of the voice signal to the server,
When the caller part of the voice signal is transmitted together with the non-caller part to the server, a second recognition result including a caller-detection result of the caller part for the caller part and recognition of the non-caller part Generate output information based on the results,
If the caller part of the voice signal is not transmitted to the server, generating output information based on the first recognition result and the recognition result of the non-calling part; And
And an output unit for outputting the generated output information.

delete

The method of claim 1,
The call history includes a call history for each user calling the voice recognition device,
The processor,
And determining whether to transmit a caller part of the voice signal together with the non-caller part of the voice signal to a server based on a call history for each user corresponding to the user who has spoken the voice corresponding to the voice signal.

The method of claim 3, wherein
The processor,
Extracting a voice pattern from the voice signal to obtain user identification information for identifying the user;
And a call history for each user corresponding to the user based on the user identification information.

The method of claim 4, wherein
The processor,
It is determined whether a call history for each user corresponding to the user exists,
If there is a call history corresponding to the user according to the determination result, the non-calling part is transmitted to the server,
And if the call history corresponding to the user does not exist, transmitting the caller part and the non-caller part to a server.

The method of claim 1,
The processor,
And determining whether to transmit a caller part of the voice signal together with the non-caller part to a server based on time information indicating a time at which the voice signal is obtained and the call history.

The method of claim 6,
The call history includes a frequency of call by time according to the time the speech recognition device is called,
The processor,
Calculating a call frequency corresponding to the time information based on the call history;
And determining whether to transmit the caller part of the voice signal together with the non-caller part to the server based on the calculated call frequency.

The method of claim 6,
The processor,
Obtaining a reliability indicating a possibility of an error with respect to the first recognition result based on the call history,
Based on a result of comparing the reliability with a preset value, determining whether to transmit a caller part of the voice signal together with the non-caller part to a server,
If the first call frequency corresponding to the first time information is greater than the second call frequency corresponding to the second time information, the first reliability corresponding to the first time information is the second reliability corresponding to the second time information. Speech recognition device, which is set to a high value.

The method of claim 8,
The processor,
And transmitting the non-calling part except the caller part of the voice signal to the server when the reliability is equal to or greater than a preset value.

The method of claim 8,
The processor,
And transmitting the caller part and the non-caller part of the voice signal to a server when the reliability is equal to or less than a preset value.

The method of claim 1,
The processor,
Obtaining a call log corresponding to a voice recognition device other than the voice recognition device from the server,
Transmitting at least a portion of the voice signal to the server based on a call history corresponding to the other voice recognition device,
The other voice recognition device is a device for providing a voice recognition service through the same server as the server connected to the voice recognition device.

The method of claim 1,
The call history includes a call frequency corresponding to the illumination intensity at which the speech recognition apparatus is called,
The processor,
Calculating the call frequency corresponding to the illuminance information indicating the illuminance of the space in which the speech recognition apparatus is installed at the time when the voice signal is obtained, based on the call frequency corresponding to the illuminance,
And determining whether to transmit the caller part of the speech signal together with the non-caller part to the server based on the call frequency.

In the method of operation of a speech recognition device that provides a service through call word recognition,
Obtaining a voice signal, wherein the voice signal is divided into a caller part corresponding to a caller and a non-caller part corresponding to the non-caller;
Generating a first recognition result indicating a caller detection result for the caller part;
If the first recognition result indicates that a caller is detected from the voice signal, the caller of the voice signal among the voice signals based on a call history corresponding to a reception environment for acquiring the voice signal and the first recognition result. Decide whether to send the part along with the non-call part to an external server,
In accordance with the determination, transmitting at least a portion of the voice signal to the server;
When the caller part of the voice signal is transmitted together with the non-caller part to the server, a second recognition result including a caller-detection result of the caller part for the caller part and recognition of the non-caller part Generate output information based on the results,
If the caller part of the voice signal is not transmitted to the server, generating output information based on the first recognition result and the recognition result of the non-caller part; And
Outputting the generated output information.

An electronic device readable recording medium having recorded thereon a program for executing the method of claim 13 on an electronic device.