KR101752119B1

KR101752119B1 - Hotword detection on multiple devices

Info

Publication number: KR101752119B1
Application number: KR1020167021778A
Authority: KR
Inventors: 매튜 샤리피
Original assignee: 구글 인코포레이티드
Priority date: 2014-10-09
Filing date: 2015-09-29
Publication date: 2017-06-28
Also published as: JP7022733B2; EP3627503B1; US20210118448A1; DE202015010012U1; CN106030699A; US10593330B2; US10134398B2; JP7354210B2; US11557299B2; EP4280210A2; US11915706B2; JP2019133198A; US20240169992A1; EP3171359A1; EP3139378B1; JP2022017569A; EP3084759B1; JP6893951B2; US9514752B2; KR101832648B1

Abstract

다수의 디바이스에서의 핫워드 검출을 위한, 컴퓨터 저장 매체에 인코딩된 컴퓨터 프로그램들을 포함하는, 방법들, 시스템들, 및 장치들이 개시된다. 일 양태에서, 방법은, 제1 컴퓨팅 디바이스에 의해, 발성에 대응하는 오디오 데이터를 수신하는 단계의 액션들을 포함한다. 이 액션들은 발성이 핫워드를 포함할 가능성에 대응하는 제1 값을 결정하는 단계를 더 포함한다. 이 액션들은 발성이 핫워드를 포함할 가능성에 대응하는 제2 값을 수신하는 단계 - 제2 값은 제2 컴퓨팅 디바이스에 의해 결정됨 - 를 더 포함한다. 이 액션들은 제1 값과 제2 값을 비교하는 단계를 더 포함한다. 이 액션들은 제1 값과 제2 값을 비교하는 것에 기초하여, 오디오 데이터에 대한 음성 인식 처리를 개시하는 단계를 더 포함한다.Methods, systems, and apparatus are disclosed that include computer programs encoded on a computer storage medium for hot word detection in multiple devices. In one aspect, a method includes actions of a step of receiving, by a first computing device, audio data corresponding to a utterance. These actions further include determining a first value corresponding to the likelihood that the utterance includes a hot word. The actions further comprising receiving a second value corresponding to the likelihood that the utterance includes a hot word, the second value being determined by the second computing device. These actions further include comparing the first value to the second value. These actions further include initiating a speech recognition process on the audio data based on comparing the first value and the second value.

Description

[0001] HOT WORD DETECTION ON MULTIPLE DEVICES [0002]

이 명세서는 일반적으로 사람이 말하고 있는 단어들을 인식하는 것, 다르게는 음성 인식이라고 불리는 것에 대한 시스템들 및 기법들에 관한 것이다.This specification generally relates to recognizing words that people are speaking, or systems and techniques for what is called speech recognition.

음성-사용가능(speech-enabled) 집 또는 다른 환경 - 즉, 사용자가 큰 소리로 쿼리(query) 또는 명령(command)을 말하기만 하면 되고 컴퓨터 기반 시스템이 쿼리를 처리하고 그에 응답하는 그리고/또는 명령이 수행되게 하는 것 - 의 현실이 우리 앞에 있다. 음성-사용가능 환경(예컨대, 집, 직장, 학교 등)은 환경의 다양한 방들 또는 구역들의 도처에 분포된 연결된 마이크 디바이스들의 네트워크를 이용하여 구현될 수 있다. 그러한 마이크들의 네트워크를 통하여, 사용자는 그의 앞에 또는 심지어 근처에 컴퓨터 또는 다른 디바이스를 가질 필요 없이 환경 내의 본질적으로 어느 곳으로부터든 구두로 시스템에 쿼리하는 능력을 가진다. 예를 들어, 주방에서 요리하는 동안, 사용자는 시스템에 "3개의 컵에 몇 밀리리터인가(how many milliliters in three cups)?"를 물어볼 수 있고, 이에 응답하여, 시스템으로부터, 예컨대, 합성된 음성 출력의 형태로, 응답을 받을 수 있다. 대안적으로, 사용자는 시스템에, "가장 가까운 주유소가 언제 문을 닫는가(when does my nearest gas station close)" 또는 외출을 준비하면서 "오늘 코트를 입어야 하는가(should I wear a coat today)?"와 같은 질문들을 물어볼 수 있다.A speech-enabled home or other environment - that is, a user simply has to say loudly a query or command, and the computer-based system processes and responds to the query and / The reality of what is being done - before us. A voice-enabled environment (e.g., home, workplace, school, etc.) may be implemented using a network of connected microphone devices distributed throughout various rooms or areas of the environment. Through the network of such microphones, the user has the ability to query the system verbally from virtually anywhere in the environment without having to have a computer or other device in front of or even near it. For example, during cooking in the kitchen, the user may ask the system "how many milliliters in three cups ?," and in response, In the form of a response, you can receive. Alternatively, the user may ask the system to "close when the nearest gas station closes" or "should I wear a coat today?" You can ask the same questions.

또한, 사용자는 사용자의 개인 정보와 관련 있는, 쿼리를 시스템에 물어보고/물어보거나 명령을 내릴 수 있다. 예를 들어, 사용자는 시스템에 "John과의 미팅이 언제인가(when is my meeting with John)?"를 물어보거나 시스템에 "내가 집에 돌아왔을 때 John에게 통화하는 것을 상기시켜 달라(remind me to call John when I get back home)"고 명령할 수 있다.In addition, the user can ask / query or issue a query to the system, which is related to the user's personal information. For example, the user might ask the system to "When is my meeting with John?" Or remind the system "Remind me to call John when I get home. call John when I get back home) ".

음성-사용가능 시스템에 대해, 사용자가 시스템과 상호 작용하는 방식은, 전적으로 그런 것은 아닐지라도, 주로 음성 입력을 이용하도록 설계된다. 따라서, 시스템을 향하고 있지 않은 것들을 포함하여 주위 환경에서 이루어진 모든 발성들을 잠재적으로 포착하는, 시스템은 임의의 주어진 발성이, 예컨대, 환경에 존재하는 개인을 향하고 있는 것이 아니라 시스템을 향하고 있는 때를 판별하는 어떤 방법을 가져야만 한다. 이를 달성하는 한 가지 방법은, 환경 내의 사용자들 사이의 합의에 의해, 시스템의 주의를 환기시키기 위해 말해지는 미리 결정된 단어로서 예약되는, 핫워드(hotword)를 이용하는 것이다. 예시적인 환경에서, 시스템의 주의를 환기시키기 위해 사용되는 핫워드는 "OK 컴퓨터"라는 단어들이다. 따라서, "OK 컴퓨터"라는 단어들이 말해질 때마다, 그것은 마이크에 의해 포착되어, 시스템에 전달되고, 시스템은 음성 인식 기법들을 수행하여 핫워드가 말해졌는지를 결정하고, 그렇다면, 뒤이은 명령 또는 쿼리를 기다린다. 따라서, 시스템을 향하는 발성들은 [핫워드] [쿼리]의 일반적인 형태를 가지며, 이 예에서 "핫워드"는 "OK 컴퓨터"이고 "쿼리"는 시스템에 의해, 단독으로 또는 네트워크를 통해 서버와 함께, 음성 인식되고, 구문 분석되고, 작용될 수 있는 임의의 질문, 명령, 선언, 또는 다른 요청일 수 있다.For voice-enabled systems, the manner in which the user interacts with the system is primarily designed to use voice input, although not entirely. Thus, a system that potentially captures all of the utterances made in the environment, including those that are not pointing to the system, determines that any given utterance is directed to the system rather than to an individual present in the environment It must have some way. One way to achieve this is to use a hotword, which is reserved as a predetermined word that is said to consume the system's attention, by agreement among users in the environment. In an exemplary environment, the hot words used to draw attention to the system are the words "OK computer ". Thus, whenever the words "OK computer" are said, it is picked up by the microphone and delivered to the system, and the system performs speech recognition techniques to determine if the hot word has been spoken, . Thus, the voices directed to the system have the general form of a [hot word] [query], in which the "hot word" is the "OK computer" and the "query" is the system, either alone or with the server , Speech, recognized, parsed, and / or acted upon by the user.

이 명세서에 기술된 주제의 하나의 혁신적인 양태에 따르면, 사용자 디바이스가 사용자에 의해 말해지는 발성을 수신한다. 상기 사용자 디바이스는 상기 발성이 핫워드를 포함하는지를 결정하고 상기 발성이 상기 핫워드를 포함할 가능성을 나타내는 핫워드 신뢰도 점수를 계산한다. 상기 사용자 디바이스는 이 점수를 근처에 있는 다른 사용자 디바이스들에 송신한다. 상기 다른 사용자 디바이스들은 동일한 발성을 수신했을 가능성이 있다. 상기 다른 사용자 디바이스들은 핫워드 신뢰도 점수를 계산하고 그들의 점수들을 상기 사용자 디바이스에 송신한다. 상기 사용자 디바이스는 상기 핫워드 신뢰도 점수들을 비교한다. 상기 사용자 디바이스가 가장 높은 핫워드 신뢰도 점수를 가진다면, 상기 사용자 디바이스는 활성으로 남아 있고 부가 오디오를 처리하기 위해 준비한다. 상기 사용자 디바이스가 가장 높은 핫워드 신뢰도 점수를 갖지 않는다면, 상기 사용자 디바이스는 상기 부가 오디오를 처리하지 않는다.According to one innovative aspect of the subject matter described in this specification, the user device receives the utterance spoken by the user. The user device determines whether the utterance includes a hot word and calculates a hot word confidence score indicating the likelihood that the utterance includes the hot word. The user device sends the score to other nearby user devices. The other user devices may have received the same utterance. The other user devices calculate hot word reliability scores and transmit their scores to the user device. The user device compares the hotword reliability scores. If the user device has the highest hot word reliability score, the user device remains active and prepares for processing additional audio. If the user device does not have the highest hot word reliability score, the user device does not process the additional audio.

일반적으로, 이 명세서에 기술된 주제의 또 다른 혁신적인 양태는, 제1 컴퓨팅 디바이스에 의해, 발성에 대응하는 오디오 데이터를 수신하는 단계; 상기 발성이 핫워드를 포함할 가능성에 대응하는 제1 값을 결정하는 단계; 상기 발성이 핫워드를 포함할 가능성에 대응하는 제2 값을 수신하는 단계 - 상기 제2 값은 제2 컴퓨팅 디바이스에 의해 결정됨 -; 상기 제1 값과 상기 제2 값을 비교하는 단계; 및 상기 제1 값과 상기 제2 값을 비교하는 것에 기초하여, 상기 오디오 데이터에 대한 음성 인식 처리를 개시하는 단계의 액션들을 포함하는 방법들에서 구현될 수 있다.In general, another innovative aspect of the subject matter described herein is a method comprising: receiving, by a first computing device, audio data corresponding to speech; Determining a first value corresponding to the likelihood that the utterance includes a hot word; Receiving a second value corresponding to a likelihood that the utterance includes a hot word, the second value being determined by a second computing device; Comparing the first value with the second value; And initiating speech recognition processing on the audio data based on comparing the first value and the second value.

이들 및 다른 실시예들은 각각 임의로 다음의 특징들 중 하나 이상을 포함할 수 있다. 상기 액션들은 상기 제1 값이 핫워드 점수 임계치를 만족시키는 것을 결정하는 단계를 더 포함한다. 상기 액션들은 상기 제1 값을 상기 제2 컴퓨팅 디바이스에 송신하는 단계를 더 포함한다. 상기 액션들은 상기 제1 값과 상기 제2 값을 비교하는 것에 기초하여 상기 제1 컴퓨팅 디바이스의 활성화 상태를 결정하는 단계를 더 포함한다. 상기 제1 값과 상기 제2 값을 비교하는 것에 기초하여 상기 제1 컴퓨팅 디바이스의 활성화 상태를 결정하는 단계의 액션은 상기 활성화 상태가 활성 상태인 것을 결정하는 단계를 더 포함한다. 상기 액션들은, 상기 제1 컴퓨팅 디바이스에 의해, 부가 발성에 대응하는 부가 오디오 데이터를 수신하는 단계; 상기 부가 발성이 상기 핫워드를 포함할 가능성에 대응하는 제3 값을 결정하는 단계; 상기 발성이 상기 핫워드를 포함할 가능성에 대응하는 제4 값을 수신하는 단계 - 상기 제4 값은 제3 컴퓨팅 디바이스에 의해 결정됨 -; 상기 제1 값과 상기 제2 값을 비교하는 단계; 및 상기 제1 값과 상기 제2 값을 비교하는 것에 기초하여, 상기 제1 컴퓨팅 디바이스의 상기 활성화 상태가 비활성 상태인 것을 결정하는 단계를 더 포함한다.These and other embodiments may each optionally include one or more of the following features. The actions further comprise determining that the first value satisfies a hot word score threshold. The actions further comprising transmitting the first value to the second computing device. The actions further comprise determining an activation state of the first computing device based on comparing the first value with the second value. The act of determining an activation state of the first computing device based on comparing the first value with the second value further comprises determining that the activation state is active. The actions comprising: receiving, by the first computing device, additional audio data corresponding to additional speech; Determining a third value corresponding to the likelihood that the additive utterance includes the hot word; Receiving a fourth value corresponding to the likelihood that the utterance includes the hot word, the fourth value being determined by a third computing device; Comparing the first value with the second value; And determining that the activated state of the first computing device is inactive, based on comparing the first value to the second value.

상기 제1 값을 상기 제2 컴퓨팅 디바이스에 송신하는 단계의 액션은 서버에, 로컬 네트워크를 통하여, 또는 단거리 무선(short range radio)을 통하여, 상기 제1 값을 송신하는 단계를 더 포함한다. 상기 발성이 상기 핫워드를 포함할 가능성에 대응하는 제2 값을 수신하는 단계 - 상기 제2 값은 제2 컴퓨팅 디바이스에 의해 결정됨 - 의 액션은 상기 서버로부터, 상기 로컬 네트워크를 통하여, 또는 상기 단거리 무선을 통하여, 제2 컴퓨팅 디바이스에 의해 결정된 제2 값을 수신하는 단계를 더 포함한다. 상기 액션들은 상기 제2 컴퓨팅 디바이스를 식별하는 단계; 및 상기 제2 컴퓨팅 디바이스가 상기 핫워드를 포함하는 발성들에 응답하도록 구성된 것을 결정하는 단계를 더 포함한다. 상기 제1 값을 상기 제2 컴퓨팅 디바이스에 송신하는 단계의 액션은 상기 제1 컴퓨팅 디바이스에 대한 제1 식별자를 송신하는 단계를 더 포함한다. 상기 발성이 상기 핫워드를 포함할 가능성에 대응하는 제2 값을 수신하는 단계 - 상기 제2 값은 제2 컴퓨팅 디바이스에 의해 결정됨 - 의 액션은 상기 제2 컴퓨팅 디바이스에 대한 제2 식별자를 수신하는 단계를 더 포함한다. 상기 활성화 상태가 활성 상태인 것을 결정하는 단계의 액션은 상기 발성에 대응하는 상기 오디오 데이터를 수신하는 단계 이후 특정량의 시간이 경과한 것을 결정하는 단계를 더 포함한다. 상기 액션들은 상기 활성화 상태가 활성 상태인 것을 결정하는 것에 기초하여, 특정량의 시간 동안, 상기 제1 값을 계속 송신하는 단계를 더 포함한다.The act of transmitting the first value to the second computing device further comprises transmitting the first value to a server, via a local network, or via a short range radio. Receiving a second value corresponding to the likelihood that the utterance includes the hot word, wherein the second value is determined by the second computing device, is received from the server, through the local network, And receiving, via radio, a second value determined by the second computing device. The actions identifying the second computing device; And determining that the second computing device is configured to respond to voices comprising the hot word. Wherein the act of sending the first value to the second computing device further comprises transmitting a first identifier for the first computing device. Receiving a second value corresponding to the likelihood that the utterance includes the hot word, wherein the second value is determined by a second computing device, receives a second identifier for the second computing device . The action of determining that the activation state is active further comprises determining that a certain amount of time has elapsed since the step of receiving the audio data corresponding to the speech. The actions further comprise continuing to transmit the first value for a specified amount of time based on determining that the activation state is active.

이 양태의 다른 실시예들은, 상기 방법들의 동작들을 수행하도록 각각 구성된, 대응하는 시스템들, 장치, 및 컴퓨터 저장 디바이스들에 기록된 컴퓨터 프로그램들을 포함한다.Other embodiments of this aspect include corresponding systems, devices, and computer programs recorded on computer storage devices, each configured to perform the operations of the methods.

이 명세서에 기술된 주제의 특정 실시예들은 다음의 이점들 중 하나 이상을 실현하도록 구현될 수 있다. 다수의 디바이스들이 핫워드를 검출할 수 있고 하나의 디바이스만이 상기 핫워드에 응답할 것이다.Certain embodiments of the subject matter described in this specification may be implemented to realize one or more of the following advantages. Multiple devices can detect a hot word and only one device will respond to the hot word.

이 명세서에 기술된 주제의 하나 이상의 실시예들의 세부 사항들이 첨부 도면들 및 하기의 설명에서 제시된다. 이 주제의 다른 특징들, 양태들, 및 이점들은 설명, 도면들, 및 청구항들로부터 명백해질 것이다.The details of one or more embodiments of the subject matter described in this specification are set forth in the accompanying drawings and the description below. Other features, aspects, and advantages of the subject matter will become apparent from the description, drawings, and claims.

도 1은 핫워드 검출을 위한 예시적인 시스템의 도면이다.
도 2는 핫워드 검출을 위한 예시적인 프로세스의 도면이다.
도 3은 컴퓨팅 디바이스 및 모바일 컴퓨팅 디바이스의 예를 보여준다.
다양한 도면들에서 동일한 참조 번호들 및 명칭들은 동일한 요소들을 나타낸다.Figure 1 is a drawing of an exemplary system for hot word detection.
Figure 2 is a drawing of an exemplary process for hot word detection.
Figure 3 shows an example of a computing device and a mobile computing device.
In the various figures, like reference numerals and names denote like elements.

너무 멀지 않은 미래에, 많은 디바이스들이 계속해서 핫워드들을 청취하고 있을 수 있는 것이 가능하다. 단일 사용자가 그들의 음성에 응답하도록 훈련된 다수의 디바이스들(예컨대, 전화, 태블릿, TV 등)을 가지는 경우, 사용자가 어드레싱하려고 의도하는 것들일 가능성이 없는 디바이스들에서 핫워드들에 응답하는 것을 억제하는 것이 바람직할 수 있다. 예를 들어, 사용자가 하나의 디바이스를 향하여 핫워드를 말할 때, 그들의 다른 디바이스들 중 임의의 것이 근처에 있다면, 그것들도 음성 검색을 트리거할 가능성이 있다. 많은 경우에, 이것은 사용자의 의도가 아니다. 따라서, 단일 디바이스, 구체적으로 사용자가 말하고 있는 대상인 디바이스만이 트리거한다면 유리할 수 있다. 본 명세서는 핫워드에 반응하기 위한 정확한 디바이스를 선택하고, 다른 디바이스들에서의 상기 핫워드에 대한 반응을 억제하는 문제를 다룬다.In the not too distant future, it is possible that many devices can still be listening to hot words. When a single user has a plurality of devices (e.g., telephone, tablet, TV, etc.) trained to respond to their voice, it inhibits responding to hot words in devices that are not likely to be what the user intends to address May be desirable. For example, when a user speaks a hot word towards a device, if any of their other devices are nearby, they are also likely to trigger a voice search. In many cases, this is not the intent of the user. Thus, it can be advantageous if only a single device, specifically a device that is the subject of what the user is talking about triggers. The present specification deals with the problem of selecting the correct device for responding to a hot word and suppressing the reaction to the hot word at other devices.

도 1은 핫워드 검출을 위한 예시적인 시스템(100)의 도면이다. 일반적으로, 시스템(100)은 사용자(102)가 컴퓨팅 디바이스들(106, 108, 및 110)의 마이크들에 의해 검출되는 발성(104)을 말하는 것을 보여준다. 컴퓨팅 디바이스들(106, 108, 및 110)은 상기 발성(104)을 처리하여 상기 발성(104)이 핫워드를 포함할 가능성을 결정한다. 컴퓨팅 디바이스들(106, 108, 및 110)은 각각 상기 발성(104)이 핫워드를 포함할 가능성을 나타내는 데이터를 서로에게 송신한다. 컴퓨팅 디바이스들(106, 108, 및 110)은 각각 그 데이터를 비교하고, 상기 발성(104)이 핫워드를 포함할 가장 높은 가능성을 계산한 컴퓨팅 디바이스는 상기 발성(104)에 대한 음성 인식을 개시한다. 상기 발성(104)이 핫워드를 포함할 가장 높은 가능성을 계산하지 않은 컴퓨팅 디바이스들은 상기 발성(104) 이후의 음성에 대한 음성 인식을 개시하지 않는다.1 is a diagram of an exemplary system 100 for hot word detection. Generally, the system 100 shows that the user 102 speaks voices 104 that are detected by the microphones of the computing devices 106, 108, and 110. Computing devices 106, 108, and 110 process the utterance 104 to determine the likelihood that the utterance 104 includes a hot word. Computing devices 106, 108, and 110 each send data to each other that indicates the likelihood that the utterance 104 includes a hot word. Computing devices 106, 108, and 110 each compare the data and a computing device that computes the highest likelihood that the utterance 104 will include a hot word will initiate speech recognition for the utterance 104 do. Computing devices that have not computed the highest likelihood that the utterance 104 will include a hot word will not initiate speech recognition for the voice after the utterance 104. [

상기 발성(104)이 핫워드에 대응할 가능성을 나타내는 데이터를, 다른 컴퓨팅 디바이스에 송신하기 전에, 서로의 근처에 위치하는 컴퓨팅 디바이스들은 서로를 식별한다. 일부 구현들에서, 컴퓨팅 디바이스들은 핫워드에 응답하도록 구성된 다른 디바이스들을 찾아 로컬 네트워크를 검색하는 것에 의해 서로를 식별한다. 예를 들어, 컴퓨팅 디바이스(106)는 핫워드에 응답하도록 구성된 다른 디바이스들을 찾아 로컬 영역 네트워크를 검색하고 컴퓨팅 디바이스(108) 및 컴퓨팅 디바이스(110)를 식별할 수 있다.Computing devices located near each other identify each other before sending data to the other computing device indicating the likelihood that the utterance 104 will correspond to a hot word. In some implementations, the computing devices identify each other by searching the local network for other devices configured to respond to the hot words. For example, the computing device 106 may look for other devices configured to respond to hot words and search the local area network and identify the computing device 108 and the computing device 110.

일부 구현들에서, 컴퓨팅 디바이스들은 각각의 디바이스에 로그인되어 있는 사용자를 식별하는 것에 의해 핫워드에 응답하도록 구성된 다른 근처의 컴퓨팅 디바이스들을 식별한다. 예를 들어, 사용자(102)가 컴퓨팅 디바이스들(106, 108, 및 110)에 로그인되어 있다. 사용자(102)는 그 사용자의 손에 컴퓨팅 디바이스(106)를 가진다. 컴퓨팅 디바이스(108)는 테이블에 놓여 있고, 컴퓨팅 디바이스(110)는 근처의 벽에 위치하고 있다. 컴퓨팅 디바이스(106)는 컴퓨팅 디바이스들(108 및 110)을 검출하고, 각각의 컴퓨팅 디바이스는 사용자 식별자와 같은, 컴퓨팅 디바이스에 로그인되어 있는 사용자와 관련 있는 정보를 공유한다. 일부 구현들에서, 컴퓨팅 디바이스들은 화자 식별(speaker identification)을 통하여 동일한 사용자에 의해 핫워드가 말해질 때 응답하도록 구성된 컴퓨팅 디바이스들을 식별하는 것에 의해 핫워드에 응답하도록 구성된 다른 근처의 컴퓨팅 디바이스들을 식별할 수 있다. 예를 들어, 사용자(102)는 컴퓨팅 디바이스들(106, 108, 및 110)을 각각, 사용자(102)가 핫워드를 말할 때 사용자(102)의 음성에 응답하도록 구성하였다. 컴퓨팅 디바이스들은 사용자(102)에 대한 사용자 식별자를 각각의 다른 컴퓨팅 디바이스에 제공하는 것에 의해 화자 식별 정보를 공유한다. 일부 구현들에서, 컴퓨팅 디바이스들은 단거리 무선을 통하여 핫워드에 응답하도록 구성된 다른 컴퓨팅 디바이스들을 식별할 수 있다. 예를 들어, 컴퓨팅 디바이스(106)는 핫워드에 응답하도록 구성된 다른 컴퓨팅 디바이스들을 검색하는 신호를 단거리 무선을 통하여 송신할 수 있다. 컴퓨팅 디바이스들은 이러한 기법들 중 하나 또는 이들의 조합을 이용하여 핫워드에 응답하도록 구성된 다른 컴퓨팅 디바이스들을 식별할 수 있다In some implementations, the computing devices identify other nearby computing devices that are configured to respond to the hot words by identifying the user logged in to each device. For example, user 102 is logged into computing devices 106, 108, and 110. The user 102 has a computing device 106 at the user's hand. The computing device 108 is located on a table, and the computing device 110 is located on a nearby wall. Computing device 106 detects computing devices 108 and 110, and each computing device shares information related to a user logged into the computing device, such as a user identifier. In some implementations, the computing devices identify other nearby computing devices configured to respond to the hot word by identifying computing devices configured to respond when the hot word is spoken by the same user through speaker identification . For example, the user 102 has configured the computing devices 106, 108, and 110, respectively, to respond to the voice of the user 102 when the user 102 speaks a hot word. Computing devices share speaker identification information by providing a user identifier for user 102 to each other computing device. In some implementations, the computing devices may identify other computing devices configured to respond to hot words over short-range wireless. For example, the computing device 106 may transmit a signal over short-range radio to retrieve other computing devices configured to respond to a hot word. Computing devices may use one or a combination of these techniques to identify other computing devices configured to respond to hot words

컴퓨팅 디바이스들(106, 108, 및 110)이 핫워드에 응답하도록 구성된 다른 컴퓨팅 디바이스들을 식별하면, 컴퓨팅 디바이스들(106, 108, 및 110)은 식별된 컴퓨팅 디바이스들에 대한 디바이스 식별자들을 공유 및 저장한다. 식별자들은 디바이스의 타입, 디바이스의 IP 주소, MAC 주소, 사용자에 의해 디바이스에 주어진 이름, 또는 임의의 유사한 고유 식별자에 기초할 수 있다. 예를 들어, 컴퓨팅 디바이스(106)에 대한 디바이스 식별자(112)는 "전화(phone)"일 수 있다. 컴퓨팅 디바이스(108)에 대한 디바이스 식별자(114)는 "태블릿(tablet)"일 수 있다. 컴퓨팅 디바이스(110)에 대한 디바이스 식별자(116)는 "온도조절장치(thermostat)"일 수 있다. 컴퓨팅 디바이스들(106, 108, 및 110)은 핫워드에 응답하도록 구성된 다른 컴퓨팅 디바이스들에 대한 디바이스 식별자를 저장한다. 각각의 컴퓨팅 디바이스는 디바이스 그룹을 가지며 거기에 컴퓨팅 디바이스는 디바이스 식별자들을 저장한다. 예를 들어, 컴퓨팅 디바이스(106)는 컴퓨팅 디바이스(106)에 의해 계산된, 오디오 데이터가 핫워드를 포함할 가능성을 수신할 2개의 디바이스로서 "태블릿"과 "온도조절장치"를 열거하는 디바이스 그룹(118)을 가진다. 컴퓨팅 디바이스(108)는 컴퓨팅 디바이스(108)에 의해 계산된, 오디오 데이터가 핫워드를 포함할 가능성을 수신할 2개의 디바이스로서 "전화"와 "온도조절장치"를 열거하는 디바이스 그룹(120)을 가진다. 컴퓨팅 디바이스(110)는 컴퓨팅 디바이스(110)에 의해 계산된, 오디오 데이터가 핫워드를 포함할 가능성을 수신할 2개의 디바이스로서 "전화"와 "태블릿"을 열거하는 디바이스 그룹(122)을 가진다.Once computing devices 106, 108, and 110 have identified other computing devices configured to respond to hot words, computing devices 106, 108, and 110 may share and store device identifiers for identified computing devices do. The identifiers may be based on the type of device, the IP address of the device, the MAC address, the name given to the device by the user, or any similar unique identifier. For example, the device identifier 112 for the computing device 106 may be a "phone. &Quot; The device identifier 114 for the computing device 108 may be a "tablet ". The device identifier 116 for the computing device 110 may be a "thermostat ". Computing devices 106, 108, and 110 store device identifiers for other computing devices configured to respond to hot words. Each computing device has a device group in which the computing device stores device identifiers. For example, the computing device 106 may be a device group that enumerates "tablets" and "thermostats" as two devices that are calculated by the computing device 106 and that will receive the possibility that the audio data contains hot words (118). The computing device 108 may be configured to include a group of devices 120 that enumerate the "phone" and "thermostat" as two devices, which are calculated by the computing device 108 and that will receive the possibility that the audio data contains a hot word I have. Computing device 110 has a device group 122 that enumerates "phone" and "tablet" as two devices, calculated by computing device 110, to receive the possibility that the audio data contains a hot word.

사용자(102)가 "OK 컴퓨터"라는 발성(104)을 말할 때, 사용자(102)의 근처에 마이크를 가진 각각의 컴퓨팅 디바이스는 발성(104)을 검출하고 처리한다. 각각의 컴퓨팅 디바이스는 마이크와 같은 오디오 입력 디바이스를 통하여 발성(104)을 검출한다. 각각의 마이크는 오디오 데이터를 각자의 오디오 서브시스템에 제공한다. 각자의 오디오 서브시스템은 오디오 데이터를 버퍼링하고, 필터링하고, 디지털화한다. 일부 구현들에서, 각각의 컴퓨팅 디바이스는 또한 오디오 데이터에 대한 엔드포인팅 및 화자 식별을 수행할 수 있다. 오디오 서브시스템은 처리된 오디오 데이터를 핫워더(hotworder)에 제공한다. 핫워더는 처리된 오디오 데이터를 알려진 핫워드 데이터와 비교하고 발성(104)이 핫워드에 대응할 가능성을 나타내는 신뢰도 점수를 계산한다. 핫워더는 처리된 오디오 데이터로부터, 필터뱅크 에너지(filterbank energy) 또는 멜 주파수 켑스트럼 계수(mel-frequency cepstral coefficient)와 같은 오디오 특징들을 추출할 수 있다. 핫워더는 분류 윈도우(classifying window)들을 이용하여, 예를 들어 서포트 벡터 머신(support vector machine) 또는 신경망(neural network)을 이용하는 것에 의해 이러한 오디오 특징들을 처리할 수 있다. 오디오 특징들의 처리에 기초하여, 핫워더(124)는 0.85의 신뢰도 점수를 계산하고, 핫워더(126)는 0.6의 신뢰도 점수를 계산하고, 핫워더(128)는 0.45의 신뢰도 점수를 계산한다. 일부 구현들에서, 신뢰도 점수는 0 내지 1의 스케일로 정규화될 수 있고, 더 높은 숫자는 발성(104)이 핫워드를 포함할 더 큰 신뢰도를 나타낸다.When the user 102 speaks a vocalization 104 of "OK computer ", each computing device with a microphone near the user 102 detects and processes the vocalization 104. [ Each computing device detects vocalization 104 through an audio input device, such as a microphone. Each microphone provides audio data to its audio subsystem. Each audio subsystem buffers, filters, and digitizes audio data. In some implementations, each computing device may also perform endpointing and speaker identification for audio data. The audio subsystem provides processed audio data to a hotworder. HotWorder compares the processed audio data to known hotword data and calculates a confidence score indicating the likelihood that utterance 104 will correspond to the hot word. Hotwaders can extract audio features, such as filterbank energy or mel-frequency cepstral coefficient, from the processed audio data. Hotwaders can process these audio features by using classifying windows, for example, using a support vector machine or neural network. Based on the processing of the audio features, HotWonder 124 calculates a reliability score of 0.85, HotWonder 126 calculates a reliability score of 0.6, and HotWonder 128 calculates a reliability score of 0.45. In some implementations, the confidence score can be normalized to a scale of 0 to 1, and a higher number indicates greater confidence that vocalization 104 will include the hot word.

각각의 컴퓨팅 디바이스는 각자의 신뢰도 점수 데이터 패킷을 디바이스 그룹 내의 다른 컴퓨팅 디바이스들에 송신한다. 각각의 신뢰도 점수 데이터 패킷은 각자의 신뢰도 점수 및 컴퓨팅 디바이스에 대한 각자의 디바이스 식별자를 포함한다. 예를 들어, 컴퓨팅 디바이스(106)는 0.85의 신뢰도 점수 및 디바이스 식별자 "전화"를 포함하는 신뢰도 점수 데이터 패킷(130)을 디바이스 그룹(118) 내의 컴퓨팅 디바이스들인, 컴퓨팅 디바이스들(108 및 110)에 송신한다. 컴퓨팅 디바이스(108)는 0.6의 신뢰도 점수 및 디바이스 식별자 "태블릿"을 포함하는 신뢰도 점수 데이터 패킷(132)을 디바이스 그룹(120) 내의 컴퓨팅 디바이스들인, 컴퓨팅 디바이스들(106 및 110)에 송신한다. 컴퓨팅 디바이스(110)는 0.45의 신뢰도 점수 및 디바이스 식별자 "온도조절장치"를 포함하는 신뢰도 점수 데이터 패킷(134)을 디바이스 그룹(118) 내의 컴퓨팅 디바이스들인, 컴퓨팅 디바이스들(106 및 108)에 송신한다.Each computing device sends its own reliability score data packet to the other computing devices in the device group. Each reliability score data packet includes a respective reliability score and a respective device identifier for the computing device. For example, the computing device 106 may send a reliability score data packet 130 containing a reliability score of 0.85 and a device identifier "telephone " to the computing devices 108 and 110, which are computing devices in the device group 118 . The computing device 108 sends a confidence score data packet 132 containing a confidence score of 0.6 and a device identifier "tablet " to the computing devices 106 and 110, which are computing devices in the device group 120. The computing device 110 sends a reliability point data packet 134 containing a reliability score of 0.45 and a device identifier "temperature controller" to the computing devices 106 and 108, which are computing devices in the device group 118 .

일부 구현들에서, 컴퓨팅 디바이스는 신뢰도 점수가 핫워드 점수 임계치를 만족시키면 신뢰도 점수 데이터 패킷을 송신할 수 있다. 예를 들어, 핫워드 점수 임계치가 0.5이면, 컴퓨팅 디바이스(110)는 신뢰도 점수 데이터 패킷(134)을 디바이스 그룹(122) 내의 다른 컴퓨팅 디바이스들에 송신하지 않을 것이다. 컴퓨팅 디바이스들(106 및 108)은 여전히 신뢰도 점수 데이터 패킷들(130 및 132)을, 각각, 디바이스 그룹들(118 및 120) 내의 컴퓨팅 디바이스들에 송신할 것이다.In some implementations, the computing device may transmit a confidence point data packet if the confidence score meets a hot word score threshold. For example, if the hotword score threshold is 0.5, computing device 110 will not send confidence score data packet 134 to other computing devices in device group 122. Computing devices 106 and 108 will still send confidence point data packets 130 and 132 to computing devices in device groups 118 and 120, respectively.

일부 구현들에서, 신뢰도 점수 데이터 패킷을 송신하는 컴퓨팅 디바이스는 신뢰도 점수 데이터 패킷을 다른 컴퓨팅 디바이스들에 직접 송신할 수 있다. 예를 들어, 컴퓨팅 디바이스(106)는 신뢰도 점수 데이터 패킷(130)을 컴퓨팅 디바이스들(108 및 110)에 단거리 무선을 통하여 송신할 수 있다. 2개의 컴퓨팅 디바이스 사이에 사용되는 통신 프로토콜은 유니버설 플러그 앤 플레이(universal plug and play)일 수 있다. 일부 구현들에서, 신뢰도 점수 데이터 패킷을 송신하는 컴퓨팅 디바이스는 신뢰도 점수 데이터 패킷을 브로드캐스트할 수 있다. 이 경우, 신뢰도 점수 데이터 패킷은 디바이스 그룹 내의 컴퓨팅 디바이스들에 의해 그리고 다른 컴퓨팅 디바이스들에 의해 수신될 수 있다. 일부 구현들에서, 신뢰도 점수 데이터 패킷을 송신하는 컴퓨팅 디바이스는 신뢰도 점수 데이터 패킷을 서버에 송신할 수 있고, 그 후 서버는 신뢰도 점수 데이터 패킷을 디바이스 그룹 내의 컴퓨팅 디바이스들에 송신한다. 서버는 컴퓨팅 디바이스들의 로컬 영역 네트워크 내에 위치하거나 인터넷을 통하여 액세스 가능할 수 있다. 예를 들어, 컴퓨팅 디바이스(108)는 신뢰도 점수 데이터 패킷(132) 및 디바이스 그룹(120) 내의 컴퓨팅 디바이스들의 목록을 서버에 송신한다. 서버는 신뢰도 점수 데이터 패킷(132)을 컴퓨팅 디바이스(106 및 110)에 송신한다. 컴퓨팅 디바이스가 신뢰도 점수 데이터 패킷을 다른 컴퓨팅 디바이스에 송신하는 경우에, 수신 컴퓨팅 디바이스는 수신 컴퓨팅 디바이스가 신뢰도 점수 데이터 패킷을 수신했다는 확인을 회신할 수 있다.In some implementations, the computing device transmitting the reliability point data packet may send the reliability point data packet directly to the other computing devices. For example, the computing device 106 may transmit the confidence score data packet 130 to the computing devices 108 and 110 over short-range wireless. The communication protocol used between the two computing devices may be a universal plug and play. In some implementations, the computing device transmitting the reliability point data packet may broadcast the reliability point data packet. In this case, the confidence score data packet may be received by the computing devices in the device group and by other computing devices. In some implementations, the computing device transmitting the reliability point data packet may send a reliability point data packet to the server, and then the server sends the reliability point data packet to the computing devices in the device group. The server may be located within the local area network of computing devices or may be accessible via the Internet. For example, the computing device 108 sends a reliability score data packet 132 and a list of computing devices in the device group 120 to the server. The server sends the confidence score data packet 132 to the computing devices 106 and 110. In case the computing device sends a reliability point data packet to another computing device, the receiving computing device may return confirmation that the receiving computing device has received the reliability point data packet.

각각의 컴퓨팅 디바이스는 점수 비교기를 이용하여 컴퓨팅 디바이스가 수신한 핫워드 신뢰도 점수들을 비교한다. 예를 들어, 컴퓨팅 디바이스(106)는 0.85의 핫워드 신뢰도 점수를 계산하였고 0.6 및 0.45의 핫워드 신뢰도 점수들을 수신하였다. 이 경우, 점수 비교기(136)는 3개의 점수를 비교하고 0.85의 점수를 가장 높은 것으로 식별한다. 컴퓨팅 디바이스들(108 및 110)에 대해, 점수 비교기들(138 및 140)은 컴퓨팅 디바이스(106)에 대응하는 0.85의 점수를 가장 높은 것으로 식별하는, 유사한 결론들에 도달한다.Each computing device uses the score comparator to compare the hotword reliability scores received by the computing device. For example, the computing device 106 computed a hot word reliability score of 0.85 and received hot word reliability scores of 0.6 and 0.45. In this case, the score comparator 136 compares the three scores and identifies the highest score of 0.85. For computing devices 108 and 110, score comparators 138 and 140 arrive at similar conclusions that identify the highest score of 0.85 corresponding to computing device 106.

그 자신의 핫워드 신뢰도 점수가 가장 높은 것을 결정하는 컴퓨팅 디바이스는 핫워드 발성 이후의 음성 데이터에 대한 음성 인식을 개시한다. 예를 들어, 사용자는 "OK 컴퓨터"를 말할 수 있고, 컴퓨팅 디바이스(106)는 그것이 가장 높은 핫워드 신뢰도 점수를 가지는 것을 결정할 수 있다. 컴퓨팅 디바이스(106)는 핫워드 이후에 수신된 오디오 데이터에 대한 음성 인식을 개시할 것이다. 사용자가 "Alice 호출(call Alice)"을 말하면, 컴퓨팅 디바이스(106)는 발성을 처리하고 적절한 명령을 실행할 것이다. 일부 구현들에서, 핫워드를 수신하는 것은 핫워드를 수신하는 컴퓨팅 디바이스들로 하여금 슬립(sleep) 상태로부터 활성화하게 할 수 있다. 이 경우, 가장 높은 핫워드 신뢰도 점수를 가진 컴퓨팅 디바이스는 어웨이크(awake) 상태에 남아 있는 반면 가장 높은 핫워드 신뢰도 점수를 갖지 않는 다른 컴퓨팅 디바이스들은 핫워드 발성 이후의 음성 데이터를 처리하지 않고 슬립 상태에 들어간다.A computing device that determines that its own hot word reliability score is highest will initiate voice recognition for voice data after hot word utterance. For example, the user may speak of an "OK computer ", and the computing device 106 may determine that it has the highest hot word reliability score. The computing device 106 will initiate speech recognition for the audio data received after the hot word. If the user speaks an "Alice call, " the computing device 106 will process the utterance and execute the appropriate command. In some implementations, receiving a hot word may cause the computing devices receiving the hot word to activate from a sleep state. In this case, the computing device with the highest hot word reliability score remains in the awake state, while other computing devices that do not have the highest hot word reliability score do not process the voice data after hot word utterance, &Lt; / RTI >

도 1에 예시된 바와 같이, 점수 비교기(136)는 컴퓨팅 디바이스(106)에 대응하는 핫워드 신뢰도 점수를 가장 높은 것으로 식별하였다. 그러므로, 디바이스 상태(142)는 "어웨이크"이다. 점수 비교기들(138 및 140)도 컴퓨팅 디바이스(106)에 대응하는 핫워드 신뢰도 점수를 가장 높은 것으로 식별하였다. 그러므로, 디바이스 상태들(144 및 146)은 "슬립(asleep)"이다. 일부 구현들에서, 컴퓨팅 디바이스의 활성화 상태는 영향을 받지 않을 수 있다. 예를 들어, 사용자(102)는 컴퓨팅 디바이스(108)에서 영화를 시청중이고 사용자의 손에 컴퓨팅 디바이스(106)를 가질 수 있다. 사용자(102)가 "OK 컴퓨터"를 말할 때, 컴퓨팅 디바이스(106)는, 가장 높은 핫워드 신뢰도 점수를 가지고 있기 때문에, 핫워드 이후의 오디오 데이터에 대한 음성 인식을 개시한다. 컴퓨팅 디바이스(108)는 핫워드 이후의 오디오 데이터에 대한 음성 인식을 개시하지 않고, 계속 영화를 재생한다.As illustrated in FIG. 1, the score comparator 136 has identified the hotword reliability score corresponding to the computing device 106 as the highest. Therefore, the device state 142 is "awake ". Score comparators 138 and 140 also identified the hotword reliability score corresponding to computing device 106 as the highest. Therefore, device states 144 and 146 are "asleep ". In some implementations, the activation state of the computing device may not be affected. For example, the user 102 may be viewing a movie at the computing device 108 and have the computing device 106 in the hands of the user. When the user 102 speaks an "OK computer ", the computing device 106 initiates speech recognition for audio data after the hot word since it has the highest hot word reliability score. The computing device 108 does not initiate speech recognition for the audio data after the hot word and continues to play back the movie.

일부 구현들에서, 그것이 가장 높은 핫워드 신뢰도 점수를 가지는 것을 결정하는 컴퓨팅 디바이스는 핫워드 이후의 음성에 대한 음성 인식의 수행을 시작하기 전에 특정량의 시간 동안 기다린다. 이렇게 하는 것은 가장 높은 핫워드 신뢰도 점수를 계산한 컴퓨팅 디바이스가 더 높은 핫워드 신뢰도 점수를 기다리지 않고 핫워드 이후의 음성에 대한 음성 인식의 수행을 시작하는 것을 허용한다. 예시하자면, 컴퓨팅 디바이스(106)의 점수 비교기(136)는 컴퓨팅 디바이스(108 및 110)로부터, 각각, 0.6 및 0.45의 핫워드 신뢰도 점수들뿐만 아니라, 핫워더(124)로부터 0.85의 핫워드 신뢰도 점수를 수신하였다. 핫워더(124)가 "Ok 컴퓨터" 오디오 데이터의 핫워드 신뢰도 점수를 계산하는 때로부터, 컴퓨팅 디바이스(106)는 핫워드 이후의 음성에 대한 음성 인식을 수행하기 전에 500 밀리초를 기다린다. 점수 비교기가 더 높은 점수를 수신하는 경우에, 컴퓨팅 디바이스는 디바이스 상태를 "슬립"으로 설정하기 전에 특정량의 시간 동안 기다리지 않을 수 있다. 예를 들어, 컴퓨팅 디바이스(108)의 핫워더(126)는 0.6의 핫워드 신뢰도 점수를 계산하고 0.85 및 0.45의 핫워드 신뢰도 점수들을 수신한다. 컴퓨팅 디바이스(108)가 0.85의 핫워드 신뢰도 점수를 수신하면, 컴퓨팅 디바이스(108)는 디바이스 상태(144)를 "슬립"으로 설정할 수 있다. 이것은 컴퓨팅 디바이스(108)가 핫워더(126)가 0.6의 핫워드 신뢰도 점수를 계산한 후에 특정량의 시간 내에 0.85의 핫워드 신뢰도 점수를 수신하는 것을 가정한다.In some implementations, the computing device that determines that it has the highest hot word reliability score waits for a certain amount of time before beginning to perform speech recognition on the voice after the hot word. This allows a computing device that computes the highest hot word reliability score to begin performing voice recognition for voice after hot words without waiting for a higher hot word reliability score. The point comparator 136 of the computing device 106 receives from the computing devices 108 and 110 the hot word reliability scores of 0.6 and 0.45 respectively as well as the hot word reliability scores of 0.85 from the hot worder 124 . Computing device 106 waits 500 milliseconds prior to performing speech recognition for speech after the hot word, from when hotware 124 computes the hot word reliability score of the "Ok computer" audio data. If the score comparator receives a higher score, the computing device may not wait for a certain amount of time before setting the device state to "sleep ". For example, the hotwonder 126 of the computing device 108 computes a hot word reliability score of 0.6 and receives hot word reliability scores of 0.85 and 0.45. When the computing device 108 receives a hot word reliability score of 0.85, the computing device 108 may set the device state 144 to "sleep ". This assumes that the computing device 108 receives a hotword reliability score of 0.85 within a certain amount of time after the hotword 126 computes a hotword reliability score of 0.6.

일부 구현들에서, 컴퓨팅 디바이스가 가장 높은 핫워드 신뢰도 점수를 가지는 경우, 컴퓨팅 디바이스는 다른 컴퓨팅 디바이스들이 신뢰도 점수 데이터 패킷을 수신하는 것을 보장하기 위해 특정량의 시간 동안 계속 신뢰도 점수 데이터 패킷을 브로드캐스트할 수 있다. 이 전략은 컴퓨팅 디바이스가 다른 컴퓨팅 디바이스로부터 신뢰도 점수 데이터 패킷을 수신하면 확인을 회신하는 경우에 가장 적용 가능할 것이다. 그러므로, 컴퓨팅 디바이스(106)가 신뢰도 점수 데이터 패킷(130)을 디바이스 그룹(118) 내의 컴퓨팅 디바이스들에 송신하고 500 밀리초와 같은 특정량의 시간 전에 확인을 수신하면, 컴퓨팅 디바이스(106)는 핫워드 이후의 음성에 대한 음성 인식의 수행을 시작할 수 있다. 컴퓨팅 디바이스들이 그들의 신뢰도 점수 데이터 패킷들을 브로드캐스트하고 확인을 기대하지 않는 경우에, 컴퓨팅 디바이스는 그들의 핫워드 신뢰도 점수들을, 500 밀리초와 같은 특정량의 시간 동안, 또는 컴퓨팅 디바이스가 어떤 것이든 맨 먼저 오는 더 높은 핫워드 신뢰도 점수를 수신할 때까지, 계속 브로드캐스트할 수 있다. 예를 들어, 컴퓨팅 디바이스(110)는 0.45의 핫워드 신뢰도 점수를 계산하고 신뢰도 점수 데이터 패킷(134)을 브로드캐스트하기 시작한다. 300 밀리초 후에, 컴퓨팅 디바이스(110)는 신뢰도 점수 데이터 패킷(130)을 수신하고 신뢰도 점수 데이터 패킷(134)의 브로드캐스트를 중지하는데, 그 이유는 신뢰도 점수 데이터 패킷(130)으로부터의 0.85의 핫워드 신뢰도 점수가 45의 핫워드 신뢰도 점수보다 높기 때문이다. 또 다른 브로드캐스트 예로서, 컴퓨팅 디바이스(106)는 0.45의 핫워드 신뢰도 점수를 계산하고 신뢰도 점수 데이터 패킷(130)을 브로드캐스트하기 시작한다. 500 밀리초 후에, 컴퓨팅 디바이스(106)는 신뢰도 점수 데이터 패킷(130)의 브로드캐스트를 중지하고 핫워드 이후의 음성에 대한 음성 인식의 수행을 시작한다. 컴퓨팅 디바이스(106)는 500 밀리초가 경과하기 전에 신뢰도 점수 데이터 패킷들(132 및 134)을 수신할 수 있지만, 신뢰도 점수 데이터 패킷들(132 및 134) 내의 핫워드 신뢰도 점수들이 0.85보다 낮기 때문에, 컴퓨팅 디바이스 500 밀리초가 경과한 후까지 계속 기다린다.In some implementations, if the computing device has the highest hot word reliability score, the computing device continues to broadcast a reliability point data packet for a certain amount of time to ensure that other computing devices receive the reliability point data packet . This strategy would be most applicable when the computing device replies acknowledgment when it receives a reliability point data packet from another computing device. Thus, if computing device 106 sends a confidence score data packet 130 to computing devices in device group 118 and receives confirmation before a certain amount of time, such as 500 milliseconds, computing device 106 is hot It is possible to start performing speech recognition on the speech after the word. In the event that the computing devices broadcast their reliability score data packets and do not expect to be acknowledged, the computing devices may determine their hot word reliability scores for a certain amount of time, such as 500 milliseconds, It can continue to broadcast until it receives a higher hot word reliability score. For example, the computing device 110 computes a hotword reliability score of 0.45 and begins to broadcast a reliability score data packet 134. After 300 milliseconds, the computing device 110 receives the reliability score data packet 130 and stops broadcasting the reliability score data packet 134 because the reliability score data packet 130 is 0.85 hot Because the word reliability score is higher than the hot word reliability score of 45. As another broadcast example, the computing device 106 calculates a hotword reliability score of 0.45 and begins to broadcast a reliability score data packet 130. [ After 500 milliseconds, the computing device 106 stops broadcasting the reliability score data packet 130 and begins performing voice recognition for the voice after the hot word. The computing device 106 may receive the reliability point data packets 132 and 134 before 500 milliseconds have elapsed but since the hot word reliability scores in the reliability point data packets 132 and 134 are lower than 0.85, Wait until device 500 milliseconds have elapsed.

일부 구현들에서, 컴퓨팅 디바이스가 더 높은 핫워드 신뢰도 점수를 수신할 때까지 컴퓨팅 디바이스는 핫워드 이후의 음성에 대한 음성 인식의 수행을 시작할 수 있다. 핫워더는 핫워드 신뢰도 점수를 계산하고, 핫워드 신뢰도 점수가 임계치를 만족시키면, 컴퓨팅 디바이스는 핫워드 이후의 음성에 대한 음성 인식을 수행한다. 컴퓨팅 디바이스는 음성 인식에 대한 어떤 지시도 사용자에 표시하지 않고 음성 인식을 수행할 수 있다. 이것은 그렇게 하는 것이 컴퓨팅 디바이스가 가장 높은 핫워드 점수를 계산한 것을 확인할 때까지 컴퓨팅 디바이스가 기다리는 경우보다 더 빠르게 컴퓨팅 디바이스가 음성 인식에 기초한 결과들을 사용자에게 표시하는 것을 허용하면서도 컴퓨팅 디바이스가 활성이 아니라는 인상을 사용자에 주기 때문에 바람직할 수 있다. 예로서, 컴퓨팅 디바이스(106)는 0.85의 핫워드 신뢰도 점수를 계산하고 핫워드 이후의 음성에 대한 음성 인식의 수행을 시작한다. 컴퓨팅 디바이스(106)는 신뢰도 점수 데이터 패킷들(132 및 134)을 수신하고 0.85의 핫워드 신뢰도 점수가 가장 높다는 것을 결정한다. 컴퓨팅 디바이스(106)는 핫워드 이후의 음성에 대한 음성 인식을 계속 수행하고 그 결과들을 사용자에 제시한다. 컴퓨팅 디바이스(108)에 대해, 핫워더(126)는 0.6의 핫워드 신뢰도 점수를 계산하고, 컴퓨팅 디바이스(108)는 사용자에 데이터를 표시하지 않고 핫워드 이후의 음성에 대한 음성 인식의 수행을 시작한다. 컴퓨팅 디바이스(108)가 0.85의 핫워드 신뢰도를 포함하는 신뢰도 점수 데이터 패킷(130)을 수신하면, 컴퓨팅 디바이스는 음성 인식의 수행을 중지한다. 어떤 데이터도 사용자에 표시되지 않고, 사용자는 컴퓨팅 디바이스(108)가 "슬립" 상태에 남아 있었다는 인상을 받을 가능성이 있다.In some implementations, the computing device may begin performing voice recognition for voice after hot words until the computing device receives a higher hot word reliability score. The hotword computes a hotword confidence score, and if the hotword confidence score satisfies the threshold, the computing device performs speech recognition for the voice after the hotword. The computing device may perform speech recognition without any indication to the user of speech recognition. This allows the computing device to display results based on speech recognition to the user more quickly than if the computing device waited until doing so verified that the computing device had calculated the highest hotword score, To the user. As an example, the computing device 106 calculates a hotword reliability score of 0.85 and begins performing speech recognition for the voice after the hot word. The computing device 106 receives the reliability score data packets 132 and 134 and determines that the hot word reliability score of 0.85 is the highest. The computing device 106 continues to perform speech recognition for the voice after the hot word and present the results to the user. For the computing device 108, the hotwirder 126 computes a hotword reliability score of 0.6, and the computing device 108 begins performing speech recognition for the voice after the hot word without displaying data to the user do. When the computing device 108 receives the reliability point data packet 130 that includes the hot word reliability of 0.85, the computing device stops performing voice recognition. No data is displayed to the user, and the user is likely to be impressed that the computing device 108 has remained in the "sleep" state.

일부 구현들에서, 핫워드가 말해진 후에 어떤 대기 시간도 피하기 위해, 핫워드의 끝 전에, 예컨대, 부분적인 핫워드에 대해 핫워더로부터 점수들이 보고될 수 있다. 예를 들어, 사용자가 "Ok 컴퓨터"를 말하고 있을 때, 컴퓨팅 디바이스는 사용자가 "OK 컴(OK comp)"을 말하는 것을 완료하면 부분적인 핫워드 신뢰도 점수를 계산할 수 있다. 그 후 컴퓨팅 디바이스는 부분적인 핫워드 신뢰도 점수를 다른 컴퓨팅 디바이스들과 공유할 수 있다. 가장 높은 부분적인 핫워드 신뢰도 점수를 가진 컴퓨팅 디바이스는 사용자의 음성을 계속 처리할 수 있다.In some implementations, scores may be reported from the HotWorder prior to the end of the hot word, e.g., a partial hot word, to avoid any wait time after the hot word is spoken. For example, when the user is speaking "Ok computer ", the computing device can calculate a partial hot word reliability score if the user completes saying" OK comp ". The computing device may then share a partial hotword reliability score with other computing devices. A computing device with the highest partial hot-word reliability score can continue to process the user's voice.

일부 구현들에서, 컴퓨팅 디바이스가 핫워드 신뢰도 점수가 임계치를 만족시키는 것을 결정할 때 컴퓨팅 디바이스는, 예컨대, 특정 주파수 또는 주파수 패턴의, 가청음 또는 불가청음을 낼 수 있다. 그 음은 다른 컴퓨팅 디바이스들에게, 컴퓨팅 디바이스가 핫워드 이후의 오디오 데이터를 계속 처리할 것임을 신호할 것이다. 다른 컴퓨팅 디바이스들은 이 음을 수신하고 오디오 데이터의 처리를 중단할 것이다. 예를 들어, 사용자는 "Ok 컴퓨터"를 말한다. 컴퓨팅 디바이스들 중 하나가 임계치 이상인 핫워드 신뢰도 점수를 계산한다. 컴퓨팅 디바이스가 핫워드 신뢰도 점수가 임계치 이상인 것을 결정하면, 컴퓨팅 디바이스는 18 킬로헤르츠의 음을 낸다. 사용자 근처에 있는 다른 컴퓨팅 디바이스들도 핫워드 신뢰도 점수를 계산중일 수 있고 다른 컴퓨팅 디바이스들이 음을 수신할 때 핫워드 신뢰도 점수를 계산하는 도중일 수 있다. 다른 컴퓨팅 디바이스들이 음을 수신할 때, 다른 컴퓨팅 디바이스들은 사용자의 음성의 처리를 중단한다. 일부 구현들에서, 컴퓨팅 디바이스는 가청음 또는 불가청음에 핫워드 신뢰도 점수를 인코딩할 수 있다. 예를 들어, 핫워드 신뢰도 점수가 0.5이면, 컴퓨팅 디바이스는 0.5의 점수를 인코딩하는 주파수 패턴을 포함하는 가청음 또는 불가청음을 생성할 수 있다.In some implementations, when a computing device determines that a hot word reliability score meets a threshold, the computing device may generate audible or noisy speech, e.g., at a particular frequency or frequency pattern. The tone will signal other computing devices that the computing device will continue to process the audio data after the hot word. Other computing devices will receive this tone and stop processing the audio data. For example, the user refers to "Ok computer ". One of the computing devices calculates a hotword reliability score that is above a threshold. If the computing device determines that the hotword reliability score is above the threshold, the computing device produces a tone of 18 kilohertz. Other computing devices in the vicinity of the user may be calculating the hotword reliability score and calculating the hotword reliability score when the other computing devices receive the tone. When the other computing devices receive the tone, the other computing devices stop processing the user's voice. In some implementations, the computing device may encode a hot word reliability score in an audible or audible tone. For example, if the hotword reliability score is 0.5, the computing device may generate an audible or an audible tone that includes a frequency pattern that encodes a score of 0.5.

일부 구현들에서, 컴퓨팅 디바이스들은 상이한 오디오 메트릭들을 이용하여, 사용자의 음성의 처리를 계속할 컴퓨팅 디바이스를 선택할 수 있다. 예를 들어, 컴퓨팅 디바이스들은 소리 세기(loudness)를 이용하여, 어느 컴퓨팅 디바이스가 사용자의 음성을 계속 처리할 것인지를 결정할 수 있다. 가장 큰 음성을 검출하는 컴퓨팅 디바이스는 사용자의 음성을 계속 처리할 수 있다. 또 다른 예로서, 현재 사용중인 또는 활성 디스플레이를 가진 컴퓨팅 디바이스는 다른 컴퓨팅 디바이스들에게, 그것이 핫워드를 검출하면 사용자의 음성의 계속 처리할 것임을 통지할 수 있다.In some implementations, the computing devices may use different audio metrics to select a computing device to continue processing the user's voice. For example, computing devices can use loudness to determine which computing device will continue to process the user's voice. A computing device that detects the largest voice can continue to process the user's voice. As another example, a computing device that is currently in use or has an active display may notify other computing devices that it will continue to process the user's voice if it detects a hot word.

일부 구현들에서, 사용자가 말하고 있는 동안 사용자의 근처에 있는 각각의 컴퓨팅 디바이스는 오디오 데이터를 수신하고 음성 인식을 개선하기 위해 그 오디오 데이터를 서버에 송신한다. 각각의 컴퓨팅 디바이스는 사용자의 음성에 대응하는 오디오 데이터를 수신할 수 있다. 하나의 컴퓨팅 디바이스만이 사용자에게 사용자의 음성을 처리하고 있는 것으로 보이겠지만, 각각의 컴퓨팅 디바이스가 오디오 데이터를 서버에 송신할 수 있다. 그 후 서버는 각각의 컴퓨팅 디바이스로부터 수신되는 오디오 데이터를 이용하여 음성 인식을 개선할 수 있는데, 그 이유는 서버가 동일한 발성에 대응하는 상이한 오디오 샘플들을 비교할 수 있기 때문이다. 예를 들어, 사용자가 "Ok 컴퓨터, 우유를 사는 것을 상기시켜달라(Ok computer, remind me to buy milk)"고 말한다. 사용자가 "Ok 컴퓨터"를 말하는 것을 완료하면, 근처의 컴퓨팅 디바이스들은 어느 컴퓨팅 디바이스가 가장 높은 핫워드 신뢰도 점수를 가지는지를 결정했을 가능성이 있을 것이고, 해당 컴퓨팅 디바이스는 사용자가 "우유를 사는 것을 상기시켜달라"는 단어들을 말할 때 그 단어들을 처리하고 그에 응답할 것이다. 다른 컴퓨팅 디바이스들도 "우유를 사는 것을 상기시켜달라"를 수신할 것이다. 다른 컴퓨팅 디바이스는 "우유를 사는 것을 상기시켜달라"는 발성에 응답하지 않겠지만, 다른 컴퓨팅 디바이스들은 "우유를 사는 것을 상기시켜달라"에 대응하는 오디오 데이터를 서버에 송신할 수 있다. "우유를 사는 것을 상기시켜달라"에 응답하는 컴퓨팅 디바이스들도 그것의 오디오 데이터를 서버에 송신할 수 있다. 서버는 그 오디오 데이터를 처리하여 음성 인식을 개선할 수 있는데, 그 이유는 서버가 동일한 "우유를 사는 것을 상기시켜달라" 발성에 대응하는 상이한 컴퓨팅 디바이스들로부터의 상이한 오디오 샘플들을 가지기 때문이다.In some implementations, each computing device in the vicinity of the user while the user is speaking receives the audio data and sends the audio data to the server to improve speech recognition. Each computing device can receive audio data corresponding to the user ' s voice. Although only one computing device appears to be processing the user's voice to the user, each computing device may transmit audio data to the server. The server can then use the audio data received from each computing device to improve speech recognition because the server can compare different audio samples corresponding to the same utterance. For example, the user says, "Ok computer, remind me to buy milk." When the user finishes saying "Ok computer", it is likely that nearby computing devices have determined which computing device has the highest hot word reliability score, and that computing device reminds the user to "buy milk Will treat and respond to the words when speaking of them. Other computing devices will also receive a "remind me to buy milk". Other computing devices will not respond to the "remind me to buy milk" voices, but other computing devices can send audio data corresponding to "remind me to buy milk" to the server. Computing devices that respond to "remind to buy milk" can also send its audio data to the server. The server can process the audio data to improve speech recognition because the server has different audio samples from different computing devices corresponding to the same "remember to buy milk" utterance.

도 2는 핫워드 검출을 위한 예시적인 프로세스(200)의 도면이다. 프로세스(200)는 도 1로부터의 컴퓨팅 디바이스(108)와 같은 컴퓨팅 디바이스에 의해 수행될 수 있다. 프로세스(200)는 발성이 핫워드를 포함할 가능성에 대응하는 값을 계산하고 그 값을 다른 컴퓨팅 디바이스들에 의해 계산된 다른 값들과 비교하여 핫워드 이후의 발성의 부분에 대한 음성 인식을 수행할지 여부를 결정한다.2 is a diagram of an exemplary process 200 for hot word detection. Process 200 may be performed by a computing device, such as computing device 108 from FIG. The process 200 determines whether to perform speech recognition for the portion of the utterance after the hot word by calculating a value corresponding to the likelihood that the utterance will include a hot word and comparing the value with other values computed by other computing devices .

컴퓨팅 디바이스는 발성에 대응하는 오디오 데이터를 수신한다(210). 사용자가 발성을 말하고 컴퓨팅 디바이스의 마이크가 그 발성의 오디오 데이터를 수신한다. 컴퓨팅 디바이스는 오디오 데이터를 버터링, 필터링, 엔드포인팅, 및 디지털화하는 것에 의해 오디오 데이터를 처리한다. 예로서, 사용자는 "Ok, 컴퓨터"를 발성할 수 있고, 컴퓨팅 디바이스의 마이크는 "Ok, 컴퓨터"에 대응하는 오디오 데이터를 수신할 것이다. 컴퓨팅 디바이스의 오디오 서브시스템이 컴퓨팅 디바이스에 의한 추가 처리를 위해 오디오 데이터를 샘플링, 버퍼링, 필터링, 및 엔드포인팅할 것이다.The computing device receives the audio data corresponding to the utterance (210). The user speaks voices and the microphone of the computing device receives the voiced audio data. The computing device processes the audio data by buttering, filtering, endpointing, and digitizing the audio data. By way of example, the user may speak "Ok, computer" and the microphone of the computing device will receive audio data corresponding to "Ok, computer". The audio subsystem of the computing device will sample, buffer, filter, and end point the audio data for further processing by the computing device.

컴퓨팅 디바이스는 발성이 핫워드를 포함할 가능성에 대응하는 제1 값을 결정한다(220). 컴퓨팅 디바이스는 발성의 오디오 데이터를 핫워드를 포함하는 오디오 샘플들의 그룹과 비교하는 것에 의해 또는 발성의 오디오 데이터의 오디오 특성들을 분석하는 것에 의해, 핫워드 신뢰도 점수라고 불릴 수 있는, 제1 값을 결정한다. 제1 값은 0 내지 1의 스케일로 정규화될 수 있고, 여기서 1은 발성이 핫워드를 포함할 가장 높은 가능성을 나타낸다. 일부 구현들에서, 컴퓨팅 디바이스는 제2 컴퓨팅 디바이스를 식별하고 제2 컴퓨팅 디바이스가 핫워드를 포함하는 발성들에 응답하도록 구성되고 핫워드에 응답하도록 사용자에 의해 구성된 것을 결정한다. 사용자는 컴퓨팅 디바이스와 제2 컴퓨팅 디바이스 양쪽 모두에 로그인되어 있을 수 있다. 컴퓨팅 디바이스와 제2 컴퓨팅 디바이스 양쪽 모두가 사용자의 음성에 응답하도록 구성될 수 있다. 컴퓨팅 디바이스와 제2 컴퓨팅 디바이스는 동일한 로컬 영역 네트워크에 연결될 수 있다. 컴퓨팅 디바이스와 제2 컴퓨팅 디바이스는 양쪽 모두가, GPS 또는 신호 세기에 의해 결정된 바와 같이, 10 미터와 같은, 서로의 특정 거리 이내에 위치할 수 있다. 예를 들어, 이 컴퓨팅 디바이스들은 단거리 무선에 의해 통신할 수 있다. 컴퓨팅 디바이스는 제2 컴퓨팅 디바이스에 의해 송신되는 신호의 세기를 5 dBm으로서 검출하고 그것을 5 미터와 같은 대응하는 거리로 번역할 수 있다.The computing device determines 220 a first value corresponding to the likelihood that the utterance includes a hot word. The computing device determines a first value, which may be referred to as a hot word reliability score, by comparing the audio data of the utterance to a group of audio samples containing the hot word or by analyzing the audio characteristics of the audio data of utterance do. The first value may be normalized to a scale of 0 to 1, where 1 represents the highest probability that the utterance will include the hot word. In some implementations, the computing device identifies the second computing device and determines that the second computing device is configured to respond to voices, including hot words, and configured by the user to respond to hot words. The user may be logged in to both the computing device and the second computing device. Both the computing device and the second computing device may be configured to respond to the user's voice. The computing device and the second computing device may be connected to the same local area network. Both the computing device and the second computing device may be located within a certain distance of each other, such as 10 meters, as determined by GPS or signal strength. For example, these computing devices may communicate by short-range wireless. The computing device may detect the strength of the signal transmitted by the second computing device as 5 dBm and translate it to a corresponding distance, such as 5 meters.

컴퓨팅 디바이스는 발성이 핫워드를 포함할 가능성에 대응하는 제2 값 - 제2 값은 제2 컴퓨팅 디바이스에 의해 결정됨 - 을 수신한다(230). 제2 컴퓨팅 디바이스는 제2 컴퓨팅 디바이스의 마이크를 통하여 발성을 수신한다. 제2 컴퓨팅 디바이스는 발성에 대응하는 수신된 오디오 데이터를 처리하고 제2 값 또는 제2 핫워드 신뢰도 점수를 결정한다. 제2 핫워드 신뢰도 점수는 제2 컴퓨팅 디바이스에 의해 계산된, 발성이 핫워드를 포함할 가능성을 반영한다. 일부 구현들에서, 컴퓨팅 디바이스는 다음의 기법들 중 하나 이상을 이용하여 제1 값을 제2 컴퓨팅 디바이스에 송신한다. 컴퓨팅 디바이스는 인터넷을 통하여 액세스 가능한 서버를 통하여, 로컬 영역 네트워크에 위치하는 서버를 통하여, 또는 로컬 영역 네트워크 또는 단거리 무선을 통하여 직접, 제1 값을 제2 컴퓨팅 디바이스에 송신할 수 있다. 컴퓨팅 디바이스는 제1 값을 제2 컴퓨팅 디바이스에게만 송신할 수 있거나, 컴퓨팅 디바이스는 다른 컴퓨팅 디바이스들도 제1 값을 수신할 수 있도록 제1 값을 브로드캐스트할 수 있다. 컴퓨팅 디바이스는 컴퓨팅 디바이스가 제1 값을 송신한 것과 동일한 또는 상이한 기법을 이용하여 제2 컴퓨팅 디바이스로부터 제2 값을 수신할 수 있다.The computing device receives (230) a second value corresponding to the likelihood that the utterance includes a hot word, and a second value determined by the second computing device. The second computing device receives the vocalization through the microphone of the second computing device. The second computing device processes the received audio data corresponding to the utterance and determines a second value or a second hot word reliability score. The second hot word reliability score reflects the likelihood that the utterance, calculated by the second computing device, contains a hot word. In some implementations, a computing device transmits a first value to a second computing device using one or more of the following techniques. The computing device may transmit a first value to a second computing device via a server accessible through the Internet, through a server located in the local area network, or directly via a local area network or short-range wireless. The computing device may only transmit a first value to the second computing device, or the computing device may broadcast a first value such that other computing devices may also receive the first value. The computing device may receive the second value from the second computing device using the same or a different technique as the computing device transmits the first value.

일부 구현들에서, 컴퓨팅 디바이스는 발성에 대한 소리 세기 점수 또는 발성에 대한 신호 대 잡음비를 계산할 수 있다. 컴퓨팅 디바이스는 소리 세기 점수, 신호 대 잡음비, 및 핫워드 신뢰도 점수를 조합하여, 다른 컴퓨팅 디바이스들로부터의 유사한 값들과 비교하기 위한 새로운 값을 결정할 수 있다. 예를 들어, 컴퓨팅 디바이스는 핫워드 신뢰도 점수 및 신호 대 잡음비를 계산할 수 있다. 그 후 컴퓨팅 디바이스는 그 2개의 점수를 조합하고 다른 컴퓨팅 디바이스들로부터의 유사하게 계산된 점수들과 비교할 수 있다. 일부 구현들에서, 컴퓨팅 디바이스는 상이한 점수들을 계산하고 각각의 점수를 비교를 위해 다른 컴퓨팅 디바이스들에 송신할 수 있다. 예를 들어, 컴퓨팅 디바이스는 발성에 대한 소리 세기 점수 및 핫워드 신뢰도 점수를 계산할 수 있다. 그 후 컴퓨팅 디바이스는 그 점수들을 비교를 위해 다른 컴퓨팅 디바이스들에 송신할 수 있다.In some implementations, the computing device may calculate a speech intensity score for voicing or a signal-to-noise ratio for voicing. The computing device may combine the sound intensity score, the signal-to-noise ratio, and the hot word reliability score to determine a new value for comparison with similar values from other computing devices. For example, a computing device may calculate a hotword reliability score and a signal-to-noise ratio. The computing device can then combine the two scores and compare them with similarly computed scores from other computing devices. In some implementations, the computing device may calculate different scores and send each score to other computing devices for comparison. For example, the computing device may calculate a sound intensity score and a hot word reliability score for utterance. The computing device may then send the scores to other computing devices for comparison.

일부 구현들에서, 컴퓨팅 디바이스는 제1 값과 함께 제1 식별자를 송신할 수 있다. 이 식별자는 컴퓨팅 디바이스의 주소, 사용자에 의해 주어진 컴퓨팅 디바이스의 이름, 또는 컴퓨팅 디바이스의 위치 중 하나 이상에 기초할 수 있다. 예를 들어, 식별자는 "69.123.132.43" 또는 "전화"일 수 있다. 유사하게, 제2 컴퓨팅 디바이스는 제2 값과 함께 제2 식별자를 송신할 수 있다. 일부 구현들에서, 컴퓨팅 디바이스는 제1 식별자를, 컴퓨팅 디바이스가 핫워드에 응답하도록 구성된 것으로 이전에 식별한 특정 컴퓨팅 디바이스들에 송신할 수 있다. 예를 들어, 컴퓨팅 디바이스는, 핫워드에 응답할 수 있는 것에 더하여, 컴퓨팅 디바이스와 동일한 사용자가 제2 컴퓨팅 디바이스에 로그인되었기 때문에, 제2 컴퓨팅 디바이스를 핫워드에 응답하도록 구성된 것으로 이전에 식별했을 수 있다.In some implementations, the computing device may transmit the first identifier along with the first value. The identifier may be based on one or more of the address of the computing device, the name of the computing device given by the user, or the location of the computing device. For example, the identifier may be "69.123.132.43" or "phone. &Quot; Similarly, the second computing device may transmit the second identifier along with the second value. In some implementations, the computing device may send the first identifier to the particular computing devices previously identified as being configured to respond to the hot word by the computing device. For example, a computing device may have previously identified a second computing device as being configured to respond to a hot word, because in addition to being able to respond to a hot word, the same user as the computing device has been logged into the second computing device have.

컴퓨팅 디바이스는 제1 값과 제2 값을 비교한다(240). 그 후 컴퓨팅 디바이스는, 비교의 결과에 기초하여, 오디오 데이터에 대한 음성 인식 처리를 개시한다(250). 일부 구현들에서, 예를 들어, 컴퓨팅 디바이스는 제1 값이 제2 값 이상일 때 음성 인식을 개시한다. 사용자가 "ok 컴퓨터, Carol 호출"을 말하면, 컴퓨팅 디바이스는 제1 값이 제2 값 이상이기 때문에, "Carol 호출"에 대한 음성 인식을 수행하는 것에 의해 "Carol 호출"을 처리하기 시작할 것이다. 일부 구현들에서, 컴퓨팅 디바이스는 활성화 상태를 설정한다. 제1 값이 제2 값 이상인 경우에, 컴퓨팅 디바이스는 활성화 상태를 활성 또는 "어웨이크"로서 설정한다. "어웨이크" 상태에서, 컴퓨팅 디바이스는 음성 인식으로부터의 결과들을 표시한다.The computing device compares the first value to the second value (240). The computing device then initiates a speech recognition process on the audio data based on the result of the comparison (250). In some implementations, for example, the computing device initiates speech recognition when the first value is greater than or equal to a second value. If the user speaks "ok computer, call Carol", the computing device will begin processing "Carol call" by performing speech recognition for "Carol call" because the first value is above the second value. In some implementations, the computing device sets the activation state. If the first value is greater than or equal to the second value, the computing device sets the activation state as active or "awake. &Quot; In the "awake" state, the computing device displays results from speech recognition.

일부 구현들에서, 컴퓨팅 디바이스는 제1 값과 제2 값을 비교하고 제1 값이 제2 값 미만인 것을 결정한다. 컴퓨팅 디바이스는, 제1 값이 제2 값 미만이라는 결정에 기초하여, 활성화 상태를 비활성 또는 "슬립"으로서 설정한다. "슬립" 상태에서, 컴퓨팅 디바이스는 사용자에게, 활성이거나 오디오 데이터를 처리하는 것으로 보이지 않는다.In some implementations, the computing device compares the first value with the second value and determines that the first value is less than the second value. The computing device sets the activation state to inactive or "sleep" based on the determination that the first value is less than the second value. In the "sleep" state, the computing device is not visible to the user, either active or processing audio data.

일부 구현들에서, 컴퓨팅 디바이스가 제1 값이 제2 값 이상인 것을 결정할 때, 컴퓨팅 디바이스는 활성화 상태를 활성으로 설정하기 전에 특정량의 시간 동안 기다릴 수 있다. 컴퓨팅 디바이스는 컴퓨팅 디바이스가 다른 컴퓨팅 디바이스로부터 더 높은 값을 수신하지 않을 가능성을 증가시키기 위해 특정량의 시간 동안 기다릴 수 있다. 특정량의 시간은 고정될 수 있거나, 컴퓨팅 디바이스들이 값들을 송수신하는 기법에 따라서 달라질 수 있다. 일부 구현들에서, 컴퓨팅 디바이스가 제1 값이 제2 값 이상인 것을 결정할 때, 컴퓨팅 디바이스는 특정량의 시간 동안 제1 값을 계속 송신할 수 있다. 특정량의 시간 동안 제1 값을 계속 송신하는 것에 의해, 컴퓨팅 디바이스는 제1 값이 다른 컴퓨팅 디바이스들에 의해 수신되는 가능성을 증가시킨다. 컴퓨팅 디바이스가 제1 값이 제2 값 미만인 것을 결정하는 경우에, 컴퓨팅 디바이스는 제1 값의 송신을 중지할 수 있다.In some implementations, when the computing device determines that the first value is greater than or equal to the second value, the computing device may wait for a certain amount of time before setting the activation state to active. The computing device may wait a certain amount of time to increase the likelihood that the computing device will not receive a higher value from another computing device. The amount of time of a particular amount may be fixed or may vary depending on the technique by which computing devices transmit and receive values. In some implementations, when the computing device determines that the first value is greater than or equal to the second value, the computing device may continue to transmit the first value for a certain amount of time. By continually transmitting the first value for a particular amount of time, the computing device increases the likelihood that the first value is received by the other computing devices. In case the computing device determines that the first value is less than the second value, the computing device may cease transmission of the first value.

일부 구현들에서, 컴퓨팅 디바이스는 핫워드 이후의 명령을 실행할지를 결정할 때 부가 정보를 고려할 수 있다. 부가 정보의 일 예는 핫워드 이후의 발성의 부분일 수 있다. 전형적으로, 핫워드 이후의 오디오 데이터는 "Sally 호출", "할로윈 영화 재생(play Halloween Movie)", 또는 "70도로 난방 설정(set heat to 70 degrees)"과 같은 컴퓨팅 디바이스에 대한 명령에 대응한다. 컴퓨팅 디바이스는 요청의 타입을 핸들링하는 또는 요청을 핸들링할 수 있는 전형적인 디바이스를 식별할 수 있다. 사람을 호출하는 요청은 전형적으로 미리 프로그램된 전형적인 사용들에 기초하여 또는 디바이스의 사용자의 사용 패턴들에 기초하여 전화에 의해 핸들링될 것이다. 사용자가 전형적으로 태블릿에서 영화를 시청한다면, 태블릿은 영화를 재생하는 요청을 핸들링할 수 있다. 온도조절장치가 온도를 조절할 수 있다면, 온도조절장치는 온도 조절들을 핸들링할 수 있다.In some implementations, the computing device may consider the side information when deciding whether to execute an instruction after the hot word. One example of additional information may be part of the vocalization after the hot word. Typically, audio data after a hot word corresponds to a command to a computing device such as "Sally Call," " play Halloween Movie, "or" set heat to 70 degrees & . The computing device may identify a typical device that can handle the type of request or handle the request. A request to call a person will typically be handled by the phone based on typical preprogrammed uses or based on usage patterns of the user of the device. If a user typically watches a movie on a tablet, the tablet can handle a request to play the movie. If the temperature controller can control the temperature, the temperature controller can handle the temperature controls.

컴퓨팅 디바이스가 핫워드 이후의 발성의 부분을 고려하기 위해, 컴퓨팅 디바이스는 핫워드를 식별할 가능성이 있다면 오디오 데이터에 대한 음성 인식을 개시해야 할 것이다. 컴퓨팅 디바이스는 발성의 명령 부분을 분류하고 해당 분류에서 명령들의 빈도를 계산할 수 있다. 컴퓨팅 디바이스는 그 빈도를 핫워드 신뢰도 점수와 함께 다른 컴퓨팅 디바이스들에 송신할 수 있다. 각각의 컴퓨팅 디바이스는 그 빈도들 및 핫워드 신뢰도 점수들을 이용하여, 핫워드 이후의 명령을 실행할지를 결정할 수 있다.In order for the computing device to consider the portion of speech after the hot word, the computing device will have to initiate speech recognition for the audio data if it is likely to identify the hot word. The computing device may classify the command portion of the utterance and calculate the frequency of commands in that classification. The computing device may send its frequency to other computing devices along with the hot word reliability score. Each computing device may use its frequencies and hot word reliability scores to determine whether to execute a command after the hot word.

예를 들어, 사용자가 "OK 컴퓨터, 마이클 잭슨 재생(play Michael Jackson)"을 발성하면, 컴퓨팅 디바이스가 사용자가 시간의 20%를 음악을 청취하는 데 사용하는 전화라면, 컴퓨팅 디바이스는 해당 정보를 핫워드 신뢰도 점수와 함께 송신할 수 있다. 사용자가 시간의 5%를 음악을 청취하는 데 사용하는 태블릿과 같은 컴퓨팅 디바이스는 해당 정보를 핫워드 신뢰도 점수와 함께 다른 컴퓨팅 디바이스들에 송신할 수 있다. 컴퓨팅 디바이스들은 핫워드 신뢰도 점수와 음악 재생 시간의 백분율의 조합을 이용하여, 명령을 실행할지를 결정할 수 있다.For example, if a user utters "OK computer, play Michael Jackson" and the computing device is the phone the user uses to listen to music at 20% of the time, Can be transmitted together with the word reliability score. A computing device, such as a tablet, that a user uses to listen to music at 5% of the time may send that information to other computing devices along with a hotword reliability score. The computing devices may use a combination of the hotword reliability score and the percentage of music playback time to determine whether to execute the command.

도 3은 본 명세서에 기술된 기법들을 구현하기 위해 이용될 수 있는 컴퓨팅 디바이스(300) 및 모바일 컴퓨팅 디바이스(350)의 예를 보여준다. 컴퓨팅 디바이스(300)는 랩톱, 데스크톱, 워크스테이션, 개인 휴대 정보 단말기, 서버, 블레이드 서버, 메인프레임, 및 다른 적절한 컴퓨터와 같은, 다양한 형태의 디지털 컴퓨터들을 나타내기 위해 의도된 것이다. 모바일 컴퓨팅 디바이스(350)는 개인 휴대 정보 단말기, 휴대 전화, 스마트폰, 및 다른 유사한 컴퓨팅 디바이스들과 같은, 다양한 형태의 모바일 디바이스들을 나타내기 위해 의도된 것이다. 여기에 도시된 컴포넌트들, 그들의 연결들 및 관계들, 및 그들의 기능들은 단지 예들로 의도된 것이고, 제한적인 것으로 의도된 것은 아니다.FIG. 3 shows an example of a computing device 300 and a mobile computing device 350 that may be utilized to implement the techniques described herein. The computing device 300 is intended to represent various types of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other suitable computers. The mobile computing device 350 is intended to represent various types of mobile devices, such as personal digital assistants, cell phones, smart phones, and other similar computing devices. The components, their connections and relationships, and their functions illustrated herein are by way of example only and are not intended to be limiting.

컴퓨팅 디바이스(300)는 프로세서(302), 메모리(304), 저장 디바이스(306), 메모리(304) 및 다수의 고속 확장 포트들(310)에 연결되는 고속 인터페이스(308), 및 저속 확장 포트(314) 및 저장 디바이스(306)에 연결되는 저속 인터페이스(312)를 포함한다. 프로세서(302), 메모리(304), 저장 디바이스(306), 고속 인터페이스(308), 고속 확장 포트들(310), 및 저속 인터페이스(312) 각각은 다양한 버스들을 이용하여 상호 연결되고, 공통의 마더보드 상에 또는 적절하게 다른 방식들로 장착될 수 있다. 프로세서(302)는 고속 인터페이스(308)에 결합된 디스플레이(316)와 같은, 외부 입출력 디바이스에서 GUI에 대한 그래픽 정보를 표시하기 위해 메모리(304)에 또는 저장 디바이스(306)에 저장된 명령어들을 포함하는, 컴퓨팅 디바이스(300) 내에서 실행하기 위한 명령어들을 처리할 수 있다. 다른 구현들에서는, 다수의 프로세서들 및/또는 버스들이, 적절하게, 다수의 메모리들 및 메모리 타입들과 함께 사용될 수 있다. 또한, 다수의 컴퓨팅 디바이스들이 연결될 수 있고, 각각의 디바이스는 필요한 동작들의 부분들을 제공한다(예컨대, 서버 뱅크, 블레이드 서버들의 그룹, 또는 멀티-프로세서 시스템으로서).The computing device 300 includes a processor 302, a memory 304, a storage device 306, a memory 304 and a high speed interface 308 coupled to the plurality of high speed expansion ports 310, 314 and a low speed interface 312 connected to the storage device 306. Each of the processor 302, the memory 304, the storage device 306, the high speed interface 308, the high speed expansion ports 310 and the low speed interface 312 are interconnected using various busses, May be mounted on the board or, suitably in other manners. The processor 302 may include instructions stored in the memory 304 or stored in the storage device 306 to display graphical information for the GUI at an external input and output device, such as the display 316 coupled to the high- And processing instructions within computer device 300 for execution. In other implementations, multiple processors and / or busses may be used with multiple memories and memory types, as appropriate. In addition, multiple computing devices may be connected, and each device provides portions of the required operations (e.g., as a server bank, group of blade servers, or a multi-processor system).

메모리(304)는 컴퓨팅 디바이스(300) 내의 정보를 저장한다. 일부 구현들에서, 메모리(304)는 휘발성 메모리 유닛 또는 유닛들이다. 일부 구현들에서, 메모리(304)는 비휘발성 메모리 유닛 또는 유닛들이다. 메모리(304)는 또한 자기 또는 광 디스크와 같은, 다른 형태의 컴퓨터-판독가능 매체일 수 있다.The memory 304 stores information within the computing device 300. In some implementations, memory 304 is a volatile memory unit or unit. In some implementations, memory 304 is a non-volatile memory unit or unit. The memory 304 may also be another type of computer-readable medium, such as a magnetic or optical disk.

저장 디바이스(306)는 컴퓨팅 디바이스(300)를 위한 대용량 저장을 제공할 수 있다. 일부 구현들에서, 저장 디바이스(306)는 플로피 디스크 디바이스, 하드 디스크 디바이스, 광 디스크 디바이스, 또는 테이프 디바이스, 플래시 메모리 또는 다른 유사한 솔리드 스테이트 메모리 디바이스, 또는 저장 영역 네트워크 또는 다른 구성들에서의 디바이스들을 포함하는, 디바이스들의 어레이와 같은, 컴퓨터-판독가능 매체이거나 이를 포함할 수 있다. 명령어들은 정보 캐리어에 저장될 수 있다. 명령어들은, 하나 이상의 처리 디바이스들(예를 들어, 프로세서(302))에 의해 실행될 때, 전술한 것들과 같은 하나 이상의 방법을 수행한다. 명령어들은 또한 컴퓨터- 또는 머신-판독가능 매체들(예를 들어, 메모리(304), 저장 디바이스(306), 또는 프로세서(302)의 메모리)과 같은 하나 이상의 저장 디바이스에 의해 저장될 수 있다.The storage device 306 may provide mass storage for the computing device 300. In some implementations, the storage device 306 may include a floppy disk device, a hard disk device, an optical disk device, or a device in a tape device, flash memory or other similar solid state memory device, or in a storage area network or other configurations Or a computer-readable medium, such as an array of devices. The instructions may be stored in an information carrier. The instructions, when executed by one or more processing devices (e.g., processor 302), perform one or more methods, such as those described above. The instructions may also be stored by one or more storage devices, such as computer- or machine-readable media (e.g., memory 304, storage device 306, or memory of processor 302).

고속 인터페이스(308)는 컴퓨팅 디바이스(300)를 위한 대역폭 집중적인 동작들을 관리하는 반면, 저속 인터페이스(312)는 저대역폭 집중적인 동작들을 관리한다. 이러한 기능들의 할당은 단지 예이다. 일부 구현들에서, 고속 인터페이스(308)는 메모리(304), 디스플레이(316)에(예컨대, 그래픽 프로세서 또는 가속기를 통하여), 그리고 다양한 확장 카드들(미도시)을 수용할 수 있는, 고속 확장 포트들(310)에 결합된다. 구현에서, 저속 인터페이스(312)는 저장 디바이스(306) 및 저속 확장 포트(314)에 결합된다. 다양한 통신 포트들(예컨대, USB, 블루투스, 이더넷, 무선 이더넷)을 포함할 수 있는, 저속 확장 포트(314)는 키보드, 포인팅 디바이스, 스캐너와 같은 하나 이상의 입출력 디바이스에, 또는 스위치 또는 라우터와 같은 네트워킹 디바이스에(예컨대, 네트워크 어댑터를 통하여) 결합될 수 있다.The high speed interface 308 manages bandwidth intensive operations for the computing device 300, while the low speed interface 312 manages low bandwidth intensive operations. The assignment of these functions is only an example. In some implementations, the high speed interface 308 may include a memory 304, a display 316 (e.g., via a graphics processor or accelerator), and a high speed expansion port (not shown) 310 < / RTI > In an implementation, the low speed interface 312 is coupled to the storage device 306 and the slow expansion port 314. The slow expansion port 314, which may include various communication ports (e.g., USB, Bluetooth, Ethernet, wireless Ethernet), may be connected to one or more input / output devices such as a keyboard, pointing device, scanner, May be coupled to the device (e.g., via a network adapter).

컴퓨팅 디바이스(300)는 도면에 도시된 바와 같이, 다수의 상이한 형태들로 구현될 수 있다. 예를 들어, 그것은 표준 서버(320)로서, 또는 그러한 서버들의 그룹에서 여러 번 구현될 수 있다. 게다가, 그것은 랩톱 컴퓨터(322)와 같은 개인용 컴퓨터로 구현될 수 있다. 그것은 또한 랙 서버 시스템(324)의 일부로서 구현될 수 있다. 대안적으로, 컴퓨팅 디바이스(300)로부터의 컴포넌트들은 모바일 컴퓨팅 디바이스(350)와 같은, 모바일 디바이스 내의 다른 컴포넌트들(미도시)과 조합될 수 있다. 그러한 디바이스들 각각은 컴퓨팅 디바이스(300)와 모바일 컴퓨팅 디바이스(350) 중 하나 이상을 포함할 수 있고, 전체 시스템은 서로 통신하는 다수의 컴퓨팅 디바이스들로 구성될 수 있다.The computing device 300 may be implemented in a number of different forms, as shown in the figures. For example, it may be implemented as standard server 320, or multiple times in a group of such servers. In addition, it may be implemented as a personal computer, such as laptop computer 322. [ It may also be implemented as part of the rack server system 324. [ Alternatively, components from the computing device 300 may be combined with other components (not shown) within the mobile device, such as the mobile computing device 350. Each of such devices may include one or more of computing device 300 and mobile computing device 350, and the entire system may be comprised of a plurality of computing devices in communication with one another.

모바일 컴퓨팅 디바이스(350)은, 여러 컴포넌트들 중에서, 프로세서(352), 메모리(364), 디스플레이(354)와 같은 입출력 디바이스, 통신 인터페이스(366), 및 트랜시버(368)를 포함한다. 모바일 컴퓨팅 디바이스(350)는 또한 부가 저장을 제공하기 위해, 마이크로-드라이브 또는 다른 디바이스와 같은 저장 디바이스를 구비할 수 있다. 프로세서(352), 메모리(364), 디스플레이(354), 통신 인터페이스(366), 및 트랜시버(368) 각각은 다양한 버스들을 이용하여 상호 연결되고, 컴포넌트들 중 여러 개가 공통 마더보드 상에 또는 적절하게 다른 방식들로 장착될 수 있다.The mobile computing device 350 includes among other components a processor 352, a memory 364, input and output devices such as a display 354, a communication interface 366, and a transceiver 368. The mobile computing device 350 may also include a storage device, such as a micro-drive or other device, to provide additional storage. The processor 352, the memory 364, the display 354, the communication interface 366 and the transceiver 368 are each interconnected using a variety of buses, and several of the components may be interconnected on a common motherboard, It can be mounted in other ways.

프로세서(352)는 메모리(364)에 저장된 명령어들을 포함하는, 모바일 컴퓨팅 디바이스(350) 내의 명령어들을 실행할 수 있다. 프로세서(352)는 개별적인 다수의 아날로그 및 디지털 프로세서들을 포함하는 칩들의 칩셋으로 구현될 수 있다. 프로세서(352)는, 예를 들어, 사용자 인터페이스들의 제어, 모바일 컴퓨팅 디바이스(350)에 의해 실행되는 애플리케이션들, 및 모바일 컴퓨팅 디바이스(350)에 의한 무선 통신과 같은, 모바일 컴퓨팅 디바이스(350)의 다른 컴포넌트들의 코디네이션(coordination)을 제공할 수 있다.The processor 352 may execute instructions in the mobile computing device 350, including instructions stored in the memory 364. [ The processor 352 may be implemented as a chipset of chips comprising a plurality of separate analog and digital processors. The processor 352 may be any suitable computing device that can be used by the mobile computing device 350 to communicate with other mobile computing devices 350, such as, for example, controls of user interfaces, applications executed by the mobile computing device 350, And may provide coordination of components.

프로세서(352)는 디스플레이(354)에 결합된 제어 인터페이스(358) 및 디스플레이 인터페이스(356)를 통하여 사용자와 통신할 수 있다. 디스플레이(354)는, 예를 들어, TFT(Thin-Film-Transistor Liquid Crystal Display) 디스플레이 또는 OLED(Organic Light Emitting Diode) 디스플레이, 또는 다른 적절한 디스플레이 기술일 수 있다. 디스플레이 인터페이스(356)는 그래픽 및 다른 정보를 사용자에 제시하도록 디스플레이(354)를 구동하기 위한 적절한 회로를 포함할 수 있다. 제어 인터페이스(358)는 사용자로부터 명령들을 수신하고 이들을 프로세서(352)에 제출하기 위해 변환할 수 있다. 게다가, 모바일 컴퓨팅 디바이스(350)와 다른 디바이스들의 근거리 통신을 가능하게 하기 위해, 외부 인터페이스(362)가 프로세서(352)와의 통신을 제공할 수 있다. 외부 인터페이스(362)는, 예를 들어, 일부 구현들에서 유선 통신을, 또는 다른 구현들에서 무선 통신을 제공할 수 있고, 다수의 인터페이스들이 또한 사용될 수 있다.The processor 352 may communicate with the user via the control interface 358 and the display interface 356 coupled to the display 354. The display 354 may be, for example, a thin film transistor (TFT) display or an organic light emitting diode (OLED) display, or other suitable display technology. Display interface 356 may include suitable circuitry for driving display 354 to present graphics and other information to a user. The control interface 358 may receive instructions from the user and convert them to submit them to the processor 352. In addition, external interface 362 may provide for communication with processor 352 to enable close communication between mobile computing device 350 and other devices. The external interface 362 may, for example, provide wired communication in some implementations, or wireless communications in other implementations, and multiple interfaces may also be used.

메모리(364)는 모바일 컴퓨팅 디바이스(350) 내의 정보를 저장한다. 메모리(364)는 컴퓨터-판독가능 매체 또는 매체들, 휘발성 메모리 유닛 또는 유닛들, 또는 비휘발성 메모리 유닛 또는 유닛들 중 하나 이상으로 구현될 수 있다. 확장 메모리(374)가 또한 제공되고, 예를 들어, SIMM(Single In Line Memory Module) 카드 인터페이스를 포함할 수 있는, 확장 인터페이스(372)를 통하여 모바일 컴퓨팅 디바이스(350)에 연결될 수 있다. 확장 메모리(374)는 모바일 컴퓨팅 디바이스(350)를 위한 추가 저장 공간을 제공할 수 있고, 또는 모바일 컴퓨팅 디바이스(350)를 위한 애플리케이션들 또는 다른 정보를 또한 저장할 수 있다. 구체적으로, 확장 메모리(374)는 전술한 프로세스들을 수행하는 또는 보충하는 명령어들을 포함할 수 있고, 보안 정보를 또한 포함할 수 있다. 따라서, 예를 들어, 확장 메모리(374)는 모바일 컴퓨팅 디바이스(350)를 위한 보안 모듈로서 제공될 수 있고, 모바일 컴퓨팅 디바이스(350)의 안전한 사용을 허용하는 명령어들로 프로그램될 수 있다. 게다가, 식별 정보를 SIMM 카드에 해킹 불가능한 방식으로 두는 것과 같은, 부가 정보와 함께, SIMM 카드들을 통해 안전한 애플리케이션들이 제공될 수 있다.The memory 364 stores information within the mobile computing device 350. The memory 364 may be embodied in one or more of computer-readable media or media, volatile memory units or units, or non-volatile memory units or units. An expansion memory 374 is also provided and may be coupled to the mobile computing device 350 via an expansion interface 372, which may include, for example, a Single In Line Memory Module (SIMM) card interface. Extension memory 374 may provide additional storage space for mobile computing device 350 or may also store applications or other information for mobile computing device 350. [ In particular, the expansion memory 374 may include instructions to perform or supplement the processes described above, and may also include security information. Thus, for example, the extended memory 374 may be provided as a security module for the mobile computing device 350 and may be programmed with instructions that allow secure use of the mobile computing device 350. In addition, secure applications can be provided over SIMM cards, along with additional information, such as putting the identification information in a non-hackable manner on the SIMM card.

메모리는, 아래 논의된 바와 같이, 예를 들어, 플래시 메모리 및/또는 NVRAM 메모리(비휘발성 랜덤 액세스 메모리)를 포함할 수 있다. 일부 구현들에서, 명령어들은 정보 캐리어에 저장될 수 있다. 명령어들은, 하나 이상의 처리 디바이스들(예를 들어, 프로세서(352))에 의해 실행될 때, 전술한 것들과 같은 하나 이상의 방법을 수행한다. 명령어들은 또한 하나 이상의 컴퓨터- 또는 머신-판독가능 매체들(예를 들어, 메모리(364), 확장 메모리(374), 또는 프로세서(352)의 메모리)과 같은 하나 이상의 저장 디바이스에 의해 저장될 수 있다. 일부 구현들에서, 명령어들은 전파된 신호에서, 예를 들어, 트랜시버(368) 또는 외부 인터페이스(362)를 통하여 수신될 수 있다.The memory may include, for example, flash memory and / or NVRAM memory (non-volatile random access memory), as discussed below. In some implementations, the instructions may be stored in an information carrier. The instructions, when executed by one or more processing devices (e.g., processor 352), perform one or more methods, such as those described above. The instructions may also be stored by one or more storage devices, such as one or more computer- or machine-readable media (e.g., memory 364, expansion memory 374, or memory of processor 352) . In some implementations, the instructions may be received in the propagated signal, e.g., via transceiver 368 or external interface 362.

모바일 컴퓨팅 디바이스(350)는 필요할 경우 디지털 신호 처리 회로를 포함할 수 있는, 통신 인터페이스(366)를 통하여 무선 통신할 수 있다. 통신 인터페이스(366)는, 특히, GSM 음성 통화(Global System for Mobile communications), SMS(Short Message Service), EMS(Enhanced Messaging Service), 또는 MMS 메시징(Multimedia Messaging Service), CDMA(code division multiple access), TDMA(time division multiple access), PDC(Personal Digital Cellular), WCDMA(Wideband Code Division Multiple Access), CDMA2000, 또는 GPRS(General Packet Radio Service)와 같은, 다양한 모드들 또는 프로토콜들에서의 통신들을 제공할 수 있다. 그러한 통신은, 예를 들어, 무선 주파수를 이용하여 트랜시버(368)를 통하여 발생할 수 있다. 게다가, 예를 들어, 블루투스, WiFi, 또는 다른 그러한 트랜시버(미도시)를 이용하여 단거리 통신이 발생할 수 있다. 게다가, GPS(Global Positioning System) 수신기 모듈(370)은 부가 내비게이션- 및 위치-관련 무선 데이터를 모바일 컴퓨팅 디바이스(350)에 제공할 수 있고, 그 데이터는 모바일 컴퓨팅 디바이스(350)에서 실행중인 애플리케이션들에 의해 적절하게 사용될 수 있다.The mobile computing device 350 may wirelessly communicate via the communication interface 366, which may include digital signal processing circuitry as needed. The communication interface 366 may be implemented in a variety of communication systems such as Global System for Mobile communications, SMS (Short Message Service), EMS (Enhanced Messaging Service), or MMS messaging (Multimedia Messaging Service), code division multiple access (CDMA) , Communications in various modes or protocols, such as time division multiple access (TDMA), personal digital cellular (PDC), wideband code division multiple access (WCDMA), CDMA2000, or general packet radio service . Such communication may occur through the transceiver 368 using, for example, radio frequency. In addition, short-range communications may occur using, for example, Bluetooth, WiFi, or other such transceivers (not shown). In addition, a Global Positioning System (GPS) receiver module 370 may provide additional navigation-and location-related wireless data to the mobile computing device 350, which data may be stored in applications running on the mobile computing device 350 . &Lt; / RTI >

모바일 컴퓨팅 디바이스(350)는 또한 사용자로부터의 구두 정보를 수신하고 이를 사용 가능한 디지털 정보로 변환할 수 있는, 오디오 코덱(360)을 이용하여 들을 수 있게 통신할 수 있다. 오디오 코덱(360)은 마찬가지로, 예를 들어 스피커를 통하여, 예컨대, 모바일 컴퓨팅 디바이스(350)의 핸드세트에서, 사용자를 위한 가청음을 생성할 수 있다. 그러한 음은 음성 전화 통화들로부터의 음을 포함할 수 있고, 녹음된 음(예컨대, 음성 메시지, 음악 파일 등)을 포함할 수 있고, 또한 모바일 컴퓨팅 디바이스(350)에서 동작하는 애플리케이션들에 의해 실행된 음을 포함할 수 있다.The mobile computing device 350 may also audibly communicate using an audio codec 360, which may receive verbal information from the user and convert it into usable digital information. The audio codec 360 may likewise generate an audible sound for the user, e.g., via a speaker, e.g., in a handset of the mobile computing device 350. Such a tone may include notes from voice telephone calls and may include recorded notes (e.g., voice messages, music files, etc.) and may also be executed by applications running on the mobile computing device 350 &Lt; / RTI >

모바일 컴퓨팅 디바이스(350)는 도면에 도시된 바와 같이, 다수의 상이한 형태들로 구현될 수 있다. 예를 들어, 그것은 휴대 전화(380)로서 구현될 수 있다. 그것은 또한 스마트폰(382), 개인 휴대 정보 단말기, 또는 다른 유사한 모바일 디바이스의 일부로서 구현될 수 있다.The mobile computing device 350 may be implemented in a number of different forms, as shown in the figures. For example, it may be implemented as a cellular phone 380. It may also be implemented as part of a smart phone 382, personal digital assistant, or other similar mobile device.

본 명세서에 기술된 시스템들 및 기법들의 다양한 구현들은 디지털 전자 회로, 집적 회로, 특수하게 설계된 ASIC(application specific integrated circuit), 컴퓨터 하드웨어, 펌웨어, 소프트웨어, 및/또는 이들의 조합들로 실현될 수 있다. 이 다양한 구현들은 저장 시스템, 적어도 하나의 입력 디바이스, 및 적어도 하나의 출력 디바이스로부터 데이터 및 명령어들을 수신하도록 그리고 이들에 데이터 및 명령어들을 송신하도록 결합된, 특수 또는 범용일 수 있는, 적어도 하나의 프로그램가능 프로세서를 포함하는 프로그램가능 시스템에서 실행 가능한 그리고/또는 해석 가능한 하나 이상의 컴퓨터 프로그램에서의 구현을 포함할 수 있다.Various implementations of the systems and techniques described herein may be realized with digital electronic circuits, integrated circuits, specially designed application specific integrated circuits (ASICs), computer hardware, firmware, software, and / or combinations thereof . These various implementations include a storage system, at least one input device, and at least one programmable, which may be special or general purpose, coupled to receive data and instructions from the at least one output device and to transmit data and instructions thereto. And may include implementations in one or more computer programs executable and / or interpretable in a programmable system including a processor.

이 컴퓨터 프로그램들(프로그램들, 소프트웨어, 소프트웨어 애플리케이션들 또는 코드라고도 알려짐)은 프로그램가능 프로세서를 위한 머신 명령어들을 포함하고, 고급 절차 및/또는 개체 지향 프로그래밍 언어로, 그리고/또는 어셈블리/기계어로 구현될 수 있다. 본 명세서에서 사용될 때, 머신-판독가능 매체 및 컴퓨터-판독가능 매체라는 용어들은 머신-판독가능 신호로서 머신 명령어들을 수신하는 머신-판독가능 저장 매체를 포함하여, 프로그램가능 프로세서에 머신 명령어들 및/또는 데이터를 제공하기 위해 이용되는 장치 및/또는 디바이스(예컨대, 자기 디스크들, 광 디스크들, 메모리, 프로그램가능 논리 디바이스들(PLD들))를 포함하는 임의의 저장 매체를 지칭한다. 머신-판독가능 신호라는 용어는 머신 명령어들 및/또는 데이터를 프로그램가능 프로세서에 제공하기 위해 사용되는 임의의 신호를 지칭한다.These computer programs (also known as programs, software, software applications, or code) include machine instructions for a programmable processor and may be implemented in an advanced procedure and / or in an object oriented programming language and / . As used herein, the terms machine-readable medium and computer-readable medium include machine-readable storage media for receiving machine instructions as machine-readable signals to cause machine instructions and / Refers to any storage medium including devices and / or devices (e.g., magnetic disks, optical disks, memory, programmable logic devices (PLDs)) used to provide data. The term machine-readable signal refers to any signal that is used to provide machine instructions and / or data to the programmable processor.

사용자와의 상호 작용을 제공하기 위해, 본 명세서에 기술된 시스템들 및 기법들은 사용자에 정보를 표시하기 위한 디스플레이 디바이스(예컨대, CRT(cathode ray tube) 또는 LCD(liquid crystal display) 모니터) 및 사용자가 컴퓨터에 입력을 제공하기 위해 이용할 수 있는 키보드 및 포인팅 디바이스(예컨대, 마우스 또는 트랙볼)를 가진 컴퓨터에서 구현될 수 있다. 다른 종류의 디바이스들도 사용자와의 상호 작용을 제공하기 위해 사용될 수 있다; 예를 들어, 사용자에 제공되는 피드백은 임의의 형태의 감각 피드백(예컨대, 시각 피드백, 청각 피드백, 또는 촉각 피드백)일 수 있고; 사용자로부터의 입력은, 음향, 음성, 또는 촉각 입력을 포함하여, 임의의 형태로 수신될 수 있다.To provide for interaction with a user, the systems and techniques described herein include a display device (e.g., a cathode ray tube (CRT) or a liquid crystal display (LCD) monitor) for displaying information to a user, May be implemented in a computer having a keyboard and pointing device (e.g., a mouse or trackball) available to provide input to a computer. Other types of devices may also be used to provide interaction with the user; For example, the feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); The input from the user may be received in any form, including acoustic, voice, or tactile input.

본 명세서에 기술된 시스템들 및 기법들은 백 엔드 컴포넌트를 포함하는(예컨대, 데이터 서버로서), 또는 미들웨어 컴포넌트(예컨대, 애플리케이션 서버)를 포함하는, 또는 프런트 엔드 컴포넌트(예컨대, 사용자가 본 명세서에 기술된 시스템들 및 기법들의 구현과 상호 작용하기 위해 이용할 수 있는 그래픽 사용자 인터페이스 또는 웹 브라우저를 가진 클라이언트 컴퓨터), 또는 그러한 백 엔드, 미들웨어, 또는 프런트 엔드 컴포넌트들의 임의의 조합을 포함하는 컴퓨팅 시스템에서 구현될 수 있다. 시스템의 컴포넌트들은 임의의 형태 또는 매체의 디지털 데이터 통신(예컨대, 통신 네트워크)에 의해 상호 연결될 수 있다. 통신 네트워크들의 예들은 로컬 영역 네트워크(LAN), 광역 네트워크(WAN), 및 인터넷을 포함한다.The systems and techniques described herein may be implemented in a system that includes a back end component (e.g., as a data server), or a middleware component (e.g., an application server), or a front end component A client computer with a graphical user interface or a web browser that can be used to interact with the implementation of the disclosed systems and techniques), or any combination of such back-end, middleware, or front- . The components of the system may be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a local area network (LAN), a wide area network (WAN), and the Internet.

컴퓨팅 시스템은 클라이언트들 및 서버들을 포함할 수 있다. 클라이언트와 서버는 일반적으로 서로 멀리 떨어져 있고 전형적으로 통신 네트워크를 통하여 상호 작용한다. 클라이언트와 서버의 관계는 각자의 컴퓨터들에서 실행되고 서로 클라이언트-서버 관계를 가지는 컴퓨터 프로그램들에 의하여 발생한다.The computing system may include clients and servers. Clients and servers are typically far apart and typically interact through a communications network. The relationship between a client and a server is generated by computer programs running on their respective computers and having a client-server relationship with each other.

비록 소수의 구현들이 위에 상세히 기술되었지만, 다른 수정들이 가능하다. 예를 들어, 클라이언트 애플리케이션이 대리자(delegate)(들)에 액세스하는 것으로 기술되지만, 다른 구현들에서 대리자(들)는 하나 이상의 서버에서 실행되는 애플리케이션과 같은, 하나 이상의 프로세서에 의해 구현되는 다른 애플리케이션들에 의해 이용될 수 있다. 게다가, 도면들에 묘사된 논리 흐름들은 바람직한 결과들을 달성하기 위해, 도시된 특정 순서, 또는 순차적인 순서를 요구하지 않는다. 게다가, 다른 액션들이 제공될 수 있고, 기술된 흐름들로부터, 액션들이 제거될 수 있고, 기술된 시스템들에, 다른 컴포넌트들이 추가되거나, 그로부터 제거될 수 있다. 따라서, 다른 구현들이 다음의 청구항들의 범위 안에 있다.Although a few implementations have been described in detail above, other modifications are possible. For example, while a client application is described as accessing a delegate (s), in other implementations, the delegate (s) may include other applications implemented by one or more processors, such as an application executing on one or more servers Lt; / RTI > In addition, the logic flows depicted in the Figures do not require the specific order shown, or sequential order, to achieve the desired results. In addition, other actions may be provided, and from the described flows, actions may be removed and other components added to, or removed from, the described systems. Accordingly, other implementations are within the scope of the following claims.

Claims

A computer-implemented method for determining which of a plurality of computing devices to perform automatic speech recognition,
Receiving, by the first computing device, audio data corresponding to utterance;
The method of claim 1, further comprising: prior to starting automated speech recognition processing on the audio data, using a classifier to classify the audio data as comprising a specific hot word or not including the specific hot word Processing the audio data;
Based on processing the audio data using the classifier to classify the audio data as comprising a specific hot word or not including the specific hot word, Determining a first value that reflects the first value;
Receiving a second value determined by the second computing device, the second value reflecting the second possibility that the utterance includes the particular hot word;
Comparing the first value that reflects the first possibility that the utterance includes the specific hot word and the second value that reflects the second possibility that the utterance includes the particular hot word; And
Based on the first value reflecting the first possibility that the utterance includes the specific hot word and the second value reflecting the second possibility that the utterance will include the particular hot word, Determining whether to start performing automatic speech recognition processing on the audio data
&Lt; / RTI >

The method according to claim 1,
Determining that the first value satisfies a hotword score; And
Sending the first value to the second computing device based on determining that the first value satisfies the hot word score, the first value reflecting the first possibility that the utterance includes the particular hot word
&Lt; / RTI >

The method according to claim 1,
Based on the first value reflecting the first possibility that the utterance includes the specific hot word and the second value reflecting the second possibility that the utterance will include the particular hot word, Determining an activation state of the first computing device.

4. The method of claim 3, wherein the utterance includes a first value that reflects the first possibility to include the particular hot word and a second value that reflects the second possibility that the utterance includes the particular hot word. Based on the comparison, determining the activation state of the first computing device comprises:
Determining that the activation state of the first computing device is active.

The method according to claim 1,
Receiving, by the first computing device, additional audio data corresponding to additional speech;
Processing the additional audio data using a classifier that classifies audio data as containing a specific hot word or not including the specific hot word before starting automatic speech recognition processing on the additional audio data;
Based on processing the additional audio data by using the classifier that classifies the audio data as including a specific hot word or not including the specific hot word, 3 determining a third value that reflects the likelihood;
Receiving a fourth value determined by the third computing device, the fourth value reflecting the fourth possibility that the additive utterance will include the particular hot word;
Comparing the third value that reflects the third possibility that the additive utterance includes the specific hot word and the fourth value that reflects the fourth possibility that the utterance includes the specific hot word; And
Wherein the additive utterance is based on comparing the third value that reflects the third possibility to include the particular hot word and the fourth value that reflects the fourth possibility that the additive utterance includes the particular hot word And determining whether to start performing automatic speech recognition processing on the additional audio data
&Lt; / RTI >

The method according to claim 1,
Receiving a second value, determined by the second computing device, the second value reflecting the second possibility that the utterance includes the particular hot word comprises:
Receiving a second value from the server, through the local network, or via a short-range wireless communication channel, the vocalization reflecting the second possibility to include the particular hot word.

The method according to claim 1,
Determining that the second computing device is configured to respond to voices comprising the particular hot word,
Wherein the step of comparing the first value, which reflects the first possibility that the utterance includes the specific hot word, and the second value, which reflects the second possibility that the utterance includes the specific hot word, 2 < / RTI > computing device is configured to respond to voices comprising the particular hot word.

The method according to claim 1,
Receiving a second value, determined by the second computing device, the second value reflecting the second possibility that the utterance includes the particular hot word comprises:
And receiving a second identifier of the second computing device.

5. The method of claim 4, wherein determining whether to start performing automatic speech recognition processing on the audio data further comprises: determining that a specific amount of time has elapsed since receiving the audio data corresponding to the utterance / RTI >

5. The method of claim 4,
The method comprising transmitting the first value reflecting the first possibility that the utterance will include the particular hot word for a specified amount of time based on determining that the activation state is active.

The method according to claim 1,
Based on the first value reflecting the first possibility that the utterance includes the specific hot word and the second value reflecting the second possibility that the utterance will include the particular hot word, The step of determining that the first value reflecting the first possibility that the utterance includes the specific hot word is greater than the second value reflecting the second possibility that the utterance includes the particular hot word Including,
Wherein the step of determining whether to perform automatic speech recognition processing on the audio data comprises:
Wherein the first value reflecting the first possibility that the utterance includes the particular hot word is determined based on determining that the utterance is greater than the second value reflecting the second possibility to include the particular hot word And determining to start performing automatic speech recognition processing on the audio data.

The method according to claim 1,
Based on the first value reflecting the first possibility that the utterance includes the specific hot word and the second value reflecting the second possibility that the utterance will include the particular hot word, The step of determining that the first value reflecting the first possibility that the utterance includes the specific hot word is less than the second value reflecting the second possibility that the utterance includes the particular hot word Including,
Wherein the step of determining whether to perform automatic speech recognition processing on the audio data comprises:
Wherein the first value reflecting the first possibility that the utterance includes the particular hot word is determined based on determining that the utterance is less than the second value reflecting the second possibility to include the particular hot word And deciding not to start performing automatic speech recognition processing on the audio data.

2. The method of claim 1, wherein processing the audio data using a classifier that classifies audio data as comprising a particular hot word or not includes the specific hot word comprises:
Extracting filterbank energies or mel-frequency cepstral coefficients from the audio data.

2. The method of claim 1, wherein processing the audio data using a classifier that classifies audio data as comprising a particular hot word or not includes the specific hot word comprises:
Processing the audio data using a support vector machine or a neural network.

As a computing device,
And at least one storage device operable, when executed by the computing device, to store instructions operable to cause the computing device to perform operations for determining which of a plurality of computing devices to perform automatic speech recognition, The operations include:
Receiving, by the first computing device, audio data corresponding to the utterance;
Processing the audio data using a classifier that classifies the audio data as comprising a specific hot word or not including the specific hot word before starting automatic speech recognition processing on the audio data;
Based on processing the audio data using the classifier to classify the audio data as comprising a specific hot word or not including the specific hot word, Determining a first value that reflects the first value;
Receiving a second value determined by the second computing device, the second value reflecting the second possibility that the utterance includes the particular hot word;
Comparing the first value that reflects the first possibility that the utterance includes the particular hot word and the second value that reflects the second possibility that the utterance includes the particular hot word; And
Based on the first value reflecting the first possibility that the utterance includes the specific hot word and the second value reflecting the second possibility that the utterance will include the particular hot word, An operation for determining whether to start performing automatic speech recognition processing on the audio data
Gt; computing device. &Lt; / RTI >

16. The method of claim 15, wherein the operations comprise:
Determining that the first value satisfies a hot word score; And
Sending the first value to the second computing device, the first value reflecting the first possibility that the utterance includes the particular hot word, based on determining that the first value satisfies the hot word score
Lt; / RTI >

16. The method of claim 15, wherein the operations comprise:
Based on the first value reflecting the first possibility that the utterance includes the specific hot word and the second value reflecting the second possibility that the utterance will include the particular hot word, The method further comprising: determining an activation state of the first computing device.

18. The method of claim 17, wherein the first value reflecting the first possibility that the utterance includes the particular hot word and the second value reflecting the second possibility that the utterance includes the particular hot word Based on the comparison, the act of determining the activation state of the first computing device comprises:
And determining that the activation state of the first computing device is active.

16. The method of claim 15, wherein the operations comprise:
Receiving, by the first computing device, additional audio data corresponding to additional speech;
Processing the additional audio data using a classifier that classifies the audio data as containing a specific hot word or not including the specific hot word before starting automatic speech recognition processing on the additional audio data;
Based on processing the additional audio data by using the classifier that classifies the audio data as including a specific hot word or not including the specific hot word, 3 < / RTI >
Receiving a fourth value determined by the third computing device, the fourth value reflecting the fourth possibility that the additive utterance includes the particular hot word;
The third value reflecting the third possibility that the additive utterance includes the specific hot word and the fourth value reflecting the fourth possibility that the additive utterance will include the specific hot word; And
Wherein the additive utterance is based on comparing the third value that reflects the third possibility to include the particular hot word and the fourth value that reflects the fourth possibility that the additive utterance includes the particular hot word An operation of determining whether to start performing automatic speech recognition processing on the additional audio data
Lt; / RTI >

16. The method of claim 15,
The operation of receiving a second value determined by the second computing device, the second value reflecting the second possibility that the utterance includes the particular hot word comprises:
Receiving from the server, via the local network, or via a short-range wireless communication channel, a second value that reflects the second possibility that the utterance will include the particular hot word.

16. The method of claim 15, wherein the operations comprise:
Further comprising determining that the second computing device is configured to respond to voices comprising the particular hot word,
The first value reflecting the first possibility that the utterance includes the particular hot word and the second value reflecting the second possibility that the utterance will include the particular hot word may be compared to the second value, Wherein the computing device is responsive to determining that it is configured to respond to voices comprising the particular hot word.

25. A computer readable storage medium storing software comprising instructions executable by one or more computers,
Wherein the instructions, when executed, cause the one or more computers to perform operations to determine which of a plurality of computing devices to perform automatic speech recognition, the operations comprising:
Receiving, by the first computing device, audio data corresponding to the utterance;
Processing the audio data using a classifier that classifies the audio data as comprising a specific hot word or not including the specific hot word before starting automatic speech recognition processing on the audio data;
Based on processing the audio data using the classifier to classify the audio data as comprising a specific hot word or not including the specific hot word, Determining a first value that reflects the first value;
Receiving a second value determined by the second computing device, the second value reflecting the second possibility that the utterance includes the particular hot word;
Comparing the first value that reflects the first possibility that the utterance includes the particular hot word and the second value that reflects the second possibility that the utterance includes the particular hot word; And
Based on the first value reflecting the first possibility that the utterance includes the specific hot word and the second value reflecting the second possibility that the utterance will include the particular hot word, An operation for determining whether to start performing automatic speech recognition processing on the audio data
Gt; computer-readable < / RTI >