KR102312218B1

KR102312218B1 - contextual hotwords

Info

Publication number: KR102312218B1
Application number: KR1020197021993A
Authority: KR
Inventors: 크리스토퍼 타데우스 휴즈; 이그나시오 로페즈 모레노; 알렉산더 크라쿤
Original assignee: 구글 엘엘씨
Priority date: 2016-12-27
Filing date: 2017-08-01
Publication date: 2021-10-13
Also published as: US20190287528A1; US11430442B2; US10276161B2; JP2020503568A; CN116052661B; JP2021015281A; KR20210124531A; JP7078689B2; US10839803B2; KR102374519B1; US20210043210A1; WO2018125292A1; JP6780124B2; KR20190100334A; EP3485489B1; EP3485489A1; US20180182390A1; CN110140168B; CN116052661A; EP3627504A1

Abstract

문맥상의 핫워드들(contextual hotwords)을 위해 컴퓨터 저장 매체상에 인코딩된 컴퓨터 프로그램들을 포함하는 방법들, 시스템들 및 장치가 개시된다. 일 양태에서, 컴퓨팅 디바이스의 부트 프로세스 동안, 방법은, 컴퓨팅 디바이스에 의해 컴퓨팅 디바이스와 연관된 콘텍스트를 결정하는 동작을 포함한다. 동작은 컴퓨팅 디바이스와 연관된 콘텍스트에 기초하여, 핫워드를 결정하는 것을 더 포함한다. 동작은 또한 핫워드를 결정한 후, 발화에 대응하는 오디오 데이터를 수신하는 것을 포함한다. 상기 동작들은 오디오 데이터가 핫워드를 포함하는 것으로 결정하는 것을 더 포함한다. 상기 동작들은, 상기 오디오 데이터가 상기 핫워드를 포함한다고 결정한 것에 응답하여, 상기 핫워드와 연관된 동작을 수행하는 것을 더 포함한다.Methods, systems and apparatus comprising computer programs encoded on a computer storage medium for contextual hotwords are disclosed. In an aspect, during a boot process of a computing device, a method includes determining, by the computing device, a context associated with the computing device. The operations further include determining the hotword based on the context associated with the computing device. The operations also include, after determining the hotword, receiving audio data corresponding to the utterance. The operations further include determining that the audio data includes a hotword. The operations further include, in response to determining that the audio data includes the hotword, performing an operation associated with the hotword.

Description

contextual hotwords

본 발명은 일반적으로 자동화된 음성 처리에 관한 것이다.FIELD OF THE INVENTION The present invention relates generally to automated speech processing.

음성 인식가능 가정(speech-enabled home) 또는 다른 환경의 현실 - 즉, 사용자가 질문 또는 명령을 크게 소리내어 말하는 경우 또는 컴퓨터 기반 시스템이 상기 질문을 작성 및 응답하고 그리고/또는 상기 명령이 수행되도록 하는 것 - 이 우리에게 다가오고 있다. 음성 인식가능 환경(예를 들어, 가정, 직장, 학교 등)은 환경의 여러 방들 또는 영역들 전체에 분산되어 연결된 마이크로폰 디바이스들의 네트워크를 사용하여 구현될 수 있다. 이러한 마이크로폰들의 네트워크를 통해 사용자는 자신의 앞이나 또는 근처의 컴퓨터 또는 다른 디바이스들을 가질 필요없이 본질적으로 환경의 어느 곳에서나 시스템을 구두로 질의할 수 있는 권한을 갖는다. 예를 들어, 주방에서 요리하는 동안, 사용자는 시스템에 "3개의 컵에 몇 밀리리터가 들어 있지?"라고 물을 수 있고, 그리고 이에 응답하여, 예를 들어, 합성된 음성 출력의 형태로 시스템으로부터 응답을 수신할 수 있다. 대안으로, 사용자는 "가장 가까운 주유소에 언제 도달할 수 있지?" 또는 집을 나설 준비를 할 때, "오늘 나는 코트를 입어야 할까?"와 같은 시스템 질문을 할 수 있다.The reality of a speech-enabled home or other environment - that is, when a user utters a question or command aloud or a computer-based system composes and responds to the question and/or causes the command to be performed Things - this is coming to us. A voice recognition capable environment (eg, home, work, school, etc.) may be implemented using a network of connected microphone devices distributed across several rooms or areas of the environment. This network of microphones gives users the right to verbally query the system from essentially anywhere in the environment without having to have a computer or other devices in front of or near them. For example, while cooking in the kitchen, the user may ask the system "how many milliliters are in three cups?" and in response, from the system in the form of synthesized speech output, for example. A response may be received. Alternatively, the user may ask, "When can I reach the nearest gas station?" Or, as you prepare to leave the house, you can ask a system question like “Should I wear a coat today?”

또한, 사용자는 사용자의 개인 정보와 연관된 질의 및/또는 명령을 시스템에 할 수 있다. 예를 들어, 사용자가 시스템에 "존과의 만남은 언제인가?"라고 물어볼 수 있거나 또는 "내가 집에 돌아올 때 존에게 전화하라고 알려줘"라고 시스템에 명령할 수 있다.The user may also make queries and/or commands to the system related to the user's personal information. For example, a user could ask the system "When are you meeting John?" or you could instruct the system to "remind me to call John when I get home."

음성 인식가능 시스템의 경우 사용자와 시스템의 상호 작용 방식은 주로 독점적이지는 않지만 음성 입력을 통해 이루어지도록 설계된다. 결과적으로, 시스템으로 향하지 않는 것들을 포함하여 주변 환경에서 이루어진 모든 발화들을 잠재적으로 픽업하는 시스템은, 어떤 주어진 발화가 반대 방향으로 향할 때, 예를 들어, 주변 환경에 존재하는 개인에게 향할 때를 식별하는 몇 가지 방법이 존재해야 한다. 이를 달성하는 한 가지 방법은, 환경 내의 사용자들 간의 동의에 따라 시스템의 주의를 끌기 위해 말해진 미리 결정된 단어로 예약된 핫워드를 사용하는 것이다. 예시적인 환경에서, 시스템의 주의를 끌기 위해 사용되는 핫워드는 "오케이 컴퓨터"라는 단어이다. 결과적으로, "오케이 컴퓨터"라는 단어들이 발음될 때마다, 마이크로폰에 의해 픽업되어 시스템으로 전달되며, 음성 모델링 기술들을 수행하여 핫워드가 말해졌는지를 결정하고, 그렇다면 다음의 명령 또는 질의를 대기한다. 따라서, 시스템으로 향하는 발화들은 일반적인 형식 [핫워드] [질의]를 가지고, 이 예에서, "핫워드"는 "오케이 컴퓨터"이고 그리고 "질의"는, 단독으로 또는 네트워크를 통해 서버와 결합하여 시스템에 의해 음성 인식, 구문 분석 및 수행될 수 있는 임의의 질의, 명령, 선언 또는 다른 요청일 수 있다.In the case of speech recognition capable systems, the way the user interacts with the system is primarily designed to be through voice input, although not exclusively. Consequently, a system that potentially picks up all utterances made in the surrounding environment, including those not directed to the system, is capable of identifying when any given utterance is directed in the opposite direction, e.g., towards an individual present in the surrounding environment. There must be several ways. One way to achieve this is to use reserved hotwords with predetermined words spoken to get the system's attention according to agreement among users in the environment. In an exemplary environment, the hot word used to draw the attention of the system is the word "OK computer." As a result, each time the word "ok computer" is pronounced, it is picked up by the microphone and delivered to the system, performing voice modeling techniques to determine if a hotword has been spoken, and if so, awaits the next command or query. Thus, utterances directed to the system have the general form [hotword] [query], in this example, "hotword" is "ok computer" and "query" is system, either alone or in combination with a server over a network. may be any query, command, declaration, or other request that may be voice recognized, parsed and performed by

사용자가 휴대 전화와 같은 음성 인식가능 시스템에 몇 개의 핫워드 기반 명령들을 제공하는 경우, 사용자와 전화의 상호 작용이 어색해질 수 있다. 사용자는 "오케이 컴퓨터, 나의 숙제 목록을 재생해"라고 말할 수 있다. 전화가 재생 목록의 첫 곡을 재생하기 시작할 수 있다. 사용자는 다음 곡으로 넘어가서 "오케이 컴퓨터, 다음"이라고 말하고 싶을 수도 있다. 또 다른 노래로 진행하려면, 사용자가 "오케이 컴퓨터, 다음"이라고 다시 말할 수 있다. 핫워드를 계속 반복해야 하는 필요성을 줄이기 위해, 전화는 핫워드 및 질의 모두, 또는 이 경우에는 명령으로 "다음"을 인식하도록 구성될 수 있다. 이러한 피처를 사용하면, 사용자는 "오케이 컴퓨터, 다음" 대신 다음 노래로 넘어가기 위해 "다음"만을 말할 수 있기에, 전화와 음악 애플리케이션과의 사용자 상호 작용은 더욱 자연스워러진다.When a user provides several hotword-based commands to a voice recognition capable system, such as a cell phone, the user's interaction with the phone can become awkward. The user can say "Okay computer, play my homework list." The phone may start playing the first song in the playlist. The user may want to skip to the next song and say, "Okay computer, next." To proceed to another song, the user can say "Okay computer, next" again. To reduce the need to repeat the hotword over and over, the phone may be configured to recognize "next" as both a hotword and a query, or in this case a command. With this feature, the user's interaction with the phone and the music application becomes more natural because the user can only say "next" to advance to the next song instead of "ok computer, next".

이를 달성하기 위해, 음성 인식가능 시스템은 시스템의 현재 콘텍스트를 결정하고 그리고 연관된 핫워드들을 식별한다. 콘텍스트는 시스템에서 실행중인 애플리케이션, 시스템 위치, 시스템 이동 또는 기타 유사한 상황을 기반으로 할 수 있다. 시스템은 시스템의 콘텍스트를 사용하여 추가 핫워드들을 식별할 수 있다. 예를 들어, 음악이 재생 중일 때, 시스템은 음악을 제어하기 위해 핫워드들 "다음", "중지" 및 "뒤로"를 식별할 수 있다. 시스템은 시스템의 콘텍스트를 사용하여 추가 핫워드들을 식별할 수 있다. 예를 들어, 음악이 재생중일 때, 시스템은 음악을 제어하기 위해 핫워드들 "다음", "중지" 및 "뒤로"를 식별할 수 있다. 시스템은 식별된 핫워드 각각에 대해 핫워드 모델을 요구할 수 있다. 시스템은 사용자의 음성에 대응하는 오디오 데이터의 오디오 특성을 처리하고 핫워드 모델들을 오디오 특성에 적용함으로써 새로운 핫워드들을 인식하기 위해 핫워드 모델들을 사용할 수 있다. 시스템은 음성 핫워드를 인식하고 해당 작업을 수행한다. 시스템이 음악을 재생 중이므로, 사용자가 "중지"를 말하고, 그리고 "중지"가 활성화된 핫워드가 되면, 이후 시스템은 음악 재생을 중지할 수 있다.To achieve this, the speech recognition capable system determines the current context of the system and identifies the associated hotwords. Contexts may be based on applications running on the system, system location, system movement, or other similar circumstances. The system may use the system's context to identify additional hotwords. For example, when music is playing, the system may identify the hotwords "next", "stop" and "back" to control the music. The system may use the system's context to identify additional hotwords. For example, when music is playing, the system may identify the hotwords "next", "stop" and "back" to control the music. The system may require a hotword model for each identified hotword. The system may use the hotword models to recognize new hotwords by processing an audio characteristic of audio data corresponding to the user's voice and applying the hotword models to the audio characteristic. The system recognizes the voice hotword and performs the corresponding task. Since the system is playing music, the user says "stop", and when "stop" becomes an active hotword, then the system can stop playing the music.

본 출원에 개시된 주제의 혁신적인 측면에 따르면, 롤백 방지 보안을 위한 방법은, According to an innovative aspect of the subject matter disclosed in this application, a method for anti-rollback security comprises:

컴퓨팅 디바이스에 의해, 상기 컴퓨팅 디바이스와 연관된 콘텍스트를 결정하는 단계와; 상기 컴퓨팅 디바이스와 연관된 상기 콘텍스트에 기초하여, 핫워드를 결정하는 단계와; 상기 핫워드를 결정한 후, 발화(utterance)에 대응하는 오디오 데이터를 수신하는 단계와; 상기 오디오 데이터가 상기 핫워드를 포함한다고 결정하는 단계와; 그리고 상기 오디오 데이터가 상기 핫워드를 포함한다고 결정한 것에 응답하여, 상기 핫워드와 연관된 동작을 수행하는 단계를 포함한다.determining, by a computing device, a context associated with the computing device; determining a hotword based on the context associated with the computing device; after determining the hot word, receiving audio data corresponding to an utterance; determining that the audio data includes the hotword; and in response to determining that the audio data includes the hotword, performing an operation associated with the hotword.

이들 및 다른 구현들은 각각 선택적으로 하나 이상의 다음 피처들을 포함할 수 있다. 상기 오디오 데이터가 상기 핫워드를 포함한다고 결정하는 단계는, 상기 오디오 데이터에 대한 음성 인식을 수행하지 않고 상기 오디오 데이터가 상기 핫워드를 포함한다고 결정하는 단계를 포함한다. 상기 오디오 데이터가 상기 핫워드를 포함한다고 결정하는 단계는, 상기 발화에 대응하는 상기 오디오 데이터의 오디오 피처들(audio features)을 추출하는 단계와; 상기 오디오 피처들을 처리함으로써 핫워드 신뢰도 스코어(hotword confidence score)를 생성하는 단계와; 상기 핫워드 신뢰도 스코어가 핫워드 신뢰도 임계값을 만족시키는지를 결정하는 단계와; 그리고 상기 핫워드 신뢰도 스코어가 핫워드 신뢰도 임계값을 만족하는지를 결정하는 것에 기초하여, 상기 발화에 대응하는 상기 오디오 데이터가 상기 핫워드를 포함한다고 결정하는 단계를 포함한다. 상기 방법은 상기 핫워드를 결정한 후, 상기 핫워드에 대응하는 핫워드 모델을 수신하는 단계를 더 포함하고, 상기 오디오 데이터가 상기 핫워드를 포함한다고 결정하는 단계는, 상기 핫워드 모델을 사용하여, 상기 오디오 데이터가 상기 핫워드를 포함한다고 결정하는 단계를 포함한다.These and other implementations may each optionally include one or more of the following features. The determining that the audio data includes the hotword includes determining that the audio data includes the hotword without performing speech recognition on the audio data. Determining that the audio data includes the hotword may include: extracting audio features of the audio data corresponding to the utterance; generating a hotword confidence score by processing the audio features; determining whether the hotword confidence score satisfies a hotword confidence threshold; and based on determining whether the hotword confidence score satisfies a hotword confidence threshold, determining that the audio data corresponding to the utterance includes the hotword. The method further comprises, after determining the hotword, receiving a hotword model corresponding to the hotword, wherein determining that the audio data includes the hotword comprises: using the hotword model , determining that the audio data includes the hotword.

상기 방법은, 상기 컴퓨팅 디바이스에 의해, 상기 컴퓨팅 디바이스 상에서 실행중인 애플리케이션을 식별하는 단계를 포함한다. 상기 콘텍스트는 상기 컴퓨팅 디바이스 상에서 실행중인 상기 애플리케이션에 기초한다. 상기 방법은, 상기 컴퓨팅 디바이스에 의해, 상기 콘텍스트가 상기 컴퓨팅 장치와 더 이상 연관되어 있지 않다고 결정하는 단계와; 그리고 상기 핫워드를 포함하는 후속적으로 수신된 오디오 데이터가 동작을 트리거링하지 않는 것으로 결정하는 단계를 더 포함한다. 상기 방법은, 출력을 위해, 상기 핫워드를 식별하는 데이터를 제공하는 단계를 더 포함한다. 상기 방법은, 상기 컴퓨팅 디바이스에 의해, 상기 컴퓨팅 디바이스의 움직임을 식별하는 단계를 포함한다. 상기 콘텍스트는 상기 컴퓨팅 디바이스의 움직임에 기초한다. 상기 방법은, 상기 컴퓨팅 디바이스에 의해, 상기 컴퓨팅 디바이스의 위치를 식별하는 단계를 더 포함한다. 상기 콘텍스트는 상기 컴퓨팅 디바이스의 위치에 기초한다. 상기 핫워드와 연관된 동작을 수행하는 단계는: 상기 핫워드를 포함하지 않는 상기 오디오 데이터의 부분에 대해 음성 인식을 수행하는 단계를 포함한다. 상기 동작은 상기 핫워드를 포함하지 않는 상기 오디오의 부분의 표기(transcription)에 기초한다. 상기 오디오 데이터는 상기 핫워드만을 포함한다. 상기 오디오 데이터의 초기 부분은 상기 핫워드를 포함한다. The method includes identifying, by the computing device, an application running on the computing device. The context is based on the application running on the computing device. The method includes determining, by the computing device, that the context is no longer associated with the computing device; and determining that subsequently received audio data comprising the hotword does not trigger an action. The method further includes providing, for output, data identifying the hotword. The method includes identifying, by the computing device, movement of the computing device. The context is based on movement of the computing device. The method further includes identifying, by the computing device, a location of the computing device. The context is based on the location of the computing device. Performing the operation associated with the hotword includes: performing speech recognition on a portion of the audio data that does not include the hotword. The operation is based on the transcription of the portion of the audio that does not contain the hotword. The audio data includes only the hotword. The initial portion of the audio data includes the hotword.

이 양태의 다른 실시예들은 각각이 방법들의 동작들을 수행하도록 구성된 대응하는 시스템들, 디바이스들 및 컴퓨터 저장 디바이스 상에 기록된 컴퓨터 프로그램들을 포함한다.Other embodiments of this aspect include computer programs recorded on corresponding systems, devices and computer storage device, each configured to perform the operations of the methods.

본 출원서에 서술된 주제는 다음의 이점들 중 하나 이상을 가질 수 있다. 컴퓨팅 디바이스는, 통상적으로 사용자가 명령 뒤에 오는 핫워드를 말하도록 요구하는 종래의 시스템들보다 짧은 명령에 응답하여 이를 인식하고 동작할 수 있다. 그에 따라, 인식된 명령의 처리가 덜 필요하고, 따라서, 컴퓨팅 디바이스가 응답할 수 있도록 보다 적은 리소스 소비(입력 용어 및 전력 소비의 저장을 위한 메모리 포함를)가 필요하고, 따라서, 컴퓨팅 디바이스는 보다 신속하고 효율적으로 응답할 수 있다. 컴퓨팅 디바이스는 명령들에 대한 음성 인식을 수행하지 않고도 상이한 명령들에 응답하여 인식하고 동작할 수 있다. 질의 및 명령을 인식하는 데 필요한 컴퓨팅 리소스들 및 배터리 전력을 줄일 수 있는데, 이는, 왜냐하면 컴퓨팅 디바이스가 하나의 용어로 질의 및 명령을 인식할 수 있기 때문에 별도의 동작 단계들에서 둘 이상의 상이한 용어들을 처리할 필요가 없기 때문이다. The subject matter described in this application may have one or more of the following advantages. The computing device is capable of recognizing and operating in response to a command that is shorter than conventional systems that typically require the user to say a hotword that follows the command. Accordingly, less processing of recognized commands is required and, therefore, less resource consumption (including memory for storage of input terms and power consumption) is required for the computing device to respond, and thus the computing device is faster and respond effectively. The computing device may recognize and operate in response to different commands without performing voice recognition for the commands. It can reduce the computing resources and battery power required to recognize the query and command, because the computing device can recognize the query and command as one term, processing two or more different terms in separate operational steps. because you don't have to.

본 명세서에 서술된 주제의 하나 이상의 실시예들의 세부 사항들은 첨부된 도면들 및 이하의 설명에서 서술된다. 주제의 다른 특징, 양상 및 장점은 상세한 설명, 도면 및 청구 범위로부터 명백해질 것이다.The details of one or more embodiments of the subject matter described herein are set forth in the accompanying drawings and the description below. Other features, aspects and advantages of the subject matter will become apparent from the detailed description, drawings and claims.

도 1은 문맥상의 핫워드들을 식별하고 처리하기 위한 예시적인 시스템을 도시한다.
도 2는 문맥상의 핫워드들을 식별하고 처리하기 위한 예시적인 프로세스를 도시한다.
도 3은 컴퓨팅 디바이스 및 모바일 컴퓨팅 디바이스의 예를 도시한다.
도면들에서, 동일한 참조 번호들은 전체에 걸쳐 대응하는 부분을 나타낸다.1 depicts an exemplary system for identifying and processing contextual hotwords.
2 depicts an exemplary process for identifying and processing contextual hotwords.
3 shows an example of a computing device and a mobile computing device.
In the drawings, like reference numbers indicate corresponding parts throughout.

도 1은 콘텍스트 핫워드들을 식별하고 처리하기 위한 예시적인 시스템(100)을 도시한다. 간략하게, 이하보다 상세히 설명하는 바와 같이, 컴퓨팅 디바이스(104)는 발화(106), "오케이 컴퓨터, 음악을 재생해(Ok computer, play music)."에 응답하여 음악을 재생하기 시작한다. 컴퓨팅 디바이스(104)는 음악을 재생하기 시작하고, 컴퓨팅 디바이스(104)는 추가의 핫워드 "다음(next)"을 다음 노래로 진행하기위한 명령으로 인식할 수 있다.1 depicts an exemplary system 100 for identifying and processing context hotwords. Briefly, as described in greater detail below, computing device 104 begins playing music in response to utterance 106 , “Ok computer, play music.” The computing device 104 begins playing the music, and the computing device 104 may recognize the additional hotword “next” as a command to advance to the next song.

보다 상세하게, 도 1의 단계 A에서 시작하여, 사용자(102)는 컴퓨팅 디바이스(104) 부근에서 "오케이 컴퓨터, 음악 재생해"이라는 발화(106)를 말한다. 컴퓨팅 디바이스(104)의 마이크로폰은 발화(106)를 수신하고 발화에 대응하는 오디오 데이터를 처리한다. 오디오 데이터의 초기 처리는 오디오 데이터를 필터링하고 그리고 오디오 데이터를 아날로그 신호로부터 디지털 신호로 변환하는 것을 포함할 수 있다. More specifically, starting at step A of FIG. 1 , the user 102 makes an utterance 106 in the vicinity of the computing device 104 , “Okay computer, play music.” A microphone of the computing device 104 receives the utterance 106 and processes audio data corresponding to the utterance. Initial processing of the audio data may include filtering the audio data and converting the audio data from an analog signal to a digital signal.

컴퓨팅 디바이스(104)가 오디오 데이터를 처리할 때, 컴퓨팅 디바이스는 추가적인 프로세싱을 위해 버퍼에 오디오 데이터를 저장할 수 있다. 버퍼 내의 오디오 데이터와 함께, 컴퓨팅 디바이스(104)는 오디오 데이터가 임의의 말한 단어들을 포함하는지 여부를 식별할 수 있다. 컴퓨팅 디바이스(104)가 말한 단어들을 식별하는 한가지 방법은 핫워드 검출기(108)를 사용하는 것이다. 핫워드 검출기(108)는 오디오 데이터에 포함된 핫워드를 식별하도록 구성된다.When the computing device 104 processes the audio data, the computing device may store the audio data in a buffer for further processing. With the audio data in the buffer, computing device 104 can identify whether the audio data includes any spoken words. One way to identify words spoken by the computing device 104 is to use the hotword detector 108 . The hotword detector 108 is configured to identify a hotword included in the audio data.

일부 구현들에서, 핫워드 검출기(108)는 발화(106)의 초기 부분에 있는 핫워드들을 식별하도록 구성될 수 있다. 이 예에서, 핫워드 검출기(108)가 활성 핫워드 (112)의 특징인 오디오 데이터의 음향 특징들을 검출하면, 음성(106) "오케이 컴퓨터, 음악을 재생해"는 핫워드(110) "오케이 컴퓨터(ok computer)"를 포함함을, 핫워드 검출기(108)는 검출할 수 있다. 음향 특성들은 발화의 단기 파워 스펙트럼들을 나타내는 멜 주파수 셉스트럴 계수(MFCC)들일 수 있거나 발화(106)에 대한 멜 스케일 필터 뱅크 에너지들일 수 있다. 예를 들어, 오디오 데이터로부터 MFCC들을 생성하는 것, 그리고 MFCC들이 핫워드 모델들(114)에 저장된 바와 같이 핫 워드 "오케이 컴퓨터"의 특성인 MFCC들과 유사한 MFCC들을 포함하는지를 분류하는 것에 기초하여, 핫워드 검출기(108)는 발화(106) "오케이 컴퓨터, 음악 재생해"가 핫워드(110) "오케이 컴퓨터"를 포함함을 검출할 수 있다. 다른 예로서, 상기 오디오 데이터로부터 멜-스케일 필터 뱅크 에너지들을 생성하는 것, 그리고 멜-스케일 필터 뱅크 에너지들이 핫워드 모델들(114)에 저장된 핫워드 "오케이 컴퓨터"의 특성인 멜-스케일 필터 뱅크 에너지들과 유사한 멜-스케일 필터 뱅크 에너지들을 포함함을 것에 기초하여, 핫워드 검출기(108)는 발화(106) "오케이 컴퓨터, 음악 재생해"가 핫워드(110) "오케이 컴퓨터"를 포함함을 검출할 수 있다.In some implementations, the hotword detector 108 may be configured to identify hotwords in the initial portion of the utterance 106 . In this example, if hotword detector 108 detects acoustic characteristics of audio data that are characteristic of active hotword 112 , voice 106 "Okay computer, play music" hotword 110 "Okay" The hotword detector 108 can detect that it contains "ok computer." The acoustic properties may be Mel Frequency Septral Coefficients (MFCCs) representing the short term power spectra of the utterance or may be Mel scale filter bank energies for the utterance 106 . For example, based on generating MFCCs from audio data, and classifying whether MFCCs contain MFCCs similar to MFCCs that are characteristic of the hot word "ok computer" as stored in hotword models 114, The hotword detector 108 may detect that the utterance 106 “Okay computer, play music” includes the hotword 110 “Okay computer”. As another example, generating mel-scale filter bank energies from the audio data, and a mel-scale filter bank that is characteristic of a hotword “ok computer” in which mel-scale filter bank energies are stored in hotword models 114 . Based on the inclusion of mel-scale filter bank energies similar to the energies, the hotword detector 108 determines that the utterance 106 “Ok computer, play music” includes the hotword 110 “OK computer”. can be detected.

핫워드 검출기(108)는, 발화(106)에 대응하는 오디오 데이터가 핫워드(110)를 포함한다고 결정하고, 그리고 컴퓨팅 디바이스(104)는 발화(106)에 대응하는 오디오 데이터에 대해 음성 인식 또는 의미 해석을 수행할 수 있다. 음성 인식기 (116)는 핫 워드(110)를 따르는 오디오 데이터의 부분에 대해 음성 인식을 수행할 수 있다. 이 예에서, 음성 인식기(116)는 단어들(118) "음악 재생"을 식별할 수 있다.The hotword detector 108 determines that the audio data corresponding to the utterance 106 includes the hotword 110 , and the computing device 104 performs speech recognition or speech recognition for the audio data corresponding to the utterance 106 . semantic interpretation can be performed. The speech recognizer 116 may perform speech recognition on the portion of the audio data that follows the hot word 110 . In this example, speech recognizer 116 may identify words 118 “play music”.

일부 구현들에서, 음성 인식기(116)는 컴퓨팅 디바이스(104) 상에 위치하지 않을 수 있다. 그 대신에, 컴퓨팅 디바이스(104)는 발화(106)의 부분에 대응하는 오디오 데이터를 핫워드(110) 후에, 서버, 예를 들어, 서버(120)에 제공한다. 서버 (120)는 음성 인식을 수행하고 컴퓨팅 디바이스(104)로 오디오 데이터의 기록을 반환한다. 컴퓨팅 디바이스(104)는 발화(106) 내의 단어들을 식별하고, 그리고 컴퓨팅 디바이스 의미론적 해석을 수행하고 임의의 음성 명령들을 식별한다. 컴퓨팅 디바이스(104)는 명령을 식별하고 그리고 명령을 실행한다. 이 예 및 단계 B에서, 컴퓨팅 디바이스(104)는 "음악 재생" 명령(118)을 식별할 때 음악(122)을 재생한다.In some implementations, the voice recognizer 116 may not be located on the computing device 104 . Instead, the computing device 104 provides audio data corresponding to the portion of the utterance 106 after the hotword 110 to a server, eg, server 120 . Server 120 performs speech recognition and returns a record of audio data to computing device 104 . Computing device 104 identifies words in utterance 106 , and performs computing device semantic interpretation and identifies any spoken commands. Computing device 104 identifies the instruction and executes the instruction. In this example and step B, computing device 104 plays music 122 upon identifying “play music” command 118 .

음악 재생(122)에서, 컴퓨팅 디바이스(104)는 전경 또는 배경에서 음악 애플리케이션을 실행 중이다. 컴퓨팅 장치(104)는 콘텍스트 식별자(124) 및 활성 핫워드 선택기(126)를 포함할 수 있다. 콘텍스트 식별자(124)는 컴퓨팅 디바이스(104)의 현재 콘텍스트를 식별하도록 구성될 수 있다. 활성 핫워드 선택기(126)는 활성 핫 워드들을 선택하기 위해 컴퓨팅 디바이스(104)의 현재 콘텍스트를 사용할 수 있다. 이 예에서, 디바이스의 콘텍스트는 음악(122)을 재생하고 음악 애플리케이션을 실행하는 것과 관련될 수 있다. 사용자들이 음악 애플리케이션 및 각 핫워드에 대한 각각의 동작들과 상호 작용하기 위해 말할 수 있는 것을 음악 애플리케이션들의 개발자들은 원하기에, 활성 핫워드 선택기(126)는 임의의 핫워드들을 식별하기 위해 음악 애플리케이션의 코드를 검사할 수 있다. 음악 애플리케이션은 "재생(play)", "다음(next)", "중지(stop)" 및 "뒤로(back)"와 같은 핫워드들을 식별할 수 있다. 적극적으로 연주하는 음악의 콘텍스트를 기초로, 활성 핫워드 선택기(126)는 "다음", "중지" 및 "뒤로"의 핫워드들을 선택하여 활성 핫워드들(112)에 이들을 저장할 수 있다.In music playback 122 , computing device 104 is running a music application in the foreground or background. The computing device 104 may include a context identifier 124 and an active hotword selector 126 . The context identifier 124 may be configured to identify a current context of the computing device 104 . Active hotword selector 126 may use the current context of computing device 104 to select active hot words. In this example, the context of the device may relate to playing music 122 and running a music application. Developers of music applications want what users can say to interact with the music application and the respective actions for each hotword, so the active hotword selector 126 selects the music application to identify any hotwords. You can check the code of A music application may identify hotwords such as “play,” “next,” “stop,” and “back.” Based on the context of the actively playing music, the active hotword selector 126 may select the hotwords of “next”, “stop” and “back” and store them in active hotwords 112 .

일부 구현들에서, 콘텍스트 식별자(124)는 콘텍스트를 결정하기 위해 컴퓨팅 디바이스(104)의 위치를 사용할 수 있다. 예를 들어, 콘텍스트 식별자(124)는 그 위치가 컴퓨팅 디바이스(104)의 사용자(102)의 집에 대응한다고 결정할 수 있다. 활성 핫워드 선택기(126)는, 사용자가 집의 온도를 제어할 수 있도록 "온열기(warmer)" 및 "냉각기(cooler)"와 같은 핫워드들을 식별하기 위해 디바이스가 사용자(102)의 집에 있다는 상황을 사용할 수 있다. 상기 예와 유사하게, 활성 핫워드 선택기(126)는 사용자(102)가 집에 있는 동안 활성 핫워드들(112)에 핫워드들인 "온열기" 및 "냉각기"를 저장할 수 있다.In some implementations, the context identifier 124 can use the location of the computing device 104 to determine the context. For example, the context identifier 124 may determine that the location corresponds to the home of the user 102 of the computing device 104 . The active hotword selector 126 indicates that the device is in the user's 102 home to identify hotwords such as "warmer" and "cooler" so that the user can control the temperature of the home. situation can be used. Similar to the example above, active hotword selector 126 may store the hotwords “warmer” and “cooler” in active hotwords 112 while user 102 is at home.

일부 구현들에서, 콘텍스트 식별자(124)는 콘텍스트를 결정하기 위해 컴퓨팅 디바이스(104)의 모션을 사용할 수 있다. 예를 들어, 콘텍스트 식별자(124)는 컴퓨팅 디바이스(104)의 움직임이 전형적인 차량의 속도 및 모션에 대응한다고 결정할 수 있다. 콘텍스트 식별자(124)는, 또한, 컴퓨팅 디바이스(104)가 차량에 있다는 확신을 증가시키도록 컴퓨팅 장치(104)가 도로에 따라 움직이고 있는지를 결정하기 위해 컴퓨팅 디바이스(104)의 속도 및 움직임을 비교할 수 있다. 이 예에서, 활성 핫워드 선택기(126)는 컴퓨팅 장치의 콘텍스트를 사용하여 핫워드 "방향들들(directions)"을 식별하여 사용자가 특정 위치에 대한 방향을 요청할 수 있도록 할 수 있다. 활성 핫워드 선택기(126)는 컴퓨팅 디바이스(104)가 차량에 있는 동안 핫워드 "방향"을 활성 핫워드들(112)에 저장할 수 있다.In some implementations, the context identifier 124 can use the motion of the computing device 104 to determine the context. For example, the context identifier 124 may determine that the movement of the computing device 104 corresponds to the speed and motion of a typical vehicle. The context identifier 124 may also compare the speed and movement of the computing device 104 to determine whether the computing device 104 is moving along a road to increase confidence that the computing device 104 is in a vehicle. have. In this example, active hotword selector 126 may use the context of the computing device to identify hotword “directions” so that the user can request directions to a particular location. The active hotword selector 126 may store the hotword “direction” in the active hotwords 112 while the computing device 104 is in the vehicle.

일부 구현들에서, 콘텍스트 식별자(124)는 콘텍스트를 결정하기 위해 요일 또는 시간 또는 둘 다를 사용할 수 있다. 예를 들어, 콘텍스트 식별자(124)는 콘텍스트가 오후 9시와 자정 사이와 같은 저녁 시간에 대응한다고 결정할 수 있다. 이 예에서, 활성 핫워드 선택기(126)는 사용자(102)가 알람을 설정하게 하는 핫워드 알람 설정(set alarm)"을 식별하기 위해 저녁 시간의 콘텍스트를 사용할 수 있다. 활성 핫워드 선택기(126)는 오후 9시부터 자정의 시간주기 동안 활성 핫워드들(112)에 핫 워드 "알람 설정"을 저장할 수 있다.In some implementations, the context identifier 124 may use the day of the week or the time or both to determine the context. For example, the context identifier 124 may determine that the context corresponds to an evening time, such as between 9:00 PM and midnight. In this example, the active hotword selector 126 may use the evening time context to identify “a hotword set alarm that causes the user 102 to set an alarm.” Active hotword selector 126 ) may store the hot word “alarm set” in active hotwords 112 for a time period from 9:00 pm to midnight.

일부 구현들에서, 콘텍스트 식별자(124)는 핫워드들을 식별하기 위해 사용자(102)의 과거 동작들 및 패턴들을 사용할 수 있다. 콘텍스트 식별자(124)는 아마도 사용자가 아마도 하루 중의 다른 시간들에서 수행하는 동작을 식별하고, 그리고 이들 동작에 기초하여 콘텍스트를 결정할 수 있다. 활성 핫워드 선택기(126)는 관련 핫워드들을 식별하고 이러한 핫워드들을 해당 기간 동안 활성 핫워드들(112)에 저장할 수 있다. 예를 들어, 콘텍스트 식별자(124)는 사용자(102)가 오전 8시에서 오전 10시 동안 컴퓨팅 디바이스(104) 상의 뉴스를 읽는 것으로 결정할 수 있다. 활성 핫워드 선택기(126)는 활성 핫워드(112)로서 "뉴스(news)"를 선택할 수 있다. "뉴스"를 핫워드로 사용하면, 사용자(102)는 오전 8시부터 오전 10시 동안 뉴스 애플리케이션을 열기 위해 "뉴스"라고 말할 수 있다. 뉴스 애플리케이션은 고유한 해당 핫워드들이 있을 수 있다. 활성 핫워드 선택기(126)는, 뉴스 애플리케이션이 컴퓨팅 디바이스(104) 상에서 열려있을 때 "스포츠(sports)", "로컬(local)" 및 "국내(national)"와 같은 핫워드들을 활성 핫워드들(112)로서 식별할 수 있다.In some implementations, the context identifier 124 can use the user's 102 past actions and patterns to identify hotwords. The context identifier 124 may identify actions the user performs, perhaps at different times of the day, and may determine a context based on those actions. Active hotword selector 126 may identify relevant hotwords and store these hotwords in active hotwords 112 for a period of time. For example, the context identifier 124 may determine that the user 102 is reading news on the computing device 104 between 8:00 AM and 10:00 AM. The active hotword selector 126 may select “news” as the active hotword 112 . Using “news” as a hot word, user 102 could say “news” to open the news application from 8 am to 10 am. A news application may have its own corresponding hotwords. The active hotword selector 126 selects hotwords such as “sports,” “local,” and “national” as active hotwords when a news application is open on the computing device 104 . It can be identified as (112).

일부 구현들에서, 콘텍스트 식별자(124)는 반드시 시간 의존적이지 않은 과거의 동작들을 식별할 수 있다. 예를 들어, 사용자(102)는 컴퓨팅 디바이스(102) 상의 날씨를 습관적으로 점검할 수 있다. 컴퓨팅 디바이스의 콘텍스트는 거의 항상 사용자(102)가 날씨를 체크 한 시간에 대응한다고, 콘텍스트 식별자(124)가 결정할 수 있다. 이 경우, 활성 핫워드 선택기(126)는 핫워드 "날씨(weather)"를 식별하고 그리고 핫워드를 활성 핫워드들(112)에 저장할 수 있다. "날씨"를 활성 핫워드로 사용하는 경우, 사용자(102)는 단지 날씨 애플리케이션을 열고 날씨를 확인하기 위해 "날씨"를 말한다.In some implementations, the context identifier 124 can identify past actions that are not necessarily time dependent. For example, user 102 may habitually check the weather on computing device 102 . The context identifier 124 may determine that the context of the computing device almost always corresponds to the time the user 102 checked the weather. In this case, the active hotword selector 126 may identify the hotword “weather” and store the hotword in the active hotwords 112 . When using “weather” as the active hotword, user 102 just opens the weather application and says “weather” to check the weather.

일부 구현들에서, 콘텍스트 식별자(124)는 컴퓨팅 디바이스(104)의 상태에 기초하여 콘텍스트를 결정할 수 있다. 예를 들어, 컴퓨팅 디바이스(104)의 상태는 "잠금(lock)"이 될 수 있다. 이 경우, 활성 핫워드 선택기(126)는 핫워드 "잠금 해제(unlock)"를 식별하고, 디바이스가 잠금이될 때 핫워드를 활성 핫워드(112)에 저장할 수 있다. "잠금 해제"를 활성 핫워드로 사용하면, 사용자가 "잠금 해제"라고 말하면서 전화의 잠금을 해제할 수 있다. 보안을 향상시키기 위해, 컴퓨팅 디바이스(104)는 화자 식별 기술들을 사용하여 화자가 사용자(102)임을 검증할 수 있다. 이 경우, 상응하는 핫워드 모델은 사용자(102)의 음성을 사용하여 트레이닝될 것이다. 예를 들어, 컴퓨팅 디바이스(104)는, 컴퓨팅 디바이스(104) 또는 서버(120)가 음성 샘플들을 갖는 사용자(102)에 특정한 핫워드 모델을 구축할 수 있도록 사용자 (102)에게 여러번 "잠금 해제"하도록 촉구할 수 있다.In some implementations, the context identifier 124 can determine the context based on the state of the computing device 104 . For example, the state of computing device 104 may be “locked”. In this case, the active hotword selector 126 may identify the hotword “unlock” and store the hotword in the active hotword 112 when the device is locked. Using "unlock" as an active hotword, the user can unlock the phone by saying "unlock". To enhance security, computing device 104 may verify that the speaker is the user 102 using speaker identification techniques. In this case, the corresponding hotword model will be trained using the user's 102 voice. For example, computing device 104 may “unlock” user 102 multiple times so that computing device 104 or server 120 can build a hotword model specific to user 102 with voice samples. may be urged to do so.

활성 핫워드들(112)은 새로운 핫워드를 포함하고, 그리고 컴퓨팅 디바이스(104)는, 컴퓨팅 디바이스(104)가 새로 추가된 핫워드에 대한 핫워드를 갖는지를 결정하기 위해 핫워드 모델들(114)을 검사할 수 있다. 예를 들어, 활성 핫워드 선택기(126)가 활성 핫워드들(112)에 "다음"을 저장할 때, 핫워드 모델들(114)이 "다음"에 대한 핫워드 모델을 포함하는지를 컴퓨팅 디바이스(104)가 결정한다. "다음"에 대한 핫워드 모델이 핫워드 모델들(114)에 있으면, 이후, 핫워드 검출기(108)는 핫워드 "다음"을 검출하기 시작할 수 있고 그리고 컴퓨팅 디바이스(104)는 스테이지 C, D 및 E를 스킵할 수 있다. "다음"에 대한 핫워드 모델이 핫워드 모델들 (114)에 없다면, 단계 C에서, 컴퓨팅 디바이스(104)는 서버(120)에 "다음"에 대한 핫워드 모델에 대한 요청(128)을 전송한다.Active hotwords 112 include the new hotword, and computing device 104 uses hotword models 114 to determine whether computing device 104 has a hotword for the newly added hotword. ) can be checked. For example, when active hotword selector 126 stores “next” in active hotwords 112 , computing device 104 determines whether hotword models 114 include a hotword model for “next”. ) is determined. If a hotword model for “next” is in the hotword models 114 , then the hotword detector 108 can begin to detect the hotword “next” and the computing device 104 performs stages C, D and E can be skipped. If the hotword model for “next” is not in the hotword models 114 , in step C, the computing device 104 sends a request 128 for the hotword model for “next” to the server 120 . do.

일부 구현들에서, 컴퓨팅 디바이스(104)는 핫워드 모델(114)에서 대응하는 핫워드 모델을 이미 갖는 핫워드 워드에 대한 핫워드 모델을 요구할 수 있다. 컴퓨팅 디바이스(104)는, 로컬로 저장된 핫워드 모델이 가능한 한 정확함을 보장하기 위해 주기적으로, 예를 들어, 일주일에 한 번 핫워드 모델을 요청할 수 있다. 컴퓨팅 디바이스(104)는 또한 사용자(102)로부터의 피드백에 응답하여 핫워드 모델을 요구할 수 있다. 일부 예들에서, 사용자(102)는 "다음"과 같은 핫워드를 말할 수 있고, 그리고 컴퓨터는 노래를 진행시킬 수 없다. 핫 워드 검출기(108)는 "다음"에 대응하는 오디오 데이터를 처리할 수 있지만, 핫워드 신뢰 스코어가 임계값을 만족시키지 않아서 핫워드를 식별할 수 없다.In some implementations, computing device 104 can require a hotword model for a hotword word that already has a corresponding hotword model in hotword model 114 . The computing device 104 may request the hotword model periodically, eg, once a week, to ensure that the locally stored hotword model is as accurate as possible. Computing device 104 may also request the hotword model in response to feedback from user 102 . In some examples, user 102 may say a hotword such as “next,” and the computer cannot advance the song. The hot word detector 108 may process the audio data corresponding to "next", but cannot identify the hot word because the hot word confidence score does not satisfy the threshold.

사용자(102)는 컴퓨팅 디바이스(104) 상의 다음 노래 버튼을 선택함으로써 노래를 전진시킬 수 있다. 컴퓨팅 디바이스(104)가 이러한 일련의 동작들을 검출하면, 컴퓨팅 디바이스(104)는 "다음"에 대한 업데이트된 핫워드 모델을 요구할 수 있다. 대안적으로, 컴퓨팅 디바이스(104)는, 임계값을 만족시키지만 더 낮은 핫워드 신뢰도 임계치를 초과한 핫워드 신뢰도 스코어를 생성하지 않은 오디오 데이터를 사용하여 "다음"에 대한 핫워드 모델을 업데이트할 수 있다. 일부 구현들에서, 컴퓨팅 디바이스(104)는 임계값을 만족시키는 핫워드 신뢰도 스코어를 생성하지 않은 오디오 데이터에 대한 노이즈 레벨을 계산할 수 있다. 노이즈 레벨이 노이즈 임계치보다 크다면, 컴퓨팅 디바이스(104)는 너무 많은 배경 노이즈를 가질 수 있으므로, 대응하는 오디오 데이터로 핫워드 모델을 업데이트하지 않을 수 있다.User 102 may advance a song by selecting a next song button on computing device 104 . When computing device 104 detects such a series of actions, computing device 104 may request an updated hotword model for “next”. Alternatively, the computing device 104 may update the hotword model for “next” using audio data that meets the threshold but does not generate a hotword confidence score that exceeds a lower hotword confidence threshold. have. In some implementations, computing device 104 can calculate a noise level for audio data that did not generate a hotword confidence score that satisfies a threshold. If the noise level is greater than the noise threshold, the computing device 104 may have too much background noise and may not update the hotword model with the corresponding audio data.

서버(120)는 핫워드 모델에 대한 요청(128)을 수신하고 그리고 스테이지 D에서 대응하는 핫워드 모델(130)을 식별한다. 서버(120)는 인터넷과 같은 네트워크(132)를 통해 액세스 가능한 하나 이상의 서버들에 대응할 수 있다. 하나의 서버에 의해 액세스할 수 있는 데이터는 다른 서버들에 의해 액세스할 수 있다. 핫워드 모델들을 식별 및 제공하는 것 이외에, 서버(102)는 오디오 데이터를 수신하고 그리고 수신된 오디오 데이터에 기초하여 핫워드 모델들(130)을 생성하도록 구성된다.Server 120 receives a request 128 for a hotword model and identifies a corresponding hotword model 130 in stage D. Server 120 may correspond to one or more servers accessible via network 132 , such as the Internet. Data that can be accessed by one server can be accessed by other servers. In addition to identifying and providing hotword models, server 102 is configured to receive audio data and generate hotword models 130 based on the received audio data.

핫워드 모델들(130)을 생성하기 위해, 서버(120)는 음성 데이터(134)를 수신 및 수집한다. 서버(120)는 음성 인식을 수행하는 서버들로부터 음성 데이터(134)를 수신할 수 있다. 서버들은 음성 인식을 수행하고 그리고 오디오 데이터를 서버 (120)에 제공하여 핫워드 모델들을 생성할 수 있다. 수집된 음성 데이터(134)로, 음성 인식기(136)는 수집된 음성 데이터 내의 단어들을 식별한다.To generate hotword models 130 , server 120 receives and collects voice data 134 . The server 120 may receive voice data 134 from servers that perform voice recognition. Servers may perform speech recognition and provide audio data to server 120 to generate hotword models. With the collected speech data 134 , the speech recognizer 136 identifies words in the collected speech data.

음성 인식기(136)는 녹음 데이터 및 음성 데이터를 음성 데이터 토크나이저 (138)에 제공한다. 음성 데이터 토크 나이저(138)는 오디오 데이터를 상이한 단어들에 대응하는 부분들로 분할한다. 예를 들어, 수집된 음성 데이터(134)가 "다음 그 노래 재생"이라는 단어에 대응한다면, 이후, 음성 인식기는 "다음 그 노래 재생"의 변환을 생성하고 음성 데이터 토크나이저(138)는 오디오 데이터를 4개의 섹션으로 토큰화한다. "재생"에 대한 하나의 섹션, "그"에 대한 다른 섹션, "다음"에 대한 다른 섹션, "노래"에 대한 다른 섹션이 있다.The voice recognizer 136 provides recorded data and voice data to the voice data tokenizer 138 . The voice data tokenizer 138 divides the audio data into parts corresponding to different words. For example, if the collected voice data 134 corresponds to the word “play next that song”, then the voice recognizer generates a translation of “play next that song” and the voice data tokenizer 138 uses the audio data tokenize into 4 sections. There is one section for “play”, another section for “that”, another section for “next”, and another section for “song”.

음성 인식기(136) 및 음성 데이터 토크나이저(138)는 많은 음성 샘플들을 토큰화할 수 있고 그리고 토큰화된 음성 샘플들을 핫워드 모델 생성기(140)에 제공할 수 있다. 핫워드 모델 생성기(140)는 동일한 단어의 다수의 샘플들을 처리하여 그 단어에 대한 핫워드 모델을 생성한다. 예를 들어, 핫워드 모델 생성기(140)는 단어 "다음"에 대응하는 다수의 음성 샘플들을 수신할 수 있다. 핫워드 모델 생성기(140)는 음성 샘플의 오디오 특성들을 추출하여 "다음"에 대한 핫워드 모델을 생성한다. 핫워드 모델 생성기(140)는 핫워드 모델들(130)에 "다음"에 대한 핫워드 모델을 저장한다.Speech recognizer 136 and speech data tokenizer 138 may tokenize many speech samples and provide tokenized speech samples to hotword model generator 140 . The hotword model generator 140 processes multiple samples of the same word to generate a hotword model for the word. For example, the hotword model generator 140 may receive a number of speech samples corresponding to the word “next”. The hotword model generator 140 extracts audio characteristics of a voice sample to generate a hotword model for "next". The hotword model generator 140 stores the hotword model for "next" in the hotword models 130 .

핫워드 모델들(130)은 이들 특정 핫워드들을 인식하기 위해 서버가 컴퓨터 디바이스들에 제공할 준비가 되어있다. 일부 구현들에서, 서버(120)는 핫워드 모델 들(130)에서 요구된 핫워드를 갖지 않을 수 있다. 이 경우, 서버(120)는 음성 인식기(136) 및 음성 데이터 토크나이저(138)를 이용하여 수집된 음성 데이터(134)를 분석하여 요구된 단어에 대응하는 오디오 샘플들을 식별할 수 있다. 대안적으로, 서버(120)는 컴퓨팅 디바이스(104)가 샘플들을 수집하도록 요청할 수 있다. 컴퓨팅 디바이스(104)는 단어를 여러 번 반복하도록 사용자에게 요청할 수 있다. 컴퓨팅 디바이스(104)는 핫워드 모델을 생성하기 위한 처리를 위해 서버(120)에 오디오 데이터를 제공할 수 있다. 일부 구현들에서, 요청된 핫워드는 하나보다 많은 단어 일 수 있다. 이 예에서, 핫워드 모델 생성기(140)는 다수 워드 핫워드에 대한 핫워드 모델을 생성하기 위해 핫워드 모델들(130)을 결합할 수 있다.Hotword models 130 are ready to be provided by the server to computer devices to recognize these specific hotwords. In some implementations, server 120 may not have the hotword required in hotword models 130 . In this case, the server 120 may analyze the collected voice data 134 using the voice recognizer 136 and the voice data tokenizer 138 to identify audio samples corresponding to the requested word. Alternatively, server 120 may request computing device 104 to collect samples. Computing device 104 may request the user to repeat a word multiple times. Computing device 104 may provide audio data to server 120 for processing to generate a hotword model. In some implementations, the requested hotword may be more than one word. In this example, hotword model generator 140 may combine hotword models 130 to generate a hotword model for a multi-word hotword.

일부 구현들에서, 핫워드 모델 생성기(140)는 특정 콘텍스트에 특정한 노이즈를 포함하는 핫워드 모델들을 생성할 수 있다. 핫워드 모델 생성기(140)는 타겟 핫워드에 대응하는 모든 음성 토큰을 선택하지 않을 수 있다. 그 대신에, 핫워드 모델 생성기(140)는 대응하는 콘텍스트에 존재할 가능성이 있는 배경 노이즈를 포함하는 음성 토큰들을 선택한다. 예를 들어, 핫워드 모델 생성기(140)는 "다음(next)"을 포함하고 배경 음악을 갖는 음성 토큰들로 "다음"에 대한 핫워드 모델을 생성할 수 있다. 서버(120)가 핫워드 모델 "다음"에 대한 요청을 수신하고 그리고 콘텍스트가 음악 재생임을 그 요청이 나타내면, 이후, 서버(120)는 배경 음악용으로 구성된 "다음"의 핫워드 모델을 제공할 수 있다. 서버(120)가 핫워드 모델 "다음"에 대한 요청을 수신하고 그리고 콘텍스트가 사진 감상임을 그 요청이 나타내면, 이후, 서버(120)는 배경 노이즈가 없도록 구성된 "다음"의 핫워드 모델을 제공할 수 있다.In some implementations, hotword model generator 140 can generate hotword models that include noise specific to a specific context. The hotword model generator 140 may not select all voice tokens corresponding to the target hotword. Instead, the hotword model generator 140 selects speech tokens containing background noise that are likely to be present in the corresponding context. For example, the hotword model generator 140 may generate a hotword model for "next" with speech tokens that include "next" and have background music. If server 120 receives a request for a hotword model "next" and the request indicates that the context is music playback, then server 120 may provide a hotword model of "next" configured for background music. can If server 120 receives a request for a hotword model "next" and the request indicates that the context is photo appreciation, then server 120 may provide a hotword model "next" configured to be free from background noise. can

단계 E에서, 서버(120)는 요청된 핫워드 모델을 포함하는 응답(142)을 제공한다. 도 1에 도시된 예에서, 서버(120)는 "다음"에 대한 핫워드 모델을 컴퓨팅 디바이스(104)에 제공한다. 컴퓨팅 디바이스(104)는 핫워드 모델을 핫워드 단어 모델들(114 및 114)에 저장하고 그리고 활성 핫워드들(112) 내의 표시자를 업데이트하여 컴퓨팅 디바이스(104)에 저장된 대응하는 핫워드 모델이 있음을 표시할 수 있다.In step E, the server 120 provides a response 142 containing the requested hotword model. In the example shown in FIG. 1 , server 120 provides a hotword model for “next” to computing device 104 . Computing device 104 stores the hotword model in hotword word models 114 and 114 and updates the indicator in active hotwords 112 to have a corresponding hotword model stored in computing device 104 . can be displayed.

일부 구현들에서, 활성 핫워드 선택기(126)는 활성 핫워드들(112)로부터 핫 워드들을 제거할 수 있다. 콘텍스트 식별자(124)가 콘텍스트가 변경되었음을 표시할 때, 활성 핫워드 선택기(126)는 새로운 콘텍스트에 기초하여 활성 핫워드들(112)을 업데이트할 수 있다. 전술한 예들 중 일부에 이어서, 활성 핫워드 선택기(126)는 오전 10시 이후에 활성 핫워드들(112)로부터 핫워드 "뉴스"를 제거할 수 있다. 유사하게, 활성 핫워드 선택기(126)는 자정 이후에 핫워드 "알람 설정"을 제거할 수 있고 그리고 오후 9시 이후 핫워드 "알람 설정"을 활성 핫워드들(112)에 다시 추가할 수 있다. 일부 구현들에서, 활성 핫워드 선택기(126)가 핫워드 활성 핫워드들을(112)를 제거할 때 대응하는 핫워드 모델은 핫워드 모델들(114)에 남는다.In some implementations, active hotword selector 126 can remove hot words from active hotwords 112 . When the context identifier 124 indicates that the context has changed, the active hotword selector 126 may update the active hotwords 112 based on the new context. Continuing with some of the above examples, active hotword selector 126 may remove the hotword “news” from active hotwords 112 after 10 am. Similarly, the active hotword selector 126 may remove the hotword “set alarm” after midnight and add the hotword “set alarm” back to the active hotwords 112 after 9 pm. . In some implementations, when the active hotword selector 126 removes the hotword active hotwords 112 , the corresponding hotword model remains in the hotword models 114 .

일부 구현들에서, 활성 핫워드 선택기(126)는, 동일한 애플리케이션이 컴퓨팅 디바이스(104)에서 실행 중일 때에도 활성 핫워드들(112)로부터 핫워드들을 제거할 수 있다. 컴퓨팅 디바이스(104)가 음악 애플리케이션을 실행할 때, 활성 핫워드 선택기(126)는 핫워드 "재생(play)", "다음(next)", "중지(stop)" 및 "뒤로(back)"를 식별할 수 있고, 그리고 핫워드 모델들(114)에 대응하는 핫워드 모델들을 로딩한다. 이 예에서, 핫워드들은 컴퓨팅 디바이스(104)가 핫워드 모델을 요구하기 위해 활성 핫워드들(112)에 추가될 필요가 없다. 음악이 재생되는 동안, 활성 핫워드 선택기들(126)은, 핫워드들 "다음", "중지" 및 "뒤로"를 활성 핫워드들로서 포함할 수 있다. 음악이 중지되고 그리고 음악 애플리케이션이 열린 상태로 남아있는 경우, 활성 핫워드 선택기(126)는 활성 핫워드들(112)을 "다음", "재생"및 "뒤로"로 업데이트할 수 있다.In some implementations, active hotword selector 126 can remove hotwords from active hotwords 112 even when the same application is running on computing device 104 . When the computing device 104 executes the music application, the active hotword selector 126 selects the hotwords “play,” “next,” “stop,” and “back.” Identify and load hotword models corresponding to hotword models 114 . In this example, the hotwords do not need to be added to the active hotwords 112 for the computing device 104 to request the hotword model. While music is playing, active hotword selectors 126 may include the hotwords "next", "pause" and "back" as active hotwords. When the music is stopped and the music application remains open, the active hotword selector 126 may update the active hotwords 112 to "Next", "Play" and "Back".

일부 구현들에서, 사용자 인터페이스 생성기(144)는 컴퓨팅 디바이스(104) 상에 디스플레이하기 위한 사용자 인터페이스를 생성한다. 사용자 인터페이스는 활성 핫워드들(112)을 나타낼 수 있다. 예를 들어, 사용자 인터페이스는, 노래가 재생되기 시작할 때, 사용자(102)가 음악을 제어하기 위해 "다음", "중지" 또는 "뒤로"라고 말하도록 지시할 수 있다. 음악이 중지되면, 사용자 인터페이스는 사용자 (102)가 음악을 제어하기 위해 "다음", "재생" 및 "뒤로"를 말할 수 있음을 나타낼 수 있다. 사용자 인터페이스 생성기(144)는 또한 핫워드가 활성일 때 통지를 생성할 수 있다.In some implementations, the user interface generator 144 creates a user interface for display on the computing device 104 . The user interface may present active hotwords 112 . For example, the user interface may instruct the user 102 to say “Next,” “Stop,” or “Back” to control the music, when the song begins to play. When the music is paused, the user interface may indicate that the user 102 can say “next”, “play” and “back” to control the music. User interface generator 144 may also generate a notification when a hotword is active.

예를 들어, 사용자 인터페이스는 현재 시간이 오후 9시에 도달하면 핫워드 "알람 설정"이 활성화되었음을 나타낼 수 있다. 유사하게, 사용자 인터페이스는 현재 시간이 자정에 도달할 때 핫워드 "알람 설정"이 더 이상 활성화되지 않음을 나타낼 수 있다. 사용자 인터페이스는 각 핫워드를 말하는 효과를 나타낼 수도 있다. 예를 들어, 음악 애플리케이션이 활성화되어 있고 음악이 재생 중일 때, 사용자 인터페이스는 "다음"이 재생 목록의 다음 노래로 진행하고, "중지"는 현재 노래가 재생되는 것을 멈추고, "뒤로"는 재생 목록의 이전 노래로 되돌아 간다는 것을 나타낼 수 있다.For example, the user interface may indicate that the hotword “set alarm” is activated when the current time reaches 9 PM. Similarly, the user interface may indicate that the hotword “set alarm” is no longer active when the current time reaches midnight. The user interface may exhibit the effect of speaking each hotword. For example, when the music application is active and music is playing, the user interface shows that “Next” advances to the next song in the playlist, “Stop” stops the current song from playing, and “Back” indicates that the playlist may indicate a return to the previous song of

일부 구현들에서, 사용자 인터페이스 생성기(144)는 또한 상이한 핫워드들이 언제 활성화되는지를 제어하기 위해 사용자 인터페이스를 생성할 수 있다. 제어 인터페이스는 핫워드가 활성화될 때 콘텍스트를 포함할 수 있으며 핫워드가 활성화될 때 사용자가 콘텍스트를 업데이트할 수 있도록 한다. 부가적으로 또는 대안으로, 제어 인터페이스는 사용자(102)가 각각의 콘텍스트에 대해 활성인 핫워드들을 나타낼 수 있도록 한다. 예를 들어, 제어 인터페이스는 오전 8시부터 오전 10시까지 핫 워드 "뉴스(news)"가 활성화되었음을 나타낼 수 있다. 사용자(102)는 핫워드 "뉴스"가 오전 8시에서 정오까지 활성화되도록 콘텍스트를 조정할 수 있다.In some implementations, the user interface generator 144 can also generate the user interface to control when different hotwords are activated. The control interface may include a context when the hotword is activated and allows the user to update the context when the hotword is activated. Additionally or alternatively, the control interface allows user 102 to indicate active hotwords for each context. For example, the control interface may indicate that the hot word “news” has been activated from 8 am to 10 am. User 102 can adjust the context so that the hot word “news” is active from 8 am to noon.

제어 인터페이스는 또한 음악이 재생중일 때, 핫워드들 "다음", "중지" 또는 "뒤로"가 활성화되었음을 나타낼 수 있다. 사용자(102)는 음악 재생 콘텍스트에 대한 핫워드들을 단지 "다음" 및 "중지"로 업데이트할 수 있다. 일부 구현들에서, 제어 인터페이스는 또한 사용자(102)에게 기존 또는 커스텀 콘텍스트들에 대한 커스텀 핫워드들을 추가하는 기능을 제공할 수 있다. 예를 들어, 사용자(102)는 제어 인터페이스 "엄마에게 전화해(call mom)"를 핫워드로서 입력할 수 있고, 항상 핫워드를 활성화시키고, 연락처 "엄마"를 호출하는 핫워드를 검출할 수 있다. 또한, 사용자(102)는 음악 재생 콘텍스트에 "업" 및 "다운"을 추가할 수 있고 그리고 핫워드들이 볼륨을 제어하도록 지정할 수 있다. 사용자는 또한 시간주기 오전 11:00에서 오후 1:00에 대응하는 새로운 콘텍스트를 추가할 수 있다. 사용자는, 해당 기간 동안 활성화도록 핫워드 "점심 식사 주문(order lunch)"를 추가할 수 있고 그리고 핫워드가 음식 주문 애플리케이션을 여는 것임을 나타낼 수 있다.The control interface may also indicate that when music is playing, the hotwords "next", "stop" or "back" have been activated. User 102 may update the hotwords for the music playback context only to “next” and “pause”. In some implementations, the control interface may also provide the user 102 the ability to add custom hotwords to existing or custom contexts. For example, the user 102 may enter the control interface “call mom” as a hotword, always activate the hotword, and detect a hotword that calls the contact “mom”. have. In addition, user 102 can add "up" and "down" to the music playback context and specify hotwords to control the volume. The user may also add a new context corresponding to the time period 11:00 AM to 1:00 PM. The user may add the hotword “order lunch” to activate during that time period and indicate that the hotword is to open a food ordering application.

단계 F에서, 사용자(102)는 핫워드(148)를 포함하는 발화(146)를 말한다. 컴퓨팅 장치(104)는 마이크로폰을 통해 발화(146)를 수신하고 그리고 대응하는 오디오 데이터를 처리한다. 핫워드 검출기(108)는 활성 핫워드들(112)의 핫워드 모델들 (114)을 비교하여 발화(146)가 임의의 활성 핫워드들을 포함하는지를 식별한다. 핫 워드 검출기(108)가 핫워드를 식별하면, 컴퓨팅 디바이스는 대응하는 명령을 수행한다. 도 1에 도시된 예에서, 사용자(102)는 "다음"을 말한다. 활성 핫워드들(112)는 "중지", "다음" 및 "뒤로"일 수 있다. 핫워드 검출기(108)는 발화(146)에 대응하는 오디오 데이터를 "중지", "다음" 및 "뒤로"에 대응하는 핫워드 모델들(114)과 비교하고 그리고 발화(146)가 핫워드 "다음"을 포함한다고 결정한다. 다음 노래로 진행하기 위한 명령에 대응하는 핫워드 "다음"을 검출하는 것에 기초하여, 컴퓨팅 디바이스는 단계 G에서 다음 노래(150)로 진행한다.In step F, user 102 speaks utterance 146 that includes hotword 148 . Computing device 104 receives utterance 146 via a microphone and processes corresponding audio data. The hotword detector 108 compares the hotword models 114 of the active hotwords 112 to identify whether the utterance 146 contains any active hotwords. When the hot word detector 108 identifies the hot word, the computing device performs the corresponding instruction. In the example shown in FIG. 1 , user 102 says “next”. Active hotwords 112 may be “pause,” “next,” and “back.” The hotword detector 108 compares the audio data corresponding to the utterance 146 with hotword models 114 corresponding to “pause,” “next,” and “back” and the utterance 146 is a hotword “ decide to include " Based on detecting the hotword “next” corresponding to the instruction to proceed to the next song, the computing device proceeds to the next song 150 in step G.

일부 구현들에서, 핫워드 검출기(108)는 활성 핫워드들(112) 중에서는 없지만 모델들이 여전히 핫워드 모델들에 저장된 핫워드들을 검출할 수 있다. 이 경우, 핫워드 검출기(108)는 핫워드가 현재 활성화되어 있지 않음을 나타내는 사용자 인터페이스를 생성하기 위한 지시를 사용자 인터페이스 생성기(144)에 제공할 수 있다. 예를 들어, 사용자(102)는 음악이 재생 중일 때 "재생"을 말할 수 있다. 핫 워드 검출기(108)는 핫워드 "재생"을 식별할 수 있다. 핫워드가 활성이 아니기 때문에, 컴퓨팅 디바이스(104)는 어떠한 동작도 수행하지 않는다. 하지만, 사용자 인터페이스 생성기(144)는 핫워드 "재생"이 활성 상태가 아니며 활성 핫워드가 "중지", "다음" 및 "뒤로"임을 나타내는 인터페이스를 생성할 수 있다.In some implementations, hotword detector 108 can detect hotwords that are not among active hotwords 112 but whose models are still stored in hotword models. In this case, hotword detector 108 may provide instructions to user interface generator 144 to create a user interface indicating that the hotword is not currently active. For example, user 102 may say “play” when music is playing. The hot word detector 108 may identify the hot word “play”. Because the hotword is not active, the computing device 104 does not perform any action. However, the user interface generator 144 may generate an interface indicating that the hotword "play" is not active and the active hotwords are "stop," "next," and "back."

일부 구현들에서, 컴퓨팅 디바이스(104)는 여전히 디폴트 핫워드 "오케이 컴퓨터"를 식별하도록 구성될 수 있다. 이 예에서, 컴퓨팅 디바이스(104)는 음성 인식을 사용하여 "오케이 컴퓨터"를 따르는 오디오 데이터를 처리할 수 있고 그리고 후속 오디오 데이터의 기록에 기초하여 적절한 동작을 실행할 수 있다. 예를 들어, 음악이 재생되는 동안 사용자(102)가 "오케이 컴퓨터, 다음"이라고 말하면, 이후 컴퓨팅 디바이스(104)는 "오케이 컴퓨터" 핫워드를 식별하고 그리고 명령어 "다음"을 포함하는 오디오 데이터의 후속 부분을 기록한 후 다음 노래로 진행한다. In some implementations, computing device 104 may still be configured to identify the default hotword “ok computer”. In this example, computing device 104 may process audio data following the “ok computer” using speech recognition and execute appropriate actions based on subsequent recording of the audio data. For example, if the user 102 says "OK computer, next" while music is playing, then the computing device 104 identifies the "OK computer" hotword and selects the audio data including the instruction "next". After recording the sequel, proceed to the next song.

유사하게, 문맥상의 핫워드는 명령어 다음에 올 수 있다. "점심 식사 주문"의 예를 계속하면, 사용자(102)는 핫워드가 활성화되는 동안 "샌드위치 델리에서 점심 식사 주문"을 말할 수 있다. 이 예에서, 핫워드 검출기(108)는 "점심 식사 주문" 핫워드를 식별한다. 음성 인식기(116)는 "샌드위치 델리(Sandwich Deli)"라는 표기를 생성한다. 컴퓨팅 디바이스(104)는 음식 주문 애플리케이션을 열고 그리고 샌드위치 델리 메뉴를 열 수 있다.Similarly, a contextual hotword may follow an instruction. Continuing the example of "order lunch", user 102 may say "order lunch at a sandwich deli" while the hotword is active. In this example, the hotword detector 108 identifies the “Order lunch” hotword. Speech recognizer 116 generates the notation "Sandwich Deli." Computing device 104 may open a food ordering application and open a sandwich deli menu.

일부 구현들에서, 핫워드 검출기(108)는 처리된 오디오 데이터의 각 초기 부분에 대한 핫워드 신뢰도 스코어를 생성한다. 핫워드 신뢰도 스코어가 임계치를 만족하면, 핫워드 검출기(108)는 오디오 데이터가 핫워드를 포함한다고 결정한다. 예를 들어, 핫워드 신뢰도 스코어가 0.9이고, 그리고 핫워드 신뢰도 임계값이 0.8이면, 핫워드 검출기(108)는 오디오 데이터가 핫워드를 포함한다고 결정한다.In some implementations, the hotword detector 108 generates a hotword confidence score for each initial portion of the processed audio data. If the hotword confidence score satisfies the threshold, the hotword detector 108 determines that the audio data contains a hotword. For example, if the hotword confidence score is 0.9, and the hotword confidence threshold is 0.8, the hotword detector 108 determines that the audio data includes a hotword.

일부 구현들에서, 핫워드 신뢰도 스코어가 임계치 아래의 범위를 갖는다면, 사용자 인터페이스 생성기(144)는 사용자(102)가 핫워드를 말했음을 확인하는 인터페이스를 생성할 수 있다. 예를 들어, 핫워드 신뢰도 스코어는 0.7일 수 있다. 범위가 0.6 내지 0.8 사이이면, 사용자 인터페이스 생성기(144)는 사용자(102)가 핫 워드를 확인 또는 반복하도록 요청하는 사용자 인터페이스를 생성할 수 있다. 일부 구현들에서, 사용자(102)가 핫워드를 말했음을 사용자가 확인하면, 컴퓨팅 디바이스(104)는 오디오 데이터를 사용하여 미래의 성능을 향상시키기 위해 핫워드 모델을 업데이트할 수 있다. 컴퓨팅 디바이스(104)는 오디오 데이터에 너무 많은 노이즈가 있는 경우 오디오 데이터를 사용할 수 없다.In some implementations, if the hotword confidence score has a range below a threshold, the user interface generator 144 can generate an interface confirming that the user 102 has spoken the hotword. For example, the hotword confidence score may be 0.7. If the range is between 0.6 and 0.8, the user interface generator 144 may generate a user interface that requests the user 102 to confirm or repeat the hot word. In some implementations, once the user confirms that the user 102 has spoken the hotword, the computing device 104 may use the audio data to update the hotword model to improve future performance. The computing device 104 cannot use the audio data if there is too much noise in the audio data.

도 2는 문맥상의 핫워드들을 식별하고 처리하기 위한 예시적인 프로세스(200)를 도시한다. 일반적으로, 프로세스(200)는 디바이스의 콘텍스트에 기초하여 핫워드들을 식별하고 핫워드들에 액션들을 할당하여, 사용자가 핫워드를 말할 때, 프로세스는 대응하는 액션을 수행한다. 프로세스(200)는 하나 이상의 컴퓨터들을 포함하는 컴퓨터 시스템, 예를 들어, 도 1에 도시된 시스템(100)에 의해 수행되는 것으로 설명될 것이다.2 depicts an exemplary process 200 for identifying and processing contextual hotwords. In general, process 200 identifies hotwords based on the context of the device and assigns actions to the hotwords so that when a user speaks the hotword, the process performs the corresponding action. Process 200 will be described as being performed by a computer system including one or more computers, eg, system 100 shown in FIG. 1 .

시스템은 컴퓨팅 디바이스와 연관된 콘텍스트를 결정한다(210). 일부 구현들에서, 시스템은 시스템에서 실행중인 애플리케이션을 식별한다. 시스템은 시스템에서 실행중인 애플리케이션 기반으로 콘텍스트를 결정한다. 예를 들어, 애플리케이션은 음악 애플리케이션일 수 있다. 이 경우, 콘텍스트는 음악을 재생중일 수 있다. 시스템은, 또한, 백그라운드(background) 및 포그라운드(foreground)에서 실행중인 애플리케이션들 사이를 구별할 수 있다. 예를 들어, 백그라운드 또는 포그라운드에서 음악을 실행하고 재생하는 음악 애플리케이션은 음악 재생과 동일한 콘텍스트를 가질 수 있다. 백그라운드에서 실행중인 브라우저와 같은 애플리케이션은 콘텍스트에 영향을 미치지 않을 수 있다. 일부 구현들에서, 콘텍스트는 또한 디바이스가 잠겨질 때와 같은 디바이스의 상태와 관련될 수 있다. 콘텍스트는, 또한, 화면, 예를 들어, "홈 스크린(home screen)"에 디스플레이되는 것과 관련될 수 있다.The system determines ( 210 ) a context associated with the computing device. In some implementations, the system identifies an application running on the system. The system determines the context based on the application running on the system. For example, the application may be a music application. In this case, the context may be playing music. The system can also distinguish between applications running in the background and foreground. For example, a music application that runs and plays music in the background or foreground may have the same context as music playback. Applications such as browsers running in the background may not affect the context. In some implementations, the context can also relate to the state of the device, such as when the device is locked. Context may also relate to being displayed on a screen, eg, a “home screen”.

일부 구현들에서, 콘텍스트는 시스템의 움직임에 기초할 수 있다. 예를 들어, 시스템이 자동차와 유사한 속도로 움직이는 경우, 시스템은 콘텍스트가 "자동차 안에서(in a car)"라고 결정할 수 있다. 일부 구현들에서, 콘텍스트는 컴퓨팅 디바이스의 위치에 기초할 수 있다. 예를 들어, 시스템은 사용자의 집에 위치할 수 있다. 이 경우 장치의 콘텍스트는 "집에(at home)"일 수 있다. 일부 구현들에서, 콘텍스트는 콘텍스트들의 조합일 수 있다. 예를 들어, 시스템이 잠겨져 있고 그리고 사용자의 집에서 있는 경우, 콘텍스트는 "집에서 잠겨진(locked at home)" 상태일 수 있다.In some implementations, the context may be based on the movement of the system. For example, if the system is moving at a speed similar to that of a car, the system may determine that the context is "in a car." In some implementations, the context can be based on the location of the computing device. For example, the system may be located in the user's home. In this case, the context of the device may be “at home”. In some implementations, a context may be a combination of contexts. For example, if the system is locked and at the user's home, the context may be "locked at home".

시스템은, 시스템과 연관된 콘텍스트에 기반하여 핫워드(220)를 결정한다. 일부 구현들에서, 시스템은 소프트웨어에서 식별된 핫워드들에 기초하여 핫워드를 결정한다. 예를 들어, 음악 애플리케이션은 콘텍스트가 "음악 재생중"인 경우 그리고 음악이 중지되는 경우 그리고 음악 애플리케이션이 열려있는 경우에 대한 핫워드들을 식별할 수 있다. The system determines the hotword 220 based on a context associated with the system. In some implementations, the system determines the hotword based on the identified hotwords in software. For example, a music application may identify hotwords for when the context is "music is playing" and when the music is paused and when the music application is open.

일부 구현들에서, 시스템은 이전의 사용에 기초하여 핫워드들을 결정할 수 있다. 예를 들어, 사용자가 특정 시간 범위 동안 일반적으로 뉴스를 읽는 경우, 이후, 시스템은 콘텍스트가 그 시간 범위에 있을 때 핫워드 "뉴스"를 결정할 수 있다. 일부 구현들에서, 시스템은 핫워드를 디스플레이하고 핫워드에 의해 수행되는 동작을 디스플레이에 표시할 수 있다. 핫워드가 활성화될 때 및 시스템이 핫워드를 비활성화할 때 시스템은 알림을 제공할 수 있다.In some implementations, the system can determine hotwords based on previous usage. For example, if the user generally reads news during a certain time range, then the system may determine the hotword "news" when the context is in that time range. In some implementations, the system can display the hotword and indicate on the display the action performed by the hotword. The system may provide notifications when a hotword is activated and when the system deactivates a hotword.

시스템은 핫워드를 결정한 후, 발화(230)에 대응하는 오디오 데이터를 수신한다. 일부 구현들에서, 시스템은 서버로부터 핫워드를 위한 핫워드 모델을 요구하고, 시스템은 핫워드를 결정한다. 예를 들어, 시스템에서 핫워드 "다음"을 결정하면, 이후, 시스템은 서버로부터 "다음"에 대한 핫워드 모델을 요청할 수 있다.After the system determines the hotword, it receives audio data corresponding to the utterance 230 . In some implementations, the system requests a hotword model for the hotword from the server, and the system determines the hotword. For example, if the system determines the hotword "next", then the system may request the hotword model for "next" from the server.

시스템은 오디오 데이터가 핫워드(240)를 포함하는 것으로 결정한다. 일부 구현들에서, 시스템은 오디오 데이터가 오디오 데이터에 대한 음성 인식을 수행하지 않고 핫워드를 포함한다고 결정한다. 일부 구현들에서, 시스템은 발화에 대응하는 오디오 데이터의 오디오 피처들을 추출함으로써 오디오 데이터가 핫워드를 포함한다고 결정한다. 시스템은 오디오 피처들을 처리하고 가능한 오디오 피처들을 핫워드 모델의 피처들을 비교함으로써 핫워드 신뢰도 스코어를 생성한다. The system determines that the audio data contains a hotword 240 . In some implementations, the system determines that the audio data includes a hotword without performing speech recognition on the audio data. In some implementations, the system determines that the audio data includes a hotword by extracting audio features of the audio data corresponding to the utterance. The system generates a hotword confidence score by processing the audio features and comparing the possible audio features to the features of the hotword model.

핫워드 신뢰도 스코어가 핫워드 신뢰도 임계값을 만족하면, 이후, 시스템은 오디오 데이터가 핫워드를 포함한다고 결정한다. 핫워드 신뢰도 스코어가 핫워드 신뢰도 임계값을 만족시키지 않으면, 이후, 시스템은 오디오 데이터가 핫워드를 포함하지 않는다고 결정한다. 예를 들어, 핫워드 신뢰도 임계치가 0.8인 경우, 이후, 핫워드 신뢰도 스코어가 0.8 이상인 오디오 데이터는 핫워드를 포함하는 것으로 라벨링되고 그리고 핫워드 신뢰도 스코어가 0.8 미만인 오디오 데이터는 핫워드를 포함하지 않는 것으로 라벨링된다.If the hotword confidence score satisfies the hotword confidence threshold, then the system determines that the audio data includes the hotword. If the hotword confidence score does not satisfy the hotword confidence threshold, then the system determines that the audio data does not contain the hotword. For example, if the hotword confidence threshold is 0.8, then, audio data having a hotword confidence score of 0.8 or higher is labeled as containing a hotword, and audio data having a hotword confidence score of less than 0.8 does not contain the hotword. labeled as

일부 구현들에서, 오디오 데이터는 핫워드만을 포함한다. 예를 들어, 사용자는 핫워드인 "다음"만을 말할 수 있다. 일부 구현들에서, 오디오 데이터의 초기 부분은 핫워드를 포함한다. 예를 들어, 핫워드는 "점심 식사 주문"일 수 있으며 사용자는 "샌드위치 델리에서 점심 식사 주문"을 말할 수 있다. 이 경우, 시스템은 핫 워드 "점심 식사 주문"을 식별하고 그리고 음성 인식을 사용하여 핫워드 다음의 오디오 데이터 부분을 처리한다. In some implementations, the audio data includes only hotwords. For example, the user can only say the hotword "next". In some implementations, the initial portion of the audio data includes a hotword. For example, the hotword may be "order lunch" and the user may say "order lunch at the sandwich deli". In this case, the system identifies the hot word "lunch order" and uses speech recognition to process the audio data portion following the hot word.

시스템은, 오디오 데이터가 핫워드를 포함한다고 결정한 것에 응답하여, 핫 워드(250)와 연관된 동작을 수행한다. 일부 구현들에서, 시스템은 핫워드를 사용하여 동작을 식별한다. 예를 들어, 음악이 재생되고 사용자가 "다음"이라고 말하면 시스템은 노래를 진행시킨다. The system performs an action associated with the hot word 250 in response to determining that the audio data includes the hot word. In some implementations, the system uses a hotword to identify an action. For example, when music is played and the user says "next", the system advances the song.

일부 구현들에서, 시스템은 콘텍스트가 더 이상 유효하지 않을 때 활성 핫워드 목록으로부터 핫워드를 제거한다. 예를 들어, 사용자가 음악 재생을 중지하면, 이후, 시스템은 활성 핫워드들의 목록에서 핫워드 "다음"을 제거한다. 이 경우 사용자가 "다음"이라고 말하면, 시스템은 "다음"에 대한 응답으로 아무 작업도 수행하지 않는다. In some implementations, the system removes a hotword from the active hotword list when the context is no longer valid. For example, if the user stops playing music, then the system removes the hotword "next" from the list of active hotwords. In this case, when the user says "next", the system does nothing in response to "next".

일부 구현들에서, 동작은 핫워드 및 핫워드를 따르는 임의의 오디오 데이터에 기초하거나 또는 핫워드를 따르는 오디오 데이터에만 기초할 수 있다. 예를 들어, 활성 핫워드는 "방향들(directions)"일 수 있으며, 그리고 사용자는 "집으로 가는 방향(directions)"를 말할 수 있다. 이 경우, 시스템은 매핑 애플리케이션을 열 수 있고 그리고 사용자의 집으로 가는 방향들을 사용자에게 제공하기 시작할 수 있다.In some implementations, the operation may be based on the hotword and any audio data following the hotword, or based only on audio data following the hotword. For example, the active hotword may be “directions,” and the user may say “directions to home.” In this case, the system may open the mapping application and begin providing directions to the user's home to the user.

일부 구현들에서, 동작은 핫워드를 따르는 오디오 데이터에 의존할 수 있다. 활성 핫워드가 핫워드를 따르는 오디오 데이터에 의존하는 상이한 작업들을 갖는 경우가 있을 수 있다. 예를 들어, 사용자가 집에 있을 수 있고 시스템이 음악을 재생 중일 수 있다. 집에 있는 사용자는 시스템이 핫워드 "증가"를 활성화하여 자동 온도 조절기의 온도를 상승시킬 수 있다. 음악을 재생하는 시스템은 또한 시스템이 핫워드 "증가"를 활성화하여 음악의 볼륨을 증가시킬 수 있다. 이 경우, 시스템은 하나 이상의 다음과 같은 방법들로 이 충돌을 완화할 수 있다.In some implementations, the operation may depend on audio data following the hotword. There may be cases where an active hotword has different tasks depending on the audio data that follows the hotword. For example, the user may be at home and the system may be playing music. A user at home can cause the system to activate the hotword "increase" to increase the thermostat's temperature. A system playing music may also increase the volume of the music by the system activating the hotword "increase". In this case, the system may mitigate this conflict in one or more of the following ways.

시스템은 핫워드들을 "온도를 높여라" 및 "볼륨을 높여라"로 업데이트할 수 있다. 사용자가 음악 볼륨을 높이기 위해 "볼륨을 높여라" 또는 온도 조절기의 온도를 높이기 위해 "온도를 높여라"라고 말하는 것처럼, 사용자가 "높여라"라고 말하면 시스템은 스크린에 디스플레이할 수 있다. 대안으로, 시스템은 "높여라"를 핫워드로서 유지하고 그리고 시스템이 "높여라" 이후 어떤 오디오 데이터에 대한 음성 인식을 수행하는 경우 "높여라" 이후에 제안을 요구할 수 있거나 또는 사용자로부터 설명을 요구할 수 있다. 예를 들어, 사용자는 "높여라"를 말할 수 있다. 시스템은 무엇을 증가시킬지 명확히 하기 위해 요청을 디스플레이하거나 또는 사용자가 요청한 합성 음성을 재생할 수 있다.The system can update the hotwords to "Raise the temperature" and "Turn up the volume". Just as the user says "turn up the volume" to increase the music volume or "raise the temperature" to increase the temperature of the thermostat, when the user says "raise" the system may display it on the screen. Alternatively, the system may keep "Raise" as a hotword and ask for a suggestion after "Raise" or ask for clarification from the user if the system performs speech recognition on some audio data after "Raise" . For example, the user may say "raise up". The system may display a request to clarify what to increment or play a synthesized voice requested by the user.

일부 구현들에서, 시스템은 말해진 핫워드에 대한 제안 또는 접미사의 존재 하에서, 때로는 말해진 핫워드에 대한 제안 또는 접미어의 존재시에만 활성 핫워드를 검출할 수 있다. 예를 들어, 활성 핫워드는 "알람 설정"일 수 있다. 시스템은 "오전 6시" 또는 "내일 아침"과 같은 제안이 오는 경우에만 핫워드 "알람 설정"을 인식할 수 있다. 이 경우, 시스템은 핫워드 다음에 이어지는 오디오 데이터의 부분에 대해 음성 인식을 수행할 수 있고, 핫워드 다음에 추가 음성 용어들이 따르지 않으면 핫 워드를 인식하지 않을 수 있다. 일부 구현들에서, 추가 음성 용어들은 핫워드가 수용하는 제안들이다. 핫워드 "알람 설정"은 "오전 6시"와 같은 제안을 허용하지만 "참치 샌드위치"는 허용하지 않을 수 있다.In some implementations, the system can detect an active hotword only in the presence of a suggestion or suffix for the spoken hotword, sometimes only in the presence of a suggestion or suffix for the spoken hotword. For example, the active hotword may be “set alarm”. The system can recognize the hotword "set alarm" only when a suggestion like "6 am" or "tomorrow morning" comes. In this case, the system may perform speech recognition on the portion of audio data following the hot word, and may not recognize the hot word unless additional speech terms follow the hot word. In some implementations, the additional phonetic terms are suggestions that the hotword accepts. The hotword "set an alarm" may allow suggestions like "6am", but not "tuna sandwich".

일부 구현들에서, 시스템은 오디오 데이터가 하나 이상의 핫워드를 포함한다고 결정할 수 있다. 현재 활성 핫워드들이 유사하게 들리기 때문에 이런 일이 발생할 수 있다. 예를 들어, 두 개의 활성 핫워드들은 "다음" 및 "텍스트"일 수 있다. 일부 구현들에서, 시스템은 오디오 데이터가 반드시 핫워드가 아닌, 핫워드를 포함한다는 것을 결정할 수 있다. 둘 이상의 핫 워드 모델이 오디오 데이터와 일치한다고 시스템이 판단하면, 이후, 시스템은 핫워드를 포함하는 오디오 데이터의 부분에 대해 음성 인식을 수행하여 사용자가 어떤 핫워드를 말했는지를 결정할 수 있다.In some implementations, the system can determine that the audio data includes one or more hotwords. This can happen because the currently active hotwords sound similar. For example, the two active hotwords may be "next" and "text". In some implementations, the system can determine that the audio data includes a hotword, not necessarily a hotword. If the system determines that the two or more hot word models match the audio data, then the system may perform speech recognition on the portion of the audio data including the hot word to determine which hot word the user said.

도 3는 본 명세서에 설명된 기술들을 구현하기 위해 사용될 수 있는 컴퓨팅 디바이스(300) 및 모바일 컴퓨팅 디바이스(350)의 일례를 도시한다. 컴퓨팅 디바이스(300)는 랩탑들, 데스크탑들, 워크스테이션들, 개인 정보 단말기(personal digital assistant)들, 서버들, 블레이드 서버들, 메인 프레임들 및 다른 적절한 컴퓨터들과 같은 다양한 형태들의 디지털 컴퓨터들을 나타내기 위해 의도된다. 모바일 컴퓨팅 디바이스(350)는 개인 정보 단말기들, 셀룰러 폰들, 스마트 폰들 및 다른 유사한 컴퓨팅 디바이스들과 같은 다양한 형태들의 모바일 디바이스들을 나타내기 위해 의도된다. 본 명세서에 제시된 구성 요소들, 이들의 연결 및 관계, 및 이들의 기능은 단지 예시적인 것으로 의도되며, 본 명세서에 설명된 및/또는 청구된 본 발명의 구현예를 제한하려고 의도된 것이 아니다.3 shows an example of a computing device 300 and a mobile computing device 350 that may be used to implement the techniques described herein. Computing device 300 represents various forms of digital computers such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes and other suitable computers. intended to bet Mobile computing device 350 is intended to represent various forms of mobile devices, such as personal digital assistants, cellular phones, smart phones, and other similar computing devices. The components presented herein, their connections and relationships, and their functions, are intended to be illustrative only and are not intended to limit the embodiments of the invention described and/or claimed herein.

컴퓨팅 디바이스(300)는 프로세서(302), 메모리(304), 저장 디바이스(306), 메모리(304) 및 다중 고속 확장 포트(310)에 연결된 고속 인터페이스(308), 및 저속 확장 포트(314) 및 저장 디바이스(306)에 연결된 저속 인터페이스(312)를 포함한다. 프로세서(302), 메모리(304), 저장 디바이스(306), 고속 인터페이스(308), 고속 확장 포트(310) 및 저속 인터페이스(312) 각각은 다양한 버스들을 사용하여 상호 연결되고, 공통 마더보드 상에 또는 적절한 다른 방식으로 장착될 수 있다. 프로세서(302)는, 메모리(304) 또는 저장 디바이스(306)에 저장된 명령어들을 포함하여, 컴퓨팅 디바이스(300) 내에서 실행되는 명령을 처리하여, 고속 인터페이스(308)에 결합된 디스플레이(316)와 같은 외부 입력/출력 디바이스 상에 GUI를 위한 그래픽 정보를 디스플레이할 수 있다. 다른 구현들에서, 다수의 메모리들 및 메모리 유형들과 함께 다수의 프로세서들 및/또는 다수의 버스들이 적절히 사용될 수 있다. 또한, 다수의 컴퓨팅 디바이스들이 연결될 수 있고, 각 디바이스는 필요한 동작들(예를 들어, 서버 뱅크, 블레이드 서버들의 그룹 또는 다중 프로세서 시스템)의 일부들을 제공한다. 메모리(304)는 컴퓨팅 디바이스(300) 내에 정보를 저장한다. 일부 구현들에서, 메모리(304)는 휘발성 메모리 유닛 또는 유닛들이다. 일부 구현들에서, 메모리(304)는 비-휘발성 메모리 유닛 또는 유닛들이다. 메모리(304)는 또한 자기 디스크 또는 광 디스크와 같은 다른 형태의 컴퓨터 판독 가능한 매체일 수 있다.The computing device 300 includes a processor 302 , a memory 304 , a storage device 306 , a high-speed interface 308 coupled to the memory 304 and multiple high-speed expansion ports 310 , and a low-speed expansion port 314 and and a low-speed interface 312 coupled to the storage device 306 . Each of the processor 302 , memory 304 , storage device 306 , high-speed interface 308 , high-speed expansion port 310 , and low-speed interface 312 are interconnected using various buses and are on a common motherboard. or may be mounted in any other suitable manner. Processor 302 processes instructions executed within computing device 300 , including instructions stored in memory 304 or storage device 306 , to display 316 coupled to high-speed interface 308 . Graphical information for the GUI can be displayed on the same external input/output device. In other implementations, multiple processors and/or multiple buses may be used as appropriate with multiple memories and memory types. Also, multiple computing devices may be connected, each providing some of the necessary operations (eg, a server bank, a group of blade servers, or a multi-processor system). Memory 304 stores information within computing device 300 . In some implementations, memory 304 is a volatile memory unit or units. In some implementations, memory 304 is a non-volatile memory unit or units. Memory 304 may also be another form of computer readable medium, such as a magnetic disk or optical disk.

저장 디바이스(306)는 컴퓨팅 디바이스(300)를 위한 대용량 저장 디바이스를 제공할 수 있다. 일부 구현들에서, 저장 디바이스(306)는 플로피 디스크 디바이스, 하드 디스크 디바이스, 광 디스크 디바이스 또는 테이프 디바이스, 플래시 메모리 또는 다른 유사한 솔리드 스테이트 메모리 디바이스, 또는 저장 영역 네트워크 또는 다른 구성의 디바이스를 포함하는 디바이스의 어레이와 같은 컴퓨터-판독가능한 매체이거나 이 컴퓨터-판독가능한 매체를 포함할 수 있다. 명령어들은 정보 매체에 저장될 수 있다. 하나 이상의 프로세싱 디바이스들(예를 들어, 프로세서(302))에 의해 실행될 때, 명령어들은 상술한 바와 같은 하나 이상의 방법들을 수행한다.The storage device 306 may provide a mass storage device for the computing device 300 . In some implementations, the storage device 306 is a device including a floppy disk device, a hard disk device, an optical disk device or tape device, a flash memory or other similar solid state memory device, or a device of a storage area network or other configuration. It may be or include a computer-readable medium such as an array. The instructions may be stored on an information medium. When executed by one or more processing devices (eg, processor 302 ), the instructions perform one or more methods as described above.

명령어들은 또한 컴퓨터 판독 가능한 매체 또는 기계 판독 가능한 매체(예를 들어, 메모리(304), 저장 장치(306) 또는 프로세서(302)상의 메모리)와 같은 하나 이상의 저장 장치들에 의해 저장될 수 있다.The instructions may also be stored on one or more storage devices, such as a computer readable medium or a machine readable medium (eg, memory 304 , storage 306 , or memory on processor 302 ).

고속 인터페이스(308)는 컴퓨팅 디바이스(300)에 대한 대역폭 집약적인 동작을 관리하는 반면, 저속 인터페이스(312)는 낮은 대역폭 집약적인 동작을 관리한다. 이러한 기능 할당은 단지 예시적인 것이다. 일부 구현들에서, 고속 인터페이스(308)는 메모리(304), 디스플레이(316)(예를 들어, 그래픽 프로세서 또는 가속기를 통해), 및 다양한 확장 카드들을 수용할 수 있는 고속 확장 포트들(310)에 결합된다. 상기 구현에서, 저속 확장 포트(314)는 저장 디바이스(306) 및 저속 확장 포트(314)에 결합된다. 다양한 통신 포트들(예를 들어, USB, 블루투스, 이더넷, 무선 이더넷)를 포함할 수 있는 저속 확장 포트는 키보드, 포인팅 디바이스, 스캐너와 같은 하나 이상의 입력/출력 디바이스, 또는 예를 들어 네트워크 어댑터를 통해 라우터 또는 스위치와 같은 네트워킹 디바이스에 결합될 수 있다.High speed interface 308 manages bandwidth intensive operations for computing device 300 , while low speed interface 312 manages low bandwidth intensive operations. This function assignment is merely exemplary. In some implementations, high-speed interface 308 connects to memory 304 , display 316 (eg, via a graphics processor or accelerator), and high-speed expansion ports 310 that can accommodate various expansion cards. are combined In the implementation, the low-speed expansion port 314 is coupled to the storage device 306 and the low-speed expansion port 314 . A low-speed expansion port, which may include a variety of communication ports (eg, USB, Bluetooth, Ethernet, wireless Ethernet), is provided via one or more input/output devices such as a keyboard, pointing device, scanner, or via, for example, a network adapter. It may be coupled to a networking device such as a router or switch.

컴퓨팅 디바이스(300)는 도면에 도시된 바와 같이 다수의 상이한 형태로 구현될 수 있다. 예를 들어, 이 컴퓨팅 디바이스는 표준 서버(320)로서 구현되거나, 또는 이러한 서버의 그룹으로 다수회 구현될 수 있다. 또한, 이 컴퓨팅 디바이스는 랩탑 컴퓨터(322)와 같은 개인용 컴퓨터에서 구현될 수도 있다. 이는 또한, 랙 서버 시스템(324)의 부분으로서 구현될 수 있다. 대안적으로, 컴퓨팅 디바이스(300)로부터의 구성 요소들은 모바일 컴퓨팅 디바이스(350)와 같은 모바일 디바이스 내의 다른 구성 요소들과 결합될 수 있다. 이러한 디바이스들 각각은 컴퓨팅 디바이스(300) 및 모바일 컴퓨팅 디바이스(350) 중 하나 이상을 포함할 수 있고, 그리고 전체 시스템은 서로 통신하는 다수의 컴퓨팅 디바이스들로 구성될 수 있다.Computing device 300 may be implemented in a number of different forms as shown in the figures. For example, the computing device may be implemented as a standard server 320 , or multiple times as a group of such servers. The computing device may also be implemented in a personal computer, such as a laptop computer 322 . It may also be implemented as part of a rack server system 324 . Alternatively, components from computing device 300 may be combined with other components within a mobile device, such as mobile computing device 350 . Each of these devices may include one or more of computing device 300 and mobile computing device 350 , and the overall system may be comprised of multiple computing devices communicating with each other.

모바일 컴퓨팅 디바이스(350)는 다른 구성요소들 중 프로세서(352), 메모리(364), 디스플레이(354)와 같은 입력/출력 디바이스, 통신 인터페이스(366) 및 트랜시버(368)를 포함한다. 모바일 컴퓨팅 디바이스(350)는 추가적인 저장소를 제공하기 위해 마이크로 드라이브 또는 다른 디바이스와 같은 저장 디바이스를 또한 구비할 수 있다. 프로세서(352), 메모리(364), 디스플레이(354), 통신 인터페이스(366) 및 트랜시버(368)들 각각은 다양한 버스들을 사용하여 상호 연결되며, 일부 구성 요소는 공통 마더보드 상에 또는 적절히 다른 방식으로 장착될 수 있다.Mobile computing device 350 includes processor 352 , memory 364 , input/output devices such as display 354 , communication interface 366 and transceiver 368 , among other components. Mobile computing device 350 may also include a storage device, such as a micro drive or other device, to provide additional storage. Each of the processor 352 , memory 364 , display 354 , communication interface 366 , and transceiver 368 are interconnected using various buses, some components being on a common motherboard or in other ways as appropriate. can be installed with

프로세서(352)는 메모리(364)에 저장된 명령을 포함하여 모바일 컴퓨팅 디바이스(350) 내의 명령어들을 실행할 수 있다. 프로세서(352)는 별도의 및 다수의 아날로그 및 디지털 프로세서를 포함하는 칩의 칩셋(chipset)으로서 구현될 수 있다. 프로세서(352)는, 예를 들어, 사용자 인터페이스들의 제어, 모바일 컴퓨팅 디바이스(350)에 의해 실행되는 애플리케이션들, 및 모바일 컴퓨팅 디바이스(350)에 의한 무선 통신과 같은 모바일 컴퓨팅 디바이스(350)의 다른 구성 요소들의 조정을 제공할 수 있다.Processor 352 may execute instructions within mobile computing device 350 , including instructions stored in memory 364 . The processor 352 may be implemented as a chipset of a chip that includes separate and multiple analog and digital processors. The processor 352 may be, for example, controlling user interfaces, applications executed by the mobile computing device 350 , and other components of the mobile computing device 350 , such as wireless communication by the mobile computing device 350 . It can provide coordination of elements.

프로세서(352)는 제어 인터페이스(358) 및 디스플레이(354)에 결합된 디스플레이 인터페이스(356)를 통해 사용자와 통신할 수 있다. 디스플레이(354)는, 예를 들어, 박막 트랜지스터 액정 디스플레이(Thin-Film-Transistor Liquid Crystal Display: TFT LCD) 또는 유기 발광 다이오드(Organic Light Emitting Diode: OLED) 디스플레이 또는 다른 적절한 디스플레이 기술일 수 있다. 디스플레이 인터페이스(356)는 사용자에게 그래픽 및 다른 정보를 제공하기 위해 디스플레이(354)를 구동하기 위한 적절한 회로를 포함할 수 있다. 제어 인터페이스(358)는 사용자로부터 명령을 수신하여 이 명령을 프로세서(352)에 전달하기 위해 이들 명령을 변환할 수 있다. 또한, 외부 인터페이스(362)는 프로세서(352)와 통신하며 다른 디바이스들과 모바일 통신 디바이스(350)의 근거리 통신이 가능하도록 제공될 수 있다. 외부 인터페이스(362)는, 예를 들어, 일부 구현들에서 유선 통신을 제공하거나, 또는 다른 구현들에서 무선 통신을 제공할 수 있으며, 다수의 인터페이스들은 또한 사용될 수 있다.The processor 352 may communicate with a user through a control interface 358 and a display interface 356 coupled to a display 354 . Display 354 may be, for example, a Thin-Film-Transistor Liquid Crystal Display (TFT LCD) or Organic Light Emitting Diode (OLED) display or other suitable display technology. Display interface 356 may include suitable circuitry for driving display 354 to provide graphics and other information to a user. Control interface 358 may receive commands from a user and translate these commands to communicate the commands to processor 352 . In addition, the external interface 362 may be provided to communicate with the processor 352 and to enable short-range communication of the mobile communication device 350 with other devices. External interface 362 may, for example, provide wired communication in some implementations, or wireless communication in other implementations, and multiple interfaces may also be used.

메모리(364)는 컴퓨팅 디바이스(350) 내에 정보를 저장한다. 메모리(364)는 컴퓨터 판독 가능 매체 또는 매체들, 휘발성 메모리 유닛 또는 유닛들 또는 비-휘발성 메모리 유닛 또는 유닛들 중 하나 이상으로 구현될 수 있다. 확장 메모리(374)는 또한 예를 들어 SIMM(Single In Line Memory Module) 카드 인터페이스를 포함할 수 있는 확장 인터페이스(372)를 통해 모바일 컴퓨팅 디바이스(350)에 제공되고 연결될 수 있다. 이러한 확장 메모리(374)는 모바일 컴퓨팅 디바이스(350)를 위한 여분의 저장 공간을 제공할 수 있거나, 또는 모바일 컴퓨팅 디바이스(350)를 위한 애플리케이션 또는 다른 정보를 더 저장할 수 있다. 구체적으로, 확장 메모리(374)는 전술된 처리를 수행하거나 보충하기 위한 명령을 포함할 수 있고, 또한 보안 정보를 포함할 수 있다. 따라서, 예를 들어, 확장 메모리(374)는 모바일 컴퓨팅 디바이스(350)를 위한 보안 모듈로서 제공될 수 있고, 모바일 컴퓨팅 디바이스(350)의 보안 사용을 가능하게 하는 명령으로 프로그래밍될 수 있다. 또한, 예를 들어 SIMM 카드 상에 식별 정보를 해킹할 수 없는 방식으로 배치하는 보안 애플리케이션이 추가적인 정보와 함께 SIMM 카드를 통해 제공될 수 있다.Memory 364 stores information within computing device 350 . Memory 364 may be implemented in one or more of a computer-readable medium or media, a volatile memory unit or units, or a non-volatile memory unit or units. Expansion memory 374 may also be provided and coupled to mobile computing device 350 via expansion interface 372 , which may include, for example, a Single In Line Memory Module (SIMM) card interface. This expansion memory 374 may provide extra storage space for the mobile computing device 350 , or may further store applications or other information for the mobile computing device 350 . Specifically, the expansion memory 374 may include instructions for performing or supplementing the above-described processing, and may also include security information. Thus, for example, expansion memory 374 may be provided as a secure module for mobile computing device 350 and may be programmed with instructions to enable secure use of mobile computing device 350 . Also, for example, a secure application that places identification information on the SIMM card in an unhackable manner may be provided via the SIMM card along with the additional information.

메모리는, 예를 들어, 후술된 바와 같이, 플래시 메모리 및/또는 NVRAM 메모리(비-활성 랜덤 액세스 메모리)를 포함할 수 있다. 일 구현들에서, 명령어들은 정보 운반매체에 저장되고, 하나 이상의 프로세싱 디바이스들(예를 들어, 프로세서(352))에 의해 실행될 때, 그러한 명령어들은 전술한 방법과 같은 하나 이상의 방법들을 수행한다. 명령어들은 또한 하나 이상의 컴퓨터 판독 가능 매체 또는 기계 판독 가능 매체(예를 들어, 메모리(364), 확장 메모리(374) 또는 프로세서(352)상의 메모리)와 같은 하나 이상의 저장 디바이스들에 의해 저장될 수 있다. 일부 구현들에서, 명령어들은, 예를 들어, 트랜시버(368) 또는 외부 인터페이스(362)를 통해 전파된 신호로 수신될 수 있다.The memory may include, for example, flash memory and/or NVRAM memory (non-active random access memory), as described below. In some implementations, the instructions are stored on an information carrier and, when executed by one or more processing devices (eg, the processor 352 ), the instructions perform one or more methods, such as the method described above. The instructions may also be stored on one or more storage devices, such as one or more computer readable media or machine readable media (eg, memory 364 , extended memory 374 , or memory on processor 352 ). . In some implementations, the instructions may be received in a propagated signal, for example, via transceiver 368 or external interface 362 .

모바일 컴퓨팅 디바이스(350)는 필요한 경우 디지털 신호 처리 회로를 포함할 수 있는 통신 인터페이스(366)를 통해 무선으로 통신할 수 있다. 통신 인터페이스(366)는 특히 GSM 음성 호출(Global System for Mobile communications), SMS(Short Message Service), EMS(Enhanced Messaging Service), 또는 MMS 메시징(Multimedia Messaging Service), CDMA(code division multiple access), TDMA(time division multiple access), PDC(Personal Digital Cellular), WCDMA(Wideband Code Division Multiple Access), CDMA2000 또는 GPRS(General Packet Radio Service)와 같은 다양한 모드들 또는 프로토콜들 하에서 통신을 제공할 수 있다. 이러한 통신은 예를 들어 무선 주파수를 사용하는 트랜시버(368)를 통해 발생할 수 있다. 또한, 예를 들어 블루투스, Wi-Fi 또는 다른 이러한 트랜시버를 사용하는 단거리 통신이 발생할 수 있다. 또한, GPS(Global Positioning System) 수신기 모듈(370)은 모바일 컴퓨팅 디바이스(350) 상에서 실행되는 애플리케이션들에 의해 적절히 사용될 수 있는 추가적인 네비게이션 관련 무선 데이터 및 위치 관련 무선 데이터를 모바일 컴퓨팅 디바이스(350)에 제공할 수 있다.Mobile computing device 350 may communicate wirelessly via communication interface 366 , which may include digital signal processing circuitry, if desired. Communication interface 366 is inter alia Global System for Mobile communications (GSM), Short Message Service (SMS), Enhanced Messaging Service (EMS), or Multimedia Messaging Service (MMS), code division multiple access (CDMA), TDMA It can provide communication under various modes or protocols, such as time division multiple access (PDC), Personal Digital Cellular (PDC), Wideband Code Division Multiple Access (WCDMA), CDMA2000 or General Packet Radio Service (GPRS). Such communication may occur, for example, via transceiver 368 using radio frequencies. Also, short-range communications using, for example, Bluetooth, Wi-Fi or other such transceivers may occur. In addition, the Global Positioning System (GPS) receiver module 370 provides the mobile computing device 350 with additional navigation-related wireless data and location-related wireless data that can be suitably used by applications running on the mobile computing device 350 . can do.

모바일 컴퓨팅 디바이스(350)는 또한 사용자로부터 발성된(spoken) 정보를 수신하고 이 발성된 정보를 사용 가능한 디지털 정보로 변환할 수 있는 오디오 코덱(360)을 사용하여 청각적으로 통신할 수 있다. 오디오 코덱(360)은 또한 예를 들어 모바일 컴퓨팅 디바이스(350)의 핸드셋 내 예를 들어 스피커를 통해, 사용자를 위한 가청 사운드를 생성할 수 있다. 이러한 사운드는 음성 전화 호출로부터의 사운드를 포함할 수 있고, 기록된 사운드(예를 들어, 음성 메시지들, 음악 파일들 등)를 포함할 수 있고, 또한 모바일 컴퓨팅 디바이스(350) 상에서 동작하는 애플리케이션들에 의해 생성된 사운드를 포함할 수도 있다.Mobile computing device 350 may also communicate audibly using audio codec 360 that may receive spoken information from a user and convert the spoken information into usable digital information. Audio codec 360 may also generate audible sound for a user, such as through a speaker in a handset of mobile computing device 350 , for example. Such sound may include sound from a voice phone call, may include recorded sound (eg, voice messages, music files, etc.), and may also include applications running on mobile computing device 350 . It may include sound generated by

모바일 컴퓨팅 디바이스(350)는 도면에 도시된 바와 같이 다수의 다른 형태들로 구현될 수 있다. 예를 들어, 이 모바일 컴퓨팅 디바이스는 셀룰러 폰(380)으로 구현될 수 있다. 이 컴퓨팅 디바이스는 또한 스마트 폰(382), 개인 정보 단말기, 또는 다른 유사한 모바일 디바이스의 일부로서 구현될 수 있다.Mobile computing device 350 may be implemented in a number of different forms as shown in the figures. For example, this mobile computing device may be implemented as a cellular phone 380 . This computing device may also be implemented as part of a smart phone 382 , personal digital assistant, or other similar mobile device.

여기에 서술된 시스템들 및 서술들의 다양한 구현들은, 디지털 전자 회로, 집적 회로, 특수 설계된 ASIC(application specific integrated circuit)들, 컴퓨터 하드웨어, 펌웨어, 소프트웨어 및/또는 이들의 조합들로 실현될 수 있다. 이러한 다양한 구현들은, 적어도 하나의 프로그램 가능 프로세서를 포함하는 프로그램 가능 시스템상에서 실행 가능하고 그리고/또는 해석 가능한 하나 이상의 컴퓨터 프로그램들에서의 구현을 포함할 수 있으며, 상기 적어도 하나의 프로그램 가능 프로세서는, 저장 시스템, 적어도 하나의 입력 디바이스 및 적어도 하나의 출력 디바이스로부터 데이터 및 명령들을 수신하고, 그리고 데이터 및 명령들을 저장 시스템, 적어도 하나의 입력 디바이스 및 적어도 하나의 출력 디바이스에 전송하도록 결합된 특수용 또는 범용 프로세서일 수 있다.Various implementations of the systems and descriptions described herein may be realized in digital electronic circuitry, integrated circuits, specially designed application specific integrated circuits (ASICs), computer hardware, firmware, software, and/or combinations thereof. Such various implementations may include implementation in one or more computer programs executable and/or interpretable on a programmable system comprising at least one programmable processor, the at least one programmable processor comprising: a special purpose or general purpose processor coupled to receive data and instructions from the system, at least one input device and at least one output device, and transmit data and instructions to a storage system, at least one input device and at least one output device; can

이러한 컴퓨터 프로그램들(프로그램들, 소프트웨어, 소프트웨어 애플리케이션들 또는 코드라고도 함)은 프로그래밍가능한 프로세서용 기계 명령들을 포함하고 그리고 하이-레벨 프로시저 및/또는 객체 지향 프로그래밍 언어, 및/또는 어셈블리/기계 언어로 구현될 수 있다. 여기에서 사용되는, "기계 판독가능 매체", "컴퓨터 판독가능 매체"라는 용어들은, 기계 판독가능한 신호로서 기계 명령들을 수신하는 기계 판독가능한 매체를 포함하여, 프로그램 가능 프로세서에 기계 명령들 및/또는 데이터를 제공하는데 사용되는, 임의의 컴퓨터 프로그램 제품, 장치 및/또는 디바이스(예를 들어, 자기 디스크들, 광 디스크들, 메모리, PLD(Programmable Logic Device)들)를 언급한다. "기계 판독가능 신호"라는 용어는 기계 명령어들 및/또는 데이터를 프로그램 가능 프로세서에 제공하기 위해 사용되는 임의의 신호를 언급한다.Such computer programs (also referred to as programs, software, software applications or code) contain machine instructions for a programmable processor and are written in high-level procedural and/or object-oriented programming language, and/or assembly/machine language. can be implemented. As used herein, the terms "machine readable medium" and "computer readable medium" refer to machine instructions and/or to a programmable processor, including a machine readable medium that receives machine instructions as a machine readable signal. refers to any computer program product, apparatus, and/or device (eg, magnetic disks, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide data. The term “machine-readable signal” refers to any signal used to provide machine instructions and/or data to a programmable processor.

사용자와의 상호 작용을 제공하기 위해, 여기에 서술된 시스템들 및 기술들은, 정보를 사용자에게 디스플레이하기 위한 디스플레이 디바이스(예를 들어, CRT (cathode ray tube) 또는 LCD(liquid crystal display) 모니터), 및 키보드 및 사용자가 입력을 컴퓨터에 제공할 수 있는 포인팅 디바이스(예를 들어, 마우스 또는 트랙볼)를 갖는 컴퓨터 상에 구현될 수 있다. 다른 종류의 디바이스들은 사용자와의 상호 작용을 제공하는데 사용될 수 있다. 예를 들어, 사용자에게 제공된 피드백은 임의의 형태의 감각 피드백(예를 들어, 시각적 피드백, 청각 피드백 또는 촉각 피드백)일 수 있고 그리고 사용자로부터의 입력은 음향, 음성 또는 촉각 입력을 포함하는 임의의 형태로 수신될 수 있다.To provide interaction with a user, the systems and techniques described herein may include a display device (eg, a cathode ray tube (CRT) or liquid crystal display (LCD) monitor) for displaying information to a user; and a keyboard and a pointing device (eg, a mouse or trackball) through which the user can provide input to the computer. Other types of devices may be used to provide interaction with the user. For example, the feedback provided to the user may be any form of sensory feedback (eg, visual feedback, auditory feedback, or tactile feedback) and the input from the user may be any form including acoustic, voice, or tactile input. can be received as

여기에 서술된 시스템들 및 기술들은 백엔드 컴포넌트(예 : 데이터 서버)를 포함하거나 또는 미들웨어 컴포넌트(예 : 애플리케이션 서버)를 포함하거나 또는 프론트엔드 컴포넌트(예를 들어, 사용자가 여기에 서술된 시스템 및 기술들의 구현과 상호 작용할 수 있는 그래픽 사용자 인터페이스 또는 웹 브라우저를 갖는 클라이언트 컴퓨터) 또는 이러한 백 엔드, 미들웨어 또는 프론트 엔드 컴포넌트의 임의의 조합을 포함할 수 있다. 시스템의 컴포넌트는 임의의 형태 또는 매체의 디지털 데이터 통신(예를 들어, 통신 네트워크)에 의해 상호접속될 수 있다. 통신 네트워크의 예들은 근거리 통신망("LAN"), 광역 통신망("WAN") 및 인터넷을 포함한다.The systems and technologies described herein may include a backend component (eg, a data server) or a middleware component (eg, an application server) or may include a frontend component (eg, a user a client computer having a web browser or graphical user interface capable of interacting with its implementation) or any combination of such back-end, middleware or front-end components. The components of the system may be interconnected by digital data communication (eg, a communication network) in any form or medium. Examples of communication networks include local area networks (“LANs”), wide area networks (“WANs”), and the Internet.

컴퓨팅 시스템에는 클라이언트들 및 서버들이 포함될 수 있다. 클라이언트와 서버는 일반적으로 서로 멀리 떨어져 있으며 일반적으로 통신 네트워크를 통해 상호 작용한다. 클라이언트와 서버의 관계는 각각의 컴퓨터에서 실행되고 서로 클라이언트-서버 관계를 갖는 컴퓨터 프로그램에 의해 발생한다.A computing system may include clients and servers. Clients and servers are typically remote from each other and typically interact through a communications network. The relationship of client and server arises by virtue of computer programs running on each computer and having a client-server relationship to each other.

비록 일부 구현이 상술되었지만, 다른 수정들이 가능하다. 예를 들어, 클라이언트 애플리케이션은 델리게이트(들)에 액세스하는 것으로 서술되지만, 다른 구현들에서, 델리게이트(들)은 하나 이상의 서버들에서 실행되는 애플리케이션과 같은 하나 이상의 프로세서들에 의해 구현되는 다른 애플리케이션들에 의해 채용될 수 있다. 또한, 도면들에 도시된 논리 흐름들은 바람직한 결과들을 달성하기 위해 도시된 특정 순서 또는 순차 순서를 요구하지 않는다. 또한, 서술된 흐름에서 다른 동작을 제공하거나 동작을 제거할 수 있으며, 서술된 시스템들에 다른 구성 요소들 추가하거나 제거할 수 있다. 따라서, 다른 구현들은 다음의 청구항들의 범위 내에 있다.Although some implementations have been described above, other modifications are possible. For example, a client application is described as accessing a delegate(s), but in other implementations, the delegate(s) has access to other applications implemented by one or more processors, such as an application running on one or more servers. can be employed by Furthermore, the logic flows shown in the figures do not require the specific order shown or sequential order to achieve desirable results. In addition, other actions may be provided or removed from the described flow, and other components may be added or removed from the described systems. Accordingly, other implementations are within the scope of the following claims.

Claims

A computer-implemented method comprising:
determining, by a computing device, a context associated with the computing device;
determining a hotword based on the context associated with the computing device, and after determining the hotword, receiving a hotword model corresponding to the hotword, wherein the hotword model is in the context associated with the computing device. including noise specific to - and;
after determining the hot word, receiving audio data corresponding to an utterance;
determining, using the hotword model, that the audio data includes the hotword;
in response to determining that the audio data includes the hotword, performing an operation associated with the hotword;
determining, by the computing device, that the context is no longer associated with the computing device; and
determining that subsequently received audio data comprising the hotword does not trigger an action
A computer-implemented method.

According to claim 1,
Determining that the audio data includes the hotword comprises:
extracting audio features of the audio data corresponding to the utterance;
generating a hotword confidence score by processing the audio features;
determining whether the hotword confidence score satisfies a hotword confidence threshold; and
based on determining whether the hotword confidence score satisfies a hotword confidence threshold, determining that the audio data corresponding to the utterance includes the hotword.
A computer-implemented method.

3. The method of claim 1 or 2,
providing, for output, data identifying the hotword
A computer-implemented method.

According to claim 1,
identifying, by the computing device, movement of the computing device;
wherein the context is based on movement of the computing device.
A computer-implemented method.

According to claim 1,
identifying, by the computing device, an application running on the computing device;
wherein the context is based on the application running on the computing device.
A computer-implemented method.

According to claim 1,
identifying, by the computing device, a location of the computing device;
wherein the context is based on a location of the computing device.
A computer-implemented method.

delete

According to claim 1,
Performing the operation associated with the hot word includes:
performing speech recognition on a portion of the audio data that does not include the hotword;
wherein the operation is based on a transcription of the portion of the audio that does not include the hotword.
A computer-implemented method.

According to claim 1,
wherein the audio data includes only the hotword
A computer-implemented method.

According to claim 1,
wherein the initial portion of the audio data includes the hotword.
A computer-implemented method.

As a system,
one or more computers and one or more storage devices;
the one or more storage devices store instructions that, when executed by the one or more computers, are operable to cause the one or more computers to perform operations;
The actions are:
determining, by a computing device, a context associated with the computing device;
determining a hotword based on the context associated with the computing device, and after determining the hotword, receiving a hotword model corresponding to the hotword, wherein the hotword model is in the context associated with the computing device. including noise specific to - and;
receiving audio data corresponding to the utterance after determining the hot word;
determining, using the hotword model, that the audio data includes the hotword;
in response to determining that the audio data includes the hotword, performing an operation associated with the hotword;
determining, by the computing device, that the context is no longer associated with the computing device; and
determining that subsequently received audio data comprising the hotword does not trigger an action.
system.

12. The method of claim 11,
Determining that the audio data includes the hotword comprises:
extracting audio features of the audio data corresponding to the utterance;
generating a hotword confidence score by processing the audio features;
determining whether the hotword confidence score satisfies a hotword confidence threshold; and
based on determining whether the hotword confidence score satisfies a hotword confidence threshold, determining that the audio data corresponding to the utterance includes the hotword.
system.

13. The method of any one of claims 11 or 12,
The actions are:
further comprising, by the computing device, identifying an application running on the computing device;
wherein the context is based on the application running on the computing device.
system.

delete

12. The method of claim 11,
Performing the operation associated with the hot word includes:
performing speech recognition on a portion of the audio data that does not include the hotword;
wherein the operation is based on a representation of a portion of the audio that does not include the hotword.
system.

A non-transitory computer-readable medium storing software comprising instructions executable by one or more computers, comprising:
when executed, cause the one or more computers to perform operations;
The actions are:
determining, by a computing device, a context associated with the computing device;
determining a hotword based on the context associated with the computing device, and after determining the hotword, receiving a hotword model corresponding to the hotword, wherein the hotword model is in the context associated with the computing device. including noise specific to - and;
receiving audio data corresponding to the utterance after determining the hot word;
determining, using the hotword model, that the audio data includes the hotword;
in response to determining that the audio data includes the hotword, performing an operation associated with the hotword;
determining, by the computing device, that the context is no longer associated with the computing device; and
determining that subsequently received audio data comprising the hotword does not trigger an action.
Non-transitory computer-readable media.

A non-transitory computer readable medium storing a computer program configured to perform the method of claim 1 when executed by a processor.

delete