KR20230010845A

KR20230010845A - Providing audio information with a digital assistant

Info

Publication number: KR20230010845A
Application number: KR1020237001029A
Authority: KR
Inventors: 라훌 네어; 골나즈 아브돌라히안; 아비 바르-지브; 니란잔 만주나스
Original assignee: 애플 인크.
Priority date: 2018-06-01
Filing date: 2019-04-24
Publication date: 2023-01-19
Also published as: US20210224031A1; KR20240042222A; CN112154412A; EP3782017A1; AU2019279597A1; KR102651249B1; WO2019231587A1; US20240086147A1; US20230229387A1; KR20210005200A; KR102488285B1; AU2022201037A1; US11609739B2; US11861265B2; AU2019279597B2; AU2022201037B2

Abstract

오디오 정보를 제공하기 위한 예시적인 기법에서, 입력이 수신되고, 수신된 입력에 응답하는 오디오 정보가 스피커를 사용하여 제공된다. 오디오 정보를 제공하는 동안, 외부 사운드가 검출된다. 외부 사운드가 제1 타입의 통신인 것으로 결정되는 경우, 오디오 정보의 제공이 중단된다. 외부 사운드가 제2 타입의 통신인 것으로 결정되는 경우, 오디오 정보의 제공은 계속된다.In an exemplary technique for providing audio information, input is received and audio information responsive to the received input is presented using a speaker. While providing audio information, external sound is detected. When it is determined that the external sound is the first type of communication, provision of audio information is stopped. When it is determined that the external sound is the second type of communication, the provision of audio information continues.

Description

Providing audio information using a digital assistant {PROVIDING AUDIO INFORMATION WITH A DIGITAL ASSISTANT}

관련 출원에 대한 상호 참조CROSS REFERENCES TO RELATED APPLICATIONS

본 출원은 2018년 6월 1일자로 출원되고 발명의 명칭이 "디지털 어시스턴트를 이용한 오디오 정보 제공(Providing Audio Information with a Digital Assistant)"인 미국 가출원 제62/679,644호에 대한 우선권을 주장하며, 이로써, 이의 전체 개시내용은 모든 적절한 목적들을 위해 참조로 포함된다.This application claims priority to U.S. Provisional Application No. 62/679,644, filed on June 1, 2018, entitled "Providing Audio Information with a Digital Assistant", hereby , the entire disclosure of which is incorporated by reference for all suitable purposes.

기술분야technology field

본 개시내용은 일반적으로, 디지털 어시스턴트(digital assistant)를 구현하는 전자 디바이스에 관한 것으로, 더 구체적으로, 디지털 어시스턴트를 이용하여 오디오 정보를 제공하는 전자 디바이스에 관한 것이다.The present disclosure relates generally to electronic devices that implement digital assistants, and more specifically to electronic devices that provide audio information using digital assistants.

배경기술background art

디지털 어시스턴트는 음성 및/또는 텍스트 형태의 자연 언어 입력을 해석하고, 입력에 기초하여 사용자 요청을 결정한다. 이어서, 디지털 어시스턴트는 사용자 요청에 기초하여 액션(action)들을 수행한다. 액션들은 사용자 요청에 응답하는 정보를 제공하는 것 및/또는 태스크들을 수행하는 것을 포함한다.The digital assistant interprets natural language input in the form of voice and/or text and determines a user request based on the input. The digital assistant then performs actions based on the user request. Actions include providing information in response to a user request and/or performing tasks.

본 개시내용은 디지털 어시스턴트를 구현하는 전자 디바이스를 이용하여 오디오 정보를 제공하기 위한 기법들을 설명한다. 일부 실시예들에 따르면, 전자 디바이스는 특정 타입의 인터럽션(interruption)에 대한 응답으로 오디오 정보를 제공하는 것을 중단한다. 부가하여, 일부 실시예들에 따르면, 전자 디바이스는 오디오 정보가 인터럽트될 것으로 예상되지 않을 때까지 오디오 정보를 제공하기 위해(또는 제공하는 것을 재개하기 위해) 대기한다. 이러한 기법들은, 일부 예시적인 실시예들에서, 사용자가 디지털 어시스턴트로부터의 오디오 정보에 의해 인터럽트되거나 또는 주의가 분산되지 않으면서 말할 수 있게 함으로써, 디지털 어시스턴트와의 더 자연스럽고 효율적인 상호작용을 제공한다. 이 기법들은 전자 디바이스들, 이를테면, 데스크톱 컴퓨터들, 랩톱들, 태블릿들, 텔레비전들, 스피커들, 엔터테인먼트 시스템들, 및 스마트폰들에 적용될 수 있다.This disclosure describes techniques for providing audio information using an electronic device implementing a digital assistant. According to some embodiments, the electronic device stops providing audio information in response to a particular type of interruption. In addition, according to some embodiments, the electronic device waits to provide (or resume providing) audio information until the audio information is not expected to be interrupted. These techniques, in some demonstrative embodiments, provide a more natural and efficient interaction with the digital assistant by allowing the user to speak without being interrupted or distracted by audio information from the digital assistant. These techniques may be applied to electronic devices, such as desktop computers, laptops, tablets, televisions, speakers, entertainment systems, and smartphones.

일부 실시예들에 따르면, 오디오 정보를 제공하기 위한 기법은, 스피커를 사용하여, 수신된 입력에 응답하는 오디오 정보를 제공하는 단계; 오디오 정보를 제공하는 동안, 외부 사운드를 검출하는 단계; 외부 사운드가 제1 타입의 통신이라는 결정에 따라, 오디오 정보의 제공을 중단하는 단계; 및 외부 사운드가 제2 타입의 통신이라는 결정에 따라, 오디오 정보의 제공을 계속하는 단계를 포함한다. 일부 실시예들에서, 수신된 입력은 트리거링 커맨드를 포함한다.According to some embodiments, a technique for providing audio information includes providing, using a speaker, audio information responsive to a received input; detecting an external sound while providing audio information; stopping provision of audio information upon determination that the external sound is the first type of communication; and continuing to provide audio information according to a determination that the external sound is the second type of communication. In some embodiments, the received input includes a triggering command.

일부 실시예들에서, 기법은, 오디오 정보의 제공을 중단한 후에: 제1 타입의 통신과 연관된 하나 이상의 시각적 특성들을 검출하는 단계; 제1 타입의 통신이 중단된 것을 검출하는 단계; 제1 타입의 통신이 중단된 것을 검출하는 것에 대한 응답으로, 하나 이상의 시각적 특성들이 제1 타입의 추가 통신이 예상됨을 나타내는지 여부를 결정하는 단계; 제1 타입의 추가 통신이 예상되지 않는다는 결정에 따라, 재개 오디오 정보를 제공하는 단계; 및 제1 타입의 추가 통신이 예상된다는 결정에 따라, 오디오 정보의 제공을 중단하는 것을 계속하는 단계를 더 포함한다.In some embodiments, the technique includes, after ceasing to provide audio information: detecting one or more visual characteristics associated with the first type of communication; detecting that the first type of communication is discontinued; in response to detecting that the first type of communication has been discontinued, determining whether one or more visual characteristics indicate that further communication of the first type is expected; in accordance with a determination that no further communication of the first type is expected, providing resume audio information; and upon a determination that further communication of the first type is expected, continuing to discontinue providing the audio information.

일부 실시예들에서, 하나 이상의 시각적 특성들은 시선, 얼굴 표정, 손 제스처, 또는 이들의 조합을 포함한다. 일부 실시예들에서, 오디오 정보의 제공을 중단하는 것은 오디오 정보를 페이드 아웃(fading out)하는 것을 포함한다. 일부 실시예들에서, 기법은, 오디오 정보의 제공을 중단한 후에, 그리고 제1 타입의 통신이 중단되었다는 결정에 따라, 재개 오디오 정보를 제공하는 단계를 더 포함한다. 일부 실시예들에서, 오디오 정보는 미리 정의된 세그먼트들로 분할되고, 재개 오디오 정보는 오디오 정보가 중단되었던 세그먼트로 시작된다. 일부 실시예들에서, 재개 오디오 정보는 오디오 정보의 이전에 제공된 세그먼트의 재구성된 버전(rephrased version)을 포함한다.In some embodiments, the one or more visual characteristics include gaze, facial expression, hand gesture, or a combination thereof. In some embodiments, ceasing to present the audio information includes fading out the audio information. In some embodiments, the technique further includes providing resume audio information after ceasing to provide audio information and upon a determination that the first type of communication has ceased. In some embodiments, the audio information is divided into predefined segments, and the resume audio information begins with the segment where the audio information left off. In some embodiments, the resume audio information includes a rephrased version of a previously presented segment of audio information.

일부 실시예들에서, 제1 타입의 통신은 직접-발성 어휘 발화(directly-vocalized lexical utterance)를 포함한다. 일부 실시예들에서, 직접-발성 어휘 발화는 침묵화 명령들을 배제한다. 일부 실시예들에서, 기법은, 외부 사운드의 소스에 대응하는 위치를 결정함으로써, 외부 사운드가 직접-발성 어휘 발화인 것으로 결정하는 단계를 더 포함한다. 일부 실시예들에서, 위치는 지향성 마이크로폰 어레이를 이용하여 결정된다.In some embodiments, the first type of communication includes a directly-vocalized lexical utterance. In some embodiments, direct-spoken lexical utterance excludes silence commands. In some embodiments, the technique further includes determining that the external sound is a direct-spoken lexical utterance by determining a location corresponding to a source of the external sound. In some embodiments, location is determined using a directional microphone array.

일부 실시예들에서, 제2 타입의 통신은 대화 사운드들을 포함한다. 일부 실시예들에서, 제2 타입의 통신은 압축 오디오를 포함한다. 일부 실시예들에서, 제2 타입의 통신은 전자 디바이스에 의해 재생되는 어휘 발화를 포함한다. 일부 실시예들에서, 기법은, 외부 사운드의 소스에 대응하는 위치를 결정함으로써, 외부 사운드가 전자 디바이스에 의해 재생되는 어휘 발화인 것으로 결정하는 단계를 더 포함한다. 일부 실시예들에서, 위치는 지향성 마이크로폰 어레이를 이용하여 결정된다.In some embodiments, the second type of communication includes conversational sounds. In some embodiments, the second type of communication includes compressed audio. In some embodiments, the second type of communication includes a vocabulary utterance reproduced by the electronic device. In some embodiments, the technique further includes determining that the external sound is a lexical utterance reproduced by the electronic device by determining a location corresponding to a source of the external sound. In some embodiments, location is determined using a directional microphone array.

일부 실시예들에 따르면, 오디오 정보를 제공하기 위한 기법은 소스로부터 스피치(speech) 입력을 수신하는 단계 - 스피치 입력은 하나 이상의 명령들을 포함함 -; 스피치 입력의 소스와 연관된 하나 이상의 시각적 특성들을 검출하는 단계; 스피치 입력이 중단된 것을 검출하는 단계; 스피치 입력이 중단된 것을 검출하는 것에 대한 응답으로, 소스와 연관된 하나 이상의 시각적 특성들이 소스로부터의 추가 스피치 입력이 예상됨을 나타내는지 여부를 결정하는 단계; 소스로부터의 추가 스피치 입력이 예상되지 않는다는 결정에 따라, 하나 이상의 명령들에 대한 응답을 제공하는 단계; 소스로부터의 추가 스피치 입력이 예상된다는 결정에 따라, 하나 이상의 명령들에 대한 응답을 제공하는 것을 보류하는 단계를 포함한다.According to some embodiments, a technique for providing audio information includes receiving speech input from a source, where the speech input includes one or more instructions; detecting one or more visual characteristics associated with a source of speech input; detecting that speech input has ceased; in response to detecting that the speech input is discontinued, determining whether one or more visual characteristics associated with the source indicate that additional speech input from the source is expected; upon a determination that no additional speech input from the source is expected, providing a response to the one or more commands; upon determining that additional speech input from the source is expected, withholding providing a response to the one or more commands.

일부 실시예들에서, 하나 이상의 시각적 특성들은 시선, 얼굴 표정, 손 제스처, 또는 이들의 조합을 포함한다. 일부 실시예들에서, 기법은, 소스로부터의 추가 스피치 입력이 예상된다는 결정에 따라, 미리 결정된 시간 동안 하나 이상의 명령들에 대한 응답을 제공하는 것을 보류하는 단계; 및 미리 결정된 시간 후에, 그리고 소스로부터의 스피치 입력이 재개되지 않았다는 결정에 따라, 하나 이상의 명령들에 대한 응답을 제공하는 단계를 더 포함한다.In some embodiments, the one or more visual characteristics include gaze, facial expression, hand gesture, or a combination thereof. In some embodiments, the technique includes, upon a determination that additional speech input from the source is expected, withholding providing a response to one or more commands for a predetermined amount of time; and providing a response to the one or more commands after a predetermined time and upon a determination that speech input from the source has not been resumed.

다양하게 설명된 실시예들의 보다 양호한 이해를 위해, 유사한 도면 부호들이 도면 전체에 걸쳐서 대응 부분들을 나타내는 하기의 도면들과 관련하여 하기의 발명을 실시하기 위한 구체적인 내용이 참조되어야 한다.
도 1a 및 도 1b는 다양한 실시예들에 따른, 사용자에게 오디오 정보를 제공하기 위한 예시적인 시스템을 도시한다.
도 2는 다양한 실시예들에 따른, 환경에서 오디오 정보를 제공하는 전자 디바이스의 예를 도시한다.
도 3은 다양한 실시예들에 따른, 오디오 정보를 제공하기 위한 예시적인 프로세스의 흐름도를 도시한다.
도 4는 다양한 실시예들에 따른, 오디오 정보를 제공하기 위한 다른 예시적인 프로세스의 흐름도를 도시한다.For a better understanding of the various described embodiments, reference should be made to the specific details for carrying out the invention below in conjunction with the following drawings in which like reference numerals indicate corresponding parts throughout the drawings.
1A and 1B illustrate an example system for providing audio information to a user, in accordance with various embodiments.
2 illustrates an example of an electronic device providing audio information in an environment, in accordance with various embodiments.
3 shows a flow diagram of an example process for providing audio information, in accordance with various embodiments.
4 shows a flow diagram of another example process for providing audio information, in accordance with various embodiments.

하기의 설명은 예시적인 방법들, 파라미터들 등을 기재하고 있다. 그러나, 이러한 설명이 본 개시내용의 범주에 대한 제한으로서 의도되지 않고 그 대신에 예시적인 실시예들의 설명으로서 제공된다는 것을 인식해야 한다.The following description sets forth exemplary methods, parameters, and the like. However, it should be recognized that this description is not intended as a limitation on the scope of the present disclosure but instead is provided as a description of exemplary embodiments.

도 1a 및 도 1b는 다양한 실시예들에 따른, 사용자에게 오디오 정보를 제공하기 위한 예시적인 시스템(100)을 도시한다.1A and 1B illustrate an example system 100 for providing audio information to a user, in accordance with various embodiments.

일부 실시예들에서, 도 1a에 예시된 바와 같이, 시스템(100)은 디바이스(100a)를 포함한다. 디바이스(100a)는 다양한 컴포넌트들, 이를테면, 프로세서(들)(102), RF 회로부(들)(104), 메모리(들)(106), 이미지 센서(들)(108), 배향 센서(들)(110), 마이크로폰(들)(112), 위치 센서(들)(116), 스피커(들)(118), 디스플레이(들)(120), 및 터치-감응형 표면(들)(122)을 포함한다. 이러한 컴포넌트들은 옵션적으로 디바이스(100a)의 통신 버스(들)(150)를 통해 통신한다.In some embodiments, as illustrated in FIG. 1A , system 100 includes device 100a. Device 100a includes various components, such as processor(s) 102, RF circuitry(s) 104, memory(s) 106, image sensor(s) 108, orientation sensor(s) 110, microphone(s) 112, position sensor(s) 116, speaker(s) 118, display(s) 120, and touch-sensitive surface(s) 122. include These components optionally communicate via communication bus(s) 150 of device 100a.

일부 실시예들에서, 시스템(100)의 엘리먼트들은 기지국 디바이스(예컨대, 원격 서버, 모바일 디바이스, 또는 랩톱과 같은 컴퓨팅 디바이스)에서 구현되고, 시스템(100)의 다른 엘리먼트들은 보조 디바이스(이를테면, 오디오 재생 디바이스, 텔레비전, 모니터, 또는 헤드-마운트 디스플레이(HMD) 디바이스)에서 구현되며, 여기서, 보조 디바이스는 기지국 디바이스와 통신한다. 일부 실시예들에서, 디바이스(100a)는 기지국 디바이스 또는 보조 디바이스에서 구현된다.In some embodiments, elements of system 100 are implemented in a base station device (eg, a remote server, mobile device, or computing device such as a laptop) and other elements of system 100 are implemented in a secondary device (eg, audio playback). device, television, monitor, or head-mounted display (HMD) device), where the secondary device communicates with the base station device. In some embodiments, device 100a is implemented in a base station device or an auxiliary device.

도 1b에 예시된 바와 같이, 일부 실시예들에서, 시스템(100)은, 이를테면, 유선 연결 또는 무선 연결을 통해 통신하는 2개의(또는 그 이상의) 디바이스들을 포함한다. 제1 디바이스(100b)(예컨대, 기지국 디바이스)는 프로세서(들)(102), RF 회로부(들)(104), 메모리(들)(106)를 포함한다. 이러한 컴포넌트들은 옵션적으로 디바이스(100b)의 통신 버스(들)(150)를 통해 통신한다. 제2 디바이스(100c)(예컨대, 보조 디바이스)는 다양한 컴포넌트들, 예컨대, 프로세서(들)(102), RF 회로부(들)(104), 메모리(들)(106), 이미지 센서(들)(108), 배향 센서(들)(110), 마이크로폰(들)(112), 위치 센서(들)(116), 스피커(들)(118), 디스플레이(들)(120), 및 터치-감응형 표면(들)(122)을 포함한다. 이러한 컴포넌트들은 옵션적으로 디바이스(100c)의 통신 버스(들)(150)를 통해 통신한다.As illustrated in FIG. 1B , in some embodiments system 100 includes two (or more) devices that communicate, such as via a wired connection or a wireless connection. A first device 100b (eg, a base station device) includes processor(s) 102 , RF circuitry(s) 104 , memory(s) 106 . These components optionally communicate via communication bus(s) 150 of device 100b. The second device 100c (eg, auxiliary device) includes various components, such as processor(s) 102, RF circuitry(s) 104, memory(s) 106, image sensor(s) ( 108), orientation sensor(s) 110, microphone(s) 112, position sensor(s) 116, speaker(s) 118, display(s) 120, and touch-sensitive surface(s) 122 . These components optionally communicate via communication bus(s) 150 of device 100c.

시스템(100)은 프로세서(들)(102) 및 메모리(들)(106)를 포함한다. 프로세서(들)(102)는 하나 이상의 일반 프로세서들, 하나 이상의 그래픽 프로세서들, 및/또는 하나 이상의 디지털 신호 프로세서들을 포함한다. 일부 실시예들에서, 메모리(들)(106)는 아래에서 설명되는 기법들을 수행하기 위하여 프로세서(들)(102)에 의해 실행되도록 구성된 컴퓨터-판독가능 명령들을 저장하는 하나 이상의 비-일시적 컴퓨터-판독가능 저장 매체들(예컨대, 플래시 메모리, 랜덤 액세스 메모리)이다.System 100 includes processor(s) 102 and memory(s) 106 . Processor(s) 102 include one or more general processors, one or more graphics processors, and/or one or more digital signal processors. In some embodiments, memory(s) 106 may store one or more non-transitory computer-readable instructions configured to be executed by processor(s) 102 to perform the techniques described below. readable storage media (eg flash memory, random access memory).

시스템(100)은 RF 회로부(들)(104)를 포함한다. RF 회로부(들)(104)는 옵션적으로 전자 디바이스들과 통신하기 위한 회로부, 인터넷, 인트라넷과 같은 네트워크들, 및/또는 셀룰러 네트워크들 및 무선 로컬 영역 네트워크들(LAN)과 같은 무선 네트워크를 포함한다. RF 회로부(들)(104)는 옵션적으로 블루투스®와 같은 근거리 통신 및/또는 단거리 통신을 이용하여 통신하기 위한 회로부를 포함한다.System 100 includes RF circuitry(s) 104 . RF circuitry(s) 104 optionally includes circuitry for communicating with electronic devices, networks such as the Internet, intranets, and/or wireless networks such as cellular networks and wireless local area networks (LAN). do. RF circuitry(s) 104 optionally includes circuitry for communicating using short-range and/or short-range communications, such as Bluetooth®.

시스템(100)은 디스플레이(들)(120)를 포함한다. 일부 실시예들에서, 디스플레이(들)(120)는 제1 디스플레이(예컨대, 좌안 디스플레이 패널) 및 제2 디스플레이(예컨대, 우안 디스플레이 패널)를 포함하며, 각각의 디스플레이는 사용자의 각각의 눈에 이미지들을 디스플레이한다. 대응하는 이미지들은 제1 디스플레이 및 제2 디스플레이 상에 동시에 디스플레이된다. 옵션적으로, 대응하는 이미지들은 상이한 시점들로부터의 동일한 물리적 객체들의 동일한 가상 객체들 및/또는 표현들을 포함하여, 사용자에게 디스플레이들 상의 객체들의 깊이의 착각을 일으키는 시차 효과를 야기한다. 일부 실시예들에서, 디스플레이(들)(120)는 단일 디스플레이를 포함한다. 대응하는 이미지들은 사용자의 각각의 눈에 대하여 단일 디스플레이의 제1 영역 및 제2 영역 상에 동시에 디스플레이된다. 옵션적으로, 대응하는 이미지들은 상이한 시점들로부터의 동일한 물리적 객체들의 동일한 가상 객체들 및/또는 표현들을 포함하여, 사용자에게 단일 디스플레이 상의 객체들의 깊이의 착각을 일으키는 시차 효과를 야기한다.System 100 includes display(s) 120 . In some embodiments, display(s) 120 includes a first display (eg, a left eye display panel) and a second display (eg, a right eye display panel), each display providing an image to a respective eye of a user. display them Corresponding images are simultaneously displayed on the first display and the second display. Optionally, the corresponding images include the same virtual objects and/or representations of the same physical objects from different viewpoints, causing a parallax effect that gives the user the illusion of depth of the objects on the displays. In some embodiments, display(s) 120 includes a single display. Corresponding images are simultaneously displayed on the first area and the second area of a single display for each eye of the user. Optionally, the corresponding images include the same virtual objects and/or representations of the same physical objects from different viewpoints, causing a parallax effect that gives the user the illusion of depth of objects on a single display.

일부 실시예들에서, 시스템(100)은 탭 입력 및 스와이프 입력과 같은 사용자 입력들을 수신하기 위한 터치-감응형 표면(들)(122)을 포함한다. 일부 실시예들에서, 디스플레이(들)(120) 및 터치-감응형 표면(들)(122)은 터치-감응형 디스플레이(들)를 형성한다.In some embodiments, system 100 includes touch-sensitive surface(s) 122 for receiving user inputs such as tap input and swipe input. In some embodiments, display(s) 120 and touch-sensitive surface(s) 122 form a touch-sensitive display(s).

시스템(100)은 이미지 센서(들)(108)를 포함한다. 이미지 센서(들)(108)는 옵션적으로 실제 환경으로부터 물리적 객체들의 이미지들을 획득하도록 동작가능한 전하 결합 소자(CCD) 센서들, 및/또는 상보성 금속-산화물-반도체(CMOS) 센서들과 같은 하나 이상의 가시광 이미지 센서를 포함한다. 이미지 센서(들)는 또한 옵션적으로 실제 환경으로부터 적외선 광을 검출하기 위한 수동형 IR 센서 또는 능동형 IR 센서와 같은 하나 이상의 적외선(IR) 센서(들)를 포함한다. 예를 들어, 능동형 IR 센서는 적외선 광을 실제 환경으로 방출하기 위한 IR 도트 방출기와 같은 IR 방출기를 포함한다. 이미지 센서(들)(108)는 또한 옵션적으로 실제 환경에서 물리적 객체들의 움직임을 포착하도록 구성된 하나 이상의 이벤트 카메라(들)를 포함한다. 이미지 센서(들)(108)는 또한 옵션적으로 시스템(100)으로부터 물리적 객체들의 거리를 검출하도록 구성된 하나 이상의 깊이 센서(들)를 포함한다. 일부 실시예들에서, 시스템(100)은 CCD 센서, 이벤트 카메라, 및 깊이 센서를 조합하여 사용하여 시스템(100) 주위의 물리적 환경을 검출한다. 일부 실시예들에서, 이미지 센서(들)(108)는 제1 이미지 센서 및 제2 이미지 센서를 포함한다. 제1 이미지 센서 및 제2 이미지 센서는 옵션적으로 2개의 별개의 시야로부터 환경에서의 물리적 객체들의 이미지들을 포착하도록 구성된다. 일부 실시예들에서, 시스템(100)은 이미지 센서(들)(108)를 사용하여 손 제스처들과 같은 사용자 입력들을 수신한다. 일부 실시예들에서, 시스템(100)은 이미지 센서(들)(108)를 사용하여 환경에서의 시스템(100) 및/또는 디스플레이(들)(120)의 위치 및 배향을 검출한다. 예컨대, 시스템(100)은 이미지 센서(들)(108)를 사용하여 환경에서의 하나 이상의 객체들의 위치 및 배향을 추적한다.System 100 includes image sensor(s) 108 . Image sensor(s) 108 may optionally be one such as charge coupled device (CCD) sensors, and/or complementary metal-oxide-semiconductor (CMOS) sensors operable to acquire images of physical objects from the real environment. It includes more than one visible light image sensor. The image sensor(s) also optionally include one or more infrared (IR) sensor(s), such as passive IR sensors or active IR sensors for detecting infrared light from the real environment. For example, an active IR sensor includes an IR emitter such as an IR dot emitter for emitting infrared light into the real environment. Image sensor(s) 108 also optionally includes one or more event camera(s) configured to capture movement of physical objects in the real environment. Image sensor(s) 108 also optionally includes one or more depth sensor(s) configured to detect distances of physical objects from system 100 . In some embodiments, system 100 uses a combination of a CCD sensor, an event camera, and a depth sensor to detect the physical environment around system 100. In some embodiments, image sensor(s) 108 includes a first image sensor and a second image sensor. The first image sensor and the second image sensor are optionally configured to capture images of physical objects in the environment from two separate fields of view. In some embodiments, system 100 uses image sensor(s) 108 to receive user inputs, such as hand gestures. In some embodiments, system 100 uses image sensor(s) 108 to detect the position and orientation of system 100 and/or display(s) 120 in the environment. For example, system 100 uses image sensor(s) 108 to track the position and orientation of one or more objects in the environment.

일부 실시예들에서, 시스템(100)은 마이크로폰(들)(112)을 포함한다. 시스템(100)은 마이크로폰(들)(112)을 사용하여 사용자 및/또는 사용자의 환경으로부터의 사운드를 검출한다. 일부 실시예들에서, 마이크로폰(들)(112)은, 예컨대, 주변 소음을 식별하거나 또는 환경의 공간에서 사운드 소스를 위치파악하기 위하여 옵션적으로 나란히 동작하는 마이크로폰들의 어레이(복수의 마이크로폰들을 포함)를 포함한다.In some embodiments, system 100 includes microphone(s) 112 . System 100 uses microphone(s) 112 to detect sounds from the user and/or the user's environment. In some embodiments, microphone(s) 112 is an array of microphones (including a plurality of microphones) that optionally operate in tandem, e.g., to identify ambient noise or to locate a sound source in the space of the environment. includes

시스템(100)은 시스템(100) 및/또는 디스플레이(들)(120)의 배향 및/또는 이동을 검출하기 위하여 배향 센서(들)(110)를 포함한다. 예컨대, 시스템(100)은 배향 센서(들)(110)를 사용하여, 예컨대, 환경에서의 물리적 객체들에 대한 시스템(100) 및/또는 디스플레이(들)(120)의 위치 및/또는 배향의 변화를 추적한다. 배향 센서(들)(110)는 옵션적으로 하나 이상의 자이로스코프들 및/또는 하나 이상의 가속도계들을 포함한다.System 100 includes orientation sensor(s) 110 to detect orientation and/or movement of system 100 and/or display(s) 120 . For example, system 100 may use orientation sensor(s) 110 to, for example, determine the position and/or orientation of system 100 and/or display(s) 120 relative to physical objects in the environment. track changes Orientation sensor(s) 110 optionally include one or more gyroscopes and/or one or more accelerometers.

일부 실시예들에서, 시스템(100)은 디지털 어시스턴트를 구현한다. 디지털 어시스턴트는 음성 및/또는 텍스트 형태의 자연 언어 입력을 해석하고, 입력에 기초하여 하나 이상의 명령들을 결정한다. 이어서, 디지털 어시스턴트는 명령들에 기초하여 액션들을 수행한다. 일부 실시예들에서, 액션들은 명령들에 응답하는 오디오 정보를 제공하는 것 및/또는 태스크들을 수행하는 것을 포함한다. "디지털 어시스턴트"라는 용어는 자연 언어 입력을 해석하고 그 입력에 응답하는 액션들을 수행할 수 있는 임의의 정보 프로세싱 시스템을 지칭할 수 있다.In some embodiments, system 100 implements a digital assistant. The digital assistant interprets natural language input in the form of voice and/or text and determines one or more commands based on the input. The digital assistant then performs actions based on the commands. In some embodiments, actions include providing audio information in response to commands and/or performing tasks. The term "digital assistant" can refer to any information processing system capable of interpreting natural language input and performing actions in response to that input.

전형적으로, 자연 언어 입력은 디지털 어시스턴트에 의한 정보제공형 답변 또는 태스크의 수행 중 어느 하나를 요청한다. 디지털 어시스턴트는 요청된 정보를 오디오 포맷으로 제공하고/하거나 요청된 태스크를 수행함으로써 입력에 응답한다. 예컨대, 사용자가 디지털 어시스턴트에게 "내일 일기 예보는 어떻습니까?"라고 질문할 때, 디지털 어시스턴트는 "내일은 화창할 것으로 예상되며 최고 온도는 75도이고 최저 온도는 60도입니다"의 오디오 답변으로 응답할 수 있다. 다른 예로서, 사용자가 "내일 오전 6:00에 알람을 세팅하십시오"라고 요청할 때, 디지털 어시스턴트는 각각의 알람을 세팅하는 태스크를 수행하고, "내일 오전 6시에 알람이 세팅되었습니다"의 오디오 확인을 제공한다.Typically, the natural language input requests either an informative answer or performance of a task by the digital assistant. The digital assistant responds to the input by providing the requested information in audio format and/or performing the requested task. For example, when a user asks the digital assistant, "What's the weather forecast for tomorrow?", the digital assistant might respond with an audio response of "Tomorrow is expected to be sunny, with a high of 75 degrees and a low of 60 degrees." can As another example, when the user requests "Set an alarm for 6:00 AM tomorrow", the digital assistant performs the task of setting each alarm, followed by an audio confirmation of "Your alarm has been set for 6 AM tomorrow". provides

일부 실시예들에서, 시각적 정보(예컨대, 텍스트, 비디오, 애니메이션들 등)가 오디오 정보에 부가하여 또는 그 대신에 제공된다. 게다가, 일부 실시예들에서, 제공된 정보는 미디어 콘텐츠(예컨대, 음악 또는 비디오 콘텐츠)를 포함하고, 디지털 어시스턴트는 미디어 콘텐츠의 재생을 제어한다(예컨대, 음악 또는 비디오 콘텐츠의 시작 및 중단).In some embodiments, visual information (eg, text, video, animations, etc.) is provided in addition to or instead of audio information. Additionally, in some embodiments, the provided information includes media content (eg, music or video content), and the digital assistant controls playback of the media content (eg, starting and stopping the music or video content).

일부 경우들에서, 디지털 어시스턴트에 의한 오디오 정보의 제공을 인터럽트하는 것이 유리할 것이다. 예컨대, 디지털 어시스턴트가 오디오 정보를 제공하고 있는 동안 사용자가 다른 사람에게 말하기 시작하는 경우, 사용자는 디지털 어시스턴트에 의해 제공되는 정보를 듣지 못할 수 있다. 이 경우, 시스템(100)은 사용자와 다른 사람 사이의 대화가 종료될 때까지 오디오 정보를 제공하는 것을 중단한다. 이러한 방식으로, 시스템(100)은 더 공손한 방식으로 디지털 어시스턴트를 이용하여 오디오 정보를 제공한다.In some cases, it may be advantageous to interrupt the presentation of audio information by the digital assistant. For example, if the user starts talking to another person while the digital assistant is providing audio information, the user may not hear the information provided by the digital assistant. In this case, the system 100 stops providing audio information until the conversation between the user and the other person ends. In this way, system 100 presents audio information using a digital assistant in a more polite manner.

게다가, 일부 실시예들에서, 오디오 정보를 제공하기 전에(또는 중단된 오디오 정보의 제공을 재개하기 전에), 시스템(100)은 디지털 어시스턴트에 의해 오디오 정보가 제공되기에 적절함을 나타내는 시각적 특성들을 검출한다. 예컨대, 사용자가 요청을 제공하지만 생각하기 위해 말하는 것을 중단할 때(예컨대, "톰과 월요일 오전 9시에 미팅을 스케줄링하고 또...", 시스템(100)은 부가적인 스피치가 예상되는 것을 검출하고, 오디오 정보를 제공하기 위해 대기한다.Additionally, in some embodiments, before presenting audio information (or resuming presentation of audio information that has been interrupted), system 100 sets visual characteristics to indicate that the audio information is appropriate to be presented by the digital assistant. detect For example, when a user provides a request but stops speaking to think (eg, "I schedule a meeting with Tom for Monday at 9:00 AM and..."), system 100 detects that additional speech is expected. and waits to provide audio information.

도 2는 다양한 실시예들에 따른, 환경(210)에서 오디오 정보(202)를 제공하는 전자 디바이스(200)의 예를 도시한다. 일부 실시예들에서, 전자 디바이스(200)는 도 1a 및 도 1b를 참조하여 설명된 바와 같은 시스템(100)의 실시예이다. 오디오 정보(202)는 수신된 입력에 응답하여 스피커(들)(218)를 사용하여 제공된다. 일부 실시예들에서, 수신된 입력은 전자 디바이스(200)에 의해 구현된 디지털 어시스턴트에 대한 하나 이상의 명령들을 포함하는 음성 또는 텍스트 형태의 자연 언어 입력이다. 전자 디바이스(200)는 수신된 입력에 기초하여 하나 이상의 명령들을 결정하고, 하나 이상의 명령들에 기초하여 오디오 정보(202)를 제공한다. 일부 실시예들에서, 수신된 입력은 디지털 어시스턴트에 대한 명령들로서 입력을 식별하는 트리거링 커맨드(예컨대, "헬로 컴퓨터")를 포함한다.2 illustrates an example of an electronic device 200 providing audio information 202 in an environment 210 according to various embodiments. In some embodiments, electronic device 200 is an embodiment of system 100 as described with reference to FIGS. 1A and 1B . Audio information 202 is presented using speaker(s) 218 in response to received input. In some embodiments, the received input is natural language input in the form of voice or text that includes one or more commands to the digital assistant implemented by electronic device 200 . The electronic device 200 determines one or more commands based on the received input and provides audio information 202 based on the one or more commands. In some embodiments, the received input includes a triggering command (eg, “hello computer”) that identifies the input as commands to the digital assistant.

일부 실시예들에서, 사용자로부터의 입력이 중단된 후에, 전자 디바이스(200)는, 오디오 정보(202)를 제공하기 전에, 사용자의 시각적 특성들이 추가 입력이 예상됨을 나타내는지 여부를 결정한다. 시각적 특성들의 예들은 시선, 얼굴 표정들, 및/또는 손 제스처들을 포함한다. 예컨대, 전자 디바이스(200)가 사람의 눈들이 말하는 것을 중단한 후에 상방을 응시하는 것을 검출하는 경우, 전자 디바이스(200)는 사람으로부터의 추가 스피치가 예상되는 것으로 결정하는데, 이는 상방 응시가 사람이 생각하고 있음을 나타내기 때문이다. 일부 실시예들에서, 추가 입력이 예상되는 것으로 결정한 후에, 전자 디바이스(200)는 미리 결정된 시간 동안 대기한다. 미리 결정된 시간 동안 추가 입력이 제공되지 않는 경우, 전자 디바이스(200)는 오디오 정보(202)를 제공하는 것으로 진행한다. 사용자의 시각적 특성들이 추가 입력이 예상됨을 나타내지 않는 경우, 전자 디바이스(200)는 사용자로부터의 입력이 중단된 후에 오디오 정보(202)를 제공하는 것으로 진행한다.In some embodiments, after input from the user is interrupted, the electronic device 200 determines whether the user's visual characteristics indicate that further input is expected, before presenting the audio information 202 . Examples of visual characteristics include gaze, facial expressions, and/or hand gestures. For example, if the electronic device 200 detects that the person's eyes are gazing upwards after they have stopped speaking, the electronic device 200 determines that additional speech from the person is expected, which means that the upward gazing indicates that the person is Because it shows you are thinking. In some embodiments, after determining that additional input is expected, the electronic device 200 waits for a predetermined amount of time. If no additional input is provided for a predetermined amount of time, the electronic device 200 proceeds to provide audio information 202 . If the user's visual characteristics do not indicate that further input is expected, the electronic device 200 proceeds to provide the audio information 202 after input from the user has ceased.

전자 디바이스(200)가 오디오 정보(202)를 제공하면서 외부 사운드 소스(204)로부터의 외부 사운드(206)를 검출하는 경우, 전자 디바이스(200)는 외부 사운드(206)의 타입에 기초하여, 외부 사운드(206)가 오디오 정보(202)의 제공을 중단하는 것을 정당화(warrant)하는지 여부를 결정한다. 일부 타입의 외부 사운드들(206)의 경우, 오디오 정보(202)를 중단하는 것은 불필요하다. 예컨대, 사람이 청취 또는 생각하고 있음을 나타내는 대화 사운드들, 이를테면, "흠", "음", "오케이", "으응", "예", "알겠어요" 등은 오디오 정보(202)의 제공을 중단하는 것을 정당화하지 않을 것이다. 다른 타입의 외부 사운드들(206), 이를테면, 압축 오디오(예컨대, 음악 또는 비디오와 같은 미디어 콘텐츠로부터의 사운드들) 또는 전자 디바이스에 의해 재생되는 스피치(예컨대, 텔레비전에 의해 방출되는 어휘 발화들)인 외부 사운드들(206)이 또한, 오디오 정보(202)의 제공을 중단하는 것을 정당화하지 않을 것이다.When the electronic device 200 detects the external sound 206 from the external sound source 204 while providing the audio information 202, the electronic device 200, based on the type of the external sound 206, Determines whether sound 206 warrants ceasing to provide audio information 202 . For some types of external sounds 206, interrupting the audio information 202 is unnecessary. For example, conversational sounds indicating that a person is listening or thinking, such as "hmm", "hmm", "ok", "uh", "yes", "okay", etc., provide audio information 202. will not justify discontinuing. Other types of external sounds 206, such as compressed audio (eg, sounds from media content such as music or video) or speech reproduced by an electronic device (eg, vocabulary utterances emitted by a television) External sounds 206 will also not justify stopping the provision of audio information 202 .

일부 실시예들에서, 외부 사운드(206)가 압축 오디오와 일치하는 특성들을 갖는 것으로 전자 디바이스(200)가 결정하는 경우, 전자 디바이스(200)는 오디오 정보(202)를 제공하는 것을 계속한다(예컨대, 압축 오디오는 오디오 정보(202)를 중단하는 것을 정당화하지 않는 외부 사운드의 타입임). 다른 실시예들에서, 외부 사운드(206)가 압축 오디오와 일치하는 특성들을 갖는 것으로 전자 디바이스(200)가 결정할 때, 전자 디바이스(200)는 외부 사운드 소스(204)의 특성들 및/또는 압축 오디오의 콘텐츠를 추가로 결정한다. 압축 오디오 및/또는 압축 오디오의 콘텐츠를 방출하는 외부 사운드 소스(204)의 특성들에 기초하여, 전자 디바이스(200)는 오디오 정보(202)를 제공하는 것을 계속할 수 있거나 또는 오디오 정보(202)를 중단할 수 있다. 예컨대, 외부 사운드 소스(204)가 낮은-우선순위 오디오를 방출하는 텔레비전 또는 다른 디바이스인 것으로 전자 디바이스(200)가 결정하는 경우, 전자 디바이스(200)는 오디오 정보(202)를 제공하는 것을 계속한다. 낮은-우선순위 오디오의 예들은 미리-레코딩된 오디오, 이를테면 음악 또는 영화들, 텔레비전 프로그램들, 또는 라디오 방송들을 포함한다. 그러나, 외부 사운드 소스(204)가 높은-우선순위 오디오를 방출하는 전화 또는 다른 디바이스인 것으로 전자 디바이스(200)가 결정하는 경우, 전자 디바이스(200)는 높은-우선순위 오디오로부터 주의를 분산시키기 않도록 오디오 정보(202)를 제공하는 것을 중단할 수 있다. 높은-우선순위 오디오의 예들은 거의 실시간으로 말하는 사람의 오디오(예컨대, 전화 대화), 알람, 경고 메시지를 포함한다.In some embodiments, if the electronic device 200 determines that the external sound 206 has characteristics consistent with the compressed audio, the electronic device 200 continues to provide the audio information 202 (e.g. , compressed audio is a type of extraneous sound that does not justify interrupting the audio information 202). In other embodiments, when the electronic device 200 determines that the external sound 206 has characteristics consistent with the compressed audio, the electronic device 200 determines the characteristics of the external sound source 204 and/or the compressed audio. further determine the content of Based on the characteristics of external sound source 204 emitting compressed audio and/or content of compressed audio, electronic device 200 may continue to provide audio information 202 or may not provide audio information 202 . can be stopped For example, if electronic device 200 determines that external sound source 204 is a television or other device emitting low-priority audio, electronic device 200 continues to provide audio information 202. . Examples of low-priority audio include pre-recorded audio, such as music or movies, television programs, or radio broadcasts. However, if the electronic device 200 determines that the external sound source 204 is a phone or other device that emits high-priority audio, the electronic device 200 must not be distracted from the high-priority audio. It may stop providing audio information 202 . Examples of high-priority audio include near real-time audio of a person speaking (eg, a phone conversation), alarms, and warning messages.

일반적으로, 더 많은 실질적인 정보를 전달하는 타입의 외부 사운드들(206)은 사람들 사이의 대화들이거나, 또는 그렇지 않으면, 오디오 정보(202)의 제공을 중단하는 것을 정당화하는 높은-우선순위 오디오를 포함한다. 이러한 타입의 외부 사운드들(206)은 직접-발성 어휘 발화들(예컨대, 환경(210)에서 말하는 사람에 의해 방출되는 외부 사운드(206))을 포함한다. 예컨대, 오디오 정보(202)가 제공되고 있는 동안, 사람이 환경(210) 내의 다른 사람에게 말하기 시작하는 경우, 전자 디바이스(200)는 스피치를 검출할 시에 오디오 정보(202)의 제공을 중단할 수 있다. 오디오 정보(202)의 제공을 중단하는 것은 2명의 사람들이 오디오 정보(202)에 의해 주의가 분산되거나 또는 인터럽트되지 않으면서 대화를 할 수 있게 한다. 유사하게, 디지털 어시스턴트에 대한 후속 요청을 하거나 또는 그렇지 않으면 실질적인 스피치를 전달하는 환경(210) 내의 사람이 또한, 오디오 정보(202)의 제공을 중단하는 것을 정당화할 것이다. 특히, 오디오 정보(202)는 사용자가 "중단", "조용히", "종료" 등과 같은 침묵화 또는 트리거링 커맨드를 말할 필요 없이 중단된다. 일부 실시예들에서, 오디오 정보(202)를 중단하는 것은 오디오 정보(202)를 페이드 아웃하는 것을 포함한다.Generally, the types of extraneous sounds 206 that convey more substantive information are conversations between people, or otherwise contain high-priority audio that justifies ceasing to provide audio information 202. do. External sounds 206 of this type include direct-spoken lexical utterances (eg, external sound 206 emitted by a person speaking in environment 210 ). For example, if a person starts speaking to another person in environment 210 while audio information 202 is being presented, electronic device 200 may stop providing audio information 202 upon detecting speech. can Stopping the presentation of the audio information 202 allows two people to have a conversation without being distracted or interrupted by the audio information 202 . Similarly, a person in the environment 210 who makes a subsequent request to the digital assistant or otherwise conveys substantial speech will also justify stopping providing the audio information 202 . In particular, the audio information 202 is stopped without the user needing to say a silencing or triggering command such as "stop", "silence", "end", and the like. In some embodiments, stopping the audio information 202 includes fading the audio information 202 out.

일부 실시예들에서, 전자 디바이스(200)는 환경(210) 내의 외부 사운드 소스(204)의 위치에 적어도 부분적으로 기초하여 외부 사운드(206)의 타입을 결정한다. 일부 실시예들에서, 외부 사운드 소스(204)의 위치는 사운드 소스의 방향 및/또는 거리를 검출할 수 있는 마이크로폰 어레이를 사용하여 결정된다. 외부 사운드 소스(204)의 위치가 사람에 대응하는 경우(그리고, 옵션적으로, 외부 사운드(204)가 사람이 청취 또는 생각하고 있음을 나타내는 대화 사운드가 아닌 경우), 전자 디바이스(200)는 외부 사운드(204)가 실질적인 것으로 결정하고, 오디오 정보(202)의 제공을 중단한다. 그러나, 외부 사운드 소스(204)의 위치가 전자 디바이스(예컨대, 텔레비전 또는 라우드스피커)에 대응하는 것으로 결정되는 경우, 전자 디바이스(200)는 오디오 정보(202)를 제공하는 것을 계속한다. 이러한 방식으로, 전자 디바이스(200)는, 전자 디바이스에 의해 방출되는 외부 사운드(206)가 인간 스피치처럼 들릴 때에도(예컨대, 텔레비전 프로그램에서 말하는 어휘 발화), 오디오 정보(202)를 제공하는 것을 중단하지 않는다.In some embodiments, electronic device 200 determines the type of external sound 206 based at least in part on a location of external sound source 204 within environment 210 . In some embodiments, the location of the external sound source 204 is determined using a microphone array capable of detecting the direction and/or distance of the sound source. If the location of the external sound source 204 corresponds to a person (and, optionally, if the external sound 204 is not a conversational sound indicating that the person is listening or thinking), the electronic device 200 may It determines that the sound 204 is substantial and stops providing the audio information 202 . However, if the location of the external sound source 204 is determined to correspond to an electronic device (eg, a television or loudspeaker), the electronic device 200 continues to provide audio information 202 . In this way, the electronic device 200 does not stop providing the audio information 202 even when the external sound 206 emitted by the electronic device sounds like human speech (eg, lexical utterances spoken in a television program). don't

일부 실시예들에서, 오디오 정보(202)의 제공을 중단한 후에, 전자 디바이스(200)는 적절한 시간까지 오디오 정보(202)를 재개하는 것을 대기한다. 예컨대, 사람이 환경(210) 내의 다른 사람에게 말하고 있는 경우, 전자 디바이스(200)는 2명의 사람들 사이의 추가 통신이 더 이상 예상되지 않을 때까지 오디오 정보(202)를 재개하는 것을 대기한다. 일부 실시예들에서, 전자 디바이스(200)는 시선, 얼굴 표정들, 및/또는 손 제스처들과 같은 외부 사운드들(206)을 만드는 하나 이상의 사람들의 시각적 특성들에 기초하여 추가 통신이 예상됨을 검출한다. 예컨대, 전자 디바이스(200)가 사람의 눈들이 말하는 것을 중단한 후에 상방을 응시하는 것을 검출하는 경우, 전자 디바이스(200)는 사람으로부터의 추가 스피치가 예상되는 것으로 결정하는데, 이는 상방 응시가 사람이 생각하고 있음을 나타내기 때문이다.In some embodiments, after ceasing to provide audio information 202 , electronic device 200 waits to resume audio information 202 until an appropriate time. For example, if a person is speaking to another person in environment 210, electronic device 200 waits to resume audio information 202 until further communication between the two persons is no longer expected. In some embodiments, electronic device 200 detects that further communication is expected based on visual characteristics of one or more people making external sounds 206, such as gaze, facial expressions, and/or hand gestures. do. For example, if the electronic device 200 detects that the person's eyes are gazing upwards after they have stopped speaking, the electronic device 200 determines that additional speech from the person is expected, which means that the upward gazing indicates that the person is Because it shows you are thinking.

오디오 정보(202)가 계속되는 것이 적절한 것으로 전자 디바이스(200)가 결정하면, 전자 디바이스(200)는 재개 오디오 정보(202)를 제공한다. 일부 실시예들에서, 전자 디바이스(200)는 오디오 정보(202)가 시선, 얼굴 표정들, 및/또는 손 제스처들과 같은 하나 이상의 사람들의 시각적 특성들에 기초하여, 오디오 정보(202)가 계속되는 것이 적절한 것으로 결정한다. 예컨대, 사람의 눈들이 스피커(들)(218)의 방향을 응시하고 있음을 시스템이 검출하는 경우, 전자 디바이스(200)는 재개 오디오 정보를 제공하는 것이 적절한 것으로 결정한다.If the electronic device 200 determines that the audio information 202 is appropriate to continue, the electronic device 200 provides the resume audio information 202 . In some embodiments, the electronic device 200 may continue the audio information 202 based on one or more visual characteristics of people, such as gaze, facial expressions, and/or hand gestures. decide what is appropriate. For example, if the system detects that the person's eyes are gazing in the direction of the speaker(s) 218, the electronic device 200 determines that it is appropriate to provide resume audio information.

일부 실시예들에서, 오디오 정보(202)는 미리 정의된 세그먼트들로 분할되고, 재개 오디오 정보는 오디오 정보(202)가 중단되었던 세그먼트로 시작한다. 이러한 방식으로, 재개 오디오 정보는 전체 문구 또는 단어로 시작할 수 있다. 일부 실시예들에서, 재개 오디오 정보는 오디오 정보(202)의 이전에 제공된 세그먼트의 재구성된 버전을 포함한다. 오디오 정보(202)의 이전에 제공된 세그먼트의 재구성된 버전은, 동일한(예컨대, 축어적(verbatim) 오디오 정보를 반복하지 않으면서, 오디오 정보(202)가 중단되었던 지점을 청취자에게 상기시킨다.In some embodiments, the audio information 202 is divided into predefined segments, and the resume audio information begins with the segment where the audio information 202 left off. In this way, resume audio information can start with an entire phrase or word. In some embodiments, resume audio information includes a reconstructed version of a previously presented segment of audio information 202 . A reconstructed version of a previously presented segment of audio information 202 reminds the listener of the point where audio information 202 left off, without repeating the same (eg, verbatim) audio information.

이제 도 3을 참조하면, 다양한 실시예들에 따른, 오디오 정보를 제공하기 위한 예시적인 프로세스(300)의 흐름도가 도시된다. 프로세스(300)는 사용자 디바이스(예컨대, 100a, 200)를 사용하여 수행될 수 있다. 전자 디바이스는, 예컨대, 데스크톱 컴퓨터, 랩톱 컴퓨터, 핸드헬드 모바일 디바이스, 오디오 재생 디바이스, 텔레비전, 모니터, 헤드-마운트 디스플레이(HMD) 디바이스, 또는 헤드-업 디스플레이 디바이스이다. 다른 실시예들에서, 프로세스(300)는 베이스 디바이스와 같은 다른 디바이스에 통신가능하게 커플링되는 사용자 디바이스와 같은 2개 이상의 전자 디바이스들을 사용하여 수행됨이 인식되어야 한다. 이러한 실시예들에서, 프로세스(300)의 동작들은 임의의 방식으로 사용자 디바이스와 다른 디바이스 사이에 분배된다. 프로세스(300)의 블록들이 도 3의 특정 순서로 도시되어 있지만, 이러한 블록들은 다른 순서들로 수행될 수 있음이 이해되어야 한다. 추가로, 프로세스(300)의 하나 이상의 블록들은 부분적으로 수행될 수 있고/있거나, 옵션적으로 수행될 수 있고/있거나, 다른 블록(들)과 조합될 수 있고/있거나 추가 블록들이 수행될 수 있다.Referring now to FIG. 3 , a flow diagram of an example process 300 for providing audio information is shown, in accordance with various embodiments. Process 300 may be performed using a user device (eg, 100a, 200). The electronic device is, for example, a desktop computer, laptop computer, handheld mobile device, audio reproduction device, television, monitor, head-mounted display (HMD) device, or head-up display device. It should be appreciated that in other embodiments, process 300 is performed using two or more electronic devices, such as a user device that are communicatively coupled to another device, such as a base device. In these embodiments, the actions of process 300 are distributed between the user device and the other device in any manner. Although the blocks of process 300 are shown in FIG. 3 in a particular order, it should be understood that these blocks may be performed in other orders. Additionally, one or more blocks of process 300 may be partially performed, optionally performed, combined with other block(s), and/or additional blocks may be performed. .

블록(302)에서, 수신된 입력에 응답하는 오디오 정보(예컨대, 202)는 스피커(예컨대, 118, 218)를 사용하여 제공된다. 일부 실시예들에서, 수신된 입력은 트리거링 커맨드를 포함한다.At block 302, audio information (eg, 202) responsive to the received input is provided using a speaker (eg, 118, 218). In some embodiments, the received input includes a triggering command.

블록(304)에서, 오디오 정보를 제공하는 동안, 외부 사운드(예컨대, 206)가 검출된다.At block 304, an external sound (eg 206) is detected while providing audio information.

블록(306)에서, 외부 사운드가 제1 타입의 통신이라는 결정에 따라, 오디오 정보의 제공이 중단된다. 일부 실시예들에서, 오디오 정보의 제공을 중단하는 것은 오디오 정보를 페이드 아웃하는 것을 포함한다. 일부 실시예들에서, 제1 타입의 통신은 직접-발성 어휘 발화를 포함한다. 옵션적으로, 직접-발성 어휘 발화는 침묵화 명령들을 배제한다.At block 306, upon a determination that the external sound is the first type of communication, the provision of audio information is discontinued. In some embodiments, ceasing to present the audio information includes fading out the audio information. In some embodiments, the first type of communication includes direct-spoken vocabulary utterance. Optionally, direct-spoken lexical utterance excludes silence commands.

일부 실시예들에서, 외부 사운드의 소스(예컨대, 204)에 대응하는 위치를 결정함으로써, 외부 사운드가 직접-발성 어휘 발화인 것으로 결정된다. 일부 실시예들에서, 외부 사운드의 소스에 대응하는 위치는 지향성 마이크로폰 어레이를 이용하여 결정된다.In some embodiments, the external sound is determined to be a direct-spoken lexical utterance by determining a location corresponding to a source of the external sound (eg, 204 ). In some embodiments, the location corresponding to the source of the external sound is determined using a directional microphone array.

블록(308)에서, 오디오 정보의 제공을 중단한 후에, 제1 타입의 통신과 연관된 하나 이상의 시각적 특성들이 검출된다. 하나 이상의 시각적 특성들은 시선, 얼굴 표정, 손 제스처, 또는 이들의 조합을 포함한다.At block 308, after ceasing to provide audio information, one or more visual characteristics associated with the first type of communication are detected. One or more visual characteristics include gaze, facial expression, hand gesture, or a combination thereof.

블록(310)에서, 제1 타입의 통신이 중단된 것으로 검출된다.At block 310, it is detected that the first type of communication is down.

블록(312)에서, 제1 타입의 통신이 중단된 것을 검출하는 것에 대한 응답으로, 하나 이상의 시각적 특성들이 제1 타입의 추가 통신이 예상됨을 나타내는지 여부에 대한 결정이 이루어진다.At block 312, in response to detecting that the first type of communication has been interrupted, a determination is made whether one or more visual characteristics indicate that further communication of the first type is expected.

블록(314)에서, 제1 타입의 추가 통신이 예상되지 않는다는 결정에 따라, 재개 오디오 정보가 제공된다. 일부 실시예들에서, 재개 오디오 정보는 오디오 정보의 제공을 중단한 후에, 그리고 제1 타입의 통신이 중단되었다는 결정에 따라 제공된다. 일부 실시예들에서, 오디오 정보는 미리 정의된 세그먼트들로 분할되고, 재개 오디오 정보는 오디오 정보가 중단되었던 세그먼트로 시작된다. 일부 실시예들에서, 재개 오디오 정보는 오디오 정보의 이전에 제공된 세그먼트의 재구성된 버전을 포함한다.At block 314, in accordance with a determination that no additional communication of the first type is expected, resume audio information is provided. In some embodiments, resume audio information is provided after ceasing to provide audio information and in accordance with a determination that the first type of communication has ceased. In some embodiments, the audio information is divided into predefined segments, and the resume audio information begins with the segment where the audio information left off. In some embodiments, the resume audio information includes a reconstructed version of a previously presented segment of audio information.

블록(316)에서, 제1 타입의 추가 통신이 예상된다는 결정에 따라, 오디오 정보의 제공은 계속해서 중단된다.At block 316, upon a determination that further communication of the first type is expected, the presentation of audio information continues to cease.

블록(318)에서, 외부 사운드가 제2 타입의 통신이라는 결정에 따라, 오디오 정보의 제공이 계속된다. 일부 실시예들에서, 제2 타입의 통신은 대화 사운드들(사람이 청취 또는 생각하고 있음을 나타내는 사운드들, 예컨대, "흠", "음", "오케이", "으응", "예", "알겠어요" 등)을 포함한다. 일부 실시예들에서, 제2 타입의 통신은 압축 오디오를 포함한다. 일부 실시예들에서, 제2 타입의 통신은 전자 디바이스에 의해 재생되는 어휘 발화(예컨대, 스피치)를 포함한다. 일부 실시예들에서, 외부 사운드는 외부 사운드의 소스(예컨대, 204)에 대응하는 위치를 결정함으로써, 전자 디바이스에 의해 재생되는 어휘 발화인 것으로 결정된다. 일부 실시예들에서, 외부 사운드의 소스의 위치는 지향성 마이크로폰 어레이를 이용하여 결정된다.At block 318, upon a determination that the external sound is the second type of communication, the presentation of audio information continues. In some embodiments, the second type of communication is conversational sounds (sounds that indicate that a person is listening or thinking, e.g., "hmm", "hmm", "ok", "uhm", "yes", "Okay", etc.). In some embodiments, the second type of communication includes compressed audio. In some embodiments, the second type of communication includes a vocabulary utterance (eg, speech) reproduced by the electronic device. In some embodiments, the external sound is determined to be a lexical utterance reproduced by the electronic device by determining a location corresponding to a source of the external sound (eg, 204 ). In some embodiments, the location of the source of the external sound is determined using a directional microphone array.

이제 도 4을 참조하면, 다양한 실시예들에 따른, 오디오 정보를 제공하기 위한 예시적인 프로세스(400)의 흐름도가 도시된다. 프로세스(400)는 사용자 디바이스(예컨대, 100a, 200)를 사용하여 수행될 수 있다. 전자 디바이스는, 예컨대, 데스크톱 컴퓨터, 랩톱 컴퓨터, 핸드헬드 모바일 디바이스, 오디오 재생 디바이스, 텔레비전, 모니터, 헤드-마운트 디스플레이(HMD) 디바이스, 또는 헤드-업 디스플레이 디바이스이다. 다른 실시예들에서, 프로세스(400)는 베이스 디바이스와 같은 다른 디바이스에 통신가능하게 커플링되는 사용자 디바이스와 같은 2개 이상의 전자 디바이스들을 사용하여 수행됨이 인식되어야 한다. 이러한 실시예들에서, 프로세스(400)의 동작들은 임의의 방식으로 사용자 디바이스와 다른 디바이스 사이에 분배된다. 프로세스(400)의 블록들이 도 4의 특정 순서로 도시되어 있지만, 이러한 블록들은 다른 순서들로 수행될 수 있음이 이해되어야 한다. 추가로, 프로세스(400)의 하나 이상의 블록들은 부분적으로 수행될 수 있고/있거나, 옵션적으로 수행될 수 있고/있거나, 다른 블록(들)과 조합될 수 있고/있거나 추가 블록들이 수행될 수 있다.Referring now to FIG. 4 , a flow diagram of an example process 400 for providing audio information is shown, in accordance with various embodiments. Process 400 may be performed using a user device (eg, 100a, 200). The electronic device is, for example, a desktop computer, laptop computer, handheld mobile device, audio reproduction device, television, monitor, head-mounted display (HMD) device, or head-up display device. It should be appreciated that in other embodiments, process 400 is performed using two or more electronic devices, such as a user device that are communicatively coupled to another device, such as a base device. In these embodiments, the actions of process 400 are distributed between the user device and the other device in any manner. Although the blocks of process 400 are shown in FIG. 4 in a particular order, it should be understood that these blocks may be performed in other orders. Additionally, one or more blocks of process 400 may be partially performed, optionally performed, combined with other block(s), and/or additional blocks may be performed. .

블록(402)에서, 하나 이상의 명령들을 포함하는 스피치 입력이 소스로부터 수신된다.At block 402, speech input containing one or more commands is received from a source.

블록(404)에서, 스피치 입력의 소스와 연관된 하나 이상의 시각적 특성들이 검출된다. 하나 이상의 시각적 특성들은 시선, 얼굴 표정, 손 제스처, 또는 이들의 조합을 포함한다.At block 404, one or more visual characteristics associated with the source of speech input are detected. One or more visual characteristics include gaze, facial expression, hand gesture, or a combination thereof.

블록(406)에서, 스피치 입력이 중단된 것으로 검출된다.At block 406, the speech input is detected as being interrupted.

블록(408)에서, 스피치 입력이 중단된 것을 검출하는 것에 대한 응답으로, 소스와 연관된 하나 이상의 시각적 특성들이 소스로부터의 추가 스피치 입력이 예상됨을 나타내는지 여부에 대한 결정이 이루어진다.At block 408, in response to detecting that the speech input has ceased, a determination is made whether one or more visual characteristics associated with the source indicate that additional speech input from the source is expected.

블록(410)에서, 소스로부터의 추가 스피치 입력이 예상되지 않는다는 결정에 따라, 하나 이상의 명령들에 대한 응답이 제공된다.At block 410, a response to one or more commands is provided in response to a determination that no additional speech input from the source is expected.

블록(412)에서, 소스로부터의 추가 스피치 입력이 예상된다는 결정에 따라, 하나 이상의 명령들에 대한 응답이 제공되지 않는다. 일부 실시예들에서, 소스로부터의 추가 스피치 입력이 예상된다는 결정에 따라, 미리 결정된 시간 동안 하나 이상의 명령들에 대한 응답이 제공되지 않는다. 미리 결정된 시간 후에, 그리고 소스로부터의 스피치 입력이 재개되지 않았다는 결정에 따라, 하나 이상의 명령들에 대한 응답이 제공된다.At block 412, no response is provided to the one or more commands upon a determination that additional speech input from the source is expected. In some embodiments, a response to one or more commands is not provided for a predetermined amount of time upon a determination that additional speech input from the source is expected. After a predetermined time, and upon a determination that speech input from the source has not been resumed, a response to the one or more commands is provided.

전술된 방법들(300 및/또는 400)의 특징부들을 수행하기 위한 실행가능 명령들은, 옵션적으로, 일시적인 또는 비-일시적 컴퓨터-판독가능 저장 매체(예컨대, 메모리(들)(106)) 또는 하나 이상의 프로세서들(예컨대, 프로세서(들)(102))에 의한 실행을 위해 구성된 기타 컴퓨터 프로그램 제품에 포함된다. 추가로, 방법(300)에서의 일부 동작들은 옵션적으로 방법(400)에 포함되고, 방법(400)에서의 일부 동작들은 옵션적으로 방법(300)에 포함된다.Executable instructions for performing the features of methods 300 and/or 400 described above may optionally be stored in a transitory or non-transitory computer-readable storage medium (eg, memory(s) 106) or included in other computer program products configured for execution by one or more processors (eg, processor(s) 102). Additionally, some acts in method 300 are optionally included in method 400, and some acts in method 400 are optionally included in method 300.

특정 실시 형태들에 대한 전술한 설명들은 예시 및 설명의 목적으로 제시되었다. 이들은 총망라하거나 청구범위의 범주를 개시된 정확한 형태로 제한하고자 하는 것이 아니며, 상기 교시를 고려하여 많은 수정 및 변형이 가능하다는 것을 이해하여야 한다.The foregoing descriptions of specific embodiments have been presented for purposes of illustration and description. It should be understood that they are not intended to be exhaustive or to limit the scope of the claims to the precise forms disclosed, and that many modifications and variations are possible in light of the above teachings.

Claims

As a method,
receiving speech input from a source, the speech input comprising one or more commands;
detecting one or more first visual characteristics associated with the source of the speech input;
detecting that the speech input has stopped;
in response to detecting that the speech input has been discontinued, determining whether the one or more first visual characteristics associated with the source indicate that additional speech input from the source is expected;
upon a determination that no additional speech input from the source is expected, providing a response to the one or more commands; and
and in accordance with a determination that additional speech input from the source is expected, not providing a response to the one or more commands.

The method of claim 1 , wherein the one or more first visual characteristics include gaze, facial expression, hand gesture, or a combination thereof.

According to claim 1,
upon a determination that additional speech input from the source is expected, not providing the response to the one or more commands for a predetermined amount of time; and
providing the response to the one or more commands after the predetermined time and upon a determination that the speech input from the source has not been resumed.

According to claim 1,
while providing the response to the one or more commands, detecting an external sound;
upon determining that the external sound is a first type of communication, ceasing to provide the response to the one or more commands; and
continuing to provide the response to the one or more commands upon a determination that the external sound is a second type of communication.

According to claim 4,
After ceasing to provide the response to the one or more commands:
detecting one or more second visual characteristics associated with the first type of communication; and
detecting that the first type of communication is discontinued;
in response to detecting that the first type of communication has been discontinued, determining whether the one or more second visual characteristics indicate that further communication of the first type is expected;
upon a determination that no further communication of the first type is expected, providing a resumed response to the one or more commands; and
continuing to provide the response to the one or more commands upon a determination that further communication of the first type is expected.

6. The method of claim 5, wherein the one or more second visual characteristics include gaze, facial expression, hand gesture, or a combination thereof.

5. The method of claim 4, wherein ceasing to provide the response to the one or more commands comprises fading out the response to the one or more commands.

5. The method of claim 4, further comprising providing a resume response to the one or more commands after ceasing to provide the response to the one or more commands and upon a determination that the first type of communication has been discontinued. How to.

9. The method of claim 8, wherein the response to the one or more commands is divided into predefined segments, and wherein the resume response to the one or more commands begins with the segment where the response to the one or more commands was interrupted. Way.

10. The method of claim 9, wherein the resume response to the one or more commands comprises a rephrased version of a previously provided segment of the response to the one or more commands.

5. The method of claim 4, wherein the first type of communication comprises a directly-vocalized lexical utterance.

12. The method of claim 11, wherein the direct-voiced vocabulary utterance excludes silence commands.

According to claim 11,
determining that the external sound is a direct-spoken lexical utterance by determining a location corresponding to a source of the external sound.

14. The method of claim 13, wherein the location is determined using a directional microphone array.

5. The method of claim 4, wherein the second type of communication comprises conversational sounds.

5. The method of claim 4, wherein the second type of communication comprises compressed audio.

5. The method of claim 4, wherein the second type of communication comprises vocabulary utterances reproduced by an electronic device.

According to claim 17,
determining that the external sound is a lexical utterance reproduced by the electronic device by determining a location corresponding to a source of the external sound.

19. The method of claim 18, wherein the location is determined using a directional microphone array.

A non-transitory computer-readable storage medium storing one or more programs configured to be executed by one or more processors of an electronic device, the one or more programs having instructions for performing the method of any one of claims 1 to 19. A non-transitory computer-readable storage medium comprising:

As an electronic device,
one or more processors; and
An electronic device comprising a memory storing one or more programs configured to be executed by the one or more processors, the one or more programs including instructions for performing the method of any one of claims 1 to 19.