KR102140177B1

KR102140177B1 - Answering questions using environmental context

Info

Publication number: KR102140177B1
Application number: KR1020190119592A
Authority: KR
Inventors: 매튜 샤리피; 게오르게 포스텔니쿠
Original assignee: 구글 엘엘씨
Priority date: 2012-09-10
Filing date: 2019-09-27
Publication date: 2020-08-03
Also published as: CN103714104B; CN106250508B; WO2014039106A1; KR20190113712A; CN106250508A; KR20140034034A; KR102241972B1; CN103714104A; KR102029276B1; KR20200093489A

Abstract

발화 및 환경 데이터를 수신하고, 발화에 대한 전사를 획득하고, 환경 데이터를 이용하여 엔티티를 식별하고, 전사의 적어도 일부와 엔티티를 식별하는 데이터를 포함하는 쿼리를 자연 언어 쿼리 프로세싱 엔진에 제출하고, 그리고 쿼리에 대한 하나 이상의 결과를 획득하기 위한, 방법들, 시스템들, 및 컴퓨터 저장 매체 상에 인코딩된 컴퓨터 프로그램을 포함하는 장치가 개시된다. Receive speech and environmental data, obtain transcriptions for speech, identify entities using environmental data, submit a query to the natural language query processing engine that includes data identifying at least a portion of the transcription and entities, And an apparatus comprising methods, systems, and a computer program encoded on a computer storage medium for obtaining one or more results for a query.

Description

Question answer using environment context {ANSWERING QUESTIONS USING ENVIRONMENTAL CONTEXT}

본 출원은 2012년 9월 10일자로 출원된 미국 가출원번호 61/698,934, 2012년 9월 10일자로 출원된 61/698,949호, 2012년 9월 25일자로 출원된 미국 특허 출원번호 13/626,439, 2012년 9월 25일자로 출원된 미국 특허 출원번호 13/626,351, 및 2013년 2월 15일자로 출원된 미국 특허 출원번호 13/768,232에 대해 우선권을 주장하며, 이 문서들의 모든 내용은 참조로서 본 명세서에 포함된다.This application is filed on September 10, 2012, US provisional application number 61/698,934, filed on September 10, 2012, 61/698,949, filed on September 25, 2012, U.S. Patent Application No. 13/626,439, Priority is claimed on U.S. Patent Application No. 13/626,351 filed on September 25, 2012, and U.S. Patent Application No. 13/768,232 filed on February 15, 2013, all contents of which are hereby incorporated by reference. Included in the specification.

본 명세서는 자연 언어 쿼리(natural language query)와 환경 정보에 기초하여 쿼리의 결과들을 식별하는 것에 관한 것으로, 환경 정보(예를 들어, 콘텍스트(context))를 이용하여 질문들에 답변하기 위한 것이다. This specification relates to identifying results of a query based on natural language query and environmental information, and is intended to answer questions using environmental information (eg, context).

일반적으로, 탐색 쿼리는 사용자가 탐색 엔진에 탐색의 실행을 요청할 때, 탐색 엔진에 제출하는 하나 이상의 용어를 포함한다. 그 밖의 접근법 중에서는, 사용자가 키보드 상에서 타이핑하거나, 또한 음성 쿼리의 콘텍스트에서 모바일 디바이스의 마이크로폰에 쿼리 용어들을 말함으로써 탐색 쿼리의 쿼리 용어들을 입력할 수도 있다. 음성 쿼리들은 음성 인식 기술을 이용하여 처리될 수 있다. Generally, a search query includes one or more terms that the user submits to the search engine when the user requests the search engine to perform the search. Among other approaches, a user may enter query terms of a search query by typing on the keyboard or by speaking query terms into the microphone of the mobile device in the context of a voice query. Speech queries can be processed using speech recognition technology.

본 명세서에서 설명된 주제의 일부 혁신적 양태에 따르면, 환경 정보(예컨대, 주변 노이즈)는 쿼리 프로세싱 시스템이 자연 언어 쿼리를 답변하는데 도움을 줄 수 있다. 예를 들어, 사용자는 자신이 보고 있는 텔레비전 프로그램에 관한 질문(에컨대, "이 영화에 나오는 주인공은 누구인가요?")을 물어볼 수 있다. 사용자의 모바일 디바이스는 사용자의 발화(utterance)와, 텔레비전 프로그램의 사운드트랙 오디오를 포함할 수 있는 환경 정보를 검출한다. 모바일 컴퓨팅 디바이스는 발화 및 환경 정보를 파형 데이터(waveform data)로서 인코딩하고, 및 상기 파형 데이터를 서버-기반 컴퓨팅 환경으로 제공한다. According to some innovative aspects of the subject matter described herein, environmental information (eg, ambient noise) can help the query processing system answer natural language queries. For example, a user may ask a question about the television program they are watching (eg, "Who is the main character in this movie?"). The user's mobile device detects the user's utterance and environmental information that may include the soundtrack audio of the television program. The mobile computing device encodes speech and environmental information as waveform data, and provides the waveform data to a server-based computing environment.

컴퓨팅 환경은 상기 파형 데이터의 환경 데이터로부터 상기 발화를 분리한 다음, 상기 발화에 대한 전사(transcription)를 획득한다. 컴퓨팅 환경은 예를 들어, 영화의 이름을 식별함으로써 상기 환경 데이터 및 상기 발화에 관련된 엔티티 데이터를 추가로 식별한다. 이어서 상기 전사와 상기 엔티티 데이터로부터, 컴퓨팅 환경은 하나 이상의 결과(예컨대, 상기 사용자의 질문에 응답하는 결과들)를 식별할 수 있다. 특히, 상기 하나 이상의 결과들은 "이 영화에 어떤 배우가 나오나요?"(예컨대, 배우의 이름)에 대한 사용자의 질문에 대한 답변을 포함할 수 있다. 상기 컴퓨팅 환경은 이러한 결과들을 모바일 컴퓨팅 디바이스의 사용자에게 제공할 수 있다. The computing environment separates the utterance from the environment data of the waveform data, and then acquires a transcription for the utterance. The computing environment further identifies the environment data and entity data related to the utterance, for example, by identifying the name of the movie. Then, from the transcription and the entity data, the computing environment can identify one or more results (eg, results in response to the user's question). In particular, the one or more results may include an answer to the user's question of "What actor is in this movie?" The computing environment can provide these results to the user of the mobile computing device.

본 명세서에서 설명된 주제의 혁신적인 양태들은 방법으로 구현될 수 있으며, 이 방법은 발화 및 환경 데이터를 인코딩하는 오디오 데이터를 수신하는 동작, 상기 발화의 전사를 획득하는 동작, 상기 환경 데이터를 이용하여 엔티티를 식별하는 동작, 상기 전사의 적어도 일부와 상기 엔티티를 식별하는 데이터를 포함하는 쿼리를 자연 언어 쿼리 프로세싱 엔진에 제출하는 동작, 및 상기 쿼리에 대한 하나 이상의 결과를 획득하는 동작을 포함한다. The innovative aspects of the subject matter described herein can be implemented in a method, the method comprising receiving audio data encoding speech and environmental data, obtaining transcription of the speech, and entity using the environmental data And submitting a query comprising at least a portion of the transcription and data identifying the entity to a natural language query processing engine, and obtaining one or more results for the query.

이러한 양태들의 다른 실시예들은 상응하는 시스템들, 장치들, 및 컴퓨터 저장 디바이스들 상에 인코딩되어, 본 발명의 동작들을 수행하도록 구성된 컴퓨터 프로그램을 포함한다.Other embodiments of these aspects include computer programs encoded on corresponding systems, apparatuses, and computer storage devices, configured to perform the operations of the present invention.

이러한 및 다른 실시예들은 하나 이상의 후술하는 특징을 선택적으로 각각 포함할 수 있다. 예를 들어, 적어도 하나의 결과의 표현(representation)을 출력한다. 상기 엔티티가 상기 발화를 추가로 이용하여 식별될 수 있다. 상기 쿼리를 생성한다. 상기 쿼리를 생성하는 동작은 상기 전사와 상기 엔티티를 식별하는 데이터를 연관시키는 동작을 포함한다. 상기 연관시키는 동작은 상기 전사를 엔티티를 식별하는 데이터로 태깅하는 동작을 포함한다. 상기 연관시키는 동작은 상기 전사의 일부를 상기 엔티티를 식별하는 데이터로 대체하는 동작을 더 포함한다. 상기 대체하는 동작은 상기 전사의 하나 이상의 단어를 상기 엔티티를 식별하는 데이터로 대체하는 동작을 더 포함한다. 상기 환경 데이터를 수신하는 동작은 환경 오디오 데이터, 환경 이미지 데이터, 또는 이 둘 모두를 수신하는 동작을 더 포함한다. 상기 환경 오디오 데이터를 수신하는 동작은 배경 노이즈를 포함하는 추가 오디오 데이터를 수신하는 동작을 더 포함한다. These and other embodiments may optionally include one or more of the features described below, respectively. For example, the representation of at least one result is output. The entity can be identified further using the utterance. Create the above query. Generating the query includes associating the warrior with data identifying the entity. The associating operation includes tagging the transcript with data identifying the entity. The associating action further includes replacing a portion of the transcript with data identifying the entity. The replacing operation further includes replacing one or more words of the transcription with data identifying the entity. The operation of receiving the environment data further includes receiving environment audio data, environment image data, or both. The operation of receiving the environmental audio data further includes receiving additional audio data including background noise.

본 명세서에서 설명된 주제의 일부 혁신적인 양태들에 따르면, 미디어 콘텐츠의 아이템이 환경 오디오 데이터와 발화된 자연 언어 쿼리에 기초하여 식별된다. 예를 들어, 사용자는 자신이 보고 있는 텔레비전 프로그램에 관한 질문 예컨대, "우리가 지금 뭘 보고 있습니까?")를 물을 수 있다. 그 질문은 그 질문이 텔레비전 쇼에 관한 것이며 다른 형태의 미디어 콘텐츠에 대한 것이 아니라는 것을 제안하는 키워드들(예컨대, "보기(watching)")을 포함할 수 있다. 사용자의 모바일 디바이스는 사용자의 발화와, 텔레비전 프로그램의 배경 오디오를 포함할 수 있는 환경 데이터를 검출한다. 모바일 컴퓨팅 디바이스는 상기 발화와 상기 환경 데이터를 파형 데이터로 인코딩하고, 및 그 파형 데이터를 서버-기반 컴퓨팅 환경에 제공한다. According to some innovative aspects of the subject matter described herein, an item of media content is identified based on environmental audio data and spoken natural language queries. For example, a user may ask a question about the television program they are viewing, such as "What are we watching now?". The question may include keywords (eg, “watching”) suggesting that the question is about a television show and not about other forms of media content. The user's mobile device detects the user's speech and environmental data that may include background audio of the television program. The mobile computing device encodes the speech and the environmental data into waveform data, and provides the waveform data to a server-based computing environment.

컴퓨팅 환경은 상기 파형 데이터의 환경 데이터로부터 상기 발화를 분리한 다음, 상기 발화에 대한 전사를 획득하기 위해 상기 발화를 처리한다. 상기 전사로부터, 컴퓨팅 환경은 임의의 콘텐트 유형-특정 키워드들(예컨대, 키워드 "보기")를 검출한다. 이어 컴퓨팅 환경은 상기 환경 데이터에 기초하여 미디어 콘텐츠의 아이템들을 식별할 수 있고, 상기 식별된 아이템들로부터 키워드들과 연관된 특정 콘텐츠 유형에 매칭되는 멀티 콘텐츠의 특정 아이템을 선택할 수 있다. 컴퓨팅 환경은 멀티미디어 콘텐츠의 특정 아이템의 표현을 모바일 컴퓨팅 디바이스의 사용자에게 제공한다. The computing environment separates the utterance from the environment data of the waveform data, and then processes the utterance to obtain transcription for the utterance. From the transcription, the computing environment detects any content type-specific keywords (eg, keyword "view"). The computing environment can then identify items of media content based on the environment data, and select a specific item of multi-content that matches a specific content type associated with keywords from the identified items. The computing environment provides a representation of a particular item of multimedia content to a user of the mobile computing device.

본 명세서에서 설명된 주제의 혁신적인 양태들은 방법들로 구현될 수 있는데, 이 방법은 (ⅰ) 발화된 자연 언어 쿼리를 인코딩하는 오디오 데이터 및 (ⅱ) 환경 오디오 데이터를 수신하는 동작, 상기 발화된 자연 언어 쿼리에 대한 전사를 획득하는 동작, 상기 전사에 있는 하나 이상의 키워드에 연관된 특정 콘텐츠 유형을 판단하는 동작, 상기 환경 오디오 데이터의 적어도 일부를 콘텐츠 인식 엔진에 제공하는 동작, 및 상기 콘텐츠 인식 엔진에 의해 출력되었고, 상기 특정 콘텐츠 유형에 매칭되는 콘텐츠 아이템을 식별하는 동작을 포함한다. The innovative aspects of the subject matter described herein can be implemented in methods, which include (i) receiving audio data encoding a spoken natural language query and (ii) receiving environmental audio data, the spoken nature. Obtaining a transcription for a language query, determining a specific content type associated with one or more keywords in the transcription, providing at least a portion of the environment audio data to a content recognition engine, and by the content recognition engine And outputting a content item matching the specific content type.

이러한 양태들의 다른 실시예들은 상응하는 시스템들, 장치들, 및 컴퓨팅 저장 디바이스들에 인코딩되고 상기 방법들의 동작들을 수행하도록 구성된 컴퓨터 프로그램들을 포함한다. Other embodiments of these aspects include computer programs encoded in corresponding systems, apparatuses, and computing storage devices and configured to perform the operations of the methods.

이러한 및 다른 실시예들은 하나 이상의 후술하는 특징을 선택적으로 각각 포함할 수 있다. 예를 들어, 특정 콘텐츠 유형은 영화 콘텐츠 유형, 음악 콘텐츠 유형, 텔레비전 쇼 콘텐츠 유형, 오디오 팟캐스트 콘텐츠 유형, 북 콘텐츠 유형, 미술작품 콘텐츠 유형, 예고편(trailer) 콘텐츠 유형, 비디오 팟캐스트 콘텐츠 유형, 인터넷 비디오 콘텐츠 유형, 또는 비디오 게임 콘텐츠 유형이다. 상기 환경 오디오 데이터를 수신하는 동작은 배경 소음을 포함하는 추가 오디오 데이터를 수신하는 동작을 더 포함한다. 상기 배경 소음은 특정 콘텐츠 유형과 연관된다. 비디오 데이터 또는 이미지 데이터를 포함하는 추가 환경 데이터를 수신한다. 상기 비디오 데이터 또는 상기 이미지 데이터는 상기 특정 콘텐츠 유형과 연관된다. 상기 환경 오디오 데이터의 적어도 일부를 상기 콘텐츠 인식 엔진에 제공하는 동작은 상기 환경 오디오 데이터의 상기 일부를 오디오 핑거프린팅 엔진(fingerprinting engine)에 제공하는 동작을 더 포함한다. 상기 특정 콘텐츠 유형을 판단하는 동작은 하나 이상의 데이터베이스를 이용하여, 복수의 콘텐츠 유형 각각에 대해, 상기 키워드들 중 적어도 하나가 상기 복수의 콘텐츠 유형 중 적어도 하나에 매핑되는 하나 이상의 키워드를 식별하는 동작을 더 포함한다. 상기 복수의 콘텐츠 유형은 상기 특정 콘텐츠 유형을 포함하고, 상기 매핑되는 동작은 상기 키워드들 중 적어도 하나가 상기 특정 콘텐츠 유형에 매핑되는 동작을 더 포함한다. 상기 콘텐츠 아이템을 식별하는 데이터를 출력한다. These and other embodiments may optionally include one or more of the features described below, respectively. For example, specific content types include movie content type, music content type, television show content type, audio podcast content type, book content type, artwork content type, trailer content type, video podcast content type, internet Video content type, or video game content type. The operation of receiving the environmental audio data further includes receiving additional audio data including background noise. The background noise is associated with a specific content type. Additional environmental data including video data or image data is received. The video data or the image data is associated with the specific content type. The operation of providing at least a portion of the environmental audio data to the content recognition engine further includes providing the portion of the environmental audio data to an audio fingerprinting engine. The determining of the specific content type may include, for each of the plurality of content types, identifying one or more keywords mapped to at least one of the plurality of content types using one or more databases. It includes more. The plurality of content types include the specific content type, and the mapped operation further includes an operation in which at least one of the keywords is mapped to the specific content type. Data identifying the content item is output.

상기 특징들은 예를 들어, 상기 특정 콘텐츠 유형을 식별하는 데이터를 상기 콘텐츠 인식 엔진에 제공하는 동작, 및 상기 콘텐츠 아이템을 식별하는 동작은 상기 콘텐츠 인식 엔진으로부터 상기 콘텐츠 아이템을 식별하는 데이터를 수신하는 동작을 더 포함한다. 상기 콘텐츠 인식 시스템으로부터 2개 이상의 콘텐츠 인식 후보를 수신하는 동작과, 상기 콘텐츠 아이템을 식별하는 동작은 상기 특정 콘텐츠 유형에 기초하여 특정 콘텐츠 인식 후보를 선택하는 동작을 더 포함한다. 2개 이상의 콘텐츠 인식 후보 각각은 랭킹 스코어와 연관되고, 상기 방법은 상기 특정 콘텐츠 유형에 기초하여 상기 2개 이상의 콘텐츠 인식 후보의 랭킹 스코어를 조정하는 동작을 더 포함한다. 상기 조정된 랭킹 스코어들에 기초하여 상기 2개 이상의 콘텐츠 인식 후보들의 순위를 부여한다.The features include, for example, providing data identifying the specific content type to the content recognition engine, and identifying the content item is receiving data identifying the content item from the content recognition engine It further includes. The operation of receiving two or more content recognition candidates from the content recognition system and the operation of identifying the content item further include selecting a specific content recognition candidate based on the specific content type. Each of the two or more content recognition candidates is associated with a ranking score, and the method further includes adjusting the ranking scores of the two or more content recognition candidates based on the specific content type. The two or more content recognition candidates are ranked based on the adjusted ranking scores.

본 명세서에서 설명된 주제의 하나 이상의 세부사항들은 첨부 도면들 및 이하의 상세한 설명에 개시되어 있다. 본 주제의 다른 잠재적 특징들, 양태들, 및 이점들은 상기 상세한 설명, 도면들 및 청구항들로부터 명백해 질 것이다.One or more details of the subject matter described in this specification are set forth in the accompanying drawings and the description below. Other potential features, aspects, and advantages of the subject matter will become apparent from the detailed description, drawings, and claims above.

도 1은 환경 오디오 데이터와 발화된 자연 언어 쿼리에 기초하여 콘텐츠 아이템 데이터를 식별하는 예시적 시스템을 묘사한다.
도 2는 환경 오디오 데이터와 발환된 자연 언어 쿼리에 기초하여 콘텐츠 아이템 데이터를 식별하는 예시적 프로세스를 위한 순서도를 묘사한다.
도 3a 및 도 3b는 콘텐츠 아이템을 식별하는 예시적 시스템의 부분들을 묘사한다.
도 4는 환경 이미지 데이터 및 발화된 자연 언어 쿼리에 기초하여 미디어 콘텐츠 아이템들을 식별하는 예시적 시스템을 묘사한다.
도 5는 환경 오디오 데이터 및 발화에 기초하여 하나 이상의 결과를 식별하는 시스템을 묘사한다.
도 6은 환경 데이터 및 발화에 기초하여 하나 이상의 결과를 식별하는 예시적 프로세스를 위한 순서도를 묘사한다.
도 7은 본 명세서에서 설명되는 기술들을 구현하는데 이용될 수 있는 컴퓨터 시스템 및 모바일 컴퓨터 디바이스를 묘사한다.
여러 도면들에서 유사한 참조 기호는 유사한 구성요소를 가리킨다.1 depicts an exemplary system for identifying content item data based on environmental audio data and spoken natural language queries.
2 depicts a flow chart for an exemplary process of identifying content item data based on environmental audio data and extracted natural language queries.
3A and 3B depict portions of an exemplary system for identifying content items.
4 depicts an exemplary system for identifying media content items based on environmental image data and spoken natural language queries.
5 depicts a system that identifies one or more results based on environmental audio data and speech.
6 depicts a flow chart for an example process for identifying one or more results based on environmental data and speech.
7 depicts a computer system and a mobile computer device that can be used to implement the techniques described herein.
Like reference symbols in the various drawings indicate like elements.

환경 정보를 콘텍스트(context)로서 사용하여 발화된 자연 언어 쿼리(spoken natural language query)들에 답변하는 컴퓨팅 환경은 복수의 프로세스들을 사용하여 쿼리들을 처리할 수 있다. 도 1 내지 도 4에 도시된 일부 프로세스들의 예에서, 컴퓨팅 환경은 주변 노이즈와 같은 환경 정보에 기초하여 미디어 콘텐츠를 식별할 수 있다. 도 5 및 도 6에 도시된 다른 프로세스들의 예에서, 컴퓨팅 환경은 발화된 자연 언어 쿼리에 대하여 더욱 만족스러운 답변을 제공하기 위하여, 미디어 콘텐츠를 식별하는 데이터 같은 환경 정보로부터 도출되는 콘텍스트로 발화된 자연 언어 쿼리를 증가시킬 수 있다.A computing environment that responds to spoken natural language queries using environment information as context can process queries using multiple processes. In the examples of some of the processes shown in FIGS. 1 to 4, the computing environment can identify media content based on environmental information such as ambient noise. In the examples of the other processes shown in FIGS. 5 and 6, the computing environment is a spoken nature in context derived from environment information, such as data identifying media content, to provide a more satisfactory answer to spoken natural language queries. You can increase the language query.

더 구체적으로, 도 1은 환경 오디오 데이터 및 발화된 자연 언어 쿼리에 기초한 콘텐츠 아이템 데이터를 식별하는 시스템(100)을 나타낸다. 간단히, 시스템(100)은, 환경 오디오 데이터에 기초하며 특정 콘텐츠 유형을 발화된 자연 언어 쿼리와 연관시켜 매칭하는, 콘텐츠 아이템 데이터를 식별할 수 있다. 시스템(100)은 모바일 컴퓨팅 디바이스(102), 명확화 엔진(disambiguation engine, 104), 음성 인식 엔진(106), 키워드 맵핑 엔진(108), 및 콘텐츠 인식 엔진(110)을 포함한다. 모바일 컴퓨팅 디바이스(102)는 하나 또는 그 이상의 네트워크들을 통하여 명확화 엔진(104)과 통신한다. 모바일 디바이스(110)는 마이크로폰, 카메라 또는, 사용자(112) 및/또는 사용자(112)와 연관된 환경 데이터로부터의 발화들을 검출하기 위한 다른 검출 메커니즘들을 포함할 수 있다.More specifically, FIG. 1 shows a system 100 for identifying content item data based on environmental audio data and spoken natural language queries. Briefly, the system 100 can identify content item data that matches based on environmental audio data and associates a particular content type with a spoken natural language query. System 100 includes a mobile computing device 102, a disambiguation engine 104, a speech recognition engine 106, a keyword mapping engine 108, and a content recognition engine 110. Mobile computing device 102 communicates with clarification engine 104 via one or more networks. Mobile device 110 may include a microphone, camera, or other detection mechanisms for detecting utterances from user 112 and/or environmental data associated with user 112.

일부 예들에 있어서, 사용자(112)는 TV 프로그램을 시청하고 있다. 도시된 예에 있어서, 사용자(112)는 현재 디스플레이되는 TV 프로그램을 누가 감독했는지를 알고 싶어한다. 일부 예들에 있어서, 사용자(112)는 현재 디스플레이되는 TV 프로그램의 이름을 알지 못할 수도 있으며, 따라서 "누가 이 쇼를 감독했나요?"라고 물을 수 있다. 모바일 컴퓨팅 디바이스(102)는 사용자(112)의 환경에 연관된 환경 오디오 데이터와 함께 이 발화를 검출한다.In some examples, user 112 is watching a TV program. In the example shown, user 112 wants to know who supervised the currently displayed TV program. In some examples, the user 112 may not know the name of the TV program currently being displayed, and thus may ask, "Who supervised this show?" The mobile computing device 102 detects this utterance along with environmental audio data associated with the environment of the user 112.

일부 예들에 있어서, 사용자(112)의 환경과 연관된 환경 오디오 데이터는 사용자(112)의 환경의 배경 노이즈를 포함할 수 있다. 예를 들어, 환경 오디오 데이터는 TV 프로그램의 소리들을 포함할 수 있다. 일부 예들에 있어서, 현재 디스플레이되는 TV 프로그램과 연관된 환경 오디오 데이터는 현재 디스플레이되는 TV 프로그램의 오디오(예를 들어, 현재 디스플레이되는 TV 프로그램의 대화, 현재 디스플레이되는 TV 프로그램과 연관된 사운드트랙 오디오, 기타 등등)를 포함할 수 있다.In some examples, the environment audio data associated with the environment of the user 112 may include background noise of the environment of the user 112. For example, the environmental audio data may include sounds of a TV program. In some examples, the environmental audio data associated with the currently displayed TV program is the audio of the currently displayed TV program (eg, the conversation of the currently displayed TV program, the soundtrack audio associated with the currently displayed TV program, etc.) It may include.

일부 예들에 있어서, 모바일 컴퓨팅 디바이스(102)는 발화를 검출한 후에 환경 오디오 데이터를 검출하거나, 발화의 검출과 동시에 환경 오디오 데이터를 검출하거나, 혹은 둘 다에 의할 수 있다. 동작 (A) 동안, 모바일 컴퓨팅 디바이스(102)는 검출된 발화 및 환경 오디오 데이터를 처리하여 검출된 발화 및 환경 오디오 데이터를 나타내는 파형 데이터(114)를 생성하고, 파형 데이터(114)를 (예를 들어, 네트워크를 통하여) 명확화 엔진(104)으로 전송한다. 일부 예들에 있어서, 환경 오디오 데이터는 모바일 컴퓨팅 디바이스(110)로부터 스트리밍 된다.In some examples, mobile computing device 102 may detect environmental audio data after detecting an utterance, detect environmental audio data concurrently with detection of an utterance, or both. During operation (A), the mobile computing device 102 processes the detected speech and environmental audio data to generate waveform data 114 representing the detected speech and environmental audio data, and the waveform data 114 (for example, (For example, over a network). In some examples, environmental audio data is streamed from mobile computing device 110.

명확화 엔진(104)은 모바일 컴퓨팅 디바이스(102)로부터 파형 데이터(114)를 수신한다. 동작 (B) 동안, 명확화 엔진(104)은 파형 데이터(114)의 다른 부분으로부터 발화의 분리(또는 추출)를 포함하여 파형 데이터(114)를 처리하고, 발화를 (예를 들어, 네트워크를 통하여) 음성 인식 엔진(106)으로 전송한다. 예를 들어, 명확화 엔진(104)은 사용자(112)의 환경의 배경 노이즈(예를 들어, 현재 디스플레이되는 TV 프로그램의 오디오)로부터 발화("누가 이 쇼를 감독했나요?")를 분리한다.Clarification engine 104 receives waveform data 114 from mobile computing device 102. During operation (B), the articulation engine 104 processes the waveform data 114, including the separation (or extraction) of the utterance from other parts of the waveform data 114, and the utterance (eg, over a network). ) To the speech recognition engine 106. For example, the clarification engine 104 separates the utterance ("Who supervised this show") from the background noise of the environment of the user 112 (eg, the audio of the TV program currently being displayed).

일부 예들에 있어서, 음성 활동 또는 컴퓨팅 장치(102)의 사용자와 연관된 음성 활동을 포함하는 파형 데이터(114)의 일부를 식별함으로써 배경 노이즈로부터 발화의 분리를 가능하게 하기 위하여 명확화 엔진(104)은 음성 검출기를 사용한다. 일부 예들에 있어서, 발화는 쿼리(예를 들어, 현재 디스플레이되는 TV에 관련된 쿼리)에 관련된다. 일부 예들에 있어서, 파형 데이터(114)는 검출된 발화를 포함한다. 이에 대응하여, 명확화 엔진(104)은 모바일 컴퓨팅 디바이스(102)로부터 발화와 관련된 환경 오디오 데이터를 요구할 수 있다.In some examples, the disambiguation engine 104 may be configured to enable the separation of speech from background noise by identifying a portion of the waveform data 114 that includes speech activity or speech activity associated with a user of the computing device 102. Use a detector. In some examples, the utterance is related to a query (eg, a query related to the TV currently being displayed). In some examples, waveform data 114 includes detected speech. Correspondingly, the disambiguation engine 104 may request environmental audio data related to the utterance from the mobile computing device 102.

음성 인식 엔진(106)은 명확화 엔진(104)으로부터 발화에 대응하는 파형 데이터(114)의 일부를 수신한다. 동작 (C) 동안, 음성 인식 엔진(106)은 발화의 전사(transcription)를 획득하고, 그 전사를 키워드 맵핑 엔진(108)으로 제공한다. 구체적으로, 음성 인식 엔진(106)은 음성 인식 엔진(106)으로부터 수신한 발화를 처리한다. 일부 예들에 있어서, 음성 인식 엔진(106)에 의한 발화의 처리는 발화의 전사를 생성하는 것을 포함한다. 발화의 전사를 생성하는 것은 발화를 텍스트 또는 텍스에 관련된 데이터로 전사하는 것을 포함한다. 다시 말해, 음성 인식 시스템(106)은 발화의 문자적 형태의 언어 표현을 제공할 수 있다.The speech recognition engine 106 receives a portion of the waveform data 114 corresponding to the utterance from the clarification engine 104. During operation (C), the speech recognition engine 106 obtains a transcription of the utterance and provides the transcription to the keyword mapping engine 108. Specifically, the speech recognition engine 106 processes the utterance received from the speech recognition engine 106. In some examples, processing of speech by speech recognition engine 106 includes generating transcription of speech. Generating the transcription of an utterance includes transferring the utterance into data related to text or text. In other words, the speech recognition system 106 may provide a language expression in the literal form of speech.

예를 들어, 음성 인식 시스템(106)은 발화를 전사하여 "누가 이 쇼를 감독했나요?"라는 전사를 생성한다. 다른 실시들예에 있어서, 음성 인식 시스템(106)은 둘 또는 그 이상의 발화의 전사들을 제공한다. 예를 들어, 음성 인식 시스템(106)은 발화를 전사하여 "누가 이 쇼(show)를 감독했나요?" 및 "누가 이 슈(shoe)를 감독했나요?"라는 전사를 생성한다.For example, the speech recognition system 106 transcribes the utterance to create a warrior, "Who directed this show?" In other embodiments, speech recognition system 106 provides two or more utterances of speech. For example, the speech recognition system 106 transcribes the utterance, "Who supervised this show?" And "Who supervised this shoe?"

키워드 맵핑 엔진(108)은 음성 인식 엔진(106)으로부터 전사를 수신한다. 동작 (D) 동안, 키워드 맵핑 엔진(108)은 특정 콘텐츠 유형에 연관된 전사 내의 하나 또는 그 이상의 키워드들을 식별하고, 상기 특정 콘텐츠 유형을 명확화 엔진(104)에 제공한다. 일부 실시예들에 있어서, 하나 또는 그 이상의 콘텐츠 유형들은 '영화', '음악', 'TV 쇼', '오디오 팟캐스트(audio podcast)', '영상(image)', '미술품(artwork)', '책', '잡지', '트레일러(trailer)', '비디오 팟캐스트(video podcast)', '인터넷 비디오', 또는 '비디오 게임'을 포함할 수 있다.The keyword mapping engine 108 receives transcriptions from the speech recognition engine 106. During operation (D), the keyword mapping engine 108 identifies one or more keywords in the transcription associated with a particular content type, and provides the specific content type to the clarification engine 104. In some embodiments, one or more content types include'movie','music','TV show','audio podcast','image','artwork' ,'Book','magazine','trailer','video podcast','internet video', or'video game'.

예를 들어, 키워드 맵핑 엔진(108)은 키워드 "감독했나(directed)"를 "누가 이 쇼를 감독했나요?"라는 전사로부터 식별한다. 키워드 "감독했나"는 'TV 쇼' 콘텐츠 유형에 연관된다. 일부 실시예들에 있어서, 키워드 맵핑 엔진(108)에 의하여 식별되는 전사의 키워드는 둘 또는 그 이상의 콘텐츠 유형들과 연관된다. 예를 들어, 키워드 "감독했나"는 'TV 쇼' 및 '영화' 콘텐츠 유형들에 연관된다.For example, the keyword mapping engine 108 identifies the keyword "directed" from the warrior "Who directed this show?". The keyword "directed" relates to the type of'TV Show' content. In some embodiments, the keyword of the transcription identified by keyword mapping engine 108 is associated with two or more content types. For example, the keyword "directed" is related to'TV Show' and'Movie' content types.

일부 실시예들에 있어서, 키워드 맵핑 엔진(108)은 특정 콘텐츠 유형에 연관된 전사에서 둘 또는 그 이상의 키워드를 식별한다. 예를 들어, 키워드 맵핑 엔진(108)은 특정 콘텐츠 유형에 연관된 키워드들 "감독했나" 및 "쇼"를 식별한다. 일부 실시예들에 있어서, 식별된 둘 또는 그 이상의 키워드들은 동일한 콘텐츠 유형에 연관된다. 예를 들어, 식별된 키워드들 "감독했나" 및 "쇼"는 모두 'TV 쇼' 콘텐츠 유형에 연관된다. 일부 실시예들에 있어서, 식별된 둘 또는 그 이상의 키워드들은 다른 콘텐츠 유형들에 연관된다. 예를 들어, 식별된 키워드 "감독했나"는 '영화' 콘텐츠 유형에 연관되며, 식별된 키워드 "쇼"는 'TV 쇼' 콘텐츠 유형에 연관된다. 키워드 맵핑 엔진(108)은 상기 특정 콘텐츠 유형을 명확화 엔진(108)으로 (예를 들어, 네트워크를 통하여) 전송한다.In some embodiments, keyword mapping engine 108 identifies two or more keywords in a transcription associated with a particular content type. For example, keyword mapping engine 108 identifies keywords “directed” and “show” associated with a particular content type. In some embodiments, two or more keywords identified are associated with the same content type. For example, the identified keywords "directed" and "show" are all related to the'TV Show' content type. In some embodiments, two or more of the identified keywords are associated with different content types. For example, the identified keyword "directed" is associated with the'movie' content type, and the identified keyword "show" is associated with the'TV show' content type. Keyword mapping engine 108 sends the specific content type to clarification engine 108 (eg, over a network).

일부 실시예들에 있어서, 키워드 맵핑 엔진(108)은 복수의 콘텐츠 유형들 각각에 대하여 키워드들 중 적어도 하나를 복수의 콘텐츠 유형들 중 적어도 하나에 맵핑하는 하나 또는 그 이상의 데이터베이스를 사용하여 특정 콘텐츠 유형에 연관된 전사 내의 하나 또는 그 이상의 키워드들을 식별한다. 구체적으로, 키워드 맵핑 엔진(108)은 하나의 데이터베이스(또는 복수의 데이터베이스들)을 포함(또는 이들과 통신)한다. 데이터베이스는 키워드들과 콘텐츠 유형들 사이의 맵핑을 포함하거나 이에 연관된다. 구체적으로, 데이터베이스는 키워드 맵핑 엔진(108)이 특정 콘텐츠 유형들에 연관된 전사 내의 하나 또는 그 이상의 키워드들을 식별할 수 있도록 하는 것과 같은 키워드들과 콘텐츠 유형들 사이의 연결(예를 들어, 맵핑)을 제공한다.In some embodiments, keyword mapping engine 108 uses one or more databases to map at least one of the keywords to at least one of the plurality of content types for each of the plurality of content types. Identifies one or more keywords in the transcription associated with. Specifically, the keyword mapping engine 108 includes (or communicates with) one database (or multiple databases). The database includes or is associated with mappings between keywords and content types. Specifically, the database establishes a connection (eg, mapping) between keywords and content types, such as allowing the keyword mapping engine 108 to identify one or more keywords in the transcription associated with particular content types. to provide.

일부 실시예들에 있어서, 키워드들과 콘텐츠 유형들 사이의 하나 또는 그 이상의 맵핑들은 단방향(unidirectional)(예를 들어, 한 방향 맵핑(즉, 키워드들로부터 콘텐츠 유형들로의 맵핑)을 포함할 수 있다. 일부 실시예들에 있어서, 키워드들과 콘텐츠 유형들 사이의 하나 또는 그 이상의 맵핑들은 양방향(bidirectional)(예를 들어, 쌍방향 맵핑(즉, 키워드들로부터 콘텐츠 유형들로의 맵핑 및 콘텐츠 유형들로부터 키워드들로의 맵핑)을 포함할 수 있다. 일부 실시예들에 있어서, 하나 또는 그 이상의 데이터베이스는 하나 또는 그 이상의 키워드들을 둘 또는 그 이상의 콘텐츠 유형들로 맵핑한다.In some embodiments, one or more mappings between keywords and content types may include unidirectional (eg, one-way mapping (ie, mapping from keywords to content types)). In some embodiments, one or more mappings between keywords and content types are bidirectional (eg, bi-directional mapping (ie, mapping from keywords to content types and content types). To keywords) In some embodiments, one or more databases map one or more keywords to two or more content types.

예를 들어, 키워드 맵핑 엔진(108)은 키워드 "감독했나"를 '영화' 및 'TV 쇼' 콘텐츠 유형들로 맵핑하는 하나 또는 그 이상의 데이터베이스를 사용한다. 일부 실시예들에 있어서, 키워드들과 콘텐츠 유형들 사이의 맵핑은 기본 키워드(root keyword)에 대한 복수의 변화형(varying version)(즉, 단어 패밀리)과 콘텐츠 유형들 사이의 맵핑을 포함한다. 키워드의 다른 형태는 시제(예를 들어, 과거, 현재, 미래)와 품사(예를 들어, 명사, 동사)와 같은 다른 문법적 카테고리를 포함할 수 있다. 예를 들어, 데이터베이스는 "감독(directors)", "지시(direction)", "감독했나(directed)"와 같은 어근(root word) "감독하다"의 단어 패밀리의 하나 또는 그 이상의 콘텐츠 유형으로의 맵핑을 포함할 수 있다.For example, the keyword mapping engine 108 uses one or more databases that map the keyword "directed" to'movie' and'TV show' content types. In some embodiments, the mapping between keywords and content types includes a mapping between multiple types (ie word families) and content types for the root keyword. Other forms of keywords may include other grammatical categories, such as tense (eg, past, present, future) and parts of speech (eg, nouns, verbs). For example, the database may be a root word "supervised" word family such as "directors", "direction", "directed" to one or more content types. May include mapping.

명확화 엔진(104)는 키워드 맵핑 엔진(108)으로부터 발화의 전사에 연관된 특정 콘텐츠 유형을 식별한 데이터를 수신한다. 더욱이, 상술한 바와 같이, 명확화 엔진(104)은 모바일 컴퓨팅 디바이스(102)로부터 발화와 연관된 환경 오디오 데이터를 포함하는 파형 데이터(114)를 수신한다. 동작 (E) 동안, 명확화 엔진(104)은 그리고 나서 환경 오디오 데이터 및 특정 콘텐츠 유형을 콘텐츠 인식 엔진(110)으로 제공한다.Clarification engine 104 receives data from keyword mapping engine 108 that identifies a particular content type associated with transcription of the utterance. Moreover, as described above, the articulation engine 104 receives waveform data 114 that includes environmental audio data associated with the utterance from the mobile computing device 102. During operation (E), the articulation engine 104 then provides environmental audio data and specific content types to the content recognition engine 110.

예를 들어, 명확화 엔진(104)은 현재 디스플레이되는 TV 프로그램의 오디오(예를 들어, 현재 디스플레이되는 TV 프로그램의 대화, 현재 디스플레이되는 TV 프로그램의 사운드트랙 오디오, 기타 등등)를 포함하는 현재 디스플레이되는 TV 프로그램에 관련된 환경 오디오 데이터 및 발화의 전사의 특정 콘텐츠 유형(예를 들어, 'TV 쇼' 콘텐츠 유형)을 콘텐츠 인식 엔진(110)으로 전송한다.For example, the disambiguation engine 104 may display the currently displayed TV including the audio of the currently displayed TV program (eg, the conversation of the currently displayed TV program, the soundtrack audio of the currently displayed TV program, etc.). The environmental audio data related to the program and the specific content type of the transcription of the utterance (eg, the'TV show' content type) are transmitted to the content recognition engine 110.

일부 실시예들에 있어서, 명확화 엔진(104)은 환경 오디오 데이터의 일부를 콘텐츠 인식 엔진(110)으로 제공한다. 일부 예들에 있어서, 상기 환경 오디오 데이터의 일부는 발화를 검출한 후에 모바일 컴퓨팅 디바이스(102)에 의하여 검출된 배경 노이즈를 포함할 수 있다. 일부 예들에 있어서, 상기 환경 오디오 데이터의 일부는 발화의 검출과 동시에 모바일 컴퓨팅 디바이스(102)에 의하여 검출된 배경 노이즈를 포함할 수 있다.In some embodiments, the articulation engine 104 provides a portion of the environmental audio data to the content recognition engine 110. In some examples, a portion of the environmental audio data may include background noise detected by mobile computing device 102 after detecting an utterance. In some examples, the portion of the environmental audio data may include background noise detected by the mobile computing device 102 at the same time as detection of the utterance.

일부 실시예들에 있어서, (파형 데이터(114)의) 배경 노이즈는 전사의 키워드에 연관된 특정 콘텐츠 유형에 연관된다. 예를 들어, 전사 "누가 이 쇼를 감독했나요?"라는 전사의 키워드 "감독했나"는 'TV 쇼' 콘텐츠 유형에 연관되며, 배경 노이즈(예를 들어, 현재 디스플레이되는 TV 프로그램에 관련된 환경 오디오 데이터) 또한 'TV 쇼' 콘텐츠 유형에 연관된다.In some embodiments, the background noise (of waveform data 114) is associated with a particular content type associated with the keyword of the transcription. For example, the warrior “Who supervised this show” warrior’s keyword “directed” is related to the “TV Show” content type, and the background noise (eg, environmental audio data related to the TV program currently being displayed). ) Also related to the'TV Show' content type.

콘텐츠 인식 엔진(110)은 명확화 엔진(104)으로부터 환경 오디오 데이터 및 특정 콘텐츠 유형을 수신한다. 동작 (F) 동안, 콘텐츠 인식 엔진(110)은, 환경 오디오 데이터에 기초하며 특정 콘텐츠 유형을 매칭하는, 콘텐츠 아이템 데이터를 식별하고, 콘텐츠 아이템 데이터를 명확화 엔진(104)으로 제공한다. 구체적으로, 콘텐츠 인식 엔진(110)은 환경 오디오 데이터를 적절히 처리하여, 환경 오디오 데이터(예를 들어, TV 쇼의 이름, 노래의 이름, 기타 등등)에 연관된 콘텐츠 아이템 데이터를 식별한다. 더욱이, 콘텐츠 인식 엔진(110)은 식별된 콘텐츠 아이템 데이터를 특정 콘텐츠 유형(예를 들어, 발화의 전사의 콘텐츠 유형)과 매칭시킨다. 콘텐츠 인식 엔진(110)은 식별된 콘텐츠 아이템 데이터를 명확화 엔진(104)으로 (예를 들어, 네트워크를 통하여) 전송한다.The content recognition engine 110 receives environmental audio data and specific content types from the articulation engine 104. During operation (F), the content recognition engine 110 identifies content item data, which is based on the environmental audio data and matches a particular content type, and provides the content item data to the clarification engine 104. Specifically, the content recognition engine 110 properly processes the environmental audio data to identify content item data associated with the environmental audio data (eg, the name of a TV show, the name of a song, etc.). Moreover, the content recognition engine 110 matches the identified content item data with a specific content type (eg, the content type of the transcription of the utterance). The content recognition engine 110 transmits the identified content item data to the clarification engine 104 (eg, over a network).

예를 들어, 콘텐츠 인식 엔진(110)은 현재 디스플레이되는 TV 프로그램에 관련된 환경 오디오 데이터에 기초하며, 더욱이 'TV 쇼' 콘텐츠 유형에 매칭되는 콘텐츠 아이템 데이터를 식별한다. 이를 위하여, 콘텐츠 인식 엔진(110)은 콘텐츠 인식 엔진(110)에 의하여 수신된 환경 오디오 데이터의 일부에 따라서 현재 디스플레이되는 TV 프로그램의 대화, 또는 현재 디스플레이되는 TV 프로그램에 연관된 사운드트랙 오디오에 기초한 콘텐츠 아이템 데이터를 식별한다.For example, the content recognition engine 110 is based on environmental audio data related to the currently displayed TV program, and further identifies content item data matching the'TV show' content type. To this end, the content recognition engine 110 is based on a portion of the environment audio data received by the content recognition engine 110, a conversation of a currently displayed TV program, or a content item based on soundtrack audio associated with the currently displayed TV program. Identifies the data.

일부 실시예들에 있어서, 콘텐츠 인식 엔진(110)은 콘텐츠 아이템 데이터를 식별하기 위하여 웨이블릿(wavelet)들을 사용하는 콘텐츠 지문을 사용하는 오디오 지문(audio fingerprinting) 엔진이다. 구체적으로, 콘텐츠 인식 엔진(110)은 파형 데이터(114)를 스펙트로그램(spectrogram)으로 변환한다. 스펙트로그램으로부터 콘텐츠 인식 엔진(110)은 스펙트럼 영상(spectral image)들을 추출한다. 스펙트럼 영상들은 웨이블릿으로 표현될 수 있다. 스펙트로그램으로부터 추출된 스펙트럼 영상들 각각에 대하여 콘텐츠 인식 엔진(110)은 웨이블릿들 각각의 크기에 기초한 "top" 웨이블릿들을 추출한다. 각 스펙트럼 영상에 대하여, 콘텐츠 인식 엔진9110)은 영상의 웨이블릿 서명을 연산한다. 일부 예들에 있어서, 웨이블릿 서명은 영상의 웨이블릿 분해의 줄여지고(truncated), 양자화된(quantized) 형태이다.In some embodiments, the content recognition engine 110 is an audio fingerprinting engine that uses content fingerprints that use wavelets to identify content item data. Specifically, the content recognition engine 110 converts the waveform data 114 into a spectrogram. The content recognition engine 110 extracts spectral images from the spectrogram. Spectral images may be represented as wavelets. For each of the spectral images extracted from the spectrogram, the content recognition engine 110 extracts “top” wavelets based on the size of each of the wavelets. For each spectral image, the content recognition engine 9110 calculates the wavelet signature of the image. In some examples, wavelet signature is a truncated, quantized form of wavelet decomposition of an image.

예를 들어, 웨이블릿로 m × n 영상을 설명하기 위하여, m × n 웨이블릿들이 압축 없이 되돌아온다. 더욱이, 콘텐츠 인식 엔진(110)은 노래를 대부분 특징짓는 웨이블릿들의 서브세트를 사용한다. 구체적으로, t << m × n 일 때, (크기에 의한) t개의 "top" 웨이블릿들이 선택된다. 더욱이, 콘텐츠 인식 엔진(110)은, 예를 들어 스파스 비트 벡터(sparse bit vector)들을 위한 서브-지문을 연산하기 위하여 MinHash 를 사용하여 상술한 스파스 웨이블릭 벡터(sparse wavelet-vector)의 간결한 표현을 생성한다.For example, in order to describe an m×n image with a wavelet, m×n wavelets return without compression. Moreover, the content recognition engine 110 uses a subset of wavelets that characterize most of the song. Specifically, when t << m × n, t “top” wavelets (by size) are selected. Moreover, the content recognition engine 110 is concise of the sparse wavelet-vector described above using MinHash, for example, to compute sub-fingerprints for sparse bit vectors. Create an expression.

일부 예들에 있어서, 환경 오디오 데이터가 적어도 현재 디스플레이되는 TV 프로그램에 연관된 사운드트랙 오디오를 포함할 때, 콘텐츠 인식 엔진(110)은, 현재 디스플레이되는 TV 프로그램에 연관된 사운드트랙 오디오에 기초하며 또한 'TV 쇼' 콘텐츠 유형을 매칭하는 콘텐츠 아이템 데이터를 식별한다. 따라서 일부 예들에 있어서, 콘텐츠 인식 엔진(110)은 현재 디스플레이되는 TV 프로그램의 이름에 관련된 콘텐츠 아이템 데이터를 식별한다. 예를 들어, 콘텐츠 인식 엔진(110)은 특정 콘텐츠 아이템(예를 들어, 특정 TV 쇼)이 테마 송(예를 들어, 사운드트랙 오디오)과 연관되고, 상기 특정 콘텐츠 아이템(예를 들어, 상기 특정 TV 쇼)이 상기 특정 콘텐츠 유형(예를 들어, 'TV 쇼' 콘텐츠 유형)과 매칭되는 것을 판단할 수 있다. 따라서, 콘텐츠 인식 엔진(110)은, 환경 오디오 데이터(예를 들어, 사운드트랙 오디오)에 기초한 상기 특정 콘텐츠 아이템(예를 들어, 현재 디스플레이되는 TV 프로그램)에 관련되며 추가적으로 상기 특정 콘텐츠 유형(예를 들어, 'TV 쇼' 콘텐츠 유형)과 매칭되는 데이터(예를 들어, 상기 특정 TV 쇼의 이름)를 식별할 수 있다.In some examples, when the environmental audio data includes at least the soundtrack audio associated with the currently displayed TV program, the content recognition engine 110 is based on the soundtrack audio associated with the currently displayed TV program and is also referred to as a'TV show. 'Identifies content item data matching the content type. Thus, in some examples, the content recognition engine 110 identifies content item data related to the name of the currently displayed TV program. For example, the content recognition engine 110 may associate a specific content item (eg, a specific TV show) with a theme song (eg, soundtrack audio), and the specific content item (eg, the specific It may be determined that the TV show) matches the specific content type (eg, the'TV show' content type). Accordingly, the content recognition engine 110 is related to the specific content item (eg, currently displayed TV program) based on environmental audio data (eg, soundtrack audio) and additionally includes the specific content type (eg For example, data matching the'TV show' content type) (eg, the name of the specific TV show) may be identified.

명확화 엔진(104)은 콘텐츠 인식 엔진(110)으로부터 식별된 콘텐츠 아이템 데이터를 수신한다. 동작 (G)에서, 명확화 엔진(104)은 그리고 나서 식별된 콘텐츠 아이템 데이터를 모바일 컴퓨팅 디바이스(102)로 제공한다. 예를 들어, 명확화 엔진(104)은 현재 디스플레이되는 TV 프로그램(예를 들어, 현재 디스플레이되는 TV 프로그램의 이름)에 관련된 식별된 콘텐츠 아이템 데이터를 모바일 컴퓨팅 디바이스(102)로 전송한다.The disambiguation engine 104 receives the identified content item data from the content recognition engine 110. In operation (G), the articulation engine 104 then provides the identified content item data to the mobile computing device 102. For example, the disambiguation engine 104 sends the identified content item data related to the currently displayed TV program (eg, the name of the currently displayed TV program) to the mobile computing device 102.

일부 예들에 있어서, 모바일 컴퓨팅 디바이스(102), 명확화 엔진(104), 음성 인식 엔진(106), 키워드 맵핑 엔진(108), 및 콘텐츠 인식 엔진(110) 중 하나 또는 그 이상은 모바일 컴퓨팅 디바이스(102), 명확화 엔진(104), 음성 인식 엔진(106), 키워드 맵핑 엔진(108), 및 콘텐츠 인식 엔진(110)의 서브셋(또는 각각)과 통신할 수 있다. 일부 실시예들에 있어서, 명확화 엔진(104), 음성 인식 엔진(106), 키워드 맵핑 엔진(108), 및 콘텐츠 인식 엔진(110) 중 하나 또는 그 이상은 하나 이상의 컴퓨팅 서버들, 분산된 컴퓨팅 시스템, 또는 하나의 서버 팜이나 클러스터와 같은 하나 또는 그 이상의 컴퓨팅 디바이스들을 사용하여 실행될 수 있다.In some examples, one or more of the mobile computing device 102, the disambiguation engine 104, the speech recognition engine 106, the keyword mapping engine 108, and the content recognition engine 110 are mobile computing device 102. ), clarification engine 104, speech recognition engine 106, keyword mapping engine 108, and a subset (or each) of content recognition engine 110. In some embodiments, one or more of the articulation engine 104, speech recognition engine 106, keyword mapping engine 108, and content recognition engine 110 are one or more computing servers, distributed computing system. Or, it may be implemented using one or more computing devices, such as one server farm or cluster.

일부 실시예들에 있어서, 상술한 바와 같이, 환경 오디오 데이터는 모바일 컴퓨팅 디바이스(110)로부터 명확화 엔진(104)으로 스트리밍될 수 있다. 환경 오디오 데이터가 스트리밍될 때, 상술한 처리(예를 들어, 동작들 (A)-(H))는 환경 오디오 데이터가 명확화 엔진(104)에 의하여 수신될 때 수행된다(즉, 점진적으로 수행된다). 다시 말해, 명확화 엔진(104)에 의하여 환경 오디오 데이터의 각 부분이 (예를 들어, 스트리밍되어) 수신될 때, 동작들 (A)-(H)가 콘텐츠 아이템 데이터가 식별될 때까지 반복적으로 수행된다.In some embodiments, as described above, environmental audio data may be streamed from mobile computing device 110 to clarification engine 104. When the environmental audio data is streamed, the above-described processing (e.g., operations (A)-(H)) is performed when the environmental audio data is received by the articulation engine 104 (i.e., is performed gradually). ). In other words, when each portion of the environmental audio data is received (eg, streamed) by the disambiguation engine 104, operations (A)-(H) are performed repeatedly until the content item data is identified. do.

도 2는 환경 오디오 데이터 및 발화된 자연 언어 쿼리에 기초한 콘텐츠 아이템 데이터를 식별하는 예시적 프로세스(200)의 흐름도를 나타낸다. 예시적 프로세스(200)는 하나 또는 그 이상의 컴퓨팅 디바이스에 의하여 실행될 수 있다. 예를 들어, 모바일 컴퓨팅 디바이스(102), 명확화 엔진(104), 음성 인식 엔진(106), 키워드 맵핑 엔진(108), 및/또는 콘텐츠 인식 엔진(110)이 예시적 프로세스(200)를 실행하기 위하여 사용될 수 있다.2 shows a flow diagram of an example process 200 for identifying content item data based on environmental audio data and spoken natural language queries. The example process 200 can be executed by one or more computing devices. For example, mobile computing device 102, clarification engine 104, speech recognition engine 106, keyword mapping engine 108, and/or content recognition engine 110 execute exemplary process 200. Can be used for

발화된 자연 언어 쿼리를 부호화하는 오디오 데이터 및 환경 오디오 데이터가 수신된다(202). 예를 들어, 명화화 엔진(104)이 모바일 컴퓨팅 디바이스(102)로부터 파형 데이터(114)를 수신한다. 파형 데이터(114)는 사용자의 발화된 자연 쿼리(예를 들어, "누가 이 쇼를 감독했나요?") 및 환경 오디오 데이터(예를 들어, 현재 디스플레이되는 TV 프로그램의 오디오)를 포함한다. 명확화 엔진(104)은 사용자(112)의 환경의 배경 노이즈(예를 들어, 현재 디스플레이되는 TV 프로그램의 오디오)로부터 발화된 자연 언어 쿼리("누가 이 쇼를 감독했나요?")를 분리한다.Audio data encoding the spoken natural language query and environmental audio data are received (202). For example, masterpiece engine 104 receives waveform data 114 from mobile computing device 102. Waveform data 114 includes the user's uttered natural query (eg, “Who supervised this show?”) and environmental audio data (eg, the audio of the TV program currently being displayed). The clarification engine 104 separates the spoken natural language query ("Who supervised the show?") from the background noise of the environment of the user 112 (eg, the audio of the currently displayed TV program).

자연 언어 쿼리의 전사가 획득된다(204). 예를 들어, 음성 인식 시스템(106)이 자연 언어 쿼리를 전사하여 자연 언어 쿼리의 전사(예를 들어, "누가 이 쇼를 감독했나요?")를 생성한다.Transcription of the natural language query is obtained (204). For example, the speech recognition system 106 transcribes the natural language query to create a transcription of the natural language query (eg, "Who supervised this show?").

전사 내의 하나 또는 그 이상의 키워드들에 연관된 특정 콘텐츠 유형이 판단된다(206). 예를 들어, 키워드 맵핑 엔진(108)이 특정 콘텐츠 유형(예를 들어, 'TV 쇼' 콘텐츠 유형)에 연관된 전사(예를 들어, "누가 이 쇼를 감독했나요?") 내의 하나 또는 그 이상의 키워드(예를 들어, "감독했나")를 식별한다. 일부 실시예들에 있어서, 키워드 맵핑 엔진(108)은 복수의 콘텐츠 유형들 각각에 대하여 키워드들 중 적어도 하나를 복수의 콘텐츠 유형들 중 적어도 하나에 맵핑하는 하나 또는 그 이상의 데이터베이스를 사용하여 전사 내의 하나 또는 그 이상의 키워드들을 식별한다. 데이터베이스는 키워드들(예를 들어, "감독했나")과 콘텐츠 유형들(예를 들어, 'TV 쇼' 콘텐츠 유형) 사이의 연결(예를 들어, 맵핑)을 제공한다.The particular content type associated with one or more keywords in the transcription is determined (206). For example, one or more keywords within a warrior (eg, "Who supervised this show?") associated with a particular content type (eg,'TV Show' content type) by keyword mapping engine 108. (E.g., "has been supervised"). In some embodiments, the keyword mapping engine 108 uses one or more databases to map at least one of the keywords to at least one of the plurality of content types for each of the plurality of content types, one in the transcription. Or more keywords. The database provides a link (eg, mapping) between keywords (eg, “has been supervised”) and content types (eg,'TV Show' content type).

환경 오디오 데이터의 적어도 일부는 콘텐츠 인식 엔진(208)에 제공된다. 예를 들어, 명확화 엔진(104)은 파형 데이터(114)에 의하여 부호화된 적어도 상기 환경 오디오 데이터의 일부(예를 들어, 현재 디스플레이되는 TV 프로그램의 오디오)를 콘텐츠 인식 엔진(110)에 제공한다. 일부 예들에 있어서, 명확화 엔진(104)은 또한 전사 내의 하나 또는 그 이상의 키워드들(예를 들어, "감독했나")에 연관된 특정 콘텐츠 유형(예를 들어, 'TV 쇼' 콘텐츠 유형)을 콘텐츠 인식 엔진(110)에 제공한다.At least a portion of the environmental audio data is provided to the content recognition engine 208. For example, the articulation engine 104 provides the content recognition engine 110 with at least a portion of the environment audio data (eg, audio of a currently displayed TV program) encoded by the waveform data 114. In some examples, the disambiguation engine 104 is also content-aware of a particular content type (eg, a'TV Show' content type) associated with one or more keywords (eg, "has been supervised") within the enterprise. Provided to the engine 110.

콘텐츠 인식 엔진에 의하여 출력되며 특정 콘텐츠 유형과 매칭되는 콘텐츠 아이템이 식별된다(210). 예를 들어, 콘텐츠 인식 엔진(110)은, 환경 오디오 데이터(예를 들어, 현재 디스플레이되는 TV 프로그램의 오디오)에 기초하며 특정 콘텐츠 유형(예를 들어, 'TV 쇼' 콘텐츠 유형)에 매칭되는 콘텐츠 아이템 또는 콘텐츠 아이템 데이터를 식별한다.Content items that are output by the content recognition engine and match specific content types are identified (210 ). For example, the content recognition engine 110 is based on environmental audio data (eg, audio of a currently displayed TV program) and content that matches a specific content type (eg,'TV show' content type). Identifies item or content item data.

도 3a 및 도 3b는 콘텐츠 아이템 데이터를 식별하기 위한 시스템의 일부들(300a, 300b)을 각각 도시한 것이다. 특히 도 3a 및 도 3b는 명확화 엔진들(304a, 304b) 각각; 및 콘텐츠 인식 엔진들(310a, 310b)을 각각 포함한다. 명확화 엔진들(304a, 304b)은 도 1에 도시된 시스템(100)의 명확화 엔진(104)과 유사하며; 콘텐츠 인식 엔진들(310a, 310b)은 도 1에 도시된 시스템(100)의 콘텐츠 인식 엔진(110)과 유사하다. 3A and 3B show portions of systems 300a and 300b, respectively, for identifying content item data. In particular, FIGS. 3A and 3B are clarification engines 304a and 304b, respectively; And content recognition engines 310a and 310b, respectively. The clarification engines 304a, 304b are similar to the clarification engine 104 of the system 100 shown in FIG. 1; The content recognition engines 310a and 310b are similar to the content recognition engine 110 of the system 100 shown in FIG. 1.

도 3a는 콘텐츠 인식 엔진(310a)을 포함하는 일부(300a)를 도시한다. 콘텐츠 인식 엔진(310a)은 환경 데이터에 기초하여 특정한 콘텐츠 유형과 일치하는 콘텐츠 아이템을 식별할 수 있다. 달리 말하면, 콘텐츠 인식 엔진(310a)은 콘텐츠 아이템을 식별하기 위하여 환경 데이터에 기초하여 환경 데이터를 적절히 처리하고, 나아가 선택된 콘텐츠 아이템 데이터가 특정한 콘텐츠 유형과 일치하도록 하나 이상의 식별된 콘텐츠 아이템 데이터를 선택할 수 있다. 3A shows a portion 300a that includes a content recognition engine 310a. The content recognition engine 310a may identify content items matching a specific content type based on the environment data. In other words, the content recognition engine 310a appropriately processes the environment data based on the environment data to identify the content item, and further selects one or more identified content item data so that the selected content item data matches a specific content type. have.

구체적으로, 동작 (A) 동안, 명확화 엔진(304a)은 환경 데이터 및 특정한 콘텐츠 유형을 콘텐츠 인식 엔진(310a)에 제공한다. 일부 실시예들에 있어서, 명확화 엔진(304a)은 환경 데이터의 일부를 콘텐츠 인식 엔진(310a)에 제공한다. Specifically, during operation (A), the clarification engine 304a provides environment data and specific content types to the content recognition engine 310a. In some embodiments, the articulation engine 304a provides a portion of the environment data to the content recognition engine 310a.

콘텐츠 인식 엔진(310a)은 명확화 엔진(304a)으로부터 환경 데이터 및 특정한 콘텐츠 유형을 수신한다. 그러면 동작 (B) 동안, 콘텐츠 인식 엔진(310a)은 환경 데이터에 기초하여 특정한 콘텐츠 유형과 일치하는 콘텐츠 아이템 데이터를 식별하고 식별된 콘텐츠 아이템 데이터를 명확화 엔진(304a)에 제공한다. 구체적으로, 콘텐츠 인식 엔진(310a)은 환경 데이터에 기초하여 콘텐츠 아이템 데이터(예를 들어, TV쇼의 이름, 노래 제목 등)를 식별한다. 그 후에 콘텐츠 인식 엔진(310a)은 특정한 콘텐츠 유형과 일치하는 식별된 콘텐츠 아이템 데이터 중 하나 이상을 선택한다. 달리 말하면, 콘텐츠 인식 엔진(310a)은 특정한 콘텐츠 유형에 기초하여 식별된 콘텐츠 아이템 데이터를 필터링한다. 콘텐츠 인식 엔진(310a)은 (예를 들어, 네트워크를 거쳐) 식별된 콘텐츠 아이템 데이터를 명확화 엔진(304a)으로 전송한다. The content recognition engine 310a receives environmental data and specific content types from the clarification engine 304a. Then, during operation (B), the content recognition engine 310a identifies content item data matching a specific content type based on the environment data and provides the identified content item data to the clarification engine 304a. Specifically, the content recognition engine 310a identifies content item data (eg, a TV show name, song title, etc.) based on the environment data. The content recognition engine 310a then selects one or more of the identified content item data that matches the particular content type. In other words, the content recognition engine 310a filters the identified content item data based on the specific content type. The content recognition engine 310a transmits the identified content item data (eg, over a network) to the clarification engine 304a.

일부 예들에 있어서, 도 1과 관련하여 상술한 바와 같이, 환경 데이터가 적어도 현재 디스플레이되는 TV 프로그램과 관련된 사운드트랙(soundtrack) 오디오를 포함하는 경우, 콘텐츠 인식 엔진(310a)은 현재 디스플레이되는 TV 프로그램과 연관된 사운드트랙 오디오에 기초하여 콘텐츠 아이템 데이터를 식별한다. 콘텐츠 인식 엔진(310a)은 그 후에 ＇TV 쇼＇ 콘텐츠 유형에 기초하여 식별된 콘텐츠 아이템 데이터를 필터링한다. 예를 들어, 콘텐츠 인식 엔진(310a)은 사운드트랙 오디오와 연관된 ＇테마 송 이름＇ 및 ＇TV 쇼 이름＇을 식별한다. 콘텐츠 인식 엔진(310a)은 이 후에 식별된 콘텐츠 아이템 데이터가 또한 ＇TV 쇼＇ 콘텐츠 유형과도 일치하도록 식별된 콘텐츠 아이템 데이터를 필터링한다. 예를 들어, 콘텐츠 인식 엔진(310a)은 ＇TV 쇼 이름＇ 식별 데이터를 선택하고, ＇TV 쇼 이름＇ 식별 데이터를 명확화 엔진(304a)으로 전송한다.In some examples, as described above with respect to FIG. 1, when the environment data includes at least a soundtrack audio associated with the currently displayed TV program, the content recognition engine 310a is configured to display the currently displayed TV program. Content item data is identified based on the associated soundtrack audio. The content recognition engine 310a then filters the identified content item data based on the “TV Show” content type. For example, the content recognition engine 310a identifies “theme song name” and “TV show name” associated with soundtrack audio. The content recognition engine 310a then filters the identified content item data such that the identified content item data also matches the “TV Show” content type. For example, the content recognition engine 310a selects "TV show name" identification data and transmits "TV show name" identification data to the clarification engine 304a.

일부 예들에 있어서, 콘텐츠 인식 엔진(310a)은 콘텐츠 유형 (예를 들어, ＇TV 쇼＇ 콘텐츠 유형)에 기초하여 코퍼스(corpus)(또는 인덱스)를 선택한다. 구체적으로, 콘텐츠 인식 엔진(310a)은 ＇TV 쇼＇ 콘텐츠 유형과 관련된 제1 인덱스 및 ＇영화＇ 콘텐츠 유형과 관련된 제2 인덱스에 대한 접근권(access)을 가질 수 있다. 콘텐츠 인식 엔진(310a)은 ＇TV 쇼＇ 콘텐츠 유형에 기초하여 적절히 제1 인덱스를 선택한다. 따라서, 제1 인덱스를 선택함 (그리고 제2 인덱스를 선택하지 않음)에 의하여 콘텐츠 인식 엔진(310a)은 보다 효과적으로 콘텐츠 아이템 데이터 (예를 들어, TV 쇼의 이름)를 식별할 수 있다. In some examples, the content recognition engine 310a selects a corpus (or index) based on the content type (eg, “TV Show content type”). Specifically, the content recognition engine 310a may have access to the first index related to the “TV Show” content type and the second index related to the “Movie” content type. The content recognition engine 310a appropriately selects the first index based on the “TV Show” content type. Thus, by selecting the first index (and not selecting the second index), the content recognition engine 310a can more effectively identify content item data (eg, the name of a TV show).

명확화 엔진(304a)은 콘텐츠 인식 엔진(310a)으로부터 콘텐츠 아이템 데이터를 수신한다. 예를 들어, 명확화 엔진(304a)은 콘텐츠 인식 엔진(310a)으로부터 ＇TV 쇼 이름＇ 식별 데이터를 수신한다. 명확화 엔진(304a)은 이 후에, 동작 (C) 동안, 식별 데이터를 제3 자(예를 들어, 도 1의 모바일 컴퓨팅 디바이스(102))에게 제공한다. 예를 들어, 명확화 엔진(304a)은 데이터를 ＇TV 쇼 이름＇ 식별 데이터를 제3 자에게 제공한다. The disambiguation engine 304a receives content item data from the content recognition engine 310a. For example, the clarification engine 304a receives “TV show name” identification data from the content recognition engine 310a. The clarification engine 304a then provides identification data to a third party (eg, the mobile computing device 102 of FIG. 1) during operation (C). For example, the disambiguation engine 304a provides the data “TV Show Name” identification data to a third party.

도 3b는 콘텐츠 인식 엔진(310b)을 포함하는 일부(300b)를 도시한 도면이다. 콘텐츠 인식 엔진(310b)은 환경 데이터에 기초하여 콘텐츠 아이템 데이터를 식별할 수 있다. 달리 말하면, 콘텐츠 인식 엔진(310b)은 환경 데이터에 기초하여 콘텐츠 아이템 데이터를 식별하기 위하여 환경 데이터를 적절히 처리하고, 콘텐츠 아이템 데이터를 명확화 엔진(304b)에 제공한다. 콘텐츠 인식 엔진(310b)은 선택된 콘텐츠 아이템 데이터가 특정한 콘텐츠 유형과 일치하도록 식별된 콘텐츠 아이템 데이터 중 적어도 하나를 선택한다. 3B is a diagram illustrating a part 300b including the content recognition engine 310b. The content recognition engine 310b may identify content item data based on the environment data. In other words, the content recognition engine 310b appropriately processes the environment data to identify the content item data based on the environment data, and provides the content item data to the clarification engine 304b. The content recognition engine 310b selects at least one of the identified content item data such that the selected content item data matches a specific content type.

구체적으로, 동작 (A) 동안, 명확화 엔진(304b)은 환경 데이터를 콘텐츠 인식 엔진(310b)에 제공한다. 일부 실시예들에 있어서, 명확화 엔진(304b)은 환경 데이터의 일부를 콘텐츠 인식 엔진(310b)에 제공한다. Specifically, during operation (A), the articulation engine 304b provides environmental data to the content recognition engine 310b. In some embodiments, the articulation engine 304b provides a portion of the environment data to the content recognition engine 310b.

콘텐츠 인식 엔진(310b)은 명확화 엔진(304b)으로부터 환경 데이터를 수신한다. 그 후에, 동작 (B) 동안, 콘텐츠 인식 엔진(310b)은 환경 데이터에 기초하여 콘텐츠 아이템 데이터를 식별하고 식별된 콘텐츠 아이템 데이터를 명확화 엔진(304b)으로 제공한다. 구체적으로, 콘텐츠 인식 엔진(310b)은 환경 데이터에 기초하여 둘 이상의 콘텐츠 아이템들(예를 들어, TV 쇼의 이름, 노래의 제목, 등)과 연관된 콘텐츠 아이템 데이터를 식별한다. 콘텐츠 인식 엔진(310b)은 (예를 들어, 네트워크를 거쳐) 식별된 콘텐츠 아이템 데이터를 표현하는 둘 이상의 후보들을 명확화 엔진(304b)으로 전송한다. The content recognition engine 310b receives environmental data from the clarification engine 304b. Thereafter, during operation (B), the content recognition engine 310b identifies the content item data based on the environment data and provides the identified content item data to the clarification engine 304b. Specifically, the content recognition engine 310b identifies content item data associated with two or more content items (eg, TV show name, song title, etc.) based on the environment data. The content recognition engine 310b sends two or more candidates representing the identified content item data (eg, over a network) to the disambiguation engine 304b.

일부 예들에 있어서, 도 1과 관련하여 상술한 바와 같이, 환경 데이터가 현재 디스플레이되는 TV 프로그램과 연관된 적어도 사운드트랙 오디오를 포함하는 경우, 콘텐츠 인식 엔진(310b)은 현재 디스플레이되는 TV 프로그램과 연관된 사운드트랙 오디오에 기초하여 둘 이상의 콘텐츠 아이템들과 관련된 콘텐츠 아이템 데이터를 식별한다. 예를 들어, 콘텐츠 인식 엔진(310b)은 사운드트랙 오디오와 연관된 ＇테마 송 이름＇ 및 ＇TV 쇼 이름＇을 식별하고, ＇테마 송 이름＇ 및 ＇TV 쇼 이름＇ 식별 데이터를 명확화 엔진(304b)으로 전송한다. In some examples, as described above with respect to FIG. 1, when the environment data includes at least soundtrack audio associated with the currently displayed TV program, the content recognition engine 310b is a soundtrack associated with the currently displayed TV program. Identifies content item data associated with two or more content items based on audio. For example, the content recognition engine 310b identifies the “theme song name” and “TV show name” associated with the soundtrack audio, and the “theme song name” and the “TV show name” identification data clarification engine 304b To be transferred.

명확화 엔진(304b)은 콘텐츠 인식 엔진(310b)으로부터 둘 이상의 후보들을 수신한다. 예를 들어, 명확화 엔진(304b)은 콘텐츠 인식 엔진(310b)으로부터 ＇테마 송 이름＇ 및 ＇TV 쇼 이름＇ 후보들을 수신한다. 이 후에, 동작 (C) 동안, 명확화 엔진(304b)은 특정한 콘텐츠 유형에 기초하여 둘 이상의 후보들 중 하나를 선택하고 선택된 후보를 제3 자 (예를 들어, 도 1의 모바일 컴퓨팅 디바이스(102))에게 제공한다. 구체적으로, 도 1과 관련하여 상술한 바와 같이, 명확화 엔진(304b)은 사전에 (예를 들어, 발화와 연관된) 특정한 콘텐츠 유형을 수신할 수 있다. 명확화 엔진(304b)은 특정한 콘텐츠 유형에 기초하여 둘 이상의 후보들 중 특정한 후보를 선택한다. 구체적으로, 명확화 엔진(304b)은 둘 이상의 후보들 중 특정한 콘텐츠 유형과 일치하는 특정한 후보를 선택한다. 예를 들어, 명확화 엔진(304b)은 ＇TV 쇼 이름＇ 후보가 ＇TV 쇼＇ 콘텐츠 유형과 일치하기 때문에 ＇TV 쇼 이름＇ 후보를 선택한다. The articulation engine 304b receives two or more candidates from the content recognition engine 310b. For example, the clarification engine 304b receives “theme song name” and “TV show name” candidates from the content recognition engine 310b. Thereafter, during operation (C), the disambiguation engine 304b selects one of the two or more candidates based on the particular content type, and selects the selected candidate by a third party (eg, the mobile computing device 102 of FIG. 1 ). To give. Specifically, as described above with respect to FIG. 1, the disambiguation engine 304b may receive a particular content type (eg, associated with the utterance) in advance. The clarification engine 304b selects a specific candidate from two or more candidates based on a specific content type. Specifically, the clarification engine 304b selects a specific candidate that matches a specific content type from two or more candidates. For example, the clarification engine 304b selects the'TV Show Name' candidate because the'TV Show Name' candidate matches the'TV Show' content type.

일부 실시예들에 있어서, 콘텐츠 인식 엔진(310b)으로부터의 둘 이상의 후보들은 랭킹 스코어와 연관된다. 랭킹 스코어는 명확화 엔진(304b)에 의하여 결정된 어떠한 스코어 측정(metric)과도 연관될 수 있다. 명확화 엔진(304b)은 더불어 특정한 콘텐츠 유형에 기초하여 둘 이상의 후보들의 랭킹 스코어를 조정할 수도 있다. 구체적으로, 명확화 엔진(304b)은 각각의 후보들이 특정한 콘텐츠 유형과 일치하는 경우에 하나 이상의 후보들의 랭킹 스코어를 증가시킬 수 있다. 예를 들어, 후보 ＇TV 쇼 이름＇의 랭킹 스코어는 그것이 ＇TV 쇼＇ 콘텐츠 유형과 일치함에 따라서 증가할 수 있다. 뿐만 아니라, 명확화 엔진(304b)은 각각의 후보들이 특정한 콘텐츠 유형과 일치하지 않는 경우에 하나 이상의 후보들의 랭킹 스코어를 감소시킬 수 있다. 예를 들어, 후보 ＇테마 송 이름＇의 랭킹 스코어는 그것이 ＇TV 쇼＇ 콘텐츠 유형과 일치하지 않음에 따라서 감소할 수 있다. In some embodiments, two or more candidates from content recognition engine 310b are associated with a ranking score. The ranking score can be associated with any score metric determined by the clarification engine 304b. The clarification engine 304b may also adjust the ranking scores of two or more candidates based on a particular content type. Specifically, the clarification engine 304b may increase the ranking score of one or more candidates when each candidate matches a specific content type. For example, the ranking score of the candidate'TV Show Name' may increase as it matches the'TV Show' content type. In addition, the clarification engine 304b can reduce the ranking score of one or more candidates when each candidate does not match a particular content type. For example, the ranking score of the candidate “theme song name” may decrease as it does not match the “TV show” content type.

일부 실시예들에 있어서, 둘 이상의 후보들은, 명확화 엔진(304b)에 의하여 각각 조정된 랭킹 스코어들에 기초하여 순위가 부여될 수 있다. 예를 들어, 명확화 엔진(304b)은 ＇테마 송 이름＇ 후보의 조정된 랭킹 스코어와 비교하였을 때, ＇TV 쇼 이름＇ 후보가 더 높은 조정된 랭킹 스코어를 가짐에 따라, ＇TV 쇼 이름＇ 후보의 순위를 ＇테마 송 이름＇ 후보보다 상위로 부여할 수 있다. 일부 예들에 있어서, 명확화 엔진(304b)은 가장 높게 순위가 부여된 (예를 들어, 가장 높은 조정된 랭킹 스코어를 가지는) 후보를 선택한다. In some embodiments, two or more candidates may be ranked based on ranking scores each adjusted by clarification engine 304b. For example, as the clarification engine 304b has a higher adjusted ranking score, as the “TV Show Name” candidate has a higher adjusted ranking score when compared to the adjusted ranking score of the “Theme Song Name” candidate. You can give the rank of'theme song name' higher than the candidate. In some examples, the clarification engine 304b selects the highest ranked candidate (eg, having the highest adjusted ranking score).

도 4는 환경 이미지 데이터 및 발화된 자연 언어 쿼리에 기초하여 콘텐츠 아이템 데이터를 식별하기 위한 시스템(400)을 도시한 것이다. 요컨대, 시스템(400)은 환경 데이터에 기초하여 발화된 자연 언어 쿼리와 연관된 특정한 콘텐츠 유형과 일치하는 콘텐츠 아이템 데이터를 식별할 수 있다. 시스템(400)은, 도 1에 도시된 시스템(100)의 모바일 컴퓨팅 디바이스(102), 명확화 엔진(104), 음성 인식 엔진(106), 키워드 맵핑 엔진(108) 및 콘텐츠 인식 엔진(110)과 각각 유사한, 모바일 컴퓨팅 디바이스(402), 음성 인식 엔진(406), 키워드 맵핑 엔진(408), 및 콘텐츠 인식 엔진(410)을 포함한다. 4 illustrates a system 400 for identifying content item data based on environmental image data and spoken natural language queries. In short, the system 400 can identify content item data that matches a particular content type associated with the spoken natural language query based on the environment data. The system 400 includes a mobile computing device 102 of the system 100 shown in FIG. 1, a clarification engine 104, a speech recognition engine 106, a keyword mapping engine 108 and a content recognition engine 110. Each includes a mobile computing device 402, a speech recognition engine 406, a keyword mapping engine 408, and a content recognition engine 410, respectively.

일부 예들에 있어서, 사용자(112)는 영화의 사운드트랙의 CD 앨범 커버를 보고 있다. 도시된 예에 있어서, 사용자(112)는 사운드트랙에 어떠한 노래들이 수록되어 있는 지를 알고 싶다. 일부 예들에 있어서, 사용자(112)는 영화 사운드트랙의 이름을 모를 수 있고, 따라서 ＂여기에 나오는 노래가 무엇입니까?＂ 또는 ＂이 영화에서 연주된 노래는 무엇입니까?＂라는 질문을 할 수 있다. 모바일 컴퓨팅 디바이스(402)는 사용자(112)의 환경과 연관된 환경 이미지 데이터뿐만 아니라 이러한 발화를 감지한다. In some examples, user 112 is viewing a CD album cover of a movie's soundtrack. In the illustrated example, the user 112 wants to know what songs are recorded in the soundtrack. In some examples, the user 112 may not know the name of the movie soundtrack, and thus may ask, ＂What song is there?＂ or ＂What song was played in the movie?＂ . The mobile computing device 402 detects these utterances as well as environmental image data associated with the environment of the user 112.

일부 예들에 있어서, 사용자(112)의 환경과 연관된 환경 이미지 데이터는 사용자(112) 환경의 이미지 데이터를 포함할 수 있다. 예를 들어, 환경 이미지 데이터는 영화와 관련된 이미지들을 묘사한 (예를 들어, 연관된 영화의 영화 포스터의 이미지) CD 앨범 커버의 이미지를 포함한다. 일부 예들에 있어서, 모바일 컴퓨팅 디바이스(402)는 CD 앨범 커버의 이미지 (또는 비디오)를 캡쳐하는 모바일 컴퓨팅 디바이스(402)의 카메라를 활용하여 환경 이미지 데이터를 감지한다. In some examples, environment image data associated with the environment of user 112 may include image data of user 112 environment. For example, the environmental image data includes an image of a CD album cover that depicts images associated with a movie (eg, an image of a movie poster of an associated movie). In some examples, mobile computing device 402 utilizes the camera of mobile computing device 402 to capture an image (or video) of a CD album cover to sense environmental image data.

동작 (A) 동안, 모바일 컴퓨팅 디바이스(402)는 감지된 발화를 처리하여 감지된 발화를 나타내는 파형 데이터(414)를 생성하고 파형 데이터(414) 및 환경 이미지 데이터를 (예를 들어, 네트워크를 거쳐) 명확화 엔진(404)으로 전송한다. During operation (A), the mobile computing device 402 processes the sensed utterance to generate waveform data 414 representing the sensed utterance and transmits the waveform data 414 and environmental image data (eg, over a network). ) To the clarification engine 404.

동작 (B) 동안, 명확화 엔진(404)은 모바일 컴퓨팅 디바이스(402)로부터 파형 데이터(414) 및 환경 이미지 데이터를 수신한다. 명확화 엔진(404)은 파형 데이터(414)를 처리하고 (예를 들어, 네트워크를 거쳐) 음성 인식 엔진(406)으로 발화를 전송한다. 일부 예들에 있어서, 발화는 쿼리 (예를 들어, 영화 사운드트랙과 관련된 쿼리)와 관련된다. During operation (B), the articulation engine 404 receives waveform data 414 and environmental image data from the mobile computing device 402. The articulation engine 404 processes the waveform data 414 (eg, over a network) and sends a utterance to the speech recognition engine 406. In some examples, the utterance is related to a query (eg, a query related to a movie soundtrack).

음성 인식 엔진(406)은 명확화 엔진(404)으로부터 발화를 수신한다. 동작 (C) 동안, 음성 인식 엔진(406)은 발화의 전사(transcription)를 획득하고 전사를 키워드 맵핑 엔진(408)에 제공한다. 구체적으로, 음성 인식 엔진(406)은 발화의 전사를 생성함으로써 음성 인식 엔진(406)으로부터 수신한 발화를 처리한다. The speech recognition engine 406 receives the utterance from the articulation engine 404. During operation (C), the speech recognition engine 406 obtains the transcription of the utterance and provides the transcription to the keyword mapping engine 408. Specifically, the speech recognition engine 406 processes the utterance received from the speech recognition engine 406 by generating transcription of speech.

예를 들어, 음성 인식 엔진(406)은 발화를 전사하여 "여기에 나오는 노래가 무엇입니까?"라는 전사를 생성한다. 일부 실시예들에 있어서, 음성 인식 엔진(406)은 발화의 둘 이상의 전사들을 제공한다. 예를 들어, 음성 인식 엔진(406)은 발화를 전사하여 ＂여기에 나오는 노래가 무엇입니까?(What songs are on this?)＂와 ＂여기에 나오는 놀이가 무엇입니까?(What sinks are on this?)＂의 전사들을 생성한다. For example, the speech recognition engine 406 transcribes the utterance to create a transcript, "What song is there?" In some embodiments, speech recognition engine 406 provides two or more transcriptions of speech. For example, the speech recognition engine 406 transcribes the utterance so that WhatWhat songs are on this?＂ and 놀이What plays are on this? )＂ warriors.

키워드 맵핑 엔진(408)은 음성 인식 엔진(406)으로부터 전사를 수신한다. 동작 (D) 동안, 키워드 맵핑 엔진(408)은 특정한 콘텐츠 유형과 연관된 전사 내의 하나 이상의 키워드들을 식별하고 특정한 콘텐츠 유형을 명확화 엔진(404)으로 제공한다. The keyword mapping engine 408 receives transcriptions from the speech recognition engine 406. During operation (D), the keyword mapping engine 408 identifies one or more keywords in the transcription associated with the particular content type and provides the specific content type to the clarification engine 404.

예를 들어, 키워드 맵핑 엔진(408)은 ＂여기에 나오는 노래가 무엇입니까?＂의 전사로부터 키워드 ＂노래＂를 식별한다. 키워드 ＂노래＂는 ＇음악＇ 콘텐츠 유형과 연관된다. 일부 실시예들에 있어서, 키워드 맵핑 엔진(408)에 의하여 식별된 전사의 키워드는 둘 이상의 콘텐츠 유형들과 연관된다. 예를 들어, 키워드 ＂노래＂는 ＇음악＇ 및 ＇가수＇ 콘텐츠 유형들과 연관된다. 키워드 맵핑 엔진(408)은 (예를 들어, 네트워크를 거쳐) 특정한 콘텐츠 유형을 명확화 엔진(408)으로 전송한다. For example, the keyword mapping engine 408 identifies the keyword “song” from the transcription of “What song is there?”. The keyword “song” is related to the “music” content type. In some embodiments, the keyword of the transcription identified by keyword mapping engine 408 is associated with two or more content types. For example, the keyword “song” is associated with “music” and “singer” content types. Keyword mapping engine 408 sends a specific content type (eg, over a network) to clarification engine 408.

일부 실시예들에 있어서, 상술한 바와 유사하게, 키워드 맵핑 엔진(408)은, 다수의 콘텐츠 유형들 각각에 대하여, 적어도 하나의 키워드들과 다수의 콘텐츠 유형들 중 적어도 하나와 맵핑한, 하나 이상의 데이터베이스들을 사용하여, 전사 내에서 특정한 콘텐츠 유형과 연관된 하나 이상의 키워드들을 식별한다. 예를 들어, 키워드 맵핑 엔진(408)은 ＂노래＂를 ＇음악＇ 및 ＇가수＇ 콘텐츠 유형들과 맵핑한 하나 이상의 데이터베이스들을 사용한다. In some embodiments, similar to the above, the keyword mapping engine 408 maps, for each of the multiple content types, at least one keyword and at least one of the multiple content types, one or more. Using databases, one or more keywords associated with a particular content type within the enterprise are identified. For example, the keyword mapping engine 408 uses one or more databases that map “song” to “music” and “singer” content types.

명확화 엔진(404)은 키워드 맵핑 엔진(408)으로부터 발화의 전사와 연관된 특정한 콘텐츠 유형을 수신한다. 뿐만 아니라, 상술한 바와 같이, 명확화 엔진(404)은 발화와 연관된 환경 이미지 데이터를 수신한다. 동작 (E) 동안, 명확화 엔진(404)은 환경 이미지 데이터와 특정한 콘텐츠 유형을 콘텐츠 인식 엔진(410)에 제공한다. Clarification engine 404 receives a specific content type associated with transcription of speech from keyword mapping engine 408. In addition, as described above, the clarification engine 404 receives environmental image data associated with the utterance. During operation (E), the clarification engine 404 provides environment image data and a specific content type to the content recognition engine 410.

예를 들어, 명확화 엔진(404)은 영화 사운드트랙과 관련된 환경 이미지 데이터 (예를 들어, 영화 포스터 CD 앨범 커버의 이미지) 및 발화 전사의 특정한 콘텐츠 유형 (예를 들어, ＇음악＇ 콘텐츠 유형)을 콘텐츠 인식 엔진(410)에 전송한다. For example, the articulation engine 404 may provide environmental image data (eg, images of a movie poster CD album cover) and specific content types (eg, “music” content types) associated with the movie soundtrack. Content recognition engine 410.

콘텐츠 인식 엔진(410)은 명확화 엔진(404)으로부터 환경 이미지 데이터 및 특정한 콘텐츠 유형을 수신한다. 그러면 동작 (F) 동안, 콘텐츠 인식 엔진(410)은 환경 이미지 데이터에 기초하며 특정한 콘텐츠 유형과 일치하는 콘텐츠 아이템 데이터를 식별하고 식별된 콘텐츠 아이템 데이터를 명확화 엔진(404)에 제공한다. 구체적으로, 콘텐츠 인식 엔진(410)은 환경 이미지 데이터를 적절히 처리하여 콘텐츠 아이템 데이터 (예를 들어 콘텐츠 아이템의 이름)를 식별한다. 덧붙여, 콘텐츠 인식 엔진(410)은 식별된 콘텐츠 아이템과 특정한 콘텐츠 유형 (예를 들어, 발화 전사의 콘텐츠 유형)과 일치시킨다. 콘텐츠 인식 엔진(408)은 (예를 들어, 네트워크를 거쳐) 식별된 콘텐츠 아이템 데이터를 명확화 엔진(408)으로 전송한다. The content recognition engine 410 receives environmental image data and specific content types from the disambiguation engine 404. Then, during operation F, the content recognition engine 410 identifies content item data that is based on the environment image data and matches a particular content type and provides the identified content item data to the clarification engine 404. Specifically, the content recognition engine 410 appropriately processes the environment image data to identify content item data (eg, the name of the content item). In addition, the content recognition engine 410 matches the identified content item with a specific content type (eg, the content type of the speech transcription). The content recognition engine 408 transmits the identified content item data (eg, over a network) to the disambiguation engine 408.

예를 들어, 콘텐츠 인식 엔진(410)은 영화 포스터 CD 앨범 커버의 이미지와 관련된 환경 이미지 데이터에 기초하며, 나아가 ＇음악＇ 콘텐츠 유형과 일치하는 데이터를 식별한다. For example, the content recognition engine 410 is based on environmental image data related to the image of the movie poster CD album cover, and further identifies data matching the “music” content type.

일부 예들에 있어서, 환경 이미지 데이터가 적어도 CD 앨범 커버와 연관된 영화 포스터를 포함하는 경우, 콘텐츠 인식 엔진(410)은 CD 앨범 커버와 연관된 영화 포스터에 기초하며 또한 ＇음악＇ 콘텐츠 유형과도 일치하는 콘텐츠 아이템 데이터를 식별한다. 따라서, 일부 예들에 있어서, 콘텐츠 인식 엔진(410)은 영화 사운드트랙의 제목과 관련된 콘텐츠 아이템 데이터를 식별한다. 예를 들어, 콘텐츠 인식 엔진(410)은 특정한 콘텐츠 아이템(예를 들어, 특정 영화 사운드트랙)이 영화 포스터와 연관되며, 특정한 콘텐츠 아이템(예를 들어, 특정 영화 사운드 트랙)이 특정한 콘텐츠 유형(예를 들어, ＇음악＇ 콘텐츠 유형)과 일치한다고 판단할 수 있다. 따라서, 콘텐츠 인식 엔진(410)은 환경 이미지 데이터 (예를 들어, CD 앨범 커버의 이미지)에 기초하며 나아가 특정한 콘텐츠 유형(예를 들어, ＇음악＇ 콘텐츠 유형)과 일치하는, 특정한 콘텐츠 아이템(예를 들어, 특정 영화 사운드트랙)과 관련된 데이터(예를 들어, 특정 영화 사운드트랙의 제목)를 식별할 수 있다.In some examples, if the environmental image data includes at least a movie poster associated with the CD album cover, the content recognition engine 410 is based on the movie poster associated with the CD album cover and also matches the “music” content type. Identifies item data. Thus, in some examples, the content recognition engine 410 identifies content item data associated with the title of the movie soundtrack. For example, the content recognition engine 410 allows a specific content item (eg, a specific movie soundtrack) to be associated with a movie poster, and a specific content item (eg, a specific movie soundtrack) to a specific content type (eg For example, it can be judged that it is consistent with “Music” content type). Thus, the content recognition engine 410 is based on the environmental image data (eg, the image of the CD album cover) and further matches a specific content type (eg, “music” content type), a specific content item (eg For example, data related to a specific movie soundtrack (eg, the title of a specific movie soundtrack) may be identified.

명확화 엔진(404)은 콘텐츠 인식 엔진(410)으로부터 식별된 콘텐츠 아이템 데이터를 수신한다. 그 후 동작 (G)에서, 명확화 엔진(404)은 식별된 콘텐츠 아이템 데이터를 모바일 컴퓨팅 디바이스(402)에 제공한다. 예를 들어, 명확화 엔진(404)은 영화 사운드트랙과 관련된 식별된 콘텐츠 아이템 데이터 (예를 들어, 영화 사운드트랙의 제목)를 모바일 컴퓨팅 디바이스(402)로 전송한다. The disambiguation engine 404 receives the identified content item data from the content recognition engine 410. Then in operation (G), the disambiguation engine 404 provides the identified content item data to the mobile computing device 402. For example, the disambiguation engine 404 sends the identified content item data (eg, the title of the movie soundtrack) associated with the movie soundtrack to the mobile computing device 402.

위에서 언급한 바와 같이, 도 1 내지 도 4는 주변 노이즈와 같은 환경 정보에 기초하여 미디어 콘텐츠 (또는 다른 콘텐츠)를 식별할 수 있는 컴퓨팅 환경 내에서의 여러가지 예시적인 프로세스들을 도시한 것이다. 콘텐츠를 식별하기 위한 다른 프로세스들 또한 사용될 수 있다. 일반적으로, 도 5 및 6은, 발화된 자연 언어 쿼리에 대한 더욱 만족스러운 답변을 제공하기 위해, 미디어 콘텐츠를 식별하는 데이터와 같은, 컴퓨팅 환경이 환경 정보로부터 비롯된 콘텍스트에 대한 발화된 자연 언어 쿼리를 증가시킬 수 있는 예시적인 프로세스를 나타낸다. As mentioned above, FIGS. 1-4 illustrate various exemplary processes within a computing environment capable of identifying media content (or other content) based on environmental information such as ambient noise. Other processes for identifying content can also be used. In general, FIGS. 5 and 6 provide spoken natural language queries for contexts in which the computing environment originates from environmental information, such as data identifying media content, to provide a more satisfactory answer to spoken natural language queries. It represents an exemplary process that can be increased.

보다 상세하게, 도 5는 환경 오디오 및 발화에 기초하여 하나 이상의 결과들을 식별하기 위한 시스템(500)을 도시한다. 일부 예시들에서, 상기 하나 이상의 결과들은 자연 언어 쿼리에 대한 하나 이상의 답변을 표현할 수 있다. 시스템(500)은 모바일 컴퓨팅 디바이스(502), 코디네이션 엔진(504), 음성 인식 엔진(506), 콘텐츠 식별 엔진(508), 및 자연 언어 쿼리 프로세싱 엔진(510)을 포함한다. 모바일 컴퓨팅 디바이스(502)는 하나 이상의 네트워크를 통해 코디네이션 엔진(504)과 통신한다. 모바일 디바이스(510)는 마이크로폰, 카메라, 또는 사용자(512) 및/또는 사용자(512)와 연관된 환경 데이터로부터의 발화를 감지하기 위한 다른 감지 메커니즘을 포함할 수 있다. More specifically, FIG. 5 shows a system 500 for identifying one or more results based on environmental audio and speech. In some examples, the one or more results may represent one or more answers to natural language queries. The system 500 includes a mobile computing device 502, a coordination engine 504, a speech recognition engine 506, a content identification engine 508, and a natural language query processing engine 510. Mobile computing device 502 communicates with coordination engine 504 via one or more networks. Mobile device 510 may include a microphone, camera, or other sensing mechanism for sensing utterances from user 512 and/or environmental data associated with user 512.

도 1의 시스템(100)과 유사하게, 사용자(512)는 텔레비전 프로그램 시청중이다. 도시된 예시에서, 사용자(512)는 누가 현재 재생중(playing)인 텔레비전 프로그램(예를 들어, 엔티티(entity))를 감독하였는지 알고 싶어한다. 일부 예시들에서, 사용자(512)는 현재 재생중인 텔레비전 프로그램의 이름을 알지 못할 수 있고, 따라서 질문 "누가 이 쇼를 감독했나요?"를 물어볼 수 있다. 모바일 컴퓨팅 디바이스(502)는, 사용자(512) 환경과 연관된 환경 데이터뿐만 아니라, 이 발화를 감지한다.Similar to system 100 of FIG. 1, user 512 is watching a television program. In the illustrated example, the user 512 wants to know who supervised the television program (eg, entity) that is currently playing. In some examples, user 512 may not know the name of the television program currently being played, and thus may ask the question "Who supervised this show?" The mobile computing device 502 detects this utterance, as well as environmental data associated with the user 512 environment.

일부 예시들에서, 사용자(512)의 환경과 연관된 환경 데이터는 사용자(512) 의 환경의 배경 노이즈를 포함할 수 있다. 예를 들어, 환경 데이터는 텔레비전 프로그램(예를 들어, 엔티티) 사운드를 포함한다. 일부 예시들에서, 현재 디스플레이된 텔레비전 프로그램과 연관된 환경 데이터는 현재 디스플레이된 텔레비전 프로그램의 오디오(예를 들어, 현재 디스플레이된 텔레비전 프로그램의 대화, 현재 디스플레이 텔레비전 프로그램과 연관된 사운드트랙 오디오, 등)를 포함할 수 있다. 일부 예시들에서, 환경 데이터는 환경 오디오 데이터, 환경 이미지 데이터, 또는 둘 다를 포함할 수 있다. 일부 예시들에서, 모바일 컴퓨팅 디바이스(502)는 발화를 감지한 후에 환경 오디오 데이터를 감지하고; 발화를 감지하면서 동시에 환경 오디오 데이터를 감지하며; 또는 양자를 모두 감지한다. 동작 (A) 동안, 모바일 컴퓨팅 디바이스(502)는 감지된 발화 및 감지된 환경 오디오 데이터(예를 들어, 텔레비전 프로그램의 사운드)를 나타내는 파형 데이터(514)를 생성하기 위해 감지된 발화 및 환경 데이터를 처리하고 파형 데이터(514)를 코디네이션 엔진(504)으로 (예를 들어, 네트워크를 통해) 전송한다.In some examples, environment data associated with the environment of the user 512 may include background noise of the environment of the user 512. For example, environmental data includes television program (eg, entity) sounds. In some examples, environmental data associated with the currently displayed television program may include audio of the currently displayed television program (eg, conversation of the currently displayed television program, soundtrack audio associated with the current display television program, etc.). Can. In some examples, the environmental data may include environmental audio data, environmental image data, or both. In some examples, the mobile computing device 502 detects environmental audio data after detecting an utterance; Simultaneously detecting environmental audio data while detecting ignition; Or both. During operation (A), the mobile computing device 502 generates sensed speech and environmental data to generate waveform data 514 representing the detected speech and the detected environmental audio data (eg, sound of a television program). Process and send the waveform data 514 to the coordination engine 504 (eg, over the network).

코디네이션 엔진(504)은 모바일 컴퓨팅 디바이스(502)로부터 파형 데이터(514)를 수신한다. 동작 (B) 동안, 코디네이션 엔진(504)은, 파형 데이터(514)의 다른 부분들로부터 발화를 분리(또는 추출)하는 것을 포함하여, 파형 데이터(514)를 처리하고 파형에 대응하는 파형 데이터(514)의 부분을 음성 인식 엔진(506)으로 (예를 들어, 네트워크를 통해) 전송한다. 예를 들어, 코디네이션 엔진(504)은 사용자(512) 환경의 배경 노이즈(예를 들어, 현재 디스플레이된 텔레비전 프로그램의 오디오)로부터 발화("누가 이 쇼를 감독했나요?")를 분리한다. 일부 예시들에서, 코디네이션 엔진(504)은 음성 활동(voice activity)을 포함하는 파형 데이터(514)의 일 부분을 식별하여 배경 노이즈로부터 발화의 분리를 용이하게 하기 위해 음성 감지기(voice detector)를 이용할 수 있다. 일부 예시들에서, 발화는 쿼리(예를 들어, 현재 디스플레이된 텔레비전 프로그램과 관련있는 쿼리)와 연관된다.Coordination engine 504 receives waveform data 514 from mobile computing device 502. During operation (B), the coordination engine 504 processes the waveform data 514, including separating (or extracting) utterances from other portions of the waveform data 514 and waveform data corresponding to the waveform ( 514) to the speech recognition engine 506 (eg, over a network). For example, the coordination engine 504 separates the utterance ("Who supervised this show") from the background noise of the user 512 environment (eg, the audio of the currently displayed television program). In some examples, coordination engine 504 uses a voice detector to identify a portion of waveform data 514 that includes voice activity to facilitate separation of speech from background noise. Can. In some examples, the utterance is associated with a query (eg, a query related to the currently displayed television program).

음성 인식 엔진(506)은 코디네이션 엔진(504)로부터 발화에 대응하는 파형 데이터(514)의 일부를 수신한다. 동작 (C) 동안, 코디네이션 엔진(506)은 발화의 전사를 획득하고 코디네이션 엔진(504)으로 상기 전사를 제공한다. 특히, 음성 인식 엔진(506)은 코디네이션 엔진(504)으로부터 수신된 발화에 대응하는 파형 데이터(514)의 일부를 적절하게 처리한다. 일부 예시들에서, 음성 인식 엔진(506)에 의한 발화에 대응하는 파형 데이터(514)의 일부를 처리하는 단계는 발화의 전사를 생성하는 단계를 포함한다. 발화의 전사를 생성하는 단계는 발화를 텍스트 또는 텍스트-관련 데이터로 전사하는 단계를 포함할 수 있다. 다시 말해서, 음성 인식 엔진(506)은 발화의 서면 형태(in written form of the utterance)의 언어 표현을 제공할 수 있다.The speech recognition engine 506 receives a portion of the waveform data 514 corresponding to the utterance from the coordination engine 504. During operation (C), the coordination engine 506 acquires a transcription of ignition and provides the transcription to the coordination engine 504. In particular, the speech recognition engine 506 appropriately processes a portion of the waveform data 514 corresponding to the utterance received from the coordination engine 504. In some examples, processing the portion of the waveform data 514 corresponding to the speech by the speech recognition engine 506 includes generating a transcription of the speech. Generating the transcription of the utterance may include transferring the utterance into text or text-related data. In other words, the speech recognition engine 506 can provide a language representation in the written form of the utterance.

예를 들어, 음성 인식 엔진(506)은 "누가 이 쇼를 감독했나요?"의 전사를 생성하기 위해 상기 발화를 전사한다. 일부 실시예에서, 음성 인식 엔진(506)은 상기 발화의 둘 이상의 전사를 제공한다. 예를 들어, 음성 인식 엔진(506)은 "누가 이 쇼를 감독했나요?(Who directed this show?)" 및 "누가 이 신발을 지시했나요?(Who directed this shoe?)"의 전사를 생성하기 위해 상기 발화를 전사할 수 있다.For example, the speech recognition engine 506 transcribes the utterance to create a transcript of "Who supervised this show?" In some embodiments, speech recognition engine 506 provides two or more transcriptions of the speech. For example, the speech recognition engine 506 may generate warriors for "Who directed this show?" and "Who directed this shoe?" The ignition can be transferred.

코디네이션 엔진(504)은 음성 인식 엔진(506)으로부터 발화의 전사를 수신한다. 나아가, 위에서 언급한 바와 같이, 코디네이션 엔진(504)은 모바일 컴퓨팅 디바이스(502)로부터 발화와 연관된 환경 오디오 데이터를 포함하는 파형 데이터(514)를 수신한다. 코디네이션 엔진(504)은 이제 환경 데이터를 이용하여 엔티티를 식별한다. 특히, 코디네이션 엔진(504)은 콘텐츠 식별 엔진(508)으로부터 엔티티를 식별하는 데이터를 획득한다. 그렇게 하기 위해, 동작 (D) 동안, 코디네이션 엔진(504)은 환경 데이터 및 발화에 대응하는 파형 데이터(514)의 부분을 콘텐츠 식별 엔진(508)으로 (예를 들어, 네트워크를 통해) 제공한다.Coordination engine 504 receives transcription of speech from speech recognition engine 506. Further, as mentioned above, the coordination engine 504 receives waveform data 514 including environmental audio data associated with the utterance from the mobile computing device 502. Coordination engine 504 now uses the environmental data to identify the entity. In particular, the coordination engine 504 obtains data identifying the entity from the content identification engine 508. To do so, during operation (D), the coordination engine 504 provides a portion of the waveform data 514 corresponding to the environmental data and speech to the content identification engine 508 (eg, over a network).

예를 들어, 코디네이션 엔진(504)은 현재 디스플레이된 텔레비전 프로그램의 오디오(예를 들어, 현재 디스플레이된 텔레비전 프로그램의 대화, 현재 디스플레이된 텔레비전 프로그램과 연관된 사운드트랙 오디오, 등)를 포함하는 현재 디스플레이된 텔레비전 프로그램(예를 들어, 엔티티)과 관련된 환경 데이터 및 발화("누가 이 쇼를 감독했나요?")에 대응하는 파형(514)의 부분을 콘텐츠 식별 엔진(508)으로 전송한다. For example, the coordination engine 504 currently displays television that includes audio of the currently displayed television program (eg, a conversation of the currently displayed television program, soundtrack audio associated with the currently displayed television program, etc.). A portion of the waveform 514 corresponding to the environmental data and utterances associated with the program (eg, an entity) (“Who supervised this show?”) is sent to the content identification engine 508.

일부 실시예에서, 코디네이션 엔진(504)은 콘텐츠 식별 엔진(508)으로 환경 데이터의 일 부분을 제공한다. 일부 예시들에서, 환경 데이터의 일부는 발화 감지 후에 모바일 컴퓨팅 디바이스(502)에 의해 감지된 배경 노이즈를 포함할 수 있다. 일부 예시들에서, 환경 데이터의 일부는 발화 감지와 동시에 모바일 컴퓨팅 디바이스(502)에 의해 감지되는 배경 노이즈를 포함할 수 있다.In some embodiments, coordination engine 504 provides a portion of environmental data to content identification engine 508. In some examples, a portion of the environmental data may include background noise sensed by the mobile computing device 502 after ignition detection. In some examples, some of the environmental data may include background noise sensed by the mobile computing device 502 at the same time as ignition detection.

콘텐츠 식별 엔진(508)은 코디네이션 엔진(504)으로부터 환경 데이터 및 발화에 대응하는 파형(514)의 일부를 수신한다. 동작 (E) 동안, 콘텐츠 식별 엔진(508)은 환경 데이터와 발화에 기초한 엔티티(예를 들어, 콘텐츠 아이템 데이터)를 식별하는 데이터를 식별하고, 엔티티를 식별하는 상기 데이터를 코디네이션 엔진(504)으로 (예를 들어, 네트워크를 통해) 제공한다. 특히, 콘텐츠 식별 엔진(508)은 환경 데이터(예를 들어, 텔레비전 쇼의 이름, 노래의 이름, 등)와 연관된 엔티티(예를 들어, 콘텐츠 아이템 데이터)를 식별하는 데이터를 식별하기 위해 환경 데이터 및 발화에 대응하는 파형(514)의 일부를 적절하게 처리한다.The content identification engine 508 receives a portion of the waveform 514 corresponding to the environmental data and speech from the coordination engine 504. During operation (E), the content identification engine 508 identifies data identifying an entity (eg, content item data) based on environmental data and speech, and the data identifying the entity to the coordination engine 504. (For example, through a network). In particular, the content identification engine 508 is configured to identify environmental data (e.g., the name of a television show, the name of a song, etc.) and data associated with the entity (e.g., content item data) to identify the data. A portion of the waveform 514 corresponding to ignition is appropriately processed.

예를 들어, 콘텐츠 식별 엔진(508)은 현재 디스플레이된 텔레비전 프로그램과 연관된 콘텐츠 아이템 데이터를 식별하기 위해 환경 오디오 데이터를 처리한다. 일부 실시예에서, 콘텐츠 식별 엔진(508)은 도 1의 시스템(100)이다. For example, the content identification engine 508 processes environmental audio data to identify content item data associated with the currently displayed television program. In some embodiments, the content identification engine 508 is the system 100 of FIG. 1.

코디네이션 엔진(504)은 콘텐츠 식별 엔진(508)으로부터 엔티티(예를 들어, 콘텐츠 아이템 데이터)를 식별하는 데이터를 수신한다. 나아가, 전술한 바와 같이, 코디네이션 엔진(504)은 음성 인식 엔진(506)으로부터 전사를 수신한다. 동작 (F) 동안, 코디네이션 엔진(504)은 이제 전사를 포함하는 쿼리와 엔티티를 식별하는 데이터를 자연 언어 쿼리 프로세싱 엔진(510)으로 (예를 들어, 네트워크를 통해) 제공한다. 예를 들어, 코디네이션 엔진(504)은 발화("누가 이 쇼를 감독했나요?")의 전사를 포함하는 쿼리 및 콘텐츠 아이템 데이터('텔레비전 쇼 이름')을 자연 언어 쿼리 프로세싱 엔진(510)으로 제공한다.The coordination engine 504 receives data identifying an entity (eg, content item data) from the content identification engine 508. Furthermore, as described above, the coordination engine 504 receives transcription from the speech recognition engine 506. During operation (F), the coordination engine 504 now provides the query including the transcription and data identifying the entity to the natural language query processing engine 510 (eg, over a network). For example, the coordination engine 504 provides the query and content item data ('television show name') including the transcription of the utterance ("Who supervised this show?") to the natural language query processing engine 510. do.

일부 예시들에서, 코디네이션 엔진(504)은 쿼리를 생성한다. 일부 예시들에서, 코디네이션 엔진(504)은 (예를 들어, 제 3 서버로부터) 쿼리를 획득한다. 예를 들어, 코디네이션 엔진(504)은 발화의 전사, 및 엔티티를 식별하는 데이터를 제 3 서버로 제출하고, 상기 전사 및 엔티티를 식별하는 데이터에 기초한 쿼리를 회신한다.In some examples, coordination engine 504 generates a query. In some examples, coordination engine 504 obtains a query (eg, from a third server). For example, the coordination engine 504 submits the transcript of the utterance, and data identifying the entity to a third server, and returns a query based on the data identifying the transcript and entity.

일부 실시예에서, 코디네이션 엔진(504)에 의해 쿼리를 생성하는 단계는 발화의 전사를 엔티티(예를 들어, 콘텐츠 아이템 데이터)를 식별하는 데이터와 연관시키는 단계를 포함할 수 있다. 일부 예시들에서, 발화의 전사를 콘텐츠 아이템 데이터와 연관시키는 단계는 상기 전사에 엔티티를 식별하는 데이터를 태깅하는 단계를 포함할 수 있다. 예를 들어, 코디네이션 엔진(504)은 전사 "누가 이 쇼를 감독했나요?"에 '텔레비전 쇼 이름' 또는 콘텐츠 아이템 데이터와 연관된 다른 식별 정보(예를 들어, 식별(ID) 번호)를 태그할 수 있다. 일부 예시들에서, 발화의 전사를 엔티티를 식별하는 데이터와 연관시키는 단계는 전사의 일 부분을 엔티티를 식별하는 데이터로 대체하는 단계를 포함한다. 예를 들어, 코디네이션 엔진(504)은 전사 "누가 이 쇼를 감독했나요?"의 일 부분을 '텔레비전 쇼 이름' 또는 '텔레비전 쇼 이름'을 식별하는 데이터로 대체할 수 있다. 일부 예시들에서, 전사의 일 부분을 엔티티를 식별하는 데이터로 대체하는 단계는 발화의 전사의 하나 이상의 단어들을 엔티티를 식별하는 데이터로 대체하는 단계를 포함할 수 있다. 예를 들어, 코디네이션 엔진(504)은 전사 "누가 이 쇼를 감독했나요?"에서 '텔레비전 쇼 이름' 또는 '텔레비전 쇼 이름'을 식별하는 데이터를 대체할 수 있다. 예를 들어, 상기 대체는 "누가 '텔레비전 쇼 이름'을 감독했나요?(Who directed ‘television show name')" 또는 "누가 ID 번호를 감독했나요?(Who directed ‘ID number’)"를 포함하는 전사를 초래할 수 있다.In some embodiments, generating a query by the coordination engine 504 may include associating the transcription of the utterance with data identifying an entity (eg, content item data). In some examples, associating the transcription of the utterance with content item data may include tagging data identifying the entity to the transcription. For example, the coordination engine 504 may tag the warrior “Who directed this show?” with “TV Show Name” or other identifying information associated with content item data (eg, identification (ID) number). have. In some examples, associating the transcription of the utterance with data identifying the entity includes replacing a portion of the transcription with data identifying the entity. For example, the coordination engine 504 may replace a portion of the warrior "Who supervised this show" with data identifying the'television show name' or'television show name'. In some examples, replacing a portion of the transcript with data identifying the entity may include replacing one or more words of the transcript of the utterance with data identifying the entity. For example, the coordination engine 504 may replace the data identifying'TV Show Name' or'TV Show Name' in the warrior "Who supervised this show?" For example, the replacement would include a warrior who included "Who directed'television show name'" or "Who directed'ID number'?" Can cause.

자연 언어 쿼리 프로세싱 엔진(510)은 코디네이션 엔진(504)으로부터 전사를 포함하는 쿼리 및 엔티티(예를 들어, 콘텐츠 아이템 데이터)를 식별하는 데이터를 수신한다. 동작 (G) 동안, 자연 언어 쿼리 프로세싱 엔진(510)은 쿼리를 적절하게 처리하고 상기 처리에 기초하여, 하나 이상의 결과들을 코디네이션 엔진(504)으로 (예를 들어, 네트워크를 통해서) 제공한다. 다시 말해서, 코디네이션 엔진(510)은 (예를 들어, 자연 언어 쿼리 프로세싱 엔진(510)으로부터) 쿼리에 대한 하나 이상의 결과를 획득한다.The natural language query processing engine 510 receives data from the coordination engine 504 that identifies the query and entities (eg, content item data) that contain transcriptions. During operation (G), the natural language query processing engine 510 processes the query appropriately and, based on the processing, provides one or more results to the coordination engine 504 (eg, over a network). In other words, the coordination engine 510 obtains one or more results for the query (eg, from the natural language query processing engine 510).

특히, 자연 언어 쿼리 프로세싱 엔진(510)은 (정보 리소스의 컬렉션으로부터) 쿼리에 관련된 정보 리소스(발화의 전사 및 콘텐츠 아이템 데이터)를 획득한다. 일부 예시들에서, 자연 언어 쿼리 프로세싱 엔진(510)은 쿼리를 데이터베이스 정보(예를 들어, 텍스트 문서, 이미지, 오디오, 비디오, 등)에 대하여 매칭시키고 데이터베이스의 각 개체들이 쿼리에 얼마나 잘 매칭되는지에 대한 점수가 계산된다. 자연 언어 쿼리 프로세싱 엔진(510)은 상기 매칭된 개체들(예를 들어, 임계점수보다 높은 점수를 갖는 개체들)에 기초하여 하나 이상의 결과들을 식별한다. Specifically, the natural language query processing engine 510 obtains information resources (transcription of speech and content item data) related to the query (from a collection of information resources). In some examples, the natural language query processing engine 510 matches the query against database information (eg, text documents, images, audio, video, etc.) and how well each object in the database matches the query. The score for is calculated. The natural language query processing engine 510 identifies one or more results based on the matched entities (eg, those having a score higher than a threshold score).

예를 들어, 자연 언어 쿼리 프로세싱 엔진(510)은 발화 "누가 이 쇼를 감독했나요?"의 전사를 포함하는 쿼리 및 '텔레비전 쇼 이름' (또는 다른 식별 정보)를 수신한다. 자연 언어 쿼리 프로세싱 엔진(510)은 상기 쿼리를 데이터베이스 정보에 대하여 매칭시키고, 쿼리를 매칭시키는 하나 이상의 결과들을 제공한다. 자연 언어 쿼리 프로세싱 엔진(510)은 각 매칭된 개체들의 점수를 계산한다.For example, the natural language query processing engine 510 receives a query containing the transcription of the utterance "Who supervised this show?" and the'television show name' (or other identifying information). The natural language query processing engine 510 matches the query against database information and provides one or more results matching the query. The natural language query processing engine 510 calculates the score of each matched entity.

코디네이션 엔진(504)은 자연 언어 쿼리 프로세싱 엔진(510)으로부터 하나 이상의 결과들을 수신한다. 동작 (H)에서, 이제 코디네이션 엔진(504)은 상기 하나 이상의 결과들을 모바일 컴퓨팅 디바이스(502)로 (예를 들어, 네트워크를 통해서) 제공한다. 예를 들어, 코디네이션 엔진(504)은 하나 이상의 결과들(예를 들어, 텔레비전 쇼의 감독의 이름)을 모바일 컴퓨팅 디바이스(502)로 전송한다.Coordination engine 504 receives one or more results from natural language query processing engine 510. In operation (H), the coordination engine 504 now provides the one or more results to the mobile computing device 502 (eg, over a network). For example, the coordination engine 504 sends one or more results (eg, the name of the director of the television show) to the mobile computing device 502.

일부 예시들에서, 하나 이상의 모바일 컴퓨팅 디바이스(502), 코디네이션 엔진(504), 음성 인식 엔진(506), 콘텐츠 식별 엔진(508), 및 자연 언어 쿼리 프로세싱 엔진(510)은 모바일 컴퓨팅 디바이스(502), 코디네이션 엔진(504), 음성 인식 엔진(506), 콘텐츠 식별 엔진(508), 및 자연 언어 쿼리 프로세싱 엔진(510)의 서브셋(또는 각각)과 통신할 수 있다. 일부 실시예에서, 하나 이상의 코디네이션 엔진(504), 음성 인식 엔진(506), 콘텐츠 식별 엔진(508), 및 자연 언어 쿼리 프로세싱 엔진(510)은, 하나 이상의 컴퓨팅 서버, 분산 컴퓨팅 시스템, 또는 서버 팜 또는 클러스터와 같은, 하나 이상의 컴퓨팅 디바이스에 의해 구현될 수 있다.In some examples, one or more mobile computing devices 502, coordination engine 504, speech recognition engine 506, content identification engine 508, and natural language query processing engine 510 are mobile computing devices 502. , Coordination engine 504, speech recognition engine 506, content identification engine 508, and a subset (or each) of natural language query processing engine 510. In some embodiments, one or more coordination engines 504, speech recognition engines 506, content identification engines 508, and natural language query processing engines 510 may include one or more computing servers, distributed computing systems, or server farms. Or it may be implemented by one or more computing devices, such as clusters.

도 6은 환경 데이터 및 발화에 기초하여 하나 이상의 결과들을 식별하기 위한 예시 프로세스(600)의 흐름도를 도시한다. 예시 프로세스(600)은 하나 이상의 컴퓨팅 디바이스를 이용하여 실행될 수 있다. 예를 들어, 모바일 컴퓨팅 디바이스(502), 코디네이션 엔진(504), 음성 인식 엔진(506), 콘텐츠 식별 엔진(508), 및/또는 자연 언어 쿼리 프로세싱 엔진(510)은 예시 프로세스(600)을 실행하기 위해 이용될 수 있다.6 shows a flow diagram of an example process 600 for identifying one or more results based on environmental data and speech. The example process 600 can be executed using one or more computing devices. For example, mobile computing device 502, coordination engine 504, speech recognition engine 506, content identification engine 508, and/or natural language query processing engine 510 execute example process 600. Can be used to

발화 및 환경 데이터를 인코딩한 오디오 데이터가 수신된다(602). 예를 들어, 코디네이션 엔진(504)은 모바일 컴퓨팅 디바이스(502)로부터 파형 데이터(514)를 수신한다. 파형 데이터(514)는 사용자의 발화(예를 들어, "누가 이 쇼를 감독했나요?") 및 환경 데이터(예를 들어, 현재 디스플레이된 텔레비전 프로그램의 오디오)를 포함한다. 일부 예시들에서, 환경 데이터를 수신하는 단계는 환경 오디오 데이터, 환경 이미지 데이터, 또는 둘 다를 수신하는 단계를 포함할 수 있다. 일부 예시들에서, 환경 데이터를 수신하는 단계는 배경 노이즈를 포함하는 부가적인 오디오 데이터를 수신하는 단계를 포함한다.Audio data encoding speech and environmental data is received (602). For example, coordination engine 504 receives waveform data 514 from mobile computing device 502. Waveform data 514 includes the user's speech (eg, “Who supervised this show?”) and environmental data (eg, the audio of the television program currently displayed). In some examples, receiving the environmental data may include receiving environmental audio data, environmental image data, or both. In some examples, receiving the environmental data includes receiving additional audio data including background noise.

발화의 전사가 획득된다(604). 예를 들어, 코디네이션 엔진(504)은 음성 인식 엔진(506)을 이용하여 발화의 전사를 획득한다. 음성 인식 엔진(506)은 발화의 전사(예를 들어, "누가 이 쇼를 감독했나요?")를 생성하기 위해 상기 발화를 전사한다.A warrior of utterance is obtained (604). For example, the coordination engine 504 uses the speech recognition engine 506 to acquire transcription of speech. The speech recognition engine 506 transcribes the utterance to create a transcript of the utterance (eg, "Who supervised this show?").

엔티티가 환경 데이터를 이용하여 식별된다(606). 예를 들어, 코디네이션 엔진(504)은 콘텐츠 식별 엔진(508)을 이용하여 엔티티를 식별하는 데이터를 획득한다. 콘텐츠 식별 엔진(508)은 환경 데이터(예를 들어, 텔레비전 쇼의 이름, 노래의 제목, 등)와 연관된 엔티티(예를 들어, 콘텐츠 아이템 데이터)를 식별하는 데이터를 식별하기 위해 환경 데이터(예를 들어, 디스플레이된 텔레비전 프로그램과 연관된 환경 오디오 데이터)를 적절하게 처리할 수 있다. 일부 예시들에서, 콘텐츠 식별 엔진(508)은 엔티티를 식별하기 위해 (동시에 또는 환경 데이터의 처리에 후속하여) 발화에 대응하는 파형(514)을 더 처리할 수 있다.The entity is identified using environmental data (606). For example, the coordination engine 504 uses the content identification engine 508 to obtain data that identifies the entity. The content identification engine 508 identifies environmental data (eg, content item data) to identify data associated with the entity (eg, content item data) associated with the environmental data (eg, television show name, song title, etc.). For example, environmental audio data (associated with the displayed television program) can be properly processed. In some examples, the content identification engine 508 can further process the waveform 514 corresponding to the utterance (simultaneously or following processing of the environmental data) to identify the entity.

일부 예시들에서, 코디네이션 엔진(504)은 쿼리를 생성한다. 일부 예시들에서, 코디네이션 엔진(504)에 의해 쿼리가 생성되는 단계는 발화의 전사를 엔티티를 식별하는 데이터와 연관시키는 단계를 포함할 수 있다. 일부 예시들에서, 발화의 전사를 콘텐츠 아이템 데이터와 연관시키는 단계는 전사의 일 부분을 엔티티를 식별하는 데이터로 대체하는 단계를 포함할 수 있다. 일부 예시들에서, 전사의 일 부분을 엔티티를 식별하는 데이터로 대체하는 단계는 발화의 전사의 하나 이상의 단어들을 엔티티를 식별하는 데이터로 대체하는 단계를 포함할 수 있다.In some examples, coordination engine 504 generates a query. In some examples, the step in which the query is generated by the coordination engine 504 may include associating the transcription of the utterance with data identifying the entity. In some examples, associating the transcription of the utterance with content item data may include replacing a portion of the transcription with data identifying an entity. In some examples, replacing a portion of the transcript with data identifying the entity may include replacing one or more words of the transcript of the utterance with data identifying the entity.

쿼리는 자연 언어 프로세싱 엔진으로 제출된다(608). 예를 들어, 코디네이션 엔진(504)은 쿼리를 자연 언어 프로세싱 엔진(510)으로 제출한다. 쿼리는 전사의 적어도 일 부분 및 엔티티(예를 들어, 콘텐츠 아이템 데이터)를 식별하는 데이터를 포함할 수 있다. 예를 들어, 코디네이션 엔진(504)은 발화("누가 이 쇼를 감독했나요?")의 전사를 포함하는 쿼리 및 콘텐츠 아이템 데이터('텔레비전 쇼 이름')을 자연 언어 쿼리 프로세싱 엔진(510)으로 제공한다.The query is submitted to the natural language processing engine (608). For example, coordination engine 504 submits the query to natural language processing engine 510. The query may include data identifying at least a portion of the transcript and entities (eg, content item data). For example, the coordination engine 504 provides query and content item data ('television show name') including the transcription of the utterance ("Who supervised this show?") to the natural language query processing engine 510. do.

쿼리에 대한 하나 이상의 결과들이 획득된다(610). 예를 들어, 코디네이션 엔진은 자연 언어 쿼리 프로세싱 엔진(510)으로부터 쿼리에 대한 하나 이상의 결과들(예를 들어, 텔레비전 쇼의 감독의 이름)을 획득한다. 일부 예시들에서, 코디네이션 엔진(504)은 이제 상기 하나 이상의 결과들을 모바일 컴퓨팅 디바이스(502)로 제공한다.One or more results for the query are obtained (610). For example, the coordination engine obtains one or more results for the query (eg, the name of the director of the television show) from the natural language query processing engine 510. In some examples, coordination engine 504 now provides the one or more results to mobile computing device 502.

도 7은 여기에 설명된 기술이 이용될 수 있는 일반적인 컴퓨팅 디바이스(700) 및 일반적인 모바일 컴퓨팅 디바이스(750)의 일 예시를 나타낸다. 컴퓨팅 디바이스(700)는 랩탑, 데스트탑, 워크스테이션, PDA, 서버, 블레이드 서버, 메인프레임, 및 그 밖의 적절한 컴퓨터들과 같은 다양한 형태의 디지털 컴퓨터를 나타내기 위해 사용된다. 모바일 컴퓨팅 디바이스(750)는 PDA, 셀룰라 전화, 스마트폰, 및 그 밖의 유사한 컴퓨팅 디바이스와 같은 다양한 형태의 모바일 디바이스를 나타내기 위해 사용된다. 여기에 보여지는 컴포넌트들, 그 연결 및 관계, 및 그 기능들은 단지 예시를 의미하고, 본 명세서에서 설명하거나 또는 청구된 기술의 실시예를 제한하는 것을 의미하지 않는다.7 shows an example of a typical computing device 700 and a typical mobile computing device 750 in which the techniques described herein can be utilized. Computing device 700 is used to represent various types of digital computers, such as laptops, desktops, workstations, PDAs, servers, blade servers, mainframes, and other suitable computers. Mobile computing device 750 is used to represent various types of mobile devices, such as PDAs, cell phones, smart phones, and other similar computing devices. The components, their connections and relationships, and their functions shown herein are meant for illustration only, and are not intended to limit the embodiment of the technology described or claimed herein.

컴퓨팅 디바이스(700)는 프로세서(702), 메모리(704), 저장 디바이스(706), 메모리(704)에 접속하는 고속 인터페이스(708)와 고속 확장 포트(710), 및 저속 버스(714)와 저장 디바이스(706)에 접속하는 저속 인터페이스(712)를 포함한다. 각 구성요소(702, 704, 706, 708, 710, 및 512)는 다양한 버스들을 사용하여 서로 접속되고, 일반적인 마더보드 또는 적절한 경우 다른 방식으로 탑재될 수 있다. 프로세서(702)는 컴퓨팅 디바이스(700) 내에서 실행하기 위한 명령어를 처리할 수 있으며, 이러한 명령어에는, 고속 인터페이스(708)에 연결된 디스플레이(716)와 같은 외장 입/출력 디바이스상에서 GUI용 그래픽 정보를 디스플레이하기 위해, 메모리(704) 또는 저장 디바이스(706)에 저장되는 명령어가 포함된다. 다른 실시예에서, 다중 프로세서 및/또는 다중 버스는 적절한 경우, 다중 메모리 및 메모리 타입과 함께 사용될 수 있다. 또한, 다중 컴퓨팅 디바이스(700)는 각 디바이스가 필요 동작의 부분을 제공하는 형태(예를 들어, 서버 뱅크, 블레이드 서버의 그룹, 또는 다중 프로세서 시스템)로 접속될 수 있다.The computing device 700 includes a processor 702, a memory 704, a storage device 706, a fast interface 708 and a fast expansion port 710 connecting to the memory 704, and a slow bus 714 and storage. And a low speed interface 712 that connects to the device 706. Each of the components 702, 704, 706, 708, 710, and 512 are connected to each other using various buses, and may be mounted in a general motherboard or other manner as appropriate. The processor 702 can process instructions for execution within the computing device 700, which includes graphical information for the GUI on an external input/output device, such as a display 716 connected to the high-speed interface 708. For display, instructions stored in memory 704 or storage device 706 are included. In other embodiments, multiple processors and/or multiple buses may be used with multiple memories and memory types, as appropriate. In addition, multiple computing devices 700 may be connected in a form (eg, a server bank, a group of blade servers, or a multi-processor system) in which each device provides a portion of the required operation.

메모리(704)는 컴퓨팅 디바이스(700)내에 정보를 저장한다. 일 실시예에서, 메모리(704)는 휘발성 메모리 유닛 또는 유닛들이다. 또 다른 실시예에서, 메모리(704)는 비휘발성 메모리 유닛 또는 유닛들이다. 또한, 메모리(704)는 마그네틱 또는 광 디스크와 같은 다른 형태의 컴퓨터 판독가능 매체일 수 있다.Memory 704 stores information in computing device 700. In one embodiment, memory 704 is a volatile memory unit or units. In another embodiment, the memory 704 is a non-volatile memory unit or units. Further, the memory 704 may be other types of computer readable media, such as magnetic or optical disks.

저장 디바이스(706)는 컴퓨팅 디바이스(700)를 위한 대용량 저장소(mass storage)를 제공할 수 있다. 일 실시예에서, 저장 디바이스(706)는 플로피 디스크 디바이스, 하드 디스크 디바이스, 광 디스크 디바이스, 또는 테이프 디바이스, 플래쉬 메모리 또는 다른 유사한 고체 상태 메모리 디바이스, 또는 저장 영역 네트워크 또는 다른 구성에 존재하는 디바이스를 포함하는 디바이스 어레이일 수 있다. 컴퓨터 프로그램 제품은 정보 매체 내에 유형적으로 구체화될 수 있다. 또한, 컴퓨터 프로그램 제품은 실행될 때, 상술한 것과 같은 하나 이상의 방법을 수행하는 명령어를 포함할 수 있다. 정보 캐리어는 메모리(704), 저장 디바이스(706), 프로세서(702)상의 메모리와 같은 컴퓨터 또는 기계 판독가능 매체이다.The storage device 706 can provide mass storage for the computing device 700. In one embodiment, storage device 706 includes a floppy disk device, hard disk device, optical disk device, or tape device, flash memory or other similar solid state memory device, or a device present in a storage area network or other configuration. Device array. Computer program products may be embodied tangibly in an information medium. In addition, the computer program product may include instructions that, when executed, perform one or more methods as described above. The information carrier is a computer or machine readable medium, such as memory on memory 704, storage device 706, processor 702.

저속 제어부(712)가 저대역-집약적 동작(lower bandwidth-intensive operations)을 관리하는 반면, 고속 제어부(708)는 컴퓨팅 디바이스(700)에 대한 대역-집약적 동작을 관리한다. 이러한 기능들의 배치는 단지 예시적일 뿐이다. 일 실시예에서, 고속 제어부(708)는 메모리(704), 디스플레이(716)(예를 들어, 그래픽 프로세서 또는 가속기를 통함)에 연결되고, 다양한 확장 카드(도시되지 않음)을 수용할 수 있는 고속 확장 포트(710)에 연결된다. 일부 실시예에서는, 저속 제어부(712)는 저장 디바이스(706) 및 저속 확장 포트(714)에 연결된다. 다양한 통신 포트(예를 들어, USB, 블루투스, 이더넷, 무선 이더넷)를 포함할 수 있는 저속 확장 포트는 키보드, 포인팅 디바이스, 스캐너와 같은 하나 이상의 입/출력 디바이스들에 연결되거나, 또는 예컨대 네트워크 어댑터를 통하여, 스위치나 라우터와 같은 네트워킹 디바이스에 연결될 수 있다.The low-speed control 712 manages low-band-intensive operations, while the high-speed control 708 manages band-intensive operations on the computing device 700. The arrangement of these functions is merely exemplary. In one embodiment, high speed control 708 is coupled to memory 704, display 716 (eg, via a graphics processor or accelerator), and is capable of accommodating various expansion cards (not shown). It is connected to the expansion port 710. In some embodiments, slow control 712 is coupled to storage device 706 and slow expansion port 714. The low speed expansion port, which may include various communication ports (eg, USB, Bluetooth, Ethernet, wireless Ethernet), is connected to one or more input/output devices such as a keyboard, pointing device, scanner, or a network adapter, for example. Through it, it can be connected to a networking device such as a switch or router.

컴퓨팅 디바이스(700)는 도면에 도시된 바와 같이, 복수의 다른 형태로 구현될 수 있다. 예를 들어, 컴퓨팅 디바이스(700)는 표준 서버(720)로 구현되거나 이러한 서버들의 그룹에서 여러 번 구현될 수 있다. 또한, 컴퓨팅 디바이스(700)는 랙 서버 시스템(724)의 부분으로서 구현될 수 있다. 이에 더하여, 컴퓨팅 디바이스(700)는 랩탑 컴퓨터(722)와 같은 개인용 컴퓨터내에 구현될 수 있다. 선택적으로, 컴퓨팅 디바이스 (700)로부터의 구성요소는 디바이스(750)와 같은 모바일 디바이스(도시되지 않음)내 다른 구성요소와 조합될 수 있다. 이러한 디바이스 각각은 하나 이상의 컴퓨팅 디바이스(700, 750)를 포함하고, 전체 시스템은 서로 통신하는 다중 컴퓨팅 디바이스(700, 750)로 구성될 수 있다.The computing device 700 can be implemented in a number of different forms, as shown in the figure. For example, computing device 700 may be implemented as standard server 720 or multiple times in a group of such servers. In addition, computing device 700 may be implemented as part of rack server system 724. In addition, computing device 700 may be implemented within a personal computer, such as laptop computer 722. Optionally, components from computing device 700 can be combined with other components in a mobile device (not shown), such as device 750. Each of these devices includes one or more computing devices 700, 750, and the entire system can be comprised of multiple computing devices 700, 750 communicating with each other.

컴퓨팅 디바이스(750)는 여러 구성요소 중에서 프로세서(752), 메모리(764), 디스플레이(754)와 같은 입/출력 디바이스, 통신 인터페이스(766), 및 트랜스시버(768)를 포함한다. 또한, 디바이스(750)에는 추가적인 저장소를 제공하기 위하여, 마이크로 드라이브 또는 다른 디바이스와 같은 저장 디바이스가 제공될 수 있다. 각 구성요소(750, 752, 764, 754, 766, 및 568)는 다양한 버스를 이용하여 서로 접속되고, 구성요소의 몇몇은 통상의 마더보드에 탑재되거나 적절한 다른 방법으로 탑재될 수 있다.Computing device 750 includes a processor 752, memory 764, input/output devices such as display 754, communication interface 766, and transceiver 768, among other components. Further, the device 750 may be provided with a storage device, such as a micro drive or other device, to provide additional storage. Each of the components 750, 752, 764, 754, 766, and 568 are connected to each other using various buses, and some of the components may be mounted on a conventional motherboard or other suitable method.

프로세서(752)는 컴퓨팅 디바이스(750) 내에서 명령어를 실행하며, 이 명령어에는 메모리(764)에 저장된 명령어가 포함된다. 프로세서는 개별적이고 다중의 아날로그 및 디지털 프로세서를 포함하는 칩들의 칩 세트로서 구현될 수 있다. 프로세서는, 예를 들어, 사용자 인터페이스의 컨트롤, 디바이스(750)에 의해 실행되는 애플리케이션, 및 컴퓨팅 디바이스(750)에 의한 무선 통신과 같은 디바이스(750)의 다른 구성요소들 사이에 조정을 제공할 수 있다.Processor 752 executes instructions within computing device 750, which include instructions stored in memory 764. The processor can be implemented as a chip set of chips, including individual and multiple analog and digital processors. The processor can provide coordination between other components of the device 750, such as, for example, control of the user interface, applications executed by the device 750, and wireless communication by the computing device 750. have.

프로세서(752)는 제어 인터페이스(758) 및 디스플레이(754)에 연결된 디스플레이 인터페이스(756)를 통해 사용자와 통신할 수 있다. 디스플레이(754)는, 예를 들어, TFT LCD(Thin-Film-Tansistor Liquid Crystal Display) 또는 OLED(Organic Light Emitting Diode) 디스플레이, 또는 다른 적절한 디스플레이 기술일 수 있다. 디스플레이 인터페이스(756)는 그래픽 및 다른 정보를 사용자에게 나타내기 위해 디스플레이(754)를 구동하는 적절한 회로를 포함할 수 있다. 제어 인터페이스(758)는 사용자로부터 명령들을 수신하고, 프로세서(752)에 제출하기 위해 그 명령들을 변환한다. 더욱이, 확장 인터페이스(762)는 디바이스(750)와 다른 디바이스들간에 근거리 통신이 가능하도록 하기 위해, 프로세서(752)와의 통신에 제공될 수 있다. 확장 인터페이스(762)는, 예를 들어, 일부 실시예에서는 유선 통신을 제공하고 다른 실시예에서 무선 통신을 제공하며, 또한 다중 인터페이스가 사용될 수 있다.The processor 752 can communicate with the user through a control interface 758 and a display interface 756 connected to the display 754. The display 754 may be, for example, a TFT-Thin-Film-Tansistor Liquid Crystal Display (LCD) or Organic Light Emitting Diode (OLED) display, or other suitable display technology. Display interface 756 can include suitable circuitry to drive display 754 to present graphics and other information to the user. Control interface 758 receives instructions from the user and translates them to submit to processor 752. Moreover, an extension interface 762 may be provided for communication with the processor 752 to enable short-range communication between the device 750 and other devices. The extended interface 762, for example, provides wired communication in some embodiments and wireless communication in other embodiments, and multiple interfaces may also be used.

메모리(764)는 컴퓨팅 디바이스(750)내에 정보를 저장한다. 메모리(764)는 컴퓨터 판독가능 매체 또는 미디어, 휘발성 메모리 유닛 또는 유닛들, 또는 비휘발성 메모리 유닛 또는 유닛들 중 하나 이상으로서 구현될 수 있다. 또한, 확장 메모리(774)가 제공되어, 예를 들어 SIMM(Single In Line Memory Module) 카드 인터페이스를 포함하는 확장 인터페이스(774)를 통해 디바이스(750)에 접속될 수 있다. 이러한 확장 메모리(774)는 디바이스(750)를 위한 여분의 저장 공간을 제공할 수 있고, 또한 어플리케이션 또는 디바이스(750)를 위한 다른 정보를 저장할 수 있다. 특히, 확장 메모리(774)는 상술된 프로세스를 실행하거나 보조하기 위한 명령어를 포함하고, 또한 보안 정보를 포함할 수 있다. 따라서, 예를 들어, 확장 메모리(774)는 디바이스(750)용 보안 모듈로서 제공될 수 있고, 디바이스(750)의 안전한 사용을 가능하게 하는 명령어로 프로그램될 수 있다. 더욱이, 보안 어플리케이션은, 해킹할 수 없는 방식(non-hackable manner)으로 SIMM 카드상에 식별 정보를 위치시킨 것과 같은 추가적 정보와 함께 SIMM 카드를 통해 제공될 수 있다.Memory 764 stores information in computing device 750. The memory 764 can be implemented as one or more of a computer-readable medium or media, a volatile memory unit or units, or a non-volatile memory unit or units. In addition, an extended memory 774 may be provided and connected to the device 750 through an extended interface 774 that includes, for example, a SIMM (Single In Line Memory Module) card interface. The extended memory 774 may provide extra storage space for the device 750, and may also store other information for the application or device 750. In particular, the extended memory 774 includes instructions for executing or assisting the above-described process, and may also include security information. Thus, for example, the extended memory 774 can be provided as a security module for the device 750 and can be programmed with instructions that enable secure use of the device 750. Moreover, the security application can be provided through the SIMM card along with additional information such as placing identification information on the SIMM card in a non-hackable manner.

메모리는 아래에서 논의되는 것과 같이, 예를 들어, 플래시 메모리 및/또는 NVRAM 메모리를 포함할 수 있다. 일 실시예에서, 컴퓨터 프로그램 제품은 정보 캐리어에 유형적으로 구체화된다. 컴퓨터 프로그램 제품은 실행될 때, 상술된 것과 같은 하나 이상의 방법을 수행하는 명령어를 포함한다. 정보 캐리어는 메모리(764), 확장 메모리(774), 프로세서(752)상의 메모리, 또는 예를 들어 트랜스시버(768) 또는 확장 인터페이스(762)를 통해 수신될 수 있는 전달된 신호와 같은 컴퓨터-또는 기계-판독가능 매체이다.The memory can include, for example, flash memory and/or NVRAM memory, as discussed below. In one embodiment, the computer program product is tangibly embodied in an information carrier. The computer program product includes instructions that, when executed, perform one or more methods as described above. The information carrier is a computer-or machine, such as a memory 764, an extended memory 774, a memory on the processor 752, or a transmitted signal that can be received, for example, via a transceiver 768 or an extended interface 762. -It is a readable medium.

디바이스(750)는 디지털 신호 처리 회로를 필요에 따라 포함하는 통신 인터페이스(766)를 통해 무선으로 통신할 수 있다. 통신 인터페이스(766)는 GSM 음성 호, SMS, EMS, 또는 MMS 메시징, CDMA, TDMA, PDC, WCDMA, CDMA2000, 또는 GPRS 등과 같은 다양한 모드 또는 프로토콜 하에서의 통신을 제공할 수 있다. 이러한 통신은 예를 들어, 무선-주파수 트랜스시버(768)를 통해 수행될 수 있다. 또한, 단거리(short range) 통신은 예를 들어, 블루투스, WiFi, 또는 다른 이러한 트랜스시버(도시되지 않음)를 사용하여 수행될 수 있다. 이에 더하여, GPS(Global Position System) 수신기 모듈(770)은 추가적인 항법- 및 위치- 관련 무선 데이터를 디바이스(750)에 제공할 수 있다. 이 무선 데이터는 디바이스(750)에서 실행중인 어플리케이션에 의해 적절하게 사용될 수 있다.The device 750 can communicate wirelessly through a communication interface 766 that includes digital signal processing circuitry as needed. The communication interface 766 can provide communication under various modes or protocols, such as GSM voice calls, SMS, EMS, or MMS messaging, CDMA, TDMA, PDC, WCDMA, CDMA2000, or GPRS. Such communication may be performed, for example, via radio-frequency transceiver 768. In addition, short range communication may be performed using, for example, Bluetooth, WiFi, or other such transceivers (not shown). In addition, the Global Position System (GPS) receiver module 770 can provide additional navigation- and location-related wireless data to the device 750. This wireless data can be appropriately used by applications running on the device 750.

또한, 디바이스(750)는 사용자로부터의 발화 정보를 수신하고, 그 발화 정보를 사용가능한 디지털 정보로 변환하는 오디오 코덱(760)을 이용하여, 청취 가능하게 통신할 수 있다. 또한, 오디오 코덱(760)은 예를 들어, 디바이스(750)의 핸드셋 내의 스피커를 통하는 것과 같이 해서, 사용자가 들을 수있는 음성을 생성한다. 이러한 음성은 음성 전화 호로부터의 음성을 포함할 수 있고, 녹음된 음성(예를 들어, 음성 메시지, 뮤직 파일 등)은 포함할 수 있고, 또한 디바이스(750) 상에서 동작하는 애플리케이션에 의해 생성된 음성을 포함할 수 있다.Further, the device 750 may communicate audibly using an audio codec 760 that receives utterance information from a user and converts the utterance information into usable digital information. In addition, the audio codec 760 generates a voice that the user can hear, such as through a speaker in the handset of the device 750. Such voices may include voices from voice telephone calls, recorded voices (eg, voice messages, music files, etc.), and voices generated by applications running on device 750 as well. It may include.

컴퓨팅 디바이스(750)는 도면에 도시된 바와 같이, 복수의 다양한 형태로 구현될 수 있다. 예를 들어, 컴퓨팅 디바이스(750)는 셀룰러 전화(780)로서 구현될 수 있다. 또한, 컴퓨팅 디바이스(750)는 스마트폰(782), PDA, 또는 다른 유사한 모바일 디바이스의 일부로서 구현될 수 있다.The computing device 750 may be implemented in a plurality of various forms, as shown in the figure. For example, computing device 750 may be implemented as cellular telephone 780. Further, the computing device 750 can be implemented as part of a smartphone 782, PDA, or other similar mobile device.

본 명세서에 설명된 다양한 시스템과 방법의 여러 실시예는 디지털 전자 회로, 집적 회로, 특정 목적으로 설계된 ASICs(application specific integrated circuits), 컴퓨터 하드웨어, 펌웨어, 소프트웨어, 및/또는 이들의 조합으로 실현될 수 있다. 실시예에는 하나 이상의 컴퓨터 프로그램의 실시예가 포함되고, 이 컴퓨터 프로그램은 프로그램 가능한 시스템 상에서 실행가능 및/또는 해석가능하며, 프로그램 가능한 시스템은 저장 시스템에 연결되어 데이터와 명령을 송수신하는, 전용 또는 범용인 적어도 하나의 프로그램 가능한 프로세서, 적어도 하나의 입력 디바이스, 및 적어도 하나의 출력 디바이스를 포함한다.Various embodiments of the various systems and methods described herein can be realized with digital electronic circuits, integrated circuits, application specific integrated circuits (ASICs) designed for specific purposes, computer hardware, firmware, software, and/or combinations thereof. have. Embodiments include embodiments of one or more computer programs, which are executable and/or interpretable on a programmable system, the programmable system being connected to a storage system to transmit and receive data and instructions, dedicated or general purpose And at least one programmable processor, at least one input device, and at least one output device.

이러한 컴퓨터 프로그램(또한, 프로그램, 소프트웨어, 소프트웨어 애플리케이션, 또는 코드라 함)은 프로그램 가능한 프로세서용 기계 명령을 포함하고, 고레벨 절차 및/또는 객체지향 프로그래밍 언어, 및/또는 어셈블리/기계 언어로 구현될 수 있다. 본 명세서에 사용되는 바와 같이, 용어 "기계-판독가능 매체", "컴퓨터-판독가능 매체"는 기계 명령 및/또는 데이터를 프로그램 가능한 프로세서에 제공하는데 사용되는 장치 및/또는 디바이스(예를 들어, 자기 디스크, 광디스크, 메모리, 프로그램 가능한 로직 디바이스(PLD))를 지칭하며, 기계-판독가능 신호로써 기계 명령을 수신하는 기계-판독가능 매체도 포함된다. 용어 "기계-판독가능 신호"는 명령어 및/또는 데이터를 프로그램 가능한 프로세서로 제공하기 위해 사용되는 어떠한 신호라도 참조한다.Such computer programs (also called programs, software, software applications, or code) include machine instructions for programmable processors, and can be implemented in a high-level procedural and/or object-oriented programming language, and/or assembly/machine language. have. As used herein, the terms "machine-readable medium", "computer-readable medium" are devices and/or devices (eg, used to provide machine instructions and/or data to a programmable processor) Refers to magnetic disks, optical disks, memory, programmable logic devices (PLDs), and machine-readable media that receive machine instructions as machine-readable signals. The term "machine-readable signal" refers to any signal used to provide instructions and/or data to a programmable processor.

사용자와의 상호작용을 제공하기 위하여, 본 명세서에 설명되는 시스템과 방법은, 정보를 사용자에게 디스플레이 하는 디스플레이 디바이스(예를 들어, 음극선관(CRT) 또는 LCD(liquid crystal display) 모니터) 및 사용자가 컴퓨터에 입력하는데 사용하는 키보드와 포인팅 디바이스(예를 들어, 마우스 또는 트랙볼)를 구비한 컴퓨터상에서 구현될 수 있다. 다른 카테고리의 디바이스도 사용자와의 상호작용을 제공하기 위하여 사용될 수 있다. 예를 들어, 사용자에게 제공되는 피드백은 지각 피드백(시각, 청각 또는 촉각 피드백)의 임의 형태가 될 수 있고, 사용자로부터의 입력은 음향, 음성 또는 촉각 입력을 포함하는 임의 형태로 수신될 수 있다.In order to provide user interaction, the systems and methods described herein include a display device (eg, a cathode ray tube (CRT) or liquid crystal display (LCD) monitor) and a user that displays information to a user. It may be implemented on a computer having a keyboard and a pointing device (eg, mouse or trackball) used for input to a computer. Other categories of devices can also be used to provide user interaction. For example, the feedback provided to the user can be any form of perceptual feedback (visual, auditory or tactile feedback), and input from the user can be received in any form, including acoustic, voice or tactile input.

본 명세서에 설명된 다양한 시스템과 방법은, 백엔드 구성요소(예를 들어, 데이터 서버), 또는 미들웨어 구성요소(예를 들어, 애플리케이션 서버) 또는 프론트엔드 구성요소(예를 들어, 본 명세서에 설명된 시스템 및 방법의 실시예와 상호작용하기 위해 사용자가 사용할 수 있는 그래픽 사용자 인터페이스(GUI) 또는 웹브라우저를 구비한 클라이언트 컴퓨터) 또는 이러한 백엔드, 미들웨어 또는 프론트엔드 구성요소의 임의 조합을 포함하는 컴퓨팅 시스템으로 구현될 수 있다. 시스템의 구성요소는 임의 형태 또는 디지털 데이터 통신의 매체(예를 들어, 통신 네트워크)에 의해 상호접속될 수 있다. 통신 네트워크의 예는 근거리 네트워크("LAN"), 광역 네트워크("WAN"), 및 인터넷을 포함한다.The various systems and methods described herein may include a back-end component (eg, data server), or middleware component (eg, application server) or front-end component (eg, as described herein). To a computing system that includes a graphical user interface (GUI) or a client computer with a web browser that a user can use to interact with embodiments of the systems and methods, or any combination of these backend, middleware, or frontend components. Can be implemented. The components of the system may be interconnected by any form or medium of digital data communication (eg, a communication network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), and the Internet.

컴퓨팅 시스템은 클라이언트와 서버를 포함할 수 있다. 클라이언트와 서버는 일반적으로 서로 떨어져 있고, 통신 네트워크를 통해 상호 작용한다. 클라이언트와 서버의 관계는, 각 컴퓨터에서 실행 중이며 서로에 대하여 클라이언트-서버 관계를 갖는 컴퓨터 프로그램들에 의해 발생한다.The computing system can include a client and a server. The client and server are generally separate from each other and interact through a communication network. The relationship between the client and the server is caused by computer programs running on each computer and having a client-server relationship to each other.

본 명세서가 몇몇 특징들을 포함하지만, 이것들은 개시된 내용 또는 청구될 수 있는 내용의 범위를 제한하는 것으로 해석되서는 안되며, 개시된 내용의 예시 실시예들의 특징의 설명으로써 해석되어야 할 것이다. 본 명세서에서 분리된 실시예들의 콘텍스트에 기재된 어떤 특징들은 하나의 실시예에서 결합되어 제공될 수도 있다. 역으로, 하나의 실시예의 콘텍스트에 기재된 다양한 특징들은 별개의 또는 어떤 적절한 하위 조합의 다수의 실시예에서 제공될 수 있다. 나아가, 특징들이 위에서 특정 조건들에서 동작하거나 심지어 그와 같이 제기되었다 하더라도, 제기된 조합들로부터의 하나 이상의 특징들은 어떤 경우에는 조합으로부터 삭제될 수 있고, 제기된 조합은 하위 조합 또는 하위 조합의 변형으로 지시될 수 있다.Although this specification includes some features, these should not be construed as limiting the scope of the disclosed content or the content that can be claimed, but should be construed as a description of features of exemplary embodiments of the disclosed content. Certain features that are described in the context of separate embodiments herein may be provided in combination in one embodiment. Conversely, various features described in the context of one embodiment may be provided in multiple embodiments in separate or any suitable subcombination. Furthermore, even if the features operate under certain conditions above or even have been raised as such, one or more features from the combinations raised may in some cases be deleted from the combination, and the combinations raised may be sub-combinations or variations of sub-combinations Can be directed to.

유사하게, 동작들이 도면에서 특정 순서로 도시되었지만, 이는 그러한 동작들이 도시된 특정 순서 또는 시계열적 순서로 수행되어야 하는 것, 또는 원하는 결과를 달성하기 위해 모든 도시된 동작들이 수행되는 것을 요구하는 것으로 이해되서는 안된다. 특정 상황에서는, 멀티태스킹 및 병행 처리가 유리할 수 있다. 나아가, 위에서 기술된 실시예들의 다양한 시스템 컴퍼넌트의 구분은 모든 실시예에서 그러한 구분이 요구되는 것으로 해석되어서는 안되며, 기술된 프로그램 컴퍼넌트 및 시스템은 일반적으로 하나의 소프트웨어 제품에 통합되거나 멀티플 소프트웨어 제품으로 패키지 될 수 있는 것으로 이해되어야 한다.Similarly, although the operations are shown in a specific order in the figure, it is understood that such operations must be performed in a specific order or time series order shown, or that all shown actions are performed to achieve a desired result. It should not be. In certain situations, multitasking and parallel processing may be advantageous. Furthermore, the division of various system components in the embodiments described above should not be construed as requiring that division in all embodiments, and the described program components and systems are generally integrated into one software product or packaged as multiple software products. It should be understood as possible.

즉, 본 발명의 특정 실시예들이 기술되었다. 다른 실시예들도 다음의 청구항의 범위 내에 있다. 예를 들어, 청구항에서 인용하는 동작들은 다른 순서로 수행될 수 있고 여전히 원하는 결과를 달성할 수 있다. 다수의 실시예들이 기술되었다. 그럼에도 불구하고, 본 발명의 범위 및 사상으로부터 벗어나지 않고 다양한 변형들이 가해질 수 있음이 이해될 것이다. 예를 들어, 순서 변경, 추가, 제거 단계들과 함께 위에서 보여진 다양한 형태의 플로우가 사용될 수 있다. 따라서, 다른 실시예들도 다음의 청구항들의 범위 내에 있다.That is, specific embodiments of the present invention have been described. Other embodiments are also within the scope of the following claims. For example, the actions recited in the claims can be performed in a different order and still achieve the desired result. A number of embodiments have been described. Nevertheless, it will be understood that various modifications may be made without departing from the scope and spirit of the invention. For example, the various types of flows shown above can be used with reordering, adding, and removing steps. Accordingly, other embodiments are within the scope of the following claims.

104: 명확화 엔진
106: 음성 인식 엔진
108: 키워드 매핑 엔진
110: 콘텐츠 인식 엔진104: Clarification Engine
106: speech recognition engine
108: keyword mapping engine
110: content recognition engine

Claims

As a computer-implemented method,
Generating, by the mobile device, questions about unidentified media content items other devices are playing near the mobile device and (ii) generating audio recordings of the environment audio, the questions being natural language queries uttered by the user ;
In response to forwarding the audio recording to the front end server of the natural language processing system, different parts of the audio recording are processed by a speech recognition engine server associated with the natural language processing system and a content identification engine server associated with the natural language processing system. Receiving an answer to the question based on processing; And
And in response to the question, providing, by a mobile device, an answer to a question regarding the unidentified media content item.

According to claim 1,
Identifying, by the mobile device, one or more keywords corresponding to the question,
Associating, by the mobile device, one or more keywords with one or more types of media content, and
And providing, by the mobile device, an answer based on the question and one or more types of media content.

According to claim 2,
The one or more types of media content,
A computer-implemented method comprising at least one of: movies, music, television shows, audio podcasts, images, artwork, books, magazines, trailers, videos, podcasts, internet videos, and video games.

According to claim 2,
The step of providing an answer based on one or more types of media content is:
Identifying two or more candidate answers to the question,
Generating ranked scores for each of two or more candidate answers, the ranking scores being based on one or more types of media content, and
And providing an answer based on the question and ranking score.

According to claim 1,
A computer implemented method further comprising the step of streaming environmental audio.

According to claim 1,
A computer-implemented method, wherein the speech recognition engine server associated with the natural language processing system and the content identification server associated with the natural language processing system are the same server.

According to claim 1,
Detecting, by the mobile device, environmental image data associated with the item of media content, and
And providing, by the mobile device, an answer based on the question and environment image data.

The method of claim 7,
Identifying, by the mobile device, one or more types of media content based on the environmental image data, and
And providing, by the mobile device, an answer based on the question, environmental image data, and one or more types of media content.

As a system,
It includes one or more computers and one or more storage devices that store instructions that, when executed by the one or more computers, cause one or more computers to perform the operations, the operations being:
Generating, by the mobile device, (i) a question about an unidentified media content item that another device is playing near the mobile device, and (ii) generating an audio recording of the environment audio, the question being a natural language query uttered by the user ;
In response to forwarding the audio recording to the front end server of the natural language processing system, different parts of the audio recording are processed by a speech recognition engine server associated with the natural language processing system and a content identification engine server associated with the natural language processing system. Receiving an answer to the question based on processing; And
In response to the question, providing, by the mobile device, an answer to the question regarding the unidentified media content item.

The method of claim 9,
The above operations,
Identifying one or more keywords corresponding to the question,
Associating one or more keywords with one or more types of media content, and
And providing an answer based on a question and one or more types of media content.

The method of claim 10,
The one or more types of media content,
A system that includes at least one of movies, music, television shows, audio podcasts, images, artwork, books, magazines, trailers, videos, podcasts, internet video and video games.

The method of claim 10,
The step of providing an answer based on one or more types of media content is:
Identifying two or more candidate answers to the question,
Generating a ranking score for each of two or more candidate answers, the ranking score being based on one or more types of media content, and
And further comprising providing an answer based on the question and ranking score.

The method of claim 9,
The above operations,
And streaming environmental audio.

The method of claim 9,
A system characterized in that the speech recognition engine server associated with the natural language processing system and the content identification server associated with the natural language processing system are the same server.

The method of claim 9,
The above operations,
Detecting environmental image data associated with the media content item, and
A system comprising providing answers based on question and environmental image data.

The method of claim 15,
The above operations,
Identifying one or more types of media content based on the environmental image data, and
A system comprising providing an answer based on a question, environmental image data, and one or more types of media content.

A non-transitory computer readable medium storing software executable by one or more computers and including instructions that, when executed, cause one or more computers to perform the actions, the actions being:
(i) generating a question about an unidentified media content item that another device is playing near the mobile device and (ii) generating an audio recording of the environment audio, the question being a natural language query uttered by the user;
In response to forwarding the audio recording to the front end server of the natural language processing system, different parts of the audio recording are processed by a speech recognition engine server associated with the natural language processing system and a content identification engine server associated with the natural language processing system. Receiving an answer to the question based on processing; And
And in response to the question, providing an answer to a question about the unidentified media content item.

The method of claim 17, wherein the operations,
Identifying one or more keywords corresponding to the question,
Associating one or more keywords with one or more types of media content, and
A non-transitory computer readable medium comprising providing an answer based on a question and one or more types of media content.

The method of claim 17,
The above operations,
And streaming environmental audio.

The method of claim 17,
The above operations,
Detecting environmental image data associated with the media content item, and
A non-transitory computer readable medium comprising providing answers based on question and environmental image data.