JP7402322B2

JP7402322B2 - information processing system

Info

Publication number: JP7402322B2
Application number: JP2022521806A
Authority: JP
Inventors: 貴則野村
Original assignee: NTT Docomo Inc
Current assignee: NTT Docomo Inc
Priority date: 2020-05-15
Filing date: 2021-04-23
Publication date: 2023-12-20
Anticipated expiration: 2041-04-23
Also published as: WO2021230048A1; JPWO2021230048A1

Description

本発明の一態様は、情報処理システムに関する。 One aspect of the present invention relates to an information processing system.

特許文献１には、画像形成装置及び携帯端末装置のそれぞれにおいて、入力が受け付けられた音声信号から変換された文字列に対応するコマンドが生成され、画像形成装置と携帯端末装置とで一致した当該コマンドを実行する画像形成装置が記載されている。 Patent Document 1 discloses that in each of an image forming apparatus and a mobile terminal device, a command corresponding to a character string converted from an inputted audio signal is generated, and a command corresponding to a character string that is matched between the image forming apparatus and the mobile terminal device is generated. An image forming apparatus that executes the command is described.

特開２０１９－７４６０８号公報JP 2019-74608 Publication

近年では、例えばユーザに装着される端末において、ユーザによる音声の入力に応じて、端末において表示される画像に係る処理、及び、当該画像に係る処理とは異なる処理が実行される技術が知られている。しかしながら、例えば、当該音声が、画像に係る処理及び画像に係る処理とは異なる処理のいずれにも適用され得る内容である場合、音声のみによってユーザがいずれの処理を要求しているのか判断が困難な場合がある。この場合、例えば、ユーザは画像に係る処理を意図して音声を発したにもかかわらず、画像に係る処理とは異なる処理が実行されてしまうおそれがあった。 In recent years, technology has become known in which, for example, in a terminal worn by a user, processing related to an image displayed on the terminal and processing different from the processing related to the image are executed in response to voice input by the user. ing. However, if, for example, the audio has content that can be applied to both an image-related process and a process different from an image-related process, it is difficult to determine which process the user is requesting based only on the audio. There are cases where In this case, for example, even though the user utters a voice with the intention of performing a process related to an image, there is a risk that a process different from the process related to the image will be executed.

本発明の一態様は上記実情に鑑みてなされたものであり、ユーザの要求に沿った適切な処理を行うことができる情報処理システムに関する。 One aspect of the present invention has been made in view of the above circumstances, and relates to an information processing system that can perform appropriate processing in accordance with user requests.

本発明の一態様に係る情報処理システムは、ユーザに装着される端末において表示されることによりユーザに視認される画像、ユーザの視線情報、及びユーザが発した音声であるユーザ音声を取得する取得部と、取得部によって取得された画像に示されているユーザのジェスチャを認識するジェスチャ認識部と、取得部によって取得されたユーザ音声を認識する音声認識部と、視線情報及びジェスチャ認識部による認識結果に基づいて、音声認識部によって認識されたユーザ音声に応じて画像に係る第１処理を実行する第１モード、及び、音声認識部によって認識されたユーザ音声に応じて画像に係る処理とは異なる第２処理を実行する第２モードのいずれを適用するかを決定する決定部と、決定部によって適用すると決定された第１モード又は第２モードの処理を実行する処理実行部と、を備える。 An information processing system according to one aspect of the present invention acquires an image that is displayed on a terminal worn by the user to be viewed by the user, line-of-sight information of the user, and user voice that is voice uttered by the user. a gesture recognition unit that recognizes the user's gestures shown in the image acquired by the acquisition unit; a voice recognition unit that recognizes the user's voice acquired by the acquisition unit; and a line-of-sight information and gesture recognition unit. A first mode in which a first process related to an image is executed in accordance with the user voice recognized by the voice recognition unit based on the result, and a process related to the image in response to the user voice recognized by the voice recognition unit. A determining unit that determines which of the second modes for performing different second processes is to be applied, and a processing execution unit that executes the process of the first mode or the second mode that is determined to be applied by the determining unit. .

本発明の一態様に係る情報処理システムでは、ユーザ音声、端末において表示されることでユーザに視認される画像、及びユーザの視線情報が取得され、ユーザのジェスチャ及びユーザ音声が認識される。そして、ユーザの視線情報、及びジェスチャの認識結果に基づいて、第１モード及び第２モードのいずれを適用するかが決定される。第１モードは、ユーザ音声に応じて画像に係る第１処理を実行するモードである。第２モードは、ユーザ音声に応じて画像に係る処理とは異なる第２処理を実行するモードである。例えば、情報処理システムが、単に音声認識のみによって第１モード及び第２モードのいずれを適用するかを決定する場合においては、情報処理システムは、ユーザの音声を認識するが、当該音声が、いずれの処理に係る音声なのかを把握することが困難である場合がある。この場合、例えば、当該音声が画像に係る処理に係る音声であっても、画像に係る処理とは異なる処理が実行されるおそれがある。この点、本発明の一態様に係る情報処理システムでは、ユーザの意思を反映していると考えられる、ユーザの視線情報及びジェスチャに基づいて、画像に係る処理が実行されるモード、及び画像に係る処理以外の処理が実行されるモードのいずれが適用されるかが決定されるため、ユーザの要求に沿った適切な処理を行うことができる。 In the information processing system according to one aspect of the present invention, a user's voice, an image displayed on a terminal and viewed by the user, and line-of-sight information of the user are acquired, and the user's gestures and user voice are recognized. Then, it is determined which of the first mode and the second mode to apply, based on the user's line of sight information and the gesture recognition result. The first mode is a mode in which a first process related to an image is executed in response to a user's voice. The second mode is a mode in which a second process different from the image-related process is executed in response to the user's voice. For example, in a case where the information processing system determines whether to apply the first mode or the second mode solely by voice recognition, the information processing system recognizes the user's voice; In some cases, it is difficult to determine whether the audio is related to processing. In this case, for example, even if the sound is related to processing related to images, there is a risk that processing different from the processing related to images will be executed. In this regard, in the information processing system according to one aspect of the present invention, the mode in which image-related processing is executed, and the image-related Since it is determined which of the modes in which processes other than these processes are executed is applied, appropriate processes can be performed in accordance with the user's request.

本発明によれば、ユーザの要求に沿った適切な処理を行うことができる。 According to the present invention, it is possible to perform appropriate processing in accordance with user requests.

図１は、本実施形態に係る情報処理システムの概要を説明する図である。FIG. 1 is a diagram illustrating an overview of an information processing system according to this embodiment. 図２は、図１の情報処理システム物体情報サーバの機能構成を示すブロック図である。FIG. 2 is a block diagram showing the functional configuration of the object information server of the information processing system shown in FIG. 図３は、情報処理システムによる情報表示の一例を説明する図である。FIG. 3 is a diagram illustrating an example of information display by the information processing system. 図４は、情報処理システムによる情報表示の一例を説明する図である。FIG. 4 is a diagram illustrating an example of information display by the information processing system. 図５は、情報処理システムによる情報表示の一例を説明する図である。FIG. 5 is a diagram illustrating an example of information display by the information processing system. 図６は、情報処理システムによる情報表示の一例を説明する図である。FIG. 6 is a diagram illustrating an example of information display by the information processing system. 図７は、情報処理システムによる情報表示の一例を説明する図である。FIG. 7 is a diagram illustrating an example of information display by the information processing system. 図８は、情報処理システムが行う処理を示すシーケンス図である。FIG. 8 is a sequence diagram showing processing performed by the information processing system. 図９は、情報処理システムに含まれるスマートグラス、物体情報サーバ、及び音声認識サーバのハードウェア構成を示す図である。FIG. 9 is a diagram showing the hardware configuration of smart glasses, an object information server, and a voice recognition server included in the information processing system.

以下、添付図面を参照しながら本発明の実施形態を詳細に説明する。図面の説明において、同一又は同等の要素には同一符号を用い、重複する説明を省略する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings. In the description of the drawings, the same reference numerals are used for the same or equivalent elements, and overlapping description will be omitted.

図１は、本実施形態に係る情報処理システム１の概要を説明する図である。図２は、情報処理システムの機能構成を示すブロック図である。情報処理システム１は、ユーザに装着されるスマートグラス（端末）２において、ユーザが要求する処理にしたがって種々の情報処理を実施する。本実施形態に係る情報処理システム１では、スマートグラス２において、ユーザが要求する処理に係る画像が表示される。図１に示されるように、情報処理システム１は、スマートグラス２と、物体情報サーバ１０（特定部、記憶部）と、音声認識サーバ５０（音声認識部）と、を備えている。情報処理システム１では、スマートグラス２と、物体情報サーバ１０と、音声認識サーバ５０とが相互に通信可能に構成されている。 FIG. 1 is a diagram illustrating an overview of an information processing system 1 according to the present embodiment. FIG. 2 is a block diagram showing the functional configuration of the information processing system. The information processing system 1 performs various information processing in smart glasses (terminals) 2 worn by the user in accordance with processing requested by the user. In the information processing system 1 according to the present embodiment, an image related to a process requested by a user is displayed on the smart glasses 2. As shown in FIG. 1, the information processing system 1 includes smart glasses 2, an object information server 10 (specification unit, storage unit), and a voice recognition server 50 (speech recognition unit). In the information processing system 1, the smart glasses 2, the object information server 10, and the voice recognition server 50 are configured to be able to communicate with each other.

情報処理システム１では、スマートグラス２が、物体情報サーバ１０及び音声認識サーバ５０における処理結果を考慮して、ユーザ音声に応じた処理を実行し、ユーザ音声に応じて生成した情報を表示する。一例として、情報処理システム１では、物体情報サーバ１０が、スマートグラス２が撮像した撮像画像において認識されたユーザのジェスチャ（第２ジェスチャ）が示す範囲に基づいて、ユーザ音声に応じた処理（第１処理）の対象のオブジェクトである対象オブジェトを特定する。なお、対象オブジェクトは、更に、音声認識サーバ５０によるユーザ音声の認識結果に基づき特定されてもよい（絞り込まれてもよい）。また、情報処理システム１では、音声認識サーバ５０が、ユーザ音声を認識する。そして、スマートグラス２は、音声認識サーバ５０によって認識されたユーザ音声に含まれる処理内容に応じた処理を、物体情報サーバ１０によって特定された対象オブジェクトに対して実行する。スマートグラス２は、例えば、ユーザ音声に応じて、対象オブジェクトに関する情報を対象オブジェクトに対応付けて重畳表示した重畳画像を生成し、該重畳画像を画面に表示する。 In the information processing system 1, the smart glasses 2 take into account the processing results of the object information server 10 and the voice recognition server 50, execute processing according to the user's voice, and display information generated according to the user's voice. As an example, in the information processing system 1, the object information server 10 performs processing (second gesture) according to the user's voice based on the range indicated by the user's gesture (second gesture) recognized in the captured image captured by the smart glasses 2. 1) Specify the target object that is the target object of process 1). Note that the target object may be further specified (or narrowed down) based on the recognition result of the user's voice by the voice recognition server 50. Furthermore, in the information processing system 1, the speech recognition server 50 recognizes user speech. Then, the smart glasses 2 perform a process on the target object specified by the object information server 10 according to the process content included in the user voice recognized by the voice recognition server 50. For example, the smart glasses 2 generate a superimposed image in which information about the target object is displayed in a superimposed manner in association with the target object, and display the superimposed image on the screen.

図３には、スマートグラス２において撮像された画像Ｐ１が例示されている。画像Ｐ１には、看板Ｈ１及び椅子Ｈ２等のオブジェクトが表示されている。この場合、スマートグラス２は、ユーザのハンドジェスチャであるジェスチャＨＪ２（第２ジェスチャ）を認識する。そして、物体情報サーバ１０は、ジェスチャＨＪ２が示す範囲に基づいて、画像Ｐ１に含まれる各オブジェクトのうち、例えばジェスチャＨＪ２と領域が重なる或いはジェスチャＨＪ２から所定の範囲内にあるオブジェクトである看板Ｈ１及び椅子Ｈ２を、対象オブジェクトとして特定する。さらに、音声認識サーバ５０によって「看板、情報表示」とのユーザ音声が認識された場合には、スマートグラス２は、物体情報サーバ１０によって特定された対象オブジェクトの候補のうち看板Ｈ１のみを対象オブジェクトとして絞り込み、看板Ｈ１に関する情報（看板Ｈ１が提示する情報である提示情報Ｉ）を表示すべく、提示情報Ｉを画像Ｐ１に重畳した（詳細には、提示情報Ｉを対象オブジェクトである看板Ｈ１に対応付けて表示した）画像Ｐ２を生成し、該画像Ｐ２を画面に表示する。 FIG. 3 shows an example of an image P1 captured by the smart glasses 2. Objects such as a signboard H1 and a chair H2 are displayed in the image P1. In this case, the smart glasses 2 recognize gesture HJ2 (second gesture), which is the user's hand gesture. Then, based on the range indicated by the gesture HJ2, the object information server 10 selects, among the objects included in the image P1, a signboard H1 and an object that overlaps with the gesture HJ2 or is within a predetermined range from the gesture HJ2. Chair H2 is identified as a target object. Furthermore, when the voice recognition server 50 recognizes the user's voice saying "signboard, information display", the smart glasses 2 select only the signboard H1 as the target object from among the target object candidates identified by the object information server 10. The presentation information I was superimposed on the image P1 in order to display the information regarding the signboard H1 (presentation information I, which is the information presented by the signboard H1). The image P2 (displayed in association with each other) is generated and the image P2 is displayed on the screen.

以上の処理を行うことにより、情報処理システム１では、ユーザに装着されるスマートグラス２において、ユーザが要求する処理（具体的には、ユーザが音声によって要求した画像に係る処理）が実行され、処理後の画像が画面に表示される。なお、図１及び図２に示されるスマートグラス２の数は１台であるが、スマートグラス２の数は複数であってもよい。 By performing the above processing, in the information processing system 1, the process requested by the user (specifically, the process related to the image requested by the user by voice) is executed in the smart glasses 2 worn by the user, The processed image is displayed on the screen. Note that although the number of smart glasses 2 shown in FIGS. 1 and 2 is one, the number of smart glasses 2 may be plural.

図１に戻り、音声認識サーバ５０は、ユーザ音声を認識する音声認識部として機能する。音声認識サーバ５０は、単にユーザ音声を認識して文字列に変換する機能を有していればよく、ユーザ音声に基づくユーザの識別等の機能を有していなくてもよい。音声認識サーバ５０は、周知の音声認識技術を利用してもよい。音声認識サーバ５０は、音声認識結果（すなわち、ユーザ音声を文字列に変換した情報）を物体情報サーバ１０に送信する。なお、本実施形態では、物体情報サーバ１０が音声認識サーバ５０より音声認識結果を取得するとして説明するが、例えば、物体情報サーバ１０がユーザ音声を認識する音声認識部として機能してもよい。 Returning to FIG. 1, the speech recognition server 50 functions as a speech recognition unit that recognizes user speech. The voice recognition server 50 only needs to have a function of recognizing user voice and converting it into a character string, and does not need to have a function such as user identification based on user voice. The speech recognition server 50 may utilize well-known speech recognition technology. The speech recognition server 50 transmits the speech recognition result (that is, information obtained by converting the user's speech into a character string) to the object information server 10. In this embodiment, the object information server 10 will be described as acquiring voice recognition results from the voice recognition server 50, but the object information server 10 may function as a voice recognition unit that recognizes user voice, for example.

物体情報サーバ１０は、スマートグラス２及び音声認識サーバ５０から取得した情報に基づき、対象オブジェクト等を特定し、特定した情報をスマートグラス２に提供するサーバである。対象オブジェクトとは、スマートグラス２から取得した撮像画像に含まれるオブジェクトであって画像に係る第１処理の対象のオブジェクトである。 The object information server 10 is a server that specifies a target object and the like based on information acquired from the smart glasses 2 and the voice recognition server 50, and provides the smart glasses 2 with the specified information. The target object is an object included in the captured image acquired from the smart glasses 2, and is a target object of the first process related to the image.

物体情報サーバ１０は、スマートグラス２及び音声認識サーバ５０から取得した各種情報、すなわち、撮像画像、ユーザ音声認識結果、スマートグラス２の測位結果等を記憶する。また、物体情報サーバ１０は、複数のオブジェクトに係るオブジェクト情報を予め記憶する記憶部として機能する。オブジェクト情報とは、現実空間に存在する物体（オブジェクト）の情報である。オブジェクト情報では、例えば、複数のオブジェクトのそれぞれについて、オブジェクトを示す（一意に特定する）情報であるオブジェクトＩＤと、オブジェクトの種別を特定する情報である種別情報と、オブジェクトが存在する位置情報と、オブジェクトの画像と、オブジェクトに係る詳細情報（オブジェクトに関する情報）と、が対応付けられて記憶されている。種別情報は、オブジェクトの名称を含んでいてもよい。なお、オブジェクト情報では、上述した情報の一部だけが対応付けられて記憶されていてもよい。すなわち、例えば、種別情報、オブジェクトの画像、及びオブジェクトに係る詳細情報のみが対応付けられて記憶されていてもよい。 The object information server 10 stores various information acquired from the smart glasses 2 and the voice recognition server 50, such as captured images, user voice recognition results, and positioning results of the smart glasses 2. Further, the object information server 10 functions as a storage unit that stores object information regarding a plurality of objects in advance. Object information is information about objects that exist in real space. The object information includes, for example, for each of a plurality of objects, an object ID that is information that indicates (uniquely identifies) the object, type information that is information that specifies the type of the object, and position information where the object exists. An image of an object and detailed information about the object (information about the object) are stored in association with each other. The type information may include the name of the object. Note that in the object information, only part of the above-mentioned information may be stored in association with each other. That is, for example, only the type information, the image of the object, and the detailed information regarding the object may be stored in association with each other.

オブジェクトに係る詳細情報とは、例えばオブジェクトの内容に関する情報であり、例えばオブジェクトが店の看板である場合には、当該店の名称、店の営業時間、店で販売・提供する商品名、店で販売する商品・サービスの料金、店の電話番号、店のＵＲＬ等である。また、オブジェクトが商品そのものである場合には、オブジェクトに係る詳細情報とは、例えば当該商品の料金、当該商品の価格、当該商品のスペック、当該商品が説明されたＵＲＬ等である。 Detailed information related to an object is, for example, information related to the contents of the object. For example, if the object is a store sign, it may include the name of the store, business hours of the store, names of products sold/provided at the store, and information about the contents of the object. These include the prices of the products and services being sold, the store's phone number, and the store's URL. Furthermore, when the object is a product itself, the detailed information related to the object includes, for example, the price of the product, the price of the product, the specifications of the product, the URL explaining the product, and the like.

オブジェクトに係る詳細情報の各項目は、ユーザ音声と紐づけられていてもよい。すなわち、例えば、オブジェクトが店の看板である場合において、「店」を含んだユーザ音声に対して店自体の情報（店の営業時間、店の電話番号等）が紐づけられ、「（商品名）」を含んだユーザ音声に対して店の商品の情報（商品の価格、商品のスペック等）が紐づけられ、「情報表示」とのユーザ音声に対して全ての詳細情報の項目が紐づけられる、というように、ユーザ音声の種別と詳細情報の項目とが紐づけられていてもよい。なお、物体情報サーバ１０は、オブジェクト情報として仮想空間に存在する物体の情報を記憶していてもよい。 Each item of detailed information regarding an object may be associated with user voice. That is, for example, when the object is a store signboard, information about the store itself (store business hours, store phone number, etc.) is linked to the user's voice that includes "store", and "(product name )” will be linked to store product information (product price, product specs, etc.), and user voices that say “information display” will be linked to all detailed information items. The type of user voice and the item of detailed information may be linked. Note that the object information server 10 may store information about objects existing in the virtual space as object information.

物体情報サーバ１０は、対象オブジェクトを特定する特定部として機能する。物体情報サーバ１０は、スマートグラス２から取得した撮像画像においてユーザのジェスチャ（第２ジェスチャ，図３に示されるジェスチャＨＪ２）が示す範囲に基づいて、対象オブジェクトを特定する。ユーザのジェスチャ（第２ジェスチャ）が示す範囲とは、撮像画像においてユーザが指定する指定範囲であり、例えばジェスチャと領域が重なる或いはジェスチャに近接する範囲である。物体情報サーバ１０は、例えば、スマートグラス２からジェスチャ（第２ジェスチャ）が示す範囲（指定範囲）の情報を取得する。物体情報サーバ１０は、スマートグラス２から、ユーザのジェスチャが示す範囲の撮像画像のみを取得してもよい。物体情報サーバ１０は、撮像画像に含まれる各オブジェクトのうち例えばジェスチャと領域が重なる或いはジェスチャから所定の範囲内にあるオブジェクトを対象オブジェクトとして特定する。物体情報サーバ１０は、従来から周知の画像認識処理を利用することによって、対象オブジェクトを特定する。物体情報サーバ１０は、例えば、記憶しているオブジェクト情報に含まれる各オブジェクトの画像と、ユーザのジェスチャが示す範囲（指定範囲）の画像とを照合することによって、対象オブジェクトを特定する。この場合、物体情報サーバ１０は、スマートグラス２における測位結果とオブジェクト情報に含まれるオブジェクトが存在する位置情報とを照合し、スマートグラス２に近い位置に存在するオブジェクトの画像（オブジェクト情報に含まれるオブジェクトの画像）のみを、ユーザの指定範囲の画像と照合してもよい。物体情報サーバ１０は、対象オブジェクトを特定すると、オブジェクト情報に基づき、当該対象オブジェクトの名称を特定してもよい。 The object information server 10 functions as a specifying unit that specifies a target object. The object information server 10 identifies the target object based on the range indicated by the user's gesture (second gesture, gesture HJ2 shown in FIG. 3) in the captured image acquired from the smart glasses 2. The range indicated by the user's gesture (second gesture) is a specified range specified by the user in the captured image, and is, for example, a range that overlaps with the gesture or is close to the gesture. The object information server 10 acquires, for example, information on the range (designated range) indicated by the gesture (second gesture) from the smart glasses 2. The object information server 10 may acquire only the captured image of the range indicated by the user's gesture from the smart glasses 2. The object information server 10 specifies, as a target object, an object included in the captured image, for example, an object whose area overlaps with the gesture or which is within a predetermined range from the gesture. The object information server 10 identifies the target object by using conventionally known image recognition processing. The object information server 10 identifies the target object, for example, by comparing the image of each object included in the stored object information with the image of the range (designated range) indicated by the user's gesture. In this case, the object information server 10 collates the positioning result of the smart glasses 2 with the position information of the object included in the object information, and generates an image of the object (included in the object information) of the object existing in a position close to the smart glasses 2. Only the image of the object) may be compared with the image in the range specified by the user. After specifying the target object, the object information server 10 may specify the name of the target object based on the object information.

図５を参照して、対象オブジェクトの特定方法について説明する。いま、スマートグラス２によって、枠Ｆ内が、ジェスチャＨＪ２が示す範囲（指定範囲Ａ）であると特定されているとする。この場合、物体情報サーバ１０は、記憶しているオブジェクト情報に基づいて、指定範囲Ａ内にある対象オブジェクトを特定する。具体的には、物体情報サーバ１０は、例えば、オブジェクト情報に含まれるオブジェクトの画像と、撮像画像のうち指定範囲Ａに対応する画像とを照合することによって、対象オブジェクトを特定する。図５に示される例では、物体情報サーバ１０は、看板Ｈ１及び椅子Ｈ２を対象オブジェクトとして特定する。物体情報サーバ１０は、対象オブジェクトとして特定したオブジェクトを示す情報（スマートグラス２においてどれが対象オブジェクトであるかを判別できる情報）と、該オブジェクトの名称とを対応付けてスマートグラス２に送信する。 A method for specifying a target object will be described with reference to FIG. Now, assume that the smart glasses 2 specify that the inside of the frame F is the range indicated by the gesture HJ2 (designated range A). In this case, the object information server 10 identifies the target object within the specified range A based on the stored object information. Specifically, the object information server 10 identifies the target object, for example, by comparing the image of the object included in the object information with the image corresponding to the specified range A among the captured images. In the example shown in FIG. 5, the object information server 10 specifies a signboard H1 and a chair H2 as target objects. The object information server 10 associates information indicating the object specified as the target object (information that allows the smart glasses 2 to determine which object is the target object) with the name of the object and transmits the information to the smart glasses 2 .

物体情報サーバ１０は、音声認識サーバ５０によって認識されたユーザ音声にオブジェクトを示す情報が含まれている場合においては、上述したジェスチャ（第２ジェスチャ）が示す範囲のオブジェクトのうち、ユーザ音声に含まれているオブジェクトを対象オブジェクトとして特定してもよい。すなわち、例えば図６に示されるように、「看板」とのユーザ音声が音声認識サーバ５０によって認識されている場合においては、物体情報サーバ１０は、対象オブジェクトの候補である看板Ｈ１及び椅子Ｈ２のうち、看板Ｈ１のみを対象オブジェクトとして特定してもよい。物体情報サーバ１０は、対象オブジェクトとして特定したオブジェクトを示す情報（スマートグラス２においてどれが対象オブジェクトであるかを判別できる情報）をスマートグラス２に送信する。 When the user voice recognized by the voice recognition server 50 includes information indicating an object, the object information server 10 selects objects included in the user voice from among the objects in the range indicated by the above-mentioned gesture (second gesture). You may specify the object that is the target object. That is, as shown in FIG. 6, for example, when the voice recognition server 50 recognizes the user's voice for "signboard", the object information server 10 recognizes the signboard H1 and the chair H2, which are candidates for the target object. Of these, only the signboard H1 may be specified as the target object. The object information server 10 transmits to the smart glasses 2 information indicating the object specified as the target object (information that allows the smart glasses 2 to determine which object is the target object).

物体情報サーバ１０は、特定した対象オブジェクトについて、記憶しているオブジェクト情報に基づき、オブジェクトに係る詳細情報（オブジェクトに関する情報）を更に特定する。物体情報サーバ１０は、例えば、音声認識サーバ５０によって認識されたユーザ音声に第１処理に係る処理内容（具体的には、対象オブジェクトに係る詳細情報の表示）が含まれている場合において、記憶しているオブジェクト情報に基づき、対象オブジェクトに係る詳細情報の特定処理を行う。すなわち、例えば図６に示されるように看板Ｈ１が対象オブジェクトとして特定されている状況において、「情報表示」とのユーザ音声が音声認識サーバ５０によって認識された場合においては、物体情報サーバ１０は、記憶しているオブジェクト情報に基づき、看板Ｈ１に係る詳細情報を提示情報として特定する。図６に示される例では、物体情報サーバ１０は、提示情報Ｉとして、店の名称（「ＸＸＸＸ」）、商品名及び商品の料金（「・ボロネーゼ：１，０００円・ジェノベーゼ：１，１００円・マルゲリータ：８００円」）を特定している。物体情報サーバ１０は、提示情報として特定した対象オブジェクトに係る詳細情報をスマートグラス２に送信する。 The object information server 10 further specifies detailed information regarding the object (information regarding the object) for the specified target object based on the stored object information. For example, when the user voice recognized by the voice recognition server 50 includes the processing content related to the first process (specifically, the display of detailed information related to the target object), the object information server 10 stores Based on the object information, detailed information related to the target object is specified. That is, for example, in a situation where the signboard H1 is specified as the target object as shown in FIG. Based on the stored object information, detailed information regarding the signboard H1 is specified as presentation information. In the example shown in FIG. 6, the object information server 10 presents the store name ("XXXX"), the product name, and the price of the product ("Bolognese: 1,000 yen - Genovese: 1,100 yen") as presentation information I.・Margherita: 800 yen). The object information server 10 transmits detailed information regarding the specified target object to the smart glasses 2 as presentation information.

物体情報サーバ１０は、例えば、音声認識サーバ５０によって認識されたユーザ音声に表示態様の変更指示が含まれている場合には、当該変更指示に応じた処理要求をスマートグラス２に送信する。具体的には、物体情報サーバ１０は、例えば、「拡大表示」とのユーザ音声が音声認識サーバ５０によって認識された場合においては、スマートグラス２に対して、提示情報の拡大表示要求を送信する。 For example, if the user voice recognized by the voice recognition server 50 includes an instruction to change the display mode, the object information server 10 transmits a processing request to the smart glasses 2 in accordance with the change instruction. Specifically, for example, when the voice recognition server 50 recognizes the user's voice saying "enlarged display", the object information server 10 transmits a request to enlarge the presentation information to the smart glasses 2. .

スマートグラス２は、ユーザに装着されるゴーグル型のウェアラブル機器であり、無線通信を行うように構成された端末である。スマートグラス２は、ユーザに視認される画像を表示可能に構成されている。スマートグラス２は、撮像機能を有しており、例えば、撮像した画像（撮像画像）をリアルタイムに表示する。なお、本実施形態ではスマートグラス２が撮像画像を表示するとして説明しているが、スマートグラス２は、撮像画像以外の画像を表示するものであってもよい。また、スマートグラス２は、装着したユーザの視線情報を取得する機能及び装着したユーザが発した音声（ユーザ音声）を取得する機能を有している。 The smart glasses 2 are goggle-type wearable devices worn by a user, and are terminals configured to perform wireless communication. The smart glasses 2 are configured to be able to display images that are visible to the user. The smart glasses 2 have an imaging function, and, for example, display a captured image (captured image) in real time. In addition, although the smart glasses 2 are described as displaying a captured image in this embodiment, the smart glasses 2 may display an image other than the captured image. Furthermore, the smart glasses 2 have a function of acquiring line-of-sight information of the user wearing the glasses and a function of acquiring voice (user voice) uttered by the user wearing the glasses.

スマートグラス２は、自身で又は他のサーバ（不図示）と通信を行うことによって、測位を行うものであってもよい。本実施形態では、スマートグラス２が測位を行うとして説明する。スマートグラス２の測位方法は、限定されず、ＧＰＳ（Global Positioning System）測位であってもよいし、基地局測位であってもよいし、撮像した画像と他のサーバ（不図示）において記憶されているマップデータとを突合させて行う測位であってもよい。スマートグラス２は、測位結果を継続的に物体情報サーバ１０に送信する。 The smart glasses 2 may perform positioning by themselves or by communicating with another server (not shown). This embodiment will be described assuming that the smart glasses 2 perform positioning. The positioning method of the smart glasses 2 is not limited, and may be GPS (Global Positioning System) positioning, base station positioning, or storage of captured images and other servers (not shown). Positioning may also be performed by comparing map data that is currently available. The smart glasses 2 continuously transmit positioning results to the object information server 10.

スマートグラス２は、図２に示されるように、取得部２１と、ジェスチャ認識部２２と、決定部２３と、生成部２４及び出力部２５（処理実行部）と、を備えている。なお、スマートグラス２は、取得部２１が取得する各種情報等を記憶する記憶部（不図示）を更に備えていてもよい。 As shown in FIG. 2, the smart glasses 2 include an acquisition section 21, a gesture recognition section 22, a determination section 23, a generation section 24, and an output section 25 (processing execution section). Note that the smart glasses 2 may further include a storage unit (not shown) that stores various information etc. acquired by the acquisition unit 21.

取得部２１は、ユーザに視認される撮像画像、ユーザの視線情報、及びユーザ音声を取得する。撮像画像とは、上述したようにスマートグラス２において撮像される画像であり、スマートグラス２の画面に表示されてユーザに視認される画像である。ユーザの視線情報とは、上述したようにスマートグラス２を装着したユーザの視線の情報である。ユーザ音声とは、上述したようにスマートグラス２を装着したユーザが発した音声である。取得部２１は、取得したユーザ音声を音声認識サーバ５０に送信する。 The acquisition unit 21 acquires the captured image visually recognized by the user, the user's line of sight information, and the user's voice. The captured image is an image captured by the smart glasses 2 as described above, and is an image displayed on the screen of the smart glasses 2 and visually recognized by the user. The user's line of sight information is information on the line of sight of the user wearing the smart glasses 2 as described above. The user voice is the voice emitted by the user wearing the smart glasses 2 as described above. The acquisition unit 21 transmits the acquired user voice to the voice recognition server 50.

ジェスチャ認識部２２は、取得部２１によって取得された撮像画像に示されているユーザのジェスチャを認識する。本実施形態では、ジェスチャ認識部２２は、例えば従来から周知の画像認識技術を利用することにより、ユーザのジェスチャを認識する。ジェスチャ認識部２２は、第１処理に係るジェスチャとして予め定められた第１ジェスチャを認識する。第１ジェスチャは、画像に係る第１処理を実行するか、或いは、画像に係る処理とは異なる第２処理を実行するか、の決定に係るジェスチャである。第１ジェスチャは、例えば、ユーザが拳を握りしめているハンドジェスチャ（図４に示されるジェスチャＨＪ１）である。 The gesture recognition unit 22 recognizes the user's gesture shown in the captured image acquired by the acquisition unit 21. In this embodiment, the gesture recognition unit 22 recognizes a user's gesture by using, for example, a conventionally known image recognition technique. The gesture recognition unit 22 recognizes a first gesture that is predetermined as a gesture related to the first process. The first gesture is a gesture related to determining whether to perform a first process related to an image or a second process different from the process related to an image. The first gesture is, for example, a hand gesture in which the user clenches a fist (gesture HJ1 shown in FIG. 4).

また、ジェスチャ認識部２２は、対象オブジェクトが含まれ得る範囲を示すジェスチャとして予め定められた第２ジェスチャを更に認識する。第２ジェスチャは、例えば、ユーザが拳を握りしめている状態から開いた状態に変化する一連のハンドジェスチャ（図５に示されるジェスチャＨＪ２）である。第２ジェスチャは、拳の開き方によって対象オブジェクトが含まれる範囲を表している。すなわち、ジェスチャ認識部２２は、図５に示されるように、ユーザが拳を握りしめている状態から徐々に拳を開くジェスチャＨＪ２を認識すると、ジェスチャＨＪ２における拳の開き具合（開き方）に応じて、ジェスチャＨＪ２が示す範囲（対象オブジェクトが含まれる範囲）を特定する。ジェスチャ認識部２２によって特定されたジェスチャＨＪ２が示す範囲は、後述する生成部２４によって生成される画像において「枠Ｆ」（図５参照）で示される。なお、第１ジェスチャ及び第２ジェスチャは、他のハンドジェスチャ、又はユーザの身体の他の部位のジェスチャであってもよい。ジェスチャ認識部２２は、第２ジェスチャを認識した場合において、第２ジェスチャを認識した撮像画像と、該第２ジェスチャが示す範囲の情報とを物体情報サーバ１０に送信する。なお、ジェスチャ認識部２２は、第２ジェスチャが示す範囲の撮像画像のみを物体情報サーバ１０に送信してもよい。 Furthermore, the gesture recognition unit 22 further recognizes a second gesture predetermined as a gesture indicating a range in which the target object can be included. The second gesture is, for example, a series of hand gestures (gesture HJ2 shown in FIG. 5) in which the user changes from a clenched fist to an open fist. The second gesture represents the range that includes the target object depending on how the fist is opened. That is, as shown in FIG. 5, when the gesture recognition unit 22 recognizes a gesture HJ2 in which the user gradually opens his/her fist from a state where the user is clenching the fist, the gesture recognition unit 22 detects the gesture HJ2 according to the degree (how to open) of the fist in the gesture HJ2. , the range indicated by the gesture HJ2 (the range that includes the target object) is specified. The range indicated by the gesture HJ2 specified by the gesture recognition unit 22 is indicated by a “frame F” (see FIG. 5) in an image generated by the generation unit 24, which will be described later. Note that the first gesture and the second gesture may be other hand gestures or gestures of other parts of the user's body. When the gesture recognition unit 22 recognizes the second gesture, it transmits the captured image in which the second gesture is recognized and information on the range indicated by the second gesture to the object information server 10 . Note that the gesture recognition unit 22 may transmit only the captured image in the range indicated by the second gesture to the object information server 10.

決定部２３は、ユーザの視線情報及びジェスチャ認識部２２による認識結果に基づいて、第１モード及び第２モードのいずれを適用するかを決定する。第１モードとは、音声認識サーバ５０によって認識されたユーザ音声に応じて画像に係る第１処理を実行するモードである。第２モードとは、音声認識サーバ５０によって認識されたユーザ音声に応じて画像に係る処理とは異なる第２処理を実行するモードである。 The determining unit 23 determines which of the first mode and the second mode to apply, based on the user's line of sight information and the recognition result by the gesture recognition unit 22. The first mode is a mode in which a first process related to an image is executed in response to the user's voice recognized by the voice recognition server 50. The second mode is a mode in which a second process different from the image-related process is executed in accordance with the user voice recognized by the voice recognition server 50.

具体的には、決定部２３は、視線情報及びジェスチャ認識部２２による第１ジェスチャの認識結果に基づいて、ユーザが撮像画像に示されている第１ジェスチャを注視しているか否かを判定し、注視している場合に、第１モードを適用すると決定し、注視していない場合に、第２モードを適用すると決定する。すなわち、決定部２３は、まず、ジェスチャ認識部２２によって第１ジェスチャが認識されているか否かを判定する。そして、決定部２３は、第１ジェスチャが認識されている場合において、視線情報に基づき撮像画像においてユーザが第１ジェスチャを注視しているか否かを判定する。決定部２３は、第１ジェスチャからのユーザの視線のずれが所定の範囲内（例えば１５°以内）である場合には、ユーザが第１ジェスチャを注視していると判定する。 Specifically, the determination unit 23 determines whether the user is gazing at the first gesture shown in the captured image based on the line of sight information and the recognition result of the first gesture by the gesture recognition unit 22. , it is determined that the first mode is applied when the user is gazing, and it is determined that the second mode is applied when the user is not gazing. That is, the determining unit 23 first determines whether the first gesture is recognized by the gesture recognizing unit 22. Then, when the first gesture is recognized, the determining unit 23 determines whether the user is gazing at the first gesture in the captured image based on the line-of-sight information. The determining unit 23 determines that the user is gazing at the first gesture when the deviation of the user's line of sight from the first gesture is within a predetermined range (for example, within 15 degrees).

図４を参照して、対象オブジェクトの特定方法について説明する。いま、スマートグラス２において、メッセージが受信されており、撮像画像である画像Ｐ３に「新着メッセージあり」との、ユーザがメッセージを受信した旨の情報が重畳された画像Ｐ４が表示されているとする。画像Ｐ３には、看板Ｈ１、椅子Ｈ２、及びユーザのジェスチャＨＪ１（第１ジェスチャ）が表示されている。この場合、ユーザにより撮像画像に係る処理（第１処理）が要求され得る状況であると共に、ユーザによりメッセージを画面に表示させる処理（第２処理に含まれる処理）が要求され得る状況であるといえる。 A method for specifying a target object will be described with reference to FIG. Now, in the smart glasses 2, a message is being received, and an image P4 is displayed on the captured image P3, in which information indicating that the user has received a message, such as "New message available", is superimposed. do. The image P3 displays a signboard H1, a chair H2, and a user's gesture HJ1 (first gesture). In this case, the user may request processing related to the captured image (first processing), and the user may request processing for displaying a message on the screen (processing included in the second processing). I can say that.

図４に示される例では、ユーザによって拳が握られたジェスチャＨＪ１が示されているため、ジェスチャ認識部２２によって第１ジェスチャが認識される。そして、決定部２３によって、視線情報に基づきユーザがジェスチャＨＪ１を注視していると判定された場合には、ユーザにより撮像画像に係る処理（第１処理）が要求されていると判断され、第１モードを適用することが決定される。一方で、決定部２３によって、視線情報に基づきユーザがジェスチャＨＪ１を注視していないと判定された場合には、ユーザにより撮像画像に係る処理以外の第２処理（例えば、メッセージを画面に表示させる処理）が要求されていると判断され、第２モードを適用することが決定される。 In the example shown in FIG. 4, the gesture HJ1 in which the user clenches a fist is shown, so the gesture recognition unit 22 recognizes the first gesture. Then, when the determining unit 23 determines that the user is gazing at gesture HJ1 based on the line of sight information, it is determined that the user requests processing (first processing) related to the captured image, and 1 mode is decided to apply. On the other hand, if the determining unit 23 determines that the user is not gazing at the gesture HJ1 based on the line-of-sight information, the user may perform a second process other than the process related to the captured image (for example, display a message on the screen). processing) is required, and it is determined to apply the second mode.

なお、決定部２３は、第１モードを適用することを決定した後において、ジェスチャ認識部２２によって、第１処理に係る第２ジェスチャが認識されている間においては、第１モードの適用を継続する。これは、第１処理に係る第２ジェスチャをユーザが表している間は、ユーザは、第２処理ではなく第１処理を要求していると考えられるためである。一方、決定部２３は、第１モードを適用することを決定した後において、ジェスチャ認識部２２によって第２ジェスチャが認識されなくなった場合においては、第２モードを適用することを決定する。これは、第１処理に係る第２ジェスチャをユーザが止めた場合は、ユーザは、第１処理を要求していないと考えられるためである。 Note that after determining to apply the first mode, the determining unit 23 continues to apply the first mode while the gesture recognition unit 22 recognizes the second gesture related to the first process. do. This is because while the user is expressing the second gesture related to the first process, the user is considered to be requesting the first process rather than the second process. On the other hand, if the second gesture is no longer recognized by the gesture recognition unit 22 after determining to apply the first mode, the determining unit 23 determines to apply the second mode. This is because if the user stops making the second gesture related to the first process, it is considered that the user has not requested the first process.

生成部２４は、物体情報サーバ１０から取得した情報に基づいて、スマートグラス２の画面に表示（出力）する情報を生成する。生成部２４は、物体情報サーバ１０から対象オブジェクトとして特定したオブジェクトを示す情報及び該オブジェクトの名称を受信した場合において、撮像画像に対象オブジェクトの名称が重畳された第１画像を生成する。生成部２４は、ジェスチャ認識部２２によって第２ジェスチャが認識されている場合においては、第１画像において第２ジェスチャが示す範囲を示す枠を更に重畳させる。図５に示される例では、生成部２４は、物体情報サーバ１０から取得した情報及びジェスチャ認識部２２による認識結果に基づいて、撮像画像である画像Ｐ５に、第２ジェスチャが示す範囲を示す枠Ｆ、対象オブジェクトである看板Ｈ１の名称「看板」及び椅子Ｈ２の名称「椅子」が重畳された画像Ｐ６（第１画像）を生成している。なお、生成部２４は、各対象オブジェクトの名称が、対応する対象オブジェクトの近傍に位置するように第１画像を生成してもよい。 The generation unit 24 generates information to be displayed (output) on the screen of the smart glasses 2 based on the information acquired from the object information server 10. When the generation unit 24 receives information indicating the object specified as the target object and the name of the object from the object information server 10, it generates a first image in which the name of the target object is superimposed on the captured image. When the gesture recognition unit 22 recognizes the second gesture, the generation unit 24 further superimposes a frame indicating the range indicated by the second gesture on the first image. In the example shown in FIG. 5, the generation unit 24 creates a frame indicating the range indicated by the second gesture in the image P5, which is the captured image, based on the information acquired from the object information server 10 and the recognition result by the gesture recognition unit 22. F. An image P6 (first image) is generated in which the name "Signboard" of the signboard H1, which is the target object, and the name "Chair" of the chair H2 are superimposed. Note that the generation unit 24 may generate the first image so that the name of each target object is located near the corresponding target object.

生成部２４は、物体情報サーバ１０から、対象オブジェクトとして特定したオブジェクトを示す情報、及び、提示情報として特定した対象オブジェクトに係る詳細情報を受信した場合において、撮像画像に提示情報が重畳された第２画像を生成する。図６に示される例では、生成部２４は、物体情報サーバ１０から取得した情報に基づいて、撮像画像である画像Ｐ７に、看板Ｈ１が提示する提示情報Ｉが重畳されると共に看板Ｈ１が強調表示された画像Ｐ８（第２画像）を生成している。なお、生成部２４は、対象オブジェクトの提示情報が、対応する対象オブジェクトの近傍に位置するように第２画像を生成してもよい。 When the generation unit 24 receives information indicating the object specified as the target object and detailed information regarding the specified target object as the presentation information from the object information server 10, the generation unit 24 generates a first image in which the presentation information is superimposed on the captured image. 2 images are generated. In the example shown in FIG. 6, the generation unit 24 superimposes the presentation information I presented by the signboard H1 on the image P7, which is the captured image, and emphasizes the signboard H1, based on the information acquired from the object information server 10. The displayed image P8 (second image) is generated. Note that the generation unit 24 may generate the second image so that the presentation information of the target object is located near the corresponding target object.

生成部２４は、物体情報サーバ１０から、第２画像における提示情報の表示態様の変更要求を受信した場合において、第２画像の提示情報の表示態様を変更した第３画像を生成する。図７に示される例では、生成部２４は、物体情報サーバ１０から、提示情報の拡大表示要求を受信している。この場合、生成部２４は、撮像画像である画像Ｐ９に、文字が拡大された態様の提示情報Ｉが重畳された画像Ｐ１０（第３画像）を生成する。 When receiving a request from the object information server 10 to change the display mode of the presentation information in the second image, the generation unit 24 generates a third image in which the display mode of the presentation information in the second image is changed. In the example shown in FIG. 7, the generation unit 24 receives a request for enlarged presentation information from the object information server 10. In this case, the generation unit 24 generates an image P10 (third image) in which the presentation information I in the form of enlarged characters is superimposed on the image P9, which is the captured image.

生成部２４は、上述したように、決定部２３によって適用すると決定された第１モードの処理を実行する処理実行部として機能している。すなわち、生成部２４は、音声認識サーバ５０によって認識されたユーザ音声に第１処理に係る処理内容（例えば、情報提示）が含まれている場合において、該処理内容に応じた処理である第２画像等の生成を第１処理として実行している。また、生成部２４は、上述したように、物体情報サーバ１０により特定された対象オブジェクトに関する情報に基づいて、出力情報を生成している。より具体的には、生成部２４は、対象オブジェクトに関する情報が対象オブジェクトに対応付けられて重畳表示された重畳画像である第２画像を出力情報として生成している。 As described above, the generation unit 24 functions as a processing execution unit that executes the first mode processing determined to be applied by the determination unit 23. That is, when the user voice recognized by the speech recognition server 50 includes processing content related to the first processing (for example, information presentation), the generation unit 24 generates a second processing corresponding to the processing content. Generation of images and the like is executed as the first process. Further, as described above, the generation unit 24 generates output information based on information regarding the target object specified by the object information server 10. More specifically, the generation unit 24 generates, as output information, a second image that is a superimposed image in which information regarding the target object is displayed in a superimposed manner in association with the target object.

出力部２５は、生成部２４が生成した情報を出力（スマートグラス２の画面に表示）する。すなわち、出力部２５は、上述した第１画像、第２画像、及び第３画像をスマートグラス２の画面に表示する。出力部２５は、このように、決定部２３によって適用すると決定された第１モードの処理を実行する処理実行部として機能している。すなわち、出力部２５は、音声認識サーバ５０によって認識されたユーザ音声に第１処理に係る処理内容（例えば、情報提示）が含まれている場合において、該処理内容に応じた処理である第２画像等の出力を第１処理として実行している。 The output unit 25 outputs (displays on the screen of the smart glasses 2) the information generated by the generation unit 24. That is, the output unit 25 displays the above-described first image, second image, and third image on the screen of the smart glasses 2. The output unit 25 thus functions as a processing execution unit that executes the first mode processing determined to be applied by the determination unit 23. That is, when the user voice recognized by the voice recognition server 50 includes processing content related to the first processing (for example, information presentation), the output unit 25 outputs a second processing corresponding to the processing content. Output of images, etc. is executed as the first process.

なお、生成部２４及び出力部２５は、決定部２３によって第２モードの処理を実行する（すなわち、撮像画像に係る処理以外の第２処理を実行する）と決定された場合において、当該第２処理を実行する処理実行部として機能してもよい。すなわち、例えば、決定部２３によって、新着メッセージを画面に表示させる処理（第２処理）を実行すると決定された場合において、生成部２４が新着メッセージを重畳させた画像を生成し、出力部２５が当該画像を出力してもよい。 Note that when the determining unit 23 determines to execute the second mode processing (that is, execute the second processing other than the processing related to the captured image), the generating unit 24 and the output unit 25 perform the second mode processing. It may also function as a processing execution unit that executes processing. That is, for example, when the determining unit 23 determines to execute the process (second process) of displaying a new message on the screen, the generating unit 24 generates an image on which the new message is superimposed, and the output unit 25 generates an image on which the new message is superimposed. The image may be output.

次に、本実施形態に係る情報処理システム１が行う処理について、図８を参照して説明する。図８は、情報処理システム１が行う処理を示すシーケンス図である。 Next, processing performed by the information processing system 1 according to this embodiment will be described with reference to FIG. 8. FIG. 8 is a sequence diagram showing the processing performed by the information processing system 1.

図８に示されるように、情報処理システム１では、最初にスマートグラス２がユーザの視線情報及び第１ジェスチャに基づき第１モードの適用を決定する（ステップＳ１）。具体的には、スマートグラス２は、第１ジェスチャをユーザが注視している場合に、第１モードの適用を決定する。 As shown in FIG. 8, in the information processing system 1, the smart glasses 2 first determine application of the first mode based on the user's line of sight information and the first gesture (step S1). Specifically, the smart glasses 2 determine application of the first mode when the user is gazing at the first gesture.

つづいて、スマートグラス２は、第２ジェスチャを認識し、第２ジェスチャが示す範囲に基づいて対象オブジェクトが存在し得る範囲を特定する（ステップＳ２）。つづいて、スマートグラス２は、物体情報サーバ１０に、撮像画像を送信する（ステップＳ３）。スマートグラス２は、ステップＳ２において特定した対象オブジェクトが存在し得る範囲の撮像画像のみを物体情報サーバ１０に送信してもよい。 Next, the smart glasses 2 recognize the second gesture and specify a range where the target object can exist based on the range indicated by the second gesture (step S2). Subsequently, the smart glasses 2 transmit the captured image to the object information server 10 (step S3). The smart glasses 2 may transmit to the object information server 10 only the captured images in the range where the target object identified in step S2 can exist.

つづいて、物体情報サーバ１０は、スマートグラス２から取得した撮像画像（第２ジェスチャが示す範囲の情報を含む）と、記憶している対象オブジェクト情報とに基づいて、対象オブジェクト及びその名称を特定する（ステップＳ４）。スマートグラス２は、特定した情報をスマートグラス２に送信する（ステップＳ５）。 Next, the object information server 10 identifies the target object and its name based on the captured image acquired from the smart glasses 2 (including information on the range indicated by the second gesture) and the stored target object information. (Step S4). The smart glasses 2 transmit the specified information to the smart glasses 2 (step S5).

つづいて、スマートグラス２は、物体情報サーバ１０から取得した情報（対象オブジェクト及びその名称）に基づいて、撮像画像に対象オブジェクトの名称が重畳された第１画像（図５に示される画像Ｐ６）を生成し、画面に表示する（ステップＳ６）。画像Ｐ６においては、第２ジェスチャが示す範囲を示す枠Ｆ、対象オブジェクトである看板Ｈ１の名称「看板」及び椅子Ｈ２の名称「椅子」が表示されている。 Next, the smart glasses 2 generate a first image (image P6 shown in FIG. 5) in which the name of the target object is superimposed on the captured image based on the information (target object and its name) acquired from the object information server 10. is generated and displayed on the screen (step S6). In the image P6, a frame F indicating the range indicated by the second gesture, the name "Signboard" of the signboard H1 which is the target object, and the name "Chair" of the chair H2 are displayed.

この状態において、スマートグラス２は、ユーザが発する音声であるユーザ音声を取得し、音声認識サーバ５０に送信する（ステップＳ７）。いま、スマートグラス２は、画像Ｐ６を閲覧したユーザから、対象オブジェクトを絞り込むための名称（「看板」）、及び、処理内容（「情報表示」）を含んだユーザ音声を取得したとする。この場合、音声認識サーバ５０は、音声認識によって「看板」「情報表示」との用語を認識する（ステップＳ８）。そして、音声認識サーバ５０は、音声認識結果を物体情報サーバ１０に送信する（ステップＳ９）。 In this state, the smart glasses 2 acquire user voice, which is voice emitted by the user, and transmit it to the voice recognition server 50 (step S7). It is now assumed that the smart glasses 2 have acquired a user voice that includes a name for narrowing down the target object ("signboard") and processing details ("information display") from the user who viewed the image P6. In this case, the voice recognition server 50 recognizes the terms "signboard" and "information display" through voice recognition (step S8). Then, the voice recognition server 50 transmits the voice recognition result to the object information server 10 (step S9).

つづいて、物体情報サーバ１０は、音声認識結果を受信し、「看板」とのオブジェクトを示す情報に基づき対象オブジェクトを看板Ｈ１に絞り込むと共に、「情報表示」との処理内容に基づき記憶しているオブジェクト情報から看板Ｈ１に係る詳細情報（提示情報）を特定し、特定した情報をスマートグラス２に送信する（ステップＳ１０）。 Subsequently, the object information server 10 receives the voice recognition result, narrows down the target object to the signboard H1 based on the information indicating the object "signboard", and stores it based on the processing content of "information display". Detailed information (presentation information) related to the signboard H1 is specified from the object information, and the specified information is transmitted to the smart glasses 2 (step S10).

つづいて、スマートグラス２は、物体情報サーバ１０から取得した情報（対象オブジェクト及び提示情報）に基づいて、撮像画像に提示情報が重畳された第２画像（図６に示される画像Ｐ８）を生成し、画面に表示する（ステップＳ１１）。画像Ｐ８においては、看板Ｈ１が提示する提示情報Ｉが表示されると共に看板Ｈ１が強調表示される。 Next, the smart glasses 2 generate a second image (image P8 shown in FIG. 6) in which the presentation information is superimposed on the captured image based on the information (target object and presentation information) acquired from the object information server 10. and displays it on the screen (step S11). In the image P8, the presentation information I presented by the signboard H1 is displayed, and the signboard H1 is highlighted.

この状態において、スマートグラス２は、更なるユーザ音声を取得し、音声認識サーバ５０に送信する（ステップＳ１２）。いま、スマートグラス２は、画像Ｐ８を閲覧したユーザから、更なる処理内容（拡大表示）を含んだユーザ音声を取得したとする。この場合、音声認識サーバ５０は、音声認識によって「拡大表示」との用語を認識する（ステップＳ１３）。そして、音声認識サーバ５０は、音声認識結果を物体情報サーバ１０に送信する（ステップＳ１４）。 In this state, the smart glasses 2 acquire further user voice and transmit it to the voice recognition server 50 (step S12). It is now assumed that the smart glasses 2 have acquired a user voice that includes further processing details (enlarged display) from the user who viewed the image P8. In this case, the voice recognition server 50 recognizes the term "enlarged display" through voice recognition (step S13). Then, the voice recognition server 50 transmits the voice recognition result to the object information server 10 (step S14).

つづいて、物体情報サーバ１０は、音声認識結果を受信し、「拡大表示」との処理内容を特定し、拡大表示要求をスマートグラス２に送信する（ステップＳ１５）。そして、スマートグラス２は、物体情報サーバ１０から取得した情報（拡大表示要求）に基づいて、撮像画像に文字が拡大された態様の提示情報Ｉが重畳された画像（図７に示される画像Ｐ１０）を生成し、画面に表示する（ステップＳ１６）。 Subsequently, the object information server 10 receives the voice recognition result, specifies the processing content as "enlarged display", and transmits an enlarged display request to the smart glasses 2 (step S15). Based on the information (enlarged display request) acquired from the object information server 10, the smart glasses 2 generate an image (image P10 shown in FIG. ) is generated and displayed on the screen (step S16).

次に、本実施形態に係る情報処理システム１の作用効果について説明する。 Next, the effects of the information processing system 1 according to this embodiment will be explained.

本実施形態に係る情報処理システム１は、ユーザに装着されるスマートグラス２において表示されることによりユーザに視認される画像、ユーザの視線情報、及びユーザが発した音声であるユーザ音声を取得する取得部２１と、取得部２１によって取得された画像に示されているユーザのジェスチャを認識するジェスチャ認識部２２と、取得部２１によって取得されたユーザ音声を認識する音声認識サーバ５０と、視線情報及びジェスチャ認識部２２による認識結果に基づいて、音声認識サーバ５０によって認識されたユーザ音声に応じて画像に係る第１処理を実行する第１モード、及び、音声認識サーバ５０によって認識されたユーザ音声に応じて画像に係る処理とは異なる第２処理を実行する第２モードのいずれを適用するかを決定する決定部２３と、決定部２３によって適用すると決定された第１モード又は第２モードの処理を実行する生成部２４及び出力部２５と、を備える。 The information processing system 1 according to the present embodiment acquires an image that is displayed on smart glasses 2 worn by the user to be viewed by the user, information on the user's line of sight, and user voice that is the voice uttered by the user. an acquisition unit 21, a gesture recognition unit 22 that recognizes the user's gestures shown in the image acquired by the acquisition unit 21, a voice recognition server 50 that recognizes the user voice acquired by the acquisition unit 21, and line-of-sight information. and a first mode in which a first process related to an image is executed in accordance with the user voice recognized by the voice recognition server 50 based on the recognition result by the gesture recognition unit 22; a determining unit 23 that determines which of the second modes to apply, which executes a second process different from the process related to the image according to the image, and a first mode or a second mode that is determined to be applied by the determining unit 23; It includes a generation unit 24 and an output unit 25 that execute processing.

本実施形態に係る情報処理システム１では、ユーザ音声、スマートグラス２において表示されることでユーザに視認される画像（撮像画像）及びユーザの視線情報が取得され、ユーザのジェスチャ及びユーザ音声が認識される。そして、ユーザの視線情報、及びジェスチャの認識結果に基づいて、第１モード及び第２モードのいずれを適用するかが決定される。第１モードは、ユーザ音声に応じて画像に係る処理を実行するモードである。第２モードは、ユーザ音声に応じて画像に係る処理とは異なる処理を実行するモードである。例えば、情報処理システムが、単に音声認識のみによって第１モード及び第２モードのいずれを適用するかを決定する場合において、まず、ユーザから音声の入力が受け付けられる。そして、情報処理システムは、ユーザの音声を認識するが、当該音声が、いずれの処理に係る音声なのかを把握することが困難である場合がある。この場合、例えば、当該音声が画像に係る処理に係る音声であっても、画像に係る処理とは異なる処理が実行されるおそれがある。 In the information processing system 1 according to the present embodiment, a user's voice, an image that is displayed on the smart glasses 2 and viewed by the user (captured image), and user's line of sight information are acquired, and the user's gesture and user voice are recognized. be done. Then, it is determined which of the first mode and the second mode to apply, based on the user's line of sight information and the gesture recognition result. The first mode is a mode in which image-related processing is executed in response to user voice. The second mode is a mode in which processing different from processing related to images is executed depending on the user's voice. For example, when the information processing system determines whether to apply the first mode or the second mode solely by voice recognition, first, voice input is accepted from the user. Although the information processing system recognizes the user's voice, it may be difficult to determine which process the voice relates to. In this case, for example, even if the sound is related to processing related to images, there is a risk that processing different from the processing related to images will be executed.

この点、情報処理システム１では、ユーザの意思を反映していると考えられる、ユーザの視線情報及びジェスチャに基づいて、画像に係る処理が実行されるモード、及び画像に係る処理以外の処理が実行されるモードのいずれが適用されるかが決定されるため、ユーザの要求に沿った適切な処理を行うことができる。また、情報処理システム１では、ユーザの意思に沿わない処理（すなわち、不要な処理）が行われることが抑制されるので、処理負荷を軽減することができるという技術的効果を奏する。 In this regard, in the information processing system 1, the mode in which image-related processing is executed and the processing other than image-related processing are determined based on the user's gaze information and gestures, which are considered to reflect the user's intention. Since it is determined which of the execution modes will be applied, appropriate processing can be performed in accordance with the user's request. Further, in the information processing system 1, processing that is not in accordance with the user's intention (that is, unnecessary processing) is suppressed from being performed, so that a technical effect is achieved in that the processing load can be reduced.

情報処理システム１では、ジェスチャ認識部２２が、第１処理に係るジェスチャとして予め定められた第１ジェスチャを認識し、決定部２３は、視線情報及びジェスチャ認識部２２による第１ジェスチャの認識結果に基づいて、ユーザが、画像に示されている第１ジェスチャを注視しているか否かを判定し、第１ジェスチャをユーザが注視している場合に、第１モードを適用することを決定する。 In the information processing system 1, the gesture recognition unit 22 recognizes a first gesture predetermined as a gesture related to the first process, and the determination unit 23 uses the line of sight information and the recognition result of the first gesture by the gesture recognition unit 22. Based on the image, it is determined whether the user is gazing at the first gesture shown in the image, and if the user is gazing at the first gesture, it is determined to apply the first mode.

一般的に、ある領域をユーザが注視している場合、ユーザは当該領域に関心をもっていると考えられる。そして、第１処理に係るジェスチャとして予め定められた第１ジェスチャをユーザが注視している場合、ユーザは第１処理（画像に対する処理）を要求している可能性が高いと考えられる。情報処理システム１では第１ジェスチャをユーザが注視している場合に、ユーザ音声に応じて画像に対する処理を実行する第１モードが適用されるため、ユーザが画像に対する処理を要求している場合において、第１モードが適用される可能性を高めることができる。 Generally, when a user is gazing at a certain area, it is considered that the user is interested in that area. If the user is gazing at a first gesture that is predetermined as a gesture related to the first process, it is considered that the user is highly likely to request the first process (processing on an image). In the information processing system 1, when the user is gazing at the first gesture, the first mode in which processing is performed on the image according to the user's voice is applied, so when the user requests processing on the image, the first mode is applied. , the possibility that the first mode will be applied can be increased.

情報処理システム１では、処理実行部として機能する生成部２４及び出力部２５が、音声認識サーバ５０によって認識されたユーザ音声に第１処理に係る処理内容が含まれている場合においては、当該処理内容に応じた処理を第１処理として実行する。このように、ユーザ音声に基づいてユーザが要求している処理内容が判断されて該処理内容に応じた画像の生成及び表示（出力）がなされることにより、ユーザの要求に沿った適切な処理を行うことができる。 In the information processing system 1, when the user voice recognized by the voice recognition server 50 includes the processing content related to the first process, the generation unit 24 and the output unit 25, which function as processing execution units, perform the processing according to the first processing. A process according to the content is executed as a first process. In this way, the processing content requested by the user is determined based on the user's voice, and an image is generated and displayed (output) according to the processing content, so that appropriate processing can be performed in accordance with the user's request. It can be performed.

情報処理システム１は、画像に含まれるオブジェクトであって第１処理の対象のオブジェクトである対象オブジェクトを特定する処理を実行する物体情報サーバ１０を備え、ジェスチャ認識部２２は、対象オブジェクトが含まれ得る範囲を示すジェスチャとして予め定められた第２ジェスチャを更に認識し、物体情報サーバ１０は、画像において第２ジェスチャが示す範囲に基づいて、対象オブジェクトを特定する。このように、ユーザの意思が反映されているジェスチャが示す範囲に基づいて対象オブジェクトが特定されることにより、ユーザが対象オブジェクトとしたい（処理の対象としたい）オブジェクトを適切に特定することができる。 The information processing system 1 includes an object information server 10 that executes a process of identifying a target object that is included in an image and is a target object of first processing, and a gesture recognition unit 22 that performs a process of identifying a target object that is included in an image and is a target object of first processing. The object information server 10 further recognizes a second gesture predetermined as a gesture indicating the range to be obtained, and specifies the target object based on the range indicated by the second gesture in the image. In this way, by identifying the target object based on the range indicated by the gesture that reflects the user's intention, it is possible to appropriately identify the object that the user wants to target (target the process). .

情報処理システム１では、物体情報サーバ１０が、複数のオブジェクトのそれぞれについてオブジェクトを示す情報とオブジェクトに関する情報とが少なくとも対応付けられたオブジェクト情報を記憶すると共に、特定した対象オブジェクトについてオブジェクト情報に基づきオブジェクトに関する情報を更に特定し、生成部２４が、第１モードの第１処理として、物体情報サーバ１０により特定された対象オブジェクトに関する情報に基づいて出力情報（図６の画像Ｐ８等）を生成し、出力部２５が、生成部２４が生成した出力情報をスマートグラス２の画面に表示する。このような構成によれば、ユーザがジェスチャを行うことによって対象オブジェクトに関する情報が容易に取得される。すなわち、このような構成によれば、ユーザにとって簡易な方法によってユーザが知りたい情報を取得することができる。 In the information processing system 1, the object information server 10 stores object information in which information indicating the object and information regarding the object are associated with each other for each of a plurality of objects, and also stores object information for each of the plurality of objects based on the object information for the specified target object. further specify information regarding the target object, and the generation unit 24 generates output information (such as image P8 in FIG. 6) based on information regarding the target object specified by the object information server 10 as a first process in the first mode, The output unit 25 displays the output information generated by the generation unit 24 on the screen of the smart glasses 2. According to such a configuration, information regarding the target object is easily acquired when the user performs a gesture. That is, according to such a configuration, the user can obtain the information that he/she wants to know using a simple method for the user.

情報処理システム１では、生成部２４が、物体情報サーバ１０により特定された対象オブジェクトに関する情報が対象オブジェクトに対応付けられて重畳表示された重畳画像（図６の画像Ｐ８）を出力情報として生成する。このように、対象オブジェクトと対象オブジェクトに関する情報とが対応付けられて表示されることによって、対象オブジェクトに関する情報を、よりユーザが把握し易い態様で表示することができる。 In the information processing system 1, the generation unit 24 generates, as output information, a superimposed image (image P8 in FIG. 6) in which information regarding the target object specified by the object information server 10 is displayed in a superimposed manner in association with the target object. . In this way, by displaying the target object and the information about the target object in association with each other, the information about the target object can be displayed in a manner that is easier for the user to grasp.

情報処理システム１では、物体情報サーバ１０が、音声認識サーバ５０によって認識されたユーザ音声にオブジェクトを示す情報が含まれている場合においては、第２ジェスチャが示す範囲のオブジェクトのうち、ユーザ音声に含まれているオブジェクトを対象オブジェクトとして特定する。このように、ユーザ音声の情報を更に考慮して対象オブジェクトが特定されることにより、ユーザが対象オブジェクトとしたいオブジェクトをより確実且つ容易に特定することができる。 In the information processing system 1, when the user voice recognized by the voice recognition server 50 includes information indicating an object, the object information server 10 selects the user voice from among the objects in the range indicated by the second gesture. Identify the contained object as the target object. In this way, by further considering the information of the user's voice to specify the target object, it is possible to more reliably and easily specify the object that the user wants to use as the target object.

決定部２３は、第１モードを適用することを決定した後において、ジェスチャ認識部２２によって第２ジェスチャが認識されている間においては、第１モードの適用を継続する。 After deciding to apply the first mode, the determining unit 23 continues to apply the first mode while the gesture recognition unit 22 recognizes the second gesture.

ユーザの意思が反映されている第２ジェスチャをユーザが継続している状態においては、ユーザは、画像に係る処理を継続して要求している可能性が高いと考えられる。情報処理システム１では、そのような状態において第１モードを継続するため、ユーザの要求する処理を確実に実行することができる。 In a state where the user continues to perform the second gesture that reflects the user's intention, it is considered that the user is likely to continue requesting processing related to the image. Since the information processing system 1 continues in the first mode in such a state, it is possible to reliably execute the process requested by the user.

決定部２３は、第１モードを適用することを決定した後において、ジェスチャ認識部２２によって第２ジェスチャが認識されなくなった場合においては、第２モードを適用することを決定する。 If the second gesture is no longer recognized by the gesture recognition unit 22 after determining to apply the first mode, the determining unit 23 determines to apply the second mode.

ユーザの意思が反映されている第２ジェスチャをユーザが中断した場合には、ユーザは、画像に対する処理を要求しなくなった可能性が高いと考えられる。情報処理システム１では、そのような状態において、第１処理（画像に係る処理）を実行する第１モードから、第２処理（画像に係る処理とは異なる処理）を実行する第２モードに切り替えられるため、ユーザの要求する処理を確実に実行することができる。 If the user interrupts the second gesture that reflects the user's intention, it is highly likely that the user no longer requests processing on the image. In such a state, the information processing system 1 switches from a first mode in which a first process (processing related to images) is executed to a second mode in which a second process (processing different from image related processing) is executed. Therefore, the process requested by the user can be reliably executed.

次に、情報処理システム１に含まれたスマートグラス２、音声認識サーバ５０、及び物体情報サーバ１０のハードウェア構成について、図９を参照して説明する。上述のスマートグラス２、音声認識サーバ５０、及び物体情報サーバ１０は、物理的には、プロセッサ１００１、メモリ１００２、ストレージ１００３、通信装置１００４、入力装置１００５、出力装置１００６、バス１００７などを含むコンピュータ装置として構成されてもよい。 Next, the hardware configurations of the smart glasses 2, the voice recognition server 50, and the object information server 10 included in the information processing system 1 will be described with reference to FIG. The smart glasses 2, voice recognition server 50, and object information server 10 described above are physically a computer including a processor 1001, a memory 1002, a storage 1003, a communication device 1004, an input device 1005, an output device 1006, a bus 1007, etc. It may be configured as a device.

なお、以下の説明では、「装置」という文言は、回路、デバイス、ユニットなどに読み替えることができる。スマートグラス２、音声認識サーバ５０、及び物体情報サーバ１０のハードウェア構成は、図９に示した各装置を１つ又は複数含むように構成されてもよいし、一部の装置を含まずに構成されてもよい。 In addition, in the following description, the word "apparatus" can be read as a circuit, a device, a unit, etc. The hardware configurations of the smart glasses 2, the voice recognition server 50, and the object information server 10 may be configured to include one or more of each device shown in FIG. 9, or may be configured to include one or more of the devices shown in FIG. may be configured.

スマートグラス２、音声認識サーバ５０、及び物体情報サーバ１０における各機能は、プロセッサ１００１、メモリ１００２などのハードウェア上に所定のソフトウェア（プログラム）を読み込ませることで、プロセッサ１００１が演算を行い、通信装置１００４による通信や、メモリ１００２及びストレージ１００３におけるデータの読み出し及び／又は書き込みを制御することで実現される。 Each function in the smart glasses 2, the voice recognition server 50, and the object information server 10 is performed by loading predetermined software (programs) onto hardware such as the processor 1001 and memory 1002, so that the processor 1001 performs calculations and communicates. This is achieved by controlling communication by the device 1004 and reading and/or writing of data in the memory 1002 and storage 1003.

プロセッサ１００１は、例えば、オペレーティングシステムを動作させてコンピュータ全体を制御する。プロセッサ１００１は、周辺装置とのインタフェース、制御装置、演算装置、レジスタなどを含む中央処理装置（ＣＰＵ：Central Processing Unit）で構成されてもよい。例えば、スマートグラス２の取得部２１等の制御機能はプロセッサ１００１で実現されてもよい。 The processor 1001, for example, operates an operating system to control the entire computer. The processor 1001 may be configured with a central processing unit (CPU) including an interface with peripheral devices, a control device, an arithmetic unit, registers, and the like. For example, the control function of the acquisition unit 21 and the like of the smart glasses 2 may be realized by the processor 1001.

また、プロセッサ１００１は、プログラム（プログラムコード）、ソフトウェアモジュールやデータを、ストレージ１００３及び／又は通信装置１００４からメモリ１００２に読み出し、これらに従って各種の処理を実行する。プログラムとしては、上述の実施の形態で説明した動作の少なくとも一部をコンピュータに実行させるプログラムが用いられる。 Further, the processor 1001 reads programs (program codes), software modules, and data from the storage 1003 and/or the communication device 1004 to the memory 1002, and executes various processes in accordance with these. As the program, a program that causes a computer to execute at least part of the operations described in the above embodiments is used.

例えば、スマートグラス２の取得部２１等の制御機能は、メモリ１００２に格納され、プロセッサ１００１で動作する制御プログラムによって実現されてもよく、他の機能ブロックについても同様に実現されてもよい。上述の各種処理は、１つのプロセッサ１００１で実行される旨を説明してきたが、２以上のプロセッサ１００１により同時又は逐次に実行されてもよい。プロセッサ１００１は、１以上のチップで実装されてもよい。なお、プログラムは、電気通信回線を介してネットワークから送信されても良い。 For example, the control functions of the acquisition unit 21 and the like of the smart glasses 2 may be realized by a control program stored in the memory 1002 and operated by the processor 1001, and other functional blocks may be similarly realized. Although the various processes described above have been described as being executed by one processor 1001, they may be executed by two or more processors 1001 simultaneously or sequentially. Processor 1001 may be implemented with one or more chips. Note that the program may be transmitted from a network via a telecommunications line.

メモリ１００２は、コンピュータ読み取り可能な記録媒体であり、例えば、ＲＯＭ（Read Only Memory）、ＥＰＲＯＭ（Erasable Programmable ＲＯＭ）、ＥＥＰＲＯＭ（Electrically Erasable Programmable ＲＯＭ）、ＲＡＭ（Random Access Memory）などの少なくとも１つで構成されてもよい。メモリ１００２は、レジスタ、キャッシュ、メインメモリ（主記憶装置）などと呼ばれてもよい。メモリ１００２は、本発明の一実施の形態に係る無線通信方法を実施するために実行可能なプログラム（プログラムコード）、ソフトウェアモジュールなどを保存することができる。 The memory 1002 is a computer-readable recording medium, and includes at least one of ROM (Read Only Memory), EPROM (Erasable Programmable ROM), EEPROM (Electrically Erasable Programmable ROM), RAM (Random Access Memory), etc. may be done. Memory 1002 may be called a register, cache, main memory, or the like. The memory 1002 can store executable programs (program codes), software modules, and the like to implement a wireless communication method according to an embodiment of the present invention.

ストレージ１００３は、コンピュータ読み取り可能な記録媒体であり、例えば、ＣＤＲＯＭ（Compact Disc ＲＯＭ）などの光ディスク、ハードディスクドライブ、フレキシブルディスク、光磁気ディスク(例えば、コンパクトディスク、デジタル多用途ディスク、Ｂｌｕ－ｒａｙ（登録商標）ディスク)、スマートカード、フラッシュメモリ(例えば、カード、スティック、キードライブ)、フロッピー（登録商標）ディスク、磁気ストリップなどの少なくとも１つで構成されてもよい。ストレージ１００３は、補助記憶装置と呼ばれてもよい。上述の記憶媒体は、例えば、メモリ１００２及び／又はストレージ１００３を含むデータベース、サーバその他の適切な媒体であってもよい。 The storage 1003 is a computer-readable recording medium, and includes, for example, an optical disk such as a CDROM (Compact Disc ROM), a hard disk drive, a flexible disk, a magneto-optical disk (for example, a compact disk, a digital versatile disk, and a Blu-ray). TM disk), smart card, flash memory (eg, card, stick, key drive), floppy disk, magnetic strip, etc. Storage 1003 may also be called an auxiliary storage device. The storage medium mentioned above may be, for example, a database including memory 1002 and/or storage 1003, a server, or other suitable medium.

通信装置１００４は、有線及び／又は無線ネットワークを介してコンピュータ間の通信を行うためのハードウェア（送受信デバイス）であり、例えばネットワークデバイス、ネットワークコントローラ、ネットワークカード、通信モジュールなどともいう。 The communication device 1004 is hardware (transmission/reception device) for communicating between computers via a wired and/or wireless network, and is also referred to as, for example, a network device, network controller, network card, communication module, or the like.

入力装置１００５は、外部からの入力を受け付ける入力デバイス（例えば、キーボード、マウス、マイクロフォン、スイッチ、ボタン、センサなど）である。出力装置１００６は、外部への出力を実施する出力デバイス（例えば、ディスプレイ、スピーカー、ＬＥＤランプなど）である。なお、入力装置１００５及び出力装置１００６は、一体となった構成（例えば、タッチパネル）であってもよい。 The input device 1005 is an input device (eg, keyboard, mouse, microphone, switch, button, sensor, etc.) that accepts input from the outside. The output device 1006 is an output device (for example, a display, a speaker, an LED lamp, etc.) that performs output to the outside. Note that the input device 1005 and the output device 1006 may have an integrated configuration (for example, a touch panel).

また、プロセッサ１００１やメモリ１００２などの各装置は、情報を通信するためのバス１００７で接続される。バス１００７は、単一のバスで構成されてもよいし、装置間で異なるバスで構成されてもよい。 Further, each device such as the processor 1001 and the memory 1002 is connected by a bus 1007 for communicating information. The bus 1007 may be configured as a single bus or may be configured as different buses between devices.

また、スマートグラス２、音声認識サーバ５０、及び物体情報サーバ１０は、マイクロプロセッサ、デジタル信号プロセッサ（ＤＳＰ：Digital Signal Processor）、ＡＳＩＣ（Application Specific Integrated Circuit）、ＰＬＤ（Programmable Logic Device）、ＦＰＧＡ（Field Programmable Gate Array）などのハードウェアを含んで構成されてもよく、当該ハードウェアにより、各機能ブロックの一部又は全てが実現されてもよい。例えば、プロセッサ１００１は、これらのハードウェアの少なくとも１つで実装されてもよい。 In addition, the smart glasses 2, the voice recognition server 50, and the object information server 10 include a microprocessor, a digital signal processor (DSP), an application specific integrated circuit (ASIC), a programmable logic device (PLD), and a field It may be configured to include hardware such as a programmable gate array), and a part or all of each functional block may be realized by the hardware. For example, processor 1001 may be implemented with at least one of these hardware.

以上、本実施形態について詳細に説明したが、当業者にとっては、本実施形態が本明細書中に説明した実施形態に限定されるものではないということは明らかである。本実施形態は、特許請求の範囲の記載により定まる本発明の趣旨及び範囲を逸脱することなく修正及び変更態様として実施することができる。したがって、本明細書の記載は、例示説明を目的とするものであり、本実施形態に対して何ら制限的な意味を有するものではない。 Although this embodiment has been described in detail above, it is clear for those skilled in the art that this embodiment is not limited to the embodiment described in this specification. This embodiment can be implemented as modifications and changes without departing from the spirit and scope of the present invention as defined by the claims. Therefore, the description in this specification is for the purpose of illustrative explanation and does not have any restrictive meaning with respect to this embodiment.

例えば、情報処理システム１は、スマートグラス２、音声認識サーバ５０、及び物体情報サーバ１０を含んで構成されているとして説明したが、これに限定されず、情報処理システム１の各機能が、スマートグラス２のみによって実現されてもよい。また、情報処理システム１の各機能のうち、決定部２３による第１モードの決定処理、及び第２ジェスチャに基づいた指定範囲の画定処理が物体情報サーバ１０によって実現されてもよい。 For example, the information processing system 1 has been described as being configured to include the smart glasses 2, the voice recognition server 50, and the object information server 10, but the information processing system 1 is not limited to this. It may also be realized by only the glass 2. Further, among the functions of the information processing system 1, the first mode determination process by the determination unit 23 and the designated range definition process based on the second gesture may be realized by the object information server 10.

本明細書で説明した各態様／実施形態は、ＬＴＥ（Long Term Evolution）、ＬＴＥ－Ａ（LTE-Advanced）、ＳＵＰＥＲ３Ｇ、ＩＭＴ－Ａｄｖａｎｃｅｄ、４Ｇ、５Ｇ、ＦＲＡ（Future Radio Access）、Ｗ－ＣＤＭＡ（登録商標）、ＧＳＭ（登録商標）、ＣＤＭＡ２０００、ＵＭＢ（Ultra Mobile Broad-band）、ＩＥＥＥ８０２．１１（Ｗｉ－Ｆｉ）、ＩＥＥＥ８０２．１６（ＷｉＭＡＸ）、ＩＥＥＥ８０２．２０、ＵＷＢ（Ultra-Wide Band）、Ｂｌｕｅｔｏｏｔｈ（登録商標）、その他の適切なシステムを利用するシステム及び／又はこれらに基づいて拡張された次世代システムに適用されてもよい。 Each aspect/embodiment described herein applies to LTE (Long Term Evolution), LTE-A (LTE-Advanced), SUPER 3G, IMT-Advanced, 4G, 5G, FRA (Future Radio Access), W-CDMA. (registered trademark), GSM (registered trademark), CDMA2000, UMB (Ultra Mobile Broad-band), IEEE 802.11 (Wi-Fi), IEEE 802.16 (WiMAX), IEEE 802.20, UWB (Ultra-Wide) The present invention may be applied to systems utilizing Bluetooth (registered trademark), Bluetooth (registered trademark), and other appropriate systems, and/or next-generation systems expanded based on these.

本明細書で説明した各態様／実施形態の処理手順、シーケンス、フローチャートなどは、矛盾の無い限り、順序を入れ替えてもよい。例えば、本明細書で説明した方法については、例示的な順序で様々なステップの要素を提示しており、提示した特定の順序に限定されない。 The order of the processing procedures, sequences, flowcharts, etc. of each aspect/embodiment described in this specification may be changed as long as there is no contradiction. For example, the methods described herein present elements of the various steps in an exemplary order and are not limited to the particular order presented.

入出力された情報等は特定の場所（例えば、メモリ）に保存されてもよいし、管理テーブルで管理してもよい。入出力される情報等は、上書き、更新、または追記され得る。出力された情報等は削除されてもよい。入力された情報等は他の装置へ送信されてもよい。 The input/output information may be stored in a specific location (eg, memory) or may be managed in a management table. Information etc. to be input/output may be overwritten, updated, or additionally written. The output information etc. may be deleted. The input information etc. may be transmitted to other devices.

判定は、１ビットで表される値（０か１か）によって行われてもよいし、真偽値（Boolean：trueまたはfalse）によって行われてもよいし、数値の比較（例えば、所定の値との比較）によって行われてもよい。 Judgment may be made using a value expressed by 1 bit (0 or 1), a truth value (Boolean: true or false), or a comparison of numerical values (for example, a predetermined value). (comparison with a value).

本明細書で説明した各態様／実施形態は単独で用いてもよいし、組み合わせて用いてもよいし、実行に伴って切り替えて用いてもよい。また、所定の情報の通知（例えば、「Ｘであること」の通知）は、明示的に行うものに限られず、暗黙的（例えば、当該所定の情報の通知を行わない）ことによって行われてもよい。 Each aspect/embodiment described in this specification may be used alone, may be used in combination, or may be switched and used in accordance with execution. In addition, notification of prescribed information (for example, notification of "X") is not limited to being done explicitly, but may also be done implicitly (for example, not notifying the prescribed information). Good too.

ソフトウェアは、ソフトウェア、ファームウェア、ミドルウェア、マイクロコード、ハードウェア記述言語と呼ばれるか、他の名称で呼ばれるかを問わず、命令、命令セット、コード、コードセグメント、プログラムコード、プログラム、サブプログラム、ソフトウェアモジュール、アプリケーション、ソフトウェアアプリケーション、ソフトウェアパッケージ、ルーチン、サブルーチン、オブジェクト、実行可能ファイル、実行スレッド、手順、機能などを意味するよう広く解釈されるべきである。 Software includes instructions, instruction sets, code, code segments, program code, programs, subprograms, software modules, whether referred to as software, firmware, middleware, microcode, hardware description language, or by any other name. , should be broadly construed to mean an application, software application, software package, routine, subroutine, object, executable, thread of execution, procedure, function, etc.

また、ソフトウェア、命令などは、伝送媒体を介して送受信されてもよい。例えば、ソフトウェアが、同軸ケーブル、光ファイバケーブル、ツイストペア及びデジタル加入者回線（ＤＳＬ）などの有線技術及び／又は赤外線、無線及びマイクロ波などの無線技術を使用してウェブサイト、サーバ、又は他のリモートソースから送信される場合、これらの有線技術及び／又は無線技術は、伝送媒体の定義内に含まれる。 Additionally, software, instructions, etc. may be sent and received via a transmission medium. For example, if the software uses wired technologies such as coaxial cable, fiber optic cable, twisted pair and digital subscriber line (DSL) and/or wireless technologies such as infrared, radio and microwave to When transmitted from a remote source, these wired and/or wireless technologies are included within the definition of transmission medium.

本明細書で説明した情報、信号などは、様々な異なる技術のいずれかを使用して表されてもよい。例えば、上記の説明全体に渡って言及され得るデータ、命令、コマンド、情報、信号、ビット、シンボル、チップなどは、電圧、電流、電磁波、磁界若しくは磁性粒子、光場若しくは光子、又はこれらの任意の組み合わせによって表されてもよい。 The information, signals, etc. described herein may be represented using any of a variety of different technologies. For example, data, instructions, commands, information, signals, bits, symbols, chips, etc., which may be referred to throughout the above description, may refer to voltages, currents, electromagnetic waves, magnetic fields or magnetic particles, light fields or photons, or any of these. It may also be represented by a combination of

なお、本明細書で説明した用語及び／又は本明細書の理解に必要な用語については、同一の又は類似する意味を有する用語と置き換えてもよい。 Note that terms explained in this specification and/or terms necessary for understanding this specification may be replaced with terms having the same or similar meanings.

また、本明細書で説明した情報、パラメータなどは、絶対値で表されてもよいし、所定の値からの相対値で表されてもよいし、対応する別の情報で表されてもよい。 Further, the information, parameters, etc. described in this specification may be expressed as absolute values, relative values from a predetermined value, or other corresponding information. .

スマートグラス２は、当業者によって、移動通信端末、加入者局、モバイルユニット、加入者ユニット、ワイヤレスユニット、リモートユニット、モバイルデバイス、ワイヤレスデバイス、ワイヤレス通信デバイス、リモートデバイス、モバイル加入者局、アクセス端末、モバイル端末、ワイヤレス端末、リモート端末、ハンドセット、ユーザエージェント、モバイルクライアント、クライアント、またはいくつかの他の適切な用語で呼ばれる場合もある。 The smart glasses 2 can be described by a person skilled in the art as a mobile communication terminal, a subscriber station, a mobile unit, a subscriber unit, a wireless unit, a remote unit, a mobile device, a wireless device, a wireless communication device, a remote device, a mobile subscriber station, an access terminal. , a mobile terminal, a wireless terminal, a remote terminal, a handset, a user agent, a mobile client, a client, or some other suitable terminology.

本明細書で使用する「に基づいて」という記載は、別段に明記されていない限り、「のみに基づいて」を意味しない。言い換えれば、「に基づいて」という記載は、「のみに基づいて」と「に少なくとも基づいて」の両方を意味する。 As used herein, the phrase "based on" does not mean "based solely on" unless expressly stated otherwise. In other words, the phrase "based on" means both "based only on" and "based at least on."

本明細書で「第１の」、「第２の」などの呼称を使用した場合においては、その要素へのいかなる参照も、それらの要素の量または順序を全般的に限定するものではない。これらの呼称は、２つ以上の要素間を区別する便利な方法として本明細書で使用され得る。したがって、第１および第２の要素への参照は、２つの要素のみがそこで採用され得ること、または何らかの形で第１の要素が第２の要素に先行しなければならないことを意味しない。 Any reference to the elements herein, such as "first", "second", etc., does not generally limit the amount or order of those elements. These designations may be used herein as a convenient way of distinguishing between two or more elements. Thus, reference to a first and second element does not imply that only two elements may be employed therein or that the first element must precede the second element in any way.

「含む（include）」、「含んでいる（including）」、およびそれらの変形が、本明細書あるいは特許請求の範囲で使用されている限り、これら用語は、用語「備える(comprising)」と同様に、包括的であることが意図される。さらに、本明細書あるいは特許請求の範囲において使用されている用語「または（or）」は、排他的論理和ではないことが意図される。 To the extent that the words "include," "including," and variations thereof are used in this specification or in the claims, these terms are synonymous with the term "comprising." is intended to be comprehensive. Furthermore, the term "or" as used in this specification or in the claims is not intended to be exclusive or.

本明細書において、文脈または技術的に明らかに１つのみしか存在しない装置である場合以外は、複数の装置をも含むものとする。 In this specification, a plurality of devices is also included unless it is clear from the context or technology that only one device exists.

本開示の全体において、文脈から明らかに単数を示したものではなければ、複数のものを含むものとする。 Throughout this disclosure, the plural is intended to be included unless the context clearly dictates otherwise.

１…情報処理システム、２…スマートグラス（端末）、１０…物体情報サーバ（特定部，記憶部）、２１…取得部、２２…ジェスチャ認識部、２３…決定部、２４…生成部（処理実行部）、２５…出力部（処理実行部）、５０…音声認識サーバ（音声認識部）、Ｈ１…看板（対象オブジェクト）、Ｈ２…椅子（対象オブジェクト）、ＨＪ１…ジェスチャ（第１ジェスチャ）、ＨＪ２…ジェスチャ（第２ジェスチャ）、Ｐ８…画像（重畳画像）。 DESCRIPTION OF SYMBOLS 1... Information processing system, 2... Smart glasses (terminal), 10... Object information server (specification part, storage part), 21... Acquisition part, 22... Gesture recognition part, 23... Determination part, 24... Generation part (processing execution part) ), 25... Output unit (processing execution unit), 50... Speech recognition server (speech recognition unit), H1... Signboard (target object), H2... Chair (target object), HJ1... Gesture (first gesture), HJ2 ... Gesture (second gesture), P8... Image (superimposed image).

Claims

an acquisition unit that acquires an image visually recognized by the user by being displayed on a terminal worn by the user, line-of-sight information of the user, and user voice that is voice uttered by the user;
a gesture recognition unit that recognizes the user's gesture shown in the image acquired by the acquisition unit;
a voice recognition unit that recognizes the user voice acquired by the acquisition unit;
a first mode in which a first process related to the image is executed in response to the user voice recognized by the voice recognition unit based on the line of sight information and the recognition result by the gesture recognition unit; and the voice recognition unit a determining unit that determines which of a second mode to apply a second mode that executes a second process different from the process related to the image according to the user voice recognized by the user;
An information processing system, comprising: a processing execution unit that executes processing in the first mode or the second mode determined to be applied by the determination unit.

The gesture recognition unit recognizes a first gesture predetermined as the gesture related to the first process,
The determining unit is
Determining whether the user is gazing at the first gesture shown in the image based on the line of sight information and the recognition result of the first gesture by the gesture recognition unit;
The information processing system according to claim 1, wherein the information processing system determines to apply the first mode when the user is gazing at the first gesture.

When the user voice recognized by the speech recognition unit includes processing content related to the first processing, the processing execution unit executes processing according to the processing content as the first processing. , The information processing system according to claim 1 or 2.

further comprising a specifying unit that specifies a target object that is included in the image and is a target object of the first process;
The gesture recognition unit further recognizes a second gesture predetermined as the gesture indicating a range in which the target object can be included,
The information processing system according to claim 1, wherein the specifying unit specifies the target object based on a range indicated by the second gesture in the image.

Further comprising a storage unit that stores object information in which information indicating the object and information regarding the object are at least associated with each other for each of the plurality of objects,
The identifying unit further identifies information regarding the identified target object based on the object information,
5. The processing execution unit generates output information based on information regarding the target object specified by the identification unit as the first processing in the first mode, and outputs the generated output information. The information processing system described in .

The information processing system according to claim 5, wherein the processing execution unit generates, as the output information, a superimposed image in which information regarding the target object specified by the identification unit is associated with the target object and displayed in a superimposed manner. .

In the case where the user voice recognized by the voice recognition unit includes information indicating an object, the identifying unit may select one of the objects included in the user voice from among the objects in the range indicated by the second gesture. The information processing system according to claim 4, wherein an object is specified as the target object.

4 . The determining unit continues to apply the first mode while the gesture recognition unit recognizes the second gesture after determining to apply the first mode. 4 . The information processing system according to any one of items 7 to 7.

The determining unit determines to apply the second mode if the second gesture is no longer recognized by the gesture recognition unit after determining to apply the first mode. The information processing system according to any one of items 4 to 8.