JP6621595B2

JP6621595B2 - Information processing apparatus, information processing method, and program

Info

Publication number: JP6621595B2
Application number: JP2015088275A
Authority: JP
Inventors: 俊治栗栖; 真紀佐々木
Original assignee: NTT Docomo Inc
Current assignee: NTT Docomo Inc
Priority date: 2015-04-23
Filing date: 2015-04-23
Publication date: 2019-12-18
Anticipated expiration: 2035-04-23
Also published as: JP2016206433A

Description

本発明は、画像と音声に応じた処理を行う技術に関する。 The present invention relates to a technique for performing processing according to an image and sound.

特許文献１には、表示された画像内にある物体の情報を検索する情報処理装置が開示されている。この情報処理装置は、カメラにより得られた画像中において物体を含む領域を抽出し、得られた画像と共に抽出した領域を表す矩形枠を表示する。表示された画像に対してユーザーが領域を指定し、ソフトキーで検索の操作を行うと、情報処理装置は、矩形枠内の物体の関連情報の検索処理を行い、検索結果を表示する。特許文献１の情報処理装置によれば、画像内の物体に対して検索処理を行うことができる。 Patent Document 1 discloses an information processing apparatus that searches for information on an object in a displayed image. This information processing apparatus extracts a region including an object in an image obtained by a camera, and displays a rectangular frame representing the extracted region together with the obtained image. When the user designates an area for the displayed image and performs a search operation with a soft key, the information processing apparatus performs a search process of related information of the object in the rectangular frame and displays the search result. According to the information processing apparatus of Patent Document 1, it is possible to perform a search process on an object in an image.

特開２０１０−２０５１２１号公報JP 2010-205121 A

ところで、画像に映る物体の情報を得る際には、同じ物体であっても検索したいことが異なる場合があるが、特許文献１の情報処理装置では、物体が何であるかを検索するに止まり、様々な情報を得ることができない。 By the way, when obtaining information on an object appearing in an image, there are cases where it is desired to search even for the same object, but in the information processing apparatus of Patent Document 1, the search is limited to what the object is, Various information cannot be obtained.

本発明は、画像に対するユーザーの指示を特定し、ユーザーの意図に応じた処理を行う技術を提供することを目的とする。 An object of the present invention is to provide a technique for specifying a user instruction for an image and performing processing according to the user's intention.

本発明は、表示中の画像を表し、タグを有する画像データを取得する画像データ取得手段と、前記画像の表示中に発話された音声を表す音声データを取得する音声データ取得手段と、前記音声データ取得手段が取得した音声データが表す音声の内容を特定する特定手段と、前記画像データ取得手段が取得した表示中の画像を表す画像データが有するタグと前記特定手段が特定した内容とに対応する処理を実行する処理手段とを備える情報処理装置を提供する。 The present invention provides an image data acquisition unit that acquires an image data that represents an image being displayed and has a tag, an audio data acquisition unit that acquires audio data representing a voice spoken during display of the image, and the audio Corresponding to the specifying means for specifying the audio content represented by the audio data acquired by the data acquiring means, the tag included in the image data representing the image being displayed acquired by the image data acquiring means, and the content specified by the specifying means There is provided an information processing apparatus including processing means for executing processing to be performed.

本発明においては、前記画像は複数のタグを有し、当該複数のタグから一又は複数のタグを選択する選択手段を有し、前記処理手段は、前記選択手段が選択した一又は複数のタグと前記特定手段が特定した内容とに対応した処理を実行する構成としてもよい。 In the present invention, the image includes a plurality of tags, and includes a selection unit that selects one or a plurality of tags from the plurality of tags, and the processing unit includes one or a plurality of tags selected by the selection unit. And a process corresponding to the contents specified by the specifying means may be executed.

また、本発明においては、前記選択手段は、前記特定手段が特定した内容に応じて前記複数のタグから一又は複数のタグを選択する構成としてもよい。
また、本発明においては、タグと当該タグに対応する単語との対応付けを複数格納したデータベースを有し、前記特定手段は、前記音声が表す単語を特定し、前記選択手段は、前記データベースにおいて、前記特定手段が特定した単語に対応付けられたタグを選択する構成としてもよい。 In the present invention, the selection unit may select one or a plurality of tags from the plurality of tags according to the content specified by the specifying unit.
In the present invention, there is a database storing a plurality of associations between tags and words corresponding to the tags, the specifying unit specifies a word represented by the voice, and the selecting unit includes: The tag associated with the word specified by the specifying unit may be selected.

また、本発明においては、前記画像データ取得手段は、前記画像内の位置を表す座標を取得し、前記タグは、前記画像内の位置に対応付けられており、前記選択手段は、前記画像データ取得手段が取得した座標の位置に対応付けられたタグを選択する構成としてもよい。 In the present invention, the image data acquisition unit acquires coordinates representing a position in the image, the tag is associated with a position in the image, and the selection unit includes the image data It is good also as a structure which selects the tag matched with the position of the coordinate which the acquisition means acquired.

また、本発明は、表示中の画像を表し、タグを有する画像データを取得する画像データ取得ステップと、前記画像の表示中に発話された音声を表す音声データを取得する音声データ取得ステップと、前記音声データ取得ステップで取得した音声データが表す音声の内容を特定する特定ステップと、前記画像データ取得ステップで取得した表示中の画像を表す画像データが有するタグと前記特定ステップで特定した内容とに対応する処理を実行する処理ステップとを備える情報処理方法を提供する。 Further, the present invention represents an image being displayed, an image data acquisition step for acquiring image data having a tag, an audio data acquisition step for acquiring audio data representing an audio spoken during display of the image , A specifying step for specifying the content of the sound represented by the sound data acquired in the sound data acquiring step, a tag included in the image data representing the image being displayed acquired in the image data acquiring step, and a content specified in the specifying step An information processing method comprising a processing step for executing processing corresponding to the above.

また、本発明は、コンピュータを、表示中の画像を表し、タグを有する画像データを取得する画像データ取得手段と、前記画像の表示中に発話された音声を表す音声データを取得する音声データ取得手段と、前記音声データ取得手段が取得した音声データが表す音声の内容を特定する特定手段と、前記画像データ取得手段が取得した表示中の画像を表す画像データが有するタグと前記特定手段が特定した内容とに対応する処理を実行する処理手段として機能させるためのプログラムを提供する。 The present invention also provides an image data acquisition means for acquiring image data representing a displayed image and having a tag, and audio data acquisition for acquiring audio data representing a voice spoken during the display of the image. Specifying means for specifying the content of the sound represented by the sound data acquired by the sound data acquiring means, the tag included in the image data indicating the image being displayed acquired by the image data acquiring means, and the specifying means A program for functioning as processing means for executing processing corresponding to the contents is provided.

本発明によれば、画像に対するユーザーの指示を特定し、ユーザーの意図に応じた処理を行うことができる。 According to the present invention, it is possible to specify a user instruction for an image and perform processing according to the user's intention.

本発明の一実施形態に係る装置を示した図。The figure which showed the apparatus which concerns on one Embodiment of this invention. 端末装置１０のハードウェア構成を示したブロック図。The block diagram which showed the hardware constitutions of the terminal device 10. 端末装置１０の機能ブロック図。The functional block diagram of the terminal device 10. FIG. サーバ装置２０のハードウェア構成を示したブロック図。The block diagram which showed the hardware constitutions of the server apparatus 20. FIG. サーバ装置２０の機能ブロック図。The functional block diagram of the server apparatus 20. FIG. 制御部１０１が行う処理の流れを示したフローチャート。The flowchart which showed the flow of the process which the control part 101 performs. 端末装置１０の画面の一例を示した図。The figure which showed an example of the screen of the terminal device.

［実施形態］
＜全体構成＞
図１は、本発明の一実施形態に係る音声エージェントシステム１に含まれる装置を示した図である。通信網２は、インターネットや固定電話網、音声通信やデータ通信などの通信サービスを提供する移動体通信網などを含む。通信網２には、端末装置１０が無線通信によって接続される。本実施形態に係る端末装置１０は、スマートフォンである。端末装置１０は、スマートフォンに限定されるものではなく、データ通信が可能なタブレット端末やフィーチャーフォン、腕時計型の端末などのウェアラブルデバイス、データ通信に特化したコンピュータ装置などであってもよい。なお、通信網２に接続される端末装置は多数存在するが、図面が繁雑になるのを防ぐために、図１においては１つの端末装置１０のみを示している。サーバ装置２０は、端末装置１０に対して音声エージェントサービスを提供する装置である。サーバ装置２０は、通信網２に接続されている。 [Embodiment]
<Overall configuration>
FIG. 1 is a diagram showing devices included in a voice agent system 1 according to an embodiment of the present invention. The communication network 2 includes the Internet, a fixed telephone network, a mobile communication network that provides communication services such as voice communication and data communication. The terminal device 10 is connected to the communication network 2 by wireless communication. The terminal device 10 according to the present embodiment is a smartphone. The terminal device 10 is not limited to a smartphone, and may be a wearable device such as a tablet terminal or feature phone capable of data communication, a wristwatch type terminal, a computer device specialized for data communication, or the like. Although there are many terminal devices connected to the communication network 2, only one terminal device 10 is shown in FIG. 1 in order to prevent the drawing from becoming complicated. The server device 20 is a device that provides a voice agent service to the terminal device 10. The server device 20 is connected to the communication network 2.

本実施形態における音声エージェントサービスは、端末装置１０のユーザーの発話に応じて、各種の処理を実行するサービスである。本実施形態に係る音声エージェントサービスは、端末装置１０において表示されている画像に対して発話すると、画像の内容と発話内容とに応じた処理を実行する特徴を有する。このサービスにより、ユーザーは、例えば、手で文字を入力することなく、端末装置１０に対して発話を行うことで情報の検索を行うことができる。 The voice agent service in the present embodiment is a service that executes various processes in accordance with the utterance of the user of the terminal device 10. The voice agent service according to the present embodiment has a feature that, when an utterance is made with respect to an image displayed on the terminal device 10, processing according to the content of the image and the utterance content is executed. With this service, for example, the user can search for information by speaking to the terminal device 10 without manually inputting characters.

（端末装置１０の構成）
図２は、端末装置１０のハードウェア構成の一例を示した図である。通信部１０５は、通信網２の無線基地局に対して無線通信を行う通信インターフェースとして機能する。音声処理部１０７は、マイクロホンとスピーカを有している。音声処理部１０７は、端末装置同士が音声通話を行う場合、通話相手の音声に係るデジタル信号が通信部１０５から供給されると、供給されたデジタル信号をアナログ信号に変換する。このアナログ信号は、スピーカへ供給され、スピーカからは、通話相手の音声が放音される。また、音声処理部１０７は、マイクロホンが音声を収音すると、収音した音声をデジタル信号に変換する。音声処理部１０７は、端末装置で音声通話を行う場合、ユーザーの音声を変換したデジタル信号を通信部１０５へ供給する。このデジタル信号は、通信部１０５から通信網２へ送信され、通話相手の端末装置へ送信される。 (Configuration of terminal device 10)
FIG. 2 is a diagram illustrating an example of a hardware configuration of the terminal device 10. The communication unit 105 functions as a communication interface that performs wireless communication with the wireless base station of the communication network 2. The audio processing unit 107 has a microphone and a speaker. When the terminal device performs a voice call between the terminal devices, when the digital signal related to the voice of the other party is supplied from the communication unit 105, the voice processing unit 107 converts the supplied digital signal into an analog signal. This analog signal is supplied to a speaker, and the voice of the other party is emitted from the speaker. In addition, when the microphone collects sound, the sound processing unit 107 converts the collected sound into a digital signal. The voice processing unit 107 supplies a digital signal obtained by converting the user's voice to the communication unit 105 when performing a voice call on the terminal device. This digital signal is transmitted from the communication unit 105 to the communication network 2 and transmitted to the terminal device of the other party.

タッチパネル１０３は、表示装置（例えば液晶ディスプレイ）と、表示装置の表示面において指の接触を検出するセンサーとを組み合わせた装置であり、ユーザーにより操作される操作部の一例である。タッチパネル１０３は、文字やＧＵＩ（Graphical User Interface）、端末装置１０を操作するためのメニュー画面などを表示装置で表示する。また、制御部１０１は、ユーザーが指で触れた位置をセンサーで検出する。制御部１０１は、タッチパネル１０３が検出した位置と、タッチパネルに表示されている画面に応じてユーザーの操作を特定し、特定した操作に応じて各部の制御や各種処理を実行する。 The touch panel 103 is a device that combines a display device (for example, a liquid crystal display) and a sensor that detects contact of a finger on the display surface of the display device, and is an example of an operation unit operated by a user. The touch panel 103 displays characters, a GUI (Graphical User Interface), a menu screen for operating the terminal device 10, and the like on the display device. Further, the control unit 101 detects a position where the user touches with a finger using a sensor. The control unit 101 specifies a user operation according to the position detected by the touch panel 103 and the screen displayed on the touch panel, and executes control of each unit and various processes according to the specified operation.

測位部１０８は、衛星航法システムの電波を受信するアンテナを備えており、受信した電波に基づいて端末装置１０の位置の緯度及び経度を測位し、測位した位置を表す位置情報を制御部１０１へ出力する。撮像部１０９は、撮像素子（ＣＭＯＳやＣＣＤなど）や撮像素子に像を結像する光学系などを備えている。撮像部１０９は、撮像した画像を表す画像データを生成し、生成した画像データを出力する。 The positioning unit 108 includes an antenna that receives radio waves from the satellite navigation system. The positioning unit 108 measures the latitude and longitude of the position of the terminal device 10 based on the received radio waves, and sends position information indicating the measured position to the control unit 101. Output. The imaging unit 109 includes an imaging device (such as CMOS or CCD) and an optical system that forms an image on the imaging device. The imaging unit 109 generates image data representing the captured image and outputs the generated image data.

記憶部１０２は、不揮発性メモリーを有しており、オペレーティングシステムのプログラムやアプリケーションプログラム、撮像部１０９が生成した画像データなどを記憶する。本実施形態においては、記憶部１０２はアプリケーションプログラムとして、サーバ装置２０へアクセスして音声エージェントサービスを受けるアプリケーションプログラムを記憶している。 The storage unit 102 includes a nonvolatile memory, and stores operating system programs and application programs, image data generated by the imaging unit 109, and the like. In the present embodiment, the storage unit 102 stores an application program that accesses the server device 20 and receives the voice agent service as an application program.

制御部１０１は、ＣＰＵ（Central Processing Unit）やＲＡＭ（Random Access Memory）を有し、ＣＰＵが記憶部１０２に記憶されているオペレーティングシステムのプログラムを実行すると、スマートフォンのオペレーティングシステムが実現する。また、オペレーティングシステムが実現すると、アプリケーションプログラムを実行することが可能となる。 The control unit 101 includes a CPU (Central Processing Unit) and a RAM (Random Access Memory). When the CPU executes an operating system program stored in the storage unit 102, a smartphone operating system is realized. Further, when the operating system is realized, an application program can be executed.

（端末装置１０の機能構成）
図３は、端末装置１０において実現する機能のうち、本発明に係る機能の構成を示したブロック図である。画像取得部１００１は、撮像部１０９が生成した画像データを取得する。音声取得部１００２、音声処理部１０７が生成する音声データを取得する。音声データは、マイクロホンで収音した音声から音声処理部１０７が生成したデジタル信号をデータ化したものである。送信部１００３は、画像取得部１００１が取得した画像データと音声取得部１００２が取得した音声データを、通信部１０５を制御してサーバ装置２０へ送信する。受信部１００４は、送信した画像データ及び音声データへの応答としてサーバ装置２０から送信されるコンテンツを受信する。提示部１００５は、受信部１００４が受信したコンテンツが表示されるようにタッチパネル１０３を制御する。 (Functional configuration of terminal device 10)
FIG. 3 is a block diagram showing a configuration of functions according to the present invention among functions realized in the terminal device 10. The image acquisition unit 1001 acquires image data generated by the imaging unit 109. The voice data generated by the voice acquisition unit 1002 and the voice processing unit 107 is acquired. The sound data is data obtained by converting a digital signal generated by the sound processing unit 107 from sound collected by a microphone. The transmission unit 1003 transmits the image data acquired by the image acquisition unit 1001 and the audio data acquired by the audio acquisition unit 1002 to the server device 20 by controlling the communication unit 105. The receiving unit 1004 receives content transmitted from the server device 20 as a response to the transmitted image data and audio data. The presentation unit 1005 controls the touch panel 103 so that the content received by the reception unit 1004 is displayed.

（サーバ装置２０の構成）
図４は、サーバ装置２０のハードウェア構成を示した図である。表示部２０３は、液晶ディスプレイを備えており、サーバ装置２０を操作するためのメニュー画面などを表示する。操作部２０４は、キーボードやマウスなどの入力装置を有している。サーバ装置２０は、キーボードやマウスに行われた操作に応じて動作する。通信部２０５は、通信網２を介したデータ通信を行う通信インターフェースとして機能する。 (Configuration of server device 20)
FIG. 4 is a diagram illustrating a hardware configuration of the server device 20. The display unit 203 includes a liquid crystal display, and displays a menu screen for operating the server device 20. The operation unit 204 has an input device such as a keyboard and a mouse. The server device 20 operates in accordance with operations performed on the keyboard and mouse. The communication unit 205 functions as a communication interface that performs data communication via the communication network 2.

記憶部２０２は、ハードディスク装置を有しており、オペレーティングシステムのプログラムや、アプリケーションプログラムなどを記憶している。本実施形態においては、記憶部２０２は、音声エージェントサービスを実現するプログラムであって、端末装置１０から音声データと画像データを取得し、取得したデータに対応した処理を実行し、実行結果を端末装置１０へ送信するアプリケーションプログラムを記憶している。 The storage unit 202 includes a hard disk device, and stores operating system programs, application programs, and the like. In the present embodiment, the storage unit 202 is a program that implements a voice agent service, acquires voice data and image data from the terminal device 10, executes a process corresponding to the acquired data, and sends an execution result to the terminal. An application program to be transmitted to the device 10 is stored.

制御部２０１は、ＣＰＵ、ブートローダを記憶したＲＯＭ及びＲＡＭを有している。ＣＰＵがオペレーティングシステムのプログラムを実行すると、アプリケーションプログラムを実行することが可能となる。 The control unit 201 includes a CPU and a ROM and RAM that store a boot loader. When the CPU executes the operating system program, the application program can be executed.

（サーバ装置２０の機能構成）
図５は、端末装置１０において実現する機能のうち、本発明に係る機能の構成を示したブロック図である。データ受信部２００１は、端末装置１０が送信した画像データ及び音声データを、通信部２０５を制御して受信する。データ受信部２００１は、音声データを取得する音声データ取得手段として機能する。画像解析部２００２は、データ受信部２００１が取得した画像データを解析する。画像解析部２００２は、画像解析により画像データが表す画像の内容を特定し、特定した内容に応じたタグを画像データに付加する。音声解析部２００３は、データ受信部２００１が受信した音声データに音声認識処理を行い、音声データをテキストデータに変換する。音声解析部２００３は、テキストデータを解析し、音声データが表す音声の内容を特定する。音声解析部２００３は、音声データが表す音声の内容を特定する特定手段として機能する。データ処理部２００４は、タグを有する画像データを画像解析部２００２から取得し、画像データ取得手段として機能する。データ処理部２００４は、取得した画像データに付加されたタグと、音声解析部２００３が特定したテキストデータの内容に基づいて、音声の発話内容の意図を特定し、特定した意図に応じた処理を実行する。本実施形態においては、データ処理部２００４は、通信網２に接続されている外部のサーチエンジンにアクセスし、特定した意図に応じた検索処理を実行する。提供部２００５は、データ処理部２００４の処理の結果を、画像データ及び音声データを送信してきた端末装置１０へ送信する。 (Functional configuration of server device 20)
FIG. 5 is a block diagram showing a configuration of functions according to the present invention among functions realized in the terminal device 10. The data receiving unit 2001 controls the communication unit 205 to receive the image data and audio data transmitted from the terminal device 10. The data receiving unit 2001 functions as an audio data acquisition unit that acquires audio data. The image analysis unit 2002 analyzes the image data acquired by the data reception unit 2001. The image analysis unit 2002 identifies the content of the image represented by the image data by image analysis, and adds a tag corresponding to the identified content to the image data. The voice analysis unit 2003 performs voice recognition processing on the voice data received by the data reception unit 2001, and converts the voice data into text data. The voice analysis unit 2003 analyzes the text data and specifies the content of the voice represented by the voice data. The voice analysis unit 2003 functions as a specifying unit that specifies the content of the voice represented by the voice data. The data processing unit 2004 acquires image data having a tag from the image analysis unit 2002 and functions as an image data acquisition unit. The data processing unit 2004 identifies the intention of the speech utterance content based on the tag added to the acquired image data and the content of the text data identified by the speech analysis unit 2003, and performs processing according to the identified intention. Execute. In the present embodiment, the data processing unit 2004 accesses an external search engine connected to the communication network 2 and executes a search process according to the specified intention. The providing unit 2005 transmits the processing result of the data processing unit 2004 to the terminal device 10 that has transmitted the image data and the audio data.

（実施形態の動作例）
次に、本実施形態の動作例を説明する。ユーザーが、アプリケーションプログラムを実行している端末装置１０の撮像部１０９を被写体に向け、端末装置１０に対して撮影を指示する操作を行うと、制御部１０１は、撮像部１０９を制御して被写体の撮影を行う。撮像部１０９は、撮影した被写体の画像データを生成し、制御部１０１は、撮像部１０９が生成した画像データを取得する。制御部１０１は、取得した画像データが表す画像が表示されるようにタッチパネル１０３を制御する。 (Operation example of embodiment)
Next, an operation example of this embodiment will be described. When the user points the imaging unit 109 of the terminal device 10 executing the application program toward the subject and performs an operation to instruct the terminal device 10 to perform shooting, the control unit 101 controls the imaging unit 109 to control the subject. Take a photo of The imaging unit 109 generates image data of the photographed subject, and the control unit 101 acquires the image data generated by the imaging unit 109. The control unit 101 controls the touch panel 103 so that an image represented by the acquired image data is displayed.

ユーザーは、被写体について調べたいことがある場合、表示された画像内に映っているものについて調べたいことを話す。ユーザーが発話したときの音声は、音声処理部１０７のマイクロホンで音声信号に変換され、この音声信号は、音声処理部１０７においてデジタル化された音声データに変換される。音声データは、音声処理部１０７から制御部１０１へ供給される。制御部１０１は、音声処理部１０７から供給された音声データを取得すると、取得した画像データ及び音声データを、通信部１０５を制御してサーバ装置２０へ送信する。 If the user has something to look at about the subject, he tells them that he wants to look at what is in the displayed image. The voice when the user speaks is converted into a voice signal by the microphone of the voice processing unit 107, and the voice signal is converted into voice data digitized by the voice processing unit 107. The audio data is supplied from the audio processing unit 107 to the control unit 101. When acquiring the audio data supplied from the audio processing unit 107, the control unit 101 controls the communication unit 105 to transmit the acquired image data and audio data to the server device 20.

通信部２０５は、端末装置１０が送信した画像データと音声データを取得する。制御部２０１は、通信部２０５が取得した画像データと音声データを取得する。制御部２０１は、画像データと音声データを取得すると、図６に示した処理を実行する。 The communication unit 205 acquires image data and audio data transmitted by the terminal device 10. The control unit 201 acquires image data and audio data acquired by the communication unit 205. When acquiring the image data and the sound data, the control unit 201 executes the process shown in FIG.

まず、制御部２０１は、取得した画像データが表す画像を解析し、画像内に映る物体に対応したタグを画像データに付加する（ステップＳＡ１）。タグはメタデータとして画像データに付加される。画像を解析してタグを付加する技術としては、例えば、http://j-net21.smrj.go.jp/develop/digital/entry/001-20110223-01.htmlやhttp://www.rbbtoday.com/article/2011/05/13/76887.htmlなどに記載されている公知の技術や、特開２０１０−２５２２３６号公報に開示されている公知の技術などを利用できる。図７に示したように画像内に富士山と「ＡＡＡ」という車種名の自動車がある場合、例えば「富士山」というタグと、「ＡＡＡ」というタグが画像データに付加される。次に制御部２０１は、音声データに対して音声認識処理を行い、ユーザーの音声をテキストデータに変換する（ステップＳＡ２）。 First, the control unit 201 analyzes the image represented by the acquired image data, and adds a tag corresponding to the object shown in the image to the image data (step SA1). A tag is added to image data as metadata. Examples of techniques for analyzing images and adding tags include http://j-net21.smrj.go.jp/develop/digital/entry/001-20110223-01.html and http: //www.rbbtoday A known technique described in .com / article / 2011/05/13 / 76887.html, a known technique disclosed in JP 2010-252236 A, or the like can be used. As illustrated in FIG. 7, when there is an automobile with a model name of “Mt. Fuji” and “AAA” in the image, for example, a tag “Mt. Fuji” and a tag “AAA” are added to the image data. Next, the control unit 201 performs voice recognition processing on the voice data, and converts the user's voice into text data (step SA2).

制御部２０１は、ステップＳＡ２の処理が終了すると、画像データに付加されたタグの内容と、テキストデータが表すユーザーの発話内容とに基づいて、ユーザーの発話内容の意図を解釈する処理を行う（ステップＳＡ３）。具体的には、制御部２０１は、テキストデータに対して形態素解析を実行してテキストデータに含まれる単語を特定し、特定した単語及び画像データに付加されているタグデータの組み合わせからユーザーの発話の意図を解釈する。 When the process of step SA2 ends, the control unit 201 performs a process of interpreting the intention of the user's speech content based on the content of the tag added to the image data and the user's speech content represented by the text data ( Step SA3). Specifically, the control unit 201 performs morphological analysis on the text data to identify a word included in the text data, and utters the user from a combination of the identified word and tag data added to the image data. Interpret the intent of

この解釈は、例えば、タグとなる単語と、タグとなる単語に対して使用される可能性のある単語とを対応付けたデータベースを記憶部２０２に構築しておくことにより、行うことができる。例えば、本実施形態においては、タグとなる単語の「富士山」に対しては「行きたい」や「高さ」などの単語が対応付けられ、タグとなる自動車の個々の車種名に対しては「価格」や「販売店」などの単語がデータベースにおいて対応付けられている。 This interpretation can be performed by, for example, building a database in the storage unit 202 that associates a word to be a tag with a word that may be used for the word to be a tag. For example, in this embodiment, a word such as “I want to go” or “Height” is associated with the word “Mt. Fuji” as a tag, and each vehicle type name of a car as a tag Words such as “price” and “store” are associated in the database.

ここで図７の画像に対してのユーザーの発話が例えば「行きたい」である場合を想定する。データベースにおいては、タグの「富士山」という単語に対しては「行きたい」が対応付けられており、タグの「ＡＡＡ」という単語に対しては「行きたい」が対応付けられていない。この場合、制御部２０１は、ユーザーの発話は、発話内容に対して対応付けがある「富士山」に対するものであると特定し、ユーザーの意図は「富士山へ行きたい」であると解釈する。 Here, it is assumed that the user's utterance with respect to the image of FIG. 7 is “I want to go”, for example. In the database, “I want to go” is associated with the word “Mt. Fuji” in the tag, and “I want to go” is not associated with the word “AAA” in the tag. In this case, the control unit 201 specifies that the user's utterance is for “Mt. Fuji” associated with the utterance content, and interprets that the user's intention is “I want to go to Mt. Fuji”.

一方、ユーザーの発話が例えば「価格は？」である場合、データベースにおいては、タグの「ＡＡＡ」という単語に対しては「価格」が対応付けられており、タグの「富士山」という単語に対しては「価格」が対応付けられていない。この場合、制御部２０１は、ユーザーの発話は、発話内容に対して対応付けがある「ＡＡＡ」に対するものであると特定し、ユーザーの意図は「ＡＡＡの価格は？」であると解釈する。 On the other hand, when the user's utterance is, for example, “What is the price?”, The word “AAA” is associated with “price” in the database, and the tag “Mt. Fuji” is Are not associated with "price". In this case, the control unit 201 specifies that the user's utterance is for “AAA” associated with the utterance content, and interprets that the user's intention is “AAA's price?”.

制御部２０１は、ユーザーの発話内容の意図を解釈すると、解釈した発話内容に応じた処理を実行する（ステップＳＡ４）。例えば、ユーザーの意図が「富士山へ行きたい」である場合、通信網２に接続されている外部のサーチエンジンを用い、例えば「富士山」及び「行きたい」というキーワードで検索を行う。制御部２０１は、外部のサーチエンジンから富士山への交通アクセスなどの検索結果を受信すると、受信した検索結果を、通信部２０５を制御して端末装置１０へ送信する（ステップＳＡ５）。なお、端末装置１０が、音声データや画像データと共に測位部１０８が生成した位置情報を送信する構成の場合、ユーザーの現在地から富士山までのルートを検索するようにしてもよい。 When the control unit 201 interprets the intention of the utterance content of the user, the control unit 201 executes a process according to the interpreted utterance content (step SA4). For example, when the user's intention is “I want to go to Mt. Fuji”, an external search engine connected to the communication network 2 is used, for example, a search is performed with keywords “Mt. Fuji” and “I want to go”. When receiving a search result such as traffic access to Mt. Fuji from an external search engine, the control unit 201 controls the communication unit 205 to transmit the received search result to the terminal device 10 (step SA5). When the terminal device 10 is configured to transmit the position information generated by the positioning unit 108 together with the audio data and the image data, a route from the current location of the user to Mt. Fuji may be searched.

また、ユーザーの意図が「ＡＡＡの価格は？」である場合、通信網２に接続されている外部のサーチエンジンを用い、例えば「ＡＡＡ」及び「価格」というキーワードで検索を行う。制御部２０１は、外部のサーチエンジンから「ＡＡＡ」という車種の価格の検索結果を受信すると、受信した検索結果を、通信部２０５を制御して端末装置１０へ送信する。 When the user's intention is “What is the price of AAA?”, An external search engine connected to the communication network 2 is used to perform a search using, for example, the keywords “AAA” and “price”. When receiving the search result of the price of the model “AAA” from an external search engine, the control unit 201 controls the communication unit 205 to transmit the received search result to the terminal device 10.

サーバ装置２０が送信した検索結果を通信部１０５が受信すると、制御部１０１は、通信部１０５が受信した検索結果を取得する。制御部１０１は、検索結果を取得すると、取得した検索結果が表示されるようにタッチパネル１０３を制御する。例えば、ユーザーが図７の画像に対して「行きたい」と発話した場合には、富士山への交通アクセスの検索結果がタッチパネル１０３に表示される。また、ユーザーが、図７の画像に対して「価格は？」と発話した場合には、撮影した自動車の価格がタッチパネル１０３に表示される。 When the communication unit 105 receives the search result transmitted by the server device 20, the control unit 101 acquires the search result received by the communication unit 105. When the control unit 101 acquires the search result, the control unit 101 controls the touch panel 103 so that the acquired search result is displayed. For example, when the user utters “I want to go” with respect to the image in FIG. 7, a search result of traffic access to Mt. Fuji is displayed on the touch panel 103. Further, when the user utters “What is the price?” On the image of FIG. 7, the price of the photographed automobile is displayed on the touch panel 103.

以上説明したように本実施形態によれば、画像の解析結果と音声の解析結果との組み合わせにより、ユーザーの発話内容の意図を精度良く特定し、発話内容に応じた処理を実行することができる。 As described above, according to the present embodiment, the intention of the user's utterance content can be accurately identified and the process corresponding to the utterance content can be executed by the combination of the image analysis result and the sound analysis result. .

［変形例］
以上、本発明の実施形態について説明したが、本発明は上述した実施形態に限定されることなく、他の様々な形態で実施可能である。例えば、上述の実施形態を以下のように変形して本発明を実施してもよい。なお、上述した実施形態及び以下の変形例は、各々を組み合わせてもよい。 [Modification]
As mentioned above, although embodiment of this invention was described, this invention is not limited to embodiment mentioned above, It can implement with another various form. For example, the present invention may be implemented by modifying the above-described embodiment as follows. In addition, you may combine each of embodiment mentioned above and the following modifications.

上述した実施形態においては、撮影して得られた画像を表示し、表示された画像に対してユーザーが発話を行うと、表示されている画像と発話内容に対応した処理が実行されるが、端末装置１０が表示する画像は、撮影して得られた直後の画像に限定されるものではない。例えば、撮影されて記憶部１０２に蓄積されていた画像データの中からユーザーにより選択された画像データが表す画像を表示し、表示された画像の画像データと、表示された画像に対してユーザーが発話した音声の音声データとをサーバ装置２０へ送信してもよい。また、端末装置１０が通信網２を介して他のコンピュータ装置から受信した画像データの画像や、端末装置１０に脱着可能な記録媒体から取得した画像データの画像を端末装置１０で表示し、表示された画像の画像データと、表示された画像に対してユーザーが発話した音声の音声データとをサーバ装置２０へ送信してもよい。 In the embodiment described above, an image obtained by shooting is displayed, and when the user utters the displayed image, processing corresponding to the displayed image and utterance content is executed. The image displayed by the terminal device 10 is not limited to the image immediately after being captured. For example, the image represented by the image data selected by the user from the image data that has been shot and stored in the storage unit 102 is displayed, and the user can display the image data of the displayed image and the displayed image. The voice data of the spoken voice may be transmitted to the server device 20. Further, the terminal device 10 displays an image of image data received from another computer device via the communication network 2 or an image of image data acquired from a recording medium removable from the terminal device 10 on the terminal device 10, and displays the image. The image data of the displayed image and the sound data of the voice spoken by the user with respect to the displayed image may be transmitted to the server device 20.

本発明においては、画像データに付されるタグは、上述の実施形態で説明したタグに限定されるものではない。例えば、制御部２０１は、画像内に料理が映っている場合には、料理名をタグとして付加し、画像内に人物が映っている場合には、人物名をタグとして付加する。また、制御部２０１は、撮影した画像が風景の画像である場合、撮影した場所の地名をタグとして付加し、撮影した画像内にランドマークが映っている場合、映っているランドマークの名称をタグとして付加する。 In the present invention, the tag attached to the image data is not limited to the tag described in the above embodiment. For example, the control unit 201 adds a dish name as a tag when a dish appears in the image, and adds a person name as a tag when a person appears in the image. In addition, when the photographed image is a landscape image, the control unit 201 adds the place name of the photographed place as a tag. When the landmark is reflected in the photographed image, the control unit 201 sets the name of the landmark being reflected. Add as a tag.

サーバ装置２０は、画像内に料理が映っており、例えば、ユーザーの発話が「食べる」であった場合には、タグの料理名と「食べる」というキーワードによる検索処理を行う。また、サーバ装置２０は、画像内に料理が映っており、例えば、ユーザーの発話が「カロリー」であった場合には、タグの料理名と「カロリー」というキーワードによる検索処理を行う。なお、画像内に複数種類の料理が映っている場合、各料理のカロリーを検索してもよく、複数の料理のカロリーを合計してユーザーに提示しても良い。サーバ装置２０は、画像内にプロスポーツ選手が映っており、ユーザーの発話が「成績は？」であった場合には、スポーツ選手の名前と「成績」というキーワードによる検索処理を行う。サーバ装置２０は、画像内にランドマークが映っており、ユーザーの発話が「アクセス」であった場合には、ランドマークの名称と「アクセス」というキーワードによる検索処理を行う。 The server device 20 shows a dish in the image. For example, when the user's utterance is “eat”, the server apparatus 20 performs a search process using the tag dish name and the keyword “eat”. In addition, the server device 20 shows a dish in the image. For example, when the user's utterance is “calorie”, the server device 20 performs a search process using the tag dish name and the keyword “calorie”. When a plurality of types of dishes are shown in the image, the calories of each dish may be searched, and the calories of the plurality of dishes may be summed and presented to the user. When the professional sports player is shown in the image and the user's utterance is “What is the score?”, The server device 20 performs a search process using the name of the sports player and the keyword “score”. When the landmark appears in the image and the user's utterance is “access”, the server device 20 performs a search process using the landmark name and the keyword “access”.

上述した実施形態においては、端末装置１０は、静止画を表示するが、動画を表示してもよい。端末装置１０は、動画像の表示中にユーザーが発話を行った場合、発話を行ったときの動画像のフレームの画像を抽出し、抽出した画像の画像データと、ユーザーの発話の音声データとをサーバ装置２０へ送信してもよい。 In the embodiment described above, the terminal device 10 displays a still image, but may display a moving image. When the user utters during the display of the moving image, the terminal device 10 extracts the image of the frame of the moving image when the utterance is performed, the image data of the extracted image, the voice data of the user's utterance, May be transmitted to the server device 20.

上述した実施形態においては、サーバ装置２０は、タグと発話内容とに応じた検索処理を行っているが、サーバ装置２０が行う処理は、検索処理に限定されるものではない。例えば、風景の画像でタグとして撮影場所の地名が付されており、ユーザーの発話が「補正」であった場合、制御部２０１は、風景の画像データに対して明るさやコントラスト、彩度などの自動補正を行い、補正された画像の画像データを端末装置１０へ送信してもよい。 In the embodiment described above, the server device 20 performs a search process according to the tag and the content of the utterance, but the process performed by the server device 20 is not limited to the search process. For example, when a place name of a shooting location is attached as a tag in a landscape image, and the user's utterance is “correction”, the control unit 201 sets brightness, contrast, saturation, and the like for the landscape image data. Automatic correction may be performed, and the corrected image data may be transmitted to the terminal device 10.

本発明では、端末装置１０において、ユーザーが画像内に映っている物体を指定し、サーバ装置２０は、指定された物体に対して付されたタグと、テキストデータとに応じた処理を実行するようにしてもよい。例えば、ユーザーは、撮影した画像が表示されている場合、画像内に表示されている物体の位置に指を接触させ、発話を行う。制御部１０１は、タッチパネル１０３において指が接触している位置を特定し、表示している画像において指の位置に対応した位置の座標を特定する。制御部１０１は、ユーザーが発話を行うと、画像データ、音声データ、画像内においてユーザーが指で触れた位置の座標をサーバ装置２０へ送信する。サーバ装置２０は、画像データにタグを付した後、画像内においてユーザーが指で触れた位置にある物体に対応したタグを特定し、特定したタグと発話内容応じた処理を行う。例えば、図７に示した画像において、ユーザーが指で自動車の位置に触れて「価格は？」と発話した場合、制御部２０１は、データベースを参照せず、指の位置の座標にある自動車の画像を解析して自動車の車種名をタグとして決定し、決定したタグと「価格」というキーワードで検索処理を行う。この構成によれば、より精度良くユーザーの発話の意図を特定することができる。 In the present invention, in the terminal device 10, the user specifies an object shown in the image, and the server device 20 executes processing according to the tag attached to the specified object and the text data. You may do it. For example, when a captured image is displayed, the user makes a speech by bringing a finger into contact with the position of an object displayed in the image. The control unit 101 specifies the position where the finger is in contact with the touch panel 103, and specifies the coordinates of the position corresponding to the position of the finger in the displayed image. When the user speaks, the control unit 101 transmits image data, audio data, and coordinates of a position touched by a finger in the image to the server device 20. After adding a tag to the image data, the server device 20 identifies a tag corresponding to the object at the position touched by the user's finger in the image, and performs processing according to the identified tag and utterance content. For example, in the image shown in FIG. 7, when the user touches the position of the car with a finger and utters “What is the price?”, The control unit 201 does not refer to the database, and does not refer to the database. The image is analyzed to determine the model name of the car as a tag, and a search process is performed using the determined tag and the keyword “price”. According to this configuration, the intention of the user's utterance can be specified with higher accuracy.

なお、タッチパネル１０３の表示面側でユーザーの顔を撮影可能な撮像部を備える端末装置１０においては、タッチパネル１０３を見ているユーザーの顔を撮影してユーザーの視線を検知し、タッチパネル１０３に表示されている画像においてユーザーが注視している位置の座標を特定してもよい。この構成の場合、制御部１０１は、ユーザーが発話を行うと、画像データ、音声データ、画像内においてユーザーが注視した位置の座標をサーバ装置２０へ送信する。例えば、図７に示した画像において、ユーザーが自動車を注視して「価格は？」と発話した場合、制御部２０１は、データベースを参照せず、注視した位置の座標にある自動車の画像を解析して自動車の車種名をタグとして決定し、決定したタグと「価格」というキーワードで検索処理を行う。この構成でも、より精度良くユーザーの発話の意図を特定することができる。 Note that, in the terminal device 10 including an imaging unit capable of photographing a user's face on the display surface side of the touch panel 103, the user's face is seen by looking at the touch panel 103, the user's line of sight is detected, and the touch panel 103 displays The coordinates of the position where the user is gazing in the displayed image may be specified. In the case of this configuration, when the user utters, the control unit 101 transmits image data, audio data, and coordinates of the position where the user gazes in the image to the server device 20. For example, in the image shown in FIG. 7, when the user gazes at a car and utters “What is the price?”, The control unit 201 does not refer to the database and analyzes the car image at the coordinates of the gaze position. Then, the model name of the car is determined as a tag, and search processing is performed using the determined tag and the keyword “price”. Even with this configuration, the intention of the user's utterance can be specified with higher accuracy.

また、端末装置１０がメガネ型端末の場合、ユーザーの視界の範囲を撮像部で撮影して画像データを生成する。端末装置１０がメガネ型の場合、ユーザーの視線を検知し、撮影している画像においてユーザーが注視している位置の座標を特定してもよい。この構成の場合、制御部１０１は、ユーザーが発話を行うと、画像データ、音声データ、画像内においてユーザーが注視した位置の座標をサーバ装置２０へ送信する。端末装置１０は、サーバ装置２０から送信された検索結果を受信すると、検索結果を表示する。この構成によれば、撮影を行いながら発話を行うだけで、ユーザーの視界に入る物体の検索を容易に行うことができる。 When the terminal device 10 is a glasses-type terminal, the range of the user's field of view is captured by the imaging unit to generate image data. When the terminal device 10 is a glasses type, the user's line of sight may be detected, and the coordinates of the position where the user is gazing in the captured image may be specified. In the case of this configuration, when the user utters, the control unit 101 transmits image data, audio data, and coordinates of the position where the user gazes in the image to the server device 20. When receiving the search result transmitted from the server device 20, the terminal device 10 displays the search result. According to this configuration, it is possible to easily search for an object that enters the user's field of view simply by speaking while shooting.

また、本発明においては、画像内における物体の大きさや物体毎のフォーカスの状態に応じて、発話に対応したタグを選択するようにしてもよい。例えば、画像内に映っている物体のうち、映っている領域が所定の閾値未満である物体がある場合、この物体に対応したタグを除外してユーザーの発話の意図を解釈してもよい。また、画像内に映っている物体のうち、フォーカスが最もあっている物体を特定し、特定した物体に対応するタグと発話内容とに応じた検索処理を行う構成としてもよい。 In the present invention, a tag corresponding to an utterance may be selected according to the size of an object in an image and the focus state of each object. For example, when there is an object in which an area shown in the image is less than a predetermined threshold, an intention of the user's utterance may be interpreted by excluding a tag corresponding to the object. Moreover, it is good also as a structure which performs the search process according to the tag corresponding to the identified object and the content of utterance, specifying the object which has the best focus among the objects reflected in the image.

撮像部１０９の撮像素子で得られている画像をタッチパネル１０３に表示し、画像の表示中にユーザーが発話を行った場合、発話が終了したときの画像を静止画として抽出し、抽出した画像の画像データと、ユーザーの発話の音声データとをサーバ装置２０へ送信してもよい。この構成によれば、例えば、タッチパネル１０３をタップしたり、端末装置１０のボタンを操作したりするような撮影のための操作を行わなくとも、ユーザーの発話内容の意図を特定し、発話内容に応じた処理を実行することができる。また、この変形例においても、上述の変形例のように指や視線で画像内に映っている物体を指定してもよい。 When the image obtained by the image sensor of the imaging unit 109 is displayed on the touch panel 103 and the user utters while the image is displayed, the image when the utterance ends is extracted as a still image, and the extracted image The image data and the voice data of the user's utterance may be transmitted to the server device 20. According to this configuration, for example, the intention of the utterance content of the user is specified and the utterance content is determined without performing an operation for shooting such as tapping the touch panel 103 or operating a button of the terminal device 10. A corresponding process can be executed. Also in this modified example, an object shown in the image may be specified with a finger or line of sight as in the above modified example.

上述した実施形態においては、音声データに対してサーバ装置２０が音声認識を行っているが、端末装置１０において音声認識を行ってもよい。この構成の場合、端末装置１０は、音声認識で得られたテキストデータを音声データに替えてサーバ装置２０へ送信する。この構成によれば、サーバ装置２０に係る負荷を抑えることができる。 In the embodiment described above, the server device 20 performs voice recognition on the voice data, but the terminal device 10 may perform voice recognition. In the case of this configuration, the terminal device 10 transmits the text data obtained by speech recognition to the server device 20 instead of the speech data. According to this configuration, the load on the server device 20 can be suppressed.

上述した実施形態においては、画像データに対してサーバ装置２０がタグを付加しているが、端末装置１０において画像データにタグを付加してもよい。端末装置１０において、画像データにタグを付加する場合、ユーザーがタッチパネル１０３を操作して画像データにタグを付加する構成としてもよい。サーバ装置２０は、タグが付加された画像データを取得した場合、ステップＳＡ１の処理を省略し、ステップＳＡ２以降の処理を実行する。この構成によれば、サーバ装置２０に係る負荷を抑えることができる。 In the embodiment described above, the server device 20 adds a tag to the image data. However, the terminal device 10 may add a tag to the image data. In the terminal device 10, when adding a tag to the image data, the user may operate the touch panel 103 to add the tag to the image data. When the server device 20 acquires the image data to which the tag is added, the server device 20 omits the process of step SA1 and executes the processes after step SA2. According to this configuration, the load on the server device 20 can be suppressed.

本発明においては、ユーザーの音声を記録しておき、記録した音声に対応した画像を撮像部１０９で得た場合、画像と発話内容に応じた処理を行うようにしてもよい。例えば、「富士山の高さは？」という音声を記録した後、撮像部１０９により富士山の画像を得た場合、タグとして「富士山」が付加された画像と音声の内容から、富士山の高さを検索する処理を実行してもよい。 In the present invention, when the user's voice is recorded and an image corresponding to the recorded voice is obtained by the imaging unit 109, processing according to the image and the content of the utterance may be performed. For example, when an image of Mt. Fuji is obtained by the imaging unit 109 after recording the voice “What is the height of Mt. Fuji,” the height of Mt. You may perform the process to search.

本発明に係る機能を実現するプログラムは、磁気記録媒体（磁気テープ、磁気ディスク（ＨＤＤ（Hard Disk Drive）、ＦＤ（Flexible Disk））など）、光記録媒体（光ディスクなど）、光磁気記録媒体、半導体メモリーなどのコンピュータ読み取り可能な記録媒体に記憶した状態で提供し、各装置にインストールしてもよい。また、通信網を介してプログラムをダウンロードして各装置にインストールしてもよい。 The program for realizing the functions according to the present invention includes a magnetic recording medium (magnetic tape, magnetic disk (HDD (Hard Disk Drive), FD (Flexible Disk)), etc.), optical recording medium (optical disk, etc.), magneto-optical recording medium, The program may be provided in a state stored in a computer-readable recording medium such as a semiconductor memory and installed in each device. Further, the program may be downloaded via a communication network and installed in each device.

１…音声エージェントシステム、２…通信網、１０…端末装置、２０…サーバ装置、１０１…制御部、１０２…記憶部、１０３…タッチパネル、１０５…通信部、１０７…音声処理部、１０８…測位部、１０９…撮像部、２０１…制御部、２０２…記憶部、２０３…表示部、２０４…操作部、２０５…通信部、１００１…画像取得部、１００２…音声取得部、１００３…送信部、１００４…受信部、１００５…提示部、２００１…データ受信部、２００２…画像解析部、２００３…音声解析部、２００４…データ処理部、２００５…提供部 DESCRIPTION OF SYMBOLS 1 ... Voice agent system, 2 ... Communication network, 10 ... Terminal device, 20 ... Server apparatus, 101 ... Control part, 102 ... Memory | storage part, 103 ... Touch panel, 105 ... Communication part, 107 ... Voice processing part, 108 ... Positioning part 109: Imaging unit, 201: Control unit, 202 ... Storage unit, 203 ... Display unit, 204 ... Operation unit, 205 ... Communication unit, 1001 ... Image acquisition unit, 1002 ... Audio acquisition unit, 1003 ... Transmission unit, 1004 ... Receiving unit, 1005 ... presenting unit, 2001 ... data receiving unit, 2002 ... image analyzing unit, 2003 ... voice analyzing unit, 2004 ... data processing unit, 2005 ... providing unit

Claims

An image data acquisition unit that represents an image being displayed and acquires image data having a tag;
Voice data acquisition means for acquiring voice data representing voice spoken during display of the image ;
Specifying means for specifying the content of the sound represented by the sound data acquired by the sound data acquiring means;
An information processing apparatus comprising: a processing unit that executes processing corresponding to a tag included in image data representing an image being displayed acquired by the image data acquisition unit and the content specified by the specifying unit.

The image has a plurality of tags,
A selection means for selecting one or more tags from the plurality of tags,
The information processing apparatus according to claim 1, wherein the processing unit executes processing corresponding to one or more tags selected by the selection unit and the content specified by the specifying unit.

The information processing apparatus according to claim 2, wherein the selection unit selects one or a plurality of tags from the plurality of tags according to the content specified by the specifying unit.

It has a database that stores a plurality of correspondences between tags and words corresponding to the tags,
The specifying means specifies a word represented by the voice,
The selecting means selects a tag associated with the word specified by the specifying means in the database.
The information processing apparatus according to claim 3.

The image data acquisition means acquires coordinates representing a position in the image;
The tag is associated with a position in the image,
The information processing apparatus according to claim 2, wherein the selection unit selects a tag associated with a coordinate position acquired by the image data acquisition unit.

An image data obtaining step for representing an image being displayed and obtaining image data having a tag;
A voice data acquisition step of acquiring voice data representing voice spoken during display of the image ;
A specifying step for specifying the content of the voice represented by the voice data acquired in the voice data acquisition step;
An information processing method comprising: a processing step of executing processing corresponding to a tag included in image data representing an image being displayed acquired in the image data acquisition step and the content specified in the specifying step.

Computer
An image data acquisition unit that represents an image being displayed and acquires image data having a tag;
Voice data acquisition means for acquiring voice data representing voice spoken during display of the image ;
Specifying means for specifying the content of the sound represented by the sound data acquired by the sound data acquiring means;
A program for functioning as processing means for executing processing corresponding to a tag included in image data representing an image being displayed acquired by the image data acquisition means and the content specified by the specifying means.