JP2005250317A

JP2005250317A - Information processor

Info

Publication number: JP2005250317A
Application number: JP2004063505A
Authority: JP
Inventors: Fumitaka Matsumoto; 文隆松本
Original assignee: Fuji Xerox Co Ltd
Current assignee: Fujifilm Business Innovation Corp
Priority date: 2004-03-08
Filing date: 2004-03-08
Publication date: 2005-09-15

Abstract

PROBLEM TO BE SOLVED: To appropriately and easily add an index to time series information in which picture and voice information or the like are time sequentially recorded. SOLUTION: The scene of a user 12 operating a copy machine 10 is recorded by time series information including image information and voice information. The mouth of the user 12 who operates the apparatus (the copy machine)10 is photographed by a video camera 20 and user's uttering is detected by an uttering detection section 22 from the obtained image based on the shape of the user's mouth. An index is written by an index writing section 26 at the place (the time) of the time series information in which the scene of the user's operation is photographed corresponding to the user's uttering time. Moreover, based on the voice information during uttering is detected, voice recognition is conducted by a voice recognition section 28 and analyzed language information is converted into character information by a telop generating section 30 and written into the time series information by a telop writing section 32. COPYRIGHT: (C)2005,JPO&NCIPI

Description

本発明は、画像や音声などの時系列の情報に対し、所定の時点に対してインデックスを付与する機能を有する情報処理装置に関する。 The present invention relates to an information processing apparatus having a function of assigning an index to a predetermined point in time-series information such as images and sounds.

画像情報や音声情報の少なくとも一方、さらに場合によってはテキスト情報も含む時系列情報（このような情報をマルチメディア情報と呼ぶことがある）に対し、検索を容易にするためにインデックスを付与する技術が知られている。例えば、下記特許文献１には、会議状況の時系列情報を記録するにあたって、当該会議の出席者の発言（発話）を音声情報により取得し、発言が行われた時点を定め、発言から、これに対応する時系列情報の位置（時点）を検索する技術が開示されている。 A technique for providing an index for facilitating searching for time-series information (sometimes referred to as multimedia information) including at least one of image information and audio information, and possibly text information. It has been known. For example, in Patent Document 1 below, when recording time series information of a conference situation, the speech (utterance) of the attendee of the conference is acquired by voice information, the time when the speech is made is determined, and A technique for searching for the position (time point) of time-series information corresponding to is disclosed.

特開平１１−５３３８５号公報JP-A-11-53385

前述の公報に記載の装置は、音声情報に基づき発言を抽出しており、発言とそれ以外の音声（ノイズ）を分離したり、発言した出席者を特定することが困難な場合があった。 The device described in the above-mentioned publication extracts speech based on voice information, and it may be difficult to separate speech and other speech (noise) and to identify the attendee who spoke.

本発明は、発話に基づきインデックスを付与する際、発話を適切に検出する。 The present invention appropriately detects an utterance when assigning an index based on the utterance.

本発明は、発話を、発話をした者の口の形に基づき検出し、発話が検出された、時系列上の位置（時点）にインデックスを付与する。画像情報より発話を検出するので、周囲の騒音などの影響を受けずに発話を検出することができる。 According to the present invention, an utterance is detected based on the mouth shape of the person who made the utterance, and an index is assigned to a position (time point) on the time series at which the utterance was detected. Since the utterance is detected from the image information, the utterance can be detected without being affected by the surrounding noise.

本発明の一つの態様として、機器のユーザビリティテスト装置がある。近年の事務機器、情報機器は、画面を見ながら機能を選択するように作られており、表示された機能群から一つを選択すると、その機能群のさらに詳細な設定、選択内容が示される。この階層構造が多層になっている場合も多く、また一つの作業を実行するのに、複数の機能を設定する必要がある。このような機能選択の操作が、ユーザにとってわかりやすいかを確認するのがユーザビリティテストである。このテストを行う装置では、テスト対象の機器の表示画面の情報と、操作するユーザを撮影した画像情報と、ユーザの発話内容を集音した音声情報とを時系列の情報として収録する。また、カメラによってユーザの口の形をとらえ、この口の形に基づき、そのユーザが声を発しているか、すなわち発話しているかを検出する。ユーザが発話した時点に対応する時系列情報の位置（時点）にインデックスを書き込む。このインデックスに基づき検索を行うことで、時系列情報へのアクセスが容易になり、ユーザビリティテストの解析が進めやすくなる。 One aspect of the present invention is a device usability test apparatus. Recent office equipment and information equipment are designed to select functions while looking at the screen. When one of the displayed function groups is selected, more detailed settings and selection contents of the function groups are displayed. . This hierarchical structure is often multi-layered, and it is necessary to set a plurality of functions in order to execute one work. The usability test is to confirm whether the operation for selecting such a function is easy for the user to understand. In the apparatus for performing the test, information on the display screen of the device to be tested, image information obtained by photographing the user to be operated, and voice information obtained by collecting the user's utterance contents are recorded as time-series information. Further, the user's mouth shape is captured by the camera, and based on the mouth shape, it is detected whether the user is speaking or speaking. An index is written at the position (time point) of the time series information corresponding to the time point when the user speaks. By performing a search based on this index, access to time-series information is facilitated, and usability test analysis is facilitated.

また、ユーザが発話している間の音声情報について音声認識を行ない、音声を文字情報に変換して、これを時系列情報に書き込むことができる。 Further, voice recognition can be performed on voice information while the user is speaking, voice can be converted into character information, and this can be written in time-series information.

本発明の他の態様として、プレゼンテーションや会議の収録装置がある。会議等の出席者の様子と発話を時系列の情報として記録し、後に会議等の議事録の作成などに利用する。この装置においても、出席者の口の形からその者の発話を検出し、検出された時点に対応する時系列情報の時点にインデックスを付与する。このインデックスに基づき検索を行うことで、時系列情報へのアクセスが容易となり、後に会議等の内容を確認する際の助けとなる。また、この装置においても、出席者が発話しているときの音声情報に基づき、その発話内容を文字情報化して、時系列情報上に書き込むことも可能である。また、発話時の口の形を判定しているのであるから、発話した出席者を特定することができ、この情報もインデックスに含めることもできる。 Another aspect of the present invention is a recording device for presentations and meetings. Record attendees' utterances and utterances as time-series information, and later use them to create minutes of meetings. Also in this device, the speech of the person is detected from the mouth shape of the attendee, and an index is assigned to the time point of the time series information corresponding to the detected time point. By performing a search based on this index, access to time-series information is facilitated, which is helpful when confirming the contents of a meeting or the like later. In this apparatus, it is also possible to convert the utterance contents into character information and write it on the time-series information based on the voice information when the attendee is speaking. Further, since the mouth shape at the time of utterance is determined, the attendee who uttered can be identified, and this information can also be included in the index.

以下、本発明の実施の形態（以下実施形態という）を、図面に従って説明する。図１には、本実施形態の概略構成が示されている。この実施形態は、機器の操作、取扱いがスムースにできるかを検証する、いわゆるユーザビリティテストに適用した例を示している。ユーザビリティテストの対象となる機器１０は、例えばコピー機であり、以下コピー機のユーザビリティテストを例として説明する。コピー機１０は、コピーの枚数、片面／両面の選択、用紙の種類の選択、拡大・縮小率の選択など、コピーを行う際に様々な設定が必要となる。このような設定を行うためにコピー機１０は、表示部と入力部を含むユーザインタフェイスを備えている。このインタフェイスは、例えば表示部と入力部が一体となったタッチパネルセンサとすることができる。表示部には、選択・設定内容を示す表示がなされ、ユーザはこの表示内容から自分の所望の設定内容を入力部より入力する。このコピー機の機能の選択、設定の表示は階層的となっている。例えば拡大コピーを行う場合には、初期画面の「倍率選択」を選び、この選択により変更された倍率選択の画面にて任意の倍率を選択して、所定の設定を行う。さらに、用紙の種類を設定するためのトレイの選択、両面コピーの指示など、順次、設定画面を表示させて所定の機能の選択を行う。 Hereinafter, embodiments of the present invention (hereinafter referred to as embodiments) will be described with reference to the drawings. FIG. 1 shows a schematic configuration of the present embodiment. This embodiment shows an example applied to a so-called usability test for verifying whether the operation and handling of the device can be performed smoothly. The device 10 that is the subject of the usability test is a copier, for example, and will be described below using the usability test of the copier as an example. The copying machine 10 requires various settings when performing copying, such as the number of copies, single-sided / double-sided selection, paper type selection, and enlargement / reduction ratio selection. In order to perform such settings, the copier 10 includes a user interface including a display unit and an input unit. This interface can be, for example, a touch panel sensor in which a display unit and an input unit are integrated. A display showing the selection / setting contents is displayed on the display section, and the user inputs his / her desired setting contents from the display section through the input section. This copier function selection and setting display are hierarchical. For example, when performing an enlarged copy, “select magnification” on the initial screen is selected, an arbitrary magnification is selected on the magnification selection screen changed by this selection, and predetermined settings are made. Further, a predetermined function is selected by sequentially displaying a setting screen such as tray selection for setting the paper type, double-sided copy instruction, and the like.

このように、近年コピー機は多くの機能を有する反面、その機能を選択する操作が複雑になっている。この操作が適切に行うことができるか、すなわち画面の表示内容、指示内容などが適切かなどを検証するのがユーザビリティテストである。 As described above, in recent years, a copying machine has many functions, but an operation for selecting the function is complicated. The usability test verifies whether this operation can be performed appropriately, that is, whether the display content of the screen, the instruction content, etc. are appropriate.

ユーザビリティテストにおいては、ユーザ１２がコピー機１０の選択・設定の操作をスムースに行えたかの検証を行う。例えば、「７０パーセントの倍率で、両面原稿から片面原稿に、８部をコピーする」などの操作内容をユーザ（被検者）１２に与え、ユーザが、この操作を行っているときの、操作手順すなわち表示画面の履歴、ユーザの様子など、コピー機の表示画面およびその周囲の状況を画像、音声などの情報を時系列で記録する。この時系列情報を後に解析して、ユーザビリティの検証を行う。 In the usability test, it is verified whether the user 12 can smoothly select and set the copy machine 10. For example, an operation content such as “copy 8 copies from a double-sided original to a single-sided original at a magnification of 70%” is given to the user (subject) 12, and the operation when the user is performing this operation Information on the display screen of the copying machine and its surroundings such as the history of the display screen, the state of the display screen, the state of the user, etc. is recorded in time series. This time series information is analyzed later to verify usability.

コピー機１０のインタフェイス、特に入力インタフェイスの画面の情報がスキャンコンバータ１４に入力されて、時系列の画像情報として時系列情報記録部１６に送られ、時系列情報書き込み部１８により所定の記録媒体に書き込まれ、記憶される。また、コピー機１０を操作しているユーザ１２の様子をビデオカメラ２０で撮影し、画像情報、音声情報として時系列情報記録部１６に送られ、記憶される。これらのコピー機１０自身およびビデオカメラで撮られた、その周囲の状況の時系列情報は、後に再生することができ、この再生画像、音声がユーザビリティの検証に用いられる。また、コピー機１０からは、ユーザ１２が行った操作内容を示す情報が、操作記録部２２に送出され、ここに順次記憶される。操作記録部２２に記録される情報は、操作した順序に従った操作の流れ、すなわち操作の系列となっている。 Information on the screen of the copier 10, particularly the screen of the input interface, is input to the scan converter 14 and sent to the time-series information recording unit 16 as time-series image information, and a predetermined recording is performed by the time-series information writing unit 18. Written on a medium and stored. The state of the user 12 who is operating the copy machine 10 is photographed by the video camera 20, and sent to the time-series information recording unit 16 as image information and audio information and stored therein. The time series information of the surrounding situation taken by the copy machine 10 and the video camera can be reproduced later, and the reproduced image and sound are used for verification of usability. Further, from the copying machine 10, information indicating the contents of the operation performed by the user 12 is sent to the operation recording unit 22 and sequentially stored therein. The information recorded in the operation recording unit 22 is an operation flow according to the operation order, that is, a sequence of operations.

ビデオカメラ２０は、ユーザ１２の口元を撮影するように配置されている。撮影された画像情報は発話検出部２２に送られ、ここでユーザの口の形からユーザ１２が言葉を発しているか、すなわち発話しているかが判定される。口形からユーザの発話を検出には、例えば特開２００１−５１６９３号公報に記載される技術を用いることができる。ユーザ１２の発話が検出されると、インデックス作成部２４は、その発話の先頭部分の時点に対応する、時系列情報の時点にインデックスの書き込みを行うように時系列情報記録部１６に指示を行う。時系列情報記録部１６においては、インデックス書き込み部２６にて、記憶媒体に記憶された時系列情報の指定された時点にインデックスを付与する。 The video camera 20 is arranged so as to photograph the mouth of the user 12. The photographed image information is sent to the utterance detection unit 22 where it is determined whether the user 12 is uttering words, that is, speaking from the shape of the user's mouth. For detecting the user's utterance from the mouth shape, for example, a technique described in Japanese Patent Laid-Open No. 2001-51693 can be used. When the utterance of the user 12 is detected, the index creating unit 24 instructs the time series information recording unit 16 to write the index at the time of the time series information corresponding to the time of the head portion of the utterance. . In the time series information recording unit 16, the index writing unit 26 assigns an index to a designated time point of the time series information stored in the storage medium.

また、発話検出部２２によりユーザ１２の発話が検出されると、音声認識部２８にて、この発話の期間の音声情報に基づき、音声認識が行われ、発話内容を言語情報とする。さらに、言語情報となった発話内容をテロップ作成部３０にて文字情報化したテロップとし、このテロップを時系列情報記録部１６にて時系列情報の、当該テロップに対応する発話の時点にテロップを書き込む。テロップが書き込まれた時系列情報を再生すると、画像上に発話内容が文字として表示される。テロップの表示される時間は、実際の発話の期間と一致させることもできるし、また、内容が長い場合には、読み取りを容易にするために、発話の期間よりも長くするようにもできる。 When the utterance of the user 12 is detected by the utterance detection unit 22, the speech recognition unit 28 performs speech recognition based on the speech information during the utterance period, and uses the utterance content as language information. Further, the content of the utterance that has become the language information is converted into text information converted into text information by the telop creation unit 30, and this telop is converted into a telop at the time of the utterance corresponding to the telop in the time-series information recording unit 16. Write. When the time-series information in which the telop is written is reproduced, the utterance content is displayed as characters on the image. The display time of the telop can be made to coincide with the actual utterance period, and when the content is long, it can be made longer than the utterance period for easy reading.

以上の説明においては、発話の内容を文字情報として記録する、すなわち画像にテロップを挿入する場合を示したが、インデックスのみを書き込み、テロップを省略することも可能である。また、一旦ビデオカメラ２０で録画、録音した情報を再生し、この再生情報に基づきインデックス等の書き込みを行うようにすることができる。 In the above description, the content of the utterance is recorded as character information, that is, a telop is inserted into the image. However, it is also possible to write only the index and omit the telop. In addition, information recorded and recorded by the video camera 20 can be reproduced, and an index or the like can be written based on the reproduction information.

また、ビデオカメラ２０は、ユーザの口元のみを撮影するカメラと、ユーザの操作の様子を録画できるように、比較的広い範囲を撮影するカメラとを別個に設置することもできる。さらにまた、コピー機１０の表示画面を記録することを省略することができ、この場合には、ビデオカメラでこの操作画面を撮影することも好適である。 In addition, the video camera 20 can be provided separately with a camera that captures only the user's mouth and a camera that captures a relatively wide range so that the user's operation can be recorded. Furthermore, recording the display screen of the copy machine 10 can be omitted. In this case, it is also preferable to take a picture of this operation screen with a video camera.

上記実施形態の情報処理装置は、コンピュータを所定のプログラムで動作させることによって実現できる。ビデオカメラ２０からの情報をコンピュータが取得して、コンピュータが所定のプログラムにより発話検出部２２、インデックス作成部２４、音声認識部２８、テロップ作成部３０として機能する。時系列情報記録部１６は、外部記録装置とすることも、ハードディスクなどコンピュータ内部の記録装置とすることも可能である。コンピュータは検出部２２の検出結果に従って、記録装置に対してインデックスの書き込み指示を行う。 The information processing apparatus of the above embodiment can be realized by causing a computer to operate with a predetermined program. The computer acquires information from the video camera 20, and the computer functions as an utterance detection unit 22, an index creation unit 24, a voice recognition unit 28, and a telop creation unit 30 according to a predetermined program. The time-series information recording unit 16 can be an external recording device or a recording device inside a computer such as a hard disk. The computer instructs the recording device to write an index according to the detection result of the detection unit 22.

図２には、本発明の他の実施形態の概略構成が示されている。この実施形態は、プレゼンテーションの記録を想定したものであり、図１に示す実施形態と同様の構成については、同一の符号を付してその説明を省略する。 FIG. 2 shows a schematic configuration of another embodiment of the present invention. This embodiment assumes recording of a presentation, and the same components as those in the embodiment shown in FIG.

プレゼンテーションの発表者４０は、プレゼンテーションの内容に沿った画像をあらかじめ準備し、コンピュータ４２などの情報機器を操作して、プレゼンテーションの進行に合わせて順次、準備した画像を提示画面４４に表示してプレゼンテーションの他の出席者（以下聴衆と記す）４６に提示する。この提示画像がスキャンコンバータ１４を介して時系列情報記録部１６に送られ、またビデオカメラが、プレゼンテーションの出席者の画像、音声を録画する。ビデオカメラは、発表者４０を撮影するビデオカメラ２０Ａと、聴衆４６を撮影するビデオカメラ２０Ｂ，２０Ｃを含む。ビデオカメラ２０Ａ，２０Ｂ，２０Ｃで撮影された画像情報、音声情報は、時系列情報記録部１６にて所定の記憶媒体に記憶される。このとき、ビデオカメラの画像情報、スキャンコンバータ１４からの画像情報それぞれを、一つの表示画面を例えば４つに分割した部分の一つに対応させるように記録することができる。この場合には、時系列情報を再生するときに、前記の４つの画像情報が、表示画面上に同時に表示される。また、各画像情報を別個に記録して、再生時に一つ、または複数を選んで表示させることを可能にするように記録させておくこともできる。 The presenter 40 of the presentation prepares an image according to the contents of the presentation in advance, operates an information device such as the computer 42, and displays the prepared image on the presentation screen 44 sequentially in accordance with the progress of the presentation. To other attendees (hereinafter referred to as the audience) 46. This presentation image is sent to the time-series information recording unit 16 via the scan converter 14, and the video camera records the images and sounds of the attendees of the presentation. The video camera includes a video camera 20 A that captures the presenter 40 and video cameras 20 B and 20 C that capture the audience 46. Image information and audio information captured by the video cameras 20A, 20B, and 20C are stored in a predetermined storage medium by the time-series information recording unit 16. At this time, each of the image information of the video camera and the image information from the scan converter 14 can be recorded so as to correspond to one portion of one display screen divided into, for example, four. In this case, when reproducing the time series information, the four pieces of image information are simultaneously displayed on the display screen. In addition, each image information can be recorded separately so that one or a plurality of image information can be selected and displayed at the time of reproduction.

また、それぞれのビデオカメラ２０Ａ，２０Ｂ，２０Ｃにて取得された情報は、図１に示す実施形態と同様の構成を有する発話検出部２２、インデックス作成部２４、音声認識部２８、テロップ作成部３０にて処理され、時系列情報記録部１６に送られる。なお、図においては、ビデオカメラ２０Ｂ，２０Ｃに対応する上記の構成は、省略されて示されている。また、プレゼンテーションの会場の様子全体を撮影するビデオカメラを別途設けることも好ましい。 The information acquired by each of the video cameras 20A, 20B, and 20C includes an utterance detection unit 22, an index creation unit 24, a voice recognition unit 28, and a telop creation unit 30 that have the same configuration as the embodiment shown in FIG. And sent to the time-series information recording unit 16. In the figure, the above-described configuration corresponding to the video cameras 20B and 20C is omitted. It is also preferable to separately provide a video camera that captures the entire state of the presentation venue.

インデックスの書き込みは、時系列情報の書き込みと並行して行うことができ、また一旦記録された時系列情報を再生し、再生されている情報に基づきインデックス、テロップの書き込みを後から行うようにしてもよい。テロップは、発話した者の画像に対応して書き込むようにすることができる。このようにすれば、前述のように再生時に必要な画像を選択することができる場合において、選択された画像に写っている発話した者のテロップが再生時に表示することができる。このように、複数のビデオカメラが取得した個々の時系列情報に対するインデックスおよびテロップの書き込みは、時系列情報ごとに独立して行うこともでき、また全てに対し共通に書き込むこともできる。例えば、ある出席者に発話に基づくインデックス等は、その出席者に係る時系列情報に対してのみ付与することができる。また、一人の出席者の発話に基づき、他の出席者の時系列情報へもインデックス等を書き込むようにすることができる。さらに、複数の出席者の発話に基づくインデックス等を、一つの時系列情報に書き込むようにもできるし、また複数の出席者に係る時系列情報に書き込むようにもできる。 Index writing can be performed in parallel with time-series information writing, and once recorded time-series information is reproduced, and index and telop are written later based on the information being reproduced. Also good. The telop can be written corresponding to the image of the person who spoke. In this way, when a necessary image can be selected at the time of reproduction as described above, the telop of the person who speaks in the selected image can be displayed at the time of reproduction. In this manner, writing of indexes and telops for individual time-series information acquired by a plurality of video cameras can be performed independently for each time-series information, or can be written in common for all the time-series information. For example, an index based on an utterance for a certain attendee can be given only to time series information related to that attendee. Also, an index or the like can be written in the time series information of other attendees based on the speech of one attendee. Furthermore, an index or the like based on the utterances of a plurality of attendees can be written in one time-series information, or can be written in time-series information related to a plurality of attendees.

プレゼンテーションや会議の出席者が多い場合には、インデックス付与のために撮影対象とする出席者を限定することもできる。また、提示された画像の情報は、スキャンコンバータ１４を介して所得される必要はなく、例えば、会場の全景を撮影するビデオカメラにより、掲示された画像を撮影してこれを記録することもできる。 When there are many attendees at a presentation or meeting, the attendees to be photographed can be limited for indexing. Further, the information on the presented image does not need to be obtained through the scan converter 14, and for example, the posted image can be captured and recorded by a video camera that captures the entire view of the venue. .

また、プレゼンテーション、すなわち一人または少人数の発表者がいて、他の参加者はこの発表を聞くという形態の会合のみならず、参加者が対等に意見を述べあう、いわゆる会議や、複数人の講演者が意見を述べ、これに基づく討論を他の参加者と行うシンポジウムなどにおいても、上記実施形態を適用することができる。 In addition to presentations, where there are one or a small number of presenters and other participants listen to this presentation, so-called conferences where multiple participants share their opinions, and multiple lectures The above embodiment can also be applied to a symposium in which a person expresses an opinion and discusses with other participants based on the opinion.

図２の実施形態も、図１の実施形態と同様にコンピュータを所定のプログラムで動作させることによって実現できる。 The embodiment of FIG. 2 can also be realized by operating a computer with a predetermined program in the same manner as the embodiment of FIG.

実施形態の概略構成を示すブロック図である。It is a block diagram which shows schematic structure of embodiment. 他の実施形態の概略構成を示すブロック図である。It is a block diagram which shows schematic structure of other embodiment.

Explanation of symbols

１０コピー機（機器）、１２ユーザ（被検者）、１４スキャンコンバータ、１６時系列情報記録部、１８時系列情報書き込み部、２０ビデオカメラ、２２発話検出部、２４インデックス作成部、２６インデックス書き込み部、２８音声認識部、、３０テロップ作成部、３２テロップ書き込み部、４０ユーザ（発表者）、４２コンピュータ（情報機器）、４４提示画面、４６聴衆。 10 copy machine (equipment), 12 user (subject), 14 scan converter, 16 time series information recording unit, 18 time series information writing unit, 20 video camera, 22 utterance detection unit, 24 index creation unit, 26 index writing Part, 28 voice recognition part, 30 telop creation part, 32 telop writing part, 40 user (presenter), 42 computer (information equipment), 44 presentation screen, 46 audience.

Claims

Records time-series information including image information obtained by photographing a user who operates the device according to the display on the display screen of the predetermined device, voice information obtained by collecting the user's utterance content, and image information on the display screen of the device. Time-series information recording means,
Utterance detection means for detecting the user's utterance based on the shape of the user's mouth in the image information obtained by photographing the user;
Index giving means for giving an index to a time point on the time series information corresponding to a time point when the utterance detecting means detects an utterance;
An information processing apparatus.

The information processing apparatus according to claim 1, further comprising:
Voice recognition means for performing voice recognition on the voice information while the utterance detection means is detecting a user's utterance and converting the voice information into character information;
The time-series information recording unit writes the character information converted by the voice recognition unit at a time point on the time-series information at which an utterance corresponding to the character information is performed.
Information processing device.

Time-series information recording means for storing time-series information including image information obtained by imaging attendees of a presentation or meeting, and voice information obtained by collecting the utterance contents of the attendees,
Utterance detection means for detecting the utterance of the attendee based on the mouth shape of the attendee in the image information of the attendee photographed;
Index giving means for giving an index to a time point on the time series information corresponding to a time point when the utterance detecting means detects an utterance;
An information processing apparatus.

The information processing apparatus according to claim 3, further comprising:
Voice recognition means for performing voice recognition on the voice information while the utterance detection means is detecting an attendee's utterance and converting the voice information into character information;
The time-series information recording unit writes the character information converted by the voice recognition unit at a time point on the time-series information at which an utterance corresponding to the character information is performed.
Information processing device.

5. The information processing apparatus according to claim 3, wherein the index includes information for identifying an attendee who has made a speech related to the index.

A function of acquiring image information of a user who operates the device according to the display on the display screen of the predetermined device, and detecting a user's utterance based on the shape of the user's mouth in the image information;
For the device that writes the time series information including the image information obtained by photographing the user, the voice information collected from the user's utterance content, and the image information of the display screen of the device, the time series information A function for instructing to write an index when the user's utterance is detected;
A computer-readable program that causes a computer to execute.

Acquiring image information of a user who operates the device according to the display on the display screen of the predetermined device, and detecting the user's utterance based on the shape of the user's mouth in the image information;
For the device that writes the time series information including the image information obtained by photographing the user, the voice information collected from the user's utterance content, and the image information of the display screen of the device, the time series information Instructing to write an index when the user's utterance is detected;
An information processing instruction method.

The ability to capture image information of attendees in a presentation or meeting and detect the attendee ’s utterance based on the mouth shape of the attendee in the image information;
The attendee's utterance of the time-series information is written to a device that writes time-series information including image information obtained by photographing the attendee and audio information obtained by collecting the user's utterance content. A function for instructing to write an index at the time of detection;
A computer-readable program that causes a computer to execute.

Obtaining image information of the attendees in the presentation or meeting and detecting the attendee's utterance based on the mouth shape of the attendees in the image information;
The attendee's utterance of the time-series information is written to a device that writes time-series information including image information obtained by photographing the attendee and audio information obtained by collecting the user's utterance content. Instructing to write an index at the time it was detected;
An information processing instruction method.