JP4235635B2

JP4235635B2 - Data retrieval apparatus and control method thereof

Info

Publication number: JP4235635B2
Application number: JP2005265502A
Authority: JP
Inventors: 恵弘倉片
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2005-09-13
Filing date: 2005-09-13
Publication date: 2009-03-11
Anticipated expiration: 2025-09-13
Also published as: JP2007078985A

Description

本発明は、検索対象データ内から検索条件を指定して所望のデータを検索可能なデータ検索装置及びその制御方法に関するものである。 The present invention relates to a data search apparatus capable of searching for desired data by specifying a search condition from search target data, and a control method therefor.

近年、ビデオカメラの記録方法がアナログ形式からデジタル形式になるとともに、ユーザによるビデオカメラ本体またはパーソナルコンピュータを用いた各種動画編集が行われるようになってきた。動画編集を行う際、動画データの一部分を切り取り、コピー、貼り付けが頻繁に行われている。これらの操作を行う為に、動画の切り出し部分などの先頭を検索しマークを付けるといった作業が必要である。また、動画編集を行う際、被写体が話者となっている時に字幕スーパや吹き出し等の効果を付けることも行われている。 In recent years, the recording method of the video camera has changed from an analog format to a digital format, and various types of video editing using a video camera body or a personal computer have been performed by a user. When editing a movie, a part of the movie data is frequently cut, copied, and pasted. In order to perform these operations, it is necessary to search for and mark the beginning of a cutout portion of a moving image. In addition, when editing a moving image, effects such as subtitle super or a speech balloon are added when the subject is a speaker.

このように動画データ内のシーンに対して所望のシーンを検索して、編集操作を行っている。その一つの方法として、話者の台詞をキーワードとして検索する方法が挙げられる。例えば、「おはようございます。」の挨拶シーンを検索する際、「おはようございます。」を指定することで、動画データ内の挨拶シーンを検索することができる。 In this way, a desired scene is searched for a scene in the moving image data, and an editing operation is performed. As one of the methods, there is a method of searching using a speaker's dialogue as a keyword. For example, when searching for a greeting scene of “Good morning”, it is possible to search for a greeting scene in the video data by specifying “Good morning”.

このようなシーン検索方法として、ビデオコンテンツファイル生成システムが作成したビデオコンテンツファイルに対して、所望の検索キーワードを入力し、字幕テキストファイル内を検索する方法が特許文献１に提案されている。 As such a scene search method, Patent Document 1 proposes a method of searching a subtitle text file by inputting a desired search keyword for a video content file created by a video content file generation system.

また、音声で入力したキーワードに対して、音声パターンで検索する方法が特許文献２に提案されている。 Further, Patent Literature 2 proposes a method of searching for a keyword input by voice using a voice pattern.

特開２００２−３７４４９４号公報JP 2002-374494 A 特開２００１−２９０４９６号公報Japanese Patent Laid-Open No. 2001-290496

しかしながら、従来技術において音声や字幕テキスト内のキーワードによる検索を行う場合、話者の特定ができず、特に動画データの中に多数の人が同じ台詞（キーワード）を喋っている場合に多数の検索結果が出力され、話者の特定に時間がかかるため効率が悪い。 However, when searching with keywords in audio and subtitle texts in the prior art, it is not possible to identify the speaker, especially when many people speak the same line (keyword) in the video data. The result is output, and it takes time to identify the speaker, which is inefficient.

そこで、本発明の目的は、指定した人物が喋った内容が含まれるデータを効率よく検索可能とすることにある。 Accordingly, an object of the present invention is to make it possible to efficiently search for data including contents sung by a designated person.

本発明のデータ検索装置は、人物を識別するための識別用データを用いて、検索対象データに含まれる前記人物の人物に係るデータから前記人物を識別し、前記人物を示すデータを生成する第１のデータ生成手段と、前記人物に係るデータから前記人物の音声データを抽出し、抽出した音声データから前記人物の発声内容を示すデータを生成する第２のデータ生成手段と、前記検索対象データ内における前記人物に係るデータの位置を示す位置データを生成する第３のデータ生成手段と、前記人物に係るデータ内に前記人物の画像データが含まれるか否かを示す存在データを生成する第４のデータ生成手段と、検索対象の人物の画像データが含まれるか否かを指定するデータを検索条件として入力する第３の検索条件入力手段と、前記第３の検索条件入力手段により検索条件が入力された場合、前記第１のデータ生成手段、前記第２のデータ生成手段及び前記第４のデータ生成手段により生成されたデータの組のうち、入力された検索条件に合致するデータの組みを判定する判定手段と、前記判定手段により判定されたデータの組みに対応する位置データに基づいて、その位置データにより示される位置のデータを前記検索対象データから検索するデータ検索手段とを有することを特徴とする。
本発明のデータ検索装置の制御方法は、データを検索するためのデータ検索装置の制御方法であって、人物を識別するための識別用データを用いて、検索対象データに含まれる前記人物の人物に係るデータから前記人物を識別し、前記人物を示すデータを生成する第１のデータ生成ステップと、前記人物に係るデータから前記人物の音声データを抽出し、抽出した音声データから前記人物の発声内容を示すデータを生成する第２のデータ生成ステップと、前記検索対象データ内における前記人物に係るデータの位置を示す位置データを生成する第３のデータ生成ステップと、前記人物に係るデータ内に前記人物の画像データが含まれるか否かを示す存在データを生成する第４のデータ生成ステップと、検索対象の人物の画像データが含まれるか否かを指定するデータを検索条件として入力する検索条件入力ステップと、前記検索条件入力ステップにより検索条件が入力された場合、前記第１のデータ生成ステップ、前記第２のデータ生成ステップ及び前記第４のデータ生成ステップにより生成されたデータの組のうち、入力された検索条件に合致するデータの組みを判定する判定ステップと、前記判定ステップにより判定されたデータの組みに対応する位置データに基づいて、その位置データにより示される位置のデータを前記検索対象データから検索するデータ検索ステップとを含むことを特徴とする。
本発明のプログラムは、前記のデータ検索装置の制御方法をコンピュータに実行させることを特徴とする。
本発明のコンピュータ読み取り可能な記録媒体は、前記のプログラムを記録したことを特徴とする。 The data search device of the present invention uses identification data for identifying a person to identify the person from data related to the person of the person included in the search target data, and generates data indicating the person 1 data generation means, second data generation means for extracting voice data of the person from the data relating to the person, and generating data indicating the utterance content of the person from the extracted voice data, and the search target data Third data generating means for generating position data indicating the position of the data relating to the person in the image data, and generating existence data indicating whether or not the image data of the person is included in the data relating to the person 4 data generation means, third search condition input means for inputting data specifying whether or not image data of a person to be searched is included, and the third search condition input means When the search condition is input by the search condition input means, the input search is performed from the data set generated by the first data generation means, the second data generation means, and the fourth data generation means. Based on position data corresponding to the data set determined by the determining means for determining a data set that matches the condition, the position data indicated by the position data is searched from the search target data. And a data search means.
A control method for a data search apparatus according to the present invention is a control method for a data search apparatus for searching for data, and the person of the person included in the search target data using identification data for identifying a person. A first data generation step for identifying the person from the data related to the data and generating data indicating the person; extracting voice data of the person from the data related to the person; and uttering the person from the extracted voice data A second data generating step for generating data indicating content, a third data generating step for generating position data indicating the position of the data relating to the person in the search target data, and the data relating to the person A fourth data generation step for generating presence data indicating whether or not the image data of the person is included; and image data of the person to be searched A search condition input step for inputting data specifying whether or not as a search condition; and when the search condition is input by the search condition input step, the first data generation step, the second data generation step, and the first Among the data sets generated by the data generation step of 4, based on the determination step for determining the data set that matches the input search condition, and the position data corresponding to the data set determined by the determination step And a data search step for searching for data at a position indicated by the position data from the search target data.
A program according to the present invention causes a computer to execute the control method of the data search apparatus.
The computer-readable recording medium of the present invention is characterized in that the program is recorded.

本発明によれば、人物、その人物の台詞とともに話者の画像データが含まれるか否かを指定することで、指定した人物が喋った内容が含まれ、且つ話者が画像内に映っている画像データを効率よく検索することが可能となる。 According to the present invention, by designating whether or not a person and the person's dialogue are included together with the image data of the speaker, the content of the specified person is included, and the speaker is reflected in the image. It is possible to efficiently search for existing image data.

以下、本発明を適用した好適な実施形態を、添付図面を参照しながら詳細に説明する。 DESCRIPTION OF EXEMPLARY EMBODIMENTS Hereinafter, preferred embodiments to which the invention is applied will be described in detail with reference to the accompanying drawings.

−第１の実施形態−
先ず、本発明の第１の実施形態について説明する。図１及び図２は、本発明の第１の実施形態に係る話者特定検索装置の構成を示すブロック図である。１００は動画データの内部構造を示している。動画データ１００は、話者データ１０９、画像データ１０１、音声データ１０２及び字幕データ１０３により構成されており、画像データ１０１、音声データ１０２、字幕データ１０３が時系列に並んで構成されている。図１又は図２に示す動画データ１００に含まれるデータのうち、画像データ１０１及び音声データ１０２はオリジナルのデータであり、話者データ１０９及び字幕データ１０３は画像データ１０１及び音声データ１０２を解析することによって生成され、図１又は図２に示すように所定の位置に後に埋め込まれるデータである。なお、ここで示す動画データの内部構造は一例であり、本発明に制限を与えるものではない。 -First embodiment-
First, a first embodiment of the present invention will be described. 1 and 2 are block diagrams showing the configuration of the speaker specific search apparatus according to the first embodiment of the present invention. Reference numeral 100 denotes the internal structure of the moving image data. The moving image data 100 includes speaker data 109, image data 101, audio data 102, and caption data 103, and the image data 101, audio data 102, and caption data 103 are arranged in time series. Of the data included in the moving image data 100 shown in FIG. 1 or FIG. 2, the image data 101 and the audio data 102 are original data, and the speaker data 109 and the caption data 103 analyze the image data 101 and the audio data 102. As shown in FIG. 1 or FIG. 2, it is data that is generated later and embedded at a predetermined position. The internal structure of the moving image data shown here is an example and does not limit the present invention.

字幕データ１０３は、話者情報１０６、字幕のテキスト情報１０７及び発声開始情報１０８で構成されている。話者データ１０９は、オリジナルの動画データ１００内の話者に関するデータが保存されている。話者データ１０９には、例えば話者を識別するためのデータ（例えば、話者の名称を示すデータ等）、顔識別用特徴量データ、顔の画像イメージデータ及び音声識別用特徴量データ等が話者毎に対応付けられて格納されている。話者データ１０９は話者データ読み出し部１２０により読み出され、動画データ内に存在する話者の一覧が取得できる。本実施形態では、動画データ１００中に話者データ１０９が含まれる構成を採用しているが、他の実施形態として動画データ１００中ではなく、例えば話者特定検索装置内部又は外部の記録媒体内に保持され、必要に応じて読み込まれて該当する処理において使用されるような構成であってもよい。 The caption data 103 includes speaker information 106, caption text information 107, and utterance start information 108. In the speaker data 109, data related to a speaker in the original moving image data 100 is stored. The speaker data 109 includes, for example, data for identifying a speaker (for example, data indicating the name of the speaker), face identifying feature data, face image data, voice identifying feature data, and the like. Stored in association with each speaker. The speaker data 109 is read by the speaker data reading unit 120, and a list of speakers existing in the moving image data can be acquired. In the present embodiment, a configuration in which the speaker data 109 is included in the moving image data 100 is adopted. However, as another embodiment, the moving image data 100 is not included in the moving image data 100. It is also possible to use a configuration that is held in the memory and read as necessary and used in the corresponding processing.

字幕データ１０３は、話者情報１０６、テキスト情報１０７及び発声開始情報１０８により構成される。話者情報１０６は、上記顔識別用特徴量データを用いて画像データから話者を特定した場合や上記音声識別用特徴量データを用いて音声データから話者を特定した場合に生成される当該話者を識別するためのデータである。テキスト情報１０７は、今回特定された話者の音声データが音声識別用特徴量データを用いて識別された場合に、その音声データを解析してテキスト化したテキストデータである。発声開始情報１０８は、今回特定された話者の音声データが識別された時点の時刻情報等から成る情報である。 The caption data 103 includes speaker information 106, text information 107, and utterance start information 108. The speaker information 106 is generated when a speaker is specified from image data using the face identification feature value data or when a speaker is specified from voice data using the voice identification feature value data. Data for identifying a speaker. The text information 107 is text data obtained by analyzing the voice data of the speaker specified this time when the voice data is identified using the voice identification feature data, and converting it into text. The utterance start information 108 is information including time information when the voice data of the speaker specified this time is identified.

１１０は字幕データ読み出し部であり、動画データ１００から字幕データ１０３のみを順次読み出す。読み出された字幕データ１０３は話者特定取得部１１１、テキスト取得部１１２、発声開始情報取得部１１３へ送られ、それぞれ当該字幕の話者情報１０６、字幕のテキスト情報１０７、字幕の発声開始情報１０８が取得される。ここで取得された話者情報１０６と字幕のテキスト情報１０７は字幕データ比較部１１５へ送られる。 Reference numeral 110 denotes a caption data reading unit that sequentially reads only the caption data 103 from the moving image data 100. The read subtitle data 103 is sent to the speaker identification acquisition unit 111, the text acquisition unit 112, and the utterance start information acquisition unit 113, and the subtitle speaker information 106, the subtitle text information 107, and the subtitle utterance start information, respectively. 108 is obtained. The acquired speaker information 106 and subtitle text information 107 are sent to the subtitle data comparison unit 115.

図３は字幕データの一例を示したものである。１１０の字幕データ読み出し部は、動画データ１００内から字幕データ１０３のみを読み出す。読み出した字幕データ１０３の例が１２１及び１２２である。字幕データ１２２を例に以下の説明を行う。字幕データ読み出し部１１０に読み出された字幕データ１２２は、話者特定取得部１１１へ送られ、話者情報１０６が読み取られる。本実施形態では話者情報１０６は、Speakerタグで囲われている部分である。字幕データ１２２からは<Speaker>….</Speaker>で囲われている"Ｂ子"が話者であることが取得される。 FIG. 3 shows an example of caption data. A caption data reading unit 110 reads only the caption data 103 from the moving image data 100. Examples of read subtitle data 103 are 121 and 122. The following description is given by taking the caption data 122 as an example. The caption data 122 read by the caption data reading unit 110 is sent to the speaker identification acquisition unit 111, and the speaker information 106 is read. In the present embodiment, the speaker information 106 is a portion surrounded by Speaker tags. It is acquired from the caption data 122 that “B child” surrounded by <Speaker>... </ Speaker> is a speaker.

また、字幕データ読み出し部１１０に読み出された字幕データ１２２は、テキスト取得部１１２へ送られ、字幕のテキスト情報１０７が読み取られる。本実施形態では字幕のテキスト情報１０７は、SubTitleタグで囲われている部分である。字幕データ１２２からは<SubTitle>….</SubTitle>で囲われている"おはようございます。"が字幕テキストとして取得される。 The caption data 122 read by the caption data reading unit 110 is sent to the text acquisition unit 112, and the text information 107 of the caption is read. In the present embodiment, the subtitle text information 107 is a portion surrounded by SubTitle tags. From the subtitle data 122, “Good morning.” Enclosed in <SubTitle> .... </ SubTitle> is acquired as subtitle text.

また、字幕データ読み出し部１１０に読み出された字幕データ１２２は、発声開始情報取得部１１３へ送られ、発声開始情報１０８が読み取られる。本実施形態では発声開始情報１０８は、StartTimeCodeタグで囲われている部分である。字幕データ１２２からは<StartTimeCode>….</StartTimeCode>で囲われている"T01:12:03 11"が発声始タイムコードとして取得される。本実施形態では、字幕データはタグを用いて作成しているが、他の形式であっても良い。 The caption data 122 read by the caption data reading unit 110 is sent to the utterance start information acquisition unit 113, and the utterance start information 108 is read. In the present embodiment, the utterance start information 108 is a portion surrounded by a StartTimeCode tag. From the subtitle data 122, “T01: 12: 03 11” surrounded by <StartTimeCode>... </ StartTimeCode> is acquired as the utterance start time code. In the present embodiment, subtitle data is created using tags, but may be in other formats.

１１４は検索条件入力部であり、話者及び話者の発声した内容を検索条件として入力する。話者の入力手段としては、人物名の選択や顔の選択、画像からの選択であってもよい。話者の発声した内容入力手段としては、テキスト入力や音声入力であっても良い。入力された検索条件は、字幕データ比較部１１５へ送られる。 Reference numeral 114 denotes a search condition input unit, which inputs a speaker and the content uttered by the speaker as a search condition. The speaker input means may be selection of a person name, selection of a face, or selection from an image. The content input means spoken by the speaker may be text input or voice input. The input search condition is sent to the caption data comparison unit 115.

字幕データ比較部１１５では、検索条件と話者情報１０６と字幕のテキスト情報１０７を比較し、一致している字幕データ１１６を検索し、特定する。検索条件入力部１１４で、話者が"Ｂ子"、話者の発声した内容が"おはようございます。"であるとき、字幕データ１２１は話者情報が"Ａ子"であるため不一致字幕データとして読み飛ばされ、字幕データ１２２が一致したものと判断され、一致字幕データ１１６となる。一致した字幕データ１１６は、字幕データ１２２から取得された発声開始情報１０８を一致字幕データの発声開始情報１１７として組み合わせて使用される。 The subtitle data comparison unit 115 compares the search condition, the speaker information 106, and the subtitle text information 107, and searches and specifies the subtitle data 116 that matches. In the search condition input unit 114, when the speaker is "B child" and the content of the speaker uttered is "Good morning", the caption data 121 is the mismatched caption data because the speaker information is "A child". And the subtitle data 122 is determined to be matched, and the matched subtitle data 116 is obtained. The matched subtitle data 116 is used by combining the utterance start information 108 acquired from the subtitle data 122 as the utterance start information 117 of the matched subtitle data.

一致字幕データの発声開始情報１１７は、発声開始のタイムコードとして音声データ検索部１１８及び動画像データ検索部１１９へ送られる。図２は、音声データ検索部１１８及び動画像データ検索部１１９の動作を説明するための図である。 The utterance start information 117 of the matching subtitle data is sent to the voice data search unit 118 and the moving image data search unit 119 as the utterance start time code. FIG. 2 is a diagram for explaining operations of the audio data search unit 118 and the moving image data search unit 119.

字幕データ１２２の例において、発声開始のタイムコードとして"T01:12:03 11"が音声データ検索部１１８及び動画像データ検索部１１９へ送られる。音声データ検索部１１８では、動画データ１００内から音声データ１０２のみを読み込み当該タイムコードの位置を検索する。図１又は図２では（１）に示す位置が検索された音声のデータ位置である。動画像データ検索部１１９では、動画データ１００内から画像データ１０１のみを読み込み当該タイムコードの位置を検索する。図１又は図２では（２）に示す位置が検索された画像のデータ位置である。 In the example of the caption data 122, “T01: 12: 03 11” is sent to the audio data search unit 118 and the moving image data search unit 119 as the time code for starting speech. The audio data search unit 118 reads only the audio data 102 from the moving image data 100 and searches for the position of the time code. In FIG. 1 or FIG. 2, the position indicated by (1) is the data position of the searched voice. The moving image data search unit 119 reads only the image data 101 from the moving image data 100 and searches for the position of the time code. In FIG. 1 or FIG. 2, the position shown in (2) is the data position of the searched image.

図４は字幕表示の例である。２００は画面データで、Ｂ子２０１が映っており、Ｂ子が「おはようございます。」を発声している状況である。字幕データ１２２に従って表示している字幕が２０２である。 FIG. 4 is an example of caption display. Reference numeral 200 denotes screen data in which a child B 201 is reflected and the child B is uttering “Good morning”. A subtitle displayed in accordance with the subtitle data 122 is 202.

図５は、話者特定検索装置上の検索ソフトウェアで表示される画面構成例を示す図である。３００はメイン操作画面である。３０１は画像データの表示画面である。３０２は画像表示画面３０１に表示されている画像データのタイムコードである。３０５はダイヤルであり、３０３のジョグボタン、３０４のシャトルボタンにより動作が変わる。３０３のジョグボタンが押下されているとダイヤル３０５はジョグダイヤルとして機能し、回転方向にコマ送りが可能であり回転速度によりコマ送り速度を変化させる。３０４のシャトルボタンが押下されているとダイヤル３０５はシャトルダイヤルとして機能し、回転方向及び回転角によりコマ送り方向と速度を変化させる。３０６は前方のマークポイントまでの移動ボタン、３０７は巻き戻しボタン、３０８は再生ボタン、３０９は早送りボタン、３１０は後方のマークポイントまでの移動ボタン、３１１は検索ボタンである。 FIG. 5 is a diagram showing a screen configuration example displayed by the search software on the speaker specific search device. Reference numeral 300 denotes a main operation screen. Reference numeral 301 denotes an image data display screen. Reference numeral 302 denotes a time code of image data displayed on the image display screen 301. Reference numeral 305 denotes a dial whose operation is changed by a jog button 303 and a shuttle button 304. When the jog button 303 is pressed, the dial 305 functions as a jog dial, and frame advance is possible in the rotation direction, and the frame feed speed is changed according to the rotation speed. When the shuttle button 304 is pressed, the dial 305 functions as a shuttle dial, and changes the frame feed direction and speed according to the rotation direction and rotation angle. Reference numeral 306 denotes a move button to the front mark point, 307 denotes a rewind button, 308 denotes a play button, 309 denotes a fast forward button, 310 denotes a move button to the rear mark point, and 311 denotes a search button.

図６は、メイン操作画面とともに、検索ボタン３１１を押下した際に表示される検索条件入力画面３２０を示す図である。３２１は話者を選択するプルダウンであり、話者を選択する。３２２は検索する台詞を入力する画面である。３２３は前方検索ボタンであり、現在のタイムコードより過去に遡って検索を実行する。３２４は後方検索ボタンであり、現在のタイムコードより後の検索を実行する。３２５はキャンセルボタンであり検索条件入力を中止する。 FIG. 6 is a diagram showing a search condition input screen 320 displayed when the search button 311 is pressed together with the main operation screen. Reference numeral 321 denotes a pull-down menu for selecting a speaker, which selects the speaker. Reference numeral 322 denotes a screen for inputting dialogue to be searched. Reference numeral 323 denotes a forward search button, which executes a search retroactively from the current time code. Reference numeral 324 denotes a backward search button, which executes a search after the current time code. Reference numeral 325 denotes a cancel button, which stops the search condition input.

第１の実施形態におけるソフトウェアによる処理の流れを図７のフローチャートを用いて説明する。ここでは、字幕データ１０３を含んだ動画データ１００内からＢ子が「おはようございます。」を発声したシーンを検索する手順を例に挙げて説明する。 The flow of processing by software in the first embodiment will be described with reference to the flowchart of FIG. Here, a procedure for searching for a scene where child B uttered “Good morning” from moving image data 100 including subtitle data 103 will be described as an example.

図５の検索ソフトウェアの画面３００で検索ボタン３１１を押下するとシーン検索ステップ（ステップＳ１００）に入る。シーン検索ステップ（ステップＳ１００）に入ると、図６の検索条件入力画面３２０が表示される。 When the search button 311 is pressed on the search software screen 300 in FIG. 5, a scene search step (step S100) is entered. When entering the scene search step (step S100), the search condition input screen 320 of FIG. 6 is displayed.

続いて話者指定ステップ（ステップＳ１０１）が実行される。話者指定ステップ（ステップＳ１０１）では、動画データ１００から話者データ読み出し部１２０により話者データ１０９を読み出し、動画データ１００内の話者一覧を取得する。取得された話者の一覧は話者選択プルダウン３２１により選択することができる。本実施形態の話者プルダウン３２１では話者データ１０９に含まれる話者名称が表示され、話者名称により話者を選択する。話者選択時には、話者名称以外でも話者の顔を表示して選択することも可能である。図６の例では"Ｂ子"を選択している。 Subsequently, a speaker specifying step (step S101) is executed. In the speaker specifying step (step S101), the speaker data 109 is read from the moving image data 100 by the speaker data reading unit 120, and a list of speakers in the moving image data 100 is acquired. A list of acquired speakers can be selected by a speaker selection pull-down 321. In the speaker pull-down 321 of this embodiment, the speaker name included in the speaker data 109 is displayed, and the speaker is selected by the speaker name. When selecting a speaker, it is also possible to display and select the speaker's face other than the speaker name. In the example of FIG. 6, “B child” is selected.

話者を指定した後、話者の発声内容入力ステップ（ステップＳ１０２）が実行される。話者内容入力ステップ（ステップＳ１０２）では、話者の発声した台詞を台詞入力画面３２２から入力する。図６の例では"おはようございます。"を指定している。これらのステップにより検索条件として"Ｂ子"が発声した"おはようございます。"が設定され、前方検索ボタン３２３又は後方検索ボタン３２４を押下することで同条件を字幕データとして持つシーンが検索される。本実施形態では話者指定ステップ（ステップＳ１０１）、発声内容入力ステップ（ステップＳ１０２）の順に検索条件の設定を行っているが、順序は入れ替わっても良い。 After designating the speaker, the speaker's utterance content input step (step S102) is executed. In the speaker content input step (step S102), the speech uttered by the speaker is input from the speech input screen 322. In the example of FIG. 6, “Good morning” is specified. By these steps, “Good morning” is set as a search condition by “B child”, and a scene having the same condition as subtitle data is searched by pressing the forward search button 323 or the backward search button 324. . In this embodiment, the search conditions are set in the order of the speaker specifying step (step S101) and the utterance content input step (step S102), but the order may be changed.

前方検索ボタン３２３又は後方検索ボタン３２４が押下されることで検索が開始される。検索では、先ず字幕データ読み出しステップ（ステップＳ１０３）が実行される。字幕データ読み出しステップ（ステップＳ１０３）では、字幕データ読み出し部１１０により動画データ１００から字幕データ１０３のみを順次読み出す。読み出された字幕データ１０３は話者特定取得ステップ（ステップＳ１０４）において話者特定取得部１１１で話者情報１０６が取得される。例えば字幕データ１２１に対しては"Ａ子"、字幕データ１２２に対しては"Ｂ子"が話者として取得される。取得された話者は話者一致ステップ（ステップＳ１０５）において話者指定ステップ（ステップＳ１０１）で指定された検索対象の話者と比較される。図６の例では検索対象の話者として"Ｂ子"が指定されているので、字幕データ１２１は不一致、字幕データ１２２は一致と判定される。不一致の場合は次の字幕データ読み出しステップ（ステップＳ１０３）へ戻る。一致の場合は次のテキスト取得ステップ（ステップＳ１０６）へ移行する。 The search is started when the forward search button 323 or the backward search button 324 is pressed. In the search, first, a subtitle data reading step (step S103) is executed. In the caption data reading step (step S103), the caption data reading unit 110 sequentially reads only the caption data 103 from the moving image data 100. As for the read caption data 103, speaker information 106 is acquired by the speaker specification acquisition unit 111 in the speaker specification acquisition step (step S104). For example, “A child” is acquired as the speaker for the caption data 121 and “B child” is acquired as the speaker for the caption data 122. The acquired speaker is compared with the speaker to be searched specified in the speaker specifying step (step S101) in the speaker matching step (step S105). In the example of FIG. 6, since “B child” is designated as the speaker to be searched, it is determined that the caption data 121 does not match and the caption data 122 matches. If they do not match, the process returns to the next caption data reading step (step S103). If they match, the process proceeds to the next text acquisition step (step S106).

テキスト取得ステップ（ステップＳ１０６）では、字幕データ読み出しステップ（ステップＳ１０３）で読み出された字幕データ１０３よりテキスト取得部１１２によって字幕のテキスト情報１０７を取得する。例えば字幕データ１２１に対しては"おはようございます。"、字幕データ１２２に対しては"おはようございます。"が字幕のテキストとして取得される。取得された字幕のテキストは、テキスト一致ステップ（ステップＳ１０７）において発声内容入力ステップ（ステップＳ１０２）で指定された検索対象の台詞と比較される。図６の例では検索対象の台詞として"おはようございます。"が指定されているので、字幕データ１２２は一致と判定される。不一致の場合は次の字幕データ読み出しステップ（ステップＳ１０３）へ戻る。一致の場合は字幕データ１２２が一致字幕データ１１６として発声開始情報取得ステップ（ステップＳ１０８）へ渡される。 In the text acquisition step (step S106), the text acquisition unit 112 acquires subtitle text information 107 from the subtitle data 103 read in the subtitle data read step (step S103). For example, “Good morning” is acquired as the caption text 121 and “Good morning” is acquired as the caption text for the caption data 122. The obtained subtitle text is compared with the search target dialogue specified in the utterance content input step (step S102) in the text matching step (step S107). In the example of FIG. 6, “Good morning” is specified as the search target dialogue, so that the caption data 122 is determined to match. If they do not match, the process returns to the next caption data reading step (step S103). In the case of coincidence, the caption data 122 is passed to the utterance start information acquisition step (step S108) as the matched caption data 116.

発声開始情報取得ステップ（ステップＳ１０８）では、字幕データ読み出しステップ（ステップＳ１０３）で読み出された字幕データ１０３より発声開始情報取得部１１３によって発声開始情報１０８を取得する。例えば字幕データ１２１に対しては"T01:11:50 03"、字幕データ１２２に対しては"T01:12:03 11"が発声開始情報として取得される。ここでは、"Ｂ子"の字幕データ１２２が一致字幕データ１１６として送られたので、一致した発声開始情報１１７として"T01:12:03 11"が取得される。取得された発声開始情報１１７は音声データ検索ステップ（ステップＳ１０９）へ渡される。 In the utterance start information acquisition step (step S108), the utterance start information acquisition unit 113 acquires the utterance start information 108 from the caption data 103 read in the caption data read step (step S103). For example, “T01: 11: 50 03” is acquired as the utterance start information for the caption data 121 and “T01: 12: 03 11” is acquired for the caption data 122. Here, since the “B child” subtitle data 122 is sent as the matched subtitle data 116, “T01: 12: 03 11” is acquired as the matched utterance start information 117. The acquired utterance start information 117 is transferred to the voice data search step (step S109).

音声データ検索ステップ（ステップＳ１０９）では、動画データ１００から音声データ検索部１１８により音声データ１０２のみを順次読み出し、一致した発声開始情報１１７で指定される位置（１）を検索する。ここでは一致した発声開始情報１１７として"T01:12:03 11"が入力されているので、タイムコード"01:12:03 11"の音声データ位置が取得される。 In the audio data search step (step S109), only the audio data 102 is sequentially read from the moving image data 100 by the audio data search unit 118, and the position (1) designated by the matched utterance start information 117 is searched. Here, since “T01: 12: 03 11” is input as the matched utterance start information 117, the audio data position of the time code “01:12:03 11” is acquired.

また、発声開始情報取得ステップ（ステップＳ１０８）で取得された発声開始情報１１７は動画像データ検索ステップ（ステップＳ１１０）へ渡される。発声開始情報取得ステップ（ステップＳ１０８）では、動画データ１００から動画像データ検索部１１９により画像データ１０１のみを順次読み出し、一致した発声開始情報１１７で指定される位置（２）を検索する。ここでは一致した発声開始情報１１７として"T01:12:03 11"が入力されているので、タイムコード"01:12:03 11"の画像データ位置が取得される。 The utterance start information 117 acquired in the utterance start information acquisition step (step S108) is passed to the moving image data search step (step S110). In the utterance start information acquisition step (step S108), only the image data 101 is sequentially read out from the moving image data 100 by the moving image data search unit 119, and the position (2) designated by the matched utterance start information 117 is searched. Here, since “T01: 12: 03 11” is input as the matched utterance start information 117, the image data position of the time code “01:12:03 11” is acquired.

検索結果ＯＫステップ（ステップＳ１１１）では検索された音声データ及び画像データが画像の表示画面３０１及びタイムコード表示３０２に表示され、検索結果の確認が行われる。この検索結果で良ければ検索は終了し、更に検索を行う場合には３０６は前方のマークポイントまでの移動ボタンまたは３１０は後方のマークポイントまでの移動ボタンを押下することで次の字幕データを読み出し同一条件での検索を繰り返すことが可能である。 In the search result OK step (step S111), the searched audio data and image data are displayed on the image display screen 301 and the time code display 302, and the search result is confirmed. If this search result is acceptable, the search ends. When further search is performed, the next subtitle data is read by pressing the button 306 for moving to the front mark point or the button 310 for moving to the rear mark point. It is possible to repeat the search under the same conditions.

ここで示したフローチャートは一つの実施形態であり本発明を制限するものではない。話者一致ステップ（ステップＳ１０５）、テキスト一致ステップ（ステップＳ１０７）の順序が変わっても良く、また音声データ検索ステップ（ステップＳ１０９）と動画像データ検索ステップ（ステップＳ１１０）の順序が変わってもよい。 The flowchart shown here is one embodiment and does not limit the present invention. The order of the speaker matching step (step S105) and the text matching step (step S107) may be changed, and the order of the voice data searching step (step S109) and the moving image data searching step (step S110) may be changed. .

また、本実施形態では１件毎に検索を実施しているが、動画データ１００内から検索条件に合致する字幕データを全て検索し、検索結果を画像の表示画面３０１に複数のインデックス画像としてマルチ表示することも可能である。 In the present embodiment, the search is performed for each item. However, all subtitle data matching the search condition is searched from the moving image data 100, and the search result is displayed on the image display screen 301 as a plurality of index images. It is also possible to display.

また、本実施形態では話者指定ステップ（ステップＳ１０１）にて話者を１人に限定しているが、２人以上の話者を指定して検索する事も可能である。また、本実施形態では発声内容入力ステップ（ステップＳ１０２）にて一つの台詞を指定しているが、複数の台詞を指定して検索する事も可能である。 In the present embodiment, the number of speakers is limited to one in the speaker designation step (step S101), but it is also possible to designate and search for two or more speakers. In this embodiment, one line is specified in the utterance content input step (step S102). However, it is also possible to search by specifying a plurality of lines.

さらに、本実施形態ではテキスト一致ステップ（ステップＳ１０７）にて検索条件と字幕データの字幕テキストが一致していることを判定しているが、正規表現やあいまい検索等既知の検索方法に拡張する事も可能である。 Furthermore, in the present embodiment, it is determined in the text matching step (step S107) that the search condition and the caption text of the caption data match, but this can be extended to known search methods such as regular expressions and fuzzy searches. Is also possible.

このように本実施形態によれば、特定話者の特定の台詞で画像を検索することが容易にできるため、シーン検索が効率よく行うことが可能となる。 As described above, according to the present embodiment, it is possible to easily search for an image using a specific dialogue of a specific speaker, so that a scene search can be performed efficiently.

−第２の実施形態−
次に、本発明の第２の実施形態について説明する。図９は、本発明の第２の実施形態に係る撮像装置の構成を示す図である。図９に示す撮像装置では一般的にキーボードなどの文字入力手段が無いため、画像による検索条件の指定が必要である。本実施形態ではそのような場合の解決例を示している。 -Second Embodiment-
Next, a second embodiment of the present invention will be described. FIG. 9 is a diagram illustrating a configuration of an imaging apparatus according to the second embodiment of the present invention. Since the image pickup apparatus shown in FIG. 9 generally has no character input means such as a keyboard, it is necessary to specify a search condition by an image. In the present embodiment, a solution example in such a case is shown.

図９の撮像装置において、４００は撮像装置本体（操作面）である。撮像装置はレンズユニット（図示せず）、撮像ユニット（図示せず）、信号処理ユニット（図示せず）、記録媒体（図示せず）を持っており、レンズユニットを通して取り込まれる画像を撮像ユニットでキャプチャし、信号処理ユニットによりデジタル信号処理を行い、背面の液晶表示装置４０１へ表示している。また、録画ボタン４０５を押下することでレンズユニットを通して取り込まれる画像を撮像ユニットでキャプチャし、信号処理ユニットによりデジタル信号処理を行った画像データを記録媒体へ格納する。以下では、図１又は図２に示す動画データ１００の構成図を流用して本実施形態の説明を行なう。 In the image pickup apparatus of FIG. 9, reference numeral 400 denotes an image pickup apparatus body (operation surface). The imaging device has a lens unit (not shown), an imaging unit (not shown), a signal processing unit (not shown), and a recording medium (not shown), and images taken through the lens unit are captured by the imaging unit. The signal is captured, subjected to digital signal processing by the signal processing unit, and displayed on the liquid crystal display device 401 on the rear surface. Further, when the recording button 405 is pressed, an image captured through the lens unit is captured by the imaging unit, and the image data subjected to the digital signal processing by the signal processing unit is stored in the recording medium. Hereinafter, the configuration of the moving image data 100 shown in FIG. 1 or FIG.

図９の撮像装置４００において、４０１は撮影画像や再生画像の表示、各種設定画面の表示用の液晶表示装置である。４０２は表示されている画像データのタイムコードを示している。４０３は選択用の操作部材であり、上下左右方向のボタンにより構成されている。４０４は設定ボタンで選択用操作部材４０３により選択された結果を決定する際に押下する。４０５は録画ボタンであり、録画の開始、停止を行う。４０６は前方のマークポイントまでの移動ボタン、４０７は巻き戻しボタン、４０８は再生ボタン、４０９は早送りボタン、４１０は後方のマークポイントまでの移動ボタン、４１１は検索ボタン、４１２はメニューボタンである。４１３は音声入力用のマイクである。 In the imaging device 400 of FIG. 9, 401 is a liquid crystal display device for displaying captured images and reproduced images and for displaying various setting screens. Reference numeral 402 denotes a time code of the displayed image data. Reference numeral 403 denotes an operation member for selection, which includes buttons in the up / down / left / right directions. A setting button 404 is pressed to determine the result selected by the selection operation member 403. Reference numeral 405 denotes a recording button which starts and stops recording. 406 is a move button to the front mark point, 407 is a rewind button, 408 is a play button, 409 is a fast forward button, 410 is a move button to the back mark point, 411 is a search button, and 412 is a menu button. Reference numeral 413 denotes a microphone for voice input.

本実施形態においても動画データ１００は画像データ１０１、音声データ１０２、話者データ１０９及び字幕データ１０３によって構成される。字幕データ１０３のデータ構成及びそれを生成するための処理は上述した第１の実施形態と同様であり、話者データ１０９も第１の実施形態と同様に、例えば話者を識別するためのデータ（例えば、話者の名称を示すデータ等）、顔識別用特徴量データ及び音声識別用特徴量データ等が話者毎に対応付けられて格納されている。話者データ１０９は、図１又は図２に示すように動画データ１００内に含まれる構成でもよいし、撮像装置の内部又は外部の記録媒体内に保持され、必要に応じて読み込まれて該当する処理において使用されるような構成であってもよい。 Also in this embodiment, the moving image data 100 includes image data 101, audio data 102, speaker data 109, and caption data 103. The data structure of the caption data 103 and the process for generating it are the same as in the first embodiment described above, and the speaker data 109 is data for identifying a speaker, for example, as in the first embodiment. (For example, data indicating the name of a speaker, etc.), face identifying feature data, voice identifying feature data, and the like are stored in association with each speaker. The speaker data 109 may be included in the moving image data 100 as shown in FIG. 1 or FIG. 2, or may be held in a recording medium inside or outside the image pickup apparatus and read as necessary. It may be configured to be used in processing.

ところで、音声識別用特徴量データを用いて音声データから話者が特定され、字幕データ１０３を作成するような場合、音声データから話者は特定されるが、当該音声データの台詞を発言した話者が画面内に存在せず、その音声データに対応する画像データ内に当該話者の画像データが含まれていない場合がある。 By the way, when the speaker is specified from the voice data using the voice identification feature quantity data and the caption data 103 is created, the speaker is specified from the voice data, but the speech of the voice data is spoken. There is a case where the speaker does not exist in the screen and the image data of the speaker is not included in the image data corresponding to the sound data.

本実施形態では、上記のような場合に鑑み、字幕データ１０３を作成する対象となる話者が画面内に存在するか否か（対応する画像データから当該話者が特定できるか否か）を示す画面内存在情報を生成している。これは、音声識別用特徴量データのみによって話者を特定することができ、顔識別用特徴量データによっては当該話者を特定できなかった場合、該当する音声データの台詞を発言した話者が画面内に存在しない旨の画面内存在情報が生成される。一方、字幕テキスト情報１０７が生成される場合（即ち、少なくとも音声識別用得量量データによって音声データから話者が特定され、当該音声データが解析されてテキスト化された場合）であって、それ以外の場合には、当該音声データの台詞を発言した話者が画面内に存在する旨の画面内存在情報が生成される。 In the present embodiment, in view of the above case, it is determined whether or not the speaker for which the caption data 103 is to be created exists in the screen (whether or not the speaker can be identified from the corresponding image data). The in-screen presence information shown is generated. This is because the speaker can be specified only by the feature data for voice identification, and when the speaker cannot be specified by the feature data for face identification, the speaker who spoke the speech of the corresponding voice data In-screen presence information is generated to the effect that it does not exist in the screen. On the other hand, when subtitle text information 107 is generated (that is, when a speaker is identified from voice data by at least voice identification amount data, and the voice data is analyzed and converted into text), In other cases, in-screen presence information is generated to the effect that the speaker who has spoken the speech data is present in the screen.

図８に、このようにして作成された字幕データ１０３の一例を示す。Speakerタグで囲われている部分は話者情報に対応する話者名称である。話者名称は、話者データ１０９内の例えば上述した話者を一意に特定するための識別情報によって生成される。後にこれを話者名に更新することも可能である。Existenceタグで囲われている部分は画面内存在情報を示している。字幕データ１３１では<Existence>….</Existence>で囲われた内容が"Y"なので、話者が画面内に存在している。字幕データ１３３では<Existence>….</Existence>で囲われた内容が"N"なので、話者が画面内に存在しない。SubTitleタグで囲われている部分は話者が発声している字幕のテキストである。StartTimeCodeタグで囲われている部分は話者が発声を開始したタイムコードであり、発生開始情報に対応する。字幕データ１３１は話者が"Ａ子"でタイムコード"T01:11:50 03"で示されるフレーム番号の画像にＡ子が映っており、その位置から"おはようございます。"を発声していることを示している。字幕データ１３２は話者が"Ｂ子"でタイムコード"T01:12:03 11"で示されるフレーム番号の画像にＢ子が映っており、その位置から"おはようございます。"を発声していることを示している。字幕データ１３３は話者が"Ｃ子"でタイムコード"T01:12:23 10"で示されるフレーム番号の画像にＣ子が映っておらず、その位置から"今日はいい天気ですね。"を発声していることを示している。 FIG. 8 shows an example of the caption data 103 created in this way. The part enclosed by the Speaker tag is the speaker name corresponding to the speaker information. The speaker name is generated by identification information for uniquely specifying the above-mentioned speaker in the speaker data 109, for example. It is also possible to update this to the speaker name later. The portion surrounded by the Existence tag indicates the in-screen presence information. In the subtitle data 131, the content enclosed by <Existence>... </ Existence> is “Y”, so the speaker is present on the screen. In the caption data 133, the content enclosed by <Existence>... </ Existence> is “N”, so that no speaker exists in the screen. The part enclosed by the SubTitle tag is the subtitle text spoken by the speaker. The portion enclosed by the StartTimeCode tag is the time code when the speaker starts speaking, and corresponds to the generation start information. In the caption data 131, the speaker is "A child", and the child A is reflected in the image of the frame number indicated by the time code "T01: 11: 50 03". From that position, say "Good morning." It shows that. In the caption data 132, the speaker is “B child”, and the child B is reflected in the image of the frame number indicated by the time code “T01: 12: 03 11”. From that position, say “Good morning.” It shows that. In the caption data 133, the speaker is “C child” and the frame number indicated by the time code “T01: 12: 23 10” does not show the child C. From that position, “Today is good weather.” Indicates that you are speaking.

図１０は撮像装置４００における検索条件指定画面の一例を示す図である。４２０は検索対象話者の一覧表示である。４２１は選択中の話者を示す話者選択表示枠である。４２２はＡ子の顔、４２３はＢ子の顔、４２４はＣ子の顔である。本画面では、選択用の操作部材４０３を用いて話者の選択を行う。４２５は検索する台詞の表示である。 FIG. 10 is a diagram illustrating an example of a search condition designation screen in the imaging apparatus 400. Reference numeral 420 denotes a list of search target speakers. Reference numeral 421 denotes a speaker selection display frame indicating a speaker being selected. 422 is the face of child A, 423 is the face of child B, and 424 is the face of child C. In this screen, a speaker is selected using the operation member 403 for selection. Reference numeral 425 is a display of lines to be searched.

図１１は撮像装置４０１における検索条件指定画面のもう一つの例を示す図である。４３０はＡ子であり、４３１はＡ子が話者として認識されていることを示す登録話者枠である。４３２はＢ子であり、４３３はＢ子が話者として認識されており且つ検索対象の話者として選択されていることを示す選択話者枠である。４３４はＣ子であり、４３５はＣ子が話者として認識されていることを示す登録話者枠である。４３６は検索する台詞の表示である。 FIG. 11 is a diagram illustrating another example of a search condition designation screen in the imaging apparatus 401. Reference numeral 430 denotes a child A, and reference numeral 431 denotes a registered speaker frame indicating that the child A is recognized as a speaker. Reference numeral 432 denotes a child B, and reference numeral 433 denotes a selected speaker frame indicating that the child B is recognized as a speaker and is selected as a speaker to be searched. Reference numeral 434 denotes a child C, and reference numeral 435 denotes a registered speaker frame indicating that the child C is recognized as a speaker. Reference numeral 436 is a display of dialogue to be searched.

図１０及び図１１の検索画面はメニューより選択して切り替えることが可能である。また、検索ボタン４１１を一度押下することで図１０の検索画面が表示され、更に検索ボタン４１１を押下することで図１１の検索画面を表示することも可能である。 The search screens in FIGS. 10 and 11 can be selected and switched from the menu. 10 can be displayed once by pressing the search button 411, and the search screen in FIG. 11 can be displayed by further pressing the search button 411.

図１２は本実施形態における処理の流れを示したフローチャートである。検索ボタン４１１を押下すると検索モードステップ（ステップＳ２０１）に入る。検索モードステップ（ステップＳ２０１）では、話者一覧選択画面（図１０）または画像からの話者選択画面（図１１）の何れかを表示する。本実施形態では、メニューボタン４１２を操作し、話者選択方法指定メニュー（図示せず）により最初に表示される画面を設定している。 FIG. 12 is a flowchart showing the flow of processing in this embodiment. When the search button 411 is pressed, a search mode step (step S201) is entered. In the search mode step (step S201), either a speaker list selection screen (FIG. 10) or a speaker selection screen from an image (FIG. 11) is displayed. In the present embodiment, the menu button 412 is operated to set a screen to be displayed first by a speaker selection method designation menu (not shown).

先ず、検索モードステップ（ステップＳ２０１）にて、話者一覧選択画面（図１０）が表示される場合を説明する。話者選択ステップ（ステップＳ２０２）では話者データより話者の顔の画像イメージデータを取得し、登録されている話者の一覧を表示する。本実施形態では話者としてＡ子、Ｂ子、Ｃ子の３人が登録されている。図１０で４２２はＡ子の顔、４２３はＢ子の顔、４２４はＣ子の顔である。話者データに話者の名称が登録されている場合には、顔の右隣に名称が表示されても良い。話者の一覧が表示されると、選択用操作部材４０３により話者選択表示枠４２１を移動させて話者を選択することができる。また、４人以上の話者が登録されている場合には、選択用操作部材４０３により話者選択表示枠４２１が移動すると共に話者一覧が検索対象話者一覧表示４２０内でスクロールする。検索対象の話者を選択し設定ボタン４０４で決定する。図１０ではＢ子が話者として選択されている状態を示している。話者が決定すると話者特定ステップ（ステップＳ２０４）へ進み、選択された話者の話者データが取得される。 First, the case where the speaker list selection screen (FIG. 10) is displayed in the search mode step (step S201) will be described. In the speaker selection step (step S202), the image data of the speaker's face is acquired from the speaker data, and a list of registered speakers is displayed. In this embodiment, three persons A, B, and C are registered as speakers. In FIG. 10, 422 is the face of child A, 423 is the face of child B, and 424 is the face of child C. When the speaker name is registered in the speaker data, the name may be displayed on the right side of the face. When the list of speakers is displayed, the speaker selection display frame 421 can be moved by the selection operation member 403 to select the speakers. When four or more speakers are registered, the selection operation member 403 moves the speaker selection display frame 421 and scrolls the speaker list in the search target speaker list display 420. A speaker to be searched is selected and set by the setting button 404. FIG. 10 shows a state where child B is selected as a speaker. When the speaker is determined, the process proceeds to the speaker specifying step (step S204), and the speaker data of the selected speaker is acquired.

話者特定ステップ（ステップＳ２０４）により話者データが取得されると、発声内容入力ステップ（ステップＳ２０５）となる。本実施形態では発声内容入力ステップ（ステップＳ２０５）では音声によるテキスト入力を行う。音声入力用マイク４１３に向かい、検索したい台詞を喋ると音声認識が行われ自動的にテキスト化されて検索する台詞表示４２５へ入力される。図１０では"おはようございます。"が発声内容として指定されている。正しく入力されない場合には、選択用操作部材４０３の左ボタンを押下し、検索する台詞表示４２５の文字を削除し、入力し直すことも可能である。発声内容入力が正しく入力された場合には設定ボタン４０４で決定する。これらのステップにより検索条件として"Ｂ子"が発声した"おはようございます。"が設定される。 When the speaker data is acquired in the speaker specifying step (step S204), the utterance content input step (step S205) is performed. In the present embodiment, voice input is performed in the utterance content input step (step S205). When facing the speech input microphone 413 and speaking the speech to be searched, speech recognition is performed and the text is automatically converted into text and input to the speech display 425 for searching. In FIG. 10, “Good morning” is designated as the utterance content. If the input is not correct, the user can press the left button of the selection operation member 403 to delete the character in the dialogue display 425 to be searched and input again. When the utterance content input is correctly input, the setting button 404 is used for determination. Through these steps, "Good morning."

次に、検索モードステップ（ステップＳ２０１）にて、画像データからの話者選択画面（図１１）が表示される場合を説明する。話者選択ステップ（ステップＳ２０２）では話者データより話者の識別用顔特徴量データを取得し、話者検索ステップ（ステップＳ２０３）により液晶表示装置４０１に表示されている画像データから顔検出を行い、話者データに登録されている話者の顔の画像データに登録話者枠を表示する。 Next, a case where a speaker selection screen (FIG. 11) from image data is displayed in the search mode step (step S201) will be described. In the speaker selection step (step S202), face feature amount data for identifying the speaker is acquired from the speaker data, and face detection is performed from the image data displayed on the liquid crystal display device 401 in the speaker search step (step S203). The registered speaker frame is displayed on the image data of the speaker's face registered in the speaker data.

図１１の例では、話者検索ステップ（ステップＳ２０３）では、液晶表示装置４０１にＡ子４３０、Ｂ子４３２、Ｃ子４３４の３人が表示されており、それぞれの顔の画像データから顔検出を行い、顔特徴量を算出し、話者データに登録されている話者の顔特徴量データと比較を行う。比較した結果、それぞれ話者登録されているので顔の画像データに話者登録枠が表示され（Ａ子の登録話者枠４３１、Ｂ子の選択話者枠４３２、Ｃ子の登録話者枠４３５）、顔の画像データと各話者の話者データとが関連付けされる。話者の顔に登録話者枠、選択話者枠が表示されると、選択用操作部材４０３により選択話者枠を移動させることができる。話者の選択範囲は液晶表示装置４０１に表示されている話者からのみ選択されるため、図１１の場合に４人以上話者が登録されている場合であっても、上記３人のみから話者を選択する。検索対象の話者の顔の画像データが液晶表示装置４０１内に存在しない場合には、巻き戻しボタン４０７、早送りボタン４０９により表示画像を変えることで、他の話者の顔の画像データが映っている状態にすることにより、話者データに登録されている話者であれば、同じく顔の画像データに登録話者枠が表示され同じく検索の対象とすることができる。 In the example of FIG. 11, in the speaker search step (step S203), three members A child 430, B child 432, and C child 434 are displayed on the liquid crystal display device 401, and face detection is performed from the image data of each face. The face feature amount is calculated and compared with the speaker face feature amount data registered in the speaker data. As a result of comparison, since each speaker is registered, a speaker registration frame is displayed in the face image data (A child registered speaker frame 431, B child selected speaker frame 432, C child registered speaker frame. 435), the face image data and the speaker data of each speaker are associated with each other. When the registered speaker frame and the selected speaker frame are displayed on the speaker's face, the selected speaker frame can be moved by the selection operation member 403. Since the selection range of the speakers is selected only from the speakers displayed on the liquid crystal display device 401, even if four or more speakers are registered in the case of FIG. Select a speaker. When the image data of the face of the speaker to be searched does not exist in the liquid crystal display device 401, the image data of the face of another speaker is reflected by changing the display image with the rewind button 407 and the fast forward button 409. In this state, if the speaker is registered in the speaker data, the registered speaker frame is also displayed in the face image data, and the same can be searched.

利用者は、検索対象の話者を選択し設定ボタン４０４で決定する。図１１ではＢ子が話者として選択されている状態を示している。話者が決定すると話者特定ステップ（Ｓ２０４）へ進み、選択された話者の話者データが検索条件として取得される。 The user selects a speaker to be searched and determines with the setting button 404. FIG. 11 shows a state where child B is selected as a speaker. When the speaker is determined, the process proceeds to a speaker specifying step (S204), and speaker data of the selected speaker is acquired as a search condition.

話者特定ステップ（ステップＳ２０４）により話者データが取得されると、発声内容入力ステップ（ステップＳ２０５）となる。本実施形態では発声内容入力ステップ（ステップＳ２０５）では音声によるテキスト入力を行う。音声入力用マイク４１３に向かい、検索したい台詞を喋ると音声認識が行われ自動的にテキスト化されて検索する台詞表示４３６へ入力される。図１１では"おはようございます。"が発声内容として指定されている。正しく入力されない場合には、選択用操作部材４０３の左ボタンを押下し、検索する台詞表示４３６の文字を削除し、入力し直すことも可能である。発声内容入力が正しく入力された場合には設定ボタン４０４で決定する。これらのステップにより検索条件として"Ｂ子"が発声した"おはようございます。"が設定される。 When the speaker data is acquired in the speaker specifying step (step S204), the utterance content input step (step S205) is performed. In the present embodiment, voice input is performed in the utterance content input step (step S205). When facing the speech input microphone 413 and speaking the speech to be searched, speech recognition is performed and the text is automatically converted into text and input to the speech display 436 for searching. In FIG. 11, “Good morning” is designated as the utterance content. If the input is not correct, the user can press the left button of the selection operation member 403, delete the characters in the dialogue display 436 to be searched, and input again. When the utterance content input is correctly input, the setting button 404 is used for determination. Through these steps, "Good morning."

検索条件が発声内容入力ステップ（ステップＳ２０５）で決定すると、以下の動作は第１の実施形態の字幕データ読み出しステップ（ステップＳ１０３）以降の動作と同様である。 When the search condition is determined in the utterance content input step (step S205), the following operation is the same as the operation after the subtitle data reading step (step S103) of the first embodiment.

第２の実施形態では、図８に示すように話者の画面内存在を示すデータ（画面内存在情報）が字幕データ１０３に含まれている。検索のオプションとして話者の画面内存在を指定することで、話者と話者の画面内存在と話者の台詞でシーンを検索することが可能である。この場合、話者特定ステップ（ステップＳ２０４）にて、話者の画面内存在の有無を指定する。操作の例として選択用操作部材４０３の上下ボタンにより画面内存在の有無を選択し、設定ボタン４０４で決定する。 In the second embodiment, as shown in FIG. 8, the subtitle data 103 includes data indicating the presence of the speaker in the screen (in-screen presence information). By specifying the presence of the speaker in the screen as a search option, it is possible to search for a scene based on the speaker's and speaker's on-screen presence and the speaker's dialogue. In this case, in the speaker specifying step (step S204), the presence or absence of the speaker in the screen is designated. As an example of the operation, presence / absence in the screen is selected with the up and down buttons of the selection operation member 403, and determined with the setting button 404.

話者の画面内存在情報の読み出しは、字幕データ１０３を字幕データ読み出しステップ（ステップＳ２０６）により読み出し、話者特定取得ステップ（ステップＳ２０７）にて話者情報１０６を取得する。取得された話者情報１０６には、話者名称と画面内存在情報が含まれている。字幕データ１３１に対して話者特定取得ステップ（ステップＳ２０７）で話者情報１０６を取得すると話者として"Ａ子"が当該画像内に存在している情報"Y"が取得される。字幕データ１３２に対して話者特定取得ステップ（ステップＳ２０７）で話者情報１０６を取得すると話者として"Ｂ子"が当該画像内に存在している情報"Y"が取得される。字幕データ１３３に対して話者特定取得ステップ（ステップＳ２０７）で話者情報１０６を取得すると話者として"Ｃ子"が当該画像内に存在していない情報"N"が取得される。 The speaker's in-screen presence information is read by reading the caption data 103 in the caption data reading step (step S206) and acquiring the speaker information 106 in the speaker specifying acquisition step (step S207). The acquired speaker information 106 includes a speaker name and in-screen presence information. When the speaker information 106 is acquired in the speaker specifying acquisition step (step S207) for the caption data 131, information "Y" in which "A child" is present in the image as the speaker is acquired. When the speaker information 106 is acquired in the speaker identification acquisition step (step S207) with respect to the caption data 132, information "Y" in which "B child" exists in the image is acquired as a speaker. When the speaker information 106 is acquired in the speaker identification acquisition step (step S207) for the caption data 133, information “N” in which “C child” does not exist in the image is acquired as a speaker.

話者特定取得ステップ（ステップＳ２０７）で字幕データ１０３より取得された話者名称と画面内存在情報は話者一致ステップ（ステップＳ２０８）で話者特定ステップ（ステップＳ２０４）にて設定された検索条件と比較される。 The speaker name and in-screen presence information acquired from the caption data 103 in the speaker specifying acquisition step (step S207) are the search conditions set in the speaker specifying step (step S204) in the speaker matching step (step S208). Compared with

これらのステップにより検索された図１又は図２に示す（１）の位置の音声データ１０２と（２）の位置の画像データ１０１が液晶表示装置４０１に表示され、タイムコード表示領域４０２にタイムコードが表示される。本実施形態での検索結果は字幕データ１３２が該当するのでタイムコードとして"01:12:13 11"が表示される。再生ボタン４０８を押下した場合、Ｂ子が映った映像が開始され、"おはようございます。"の字幕スーパが表示されるとともに"おはようございます"とＢ子の声で再生される。 The audio data 102 at the position (1) and the image data 101 at the position (2) shown in FIG. 1 or FIG. 2 retrieved by these steps are displayed on the liquid crystal display device 401, and the time code is displayed in the time code display area 402. Is displayed. Since the subtitle data 132 corresponds to the search result in the present embodiment, “01:12:13 11” is displayed as the time code. When the play button 408 is pressed, the video showing the child B is started, the subtitle super “Good morning” is displayed, and “Good morning” is played with the child B's voice.

このように本実施形態によれば、撮像装置をはじめとするキーボード等のテキスト入力手段や選択手段を持たない機器においても、特定話者の特定の台詞で画像を検索することが容易にできるため、シーン検索が効率よく行うことが可能となる。 As described above, according to the present embodiment, it is possible to easily search for an image using a specific speech of a specific speaker even in a device having no text input means and selection means such as a keyboard such as an imaging device. The scene search can be performed efficiently.

以上のように、上述した各実施形態によれば、話者を指定して台詞（キーワード）で検索することで、指定した話者が喋った内容が含まれる画像データ内のシーンを効率よく検索することが可能となる。 As described above, according to each of the embodiments described above, a scene in image data including the content spoken by the designated speaker can be efficiently searched by designating the speaker and searching with dialogue (keywords). It becomes possible to do.

また、当該話者の画面内存在を指定し、台詞（キーワード）で検索することで、指定した人物が喋った内容且つ、話者が画像内に映っている画像データ内のシーンを効率よく検索することが可能となる。 Also, by specifying the presence of the speaker in the screen and searching with dialogue (keywords), it is possible to efficiently search the contents of the specified person and scenes in the image data where the speaker is reflected in the image. It becomes possible to do.

また、本発明の目的は、前述した実施形態の機能を実現するソフトウェアのプログラムコードを記録した記憶媒体を、システム或いは装置に供給し、そのシステム或いは装置のコンピュータ（またはＣＰＵやＭＰＵ）が記憶媒体に格納されたプログラムコードを読み出し実行することによっても、達成されることは言うまでもない。 Another object of the present invention is to supply a storage medium storing software program codes for realizing the functions of the above-described embodiments to a system or apparatus, and the computer (or CPU or MPU) of the system or apparatus stores the storage medium. Needless to say, this can also be achieved by reading and executing the program code stored in.

この場合、記憶媒体から読み出されたプログラムコード自体が前述した実施形態の機能を実現することになり、プログラムコード自体及びそのプログラムコードを記憶した記憶媒体は本発明を構成することになる。 In this case, the program code itself read from the storage medium realizes the functions of the above-described embodiments, and the program code itself and the storage medium storing the program code constitute the present invention.

プログラムコードを供給するための記憶媒体としては、例えば、フレキシブルディスク、ハードディスク、光ディスク、光磁気ディスク、ＣＤ−ＲＯＭ、ＣＤ−Ｒ、磁気テープ、不揮発性のメモリカード、ＲＯＭ等を用いることができる。 As a storage medium for supplying the program code, for example, a flexible disk, a hard disk, an optical disk, a magneto-optical disk, a CD-ROM, a CD-R, a magnetic tape, a nonvolatile memory card, a ROM, or the like can be used.

また、コンピュータが読み出したプログラムコードを実行することにより、前述した実施形態の機能が実現されるだけでなく、そのプログラムコードの指示に基づき、コンピュータ上で稼動しているＯＳ(基本システム或いはオペレーティングシステム)などが実際の処理の一部又は全部を行い、その処理によって前述した実施形態の機能が実現される場合も含まれることは言うまでもない。 Further, by executing the program code read by the computer, not only the functions of the above-described embodiments are realized, but also an OS (basic system or operating system) running on the computer based on the instruction of the program code. Needless to say, a case where the functions of the above-described embodiment are realized by performing part or all of the actual processing and the processing is included.

さらに、記憶媒体から読み出されたプログラムコードが、コンピュータに挿入された機能拡張ボードやコンピュータに接続された機能拡張ユニットに備わるメモリに書込まれた後、そのプログラムコードの指示に基づき、その機能拡張ボードや機能拡張ユニットに備わるＣＰＵ等が実際の処理の一部又は全部を行い、その処理によって前述した実施形態の機能が実現される場合も含まれることは言うまでもない。 Further, after the program code read from the storage medium is written in a memory provided in a function expansion board inserted into the computer or a function expansion unit connected to the computer, the function is determined based on the instruction of the program code. It goes without saying that the CPU or the like provided in the expansion board or function expansion unit performs part or all of the actual processing and the functions of the above-described embodiments are realized by the processing.

本発明の第１の実施形態に係る話者特定検索装置の構成を示すブロック図である。It is a block diagram which shows the structure of the speaker specific search apparatus which concerns on the 1st Embodiment of this invention. 本発明の第１の実施形態に係る話者特定検索装置の構成を示すブロック図である。It is a block diagram which shows the structure of the speaker specific search apparatus which concerns on the 1st Embodiment of this invention. 字幕データの一例を示す図である。It is a figure which shows an example of caption data. 字幕表示の一例を示す図である。It is a figure which shows an example of a caption display. 話者特定検索装置上の検索ソフトウェアで表示される画面構成例を示す図である。It is a figure which shows the example of a screen structure displayed with the search software on a speaker specific search device. メイン操作画面とともに、検索ボタンを押下した際の検索条件入力画面を示す図である。It is a figure which shows the search condition input screen at the time of pressing down a search button with a main operation screen. 本発明の第１の実施形態における処理の流れを示すフローチャートである。It is a flowchart which shows the flow of the process in the 1st Embodiment of this invention. 字幕データの一例を示す図である。It is a figure which shows an example of caption data. 本発明の第２の実施形態に係る撮像装置の構成を示す図である。It is a figure which shows the structure of the imaging device which concerns on the 2nd Embodiment of this invention. 撮像装置における検索条件指定画面の一例を示す図である。It is a figure which shows an example of the search condition designation | designated screen in an imaging device. 撮像装置における検索条件指定画面の一例を示す図である。It is a figure which shows an example of the search condition designation | designated screen in an imaging device. 本発明の第２の実施形態における処理の流れを示すフローチャートである。It is a flowchart which shows the flow of the process in the 2nd Embodiment of this invention.

Explanation of symbols

１００：動画データ、１０１：画像データ、１０２：音声データ、１０３：字幕データ、１０６：字幕データ内の話者情報、１０７：字幕データ内の字幕テキスト情報、１０８：字幕データ内の発声開始情報、１０９：話者データ、１１０：字幕データ読み出し部、１１１：話者特定取得部、１１２：テキスト取得部、１１３：発声開始情報取得部、１１４：検索条件入力部、１１５：字幕データ比較部、１１６：一致した字幕データ、１１７：一致した字幕データの発声開始情報、１１８：音声データ検索部、１１９：動画像データ検索部、１２０：話者データ読み出し部、１２１：Ａ子の字幕データ、１２２：Ｂ子の字幕データ、１３１：Ａ子の字幕データ、１３２：Ｂ子の字幕データ、１３３：Ｃ子の字幕データ、２０１：Ｂ子の映像、２０２：字幕、３００：検索ソフトウェアメイン画面、３０１：画像表示画面、３０２：タイムコード表示、３０３：ジョグボタン、３０４：シャトルボタン、３０５：ジョグ、シャトルダイヤル、３０６：前方のマークポイントまでの移動ボタン、３０７：巻き戻しボタン、３０８：再生ボタン、３０９：早送りボタン、３１０：後方のマークポイントまでの移動ボタン、３１１：検索ボタン、３２０：検索条件入力画面、３２１：話者選択するプルダウン、３２２：検索する台詞を入力する画面、３２３：前方検索ボタン、３２４：後方検索ボタン、３２５：キャンセルボタン、４００：撮像装置本体（操作面）、４０１：液晶表示装置、４０２：タイムコード表示、４０３：選択用操作部材、４０４：設定ボタン、４０５：録画ボタン、４０６：前方のマークポイントまでの移動ボタン、４０７：巻き戻しボタン、４０８：再生ボタン、４０９：早送りボタン、４１０：後方のマークポイントまでの移動ボタン、４１１：検索ボタン、４１２：メニューボタン、４１３：音声入力用マイク、４２０：検索対象話者一覧表示、４２１：話者選択表示枠、４２２：Ａ子の顔、４２３：Ｂ子の顔、４２４：Ｃ子の顔、４３０：Ａ子の顔、４３１：Ａ子の登録話者枠、４３２：Ｂ子の顔、４３３：Ｂ子の選択話者枠、４３４：Ｃ子の顔、４３５：Ｃ子の登録話者枠、４３６：検索する台詞表示 100: moving image data, 101: image data, 102: audio data, 103: subtitle data, 106: speaker information in subtitle data, 107: subtitle text information in subtitle data, 108: utterance start information in subtitle data, 109: Speaker data, 110: Subtitle data reading unit, 111: Speaker specification acquisition unit, 112: Text acquisition unit, 113: Speaking start information acquisition unit, 114: Search condition input unit, 115: Subtitle data comparison unit, 116 : Matched subtitle data, 117: utterance start information of matched subtitle data, 118: audio data search unit, 119: moving image data search unit, 120: speaker data read unit, 121: subtitle data of child A, 122: Subtitle data of child B, 131: Subtitle data of child A, 132: Subtitle data of child B, 133: Subtitle data of child C, 201: Video of child B 202: Subtitle, 300: Search software main screen, 301: Image display screen, 302: Time code display, 303: Jog button, 304: Shuttle button, 305: Jog, shuttle dial, 306: Move button to the front mark point 307: Rewind button, 308: Play button, 309: Fast forward button, 310: Move to back mark point, 311: Search button, 320: Search condition input screen, 321: Pull-down for selecting speaker, 322: Screen for inputting dialogue to search 323: Forward search button, 324: Back search button, 325: Cancel button, 400: Imaging device body (operation surface), 401: Liquid crystal display device, 402: Time code display, 403: Selection Operation member 404: setting button 405: recording button 406: Move button to the front mark point, 407: Rewind button, 408: Play button, 409: Fast forward button, 410: Move button to the rear mark point, 411: Search button, 412: Menu button, 413 : Voice input microphone 420: Search target speaker list display 421: Speaker selection display frame 422: Child A face 423: Child B face 424: Child C face 430: Child A face 431: A child registered speaker frame, 432: B child face, 433: B child selected speaker frame, 434: C child face, 435: C child registered speaker frame, 436: Dialog to search display

Claims

First data generating means for identifying the person from data relating to the person of the person included in the search target data using identification data for identifying the person, and generating data indicating the person;
Second data generating means for extracting voice data of the person from the data related to the person, and generating data indicating the utterance content of the person from the extracted voice data;
Third data generation means for generating position data indicating the position of data relating to the person in the search target data;
Fourth data generation means for generating presence data indicating whether or not the image data of the person is included in the data relating to the person;
A third search condition input means for inputting data specifying whether or not image data of a person to be searched is included as a search condition;
When a search condition is input by the third search condition input unit, among a set of data generated by the first data generation unit, the second data generation unit, and the fourth data generation unit, Determination means for determining a set of data that matches the input search condition;
A data search device comprising: data search means for searching data of a position indicated by the position data from the search target data based on position data corresponding to the data set determined by the determination means. .

Display control means for displaying the image data of each person on the display means;
A selection means capable of selecting arbitrary image data from the image data of each person;
2. The data search apparatus according to claim 1, further comprising first search condition input means for inputting data for specifying a person corresponding to the image data selected by the selection means as a search condition. .

Identification means for identifying a person corresponding to the image data displayed on the display means using the identification data;
A selection means capable of selecting arbitrary image data from the image data of each person identified by the identification means;
The data search device according to claim 1, further comprising second search condition input means for inputting data for specifying a person corresponding to the image data selected by the selection means as a search condition. .

A method for controlling a data search apparatus for searching data, comprising:
A first data generation step of identifying the person from data related to the person included in the search target data using identification data for identifying the person, and generating data indicating the person;
A second data generation step of extracting voice data of the person from the data relating to the person, and generating data indicating the utterance content of the person from the extracted voice data;
A third data generation step of generating position data indicating a position of data related to the person in the search target data;
A fourth data generation step of generating presence data indicating whether or not the image data of the person is included in the data related to the person;
A search condition input step for inputting, as a search condition, data specifying whether or not image data of a person to be searched is included;
When a search condition is input in the search condition input step, it is input out of the data set generated in the first data generation step, the second data generation step, and the fourth data generation step. A determination step of determining a set of data that matches the search condition;
And a data search step of searching from the search target data for data at a position indicated by the position data based on the position data corresponding to the set of data determined by the determination step. Control method.

The program for making a computer perform the control method of the data search device of Claim 4.

A computer-readable recording medium on which the program according to claim 5 is recorded.