JP7241636B2

JP7241636B2 - Information processing equipment

Info

Publication number: JP7241636B2
Application number: JP2019139740A
Authority: JP
Inventors: 宏幸田中
Original assignee: Hitachi Kokusai Electric Inc
Current assignee: Hitachi Kokusai Electric Inc
Priority date: 2019-07-30
Filing date: 2019-07-30
Publication date: 2023-03-17
Anticipated expiration: 2039-07-30
Also published as: JP2021022895A

Description

本発明は、複数の映像データや音声データを取り扱う情報処理装置に関する。 The present invention relates to an information processing apparatus that handles a plurality of video data and audio data.

各種の映像データ等を記憶し、ネットワークを介してユーザーに所望の映像／音声データ（以下、映像データ等）を配信するビデオサーバ―においては、映像データ等と共に、映像データ等の内容に関する情報であるメタデータを組み合わせて記憶する。このメタデータを利用することによって、映像データ等の管理や配信等の操作をより円滑に行うことができる。 In a video server that stores various types of video data and distributes desired video/audio data (hereinafter referred to as video data, etc.) to users via a network, along with video data, etc., information related to the content of video data, etc. Combine and store certain metadata. By using this metadata, operations such as management and distribution of video data and the like can be performed more smoothly.

また、ビデオサーバー側が、映像データ等の内容を自動的に認識することもできる。例えば、特許文献１には、レンダリング処理（例えば配信時において部分的に非表示とすべき部分に対する暈し処理等）を、処理対象となる部分を自動的に識別することによって行う技術が記載されている。この技術においては、処理対象となる部分として、例えば、映像中における時刻表示、自動車の登録ナンバー、企業名、人物の顔等がある。処理対象認識部は、時刻表示、自動車の登録ナンバー、企業名等については、これらを画像中の文字を周知の文字認識手法によって認識することができる。このように認識された部分に対してのみ局所的に編集処理が施された後に、編集後の映像データ等が配信される。 Also, the video server side can automatically recognize the content of the video data. For example, Patent Literature 1 describes a technique for performing rendering processing (for example, blurring processing for a portion that should be partially hidden at the time of distribution) by automatically identifying a portion to be processed. ing. In this technique, the parts to be processed include, for example, the time display in the video, the registration number of the car, the name of the company, the face of the person, and the like. The processing target recognition unit can recognize characters in images such as time display, vehicle registration number, company name, etc. by a well-known character recognition method. Only the portion thus recognized is locally subjected to editing processing, and then the edited video data or the like is distributed.

特開２０１９－６２３８１号公報JP 2019-62381 A

例えば、上記のようにメタデータを用いた映像データの管理等を行うことができるものの、メタデータが常に適正であるとは限らない。例えば、メタデータにおいては、データとして記憶される文字情報の文字数の制限等によって、その内容が適正ではない場合もある。また、メタデータがユーザー(管理者）による入力によって作成される場合には、誤りも発生する。 For example, although video data can be managed using metadata as described above, the metadata is not always appropriate. For example, in metadata, the content may not be appropriate due to restrictions on the number of characters of character information stored as data. Errors also occur when metadata is created by user (administrator) input.

特許文献１に記載の技術においては、ビデオサーバ―側が自動的に認識を行うために、認識されるデータに対する制限は緩く、かつ、上記のような単純な文字認識、顔認識等を用いる場合、近年のパターン認識技術の進歩により、文字情報や人物を適正に認識できる可能性は高まった。しかしながら、例えば映像の劣化等がある場合には、最新のパターン認識技術を用いた場合でも、文字情報や人物を適正に認識できない場合があった。あるいは、これらの情報をより適正に認識するためには、他の補助的な情報として、例えば、対象となる映像データ、音声データのカテゴリー等の情報が必要となり、その入力が必要となったため、ユーザーによる操作が必要となり、処理に要する時間が長くなった。このため、映像データ等の内容を高精度で自動的に認識できる技術が望まれた。 In the technology described in Patent Document 1, since the video server automatically performs recognition, restrictions on the data to be recognized are loose, and when simple character recognition, face recognition, etc. as described above are used, Recent advances in pattern recognition technology have increased the possibility of properly recognizing character information and people. However, for example, when there is deterioration in the image, even if the latest pattern recognition technology is used, it may not be possible to properly recognize character information or a person. Alternatively, in order to recognize these information more properly, other auxiliary information, such as the category of target video data and audio data, is required. Requires user interaction and takes longer to process. Therefore, a technology capable of automatically recognizing the contents of video data and the like with high precision is desired.

本発明は、このような状況に鑑みなされたもので、上記課題を解決することを目的とする。 The present invention has been made in view of such circumstances, and an object thereof is to solve the above problems.

本発明は、映像データの内容に対応するキーワードとなる情報である内容特定情報を自動的に認識する情報処理装置であって、前記映像データ中の画像あるいは文字表示、当該映像データに付随する音声データにおける音声、及び当該映像データに付随するメタ情報より、前記内容特定情報の候補を認識する解析部と、前記解析部による、前記画像、前記文字表示、前記音声、前記メタ情報の各々から前記候補をそれぞれ選定し、複数の前記候補を基にして前記内容特定情報を探索する一次解析を行わせ、当該一次解析によって得られた前記候補から前記内容特定情報が設定できなかった場合において、前記一次解析の結果に基づいて、前記解析部に対して解析の条件を特定した二次解析情報を設定し、前記解析部に対して、前記二次解析情報に基づき前記画像、前記文字表示、前記音声、前記メタ情報のうちの少なくともいずれかにおいて前記候補を再度選定する二次解析を行わせ、前記一次解析による解析結果及び前記二次解析による解析結果とに基づき、前記内容特定情報を探索する情報認識部と、を具備する。
この際、前記情報認識部は、前記内容特定情報を探索する際に、前記一次解析及び前記二次解析における前記メタ情報の解析結果の優先度を高く設定してもよい。
この際、前記情報認識部は、前記一次解析における前記画像、前記文字表示、及び前記音声の各々から選定された前記候補の一致、不一致を判定してもよい。
この際、前記情報認識部は、前記二次解析後に前記内容特定情報を定めることができなかった場合に、数値化された確度が付与された複数の前記候補を表示させてもよい。
この際、前記情報認識部は、前記二次解析後に前記内容特定情報を定めることができなかった場合に、警告を発してもよい。
この際、前記情報認識部は、前記画像、前記文字表示、前記音声の各々の解析結果の優先度を予め設定し、前記一次解析による解析結果及び前記二次解析による解析結果と、当該優先度に基づき前記内容特定情報を定めてもよい。
この際、前記映像データ及び前記音声データは時系列に応じて複数のブロックに分割され、前記内容特定情報の認識は前記ブロック毎に可能とされ、前記情報認識部は、前記映像データの種類に応じて、一つの前記ブロックにおける前記一次解析の結果から得られた前記二次解析情報に基づく前記二次解析を当該一つの前記ブロックにおいてのみ行わせる動作と、一つの前記ブロックにおける前記一次解析の結果から得られた前記二次解析情報に基づく前記二次解析を当該一つの前記ブロックと共に、他の前記ブロックに対しても行わせる動作と、を切り替えて行わせてもよい。 The present invention is an information processing apparatus for automatically recognizing content specifying information, which is information that serves as a keyword corresponding to the content of video data, and which includes an image or character display in the video data, and a sound accompanying the video data. an analysis unit that recognizes candidates for the content specifying information from audio in the data and meta information accompanying the video data; Each of the candidates is selected, primary analysis is performed to search for the content-specific information based on the plurality of candidates, and if the content-specific information cannot be set from the candidates obtained by the primary analysis, Based on the result of the primary analysis, secondary analysis information specifying analysis conditions is set for the analysis unit, and for the analysis unit, the image, the character display, the performing a secondary analysis for reselecting the candidate for at least one of the voice and the meta information, and searching for the content specifying information based on the analysis result of the primary analysis and the analysis result of the secondary analysis; and an information recognition unit.
At this time, the information recognition unit may set a high priority to the analysis result of the meta information in the primary analysis and the secondary analysis when searching for the content specifying information.
At this time, the information recognition unit may determine whether the candidates selected from each of the image, the character display, and the voice in the primary analysis match or disagree.
At this time, if the content specifying information cannot be determined after the secondary analysis, the information recognition unit may display the plurality of candidates to which numerical degrees of certainty are assigned.
At this time, the information recognition unit may issue a warning when the content specifying information cannot be determined after the secondary analysis.
At this time, the information recognition unit presets the priority of each analysis result of the image, the character display, and the voice. The content specifying information may be determined based on.
At this time, the video data and the audio data are divided into a plurality of blocks in chronological order, the content specifying information can be recognized for each block, and the information recognition unit is adapted to the type of the video data. Accordingly, an operation of performing the secondary analysis based on the secondary analysis information obtained from the result of the primary analysis in one block only in the one block, and performing the primary analysis in the one block An operation of performing the secondary analysis based on the secondary analysis information obtained from the result on the one block as well as on other blocks may be switched.

本発明によると、映像データ等の内容を高精度で自動的に認識することができる。 According to the present invention, the content of video data or the like can be automatically recognized with high accuracy.

実施の形態に係る情報処理装置の構成を示す図である。1 is a diagram showing a configuration of an information processing device according to an embodiment; FIG. 実施の形態に係る情報処理装置における一次解析と二次解析の状況を映像データにおいて時系列で示した例（第６～第１０のケース）である。It is an example (sixth to tenth cases) showing, in chronological order, the states of primary analysis and secondary analysis in the information processing apparatus according to the embodiment. 実施の形態に係る情報処理装置の動作を示すフローチャートである。4 is a flow chart showing the operation of the information processing device according to the embodiment;

次に、本発明を実施するための形態を図面を参照して具体的に説明する。ここで本発明の実施の形態に係る情報処理装置は、映像データや音声データを記憶、配信するビデオサーバーである。このビデオサーバーにおいては、記憶された映像データ、音声データの内容を表す特徴的な情報である内容特定情報が自動的に認識される。このように認識された内容特定情報に対して、例えば特許文献１に記載の技術のように編集処理（レンダリング処理等）を施した後に配信してもよい。 Next, embodiments for carrying out the present invention will be specifically described with reference to the drawings. Here, the information processing apparatus according to the embodiment of the present invention is a video server that stores and distributes video data and audio data. This video server automatically recognizes content specifying information, which is characteristic information representing the content of stored video data and audio data. The content specifying information recognized in this way may be distributed after being subjected to editing processing (rendering processing, etc.) like the technique described in Patent Document 1, for example.

図１は、このビデオサーバ―１の構成を示す図である。ここでは、上記のような認識に関わる構成要素のみが記載され、例えば映像データ等をネットワークを介して配信するための構成要素については記載が省略されている。ここで、取り扱われる映像データや音声データは、収録部１１によってネットワークを介して入力し、大容量のデータを記憶可能なハードディスク等で構成された記憶部１２に記憶される。ここで、映像データや音声データには、その内容を複数の項目毎に付帯情報として特定したメタデータ（メタ情報）が対応して記憶されている。 FIG. 1 is a diagram showing the configuration of this video server-1. Here, only the components related to recognition as described above are described, and components for distributing video data or the like via a network, for example, are omitted. The video data and audio data to be handled here are input via the network by the recording unit 11 and stored in the storage unit 12 constituted by a hard disk or the like capable of storing a large amount of data. Meta data (meta information) specifying the contents of each of a plurality of items as incidental information is stored in association with the video data and the audio data.

ＣＰＵ等を具備する制御部１０は、キーボードやタッチパネルで構成された操作部１３の操作によって、このビデオサーバー１全体の動作を制御する。この際、必要な情報はディスプレイである表示部１４で表示される。 A control unit 10 having a CPU or the like controls the operation of the entire video server 1 by operating an operation unit 13 configured by a keyboard and a touch panel. At this time, necessary information is displayed on the display unit 14, which is a display.

ここで、このビデオサーバー１においては、記憶部１２で記憶された映像データ等の内容を認識するための情報認識部２０が設けられる。情報認識部２０においては、映像（画像）内における物体（人物の顔を含む）の認識を行うことによって映像内の物体を特定する物体解析部２１、映像内の字幕を文字認識することによって字幕内の文字列を認識する字幕解析部２２、映像データに付随した音声データ中の音声を認識することによって映像に登場する人物の発言内容におけるキーワードとなる語句を認識する発言内容解析部２３、対応するメタデータにおける情報、特に物体解析部２１、字幕解析部２２、発言内容解析部２３で認識される対象と対応する情報を認識するメタ情報解析部２４の、４つの解析部が設けられる。これらの解析部で認識された事項は、前記の内容特定情報の候補となる。このような複数の候補に基づいて、内容特定情報が探索される。 Here, the video server 1 is provided with an information recognition section 20 for recognizing the content of video data and the like stored in the storage section 12 . In the information recognition unit 20, an object analysis unit 21 identifies an object (including a person's face) in a video (image) by recognizing an object (including a person's face) in the video. Subtitle analysis unit 22 that recognizes the character strings in the video data, speech content analysis unit 23 that recognizes words and phrases that are keywords in the speech content of the person appearing in the video by recognizing the voice in the audio data accompanying the video data. Four analysis units are provided, namely, a meta information analysis unit 24 for recognizing information in the metadata, particularly the object analysis unit 21, the subtitle analysis unit 22, and the information corresponding to the object recognized by the statement content analysis unit 23. Matters recognized by these analysis units become candidates for the content specifying information. Content specifying information is searched for based on such a plurality of candidates.

ただし、情報認識部２０においては、このような情報の解析は２段階に分けて行われる。このため、総合解析部２５は、初めの解析（一次解析）を一次解析処理部２６を用いて行わせ、その後に２回目の解析（二次解析）を二次解析処理部２７を用いて行わせる。ここで、一次解析は、物体解析部２１、字幕解析部２２、発言内容解析部２３で、メタ情報解析部２４を用いて、上記の解析をそれぞれにおいて独立して行い、それぞれにおいて個別の結果（候補）を得る。ここで、この個別の結果は、それぞれにおいて一つである必要はなく、複数であってもよい。例えば、字幕解析部２２、発言内容解析部２３において認識された文字列（語句）として、誤記を含んだもの、同義語や発音が近い複数のものを結果としてもよい。 However, the information recognition unit 20 analyzes such information in two steps. Therefore, the integrated analysis unit 25 causes the primary analysis processing unit 26 to perform the first analysis (primary analysis), and then performs the second analysis (secondary analysis) using the secondary analysis processing unit 27. Let Here, the primary analysis is performed by the object analysis unit 21, the caption analysis unit 22, the statement content analysis unit 23, and the meta information analysis unit 24 using the meta information analysis unit 24. candidate). Here, the individual result does not need to be one for each, and may be plural. For example, as the character strings (words) recognized by the caption analysis unit 22 and the statement content analysis unit 23, those that include spelling errors, synonyms, or a plurality of similar pronunciations may be the result.

総合解析部２５は、このような一次解析の結果によって得られた候補より、内容特定情報を探索し、特定することができる。例えば、全ての解析部により同一あるいは共通する内容となる候補が認識された場合には、この候補を内容特定情報として設定することができる。また、各解析部から複数の候補が得られ、全ての解析部において共通する候補があった場合には、これを内容特定情報とすることができる。 The comprehensive analysis unit 25 can search and specify the content specifying information from the candidates obtained from the result of such primary analysis. For example, if candidates with the same or common content are recognized by all the analysis units, this candidate can be set as the content specifying information. In addition, when a plurality of candidates are obtained from each analysis unit and there is a candidate common to all analysis units, this can be used as content specifying information.

一方、このように内容特定情報を特定することができなかった場合や、候補が得られなかった解析部があったために内容特定情報を特定することができなかった場合には、総合解析部２５は、このような一次解析の結果に基づき、再度の解析（二次解析）を行わせる際に用いられる情報である二次解析情報を設定し、これを二次解析情報記憶部２８に記憶させる。二次解析処理部２７は、この二次解析情報を用いて、新たに物体解析部２１、字幕解析部２２、発言内容解析部２３、メタ情報解析部２４のうちの少なくともいずれかを用いて、再度の解析を行う。一次解析の結果が各解析部によって得られた複数の独立のものであったのに対し、二次解析の結果は、内容特定情報として最も適した一つの結果、あるいは、このような単一の結果が選択できなかった旨となる。単一の結果が選択できなかった場合には、一次解析と二次解析の結果から得られた複数の候補を、数値化された確度をそれぞれに付与した上で表示させることもできる。このため、二次解析の結果は、一次解析のみを行う場合よりも精度の高い結果、あるいは十分な確度は得られなくとも可能性がある候補が適正に表示されるため、より好ましい。 On the other hand, if the content specifying information could not be specified in this way, or if the content specifying information could not be specified because there was an analysis unit that could not obtain a candidate, the comprehensive analysis unit 25 sets secondary analysis information, which is information used when re-analyzing (secondary analysis), based on the results of such primary analysis, and stores this in the secondary analysis information storage unit 28. . Using this secondary analysis information, the secondary analysis processing unit 27 newly uses at least one of the object analysis unit 21, caption analysis unit 22, statement content analysis unit 23, and meta information analysis unit 24, Analyze again. Whereas the results of the primary analysis were multiple independent results obtained by each analysis unit, the results of the secondary analysis consisted of a single result most suitable as content-specific information, or a single such single result. It means that the result could not be selected. If a single result cannot be selected, multiple candidates obtained from the results of the primary and secondary analyses, can be displayed after being given numerical accuracy to each. Therefore, the result of the secondary analysis is more preferable than the case of performing only the primary analysis, because the results are more accurate, or possible candidates are properly displayed even if sufficient accuracy is not obtained.

上記の解析部による二次解析に際しては、一次解析における他の解析部の解析結果が反映される。ただし、総合解析部２５は、一次解析においても、他の解析部の解析結果を利用させることによって、より適正かつ効率的な解析が可能である。 In the secondary analysis by the analysis unit, the analysis results of other analysis units in the primary analysis are reflected. However, even in the primary analysis, the integrated analysis unit 25 can perform more appropriate and efficient analysis by using the analysis results of other analysis units.

例えば、複数の人物の発言が混在している場合には、音声は発言者の区別なしに一括して音声データとして記憶されるが、このうち誰による発言かは声紋等によって識別が可能である。このため、一次解析において例えば字幕解析部２２によってある特定人物の名称が認識された場合や、物体解析部２１によって特定の人物が認識された場合、この人物の発言のみを発言内容解析部２３で抽出して解析することができる。こうした操作によって、例えば実際はこの人物が「スカイツリー」と発言した場合において、録音の中断やノイズにより一次解析では「イツ」のみが認識された場合においては、二次解析によって発言内容解析部２３によっても「スカイツリー」が認識されたと判定することができる。 For example, when the utterances of a plurality of people are mixed, the voices are collectively stored as voice data without distinguishing between the utterers. . Therefore, in the primary analysis, for example, when the subtitle analysis unit 22 recognizes the name of a certain person, or when the object analysis unit 21 recognizes a specific person, only the person's utterances are analyzed by the utterance content analysis unit 23. Can be extracted and analyzed. Through these operations, for example, when this person actually says "Skytree", only "Itsu" is recognized in the primary analysis due to interruption of recording or noise. It can also be determined that "Skytree" has been recognized.

このためには、発言内容解析部２３は、まず上記の解析を行う前に、音声データ中において発言が認識された人物（登場人物）を認識した上で上記の解析を行うことが好ましい。この際、各人物の声紋等のデータは、予め記憶部１２にデータベースとして記憶させることができ、これに基づいて上記の解析を行わせることができる。また、上記のように特定の人物の発言を抽出する際には、映像データも参照し、例えばこの特定の人物の口が動いた時点からの音声を解析の対象とすることができる。 For this purpose, it is preferable that the utterance content analysis unit 23 first recognizes a person whose utterance is recognized in the voice data (appearing character) before performing the above analysis, and then performs the above analysis. At this time, data such as the voiceprint of each person can be stored in advance in the storage unit 12 as a database, and the above analysis can be performed based on this. Also, when extracting the utterances of a specific person as described above, the video data can also be referred to, for example, the sound from when the mouth of this specific person moves can be analyzed.

また、上記のように声紋を用いた発言者の識別を行う際に、声紋が類似しているために明確な識別ができない場合がある。こうした場合においては、音声データにおいて認識された声紋と予め登録された声紋との間の相違を数値化し、最も近いと推定された声紋によって認識された語句の優先度を高めることができる。 Further, when identifying a speaker using a voiceprint as described above, there are cases where clear identification cannot be performed due to similarity of voiceprints. In such cases, the difference between the voiceprint recognized in the voice data and the pre-registered voiceprint can be quantified, and the priority of the phrase recognized by the voiceprint estimated to be the closest can be increased.

また、例えばこの映像データが映画の映像である場合には、音声が吹替である場合もあり、この場合には登場人物と発言者とは一致しない。こうした場合においては、例えば、この映像データが映画の映像である旨は、メタ情報解析部２４がメタファイルを解析することによって認識することができる。この場合、例えば字幕が表示されている際に発言している人物（口が動いている人物）が発言者であると物体解析部２１によって認識することができ、この際の音声は、この発言者によるものと推定することができる。この際、音声により、あるいはメタデータにより、元の言語や吹替の言語を認識することができ、これに基づいて上記の解析を行うことができる。また、上記のようにデータベースに記憶されていなかった声紋の音声が認識された場合には、これを新たに記憶させることもできる。 Further, for example, when this video data is the video of a movie, the voice may be dubbed, and in this case, the character and the speaker do not match. In such a case, for example, the meta-information analysis unit 24 can recognize that the video data is the video of a movie by analyzing the metafile. In this case, for example, the object analysis unit 21 can recognize that a person who is speaking while the caption is being displayed (a person whose mouth is moving) is the speaker. It can be presumed that the At this time, the original language or the dubbed language can be recognized from the voice or from the metadata, and the above analysis can be performed based on this. Moreover, when the voice of the voiceprint not stored in the database is recognized as described above, it can be newly stored.

また、映像データ（音声データ）のカテゴリーによって、上記のような発言内容解析部２３による解析対象の設定をすることができる。例えば映像データが音楽に関するものであることは、映画の場合と同様に、メタ情報解析部２４がメタファイルを解析することによって認識することができる。この場合、音声としては、ボーカルの音声と、ナレーションとが混在するが、ボーカルはメロディ（音調の上下）があるのに対して、ナレーションは音調の変動が小さいため、これらの識別が可能である。このため、発言内容解析部２３はボーカルとナレーションの各々で上記の解析を行うことができ、最も多く共通に認識された語句を上記のように優先度の高い語句（候補）とすることができる。あるいは、例えばナレーションのみを解析の対象とすることもできる。 Further, it is possible to set the analysis target by the statement content analysis unit 23 as described above, depending on the category of the video data (audio data). For example, the fact that video data relates to music can be recognized by the meta-information analysis unit 24 analyzing the metafile, as in the case of movies. In this case, the vocal voice and the narration are mixed as the voice, but the vocal has a melody (up and down in tone), whereas the narration has little variation in tone, so it is possible to distinguish between them. . Therefore, the utterance content analysis unit 23 can perform the above analysis for each of the vocal and the narration, and can set the commonly recognized words and phrases as the words and phrases (candidates) with the highest priority as described above. . Alternatively, for example, only the narration can be analyzed.

また、映像データが音楽に関するものである場合には、映像初期のテロップやエンディングクレジットにおいて、キーワードとなる曲名、演者名等が表示される場合が多い。こうした場合には、字幕解析部２２による解析の対象を映像初期のテロップやエンディングクレジットに特定することができる。あるいは、映像初期のテロップやエンディングクレジットで認識された語句を候補とする優先度を高めることができる。また、映像データ（音声データ）が音楽に関するものである場合には、特に曲のメロディを予めデータベースとして記憶部１２に記憶させ、これに基づき、曲名等を認識することができる。 In addition, when the video data is related to music, the title of the song, the name of the performer, etc., which are keywords, are often displayed in the telops and ending credits at the beginning of the video. In such a case, the subject of analysis by the subtitle analysis unit 22 can be specified as the telop or ending credits at the beginning of the video. Alternatively, it is possible to increase the priority of candidates for phrases recognized in telops or ending credits at the beginning of the video. Further, when the video data (audio data) relates to music, the melodies of songs are stored in advance as a database in the storage unit 12, and the titles of songs can be recognized based on the melodies.

前記のように物体解析部２１、字幕解析部２２、発言内容解析部２３で、メタ情報解析部２４による解析が行われるが、各解析部を同等に取り扱う必要はない。例えば、メタデータには、前記の内容特定情報に対応する内容が存在している蓋然性が特に高い。このため、メタ情報解析部２４による結果の優先度を特に高くしてもよい。この場合、一次解析、二次解析において得られた各候補に数値化された確度を付与し、メタ情報解析部２４によって得られた候補の確度に対して他の候補よりも大きな重み付けをすることができる。あるいは、例えば、メタ情報解析部２４で得られた候補と、物体解析部２１、字幕解析部２２、発言内容解析部２３のうちの一つで得られた候補とが一致した場合に、この候補を内容特定情報として選定させることができる。 As described above, the object analysis unit 21, caption analysis unit 22, and statement content analysis unit 23 perform analysis by the meta-information analysis unit 24, but it is not necessary to treat each analysis unit equally. For example, there is a particularly high probability that metadata contains content corresponding to the content specifying information. Therefore, the priority of the result by the meta-information analysis unit 24 may be particularly high. In this case, each candidate obtained in the primary analysis and the secondary analysis is given a numerical accuracy, and the accuracy of the candidate obtained by the meta-information analysis unit 24 is weighted more than other candidates. can be done. Alternatively, for example, when the candidate obtained by the meta-information analysis unit 24 matches the candidate obtained by one of the object analysis unit 21, caption analysis unit 22, and statement content analysis unit 23, this candidate can be selected as content-specific information.

また、例えば一次解析の結果得られた全ての候補に共通性が全く見られなかった場合においてのみ、このようにメタ情報解析部２４の結果の優先度を高める設定を行わせてもよい。この場合、例えば、二次解析を行わせずに一次解析においてメタ情報解析部２４によって得られた候補を内容特定情報に設定する、あるいはメタ情報解析部２４によって得られた複数の候補に確度を付与して表示させることができる。また、メタ情報解析部２４の結果のみに基づいて二次解析情報を作成した上で二次解析を行わせることもできる。このように、メタ情報解析部２４による結果の優先度を実質的に高めるための手法は様々である。 Also, for example, only when there is no commonality among all the candidates obtained as a result of the primary analysis, the priority of the results of the meta-information analysis unit 24 may be set to be increased. In this case, for example, the candidates obtained by the meta-information analysis unit 24 in the primary analysis without performing the secondary analysis are set as the content specifying information, or the accuracy of the plurality of candidates obtained by the meta-information analysis unit 24 is evaluated. can be given and displayed. Further, it is also possible to create secondary analysis information based only on the results of the meta-information analysis unit 24 and then perform the secondary analysis. As described above, there are various methods for substantially increasing the priority of the result by the meta-information analysis unit 24 .

以下に、情報認識部２０におけるこの動作の具体例について説明する。ここでは、対象となる映像データ（付随する音声データを含む）が、「東京スカイツリー」に関するものであるものとする。 A specific example of this operation in the information recognition unit 20 will be described below. Here, it is assumed that the target video data (including accompanying audio data) relates to "Tokyo Skytree".

総合解析部２５は、まず、映像データの内容を認識する全ての場合において、一次解析処理部２６を用いて一次解析を行わせる。一次解析は、上記の各解析部に対して特に前提条件を設定せず、例えばここで認識すべき事項が建造物であることを特定せずに、解析を行わせる。この結果に応じた二次解析の内容、及びその後の判定結果について、複数の場合について以下に説明する。 First, the comprehensive analysis unit 25 causes the primary analysis processing unit 26 to perform primary analysis in all cases in which the content of video data is recognized. The primary analysis is performed without setting any particular preconditions for each of the above analysis units, for example, without specifying that the item to be recognized here is a building. The details of the secondary analysis according to this result and the results of subsequent determination will be described below for a plurality of cases.

まず、第１～第５のケースは、上記のように内容特定情報を定まるための動作を一つの映像データ（及び付随する音声データ）につき１回行う場合である。この場合には、映像データが長時間にわたるものである場合に、物体解析部２１、字幕解析部２２、発言内容解析部２３による解析の対象は、この時間内の全ての映像又は音声となる。 First, the first to fifth cases are cases in which the operation for determining the content specifying information as described above is performed once for one piece of video data (and accompanying audio data). In this case, when video data spans a long period of time, the objects analyzed by the object analysis unit 21, caption analysis unit 22, and statement content analysis unit 23 are all video or audio within this time.

第１のケースとして、上記の全ての解析部によって「東京スカイツリー」に関連した事項が認識される場合がある。この場合、例えば、一次解析処理部２６は、物体解析部２１によって、この映像データにおいて最も長時間出現した画像が、「タワー状の高層建築物」であることを、周知のパターン認識手法によって認識することができる。また、字幕解析部２２において認識された文字列中で最も特徴的だったキーワード、及び発言内容解析部２３において最も特徴的だった語句、メタ情報解析部２４によってメタデータ中のキーワードとして、それぞれ「スカイツリー」を認識することができる。この場合、解析設定部２５は、字幕解析部２２、発言内容解析部２３、メタ情報解析部２４によって共通の「スカイツリー」という語句を認識する。 As a first case, all the above analysis units may recognize items related to "Tokyo Skytree". In this case, for example, the primary analysis processing unit 26 uses a well-known pattern recognition method to recognize that the image that appears for the longest time in the video data is a “tower-shaped high-rise building” by the object analysis unit 21. can do. In addition, the most characteristic keyword in the character string recognized by the caption analysis unit 22, the most characteristic phrase in the statement content analysis unit 23, and the keyword in the metadata by the meta information analysis unit 24, " Sky Tree" can be recognized. In this case, the analysis setting unit 25 recognizes the common phrase “sky tree” by the caption analysis unit 22, the statement content analysis unit 23, and the meta information analysis unit 24. FIG.

また、総合解析部２５は、例えば記憶部１２に記憶された、あるいはネットワークを介して入手した「スカイツリー」に関する情報等から、物体解析部２１によって認識された「タワー状の高層建築物」という内容が、「（東京）スカイツリー」に合致する内容であることも認識することができる。この場合、全ての解析部において共通の内容が認識されたため、総合解析部２５は、上記の映像データを特定する内容として、「（東京）スカイツリー」という情報を対応付けることができる。この場合には、二次解析を行う必要はないため、総合解析部２５は、二次解析情報を作成しない。 In addition, the general analysis unit 25 also uses the information about the “sky tree” stored in the storage unit 12 or obtained via the network, for example, to identify the “tower-shaped high-rise building” recognized by the object analysis unit 21. It can also be recognized that the content matches with "(Tokyo) Sky Tree". In this case, since the common content is recognized by all the analysis units, the comprehensive analysis unit 25 can associate the information "(Tokyo) Skytree" as the content specifying the video data. In this case, since there is no need to perform secondary analysis, the integrated analysis unit 25 does not create secondary analysis information.

上記の例では、メタ情報解析部２４によっても「スカイツリー」が認識されたものとしたが、メタ情報解析部２４において、「スカイツリー」ではなくその所在地である「墨田区」やその機能である「電波塔」等、「スカイツリー」に直結する内容が認識された場合には、上記の物体解析部２１における「タワー状の高層建築物」と同様に、全ての解析部において総合的に「スカイツリー」という共通の内容が認識されたとすることができる。 In the above example, it is assumed that the meta information analysis unit 24 also recognizes "Sky Tree", but the meta information analysis unit 24 recognizes not "Sky Tree" but "Sumida Ward" as its location and its function. When a content directly connected to the "Sky Tree" such as a certain "radio tower" is recognized, all the analysis sections comprehensively It can be assumed that the common content "Skytree" was recognized.

第２のケースでは、発言内容解析部２３で、雑音によって、「スカイ釣り」という語句が認識され、他の解析部による認識結果は第１のケースと同様であったものとする。これは、音声認識によって、「スカイツリー」という語句よりも「スカイ釣り」という語句の方が適正に認識されたことを意味する。この場合、前記の場合とは異なり、全ての解析部で認識された内容に共通性が認められない。このため、第２のケースにおいては、第１のケースとは異なり、一次解析によっては映像データを特定する内容は決定されない。 In the second case, it is assumed that the utterance content analysis unit 23 recognizes the phrase "sky fishing" due to noise, and the recognition results by the other analysis units are the same as in the first case. This means that the phrase "sky fishing" was more properly recognized than the phrase "sky tree" by voice recognition. In this case, unlike the above case, there is no commonality in the contents recognized by all the analysis units. Therefore, in the second case, unlike the first case, the primary analysis does not determine the content for specifying the video data.

しかしながら、字幕解析部２２とメタ情報解析部２４によっては共通の「スカイツリー」が認識され、かつ、これと物体解析部２１によって認識された「タワー状の高層建築物」という内容が合致することは前記の場合と変わらない。このため、総合解析部２５は、「スカイツリー」を最も可能性の高い情報として認識することができる。 However, the subtitle analysis unit 22 and the meta information analysis unit 24 recognize the common "sky tree", and the contents of "tower-shaped high-rise building" recognized by the object analysis unit 21 match. is the same as in the previous case. Therefore, the integrated analysis unit 25 can recognize "Skytree" as the most likely information.

この場合、総合解析部２５は、二次解析情報として、再度の解析を行う解析部として、他の解析部との間の合致が認められなかった発言内容解析部２３を特定し、かつ再度の解析においては、優先的にサーチする内容として「スカイツリー」を特定し、これを二次解析情報記憶部２８に記憶させる。二次解析処理部２７は、この二次解析情報を読出して認識し、前提条件のなかった一次解析とは異なり、発言内容に「スカイツリー」が含まれると認識される度合いを数値化して出力させる二次解析を行わせる。一次解析においてはこの数値は、「スカイ釣り」の方が「スカイツリー」よりも高かったが、二次解析においては、このように優先的に設定された語句がサーチされる。 In this case, the comprehensive analysis unit 25 specifies the statement content analysis unit 23 that was not found to be matched with the other analysis units as the analysis unit that performs the analysis again as the secondary analysis information, and performs the analysis again. In the analysis, "Sky Tree" is specified as the content to be searched preferentially, and is stored in the secondary analysis information storage unit 28. FIG. The secondary analysis processing unit 27 reads out and recognizes this secondary analysis information, and unlike the primary analysis that does not have any preconditions, the secondary analysis processing unit 27 quantifies and outputs the degree of recognition that "Sky Tree" is included in the statement content. A secondary analysis is performed to In the primary analysis, this numerical value was higher for "sky fishing" than for "sky tree," but in the secondary analysis, words and phrases set in this way are searched with priority.

ここで、総合解析部２５は、例えばこの数値がある閾値を超えた場合には、発言内容解析部２３においては、二次解析によって「スカイツリー」が認識されたと判定することができる。この場合には、他の解析部においては一次解析の結果によって、発言内容解析部２３においては二次解析の結果によって、第１のケースと同様に「スカイツリー」という共通の情報（内容）を認識したと判定することができる。 Here, for example, when this numerical value exceeds a certain threshold, the comprehensive analysis unit 25 can determine that "Skytree" has been recognized by secondary analysis in the statement content analysis unit 23 . In this case, as in the first case, the common information (contents) of "Sky Tree" is determined by the result of the primary analysis in the other analysis units and by the result of the secondary analysis in the statement content analysis unit 23. It can be determined that it is recognized.

一方、総合解析部２５は、上記の数値がある閾値以下であった場合には、発言内容解析部２３においては、二次解析によっても「スカイツリー」が認識されなかったと判定することができる。この場合、全ての解析部で共通の内容を認識することができなかったため、「スカイツリー」は、確度は高い内容ではあるが、上記の映像データを表す情報としては充分ではないと認識することができる。 On the other hand, when the above numerical value is equal to or less than a certain threshold, the comprehensive analysis unit 25 can determine that "Skytree" was not recognized even by the secondary analysis in the statement content analysis unit 23 . In this case, since common contents could not be recognized by all the analysis units, "Sky Tree" is highly accurate, but it is recognized that it is not sufficient as information representing the above video data. can be done.

第３のケースとして、字幕解析部２２、発言内容解析部２３、メタ情報解析部２４によって、第１のケースと同様にそれぞれ「スカイツリー」が認識されたが、物体解析部２１は、「スカイツリー」とは無関係の人物の顔が認識されたものとする。この場合、第２のケースと同様に、全ての解析部で認識された内容には共通性が認められない。しかしながら、字幕解析部２２、発言内容解析部２３、メタ情報解析部２４の結果には共通性が認められるため、第２のケースと同様に、総合解析部２５は、「スカイツリー」を最も可能性の高い情報として認識することができる。 In the third case, the subtitle analysis unit 22, the statement content analysis unit 23, and the meta information analysis unit 24 each recognized "Sky Tree" as in the first case, but the object analysis unit 21 recognized "Sky Tree". It is assumed that the face of a person unrelated to the "tree" has been recognized. In this case, similar to the second case, there is no commonality in the contents recognized by all the analysis units. However, since the results of the caption analysis unit 22, the statement content analysis unit 23, and the meta information analysis unit 24 have commonalities, the comprehensive analysis unit 25 determines that "Skytree" is the most possible It can be recognized as highly sensitive information.

この場合、総合解析部２５は、二次解析情報として、再度の解析を行う解析部として、他の解析部との間の合致が認められなかった物体解析部２１を特定し、かつ再度の解析においては、優先的にサーチする対象として「スカイツリー」が該当する「建造物」を特定し、これを二次解析情報記憶部２８に記憶させる。二次解析処理部２７は、この二次解析情報を読出して認識し、前提条件のなかった一次解析とは異なり、物体認識部２１は、「建造物」のみをサーチする二次解析を行わせる。 In this case, the comprehensive analysis unit 25 specifies, as the secondary analysis information, the object analysis unit 21 that has not been found to match the other analysis units as the analysis unit that performs the analysis again, and performs the analysis again. , the “structure” to which “Skytree” corresponds is specified as a search target with priority, and this is stored in the secondary analysis information storage unit 28 . The secondary analysis processing unit 27 reads out and recognizes this secondary analysis information, and unlike the primary analysis that did not have any preconditions, the object recognition unit 21 causes the secondary analysis to search only for "buildings". .

その結果、「建造物」として、「タワー状の高層建築物」を認識した場合には、物体解析部２１以外の解析部における一次解析の結果と、物体解析部２１においては二次解析の結果によって、第１のケースと同様に「スカイツリー」という共通の情報（内容）を認識したと判定することができる。 As a result, when a "tower-shaped high-rise building" is recognized as a "building", the result of the primary analysis in the analysis units other than the object analysis unit 21 and the result of the secondary analysis in the object analysis unit 21 , it can be determined that the common information (content) "Skytree" is recognized as in the first case.

一方、総合解析部２５は、物体解析部２１においては、二次解析によっても「タワー状の高層建築物」等、「スカイツリー」に対応する建造物が認識されなかった場合、全ての解析部で共通の内容を認識することができなかったため、「スカイツリー」は、確度は高い内容ではあるが、上記の映像データを表す情報としては充分ではないと認識することができる。 On the other hand, if the object analysis unit 21 does not recognize a building corresponding to the "sky tree" such as a "tower-shaped high-rise building" even in the secondary analysis, the general analysis unit 25 Therefore, it can be recognized that "Skytree" is not sufficient as information representing the above video data, although it has high accuracy.

第４のケースとして、字幕解析部２２、メタ情報解析部２４によって、それぞれ「スカイツリー」が認識されたが、物体解析部２１は、「スカイツリー」とは無関係の人物の顔が認識され、かつ発言内容解析部２３においては「スカイツリー」とは全く関連性のない語句が認識されたものとする。この場合、第２、第３のケースと同様に、全ての解析部で認識された内容には共通性が認められないが、字幕解析部２２、メタ情報解析部２４の結果には共通性が認められるため、第２、第３のケースと同様に、総合解析部２５は、「スカイツリー」を最も可能性の高い情報として認識することができる。 As a fourth case, "Skytree" was recognized by the subtitle analysis unit 22 and the meta information analysis unit 24, but the object analysis unit 21 recognized a person's face unrelated to "Skytree", In addition, it is assumed that the statement content analysis unit 23 recognizes a phrase that is completely unrelated to "sky tree". In this case, as in the second and third cases, there is no commonality in the contents recognized by all the analysis units, but there is commonality in the results of the caption analysis unit 22 and the meta information analysis unit 24. Since it is recognized, the comprehensive analysis unit 25 can recognize "Skytree" as the information with the highest probability, as in the second and third cases.

この場合、総合解析部２５は、二次解析情報として、再度の解析を行う解析部として、「スカイツリー」との合致が認められなかった物体解析部２１において優先的にサーチする対象として第３のケースと同様に「建造物」を特定すると共に、発言内容解析部２３においては「スカイツリー」あるいはこれに類似又は関連したキーワードの有無のみを主眼としたサーチを行うことを特定した二次解析情報を作成し、これを二次解析情報記憶部２８に記憶させる。二次解析処理部２７は、この二次解析情報を読出して認識し、この二次解析情報に基づき、前提条件のなかった一次解析とは異なり、第３のケースと同様に物体認識部２１に「建造物」のみをサーチさせると共に、発言内容解析部２３においては「スカイツリー」に関連したサーチを行わせる。 In this case, the comprehensive analysis unit 25, as the secondary analysis information, as the analysis unit that performs the analysis again, the object analysis unit 21 that has not been found to match "SKYTREE" preferentially searches for the third object. In addition to specifying "buildings" in the same way as in the case of (2) above, secondary analysis specifying that the statement content analysis unit 23 performs a search focusing only on the presence or absence of "Skytree" or similar or related keywords. Information is created and stored in the secondary analysis information storage unit 28 . The secondary analysis processing unit 27 reads out and recognizes this secondary analysis information, and based on this secondary analysis information, unlike the primary analysis that has no preconditions, the object recognition unit 21 performs Only "buildings" are searched, and the statement content analysis unit 23 is caused to search related to "sky tree".

その結果、総合解析部２５は、物体解析部２１により「タワー状の高層建築物」が、発言内容解析部２３により「スカイツリー」あるいはこれに関連した語句が認識された場合には、字幕解析部２２、メタ情報解析部２４における一次解析の結果と、これらの二次解析の結果によって、第１のケースと同様に「スカイツリー」という共通の情報（内容）を認識したと判定することができる。 As a result, when the object analysis unit 21 recognizes “tower-shaped high-rise building” and the statement content analysis unit 23 recognizes “Skytree” or related words, the comprehensive analysis unit 25 analyzes the subtitles. Based on the results of the primary analysis in the unit 22 and the meta-information analysis unit 24 and the results of these secondary analyses, it is possible to determine that the common information (content) of "Skytree" has been recognized, as in the first case. can.

一方、総合解析部２５は、二次解析によっても物体解析部２１により「タワー状の高層建築物」が、又は発言内容解析部２３により「スカイツリー」あるいはこれに関連した語句が認識されなかった場合には、「スカイツリー」は、確度は高い内容ではあるが、上記の映像データを表す情報としては充分ではないと認識することができる。 On the other hand, in the secondary analysis, the comprehensive analysis unit 25 did not recognize "tower-shaped high-rise building" by the object analysis unit 21, nor "sky tree" or related words and phrases by the statement content analysis unit 23. In this case, it can be recognized that "Skytree" is highly accurate content, but not sufficient as information representing the above video data.

第１～第４のケースにおいては、物体解析部２１、字幕解析部２２、発言内容解析部２３で、メタ情報解析部２４による解析結果が対等に取り扱われ、全ての結果が合致していないものの、その中で共通の結果が得られた２つ以上のものが存在した場合に、この結果が最も可能性の高いものであるとして二次解析が行われた。 In the first to fourth cases, the object analysis unit 21, the subtitle analysis unit 22, and the statement content analysis unit 23 treat the analysis results of the meta information analysis unit 24 equally, and although all the results do not match, , when there were two or more in which a common result was obtained, a secondary analysis was performed as this result was the most likely one.

この場合、一次解析において、例えば全ての解析部において全く異なった解析結果が得られた場合には、最も可能性の高い内容を認識することができない。このため、総合解析部２５は、対象となった映像データの内容が特定できなかったと認識することができる。 In this case, in the primary analysis, if, for example, completely different analysis results are obtained in all the analysis units, the content with the highest probability cannot be recognized. Therefore, the integrated analysis unit 25 can recognize that the contents of the target video data could not be specified.

ここで、この４つの解析部を対等に取り扱わずに、解析部に優先順位を設定してもよい。第５のケースは、こうした場合に対応する。例えば、メタデータは、映像データの内容を反映したものとして作成されているため、上記の解析部による解析結果の中では、メタ情報解析部２４による解析結果の確度が最も高いと推定することもできる。この場合、メタ情報解析部２４による解析結果の優先度を他よりも高くすることができる。 Here, instead of treating these four analysis units equally, priority may be set for the analysis units. The fifth case corresponds to such cases. For example, since the metadata is created so as to reflect the content of the video data, it may be estimated that the analysis result by the meta-information analysis unit 24 has the highest accuracy among the analysis results by the above-mentioned analysis units. can. In this case, the priority of the analysis result by the meta-information analysis unit 24 can be set higher than others.

第５のケースにおいては、前記のように解析部に優先順位が設定され、特にメタ情報解析部２４の優先度が高く設定される。ここでは、メタ情報解析部２４によって「スカイツリー」が認識されたが、物体解析部２１、字幕解析部２２、発言内容解析部２３によっては、いずれも「スカイツリー」と無関係の内容が認識されたものとする。この場合、「スカイツリー」が認識されたのはメタ情報解析部２４のみであったとしても、総合解析部２５は、第２～第４のケースと同様に、「スカイツリー」を最も可能性の高い情報として認識することができる。 In the fifth case, the order of priority is set to the analysis units as described above, and in particular the priority of the meta-information analysis unit 24 is set high. Here, "Skytree" is recognized by the meta-information analysis unit 24, but content unrelated to "Skytree" is recognized by the object analysis unit 21, caption analysis unit 22, and statement content analysis unit 23. shall be assumed. In this case, even if only the meta-information analysis unit 24 recognizes "Skytree", the integrated analysis unit 25 recognizes "Skytree" as the most probable, as in the second to fourth cases. can be recognized as high-value information.

この場合、総合解析部２５は、再度の解析を行う解析部として物体解析部２１、字幕解析部２２、発言内容解析部２３を指定し、前記のケースと同様に、これらにおいて「スカイツリー」を前提としたサーチを行わせる旨の二次解析情報を作成する。その結果、総合解析部２５は、物体解析部２１、字幕解析部２２、発言内容解析部２３の全てでも「スカイツリー」あるいはこれに対応した内容が認識された場合には、第１のケースと同様に「スカイツリー」という共通の情報（内容）を認識したと判定することができる。一方、総合解析部２５は、二次解析における物体解析部２１、字幕解析部２２、発言内容解析部２３のいずれかで「スカイツリー」あるいはこれに関連した語句が認識されなかった場合には、「スカイツリー」は、確度（優先度）は高い内容ではあるが、上記の映像データを表す情報としては充分ではないと認識することができる。 In this case, the integrated analysis unit 25 designates the object analysis unit 21, caption analysis unit 22, and statement content analysis unit 23 as the analysis units that perform the analysis again, and, as in the case above, selects "Sky Tree" in these. Create secondary analysis information to the effect that the presupposed search will be performed. As a result, if all of the object analysis unit 21, caption analysis unit 22, and statement content analysis unit 23 recognize "Sky Tree" or content corresponding thereto, the integrated analysis unit 25 determines that the first case is the case. Similarly, it can be determined that the common information (content) "Skytree" is recognized. On the other hand, if any of the object analysis unit 21, caption analysis unit 22, and statement content analysis unit 23 in the secondary analysis does not recognize "Skytree" or a word related thereto, the comprehensive analysis unit 25 "Skytree" has high accuracy (priority), but it can be recognized that it is not sufficient as information representing the above video data.

この優先順位については、適宜設定が可能である。例えば、メタ情報解析部２４の解析結果を優先するが、他の２つの解析部でメタ情報解析部２４の解析結果とは異なる共通の解析結果が得られた場合には、この共通の解析結果を優先してもよい。また、総合解析部２５が、映像データのファイルサイズに応じて、どの解析部の解析結果を優先するかを設定してもよい。また、特定の解析部の組み合わせで共通の解析結果が得られた場合に、この解析結果を優先する設定としてもよい。 This priority can be set as appropriate. For example, the analysis result of the meta information analysis unit 24 is prioritized, but if the other two analysis units obtain a common analysis result different from the analysis result of the meta information analysis unit 24, this common analysis result may take precedence. Further, the comprehensive analysis unit 25 may set which analysis unit gives priority to the analysis result according to the file size of the video data. Moreover, when a common analysis result is obtained by a combination of specific analysis units, this analysis result may be prioritized.

このように、第２～第５のケースでは、一次解析の結果に基づいて設定された二次解析情報に基づいて二次解析が行われ、一次解析の結果と二次解析の結果に基づいて映像データを特定する内容が決定される。あるいは、このような特定の内容が決定できない旨が通知される。このような特定の内容が決定できない旨が通知されない場合には、各解析部による解析結果のうち、そのうちの一つを最も可能性の高いものと設定することもできる。 Thus, in the second to fifth cases, the secondary analysis is performed based on the secondary analysis information set based on the results of the primary analysis, and based on the results of the primary analysis and the results of the secondary analysis The content specifying the video data is determined. Alternatively, it is notified that such specific content cannot be determined. If no notification is given to the effect that such specific content cannot be determined, one of the analysis results by each analysis unit can be set as the most likely one.

第１～第５のケースでは、各解析部による解析は映像データの全体にわたり行われるものとした。しかしながら、特に映像データが長時間にわたるものである場合には、映像データを時系列で複数のブロックに分けてブロックごとに上記のように内容を解析することができる。この場合、メタ情報解析部２４による解析の対象となるメタデータは、このブロックに対応して作成されていればブロック毎に解析が行われ、ブロックとは無関係に映像データ全体に対するものとして作成されていれば、全てのブロックに対して共通のメタデータが解析の対象となる。物体解析部２１、字幕解析部２２、発言内容解析部２３の解析の対象は、ブロック毎の映像データあるいはこれに付随した音声データとなる。 In the first to fifth cases, analysis by each analysis unit is performed on the entire video data. However, especially when the video data is long, it is possible to divide the video data into a plurality of blocks in chronological order and analyze the contents of each block as described above. In this case, the metadata to be analyzed by the meta-information analysis unit 24 is analyzed for each block if it is created corresponding to this block, and is created for the entire video data regardless of the block. If so, common metadata for all blocks will be analyzed. The objects analyzed by the object analysis unit 21, caption analysis unit 22, and statement content analysis unit 23 are video data for each block or audio data associated therewith.

図２は、第６～第１０のケースを説明する図である。ここで、経過時間は図中右側に向けて進行するものとし、映像データ及びこれに付随する音声データは、時系列でＡ～Ｇの７つのブロックに分割して設定されるものとする。 FIG. 2 is a diagram for explaining sixth to tenth cases. Here, it is assumed that the elapsed time advances toward the right side in the drawing, and that the video data and accompanying audio data are divided into seven blocks A to G in chronological order and set.

ここで、第１～第５のケースと同様の解析を、図２における各ブロックで行うことができる。ここでは、初めに一次解析が全てのブロックにおいて行われたものとし、第６～第８のケースでは、このうち、ブロックＤのみで前記のような二次解析が必要となり、他のブロックでは第１のケースと同様に二次解析は不要となった（一次解析のみで映像データの内容が決定された）ものとする。 Here, the same analysis as for the first to fifth cases can be performed for each block in FIG. Here, it is assumed that the primary analysis was first performed on all blocks, and in the sixth to eighth cases, only block D requires the secondary analysis as described above, and the other blocks require the secondary analysis. As in case 1, the secondary analysis is no longer necessary (the contents of the video data are determined only by the primary analysis).

第６のケースにおいては、ブロックＤで前記のような二次解析が行われ、その解析結果を用いた前記のような結果が認識される。この手順は、第１～第５のケースにおける解析の対象が映像データ、これに付随する音声データが図２のブロックＤにおけるものになった場合と等価である。すなわち、ブロックＤにおける一次解析が行われ、これによって内容特定情報が定まらなかった場合には、一次解析の結果から二次解析情報が作成され、これを用いてブロックＤを対象とした二次解析が行われ、一次解析の結果と二次解析の結果を用いて第２～第５のケースと同様の結果が得られる。この結果は他のブロックにおける結果と無関係である。 In the sixth case, a secondary analysis as described above is performed in block D, and the result as described above using the results of the analysis is recognized. This procedure is equivalent to the case where the object of analysis in the first to fifth cases is the video data and the accompanying audio data is the one in block D in FIG. That is, when the primary analysis is performed on block D and the content specifying information is not determined by this, the secondary analysis information is created from the results of the primary analysis, and is used for the secondary analysis targeting block D. is performed, and results similar to those of the second to fifth cases are obtained using the results of the primary analysis and the results of the secondary analysis. This result is independent of results in other blocks.

第７のケースにおいても、ブロックＤで一次解析が行われた結果、二次解析が必要となったものとする。ここでは、前記のように二次解析情報が作成された場合において、その直前の２つのブロックであるブロックＣ、Ｂに対しても、この二次解析情報を用いた二次解析が行われる。すなわち、ブロックＤと共に、二次解析が不要とされたブロックＣ、Ｂに対しても、新たに二次解析が行われ、この結果を用いて、ブロックＣ、Ｂに対しても新たな結果が得られ、この結果が前回の結果と異なる場合には、結果が新たなものに書き換えられる。映像データが時系列的に連続的な場合においては、ブロックＤにおける内容とその直前のブロックＣ、Ｂにおける内容とは関連がある蓋然性が高いため、こうした処理は有効である。 Also in the seventh case, as a result of the primary analysis performed in block D, the secondary analysis is required. Here, when the secondary analysis information is created as described above, the secondary analysis using this secondary analysis information is also performed on the two blocks C and B immediately before the secondary analysis information. That is, along with block D, new secondary analysis is performed also for blocks C and B, for which secondary analysis is not required, and using this result, new results are obtained for blocks C and B as well. is obtained, and if this result differs from the previous result, the result is rewritten with a new one. Such processing is effective when video data is continuous in time series, because there is a high probability that the contents of block D and the contents of immediately preceding blocks C and B are related.

図２において、第８のケースにおいては、このようにブロックＤで得られた二次解析情報を用いた再度の解析が行われる対象はブロックＤの直後のブロックＥ、Ｆとされる。 In FIG. 2, in the eighth case, blocks E and F immediately following block D are subjected to re-analysis using the secondary analysis information obtained in block D in this manner.

第９のケースでは、同様の対象がブロックＤの前後のブロックＣ、Ｅとされる。これらの場合においても、同様に、ブロックＤと関連性が高いブロックでも再度の解析が行われるため、こうした処理は同様に有効である。 In the ninth case, similar objects are blocks C and E before and after block D. In these cases as well, blocks that are highly relevant to block D are also analyzed again, so such processing is similarly effective.

図２において、第１０のケースでは、ブロックＡ以外の全てのブロックが再度の解析の対象となっている。他のブロックと比べて先頭のブロックＡの内容に特殊性がある場合(例えばイントロダクションとなっている場合等）には、こうした設定は有効である。 In FIG. 2, in the tenth case, all blocks other than block A are subject to re-analysis. Such a setting is effective when the contents of the top block A are special compared to other blocks (for example, when it is an introduction).

第７～第１０のケースにおいて、再度の解析を行う（あるいは行わない）ブロックの範囲は、映像データの種類等に応じて適宜設定が可能である。このため、例えば、総合解析部２５は、映像データのファイル名、属性、ファイルサイズ（時間等）、あるいはメタデータにおける特定の項目の内容等に応じて、このように再度の解析を行うブロックの範囲を設定してもよい。なお、図２において、全てのブロックを再度の解析の対象とする場合は、映像ファイル全体としての解析を行う第１～第５のケースと等価である。 In the seventh to tenth cases, the range of blocks to be reanalyzed (or not reanalyzed) can be appropriately set according to the type of video data. For this reason, for example, the comprehensive analysis unit 25 selects a block to be analyzed again in this way according to the file name, attribute, file size (time, etc.) of the video data, or the content of a specific item in the metadata. A range may be set. Note that, in FIG. 2, when all blocks are to be re-analyzed, this is equivalent to the first to fifth cases in which the video file as a whole is analyzed.

また、このように、映像データが時系列でブロック毎に分割される場合には、効率的に上記の解析を行うために、各種の手法を適用することができる。例えば、この映像データのブロック区分(ブロックの境界の設定）を、内容に応じて行うことができる。例えば、この区分は、例えば映像のシーンチェンジのタイミング、音声（ナレーション）やテロップの表示の登場のタイミング等に応じて設定することもできる。あるいは、メタデータの内容によってもこのタイミングを設定することができる。 Also, when the video data is divided into blocks in time series in this manner, various techniques can be applied to efficiently perform the above analysis. For example, it is possible to divide the video data into blocks (set boundaries between blocks) according to the content. For example, this division can be set according to the timing of a scene change of video, the timing of appearance of voice (narration) or telop display, and the like. Alternatively, this timing can also be set according to the contents of the metadata.

この場合、処理の効率化のために、上記のようなブロック毎に行う解析を、上記のタイミングに応じて選定されたブロックにおいてのみ行わせることができる。この場合には、上記の境界の直前、直後のブロックのみで上記の解析を行い、内容特定情報を定めることができる。 In this case, the analysis performed for each block as described above can be performed only in the blocks selected according to the timing described above, in order to improve the efficiency of the processing. In this case, the content specifying information can be determined by performing the above analysis only on the blocks immediately before and after the boundary.

あるいは、時間的に長いデータの解析を行う際には、例えば上記のようなタイミングを認識し、その直後のブロックでのみ一次解析を行わせ、この結果に基づいた二次解析情報を作成し、この二次解析情報に基づいた二次解析は、全てのブロック（あるいは一次解析が行われたブロックを含む複数のブロック）行わせることができる。これによって、時間的に長いデータの解析を効率的に行い、内容特定情報を定めることができる。 Alternatively, when analyzing long data, for example, the above timing is recognized, primary analysis is performed only in the block immediately after that, secondary analysis information is created based on this result, Secondary analysis based on this secondary analysis information can be performed on all blocks (or a plurality of blocks including the block on which the primary analysis has been performed). As a result, it is possible to efficiently analyze long data and determine content specifying information.

図３は、このビデオサーバ―１の動作を示すフローチャートである。ビデオサーバ―１における動作は図における制御部１０によって制御されるが、ここでは、前記のような内容特定情報の設定に関する動作のみについて記載されており、この動作の大部分は実際には情報認識部２０（特に総合解析部２５）の制御により行われる。 FIG. 3 is a flow chart showing the operation of this video server-1. The operation of the video server 1 is controlled by the control unit 10 in the figure, but only the operation related to the setting of the content specifying information as described above is described here, and most of this operation is actually information recognition. This is performed under the control of the unit 20 (especially the comprehensive analysis unit 25).

まず、制御部１０は、このような内容特定情報を設定するタスクを受信する（Ｓ１）と、これによって対象とする映像データ、及びこれに付随する音声データ、メタデータを認識し、記憶部１２からこれらを読み出す（Ｓ２）。あるいは、制御部１０は、収録部１１によってこれらのデータを入手する。 First, when the control unit 10 receives a task for setting such content specifying information (S1), the control unit 10 recognizes the target video data and accompanying audio data and metadata, and stores the storage unit 12 (S2). Alternatively, the control unit 10 obtains these data by the recording unit 11 .

次に、情報認識部２０は、一次解析処理部２６に、各解析部を用いて一次解析を行わせる（Ｓ３）。この場合には、各解析部に独立に解析を行わせる、あるいは前記のようにある解析部の解析の際に他の解析部による結果を利用してこの解析を行わせることができる。その解析部の各々から得られた結果に基づき、上記のように内容特定情報の決定を試み、内容特定情報が特定できた場合（Ｓ４：Ｙｅｓ）には、制御部１０は、この結果を出力する（Ｓ５）。この結果は、例えば表示部１４に出力させることができるが、記憶部１２に記憶させてもよく、この際にメタデータをこれに応じて書き換えてもよい。これによって、内容特定情報を定める動作は終了する。その後、この内容特定情報に対応する映像データの内容に関して編集処理等が必要である場合には、この編集処理を行わせてもよい。 Next, the information recognition section 20 causes the primary analysis processing section 26 to perform primary analysis using each analysis section (S3). In this case, each analysis section can be made to perform an analysis independently, or, as described above, when an analysis is made by one analysis section, this analysis can be made to use the results of other analysis sections. Based on the results obtained from each of the analysis units, determination of the content specifying information is attempted as described above, and if the content specifying information can be specified (S4: Yes), the control unit 10 outputs this result. (S5). This result can be output to the display unit 14, for example, but may be stored in the storage unit 12, and at this time, the metadata may be rewritten accordingly. This completes the operation of determining the content specifying information. After that, if editing processing or the like is necessary for the content of the video data corresponding to this content specifying information, this editing processing may be performed.

一方、このように内容特定情報が特定できなかった場合（Ｓ４：Ｎｏ）には、総合解析部２５は、一次解析の結果に基づき二次解析情報を作成し、これを二次解析情報記憶部２８に記憶させる（Ｓ６）。その後、総合解析部２５は、二次解析処理部２７にこの二次解析情報に基づいて各解析部に二次解析を行わせる（Ｓ７）。その後、総合解析部２５は、一次解析による結果（候補）と二次解析による結果（候補）の両方に基づいて、前記のように内容特定情報の決定を試み（Ｓ８）、内容特定情報が決定できた場合（Ｓ９：Ｙｅｓ）には、一次解析（Ｓ３）後に内容特定事項が定まった場合と同様の処理（Ｓ５）が行われ、この動作は終了する。 On the other hand, if the content specifying information cannot be specified in this way (S4: No), the comprehensive analysis unit 25 creates secondary analysis information based on the result of the primary analysis, and stores it in the secondary analysis information storage unit. 28 (S6). After that, the integrated analysis unit 25 causes the secondary analysis processing unit 27 to perform secondary analysis on each analysis unit based on this secondary analysis information (S7). After that, the integrated analysis unit 25 attempts to determine the content specifying information as described above based on both the results (candidates) of the primary analysis and the results (candidates) of the secondary analysis (S8), and the content specifying information is determined. If it is possible (S9: Yes), the same processing (S5) as when the content specifying items are determined after the primary analysis (S3) is performed, and this operation ends.

二次解析を行った（Ｓ７）後においても内容特定情報が定まらなかった場合（Ｓ９：Ｎｏ）には、内容特定情報となる可能性のある複数の候補を定めるか否かの問い合わせが表示部１４を介して行われる（Ｓ１０）。このように複数の候補を定める必要がない旨の回答が得られた場合（Ｓ１０：Ｎｏ）には、このように内容特定情報を定めることができなかった旨の警告が表示部１４で行われ（Ｓ１１）、この動作は終了する。一方、複数の候補を定める旨の回答が得られた場合（Ｓ１０：Ｙｅｓ）には、前記のように、各候補に確度が数値化されて付与され（Ｓ１２）、表示され（Ｓ１３）、処理は終了する。 If the content specifying information is not determined even after the secondary analysis is performed (S7) (S9: No), an inquiry as to whether or not to determine a plurality of candidates that may become the content specifying information is displayed on the display unit. 14 (S10). If a reply to the effect that it is not necessary to define a plurality of candidates is obtained (S10: No), a warning is displayed on the display unit 14 to the effect that the content specifying information could not be determined in this way. (S11), this operation ends. On the other hand, when a reply is obtained to the effect that a plurality of candidates are determined (S10: Yes), as described above, the probability is quantified and given to each candidate (S12), displayed (S13), and processed. ends.

上記のように一次解析と二次解析の結果から総合的に内容特定情報を定める際には、処理の効率化のために、総合解析部２５は、以下のような動作を行わせることができる。
（１）前記の通り、時間的に長い映像データにおいては、映像の変化やテロップの有無を物体解析部２１や字幕解析部２２が認識することができ、このタイミングに応じて図２におけるブロックの区分を行うことができる。すなわち、図２におけるブロックの区分も行うことができる。
（２）図２における第６～第１０のケースにおける第２解析の対象となるブロックの設定も、総合解析部２５が行うことができる。この場合、総合解析部２５は、映像データの種類等に応じてこの設定を行うことができ、映像データの種類は、例えばメタ情報解析部２４がメタデータを解析することによって認識することができる。
（３）また、総合解析部２５は、各解析部の結果に対する優先度の設定も行うことができる。この設定も映像データの種類等に応じて行わせることができる。例えば、映像データが字幕入りの映画（洋画）である場合には、字幕解析部２２に対する優先度を高めることができ、このための映像データの種類の認識も、例えばメタ情報解析部２４によって行うことができる。この場合には、例えば各解析部によって認識された候補に対して確度が数値化された場合に、この確度に対して重み付け係数を乗じた値を新たな確度として設定し、優先度の高い解析部に対してはこの重み付け係数の値を大きく設定すればよい。また、逆に、この場合に優先度を低くする解析部においては、この重み付け係数を低く設定すればよい。例えば、前記のように映像データが字幕入りの映画（洋画）である場合には、前記の通り字幕解析部２２に対する重み付け係数を大きくすると同時に、発言内容解析部２３に対する重み付け係数を小さくすることができる。この際、例えば発言内容解析部２３に対する重み付け係数を零としてもよく、この場合には、発言内容解析部２３による解析を行う必要がない。
（４）また、総合解析部２５は、一次解析の結果からもこのような優先度を設定することができる。例えば、一次解析によって一つの解析部により複数の候補が認識され、結果的に全ての解析部によって多くの候補が認識されたが、一次解析によっては内容特定情報を定めることができなかった場合において、内容特定情報は特定できなかったが候補となった事項の共通性（例えばカテゴリーが共通である等）が認識された場合には、これに応じて優先度を定めることができる。この場合、例えば、解析部のうちで一次解析においてこの共通性を有する候補が最も多く認識されたものに対する優先度を高め、この共通性を有する候補が認識されなかったものの優先度を低くすることができる。これによって、二次解析後における内容特定情報の決定が容易となる。こうした場合には、優先度（重み付け係数）も前記の二次解析情報に含ませることができる。 When comprehensively determining the content specifying information from the results of the primary analysis and the secondary analysis as described above, the comprehensive analysis unit 25 can cause the following operations to be performed in order to improve processing efficiency. .
(1) As described above, in video data with a long time, the object analysis unit 21 and caption analysis unit 22 can recognize changes in the video and the presence or absence of telops. A distinction can be made. That is, the division of blocks in FIG. 2 can also be performed.
(2) The comprehensive analysis section 25 can also set blocks to be subjected to the second analysis in the sixth to tenth cases in FIG. In this case, the comprehensive analysis unit 25 can perform this setting according to the type of video data, etc., and the type of video data can be recognized by the meta-information analysis unit 24 analyzing the metadata, for example. .
(3) The integrated analysis unit 25 can also set priorities for the results of each analysis unit. This setting can also be made according to the type of video data. For example, if the video data is a movie (foreign film) with subtitles, the priority of the subtitle analysis unit 22 can be raised, and the type of video data for this purpose is also recognized by the meta information analysis unit 24, for example. be able to. In this case, for example, when the accuracy is quantified for the candidates recognized by each analysis unit, the value obtained by multiplying this accuracy by a weighting factor is set as a new accuracy, and the analysis with the highest priority is performed. A large value of this weighting factor may be set for the part. Conversely, in this case, the weighting factor may be set low in the analysis unit that lowers the priority. For example, when the video data is a movie (foreign film) with captions as described above, it is possible to increase the weighting coefficient for the caption analysis unit 22 and at the same time decrease the weighting coefficient for the statement content analysis unit 23 as described above. can. At this time, for example, the weighting coefficient for the statement content analysis unit 23 may be set to zero, in which case the statement content analysis unit 23 does not need to perform analysis.
(4) Further, the comprehensive analysis unit 25 can set such priority also from the result of the primary analysis. For example, when multiple candidates are recognized by one analysis unit in the primary analysis, and as a result, many candidates are recognized by all the analysis units, but the content-specific information cannot be determined by the primary analysis. If the content specifying information could not be specified but the commonality of the candidate items (for example, the category is common), the priority can be determined accordingly. In this case, for example, among the analysis units, the priority of the candidate having this commonality that was recognized most in the primary analysis is increased, and the priority of the candidate having this commonality that was not recognized is decreased. can be done. This facilitates determination of the content specifying information after the secondary analysis. In such cases, the priority (weighting factor) can also be included in the secondary analysis information.

また、特に音声データから上記の内容特定情報（候補）を認識する際には、以下のような動作により、効率化が可能である。
（５）前記の通り、発言内容解析部２３においては、声紋等のデータベースを基にして発言者を区分し、発言内容をより適切に認識することができる、あるいは、不完全に認識された語句であっても、これを適正に認識することができる。この発言者の認識も、例えばメタ情報解析部２４がメタデータ（例えば映画における出演者）を解析することによって認識することができる。
（６）また、前記のように、例えば映像データにおいて発言者が口を開いた時点を起点とした音声認識をすることによっても、発言内容をより適切に認識することができる、発言者が口を開いた時点は、物体解析部２１によって認識することができる。この際、テロップ表示がある場合には、字幕解析部２２によってこのテロップ内の語句として認識された候補と、発言内容解析部２３によって認識された候補とは共通し、かつこの候補が内容特定情報となる可能性が高い。
（７）上記のような発言者の特定のためには、記憶部１２に予め記憶されたデータベースが有効である。このため、発言内容が確認されたこのデータベースには登録されていない発言者、あるいはデータベースと一致していると認められない程度の正確度Ｌ・ｈ１で発言者Ｈ１が認識された場合には、総合解析部２５は、この発言者の声紋データＶ・ｈ（音声データ)等を、発言者Ｈ１である正確度Ｌ・Ｈ１とともに、新たに登録することができる。この場合、映像データが映画であり、国籍情報がメタデータ等により認識される場合には、国籍情報も対応させて同時に記憶させることが好ましい。これによって、以降における他の映像データに対しても、解析を効率化するためにこのデータベースを活用することができる。なお、この動作は、ここで発言内容解析部２３によって認識された候補が内容特定情報として適正であったか否かに関わらず行わせることができる。
（８）上記声紋データＶ・ｈを登録したのとは別の機会に声紋データＶ・ｈを検出した場合、かつ映像に発言者Ｈ２の顔や人名テロップを伴っていた場合で、正確度Ｌ・ｈ２を解析結果として得た場合、正確性Ｌ・ｈ１と合わせて評価し、十分な正確度を示した場合に発言者Ｈ１あるいはＨ２を声紋データＶ・ｈと関連付けることができると判断し、登録する。 Further, in particular, when recognizing the content specifying information (candidates) from voice data, efficiency can be improved by the following operation.
(5) As described above, the utterance content analysis unit 23 can classify the utterance based on a database such as a voiceprint, and can recognize the utterance content more appropriately, or can recognize incompletely recognized words and phrases. However, it can be properly recognized. This recognition of the speaker can also be recognized, for example, by the meta-information analysis unit 24 analyzing metadata (for example, performers in a movie).
(6) In addition, as described above, by performing speech recognition starting from the point in time when the speaker opens his mouth in video data, for example, it is possible to more appropriately recognize the contents of the speech. The time when is opened can be recognized by the object analysis unit 21 . At this time, if there is a telop display, the candidates recognized as the words in the telop by the caption analysis unit 22 and the candidates recognized by the statement content analysis unit 23 are common, and the candidates are content specifying information. is likely to be
(7) A database pre-stored in the storage unit 12 is effective for specifying the speaker as described above. For this reason, if a speaker who is not registered in this database whose utterance content has been confirmed, or a speaker H1 is recognized with an accuracy of L·h1 that cannot be recognized as matching with the database, The integrated analysis unit 25 can newly register the speaker's voiceprint data V·h (speech data) and the like together with the accuracy L·H1 of the speaker H1. In this case, if the video data is a movie and the nationality information is recognized by metadata or the like, it is preferable that the nationality information is also associated and stored at the same time. As a result, this database can be used to efficiently analyze other video data in the future. It should be noted that this operation can be performed regardless of whether or not the candidate recognized by the statement content analysis unit 23 is appropriate as the content specifying information.
(8) When the voiceprint data V·h is detected on a different occasion than when the voiceprint data V·h was registered, and when the video includes the face of the speaker H2 and the person's name telop, the accuracy is L・When h2 is obtained as an analysis result, it is evaluated together with the accuracy L·h1, and when sufficient accuracy is shown, it is determined that the speaker H1 or H2 can be associated with the voiceprint data V·h, register.

以上、本発明を実施形態をもとに説明した。この実施形態は例示であり、それらの各構成要素の組み合わせにいろいろな変形例が可能なこと、またそうした変形例も本発明の範囲にあることは当業者に理解されるところである。 The present invention has been described above based on the embodiments. It should be understood by those skilled in the art that this embodiment is an example, and that various modifications can be made to the combination of each component, and that such modifications are within the scope of the present invention.

１データサーバ―（情報処理装置）
１０制御部
１１収録部
１２記憶部
１３操作部
１４表示部
２０情報認識部
２１物体解析部（解析部）
２２字幕解析部（解析部）
２３発言内容解析部（解析部）
２４メタ情報解析部（解析部）
２５総合解析部
２６一次解析処理部
２７二次解析処理部
２８二次解析情報記憶部 1 Data server (information processing device)
10 control unit 11 recording unit 12 storage unit 13 operation unit 14 display unit 20 information recognition unit 21 object analysis unit (analysis unit)
22 Subtitle analysis unit (analysis unit)
23 Speech content analysis unit (analysis unit)
24 Meta information analysis unit (analysis unit)
25 comprehensive analysis unit 26 primary analysis processing unit 27 secondary analysis processing unit 28 secondary analysis information storage unit

Claims

An information processing device for automatically recognizing content specifying information, which is information serving as a keyword corresponding to the content of video data,
an analysis unit that recognizes a candidate for the content specifying information from an image or character display in the video data, audio in audio data accompanying the video data, and meta information accompanying the video data;
cause the analysis unit to perform a primary analysis of selecting the candidates from each of the image, the character display, the sound, and the meta information, and searching for the content specifying information based on a plurality of the candidates;
When the content specifying information cannot be set from the candidates obtained by the primary analysis, secondary analysis information specifying analysis conditions is set for the analysis unit based on the results of the primary analysis. ,
causing the analysis unit to perform a secondary analysis for reselecting the candidate in at least one of the image, the character display, the sound, and the meta information based on the secondary analysis information;
an information recognition unit that searches for the content specifying information based on the analysis result of the primary analysis and the analysis result of the secondary analysis;
An information processing device comprising:

2. The information according to claim 1, wherein, when searching for the content specifying information, the information recognition unit sets a high priority to the analysis result of the meta information in the primary analysis and the secondary analysis. processing equipment.

3. The information processing apparatus according to claim 2, wherein the information recognition unit determines whether the candidates selected from each of the image, the character display, and the voice in the primary analysis match or disagree.

4. The method according to claim 3, wherein the information recognition unit displays a plurality of the candidates to which a numerical degree of certainty is assigned when the content specifying information cannot be determined after the secondary analysis. The information processing device described.

5. The information processing apparatus according to claim 3, wherein said information recognition unit issues a warning when said content specifying information cannot be determined after said secondary analysis.

The information recognition unit presets the priority of each analysis result of the image, the character display, and the voice, and based on the analysis result of the primary analysis and the analysis result of the secondary analysis, and the priority 6. The information processing apparatus according to any one of claims 1 to 5, wherein content specifying information is defined.

The video data and the audio data are divided into a plurality of blocks according to time series, and the content specifying information can be recognized for each block,
The information recognition unit, according to the type of the video data,
an operation of performing the secondary analysis only in the one block based on the secondary analysis information obtained from the result of the primary analysis in the one block;
an operation of performing the secondary analysis based on the secondary analysis information obtained from the result of the primary analysis in one of the blocks as well as the other blocks;
7. The information processing apparatus according to any one of claims 1 to 6, wherein the switching is performed between and.