JP2005332206A

JP2005332206A - Video event discrimination device, program thereof, learning data generation device for video event discrimination, and program thereof

Info

Publication number: JP2005332206A
Application number: JP2004149902A
Authority: JP
Inventors: Takahiro Mochizuki; 貴裕望月; Makoto Tadenuma; 眞蓼沼
Original assignee: Nippon Hoso Kyokai NHK; Japan Broadcasting Corp
Current assignee: Japan Broadcasting Corp
Priority date: 2004-05-20
Filing date: 2004-05-20
Publication date: 2005-12-02
Anticipated expiration: 2024-05-20
Also published as: JP4546762B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a video event discrimination device which discriminates an event occurring within a video, from each scene of the video without using additional information. <P>SOLUTION: A video event discrimination device 2 is provided with; a scene division means 12 for dividing an inputted video into scenes; a feature quantity extraction means 14 for extracting video feature quantities from a plurality of frame images included in scenes; a feature quantity digitizing means 21 for converting video feature quantities to numerical data by referring to a feature quantity classification database wherein representative values of similar video feature quantities are preliminarily associated with numerical data for classifying the representative values; and an event specifying means 22 for specifying a classification of an event corresponding to a data string of the numerical data converted by the feature quantity digitizing means 21 by referring to an event database wherein classifications of events are preliminarily associated with scene digitized strings resulting from representing a plurality of consecutive scenes by data strings of numerical data. <P>COPYRIGHT: (C)2006,JPO&NCIPI

Description

本発明は、映像内で発生するイベントを判別する映像イベント判別装置及びそのプログラム、並びに、映像イベント判別用学習データ生成装置及びそのプログラムに関する。 The present invention relates to a video event discriminating apparatus and program for discriminating an event occurring in a video, and a video event discriminating learning data generating apparatus and program thereof.

近年、放送番組等の映像から、その映像内で発生するイベント（事象）を判別するイベント判別技術が種々提案されている。例えば、第一のイベント判別技術として、画面の特定の位置に表示される文字情報が変更されることを検出することで、映像のイベントの発生を検出するとともに、当該イベントの種類の判別を行う技術が開示されている（特許文献１参照）。この技術では、野球中継の映像において、「イニング」、「得点」、「アウト数」等の試合の進行状況を示す文字情報が画面の特定の位置に表示されることを利用し、その文字情報の変化を検出することで、イニングの変わり目、得点の取得時等のイベントを判別している。 In recent years, various event discriminating techniques have been proposed for discriminating events occurring in video from video such as broadcast programs. For example, as a first event discrimination technique, the occurrence of a video event is detected and the type of the event is discriminated by detecting that character information displayed at a specific position on the screen is changed. A technique is disclosed (see Patent Document 1). This technology uses the fact that character information indicating the progress of the game, such as “inning”, “score”, “number of outs”, etc., is displayed at a specific position on the screen in a baseball broadcast video. By detecting this change, it is possible to discriminate events such as the turning point of inning and the acquisition of scores.

また、例えば、第二のイベント判別技術として、スポーツ中継映像におけるイベントを、インターネットを介して配信される中継データ（得点情報等）によって判別する技術が開示されている（特許文献２参照）。この技術では、野球中継の映像を逐次記録し、インターネットを介して配信される得点情報によって得点シーンを認識した段階で、一定時間（例えば１０分）遡って、記録されている映像を再生する。このように、第二のイベント判別技術では、映像に連動した中継データに基づいて、野球中継のイベントとなる得点シーンの判別を行っている。
特開２０００−１３２５６３号公報（段落００４８〜００４９、図５）特開２００３−１７４６０９号公報（段落００１４〜００２６、図１〜図４） For example, as a second event discrimination technique, a technique for discriminating an event in a sports broadcast video by relay data (score information or the like) distributed via the Internet is disclosed (see Patent Document 2). In this technique, baseball broadcast videos are sequentially recorded, and the recorded video is played back a predetermined time (for example, 10 minutes) at the stage where the scoring scene is recognized by the score information distributed via the Internet. As described above, in the second event discrimination technique, a scoring scene that becomes a baseball relay event is discriminated based on relay data linked to a video.
JP 2000-132563 A (paragraphs 0048 to 0049, FIG. 5) JP 2003-174609 A (paragraphs 0014 to 0026, FIGS. 1 to 4)

前記した第一のイベント判別技術では、映像上の固有の文字情報に基づいて、イベントの判別を行うため、文字情報が画面上に提示されなければ、イベントの判別を行うことができないという問題がある。
また、前記した第二のイベント判別技術では、インターネット等から映像に連動した情報（中継データ）を取得することで、当該映像で発生するイベントの判別を行うため、映像以外の情報を取得する手段が必要となり、装置構成が複雑になってしまうという問題がある。さらに、第二のイベント判別技術では、リアルタイムで放送される映像においては、それに対応する中継データを取得することで、イベントを判別することができるが、録画等によって蓄積されている映像からは、イベントを判別することができないという問題もある。 In the first event discriminating technique described above, the event is discriminated based on the unique character information on the video, so that the event cannot be discriminated unless the character information is presented on the screen. is there.
Further, in the second event discrimination technique, means for acquiring information other than video in order to discriminate events occurring in the video by acquiring information (relay data) linked to the video from the Internet or the like. There is a problem that the apparatus configuration becomes complicated. Furthermore, in the second event determination technology, in the video broadcast in real time, the event can be determined by acquiring the corresponding relay data, but from the video accumulated by recording or the like, There is also a problem that the event cannot be determined.

このように、前記した第一及び第二のイベント判別技術では、映像に付加された情報（文字情報、中継データ）に基づいて、映像内のイベントの判別を行うため、その付加された情報が取得できない状況では、イベントの判別を行うことができない。そこで、映像のシーンそのものから、イベントを判別することが可能な技術開発への要求が高まっている。 As described above, in the first and second event discriminating technologies described above, since the event in the video is discriminated based on the information (character information, relay data) added to the video, the added information is In a situation where it cannot be obtained, the event cannot be determined. Therefore, there is an increasing demand for technology development that can discriminate events from video scenes themselves.

本発明は、以上のような課題を解決するためになされたものであり、付加情報を用いることなく、映像の各シーンから当該映像内で発生するイベントを判別する映像イベント判別装置及びそのプログラム、並びに、映像イベント判別用学習データ生成装置及びそのプログラムを提供することを目的とする。 The present invention has been made to solve the above-described problems, and a video event discriminating apparatus and a program for discriminating an event occurring in the video from each scene of the video without using additional information, It is another object of the present invention to provide a video event discrimination learning data generation apparatus and a program thereof.

本発明は、前記目的を達成するために創案されたものであり、まず、請求項１に記載の映像イベント判別装置は、入力された映像の各シーンにおける映像特徴量に基づいて、前記映像内で発生するイベントの種類を判別する映像イベント判別装置であって、特徴量分類データベース記憶手段と、イベントデータベース記憶手段と、シーン分割手段と、特徴量抽出手段と、特徴量数値化手段と、イベント特定手段とを備える構成とした。 The present invention has been developed to achieve the above object, and first, the video event determination device according to claim 1 is configured so that the video in-video is based on the video feature amount in each scene of the input video. A video event discriminating apparatus for discriminating the type of an event occurring in a feature class database storage means, an event database storage means, a scene dividing means, a feature quantity extraction means, a feature quantity quantification means, an event And a specific means.

かかる構成によれば、映像イベント判別装置は、シーン分割手段によって、入力された映像の画面構成が大きく切り替わる点（シーンチェンジ）を検出して、映像をシーン毎に分割する。そして、映像イベント判別装置は、特徴量抽出手段によって、シーン分割手段で分割されたシーンに含まれる複数のフレーム画像から、当該シーンにおける映像の特徴を示す映像特徴量を抽出する。 According to such a configuration, the video event determination device detects a point (scene change) where the screen configuration of the input video is largely switched by the scene dividing unit, and divides the video for each scene. Then, the video event discriminating apparatus extracts the video feature amount indicating the video feature in the scene from the plurality of frame images included in the scene divided by the scene dividing unit by the feature amount extracting unit.

さらに、映像イベント判別装置は、特徴量数値化手段によって、特徴量抽出手段で抽出された映像特徴量を、その映像特徴量を予め分類（クラスタリング）したクラスを特定する数値データ（クラスタ値）に変換する。なお、この映像特徴量を分類した数値データは、予め映像特徴量と対応付けた特徴量分類データベースとして特徴量分類データベース記憶手段に記憶しておく。この特徴量分類データベースを参照することで、特徴量数値化手段は、映像のシーンを簡易化した数値データで表すことが可能になる。 Further, the video event discriminating apparatus converts the video feature amount extracted by the feature amount extraction unit by the feature amount digitizing unit into numerical data (cluster value) for specifying a class in which the video feature amount is classified (clustered) in advance. Convert. Note that the numerical data obtained by classifying the video feature quantity is stored in advance in the feature quantity classification database storage unit as a feature quantity classification database associated with the video feature quantity. By referring to the feature quantity classification database, the feature quantity digitizing means can represent the video scene with simplified numeric data.

そして、映像イベント判別装置は、イベント特定手段によって、特徴量数値化手段で変換された数値データのデータ列に対応するイベントの種類を特定する。なお、イベントの種類は、予めイベントの種類と、連続する複数のシーンを数値データのデータ列で表したシーン数値化列とを対応付けたイベントデータベースとして、イベントデータベース記憶手段に記憶しておく。このイベントデータベースを参照することで、イベント特定手段は、シーン数値化列からイベントの種類を特定することが可能になる。 Then, the video event discriminating device specifies the type of event corresponding to the data string of the numerical data converted by the feature value digitizing means by the event specifying means. The event type is stored in advance in the event database storage unit as an event database in which the event type is associated with a scene digitization sequence in which a plurality of continuous scenes are represented by a data sequence of numerical data. By referring to this event database, the event specifying means can specify the type of event from the scene digitization sequence.

また、請求項２に記載の映像イベント判別装置は、請求項１に記載の映像イベント判別装置において、基準画像記憶手段を備え、さらに、シーン分割手段がイベント開始検出手段を備える構成とした。 According to a second aspect of the present invention, there is provided the video event determination device according to the first aspect, wherein the video event determination device includes a reference image storage unit, and the scene division unit includes an event start detection unit.

かかる構成によれば、映像イベント判別装置は、イベントの開始を示す基準の画像（基準画像）を基準画像記憶手段に予め記憶しておく。そして、シーン分割手段におけるイベント開始検出手段が、入力された映像のフレーム画像で、基準画像に類似するフレーム画像を、イベントの開始として検出する。これによって、シーンの切り替わり以外に、判別したいイベントが開始されるフレーム画像をシーンの開始とする。 According to this configuration, the video event determination device stores a reference image (reference image) indicating the start of an event in the reference image storage unit in advance. Then, the event start detecting means in the scene dividing means detects a frame image similar to the reference image in the input video frame image as the start of the event. As a result, in addition to scene switching, a frame image in which an event to be discriminated starts is set as the start of the scene.

この基準画像としては、例えば、野球中継映像において、バッターが打席に立った以降の動作をイベントとして判別したい場合、バッターが打席に立ったときの画像とする。この基準画像を基準画像記憶手段に記憶しておくことで、イベント開始検出手段は、同一シーンにおいて、バッターが打席に立った場面から、別シーンになったと判定することができる。 As the reference image, for example, in a baseball broadcast video, when it is desired to determine an operation after the batter is standing at the bat as an event, the reference image is an image when the batter is standing at the bat. By storing this reference image in the reference image storage means, the event start detection means can determine that the scene is a different scene from the scene where the batter stands in the same scene.

さらに、請求項３に記載の映像イベント判別プログラムは、入力された映像の各シーンにおける映像特徴量に基づいて、前記映像内で発生するイベントの種類を判別するために、コンピュータを、シーン分割手段、特徴量抽出手段、特徴量数値化手段、イベント特定手段として機能させる構成とした。 Furthermore, the video event determination program according to claim 3, wherein the computer uses a scene dividing unit to determine the type of event occurring in the video based on the video feature amount in each scene of the input video. , Feature quantity extraction means, feature quantity quantification means, and event identification means.

かかる構成によれば、映像イベント判別プログラムは、シーン分割手段によって、入力された映像の画面構成が大きく切り替わる点（シーンチェンジ）を検出して、映像をシーン毎に分割する。そして、映像イベント判別プログラムは、特徴量抽出手段によって、シーン分割手段で分割されたシーンに含まれる複数のフレーム画像から、当該シーンにおける映像の特徴を示す映像特徴量を抽出する。 According to this configuration, the video event determination program detects a point (scene change) where the screen configuration of the input video is largely switched by the scene dividing unit, and divides the video for each scene. Then, the video event determination program extracts the video feature amount indicating the video feature in the scene from the plurality of frame images included in the scene divided by the scene dividing unit by the feature amount extracting unit.

さらに、映像イベント判別プログラムは、特徴量数値化手段によって、予め類似する映像特徴量の代表値と、その代表値を分類する数値データとを対応付けた特徴量分類データベースを参照することで、特徴量抽出手段で抽出された映像特徴量を、分類（クラスタリング）したクラスを特定する数値データ（クラスタ値）に変換する。 Furthermore, the video event determination program refers to a feature quantity classification database in which representative values of similar video feature quantities and numerical data for classifying the representative values are associated in advance by the feature quantity quantification unit. The video feature quantity extracted by the quantity extraction means is converted into numerical data (cluster value) that identifies the classified (clustered) class.

そして、映像イベント判別プログラムは、イベント特定手段によって、予めイベントの種類と、連続する複数のシーンを数値データのデータ列で表したシーン数値化列とを対応付けたイベントデータベースを参照することで、特徴量数値化手段で変換された数値データのデータ列に対応するイベントの種類を特定する。 Then, the video event determination program refers to an event database in which event types are associated in advance with event types and scene digitization sequences in which a plurality of continuous scenes are represented by a data sequence of numerical data. The event type corresponding to the data string of the numerical data converted by the feature value digitizing means is specified.

また、請求項４に記載の映像イベント判別用学習データ生成装置は、請求項１又は請求項２に記載の映像イベント判別装置で使用する学習データである特徴量分類データベース及びイベントデータベースを生成する映像イベント判別用学習データ生成装置であって、シーン分割手段と、特徴量抽出手段と、特徴量分類手段と、シーン数値化手段と、シーン映像再生手段と、イベント設定手段とを備える構成とした。 According to a fourth aspect of the present invention, there is provided a video event discriminating learning data generating apparatus for generating a feature quantity classification database and an event database, which are learning data used in the video event discriminating apparatus according to the first or second aspect. The event discrimination learning data generating apparatus is configured to include a scene dividing unit, a feature amount extracting unit, a feature amount classifying unit, a scene digitizing unit, a scene video reproducing unit, and an event setting unit.

かかる構成によれば、映像イベント判別用学習データ生成装置は、シーン分割手段によって、入力された映像の画面構成が大きく切り替わる点（シーンチェンジ）を検出して、映像をシーン毎に分割する。そして、映像イベント判別用学習データ生成装置は、特徴量抽出手段によって、シーン分割手段で分割されたシーンに含まれる複数のフレーム画像から、当該シーンにおける映像の特徴を示す映像特徴量を抽出する。 According to such a configuration, the video event determination learning data generation apparatus detects a point (scene change) where the screen configuration of the input video is largely switched by the scene dividing unit, and divides the video for each scene. Then, the video event determination learning data generation apparatus extracts a video feature amount indicating a video feature in the scene from a plurality of frame images included in the scene divided by the scene division unit by the feature amount extraction unit.

さらに、映像イベント判別用学習データ生成装置は、特徴量分類手段によって、特徴量抽出手段で抽出された映像特徴量を、類似する映像特徴量の代表値毎に、数値データを対応付けて分類する。この類似する映像特徴量の代表値（代表映像特徴量）は、映像特徴量を分類した際に、その分類されたクラスに含まれるすべての映像特徴量を代表するものであって、例えば、そのクラスに属するすべての映像特徴量の平均値、あるいは、その平均値に最も近い映像特徴量とすることができる。このように、特徴量分類手段は、類似する映像特徴量の代表値と、クラスを示す数値データとを対応付けて、映像イベント判別装置で使用する特徴量分類データベースを生成する。 Furthermore, the learning data generation device for video event discrimination classifies the video feature amount extracted by the feature amount extraction unit by the feature amount classification unit in association with numerical data for each representative value of similar video feature amounts. . The representative value of the similar video feature amount (representative video feature amount) represents all the video feature amounts included in the classified class when the video feature amount is classified. The average value of all the video feature quantities belonging to the class or the video feature quantity closest to the average value can be set. As described above, the feature amount classification unit generates a feature amount classification database used in the video event determination device by associating the representative value of the similar video feature amount with the numerical data indicating the class.

そして、映像イベント判別用学習データ生成装置は、シーン数値化手段によって、シーン毎に、当該シーンに含まれるフレーム画像を識別するためのフレーム画像番号と、特徴量分類手段によって分類された映像特徴量の数値データとを対応付ける。これによって、どのシーンが、どの数値データで表されるかが対応付けられることになる。 The learning data generation device for video event determination includes a frame image number for identifying a frame image included in the scene and a video feature amount classified by the feature amount classification unit for each scene by the scene digitizing unit. Is associated with the numerical data. As a result, which scene is represented by which numerical data is associated.

また、映像イベント判別用学習データ生成装置は、シーン映像再生手段によって、シーン数値化手段で対応付けられたシーン毎のフレーム画像番号に基づいて、シーンを再生する。そして、映像イベント判別用学習データ生成装置は、イベント設定手段によって、再生された連続する複数のシーンに対して、イベントの種類を示すイベント識別情報を入力されることで、イベント識別情報と、複数のシーンに対応する映像特徴量の数値データのデータ列であるシーン数値化列とを対応付ける。このように、イベント設定手段は、イベントの種類（イベント識別情報）と、シーン数値化列とを対応付けて、映像イベント判別装置で使用するイベントデータベースを生成する。 The video event determination learning data generation apparatus reproduces a scene based on the frame image number of each scene associated by the scene digitizing means by the scene video reproducing means. Then, the video event determination learning data generation device receives event identification information indicating the type of event for a plurality of consecutive reproduced scenes by the event setting means, and the event identification information Is associated with a scene digitization sequence that is a data sequence of numerical data of video feature values corresponding to the scene. As described above, the event setting unit generates an event database used in the video event determination device by associating the event type (event identification information) with the scene digitization sequence.

さらに、請求項５に記載の映像イベント判別用学習データ生成装置は、請求項４に記載の映像イベント判別用学習データ生成装置において、基準画像記憶手段を備え、さらに、シーン分割手段がイベント開始検出手段を備える構成とした。 Further, the video event determination learning data generation device according to claim 5 is the video event determination learning data generation device according to claim 4, further comprising reference image storage means, wherein the scene division means detects event start. It was set as the structure provided with a means.

かかる構成によれば、映像イベント判別用学習データ生成装置は、イベントの開始を示す基準の画像（基準画像）を基準画像記憶手段に予め記憶しておく。そして、シーン分割手段におけるイベント開始検出手段が、入力された映像のフレーム画像で、基準画像に類似するフレーム画像を、イベントの開始として検出する。これによって、シーンの切り替わり以外に、判別したいイベントが開始されるフレーム画像をシーンの開始とする。 According to this configuration, the video event determination learning data generation apparatus stores in advance the reference image (reference image) indicating the start of the event in the reference image storage unit. Then, the event start detecting means in the scene dividing means detects a frame image similar to the reference image in the input video frame image as the start of the event. As a result, in addition to scene switching, a frame image in which an event to be determined is started is set as the start of the scene.

また、請求項６に記載の映像イベント判別用学習データ生成プログラムは、請求項１又は請求項２に記載の映像イベント判別装置で使用する学習データである特徴量分類データベース及びイベントデータベースを生成するために、コンピュータを、シーン分割手段、特徴量抽出手段、特徴量分類手段、シーン数値化手段、シーン映像再生手段、イベント設定手段として機能させる構成とした。 According to a sixth aspect of the present invention, there is provided a video event determination learning data generation program for generating a feature amount classification database and an event database, which are learning data used in the video event determination device according to the first or second aspect. In addition, the computer is configured to function as a scene dividing unit, a feature amount extracting unit, a feature amount classifying unit, a scene digitizing unit, a scene video reproducing unit, and an event setting unit.

かかる構成によれば、映像イベント判別用学習データ生成プログラムは、シーン分割手段によって、入力された映像の画面構成が大きく切り替わる点（シーンチェンジ）を検出して、映像をシーン毎に分割する。そして、映像イベント判別用学習データ生成プログラムは、特徴量抽出手段によって、シーン分割手段で分割されたシーンに含まれる複数のフレーム画像から、当該シーンにおける映像の特徴を示す映像特徴量を抽出する。 According to such a configuration, the learning data generation program for video event determination detects a point (scene change) where the screen configuration of the input video is largely switched by the scene dividing unit, and divides the video for each scene. Then, the learning data generation program for video event determination extracts the video feature amount indicating the video feature in the scene from the plurality of frame images included in the scene divided by the scene division unit by the feature amount extraction unit.

さらに、映像イベント判別用学習データ生成プログラムは、特徴量分類手段によって、特徴量抽出手段で抽出された映像特徴量を、類似する映像特徴量の代表値毎に、数値データを対応付けて分類することで、映像イベント判別装置で使用する特徴量分類データベースを生成する。
そして、映像イベント判別用学習データ生成プログラムは、シーン数値化手段によって、シーン毎に、当該シーンに含まれるフレーム画像を識別するためのフレーム画像番号と、特徴量分類手段によって分類された映像特徴量の数値データとを対応付ける。 Further, the learning data generation program for video event discrimination classifies the video feature amount extracted by the feature amount extraction unit by the feature amount classification unit in association with numerical data for each representative value of similar video feature amounts. Thus, a feature amount classification database used in the video event determination device is generated.
Then, the learning data generation program for determining the video event discriminates the frame image number for identifying the frame image included in the scene for each scene by the scene digitizing unit and the video feature amount classified by the feature amount classifying unit. Is associated with the numerical data.

また、映像イベント判別用学習データ生成プログラムは、シーン映像再生手段によって、シーン数値化手段で対応付けられたシーン毎のフレーム画像番号に基づいて、シーンを再生する。そして、映像イベント判別用学習データ生成プログラムは、イベント設定手段によって、再生された連続する複数のシーンに対して、イベントの種類を示すイベント識別情報を入力されることで、イベント識別情報と、複数のシーンに対応する映像特徴量の数値データのデータ列であるシーン数値化列とを対応付け、映像イベント判別装置で使用するイベントデータベースを生成する。 The video event discrimination learning data generation program reproduces the scene based on the frame image number of each scene associated by the scene digitizing means by the scene video reproducing means. Then, the video event discrimination learning data generation program receives event identification information indicating a type of event for a plurality of reproduced continuous scenes by the event setting means, and a plurality of event identification information, An event database to be used in the video event discriminating apparatus is generated by associating with a scene digitization sequence that is a data sequence of numeric data of video feature values corresponding to the scene.

請求項１又は請求項３に記載の発明によれば、映像から、シーン毎の特徴量を抽出し、その特徴量を分類したクラスを示す数値データによって当該映像を数値データのデータ列として表し、予め数値データのデータ列と映像内のイベントとを対応付けたイベントデータベースを参照することで、映像内で発生するイベントの種類を判別することが可能になる。また、本発明は、映像の特徴を抽出した簡易化した数値データのデータ列によって、映像内で発生するイベントの種類を判別するため、高速にイベントを判別することができ、映像以外の情報（文字情報等の付加情報）を用いる必要もない。これによって、本発明は、リアルタイム映像であっても、蓄積された映像であっても、イベントを判別することができる。 According to the invention described in claim 1 or claim 3, a feature amount for each scene is extracted from a video, and the video is represented as a data string of numerical data by numerical data indicating a class in which the feature amount is classified. By referring to an event database in which a data string of numerical data is associated with an event in a video in advance, it is possible to determine the type of event that occurs in the video. In addition, according to the present invention, since the type of event that occurs in the video is determined by the simplified data string of the numerical data obtained by extracting the features of the video, the event can be determined at high speed, and information other than the video ( There is no need to use additional information such as character information. As a result, the present invention can determine an event whether it is a real-time video or an accumulated video.

請求項２に記載の発明によれば、入力映像として、イベントの開始を示す基準画像に類似したフレーム画像が入力された段階で、映像のシーンの切り替わりとして、シーンの分割を行うことが可能になる。すなわち、本発明は、入力される映像において、イベントが発生するシーンの先頭フレーム画像を特定することができる。そのため、本発明は、確実にイベントが開始されるフレーム画像を先頭とした映像の特徴量に対応した、数値データのデータ列が生成されることになり、イベントの判別の精度を高めることができる。 According to the second aspect of the present invention, when a frame image similar to a reference image indicating the start of an event is input as an input video, it is possible to divide the scene as a video scene change Become. That is, according to the present invention, the first frame image of a scene where an event occurs can be specified in the input video. Therefore, according to the present invention, a data string of numerical data corresponding to the feature amount of the video starting from the frame image where the event is surely started is generated, and the accuracy of event determination can be improved. .

請求項４又は請求項６に記載の発明によれば、映像から、シーン毎の映像特徴量を抽出し、その映像特徴量を分類（クラスタリング）することで、特徴量分類データベースを生成することができる。また、本発明によれば、複数のシーンで構成されるイベントの種類を、シーン毎の数値データのデータ列を連結したデータ列に対応付けたイベントデータベースを生成することができる。この特徴量分類データベース及びイベントデータベースを使用することで、映像イベント判別装置は、映像をクラスタリングされた数値データのデータ列で表現することが可能になり、そのデータ列に基づいて、イベントを判別することが可能になる。 According to the invention described in claim 4 or claim 6, the feature quantity classification database can be generated by extracting the video feature quantity for each scene from the video and classifying (clustering) the video feature quantity. it can. Further, according to the present invention, it is possible to generate an event database in which an event type constituted by a plurality of scenes is associated with a data string obtained by connecting data strings of numerical data for each scene. By using the feature amount classification database and the event database, the video event determination device can express a video as a data string of clustered numerical data, and determines an event based on the data string. It becomes possible.

請求項５に記載の発明によれば、映像イベント判別用学習データ生成プログラムは、イベントの開始を示す基準画像に類似したフレーム画像が入力された段階で、映像のシーンの切り替わりとして、シーンの分割を行うことが可能になる。すなわち、本発明は、入力される映像において、イベントが発生するシーンの先頭フレーム画像を特定することができる。そのため、確実にイベントが開始されるフレーム画像を先頭とした映像の特徴量によって、数値列データが生成されることになり、映像イベント判別装置において、イベントの判別の精度を高めることができる。 According to the fifth aspect of the present invention, the learning data generation program for video event determination is configured to divide a scene as a video scene change at a stage where a frame image similar to a reference image indicating the start of an event is input. It becomes possible to do. That is, according to the present invention, the first frame image of a scene where an event occurs can be specified in the input video. Therefore, numerical sequence data is generated based on the feature amount of the video starting from the frame image where the event is surely started, and the video event discrimination device can improve the accuracy of event discrimination.

以下、本発明の実施の形態について図面を参照して説明する。
［映像イベント判別手法の概要］
まず、図１を参照して、本発明に係る映像イベント判別装置において、映像から映像内で発生するイベントを判別する手法について、その概要を説明する。図１は、映像イベント判別手法の概要を説明するための説明図である。ここでイベントとは、映像内における、ある意味を持った一連のシーンのことをいい、例えば、野球中継映像内において、「ホームラン」、「二塁打」等が発生したシーンを指す。図１では、野球中継映像からイベントとして「二塁打」を判別する例を示す。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.
[Outline of video event discrimination method]
First, with reference to FIG. 1, the outline | summary is demonstrated about the method of discriminating the event which generate | occur | produces in an image | video from an image | video in the video event determination apparatus based on this invention. FIG. 1 is an explanatory diagram for explaining the outline of the video event determination method. Here, an event refers to a series of scenes having a certain meaning in a video, for example, a scene where a “home run”, “double strike”, etc. occur in a baseball broadcast video. FIG. 1 shows an example in which “double strike” is determined as an event from the baseball broadcast video.

ここで映像Ｖは、カメラの構図が切り替わるシーンとして、シーンＶ₁〜Ｖ₄で構成され、シーンＶ₁は「ピッチャーが投球するシーン」、Ｖ₂は「バッターが打ったボールが外野に飛んだシーン」、Ｖ₃は「走者が１塁ベースを回ったシーン」、Ｖ₄は「走者が二塁ベース上で止まったシーン」をそれぞれ示している。 Here, the video V is composed of scenes V _{1 to} V ₄ as scenes in which the composition of the camera is switched. The scene V ₁ is a “scene where the pitcher throws”, and the V ₂ is a ball where the batter hits the outside field. “Scene” and V ₃ indicate “scene where the runner went around the base 1”, and V ₄ shows “scene where the runner stopped on the base 2”.

そして、映像イベント判別手法は、各シーンＶ₁〜Ｖ₄から映像特徴量Ｖｃを抽出する。なお、ここでは、映像特徴量Ｖｃは、映像Ｖをシーン毎に簡略化した矩形領域の情報として示している。例えば、シーンＶ₁は、７つの矩形領域（矩形１〜矩形７）の情報に簡略化している。 Then, the video event determination method extracts the video feature amount Vc from each of the scenes V _{1 to} V ₄ . Here, the video feature amount Vc is shown as information of a rectangular area in which the video V is simplified for each scene. For example, the scene V ₁ is simplified to information of seven rectangular areas (rectangle 1 to rectangle 7).

そして、映像イベント判別手法は、予め類似する映像特徴量Ｖｃをクラス分け（クラスタリング）した特徴量分類データベース１０ａに基づいて、シーン毎の映像特徴量Ｖｃをクラスタ番号であるクラスタＣ_mn（１≦Ｃ_mn≦Ｎ）として数値化する。さらに、映像イベント判別手法は、予め映像イベントの種類（識別情報）と連続する複数のシーンを数値データのデータ列で表したシーン数値化列とを対応付けたイベントデータベース１１ａに基づいて、連続する複数のシーンがどのイベントであるのかを判別する。なお、ここでは、イベントが「二塁打」であることを示している。
このように、映像イベント判別手法は、映像を、映像特徴量に基づいて簡易化した数値データのデータ列で表現し、そのデータ列に基づいて、映像内で発生するイベント（映像イベント）を判別する。 In the video event discrimination method, based on the feature quantity classification database 10a in which similar video feature quantities Vc are classified (clustered) in advance, the video feature quantity Vc for each scene is represented as a cluster number cluster C _mn (1 ≦ C _{It is} digitized as _mn ≦ N). Furthermore, the video event discrimination method is based on the event database 11a in which a video event type (identification information) and a scene digitization sequence in which a plurality of continuous scenes are represented by a data sequence of numerical data are associated in advance. It is determined which event is a plurality of scenes. Here, the event is “double hit”.
In this way, the video event discrimination method expresses a video as a data string of numerical data simplified based on the video feature amount, and discriminates an event (video event) occurring in the video based on the data string. To do.

なお、ここでは、映像のイベントとして、野球中継映像内で発生する事象（二塁打）を例示しているが、本発明は、野球中継映像のイベントに限定されるものではない。本発明は、事象によってカメラワークや構図が決まった映像であれば、その事象を判別することができる。例えば、料理番組映像等で、作業工程に対して、カメラワークや構図が決まっている場合、「キャベツを切っているシーン」、「魚を焼いているシーン」等を判別することが可能である。 Here, as an event of a video, an event occurring in a baseball broadcast video (double hit) is illustrated, but the present invention is not limited to an event of a baseball relay video. The present invention can discriminate an event if it is a video whose camera work and composition are determined by the event. For example, in a cooking program video, etc., if camera work and composition are determined for the work process, it is possible to discriminate "scenes that cut cabbage", "scenes that grill fish", etc. .

図１に示したように、映像イベント判別手法は、予め学習されている学習データ（特徴量分類データベース１０ａ及びイベントデータベース１１ａ）を参照し、映像イベントの判別を行うため、ここでは、最初に、学習データを生成する学習データ生成装置について説明し、続けて、図１に示した映像イベント判別手法を実現する映像イベント判別装置について、順次説明を行うこととする。 As shown in FIG. 1, the video event determination method refers to learning data (feature amount classification database 10a and event database 11a) learned in advance, and performs video event determination. A learning data generation apparatus that generates learning data will be described, and subsequently, a video event determination apparatus that implements the video event determination method illustrated in FIG. 1 will be described sequentially.

［学習データ生成装置の構成］
最初に、図２を参照して、本発明に係る学習データ生成装置（映像イベント判別用学習データ生成装置）の構成について説明する。図２は、学習データ生成装置の構成を示すブロック図である。 [Configuration of learning data generator]
First, the configuration of a learning data generation device (video event determination learning data generation device) according to the present invention will be described with reference to FIG. FIG. 2 is a block diagram illustrating a configuration of the learning data generation apparatus.

図２に示すように、学習データ生成装置１は、外部から入力される映像から、類似する映像特徴量をクラス分けした特徴量分類データベース１０ａと、映像イベントの種類と連続する複数のシーンを数値データのデータ列で表したシーン数値化列とを対応付けたイベントデータベース１１ａとを学習データとして生成するものである。 As illustrated in FIG. 2, the learning data generation device 1 numerically represents a feature amount classification database 10 a in which similar video feature amounts are classified from an externally input video, and a plurality of scenes continuous with the types of video events. An event database 11a in which a scene digitization sequence represented by a data sequence of data is associated is generated as learning data.

ここでは、学習データ生成装置１は、特徴量分類ＤＢ記憶手段１０と、イベントＤＢ記憶手段１１と、シーン分割手段１２と、基準画像記憶手段１３と、特徴量抽出手段１４と、映像特徴量記憶手段１５と、特徴量分類手段１６と、シーン数値化手段１７と、シーン分類ＤＢ記憶手段１８と、シーン映像再生手段１９と、イベント設定手段２０とを備えている。 Here, the learning data generation device 1 includes a feature quantity classification DB storage means 10, an event DB storage means 11, a scene division means 12, a reference image storage means 13, a feature quantity extraction means 14, and a video feature quantity storage. Means 15, feature amount classification means 16, scene digitization means 17, scene classification DB storage means 18, scene video reproduction means 19, and event setting means 20 are provided.

特徴量分類ＤＢ（データベース）記憶手段１０は、シーン毎の映像特徴量を類似する映像特徴量毎にクラス分け（クラスタリング）した特徴量分類データベース１０ａを記憶するものであって、ハードディスク等の一般的な記憶手段である。 The feature quantity classification DB (database) storage means 10 stores a feature quantity classification database 10a in which video feature quantities for each scene are classified (clustered) for each similar video feature quantity. Storage means.

イベントＤＢ（データベース）記憶手段１１は、イベントの種類と、複数のシーンを数値データのデータ列で表したシーン数値化列とを対応付けたイベントデータベース１１ａを記憶するものであって、ハードディスク等の一般的な記憶手段である。
なお、特徴量分類データベース１０ａ及びイベントデータベース１１ａは、学習データ生成装置１内で生成されるものである。 The event DB (database) storage means 11 stores an event database 11a in which event types are associated with scene digitization sequences in which a plurality of scenes are represented by a data sequence of numerical data. This is a general storage means.
The feature quantity classification database 10a and the event database 11a are generated in the learning data generation apparatus 1.

シーン分割手段１２は、外部から映像を入力し、その映像をシーン毎に分割するものである。このシーン分割手段１２は、映像の画面構成が大きく切り替わる点（シーンチェンジ）を検出して、その切り替わり点毎に映像を分割する。なお、このシーンチェンジの検出は、既存の手法によって行うことができる。例えば、シーン分割手段１２は、映像を構成する時系列に連続するフレーム画像から、それぞれの色特徴による数値ベクトル（例えば、ＲＧＢの各平均値等）を計算し、前後のフレーム画像における数値ベクトルの差分の絶対値和が予め定めた閾値よりも大きい場合は、フレーム画像間に連続性がないと判定しシーンチェンジが発生したものとみなす。 The scene dividing means 12 inputs an image from outside and divides the image for each scene. The scene dividing unit 12 detects a point (scene change) at which the screen configuration of the video is largely switched, and divides the video at each switching point. This scene change can be detected by an existing method. For example, the scene dividing unit 12 calculates a numerical vector (for example, each average value of RGB) based on each color feature from the time-series frame images constituting the video, and calculates the numerical vector of the preceding and succeeding frame images. If the sum of absolute values of the differences is larger than a predetermined threshold value, it is determined that there is no continuity between the frame images and it is considered that a scene change has occurred.

このシーン分割手段１２は、映像を分割したシーン毎にフレーム画像を逐次特徴量抽出手段１４に出力する。なお、シーン分割手段１２は、後記するシーン数値化手段１７で、１つのシーンの数値列データが生成された段階で、次のシーンを検出（分割）するものとする。さらに、ここでは、シーン分割手段１２は、イベント開始検出手段１２ａを備えている。 The scene dividing unit 12 sequentially outputs a frame image to the feature amount extracting unit 14 for each scene obtained by dividing the video. Note that the scene dividing unit 12 detects (divides) the next scene when the numerical value data of one scene is generated by the scene digitizing unit 17 described later. Further, here, the scene dividing unit 12 includes an event start detecting unit 12a.

イベント開始検出手段１２ａは、入力された映像から、シーンの切り替わりの開始点（切り替わり画像）を検出するものである。このイベント開始検出手段１２ａは、フレーム画像と、基準画像記憶手段１３に記憶されている基準画像１３ａとを比較することで、基準画像１３ａと類似するフレーム画像が入力されたことを検出し、基準画像１３ａに類似するフレーム画像を、シーンの切り替わりの開始点とみなすこととする。なお、この類似の判定は、例えば、基準画像１３ａとフレーム画像との差分の絶対値和と予め定めた閾値とを比較することにより行う。
この基準画像１３ａを、例えば、イベントの開始となる画像とすることで、シーン分割手段１２は、イベントが発生する先頭フレーム画像から、確実にシーンを分割することが可能になる。 The event start detection means 12a detects a scene switching start point (switched image) from the input video. The event start detection means 12a compares the frame image with the reference image 13a stored in the reference image storage means 13 to detect that a frame image similar to the reference image 13a has been input. A frame image similar to the image 13a is regarded as a scene switching start point. This similarity determination is performed, for example, by comparing the sum of absolute values of differences between the reference image 13a and the frame image with a predetermined threshold value.
By making the reference image 13a an image that is the start of an event, for example, the scene dividing unit 12 can reliably divide the scene from the first frame image in which the event occurs.

基準画像記憶手段１３は、イベントの開始となる基準画像１３ａを記憶しておくものであって、ハードディスク等の一般的な記憶手段である。例えば、野球中継の映像からイベントを判別するための学習データを生成する場合、基準画像１３ａを、バッターが打席に立ったときの画像や、ピッチャーが投球を行ったときの画像とすることで、シーン分割手段１２において、野球における種々のイベント（例えば、ホームラン、三振等）の開始を検出することができる。 The reference image storage unit 13 stores a reference image 13a that is the start of an event, and is a general storage unit such as a hard disk. For example, when generating learning data for discriminating an event from a video of a baseball broadcast, the reference image 13a is an image when a batter is standing at a bat or an image when a pitcher performs a pitch, The scene dividing means 12 can detect the start of various events in baseball (for example, home run, strikeout, etc.).

特徴量抽出手段１４は、シーン分割手段１２で分割されたシーン毎に、そのシーンを構成するフレーム画像から映像特徴量を抽出するものである。この映像特徴量としては、一般的な映像の特徴量を用いることができる。例えば、シーンを構成する全フレーム画像におけるＲＧＢの各平均値の数値ベクトルである。なお、この特徴量抽出手段１４は、シーンにおける全フレーム画像を対象とするのではなく、予め定めたサンプリング間隔で、フレーム画像を選択し、その選択されたフレーム画像から映像特徴量を抽出することとしてもよい。
なお、ここで抽出された映像特徴量は、映像特徴量記憶手段１５に記憶される。また、特徴量抽出手段１４は、映像特徴量を映像特徴量記憶手段１５に記憶した段階で、映像特徴量を抽出した旨を特徴量分類手段１６に通知する。 The feature quantity extraction unit 14 extracts a video feature quantity from the frame image constituting the scene for each scene divided by the scene division unit 12. As this video feature amount, a general video feature amount can be used. For example, it is a numerical vector of RGB average values in all frame images constituting the scene. Note that the feature amount extraction unit 14 does not target all frame images in the scene, but selects a frame image at a predetermined sampling interval and extracts a video feature amount from the selected frame image. It is good.
Note that the video feature quantity extracted here is stored in the video feature quantity storage unit 15. In addition, the feature quantity extraction unit 14 notifies the feature quantity classification unit 16 that the video feature quantity has been extracted at the stage where the video feature quantity is stored in the video feature quantity storage unit 15.

本実施の形態においては、特徴量抽出手段１４は、映像特徴量を、本願出願人が出願した「映像特徴情報生成方法、映像特徴情報生成装置及び映像特徴情報生成プログラム」（特願２００３−７３５４８）の技術を用いて、シーンの映像特徴量を、大まかな矩形領域と、その矩形領域の画像特徴量及び動きとで表現することとする。そのため、ここでは、特徴量抽出手段１４は、節点追跡手段１４ａと、節点分類手段１４ｂと、クラスタ画像特徴量生成手段１４ｃと、シーン特徴量生成手段１４ｄとを備えている。 In the present embodiment, the feature quantity extraction means 14 uses the “video feature information generation method, video feature information generation apparatus, and video feature information generation program” (Japanese Patent Application No. 2003-73548) filed by the applicant of the present application. The video feature amount of the scene is expressed by a rough rectangular area and the image feature amount and motion of the rectangular area. Therefore, here, the feature quantity extraction means 14 includes a node tracking means 14a, a node classification means 14b, a cluster image feature quantity generation means 14c, and a scene feature quantity generation means 14d.

節点追跡手段１４ａは、シーンの先頭のフレーム画像に予め定めた間隔で格子状に、フレーム画像の特徴を抽出するための基準となる点（節点）を設定し、その節点の近傍画像領域の特徴量に基づいて、フレーム画像毎に節点を追跡するものである。
ここで、図７を参照（適宜図２参照）して、節点追跡手段１４ａにおける節点の追跡について説明する。図７は、節点追跡手段における節点の追跡を視覚化した図で、（ａ）は、フレーム画像に節点を設置した状態を示す図、（ｂ）は、フレーム画像上において節点を追跡した状態を示す図である。 The node tracking unit 14a sets points (nodes) serving as a reference for extracting the features of the frame image in a grid pattern at predetermined intervals in the first frame image of the scene, and features of the neighboring image region of the node The nodes are tracked for each frame image based on the quantity.
Here, with reference to FIG. 7 (refer to FIG. 2 as appropriate), the tracking of the node in the node tracking means 14a will be described. FIG. 7 is a diagram visualizing the tracking of a node in the node tracking means. FIG. 7A is a diagram showing a state in which the node is set on the frame image, and FIG. 7B is a diagram showing a state in which the node is tracked on the frame image. FIG.

図７（ａ）に示すように、節点追跡手段１４ａは、シーンの先頭のフレーム画像に横Ｎｐｘ個、縦Ｎｐｙ個（Ｎｐｘ及びＮｐｙは予め設定）で格子状に節点ＰＴを設定する。そして、節点追跡手段１４ａは、各節点ＰＴを中心とした近傍領域（近傍領域画像：Ｒｆｖ画素×Ｒｆｖ画素の正方形領域）から画像特徴量を計算し、各節点ＰＴに対応付けておく。この画像特徴量は、画像処理の分野で一般的な特徴量を用いればよく、例えば、ＲＧＢの各色成分の平均値、画像をエッジ化したときのエッジ量の分布、画像の複雑さを示すフラクタル次元等を用いることができる。 As shown in FIG. 7 (a), the node tracking means 14a sets the nodes PT in a grid pattern with Npx horizontal and Npy vertical (Npx and Npy are preset) in the first frame image of the scene. Then, the node tracking unit 14a calculates an image feature amount from a neighboring region (neighboring region image: square region of Rfv pixels × Rfv pixels) centered on each node PT, and associates it with each node PT. The image feature amount may be a general feature amount in the field of image processing. For example, the average value of each RGB color component, the distribution of the edge amount when the image is edged, and the fractal indicating the complexity of the image Dimension etc. can be used.

なお、図７（ａ）では、説明の都合上、フレーム画像上に節点ＰＴを図示しているが、この節点ＰＴはフレーム画像上の格子状の点に対応した位置を示しているだけである。
そして、図７（ｂ）に示すように、節点追跡手段１４ａは、前フレ−ム画像における節点ＰＴ（図中●印）の近傍領域の画像特徴量が、現フレ−ム画像で、予め定めた閾値以下で最も差が小さくなる位置ＰＴ_B（図中×印）に節点を移動させる。これによって、画像特徴量が近似する領域が、シーン全体にわたって追跡されることになる。
図２に戻って説明を続ける。 In FIG. 7A, for convenience of explanation, the node PT is illustrated on the frame image, but the node PT merely indicates a position corresponding to a grid-like point on the frame image. .
Then, as shown in FIG. 7 (b), the node tracking means 14a determines in advance that the image feature quantity in the region near the node PT (marked with ● in the figure) in the previous frame image is the current frame image. The node is moved to a position PT _B (x mark in the figure) where the difference is the smallest below the threshold value. As a result, a region where the image feature amount approximates is tracked throughout the scene.
Returning to FIG. 2, the description will be continued.

節点分類手段１４ｂは、フレーム画像毎に、フレーム画像内の節点を、その節点の位置と、近傍領域の画像特徴量とに基づいて、分類（クラスタリング）するものである。この節点分類手段１４ｂは、各節点における近傍領域の画像特徴量が近似する節点を同一のクラス（クラスタ）として分類する。ただし、画像特徴量のみでは、図８（ａ）に示すように、位置が離れた節点を同一のクラスタＣＬ１として分類してしまうため、節点分類手段１４ｂは、図８（ｂ）に示すように、同一クラスタのいずれの節点からも、予め定めた距離以上離れている節点を、別のクラスタ（ＣＬ１及びＣＬ２）として切り離して分類する。 The node classification means 14b classifies (clusters) the nodes in the frame image for each frame image based on the position of the node and the image feature amount of the neighboring region. The node classifying unit 14b classifies the nodes that are approximate to the image feature amount of the neighboring region at each node as the same class (cluster). However, with only the image feature amount, as shown in FIG. 8A, nodes separated from each other are classified as the same cluster CL1, and therefore the node classification means 14b is arranged as shown in FIG. 8B. Then, nodes that are more than a predetermined distance away from any node of the same cluster are separated and classified as separate clusters (CL1 and CL2).

クラスタ画像特徴量生成手段１４ｃは、節点追跡手段１４ａで追跡された各節点と、節点分類手段１４ｂで分類された同一のクラスタの節点が、節点追跡手段１４ａでシーンの先頭フレーム画像から最終フレーム画像まで追跡された段階で、当該クラスタの画像特徴量（クラスタ画像特徴量）を生成するものである。このクラスタ画像特徴量は、例えば、同一クラスタ内におけるシーンの先頭フレーム画像から最終フレーム画像まで各節点の画像特徴量の平均値とする。なお、節点の近接領域画像の中で、画像特徴量の平均値に最も近似する画像を、クラスタの代表テクスチャ画像とし、画像特徴量の１つとして用いることとしてもよい。 The cluster image feature value generating unit 14c includes the nodes tracked by the node tracking unit 14a and the nodes of the same cluster classified by the node classifying unit 14b from the first frame image to the final frame image of the scene by the node tracking unit 14a. The image feature amount (cluster image feature amount) of the cluster is generated at the stage where the process is tracked to the point. This cluster image feature amount is, for example, an average value of image feature amounts of each node from the first frame image to the last frame image of the scene in the same cluster. It should be noted that an image that most closely approximates the average value of the image feature values among the adjacent region images of the nodes may be used as one of the image feature values as the representative texture image of the cluster.

シーン特徴量生成手段１４ｄは、シーン全体におけるクラスタ毎の画像特徴量を、そのシーンの特徴量（映像特徴量）として生成するものである。このシーン特徴量生成手段１４ｄは、クラスタの領域を示す矩形領域の座標情報と、クラスタ画像特徴量生成手段１４ｃで生成されたクラスタ画像特徴量と、クラスタの動きとを、シーン特徴量として生成する。なお、クラスタの領域を示す矩形領域は、シーン全体における同一クラスタに含まれる節点の座標を含んだ最大領域を示す。また、クラスタの動きは、シーンの先頭フレーム画像から最終フレームまでのクラスタの位置重心の動きベクトルをを示す。 The scene feature quantity generation unit 14d generates an image feature quantity for each cluster in the entire scene as a feature quantity (video feature quantity) of the scene. The scene feature quantity generation unit 14d generates coordinate information of a rectangular area indicating a cluster area, the cluster image feature quantity generated by the cluster image feature quantity generation unit 14c, and the movement of the cluster as a scene feature quantity. . The rectangular area indicating the cluster area indicates the maximum area including the coordinates of the nodes included in the same cluster in the entire scene. The cluster motion indicates a motion vector of the position centroid of the cluster from the first frame image to the last frame of the scene.

ここで、図９に特徴量抽出手段１４が生成した映像特徴量のデータの一例を示す。図９に示すように、シーン毎に、シーンの先頭フレーム番号Ｎｓ、最終フレーム番号Ｎｅ、矩形領域の座標情報｛（ｘ０，ｙ０）、（ｘ１、ｙ１）、（ｘ２、ｙ２）、（ｘ３、ｙ３）｝、画像特徴量｛（ｆ（０）、ｆ（１）、ｆ（２）、…、ｆ（Ｎ−１）｝、及び、動きベクトルのｘ及びｙ成分｛ｖｘ、ｖｙ｝でクラスタ１個分の情報となる。
これによって、特徴量抽出手段１４（図２）は、図１０に示すように、映像Ｖのシーンを複数の矩形領域Ｒ（Ｒ₁、Ｒ₂、Ｒ₃）で簡易化し、その矩形領域Ｒの座標情報（位置、大きさ）、画像特徴量、動きベクトルとして、映像特徴量を抽出する。
図２に戻って説明を続ける。 Here, FIG. 9 shows an example of video feature value data generated by the feature value extraction means 14. As shown in FIG. 9, for each scene, the first frame number Ns, the last frame number Ne, and the coordinate information {(x0, y0), (x1, y1), (x2, y2), (x3, y3)}, image feature quantity {(f (0), f (1), f (2),..., f (N-1)}, and cluster with x and y components {vx, vy} of the motion vector This is one piece of information.
As a result, the feature quantity extraction means 14 (FIG. 2) simplifies the scene of the video V with a plurality of rectangular areas R (R ₁ , R ₂ , R ₃ ), as shown in FIG. Video feature quantities are extracted as coordinate information (position, size), image feature quantities, and motion vectors.
Returning to FIG. 2, the description will be continued.

映像特徴量記憶手段１５は、特徴量抽出手段１４で抽出されたシーンの毎の映像特徴量１５ａを記憶しておくものであって、ハードディスク等の記憶手段である。この映像特徴量記憶手段１５は、映像特徴量１５ａを一時的に記憶しておくバッファとして機能し、後記するシーン数値化手段１７によって、入力された映像における全てのシーンの数値列データが生成された段階で、削除される。 The video feature quantity storage means 15 stores the video feature quantity 15a for each scene extracted by the feature quantity extraction means 14, and is a storage means such as a hard disk. The video feature quantity storage means 15 functions as a buffer for temporarily storing the video feature quantity 15a, and the numerical value string data of all scenes in the input video is generated by the scene digitizing means 17 described later. It is deleted at the stage.

特徴量分類手段１６は、特徴量抽出手段１４で抽出され、映像特徴量記憶手段１５に記憶された映像特徴量１５ａを、類似する映像特徴量毎に分類（クラスタリング）し、映像特徴量１５ａをその分類されたクラス（クラスタ）の値（クラスタ値）に対応付けるものである。この特徴量分類手段１６は、特徴量抽出手段１４から映像特徴量１５ａを抽出した旨を通知された段階で、映像特徴量記憶手段１５に記憶されている複数の映像特徴量１５ａを、差が予め定めた値（閾値）以下となるものを１つのクラスタとする。また、特徴量分類手段１６は、同一のクラスタに含まれる映像特徴量の平均値を、そのクラスタを代表する映像特徴量（代表映像特徴量）とし、クラスタ値と対応付けた特徴量分類データベース１０ａを生成し、特徴量分類ＤＢ記憶手段１０に記憶する。 The feature quantity classifying means 16 classifies (clusters) the video feature quantities 15a extracted by the feature quantity extracting means 14 and stored in the video feature quantity storage means 15 for each similar video feature quantity, and the video feature quantities 15a are classified. It is associated with the value (cluster value) of the classified class (cluster). When the feature quantity classifying unit 16 is notified by the feature quantity extracting unit 14 that the video feature quantity 15a has been extracted, the feature quantity classifying unit 16 compares the plurality of video feature quantities 15a stored in the video feature quantity storage unit 15 with a difference. A cluster that is equal to or less than a predetermined value (threshold value) is defined as one cluster. Further, the feature quantity classifying unit 16 sets the average value of the video feature quantities included in the same cluster as the video feature quantity (representative video feature quantity) representing the cluster, and the feature quantity classification database 10a associated with the cluster value. And is stored in the feature amount classification DB storage unit 10.

なお、特徴量分類手段１６は、映像特徴量１５ａが、複数の特徴量（例えば、図９に示したような座標情報、画像特徴量、動きベクトル）からなる特徴ベクトルである場合は、個々の特徴量毎に平均値を算出し、代表映像特徴量とする。
また、特徴量分類データベース１０ａは、図１１の特徴量分類データベースの構造図の例に示すように、クラスタ値Ｃ（Ｃ₁、Ｃ₂、…）と、代表映像特徴量ＣＶ（ＣＶ₁、ＣＶ₂、…）とを１対１で対応付けたデータベースである。 It should be noted that the feature quantity classifying unit 16 determines that each of the video feature quantities 15a is a feature vector composed of a plurality of feature quantities (for example, coordinate information, image feature quantities, and motion vectors as shown in FIG. 9). An average value is calculated for each feature amount and used as a representative video feature amount.
Further, as shown in the example of the structure diagram of the feature quantity classification database in FIG. 11, the feature quantity classification database 10a includes cluster values C (C ₁ , C ₂ ,...) And representative video feature quantities CV (CV ₁ , CV). ₂ ,...) In a one-to-one correspondence.

シーン数値化手段１７は、映像のシーンを、当該シーンの映像特徴量に基づいて、特徴量分類手段１６で分類されたクラスタ値に変換することで、シーンを数値（クラスタ値）に対応付けるものである。ここでは、シーンを複数の矩形領域で表しているため、シーン数値化手段１７は、シーンを複数の数値（クラスタ値）のデータ列に変換する。なお、シーン数値化手段１７は、数値化されたデータを、各シーンのフレーム番号に対応付けることで、シーン分類データベース１８ａを生成し、シーン分類ＤＢ記憶手段１８に記憶する。 The scene digitizing means 17 associates a scene with a numerical value (cluster value) by converting the video scene into a cluster value classified by the feature quantity classifying means 16 based on the video feature quantity of the scene. is there. Here, since the scene is represented by a plurality of rectangular areas, the scene digitizing means 17 converts the scene into a data string of a plurality of numerical values (cluster values). The scene digitizing means 17 generates the scene classification database 18a by associating the digitized data with the frame number of each scene, and stores it in the scene classification DB storage means 18.

このシーン分類データベース１８ａは、図１２のシーン分類データベースの構造図の例に示すように、シーンを特定する連続番号であるシーン番号Ｓｎに、フレーム番号Ｆｎ（先頭番号Ｆｓ−最終番号Ｆｅ）と、１つ以上のクラスタ値Ｃとを対応付けたデータベースである。 As shown in the example of the structure diagram of the scene classification database in FIG. 12, the scene classification database 18a includes a frame number Fn (first number Fs-final number Fe), a scene number Sn that is a serial number for specifying a scene, It is a database in which one or more cluster values C are associated with each other.

シーン分類ＤＢ（データベース）記憶手段１８は、シーン数値化手段１７で生成されるシーン分類データベース１８ａを記憶するものであって、ハードディスク等の記憶手段である。このシーン分類ＤＢ記憶手段１８に記憶されているシーン分類データベース１８ａは、後記するシーン映像再生手段１９及びイベント設定手段２０によって参照される。 The scene classification DB (database) storage means 18 stores the scene classification database 18a generated by the scene digitizing means 17, and is a storage means such as a hard disk. The scene classification database 18a stored in the scene classification DB storage means 18 is referred to by a scene video reproduction means 19 and an event setting means 20 described later.

シーン映像再生手段１９は、シーン分類データベース１８ａを参照することで、シーン毎の映像を再生するものである。このシーン映像再生手段１９は、図１２に示したシーン分類データベース１８ａのシーン番号Ｓｎの順番に、対応するフレーム番号Ｆｎの画像を順次再生し、図示していない表示装置にシーン再生映像を表示することで、当該学習データ生成装置１を操作する操作者に対して、シーン毎の映像を提示する。
なお、シーン映像再生手段１９は、映像の全時間軸における指定位置を指定可能なスライドバーを表示装置の画面上に表示させ、図示していないマウス等の入力手段によって、操作者が適宜、スライドバーを操作することで、対応するシーンを再生し、映像内で発生するイベントを再生映像として確認することとしてもよい。 The scene video playback means 19 plays back video for each scene by referring to the scene classification database 18a. The scene video reproduction means 19 sequentially reproduces the images of the corresponding frame numbers Fn in the order of the scene numbers Sn in the scene classification database 18a shown in FIG. 12, and displays the scene reproduction video on a display device (not shown). Thus, a video for each scene is presented to the operator who operates the learning data generation apparatus 1.
The scene video playback means 19 displays on the screen of the display device a slide bar that can specify the specified position on the entire time axis of the video, and the operator appropriately slides by means of input means such as a mouse (not shown). By operating the bar, the corresponding scene may be reproduced and an event occurring in the video may be confirmed as a reproduced video.

イベント設定手段２０は、操作者から図示していない入力手段を介して、複数の連続シーンをイベントとして指示されることで、イベントの種類と、複数のシーンを数値列データで表したシーン数値化列とを対応付けるものである。なお、イベントを識別するための識別情報（イベント名等）は、図示していない入力手段から入力されるものとする。また、そのイベントに対応するシーンは、直接シーン番号を入力されるか、前記したスライドバーの位置によって、イベント設定手段２０が、シーン分類データベース１８ａからシーン番号を検索するものとする。 The event setting means 20 is instructed by the operator via an input means (not shown) as a plurality of continuous scenes as events, thereby converting the event types and scene numerical values representing the plurality of scenes into numerical string data. Corresponds to a column. Note that identification information (event name or the like) for identifying an event is input from an input unit (not shown). The scene corresponding to the event is directly input with the scene number, or the event setting means 20 searches the scene classification database 18a for the scene number according to the position of the slide bar.

これによって、イベント設定手段２０は、イベントの種類とシーン（シーン番号）との対応付けが可能となる。そこで、イベント設定手段２０は、シーン分類データベース１８ａのシーン番号に対応付けられているクラスタ値のデータ列を複数のシーン分連結したシーン数値化列を、イベント識別情報（イベント名等）と対応付けることで、イベントデータベース１１ａを生成し、イベントＤＢ記憶手段１１に記憶する。 As a result, the event setting means 20 can associate the event type with the scene (scene number). Therefore, the event setting means 20 associates a scene digitization sequence obtained by connecting a cluster value data sequence associated with a scene number in the scene classification database 18a for a plurality of scenes with event identification information (event name, etc.). Thus, the event database 11a is generated and stored in the event DB storage means 11.

このイベントデータベース１１ａは、図１３のイベントデータベースの構成図の例に示すように、イベント識別情報Ｅｋに、シーン数値化列Ｓｄとを対応付けたデータベースである。図１３の例では、「ホームラン」というイベントの種類（イベント識別情報Ｅｋ）に、図１２で示したシーン番号Ｓｎ₁のクラスタ値（Ｃ₁₁、Ｃ₁₂、Ｃ₁₃、…）、シーン番号Ｓｎ₂のクラスタ値（Ｃ₂₁、Ｃ₂₂、Ｃ₂₃、…）、…が対応付けられていることを示している。 As shown in the example of the configuration diagram of the event database in FIG. 13, the event database 11a is a database in which the event identification information Ek is associated with the scene digitization sequence Sd. In the example of FIG. 13, the event value “home run” (event identification information Ek) includes the cluster value (C ₁₁ , C ₁₂ , C ₁₃ ,...) And the scene number Sn _{2 of} the scene number Sn ₁ shown in FIG. Are associated with each other (C ₂₁ , C ₂₂ , C ₂₃ ,...),.

以上説明したように、学習データ生成装置１は、入力される映像から、イベント（映像イベント）を判別するための学習データとなる、類似する映像特徴量をクラス分けした特徴量分類データベース１０ａと、映像イベントの種類と連続する複数のシーンを数値データのデータ列で表したシーン数値化列とを対応付けたイベントデータベース１１ａとを生成することができる。 As described above, the learning data generation device 1 has a feature amount classification database 10a in which similar video feature amounts are classified into learning data for discriminating an event (video event) from an input video, It is possible to generate the event database 11a in which the type of video event and the scene digitization sequence in which a plurality of continuous scenes are represented by a data sequence of numerical data are associated with each other.

なお、学習データ生成装置１は、シーン分割手段１２におけるイベント開始検出手段１２ａ及び基準画像記憶手段１３の各構成を省くことも可能である。しかし、イベント開始検出手段１２ａ及び基準画像記憶手段１３を備えた方が、イベントの先頭から数値データが生成され、イベントを適切に数値データのデータ列に変換することができるため好ましい。 Note that the learning data generation apparatus 1 can omit the configurations of the event start detection unit 12a and the reference image storage unit 13 in the scene division unit 12. However, it is preferable to provide the event start detection unit 12a and the reference image storage unit 13 because numerical data is generated from the beginning of the event and the event can be appropriately converted into a data string of numerical data.

なお、学習データ生成装置１は、一般的なコンピュータにプログラムを実行させ、コンピュータ内の演算装置や記憶装置を動作させることにより実現することができる。このプログラム（映像イベント判別用学習データ生成プログラム）は、通信回線を介して配布することも可能であるし、ＣＤ−ＲＯＭ等の記録媒体に書き込んで配布することも可能である。 The learning data generation device 1 can be realized by causing a general computer to execute a program and operating an arithmetic device or a storage device in the computer. The program (learning data generation program for video event determination) can be distributed via a communication line, or can be written and distributed on a recording medium such as a CD-ROM.

［学習データ生成装置の動作］
次に、図３及び図４を参照して、本発明に係る学習データ生成装置（映像イベント判別用学習データ生成装置）の動作について説明する。ここでは、学習データ生成装置の動作を、特徴量分類データベースを生成する動作と、イベントデータベースを生成する動作とに分けて説明する。図３は、学習データ生成装置が特徴量分類データベースを生成する動作を示すフローチャートである。図４は、学習データ生成装置がイベントデータベースを生成する動作を示すフローチャートである。 [Operation of learning data generator]
Next, with reference to FIG. 3 and FIG. 4, the operation of the learning data generation device (video event determination learning data generation device) according to the present invention will be described. Here, the operation of the learning data generation device will be described separately for an operation for generating a feature amount classification database and an operation for generating an event database. FIG. 3 is a flowchart illustrating an operation in which the learning data generation device generates a feature amount classification database. FIG. 4 is a flowchart illustrating an operation in which the learning data generation apparatus generates an event database.

（特徴量分類データベース生成動作）
最初に、図３を参照（適宜図２参照）して、学習データ生成装置１が、特徴量分類データベース１０ａを生成する動作について説明する。
まず、学習データ生成装置１は、シーン分割手段１２によって、映像をフレーム画像単位で入力する（ステップＳ１）。そして、学習データ生成装置１は、イベント開始検出手段１２ａによって、入力された原フレーム画像と、基準画像記憶手段１３に記憶されている基準画像１３ａとが類似するかどうかを判定する（ステップＳ２）。 (Feature classification database generation operation)
First, referring to FIG. 3 (refer to FIG. 2 as appropriate), an operation in which the learning data generation device 1 generates the feature amount classification database 10a will be described.
First, the learning data generation device 1 inputs a video in units of frame images by the scene dividing unit 12 (step S1). Then, the learning data generation device 1 determines whether the input original frame image is similar to the reference image 13a stored in the reference image storage unit 13 by the event start detection unit 12a (step S2). .

ここで、原フレーム画像と基準画像１３ａとが類似していない場合（ステップＳ２でＮｏ）、シーン分割手段１２は、原フレーム画像を時間方向で前に入力された前フレーム画像と比較することで、フレーム画像間の連続性を判定する（ステップＳ３）。そして、シーン分割手段１２は、フレーム画像に連続性があると判断した場合（ステップＳ３でＹｅｓ）、フレーム画像を図示していない記憶手段に記憶しておき、ステップＳ１に戻って、次のフレーム画像を入力する。 Here, when the original frame image and the reference image 13a are not similar (No in step S2), the scene dividing unit 12 compares the original frame image with the previous frame image input previously in the time direction. The continuity between the frame images is determined (step S3). When the scene dividing unit 12 determines that the frame images have continuity (Yes in step S3), the scene dividing unit 12 stores the frame images in a storage unit (not shown), returns to step S1, and returns to the next frame. Enter an image.

そして、原フレーム画像と基準画像１３ａとが類似している場合（ステップＳ２でＹｅｓ）又はフレーム画像に連続性がない場合（ステップＳ３でＮｏ）は、原フレーム画像において、シーンが切り替わっている（シーンチェンジ）とみなし、原フレーム画像をシーンの切り替わり画像に設定する（ステップＳ４）。
ここで、シーン分割手段１２は、原フレーム画像が、最初の切り替わり画像であるかどうかを判定し（ステップＳ５）、最初の切り替わり画像である場合（ステップＳ５でＹｅｓ）は、原フレーム画像を図示していない記憶手段に記憶しておき、ステップＳ１に戻って、次のフレーム画像を入力する。 When the original frame image and the reference image 13a are similar (Yes in step S2) or when the frame image is not continuous (No in step S3), the scene is switched in the original frame image ( It is regarded as a scene change), and the original frame image is set as a scene change image (step S4).
Here, the scene dividing unit 12 determines whether or not the original frame image is the first switching image (step S5). If the original frame image is the first switching image (Yes in step S5), the scene dividing unit 12 displays the original frame image. The image is stored in storage means not shown, and the process returns to step S1 to input the next frame image.

そして、原フレーム画像が、最初の切り替わり画像でない場合（ステップＳ５でＮｏ）は、学習データ生成装置１は、特徴量抽出手段１４によって、直前のシーンの切り替わり画像から、直前のフレーム画像までのシーンから映像特徴量を抽出する（ステップＳ６）。なお、この学習データ生成装置１では、節点追跡手段１４ａが、シーンの先頭のフレーム画像に予め定めた節点を、その節点の近傍画像領域の特徴量に基づいて、フレーム画像毎に追跡する。また、節点分類手段１４ｂが、フレーム画像毎に、フレーム画像内の節点を、その節点の位置と、近傍領域の画像特徴量とに基づいて、分類（クラスタリング）する。そして、クラスタ画像特徴量生成手段１４ｃが、節点追跡手段１４ａで追跡された各節点と、節点分類手段１４ｂで分類された同一のクラスタの節点が、節点追跡手段１４ａでシーンの先頭フレーム画像から最終フレーム画像まで追跡された段階で、当該クラスタの画像特徴量（クラスタ画像特徴量）を生成する。そして、シーン特徴量生成手段１４ｄが、シーン全体におけるクラスタ毎の画像特徴量を、そのシーンの特徴量（映像特徴量）として生成する。 If the original frame image is not the first switching image (No in step S5), the learning data generating device 1 uses the feature amount extraction unit 14 to change the scene from the immediately preceding scene switching image to the immediately preceding frame image. The video feature amount is extracted from (step S6). In the learning data generating apparatus 1, the node tracking unit 14a tracks a node predetermined in the first frame image of the scene for each frame image based on the feature amount of the image area near the node. In addition, the node classification unit 14b classifies (clusters) the nodes in the frame image for each frame image based on the position of the node and the image feature amount of the neighboring region. Then, the cluster image feature value generating unit 14c determines each node tracked by the node tracking unit 14a and the node of the same cluster classified by the node classifying unit 14b from the first frame image of the scene by the node tracking unit 14a. When the frame image is tracked, an image feature amount (cluster image feature amount) of the cluster is generated. Then, the scene feature value generation unit 14d generates an image feature value for each cluster in the entire scene as a feature value (video feature value) of the scene.

続けて、学習データ生成装置１は、特徴量分類手段１６によって、ステップＳ６で抽出した映像特徴量を、類似する映像特徴量毎に分類（クラスタリング）し、映像特徴量をその分類されたクラス（クラスタ）の値（クラスタ値）に対応付け、特徴量分類データベース１０ａを生成する（ステップＳ７）。 Subsequently, the learning data generation device 1 classifies (clusters) the video feature quantities extracted in step S6 by the feature quantity classification unit 16 for each similar video feature quantity, and classifies the video feature quantities into the classified classes ( The feature quantity classification database 10a is generated in association with the cluster value (cluster value) (step S7).

そして、学習データ生成装置１は、入力された映像が終了したかどうかを判定し（ステップＳ８）、終了していない場合（ステップＳ８でＮｏ）は、ステップＳ１に戻って動作を続ける。また、映像が終了した場合（ステップＳ８でＹｅｓ）は、動作を終了する。
なお、図示していないが、ステップＳ１において、映像を入力できなくなった段階で、ステップＳ６に進むこととする。これによって、映像の最終シーンの映像特徴量が抽出されることになる。
以上説明したように、学習データ生成装置１は、入力された映像から、シーン毎の映像特徴量をクラスタリングした特徴量分類データベース１０ａを生成することができる。 Then, the learning data generation device 1 determines whether or not the input video has ended (step S8). If it has not ended (No in step S8), the learning data generation device 1 returns to step S1 and continues the operation. If the video is finished (Yes in step S8), the operation is finished.
Although not shown in the figure, it is assumed that the process proceeds to step S6 when it becomes impossible to input a video in step S1. As a result, the video feature amount of the final scene of the video is extracted.
As described above, the learning data generation device 1 can generate the feature quantity classification database 10a obtained by clustering the video feature quantities for each scene from the input video.

（イベントデータベース生成動作）
次に、図４を参照（適宜図２参照）して、学習データ生成装置１が、イベントデータベース１１ａを生成する動作について説明する。なお、図４におけるステップＳ１１からステップＳ１６までの動作は、図３で説明したステップＳ１からステップＳ６までの動作と同じ動作であるため説明を省略し、ステップＳ１７以降の動作について説明する。 (Event database generation operation)
Next, with reference to FIG. 4 (refer to FIG. 2 as appropriate), an operation in which the learning data generation device 1 generates the event database 11a will be described. Note that the operation from step S11 to step S16 in FIG. 4 is the same as the operation from step S1 to step S6 described with reference to FIG. 3, and thus the description thereof will be omitted, and the operation after step S17 will be described.

学習データ生成装置１は、ステップＳ１６における映像特徴量の抽出後、シーン数値化手段１７によって、シーン番号に、フレーム番号（先頭番号−最終番号）と、複数の数値（クラスタ値）のデータ列とを対応付けたシーン分類データベース１８ａを生成する（ステップＳ１７）。
そして、学習データ生成装置１は、入力された映像が終了したかどうかを判定し（ステップＳ１８）、終了していない場合（ステップＳ１８でＮｏ）は、ステップＳ１１に戻って動作を続ける。 After the extraction of the video feature amount in step S16, the learning data generating apparatus 1 uses the scene digitizing means 17 to add a frame number (start number-final number) and a data string of a plurality of numerical values (cluster values) to the scene number. Is associated with the scene classification database 18a (step S17).
Then, the learning data generation device 1 determines whether or not the input video has ended (step S18), and if it has not ended (No in step S18), it returns to step S11 and continues the operation.

一方、映像が終了した場合（ステップＳ１８でＹｅｓ）、学習データ生成装置１は、シーン映像再生手段１９によって、シーン毎の映像を再生する（ステップＳ１９）。そして、学習データ生成装置１は、イベント設定手段２０によって、操作者から図示していない入力手段を介して、複数の連続シーンをイベントとして指示されることで、イベントの種類を複数のシーンのシーン数値化列に対応付け、イベントデータベース１１ａを生成する（ステップＳ２０）。 On the other hand, when the video is completed (Yes in step S18), the learning data generating device 1 plays the video for each scene by the scene video playback means 19 (step S19). The learning data generation device 1 then instructs the event setting unit 20 to specify a plurality of continuous scenes as events via an input unit (not shown) from the operator, thereby setting the event type to a scene of a plurality of scenes. The event database 11a is generated in association with the digitized string (step S20).

そして、学習データ生成装置１は、操作者からイベントの対応付けに対する終了指示が入力されるかどうかを判定し（ステップＳ２１）、終了が指示された段階（ステップＳ２１でＹｅｓ）で動作を終了し、指示されない間（ステップＳ２１でＮｏ）は、ステップＳ１９に戻って動作を続ける。
以上説明したように、学習データ生成装置１は、入力された映像から、複数のシーンが連続したイベントに対して、映像特徴量のクラスタ値のデータ列を対応付けたイベントデータベース１１ａを生成することができる。 Then, the learning data generation device 1 determines whether or not an end instruction for event association is input from the operator (step S21), and ends the operation when the end is instructed (Yes in step S21). While not instructed (No in step S21), the process returns to step S19 to continue the operation.
As described above, the learning data generation device 1 generates the event database 11a in which the data sequence of the cluster value of the video feature amount is associated with the event in which a plurality of scenes are continuous from the input video. Can do.

［映像イベント判別装置の構成］
次に、図５を参照して、本発明に係る映像イベント判別装置の構成について説明する。図５は、映像イベント判別装置の構成を示すブロック図である。
図５に示すように、映像イベント判別装置２は、外部から入力される映像から、イベントの種類を判別するものである。ここでは、映像イベント判別装置２は、特徴量分類ＤＢ記憶手段１０と、イベントＤＢ記憶手段１１と、シーン分割手段１２と、基準画像記憶手段１３と、特徴量抽出手段１４と、映像特徴量記憶手段１５と、特徴量数値化手段２１と、イベント特定手段２２とを備えている。 [Configuration of video event discriminator]
Next, the configuration of the video event determination device according to the present invention will be described with reference to FIG. FIG. 5 is a block diagram showing the configuration of the video event discriminating apparatus.
As shown in FIG. 5, the video event discriminating apparatus 2 discriminates the type of event from the video inputted from the outside. Here, the video event discriminating apparatus 2 includes a feature amount classification DB storage unit 10, an event DB storage unit 11, a scene division unit 12, a reference image storage unit 13, a feature amount extraction unit 14, and a video feature amount storage. Means 15, feature value digitizing means 21, and event specifying means 22 are provided.

ここで、特徴量数値化手段２１及びイベント特定手段２２以外の構成は、図２で説明した学習データ生成装置１の構成と同一であるので、同一の符号を付して説明を省略する。また、学習データ生成装置１（図２）において、イベント開始検出手段１２ａ及び基準画像記憶手段１３が構成から省かれている場合は、映像イベント判別装置２においても構成から省くこととする。 Here, the configuration other than the feature value digitizing means 21 and the event specifying means 22 is the same as the configuration of the learning data generating apparatus 1 described with reference to FIG. In the learning data generation device 1 (FIG. 2), when the event start detection unit 12a and the reference image storage unit 13 are omitted from the configuration, the video event determination device 2 is also omitted from the configuration.

なお、特徴量分類ＤＢ記憶手段１０に記憶されている特徴量分類データベース１０ａ（図１１参照）、及び、イベントＤＢ記憶手段１１に記憶されているイベントデータベース１１ａ（図１３参照）は、学習データとして学習データ生成装置１で予め生成されたものである。 The feature quantity classification database 10a (see FIG. 11) stored in the feature quantity classification DB storage means 10 and the event database 11a (see FIG. 13) stored in the event DB storage means 11 are used as learning data. It is generated in advance by the learning data generation device 1.

特徴量数値化手段２１は、特徴量抽出手段１４で抽出された映像特徴量を、特徴量分類データベース１０ａに基づいて、当該映像特徴量を分類した数値（クラスタ値）に変換するものである。この特徴量数値化手段２１は、特徴量抽出手段１４から映像特徴量を抽出した旨を通知された段階で、映像特徴量記憶手段１５に記憶された映像特徴量１５ａと、図１１に示した特徴量分類データベース１０ａの代表映像特徴量ＣＶとのデータ距離が最も近いクラスタ値Ｃを、当該映像特徴量１５ａのクラスタ値とする。ここで変換されたクラスタ値は、イベント特定手段２２に出力される。 The feature quantity digitizing means 21 converts the video feature quantity extracted by the feature quantity extracting means 14 into a numeric value (cluster value) obtained by classifying the video feature quantity based on the feature quantity classification database 10a. The feature quantity digitizing means 21 is notified of the fact that the video feature quantity has been extracted from the feature quantity extracting means 14, and the video feature quantity 15a stored in the video feature quantity storage means 15 is shown in FIG. The cluster value C having the shortest data distance from the representative video feature quantity CV in the feature quantity classification database 10a is set as the cluster value of the video feature quantity 15a. The cluster value converted here is output to the event specifying means 22.

イベント特定手段２２は、イベントデータベース１１ａに基づいて、特徴量数値化手段２１から逐次出力されるクラスタ値のデータ列が、どのイベントに対応するデータ列であるのかを特定するものである。このイベント特定手段２２は、入力されたクラスタ値のデータ列が、図１３に示したイベントデータベース１１ａのシーン数値化列Ｓｄと等しくなるイベント識別情報Ｅｋ（例えば、「ホームラン」等のイベント名）を、その連続したシーンのイベントと特定し、その特定結果（判別イベント）を出力する。 The event specifying means 22 specifies which event the data string of cluster values sequentially output from the feature value digitizing means 21 corresponds to, based on the event database 11a. The event specifying means 22 uses event identification information Ek (for example, an event name such as “home run”) in which the input cluster value data string is equal to the scene digitization string Sd of the event database 11a shown in FIG. The event of the continuous scene is identified, and the identification result (discrimination event) is output.

なお、ここでは、特徴量抽出手段１４が、シーンの映像特徴量を特徴量数値化手段２１に出力する際に、イベント特定手段２２に対して、シーンの先頭フレーム番号及び最終フレーム番号を通知することとする。これによって、イベント特定手段２２が、連続シーンの先頭シーンの先頭フレーム番号と、最終シーンの最終フレーム番号とを、判別イベントに付加することで、イベント全体の先頭フレーム番号と、最終フレーム番号とを、同時に出力することが可能になる。 Here, when the feature quantity extraction unit 14 outputs the video feature quantity of the scene to the feature quantity digitizing unit 21, the event specifying unit 22 is notified of the first frame number and the last frame number of the scene. I will do it. As a result, the event specifying unit 22 adds the first frame number of the first scene of the continuous scene and the last frame number of the last scene to the discrimination event, thereby obtaining the first frame number and the last frame number of the entire event. Can be output simultaneously.

以上説明したように、映像イベント判別装置２は、入力される映像から、イベント（映像イベント）を判別することができる。この映像イベントの判別は、映像特徴量を分類した数値列によって行うため、従来行うことができなかった、映像のシーンそのものからイベントを自動で判別することができる。 As described above, the video event determination device 2 can determine an event (video event) from an input video. Since this video event is discriminated by a numerical sequence in which video feature quantities are classified, the event can be automatically discriminated from the video scene itself, which could not be done conventionally.

なお、映像イベント判別装置２は、一般的なコンピュータにプログラムを実行させ、コンピュータ内の演算装置や記憶装置を動作させることにより実現することができる。このプログラム（映像イベント判別プログラム）は、通信回線を介して配布することも可能であるし、ＣＤ−ＲＯＭ等の記録媒体に書き込んで配布することも可能である。 The video event determination device 2 can be realized by causing a general computer to execute a program and operating an arithmetic device or a storage device in the computer. This program (video event determination program) can be distributed via a communication line, or can be written on a recording medium such as a CD-ROM for distribution.

［映像イベント判別装置の動作］
次に、図６を参照（適宜図５参照）して、本発明に係る映像イベント判別装置の動作について説明する。図６は、映像イベント判別装置の動作（イベント判別動作）を示すフローチャートである。なお、ステップＳ３１〜ステップＳ３６は、図３で説明した学習データ生成装置１（図２）の動作におけるステップＳ１〜ステップＳ６と同じ動作であるため説明を省略し、ステップＳ３７以降の動作について説明する。 [Operation of video event discriminator]
Next, referring to FIG. 6 (refer to FIG. 5 as appropriate), the operation of the video event determination device according to the present invention will be described. FIG. 6 is a flowchart showing the operation (event determination operation) of the video event determination device. Steps S31 to S36 are the same as steps S1 to S6 in the operation of the learning data generation device 1 (FIG. 2) described in FIG. 3, and thus the description thereof will be omitted, and the operations after step S37 will be described. .

ステップＳ３６後、映像イベント判別装置２は、特徴量数値化手段２１が、学習データである特徴量分類データベース１０ａを参照することで、特徴量抽出手段１４で抽出された映像特徴量を、当該映像特徴量を分類したクラスタ値に変換する（ステップＳ３７）。 After step S36, the video event discriminating apparatus 2 uses the video feature quantity extracted by the feature quantity extraction means 14 as a result of the feature quantity digitizing means 21 referring to the feature quantity classification database 10a as learning data. The feature values are converted into classified cluster values (step S37).

そして、映像イベント判別装置２は、イベント特定手段２２が、イベントデータベース１１ａを参照することで、ステップＳ３７で逐次変換されたクラスタ値のデータ列が、どのイベントに対応するデータ列であるかを特定し（ステップＳ３８）、当該イベントのイベント名、先頭フレーム番号及び最終フレーム番号を判別イベントとして出力する（ステップＳ３９）。 Then, in the video event determination device 2, the event identification unit 22 refers to the event database 11a, and identifies which event the data sequence of the cluster values sequentially converted in step S37 corresponds to. (Step S38), the event name, the first frame number, and the last frame number of the event are output as discrimination events (Step S39).

そして、映像イベント判別装置２は、入力された映像が終了したかどうかを判定し（ステップＳ４０）、終了していない場合（ステップＳ４０でＮｏ）は、ステップＳ３１に戻って動作を続ける。また、映像が終了した場合（ステップＳ４０でＹｅｓ）は、動作（イベント判別動作）を終了する。
以上の動作によって、映像イベント判別装置２は、学習データ（特徴量分類データベース１０ａ及びイベントデータベース１１ａ）に基づいて、映像から、イベント（映像イベント）を判別することができる。 Then, the video event discriminating apparatus 2 determines whether or not the input video has ended (step S40), and if it has not ended (No in step S40), it returns to step S31 and continues the operation. If the video has ended (Yes in step S40), the operation (event determination operation) ends.
With the above operation, the video event determination device 2 can determine an event (video event) from the video based on the learning data (the feature amount classification database 10a and the event database 11a).

本発明に係る映像イベント判別手法の概要を説明するための説明図である。It is explanatory drawing for demonstrating the outline | summary of the video event discrimination | determination method based on this invention. 本発明に係る映像イベント判別用学習データ生成装置の構成を示すブロック図である。It is a block diagram which shows the structure of the learning data generation apparatus for video event discrimination | determination based on this invention. 本発明に係る学習データ生成装置が特徴量分類データベースを生成する動作を示すフローチャートである。It is a flowchart which shows the operation | movement which the learning data generation apparatus which concerns on this invention produces | generates the feature-value classification database. 本発明に係る学習データ生成装置がイベントデータベースを生成する動作を示すフローチャートである。It is a flowchart which shows the operation | movement which the learning data generation apparatus which concerns on this invention produces | generates an event database. 本発明に係る映像イベント判別装置の構成の構成を示すブロック図である。It is a block diagram which shows the structure of the structure of the video event discrimination | determination apparatus based on this invention. 本発明に係る映像イベント判別装置の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the video event discrimination | determination apparatus based on this invention. 節点追跡手段における節点の追跡を視覚化した図で、（ａ）は、フレーム画像に節点を設置した状態を示す図、（ｂ）は、フレーム画像上において節点を追跡した状態を示す図である。FIG. 2A is a diagram visualizing the tracking of a node in a node tracking unit, where FIG. 3A is a diagram illustrating a state in which a node is installed in a frame image, and FIG. . 節点を同一のクラス（クラスタ）として分類する概念を示した概念図である。It is the conceptual diagram which showed the concept which classifies a node as the same class (cluster). 特徴量抽出手段が生成する映像特徴量のデータの一例を示す図である。It is a figure which shows an example of the data of the image | video feature-value data which a feature-value extraction means produces | generates. 映像のシーンを複数の矩形領域で簡易化する概念を示した概念図である。It is the conceptual diagram which showed the concept which simplifies the scene of an image | video by a some rectangular area. 特徴量分類データベースの構造の例を示す構造図である。It is a structure figure which shows the example of the structure of a feature-value classification database. シーン分類データベースの構造の例を示す構造図である。It is a structure figure which shows the example of the structure of a scene classification database. イベントデータベースの構造の例を示す構造図である。It is a structural diagram which shows the example of the structure of an event database.

Explanation of symbols

１学習データ生成装置（映像イベント判別用学習データ生成装置）
２映像イベント判別装置
１０特徴量分類ＤＢ（データベース）記憶手段
１０ａ特徴量分類データベース
１１イベントＤＢ（データベース）記憶手段
１１ａイベントデータベース
１２シーン分割手段
１２ａイベント開始検出手段
１３基準画像記憶手段
１３ａ基準画像
１４特徴量抽出手段
１５映像特徴量記憶手段
１６特徴量分類手段
１７シーン数値化手段
１８シーン分類ＤＢ（データベース）記憶手段
１８ａシーン分類データベース
１９シーン映像再生手段
２０イベント設定手段
２１特徴量数値化手段
２２イベント特定手段 1 Learning data generator (learning data generator for video event discrimination)
2 Video event discriminating device 10 Feature quantity classification DB (database) storage means 10a Feature quantity classification database 11 Event DB (database) storage means 11a Event database 12 Scene division means 12a Event start detection means 13 Reference image storage means 13a Reference image 14 Features Quantity extraction means 15 Video feature quantity storage means 16 Feature quantity classification means 17 Scene digitization means 18 Scene classification DB (database) storage means 18a Scene classification database 19 Scene video reproduction means 20 Event setting means 21 Feature quantity digitization means 22 Event specification means

Claims

A video event discriminating apparatus that discriminates the type of event that occurs in the video based on the video feature amount in each scene of the input video,
A feature quantity classification database storage unit storing a feature quantity classification database in which representative values of similar video feature quantities and numerical data for classifying the representative values are associated in advance;
An event database storage unit that stores an event database in which the event type and a scene digitization sequence in which a plurality of continuous scenes are represented by a data sequence of the numerical data are associated in advance;
Scene dividing means for dividing the input video for each scene;
Feature quantity extracting means for extracting the video feature quantity in the scene from a plurality of frame images included in the scene divided by the scene dividing means;
A feature quantity quantification means for referring to a feature quantity classification database stored in the feature quantity classification database storage means and converting the video feature quantity extracted by the feature quantity extraction means into the numerical data;
Referring to an event database stored in the event database storage means, an event specifying means for specifying an event type corresponding to a data string of numerical data converted by the feature value digitizing means;
A video event discriminating apparatus comprising:

Reference image storage means for storing in advance a reference image that is the start of an event that occurs in the video,
Further, the scene division means includes event start detection means for detecting the start of an event in the video as a start point of the scene switching by comparing the frame image with the reference image. The video event discriminating apparatus according to claim 1.

In order to determine the type of event that occurs in the video based on the video feature amount in each scene of the input video,
Scene dividing means for dividing the inputted video into scenes;
Feature quantity extracting means for extracting the video feature quantity in the scene from a plurality of frame images included in the scene divided by the scene dividing means;
With reference to a feature quantity classification database in which representative values of similar video feature quantities and numerical data for classifying the representative values are associated in advance, the video feature quantities extracted by the feature quantity extraction unit are Feature value digitizing means to convert to data,
Numeric data converted by the feature value digitizing means with reference to an event database in which the event type and a scene digitizing sequence in which a plurality of consecutive scenes are represented by a data sequence of the numeric data are associated in advance Event identification means for identifying the type of event corresponding to the data column of
A video event discriminating program characterized by functioning as

A learning data generation device for video event determination that generates a feature amount classification database and an event database that are learning data used in the video event determination device according to claim 1,
Scene dividing means for dividing the input video for each scene;
Feature quantity extracting means for extracting the video feature quantity in the scene from a plurality of frame images included in the scene divided by the scene dividing means;
Feature quantity classification means for generating a feature quantity classification database that classifies video feature quantities extracted by the feature quantity extraction means in association with numerical data for each representative value of similar video feature quantities;
A scene digitizing means for associating, for each scene, a frame image number for identifying a frame image included in the scene and the numerical data of the video feature quantity classified by the feature quantity classifying means;
Scene video playback means for playing back the scene based on the frame image number for each scene associated with the scene digitizing means;
Event identification information indicating the type of event is input to a plurality of consecutive scenes reproduced by the scene image reproduction means, and the event identification information and the image feature amount corresponding to the plurality of scenes Event setting means for generating the event database in association with a scene digitization sequence that is a data sequence of
A learning data generating device for video event discrimination characterized by comprising:

Reference image storage means for storing in advance a reference image that is the start of an event that occurs in the video,
Furthermore, the scene detection means includes an event start detection means for detecting the start of an event in the video as a start point of the scene change by comparing the frame image with the reference image. The learning data generation device for video event discrimination according to claim 4.

In order to generate a feature amount classification database and an event database, which are learning data used in the video event determination device according to claim 1 or 2,
Scene dividing means for dividing the inputted video into scenes;
Feature quantity extracting means for extracting the video feature quantity in the scene from a plurality of frame images included in the scene divided by the scene dividing means;
Feature quantity classification means for generating a feature quantity classification database that classifies video feature quantities extracted by the feature quantity extraction means in association with numerical data for each representative value of similar video feature quantities;
A scene quantification unit that associates, for each scene, a frame image number for identifying a frame image included in the scene and the numerical data of the video feature amount classified by the feature amount classification unit;
Scene video playback means for playing back the scene based on the frame image number for each scene associated with the scene digitizing means;
Event identification information indicating the type of event is input to a plurality of consecutive scenes reproduced by the scene image reproduction means, and the event identification information and the image feature amount corresponding to the plurality of scenes Event setting means for generating an event database in association with a scene digitization sequence that is a data sequence of the numerical data of
A learning data generation program for discriminating video events, characterized in that