JP6261222B2

JP6261222B2 - Identification device and identification program

Info

Publication number: JP6261222B2
Application number: JP2013152817A
Authority: JP
Inventors: 高橋　正樹; 正樹高橋; 苗村　昌秀; 昌秀苗村; 藤井　真人; 真人藤井
Original assignee: Japan Broadcasting Corp
Current assignee: Japan Broadcasting Corp
Priority date: 2013-07-23
Filing date: 2013-07-23
Publication date: 2018-01-17
Anticipated expiration: 2033-07-23
Also published as: JP2015022702A

Description

本発明は、映像により示される動作を識別するための装置及びプログラムに関する。 The present invention relates to an apparatus and a program for identifying an operation indicated by an image.

従来、画像から特徴点を検出し特徴量を記述するＳＩＦＴ（Ｓｃａｌｅ−ＩｎｖａｒｉａｎｔＦｅａｔｕｒｅＴｒａｎｓｆｏｒｍ）アルゴリズムと、ＢｏＦ（Ｂａｇ−ｏｆ−Ｆｅａｔｕｒｅｓ）法との組み合わせにより、静止画像から物体を認識する技術がある。ＢｏＦは、抽出した局所特徴量を統合し、ヒストグラムとして表現する手法である。 2. Description of the Related Art Conventionally, there is a technique for recognizing an object from a still image by combining a SIFT (Scale-Invariant Feature Transform) algorithm that detects a feature point from an image and describes a feature amount, and a BoF (Bag-of-Features) method. BoF is a method of integrating extracted local feature values and expressing them as a histogram.

特許文献１には、映像の時空間情報を利用して前景と背景とを分離し、背景上に発生する信号をノイズとして処理する方法が提案されている。特許文献２には、連続する数フレームの画像から得られる時間微分と空間微分とを利用した時空間特徴量の作成手法が提案されている。特許文献３には、輝度勾配特徴の共起を用いてオブジェクトを検出する方法が提案されている。 Patent Document 1 proposes a method of separating a foreground and a background using time-space information of a video and processing a signal generated on the background as noise. Patent Document 2 proposes a method for creating a spatiotemporal feature amount using temporal differentiation and spatial differentiation obtained from images of several consecutive frames. Patent Document 3 proposes a method for detecting an object using co-occurrence of luminance gradient features.

特開平９−８１７１４号公報JP 9-81714 A 特開２０１０−９２２９３号公報JP 2010-92293 A 特開２００９−３０１１０４号公報JP 2009-301104 A

ところで、画像から抽出した多数の局所特徴（例えば、ＳＩＦＴ）の統計量を用いて画像を表現する手法は、画像内のオブジェクト認識又は画像のクラスタリング等に利用されている。この手法を動画像に拡張し、映像中から取得した局所特徴に基づいて被写体の動作を認識する手法も考えられる。しかしながら、静止画像で用いた特徴量を時間方向へ拡張したＭｏＳＩＦＴ（ＭｏｔｉｏｎＳＩＦＴ）、ＳＴＩＰ（Ｓｐａｃｅ−ＴｉｍｅＩｎｔｅｒｅｓｔＰｏｉｎｔｓ）等の手法は、時間軸上では十分な情報量を持たない。 By the way, a technique for representing an image using statistics of a large number of local features (for example, SIFT) extracted from the image is used for object recognition in the image or image clustering. A method of extending this method to a moving image and recognizing the motion of the subject based on local features acquired from the video is also conceivable. However, methods such as MoSIFT (Motion SIFT) and STIP (Space-Time Interest Points), which expand feature quantities used in still images in the time direction, do not have a sufficient amount of information on the time axis.

映像から人物動作を検出する場合、空間情報に加えて比較的長期（例えば、数秒間）の時間情報を扱う必要がある。また、人物動作は個人差が大きく、撮影方向によって見え方や動きは大きく変化する。これらの原因により、人物動作の解析に適した特徴量を記述することは難しかった。 When detecting a human motion from an image, it is necessary to handle relatively long time information (for example, several seconds) in addition to spatial information. In addition, human movements vary greatly from person to person, and how they appear and change greatly depending on the shooting direction. For these reasons, it has been difficult to describe feature quantities suitable for human motion analysis.

また、ＢｏＦ等の画像又は映像の表現手法では、サイズ変化による不変性を保つために局所特徴の位置情報を無視している。局所特徴の空間上の位置情報を埋め込むために、Ｓｐａｔｉａｌ−Ｐｙｒａｍｉｄ等の手法も考案されているが、時間方向の情報がないため、映像から人物の動作を検出することは難しかった。 In addition, in an image or video expression technique such as BoF, local feature position information is ignored in order to maintain invariance due to a size change. A technique such as Spatial-Pyramid has been devised to embed the position information of the local features in the space. However, since there is no information in the time direction, it has been difficult to detect the movement of the person from the video.

本発明は、映像により示される動作を精度良く識別できる識別装置及び識別プログラムを提供することを目的とする。 An object of the present invention is to provide an identification device and an identification program capable of accurately identifying an operation indicated by an image.

本発明に係る識別装置は、映像により示される動作を識別する装置であって、前記映像を時空間小領域に分割する映像分割部と、前記時空間小領域それぞれについて、当該領域内の特徴点の軌跡に関する局所特徴量を算出し、所定次元で記述する特徴記述部と、前記局所特徴量の集合を統計処理し、特徴表現を求める特徴表現部と、前記時空間小領域毎の特徴表現の集合に基づいて、所定の識別器により前記動作を識別する識別部と、を備える。 An identification apparatus according to the present invention is an apparatus for identifying an operation indicated by an image, wherein an image dividing unit that divides the image into spatiotemporal subregions, and a feature point in the region for each of the spatiotemporal subregions Calculating a local feature amount related to the trajectory of the feature, a feature description portion described in a predetermined dimension, a feature processing portion for statistically processing the set of local feature amounts to obtain a feature representation, and a feature representation for each of the spatio-temporal subregions And an identification unit that identifies the operation by a predetermined classifier based on the set.

この構成によれば、識別装置は、時空間小領域により位置と時間情報を考慮した特徴表現を作成する。これにより、識別装置は、動作の時間的・空間的変化に頑健であり、映像コンテンツ内の人物が行う特定動作を高い精度で検出できる。 According to this configuration, the identification device creates a feature expression that takes position and time information into consideration using a small space-time region. As a result, the identification device is robust to temporal and spatial changes in operation, and can detect a specific operation performed by a person in the video content with high accuracy.

前記特徴表現部は、前記時空間小領域から得られた部分特徴表現、及び当該時空間小領域を包含し、かつ、時間スケールが等しい大領域から得られた大域特徴表現を組み合わせた特徴対を作成し、前記識別部は、前記特徴対の集合に基づいて、前記動作を識別してもよい。 The feature representation unit includes a partial feature representation obtained from the spatiotemporal subregion and a feature pair that includes the spatiotemporal subregion and combines the global feature representation obtained from the large region having the same time scale. The creating unit may identify the operation based on the set of feature pairs.

この構成によれば、識別装置は、特徴対を用いて、大領域におけるカメラの動き情報を含んだ特徴表現を生成することにより、動作の識別精度を向上できる。 According to this configuration, the identification device can improve the identification accuracy of the operation by generating the feature expression including the motion information of the camera in the large area using the feature pair.

前記映像分割部は、互いに重なり、かつ、時間スケールが異なる複数の時空間小領域を含むように前記映像を分割してもよい。 The video dividing unit may divide the video so as to include a plurality of spatio-temporal small regions that overlap each other and have different time scales.

この構成によれば、識別装置は、時空間小領域において時間的スケールを複数用意することで、動作速度の個人差にも対応でき、動作の識別精度を向上できる。 According to this configuration, by preparing a plurality of temporal scales in a small spatiotemporal region, the identification device can cope with individual differences in the operation speed and can improve the identification accuracy of the operation.

前記識別部は、前記特徴対それぞれを入力とする第１の識別処理に続いて、当該第１の識別処理の結果の組み合わせを入力とする第２の識別処理を行ってもよい。 The identification unit may perform a second identification process using a combination of results of the first identification process as an input, following the first identification process using each of the feature pairs as an input.

この構成によれば、識別装置は、特徴表現を２段階の識別器で識別することにより、各段階の次元数を低減でき、少ない次元数で人物動作を精度良く識別できた。 According to this configuration, the identification device can reduce the number of dimensions in each stage by identifying the feature expression with the two-stage classifier, and can accurately identify the human action with a small number of dimensions.

前記特徴記述部は、特徴点の軌跡内の動きベクトルの方向と強さの分布を示す動き特徴、当該軌跡周辺の色の分布を示す見え特徴、並びに当該軌跡の時間長及び移動距離を示す形状特徴を表す特徴量を算出してもよい。 The feature description unit includes a motion feature indicating a direction and intensity distribution of a motion vector in a trajectory of a feature point, an appearance feature indicating a color distribution around the trajectory, and a shape indicating a time length and a moving distance of the trajectory. A feature amount representing a feature may be calculated.

この構成によれば、識別装置は、見え特徴、動き特徴及び形状特徴を用いて、映像により示される動作の特徴を効率的に表現でき、動作の識別精度を向上できる。 According to this configuration, the identification device can efficiently express the feature of the motion indicated by the video using the appearance feature, the motion feature, and the shape feature, and can improve the identification accuracy of the motion.

本発明に係る識別プログラムは、映像により示される動作をコンピュータに識別させるためのプログラムであって、前記映像を時空間小領域に分割する映像分割ステップと、前記時空間小領域それぞれについて、当該領域内の特徴点の軌跡に関する局所特徴量を算出し、所定次元で記述する特徴記述ステップと、前記局所特徴量の集合を統計処理し、特徴表現を求める特徴表現ステップと、前記時空間小領域毎の特徴表現の集合に基づいて、所定の識別器により前記動作を識別する識別ステップと、を実行させる。 An identification program according to the present invention is a program for causing a computer to identify an operation indicated by a video, and includes a video dividing step for dividing the video into a spatio-temporal subregion, and each of the spatiotemporal subregions Calculating a local feature amount related to a trajectory of a feature point within the feature point, describing a feature in a predetermined dimension, performing a statistical process on the set of local feature amounts, obtaining a feature representation, and each time-space subregion And an identification step of identifying the action by a predetermined classifier based on the set of feature expressions.

本発明によれば、映像により示される動作を精度良く識別できる。 According to the present invention, an operation indicated by an image can be identified with high accuracy.

第１実施形態に係る識別装置の構成を示すブロック図である。It is a block diagram which shows the structure of the identification device which concerns on 1st Embodiment. 第１実施形態に係る識別処理の流れを示す図である。It is a figure which shows the flow of the identification process which concerns on 1st Embodiment. 第１実施形態に係る見え特徴を説明する図である。It is a figure explaining the appearance characteristic which concerns on 1st Embodiment. 第１実施形態に係る動き特徴を説明する図である。It is a figure explaining the movement characteristic which concerns on 1st Embodiment. 第１実施形態に係る特徴表現の概略図である。It is the schematic of the feature expression which concerns on 1st Embodiment. 第２実施形態に係る識別処理の流れを示す図である。It is a figure which shows the flow of the identification process which concerns on 2nd Embodiment. 第１実施形態及び第２実施形態による識別精度を、従来手法と比較した評価実験結果を示す図である。It is a figure which shows the evaluation experiment result which compared the identification accuracy by 1st Embodiment and 2nd Embodiment with the conventional method.

［第１実施形態］
以下、本発明の第１実施形態について説明する。
本実施形態の識別装置１は、映像中で人物等が行っている動作を識別する。ここで、映像とは、動画で表現されるコンテンツであり、例えば、所定のフレームレートで伝送される放送映像、家庭映像、インターネット映像等を含む。 [First Embodiment]
The first embodiment of the present invention will be described below.
The identification device 1 according to the present embodiment identifies an operation performed by a person or the like in a video. Here, the video is content expressed by a moving image, and includes, for example, broadcast video, home video, Internet video, and the like transmitted at a predetermined frame rate.

図１は、本実施形態に係る識別装置１の構成を示すブロック図である。
識別装置１は、制御部１０と、記憶部２０と、通信部３０と、入力部４０と、表示部５０とを備える。 FIG. 1 is a block diagram showing the configuration of the identification device 1 according to this embodiment.
The identification device 1 includes a control unit 10, a storage unit 20, a communication unit 30, an input unit 40, and a display unit 50.

制御部１０は、識別装置１の全体を制御する部分であり、記憶部２０に記憶された各種プログラムを適宜読み出して実行することにより、本実施形態における各種機能を実現している。制御部１０は、ＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）であってよい。
また、制御部１０は、映像分割部１１と、特徴記述部１２と、特徴表現部１３と、識別部１４とを備え、後述の識別処理を実行する。 The control part 10 is a part which controls the whole identification apparatus 1, and implement | achieves the various functions in this embodiment by reading and executing the various programs memorize | stored in the memory | storage part 20 suitably. The control unit 10 may be a CPU (Central Processing Unit).
In addition, the control unit 10 includes a video dividing unit 11, a feature description unit 12, a feature expression unit 13, and an identification unit 14, and executes identification processing described later.

記憶部２０は、ハードウェア群を識別装置１として機能させるための各種プログラム、及び各種データ等の記憶領域であり、ＲＯＭ、ＲＡＭ、フラッシュメモリ又はハードディスク（ＨＤＤ）等であってよい。具体的には、記憶部２０は、本実施形態の各機能を制御部１０に実行させるための識別プログラムの他、後述の特徴記述、特徴表現、及び識別器の学習データ等を記憶する。 The storage unit 20 is a storage area for various programs and various data for causing the hardware group to function as the identification device 1, and may be a ROM, a RAM, a flash memory, a hard disk (HDD), or the like. Specifically, the storage unit 20 stores, in addition to an identification program for causing the control unit 10 to execute each function of the present embodiment, a feature description, a feature expression, and learning data of the classifier described later.

通信部３０は、識別装置１が他の装置と通信する場合のネットワーク・アダプタである。具体的には、通信部３０は、制御部１０の制御に従って、ネットワークを介して、通信端末又は各種サーバ等と通信する。これにより、映像データ、学習データ及び識別結果データ等の送受信を行う。 The communication unit 30 is a network adapter when the identification device 1 communicates with other devices. Specifically, the communication unit 30 communicates with a communication terminal or various servers via a network according to the control of the control unit 10. Thereby, transmission / reception of video data, learning data, identification result data, and the like is performed.

入力部４０は、識別装置１に対するユーザからの指示入力を受け付けるインタフェース装置である。入力部４０は、例えば、キーボード、マウス、タッチパネル及び操作ボタン等により構成される。 The input unit 40 is an interface device that receives an instruction input from the user to the identification device 1. The input unit 40 includes, for example, a keyboard, a mouse, a touch panel, operation buttons, and the like.

表示部５０は、制御部１０による表示制御に従って、ユーザにデータの入力を受け付ける画面を表示したり、制御部１０による処理結果の画面を表示したりする。表示部５０は、液晶ディスプレイ（ＬＣＤ）又は有機ＥＬディスプレイ等のディスプレイ装置であってよい。 The display unit 50 displays a screen for accepting data input to the user or displays a processing result screen by the control unit 10 according to display control by the control unit 10. The display unit 50 may be a display device such as a liquid crystal display (LCD) or an organic EL display.

図２は、本実施形態に係る制御部１０の各部により実行される識別処理の流れを示す図である。
識別処理は、映像分割部１１による映像分割ステップ（ステップＳ１）、特徴記述部１２による局所特徴抽出ステップ（ステップＳ２）及び特徴記述ステップ（ステップＳ３）、特徴表現部１３による特徴表現ステップ（ステップＳ４）、識別部１４による識別ステップ（ステップＳ５）の順に実行される。
以下、各部の機能を詳述する。 FIG. 2 is a diagram illustrating a flow of identification processing executed by each unit of the control unit 10 according to the present embodiment.
The identification process includes a video segmentation step (step S1) by the video segmentation unit 11, a local feature extraction step (step S2) and a feature description step (step S3) by the feature description unit 12, and a feature representation step (step S4) by the feature representation unit 13. ) And the identification step (step S5) by the identification unit 14 are executed in this order.
Hereinafter, the function of each part will be described in detail.

ステップＳ１において、映像分割部１１は、映像を時空間小領域に分割する。時空間小領域とは、空間及び時間のそれぞれを所定の単位に分割した領域である。例えば、放送映像の場合、連続した所定数のフレームそれぞれから座標が共通の部分を取り出した領域である。 In step S1, the video dividing unit 11 divides the video into space-time small areas. A spatio-temporal small region is a region obtained by dividing space and time into predetermined units. For example, in the case of a broadcast video, this is an area obtained by extracting a portion having a common coordinate from each of a predetermined number of consecutive frames.

ＢｏＦでは画面全体から１つの特徴ベクトルを作成するため、映像から抽出した局所特徴の位置情報は考慮されない。また軌跡を抽出する時間尺を固定しているため、動作速度の変化が許容されない。これらの課題は、軌跡特徴の抽出範囲を空間的・時間的に固定していることに起因する。
そこで、本実施形態では、局所特徴の時空間上の位置を考慮した特徴表現ｓｐａｔｉｏ−ｔｅｍｐｏｒａｌＭｕｌｔｉｓｃａｌｅＢａｇｓ（ｓｔＭＢ）を用いる。ｓｔＭＢでは、空間的・時間的に分割された小領域の集合を定義し、これら小領域それぞれの内部から得られた局所特徴量を個別の特徴として扱うことで時空間上の位置情報を特徴表現へ埋め込むことができる。これにより、動作の時間的・空間的変化に頑健で、映像解析に基づく動作の識別性能が高い特徴表現が得られる。 In BoF, since one feature vector is created from the entire screen, the position information of the local feature extracted from the video is not considered. Further, since the time scale for extracting the trajectory is fixed, the change in the operation speed is not allowed. These problems are caused by the fact that the extraction range of trajectory features is fixed spatially and temporally.
Thus, in the present embodiment, feature expression spatial-temporal multiscale bags (stMB) that consider the position of local features in space and time are used. stMB defines a set of spatially and temporally divided subregions, and local features obtained from the inside of each of these subregions are treated as individual features to represent spatial and temporal position information. Can be embedded. This makes it possible to obtain a feature representation that is robust to temporal and spatial changes in motion and that has high motion identification performance based on video analysis.

具体的には、映像分割部１１は、互いに重なり、かつ、時間スケールが異なる複数の時空間小領域を含むように映像を分割する。
例えば、映像分割部１１は、識別対象の映像（例えば、３秒間の映像）に対して、空間的に９種類（縦横にそれぞれ３分割）の領域、時間的に３種類（１秒、２秒、３秒）のスケールを設ける。 Specifically, the video dividing unit 11 divides the video so as to include a plurality of spatio-temporal small regions that overlap each other and have different time scales.
For example, the video dividing unit 11 spatially classifies nine types of images (for example, three seconds vertically and horizontally), and three types temporally (1 second, 2 seconds) with respect to the identification target video (for example, video for 3 seconds). 3 seconds).

ステップＳ２において、特徴記述部１２は、時空間小領域それぞれについて、この領域内の特徴点を抽出する。
特徴点は、画像内のエッジ等、時間経過に従って追跡可能な複数の点であり、既存の画像解析手法により抽出される。 In step S <b> 2, the feature description unit 12 extracts a feature point in this space-time small region.
The feature points are a plurality of points that can be tracked over time, such as edges in the image, and are extracted by an existing image analysis method.

ステップＳ３において、特徴記述部１２は、抽出した特徴点の軌跡に関する局所特徴量を算出し、所定次元で記述する。特徴点の軌跡は、画像中の特徴頂点の移動履歴であり、空間情報及び時間情報の双方を有する。
具体的には、特徴記述部１２は、特徴点の軌跡内の動きベクトルの方向と強さの分布を示す「動き特徴」、軌跡周辺の色の分布を示す「見え特徴」、並びに軌跡の時間長及び移動距離を示す「形状特徴」を表す特徴量を算出する。 In step S <b> 3, the feature description unit 12 calculates a local feature amount related to the trajectory of the extracted feature point and describes it in a predetermined dimension. The trajectory of the feature point is a movement history of the feature vertex in the image, and has both spatial information and time information.
Specifically, the feature description unit 12 includes a “motion feature” indicating the direction and intensity distribution of the motion vector in the trajectory of the feature point, an “appearance feature” indicating the color distribution around the trajectory, and the trajectory time. A feature amount representing a “shape feature” indicating the length and the moving distance is calculated.

図３は、本実施形態に係る見え特徴を説明する図である。
特徴記述部１２は、特徴点の周囲（正方形の領域内）の点の色を、２７分割したＲＧＢ色空間のいずれかの領域に当てはめ、２７次元の色ヒストグラムを作成する。これにより、ある特徴点の周囲の色の分布が所定次元の特徴量として表現される。 FIG. 3 is a diagram for explaining appearance features according to the present embodiment.
The feature description unit 12 applies the color of the points around the feature points (in the square region) to any region of the RGB color space divided into 27, and creates a 27-dimensional color histogram. Thereby, the color distribution around a certain feature point is expressed as a feature quantity of a predetermined dimension.

図４は、本実施形態に係る動き特徴を説明する図である。
特徴記述部１２は、特徴点の軌跡をフレーム間の動きベクトルに分割し、各ベクトルの向き及び大きさに応じて量子化処理を施して２５通りのラベルを付与し、２５次元のヒストグラムを作成する。これにより、ある特徴点が動く方向及び大きさの分布が所定次元の特徴量として表現される。このとき、軌跡内の全フレームの情報が使用されるため、動作の識別に有効な情報の欠落が抑制される。 FIG. 4 is a diagram for explaining the movement feature according to the present embodiment.
The feature description unit 12 divides the trajectory of feature points into motion vectors between frames, performs quantization processing according to the direction and size of each vector, gives 25 different labels, and creates a 25-dimensional histogram To do. Thereby, the direction and size distribution of a certain feature point is expressed as a feature quantity of a predetermined dimension. At this time, since the information of all the frames in the trajectory is used, the lack of information effective for identifying the operation is suppressed.

ステップＳ４において、特徴表現部１３は、ステップＳ３で得られた局所特徴量の集合を統計処理し、特徴表現を求める。
特徴表現部１３は、各局所特徴量を特徴空間へ射影した後、これらを固定次元の特徴ベクトルへと統合する。まず、特徴表現部１３は、全ての局所特徴量を特徴空間へ射影し、特徴量の分布に応じたクラスタリング処理を施す。その後、特徴表現部１３は、クラスタ毎に量子化や平均化等の統計処理を行い、画像を１つの特徴ベクトルで表現する。画像により抽出される特徴点の数は異なるが、このステップを経ることで特徴点の数に関わらず、特徴量の集合は常に固定次元の特徴ベクトルへと変換できる。 In step S4, the feature expression unit 13 performs statistical processing on the set of local feature values obtained in step S3 to obtain a feature expression.
The feature expression unit 13 projects each local feature amount onto the feature space, and then integrates these into a fixed dimension feature vector. First, the feature expression unit 13 projects all local feature values onto the feature space, and performs clustering processing according to the feature value distribution. Thereafter, the feature expression unit 13 performs statistical processing such as quantization and averaging for each cluster, and expresses the image as one feature vector. Although the number of feature points extracted varies depending on the image, a set of feature amounts can always be converted into a fixed dimension feature vector through this step regardless of the number of feature points.

このステップは、ＢｏＦで処理されることが多いが、本実施形態では、特徴表現部１３は、ＶＬＡＤ（ＶｅｃｔｏｒｏｆＬｏｃａｌｌｙＡｇｇｒｅｇａｔｅｄＤｅｓｃｒｉｐｔｏｒｓ）の手法を用いる。
具体的には、特徴表現部１３は、領域内の複数の特徴点から得られる局所特徴量の集合を、所定のコードワード数（例えば、１６）×局所特徴量の次元数（見え特徴の２７次元＋動き特徴の２５次元＋形状特徴の２次元）の情報として部分特徴表現を生成する。
なお、部分特徴表現は、コードワード毎に、中心からこのコードワードに分類された局所特徴量のベクトル和として算出される。 This step is often processed by BoF, but in this embodiment, the feature expression unit 13 uses a VLAD (Vector of Locally Aggregated Descriptors) technique.
Specifically, the feature representation unit 13 sets a set of local feature amounts obtained from a plurality of feature points in a region as a predetermined number of code words (for example, 16) × the number of dimensions of local feature amounts (27 of the appearance features). A partial feature representation is generated as (dimension + 25 dimensions of motion features + two dimensions of shape features) information.
The partial feature expression is calculated as a vector sum of local feature amounts classified into the code word from the center for each code word.

さらに、特徴表現部１３は、ｓｔＭＢにより、時空間小領域から得られた部分特徴表現を結合し、全体の特徴表現とする。 Further, the feature expression unit 13 combines the partial feature expressions obtained from the spatio-temporal small regions by stMB to obtain the entire feature expression.

図５は、本実施形態に係る特徴表現の概略図である。
特徴表現部１３は、空間的に９種類、時間的に３種類のスケールを設けているため、特徴表現は、全部で２７種類のヒストグラムを統合した形となる。各ヒストグラムの次元は、特徴量の記述手法によって異なり、ＢｏＦではコードワード数、本実施形態のＶＬＡＤではコードワード数×特徴次元数となる。 FIG. 5 is a schematic diagram of feature expression according to the present embodiment.
Since the feature representation unit 13 is provided with nine types of spatial scales and three types of temporal scales, the feature representation has a form in which 27 types of histograms are integrated in total. The dimension of each histogram differs depending on the description method of the feature amount, and is the number of codewords in BoF, and the number of codewords × number of feature dimensions in the VLAD of this embodiment.

このように、特徴表現部１３は、ｓｔＭＢにより時空間内の小領域毎に軌跡特徴を取得することで、位置と時間の双方を考慮した特徴量を生成できる。
通常、カメラマンは３分割法と言われる水平・垂直の３分割線を意識して被写体を撮影することが多いため、この空間分割法はプロカメラマンが撮影した放送映像に対して特に効果が期待できる。また、特徴表現部１３は、小領域の時間的スケールを複数用意することで、動作速度の個人差にも対応できる。 As described above, the feature expression unit 13 can generate a feature amount considering both position and time by acquiring a trajectory feature for each small region in space-time using stMB.
Usually, a photographer often takes a picture of a subject in consideration of horizontal and vertical three-divided lines, which is referred to as a three-division method. Therefore, this spatial division method can be expected to be particularly effective for broadcast images taken by a professional cameraman. . In addition, the feature expression unit 13 can cope with individual differences in the operation speed by preparing a plurality of temporal scales of small areas.

ステップＳ５において、識別部１４は、時空間小領域毎の特徴表現の集合に基づいて、所定の識別器により動作を識別する。
具体的には、識別部１４は、ステップＳ４で作成された固定次元ベクトルを入力とし、サポートベクターマシン（ＳＶＭ）等の識別器により、予め設定された複数種類の動作のうち、いずれであるかの判定を行う。なお、識別器は、正解データを与えた「教師付き学習」により、事前に学習しておく。 In step S <b> 5, the identification unit 14 identifies an operation by a predetermined classifier based on a set of feature expressions for each space-time small area.
Specifically, the identification unit 14 receives the fixed-dimension vector created in step S4, and is one of a plurality of types of operations set in advance by a classifier such as a support vector machine (SVM). Judgment is made. Note that the discriminator learns in advance by “supervised learning” given correct data.

［第２実施形態］
以下、本発明の第２実施形態について説明する。
なお、第１実施形態と同様の構成については、同一の符号を付し、説明を省略又は簡略化する。
本実施形態の識別装置１ａは、制御部１０ａにおける特徴表現部１３ａ及び識別部１４ａの機能が第１実施形態と異なる。 [Second Embodiment]
Hereinafter, a second embodiment of the present invention will be described.
In addition, about the structure similar to 1st Embodiment, the same code | symbol is attached | subjected and description is abbreviate | omitted or simplified.
The identification device 1a of the present embodiment is different from the first embodiment in the functions of the feature expression unit 13a and the identification unit 14a in the control unit 10a.

特徴表現部１３ａは、時空間小領域から得られた部分特徴表現、及びこの時空間小領域を包含し、かつ、時間スケールが等しい大領域から得られた大域特徴表現を組み合わせた特徴対を作成する。 The feature expression unit 13a creates a partial feature expression obtained from a spatio-temporal small region and a feature pair that includes the spatio-temporal small region and combines a global feature representation obtained from a large region having the same time scale. To do.

一般に放送映像のカメラワークにはある程度のパターンが存在し、そのパターンに則ってカメラ操作が行われることが多い。特にスポーツ中継映像ではその傾向が顕著である。例えば野球中継では、投球シーンには投手を後ろから撮影したロングショット映像が用いられ、ヒットの際は即座にボールを追うカメラへと切り替わる。また、ゴルフのパッティングではグリーン全体を捉えたロングショットから始まり、ボールの移動に合わせてボールへのズームアップ画像へと移行する。また、バスケットボールでは、シュート後に相手チームへ攻撃が移るため、カメラが反対方向へパンされる。このように、カメラ操作には人物動作をキーとした典型的パターンが存在する。
カメラ操作により発生し、映像の背景領域から得られる局所特徴量は、一般にノイズとして扱われることが多いが、これらはカメラの動きを表す特徴量でもあるため、人物動作とカメラモーションの共起を考慮することが人物動作の認識精度に貢献すると考えられる。 In general, there are some patterns in broadcast video camera work, and camera operations are often performed according to these patterns. This tendency is particularly noticeable in sports broadcast video. For example, in a baseball broadcast, a long shot video in which a pitcher is photographed from behind is used for a pitching scene, and when a hit occurs, the camera immediately switches to a camera that follows the ball. Golf putting starts with a long shot that captures the entire green and moves to a zoomed-in image of the ball as the ball moves. In basketball, since the attack moves to the opponent team after shooting, the camera is panned in the opposite direction. As described above, there is a typical pattern for camera operation using a human action as a key.
Local feature values generated by camera operation and obtained from the background area of the video are generally treated as noise, but these are also feature values that represent camera movement. Consideration is considered to contribute to the recognition accuracy of human motion.

そこで、特徴表現部１３ａは、空間的小領域と画像全体との共起特徴量を作成し、被写体の動作とカメラの動きとの双方を表現する。
具体的には、特徴表現部１３ａは、画面全体から抽出した大域特徴表現（ＧｌｏｂａｌＲｅｐｒｅｓｅｎｔａｉｏｎ）と、小領域から抽出した部分特徴表現（ＬｏｃａｌＲｅｐｒｅｓｅｎｔａｉｏｎ）の対を作成し、この特徴対（ＧＰＲ：ＧｌｏｃａｌＰａｉｒｗｉｓｅＲｅｐｒｅｓｅｎｔａｉｏｎ）を、時空間小領域及び大領域の共起特徴量の基本単位とする。 Therefore, the feature expression unit 13a creates a co-occurrence feature amount between the spatial small area and the entire image, and expresses both the motion of the subject and the motion of the camera.
Specifically, the feature representation unit 13a creates a pair of a global feature representation (Global Representation) extracted from the entire screen and a partial feature representation (Local Representation) extracted from a small region, and this feature pair (GPR: Global Representation). Let Pairwise Representation) be the basic unit of the co-occurrence feature amount of the spatio-temporal small region and large region.

識別部１４ａは、特徴対の集合に基づいて、動作を識別する。
具体的には、識別部１４ａは、特徴対それぞれを入力とする第１の識別処理に続いて、この第１の識別処理の結果の組み合わせを入力とする第２の識別処理を行う。 The identification unit 14a identifies an operation based on the set of feature pairs.
Specifically, the identification unit 14a performs a second identification process using the combination of the results of the first identification process as an input, following the first identification process using each feature pair as an input.

図６は、本実施形態に係る識別部１４ａによる識別処理の流れを示す図である。
第１段階において、識別部１４ａは、単一の特徴対（ＧＰＲ）を入力し、ＳＶＭ回帰により、入力された特徴対（Ｐ_１ ^１〜Ｐ_ＮＴ ^ＮＳ）それぞれに対応する時空間領域に各識別対象動作（行動１〜Ｎ_Ａ）が含まれる信頼度を連続値（０〜１の実数）で出力する。 FIG. 6 is a diagram showing a flow of identification processing by the identification unit 14a according to the present embodiment.
In the first step, the identification unit 14a inputs the single feature pairs (GPR), the SVM regression, input feature pairs _{_^(P 1} 1 _~P ^{NT NS)} each identified space region when corresponding to each The reliability including the target action (actions 1 to N _A ) is output as a continuous value (a real number from 0 to 1).

例えば、時空間分割数（＝特徴対の数）が２７（空間分割数Ｎ_Ｓ＝９、時間分割数Ｎ_Ｔ＝３）、識別対象の動作がＮ_Ａ種類ある場合、この第１段階目のＳＶＭ識別器は２７×Ｎ_Ａ個必要となる。識別部１４ａは、全ＳＶＭ識別器で信頼度を算出した後、これらを結合し、２７×Ｎ_Ａ次元の特徴ベクトルとして表現する。 For example, when the number of space-time divisions (= number of feature pairs) is 27 (space division number N _S = 9, time division number N _T = 3) and there are N _A types of actions to be identified, this first step 27 × N _A SVM discriminators are required. Identifying unit 14a, after calculating the reliability in all SVM classifier combines these are expressed as a feature vector of 27 × N _A dimension.

第２段階において、識別部１４ａは、第１段階で生成した特徴ベクトルを、Ｎ_ＡクラスＳＶＭ識別器へ入力し、複数の対象動作のいずれであるかを識別する。識別部１４ａは、この２段階ＳＶＭにより、特徴量の次元数を膨張させることなく、かつ各小領域で抽出した特徴量を十分に考慮した上で人物動作を識別することが可能となった。 In the second stage, the identification unit 14a, a feature vector generated in the first stage, and input to N _A class SVM classifier, which identifies which of the plurality of target operation. With this two-stage SVM, the identification unit 14a can identify a human action without sufficiently expanding the number of dimensions of the feature amount and sufficiently considering the feature amount extracted in each small region.

このように、ｓｔＭＢが構成する２７通りの時空間領域から、２７対のＧＰＲが生成される。これらを単純連結して１つの特徴ベクトルとすることも可能であるが、この場合は非常に高次元となるため、識別器が十分な精度を持つためには多量の学習データが必要である。そこで、識別部１４ａは、入力特徴量の異なる２段階のＳＶＭ識別器を用い、各段階では低次元の特徴量で識別処理を行うと同時に、最終的な識別精度を向上させた。 In this way, 27 pairs of GPRs are generated from the 27 spatio-temporal regions formed by stMB. These can be simply connected to form one feature vector. However, in this case, since the dimensions are very high, a large amount of learning data is necessary for the discriminator to have sufficient accuracy. Therefore, the identification unit 14a uses two-stage SVM classifiers with different input feature quantities, and at each stage, performs classification processing with low-dimensional feature quantities and at the same time improves the final identification accuracy.

［実験結果］
図７は、第１実施形態に係る識別装置１及び第２実施形態に係る識別装置１ａによる識別精度を、従来手法と比較した評価実験結果を示す図である。
既存の３手法（コードワード数ｋ＝１６のＢｏＦ、コードワード数ｋ＝１０００のＢｏＦ、コードワード数ｋ＝１６のＶＬＡＤ）と、新規２手法（ｓｔＭＢ、ＧＰＲ）とを比較した結果を示す。なお、新規２手法におけるコードワード数は１６である。 [Experimental result]
FIG. 7 is a diagram showing evaluation experiment results comparing the identification accuracy of the identification device 1 according to the first embodiment and the identification device 1a according to the second embodiment with a conventional method.
The results of comparing the existing three methods (BoF with the number of codewords k = 16, BoF with the number of codewords k = 1000, VLAD with the number of codewords k = 16) and the two new methods (stMB, GPR) are shown. The number of code words in the new two methods is 16.

評価には２試合分の放送映像データを用いた。識別した動作は、４種（ジャンプ、レイアップシュート、ジャンプシュート、セットシュート）及び「対象動作なし」である。１時間分の映像を用いて識別器を学習し、残りの１時間で評価した。各動作開始から終了までの時間に得られた特徴量を正例、いずれの動作も起きていない時間の特徴量を負例として学習した。各動作の生起タイミング（正解データ）は、目視により手動で作成した。
精度は、識別結果に含まれる正解の割合（適合率）と全正解のどれだけを正しく識別できたかの割合（再現率）の調和平均であるＦ値（％）で評価した。 Broadcast video data for two games was used for the evaluation. The identified actions are 4 types (jump, layup shoot, jump shoot, set shoot) and “no target action”. The classifier was learned using the video for one hour and evaluated in the remaining one hour. The feature amount obtained during the time from the start to the end of each operation was learned as a positive example, and the feature amount at the time when no operation occurred was learned as a negative example. The occurrence timing (correct data) of each operation was manually created by visual inspection.
The accuracy was evaluated by the F value (%), which is a harmonic average of the proportion of correct answers (acceptance rate) included in the identification result and the proportion of all correct answers (reproducibility).

ｓｔＭＢを用いてＶＬＡＤを記述した場合（第１実施形態）、従来３手法より高い識別精度（Ｔｏｔａｌ）が得られた。これにより、軌跡特徴の位置を考慮した記述手法（第１実施形態）の有効性が示された。特にスポーツ映像では画面内の特定領域で特定動作（例えばシュート）が行われる傾向があるため、位置情報の利用効果が高い。さらに、時間方向にスケールを変化させ、動作速度の個人差を吸収したことにより、精度が向上した。
なお、個別の動作では、既存のＶＬＡＤの精度が高い場合もあるが、識別対象に応じて精度が高い方の手法を選択的に用いることにより、高い精度が期待できる。 When VLAD was described using stMB (first embodiment), higher identification accuracy (Total) was obtained than in the conventional three methods. Thereby, the effectiveness of the description method (first embodiment) in consideration of the position of the trajectory feature is shown. In particular, sports images tend to perform a specific action (for example, a shot) in a specific area in the screen, so that the effect of using position information is high. Furthermore, accuracy was improved by changing the scale in the time direction and absorbing individual differences in operating speed.
In the individual operation, the accuracy of the existing VLAD may be high, but high accuracy can be expected by selectively using the method with higher accuracy according to the identification target.

また、特徴対（ＧＰＲ）による２段階ＳＶＭで識別した場合（第２実施形態）、精度はさらに高まった。評価に用いたバスケットボール映像では、ゴール直後にカメラを反対方向へパンする等、カメラ操作に一定のパターンが見られた。ＧＰＲにより、人物動作をキーとしたカメラ操作パターンが学習されるため、高い識別精度が得られた。 Moreover, when it identified by 2 step | paragraph SVM by a feature pair (GPR) (2nd Embodiment), the precision further improved. In the basketball video used for the evaluation, a certain pattern was seen in the camera operation, such as panning the camera in the opposite direction immediately after the goal. Since GPR learns camera operation patterns using human actions as keys, high identification accuracy is obtained.

以上のように、識別装置１は、ｓｔＭＢにより位置と時間情報を考慮した特徴表現を作成した。これにより、識別装置１は、動作の時間的・空間的変化に頑健であり、映像コンテンツ内の人物が行う特定動作を高い精度で検出することが可能となった。
本手法は、放送映像、インターネット上の映像、監視映像等、あらゆる映像に幅広く応用可能であり、行動をキーとした検索又は特定動作の監視等に利用できる。また、本手法は、特にカメラ操作に固定パターンが数多く存在するスポーツ放送映像の解析に適した手法である。 As described above, the identification device 1 creates a feature expression that takes position and time information into account using stMB. As a result, the identification device 1 is robust against temporal and spatial changes in operation, and can detect a specific operation performed by a person in the video content with high accuracy.
This method can be widely applied to all types of video such as broadcast video, video on the Internet, and monitoring video, and can be used for searching using behavior as a key or monitoring a specific operation. In addition, this method is particularly suitable for analysis of sports broadcast video in which many fixed patterns exist for camera operation.

識別装置１は、見え特徴、動き特徴及び形状特徴を用いて、映像により示される動作の特徴を効率的に表現できるので、動作の識別精度を向上できた。 Since the identification device 1 can efficiently express the feature of the motion indicated by the video using the appearance feature, the motion feature, and the shape feature, the identification accuracy of the motion can be improved.

識別装置１は、ｓｔＭＢにおいて時間的スケールを複数用意することで、動作速度の個人差にも対応でき、動作の識別精度を向上できた。 The identification device 1 can cope with individual differences in operation speed by preparing a plurality of time scales in stMB, and can improve the identification accuracy of the operation.

識別装置１ａは、特徴対（ＧＰＲ）を用いて、大領域におけるカメラの動き情報を含んだ特徴表現を生成することにより、動作の識別精度を向上できた。 The identification device 1a can improve the identification accuracy of the operation by generating the feature expression including the motion information of the camera in the large area using the feature pair (GPR).

識別装置１ａは、ｓｔＭＢとＧＰＲを組み合わせた特徴表現を２段階の識別器で識別することで、人物動作の検出精度を向上できた。
さらに、全領域から得られたヒストグラムを単純に並べた場合、最終的な次元数は、例えば、５４軌跡特徴次元数×１６コードワード数×９空間分割数×３時間分割数＝２３３２８と膨大になるため、少ない学習データでは十分な精度を期待できないが、この２段階の識別手法により、少ない次元数で人物動作を精度良く識別できた。 The identification device 1a is able to improve the detection accuracy of the human motion by identifying the feature expression combining stMB and GPR with a two-stage classifier.
Further, when the histograms obtained from all the regions are simply arranged, the final number of dimensions is enormous, for example, 54 locus feature dimension number × 16 codeword number × 9 space division number × 3 time division number = 23328. Therefore, sufficient accuracy cannot be expected with a small amount of learning data, but the human motion can be accurately identified with a small number of dimensions by this two-stage identification method.

以上、本発明の実施形態について説明したが、本発明は前述した実施形態に限るものではない。また、本実施形態に記載された効果は、本発明から生じる最も好適な効果を列挙したに過ぎず、本発明による効果は、本実施形態に記載されたものに限定されるものではない。 As mentioned above, although embodiment of this invention was described, this invention is not restricted to embodiment mentioned above. Further, the effects described in the present embodiment are merely a list of the most preferable effects resulting from the present invention, and the effects of the present invention are not limited to those described in the present embodiment.

ｓｔＭＢにおける時空間の分割数は、適宜設定可能である。
また、識別装置は、対象映像（例えば、バスケットボール、アメリカンフットボール等の種類）に応じて、識別器群を切り替え可能であってよく、この場合、各識別器に対して固有の識別対象動作が設定される。 The number of space-time divisions in stMB can be set as appropriate.
The discriminator may be able to switch the discriminator group according to the target video (for example, the type of basketball, American football, etc.). In this case, a specific discriminating target operation is set for each discriminator. Is done.

本実施形態では、主に識別装置の構成と動作について説明したが、本発明はこれに限られず、各構成要素を備え、映像中の動作を識別するための方法、又はプログラムとして構成されてもよい。 In the present embodiment, the configuration and operation of the identification device have been mainly described. However, the present invention is not limited to this, and each component may be provided as a method or program for identifying an operation in a video. Good.

さらに、識別装置の機能を実現するためのプログラムをコンピュータで読み取り可能な記録媒体に記録して、この記録媒体に記録されたプログラムをコンピュータシステムに読み込ませ、実行することによって実現してもよい。 Furthermore, the program for realizing the function of the identification device may be recorded on a computer-readable recording medium, and the program recorded on the recording medium may be read by the computer system and executed.

ここでいう「コンピュータシステム」とは、ＯＳや周辺機器等のハードウェアを含むものとする。また、「コンピュータで読み取り可能な記録媒体」とは、フレキシブルディスク、光磁気ディスク、ＲＯＭ、ＣＤ−ＲＯＭ等の可搬媒体、コンピュータシステムに内蔵されるハードディスク等の記憶装置のことをいう。 The “computer system” here includes an OS and hardware such as peripheral devices. The “computer-readable recording medium” refers to a storage device such as a portable medium such as a flexible disk, a magneto-optical disk, a ROM, and a CD-ROM, and a hard disk built in the computer system.

さらに「コンピュータで読み取り可能な記録媒体」とは、インターネット等のネットワークや電話回線等の通信回線を介してプログラムを送信する場合の通信線のように、短時刻の間、動的にプログラムを保持するもの、その場合のサーバやクライアントとなるコンピュータシステム内部の揮発性メモリのように、一定時刻プログラムを保持しているものも含んでもよい。また、上記プログラムは、前述した機能の一部を実現するためのものであってもよく、さらに前述した機能をコンピュータシステムにすでに記録されているプログラムとの組み合わせで実現できるものであってもよい。 Furthermore, “computer-readable recording medium” means that a program is dynamically held for a short time, like a communication line when transmitting a program via a network such as the Internet or a communication line such as a telephone line. It is also possible to include one that holds a program for a certain time, such as a volatile memory inside a computer system that becomes a server or client in that case. Further, the program may be for realizing a part of the above-described functions, and may be capable of realizing the above-described functions in combination with a program already recorded in the computer system. .

１、１ａ識別装置
１０、１０ａ制御部
１１映像分割部
１２特徴記述部
１３、１３ａ特徴表現部
１４、１４ａ識別部 DESCRIPTION OF SYMBOLS 1, 1a identification device 10, 10a Control part 11 Image | video division part 12 Feature description part 13, 13a Feature expression part 14, 14a Identification part

Claims

An identification device for identifying an operation indicated by an image,
A video dividing unit for dividing the video into space-time sub-regions;
For each of the spatiotemporal subregions, calculate a local feature amount related to the trajectory of the feature point in the region, and describe a feature description unit with a predetermined dimension;
A feature expression unit for statistically processing the set of local feature values to obtain a feature expression;
An identification unit that identifies the action by a predetermined classifier based on a set of feature expressions for each of the spatio-temporal subregions ,
The image dividing section, overlap each other, and said you divide the image identification apparatus to include a spatial subregion when a plurality of time scales are different.

An identification device for identifying an operation indicated by an image,
A video dividing unit for dividing the video into space-time sub-regions;
For each of the spatiotemporal subregions, calculate a local feature amount related to the trajectory of the feature point in the region, and describe a feature description unit with a predetermined dimension;
A feature expression unit for statistically processing the set of local feature values to obtain a feature expression;
An identification unit that identifies the action by a predetermined classifier based on a set of feature expressions for each of the spatio-temporal subregions,
The feature representation unit includes a partial feature representation obtained from the spatiotemporal subregion and a feature pair that includes the spatiotemporal subregion and combines the global feature representation obtained from the large region having the same time scale. make,
The identification unit, based on a set of the feature pairs, identification device that identifies the operation.

The identification device according to claim 2, wherein the video dividing unit divides the video so as to include a plurality of spatiotemporal small regions that overlap each other and have different time scales.

The said identification part performs the 2nd identification process which uses the combination of the result of the said 1st identification process as an input following the 1st identification process which inputs each said feature pair. The identification device described in 1.

The feature description unit includes a motion feature indicating a direction and intensity distribution of a motion vector in a trajectory of a feature point, an appearance feature indicating a color distribution around the trajectory, and a shape indicating a time length and a moving distance of the trajectory. The identification device according to claim 1, wherein a feature amount representing a feature is calculated.

An identification program for causing a computer to identify an operation indicated by an image,
A video dividing step of dividing the video into space-time sub-regions;
For each of the space-time subregions, calculate a local feature amount related to the trajectory of the feature point in the region, and describe the feature description step in a predetermined dimension;
A feature expression step of statistically processing the set of local feature values to obtain a feature expression;
An identification step for identifying the action by a predetermined classifier based on a set of feature expressions for each of the spatio-temporal subregions ,
In the image segmentation step, mutually overlap, and, because of the identification program the to divide the video to include a spatial subregion when a plurality of time scales are different.

An identification program for causing a computer to identify an operation indicated by an image,
A video dividing step of dividing the video into space-time sub-regions;
For each of the space-time subregions, calculate a local feature amount related to the trajectory of the feature point in the region, and describe the feature description step in a predetermined dimension
A feature expression step of statistically processing the set of local feature values to obtain a feature expression;
An identification step for identifying the action by a predetermined classifier based on a set of feature expressions for each of the spatio-temporal subregions,
In the feature representation step, a partial feature representation obtained from the spatiotemporal subregion and a feature pair including the spatiotemporal subregion and a global feature representation obtained from a large region having the same time scale are combined. Let them create
An identification program for identifying the operation based on the set of feature pairs in the identification step.