JP2018181273A

JP2018181273A - Image processing apparatus, method thereof, and program

Info

Publication number: JP2018181273A
Application number: JP2017084778A
Authority: JP
Inventors: 敬正角田; Norimasa Kadota
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2017-04-21
Filing date: 2017-04-21
Publication date: 2018-11-15

Abstract

PROBLEM TO BE SOLVED: To recognize behavior of a plurality of persons from a video image.SOLUTION: An image processing apparatus includes: acquisition means of acquiring a video image including time-series still images; detection means of detecting one or more objects in each still image from the video image; feature quantity extraction means of extracting feature quantities corresponding to each of the objects from the still images; object integration means of integrating feature quantities corresponding to each of the objects in the still images; time-series integration means of integrating the feature quantities integrated in the still images, for the time-series still images; and identification means of identifying behavior of the object in the video image, on the basis of the feature quantities integrated for the time-series still images.SELECTED DRAWING: Figure 2

Description

本発明は、動画像から人物などの対象物の行動を認識する技術に関する。特に、複数の人物により行われる行動を認識する技術に関するものである。 The present invention relates to a technology for recognizing the behavior of an object such as a person from a moving image. In particular, the present invention relates to a technology for recognizing an action performed by a plurality of persons.

映像解析による人物行動の認識（以下、行動認識と称する）は、監視、マーケティング、スポーツ解析等の用途で有用な技術である。例えば、行動認識の認識結果に基づき映像の中から被写体の特定行動を検出する監視応用や、行動の種類を示す行動ラベルをキーとした映像検索、マーケティング上関心のある行動のカウント、スポーツ映像におけるスタッツ解析等、多様な応用がある。 Recognition of person's action by image analysis (hereinafter referred to as action recognition) is a useful technique in applications such as monitoring, marketing, and sports analysis. For example, a monitoring application that detects a specific action of a subject from a video based on a recognition result of action recognition, a video search using an action label as a key indicating the type of action, counting of actions of interest in marketing, sports video There are various applications such as Stats analysis.

その際に撮影される映像は、１フレームに多人数が同時に映る状況が一般的である。このような映像を用いた行動認識方法として、特許文献１では、多人数が登場し隠れが多発する混雑状況下においても人物の動き予測を用いて長い軌跡情報を生成することで、映像中の複数人物の個別行動を安定して認識する技術が開示されている。映像解析による人物行動の認識（以下、行動認識）は、監視、マーケティング、スポーツ解析等の用途で有用な技術である。例えば、行動認識の認識結果に基づき映像の中から被写体の特定行動を検出する監視応用や、行動の種類を示す行動ラベルをキーとした映像検索、マーケティング上関心のある行動のカウント、スポーツ映像におけるスタッツ解析等、多様な応用がある。 In the case of the video taken at that time, it is common that a large number of people appear simultaneously in one frame. As an action recognition method using such a video, in Patent Document 1, even in a crowded situation where many people appear and hiding frequently occurs, long trajectory information is generated using motion prediction of a person, thereby making it possible to display in the video. There is disclosed a technology for stably recognizing individual actions of a plurality of persons. Recognition of human behavior by image analysis (hereinafter referred to as action recognition) is a useful technique in applications such as monitoring, marketing, and sports analysis. For example, a monitoring application that detects a specific action of a subject from a video based on a recognition result of action recognition, a video search using an action label as a key indicating the type of action, counting of actions of interest in marketing, sports video There are various applications such as Stats analysis.

その際に撮影される映像は、１フレームに多人数が同時に映る状況が一般的である。このような映像を用いた行動認識方法として、特許文献１では、多人数が登場し隠れが多発する混雑状況下においても人物の動き予測を用いて長い軌跡情報を生成することで、映像中の複数人物の個別行動を安定して認識する技術が開示されている。 In the case of the video taken at that time, it is common that a large number of people appear simultaneously in one frame. As an action recognition method using such a video, in Patent Document 1, even in a crowded situation where many people appear and hiding frequently occurs, long trajectory information is generated using motion prediction of a person, thereby making it possible to display in the video. There is disclosed a technology for stably recognizing individual actions of a plurality of persons.

さらに監視やスポーツ解析の用途では、映像中の複数人物が関係する協調動作の認識が可能になることで付加価値の高い応用につながると考えられる。例えば映像内の複数人の状況をより直観的で分かりやすい表現で監視者にアノテーションする（例「行列ができている」、「言い争いが起きている」など）ことができる。また、複数人が協力して行う犯罪の検知に応用したり、スポーツにおいてはチームプレイの解析が可能になるなどの応用例である。 Furthermore, in the application of surveillance and sports analysis, it is thought that it becomes possible to lead to application with high added value by enabling recognition of a coordinated action involving multiple persons in a video. For example, it is possible to annotate the observer with the situation of a plurality of persons in the video in a more intuitive and easy-to-understand expression (eg, "a matrix is made", "a dispute is occurring", etc.). In addition, it is applied to detection of a crime performed by a plurality of people in cooperation, and in sports, it is an application example such as analysis of team play becomes possible.

非特許文献１では、複数人物の協調動作を認識する技術として、複数の人間の協調的な行動の認識を行っており、個々の人物が立っているか歩いているかのプリミティブな個別行動の認識を行う。それから、２者間のインタラクション行動の認識（向き合っている、列になっている等）、全体の協調行動の認識（話し合っている、並んで歩いている、集まっている等）を階層的なグラフィカルモデルを用いて実現している。 In Non-Patent Document 1, as a technology for recognizing the cooperative action of a plurality of persons, recognition of cooperative actions of a plurality of persons is performed, and recognition of primitive individual actions of whether individual persons are standing or walking is performed. Do. Then, hierarchical graphical recognition of interaction behavior between two parties (face-to-face, in-line, etc.), overall collaborative behavior recognition (discussion, walking-side-by-side, etc.) It is realized using a model.

特許第５２８５５７５号公報Patent No. 5285575 gazette

ＷｏｎｇｕｎＣｈｏｉａｎｄＳｉｌｖｉｏＳａｖａｒｅｓｅ，“Ａｕｎｉｆｉｅｄｆｒａｍｅｗｏｒｋｆｏｒｍｕｌｔｉ−ｔａｒｇｅｔｔｒａｃｋｉｎｇａｎｄｃｏｌｌｅｃｔｉｖｅａｃｔｉｖｉｔｙｒｅｃｏｇｎｉｔｉｏｎ”，ＥＣＣＶ２０１２Wongun Choi and Silvio Savarese, “A unified framework for multi-target tracking and collective activity recognition”, ECCV 2012

上述のように、監視やスポーツにおける、複数の人間が関与する行動の認識は広範な応用が想定される。特許文献１は、複数人物の個別動作の認識は行うが、複数人物が協調して行う行動を認識するものではない。また、非特許文献１では、人物の個別行動の認識に、時空間上の直方体から作成される時空間特徴量を抽出し、さらに別途学習した識別器を用いて得られた識別スコアを用いている。このように特徴量自体が時間的な幅を持つため識別結果が時間的に粗くなるという特徴がある。また特徴抽出器、識別器、グラフィカルモデルは独立したモジュールであり、それらを一貫した全体の最適化が出来なかった。 As mentioned above, recognition of actions involving multiple people in surveillance and sports is expected to have wide application. Patent Document 1 recognizes individual actions of a plurality of persons, but does not recognize an action performed by a plurality of persons in cooperation. Further, in Non-Patent Document 1, for recognition of individual behavior of a person, a spatiotemporal feature value created from a rectangular parallelepiped in spacetime is extracted, and a discrimination score obtained using a separately learned classifier is used. There is. As such, since the feature amount itself has a temporal width, there is a feature that the identification result becomes rough in time. Also, feature extractors, classifiers and graphical models are independent modules, which could not be consistently optimized throughout.

本発明の１態様によれば、画像処理装置に、時系列の静止画像を含む動画像を取得する取得手段と、前記動画像から静止画像ごとに１以上の対象物を検出する検出手段と、前記静止画像から前記対象物のそれぞれに対応する特徴量を抽出する特徴量抽出手段と、前記静止画像において前記対象物のそれぞれに対応する特徴量を統合する対象物統合手段と、前記静止画像において統合された対象物の特徴量を前記時系列の静止画像について統合する時系列統合手段と、前記時系列の静止画像について統合された特徴量に基づいて前記動画像における前記対象物の行動を識別する識別手段とを備える。 According to one aspect of the present invention, an image processing apparatus includes acquisition means for acquiring moving images including time-series still images, detection means for detecting one or more objects for each still image from the moving images, In the still image, a feature amount extraction unit for extracting feature amounts corresponding to each of the objects from the still image, an object integration unit for integrating feature amounts corresponding to each of the objects in the still image, and Time series integration means for integrating the feature quantities of the integrated object for the time series still image, and the action of the object in the moving image is identified based on the feature quantity integrated for the time series still image And identification means.

本発明によれば、動画の各フレームに映る人物の個別動作を表す特徴量を抽出し、複数人の特徴量を統合し、さらに時間的な統合を行うことで、複数人の個別行動で意味付けられる行動の精度の良い識別を可能にする。 According to the present invention, feature quantities representing individual motions of a person appearing in each frame of a moving image are extracted, feature quantities of a plurality of people are integrated, and temporal integration is performed to make sense of individual actions of a plurality of people. Allows accurate identification of the actions to be attached.

カメラ配置の一例と２つのカメラで撮影される静止画の一例を示す図である。It is a figure which shows an example of a camera arrangement | positioning, and an example of the still image image | photographed with two cameras. 認識時のシステム構成の一例を示す図である。It is a figure which shows an example of the system configuration | structure at the time of recognition. 認識時の処理の一例を示すフローチャートである。It is a flow chart which shows an example of processing at the time of recognition. 人物検出の結果の一例と検出された人物領域をソートした結果の一例を示す図である。It is a figure which shows an example of a result of person detection, and an example of a result which sorted the detected person area. ２つのフレームでの人物検出の結果の一例と検出された人物領域をソートした結果の一例を示す図である。It is a figure which shows an example of the result of person detection in two flames | frames, and an example of the result of having sorted the detected person area | region. 認識および学習時の処理で用いるニューラルネットワーク構造の一例を示す図である。It is a figure which shows an example of the neural network structure used by the process at the time of recognition and learning. 図６で示したニューラルネットワークを展開した図である。It is the figure which expand | deployed the neural network shown in FIG. 図７で示したニューラルネットワークを制御することで実現される人物系列の統合と時系列の統合を説明する図である。It is a figure explaining integration of a person series and integration of a time series which are realized by controlling a neural network shown in FIG. 図８で示した制御されたニューラルネットワークと等価なニューラルネットワークを示す図である。Fig. 9 shows a neural network equivalent to the controlled neural network shown in Fig. 8; 学習時のシステム構成の一例を示す図である。It is a figure which shows an example of the system configuration | structure at the time of learning. 学習時の処理の一例を示すフローチャートである。It is a flowchart which shows an example of the process at the time of learning. フレーム内の人物領域と人物の座標データを用いるニューラルネットワーク構造を示す図である。It is a figure which shows the neural network structure using the person area | region in a flame | frame, and coordinate data of a person. 認識および学習時の処理で用いるニューラルネットワーク構造の一例を示す図である。It is a figure which shows an example of the neural network structure used by the process at the time of recognition and learning. ２つのカメラで撮影された同時刻の２フレームの人物検出結果の一例と検出された人物領域をソートした結果の一例を示す図である。FIG. 7 is a diagram showing an example of a person detection result of two frames at the same time taken by two cameras and an example of a result of sorting the detected person area. 認識および学習時の処理で用いるニューラルネットワーク構造の一例を示す図である。It is a figure which shows an example of the neural network structure used by the process at the time of recognition and learning. 認識時のシステム構成の一例を示す図である（その２）。It is a figure which shows an example of the system configuration | structure at the time of recognition (the 2). 認識および学習時の処理で用いるニューラルネットワーク構造の一例を示す図である。It is a figure which shows an example of the neural network structure used by the process at the time of recognition and learning. ＬＳＴＭの制御状態を示す図である。It is a figure which shows the control state of LSTM.

以下、図面を参照しながら本発明の実施形態について詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

（第１の実施形態）
第１の実施形態では、フットサルを例として動画像から複数選手の動作で意味付けられる行動ラベルを認識する方法について説明する。 First Embodiment
In the first embodiment, a method of recognizing an action label that can be added to the action of a plurality of players from a moving image will be described by taking a futsal as an example.

図１は、本実施形態において想定する動画について説明する図である。図１（ａ）の１０２はフットサルコートを示す。１０３―１１２はカメラを示す。Ｘ、Ｙ、Ｚは原点を１１３に定義した３次元座標（世界座標）のＸ軸，Ｙ軸、Ｚ軸を示す。この図１（ａ）に示すように、フットサルのコートの周囲に複数台のカメラが配置され、そのカメラにより動画が撮影されるものとする。図１（ｂ）に、カメラ１０３で取得されるフットサル動画の１フレーム（静止画像）２０１の例を示す。２０２がボール、２０３―２０９が、それぞれ人物Ａ―Ｇを示す。このように動画中には、対象物として、複数人の人物（選手および審判が含まれる）とボールが存在する。図１（ｃ）に、カメラ１０４で取得される１フレーム３０１の例を示す。３０２がボール、３０３―３１１が、それぞれ人物Ａ―Ｋを示す。図１（ｃ）におけるボール３０２、人物Ａ−Ｅ（３０３−３０７）は、図１（ｂ）におけるボール２０２、人物Ａ−Ｅ（２０３−２０７）と同一の物体あるいは人物である。 FIG. 1 is a diagram for explaining a moving image assumed in the present embodiment. Reference numeral 102 in FIG. 1A denotes a futsal coat. Reference numerals 103 to 112 denote cameras. X, Y and Z indicate the X axis, Y axis and Z axis of three-dimensional coordinates (world coordinates) whose origin is defined as 113. As shown in FIG. 1A, it is assumed that a plurality of cameras are arranged around a futsal coat and a moving image is photographed by the cameras. FIG. 1B shows an example of one frame (still image) 201 of futsal moving image acquired by the camera 103. 202 indicates a ball, and 203 to 209 indicate persons AG, respectively. Thus, in the moving image, there are a plurality of persons (including players and referees) and a ball as objects. An example of one frame 301 acquired by the camera 104 is shown in FIG. Reference numeral 302 denotes a ball, and 303 to 311 denote persons AK, respectively. The ball 302 and the persons AE (303 to 307) in FIG. 1 (c) are the same objects or persons as the ball 202 and the persons AE (203 to 207) in FIG. 1 (b).

各カメラ１０３―１１２は、キャリブレーションが行われ、カメラの内部および外部パラメータが取得されているものとする。すなわち、各カメラで取得した画像にステレオ法を適用し、対応点をコートの基準点を原点として設定した世界座標上に投影することができる。本実施形態では、コートのセンターマークを原点１１３とし、Ｘ軸、Ｙ軸、Ｚ軸を持つ３次元空間を世界座標として設定する。画像に映る人物やボールの検出と組み合わせ、人物やボールの世界座標上の位置を取得できるものとする。 Each of the cameras 103 to 112 is calibrated, and internal and external parameters of the cameras are acquired. That is, the stereo method can be applied to the image acquired by each camera, and the corresponding points can be projected on world coordinates set with the reference point of the coat as the origin. In this embodiment, the center mark of the coat is set as the origin 113, and a three-dimensional space having an X axis, a Y axis, and a Z axis is set as world coordinates. The position of the person or the ball on the world coordinates can be acquired in combination with the detection of the person or the ball appearing in the image.

またこのフットサルの行動認識においては、パス、シュート、ドリブル、キープ、クリアーの５種類のマルチカテゴリの行動ラベルを認識するとする。ここで扱う行動ラベル（パス、シュート、ドリブル、キープ、クリアー）の認識は、個人の動作の認識だけでは不十分で、ボールに関係する複数人の個別動作を総合して識別する必要のある行動ラベルである。 Also, in this futsal action recognition, it is assumed that action labels of five types of multi-category are recognized: pass, shoot, dribble, keep and clear. The recognition of the action labels (pass, shoot, dribble, keep, clear) dealt with here is not enough to recognize the individual's actions, and the actions that need to be comprehensively identified by the individual actions of multiple persons related to the ball. It is a label.

図２（ａ）は、本実施形態の画像処理装置である行動認識装置１０００の機能構成を示す図である。本実施形態の行動認識装置１０００は、動画取得部１００１、人物物体検出部１００２、人物領域抽出部１００３、人物領域ソート部１００４、統合制御信号作成部１００５、画像特徴量抽出部１００６を有する。さらに、人物系列統合部１００７、時系列統合部１００８、行動ラベル識別部１００９を有する。これらの各機能の詳細については、図３等を用いて後述する。 FIG. 2A is a diagram showing a functional configuration of the behavior recognition apparatus 1000 which is the image processing apparatus of the present embodiment. The action recognition apparatus 1000 of the present embodiment includes a moving image acquisition unit 1001, a human object detection unit 1002, a human area extraction unit 1003, a human area sort unit 1004, an integrated control signal generation unit 1005, and an image feature quantity extraction unit 1006. Furthermore, it has a person series integration unit 1007, a time series integration unit 1008, and an action label identification unit 1009. The details of each of these functions will be described later with reference to FIG.

図３（ａ）は、本実施形態における認識時の処理の一例を示すフローチャートである。このフローチャートを用いて、処理全体の概要を説明する。 FIG. 3A is a flowchart showing an example of processing at the time of recognition in the present embodiment. The outline of the entire process will be described using this flowchart.

まずＳ１００１では、動画取得部１００１が、複数の静止画から成る動画のフレーム系列を取得する。Ｓ１００２では、人物物体検出部１００２が、Ｓ１００１で取得したフレームに映る人物およびボールのそれぞれについて位置とサイズを検出する。フレーム内に複数人物がいる場合、その人数に対応した複数の位置とサイズを検出する。Ｓ１００３では、人物領域取得部１００３が、Ｓ１００２で検出した人物の位置とサイズに基づき、人物の領域（以後、単に人物領域と呼ぶ）を取得する。Ｓ１００４では、人物領域ソート部１００４が、Ｓ１００２で検出したボールと人物の位置に基づき、Ｓ１００３で取得した人物領域をソートする。ソートの詳細については後述する。 First, in S1001, the moving image acquisition unit 1001 acquires a frame sequence of a moving image composed of a plurality of still images. In S1002, the human object detection unit 1002 detects the position and size of each of the person and the ball appearing in the frame acquired in S1001. When there are a plurality of persons in the frame, a plurality of positions and sizes corresponding to the number of persons are detected. In S1003, the human area acquisition unit 1003 acquires a human area (hereinafter, simply referred to as a human area) based on the position and size of the human detected in S1002. In S1004, the person area sorting unit 1004 sorts the person area acquired in S1003 based on the position of the ball and the person detected in S1002. Details of the sorting will be described later.

Ｓ１００５では、統合制御信号作成部１００５が、Ｓ１００２で検出したフレーム内に存在する人物の数に基づき、後述する人物系列統合部１００７、時系列統合部１００８を制御する信号を作成する。Ｓ１００６では、画像特徴量抽出部１００６が、Ｓ１００３で取得した人物領域に対応する画像特徴量を抽出する。フレーム内に複数人物がいる場合は、複数人物に対応する領域毎に画像特徴量を抽出する。使用する画像特徴量は後述する。抽出した画像特徴量を人物特徴量と呼ぶこととする。 In S1005, the integrated control signal generation unit 1005 generates a signal for controlling a person series integration unit 1007 and a time series integration unit 1008 described later based on the number of persons present in the frame detected in S1002. In S1006, the image feature quantity extraction unit 1006 extracts an image feature quantity corresponding to the person area acquired in S1003. When there are a plurality of persons in the frame, the image feature amount is extracted for each area corresponding to the plurality of persons. Image feature quantities to be used will be described later. The extracted image feature quantity is referred to as a person feature quantity.

Ｓ１００７では、人物系列統合部１００７が、Ｓ１００６で取得したフレーム内の人数に対応する複数の人物特徴量を統合する対象物統合の処理を行う。この統合処理の詳細については後述する。この処理の結果、複数人物特徴量を取得する。Ｓ１００８では、時系列統合部１００８が、Ｓ１００１で取得した複数のフレームにそれぞれ対応する、Ｓ１００７で統合した人物特徴量をさらに時間的に統合する処理を行う。この統合処理の詳細については後述する。この処理の結果、複数フレーム複数人物特徴量を取得する。Ｓ１００９では、行動ラベル識別部１００９が、Ｓ１００８で統合した複数フレームの複数人物特徴量に基づき、行動ラベルを識別する。これはフレーム毎に行う。使用する識別器については後述する。 In S1007, the person series integration unit 1007 performs an object integration process of integrating a plurality of person feature quantities corresponding to the number of people in the frame acquired in S1006. Details of this integration process will be described later. As a result of this processing, a plurality of person feature quantities are acquired. In S1008, the time-series integration unit 1008 performs processing to further temporally integrate the person feature quantities integrated in S1007 respectively corresponding to the plurality of frames acquired in S1001. Details of this integration process will be described later. As a result of this processing, a plurality of frames and a plurality of person feature amounts are acquired. In S1009, the action label identification unit 1009 identifies an action label based on the plurality of person feature quantities of the plurality of frames integrated in S1008. This is done frame by frame. The classifier to be used will be described later.

次に、図３（ａ）に示したフローチャートに従って、各処理のより具体的な内容について述べる。 Next, more specific contents of each process will be described according to the flowchart shown in FIG.

Ｓ１００１では、図１（ａ）のように配置したマルチカメラで撮影された動画を取得する。ただし、本実施形態では、次のＳ１００２の人物物体検出以外の工程では多視点の動画は用いず、何れか１つのカメラで撮影された複数フレームを用いる。本実施形態では、カメラは解像度ＦｕｌｌＨＤ（１９２０×１０８０ピクセル）、秒間３０フレーム程度の速度を想定し、連続する３０フレーム（１秒分）を取得するとする。しかし、数フレームおきに取得する、より低速なカメラで取得する、またはより長い時間取得する等、異なる条件で複数フレームを取得しても、それらが数倍程度の差異であれば、本実施形態が実現する機能は大きく損なわれない。また、カメラで撮影された動画は、直接取得しても良いが、外部記憶装置に記憶し、そこから所定の複数フレームを取得しても良い。 In S1001, a moving image captured by a multi-camera arranged as shown in FIG. 1A is acquired. However, in the present embodiment, in the subsequent steps other than human object detection in S1002, a multi-viewpoint moving image is not used, and a plurality of frames captured by any one camera are used. In this embodiment, the camera assumes a resolution of Full HD (1920 × 1080 pixels) and a speed of about 30 frames per second, and obtains continuous 30 frames (for one second). However, even if a plurality of frames are acquired under different conditions, such as acquiring every few frames, acquiring with a slower camera, acquiring for a longer time, etc., if they differ by several times, this embodiment The functions realized by are not significantly compromised. In addition, although a moving image shot by a camera may be obtained directly, it may be stored in an external storage device, and a plurality of predetermined frames may be obtained therefrom.

次にＳ１００２では、Ｓ１００１で取得したマルチカメラの動画の各フレームに対し、人物検出およびボール検出を行い、複数カメラのフレームに対する検出結果を用いて、最終的な人物およびボールの位置と人物領域を取得する。１つのフレームにおける人物検出およびボール検出は、ＡｄａＢｏｏｓｔ等の公知の物体検出方法を用いればよい。その際人物検出は、特に人物の顔を検出するように学習された検出器を用いることも可能である。 Next, in S1002, human detection and ball detection are performed on each frame of the multi-camera moving image acquired in S1001, and using the detection results for the multiple camera frames, the final positions of the human and the ball and the human area are displayed. get. For human detection and ball detection in one frame, a known object detection method such as AdaBoost may be used. In this case, it is also possible to use human detection, in particular a detector trained to detect a human face.

続いて、ある瞬間のあるカメラのフレームでの検出結果と別のカメラのフレームでの検出結果から、対応する点を探索することで、前述のようにステレオ法の適用により、検出結果の世界座標上（実世界上）の位置を取得することができる。対応点探索は、公知の技術であるＯＲＢ等の特徴量、ＦＡＳＴ等のコーナー検出、ハミング距離等のメトリック、ｋｄ−ｔｒｅｅ等の近似最近傍探索法を適用すればよい。これらの結果、ボールと各人物の顔の位置が世界座標上の３次元位置（Ｘ，Ｙ，Ｚ）として取得される。Ｓ１００１で取得したフレーム毎にこの処理を行い、その結果、フレーム毎にボールと各人物の顔位置が取得される。 Subsequently, by searching for the corresponding point from the detection result of one camera frame at a certain moment and the detection result of another camera frame, by applying the stereo method as described above, the world coordinates of the detection result The position above (in the real world) can be obtained. Correspondence point search may be performed using known techniques such as feature amounts of ORB, corner detection such as FAST, metrics such as Hamming distance, and approximate nearest neighbor search methods such as kd-tree. As a result of these, the positions of the ball and the face of each person are obtained as three-dimensional positions (X, Y, Z) on world coordinates. This process is performed for each frame acquired in S1001. As a result, the ball and the face position of each person are acquired for each frame.

次に、Ｓ１００３では、Ｓ１００２で取得した各人物の世界座標上の顔位置から、各人物のフレーム上の領域（人物領域）をバウンディングボックスとして取得する。バウンディングボックスとは、各フレーム上での位置（Ｘ，Ｙ）と幅（Ｗ，Ｈ）の４パラメータで指定される矩形領域である。バウンディングボックスは、人物の身長を基準に前後左右数ｍの幅を持たせて人物を覆うように設定するものとする。 Next, in S1003, from the face position on world coordinates of each person acquired in S1002, an area (person area) on the frame of each person is acquired as a bounding box. The bounding box is a rectangular area specified by four parameters of the position (X, Y) and the width (W, H) on each frame. The bounding box is set so as to cover the person by giving a width of several meters in front and rear left and right with reference to the height of the person.

図４（ａ）は、図１（ｂ）のフレームに対し、人物検出およびボール検出を行った結果を描画した図である。４０２がボール検出位置、４０３―４０９が、それぞれ人物Ａ―Ｇの人物領域である。人物検出位置に対しバウンディングボックスが設定され、さらにボールの位置が示されている。 FIG. 4A is a drawing in which results of human detection and ball detection are performed on the frame of FIG. 1B. Reference numeral 402 denotes a ball detection position, and reference numerals 403 to 409 denote person areas of the persons A to G, respectively. A bounding box is set for the person detection position, and the position of the ball is further indicated.

続いて、元のフレーム（サイズ：１９２０×１０８０ピクセル）から人物領域を切り出し、一定サイズにリサイズする。人物領域は様々な大きさが存在するため元のフレームの人物領域が拡大される場合と縮小される場合とがある。拡大にはバイキュービック補間、縮小には最近傍補間を適用する。また、本実施形態ではリサイズ後の人物領域のサイズは２５６×２５６ピクセルとし、以後リサイズ後の人物領域を単に人物領域と呼ぶ。以上の処理をフレーム毎に行い、その結果フレーム毎の人物領域が取得される。 Subsequently, the person area is cut out from the original frame (size: 1920 × 1080 pixels) and resized to a fixed size. Since the human area has various sizes, the human area of the original frame may be enlarged or reduced. Bicubic interpolation is applied for enlargement, and nearest neighbor interpolation is applied for reduction. Further, in the present embodiment, the size of the human area after resizing is 256 × 256 pixels, and the human area after resizing is hereinafter simply referred to as a human area. The above process is performed for each frame, and as a result, the person area for each frame is acquired.

次に、Ｓ１００４では、Ｓ１００３で取得した人物領域を、ボールと人物領域との距離に基づき降順にソートする。ここで距離は、人物物体検出工程Ｓ１００２で取得した人物とボールの世界座標上の３次元位置から得られる人物とボールとの間のユークリッド距離とする。 Next, in S1004, the person area acquired in S1003 is sorted in descending order based on the distance between the ball and the person area. Here, the distance is the Euclidean distance between the person and the ball obtained from the three-dimensional position on the world coordinates of the person and the ball acquired in the person object detection step S1002.

図４（ｂ）に、Ｓ１００３でのリサイズ、本工程でのソート処理を１フレーム分に対し行った結果を示す。５０２―５０８は、それぞれ人物Ａ―Ｇのリサイズした人物領域である。図４（ａ）での様々な大きさの人物領域は均一の解像度にリサイズされ、ボールからの距離が最も遠い人物Ｇの領域５０２から最も近い人物Ａの領域５０８まで降順にソートされる。この処理をフレーム毎に行い、その結果フレーム毎のソートされた人物領域が取得される。さらにフレーム毎のソート結果をフレーム順に連結し、１次元に並んだ人物領域の系列データを取得する。この際フレーム毎に人物の数が違う場合でも、フレーム順に連結する。この系列データを以後、人物領域系列とよぶ。 FIG. 4B shows the results of resizing in S1003 and sorting in this step for one frame. Reference numerals 502 to 508 denote the resized person areas of the persons A to G, respectively. Human regions of various sizes in FIG. 4A are resized to uniform resolution, and sorted in descending order from the region 502 of the person G farthest from the ball to the region 508 of the closest person A. This process is performed for each frame, and as a result, the sorted person area for each frame is acquired. Furthermore, the sorting results for each frame are connected in frame order, and series data of the person area arranged in one dimension is acquired. At this time, even if the number of persons is different for each frame, they are linked in the order of frames. This series data is hereinafter referred to as a person area series.

図５に、人物の数が異なる２フレーム分の人物領域をソートし連結した例を示す。図５（ａ）の６０１は１フレーム目に対応する時刻Ｉでのフレームであり、６０２、６０３、６０４は、それぞれ人物Ａ、Ｂ、Ｃの人物領域である。図５（ｂ）において、７０１は２フレーム目に対応する時刻ＩＩでのフレームであり、７０２、７０３は、それぞれ人物Ａ、Ｂの人物領域である。図５（ａ）の人物領域６０２と図５（ｂ）の人物領域７０２は同一人物Ａに対応し、図５（ａ）の人物領域６０３と図５（ｂ）の人物領域７０３と同一人物Ｂに対応している。図５（ａ）では３人の人物Ａ、Ｂ、Ｃが存在し、図５（ｂ）には２人の人物Ａ、Ｂが存在する。この場合、人物領域はフレーム毎に距離に基づきソートされ、１フレーム目の３人の人物領域の後、２フレーム目の２人の人物領域が連結される。結果、図５（ｃ）に示す人物領域系列が得られる。 FIG. 5 shows an example in which person areas of two frames different in the number of persons are sorted and connected. In FIG. 5A, reference numeral 601 denotes a frame at time I corresponding to the first frame, and reference numerals 602, 603, and 604 denote person areas of persons A, B, and C, respectively. In FIG. 5B, 701 is a frame at time II corresponding to the second frame, and 702 and 703 are person areas of persons A and B, respectively. The person area 602 in FIG. 5 (a) and the person area 702 in FIG. 5 (b) correspond to the same person A, and the same person B as the person area 603 in FIG. 5 (a) and the person area 703 in FIG. It corresponds to In FIG. 5A, there are three persons A, B, and C, and in FIG. 5B, two persons A and B exist. In this case, the person area is sorted based on the distance for each frame, and the two person areas of the second frame are connected after the three person areas of the first frame. As a result, a person area sequence shown in FIG. 5 (c) is obtained.

図５（ｃ）において、８０２、８０３、８０４はそれぞれ１フレーム目の人物Ｃ、人物Ｂ、人物Ａの人物領域、８０５、８０６はそれぞれ２フレーム目の人物Ｂ、人物Ａの人物領域である。人物領域系列（８０２〜８０６）は、このように１次元の系列データである。また、ここでは、ソートの基準を人物とボールの間のユークリッド距離にしているが、別の基準でソートを行ってもよい。例えば、人物検出結果に対しすべてのフレームで統一的な人物ＩＤが与えられる場合、そのＩＤを昇順にソートしてもよい。 In FIG. 5C, reference numerals 802, 803, and 804 denote a person C and a person B and a person area of a person A in the first frame, respectively, and 805 and 806 denote a person area of the person B and a person A in the second frame. The human region series (802 to 806) is thus one-dimensional series data. Also, although the sorting criterion is the Euclidean distance between the person and the ball here, sorting may be performed on another criterion. For example, when a uniform person ID is given to all the frames for the person detection result, the IDs may be sorted in ascending order.

本実施形態では、Ｓ１００６−Ｓ１００９の工程は、畳み込みニューラルネットワーク、再帰型ニューラルネットワーク、ソフトマックス識別器を組み合わせたニューラルネットワークのネットワーク構造で実現される。以下では、畳み込みニューラルネットワーク（Ｃｏｎｖｏｌｕｔｉｏｎａｌｎｅｕｒａｌｎｅｔｗｏｒｋ）をＣＮＮ、再帰型ニューラルネットワーク（Ｒｅｃｕｒｒｅｎｔｎｅｕｒａｌｎｅｔｗｏｒｋ）をＲＮＮとも称する。また本実施形態では、ＲＮＮとして、ＲＮＮの一種であるＬＳＴＭ（Ｌｏｎｇｓｈｏｒｔーｔｅｒｍｍｅｍｏｒｙ）を用いる。またＳ１００５では、再帰型ニューラルネットワークを制御する信号を作成する。 In the present embodiment, the processes of S1006-S1009 are implemented by a network structure of a neural network combining a convolutional neural network, a recursive neural network, and a softmax discriminator. In the following, a convolutional neural network is also referred to as CNN, and a recurrent neural network is also referred to as RNN. Further, in the present embodiment, a long short-term memory (LSTM) which is a type of RNN is used as the RNN. In S1005, a signal for controlling a recursive neural network is created.

Ｓ１００６−Ｓ１００９の工程を実行するニューラルネットワークのネットワーク構造の概要を図６に示す。まずこの図６を用いて、Ｓ１００６−Ｓ１００９の詳細について説明する。次に、Ｓ１００５で作成する制御信号について、より具体的な事例に即して説明する。 An outline of a network structure of a neural network that executes the steps of S1006-S1009 is shown in FIG. First, the details of S1006-S1009 will be described using FIG. Next, the control signal generated in S1005 will be described based on a more specific example.

図６のネットワーク構造９０１は、入力９０２、ＣＮＮ９０３、ＬＳＴＭ１（９０４）、ＬＳＴＭ２（９０５）、ＦＣ９０６、Ｓｏｆｔｍａｘ９０７の各モジュールを持つ。Ｓ１００６で行う画像特徴量抽出処理はＣＮＮ９０３によって実現される。ＣＮＮ９０３は、画像認識に用いられる多数の層から構成されるニューラルネットワークである。ＣＮＮの中間層は低次では線や点、パターンなどのプリミティブな幾何学的特徴量、高次では、パーツ、パーツを組み合わせたオブジェクトに対応する複雑な特徴量を抽出することで知られている。また大規模データで学習済みのＣＮＮの中間層の特徴量を別の分類タスクに応用することで、高精度な分類が行えることが以下のＤｏｎａｈｕｅらの論文で開示されている。
ＪＤｏｎａｈｕｅ，ＹＪｉａ，ＯＶｉｎｙａｌｓ，ＪＨｏｆｆｍａｎ，ＮＺｈａｎｇ，ＥＴｚｅｎｇ，ＴＤａｒｒｅｌｌ，ＴＤａｒｒｅｌｌ，“ＤｅＣＡＦ：ＡＤｅｅｐＣｏｎｖｏｌｕｔｉｏｎａｌＡｃｔｉｖａｔｉｏｎＦｅａｔｕｒｅｆｏｒＧｅｎｅｒｉｃＶｉｓｕａｌＲｅｃｏｇｎｉｔｉｏｎ”，ａｒＸｉｖ２０１３
Ｓ１００６では、Ｓ１００４で作成した人物領域系列の各人物領域をＣＮＮ（９０３）に入力し、画像特徴量を取得する。ここでＣＮＮの特徴量は、複数の中間層から特徴量を取得してもよいし、一部の中間層の特徴量のみを用いてもよい。 The network structure 901 of FIG. 6 has modules of an input 902, CNN 903, LSTM1 (904), LSTM 2 (905), FC 906, and Softmax 907. The image feature extraction processing performed in S1006 is realized by the CNN 903. The CNN 903 is a neural network composed of multiple layers used for image recognition. The lower layer CNN's middle layer is known to extract primitive geometric features such as lines, points and patterns in low order, and complex features in high order parts and objects that combine parts. . Also, it is disclosed in the following paper by Donahue et al. That high-precision classification can be performed by applying the feature quantities of the CNN middle layer already trained on large-scale data to another classification task.
J Donahue, Y Jia, O Vinyals, J Hoffman, N Zhang, E Tzeng, T Darrell, T Darrell, "DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition", arXiv 2013
In S1006, each person area of the person area series created in S1004 is input to the CNN (903), and an image feature amount is acquired. Here, as the feature amount of CNN, the feature amounts may be acquired from a plurality of intermediate layers, or only the feature amounts of some intermediate layers may be used.

次に、Ｓ１００７での人物系列統合処理は、ＬＳＴＭ１（９０４）によって実現される。ＬＳＴＭ（Ｌｏｎｇｓｈｏｒｔーｔｅｒｍｍｅｍｏｒｙ）とは再帰型ニューラルネットワークの一種である。再帰型ニューラルネットワークでは一般に、現在の入力ベクトルｘ_ｔと一期前の隠れ状態ベクトルｈ_ｔ−１がネットワークに入力され、現在の隠れ状態ベクトルｈ_ｔが計算され出力される。ＬＳＴＭでは、内部で入力、忘却、出力を制御するニューラルネットワークである。下記のＤｏｎａｈｕｅらの論文で開示されている表記に従うと、（ｉｎｐｕｔｇａｔｅ，ｆｏｒｇｅｔｇａｔｅ，ｏｕｔｐｕｔｇａｔｅ，ｉｎｐｕｔｍｏｄｕｌａｔｉｏｎｇａｔｅ）とセルユニットを持つ。そして、ある時刻の入力と一期前の隠れ状態ｈ_ｔにより入力ｘ_ｔ、忘却、出力を制御することで、短・長期の複雑な時系列パターンが識別できるようになっている。
ＤｏｎａｈｕｅＪ．ｅｔａｌ．， ”Ｌｏｎｇ−ｔｅｒｍｒｅｃｕｒｒｅｎｔｃｏｎｖｏｌｕｔｉｏｎａｌｎｅｔｗｏｒｋｓｆｏｒｖｉｓｕａｌｒｅｃｏｇｎｉｔｉｏｎａｎｄｄｅｓｃｒｉｐｔｉｏｎ”，ＣＶＰＲ２０１５
入力ｘ_ｔ、隠れ状態ｈ_ｔ、セルｃ_ｔおよび入力、忘却、出力を制御するゲート出力ｉ_ｔ、ｆ_ｔ、ｏ_ｔの更新は以下の式（１）のとおりである。
ｉ_ｔ＝σ（Ｗ_ｘｉｘ_ｔ＋Ｗ_ｈｉｈ_ｔ−１＋ｂ_ｉ）
ｆ_ｔ＝σ（Ｗ_ｘｆｘ_ｔ＋Ｗ_ｈｆｈ_ｔ−１＋ｂ_ｆ）
ｏ_ｔ＝σ（Ｗ_ｘｏｘ_ｔ＋Ｗ_ｈＯｈ_ｔ−１＋ｂ_ｏ）
ｇ_ｔ＝ｔａｎｈ（Ｗ_ｘｃｘ_ｔ＋Ｗ_ｈｃｈ_ｔ−１＋ｂ_ｃ）
ｃ_ｔ＝ｆ_ｔ・ｃ_ｔ−１＋ｉ_ｔ・ｇ_ｔ
ｈ_ｔ＝ｏ_ｔ・ｔａｎｈ（ｃ_ｔ）・・・（１）
ここで、σ（）はシグモイド関数、ｔａｎｈ（）は双曲線正接関数、ｇ_ｔはセルへの入力、・は要素ごとの積を表す。また、Ｗ_ｘｉ、Ｗ_ｈｉ、Ｗ_ｘｆ、Ｗ_ｈｆ、ｂ_ｉ、ｂ_ｆ、ｂ_ｏ、ｂ_ｃは、ｉｎｐｕｔｇａｔｅ，ｆｏｒｇｅｔｇａｔｅ，ｏｕｔｐｕｔｇａｔｅ，ｉｎｐｕｔｍｏｄｕｌａｔｉｏｎｇａｔｅの重みおよびバイアスである。 Next, person series integration processing in S1007 is realized by LSTM 1 (904). Long short-term memory (LSTM) is a type of recursive neural network. In a recursive neural network, in general, the current input vector _xt and the hidden state vector _ht-1 one period before are input to the network, and the current hidden state vector _ht is calculated and output. LSTM is a neural network that controls input, forgetting, and output internally. According to the notation disclosed in the paper of Donahue et al. Below, it has (input gate, forget gate, output gate, input modulation gate) and cell unit. Then, by controlling the input x _t , forgetting and output according to the input at a certain time and the hidden state h _t one period before, it is possible to identify short and long complex time series patterns.
Donahue J. et al. , “Long-term recurring convolutional networks for visual recognition and description”, CVPR 2015
Input _{x t,} hidden states _{h t,} cell _{c t} and the input, forgetting, gate output to control the output _i _t, f t, updates _{o t} are shown in the following formula (1).
i _t = σ (W _xi x _t + W _hi h _t-1 + b _i )
_{_{_{f t = σ (W xf x}}} t + W hf h t-1 + b f)
o _t = σ (W _xo x _t + W _hO h _t-1 + b _o )
_{_{_{_{g t = tanh (W xc x}}}} t + W hc h t-1 + b c)
c _t = f _t · c _{t -1} + i _t · g _t
h _t = o _t · tan h (c _t ) (1)
Here, sigma () is the sigmoid function, tanh () is hyperbolic tangent function, g _t is input to the cell, - represents the product of each element. In addition, W _xi , W _hi , W _xf , W _hf , b _i , b _f , b _o , b _c are weights and biases of the input gate, the forget gate, the output gate, and the input modulation gate.

上記ＬＳＴＭは、２つの外部からの制御が可能な状態（以下、制御状態）を持つ。本実施形態では、この２つの制御状態を、“更新”と“リセット”と呼ぶ。 The LSTM has a state in which two external controls are possible (hereinafter referred to as a control state). In this embodiment, these two control states are referred to as "update" and "reset".

式（１）のように機能させる場合を、本実施形態ではＬＳＴＭの“更新”とする。そして、“リセット”は、外部からの信号によってｆｏｒｇｅｔｇａｔｅの出力を強制的にｆ_ｔ＝０とすることで実現される、
ｉ_ｔ＝σ（Ｗ_ｘｉｘ_ｔ＋Ｗ_ｈｉｈ_ｔ−１＋ｂ_ｉ）
ｏ_ｔ＝σ（Ｗ_ｘｏｘ_ｔ＋Ｗ_ｈＯｈ_ｔ−１＋ｂ_ｏ）
ｇ_ｔ＝ｔａｎｈ（Ｗ_ｘｃｘ_ｔ＋Ｗ_ｈｃｈ_ｔ−１＋ｂ_ｃ）
ｃ_ｔ＝ｉ_ｔ・ｇ_ｔ
ｈ_ｔ＝ｏ_ｔ・ｔａｎｈ（ｃ_ｔ）・・・（２）
となる制御状態とする。 In the present embodiment, the case of functioning as the equation (1) is referred to as "update" of LSTM. And "reset" is realized by forcing the output of the forge gate to f _t = 0 by an external signal.
i _t = σ (W _xi x _t + W _hi h _t-1 + b _i )
o _t = σ (W _xo x _t + W _hO h _t-1 + b _o )
_{_{_{_{g t = tanh (W xc x}}}} t + W hc h t-1 + b c)
c _t = i _t · g _t
h _t = o _t · tan h (c _t ) (2)
The control state is

Ｓ１００４で作成された人物領域系列は、ＣＮＮ９０３を介して特徴量（人物特徴量）が抽出されＬＳＴＭ１（９０４）に入力される。ＬＳＴＭ１（９０４）には、再帰的に人物特徴量が入力され、隠れ状態が更新される。またフレーム毎の系列の初期に制御状態を“リセット”にし、それ以外では“更新”に切り替え、フレーム毎の人物特徴量が統合される。この制御状態の切り替えについてはより具体的な事例に基づき後述する。 From the person area sequence created in S1004, the feature amount (person feature amount) is extracted via the CNN 903 and input to the LSTM1 (904). A person feature amount is recursively input to the LSTM 1 (904), and the hidden state is updated. Further, the control state is set to "reset" at the beginning of the series for each frame, and switched to "update" otherwise, and the person feature quantities for each frame are integrated. The switching of the control state will be described later based on more specific cases.

次に、Ｓ１００８での時系列統合処理は、ＬＳＴＭ２（９０５）によって実現される。ＬＳＴＭ２（９０５）は、“更新”、“リセット”の他にもう一つの制御状態である“保持”を持つ。式（１）のＬＳＴＭの更新に対し、ｆｏｒｇｅｔｇａｔｅ，ｉｎｐｕｔｇａｔｅ，ｏｕｔｐｕｔｇａｔｅの各値を強制的にｆ_ｔ＝１、ｉ_ｔ＝０、ｏ_ｔ＝１とすることで、
ｉ_ｔ＝０
ｆ_ｔ＝１
ｏ_ｔ＝１
ｇ_ｔ＝ｔａｎｈ（Ｗ_ｘｃｘ_ｔ＋Ｗ_ｈｃｈ_ｔ−１＋ｂ_ｃ）
ｃ_ｔ＝ｆ_ｔ・ｃ_ｔ−１＋ｉ_ｔ・ｇ_ｔ
ｈ_ｔ＝ｏ_ｔ・ｔａｎｈ（ｃ_ｔ）・・・（３）
となり、式（３）では、ＬＳＴＭへの入力に関わらず、
ｃ_ｔ＝ｃ_ｔ−１
ｈ_ｔ＝ｔａｎｈ（ｃ_ｔ−１）
となる。これは、隠れ状態ｈ_ｔ、セルｃ_ｔが如何なる入力に対しても変化しない状態である。本実施形態では、これをＬＳＴＭの３つ目の制御状態として“保持”とする。 Next, the time series integration process in S1008 is realized by LSTM2 (905). In addition to "update" and "reset", LSTM2 (905) has another control state, "hold". To update LSTM of formula (1), by a forget Gate, input The Gate, Forcing the values of _{_{output gate f t = 1, i}} t = 0, o t = 1,
i _t = 0
f _t = 1
o _t = 1
_{_{_{_{g t = tanh (W xc x}}}} t + W hc h t-1 + b c)
c _t = f _t · c _{t -1} + i _t · g _t
h _t = o _t · tan h (c _t ) (3)
In equation (3), regardless of the input to LSTM,
c _t = c _{t -1}
h _t = tan h (c _{t -1} )
It becomes. This is the state hidden state h _t, cell c _t does not change for any input. In this embodiment, this is set as "holding" as the third control state of LSTM.

本工程では、ＬＳＴＭ１（９０４）が人物特徴量の統合を行う間、ＬＳＴＭ２（９０５）を“保持”に切り替えることで、状態（隠れ状態ｈ_ｔ、セルｃ_ｔ）を変化させないようにする。そしてフレーム毎の人物特徴量の統合の最後で、“更新”に切り替えることで、フレーム内で再帰的に統合された人物特徴量を、フレーム毎に入力として受け取り、状態を更新し、時間的な統合を行う。制御状態の切り替えについては、より具体的な事例に基づき後述する。 In this process, while LSTM 1 (904) performs integration of person feature quantities, the state (hidden state h _t , cell c _t ) is not changed by switching LSTM 2 (905) to “hold”. Then, at the end of the integration of the person feature amounts for each frame, by switching to "update", the person feature amount recursively integrated in the frame is received as an input for each frame, and the state is updated. Do integration. The switching of the control state will be described later based on more specific cases.

最後に、Ｓ１００９での行動ラベル識別処理は、ＦＣ９０６およびＳｏｆｔｍａｘ（９０７）によって実現される。ＦＣ９０６は、ＬＳＴＭ２（９０５）の隠れ状態ｈ_ｔに対し、重み行列を内積し、行動ラベルのラベル数に対応する数のスコア（内積スコア）を得る。さらに、Ｓｏｆｔｍａｘ（９０７）ではＳｏｆｔｍａｘ関数により、内積スコアを確率（０以上、１以下の実数）に変換する。以上の処理により行動ラベルに対応する確率で表現された識別スコアを得る。 Finally, the action label identification process in S1009 is realized by FC 906 and Softmax (907). The FC 906 inner multiplies the weight matrix with respect to the hidden state h _t of the LSTM 2 (905), and obtains scores (inner product score) of the number corresponding to the number of labels of the action label. Furthermore, in Softmax (907), the inner product score is converted to a probability (a real number of 0 or more and 1 or less) by the Softmax function. The identification score expressed by the probability corresponding to the action label is obtained by the above processing.

以上で、図６を参照して、画像特徴量抽出処理（Ｓ１００６）、人物系列統合処理（Ｓ１００７）、時系列統合処理（Ｓ１００８）、行動ラベル識別処理（Ｓ１００９）の各処理の詳細について説明した。次に、統合制御信号作成処理（Ｓ１００５）で作成する制御信号について、より具体的な事例に即して説明する。Ｓ１００５では、本実施形態が提示する複数人物の個別動作の統合および時系列統合を実現するため、人物系列統合処理（Ｓ１００７）を担うＬＳＴＭ１（９０４）と時系列統合処理（Ｓ１００８）を担うＬＳＴＭ２（９０５）とを制御する信号の作成を行う。 The details of each process of the image feature extraction process (S1006), the person series integration process (S1007), the time series integration process (S1008), and the action label identification process (S1009) have been described above with reference to FIG. . Next, control signals generated in the integrated control signal generation process (S1005) will be described based on more specific cases. In S1005, in order to realize integration and time-series integration of individual motions of a plurality of persons presented in the present embodiment, LSTM1 (904) responsible for person series integration processing (S1007) and LSTM2 responsible for time series integration processing (S1008) 905) and creating a control signal.

図７に、Ｓ１００４で作成した人物領域系列を、図６に示したネットワーク構造のニューラルネットワークに入力した場合の例を示す。ここで、１００２は時刻Ｉでの人物Ｃの人物領域、１００３は時刻Ｉでの人物Ｂの人物領域、１００４は時刻Ｉでの人物Ａの人物領域、１００５は時刻ＩＩでの人物Ｂの人物領域、１００６は時刻ＩＩでの人物Ａの人物領域である。これらの人物領域は、図５（ｃ）における各人物領域と同一である。また１００７はＣＮＮ、１００８は１層目のＬＳＴＭ、１００９は２層目のＬＳＴＭ、１０１０はＦＣ、１０１１はＳｏｆｔｍａｘである。１００７から１０１１までの各モジュールは、図９における９０３から９０７までの各モジュールと同一である。 FIG. 7 shows an example of the case where the person area sequence created in S1004 is input to the neural network of the network structure shown in FIG. Here, 1002 is a person area of person C at time I, 1003 is a person area of person B at time I, 1004 is a person area of person A at time I, and 1005 is a person area of person B at time II , 1006 is the person area of the person A at time II. These person areas are the same as the person areas in FIG. 5 (c). 1007 is CNN, 1008 is a first layer LSTM, 1009 is a second layer LSTM, 1010 is FC, and 1011 is Softmax. The modules 1007 to 1011 are the same as the modules 903 to 907 in FIG.

また図７は、再帰型ニューラルネットワークを時間方向に展開した図であり、鉛直方向の線１０１２がユニット間の信号及び誤差の伝搬経路を表し、水平方向の線１０１３が時間方向の信号及び誤差の伝搬経路を表す。Ｓ１００４で作成された人物領域系列（１００２から１００６）は、系列の左から順番にＣＮＮ１００７に入力される。 FIG. 7 is a diagram in which the recursive neural network is expanded in the time direction, and the vertical line 1012 represents the propagation path of the signal and error between units, and the horizontal line 1013 is the signal and error in the time direction. Represents a propagation path. The person area series (1002 to 1006) created in S1004 are input to the CNN 1007 in order from the left of the series.

複数フレームにわたる人物領域系列（図７の１００２〜１００６）に対し、２つのＬＳＴＭ（１００８、１００９）での人物系列の統合と時系列の統合とを実現する。そのため、ＬＳＴＭの制御状態（“更新”、“保持”、“リセット”）を、制御信号によって切り替える。図１８（ａ）に、図７の１００２〜１００６の人物領域系列に関し行う制御状態を示す。 With respect to person area series (1002 to 1006 in FIG. 7) extending over a plurality of frames, integration of person series and integration of time series by two LSTMs (1008, 1009) are realized. Therefore, the control state (“update”, “hold”, “reset”) of the LSTM is switched by the control signal. FIG. 18A shows a control state performed on the person area series 1002 to 1006 in FIG. 7.

系列の初期（ｎ＝１）では、２つのＬＳＴＭをリセットにする。ＬＳＴＭ１ではフレーム内の人物特徴量の統合を行い、ＬＳＴＭ２ではＬＳＴＭ１で統合した人物特徴量をフレーム毎に統合する。この処理のため、ｎ＝２でＬＳＴＭ１を“更新”、ＬＳＴＭ２を“リセット”にし、ｎ＝３でＬＳＴＭ１を“更新”、ＬＳＴＭ２を“リセット”にする。これにより、ＬＳＴＭ１で１フレーム目の人物特徴量の統合、ＬＳＴＭ２でＬＳＴＭ１によって統合された１フレーム目の人物特徴量をｎ＝３のみ入力として受け取る。ｎ＝１、ｎ＝２でＬＳＴＭ２を“リセット”としたが、これらは最初の人物を統合するｎ＝３で“リセット”することが重要であり、ｎ＝１、ｎ＝２でＬＳＴＭ２の制御状態は何でもよい。次にｎ＝４で、ＬＳＴＭ１を再び“リセット”し、ＬＳＴＭ２を保持にする。ｎ＝５でＬＳＴＭ１、ＬＳＴＭ２を共に“更新”にしすることで、ＬＳＴＭ１により２フレーム目の人物特徴量の統合、ＬＳＴＭ２で、統合した２フレーム目の人物特徴量をｎ＝５に入力として受け取とる。結果ＬＳＴＭ２では、ｎ＝３（１フレーム目）の入力とｎ＝５（２フレーム目）での入力が統合される（時間方向の統合）。 At the beginning of the sequence (n = 1), reset two LSTMs. The LSTM 1 integrates human feature quantities in a frame, and the LSTM 2 integrates human feature quantities integrated in the LSTM 1 for each frame. For this process, n = 2 makes LSTM1 "update", LSTM2 "reset", n = 3 makes LSTM1 "update", and LSTM2 "reset". As a result, integration of the person feature amount of the first frame by LSTM1 and the person feature amount of the first frame integrated by LSTM1 by LSTM2 are received as n = 3 only. Although n = 1 and n = 2 set LSTM2 to "Reset", it is important to integrate the first person with n = 3 and "Reset", n = 1 and n = 2 to control LSTM 2 The state may be anything. Next, at n = 4, "re-set" LSTM1 again and hold LSTM2. By setting both LSTM1 and LSTM2 to “update” at n = 5, the person feature of the second frame is integrated by LSTM 1 and the person feature of the integrated second frame is received as n = 5 by LSTM 2 . As a result, in the case of LSTM2, the input at n = 3 (first frame) and the input at n = 5 (second frame) are integrated (integration in time direction).

図８に、図７のネットワークに対し、図１８（ａ）の制御を行った場合の信号および誤差の伝搬経路（１１２３）を示す。１１０２〜１１０６の人物領域は図７における１００２〜１００６と同一である。ＣＮＮ、ＬＳＴＭ１、ＬＳＴＭ２、ＦＣ、Ｓｏｆｔｍａｘは、図７におけるＣＮＮ１００７、ＬＳＴＭ１（１００８）、ＬＳＴＭ２（１００９）、ＦＣ１０１０、Ｓｏｆｔｍａｘ１０１１と同一である。ＬＳＴＭ１およびＬＳＴＭ２に関し、“更新”を白背景の矩形（１１１６等）、“リセット”を斜線パターンの矩形（１１０９等）、“保持”をドットパターンの矩形（１１１４等）で示した。 FIG. 8 shows propagation paths (1123) of signals and errors when the control of FIG. 18 (a) is performed on the network of FIG. The person area 1102 to 1106 is the same as 1002 to 1006 in FIG. CNN, LSTM1, LSTM2, FC, Softmax are the same as CNN 1007, LSTM1 (1008), LSTM 2 (1009), FC 1010, Softmax 1011 in FIG. As for LSTM1 and LSTM2, “update” is indicated by a white background rectangle (1116 etc.), “reset” by a hatched pattern rectangle (1109 etc.), and “hold” by a dot pattern rectangle (1114 etc.).

１１０９が図１８（ａ）のｎ＝１におけるＬＳＴＭ１の“リセット”、１１１２が図１８（ａ）のｎ＝１におけるＬＳＴＭ２の“リセット”である。１１１０が図１８（ａ）のｎ＝２におけるＬＳＴＭ１の“更新”、１１１３が図１８（ａ）のｎ＝２におけるＬＳＴＭ２の“リセット”である。１１２４が図１８（ａ）のｎ＝３におけるＬＳＴＭ１の“更新”、１１１６が図１８（ａ）のｎ＝３におけるＬＳＴＭ２の“リセット”である。１１１１が図１８（ａ）のｎ＝４におけるＬＳＴＭ１の“リセット”、１１１４が図１８（ａ）のｎ＝４におけるＬＳＴＭ２の“保持”である。１１２５が図１８（ａ）のｎ＝５におけるＬＳＴＭ１の“更新”、１１１７が図１８（ａ）のｎ＝５におけるＬＳＴＭ２の“更新”である。 The reference numeral 1109 denotes "reset" of the LSTM 1 at n = 1 in FIG. 18A, and 1112 denotes the "reset" of the LSTM 2 at n = 1 in FIG. 18A. Reference numeral 1110 denotes “update” of LSTM 1 at n = 2 in FIG. 18A, and reference numeral 1113 denotes “reset” of LSTM 2 at n = 2 in FIG. 18A. 1124 is “update” of LSTM 1 at n = 3 in FIG. 18A, and 1116 is “reset” of LSTM 2 at n = 3 in FIG. 18A. In FIG. 18A, 1111 is a “reset” of the LSTM 1 at n = 4, and 1114 is a “hold” of the LSTM 2 at n = 4 in FIG. 18A. 1125 is "update" of LSTM 1 at n = 5 in FIG. 18A, and 1117 is "update" of LSTM 2 at n = 5 in FIG. 18A.

ＬＳＴＭ１では、“リセット”１１０９、“更新”１１１０、“更新”１１２４により、１フレーム目（時刻Ｉ）の人物特徴量が統合される。次に“リセット”１１１１により一旦内部状態をリセットし、再び“更新”１１２５をさせることで、１１２７で信号と誤差を伝搬させず、２フレーム目（時刻ＩＩ）の人物特徴量のみが統合される。ＬＳＴＭ２では、“リセット”１１１２、“リセット”１１１３、“リセット”１１１６とし、ｎ＝３で１フレーム目の人物特徴量が統合されたＬＳＴＭ１（１１２４）からのみ信号を受け取るようにする。次に“保持”１１１４を設定し、内部状態を不変にして、再び“更新”１１１７を設定する。こうすることで、１フレーム目の統合された人物特徴量を受け取ったＬＳＴＭ２（１１１６）の信号と、２フレーム目の人物特徴量が統合されたＬＳＴＭ１（１１２５）の信号を受け取り、時間方向の統合を行う。 In the LSTM 1, the person feature value of the first frame (time I) is integrated by “reset” 1109, “update” 1110, and “update” 1124. Next, the internal state is temporarily reset by "Reset" 1111 and "Update" 1125 is performed again, so that only the person feature value of the second frame (time II) is integrated without propagating the signal and error at 1127. . In LSTM2, “reset” 1112, “reset” 1113, and “reset” 1116 are made to receive signals only from LSTM 1 (1124) in which the person feature value of the first frame is integrated when n = 3. Next, “hold” 1114 is set, the internal state is made unchanged, and “update” 1117 is set again. By doing this, the signal of LSTM2 (1116) that receives the integrated human feature quantity of the first frame and the signal of LSTM1 (1125) where the human feature quantity of the second frame is integrated are integrated in the time direction I do.

各フレームのＬＳＴＭ２の内部状態は、ＦＣとＳｏｆｔｍａｘ（１１２１、１１２２）に伝搬し、識別スコアが出力される。Ｓｏｆｔｍａｘの網掛けパターンの矩形（１１１８、１１１９、１１２０）は、誤差を評価しない“無視”をしめすが、これに関しては、学習時の処理の中で詳細に説明する。 The internal state of LSTM 2 of each frame is propagated to FC and Softmax (1121, 1122), and the identification score is output. Softmax's hatched pattern rectangles (1118, 1119, 1120) indicate "ignore" which does not evaluate the error, but this will be described in detail in the process of learning.

図９に、図８で示した制御を実施した場合のネットワークと等価の構造を持つネットワークの図を示す。１２０２〜１２０６の人物領域は図７における１００２〜１００６と同一である。またＣＮＮ（１２０７）、ＬＳＴＭ１（１２０８、１２０９）、ＬＳＴＭ２（１２１０）、ＦＣ（１２１１）、Ｓｏｆｔｍａｘ（１２１２）は、図７における同名の要素と同一である。 FIG. 9 shows a diagram of a network having a structure equivalent to the network when the control shown in FIG. 8 is performed. Person areas 1202 to 1206 are the same as 1002 to 1006 in FIG. Further, CNN (1207), LSTM1 (1208, 1209), LSTM2 (1210), FC (1211), and Softmax (1212) are the same as the elements of the same names in FIG.

１フレーム目（時刻Ｉ）の人物Ｃ領域１２０２、人物Ｂ領域１２０３、人物Ａ領域１２０４は、ＣＮＮを介してＬＳＴＭ１で統合され、１フレーム目の人物特徴量を統合したＬＳＴＭ１の内部状態はＬＳＴＭ２（１２１０）に入力される。続いてＬＳＴＭ１は内部状態がリセットされ、新たに２フレーム目（時刻ＩＩ）の人物Ｂ領域１２０５と人物Ａ領域１２０６がＬＳＴＭ１で統合され、２フレーム目の人物特徴量を統合したＬＳＴＭ１の内部状態がＬＳＴＭ２（１２１３）に入力される。ＬＳＴＭ２では、１フレーム目のＬＳＴＭ２の内部状態とＬＳＴＭ１からの入力を受け取り、１フレーム目の情報と２フレーム目の情報を統合する。各フレームのＬＳＴＭ２の内部状態は、ＦＣを経由してＳｏｆｔｍａｘで行動ラベルの識別スコアを出力する。以上のように、図７に示す構成のネットワークと図１８（ａ）に示す制御により、図９のネットワークが実行される。 Person C area 1202, person B area 1203 and person A area 1204 in the first frame (time I) are integrated by LSTM1 through CNN, and the internal state of LSTM1 obtained by integrating the person feature quantities of the first frame is LSTM2 ( 1210). Subsequently, the internal state of LSTM1 is reset, and person B area 1205 and person A area 1206 of the second frame (time II) are newly integrated by LSTM1, and the internal state of LSTM 1 is obtained by integrating the person feature value of the second frame. It is input to LSTM2 (1213). In LSTM2, the internal state of LSTM2 in the first frame and the input from LSTM1 are received, and the information in the first frame and the information in the second frame are integrated. The internal state of LSTM2 of each frame outputs the identification score of the action label by Softmax via FC. As described above, the network shown in FIG. 9 is executed by the network shown in FIG. 7 and the control shown in FIG. 18 (a).

以上が動画取得工程Ｓ１００１で得られた３０フレーム分の動画を複数人の動作に基づき行動認識する、認識時の処理である。この後、次の３０フレームに対して同様に認識時の処理を実行してもよいが、いくつかのフレームが重複するように認識時の処理を実行してもよい。すなわち、ある認識時の処理によって、フットサル動画の１フレーム目から３０フレーム目までの識別を実行した後、次に１５フレーム目から４５フレーム目までを処理するようにしてもよい。その場合、あるフレームの複数回の結果を平均して最終結果を得る。このように冗長に認識処理することで、あるフレームを異なる系列で複数回認識することになり、結果がよりロバストになる。 The above is the process at the time of recognition, which recognizes the motion of 30 frames obtained in the moving image acquisition step S1001 based on the motion of a plurality of people. After this, the process at the time of recognition may be similarly performed on the next 30 frames, but the process at the time of recognition may be performed such that several frames overlap. That is, after the identification from the first frame to the 30th frame of the futsal moving image is performed by a certain recognition process, the 15th to 45th frames may be processed next. In that case, multiple results of a certain frame are averaged to obtain the final result. By redundantly recognizing in this way, a frame is recognized multiple times by different sequences, and the result is more robust.

次に、人物系列統合工程で用いる人物系列統合部および時系列統合工程で用いる時系列統合部、行動ラベル識別工程で用いる、行動ラベル識別部の学習方法について説明する。 Next, a learning method of the action label identification unit used in the person series integration unit used in the person series integration process, the time series integration unit used in the time series integration process, and the action label identification process will be described.

図１０は、本実施形態における学習装置５０００の機能構成を示す図である。学習装置５０００は、人物領域抽出部５００１、人物領域ソート部５００２、統合制御信号学習ラベル作成部５００３、パラメータパラメータ最適化部５００４を有する。さらに学習装置５０００は、記憶部として、学習データ保持部５００５、ネットワークパラメータ保持部５００６を有する。 FIG. 10 is a diagram showing a functional configuration of the learning device 5000 in the present embodiment. The learning device 5000 includes a person area extraction unit 5001, a person area sorting unit 5002, an integrated control signal learning label generation unit 5003, and a parameter parameter optimization unit 5004. The learning device 5000 further includes a learning data holding unit 5005 and a network parameter holding unit 5006 as storage units.

図１１は、本実施形態における学習に関する処理の一例を示すフローチャートである。ここで各工程の概要及び図１０に示した各部の機能について説明する。 FIG. 11 is a flowchart showing an example of processing related to learning in the present embodiment. Here, the outline of each process and the function of each unit shown in FIG. 10 will be described.

Ｓ５００１では、人物領域抽出部５００１が、学習データ保持部５００５に記憶されている動画および人物検出結果から動画を構成するフレーム内に存在する人物の領域を抽出する。この処理は本実施形態の認識時の処理で説明したＳ１００３と同様の人物領域抽出処理である。また、学習データ保持部５００５に記憶されているデータの詳細は、後述する。 In S5001, the person area extraction unit 5001 extracts the area of the person present in the frame making up the moving image from the moving image and the person detection result stored in the learning data holding unit 5005. This process is the same person area extraction process as S1003 described in the recognition process of this embodiment. The details of the data stored in the learning data storage unit 5005 will be described later.

Ｓ５００２では、Ｓ５００１で設定した人物領域を均一にリサイズし、一定の基準でソートする。本工程は、本実施形態の認識時の処理で説明したＳ１００４の人物領域ソート処理と同様であるため詳細な説明は省略する。 In S5002, the person area set in S5001 is uniformly resized, and sorting is performed according to a certain standard. Since this process is the same as the person area sorting process of S1004 described in the process at the time of recognition of the present embodiment, the detailed description will be omitted.

Ｓ５００３では、統合制御信号学習ラベル作成部５００３が、Ｓ５００１で検出したフレーム内に存在する人物の数とフレームに付与された行動ラベルに基づき、制御信号および学習ラベルを作成する。これらは、認識時の処理で用いる画像特徴量抽出部１００６、人物系列統合部１００７、時系列統合部１００８、行動ラベル識別部１００９で用いるニューラルネットワークのパラメータを学習するために用いられる。 In S5003, the integrated control signal learning label generation unit 5003 generates a control signal and a learning label based on the number of persons present in the frame detected in S5001 and the action label attached to the frame. These are used to learn the parameters of the neural network used in the image feature quantity extraction unit 1006, the person series integration unit 1007, the time series integration unit 1008, and the action label identification unit 1009 used in the process of recognition.

Ｓ５００４では、Ｓ５００２で作成した人物系列を入力とし、統合制御信号学習ラベル作成工程Ｓ５００３で作成した学習ラベルを目標値として、ニューラルネットワークのパラメータの最適化を実行する。 In S5004, optimization of parameters of the neural network is executed with the person series created in S5002 as an input and the learning label created in the integrated control signal learning label creating step S5003 as a target value.

以上のＳ５００１−Ｓ５００４は、予め設定したイタレーション数Ｎだけ繰り返される。最終的なパラメータおよびイタレーションの途中でのパラメータは、ネットワークパラメータ保持部５００６に記憶される。 The above S5001 to S5004 are repeated by the number N of iterations set in advance. The final parameters and the parameters in the middle of the iteration are stored in the network parameter storage unit 5006.

次に図１０に示したフローチャートの内、認識時の処理と差異のある、統合制御信号学習ラベル作成（Ｓ５００３）とパラメータ更新（Ｓ５００４）について、より具体的な内容について述べる。また学習データ保持部５００５に記憶されているデータについても説明する。 Next, more specific contents of integrated control signal learning label creation (S5003) and parameter update (S5004) which are different from the processing at the time of recognition in the flowchart shown in FIG. 10 will be described. Also, data stored in the learning data storage unit 5005 will be described.

学習データ保持部５００５には、本実施形態で認識するフットサルの行動ラベルに対応する動画と正解ラベル（行動ラベル）、および動画中の各フレームの人物検出結果、ボール検出結果が保存されている。行動ラベルは“パス”，“シュート”，“ドリブル”，“キープ”，“クリアー”である。動画は任意の複数フレームで構成され、正解ラベルはフレーム毎に付与されているものとする。 The learning data holding unit 5005 stores a moving image and a correct answer label (action label) corresponding to the action label of the futsal recognized in the present embodiment, a person detection result of each frame in the moving image, and a ball detection result. The action labels are "pass", "shoot", "dribble", "keep" and "clear". It is assumed that the moving image is composed of arbitrary plural frames, and the correct answer label is attached to each frame.

Ｓ５００１では、ある行動ラベルが付与された任意のフレーム数で構成される動画から、ランダムに連続する３０フレームを選択し、その各フレームの人物検出結果を用い、人物領域の抽出を行う。人物領域の抽出は、認識時の処理におけるＳ１００３と同様の処理である。 In S5001, 30 consecutive frames are selected at random from a moving image composed of an arbitrary number of frames to which a certain action label is attached, and a person area is extracted using a person detection result of each frame. Extraction of a person area is processing similar to S1003 in processing at the time of recognition.

Ｓ５００３では、本実施形態の認識時の処理におけるＳ１００５で行うＬＳＴＭの制御信号の作成に加え、ＣＮＮ、ＬＳＴＭ、Ｓｏｆｔｍａｘ識別器を学習する学習ラベルの作成を行う。ＬＳＴＭの制御信号の作成は、Ｓ１００５での処理と同様の処理であるため、ここでは詳細な説明は省略する。この処理の結果、図１８（ａ）と同様の統合制御信号が作成される。 In S5003, in addition to creation of the control signal of LSTM performed in S1005 in the processing at the time of recognition of this embodiment, creation of a learning label for learning CNN, LSTM, and Softmax discriminator is performed. The generation of the control signal of LSTM is the same processing as the processing in S1005, and thus the detailed description is omitted here. As a result of this process, an integrated control signal similar to that shown in FIG. 18A is created.

Ｓｏｆｔｍａｘ識別器には、作成されたＬＳＴＭの制御信号のうちＬＳＴＭ２を“更新”させる信号発生時に、動画に付与された学習ラベルを与える。それ以外の場合は、学習ラベルに“無視”ラベルを設定する。“無視”ラベルは、それが設定された場合、Ｓｏｆｔｍａｘの損失関数を評価しないようにする特別なラベルである。 The Softmax discriminator is provided with a learning label attached to the moving image at the time of generation of a signal for “updating” LSTM 2 among the generated control signals of LSTM. Otherwise, set the "ignore" label on the learning label. The "ignore" label is a special label that prevents Softmax's loss function from being evaluated if it is set.

図８の上部に示したように、Ｓｏｆｔｍａｘ識別器には、時系列統合を実行する時刻Ｉ人物Ａの人物領域（１１０４）および時刻ＩＩ人物Ａの人物領域（１１０６）を入力する際に学習ラベルを与え（１１２１、１１２２）る。また、それ以外のときは、“無視”ラベルを与える（１１１７）。 As shown in the upper part of FIG. 8, when the softmax discriminator inputs the person area (1104) of person A at time I and the person area (1106) of person A at time II to execute time series integration, learning labels (1121, 1122). Otherwise, the "ignore" label is given (1117).

Ｓ５００４では、パラメータ最適化部５００４により、画像特徴量抽出部１００６、人物系列統合部１００７、時系列統合部１００８、行動ラベル識別部１００９に対応するＣＮＮ、ＬＳＴＭとＳｏｆｔｍａｘ識別器のパラメータ最適化を行う。 In step S5004, the parameter optimization unit 5004 performs parameter optimization of the CNN, LSTM, and Softmax discriminators corresponding to the image feature quantity extraction unit 1006, person series integration unit 1007, time series integration unit 1008, and action label identification unit 1009. .

ここでは、Ｓ５００３で作成した統合制御信号により、ＬＳＴＭをリセット、保持、通常状態のいずれかの制御状態に適宜制御した上で、同時に作成した学習ラベルをＳｏｆｔｍａｘ識別器に与える。下記のＧｒａｖｅｓらの論文に記載のＢＰＴＴ（ＢａｃｋＰｒｏｐｅｇａｔｉｏｎＴｈｒｏｕｔｈＴｉｍｅ）法を適用することで、パラメータの最適化を実行する。
Ａ．ＧｒａｖｅｓａｎｄＪ．Ｓｃｈｍｉｄｈｕｂｅｒ． “ＦｒａｍｅｗｉｓｅＰｈｏｎｅｍｅＣｌａｓｓｉｆｉｃａｔｉｏｎｗｉｔｈＢｉｄｉｒｅｃｔｉｏｎａｌＬＳＴＭＮｅｔｗｏｒｋｓ”．ＩｎＰｒｏｃ．ＩｎｔｅｒｎａｔｉｏｎａｌＪｏｉｎｔＣｏｎｆｅｒｅｎｃｅｏｎＮｅｕｒａｌＮｅｔｗｏｒｋｓＩＪＣＮＮ’０５
Ｓｏｆｔｍａｘ識別器では、損失関数として交差エントロピー誤差を用い、無視ラベル以外のラベルが与えられた場合に損失関数を評価し、誤差を計算する。ＬＳＴＭは、制御信号によって、リセット、保持、通常状態の３つの制御状態が切り替えられ、保持以外の場合にパラメータが更新される。ＣＮＮは、大規模データで学習済みのパラメータを初期値として用い、本工程にてファインチューニングが行われる。ただしＣＮＮのファインチューニングは実施しなくても、本実施形態が実現する機能は大きく損なわれない。そのため、ＣＮＮのファインチューニングは省略してもよいものとする。省略する場合、ＣＮＮのパラメータは、大規模データで学習済みのパラメータに固定して用いる。 Here, according to the integrated control signal generated in S5003, the LSTM is appropriately controlled to any of the control state of reset, hold, or normal state, and the learning label generated simultaneously is given to the Softmax discriminator. Parameter optimization is performed by applying the Back Propagation Throut Time (BPTT) method described in Graves et al.
A. Graves and J. Schmidhuber. “Framewise Phoneme Classification with Bidirectional LSTM Networks”. In Proc. International Joint Conference on Neural Networks IJCNN'05
The Softmax discriminator uses the cross entropy error as a loss function, evaluates the loss function when a label other than the neglected label is given, and calculates the error. In the LSTM, three control states of reset, hold, and normal state are switched by a control signal, and parameters are updated in cases other than hold. In CNN, fine tuning is performed in this process using large-scale data and learned parameters as initial values. However, even if the CNN fine tuning is not performed, the functions realized by the present embodiment are not significantly impaired. Therefore, CNN fine tuning may be omitted. When omitted, the parameters of CNN are fixed to the learned parameters for large scale data.

本実施形態では、スポーツの中で、サッカー、フットサル、ラグビーでのプレイの認識を想定した複数人の個別行動で意味付けられる行動ラベルの識別方法について説明した。 In the present embodiment, a method of identifying an action label to be added to by individual actions of a plurality of persons assuming recognition of play in soccer, futsal, rugby in sports has been described.

これらのスポーツにおいて、たとえばパス回しやタックルのような動作の識別は、選手個人の動作だけでは不十分だが、選手全員の動作を常に扱う必要はなく、ボールに関係する数人の選手の個別動作を扱う枠組みにより認識できると言える。このような複数人の協調的動作で意味付けられる行動の認識を、映像解析によって行う場合、非特許文献１と同様に、動画のフレーム毎の人数の変動に対応できる枠組みが必要である。また、サッカーやフットサル、ラグビーでの応用を想定すると、フレーム単位の時間分解能での認識が原理的に可能で、かつ複数人の個別動作を統合し、全体の行動を認識できる枠組みが求められる。さらに特徴量抽出、個別動作の統合、時系列の統合、行動ラベルの識別を実現する各部の全体を最適化することで、各手段を個別に最適化する以上の精度向上を図ることが期待出来る。 In these sports, for example, identification of movement such as pass turning and tackle is not sufficient for individual player's movement alone, but it is not necessary to always handle the movement of all players, and individual movement of several players related to the ball It can be said that it can be recognized by the framework that deals with In the case where such action recognition that is meaningfully performed by a plurality of cooperative actions is performed by video analysis, it is necessary to have a framework capable of coping with the change in the number of people for each frame of the moving image, as in Non-Patent Document 1. In addition, assuming applications in soccer, futsal and rugby, it is necessary in principle to be able to recognize with frame-based time resolution in principle, and to have a framework in which individual motions of multiple people can be integrated to recognize overall behavior. Furthermore, by optimizing the whole of each part that realizes feature value extraction, integration of individual actions, integration of time series, and identification of action labels, it can be expected to improve accuracy beyond optimizing each means individually. .

以上のように本実施形態によれば、行動認識装置１０００は、複数人物の個別動作を表す特徴量を統合し、更にそれを時間的に統合し、行動ラベルを識別する。これにより、動画のフレーム毎の人数変動に対応し、フレーム単位での認識を行い、さらに複数人の個別動作を統合し全体の行動ラベルの識別を可能にする。さらに特徴量抽出、個別動作の統合、時系列の統合、行動ラベルの識別を実現するニューラルネットワークの全体を最適化することで、精度の良い行動ラベルの識別を可能にする。 As described above, according to the present embodiment, the behavior recognition apparatus 1000 integrates feature amounts representing individual motions of a plurality of persons, and further integrates them temporally to identify a behavior label. In this way, in response to the change in the number of people in each frame of the moving image, recognition is performed in frame units, and furthermore, individual actions of a plurality of people are integrated to enable identification of the entire action label. Furthermore, by optimizing the entire neural network that realizes feature value extraction, integration of individual actions, integration of time series, and identification of action labels, it is possible to identify action labels with high accuracy.

（実施形態１の派生の形態１）
第１の実施形態では、動画を構成する静止画の人物領域から画像特徴量を取得し、行動ラベルを識別する方法について説明した。しかし、第１の実施形態での問題設定で利用可能な情報は人物領域の画像特徴以外にも存在する。例えば、第１の実施形態では、Ｓ１００２の人物物体検出で、ボール位置の検出を行っている。そこで、そのボールを中心とした任意の大きさの領域（以下、ボール領域と呼ぶ）を人物領域に加え利用してもよい。 (Form 1 of derivation of Embodiment 1)
In the first embodiment, the method of acquiring the image feature amount from the person area of the still image constituting the moving image and identifying the action label has been described. However, information available in the problem setting in the first embodiment exists in addition to the image feature of the person area. For example, in the first embodiment, detection of the ball position is performed in human object detection in S1002. Therefore, an area of an arbitrary size centered on the ball (hereinafter referred to as a ball area) may be added to the person area and used.

また、Ｓ１００２の人物物体検出では、人物およびボールの３次元上の位置を取得している。その人物およびボールの座標値を人物領域またはボール領域の画像特徴量に連結して利用してもよい。人物領域に加え、ボール領域を利用する場合、ある時刻、あるカメラで撮影されたフレームの人物検出結果の人物領域系列の終端にボール領域を加えればよい。例えば、図４（ｂ）に示した人物領域ソート結果（５０１）に対し、終端にボール領域を加えた系列を作成する。異なる時刻のフレームに対しても同様に人物領域系列の終端にボール領域を加え、ネットワークに入力する部分領域の画像系列を作成すればよい。 Further, in the human object detection in S1002, the three-dimensional positions of the human and the ball are acquired. The coordinate values of the person and the ball may be used in connection with the image feature amount of the person area or the ball area. When using the ball area in addition to the person area, the ball area may be added to the end of the person area series of the person detection result of the frame taken by a certain camera at a certain time. For example, a series is created in which the ball area is added to the end of the person area sorting result (501) shown in FIG. 4 (b). Similarly, a ball area may be added to the end of the person area series for frames at different times to create an image series of partial areas to be input to the network.

第１の実施形態では、ネットワークに入力する人物領域系列は、図７の１００２〜１００６等で示したように、
｛時刻Ｉ人物Ｃ領域，時刻Ｉ人物Ｂ領域，時刻Ｉ人物Ａ領域，時刻ＩＩ人物Ｂ領域，時刻ＩＩ人物Ａ｝
である。ここで、中括弧で囲まれた系列“｛ｘ１，ｘ２，ｘ３，．．．，ｘｎ｝”は、ネットワークに入力する系列データを示す。人物領域に加えボール領域を利用する場合、ネットワークへの入力は、
｛時刻Ｉ人物Ｃ領域、時刻Ｉ人物Ｂ領域、時刻Ｉ人物Ａ領域、時刻Ｉボール領域、時刻ＩＩ人物Ｂ領域、時刻ＩＩ人物Ａ、時刻ＩＩボール領域｝
となる。 In the first embodiment, as shown by 1002 to 1006 in FIG.
{Time I person C area, time I person B area, time I person A area, time II person B area, time II person A}
It is. Here, a sequence “{x1, x2, x3,..., Xn}” surrounded by braces indicates sequence data to be input to the network. When using the ball area in addition to the person area, the input to the network is
{Time I person C area, time I person B area, time I person A area, time I ball area, time II person B area, time II person A, time II ball area}
It becomes.

ボール領域が検出されないフレームである場合、ボール領域を無視して画像領域の系列を作成してもよい。または、ボール検出が成功した前後のフレームから線形補間等の補間処理を実行し、ボール位置を推定してもよいし、ボールが検出されないフレームを含む動画はそもそも認識対象から除外してもよい。 If the ball area is a frame not detected, the ball area may be ignored to create a sequence of image areas. Alternatively, interpolation processing such as linear interpolation may be performed from frames before and after successful ball detection to estimate the ball position, or a moving image including a frame in which no ball is detected may be excluded from the recognition target.

フットサル、サッカー、ラグビー等の様々なスポーツにおいて、ボールは一般に人物よりも高速に移動する。そのためボール検出は、一般的なフレームレートの動画を利用する場合、検出失敗が多くなる場合が多い。そして線形補間等の単純な補間では誤差が大きくなることが考えられる。そこで補間等に起因する誤差の影響を軽減させるために、抽出するボール領域を人物領域よりも広い領域にしてもよい。具体的には、第１の実施形態では、人物領域を２−４ｍ程度の領域と設定したが、それに対しボール領域は５−１０ｍ程度の広範囲から取得するようにする。このようにすれば、ボール検出に誤差があってもボール領域内にボールが含まれる確率が増える。 In various sports, such as futsal, soccer, rugby, balls generally move faster than people. Therefore, in the case of using a general frame rate moving image for ball detection, detection failures often increase. And in simple interpolations, such as linear interpolation, it is possible that an error becomes large. Therefore, in order to reduce the influence of an error caused by interpolation or the like, the ball area to be extracted may be a wider area than the human area. Specifically, in the first embodiment, the person area is set to an area of about 2 to 4 m, while the ball area is acquired from a wide range of about 5 to 10 m. In this way, even if there is an error in ball detection, the probability that the ball is included in the ball area increases.

（実施形態１の派生の形態２）
第１の実施形態では、動画を構成する静止画の人物領域から画像特徴量を取得し、行動ラベルを識別する方法について説明した。さらに第１の実施形態の派生の形態１で、ボール領域を人物領域に連結し、ボールと人物の領域から行動ラベルを識別する方法について説明した。本実施形態では、画像の他に画像に対応付けられるメタデータを利用する方法について述べる。 (Modified form 2 of Embodiment 1)
In the first embodiment, the method of acquiring the image feature amount from the person area of the still image constituting the moving image and identifying the action label has been described. Furthermore, in the first modification of the first embodiment, the method of connecting the ball area to the person area and identifying the action label from the ball and the area of the person has been described. In this embodiment, a method of using metadata associated with an image in addition to the image will be described.

第１の実施形態ですでに説明したように、ボールおよび人物は物体検出およびステレオ法により、３次元上の座標値が得られている。そのため人物およびボールの３次元上の座標値を、人物領域またはボール領域の画像特徴量と共に利用することができる。その場合では、第１の実施形態ではＣＮＮによって実現した画像特徴抽出部１００６の後段で、画像特徴量とこれらの座標値を連結し、人物系列統合部１００７に入力するようにすればよい。 As already described in the first embodiment, the ball and the person obtain coordinate values in three dimensions by object detection and stereo method. Therefore, the three-dimensional coordinate values of the person and the ball can be used together with the image feature amount of the person area or the ball area. In such a case, the image feature quantities and their coordinate values may be connected at the subsequent stage of the image feature extraction unit 1006 implemented by CNN in the first embodiment, and may be input to the person series integration unit 1007.

図１２に、人物の画像領域と３次元上の座標値を利用した場合のネットワーク構造の例を示す。１７０２、１７０４、１７０６は図７における１００２、１００３、１００４と同一である。１７０３、１７０５、１７０７は、それぞれ人物Ａ、人物Ｂ、人物Ｃの座標データを表す。ＣＮＮ１７０８、ＬＳＴＭ１（１７１０）、ＬＳＴＭ２（１７１１）、ＦＣ１７１２、Ｓｏｆｔｍａｘ１７１３は、図６におけるＣＮＮ９０）、ＬＳＴＭ１（９０４）、ＬＳＴＭ２（９０５）、ＦＣ９０６、Ｓｏｆｔｍａｘ９０７と同一である。Ｃｏｎｃａｔ１７０９は、連結モジュールである。 FIG. 12 shows an example of a network structure in the case of using an image area of a person and three-dimensional coordinate values. 1702, 1704 and 1706 are the same as 1002, 1003 and 1004 in FIG. Reference numerals 1703, 1705, and 1707 denote coordinate data of the person A, the person B, and the person C, respectively. CNN 1708, LSTM 1 (1710), LSTM 2 (171 1), FC 1712 and Softmax 1713 are the same as CNN 90), LSTM 1 (904), LSTM 2 (905), FC 906 and Softmax 907 in FIG. Concat 1709 is a connection module.

ここで座標データ（１７０３、１７０５、１７０７）は、人物の３次元座標上の位置（Ｘ，Ｙ，Ｚ）の他、ボールからの距離、カメラからの距離を利用して計算してもよい。さらに前時刻のデータも利用して計算される速度や加速度、そしてチームＩＤなどのその他のメタデータ等を使ってもよい。連結モジュール１７０９は、ＣＮＮ１７０８によって抽出された画像特徴量と座標データを連結するモジュールである。この連結後のデータがＬＳＴＭ１（１７１０）に入力される。この連結モジュール１７０９は、このように２つの特徴量を単純に連結するだけでもよいが、重み行列を内積し、次元をより低次元に削減するようにしてもよい。その場合のＦＣをＣｏｎｃａｔ１７０９とＬＳＴＭ１（１７１０）の間に追加し、第１の実施形態の学習時の処理で記述した手続きと同様に学習する。 Here, the coordinate data (1703, 1705, 1707) may be calculated using the distance from the ball or the distance from the camera in addition to the position (X, Y, Z) on the three-dimensional coordinate of the person. Furthermore, other metadata such as the speed and acceleration calculated using data of previous time and team ID may be used. The connection module 1709 is a module that connects the image feature quantity extracted by the CNN 1708 and the coordinate data. This concatenated data is input to LSTM 1 (1710). The connection module 1709 may simply connect two feature quantities in this way, but may inner-multiply a weight matrix to reduce the dimensions to lower dimensions. In this case, FC is added between Concat 1709 and LSTM 1 (1710), and learning is performed in the same manner as the procedure described in the learning process of the first embodiment.

また座標データは、第１の実施形態ではカメラ・キャリブレーションを行ったマルチカメラの多視点画像に対する物体検出結果にステレオ法を適用することで取得した。座標データの取得方法は、この他にもＧＰＳ（ＧｌｏｂａｌＰｏｓｉｔｉｏｎｉｎｇＳｙｓｔｅｍ）機器を選手に装着させ、取得してもよい。 In the first embodiment, the coordinate data is acquired by applying the stereo method to the object detection result for the multi-viewpoint image of the multi-camera on which the camera calibration has been performed. The coordinate data may be acquired by attaching a GPS (Global Positioning System) device to the player.

（実施形態１の派生の形態３）
第１の実施形態では、人物系列の統合と時系列の統合を行い、行動ラベルを識別する方法について説明した。その際、まず人物系列の統合を行い、時系列の統合を行うという順序で２つの統合を行ったが、その順番はこの限りではない。すなわち、まず時系列の統合を行い、次に人物系列の統合を行ってもよい。 (Form 3 of derivation of Embodiment 1)
In the first embodiment, the method of integrating the person series and integrating the time series to identify the action label has been described. At that time, two series of integration were performed in the order of first integrating the person series and then integrating the time series, but the order is not limited to this. That is, time series integration may be performed first, and then person series integration may be performed.

この場合、各人物毎にその人物が映る時刻をソートし、人物領域系列を作成する。図５（ａ）と図５（ｂ）に示すように、時刻Ｉで人物Ａ、Ｂ、Ｃの３名が存在し時刻ＩＩで人物ＡとＢの２名が存在する場合を再び考える。このとき、人物Ｃは、時刻Ｉにしか存在せず、人物Ｂは時刻Ｉ、ＩＩとも存在し、人物Ａも同じく時刻Ｉ、ＩＩともに存在するため、ソートした結果の人物領域系列は以下のようになる。
｛時刻Ｉ人物Ｃ領域，時刻Ｉ人物Ｂ領域，時刻ＩＩ人物Ｂ領域，時刻Ｉ人物Ａ領域，時刻ＩＩ人物Ａ｝
ここで、人物の順序は、ボールからの距離を降順にソートし（人物Ｃ、Ｂ、Ａ）、時刻は昇順にソートした（時刻Ｉ、ＩＩ）。これを、すでに図６に示したネットワークのＩｎｐｕｔ（９０２）に入力し、ＬＳＴＭ１１と２（９０４と９０５）を、時系列の統合、人物系列の統合という順に統合するため制御する。このときのＬＳＴＭ１とＬＳＴＭ２制御状態は図１８（ｂ）のようになる。 In this case, the time when the person appears for each person is sorted to create a person area series. As shown in FIGS. 5A and 5B, consider again the case where there are three persons A, B and C at time I and two persons A and B at time II. At this time, person C is present only at time I, person B is present at both times I and II, and person A is also present at both times I and II. become.
{Time I person C area, time I person B area, time II person B area, time I person A area, time II person A}
Here, the order of persons was sorted in descending order of the distance from the ball (persons C, B, A), and the times were sorted in ascending order (time I, II). This is input to the input (902) of the network already shown in FIG. 6, and LSTMs 11 and 2 (904 and 905) are controlled to be integrated in order of time series integration and person series integration. The control states of LSTM1 and LSTM2 at this time are as shown in FIG.

ここで、ｎ＝１でＬＳＴＭ１では、時刻Ｉ人物Ｃ領域を統合するため“リセット”し、ＬＳＴＭ２でも“リセット”する。人物Ｃは時刻ＩＩで存在しないのでＬＳＴＭ２では、ｎ＝１で最初の人物が統合される。ｎ＝２、ｎ＝３で次の人物Ｂを統合する。そのためｎ＝２でＬＳＴＭ１を“リセット”し、ＬＳＴＭ２を“保持”にすることで、ＬＳＴＭ１で時刻Ｉの人物Ｂ領域を統合する。ｎ＝３でＬＳＴＭ１、ＬＳＴＭ２を共に“更新”することで、ＬＳＴＭ１で時刻ＩＩの人物Ｂを統合し、ＬＳＴＭ２で、時刻Ｉ、ＩＩの人物Ｂ領域を統合した結果を受け取り、内部状態を更新する。次は、同様にｎ＝４でＬＳＴＭ１を“リセット”し、ＬＳＴＭ２を“保持”にすることで、ＬＳＴＭ１で時刻Ｉの人物Ａ領域を統合する。ｎ＝５でＬＳＴＭ１、ＬＳＴＭ２を共に“更新”することで、ＬＳＴＭ１で時刻ＩＩの人物Ａを統合し、ＬＳＴＭ２で、時刻Ｉ、ＩＩの人物Ａ領域を統合した結果を受け取り、内部状態を更新する。 Here, with n = 1, in LSTM1, “time reset” is performed in order to integrate the time I person C area, and “reset” is also performed in LSTM2. Since the person C does not exist at time II, in LSTM 2, the first person is integrated at n = 1. The next person B is integrated with n = 2 and n = 3. Therefore, LSTM1 is "reset" at n = 2 and LSTM2 is "held", whereby the person B area at time I is integrated in LSTM1. By “updating” both LSTM1 and LSTM2 with n = 3, the person B at time II is integrated in LSTM1, and the result of integrating the person B area in time I and II is received in LSTM2 and the internal state is updated . Next, similarly, “n” resets LSTM1 with n = 4, and “holds” LSTM2, thereby integrating the person A area at time I with LSTM1. By “updating” both LSTM1 and LSTM2 at n = 5, the person A at time II is integrated in LSTM1, and the result of integrating the person A area in time I, II is received in LSTM 2 and the internal state is updated. .

図１３に、図６のネットワークに対し、人物領域系列を入力し、図１８（ｂ）の制御を行った場合の信号及び誤差の伝搬経路（１４１１）を示す。１４０２は図５（ｃ）の８０２に示した時刻Ｉ人物Ａ領域と同一である。１４０３は図５（ｃ）の８０３に示した時刻Ｉ人物Ｂ領域と同一である。１４０４は図５（ｃ）の８０５に示した時刻ＩＩ人物Ｂ領域と同一である。１４０５は図５（ｃ）の８０４に示した時刻Ｉ人物Ａ領域と同一である。１４０６は図５（ｃ）の８０６に示した時刻ＩＩ人物Ａ領域と同一である。ＣＮＮ、ＬＳＴＭ１、ＬＳＴＭ２、ＦＣ、Ｓｏｆｔｍａｘは、図６におけるＣＮＮ（９０３）、ＬＳＴＭ１（９０４）、ＬＳＴＭ２（９０５）、ＦＣ（９０６）、Ｓｏｆｔｍａｘ（９０７）と同一である。 FIG. 13 shows propagation paths (1411) of signals and errors when the person area sequence is input to the network of FIG. 6 and the control of FIG. 18 (b) is performed. Reference numeral 1402 is the same as the time I person A area shown at 802 in FIG. 5C. Reference numeral 1403 is the same as the time I person B area shown at 803 in FIG. 5C. 1404 is the same as the time II person B area shown in 805 of FIG. 5C. 1405 is identical to the time I person A area shown at 804 in FIG. 5C. 1406 is the same as the time II person A area shown at 806 in FIG. 5C. CNN, LSTM1, LSTM2, FC, Softmax are the same as CNN (903), LSTM1 (904), LSTM2 (905), FC (906), Softmax (907) in FIG.

ＬＳＴＭ１およびＬＳＴＭ２に関し、“更新”を白背景の矩形（１４０９）、“リセット”を斜線パターンの矩形（１４０７）、“保持”をドットパターンの矩形（１４０８）で示した。また網掛けパターンの矩形（１４１０）はＳｏｆｔｍａｘの“無視”、白背景の矩形（１４１２）はＳｏｆｔｍａｘの“通常動作”（“無視”でない動作）を示す。図１８（ｂ）の制御により、信号及び誤差が１４１１のように伝搬し、時系列統合および人物系列統合をこの順序で実現される。 As for LSTM1 and LSTM2, “update” is indicated by a white background rectangle (1409), “reset” is indicated by a hatched pattern rectangle (1407), and “hold” is indicated by a dot pattern rectangle (1408). Also, the shaded pattern rectangle (1410) indicates Softmax's "ignore", and the white background rectangle (1412) indicates Softmax's "normal operation" (an operation that is not "ignore"). By the control of FIG. 18B, the signal and the error are propagated as 1411, and time series integration and person series integration are realized in this order.

以上のように実施することで、人物系列および時系列の統合を任意の順序が行うことができる。 By carrying out as described above, integration of person series and time series can be performed in any order.

（実施形態２）
本実施形態では、複数のカメラで撮影されたフットサル動画に関し、同一の人物が複数のカメラで撮影された場合における行動認識を行う方法について述べる。その際、複数のカメラの同一人物領域の統合（視点統合）と、複数の人物の個別動作を表す人物特徴量の統合（対象物統合）と、時系列の統合との３種類の情報の統合を行う。 Second Embodiment
In the present embodiment, a method of performing action recognition in the case where the same person is photographed by a plurality of cameras will be described with respect to futsal moving images photographed by a plurality of cameras. At that time, integration of three types of information: integration of the same person area of a plurality of cameras (viewpoint integration), integration of person feature quantities representing individual motions of a plurality of persons (object integration), and integration of time series I do.

本実施形態で識別する行動ラベルは、第１の実施形態と同様に、“パス”，“シュート”，“ドリブル”，“キープ”，“クリアー”の５種類の行動ラベルとする。 Similar to the first embodiment, the action labels to be identified in this embodiment are five types of action labels: “pass”, “shoot”, “dribble”, “keep”, and “clear”.

また本実施形態では、第１の実施形態と同様に、フットサルコート周辺に配置した複数のカメラで撮影されたフットサル動画を用いる。図１は既に説明した図であるが、このカメラ配置の一例と２つのカメラで撮影された１フレームの例を説明する図である。 Further, in the present embodiment, as in the first embodiment, futsal moving images photographed by a plurality of cameras arranged around the futsal court are used. FIG. 1 is a diagram already described, but is a diagram for explaining an example of this camera arrangement and an example of one frame photographed by two cameras.

図２（ｂ）は、本実施形態で説明する行動認識装置２０００の機能構成を示す図である。本実施形態の行動認識装置２０００は、マルチカメラ動画取得部２００１、人物物体検出部２００２、人物領域抽出部２００３、人物領域ソート部２００４、統合制御信号作成部２００５、画像特徴量抽出部２００６を有する。さらに、カメラ系列統合部２００７、人物系列統合部２００８、時系列統合部２００９、行動ラベル識別部２０１０を有する。これらの各機能の詳細について、図３等を用いて以下に説明する。 FIG. 2B is a diagram showing a functional configuration of the behavior recognition apparatus 2000 described in the present embodiment. The action recognition apparatus 2000 according to this embodiment includes a multi-camera moving image acquisition unit 2001, a human object detection unit 2002, a human area extraction unit 2003, a human area sort unit 2004, an integrated control signal generation unit 2005, and an image feature quantity extraction unit 2006. . Further, it has a camera series integration unit 2007, a person series integration unit 2008, a time series integration unit 2009, and an action label identification unit 2010. Details of each of these functions will be described below with reference to FIG.

図３（ｂ）は、本実施形態における認識時の処理の一例を示すフローチャートである。 FIG. 3B is a flowchart showing an example of processing at the time of recognition in the present embodiment.

Ｓ２００１では、マルチカメラ動画取得部２００１により、複数のカメラで撮影された、複数の静止画から成る動画のフレーム系列を取得する。Ｓ２００２の人物物体検出及びＳ２００３の人物領域抽出は、第１の実施形態における認識処理時のＳ１００２及びＳ１００３と同様の処理であるため説明を省略する。 In S2001, the multi-camera moving image acquisition unit 2001 acquires a frame sequence of moving images composed of a plurality of still images captured by a plurality of cameras. The human object detection in S2002 and the human area extraction in S2003 are the same processes as S1002 and S1003 at the time of recognition processing in the first embodiment, and therefore the description thereof is omitted.

また、人物領域ソート処理（Ｓ２００４）、統合制御信号作成処理（Ｓ２００５）、画像特徴量抽出処理（Ｓ２００６）は、第１の実施形態における認識処理時のＳ１００４−１００６と同様の処理である。ただし、一部異なるため、その差分について他の処理と合わせて説明する。 The person area sorting process (S2004), the integrated control signal creation process (S2005), and the image feature quantity extraction process (S2006) are the same processes as S1004-1006 at the recognition process in the first embodiment. However, since there is a difference, the difference will be described together with other processing.

Ｓ２００７では、カメラ系列統合部２００７により、複数のカメラで撮影された同一人物の人物領域の統合を行う。 In step S2007, the camera series integration unit 2007 integrates the person areas of the same person captured by a plurality of cameras.

また、人物系列統合処理（Ｓ２００８）、時系列統合処理（Ｓ２００９）、行動ラベル識別処理（Ｓ２０１０）は、第１の実施形態における認識処理時のＳ１００７−１００９と同様の処理である。ただし、一部異なるため、その差分について他の処理と合わせて説明する。 The person series integration process (S2008), the time series integration process (S2009), and the action label identification process (S2010) are the same processes as S1007-1009 at the recognition process in the first embodiment. However, since there is a difference, the difference will be described together with other processing.

次に、図３（ｂ）に示したフローチャートに従って、各処理のより具体的な内容について述べる。 Next, more specific contents of each process will be described according to the flowchart shown in FIG.

マルチカメラ動画取得工程Ｓ２００１は、図１（ａ）のように配置したマルチカメラを用い、多視点の動画を取得する。各カメラの動画は、同期されているとする。図１（ｂ）と（ｃ）は、前述の通り同じ瞬間を撮影したカメラ１０３（図１（ａ）。以後カメラ１と呼ぶ）とカメラ１０４（図１（ａ）。以後カメラ２と呼ぶ）のフレームであるが、各カメラからはこのような同期されたフレームが取得されるとする。 A multi-camera moving image acquisition step S2001 acquires a multi-view moving image using a multi-camera arranged as shown in FIG. 1 (a). It is assumed that the videos of each camera are synchronized. 1 (b) and 1 (c) show the camera 103 (FIG. 1 (a); hereinafter referred to as camera 1) and the camera 104 (FIG. 1 (a); hereinafter referred to as camera 2) capturing the same moment as described above. It is assumed that such synchronized frames are acquired from each camera.

人物領域ソート工程Ｓ２００４では、マルチカメラ動画取得工程Ｓ２００１で取得された多視点動画の各フレーム中の人物領域のソートを行う。
本工程では、同一の人物が複数のフレームおよびカメラで撮影されており、例えば人物Ａが、フレームＩのカメラ１、カメラ２，カメラ４で撮影されている場合、人物Ａの人物領域をカメラの番号順に並べた系列，
｛フレームＩ人物Ａカメラ１，フレームＩ人物Ａカメラ２，フレームＩ人物Ａカメラ４｝
をカメラ系列とする。
同様に人物Ｂについて、カメラ１とカメラ２で撮影され、
｛フレームＩ人物Ｂカメラ１，フレームＩ人物Ｂカメラ２｝
というカメラ系列が得られる場合、人物系列は，カメラ系列をネストした系列、
｛｛フレームＩ人物Ａカメラ１，フレームＩ人物Ａカメラ２，フレームＩ人物Ａカメラ４｝，｛フレームＩ人物Ｂカメラ１，フレームＩ人物Ｂカメラ２｝｝
となる。
さらにフレームＩＩにおいて、人物Ａがカメラ１，カメラ２で撮影され，人物Ｂがカメラ２，カメラ３で撮影された場合，時系列は、カメラ系列および人物系列をネストした系列，
｛｛｛フレームＩ人物Ａカメラ１，フレームＩ人物Ａカメラ２，フレームＩ人物Ａカメラ４｝，｛フレームＩ人物Ｂカメラ１，フレームＩ人物Ｂカメラ２｝｝，｛｛フレームＩＩ人物Ａカメラ１，フレームＩＩ人物Ａカメラ２｝，｛フレームＩＩ人物Ｂカメラ２，フレームＩＩ人物Ｂカメラ３｝｝｝
となる。
人物領域のソートは、このように作成された、ネストされた時系列を１次元に並べることで実行される。
｛フレームＩ人物Ａカメラ１，フレームＩ人物Ａカメラ２，フレームＩ人物Ａカメラ４，フレームＩ人物Ｂカメラ１，フレームＩ人物Ｂカメラ２，フレームＩＩ人物Ａカメラ１，フレームＩＩ人物Ａカメラ２，フレームＩＩ人物Ｂカメラ２，フレームＩＩ人物Ｂカメラ３｝
図１４にカメラ１およびカメラ２で撮影された同じ時刻の２つのフレームにおける人物検出の結果（図１４（ａ），（ｂ））と人物領域のソート結果（図１４（ｃ））を示す。ここで、１３０２は、カメラ１のフレーム上でのボール検出結果、１３０３―１３０９は、それぞれカメラ１で撮影されたフレーム上の人物Ａ―Ｇの人物領域を示す。１４０２は、カメラ２のフレーム上でのボール検出結果、１４０３―１４１３は、それぞれカメラ１で撮影されたフレーム上の人物Ａ―Ｋの人物領域を示す。カメラ１で撮影されたボール１３０２および人物Ａ―Ｅ（１３０３―１３０７）とカメラ２で撮影されたボール１４０２および人物Ａ―Ｅ（１４０３―１４０７）は同一の物体及び人物である。 In the person area sorting step S2004, the person area in each frame of the multi-view moving image acquired in the multi-camera moving image acquisition step S2001 is sorted.
In this process, when the same person is photographed by a plurality of frames and cameras, for example, when person A is photographed by camera 1, camera 2, and camera 4 of frame I, the person area of person A is Sequences arranged in numerical order,
{Frame I Person A Camera 1, Frame I Person A Camera 2, Frame I Person A Camera 4}
As a camera series.
Similarly, the person B is photographed with the camera 1 and the camera 2,
{Frame I Person B Camera 1, Frame I Person B Camera 2}
If a camera series is obtained, the person series is a series of nested camera series,
{{Frame I Person A Camera 1, Frame I Person A Camera 2, Frame I Person A Camera 4}, {Frame I Person B Camera 1, Frame I Person B Camera 2}}
It becomes.
Furthermore, in the frame II, when the person A is photographed by the cameras 1 and 2 and the person B is photographed by the cameras 2 and 3, the time series is a series of nested camera series and person series,
{{{Frame I Person A Camera 1, Frame I Person A Camera 2, Frame I Person A Camera 4}, {Frame I Person B Camera 1, Frame I Person B Camera 2}}, {{Frame II Person A Camera 1 , Frame II Person A Camera 2}, {Frame II Person B Camera 2, Frame II Person B Camera 3}}}
It becomes.
Sorting of the person area is performed by arranging the nested time series created in this way in one dimension.
{Frame I Person A Camera 1, Frame I Person A Camera 2, Frame I Person A Camera 4, Frame I Person B Camera 1, Frame I Person B Camera 2, Frame II Person A Camera 1, Frame II Person A Camera 2, Frame II Person B Camera 2, Frame II Person B Camera 3}
FIG. 14 shows results of human detection (FIGS. 14A and 14B) in two frames at the same time taken by the cameras 1 and 2 and results of sorting of human regions (FIG. 14C). Here, reference numeral 1302 denotes a ball detection result on the frame of the camera 1, and 1303-1309 respectively denote human regions of the persons A to G on the frame photographed by the camera 1. Reference numeral 1402 denotes a ball detection result on the frame of the camera 2, and reference numerals 1403 to 1413 respectively denote person regions of the persons A to K on the frame photographed by the camera 1. The ball 1302 photographed with the camera 1 and the persons AE (1303-1307) and the ball 1402 photographed with the camera 2 and the persons AE (1403-1407) are the same object and person.

このとき、ネストされた人物系列は、
｛｛人物Ｇカメラ１｝，｛人物Ｋカメラ２｝，｛人物Ｊカメラ２｝，｛人物Ｆカメラ１｝，｛人物Ｊカメラ１｝，｛人物Ｈカメラ２｝，｛人物Ｅカメラ１，人物Ｅカメラ２｝，｛人物Ｄカメラ１，人物Ｄカメラ２｝，｛人物Ｃカメラ１，人物Ｃカメラ２｝，｛人物Ｂカメラ１，人物Ｂカメラ２｝，｛人物Ａカメラ１，人物Ａカメラ２｝｝
となる。ネストされた人物系列を１次元に並べたソート結果は図１４（ｃ）の１５０４〜１５１９である。すなわち，
｛人物Ｇカメラ１，人物Ｋカメラ２，人物Ｊカメラ２，人物Ｆカメラ１，人物Ｊカメラ１，人物Ｈカメラ２，人物Ｅカメラ１，人物Ｅカメラ２，人物Ｄカメラ１，人物Ｄカメラ２，人物Ｃカメラ１，人物Ｃカメラ２，人物Ｂカメラ１，人物Ｂカメラ２，人物Ａカメラ１，人物Ａカメラ２｝
となる。 At this time, the nested person series is
{{Person G camera 1}, {person K camera 2}, {person J camera 2}, {person F camera 1}, {person J camera 1}, {person H camera 2}, {person E camera 1, person E camera 2}, {person D camera 1, person D camera 2}, {person C camera 1, person C camera 2}, {person B camera 1, person B camera 2}, {person A camera 1, person A camera 2}}
It becomes. The sorting result of arranging the nested person series in one dimension is 1504 to 1519 in FIG. That is,
{Person G camera 1, person K camera 2, person J camera 2, person F camera 1, person J camera 1, person H camera 2, person E camera 1, person E camera 2, person D camera 1, person D camera 2 , Person C camera 1, person C camera 2, person B camera 1, person B camera 2, person A camera 1, person A camera 2}
It becomes.

ここでは第１の実施形態と同様に、各人物はボールからの距離に基づきソートしたが、前述のように前フレームで統一的な人物ＩＤが得られる場合、その人物ＩＤの順序に従ってソートしてもよい。 Here, as in the first embodiment, each person is sorted based on the distance from the ball, but as described above, when a uniform person ID is obtained in the previous frame, it is sorted according to the order of the person ID It is also good.

次に、統合制御信号作成工程Ｓ２００５は、カメラ系列統合工程Ｓ２００７、人物系列統合工程Ｓ２００８、時系列統合工程Ｓ２００９の工程で用いる制御信号を生成する。この制御信号により、それぞれカメラ系列統合部２００７、人物系列統合部２００８、時系列統合部２００９を制御する。 Next, integrated control signal creation step S2005 generates control signals used in the steps of camera series integration step S2007, person series integration step S2008, and time series integration step S2009. The camera series integration unit 2007, the person series integration unit 2008, and the time series integration unit 2009 are controlled by the control signal.

第１の実施形態では、人物系列統合部１００７、時系列統合部１００８を２層のＬＳＴＭで実現した。本実施形態では、カメラ系列統合部２００７、人物系列統合部２００８、時系列統合部２００９を各１層ずつの３層のＬＳＴＭ（ＬＳＴＭ１、ＬＳＴＭ２、ＬＳＴＭ３）で実現する。 In the first embodiment, the person series integration unit 1007 and the time series integration unit 1008 are realized by two layers of LSTM. In this embodiment, a camera series integration unit 2007, a person series integration unit 2008, and a time series integration unit 2009 are realized by three layers of LSTM (LSTM1, LSTM2, and LSTM3) each having one layer.

ＬＳＴＭの各レイヤーでは，ネストされた３階層の系列を階層ごとに統合することになる。すなわち，図１４に示したある１時刻、２カメラの２フレームで検出される人物領域のネストされた系列は，
｛｛｛人物Ｇカメラ１｝，｛人物Ｋカメラ２｝，｛人物Ｊカメラ２｝，｛人物Ｆカメラ１｝，｛人物Ｊカメラ１｝，｛人物Ｈカメラ２｝，｛人物Ｅカメラ１，人物Ｅカメラ２｝，｛人物Ｄカメラ１，人物Ｄカメラ２｝，｛人物Ｃカメラ１，人物Ｃカメラ２｝，｛人物Ｂカメラ１，人物Ｂカメラ２｝，｛人物Ａカメラ１，人物Ａカメラ２｝｝｝
となる。この場合，カメラ系列を統合するＬＳＴＭ１の制御状態は、カメラ系列の初期に“リセット”、統合時に“更新”をするように制御され，
｛リセット，リセット，リセット，リセット，リセット，リセット，リセット，更新，リセット，更新，リセット，更新，リセット，更新，リセット，更新｝
という状態系列になるよう制御信号が作成される。
また、人物系列を統合するＬＳＴＭ２では、人物系列の初期にリセット、統合時に更新、１階層下のカメラ系列が最後の要素以外の場合、“保持”とする。すなわち、
｛リセット，更新，更新，更新，更新，更新，更新，更新，保持，更新，保持，更新，保持，更新，保持，更新｝
となる。 In each layer of LSTM, nested three-hierarchical series will be integrated in each hierarchy. That is, the nested sequence of the person area detected in two frames of one camera and two cameras shown in FIG.
{{{Person G camera 1}, {person K camera 2}, {person J camera 2}, {person F camera 1}, {person J camera 1}, {person H camera 2}, {person E camera 1, 1 Person E camera 2}, {person D camera 1, person D camera 2}, {person C camera 1, person C camera 2}, {person B camera 1, person B camera 2}, {person A camera 1, person A Camera 2}}}
It becomes. In this case, the control state of LST M1 that integrates the camera series is controlled to “reset” at the beginning of the camera series and “update” at the time of integration,
{Reset, Reset, Reset, Reset, Reset, Reset, Reset, Update, Reset, Update, Reset, Update, Reset, Reset, Update, Reset, Reset, Update}
The control signal is generated to be in the state series.
In addition, in LSTM2 in which person series are integrated, reset at the beginning of the person series, update at integration, and "hold" if the camera series one layer lower is other than the last element. That is,
{Reset, Update, Update, Update, Update, Update, Update, Update, Update, Hold, Update, Hold, Update, Hold, Update, Hold, Update, Hold, Update}
It becomes.

同様に時系列統合するＬＳＴＭ３では，時系列の初期に“リセット”をするため、最後の要素をリセットにする。それ以外はどれでもよいが便宜的に“リセット”とする。すなわち、
｛リセット，リセット，リセット，リセット，リセット，リセット，リセット，リセット，リセット，リセット，リセット，リセット，リセット，リセット，リセット，リセット，リセット｝
となる。 Similarly, in LSTM3 that integrates in time series, the last element is reset in order to "reset" at the beginning of time series. Anything other than that may be used, but for convenience, "reset". That is,
{Reset, reset, reset, reset, reset, reset, reset, reset, reset, reset, reset, reset, reset, reset, reset, reset, reset, reset, reset, reset}
It becomes.

図１４（ｃ）の人物領域の系列に対し、実際に制御状態を切り替えたネットワークを図１５に図示する。図中のＣＮＮ、ＬＳＴＭ１、ＬＳＴＭ２、ＬＳＴＭ３、ＦＣ、Ｓｏｆｔｍａｘと記された矩形は、それぞれＣＮＮ、３層のＬＳＴＭ、Ｓｏｆｔｍａｘ識別器である。斜線で塗られた矩形（１６１８）がＬＳＴＭの“リセット”、ドットで塗られた矩形（１６１９）が“保持”、白背景の矩形（１６２０）が“更新”を表す。網掛けパターンの矩形（１６２１）がＳｏｆｔｍａｘ識別器の“無視”、白背景の矩形（１６２２）がＳｏｆｔｍａｘ識別器の通常動作（“無視”でない状態）を表す。 A network in which the control state is actually switched with respect to the sequence of the person area in FIG. 14C is illustrated in FIG. The rectangles described as CNN, LSTM1, LSTM2, LSTM3, FC, and Softmax in the figure are CNN, and three-layer LSTM and Softmax classifiers, respectively. A hatched rectangle (1618) represents LSTM "reset", a dotted rectangle (1619) represents "hold", and a white background rectangle (1620) represents "update". A shaded pattern rectangle (1621) indicates the Softmax discriminator "ignore", and a white background rectangle (1622) indicates the Softmax discriminator normal operation (non-ignore state).

この後に続く各工程（画像特徴量抽出工程Ｓ２００６、カメラ系列統合工程Ｓ２００７、人物系列統合工程Ｓ２００８、時系列統合工程Ｓ２００９、行動ラベル識別工程Ｓ２０１０）は、第１の実施形態と同様に実現される。すなわち、ＣＮＮ、ＬＳＴＭ、Ｓｏｆｔｍａｘ識別器を組み合わせたニューラルネットワークで実現される。そのネットワーク構造は、既に説明の通り、図１５となる。すなわち、前工程で作成した制御信号によって、図１５に示した構造と制御状態で、第１の実施形態と同様の認識時の処理を実行すれば、カメラ系列、人物系列、時系列を統合した行動ラベルの識別結果が得られる。 The subsequent steps (image feature quantity extraction step S2006, camera series integration step S2007, person series integration step S2008, time series integration step S2009, action label identification step S2010) are realized in the same manner as in the first embodiment. . That is, it is realized by a neural network combining CNN, LSTM, and Softmax classifiers. The network structure is as shown in FIG. That is, the camera series, the person series, and the time series are integrated if the same recognition processing as in the first embodiment is executed with the structure and the control state shown in FIG. The identification result of the action label is obtained.

以上のように実行することで、マルチモーダルの系列情報（当実施例では、カメラ系列、人物系列、時系列）を統合した行動ラベルの認識を行うことができる。
本実施形態では、カメラ系列、人物系列、時系列という順番に統合したが、時系列、人物系列、カメラ系列、という順番で統合することも可能である。 By performing as described above, it is possible to recognize an action label in which multimodal sequence information (in this embodiment, a camera sequence, a person sequence, and a time sequence) are integrated.
In this embodiment, the camera series, the person series, and the time series are integrated in this order, but it is also possible to integrate the time series, the person series, and the camera series in this order.

さらに本実施形態に記された手続きは、より一般的な多種の系列情報の統合にも適用できる。例えば、選手が心拍数センサ、加速度センサ、ＧＰＳセンサを装着し、それらのセンサから心拍データ、速度・加速度データ、位置データを取得できる場合を考える。この場合、本実施形態と同様の手続きで、各人物の複数のセンサデータの統合、複数人物のセンサデータの統合、時系列の統合の３種類の統合を行うことが可能である。選手各々が異なる機種のセンサを装着している場合や、装着しているセンサの数が選手毎に違う場合、複数のセンサデータの統合を行うことで、それらの差異を吸収する効果が期待できる。 Furthermore, the procedure described in the present embodiment can also be applied to integration of more general types of sequence information. For example, consider a case where a player wears a heart rate sensor, an acceleration sensor, and a GPS sensor and can acquire heart rate data, velocity / acceleration data, and position data from those sensors. In this case, it is possible to perform integration of three types of integration of a plurality of sensor data of each person, integration of sensor data of a plurality of persons, and integration of time series by the same procedure as this embodiment. When each player wears a different type of sensor, or when the number of worn sensors is different for each player, the integration of multiple sensor data can be expected to have the effect of absorbing those differences. .

（実施形態３）
第１、第２の実施形態では、カメラによって取得されるフレームに対し、人物検出を行い、その結果に基づく局所的な人物領域を用い、行動ラベルの識別を行った。
これらの形態では、仮に人物検出が正しく動作しなかった場合、人物でない領域が誤って入力され、誤識別につながる場合がある。 (Embodiment 3)
In the first and second embodiments, person detection is performed on a frame acquired by a camera, and action label identification is performed using a local person area based on the result.
In these modes, if person detection does not operate correctly, a non-person area may be erroneously input, which may lead to misidentification.

本実施形態では、画像の局所的な部分である人物領域に加え、画像全体からの画像特徴を抽出し、行動ラベルの識別に利用することで、人物検出の誤識別の軽減を図る方法について説明する。 In this embodiment, in addition to the human region which is a local part of the image, the image feature from the entire image is extracted and used for identification of the action label to explain a method for reducing misidentification of human detection. Do.

本実施形態で識別する行動ラベルは、第１、第２の実施形態と同様に、“パス”，“シュート”，“ドリブル”，“キープ”，“クリアー”の５種類の行動ラベルとする。 Similar to the first and second embodiments, the action labels to be identified in this embodiment are five types of action labels: “pass”, “shoot”, “dribble”, “keep”, and “clear”.

また本実施形態では、第１、第２の実施形態と同様に、フットサルコート周辺に配置した複数のカメラで撮影されたフットサル動画を用いる。図１は既に説明した図であるが、このカメラ配置の一例と２つのカメラで撮影された１フレームの例を説明する図である。 Further, in the present embodiment, as in the first and second embodiments, futsal moving images photographed by a plurality of cameras arranged around the futsal court are used. FIG. 1 is a diagram already described, but is a diagram for explaining an example of this camera arrangement and an example of one frame photographed by two cameras.

図１６（ａ）は、本実施形態で説明する行動認識装置３０００の機能構成を示す図である。 FIG. 16A shows a functional configuration of the behavior recognition apparatus 3000 described in the present embodiment.

本実施形態の行動認識装置３０００は、第１の実施形態における行動認識装置１０００の機能構成に加え、大域的特徴量抽出部３０１０を有する。これらの各機能の詳細について、図３等を用いて以下に説明する。 The action recognition apparatus 3000 of this embodiment has a global feature quantity extraction unit 3010 in addition to the functional configuration of the action recognition apparatus 1000 of the first embodiment. Details of each of these functions will be described below with reference to FIG.

図３（ｃ）は、本実施形態における認識時の処理の一例を示すフローチャートである。 FIG. 3C is a flowchart showing an example of processing at the time of recognition in the present embodiment.

ここで、Ｓ３００１−Ｓ３００５は、第１の実施形態における認識処理時のＳ１００１−１００５と同様の処理であるため説明を省略する。 Here, since S3001 to S3005 are the same processes as S1001 to 1005 at the time of the recognition process in the first embodiment, the description will be omitted.

画像特徴量抽出処理（Ｓ３００６）は、第１の実施形態における認識処理時の画像特徴量抽出工程Ｓ１００６と同様の処理であるが、一部異なるため、その差分について他の処理と合わせて説明する。 The image feature quantity extraction process (S3006) is the same process as the image feature quantity extraction step S1006 at the recognition process in the first embodiment, but is partially different, so the difference will be described together with other processes. .

Ｓ３００７では、大域的特徴量抽出部３０１０により、動画取得工程Ｓ３００１で取得されるフレーム全体から大域的画像特徴量を抽出する。 In S3007, the global feature extraction unit 3010 extracts global image features from the entire frame acquired in the moving image acquisition step S3001.

次に、図３（ｃ）に示したフローチャートに従って、より具体的な内容について述べる。本実施形態では、画像特徴量抽出処理（Ｓ３００６）以降の処理に関し第１の実施形態と差異があり、その他の処理は第１の実施形態と同様である。そのため差異のある各処理について説明する。 Next, more specific contents will be described according to the flowchart shown in FIG. The present embodiment is different from the first embodiment in the processing after the image feature quantity extraction processing (S3006), and the other processing is the same as that of the first embodiment. Therefore, each different processing will be described.

第１の実施形態では、Ｓ１００６−Ｓ１００９で機能する各部、画像特徴量抽出部１００６、人物系列統合部１００７、時系列統合部１００８、行動ラベル識別部１００９は、ＣＮＮ、ＬＳＴＭ、Ｓｏｆｔｍａｘ識別器を組み合わせて実現された。本実施形態でも同じく画像特徴量抽出部３００６、人物系列統合部３００７、時系列統合部３００８、行動ラベル識別部３００９は、ＣＮＮ、ＬＳＴＭ、Ｓｏｆｔｍａｘ識別器で実現する。加えて大域的画像特徴量抽出工程Ｓ３００７で用いる大域的特徴量抽出部３０１０もＣＮＮで実現する。 In the first embodiment, the respective units functioning in S1006-S1009, the image feature extraction unit 1006, the person series integration unit 1007, the time series integration unit 1008, and the action label identification unit 1009 combine CNN, LSTM and Softmax discriminators Was realized. Also in the present embodiment, the image feature quantity extraction unit 3006, the person series integration unit 3007, the time series integration unit 3008, and the action label identification unit 3009 are realized by CNN, LSTM, and Softmax discriminators. In addition, the global feature extraction unit 3010 used in the global image feature extraction step S3007 is also realized by CNN.

これらで構成されるネットワークの構造は、第１の実施形態におけるネットワーク構造を示す図（図９）と同様の方法で図示すると、図１７のようになる。第１の実施形態におけるネットワーク構造（図９）のＬＳＴＭ１とＬＳＴＭ２の間に、連結操ユニット（１３１５）が挿入された構造を持つ。連結操ユニット（１３１５）では、各時刻における全体画像（１３０２、１３０８）がＣＮＮ（１３１３）に入力され、そこで抽出された大域的画像特徴量とＬＳＴＭ１（１３１４）による人物系列の統合結果が連結される。連結後の特徴量が時系列統合を行うＬＳＴＭ２（１３１６）に入力される。 The structure of a network configured of these is as shown in FIG. 17 when it is illustrated in the same manner as the diagram (FIG. 9) showing the network structure in the first embodiment. A connection control unit (1315) is inserted between LSTM1 and LSTM2 of the network structure (FIG. 9) in the first embodiment. In the combined operation unit (1315), the entire image (1302, 1308) at each time is input to the CNN (1313), and the global image feature value extracted there and the integrated result of the person series by LSTM1 (1314) are connected. Ru. The feature values after connection are input to LSTM 2 (1316) that performs time series integration.

ここで、ＣＮＮ（１３１３）に入力する各時刻における全体画像（１３０２、１３０８）は、動画取得手段３００１で取得されたＦｕｌｌＨＤ（１９２０ｘ１０８０ピクセル）の画像をクロップせずにＣＮＮに合わせた大きさにリサイズした画像である。リサイズした画像は、例えば２２７×２２７ピクセルである。 Here, the entire image (1302, 1308) at each time input to the CNN (1313) has a size adjusted to the CNN without cropping the image of Full HD (1920 x 1080 pixels) acquired by the moving image acquisition means 3001. It is a resized image. The resized image is, for example, 227 × 227 pixels.

また、人物領域から画像特徴量を抽出するＣＮＮ（１３１２）と全体画像から大域的画像特徴量を抽出するＣＮＮ（１３１３）では、同じ構造にし、同じ重みパラメータを共有してもよい。あるいは、同じ構造でも別の重みパラメータを設定してもよいし、別の構造、別の重みパラメータとしてもよい。 Also, the CNN (1312) that extracts the image feature amount from the human region and the CNN (1313) that extracts the global image feature amount from the entire image may have the same structure and share the same weight parameters. Alternatively, another weight parameter may be set with the same structure, or another structure or another weight parameter may be set.

図１７が表すように、時刻Ｉの人物領域の画像特徴量はＬＳＴＭ１によって再帰的に統合され、全体画像から抽出される大域的画像特徴量と連結操作モジュールによって連結される。連結した特徴量は時系列統合を実行するＬＳＴＭ２に入力される。時刻ＩＩについても同様に、人物領域についてはＬＳＴＭ１によって再帰的に統合され、大域的画像特徴量と連結され、その後ＬＳＴＭ２によって再帰的に時系列統合が実行される。行動ラベルのスコアは、ＬＳＴＭ２による時系列統合毎にＩｎｎｅｒ−ｐｒｏｄｕｃｔユニット、Ｓｏｆｔｍａｘユニットによって計算され、出力される。 As FIG. 17 shows, the image feature-value of the person area | region of the time I is integrated recursively by LSTM1, and it is connected with the global image feature-value extracted from the whole image by the connection operation module. The concatenated feature quantities are input to LsTM 2 that performs time series integration. Similarly, for the time II, the person area is recursively integrated by LSTM 1 and linked with the global image feature, and thereafter, time series integration is recursively performed by LSTM 2. The score of the action label is calculated and output by the Inner-product unit and the Softmax unit for each time-series integration by LSTM2.

以上のように実行することで、画像全体から抽出した特徴量を人物検出によって得た検出誤差を含む局所的な人物領域の特徴量と連結し、時系列的な統合が実行される。これにより、人物検出が含む誤りを軽減した行動ラベルの識別が実行できる。 By performing as described above, the feature quantity extracted from the entire image is linked with the feature quantity of the local human region including the detection error obtained by the human detection, and time-series integration is performed. Thereby, identification of the action label which reduced the error which person detection contains can be performed.

（その他の実施形態）
本発明は、上述の実施形態の１以上の機能を実現するプログラムを、ネットワーク又は記憶媒体を介してシステム又は装置に供給し、そのシステム又は装置のコンピュータにおける１つ以上のプロセッサーがプログラムを読出し実行する処理でも実現可能である。また、１以上の機能を実現する回路（例えば、ＡＳＩＣ）によっても実現可能である。 (Other embodiments)
The present invention supplies a program that implements one or more functions of the above-described embodiments to a system or apparatus via a network or storage medium, and one or more processors in a computer of the system or apparatus read and execute the program. Can also be realized. It can also be implemented by a circuit (eg, an ASIC) that implements one or more functions.

１０００行動認識装置
１００１動画取得部
１００２人物物体検出部
１００３人物領域抽出部
１００４人物領域ソート部
１００５統合制御信号作成部
１００６画像特徴量抽出部
１００７人物系列統合部
１００８時系列統合部
１００９行動ラベル識別部 1000 action recognition apparatus 1001 motion image acquisition unit 1002 person object detection unit 1003 person region extraction unit 1004 person region sort unit 1005 integrated control signal creation unit 1006 image feature amount extraction unit 1007 person sequence integration unit 1008 time series integration unit 1009 activity label identification unit

Claims

Acquiring means for acquiring a moving image including a time-series still image;
Detection means for detecting one or more objects for each still image from the moving image;
Feature quantity extraction means for extracting the feature quantity corresponding to each of the objects from the still image;
An object integrating unit that integrates feature amounts corresponding to the respective objects in the still image;
A time-series integration unit that integrates feature quantities of an object integrated in the still image with respect to the time-series still image;
An image processing apparatus, comprising: identification means for identifying the action of the object in the moving image based on the integrated feature amount of the time-series still image.

Area extraction means for extracting an area corresponding to each of the objects from the still image;
A creation unit configured to create an area sequence in which the areas extracted from each of the time-series still images are arranged;
The image processing apparatus according to claim 1, wherein the feature amount extraction unit extracts feature amounts from regions corresponding to the respective objects in the region series.

The image processing apparatus according to claim 2, wherein the creation unit sorts the regions of the plurality of objects based on the positions of the objects.

3. The image processing apparatus according to claim 2, further comprising control means for controlling the control states of the object integration means and the time series integration means based on the detection result by the detection means.

The object integration means has at least two control states of reset and update;
The control means resets the control state of the object integration means at the beginning of the series of feature quantities of the object in the still image, and updates the control state of the object integration means at other times. 5. The image processing apparatus according to claim 4, wherein the object integration means integrates a series of objects in the still image.

The time series integration means has at least three control states of reset, hold and update;
The control means resets the control state of the time series integration means at the initial stage of the series, and brings the control state of the time series integration means at the end of the series of feature quantities of the objects integrated for each still image, By holding the control state of the time-series integration means at other times, the time-series integration means integrates a series of feature quantities of the object integrated for each still image. The image processing apparatus according to claim 1.

The image processing apparatus according to claim 1, wherein the detection unit detects a real-world position of the object.

The image processing apparatus according to claim 1, wherein the detection unit detects a person and a predetermined object different from the person as the object.

The image processing apparatus according to claim 1, wherein the acquisition unit acquires a moving image obtained by photographing an object at a plurality of viewpoints simultaneously.

The detection means detects, for each still image constituting a moving image taken from the plurality of viewpoints, the same object in the still images of the plurality of viewpoints, and detects the position of the object. The image processing apparatus according to claim 9, characterized in that:

The image processing apparatus further comprises viewpoint integration means for integrating a series of feature quantities of the objects of the plurality of viewpoints,
11. The image processing apparatus according to claim 10, wherein the object integration unit integrates feature amounts of objects of a plurality of viewpoints integrated for each of the objects.

The image processing apparatus further comprises global feature quantity extraction means for extracting a global image feature quantity from the whole still image constituting the moving image,
The image processing apparatus according to claim 1, wherein the time-series integration unit integrates the feature amount of the object integrated for each still image and the global image feature amount for each still image.

An acquisition step of acquiring a moving image including a time-series still image;
Detecting one or more objects for each still image from the moving image;
A feature amount extraction step of extracting feature amounts corresponding to each of the objects from the still image;
An object integration step of integrating feature amounts corresponding to each of the objects in the still image;
A time-series integration step of integrating feature quantities of an object integrated in the still image with respect to the time-series still image;
An identification step of identifying the action of the object in the moving image based on the integrated feature amount of the time-series still image.

A program causing a computer to function as each means of the image processing apparatus according to any one of claims 1 to 12.