JP6904651B2

JP6904651B2 - Programs, devices and methods that recognize a person's behavior using multiple recognition engines

Info

Publication number: JP6904651B2
Application number: JP2018028219A
Authority: JP
Inventors: 建鋒徐; 和之田坂; 柳原　広昌; 広昌柳原
Original assignee: KDDI Corp
Current assignee: KDDI Corp
Priority date: 2018-02-20
Filing date: 2018-02-20
Publication date: 2021-07-21
Anticipated expiration: 2038-02-20
Also published as: JP2019144830A

Description

本発明は、映像データから、人物の行動を認識する技術に関する。 The present invention relates to a technique for recognizing a person's behavior from video data.

近年、ディープラーニングを用いることによって、人物の行動認識における認識精度が飛躍的に向上してきている。
従来、移動特徴量（オプティカルフロー）から物体の動きを検出する動体認識の技術がある（例えば非特許文献１参照）。例えばTwo-stream ConvNetsによれば、空間方向のＣＮＮ(Spatial stream ConvNet)と時系列方向のＣＮＮ(Temporal stream ConvNet)とを用いて、画像中の物体や背景のアピアランス特徴と、オプティカルフローの水平方向成分及び垂直方向成分の系列における動き特徴とを抽出する。これら両方の特徴を統合することによって、行動を高精度に認識する。 In recent years, by using deep learning, the recognition accuracy in human behavior recognition has been dramatically improved.
Conventionally, there is a moving object recognition technique for detecting the movement of an object from a moving feature (optical flow) (see, for example, Non-Patent Document 1). For example, according to Two-stream ConvNets, using CNN (Spatial stream ConvNet) in the spatial direction and CNN (Temporal stream ConvNet) in the time-series direction, the appearance features of objects and backgrounds in the image and the horizontal direction of optical flow. Extract the motion features in the sequence of components and vertical components. By integrating both of these features, behavior is recognized with high accuracy.

また、６４フレームのセグメントを処理単位として、３Ｄ−ＣＮＮによって行動を認識する技術もある（例えば非特許文献２参照）。この技術によれば、非特許文献１の技術に対して、時間軸情報を含む３Ｄ convolutionを適用し。深い深層モデルに大量の教師データを学習させている。
更に、映像データをＮ（＝３）等分にセグメント化して、各セグメントのスコアを統合する技術もある（例えば非特許文献３参照）。この技術によれば、非特許文献１の技術に対して、長い時間軸情報を適用し、深い深層モデルに大量の教師データを学習させている。 There is also a technique of recognizing an action by 3D-CNN using a segment of 64 frames as a processing unit (see, for example, Non-Patent Document 2). According to this technique, a 3D convolution including time axis information is applied to the technique of Non-Patent Document 1. A deep deep model is trained with a large amount of teacher data.
Further, there is also a technique of segmenting video data into N (= 3) equal parts and integrating the scores of each segment (see, for example, Non-Patent Document 3). According to this technique, a long time axis information is applied to the technique of Non-Patent Document 1, and a large amount of teacher data is trained in a deep deep model.

一方で、一般的なＷｅｂカメラによって撮影された映像データから、人物の骨格の２次元関節データを推定する技術もある。この技術によれば、３次元関節までは推定できないが、例えばKinect（登録商標）のようなデプスセンサを必要としない。 On the other hand, there is also a technique for estimating two-dimensional joint data of a person's skeleton from video data taken by a general Web camera. According to this technique, it is not possible to estimate up to 3D joints, but it does not require a depth sensor such as Kinect®.

図１は、認識装置を有するシステム構成図である。 FIG. 1 is a system configuration diagram having a recognition device.

図１のシステムによれば、認識装置１は、例えばインターネットに接続されたサーバとして機能している。認識装置１は、教師データによって予め学習モデルを構築した認識エンジンを有する。認識エンジンが、人物の行動を認識するものである場合、教師データは、人の行動が映り込む映像データと、その行動対象（コンテキスト）とが予め対応付けられたものである。 According to the system of FIG. 1, the recognition device 1 functions as, for example, a server connected to the Internet. The recognition device 1 has a recognition engine in which a learning model is constructed in advance based on teacher data. When the recognition engine recognizes a person's behavior, the teacher data is a pre-association between the video data in which the person's behavior is reflected and the action target (context).

端末２はそれぞれ、カメラを搭載しており、人の行動を撮影した映像データを、認識装置１へ送信する。端末２は、各ユーザによって所持されるスマートフォンや携帯端末であって、携帯電話網又は無線ＬＡＮのようなアクセスネットワークに接続する。
勿論、端末２は、スマートフォン等に限られず、例えば宅内に設置されたＷｅｂカメラであってもよい。また、Ｗｅｂカメラによって撮影された映像データがＳＤカードに記録され、その記録された映像データが認識装置１へ入力されるものであってもよい。 Each terminal 2 is equipped with a camera, and transmits video data of a person's behavior to the recognition device 1. The terminal 2 is a smartphone or mobile terminal owned by each user and connects to an access network such as a mobile phone network or a wireless LAN.
Of course, the terminal 2 is not limited to a smartphone or the like, and may be, for example, a Web camera installed in the house. Further, the video data captured by the Web camera may be recorded on the SD card, and the recorded video data may be input to the recognition device 1.

具体的には、例えばユーザに、自らのスマートフォンのカメラで、自らの行動を撮影してもらう。そのスマートフォンは、その映像データを、認識装置１へ送信する。認識装置１は、その映像データから人の行動を推定し、その推定結果を様々なアプリケーションで利用する。
尚、認識装置１の各機能が端末２に組み込まれたものであってもよい。 Specifically, for example, the user is asked to take a picture of his / her behavior with the camera of his / her smartphone. The smartphone transmits the video data to the recognition device 1. The recognition device 1 estimates a person's behavior from the video data, and uses the estimation result in various applications.
It should be noted that each function of the recognition device 1 may be incorporated in the terminal 2.

Karen Simonyan and Andrew Zisserman, “Two-Stream Convolutional Networks for Action Recognition in Videos,” in NIPS 2014、[online]、［平成３０年１月２４日検索］、インターネット＜URL:https://arxiv.org/abs/1406.2199.pdf＞Karen Simonyan and Andrew Zisserman, “Two-Stream Convolutional Networks for Action Recognition in Videos,” in NIPS 2014, [online], [Search January 24, 2018], Internet <URL: https://arxiv.org/ abs / 1406.2199.pdf ＞ Joao Carreira, Andrew Zisserman. " Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset." CVPR2017(2017)、[online]、［平成３０年１月２４日検索］、インターネット＜URL: https://arxiv.org/abs/1705.07750＞Joao Carreira, Andrew Zisserman. "Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset." CVPR2017 (2017), [online], [Search January 24, 2018], Internet <URL: https: // arxiv.org/abs/1705.07750> Wang, Limin, et al. "Temporal segment networks: Towards good practices for deep action recognition." European Conference on Computer Vision. Springer International Publishing, 2016、[online]、［平成３０年１月２４日検索］、インターネット＜URL:http://www.eccv2016.org/files/posters/P-3B-42.pdf＞Wang, Limin, et al. "Temporal segment networks: Towards good practices for deep action recognition." European Conference on Computer Vision. Springer International Publishing, 2016, [online], [Search January 24, 2018], Internet < URL: http://www.eccv2016.org/files/posters/P-3B-42.pdf ＞ Cao, Zhe, et al. "Realtime multi-person 2d pose estimation using part affinity fields." CVPR2017(2017)、[online]、［平成３０年１月２４日検索］、インターネット＜URL:https://arxiv.org/abs/1611.08050＞Cao, Zhe, et al. "Realtime multi-person 2d pose estimation using part affinity fields." CVPR2017 (2017), [online], [Search January 24, 2018], Internet <URL: https: // arxiv .org / abs / 1611.08050 ＞ Soo Kim, Tae & Reiter, Austin. “Interpretable 3D Human Action Analysis with Temporal Convolutional Networks.” CVPRW 2017、[online]、［平成３０年１月２４日検索］、インターネット＜URL:http://ieeexplore.ieee.org/document/8014941/?reload=true＞Soo Kim, Tae & Reiter, Austin. “Interpretable 3D Human Action Analysis with Temporal Convolutional Networks.” CVPRW 2017, [online], [Search January 24, 2018], Internet <URL: http: //ieeexplore.ieee .org / document / 8014941 /? Reload = true ＞ Gunnar Farneback, “Two-Frame Motion Estimation Based on Polynomial Expansion, Image Analysis,” Volume 2749 of the series Lecture Notes in Computer Science, pp 363-370, June 2003、[online]、［平成３０年１月２４日検索］、インターネット＜http://liu.diva-portal.org/smash/record.jsf?pid=diva2%3A269471&dswid=-8845＞Gunnar Farneback, “Two-Frame Motion Optimization Based on Polynomial Expansion, Image Analysis,” Volume 2749 of the series Lecture Notes in Computer Science, pp 363-370, June 2003, [online], [Search January 24, 2018 ], Internet <http://liu.diva-portal.org/smash/record.jsf?pid=diva2%3A269471&dswid=-8845> OpenPose、[online]、［平成３０年１月２４日検索］、インターネット＜URL:https://github.com/CMU-Perceptual-Computing-Lab/openpose＞OpenPose, [online], [Search January 24, 2018], Internet <URL: https://github.com/CMU-Perceptual-Computing-Lab/openpose> 「動画や写真からボーンが検出できる OpenPoseを試してみた」、[online]、［平成３０年１月２４日検索］、インターネット＜URL:http://hackist.jp/?p=8285＞"I tried OpenPose, which can detect bones from videos and photos", [online], [Search on January 24, 2018], Internet <URL: http://hackist.jp/?p=8285> 「OpenPoseがどんどんバージョンアップして3d pose estimationも試せるようになっている」、[online]、［平成３０年１月２４日検索］、インターネット＜URL: http://izm-11.hatenablog.com/entry/2017/08/01/140945＞"OpenPose has been upgraded so that you can try 3d pose estimation", [online], [Search on January 24, 2018], Internet <URL: http://izm-11.hatenablog.com / entry / 2017/08/01/140945 ＞

人物の行動を認識する認識エンジンは、人物が映り込む映像データを教師データとして学習し、推定すべき映像データから、高精度に行動を推定するように調整されている。
しかしながら、実際の環境下では、人物以外の映像領域の影響によって、コンテキストの認識精度が低下する場合がある。 The recognition engine that recognizes the behavior of a person learns the video data in which the person is reflected as teacher data, and is adjusted to estimate the behavior with high accuracy from the video data to be estimated.
However, in an actual environment, the recognition accuracy of the context may decrease due to the influence of the video area other than the person.

また、認識エンジンの認識精度は、高い方から順に、関節認識->動体認識->物体認識となるのが一般的である。
しかしながら、撮影角度や照度、オクルージョン、解像度などの影響から、必ずしも関節認識の精度が高いとは限らない。即ち、最適に学習された認識エンジンを用いたとしても、撮影環境によっては、異なる種別の認識エンジンを用いた方が、認識精度が高まる場合もある。 In addition, the recognition accuracy of the recognition engine is generally joint recognition-> moving object recognition-> object recognition in descending order.
However, the accuracy of joint recognition is not always high due to the influence of shooting angle, illuminance, occlusion, resolution, and the like. That is, even if the optimally learned recognition engine is used, the recognition accuracy may be improved by using a different type of recognition engine depending on the shooting environment.

更に、認識エンジンは、クラス分類に基づくものであって、推定すべき映像データに「行動」（クラス）を付与する機械学習エンジンである。
しかしながら、認識エンジンからのスコアが高ければ、認識精度が必ずしも高いというわけではない。
似て非なる物体の動きを検出した場合、認識精度が高いがために、スコアが比較的中程度となる一方で、認識精度が低いがために、スコアが極端に高くなる場合もある。
例えば、同一の行動を認識したとしても、認識精度が比較的高い動体認識エンジンでは、比較的中程度のスコアを算出したとしても、認識精度が比較的低い物体認識エンジンでは、比較的高いスコアを出力する場合がある。
そのように考えると、個別の認識エンジンによって算出されたスコアをそのまま、その行動の認識精度と考えることはできない。 Further, the recognition engine is a machine learning engine that is based on classification and gives "behavior" (class) to the video data to be estimated.
However, if the score from the recognition engine is high, the recognition accuracy is not necessarily high.
When the movement of a similar non-object is detected, the score may be relatively medium due to the high recognition accuracy, while the score may be extremely high due to the low recognition accuracy.
For example, even if the same behavior is recognized, a moving object recognition engine with relatively high recognition accuracy will give a relatively medium score, but an object recognition engine with relatively low recognition accuracy will give a relatively high score. May be output.
Considering this, the score calculated by each recognition engine cannot be considered as the recognition accuracy of the action as it is.

そこで、本発明は、映像データに対して、人物以外の映像領域の影響を受けることなく、複数の認識エンジンのスコアに基づく総合的な観点から、行動（コンテキスト）の認識精度を高めることができるプログラム、装置及び方法を提供することを目的とする。 Therefore, the present invention can improve the recognition accuracy of actions (contexts) from a comprehensive viewpoint based on the scores of a plurality of recognition engines without being affected by the video area other than the person with respect to the video data. It is intended to provide programs, devices and methods.

本発明によれば、人物が映り込む映像データから行動を認識するようにコンピュータを機能させる行動認識プログラムであって、
映像データから、人物の関節に基づくスケルトン情報を時系列に抽出するスケルトン情報抽出手段と、
映像データのスケルトン情報から、行動を認識する関節認識エンジンと、
映像データから、スケルトン情報の囲み領域を抽出する領域切出し手段と、
映像データの囲み領域から、行動を認識する動体認識エンジンと、
行動毎に、関節認識エンジン及び動体認識エンジンそれぞれのスコアに、各認識エンジンに対応する重みを付けて統合した統合スコアを出力するスコア統合手段と、
異なる行動に基づく複数の訓練データを入力し、認識エンジン毎に、スコアの統計値を算出し、スコアの統計値が低いほど大きくなる値を、重みとして付与する重み算出手段と
してコンピュータを機能させることを特徴とする。
本発明のプログラムにおける他の実施形態によれば、
重み算出手段は、スコアの統計値の逆数値（全ての認識エンジンの逆数値の和が１となる）を、重みとする
ようにコンピュータを機能させることも好ましい。 According to the present invention, it is an action recognition program that causes a computer to function so as to recognize an action from video data in which a person is reflected.
A skeleton information extraction means that extracts skeleton information based on human joints in chronological order from video data,
A joint recognition engine that recognizes actions from skeleton information of video data,
Area extraction means for extracting the enclosed area of skeleton information from video data,
A motion recognition engine that recognizes actions from the enclosed area of video data,
A score integration means that outputs an integrated score in which the scores of the joint recognition engine and the motion recognition engine are weighted and integrated according to each recognition engine for each action.
It is a weight calculation means that inputs multiple training data based on different behaviors, calculates a score statistical value for each recognition engine, and assigns a value that increases as the score statistical value decreases as a weight. It is characterized by operating a computer.
According to other embodiments in the program of the present invention
The weight calculation means uses the inverse value of the score statistical value (the sum of the inverse values of all recognition engines is 1) as the weight.
It is also preferable to make the computer function as such.

本発明によれば、人物が映り込む映像データから行動を認識するようにコンピュータを機能させる行動認識プログラムであって、
映像データから、人物の関節に基づくスケルトン情報を時系列に抽出するスケルトン情報抽出手段と、
映像データのスケルトン情報から、行動を認識する関節認識エンジンと、
関節認識エンジンによって算出された行動のスコアが、所定条件を満たすか否かを判定する関節行動判定手段と、
関節行動判定手段によって真と判定された際に、映像データから、スケルトン情報の囲み領域を抽出する領域切出し手段と、
映像データの囲み領域から、行動を認識する動体認識エンジンと、
行動毎に、関節認識エンジン及び動体認識エンジンそれぞれのスコアを統合した統合スコアを出力するスコア統合手段と
してコンピュータを機能させることを特徴とする。 According to the present invention, it is an action recognition program that causes a computer to function so as to recognize an action from video data in which a person is reflected.
A skeleton information extraction means that extracts skeleton information based on human joints in chronological order from video data,
A joint recognition engine that recognizes actions from skeleton information of video data,
A joint behavior determination means for determining whether or not the behavior score calculated by the joint recognition engine satisfies a predetermined condition,
Area cutting means for extracting the enclosed area of skeleton information from the video data when it is determined to be true by the joint behavior determination means,
A motion recognition engine that recognizes actions from the enclosed area of video data,
It is characterized in that a computer functions as a score integration means for outputting an integrated score that integrates the scores of the joint recognition engine and the motion recognition engine for each action.

本発明の行動認識プログラムにおける他の実施形態によれば、
関節行動判定手段は、所定条件として、複数の行動における最大値又は平均値のスコアが、所定閾値以下か否かを判定する
ようにコンピュータを機能させることも好ましい。 According to other embodiments in the behavior recognition program of the present invention.
As a predetermined condition, the joint behavior determining means preferably causes a computer to function so as to determine whether or not the score of the maximum value or the average value in a plurality of actions is equal to or less than a predetermined threshold value.

本発明の行動認識プログラムにおける他の実施形態によれば、
関節行動判定手段は、所定条件として、最大値のスコアとなる行動が、所定目的行動であるか否かを判定する
ようにコンピュータを機能させることも好ましい。 According to other embodiments in the behavior recognition program of the present invention.
As a predetermined condition, the joint behavior determining means preferably causes a computer to function so as to determine whether or not the behavior having the maximum score is a predetermined target behavior.

本発明の行動認識プログラムにおける他の実施形態によれば、
関節認識エンジンは、所定条件を満たす重要関節部位を更に出力するものであり、
領域切出し手段は、関節認識エンジンから出力された重要関節部位を含む囲み領域を抽出する
ようにコンピュータを機能させることも好ましい。 According to other embodiments in the behavior recognition program of the present invention.
The joint recognition engine further outputs important joint parts that satisfy predetermined conditions.
It is also preferable that the region cutting means make the computer function so as to extract the surrounding region including the important joint portion output from the joint recognition engine.

本発明の行動認識プログラムにおける他の実施形態によれば、
スコア統合手段は、認識エンジンそれぞれのスコアに、当該認識エンジンに対応する重みを付けて統合し、
異なる行動に基づく複数の訓練データを入力し、認識エンジン毎に、スコアの統計値を算出し、スコアの統計値が低いほど大きくなる値を、重みとして付与する重み算出手段と
して更にコンピュータを機能させることも好ましい。 According to other embodiments in the behavior recognition program of the present invention.
The score integration means integrates the scores of each recognition engine with a weight corresponding to the recognition engine .
As a weight calculation means that inputs multiple training data based on different behaviors, calculates a score statistical value for each recognition engine, and assigns a value that increases as the score statistical value decreases as a weight.
It is also preferable to make the computer function further.

本発明の行動認識プログラムにおける他の実施形態によれば、
重み算出手段は、スコアの統計値の逆数値（全ての認識エンジンの逆数値の和が１となる）を、重みとする
ようにコンピュータを機能させることも好ましい。 According to other embodiments in the behavior recognition program of the present invention.
It is also preferable that the weight calculation means causes the computer to function so that the inverse value of the statistical value of the score (the sum of the inverse values of all recognition engines is 1) is used as the weight.

本発明の行動認識プログラムにおける他の実施形態によれば、
動体認識エンジンは、オプティカルフローに基づくものである
ようにコンピュータを機能させることも好ましい。 According to other embodiments in the behavior recognition program of the present invention.
It is also preferable that the motion recognition engine makes the computer function so that it is based on optical flow.

本発明の行動認識プログラムにおける他の実施形態によれば、
動体認識エンジンは、ＲＧＢ画像に基づく物体認識エンジンと、オプティカルフローに基づく動体認識エンジンとからなり、
スコア統合手段は、行動毎に、関節認識エンジン、動体認識エンジン及び物体認識エンジンそれぞれのスコアを統合する
ようにコンピュータを機能させることも好ましい。 According to other embodiments in the behavior recognition program of the present invention.
The moving object recognition engine consists of an object recognition engine based on RGB images and a moving object recognition engine based on optical flow.
It is also preferable that the score integrating means causes the computer to function so as to integrate the scores of the joint recognition engine, the moving object recognition engine, and the object recognition engine for each action.

本発明の行動認識プログラムにおける他の実施形態によれば、
スケルトン情報を、時系列の座標系に対してシフト・伸縮させることによって正規化するスケルトン情報正規化手段を更に有し、
領域切出し手段は、正規化されたスケルトン情報を囲む最小領域を、囲み領域として抽出する
ようにコンピュータを機能させることも好ましい。 According to other embodiments in the behavior recognition program of the present invention.
It also has a skeleton information normalization means that normalizes the skeleton information by shifting and expanding / contracting it with respect to the coordinate system of the time series.
It is also preferable that the region cutting means causes the computer to function so as to extract the minimum region surrounding the normalized skeleton information as the surrounding region.

本発明によれば、人物が映り込む映像データから行動を認識する装置であって、
映像データから、人物の関節に基づくスケルトン情報を時系列に抽出するスケルトン情報抽出手段と、
映像データのスケルトン情報から、行動を認識する関節認識エンジンと、
映像データから、スケルトン情報の囲み領域を抽出する領域切出し手段と、
映像データの囲み領域から、行動を認識する動体認識エンジンと、
行動毎に、関節認識エンジン及び動体認識エンジンそれぞれのスコアに、各認識エンジンに対応する重みを付けて統合した統合スコアを出力するスコア統合手段と、
異なる行動に基づく複数の訓練データを入力し、認識エンジン毎に、スコアの統計値を算出し、スコアの統計値が低いほど大きくなる値を、重みとして付与する重み算出手段と
を有することを特徴とする。 According to the present invention, it is a device that recognizes an action from video data in which a person is reflected.
A skeleton information extraction means that extracts skeleton information based on human joints in chronological order from video data,
A joint recognition engine that recognizes actions from skeleton information of video data,
Area extraction means for extracting the enclosed area of skeleton information from video data,
A motion recognition engine that recognizes actions from the enclosed area of video data,
A score integration means that outputs an integrated score in which the scores of the joint recognition engine and the motion recognition engine are weighted and integrated according to each recognition engine for each action.
A weight calculation means that inputs multiple training data based on different behaviors, calculates a score statistical value for each recognition engine, and assigns a value that increases as the score statistical value decreases as a weight. It is characterized by having.

本発明によれば、人物が映り込む映像データから行動を認識する装置であって、
映像データから、人物の関節に基づくスケルトン情報を時系列に抽出するスケルトン情報抽出手段と、
映像データのスケルトン情報から、行動を認識する関節認識エンジンと、
関節認識エンジンによって算出された行動のスコアが、所定条件を満たすか否かを判定する関節行動判定手段と、
関節行動判定手段によって真と判定された際に、映像データから、スケルトン情報の囲み領域を抽出する領域切出し手段と、
映像データの囲み領域から、行動を認識する動体認識エンジンと、
行動毎に、関節認識エンジン及び動体認識エンジンそれぞれのスコアを統合した統合スコアを出力するスコア統合手段と
を有することを特徴とする。 According to the present invention, it is a device that recognizes an action from video data in which a person is reflected.
A skeleton information extraction means that extracts skeleton information based on human joints in chronological order from video data,
A joint recognition engine that recognizes actions from skeleton information of video data,
A joint behavior determination means for determining whether or not the behavior score calculated by the joint recognition engine satisfies a predetermined condition,
Area cutting means for extracting the enclosed area of skeleton information from the video data when it is determined to be true by the joint behavior determination means,
A motion recognition engine that recognizes actions from the enclosed area of video data,
Each action is characterized by having a score integration means for outputting an integrated score that integrates the scores of the joint recognition engine and the motion recognition engine.

本発明によれば、人物が映り込む映像データから行動を認識する装置の認識方法であって、
装置は、
映像データから、人物の関節に基づくスケルトン情報を時系列に抽出する第１のステップと、
映像データのスケルトン情報から、行動を関節認識する第２のステップと、
映像データから、スケルトン情報の囲み領域を抽出する第３のステップと、
映像データの囲み領域から、行動を動体認識する第４のステップと、
行動毎に、第２のステップ及び第４のステップそれぞれのスコアに、各ステップに対応する重みを付けて統合した統合スコアを出力する第５のステップと
を実行し、
異なる行動に基づく複数の訓練データを入力し、第２のステップ及び第４のステップそれぞれスコアの統計値を算出し、スコアの統計値が低いほど大きくなる値を、重みとして付与する
ことを特徴とする。 According to the present invention, it is a recognition method of a device that recognizes an action from video data in which a person is reflected.
The device is
The first step of extracting skeleton information based on human joints from video data in chronological order,
The second step of recognizing behavior joints from skeleton information of video data,
The third step of extracting the enclosed area of skeleton information from the video data,
From the enclosed area of the video data, the fourth step of recognizing the action as a moving object,
For each action, the scores of the second step and the fourth step are weighted corresponding to each step, and the fifth step of outputting the integrated score is executed .
A plurality of training data based on different behaviors are input, score statistics are calculated for each of the second step and the fourth step, and a value that increases as the score statistics decrease is given as a weight. It is characterized by that.

本発明によれば、人物が映り込む映像データから行動を認識する装置の認識方法であって、
装置は、
映像データから、人物の関節に基づくスケルトン情報を時系列に抽出する第１のステップと、
映像データのスケルトン情報から、行動を関節認識する第２のステップと、
第２のステップによって算出された行動のスコアが、所定条件を満たすか否かを判定する第３のステップと、
第３のステップによって真と判定された際に、映像データから、スケルトン情報の囲み領域を抽出する第４のステップと、
映像データの囲み領域から、行動を動体認識する第５のステップと、
行動毎に、第２のステップ及び第５のステップそれぞれのスコアを統合した統合スコアを出力する第６のステップと
を実行することを特徴とする。
According to the present invention, it is a recognition method of a device that recognizes an action from video data in which a person is reflected.
The device is
The first step of extracting skeleton information based on human joints from video data in chronological order,
The second step of recognizing behavior joints from skeleton information of video data,
The third step of determining whether or not the action score calculated by the second step satisfies a predetermined condition, and
The fourth step of extracting the enclosed area of the skeleton information from the video data when it is determined to be true by the third step,
From surrounding areas of the video data, and a fifth step moving body recognizes action,
It is characterized in that each action is executed with a sixth step of outputting an integrated score in which the scores of the second step and the fifth step are integrated.

本発明のプログラム、装置及び方法によれば、映像データに対して、人物以外の映像領域の影響を受けることなく、複数の認識エンジンのスコアに基づく総合的な観点から、行動（コンテキスト）の認識精度を高めることができる。 According to the program, apparatus and method of the present invention, recognition of behavior (context) of video data from a comprehensive viewpoint based on scores of a plurality of recognition engines without being affected by a video area other than a person. The accuracy can be improved.

認識装置を有するシステム構成図である。It is a system block diagram which has a recognition device. 本発明における認識装置の第１の機能構成図である。It is a 1st functional block diagram of the recognition device in this invention. スケルトン情報と囲み領域とを表す第１のイメージ図である。It is a 1st image diagram which shows the skeleton information and the enclosed area. 囲み領域を表す第２のイメージ図である。It is a 2nd image figure which shows the enclosed area. 複数の動体認識エンジンによって構成した、図２に基づく第１の機能構成図である。FIG. 5 is a first functional configuration diagram based on FIG. 2 composed of a plurality of motion recognition engines. 本発明における認識装置の第２の機能構成図である。It is a 2nd functional block diagram of the recognition device in this invention. 複数の動体認識エンジンによって構成した、図６に基づく第２の機能構成図である。FIG. 6 is a second functional configuration diagram based on FIG. 6 composed of a plurality of motion recognition engines. 囲み領域を表す第３のイメージ図である。It is a 3rd image figure which shows the enclosed area.

以下、本発明の実施の形態について、図面を用いて詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

図２は、本発明における認識装置の第１の機能構成図である。 FIG. 2 is a first functional configuration diagram of the recognition device according to the present invention.

認識装置１は、複数の認識エンジンを用いて、人物が映り込む映像データから、人物の行動（コンテキスト）を推定する。
図２によれば、認識装置１は、スケルトン情報抽出部１１と、関節認識エンジン１２と、領域切出し部１３１と、動体認識エンジン１３２と、スコア統合部１４と、重み算出部１５とを有する。また、オプション的に、スケルトン情報正規化部１１１も有する。これら機能構成部は、装置に搭載されたコンピュータを機能させるプログラムを実行することによって実現できる。また、これら機能構成部の処理の流れは、映像データに対する行動（コンテキスト）の認識方法としても理解できる。 The recognition device 1 uses a plurality of recognition engines to estimate the behavior (context) of a person from the video data in which the person is reflected.
According to FIG. 2, the recognition device 1 includes a skeleton information extraction unit 11, a joint recognition engine 12, a region cutting unit 131, a moving object recognition engine 132, a score integration unit 14, and a weight calculation unit 15. It also optionally has a skeleton information normalization unit 111. These functional components can be realized by executing a program that makes the computer mounted on the device function. In addition, the processing flow of these functional components can be understood as a method of recognizing an action (context) for video data.

［スケルトン情報抽出部１１］
スケルトン情報抽出部１１は、映像データから、人物の関節に基づく「スケルトン情報」を時系列に抽出する（例えば非特許文献４参照）。スケルトン情報とは、関節毎に信頼度（０〜１）が付与された２次元骨格の関節データをいう。関節データは、２次元に基づくものであるために、一般的なＷｅｂカメラで撮影した映像データから抽出される。 [Skeleton information extraction unit 11]
The skeleton information extraction unit 11 extracts "skeleton information" based on human joints from video data in chronological order (see, for example, Non-Patent Document 4). The skeleton information refers to joint data of a two-dimensional skeleton in which a reliability (0 to 1) is given to each joint. Since the joint data is based on two dimensions, it is extracted from the video data taken by a general Web camera.

具体的にはOpenPose（登録商標）のようなスケルトンモデルを用いて、人の関節の特徴点を抽出する（例えば非特許文献７〜９参照）。OpenPoseとは、画像から複数の人間の体／手／顔のキーポイントをリアルタイムに検出可能なソフトウェアであって、GitHubによって公開されている。撮影映像に映る人の身体全体であれば、例えば１５点のキーポイントを検出できる。 Specifically, a skeleton model such as OpenPose (registered trademark) is used to extract characteristic points of human joints (see, for example, Non-Patent Documents 7 to 9). OpenPose is software that can detect multiple human body / hand / face key points in real time from images, and is published by GitHub. For example, 15 key points can be detected in the entire human body shown in the captured image.

図３は、スケルトン情報と囲み領域とを表す第１のイメージ図である。 FIG. 3 is a first image diagram showing the skeleton information and the enclosed area.

図３によれば、映像データに１人の人物が映り込んでいる。各関節（Nose, Neck, RShoulder, RElbow,・・・）に対して、信頼度（０〜９）が算出される。スケルトン情報とは、１８個の各関節の２次元座標点とその信頼度とを、各フレームで結び付けた情報をいう。 According to FIG. 3, one person is reflected in the video data. The reliability (0 to 9) is calculated for each joint (Nose, Neck, RShoulder, RElbow, ...). The skeleton information refers to information in which the two-dimensional coordinate points of each of the 18 joints and their reliability are linked in each frame.

導出されたスケルトン情報は、関節認識エンジン１２及び領域切出し部１３１へ出力される。
尚、スケルトン情報は、オプション的に、スケルトン情報正規化部１１１を介して、関節認識エンジン１２及び領域切出し部１３１へ出力するものであってもよい。 The derived skeleton information is output to the joint recognition engine 12 and the region cutting unit 131.
The skeleton information may optionally be output to the joint recognition engine 12 and the region cutting unit 131 via the skeleton information normalization unit 111.

［スケルトン情報正規化部１１１］（オプション）
スケルトン情報正規化部１１１は、スケルトン情報を、時系列の座標系に対してシフト・伸縮させることによって正規化する。
正規化したスケルトン情報は、関節認識エンジン１２及び領域切出し部１３１へ出力される。 [Skeleton information normalization unit 111] (optional)
The skeleton information normalization unit 111 normalizes the skeleton information by shifting / expanding / contracting it with respect to the time-series coordinate system.
The normalized skeleton information is output to the joint recognition engine 12 and the region cutting unit 131.

（Ｓ１）基準フレームに基づいて、原点座標と伸縮スケールとを算出する。
基準フレームとは、映像データに映り込む人物から同時に検出された「Neck」「LHip」「RHip」について、信頼度が一定以上になった最初のフレームをいう。
原点座標は、基準フレームにおける人物iのNeck関節の座標ｐ_i(Neck)をいう。
伸縮スケールscale_iは、NeckとLHipとの間の距離と、NeckとRHipとの間の距離との平均値を100にしたものであって、以下のように算出される。
scale_i＝200／（||ｐ_i(Neck)−ｐ_i(LHip)||＋||ｐ_i(Neck)−ｐ_i(RHip)||）
scale_i：人物iのスケール
ｐ_i()：人物iの２次元座標
（Ｓ２）伸縮スケールscale_iを用いて、各フレームにおいて検出した人物毎にスケルトンを、以下のようにシフト・伸縮させる。
ｎｐ_i ^t(j)＝scale_i＊（ｐ_i ^t(j)−ｐi(Neck)）
t：フレーム番号
j：関節番号
ｐ_i ^t(j)：第tフレームで人物iの関節jの座標
ｎｐ_i ^t(j)：第tフレームで人物iの関節jの正規化座標
尚、前述の実施形態によれば、基準フレームの抽出について、相対的な位置関係が変わり難い関節である「Neck」「LHip」「RHip」を用いているが、それらに代えて、「Neck」「LShoulder」「RShoulder」を用いるものであってもよい。 (S1) The origin coordinates and the expansion / contraction scale are calculated based on the reference frame.
The reference frame is the first frame in which the reliability of "Neck", "LHip", and "RHip" simultaneously detected from a person reflected in the video data exceeds a certain level.
_{The origin coordinates refer to the coordinates p i} (Neck) of the Neck joint of the person i in the reference frame.
The expansion / contraction scale scale _i is the average value of the distance between the Neck and the L Hip and the distance between the Neck and the RH ip set to 100, and is calculated as follows.
scale _i = 200 / (|| p _i (Neck) -p _i (LHip) || + || p _i (Neck) -p _i (RHip) ||)
scale _i : Scale of person i
p _i (): using a two-dimensional coordinate (S2) stretching scale scale _i person i, a skeleton for each person detected in each frame is shifted stretching as follows.
_{^{np i t (j) = scale}} i * (p i t (j) -pi (Neck))
t: frame number
j: Joint number
p _i ^t (j): coordinates of the joint j of person i in the t-th frame
np _i ^t (j): Normalization of joint j person i at the t frame coordinate Incidentally, according to the embodiment described above, the extraction of the reference frame, the relative positional relationship is hardly changed joint "Neck" Although "LHip" and "RHip" are used, "Neck", "L Shoulder", and "R Shoulder" may be used instead of them.

［関節認識エンジン１２］
関節認識エンジン１２は、映像データのスケルトン情報（正規化されたスケルトン情報）に「行動」を対応付けた教師データに基づいて、深層学習の学習モデルを予め構築したものである。
そして、関節認識エンジン１２は、学習モデルを用いて、映像データのスケルトン情報（正規化されたスケルトン情報）から、「行動」を認識する（例えば非特許文献５参照）。関節認識エンジン１２は、クラス分類に基づくものであって、クラス（推定可能な行動（コンテキスト））毎に、スコアを算出する。
関節認識エンジン１２は、例えば「飲む」「食べる」「走る」「畳む」のような人物の行動を、人物の関節の角度や位置から認識する。
そして、行動毎のスコアが、スコア統合部１４へ出力される。 [Joint recognition engine 12]
The joint recognition engine 12 builds a learning model for deep learning in advance based on teacher data in which "behavior" is associated with skeleton information (normalized skeleton information) of video data.
Then, the joint recognition engine 12 recognizes "behavior" from the skeleton information (normalized skeleton information) of the video data by using the learning model (see, for example, Non-Patent Document 5). The joint recognition engine 12 is based on classification, and calculates a score for each class (estimable behavior (context)).
The joint recognition engine 12 recognizes a person's actions such as "drinking", "eating", "running", and "folding" from the angle and position of the person's joints.
Then, the score for each action is output to the score integration unit 14.

［領域切出し部１３１］
領域切出し部１３１は、映像データから、スケルトン情報の「囲み領域」を抽出する。囲み領域は、スケルトン情報（正規化されたスケルトン情報）を囲む最小領域をいう。このように、人物が映り込む囲み領域のみを、後段の動体認識エンジン１３２へ入力することによって、認識処理を高速化すると共に、行動認識の精度を高めることができる。 [Area cutout section 131]
The area cutting unit 131 extracts the "enclosed area" of the skeleton information from the video data. The enclosed area is the minimum area that surrounds the skeleton information (normalized skeleton information). In this way, by inputting only the enclosed area in which the person is reflected into the moving object recognition engine 132 in the subsequent stage, the recognition process can be speeded up and the accuracy of action recognition can be improved.

前述した図３によれば、最初に、全ての関節の座標点（１８個）を含むように最小の矩形となる「関節のバウンディングボックス」を算出する（短破線）。そして、関節のバウンディングボックスから、所定比率で拡大した拡大ボックスを「囲み領域」として導出する（長破線）。 According to FIG. 3 described above, first, a “joint bounding box” that is the smallest rectangle so as to include the coordinate points (18 pieces) of all the joints is calculated (short dashed line). Then, from the bounding box of the joint, an enlarged box enlarged at a predetermined ratio is derived as an "enclosed area" (long dashed line).

図４は、囲み領域を表す第２のイメージ図である。 FIG. 4 is a second image diagram showing the enclosed area.

図４によれば、映像データに２人の人物が映り込んでいる。
最初に、全ての関節の座標点を含むように最小の矩形となる「関節のバウンディングボックス」を算出する。次に、関節を結ぶフレームの接続構成から、人物毎に、関節データを区分する。
そして、人物毎に、関節のバウンディングボックスから、所定比率で拡大した拡大ボックスを「囲み領域」として導出する。
また、図４について、画像領域の範囲内で、２人の人物の囲み領域を包含するような１つの領域を、「囲み領域」として再導出してもよい。 According to FIG. 4, two people are reflected in the video data.
First, a "joint bounding box" that is the smallest rectangle to include the coordinate points of all joints is calculated. Next, the joint data is divided for each person from the connection configuration of the frame connecting the joints.
Then, for each person, an enlarged box enlarged at a predetermined ratio is derived as a "enclosed area" from the bounding box of the joint.
Further, with respect to FIG. 4, one area that includes the surrounding area of two people within the range of the image area may be redistributed as the “enclosed area”.

［動体認識エンジン１３２］
動体認識エンジン１３２も、映像データに「行動」を対応付けた教師データに基づいて、深層学習の学習モデルを予め構築したものである。
そして、動体認識エンジン１３２は、学習モデルを用いて、映像データの囲み領域から、「行動」を認識する。動体認識エンジン１３２も、クラス分類に基づくものであって、クラス（推定可能な行動（コンテキスト））毎に、スコアを算出する。
そして、各動体認識エンジン１３２によって算出された行動毎のスコアは、スコア統合部１４へ出力される。 [Motion recognition engine 132]
The motion recognition engine 132 also constructs a learning model for deep learning in advance based on teacher data in which "behavior" is associated with video data.
Then, the moving body recognition engine 132 recognizes the "behavior" from the enclosed area of the video data by using the learning model. The motion recognition engine 132 is also based on the classification, and calculates the score for each class (estimable behavior (context)).
Then, the score for each action calculated by each motion recognition engine 132 is output to the score integration unit 14.

図５は、複数の動体認識エンジンによって構成した、図２に基づく第１の機能構成図である。 FIG. 5 is a first functional configuration diagram based on FIG. 2, which is composed of a plurality of motion recognition engines.

動体認識エンジン１３２は、以下のように２つの認識エンジンによって構成されるものであってもよい。
（１）ＲＧＢ画像に基づく物体認識エンジン
（２）オプティカルフローに基づく動体認識エンジン
勿論、動体認識エンジン１３２に含まれる認識エンジンは、２つに限られず、３つ以上の異なる種類の認識エンジンを組み合わせるものであってもよい。
尚、コンテキスト「行動」の種類としては、後段のスコア統合部１４で統合するためにも、同一であることが好ましい。 The moving object recognition engine 132 may be composed of two recognition engines as follows.
(1) Object recognition engine based on RGB images (2) Motion recognition engine based on optical flow Of course, the recognition engine included in the motion recognition engine 132 is not limited to two, and three or more different types of recognition engines are combined. It may be a thing.
It is preferable that the types of context "actions" are the same in order to be integrated by the score integration unit 14 in the subsequent stage.

（１）ＲＧＢ認識に基づく物体認識エンジンは、具体的にはＣＮＮ(Convolutional Neural Network)のようなニューラルネットワークを用いて、撮影映像に映り込むオブジェクト（対象物）を推定する（例えば非特許文献２参照）。 (1) An object recognition engine based on RGB recognition specifically estimates an object (object) reflected in a captured image by using a neural network such as CNN (Convolutional Neural Network) (for example, Non-Patent Document 2). reference).

（２）オプティカルフローに基づく動体認識エンジンは、フレーム間で同一の特徴点が動いている箇所を抽出し、映像データ内の物体の動きを「ベクトル」で表すものである（例えば非特許文献６参照）。 (2) The moving object recognition engine based on the optical flow extracts the part where the same feature point moves between frames and expresses the movement of the object in the video data as a "vector" (for example, Non-Patent Document 6). reference).

オプティカルフローに基づく動体認識エンジンは、映像データ（ＲＧＢ画像）における２枚の隣接フレームを入力すると、同解像度の２枚のオプティカルフローを算出する。以下の式によって、ＲＧＢ画像を輝度画像Ｙに変換する。
Ｙ＝0.299×Ｒ＋0.587×Ｇ＋0.114×Ｂ
Ｙ：ピクセルの輝度値
Ｒ，Ｇ，Ｂ：ピクセルのＲ，Ｇ，Ｂ値
非特許文献６の技術によれば、第１フレームの小さい領域の中に、任意のピクセルのＹ成分が、以下の式のようにquadratic polynomial basisで表現できる。
ｆ₁(ｘ)＝ｘ^ＴＡ₁ｘ＋ｂ₁ ^Ｔｘ＋ｃ₁
ｆ₁(ｘ)：対象ピクセルのＹ成分
ｘ：第１フレームにおけるＹ成分の対象ピクセルの位置座標
Ａ₁，ｂ₁，ｃ₁：その領域で算出する係数
同様に、第２フレームの対応領域は、以下の式になる。
ｆ₂(ｘ)＝ｆ₁(ｘ−ｄ)
＝(ｘ−ｄ)^ＴＡ₁(ｘ−ｄ)＋ｂ₁ ^Ｔ(ｘ−ｄ)＋ｃ₁
＝ｘ^ＴＡ₁ｘ＋(ｂ₁−２Ａ₁ｄ)^Ｔｘ＋ｄ^ＴＡ₁ｄ−ｂ₁ ^Ｔｄ＋ｃ₁
＝ｘ^ＴＡ₂ｘ＋ｂ₂ ^Ｔｘ＋ｃ₂
ｄ：対象ピクセルｘのオプティカルフロー（位置座標の差分）
Ａ₂，ｂ₂，ｃ₂：その領域で算出する係数
これによって、オプティカルフローｄは、以下の式で算出される。
ｄ＝−1/2Ａ₁ ^-1(ｂ₂−ｂ₁) The moving object recognition engine based on the optical flow calculates two optical flows having the same resolution when two adjacent frames in the video data (RGB image) are input. The RGB image is converted into a luminance image Y by the following formula.
Y = 0.299 x R + 0.587 x G + 0.114 x B
Y: Pixel brightness value
R, G, B: R, G, B values of pixels According to the technique of Non-Patent Document 6, the Y component of any pixel is quadratic polynomial as shown in the following equation in a small region of the first frame. Can be expressed in basis.
f ₁ (x) = x ^T A ₁ x + b ₁ ^T x + c ₁
f ₁ (x): Y component of the target pixel
x: Position coordinates of the target pixel of the Y component in the first frame
A ₁ , b ₁ , c ₁ : Coefficients calculated in that area Similarly, the corresponding area of the second frame is as follows.
f ₂ (x) = f ₁ (x−d)
= (X-d) ^TA ₁ (x-d) + b ₁ ^T (x-d) + c ₁
^{_{= X T A 1 x + (}} b 1 -2A 1 d) T x + d T A 1 d-b 1 T d + c 1
= X ^T A ₂ x + b ₂ ^T x + c ₂
d: Optical flow of target pixel x (difference in position coordinates)
A ₂ , b ₂ , c ₂ : Coefficients calculated in that area The optical flow d is calculated by the following formula.
d = -1 / 2A ₁ ^-1 (b ₂ −b ₁ )

［スコア統合部１４］
スコア統合部１４は、図２のように、「行動」毎に、関節認識エンジン１３１及び動体認識エンジン１３２それぞれのスコアの統計値を統合した「統合スコア」を出力する。
スコアの統計値とは、複数の行動におけるスコアの最大値又は平均値であってもよい。
また、図５のように、動体認識エンジンが、物体認識エンジン１３２１及び動体認識エンジン１３２２から構成される場合、スコア統合部１４は、行動毎に、関節認識エンジン１３１、物体認識エンジン１３２１及び動体認識エンジン１３２２それぞれのスコアを統合する。 [Score integration unit 14]
As shown in FIG. 2, the score integration unit 14 outputs an “integrated score” that integrates the statistical values of the scores of the joint recognition engine 131 and the motion recognition engine 132 for each “action”.
The statistical value of the score may be the maximum value or the average value of the scores in a plurality of actions.
Further, as shown in FIG. 5, when the moving body recognition engine is composed of the object recognition engine 1321 and the moving body recognition engine 1322, the score integration unit 14 performs the joint recognition engine 131, the object recognition engine 1321 and the moving body recognition for each action. The scores of each engine 1322 are integrated.

ここで、スコア統合部１４は、認識エンジン１３それぞれのスコアに、当該認識エンジンに対応する「重み」を付けて、行動actに対する統合スコアＳＡ^all(act)を算出する。
ＳＡ^RGB(act)：物体認識エンジン（ＲＧＢ画像）のスコア
ＳＡ^flow(act)：動体認識エンジン（オプティカルフロー画像）のスコア
ＳＡ^skeleton(act)：関節認識エンジン（スケルトン情報）のスコア
ｗ^skeleton：関節認識エンジンのスコアに対する重み
ｗ^RGB：物体認識エンジンのスコアに対する重み
ｗ^flow：動体認識エンジンのスコアに対する重み
ＳＡ^all(act)＝ｗ^skeletonＳＡ^skeleton(act)＋ｗ^RGBＳＡ^RGB(act)＋ｗ^flowＳＡ^flow(act)
そして、以下の式によって、映像データの行動を推定する。
best_act＝arg max_act(ＳＡ^all(act)) Here, the score integration unit 14 adds a "weight" corresponding to the recognition engine to each score of the recognition engine 13 to calculate the ^{integrated score SA all (act) for the action act.}
SA ^RGB (act): Object recognition engine (RGB image) score SA ^flow (act): Motion recognition engine (optical flow image) score SA ^skeleton (act): Joint recognition engine (skeleton information) score w ^skeleton : Joint Weight for recognition engine score w ^RGB : Weight for object recognition engine score w ^flow : Weight for motion recognition engine score SA ^all (act) = w ^skeleton SA ^skeleton (act) + w ^RGB SA ^RGB (act) + w ^flow SA ^flow (act)
Then, the behavior of the video data is estimated by the following formula.
best _act = arg max _act (SA ^all (act))

このように、認識エンジンに応じて異なる「重み」を付けることによって、「認識エンジンからのスコアが高ければ、認識精度が必ずしも高いというわけではない」とする不公平さを回避することができる。 In this way, by assigning different "weights" depending on the recognition engine, it is possible to avoid the unfairness that "the higher the score from the recognition engine, the higher the recognition accuracy is not necessarily high".

［重み算出部１５］
重み算出部１５は、異なる行動に基づく複数の訓練データを入力し、認識エンジン毎に、スコアの統計値を算出し、スコアの統計値が低いほど、大きな値の「重み」を付与する。具体的には、、スコアの統計値（例えば平均値）の逆数値（全ての認識エンジンの逆数値の和が１となる）を、「重み」とするものであってもよい。「重み」は、例えば以下のように算出される。
ＭＳＡ^skeleton＝max_act（ＳＡ^skeleton(act)）
ＭＳＡ^RGB＝max_act（ＳＡ^RGB(act)）
ＭＳＡ^flo _w＝max_act（ＳＡ^flow(act)）
ｗ^skeleton＝ｅ^{-ＭＳＡskeleton}／(ｅ^{-ＭＳＡskeleton}＋ｅ^{-ＭＳＡRGB}＋ｅ^{-ＭＳＡflow})
ｗ^RGB＝ｅ^{-ＭＳＡRGB}／(ｅ^{-ＭＳＡskeleton}＋ｅ^{-ＭＳＡRGB}＋ｅ^{-ＭＳＡflow})
ｗ^flow＝ｅ^{-ＭＳＡflow}／(ｅ^{-ＭＳＡskeleton}＋ｅ^{-ＭＳＡRGB}＋ｅ^{-ＭＳＡflow})
max：最大値（統計値） [Weight calculation unit 15]
The weight calculation unit 15 inputs a plurality of training data based on different behaviors, calculates a score statistical value for each recognition engine, and assigns a larger value of "weight" as the score statistical value is lower. Specifically, the inverse value of the statistical value (for example, the average value) of the score (the sum of the inverse values of all recognition engines is 1) may be set as the "weight". The "weight" is calculated as follows, for example.
MSA ^skeleton = max _act (SA ^skeleton (act))
MSA ^RGB = max _act (SA ^RGB (act))
MSA ^flo _w = max _act (SA ^flow (act))
w ^skeleton = e ^-MSAskeleton / (e ^-MSAskeleton + e ^-MSARGB + e ^-MSAflow )
w ^RGB = e- ^MSARGB / (e- ^MSAskeleton + e- ^MSARGB + e- ^MSAflow )
w ^flow = e- ^MSAflow / (e- ^MSAskeleton + e- ^MSARGB + e- ^MSAflow )
max: Maximum value (statistical value)

具体的に、関節認識エンジン１３１と動体認識エンジン１３２（ＲＧＢ画像認識及びオプティカルフロー画像認識）とによって算出されたスコアを補完的に且つ公平的に組み合わせる。即ち、各認識エンジンのスコア間のスケールを正規化するために、各識別器のスコア（例えば平均値）の逆数を設定する。
結果として、全体的にスコアを高く算出する傾向のある認識エンジンと、全体的にスコアを低く出力する傾向のある認識エンジンとを組み合わせて、その傾向に応じた比率でスコアを統合することができる。 Specifically, the scores calculated by the joint recognition engine 131 and the motion recognition engine 132 (RGB image recognition and optical flow image recognition) are complementarily and fairly combined. That is, in order to normalize the scale between the scores of each recognition engine, the reciprocal of the score (for example, the average value) of each classifier is set.
As a result, a recognition engine that tends to calculate the overall score high and a recognition engine that tends to output the overall score low can be combined and the scores can be integrated at a ratio according to the tendency. ..

例えば、入力された映像データについて、オプティカルフロー認識エンジン１３２２の認識精度は、ＲＧＢ認識エンジン１３２１の認識精度よりも高い。一方で、オプティカルフロー認識エンジン１３２２における行動の統計値のスコアは、ＲＧＢ認識エンジン１３２１における行動の統計値のスコアよりも低い。その場合、オプティカルフロー認識エンジン１３２２のスコアに対する重みを高くし、ＲＧＢ認識エンジン１３２１のスコアに対する重みを低くする。特に、低いスコアに対する「重み」を、非線形に高くすることが、有効である。 For example, with respect to the input video data, the recognition accuracy of the optical flow recognition engine 1322 is higher than the recognition accuracy of the RGB recognition engine 1321. On the other hand, the score of the behavioral statistics in the optical flow recognition engine 1322 is lower than the score of the behavioral statistics in the RGB recognition engine 1321. In that case, the weight for the score of the optical flow recognition engine 1322 is increased, and the weight for the score of the RGB recognition engine 1321 is decreased. In particular, it is effective to increase the "weight" for a low score non-linearly.

図６は、本発明における認識装置の第２の機能構成図である。 FIG. 6 is a second functional configuration diagram of the recognition device according to the present invention.

図６によれば、図２と比較して、スケルトン情報（正規化されたスケルトン情報）を、関節認識エンジン１２のみへ入力し、関節認識エンジン１２の行動毎のスコアに応じて、関節行動判定部１６が実行されている。 According to FIG. 6, as compared with FIG. 2, skeleton information (normalized skeleton information) is input only to the joint recognition engine 12, and the joint action is determined according to the score of each action of the joint recognition engine 12. Part 16 is being executed.

［関節行動判定部１６］
関節行動判定部１６は、関節認識エンジン１２によって算出された行動のスコアが、所定条件を満たすか否かを判定する。
これに対して、領域切出し部１３１は、関節行動判定部１６によって真と判定された際に、映像データから、スケルトン情報の囲み領域を抽出する。 [Joint behavior determination unit 16]
The joint behavior determination unit 16 determines whether or not the behavior score calculated by the joint recognition engine 12 satisfies a predetermined condition.
On the other hand, the area cutting unit 131 extracts the enclosed area of the skeleton information from the video data when it is determined to be true by the joint behavior determination unit 16.

関節行動判定部１６は、所定条件として、以下の２つの実施形態のいずれかを実行する。
（判定１）関節行動判定部１６は、所定条件として、複数の行動における最大値又は平均値のスコアが、所定閾値以下か否かを判定する。これは、関節認識エンジン１２のスコアが低い場合のみ、後段の動体認識エンジン１３２を機能させることを意味する。 The joint behavior determination unit 16 executes one of the following two embodiments as a predetermined condition.
(Determination 1) The joint behavior determination unit 16 determines, as a predetermined condition, whether or not the score of the maximum value or the average value in a plurality of actions is equal to or less than a predetermined threshold value. This means that the motion recognition engine 132 in the subsequent stage is made to function only when the score of the joint recognition engine 12 is low.

（判定２）関節行動判定部１６は、所定条件として、最大値のスコアとなる行動が、所定目的行動であるか否かを判定する。これは、関節認識エンジン１２の行動が所定目的行動である場合にのみ、後段の動体認識エンジン１３２を機能させることを意味する。
所定目的行動は、例えば「読んでいる」「書いている」「タイピングをしている」のように、関節認識エンジン１２によって認識可能なものである。 (Determination 2) The joint behavior determination unit 16 determines, as a predetermined condition, whether or not the action having the maximum score is a predetermined purpose action. This means that the motion recognition engine 132 in the subsequent stage functions only when the action of the joint recognition engine 12 is a predetermined purpose action.
The predetermined purpose action is recognizable by the joint recognition engine 12, such as "reading", "writing", and "typing".

関節行動判定部１６は、判定１又は判定２によって真と判定した場合、映像データを、領域切出し部１３１へ出力する。一方で、偽と判定した場合、関節認識エンジン１２の行動毎のスコアをスコア統合部１４へ出力する。
領域切出し部１３１は、関節行動判定部１６によって真と判定された場合にのみ、映像データから、スケルトン情報の囲み領域を抽出する。 When the joint behavior determination unit 16 determines that the result is true by the determination 1 or the determination 2, the joint behavior determination unit 16 outputs the video data to the area cutting unit 131. On the other hand, if it is determined to be false, the score for each action of the joint recognition engine 12 is output to the score integration unit 14.
The area cutting unit 131 extracts the enclosed area of the skeleton information from the video data only when it is determined to be true by the joint behavior determination unit 16.

図７は、複数の動体認識エンジンによって構成した、図６に基づく第２の機能構成図である。 FIG. 7 is a second functional configuration diagram based on FIG. 6 configured by a plurality of motion recognition engines.

図５と同様に、動体認識エンジン１３２は、ＲＧＢ画像に基づく物体認識エンジン１３２１と、オプティカルフローに基づく動体認識エンジン１３２２とによって構成されている。この場合、領域切出し部１３１は、両方の認識エンジン１３２へ、映像データの囲み領域を入力する。 Similar to FIG. 5, the moving object recognition engine 132 is composed of an object recognition engine 1321 based on an RGB image and a moving object recognition engine 1322 based on an optical flow. In this case, the area cutting unit 131 inputs the enclosed area of the video data to both recognition engines 132.

図８は、囲み領域を表す第３のイメージ図である。 FIG. 8 is a third image diagram showing the enclosed area.

図８によれば、関節認識エンジン１２は、所定条件を満たす重要関節部位を出力するものである。関節認識エンジン１２を最初に機能させることよって、どのような行動であるか、を大まかに認識することができる
重要関節部位としては、「手」「頭部」「上半身」「下半身」「全身」に分類する（図８（ａ）参照）。ここで、分類された関節部位の中で、単位時間における変位量及び／又は変位回数が大きいほど、重要関節部位であると判定される。 According to FIG. 8, the joint recognition engine 12 outputs an important joint portion satisfying a predetermined condition. By making the joint recognition engine 12 function first, it is possible to roughly recognize what kind of behavior it is. The important joint parts are "hand", "head", "upper body", "lower body", and "whole body". (See FIG. 8 (a)). Here, among the classified joint parts, the larger the displacement amount and / or the number of displacements in a unit time, the more important the joint part is determined.

これに対し、領域切出し部１３１は、関節認識エンジン１２から出力された重要関節部位を含む囲み領域を抽出する。
重要関節部位「手」である場合、図８（ｂ）のように、スケルトン情報における関節「手」「肘」を含む最小の矩形となるバウンディングボックス（短破線）を抽出する。そして、そのバウンディングボックスから、所定比率で拡大した拡大ボックス（長破線）を「囲み領域」として導出する。
また、重要関節部位「手」について、手を円心にして、一定の半径の円を作り、その縁に基づくバウンディングボックスを算出するものであってもよい。 On the other hand, the region cutting portion 131 extracts the surrounding region including the important joint portion output from the joint recognition engine 12.
In the case of the important joint part "hand", as shown in FIG. 8B, a bounding box (short dashed line) that is the smallest rectangle including the joints "hand" and "elbow" in the skeleton information is extracted. Then, from the bounding box, an enlarged box (long broken line) enlarged at a predetermined ratio is derived as a "enclosed area".
Further, for the important joint part "hand", a circle having a constant radius may be formed with the hand as the center of the circle, and a bounding box based on the edge thereof may be calculated.

以上、詳細に説明したように、本発明のプログラム、装置及び方法によれば、映像データに対して、人物以外の領域の影響を受けることなく、複数の認識エンジンに基づく総合的な観点から、行動（コンテキスト）の認識精度を高めることができる。 As described in detail above, according to the program, apparatus and method of the present invention, the video data is not affected by the area other than the person, and from a comprehensive viewpoint based on a plurality of recognition engines. It is possible to improve the recognition accuracy of actions (contexts).

特に、本発明によれば、推定すべき映像データの人物領域についてのみ行動を認識するので、コンテキストの認識精度が高まる。
また、異なる種別の複数の認識エンジンを用いることによって、様々な撮影環境の中でも、特定の認識エンジンの推定結果に依存することなく、認識精度を高めることができる。
更に、異なる種別の認識エンジンにおけるスコアと認識精度との違いは、認識エンジン毎に、スコアの統計値に応じた「重み」を付与することによって、解消することができる。 In particular, according to the present invention, since the action is recognized only in the human area of the video data to be estimated, the recognition accuracy of the context is improved.
Further, by using a plurality of recognition engines of different types, it is possible to improve the recognition accuracy even in various shooting environments without depending on the estimation result of a specific recognition engine.
Further, the difference between the score and the recognition accuracy in different types of recognition engines can be eliminated by giving a "weight" according to the statistical value of the score for each recognition engine.

前述した本発明の種々の実施形態について、本発明の技術思想及び見地の範囲の種々の変更、修正及び省略は、当業者によれば容易に行うことができる。前述の説明はあくまで例であって、何ら制約しようとするものではない。本発明は、特許請求の範囲及びその均等物として限定するものにのみ制約される。 With respect to the various embodiments of the present invention described above, various changes, modifications and omissions within the scope of the technical idea and viewpoint of the present invention can be easily made by those skilled in the art. The above explanation is just an example and does not attempt to restrict anything. The present invention is limited only to the scope of claims and their equivalents.

１認識装置
１１スケルトン情報抽出部
１１１スケルトン情報正規化部
１２関節認識エンジン
１３１領域切出し部
１３２動体認識エンジン
１３２１ＲＧＢ認識エンジン
１３２２オプティカルフロー認識エンジン
１４スコア統合部
１５重み算出部
１６関節行動判定部
２端末

1 Recognition device 11 Skeleton information extraction unit 111 Skeleton information normalization unit 12 Joint recognition engine 131 Area cutout unit 132 Motion recognition engine 1321 RGB recognition engine 1322 Optical flow recognition engine 14 Score integration unit 15 Weight calculation unit 16 Joint behavior judgment unit 2 terminals

Claims

It is an action recognition program that makes a computer function so that it recognizes actions from video data in which a person is reflected.
A skeleton information extraction means that extracts skeleton information based on human joints in chronological order from the video data,
A joint recognition engine that recognizes actions from the skeleton information of the video data,
An area cutting means for extracting the enclosed area of the skeleton information from the video data, and
A motion recognition engine that recognizes actions from the enclosed area of the video data,
A score integration means that outputs an integrated score in which the scores of the joint recognition engine and the motion recognition engine are weighted and integrated corresponding to each recognition engine for each action.
A weight calculation means that inputs a plurality of training data based on different behaviors, calculates a score statistical value for each recognition engine, and assigns a value that increases as the score statistical value decreases as a weight. A behavior recognition program characterized by the functioning of a computer.

The weight calculation means uses an inverse value of the statistical value of the score (the sum of the inverse values of all recognition engines is 1) as a weight.
The behavior recognition program according to claim 1, wherein the computer functions as described above.

It is an action recognition program that makes a computer function so that it recognizes actions from video data in which a person is reflected.
A skeleton information extraction means that extracts skeleton information based on human joints in chronological order from the video data,
A joint recognition engine that recognizes actions from the skeleton information of the video data,
A joint behavior determination means for determining whether or not the behavior score calculated by the joint recognition engine satisfies a predetermined condition,
When the joint behavior determination means determines that the data is true, the area cutting means for extracting the enclosed area of the skeleton information from the video data, and the area cutting means.
A motion recognition engine that recognizes actions from the enclosed area of the video data,
A behavior recognition program characterized in that a computer functions as a score integration means for outputting an integrated score that integrates the scores of the joint recognition engine and the motion recognition engine for each action.

The third aspect of claim 3, wherein the joint behavior determining means causes a computer to function as the predetermined condition to determine whether or not the score of the maximum value or the average value in a plurality of actions is equal to or less than a predetermined threshold value. Behavior recognition program.

The joint action determining means, wherein as the predetermined condition, the score of the maximum value actions, action according to claim 3, characterized in that causes a computer to function so as to determine whether or not the predetermined intended behavior Recognition program.

The joint recognition engine further outputs important joint parts that satisfy predetermined conditions.
The action according to any one of claims 3 to 5 , wherein the region cutting means causes a computer to function so as to extract an enclosed region including the important joint portion output from the joint recognition engine. Recognition program.

The score integration means integrates the scores of each recognition engine with a weight corresponding to the recognition engine .
A weight calculation means that inputs a plurality of training data based on different behaviors, calculates a score statistical value for each recognition engine, and assigns a value that increases as the score statistical value decreases as a weight.
The behavior recognition program according to any one of claims 3 to 6 , wherein the computer is further operated.

The claim is characterized in that the weight calculation means causes the computer to function so that the inverse value of the statistical value of the score (the sum of the inverse values of all recognition engines is 1) is used as the weight. The behavior recognition program according to item 7.

The behavior recognition program according to any one of claims 1 to 8, wherein the motion recognition engine causes a computer to function so as to be based on an optical flow.

The moving object recognition engine includes an object recognition engine based on an RGB image and a moving object recognition engine based on an optical flow.
Any one of claims 1 to 8, wherein the score integrating means causes a computer to function to integrate the scores of the joint recognition engine, the moving object recognition engine, and the object recognition engine for each action. The behavior recognition program described in the section.

Further having a skeleton information normalization means for normalizing the skeleton information by shifting / expanding / contracting it with respect to a time-series coordinate system.
The behavior recognition according to any one of claims 1 to 10, wherein the area cutting means causes a computer to function so as to extract a minimum area surrounding the normalized skeleton information as the surrounding area. program.

It is a device that recognizes actions from video data in which a person is reflected.
A skeleton information extraction means that extracts skeleton information based on human joints in chronological order from the video data,
A joint recognition engine that recognizes actions from the skeleton information of the video data,
An area cutting means for extracting the enclosed area of the skeleton information from the video data, and
A motion recognition engine that recognizes actions from the enclosed area of the video data,
A score integration means that outputs an integrated score in which the scores of the joint recognition engine and the motion recognition engine are weighted and integrated corresponding to each recognition engine for each action.
A weight calculation means that inputs a plurality of training data based on different behaviors, calculates a score statistical value for each recognition engine, and assigns a value that increases as the score statistical value decreases as a weight. A device characterized by having.

It is a device that recognizes actions from video data in which a person is reflected.
A skeleton information extraction means that extracts skeleton information based on human joints in chronological order from the video data,
A joint recognition engine that recognizes actions from the skeleton information of the video data,
A joint behavior determination means for determining whether or not the behavior score calculated by the joint recognition engine satisfies a predetermined condition,
When the joint behavior determination means determines that the data is true, the area cutting means for extracting the enclosed area of the skeleton information from the video data, and the area cutting means.
A motion recognition engine that recognizes actions from the enclosed area of the video data,
A device characterized by having a score integration means for outputting an integrated score that integrates the scores of the joint recognition engine and the motion recognition engine for each action.

It is a recognition method of a device that recognizes actions from video data in which a person is reflected.
The device is
The first step of extracting skeleton information based on human joints from the video data in chronological order,
From the skeleton information of the video data, the second step of recognizing the action joint and
A third step of extracting the enclosed area of the skeleton information from the video data, and
A fourth step of recognizing an action as a moving object from the enclosed area of the video data,
For each action, the scores of the second step and the fourth step are weighted corresponding to each step, and the fifth step of outputting the integrated score is executed .
A plurality of training data based on different behaviors are input, a score statistical value is calculated for each of the second step and the fourth step, and a value that increases as the score statistical value becomes lower is given as a weight. <br / > A device recognition method characterized by this.

It is a recognition method of a device that recognizes actions from video data in which a person is reflected.
The device is
The first step of extracting skeleton information based on human joints from the video data in chronological order,
From the skeleton information of the video data, the second step of recognizing the action joint and
The third step of determining whether or not the action score calculated by the second step satisfies a predetermined condition, and
When the result is determined to be true by the third step, the fourth step of extracting the enclosed area of the skeleton information from the video data and
From the surrounding region of the image data, and a fifth step moving body recognizes action,
A method of recognizing a device, which comprises executing a sixth step of outputting an integrated score in which the scores of the second step and the fifth step are integrated for each action.