JP2019096252A

JP2019096252A - Program, device and method for estimating context representing human action from captured video

Info

Publication number: JP2019096252A
Application number: JP2017227483A
Authority: JP
Inventors: 和之田坂; Kazuyuki Tasaka; 柳原　広昌; Hiromasa Yanagihara; 広昌柳原
Original assignee: KDDI Corp
Current assignee: KDDI Corp
Priority date: 2017-11-28
Filing date: 2017-11-28
Publication date: 2019-06-20
Anticipated expiration: 2037-11-28
Also published as: JP6836985B2

Abstract

To provide a program, etc. for estimating a context that represents a human action as quickly as possible with high accuracy, on the basis of the content of a captured video.SOLUTION: The present invention comprises: a first context recognition engine for recognizing a first context and outputting the first context and a first score in association; first recognition determination means for determining whether or not a difference in first scores between a plurality of contexts recognized by the first context recognition engine is less than or equal to a prescribed threshold; a second context recognition engine for recognizing a second context from a captured video when determined as being true by the first recognition determination means, and outputting the second context and the second score in association; and estimated context output means for outputting the second context recognized by the second context recognition engine when determined as being true by the first recognition determination means.SELECTED DRAWING: Figure 2

Description

本発明は、深層学習の学習モデルを用いて、撮影映像から、人の行動を表すコンテキストを推定する技術に関する。 The present invention relates to a technique for estimating context representing human behavior from a captured image using a learning model of deep learning.

図１は、行動推定装置を有するシステム構成図である。 FIG. 1 is a system configuration diagram having an action estimation apparatus.

図１のシステムによれば、行動推定装置１は、インターネットに接続されたサーバとして機能する。行動推定装置１は、予め学習映像によって学習モデルを構築した行動推定エンジンを有する。学習映像は、人の行動が映り込む撮影映像と、その行動対象とが予め対応付けられたものである。 According to the system of FIG. 1, the behavior estimation device 1 functions as a server connected to the Internet. The action estimation device 1 has an action estimation engine in which a learning model is constructed in advance by a learning image. In the learning video, a photographed video in which a person's action is reflected is associated in advance with the action target.

端末２はそれぞれ、カメラを搭載しており、人の行動を撮影した撮影映像を、行動推定装置１へ送信する。端末２は、各ユーザによって所持されるスマートフォンや携帯端末であって、携帯電話網又は無線ＬＡＮのようなアクセスネットワークに接続する。
勿論、端末２は、スマートフォン等に限られず、例えば宅内に設置されたＷｅｂカメラであってもよい。また、Ｗｅｂカメラによって撮影された映像データがＳＤカードに記録され、その記録された映像データが行動推定装置１へ入力されるものであってもよい。 Each of the terminals 2 is equipped with a camera, and transmits a captured image obtained by capturing an action of a person to the action estimation device 1. The terminal 2 is a smartphone or a portable terminal carried by each user, and is connected to an access network such as a mobile phone network or a wireless LAN.
Of course, the terminal 2 is not limited to a smartphone or the like, and may be, for example, a web camera installed in a house. Alternatively, video data captured by a web camera may be recorded on an SD card, and the recorded video data may be input to the behavior estimation device 1.

実運用としては、例えばモニターテストに参加したユーザに、自らのスマートフォンのカメラで、自らの行動を撮影してもらう。そのスマートフォンは、その映像を、行動推定装置１へ送信する。行動推定装置１は、その映像から人の行動を推定し、その推定結果を様々なアプリケーションで利用する。 As an actual operation, for example, a user who participated in a monitor test is asked to photograph his own action with the camera of his own smartphone. The smartphone transmits the video to the behavior estimation device 1. The behavior estimation device 1 estimates human behavior from the video, and uses the estimation result in various applications.

行動推定装置１における行動推定エンジンとしては、様々な方式のものを実装することができる。 The behavior estimation engine in the behavior estimation device 1 can be implemented in various ways.

従来、撮影映像から動体の移動を認識するために、ＲＧＢ画像に加えて、移動の特徴量（オプティカルフロー）を用いた技術がある（例えば非特許文献１参照）。例えばTwo-stream ConvNetsによれば、空間方向のＣＮＮ(Spatial stream ConvNet)と時系列方向のＣＮＮ(Temporal stream ConvNet)とを用いて、画像中の物体や背景のアピアランスの特徴と、オプティカルフローの水平方向成分と垂直成分の系列における動きの特徴との両方を抽出する。
また、人の行動をするために、人の関節とその連携部分のスケルトン情報を抽出する技術もある（例えば非特許文献２参照）。 Conventionally, in order to recognize the movement of a moving body from a captured image, there is a technique using a feature amount (optical flow) of movement in addition to the RGB image (see, for example, Non-Patent Document 1). For example, according to Two-stream ConvNets, using the spatial CNN (Spatial stream ConvNet) and the temporal CNN (Temporal stream ConvNet), the characteristics of the appearance of the object or background in the image and the horizontal flow of the optical flow Extract both the directional components and the motion features in the series of vertical components.
In addition, there is also a technique for extracting skeleton information of human joints and their associated parts in order to perform human actions (see, for example, Non-Patent Document 2).

一方で、認識処理を高速化するために、対象画像から候補領域を切り出して対象物を判定する技術もある（例えば特許文献１参照）。この技術によれば、複数の画像サイズの対象画像の中から、学習モデルの生成の際に統一されたサイズに最も近くなる画像サイズの対象画像を選択する。
また、人が特定の行動タイプをとる可能性を予測する予測器モデルを生成する技術もある（例えば特許文献２参照）。この技術によれば、行動タイプの成功したインスタンスと失敗したインスタンスとを含むデータを収集する。これらデータから、異なるタイプの複数の予測器が生成され、その性能に基づいて予測器が選択される。 On the other hand, in order to speed up recognition processing, there is also a technique of cutting out a candidate area from a target image and determining a target (for example, see Patent Document 1). According to this technique, among target images of a plurality of image sizes, a target image of an image size closest to the unified size at the time of generation of a learning model is selected.
There is also a technique of generating a predictor model that predicts the possibility that a person takes a specific action type (see, for example, Patent Document 2). According to this technique, data is collected that includes successful and failed instances of an activity type. From these data, different types of predictors are generated, and predictors are selected based on their performance.

特開２０１７−１４６８４０号公報JP, 2017-146840, A 特表２０１６−５１０４４１号公報JP 2016-510441 gazette

Karen Simonyan and Andrew Zisserman, “Two-Stream Convolutional Networks for Action Recognition in Videos,” in NIPS 2014、[online]、［平成２９年１１月１３日検索］、インターネット＜URL:https://arxiv.org/abs/1406.2199.pdf＞Karen Simonyan and Andrew Zisserman, “Two-Stream Convolutional Networks for Action Recognition in Videos,” in NIPS 2014, [online], [Search on November 13, 2017], Internet <URL: https://arxiv.org/ abs / 1406.2199. pdf> Zhe Cao, Tomas Simon, Shih-En Wei, Yaser Sheikh: Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields.、[online]、［平成２９年１１月１３日検索］、インターネット＜https://arxiv.org/pdf/1611.08050.pdf＞Zhe Cao, Tomas Simon, Shih-En Wei, Yaser Sheikh: Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields., [Online], [search on November 13, 2017], Internet <https: // arxiv. org / pdf / 1611.08050.pdf> 「スコアの統合」、[online]、［平成２９年１１月１３日検索］、インターネット＜https://image.slidesharecdn.com/170121stairlabslideshare-170119103908/95/-54-638.jpg?cb=1484822888＞"Score integration", [online], [search November 13, 2017], the Internet <https://image.slidesharecdn.com/170121stairlabslideshare-170119103908/95/-54-638.jpg?cb=1484822888> OpenPose、[online]、［平成２９年１０月１９日検索］、インターネット＜URL:https://github.com/CMU-Perceptual-Computing-Lab/openpose＞OpenPose, [online], [search October 19, 2017], Internet <URL: https://github.com/CMU-Perceptual-Computing-Lab/openpose> 「動画や写真からボーンが検出できる OpenPose を試してみた」、[online]、［平成２９年１０月１９日検索］、インターネット＜URL:http://hackist.jp/?p=8285＞"I tried OpenPose that can detect bones from videos and photos", [online], [October 19, 2017 search], Internet <URL: http://hackist.jp/?p=8285> 「OpenPoseがどんどんバージョンアップして3d pose estimationも試せるようになっている」、[online]、［平成２９年１０月１９日検索］、インターネット＜URL: http://izm-11.hatenablog.com/entry/2017/08/01/140945＞"OpenPose is being upgraded with more and more versions to try 3d pose estimation", [online], [October 19, 2017 search], Internet <URL: http://izm-11.hatenablog.com / entry / 2017/08/01/140945>

前述した従来技術によれば、人の行動が映り込む撮影映像の内容に応じて、高速に且つ高精度に認識する学習モデルを予め決定しておく必要がある。具体的には、「飲む」「食べる」「走る」のようなコンテキスト（人の行動）は、物体認識、動体認識、人物の関節領域認識のいずれであっても認識することができる。 According to the above-described prior art, it is necessary to determine in advance a learning model that is recognized at high speed and with high accuracy in accordance with the content of a captured video in which human action is reflected. Specifically, contexts such as "drink", "eat" and "run" (human behavior) can be recognized as any of object recognition, motion recognition, and human joint region recognition.

しかしながら、物体認識の場合、計算リソース（処理計算量）は比較的少なくても、物体の存在のみからコンテキストを認識するために、認識精度は低くならざるを得ない。一方で、動体認識や関節領域認識の場合、コンテキストの認識精度は高いが、計算リソースが大きくならざるを得ない。コンテキスト認識結果は、撮影映像に対してリアルタイムに出力する必要があるために、計算リソースの大きさが問題となる。 However, in the case of object recognition, although the computational resources (processing complexity) are relatively small, in order to recognize context only from the presence of an object, the recognition accuracy must be low. On the other hand, in the case of moving object recognition and joint region recognition, although the recognition accuracy of the context is high, the computational resources can not but be increased. Since the context recognition result needs to be output in real time with respect to the captured video, the size of the computational resource becomes a problem.

そこで、本発明は、撮影映像の内容に基づいて、人の行動を表すコンテキストを、できる限り高速且つ高精度に推定するプログラム、装置及び方法を提供することを目的とする。 Therefore, it is an object of the present invention to provide a program, an apparatus, and a method for estimating the context representing human behavior as fast and as accurately as possible based on the content of a captured video.

本発明によれば、撮影映像からコンテキストを推定するようにコンピュータを機能させるコンテキスト推定プログラムであって、
撮影映像からコンテキストを推定するようにコンピュータを機能させるコンテキスト推定プログラムであって、
撮影映像から、第１のコンテキストを認識し、第１のコンテキストと第１のスコアとを対応付けて出力する第１のコンテキスト認識エンジンと、
第１のコンテキスト認識エンジンによって認識された複数のコンテキストにおける第１のスコアの差が所定閾値以下であるか否かを判定する第１の認識判定手段と、
第１の認識判定手段によって真と判定された際に、撮影映像から、第２のコンテキストを認識し、第２のコンテキストと第２のスコアとを対応付けて出力する第２のコンテキスト認識エンジンと、
第１の認識判定手段によって真と判定された際に、少なくとも第２のコンテキストを出力する推定コンテキスト出力手段と
してコンピュータを機能させることを特徴とする。 According to the present invention, there is provided a context estimation program that causes a computer to function to estimate a context from captured video,
A context estimation program that causes a computer to function to estimate context from captured video,
A first context recognition engine that recognizes a first context from the captured video and outputs the first context and the first score in association with each other;
First recognition determination means for determining whether or not the difference between the first scores in the plurality of contexts recognized by the first context recognition engine is equal to or less than a predetermined threshold;
And a second context recognition engine that recognizes a second context from the captured image when it is determined to be true by the first recognition determination means, and associates and outputs a second context and a second score. ,
The computer is made to function as estimated context output means for outputting at least a second context when it is determined as true by the first recognition determination means.

本発明のコンテキスト推定プログラムにおける他の実施形態によれば、
第１の認識判定手段は、第１のコンテキスト認識エンジンによって認識された上位２つのコンテキストにおけるスコアの差が所定閾値以下であるか否かを判定する
ようにコンピュータを機能させることも好ましい。 According to another embodiment of the context estimation program of the invention:
The first recognition determination means preferably causes the computer to function to determine whether the difference between the scores in the top two contexts recognized by the first context recognition engine is less than or equal to a predetermined threshold.

本発明のコンテキスト推定プログラムにおける他の実施形態によれば、
第１のコンテキスト認識エンジンは、撮影映像から、ＲＧＢ画像に基づく物体認識によって、対象物としての第１のコンテキストを推定し、
第２のコンテキスト認識エンジンは、撮影映像から、オプティカルフローに基づく動体認識によって、動体対象としての第２のコンテキストを推定する
ようにコンピュータを機能させることも好ましい。 According to another embodiment of the context estimation program of the invention:
The first context recognition engine estimates a first context as an object from the captured image by object recognition based on the RGB image,
The second context recognition engine preferably causes the computer to function to estimate a second context as a moving object by moving object recognition based on optical flow from the captured image.

本発明のコンテキスト推定プログラムにおける他の実施形態によれば、
第１のコンテキスト認識エンジンは、撮影映像から、オプティカルフローに基づく動体認識によって、動体対象としての第１のコンテキストを推定し、
第２のコンテキスト認識エンジンは、撮影映像から、スケルトン情報に基づく人物の関節領域認識によって、人物の関節領域としての第２のコンテキストを推定する
ようにコンピュータを機能させることも好ましい。 According to another embodiment of the context estimation program of the invention:
The first context recognition engine estimates a first context as a moving object from the captured image by optical flow based moving object recognition,
The second context recognition engine preferably causes the computer to function to estimate a second context as a joint region of the person by recognizing the joint region of the person based on the skeleton information from the photographed image.

本発明のコンテキスト推定プログラムにおける他の実施形態によれば、
第１のコンテキスト認識エンジンは、撮影映像から、ＲＧＢ画像に基づく物体認識によって、対象物としての第１のコンテキストを推定し、
第２のコンテキスト認識エンジンは、撮影映像から、スケルトン情報に基づく人物の関節領域認識によって、人物の関節領域としての第２のコンテキストを推定する
ようにコンピュータを機能させることも好ましい。 According to another embodiment of the context estimation program of the invention:
The first context recognition engine estimates a first context as an object from the captured image by object recognition based on the RGB image,
The second context recognition engine preferably causes the computer to function to estimate a second context as a joint region of the person by recognizing the joint region of the person based on the skeleton information from the photographed image.

本発明のコンテキスト推定プログラムにおける他の実施形態によれば、
推定コンテキスト出力手段は、第２のコンテキストの出力に代えて、複数の第１のコンテキストそれぞれの第１のスコアと、複数の第２のコンテキストそれぞれの第２のスコアとの加算値又は平均値に基づいて、最も高いスコアとなるコンテキストを出力する
ようにコンピュータを機能させることも好ましい。 According to another embodiment of the context estimation program of the invention:
The estimated context output means is, instead of the output of the second context, a sum value or an average value of the first score of each of the plurality of first contexts and the second score of each of the plurality of second contexts. It is also preferable to make the computer function to output the context that gives the highest score based on it.

本発明のコンテキスト推定プログラムにおける他の実施形態によれば、
撮影映像は、所定単位時間に区分されており、
所定単位時間毎に、当該所定単位時間の初期段階で第１のコンテキスト認識エンジン及び第１の認識判定手段を実行し、第１の認識判定手段の判定に基づいて、その後に第２のコンテキスト認識エンジンを実行するか否かを決定する
ようにコンピュータを機能させることも好ましい。 According to another embodiment of the context estimation program of the invention:
Photographed video is divided into predetermined unit time,
The first context recognition engine and the first recognition determination means are executed at an initial stage of the predetermined unit time every predetermined unit time, and the second context recognition is thereafter performed based on the determination of the first recognition determination means It is also preferable to have the computer function to determine whether to run the engine.

本発明のコンテキスト推定プログラムにおける他の実施形態によれば、
第２のコンテキスト認識エンジンは、処理時間又は処理時間割合（単位時間当たりの当該処理時間の割合）を計測し、
第１の認識判定手段は、処理時間が所定閾値以上、又は、処理時間割合が所定閾値以上となる場合に、第１のコンテキストを推定コンテキスト出力手段へ出力すると共に、第２のコンテキスト認識エンジンを実行する
してコンピュータを更に機能させることも好ましい。 According to another embodiment of the context estimation program of the invention:
The second context recognition engine measures the processing time or the processing time ratio (proportion of the processing time per unit time),
The first recognition determination means outputs the first context to the estimated context output means when the processing time is equal to or more than a predetermined threshold or the processing time ratio is equal to or more than a predetermined threshold, and the second context recognition engine It is also preferable to run to make the computer function more.

本発明のコンテキスト推定プログラムにおける他の実施形態によれば、
第１の認識判定手段によって偽と判定された際に、第２のコンテキスト認識エンジンを実行すると共に、
第１の認識判定手段によって真と判定された際に、撮影映像から、第３のコンテキストを認識し、第３のコンテキストと第３のスコアとを対応付けて出力する第３のコンテキスト認識エンジンと、
第１の認識判定手段によって真と判定された際に、第２のコンテキストの出力に代えて、少なくとも第３のコンテキストを出力する推定コンテキスト出力手段と
してコンピュータを機能させることも好ましい。 According to another embodiment of the context estimation program of the invention:
Executing a second context recognition engine when it is determined to be false by the first recognition determination means;
And a third context recognition engine that recognizes a third context from the captured image when it is determined to be true by the first recognition determination means, and associates and outputs a third context and a third score. ,
It is also preferable to cause the computer to function as an estimated context output unit that outputs at least a third context instead of the output of the second context when it is determined to be true by the first recognition determination unit.

本発明のコンテキスト推定プログラムにおける他の実施形態によれば、
第２のコンテキスト認識エンジンによって認識された複数のコンテキストにおけるスコアの差が所定閾値以下であるか否かを判定する第２の認識判定手段と、
第２の認識判定手段によって真と判定された際に、撮影映像から、第３のコンテキストを認識する第３のコンテキスト認識エンジンと
を更に有し、
推定コンテキスト出力手段は、第２の認識判定手段によって真と判定された際に、第２のコンテキストの出力に代えて、少なくとも第３のコンテキストを出力する
ようにコンピュータを機能させることも好ましい。 According to another embodiment of the context estimation program of the invention:
A second recognition determination unit that determines whether a difference between scores in a plurality of contexts recognized by the second context recognition engine is less than or equal to a predetermined threshold value;
And a third context recognition engine that recognizes a third context from the captured image when it is determined to be true by the second recognition determination means.
The estimated context output means preferably causes the computer to output at least a third context instead of the output of the second context when it is determined to be true by the second recognition determination means.

本発明のコンテキスト推定プログラムにおける他の実施形態によれば、
第１のコンテキスト認識エンジンは、撮影映像から、ＲＧＢ画像に基づく物体認識によって、対象物としての第１のコンテキストを推定し、
第２のコンテキスト認識エンジンは、撮影映像から、オプティカルフローに基づく動体認識によって、動体対象としての第２のコンテキストを推定し、
第３のコンテキスト認識エンジンは、撮影映像から、スケルトン情報に基づく人物の関節領域認識によって、人物の関節領域としての第３のコンテキストを推定する
ようにコンピュータを機能させることも好ましい。 According to another embodiment of the context estimation program of the invention:
The first context recognition engine estimates a first context as an object from the captured image by object recognition based on the RGB image,
The second context recognition engine estimates a second context as a moving object from the captured image by optical flow based moving object recognition,
The third context recognition engine preferably causes the computer to function to estimate a third context as a joint region of the person by recognizing the joint region of the person based on the skeleton information from the captured image.

本発明のコンテキスト推定プログラムにおける他の実施形態によれば、
推定コンテキスト出力手段は、第３のコンテキストの出力に代えて、複数の第１のコンテキストそれぞれの第１のスコアと、複数の第２のコンテキストそれぞれの第２のスコアと、複数の第３のコンテキストそれぞれの第３のスコアとの加算値又は平均値に基づいて、最も高いスコアとなるコンテキストを出力する
ようにコンピュータを機能させることも好ましい。 According to another embodiment of the context estimation program of the invention:
The estimated context output means, instead of the output of the third context, includes a first score of each of the plurality of first contexts, a second score of each of the plurality of second contexts, and a plurality of third contexts It is also preferable to make the computer function to output the context with the highest score based on an addition value or an average value with each third score.

本発明のコンテキスト推定プログラムにおける他の実施形態によれば、
第２のコンテキスト認識エンジン及び／又は第３のコンテキスト認識エンジンは、処理時間又は処理時間割合（単位時間当たりの当該処理時間の割合）を計測し、
第１の認識判定手段は、処理時間が所定閾値以上、又は、処理時間割合が所定閾値以上となる場合に、第１のコンテキストを推定コンテキスト出力手段へ出力すると共に、第２のコンテキスト認識エンジン及び／又は第３のコンテキスト認識エンジンを実行する
ようにコンピュータを更に機能させることも好ましい。 According to another embodiment of the context estimation program of the invention:
The second context recognition engine and / or the third context recognition engine measures the processing time or the processing time ratio (proportion of the processing time per unit time),
The first recognition determination means outputs the first context to the estimated context output means when the processing time is the predetermined threshold or more or the processing time ratio is the predetermined threshold or more, and the second context recognition engine and It is also preferred to have the computer further function to execute a third context aware engine.

本発明によれば、撮影映像からコンテキストを推定するコンテキスト推定装置であって、
撮影映像から、第１のコンテキストを認識し、第１のコンテキストと第１のスコアとを対応付けて出力する第１のコンテキスト認識エンジンと、
第１のコンテキスト認識エンジンによって認識された複数のコンテキストにおける第１のスコアの差が所定閾値以下であるか否かを判定する第１の認識判定手段と、
第１の認識判定手段によって真と判定された際に、撮影映像から、第２のコンテキストを認識し、第２のコンテキストと第２のスコアとを対応付けて出力する第２のコンテキスト認識エンジンと、
第１の認識判定手段によって真と判定された際に、少なくとも第２のコンテキストを出力する推定コンテキスト出力手段と
を有することを特徴とする。 According to the present invention, there is provided a context estimation device for estimating a context from a photographed image, comprising:
A first context recognition engine that recognizes a first context from the captured video and outputs the first context and the first score in association with each other;
First recognition determination means for determining whether or not the difference between the first scores in the plurality of contexts recognized by the first context recognition engine is equal to or less than a predetermined threshold;
And a second context recognition engine that recognizes a second context from the captured image when it is determined to be true by the first recognition determination means, and associates and outputs a second context and a second score. ,
And estimating context output means for outputting at least a second context when it is determined to be true by the first recognition determining means.

本発明によれば、撮影映像からコンテキストを推定する装置のコンテキスト推定方法であって、
装置は、
撮影映像から、第１のコンテキストを認識し、第１のコンテキストと第１のスコアとを対応付けて出力する第１のステップと、
第１のステップによって認識された複数のコンテキストにおける第１のスコアの差が所定閾値以下であるか否かを判定する第２のステップと、
第２のステップによって真と判定された際に、撮影映像から、第２のコンテキストを認識し、第２のコンテキストと第２のスコアとを対応付けて出力する第３のステップと、
第２のステップによって真と判定された際に、少なくとも第２のコンテキストを出力する第４のステップと
を実行することを特徴とする。 According to the present invention, there is provided a method of context estimation of an apparatus for estimating context from photographed video, comprising:
The device is
A first step of recognizing a first context from the photographed video and correlating the first context with the first score;
A second step of determining whether the first score difference in the plurality of contexts recognized by the first step is less than or equal to a predetermined threshold;
A third step of recognizing a second context from the captured video when it is determined to be true in the second step, and associating and outputting the second context and the second score;
And a fourth step of outputting at least a second context when it is determined to be true by the second step.

本発明のプログラム、装置及び方法によれば、撮影映像の内容に基づいて、人の行動を表すコンテキストを、できる限り高速且つ高精度に推定することができる。具体的には、学習モデルとしてのコンテキスト認識エンジンを、撮影映像の内容に基づいて自動的に選択することができる。 According to the program, the apparatus and the method of the present invention, it is possible to estimate the context representing the action of a person as fast as possible and with high accuracy based on the content of the captured video. Specifically, a context recognition engine as a learning model can be automatically selected based on the content of the captured video.

行動推定装置を有するシステム構成図である。It is a system configuration figure which has an action presumption device. ２つのコンテキスト認識エンジンを有する行動推定装置の機能構成図である。It is a functional block diagram of the action estimation apparatus which has two context recognition engines. 図２におけるコンテキストの推定を表すフロー図である。FIG. 3 is a flow diagram illustrating the estimation of context in FIG. 2; 撮影映像に対する推定タイミングを表す説明図である。It is an explanatory view showing an presumed timing to a photography picture. ２つのコンテキスト認識エンジンの組み合わせを表すフローチャートである。Figure 2 is a flow chart representing a combination of two context recognition engines. ３つのコンテキスト認識エンジンを有する行動推定装置の機能構成図である。It is a functional block diagram of the action estimation apparatus which has three context recognition engines. ３つのコンテキスト認識エンジンの組み合わせを表すフローチャートである。Figure 2 is a flow chart representing a combination of three context recognition engines.

以下、本発明の実施の形態について、図面を用いて詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

図２は、２つのコンテキスト認識エンジンを有する行動推定装置の機能構成図である。
図３は、図２におけるコンテキストの推定を表すフロー図である。 FIG. 2 is a functional block diagram of a behavior estimation apparatus having two context recognition engines.
FIG. 3 is a flow diagram depicting the estimation of context in FIG.

行動推定装置１は、人の行動が映り込む撮影映像を入力し、コンテキストを推定する。図２によれば、行動推定装置１は、主な構成として、第１のコンテキスト認識エンジン１１と、第１の認識判定部１２と、第２のコンテキスト認識エンジン１３と、推定コンテキスト出力部１４とを有する。これら機能構成部は、装置に搭載されたコンピュータを機能させるプログラムを実行することによって実現される。また、これら機能構成部の処理の流れは、装置の行動推定方法としても理解できる。 The action estimation device 1 inputs a photographed video in which a person's action is reflected, and estimates a context. According to FIG. 2, the behavior estimation apparatus 1 mainly includes a first context recognition engine 11, a first recognition determination unit 12, a second context recognition engine 13, and an estimated context output unit 14 as main components. Have. These functional components are realized by executing a program that causes a computer installed in the device to function. Also, the flow of processing of these functional components can be understood as a method of estimating the behavior of the device.

行動推定装置１は、異なる種類の複数のコンテキスト認識エンジンを有し、撮影映像を区分した所定期間毎に、高速で且つ高精度のコンテキスト認識エンジンを自動的に選択するように機能する。 The behavior estimation device 1 has a plurality of context recognition engines of different types, and functions to automatically select a high-speed and high-precision context recognition engine every predetermined period in which photographed images are classified.

［第１のコンテキスト認識エンジン１１］
第１のコンテキスト認識エンジン１１は、撮影映像から、第１のコンテキストを認識し、第１のコンテキストと第１のスコア（コンテキスト認識精度）とを対応付けて出力する。第１のコンテキスト認識エンジン１１は、例えば「飲む」「食べる」「走る」のような人の行動を表すコンテキストを予め学習しているとする。
具体的には、撮影映像から以下のように第１のコンテキストを認識したとする。
［第１のコンテキスト］：［第１のスコア］
飲む：０．３
食べる：０．２
走る：０．１
認識結果となる第１のコンテキスト及び第１のスコアは、第１の認識判定部１２へ出力される。 [First context recognition engine 11]
The first context recognition engine 11 recognizes the first context from the captured video, and outputs the first context and the first score (context recognition accuracy) in association with each other. It is assumed that the first context recognition engine 11 learns in advance a context representing a person's action such as "drink", "eat" and "run".
Specifically, it is assumed that the first context is recognized from the captured video as follows.
[First context]: [First score]
Drink: 0.3
Eat: 0.2
Running: 0.1
The first context and the first score as the recognition result are output to the first recognition determination unit 12.

［第１の認識判定部１２］
第１の認識判定部１２は、最初にオプション的に、第１のコンテキスト認識エンジン１１で認識された最上位の第１のコンテキストについて、そのスコアが所定閾値（例えば９０％）以上のように極めて高い場合、第２のコンテキスト認識エンジン１３を実行することなく、その第１のコンテキストのみを推定コンテキスト出力部１４へ出力するものであってもよい。 [First recognition determination unit 12]
First, the first recognition determination unit 12 optionally determines that the score of the top first context recognized by the first context recognition engine 11 is not less than a predetermined threshold (eg, 90%). If it is high, only the first context may be output to the estimated context output unit 14 without executing the second context recognition engine 13.

本発明によれば、第１の認識判定部１２は、第１のコンテキスト認識エンジン１１によって認識された複数のコンテキストにおける第１のスコアの差が所定閾値以下であるか否かを判定する。
具体的には、第１の認識判定部１２は、第１のコンテキスト認識エンジン１１によって認識された上位２つのコンテキストにおけるスコアの差が所定閾値以下であるか否かを判定する。
スコアの差が大きいほど、１位のスコアのコンテキストにほぼ断定することができる。その場合、第１のコンテキスト認識エンジン１１のみで推定した第１のコンテキストを出力することが好ましい。
一方で、スコアの差が小さいほど、上位２つのコンテキストが紛らわしいと判断される。その場合、別の種類のコンテキスト認識エンジンを更に実行し、そのコンテキストも用いて判断することが好ましい。
尚、所定閾値は、オペレータによって設定可能なものである。認識したいコンテキストが動きに基づくものである場合、所定閾値（スコアの差）を大きく設定することが好ましい。 According to the present invention, the first recognition determination unit 12 determines whether or not the difference between the first scores in the plurality of contexts recognized by the first context recognition engine 11 is equal to or less than a predetermined threshold.
Specifically, the first recognition determination unit 12 determines whether or not the difference between the scores in the top two contexts recognized by the first context recognition engine 11 is equal to or less than a predetermined threshold.
The greater the difference in scores, the more likely it can be in the context of the first score. In that case, it is preferable to output the first context estimated by only the first context recognition engine 11.
On the other hand, it is determined that the top two contexts are more confusing as the difference in score is smaller. In that case, it is preferable to further execute another type of context recognition engine and to make a decision using that context as well.
The predetermined threshold can be set by the operator. If the context to be recognized is motion-based, it is preferable to set the predetermined threshold (score difference) large.

前述した第１のコンテキストの例によれば、上位２つの認識結果とのスコアの差は、０．１である。ここで、所定閾値＝０．２とした場合、上位２つのコンテキストにおけるスコアの差が所定閾値以下となり、「真」と判定される。
［第１のコンテキスト］：［第１のスコア］
（上位１位）飲む：０．３
（上位２位）食べる：０．２（※スコア差０．１＝０．３−０．２）
第１の認識判定部１２は、真と判定した場合、撮影映像を、第２のコンテキスト認識エンジン１３へ出力する。一方で、偽と判定した場合、第１のコンテキストを、推定コンテキスト出力部１４へ出力する。 According to the first context example described above, the difference in score between the top two recognition results is 0.1. Here, in the case where the predetermined threshold value is 0.2, the difference between the scores in the top two contexts is equal to or less than the predetermined threshold value, and it is determined to be “true”.
[First context]: [First score]
(Top 1) Drinking: 0.3
(Upper 2nd place) eat: 0.2 (※ score difference 0.1 = 0.3-0.2)
If the first recognition determination unit 12 determines that the result is true, the first recognition determination unit 12 outputs the captured video to the second context recognition engine 13. On the other hand, when it is determined to be false, the first context is output to the estimated context output unit 14.

［第２のコンテキスト認識エンジン１３］
第２のコンテキスト認識エンジン１３は、第１の認識判定部１２によって真と判定された際に、撮影映像から、第２のコンテキストを認識し、第２のコンテキストと第２のスコアとを対応付けて出力する。第２のコンテキスト認識エンジン１３も、例えば「飲む」「食べる」「走る」のような人の行動を表すコンテキストを予め学習しているとする。
具体的には、撮影映像から以下のように第２のコンテキストを認識したとする。
［第２のコンテキスト］：［第２のスコア］
飲む：０．５
食べる：０．２
走る：０．０
認識結果となる第２のコンテキスト及び第２のスコアは、推定コンテキスト出力部１４へ出力される。 [Second context recognition engine 13]
The second context recognition engine 13 recognizes the second context from the captured video when it is determined as true by the first recognition determination unit 12, and associates the second context with the second score. Output. It is assumed that the second context recognition engine 13 also learns in advance a context representing a person's action such as "drink", "eat" and "run".
Specifically, it is assumed that the second context is recognized from the captured image as follows.
[Second context]: [Second score]
Drink: 0.5
Eat: 0.2
Run: 0.0
The second context and the second score, which are recognition results, are output to the estimated context output unit 14.

［推定コンテキスト出力部１４］
推定コンテキスト出力部１４は、第１の認識判定部１２によって真と判定された場合、第２のコンテキストを出力する。一方で、第１の認識判定部１２によって偽と判定された場合、第１のコンテキストを出力する。 [Estimated context output unit 14]
The estimated context output unit 14 outputs a second context when the first recognition determination unit 12 determines that the result is true. On the other hand, when it is determined that the first recognition determination unit 12 is false, the first context is output.

また、他の実施形態として、推定コンテキスト出力部１４は、複数の第１のコンテキストそれぞれの第１のスコアと、複数の第２のコンテキストそれぞれの第２のスコアとの加算値又は平均値に基づいて、最も高いスコア（スコアの統合値）となるコンテキストを出力することも好ましい。
具体的には、以下のように推定コンテキストを出力する。
［コンテキスト］：［スコア（平均）］
飲む：（０．３＋０．５）／２＝０．４０
食べる：（０．２＋０．２）／２＝０．２０
走る：（０．１＋０．０）／２＝０．０５
この場合、最終的に、コンテキスト「飲む」が、アプリケーションへ出力される。
尚、スコアの統合については、単純平均のみならず、加重平均であってよいし、サポートベクタマシンを用いたものであってもよい（例えば非特許文献３参照）。 Also, as another embodiment, the estimated context output unit 14 is based on an added value or an average value of the first score of each of the plurality of first contexts and the second score of each of the plurality of second contexts. It is also preferable to output the context that gives the highest score (the integrated score value).
Specifically, the estimated context is output as follows.
Context: Score (Average)
Drink: (0.3 + 0.5) / 2 = 0.40
Eat: (0.2 + 0.2) / 2 = 0.20
Run: (0.1 + 0.0) / 2 = 0.05
In this case, finally, the context "drink" is output to the application.
The score integration may be not only simple averaging but also weighted averaging, or may use a support vector machine (see, for example, Non-Patent Document 3).

尚、推定コンテキスト出力部１４は、第１のコンテキストのスコア、第２のコンテキストのスコア、又は、第１及び第２のコンテキストのスコアの統合値（加算値又は平均値）が、所定閾値以下である場合、コンテキストの認識不可を出力する。 The estimated context output unit 14 determines that the score of the first context, the score of the second context, or the integrated value (addition value or average value) of the scores of the first and second contexts is equal to or less than a predetermined threshold. If there is, output context unrecognized.

図４は、撮影映像に対する推定タイミングを表す説明図である。 FIG. 4 is an explanatory view showing estimation timing for a photographed image.

撮影映像は、所定単位時間に区分されている。ここで、所定単位時間の中で、１行動当たりの認識に要する初期段階で、その後に使用すべきコンテキスト認識エンジンを自動的に選択する。即ち、所定単位時間毎に、第２のコンテキスト認識エンジン１３を実行するか否かが決定されていく。 The photographed video is divided into predetermined unit times. Here, in a predetermined unit time, at an initial stage required for recognition per action, a context recognition engine to be used thereafter is automatically selected. That is, it is determined whether or not the second context recognition engine 13 is to be executed every predetermined unit time.

当該所定単位時間の初期段階で、第１のコンテキスト認識エンジン１１及び第１の認識判定部１２を実行し、第１の認識判定部１２の判定に基づいて、その後に第２のコンテキスト認識エンジン１３を実行するか否かを決定する。第１の認識判定部１２によって「真」と判定された場合、その後の所定時間内では、第１のコンテキスト認識エンジン１１及び第２のコンテキスト認識エンジン１３の両方が実行される。両方が実行される場合、推定コンテキスト出力部１４は、両方のスコアを統合（加算値又は平均値）してコンテキストを決定する。
一方で、当該所定単位時間の初期段階で、第１の認識判定部１２によって「偽」と判定された場合、その後の所定時間内では、第１のコンテキスト認識エンジン１１のみが実行される。 At an initial stage of the predetermined unit time, the first context recognition engine 11 and the first recognition determination unit 12 are executed, and based on the determination of the first recognition determination unit 12, the second context recognition engine 13 is subsequently performed. Decide if you want to When the first recognition determination unit 12 determines “true”, both of the first context recognition engine 11 and the second context recognition engine 13 are executed within a predetermined time thereafter. If both are performed, the estimated context output unit 14 combines the scores of both (add value or average value) to determine the context.
On the other hand, when the first recognition determination unit 12 determines “false” at the initial stage of the predetermined unit time, only the first context recognition engine 11 is executed within the subsequent predetermined time.

図２によれば、オプション的な構成として、第２のコンテキスト認識エンジン１３は、処理時間又は処理時間割合を計測する。 According to FIG. 2, as an optional configuration, the second context recognition engine 13 measures the processing time or the processing time ratio.

このとき、第１の認識判定部１２は、処理時間が所定閾値以上、又は、処理時間割合（単位時間当たりの第２のコンテキスト認識エンジンの処理時間の割合）が所定閾値以上となる場合に、第１のコンテキストを推定コンテキスト出力部１４へ出力すると共に、第２のコンテキスト認識エンジンを実行する。第２のコンテキスト認識エンジン１３における処理時間又は処理時間割合が長いということは、第１のコンテキストのみでは足りず、第２のコンテキストも必要としていることを意味する。この場合、第１の認識判定部１２は、第１のコンテキスト及び第２のコンテキストの両方を、推定コンテキスト出力部１４へ出力するように制御する。 At this time, the first recognition determination unit 12 determines that the processing time is equal to or more than a predetermined threshold or the processing time ratio (the ratio of the processing time of the second context recognition engine per unit time) is equal to or more than a predetermined threshold. The first context is output to the estimated context output unit 14 and the second context recognition engine is executed. The fact that the processing time or the processing time ratio in the second context recognition engine 13 is long means that the first context alone is not sufficient and the second context is also required. In this case, the first recognition determination unit 12 controls to output both the first context and the second context to the estimated context output unit 14.

図２によれば、例えば以下のような３つの種別のコンテキスト認識エンジンを備えており、これらを組み合わせて利用する。
ＲＧＢ認識に基づく物体認識エンジン
オプティカルフローに基づく動体認識エンジン
スケルトン情報に基づく人物の関節領域認識エンジン
これらコンテキスト認識エンジンは、撮影映像から人の行動を推定するために、大量の学習映像から学習モデルを予め生成したものである。 According to FIG. 2, for example, the following three types of context recognition engines are provided, and these are used in combination.
Object recognition engine based on RGB recognition Dynamic object recognition engine based on optical flow Human joint area recognition engine based on skeleton information These context recognition engines learn models from a large amount of learning videos in order to estimate human behavior from photographed video It is generated in advance.

ＲＧＢ認識に基づく物体認識エンジンは、具体的にはＣＮＮ(Convolutional Neural Network)のようなニューラルネットワークを用いて、撮影映像に映り込むオブジェクト（対象物）を推定する。
オプティカルフローに基づく動体認識エンジンは、フレーム間で同一の特徴点が動いている箇所を抽出し、撮影映像の中の物体の動きを「ベクトル」で表すものである。
スケルトン情報に基づく人物の関節領域認識エンジンは、具体的にはOpenPose（登録商標）のようなスケルトンモデルを用いて、人の関節の特徴点を抽出するものである（例えば非特許文献７〜９参照）。OpenPoseとは、画像から複数の人間の体／手／顔のキーポイントをリアルタイムに検出可能なソフトウェアであって、GitHubによって公開されている。撮影映像に映る人の身体全体であれば、例えば１５点のキーポイントを検出できる。 An object recognition engine based on RGB recognition uses, for example, a neural network such as a CNN (Convolutional Neural Network) to estimate an object (object) to be reflected in a photographed image.
A moving object recognition engine based on an optical flow extracts a portion where the same feature point is moving between frames, and represents the motion of an object in a captured image as a "vector".
Specifically, a human joint region recognition engine based on skeleton information extracts feature points of human joints using a skeleton model such as OpenPose (registered trademark) (for example, Non-Patent Documents 7 to 9) reference). OpenPose is software that can detect multiple human body / hand / face key points in real time from an image, and is published by GitHub. For example, 15 key points can be detected in the entire body of a person appearing in a photographed image.

各コンテキスト認識エンジンは、以下のように特性が異なる。
［演算量］［認識精度］
ＲＧＢ認識に基づく物体認識エンジン：小低
オプティカルフローに基づく動体認識エンジン：中中
スケルトン情報に基づく人物の関節領域認識エンジン：大高 Each context recognition engine has different characteristics as follows.
[Amount of calculation] [recognition accuracy]
Object recognition engine based on RGB recognition: Low and Low Optical flow based moving body recognition engine: Medium and medium Skeleton information based human joint area recognition engine: Large High

尚、物体認識、動体認識、関節領域認識では、一般に、人の行動を表すコンテキスト自体が異なる。本発明によれば、認識されるコンテキストを共通化している。
例えば物体認識によって「ペットボトル」「人物」を認識した上で、そのペットボトルと人物の口との位置から、コンテキスト「飲む」を認識する。
また、動体認識によってペットボトルが人物の口へ向かう動きから、コンテキスト「飲む」を認識する。
更に、関節領域認識によって人物の腕の角度とペットボトルとの位置から、コンテキスト「飲む」を認識する。
このように、同じコンテキストを認識する場合であっても、認識エンジンの種類によっては判断要素が異なっている。この場合でも、物体認識よりも、動体認識及び関節領域認識の方が、それら認識精度は高い。また、動体認識よりも、関節領域認識の方が、それら認識精度は高い。 In object recognition, moving object recognition, and joint region recognition, in general, contexts representing human actions are different. According to the present invention, recognized contexts are made common.
For example, after "pet bottle" and "person" are recognized by object recognition, context "drink" is recognized from the position of the pet bottle and the person's mouth.
Moreover, the motion "body" recognizes the context "drink" from the movement of the plastic bottle toward the mouth of the person.
Furthermore, the context "drink" is recognized from the angle of the arm of the person and the position of the plastic bottle by joint region recognition.
In this way, the judgment factors differ depending on the type of recognition engine, even when recognizing the same context. Even in this case, moving object recognition and joint area recognition have higher recognition accuracy than object recognition. In addition, joint region recognition has higher recognition accuracy than moving object recognition.

図５は、２つのコンテキスト認識エンジンの組み合わせを表すフローチャートである。
［１］ＲＧＢ認識＋オプティカルフロー認識
［２］オプティカルフロー認識＋スケルトン情報認識
［３］ＲＧＢ認識＋スケルトン情報認識 FIG. 5 is a flow chart representing a combination of two context aware engines.
[1] RGB recognition + optical flow recognition [2] Optical flow recognition + skeleton information recognition [3] RGB recognition + skeleton information recognition

［１］ＲＧＢ認識＋オプティカルフロー認識（図５（ａ））
（Ｓ１１）第１のコンテキスト認識エンジン１１は、撮影映像から、ＲＧＢ画像に基づく物体認識によって、対象物としての第１のコンテキストを推定する。
（Ｓ１２）ここで、上位２つのコンテキストにおけるスコアの差が所定閾値以下であるか否かを判定する。
（Ｓ１３）Ｓ１２の判定が真である場合、第２のコンテキスト認識エンジン１３が、撮影映像から、オプティカルフローに基づく動体認識によって、動体対象としての第２のコンテキストを推定する。
（Ｓ１４）そして、Ｓ１１及びＳ１３の２つのコンテキストを統合した推定コンテキスト（スコアの加算値又は平均値が最も高いコンテキスト）が出力される。 [1] RGB recognition + optical flow recognition (FIG. 5 (a))
(S11) The first context recognition engine 11 estimates a first context as an object from the captured image by object recognition based on the RGB image.
(S12) Here, it is determined whether the difference between the scores in the top two contexts is equal to or less than a predetermined threshold.
(S13) If the determination in S12 is true, the second context recognition engine 13 estimates a second context as a moving object from the captured image by moving object recognition based on the optical flow.
(S14) Then, an estimated context (the added value of scores or the context with the highest average value) in which the two contexts of S11 and S13 are integrated is output.

［２］オプティカルフロー認識＋スケルトン情報認識（図５（ｂ））
（Ｓ１１）第１のコンテキスト認識エンジン１１は、撮影映像から、オプティカルフローに基づく動体認識によって、動体対象としての第１のコンテキストを推定する。
（Ｓ１２）ここで、上位２つのコンテキストにおけるスコアの差が所定閾値以下であるか否かを判定する。
（Ｓ１３）Ｓ１２の判定が真である場合、第２のコンテキスト認識エンジン１３が、撮影映像から、スケルトン情報に基づく人物の関節領域認識によって、人物の関節領域としての第２のコンテキストを推定する。
（Ｓ１４）そして、Ｓ１１及びＳ１３の２つのコンテキストを統合）した推定コンテキスト（スコアの加算値又は平均値が最も高いコンテキスト）が出力される。 [2] Optical flow recognition + skeleton information recognition (FIG. 5 (b))
(S11) The first context recognition engine 11 estimates a first context as a moving object from the captured image by moving object recognition based on the optical flow.
(S12) Here, it is determined whether the difference between the scores in the top two contexts is equal to or less than a predetermined threshold.
(S13) If the determination in S12 is true, the second context recognition engine 13 estimates a second context as a joint region of the person by recognizing the joint region of the person based on the skeleton information from the photographed image.
(S14) Then, the estimated context (the sum of scores or the context with the highest average value) obtained by integrating the two contexts of S11 and S13) is output.

［３］ＲＧＢ認識＋スケルトン情報認識（図５（ｃ））
（Ｓ１１）第１のコンテキスト認識エンジン１１は、撮影映像から、ＲＧＢ画像に基づく物体認識によって、対象物としての第１のコンテキストを推定する。
（Ｓ１２）ここで、上位２つのコンテキストにおけるスコアの差が所定閾値以下であるか否かを判定する。
（Ｓ１３）Ｓ１２の判定が真である場合、第２のコンテキスト認識エンジン１３が、撮影映像から、スケルトン情報に基づく人物の関節領域認識によって、人物の関節領域としての第２のコンテキストを推定する。
（Ｓ１４）そして、Ｓ１１及びＳ１３の２つのコンテキストを統合した推定コンテキスト（スコアの加算値又は平均値が最も高いコンテキスト）が出力される。 [3] RGB recognition + skeleton information recognition (Fig. 5 (c))
(S11) The first context recognition engine 11 estimates a first context as an object from the captured image by object recognition based on the RGB image.
(S12) Here, it is determined whether the difference between the scores in the top two contexts is equal to or less than a predetermined threshold.
(S13) If the determination in S12 is true, the second context recognition engine 13 estimates a second context as a joint region of the person by recognizing the joint region of the person based on the skeleton information from the photographed image.
(S14) Then, an estimated context (the added value of scores or the context with the highest average value) in which the two contexts of S11 and S13 are integrated is output.

図６は、３つのコンテキスト認識エンジンを有する本発明の行動推定装置の機能構成図である。 FIG. 6 is a functional block diagram of the behavior estimation apparatus of the present invention having three context recognition engines.

図６（ａ）によれば、第１の認識判定部１２が、真（上位２つのスコアの差が所定閾値以下）と判定した場合、撮影映像を、第３のコンテキスト認識エンジン１６へ出力する。一方で、偽と判定した場合、撮影映像を、第２のコンテキスト認識エンジン１３へ出力する。この場合、第１のコンテキスト認識エンジン１１は、第２のコンテキスト認識エンジン１３と第３のコンテキスト認識エンジン１６との切り替えのみのために用いられる。 According to FIG. 6A, when the first recognition determination unit 12 determines that it is true (the difference between the top two scores is less than or equal to a predetermined threshold), the captured image is output to the third context recognition engine 16 . On the other hand, when it is determined to be false, the captured video is output to the second context recognition engine 13. In this case, the first context recognition engine 11 is used only to switch between the second context recognition engine 13 and the third context recognition engine 16.

スコアの差が小さいほど、上位２つのコンテキストが紛らわしいと判断される。その場合、認識処理が比較的低速でも、認識精度が比較的高い第３のコンテキスト認識エンジン１６を使用する。
一方で、スコアの差が大きいほど、１位のスコアのコンテキストにほぼ断定することができる。その場合、認識精度が比較的低くても、認識処理が比較的高速な第２のコンテキスト認識エンジン１３を使用する。 The smaller the difference between the scores, the more top two contexts are judged to be confusing. In that case, even if the recognition process is relatively slow, the third context recognition engine 16 with relatively high recognition accuracy is used.
On the other hand, the larger the difference in scores, the more nearly it can be concluded in the context of the 1st score. In that case, even if the recognition accuracy is relatively low, the second context recognition engine 13 which is relatively fast in recognition processing is used.

図６（ｂ）によれば、第１の認識判定部１２が、真（上位２つのスコアの差が所定閾値以下）と判定した場合、撮影映像を、第２のコンテキスト認識エンジン１３へ出力する。一方で、偽と判定した場合、第１のコンテキスト認識エンジン１１によって認識された第１のコンテキストを、推定コンテキスト出力部１４へ出力する。これについては、前述した図２と同様である。
そして、第２のコンテキスト認識エンジン１３は、認識した第２のコンテキストを、更に第２の認識判定部１５へ出力する。 According to FIG. 6B, when the first recognition determination unit 12 determines that the difference is higher than the predetermined threshold value, the photographed image is output to the second context recognition engine 13 . On the other hand, when it is determined to be false, the first context recognized by the first context recognition engine 11 is output to the estimated context output unit 14. About this, it is the same as that of FIG. 2 mentioned above.
Then, the second context recognition engine 13 further outputs the recognized second context to the second recognition determination unit 15.

第２の認識判定部１５は、第２のコンテキスト認識エンジン１３によって認識された複数のコンテキストにおける第２のスコアの差が所定閾値以下であるか否かを判定する。
具体的には、第２の認識判定部１５は、第２のコンテキスト認識エンジン１３によって認識された上位２つのコンテキストにおけるスコアの差が所定閾値以下であるか否かを判定する。
スコアの差が大きいほど、１位のスコアのコンテキストにほぼ断定することができる。その場合、第２のコンテキスト認識エンジン１３のみで推定した第２のコンテキストを出力することが好ましい。
一方で、スコアの差が小さいほど、上位２つのコンテキストが紛らわしいと判断される。その場合、更に第３のコンテキスト認識エンジン１６へ撮影映像を出力し、その第３のコンテキストも用いて判断することが好ましい。 The second recognition determination unit 15 determines whether the difference between the second scores in the plurality of contexts recognized by the second context recognition engine 13 is equal to or less than a predetermined threshold.
Specifically, the second recognition determination unit 15 determines whether the difference between the scores in the top two contexts recognized by the second context recognition engine 13 is equal to or less than a predetermined threshold.
The greater the difference in scores, the more likely it can be in the context of the first score. In that case, it is preferable to output the second context estimated only by the second context recognition engine 13.
On the other hand, it is determined that the top two contexts are more confusing as the difference in score is smaller. In that case, it is preferable to further output a captured image to the third context recognition engine 16 and make a determination using the third context as well.

推定コンテキスト出力部１４は、第２の認識判定部１５によって真と判定された場合、第３のコンテキストを出力し、偽と判定された場合、第２のコンテキストを出力する。尚、図２と同様に、第１の認識判定部１２によって偽と判定された場合、第１のコンテキストを出力する。 The estimated context output unit 14 outputs the third context when it is determined to be true by the second recognition determination unit 15, and outputs the second context when it is determined to be false. As in FIG. 2, when the first recognition determination unit 12 determines that the result is false, the first context is output.

また、他の実施形態として、推定コンテキスト出力部１４は、複数の第１のコンテキストそれぞれの第１のスコアと、複数の第２のコンテキストそれぞれの第２のスコアと、複数の第３のコンテキストそれぞれの第３のスコアの加算値又は平均値に基づいて、最も高いスコア（スコアの統合値）となるコンテキストを出力することも好ましい。 Also, as another embodiment, the estimated context output unit 14 may calculate the first score of each of the plurality of first contexts, the second score of each of the plurality of second contexts, and each of the plurality of third contexts. It is also preferable to output the context that provides the highest score (the integrated value of the scores) based on the addition value or the average value of the third score of.

更に、他の実施形態として、前述した図２と同様に、第２のコンテキスト認識エンジン１３及び／又は第３のコンテキスト認識エンジン１６は、処理時間又は処理時間割合（単位時間当たりの当該処理時間の割合）を計測するものであってもよい。
その場合、第２の認識判定部１５は、処理時間が所定閾値以上、又は、処理時間割合が所定閾値以上となる場合に、第１のコンテキストを推定コンテキスト出力部１４へ出力すると共に、第２のコンテキスト認識エンジン１３及び／又は第３のコンテキスト認識エンジン１６を実行する。前述と同様に、第２のコンテキスト認識エンジン１３及び／又は第３のコンテキスト認識エンジン１６における処理時間又は処理時間割合が長いということは、第１のコンテキストのみでは足りず、第２のコンテキスト及び／又は第３のコンテキストも必要としていることを意味する。この場合、第２の認識判定部１５は、第２のコンテキスト及び／又は第３のコンテキストの両方を、推定コンテキスト出力部１４へ出力するように制御する。 Furthermore, as another embodiment, as in the case of FIG. 2 described above, the second context recognition engine 13 and / or the third context recognition engine 16 may process the processing time or the processing time ratio (the processing time per unit time). Ratio) may be measured.
In that case, the second recognition determination unit 15 outputs the first context to the estimated context output unit 14 when the processing time is the predetermined threshold or more, or the processing time ratio is the predetermined threshold or more. And / or the third context recognition engine 16 is executed. As described above, the fact that the processing time or the processing time ratio in the second context recognition engine 13 and / or the third context recognition engine 16 is long means that the first context alone is not sufficient, and the second context and / or Or it means that you also need a third context. In this case, the second recognition determination unit 15 controls to output both the second context and / or the third context to the estimated context output unit 14.

図７は、３つのコンテキスト認識エンジンの組み合わせを表すフローチャートである。
［４］ＲＧＢ認識＋オプティカルフロー認識orスケルトン情報認識
［５］ＲＧＢ認識＋オプティカルフロー認識＋スケルトン情報認識 FIG. 7 is a flow chart representing a combination of three context aware engines.
[4] RGB recognition + optical flow recognition or skeleton information recognition [5] RGB recognition + optical flow recognition + skeleton information recognition

［４］ＲＧＢ認識＋オプティカルフロー認識orスケルトン情報認識（図６（ａ）、図７（ａ））
（Ｓ１１）第１のコンテキスト認識エンジン１１は、撮影映像から、ＲＧＢ画像に基づく物体認識によって、対象物としての第１のコンテキストを推定する。
（Ｓ１２）ここで、上位２つのコンテキストにおけるスコアの差が所定閾値以下であるか否かを判定する。
（Ｓ１３１）Ｓ１２の判定が偽である場合、第２のコンテキスト認識エンジン１３が、撮影映像から、オプティカルフローに基づく動体認識によって、動体対象としての第２のコンテキストを推定する。
（Ｓ１３２））Ｓ１２の判定が真である場合、第３のコンテキスト認識エンジン１６が、スケルトン情報に基づく人物の関節領域認識によって、人物の関節領域としての第３のコンテキストを推定する。
（Ｓ１４）そして、Ｓ１１、Ｓ１３１及びＳ１３２の２つのコンテキストを統合した推定コンテキスト（スコアの加算値又は平均値が最も高いコンテキスト）が出力される。 [4] RGB recognition + optical flow recognition or skeleton information recognition (FIG. 6 (a), FIG. 7 (a))
(S11) The first context recognition engine 11 estimates a first context as an object from the captured image by object recognition based on the RGB image.
(S12) Here, it is determined whether the difference between the scores in the top two contexts is equal to or less than a predetermined threshold.
(S131) If the determination in S12 is false, the second context recognition engine 13 estimates a second context as a moving object from the captured image by moving object recognition based on the optical flow.
(S132) If the determination in S12 is true, the third context recognition engine 16 estimates a third context as a joint region of the person by the joint region recognition of the person based on the skeleton information.
(S14) Then, an estimated context (the added value of scores or the context with the highest average value) in which the two contexts of S11, S131 and S132 are integrated is output.

［５］ＲＧＢ認識＋オプティカルフロー認識＋スケルトン情報認識（図６（ｂ）、図７（ｂ））
（Ｓ１１）第１のコンテキスト認識エンジン１１は、撮影映像から、ＲＧＢ画像に基づく物体認識によって、対象物としての第１のコンテキストを推定する。
（Ｓ１２）ここで、上位２つのコンテキストにおけるスコアの差が所定閾値以下であるか否かを判定する。
（Ｓ１３１）Ｓ１２の判定が真である場合、第２のコンテキスト認識エンジン１３が、撮影映像から、オプティカルフローに基づく動体認識によって、動体対象としての第２のコンテキストを推定する。
（Ｓ１３２）ここで、上位２つのコンテキストにおけるスコアの差が所定閾値以下であるか否かを判定する。
（Ｓ１３３）Ｓ１３２の判定が真である場合、第３のコンテキスト認識エンジン１６が、スケルトン情報に基づく人物の関節領域認識によって、人物の関節領域としての第３のコンテキストを推定する。
（Ｓ１４）そして、Ｓ１１、Ｓ１３１及びＳ１３３の２つのコンテキストを統合した推定コンテキスト（スコアの加算値又は平均値が最も高いコンテキスト）が出力される。 [5] RGB recognition + optical flow recognition + skeleton information recognition (Fig. 6 (b), Fig. 7 (b))
(S11) The first context recognition engine 11 estimates a first context as an object from the captured image by object recognition based on the RGB image.
(S12) Here, it is determined whether the difference between the scores in the top two contexts is equal to or less than a predetermined threshold.
(S131) If the determination in S12 is true, the second context recognition engine 13 estimates a second context as a moving object from the captured video by moving object recognition based on the optical flow.
(S132) Here, it is determined whether the difference between the scores in the top two contexts is equal to or less than a predetermined threshold.
(S133) If the determination in S132 is true, the third context recognition engine 16 estimates a third context as a joint region of the person by the joint region recognition of the person based on the skeleton information.
(S14) Then, the estimated context (the added value of the score or the context with the highest average value) integrating the two contexts of S11, S131 and S133 is output.

本発明によれば、撮影映像に映り込む人の行動の変化が大きいほど、ＲＧＢ認識のみならず、動体認識や人物の関節領域認識が実行される。 According to the present invention, not only RGB recognition but also moving body recognition and joint area recognition of a person are executed as the change of the action of the person reflected in the photographed image is larger.

以上、詳細に説明したように、本発明のプログラム、装置及び方法によれば、撮影映像の内容に基づいて、人の行動を表すコンテキストを、できる限り高速且つ高精度に推定することができる。具体的には、学習モデルとしてのコンテキスト認識エンジンを、撮影映像の内容に基づいて自動的に選択することができる。 As described above in detail, according to the program, the apparatus, and the method of the present invention, the context representing the action of a person can be estimated as fast as possible and with high accuracy based on the content of the captured video. Specifically, a context recognition engine as a learning model can be automatically selected based on the content of the captured video.

前述した本発明の種々の実施形態について、本発明の技術思想及び見地の範囲の種々の変更、修正及び省略は、当業者によれば容易に行うことができる。前述の説明はあくまで例であって、何ら制約しようとするものではない。本発明は、特許請求の範囲及びその均等物として限定するものにのみ制約される。 For the various embodiments of the present invention described above, various modifications, corrections and omissions of the scope of the technical idea and aspect of the present invention can be easily made by those skilled in the art. The above description is merely an example and is not intended to be limiting in any way. The present invention is limited only as defined in the following claims and the equivalents thereto.

１行動推定装置
１１第１のコンテキスト認識エンジン
１２第１の認識判定部
１３第２のコンテキスト認識エンジン
１４推定コンテキスト出力部
１５第２の認識判定部
１６第３のコンテキスト認識エンジン
２端末

DESCRIPTION OF SYMBOLS 1 action estimation apparatus 11 1st context recognition engine 12 1st recognition determination part 13 2nd context recognition engine 14 presumed context output part 15 2nd recognition determination part 16 3rd context recognition engine 2 terminal

Claims

A context estimation program that causes a computer to function to estimate context from captured video,
A first context recognition engine that recognizes a first context from the captured video and outputs the first context and a first score in association with each other;
First recognition determination means for determining whether or not the difference between the first scores in the plurality of contexts recognized by the first context recognition engine is equal to or less than a predetermined threshold;
A second context recognition engine that recognizes a second context from the captured video when it is determined to be true by the first recognition determination means, and associates and outputs a second context and a second score When,
A context estimation program characterized by causing a computer to function as estimated context output means for outputting at least a second context when it is determined to be true by the first recognition determination means.

The first recognition determination means causes the computer to function to determine whether the difference between the scores in the top two contexts recognized by the first context recognition engine is less than or equal to a predetermined threshold. The context estimation program according to Item 1.

The first context recognition engine estimates a first context as an object from the captured image by object recognition based on RGB images;
The second context recognition engine according to claim 1 or 2, characterized in that the second context recognition engine causes a computer to function to estimate a second context as a moving object by moving object recognition based on optical flow from the captured image. Context estimation program.

The first context recognition engine estimates a first context as a moving object from the captured image by moving object recognition based on an optical flow.
The second context recognition engine is characterized in that the computer is functioned to estimate a second context as a joint region of the person from the photographed image by the joint region recognition of the person based on the skeleton information. The context estimation program according to 1 or 2.

The first context recognition engine estimates a first context as an object from the captured image by object recognition based on RGB images;
The second context recognition engine is characterized in that the computer is functioned to estimate a second context as a joint region of the person from the photographed image by the joint region recognition of the person based on the skeleton information. The context estimation program according to 1 or 2.

The estimated context output means, instead of outputting the second context, adds or averages the first score of each of the plurality of first contexts and the second score of each of the plurality of second contexts The context estimation program according to any one of claims 1 to 5, characterized in that the computer is made to function to output a context with the highest score based on.

The photographed video is divided into predetermined unit times,
The first context recognition engine and the first recognition determination means are executed at an initial stage of the predetermined unit time every predetermined unit time, and the second context recognition is thereafter performed based on the determination of the first recognition determination means The context estimation program according to any one of claims 1 to 6, wherein the computer is made to function to determine whether to execute an engine.

The second context recognition engine measures the processing time or the processing time ratio (proportion of the processing time per unit time),
The first recognition determination means outputs the first context to the estimated context output means and the second context when the processing time is more than a predetermined threshold or the processing time ratio is more than a predetermined threshold. The context estimation program according to any one of claims 1 to 7, further causing the computer to function to execute a recognition engine.

Executing a second context recognition engine when it is determined to be false by the first recognition determination means;
A third context recognition engine that recognizes a third context from the captured video when it is determined to be true by the first recognition determination means, and associates and outputs a third context and a third score When,
The computer is made to function as an estimated context output unit that outputs at least a third context instead of the output of the second context when it is determined to be true by the first recognition determination unit. The context estimation program according to any one of to 8.

A second recognition determination unit that determines whether a difference between scores in a plurality of contexts recognized by the second context recognition engine is less than or equal to a predetermined threshold value;
And a third context recognition engine that recognizes a third context from the captured video when it is determined to be true by the second recognition determination means.
The estimated context output means causes the computer to output at least a third context instead of the output of the second context when it is determined to be true by the second recognition determination means. The context estimation program according to any one of claims 1 to 8.

The first context recognition engine estimates a first context as an object from the captured image by object recognition based on RGB images;
The second context recognition engine estimates a second context as a moving object from the captured video by moving object recognition based on optical flow.
The third context recognition engine is characterized in that the computer is made to function to estimate a third context as a joint region of the person by recognizing the joint region of the person based on the skeleton information from the photographed image. The context estimation program according to 9 or 10.

The estimated context output means may replace the output of the third context with a first score of each of a plurality of first contexts, a second score of each of a plurality of second contexts, and a plurality of third contexts. The computer according to any one of claims 9 to 11, characterized in that the computer is caused to output a context with the highest score based on an addition value or an average value with the third score of each context. Context estimation program.

The second context recognition engine and / or the third context recognition engine measures the processing time or the processing time ratio (proportion of the processing time per unit time),
The first recognition determination means outputs the first context to the estimated context output means and the second context when the processing time is more than a predetermined threshold or the processing time ratio is more than a predetermined threshold. The context estimation program according to any one of claims 9 to 12, further causing the computer to function to execute a recognition engine and / or a third context recognition engine.

A context estimation apparatus for estimating a context from captured video, comprising:
A first context recognition engine that recognizes a first context from the captured video and outputs the first context and a first score in association with each other;
First recognition determination means for determining whether or not the difference between the first scores in the plurality of contexts recognized by the first context recognition engine is equal to or less than a predetermined threshold;
A second context recognition engine that recognizes a second context from the captured video when it is determined to be true by the first recognition determination means, and associates and outputs a second context and a second score When,
What is claimed is: 1. A context estimation apparatus comprising: estimated context output means for outputting at least a second context when it is determined to be true by the first recognition determination means.

A context estimation method of an apparatus for estimating context from captured video, comprising:
The device
A first step of recognizing a first context from the captured video and associating the first context with a first score;
A second step of determining whether the first score difference in the plurality of contexts recognized by the first step is less than or equal to a predetermined threshold;
A third step of recognizing a second context from the captured video when it is determined to be true in the second step, and associating and outputting the second context and the second score;
And c. Performing a fourth step of outputting at least a second context when it is determined to be true by the second step.