JP6789601B2

JP6789601B2 - A learning video selection device, program, and method for selecting a captured video masking a predetermined image area as a learning video.

Info

Publication number: JP6789601B2
Application number: JP2017206712A
Authority: JP
Inventors: 和之田坂; 柳原　広昌; 広昌柳原
Original assignee: KDDI Corp
Current assignee: KDDI Corp
Priority date: 2017-10-26
Filing date: 2017-10-26
Publication date: 2020-11-25
Anticipated expiration: 2037-10-26
Also published as: JP2019079357A

Description

本発明は、学習モデルを構築する際に、学習映像として適した撮影映像を収集する技術に関する。 The present invention relates to a technique for collecting captured images suitable as learning images when constructing a learning model.

従来、深層学習技術に基づく学習モデルを用いて、撮影映像から人や物体を認識する技術がある。ここで、学習モデルを構築するために使用する学習映像に、個人を特定可能な顔画像や、様々なプライバシ画像が含まれることが好ましくないとする問題がある。そのために、学習映像には、全て公開可能であって且つ顔画像やプライバシ画像を含まない撮影映像を用いることが一般的となっている。 Conventionally, there is a technique of recognizing a person or an object from a captured image by using a learning model based on a deep learning technique. Here, there is a problem that it is not preferable that the learning video used for constructing the learning model includes a face image that can identify an individual and various privacy images. For this reason, it is common to use a photographed video that can be published and does not include a face image or a privacy image as the learning video.

例えば、検索者に応じて、顔画像を加工した撮影映像を検索可能とする技術がある（例えば特許文献１参照）。この技術によれば、検索者の本人確認を実行した後、検索者が予め許可を得た人物の顔画像を検索キーとして登録する。そして、登録された人物以外の顔画像には、プライバシ保護の加工を施す。 For example, there is a technique that enables a photographed image obtained by processing a face image to be searched according to a searcher (see, for example, Patent Document 1). According to this technique, after the identity verification of the searcher is executed, the face image of the person for whom the searcher has obtained permission in advance is registered as the search key. Then, the face image other than the registered person is subjected to privacy protection processing.

更に、被写体を赤外線で撮影することによって、プライバシ画像に配慮する技術もある（例えば特許文献２参照）。この技術によれば、赤外線によって被写体を撮影した撮影データと、予め記憶された被写体の外形データとを比較する。外形データには提示情報が対応付けられており、撮影データと一致した外形データの提示情報が出力される。 Further, there is also a technique of considering a privacy image by photographing the subject with infrared rays (see, for example, Patent Document 2). According to this technique, the photographed data obtained by photographing the subject by infrared rays and the external shape data of the subject stored in advance are compared. The presentation information is associated with the external shape data, and the presentation information of the external shape data that matches the shooting data is output.

更に、不特定多数の第三者に公開すべき撮影映像に対して、画像品質を損なうことなく、プライバシや肖像権を保護するべく編集する技術もある（例えば特許文献３参照）。この技術によれば、動画ストリームから特定の被写体を抽出し、その被写体の画像に対してマスク処理を施す。このとき、特定の被写体に対して、動画ストリームの解像度に基づく出力条件に応じて、マスク処理を施すか否かを判定する。 Further, there is also a technique for editing a photographed image to be disclosed to an unspecified number of third parties in order to protect privacy and portrait rights without impairing the image quality (see, for example, Patent Document 3). According to this technique, a specific subject is extracted from a moving image stream, and an image of the subject is masked. At this time, it is determined whether or not to apply the mask processing to the specific subject according to the output condition based on the resolution of the moving image stream.

図１は、行動分析装置を有するシステム構成図である。 FIG. 1 is a system configuration diagram having a behavior analysis device.

図１のシステムによれば、行動分析装置は、カメラの撮影映像に映り込む人の行動を分析するものであって、インターネットに接続することによってサーバとして機能する。行動分析装置は、例えば行動推定エンジンを有し、この学習モデルは、学習映像蓄積部に蓄積された学習映像によって構築されたものである。学習映像は、人の行動が映り込む撮影映像と、その行動対象とが対応付けられたものである。 According to the system of FIG. 1, the behavior analysis device analyzes the behavior of a person reflected in the image captured by the camera, and functions as a server by connecting to the Internet. The behavior analysis device has, for example, a behavior estimation engine, and this learning model is constructed by learning images accumulated in a learning image storage unit. The learning video is an association between a captured video in which a person's behavior is reflected and the action target.

行動推定エンジンは、例えば深層学習技術を用いたActivityNetであってもよい（例えば非特許文献１参照）。この技術によれば、多種多様な人の行動（例えば「歩く」「話す」「持つ」）が映り込む学習映像から作成された学習モデルを用いて、撮影映像に映り込む人の「行動対象」を分析することができる。
また、行動推定エンジンは、例えばTwo-stream ConvNetsであってもよい（例えば非特許文献２参照）。この技術によれば、空間方向のＣＮＮ(Spatial stream ConvNet)と時系列方向のＣＮＮ(Temporal stream ConvNet)とを用いて、画像中の物体や背景のアピアランスの特徴と、オプティカルフローの水平方向成分と垂直成分の系列における動きの特徴との両方を抽出することによって、高精度に行動を認識する。 The behavior estimation engine may be, for example, ActivityNet using deep learning technology (see, for example, Non-Patent Document 1). According to this technology, a learning model created from a learning video that reflects a wide variety of human behaviors (for example, "walking", "speaking", and "holding") is used to "behavior" of the person reflected in the captured video. Can be analyzed.
Further, the behavior estimation engine may be, for example, Two-stream ConvNets (see, for example, Non-Patent Document 2). According to this technology, using CNN (Spatial stream ConvNet) in the spatial direction and CNN (Temporal stream ConvNet) in the time series direction, the characteristics of the appearance of objects and backgrounds in the image and the horizontal component of the optical flow By extracting both the movement characteristics in the series of vertical components, the behavior is recognized with high accuracy.

図１のシステムによれば、端末はそれぞれ、カメラを搭載しており、人の行動を撮影した撮影映像を、行動分析装置１へ送信する。端末は、各ユーザによって所持されるスマートフォンや携帯端末であって、携帯電話網又は無線ＬＡＮのようなアクセスネットワークに接続する。
勿論、端末は、スマートフォン等に限られず、例えば宅内に設置されたＷｅｂカメラであってもよい。また、Ｗｅｂカメラによって撮影された映像データがＳＤカードに記録され、その記録された映像データが行動分析装置１に入力されるものであってもよい。 According to the system of FIG. 1, each terminal is equipped with a camera, and a photographed image of a person's behavior is transmitted to the behavior analysis device 1. The terminal is a smartphone or mobile terminal owned by each user and connects to an access network such as a mobile phone network or a wireless LAN.
Of course, the terminal is not limited to a smartphone or the like, and may be, for example, a Web camera installed in the house. Further, the video data captured by the Web camera may be recorded on the SD card, and the recorded video data may be input to the behavior analysis device 1.

実運用としては、例えばモニターテストに参加したユーザに、自らのスマートフォンのカメラで、自らの行動を撮影してもらう。そのスマートフォンは、その映像を、行動分析装置へ送信する。行動分析装置は、その映像からユーザの行動を推定し、その推定結果を様々なアプリケーションで利用する。 In actual operation, for example, a user who participated in a monitor test is asked to take a picture of his / her behavior with the camera of his / her smartphone. The smartphone transmits the video to the behavior analyzer. The behavior analysis device estimates the user's behavior from the video, and uses the estimation result in various applications.

特開２０１４−８９６２５号公報Japanese Unexamined Patent Publication No. 2014-89625 特開２０１６−１６９９９０号公報Japanese Unexamined Patent Publication No. 2016-169990 特開２０１４−４２２３４号公報Japanese Unexamined Patent Publication No. 2014-42234

Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem and Juan Carlos Niebles, “ActivityNet: A Large-Scale Video Benchmark for Human Activity Understanding,” CVPR2015.、[online]、［平成２９年１０月１９日検索］、インターネット＜URL: http://www.cv-foundation.org/openaccess/content_cvpr_2015/papers/Heilbron_ActivityNet_A_Large-Scale_2015_CVPR_paper.pdf＞Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem and Juan Carlos Niebles, “ActivityNet: A Large-Scale Video Benchmark for Human Activity Understanding,” CVPR2015., [Online], [Search October 19, 2017], Internet <URL : http://www.cv-foundation.org/openaccess/content_cvpr_2015/papers/Heilbron_ActivityNet_A_Large-Scale_2015_CVPR_paper.pdf> Karen Simonyan and Andrew Zisserman, “Two-Stream Convolutional Networks for Action Recognition in Videos,” in NIPS 2014、[online]、［平成２９年１０月１９日検索］、インターネット＜URL:https://arxiv.org/abs/1406.2199＞Karen Simonyan and Andrew Zisserman, “Two-Stream Convolutional Networks for Action Recognition in Videos,” in NIPS 2014, [online], [Search October 19, 2017], Internet <URL: https://arxiv.org/ abs / 1406.2199> FaceNet: A Unified Embedding for Face Recognition and Clustering、[online]、［平成２９年１０月１９日検索］、インターネット＜URL:https://arxiv.org/abs/1503.03832＞FaceNet: A Unified Embedding for Face Recognition and Clustering, [online], [Search on October 19, 2017], Internet <URL: https://arxiv.org/abs/1503.03832> 「AIを使って顔画像から「常連さん」を判定しよう！」、[online]、［平成２９年１０月１９日検索］、インターネット＜URL:https://future-architect.github.io/articles/20170526/＞"Let's judge" regulars "from face images using AI! , [Online], [Search on October 19, 2017], Internet <URL: https://future-architect.github.io/articles/20170526/> 「どこまで見分ける!? デジカメ顔認識対決」、[online]、［平成２９年１０月１９日検索］、インターネット＜URL:http://news.mynavi.jp/articles/2007/08/07/face/001.html＞"How far can you tell !? Digital camera face recognition confrontation", [online], [Search on October 19, 2017], Internet <URL: http://news.mynavi.jp/articles/2007/08/07/face/ 001.html ＞「世界が認めるNECの顔認証技術」、[online]、［平成２９年１０月１９日検索］、インターネット＜URL:http://jpn.nec.com/rd/research/DataAcquition/face.html＞"World-recognized NEC face recognition technology", [online], [Search on October 19, 2017], Internet <URL: http://jpn.nec.com/rd/research/DataAcquition/face.html> OpenPose、[online]、［平成２９年１０月１９日検索］、インターネット＜URL:https://github.com/CMU-Perceptual-Computing-Lab/openpose＞OpenPose, [online], [Searched October 19, 2017], Internet <URL: https://github.com/CMU-Perceptual-Computing-Lab/openpose> 「動画や写真からボーンが検出できる OpenPose を試してみた」、[online]、［平成２９年１０月１９日検索］、インターネット＜URL:http://hackist.jp/?p=8285＞"I tried OpenPose, which can detect bones from videos and photos", [online], [Searched on October 19, 2017], Internet <URL: http://hackist.jp/?p=8285> 「OpenPoseがどんどんバージョンアップして3d pose estimationも試せるようになっている」、[online]、［平成２９年１０月１９日検索］、インターネット＜URL: http://izm-11.hatenablog.com/entry/2017/08/01/140945＞"OpenPose has been upgraded so that you can try 3d pose estimation", [online], [Search on October 19, 2017], Internet <URL: http://izm-11.hatenablog.com / entry / 2017/08/01/140945 ＞

行動分析装置は、学習モデルを構築するための適切な学習映像として、前述した特許文献１〜３によってプライバシ画像を除去した大量の撮影映像を用いることが好ましい。
しかしながら、当初の撮影映像からは、学習モデルにおけるコンテキスト（例えば人や物）を推定できていたにも拘わらず、その撮影映像からプライバシ画像を除去したことによって、コンテキストを推定できなくなる場合も多い。そのような撮影映像を学習映像として用いた場合、その学習映像に基づく学習モデルの認識精度を低下させることとなる。 It is preferable that the behavior analyzer uses a large amount of captured images from which the privacy images have been removed according to the above-mentioned Patent Documents 1 to 3 as appropriate learning images for constructing the learning model.
However, even though the context (for example, a person or an object) in the learning model could be estimated from the initially captured image, it is often impossible to estimate the context by removing the privacy image from the captured image. When such a captured image is used as a learning image, the recognition accuracy of the learning model based on the learning image is lowered.

また、学習モデルの認識精度を向上させるために、大量の撮影映像を必要とするが、プライバシ問題をクリアした撮影映像のみを収集することは、コストと技術的な手間とを要する。 Further, in order to improve the recognition accuracy of the learning model, a large amount of captured images are required, but collecting only the captured images that have cleared the privacy problem requires cost and technical labor.

そこで、本発明は、所定画像領域をマスクした撮影映像を、学習映像として利用可能か否かを選択することができる学習映像選択装置、プログラム及び方法を提供することを目的とする。 Therefore, an object of the present invention is to provide a learning image selection device, a program, and a method capable of selecting whether or not a photographed image masking a predetermined image area can be used as a learning image.

本発明によれば、撮影映像を入力し、第１のコンテキストを認識可能とする学習映像を選択する学習映像選択装置であって、
撮影映像について、第１のコンテキストを認識可能か否かを判定する第１のコンテキスト認識手段と、
第１のコンテキスト認識手段によって真と判定された撮影映像について、所定画像領域をマスクする撮影映像マスク手段と
を有し、
第１のコンテキスト認識手段は、マスク済み撮影映像を再帰的に入力し、第１のコンテキストを認識可能か否かを判定し、真と判定された撮影映像のみを、学習映像として選択する
ことを特徴とする。 According to the present invention, it is a learning image selection device that inputs a captured image and selects a learning image that makes it possible to recognize the first context.
A first context recognition means for determining whether or not the first context can be recognized for the captured image, and
It has a captured image masking means for masking a predetermined image area for a captured image determined to be true by the first context recognition means.
The first context recognition means recursively inputs the masked shot video, determines whether or not the first context can be recognized, and selects only the shot video determined to be true as the learning video. It is a feature.

本発明の学習映像選択装置における他の実施形態によれば、
第１のコンテキスト認識手段は、人の行動対象を逐次に推定するものであることも好ましい。 According to another embodiment of the learning video selection device of the present invention.
It is also preferable that the first context recognition means sequentially estimates a person's action target.

本発明によれば、撮影映像を入力し、第１のコンテキストを認識可能とする学習映像を選択する学習映像選択装置であって、
撮影映像について、第１のコンテキストを認識可能か否かを判定する第１のコンテキスト認識手段と、
第１のコンテキスト認識手段によって真と判定された撮影映像について、所定画像領域をマスクする撮影映像マスク手段と、
マスク済み撮影映像について、第２のコンテキストを認識可能か否かを判定する第２のコンテキスト認識手段と
を有し、
第１のコンテキスト認識手段は、第２のコンテキスト認識手段によって真と判定されたマスク済み撮影映像を再帰的に入力し、第１のコンテキストを認識可能か否かを判定し、真と判定された撮影映像のみを、学習映像として選択する
ことを特徴とする。 According to the present invention, it is a learning image selection device that inputs a captured image and selects a learning image that makes it possible to recognize the first context.
A first context recognition means for determining whether or not the first context can be recognized for the captured image, and
With respect to the photographed image determined to be true by the first context recognition means, the photographed image masking means for masking a predetermined image area and the photographed image masking means.
It has a second context recognition means for determining whether or not the second context can be recognized for the masked shot image.
The first context recognizing means recursively inputs the masked captured image determined to be true by the second context recognizing means, determines whether or not the first context can be recognized, and determines that it is true. It is characterized in that only the captured image is selected as the learning image.

本発明の学習映像選択装置における他の実施形態によれば、
第１のコンテキスト認識手段は、人の行動対象を逐次に推定するものであり、
第２のコンテキスト認識手段は、人の関節領域を逐次に推定するもの、及び／又は、対象物を逐次に推定するものである
ことも好ましい。 According to another embodiment of the learning video selection device of the present invention.
The first context recognition means is to sequentially estimate a person's action target.
It is also preferable that the second context recognition means sequentially estimates the joint region of a person and / or sequentially estimates an object.

本発明の学習映像選択装置における他の実施形態によれば、
撮影映像マスク手段は、マスクすべき画像領域を矩形領域で表し、当該矩形領域の外枠辺それぞれから当該撮影映像の外枠辺に向けて、マスクされてない上側、下側、左側及び右側に区分された各撮影映像を出力することも好ましい。 According to another embodiment of the learning video selection device of the present invention.
The captured image masking means represents an image area to be masked by a rectangular area, and from each of the outer frame sides of the rectangular area toward the outer frame side of the captured image, on the unmasked upper side, lower side, left side, and right side. It is also preferable to output each of the divided captured images.

本発明の学習映像選択装置における他の実施形態によれば、
第１のコンテキスト認識手段は、偽と判定した撮影映像を、撮影映像マスク手段へ再帰的に入力し、
撮影映像マスク手段は、マスクする画像領域を所定条件下で狭める
ことも好ましい。 According to another embodiment of the learning video selection device of the present invention.
The first context recognition means recursively inputs the photographed image determined to be false into the photographed image masking means.
It is also preferable that the captured image masking means narrows the image area to be masked under predetermined conditions.

本発明の学習映像選択装置における他の実施形態によれば、
撮影映像マスク手段は、撮影映像から、顔検出に基づく画像領域をマスクする
ことも好ましい。 According to another embodiment of the learning video selection device of the present invention.
It is also preferable that the captured image masking means masks an image area based on face detection from the captured image.

本発明の学習映像選択装置における他の実施形態によれば、
撮影映像マスク手段は、プライバシ画像を予め記憶しており、撮影映像から、当該プライバシ画像に所定条件以上で類似する画像領域をマスクする
ことも好ましい。 According to another embodiment of the learning video selection device of the present invention.
The captured image masking means stores a privacy image in advance, and it is also preferable to mask an image region similar to the privacy image from the captured image under predetermined conditions or more.

本発明によれば、撮影映像を入力し、第１のコンテキストを認識可能とする学習映像を選択する装置に搭載されたコンピュータを機能させる学習映像選択プログラムであって、
撮影映像について、第１のコンテキストを認識可能か否かを判定する第１のコンテキスト認識手段と、
第１のコンテキスト認識手段によって真と判定された撮影映像について、所定画像領域をマスクする撮影映像マスク手段と
してコンピュータを機能させ、
第１のコンテキスト認識手段は、マスク済み撮影映像を再帰的に入力し、第１のコンテキストを認識可能か否かを判定し、真と判定された撮影映像のみを、学習映像として選択する
ようにコンピュータに機能させることを特徴とする。 According to the present invention, it is a learning image selection program for operating a computer mounted on a device for inputting a captured image and selecting a learning image capable of recognizing a first context.
A first context recognition means for determining whether or not the first context can be recognized for the captured image, and
A computer is made to function as a photographed image masking means for masking a predetermined image area for a photographed image determined to be true by the first context recognition means.
The first context recognizing means recursively inputs the masked captured image, determines whether or not the first context can be recognized, and selects only the captured image determined to be true as the learning image. It is characterized by having a computer function.

本発明によれば、撮影映像を入力し、第１のコンテキストを認識可能とする学習映像を選択する装置に搭載されたコンピュータを機能させる学習映像選択プログラムであって、
撮影映像について、第１のコンテキストを認識可能か否かを判定する第１のコンテキスト認識手段と、
第１のコンテキスト認識手段によって真と判定された撮影映像について、所定画像領域をマスクする撮影映像マスク手段と、
マスク済み撮影映像について、第２のコンテキストを認識可能か否かを判定する第２のコンテキスト認識手段と
してコンピュータを機能させ、
第１のコンテキスト認識手段は、第２のコンテキスト認識手段によって真と判定されたマスク済み撮影映像を再帰的に入力し、第１のコンテキストを認識可能か否かを判定し、真と判定された撮影映像のみを、学習映像として選択する
ようにコンピュータに機能させることを特徴とする。 According to the present invention, it is a learning image selection program for operating a computer mounted on a device for inputting a captured image and selecting a learning image capable of recognizing a first context.
A first context recognition means for determining whether or not the first context can be recognized for the captured image, and
With respect to the photographed image determined to be true by the first context recognition means, the photographed image masking means for masking a predetermined image area and the photographed image masking means.
A computer is made to function as a second context recognition means for determining whether or not the second context can be recognized for the masked shot image.
The first context recognizing means recursively inputs the masked captured image determined to be true by the second context recognizing means, determines whether or not the first context can be recognized, and determines that it is true. It is characterized in that the computer functions to select only the captured video as the learning video.

本発明によれば、撮影映像を入力し、第１のコンテキストを認識可能とする学習映像を選択する装置の学習映像選択方法であって、
装置は、
撮影映像について、第１のコンテキストを認識可能か否かを判定する第１のステップと、
第１のステップによって真と判定された撮影映像について、所定画像領域をマスクする第２のステップと、
第２のステップにおけるマスク済み撮影映像について再帰的に、第１のコンテキストを認識可能か否かを判定し、真と判定された撮影映像のみを、学習映像として選択する第３のステップと
を実行することを特徴とする。 According to the present invention, there is a learning image selection method of a device for inputting a captured image and selecting a learning image that makes it possible to recognize the first context.
The device is
With respect to the captured image, the first step of determining whether or not the first context can be recognized, and
The second step of masking a predetermined image area for the captured image determined to be true by the first step, and
The masked captured image in the second step is recursively determined whether or not the first context can be recognized, and only the captured image determined to be true is selected as the learning image. It is characterized by doing.

本発明によれば、撮影映像を入力し、第１のコンテキストを認識可能とする学習映像を選択する装置の学習映像選択方法であって、
装置は、
撮影映像について、第１のコンテキストを認識可能か否かを判定する第１のステップと、
第１のステップによって真と判定された撮影映像について、所定画像領域をマスクする第２のステップと、
マスク済み撮影映像について、第２のコンテキストを認識可能か否かを判定する第３のステップと、
第３のステップによって真と判定されたマスク済み撮影映像について再帰的に、第１のコンテキストを認識可能か否かを判定し、真と判定された撮影映像のみを、学習映像として選択する第４のステップと
を実行することを特徴とする。 According to the present invention, there is a learning image selection method of a device for inputting a captured image and selecting a learning image that makes it possible to recognize the first context.
The device is
With respect to the captured image, the first step of determining whether or not the first context can be recognized, and
The second step of masking a predetermined image area for the captured image determined to be true by the first step, and
The third step of determining whether or not the second context can be recognized for the masked shot image, and
A fourth step, which recursively determines whether or not the first context can be recognized for the masked shot image determined to be true by the third step, and selects only the captured image determined to be true as the learning image. It is characterized by performing the steps of.

本発明の学習映像選択装置、プログラム及び方法によれば、所定画像領域をマスクした撮影映像を、学習映像として利用可能か否かを選択することができる。 According to the learning video selection device, program, and method of the present invention, it is possible to select whether or not a captured video masking a predetermined image area can be used as a learning video.

行動分析装置を有するシステム構成図である。It is a system block diagram which has a behavior analysis apparatus. 本発明における学習映像選択装置の第１の機能構成図である。It is a 1st functional block diagram of the learning image selection apparatus in this invention. 撮影映像に対する図１の各機能の処理を表す説明図である。It is explanatory drawing which shows the processing of each function of FIG. 1 with respect to the captured image. 本発明における学習映像選択装置の第２の機能構成図である。It is a 2nd functional block diagram of the learning image selection apparatus in this invention. 第２のコンテキスト認識部によって認識されるコンテキストを表す説明図である。It is explanatory drawing which shows the context recognized by the 2nd context recognition part. 撮影映像マスク部によってマスク領域が狭められた撮影映像を表す説明図である。It is explanatory drawing which shows the photographed image which the mask area was narrowed by the photographed image mask part. 撮影映像マスク部によってマスク領域で区分された複数の撮影映像を表す説明図である。It is explanatory drawing which shows the plurality of photographed images divided by the mask area by the photographed image mask part.

以下、本発明の実施の形態について、図面を用いて詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

図２は、本発明における学習映像選択装置の第１の機能構成図である。
図３は、撮影映像に対する図１の各機能の処理を表す説明図である。 FIG. 2 is a first functional configuration diagram of the learning video selection device according to the present invention.
FIG. 3 is an explanatory diagram showing processing of each function of FIG. 1 with respect to the captured image.

前述した図１の行動分析装置は、撮影映像から人の行動を推定する前提として、大量の学習映像から学習モデルを予め生成している。
ここで、本発明の行動分析装置は、学習映像に適した大量の学習映像を選択する学習映像選択機能（装置）を有する。学習映像選択機能は、撮影映像蓄積部１０１から撮影映像を入力し、選択された学習映像を学習映像蓄積部１０２へ出力する。これによって、学習映像蓄積部１０２は、学習映像選択機能によって選択された学習映像のみを蓄積する。
そして、既存の行動推定エンジンは、学習映像蓄積部１０２に蓄積された学習映像によって、学習モデルを構築する。
尚、撮影映像蓄積部１０１は、人の行動が映り込む大量の撮影映像を予め、通信ネットワークを介して端末から取得して蓄積したものであってもよい。 The behavior analysis device of FIG. 1 described above generates a learning model from a large amount of learning videos in advance on the premise of estimating human behavior from the captured video.
Here, the behavior analysis device of the present invention has a learning video selection function (device) that selects a large amount of learning video suitable for the learning video. The learning video selection function inputs the captured video from the captured video storage unit 101 and outputs the selected learning video to the learning video storage unit 102. As a result, the learning video storage unit 102 stores only the learning video selected by the learning video selection function.
Then, the existing behavior estimation engine builds a learning model from the learning video accumulated in the learning video storage unit 102.
It should be noted that the captured image storage unit 101 may acquire and store a large amount of captured images in which human behavior is reflected in advance from the terminal via the communication network.

図２によれば、行動分析装置１の学習映像選択機能は、第１のコンテキスト認識部１１と、撮影映像マスク部１２とを有する。これら機能構成部は、装置に搭載されたコンピュータを機能させるプログラムを実行することによって実現される。また、これら機能構成部の処理の流れは、装置の学習映像選択方法としても理解できる。 According to FIG. 2, the learning video selection function of the behavior analysis device 1 includes a first context recognition unit 11 and a captured video mask unit 12. These functional components are realized by executing a program that makes the computer mounted on the device function. Further, the processing flow of these functional components can be understood as a learning video selection method of the device.

［第１のコンテキスト認識部１１］
第１のコンテキスト認識部１１は、撮影映像について、第１のコンテキストを認識可能か否かを判定する。認識可能（真）と判定された撮影映像は、撮影映像マスク部１２へ出力され、認識不可（偽）と判定された撮影映像は、破棄される。
ここで、第１のコンテキスト認識部１１は、前述した行動推定エンジンと同じものである。具体的には、前述したように人の行動対象を逐次に推定するActivityNetやTwo-stream ConvNetsであってもよい。また、動いている領域抽出には、フレーム間で同一の特徴点が動いている箇所を抽出し、撮影映像の中の物体の動きを「ベクトル」で表すオプティカルフローであってもよい。 [First context recognition unit 11]
The first context recognition unit 11 determines whether or not the first context can be recognized with respect to the captured image. The captured image determined to be recognizable (true) is output to the captured image mask unit 12, and the captured image determined to be unrecognizable (false) is discarded.
Here, the first context recognition unit 11 is the same as the behavior estimation engine described above. Specifically, as described above, it may be ActivityNet or Two-stream ConvNets that sequentially estimate a person's action target. Further, the moving region may be extracted by an optical flow in which a portion where the same feature point is moving between frames is extracted and the movement of an object in the captured image is represented by a “vector”.

図３（ａ）によれば、第１のコンテキスト認識部１１に入力された撮影映像が表されている。
図３（ｂ）によれば、図３（ａ）の撮影映像から、人の行動が推定されている。ここでは、具体的に「洗濯物を畳む」という行動が推定されている。 According to FIG. 3A, the captured image input to the first context recognition unit 11 is shown.
According to FIG. 3B, human behavior is estimated from the captured image of FIG. 3A. Here, the specific behavior of "folding the laundry" is presumed.

また、第１のコンテキスト認識部１１は、再帰的に、後述する撮影映像マスク部１２からマスク済み撮影映像を再帰的に入力し、第１のコンテキストを認識可能か否かを判定する。そして、認識可能（真）と判定された撮影映像は、学習映像蓄積部１０２へ出力され、認識不可（偽）と判定された撮影映像は、破棄される。即ち、撮影映像について、所定画像領域をマスクすることによって、第１のコンテキストが認識できない場合、その撮影映像は学習映像として利用しないようにする。 Further, the first context recognition unit 11 recursively inputs the masked photographed image from the photographed image mask unit 12 described later, and determines whether or not the first context can be recognized. Then, the captured image determined to be recognizable (true) is output to the learning image storage unit 102, and the captured image determined to be unrecognizable (false) is discarded. That is, by masking a predetermined image area of the captured image, if the first context cannot be recognized, the captured image is not used as a learning image.

［撮影映像マスク部１２］
撮影映像マスク部１２は、第１のコンテキスト認識部１１によって認識可能（真）と判定された撮影映像について、所定画像領域をマスクする。 [Captured image mask unit 12]
The captured image mask unit 12 masks a predetermined image area for the captured image determined to be recognizable (true) by the first context recognition unit 11.

ここで、所定画像領域とは、顔領域であってもよい。顔検出技術としては、例えばGoogle（登録商標）のFacenet（例えば非特許文献３及び４参照）や、デジタルカメラの顔検出機能（例えば非特許文献５参照）、ＮＥＣ（登録商標）の顔検出機能（例えば非特許文献６参照）がある。顔検出には、最少分類誤りに基づく一般化学習ベクトル量子化法を用いて、撮影映像の端から順に矩形領域を探索することによって、顔と合致する矩形領域を抽出する。 Here, the predetermined image area may be a face area. Examples of face detection technology include Facenet of Google (registered trademark) (see, for example, Non-Patent Documents 3 and 4), face detection function of a digital camera (see, for example, Non-Patent Document 5), and face detection function of NEC (registered trademark). (See, for example, Non-Patent Document 6). For face detection, a generalized learning vector quantization method based on the minimum classification error is used, and a rectangular region matching the face is extracted by searching the rectangular region in order from the edge of the captured image.

勿論、本発明によってマスクすべき画像領域は、顔検出に限られるものではない。例えば車のナンバープレートや表札などのプライバシ領域であってもよい。
撮影映像マスク部１２は、マスクすべき所定画像領域を、参照画像の局所特徴量として予め記憶しておき、その参照画像と類似する画像領域を抽出する。具体的には、ＳＩＦＴ(Scale-Invariant Feature Transform)やＳＵＲＦ(Speeded Up Robust Features)のようなアルゴリズムを用いることもできる。ここで抽出された画像領域を、撮影映像の中でマスクする。 Of course, the image area to be masked by the present invention is not limited to face detection. For example, it may be a privacy area such as a car license plate or a front cover.
The captured image mask unit 12 stores a predetermined image area to be masked in advance as a local feature amount of the reference image, and extracts an image area similar to the reference image. Specifically, algorithms such as SIFT (Scale-Invariant Feature Transform) and SURF (Speeded Up Robust Features) can also be used. The image area extracted here is masked in the captured image.

マスクとは、その領域を、所定色（例えば黒）や不透明パターン等で塗りつぶすことを意味する。このマスク処理により、個人特定が不可能となる等、プライバシが保護されることとなる。 The mask means that the area is filled with a predetermined color (for example, black), an opaque pattern, or the like. This masking process protects privacy, such as making it impossible to identify an individual.

図３（ｃ）によれば、撮影映像マスク部１２によって検出された顔領域が表されている。
図３（ｄ）によれば、撮影映像の中で、検出された顔領域がマスクされている。 According to FIG. 3C, the face region detected by the captured image mask unit 12 is shown.
According to FIG. 3D, the detected face region is masked in the captured image.

そして、撮影映像マスク部１２は、マスク済み撮影映像を、第１のコンテキスト認識部１１（行動推定エンジン）へフィードバックする。 Then, the captured image mask unit 12 feeds back the masked photographed image to the first context recognition unit 11 (behavior estimation engine).

尚、前述した図２によれば、学習映像選択機能は、第１のコンテキスト認識部１１へ試験映像を入力する試験映像蓄積部１０３と、その試験映像に対する推定結果を判定する推定結果判定部１４とを更に有する。
ここで、データセットとして、行動対象が予め付与された撮影映像は、学習映像（学習用データ）と試験映像（試験用データ）とに分類される。例えば、９割の撮影映像を学習映像として選択すると共に、残り１割の撮影映像を試験映像に割り当てる。推定結果判定部１４は、試験映像に対する行動推定結果と、その試験映像に付与された行動対象とを比較し、正否を判定する。多数の試験映像を入力することによって、当該学習モデルに基づく第１のコンテキスト認識部１１の認識精度を算出することができる。 According to FIG. 2 described above, the learning video selection function includes a test video storage unit 103 that inputs a test video to the first context recognition unit 11, and an estimation result determination unit 14 that determines an estimation result for the test video. And further.
Here, as a data set, the captured video to which the action target is assigned in advance is classified into a learning video (learning data) and a test video (test data). For example, 90% of the captured video is selected as the learning video, and the remaining 10% of the captured video is assigned to the test video. The estimation result determination unit 14 compares the action estimation result for the test video with the action target assigned to the test video, and determines whether the test video is correct or not. By inputting a large number of test images, the recognition accuracy of the first context recognition unit 11 based on the learning model can be calculated.

図４は、本発明における学習映像選択装置の第２の機能構成図である。
図５は、第２のコンテキスト認識部によって認識されるコンテキストを表す説明図である。 FIG. 4 is a second functional configuration diagram of the learning video selection device according to the present invention.
FIG. 5 is an explanatory diagram showing a context recognized by the second context recognition unit.

図４によれば、図２と比較して、学習映像選択装置は更に、第２のコンテキスト認識部１３を有する。 According to FIG. 4, as compared with FIG. 2, the learning video selection device further includes a second context recognition unit 13.

［第２のコンテキスト認識部１３］
第２のコンテキスト認識部１３は、撮影映像マスク部１２から出力されたマスク済み撮影映像について、第２のコンテキストを認識可能か否かを判定する。第２のコンテキスト認識部１３は、第１のコンテキスト認識部１１とは異なるコンテキストを認識する。例えば以下のように異なる。
第１のコンテキスト認識部１１＝「人の行動対象」を逐次に推定するもの
第２のコンテキスト認識部１３＝「人の関節領域」を逐次に推定するもの、
及び／又は「対象物」を逐次に推定するもの [Second context recognition unit 13]
The second context recognition unit 13 determines whether or not the second context can be recognized with respect to the masked photographed image output from the photographed image mask unit 12. The second context recognition unit 13 recognizes a context different from that of the first context recognition unit 11. For example, it differs as follows.
First context recognition unit 11 = one that sequentially estimates "human behavior target" Second context recognition unit 13 = one that sequentially estimates "human joint area",
And / or those that sequentially estimate the "object"

＜人の関節領域の推定＞
第２のコンテキスト認識部１３は、具体的にはOpenPose（登録商標）のようなスケルトンモデルを用いて、人の関節の特徴点を抽出する（例えば非特許文献７〜９参照）。
OpenPoseとは、画像から複数の人間の体／手／顔のキーポイントをリアルタイムに検出可能なソフトウェアであって、GitHubによって公開されている。撮影映像に映る人の身体全体であれば、例えば１５点のキーポイントを検出できる。
図５（ａ）によれば、撮影映像から人の関節領域が推定されている。 <Estimation of human joint area>
The second context recognition unit 13 specifically extracts feature points of human joints using a skeleton model such as OpenPose (registered trademark) (see, for example, Non-Patent Documents 7 to 9).
OpenPose is software that can detect multiple human body / hand / face key points in real time from images, and is published by GitHub. For example, 15 key points can be detected for the entire human body shown in the captured image.
According to FIG. 5A, the human joint region is estimated from the captured image.

＜対象物領域の推定＞
第２のコンテキスト認識部１３は、具体的にはＣＮＮ(Convolutional Neural Network)のようなニューラルネットワークを用いて、撮影映像に映り込む対象物を推定することができる。
図５（ｂ）によれば、撮影映像から対象物が推定されている。具体的には「タオル」が物体認識されている。 <Estimation of object area>
Specifically, the second context recognition unit 13 can estimate the object to be reflected in the captured image by using a neural network such as CNN (Convolutional Neural Network).
According to FIG. 5B, the object is estimated from the captured image. Specifically, the "towel" is recognized as an object.

そして、第２のコンテキスト認識部１３は、マスク済み撮影映像について、第２のコンテキストの認識結果（認識可能／認識不可）を、第１のコンテキスト認識部１１へフィードバックする。 Then, the second context recognition unit 13 feeds back the recognition result (recognizable / unrecognizable) of the second context to the first context recognition unit 11 with respect to the masked captured image.

図４によれば、第１のコンテキスト認識部１１は、第２のコンテキスト認識部１３によって真（第２のコンテキストの認識可能）と判定されたマスク済み撮影映像は、再帰的に、第１のコンテキストを認識可能か否かを判定する。そして、第１のコンテキストの認識可能（真）と判定された撮影映像のみを、学習映像として、学習映像蓄積部１０２へ出力する。 According to FIG. 4, the masked captured image determined by the first context recognition unit 11 to be true (recognizable in the second context) by the second context recognition unit 13 is recursively the first. Determine if the context is recognizable. Then, only the captured video determined to be recognizable (true) in the first context is output as a learning video to the learning video storage unit 102.

＜マスク済み撮影映像に対するコンテキストの認識の再帰的な繰り返し＞
図６は、撮影映像マスク部によってマスク領域が狭められた撮影映像を表す説明図である。
前述した図２及び図４の実施形態について、第１のコンテキスト認識部１１は、撮影映像マスク部１２から入力したマスク済み撮影映像について、偽（第１のコンテキストの認識不可）と判定した場合、再帰的に、撮影映像マスク部１２へ、そのマスク済み撮影映像を出力するものであってもよい。
ここで、撮影映像マスク部１２は、マスクする画像領域を所定条件下で狭める。この所定条件とは、具体的には、マスクする画像領域の矩形範囲を所定割合狭めるものである。即ち、コンテキストを認識すべき領域を拡大する。マスクの画像領域を狭めた場合であっても、個人特定が不可能となる等、プライバシが保護される必要はある。
撮影映像マスク部１２によってマスク領域が狭められたマスク済み撮影映像は、図２の場合には、第１のコンテキスト認識部１１へ再帰的に入力される。 <Recursive repetition of context recognition for masked footage>
FIG. 6 is an explanatory diagram showing a captured image in which the mask area is narrowed by the captured image mask portion.
Regarding the above-described embodiments of FIGS. 2 and 4, when the first context recognition unit 11 determines that the masked shot image input from the shot image mask unit 12 is false (the first context cannot be recognized), The masked photographed image may be recursively output to the photographed image mask unit 12.
Here, the captured image mask unit 12 narrows the image area to be masked under predetermined conditions. Specifically, the predetermined condition narrows the rectangular range of the image area to be masked by a predetermined ratio. That is, the area where the context should be recognized is expanded. Even if the image area of the mask is narrowed, it is necessary to protect the privacy, such as making it impossible to identify an individual.
In the case of FIG. 2, the masked photographed image whose mask area is narrowed by the photographed image mask unit 12 is recursively input to the first context recognition unit 11.

＜撮影映像マスク部１２における他の実施形態のマスク方法＞
図７は、撮影映像マスク部によってマスク領域で区分された複数の撮影映像を表す説明図である。
前述した図３（ｄ）の実施形態によれば、撮影映像から所定画像領域を、例えば黒塗りとして単にマスクすることで説明した。
ここで、図７によれば、撮影映像マスク部１２は、マスクすべき画像領域を矩形領域で表し、当該矩形領域の外枠辺それぞれから当該撮影映像の外枠辺に向けて、マスクされてない上側、下側、左側及び右側に区分された各撮影映像を出力する。これら撮影画像はそれぞれ、図２の場合には第１のコンテキスト認識部１１へ、図４の場合には第２のコンテキスト認識部１３へ入力される。結果的に、第１のコンテキスト認識部１１でコンテキストが認識されたマスク済み撮影映像のみが、学習映像蓄積部１０２へ蓄積される。 <Mask method of another embodiment in the captured image mask unit 12>
FIG. 7 is an explanatory diagram showing a plurality of captured images divided by a masked area by the captured image mask unit.
According to the embodiment of FIG. 3D described above, the predetermined image area from the captured image is simply masked as, for example, blackened.
Here, according to FIG. 7, the captured image mask unit 12 represents an image area to be masked by a rectangular area, and is masked from each outer frame side of the rectangular area toward the outer frame side of the captured image. Outputs each shot image divided into upper side, lower side, left side and right side. Each of these captured images is input to the first context recognition unit 11 in the case of FIG. 2 and to the second context recognition unit 13 in the case of FIG. As a result, only the masked photographed video whose context is recognized by the first context recognition unit 11 is stored in the learning video storage unit 102.

以上、詳細に説明したように、本発明の学習映像選択装置、プログラム及び方法によれば、撮影映像の中から所定画像領域をマスクすると共に、その撮影映像を学習映像として利用可能か否かを選択することができる。
特に、本発明によれば、撮影映像から顔領域やプライバシ領域を除去したとしても、当初の撮影映像によって本来認識できていたコンテキストの認識を維持することができる。特に、深層学習のために認識可能なラベル化された行動対象に基づく撮影画像について、コンテキストが認識不可とならず、認識精度を向上させるために再学習を可能とする。 As described in detail above, according to the learning video selection device, program, and method of the present invention, it is possible to mask a predetermined image area from the captured video and determine whether or not the captured video can be used as the learning video. You can choose.
In particular, according to the present invention, even if the face region and the privacy region are removed from the captured image, the recognition of the context originally recognized by the initially captured image can be maintained. In particular, for captured images based on labeled action objects that can be recognized for deep learning, the context does not become unrecognizable, and re-learning is possible to improve recognition accuracy.

前述した本発明の種々の実施形態について、本発明の技術思想及び見地の範囲の種々の変更、修正及び省略は、当業者によれば容易に行うことができる。前述の説明はあくまで例であって、何ら制約しようとするものではない。本発明は、特許請求の範囲及びその均等物として限定するものにのみ制約される。 With respect to the various embodiments of the present invention described above, various changes, modifications and omissions within the scope of the technical idea and viewpoint of the present invention can be easily made by those skilled in the art. The above explanation is just an example and does not attempt to restrict anything. The present invention is limited only to the scope of claims and their equivalents.

１行動分析装置
１０１撮影映像蓄積部
１０２学習映像蓄積部
１０３試験映像蓄積部
１１第１のコンテキスト認識部
１２撮影映像マスク部
１３第２のコンテキスト認識部
１４推定結果判定部
1 Behavior analysis device 101 Captured video storage unit 102 Learning video storage unit 103 Test video storage unit 11 First context recognition unit 12 Captured video mask unit 13 Second context recognition unit 14 Estimated result judgment unit

Claims

It is a learning image selection device that inputs a captured image and selects a learning image that makes it possible to recognize the first context.
With respect to the captured image, a first context recognition means for determining whether or not the first context can be recognized, and
It has a captured image masking means for masking a predetermined image area for a captured image determined to be true by the first context recognition means.
The first context recognition means recursively inputs the masked shot video, determines whether or not the first context can be recognized, and selects only the shot video determined to be true as the learning video. A featured learning video selection device.

The learning video selection device according to claim 1, wherein the first context recognition means sequentially estimates a person's action target.

It is a learning image selection device that inputs a captured image and selects a learning image that makes it possible to recognize the first context.
With respect to the captured image, a first context recognition means for determining whether or not the first context can be recognized, and
With respect to the photographed image determined to be true by the first context recognition means, the photographed image masking means for masking a predetermined image area and the photographed image masking means.
It has a second context recognition means for determining whether or not the second context can be recognized for the masked shot image.
The first context recognizing means recursively inputs the masked captured image determined to be true by the second context recognizing means, determines whether or not the first context can be recognized, and determines that it is true. A learning video selection device characterized in that only captured video is selected as a learning video.

The first context recognition means is to sequentially estimate a person's action target.
The learning image selection device according to claim 3, wherein the second context recognition means sequentially estimates a human joint region and / or sequentially estimates an object.

The captured image masking means represents an image area to be masked by a rectangular area, and the unmasked upper side, lower side, left side, and right side are directed from each outer frame side of the rectangular area toward the outer frame side of the captured image. The learning video selection device according to any one of claims 1 to 4, wherein each shot video classified into the above is output.

The first context recognition means recursively inputs the photographed image determined to be false into the photographed image masking means.
The learning video selection device according to any one of claims 1 to 5, wherein the captured image masking means narrows the image area to be masked under a predetermined condition.

The learning video selection device according to any one of claims 1 to 6, wherein the captured video masking means masks an image region based on face detection from the captured video.

Any one of claims 1 to 7, wherein the captured image masking means stores a privacy image in advance and masks an image region similar to the captured image under predetermined conditions or more from the captured image. The learning video selection device described in.

It is a learning video selection program that operates a computer mounted on a device that inputs a shot video and selects a learning video that makes it possible to recognize the first context.
With respect to the captured image, a first context recognition means for determining whether or not the first context can be recognized, and
A computer is made to function as a captured image masking means for masking a predetermined image area for a captured image determined to be true by the first context recognition means.
The first context recognition means recursively inputs the masked shot video, determines whether or not the first context can be recognized, and selects only the shot video determined to be true as the learning video. A learning video selection program characterized by having a computer function.

It is a learning video selection program that operates a computer mounted on a device that inputs a shot video and selects a learning video that makes it possible to recognize the first context.
With respect to the captured image, a first context recognition means for determining whether or not the first context can be recognized, and
With respect to the photographed image determined to be true by the first context recognition means, the photographed image masking means for masking a predetermined image area and the photographed image masking means.
A computer is made to function as a second context recognition means for determining whether or not the second context can be recognized for the masked shot image.
The first context recognizing means recursively inputs the masked captured image determined to be true by the second context recognizing means, determines whether or not the first context can be recognized, and determines that it is true. A learning video selection program characterized by having a computer function to select only captured video as a learning video.

It is a learning image selection method of a device that inputs a captured image and selects a learning image that makes it possible to recognize the first context.
The device
With respect to the captured image, the first step of determining whether or not the first context can be recognized, and
The second step of masking a predetermined image area for the captured image determined to be true by the first step, and
The masked captured image in the second step is recursively determined whether or not the first context can be recognized, and only the captured image determined to be true is selected as the learning image. A learning video selection method for a device characterized by

It is a learning image selection method of a device that inputs a captured image and selects a learning image that makes it possible to recognize the first context.
The device
With respect to the captured image, the first step of determining whether or not the first context can be recognized, and
The second step of masking a predetermined image area for the captured image determined to be true by the first step, and
The third step of determining whether or not the second context can be recognized for the masked shot image, and
A fourth step, which recursively determines whether or not the first context can be recognized for the masked shot image determined to be true by the third step, and selects only the captured image determined to be true as the learning image. A learning video selection method for a device characterized by performing the steps of.