JP7240940B2

JP7240940B2 - Object image extraction device, method, and software program

Info

Publication number: JP7240940B2
Application number: JP2019078132A
Authority: JP
Inventors: 靖浩秋山; 英春服部
Original assignee: Hitachi Industry and Control Solutions Co Ltd
Current assignee: Hitachi Industry and Control Solutions Co Ltd
Priority date: 2019-04-16
Filing date: 2019-04-16
Publication date: 2023-03-16
Anticipated expiration: 2039-04-16
Also published as: JP2020177364A; WO2020213570A1

Description

本発明は、深層学習に用いることを目的とする人物画像を映像から抽出する技術に関する。 The present invention relates to a technique for extracting a human image from a video intended for use in deep learning.

コンピュータビジョンの分野で深層学習を用いて画像から物体を認識する手法が注目されている。 In the field of computer vision, a method of recognizing an object from an image using deep learning has attracted attention.

深層学習とは、人間が自然に行っている学習能力と同様の機能をコンピュータで実現しようとする手法のことである。深層学習には、従来のパタンマッチングなどの手法に比べて、特徴解析および特徴表現の柔軟性が高いことに加えて、検出の目標とする物体の特徴を人が定義しなくても良いこと等の優位性がある。 Deep learning is a technique that attempts to realize functions similar to the learning ability that humans naturally perform in computers. Compared to conventional methods such as pattern matching, deep learning has greater flexibility in feature analysis and feature expression, and it does not require humans to define the features of objects targeted for detection. have the advantage of

一方、深層学習によって得られる識別モデルの識別精度は、学習時に使用する教師画像の量および品質から大きく影響を受ける。教師画像の数が少なければ、識別モデルは、学習で参照した教師画像に含まれる検出対象に酷似した物体のみにしか反応しないような検知率の低い識別モデルとなる傾向がある。教師画像の背景などに検出対象以外のノイズが多数映りこんでいた場合も、識別モデルの識別精度が低くなる傾向がある。 On the other hand, the recognition accuracy of a recognition model obtained by deep learning is greatly affected by the quantity and quality of teacher images used during learning. If the number of teacher images is small, the discrimination model tends to have a low detection rate that responds only to objects that are very similar to the detection targets included in the teacher images referred to in learning. When a large amount of noise other than the detection target is reflected in the background of the teacher image, the discrimination accuracy of the discrimination model tends to be low.

このため、検出対象以外のノイズが少ない教師画像を効率良く自動抽出して効果的に識別モデルの深層学習に活用することを可能にする技術の開発が求められている。 For this reason, there is a demand for the development of a technique that enables efficient automatic extraction of teacher images with little noise other than detection targets and effective use in deep learning of discrimination models.

特許文献１には、ユーザが任意に指定したオブジェクト分類（人物など）に基づき、保有画像データベースから、指定したオブジェクト分類に該当するオブジェクトが含まれる画像群を抽出し、保存する技術が開示されている。 Japanese Patent Application Laid-Open No. 2002-200002 discloses a technique of extracting and saving an image group including an object corresponding to the designated object classification from a retained image database based on an object classification (person, etc.) arbitrarily designated by the user. there is

特開２００８－２９９６８１号公報JP 2008-299681 A

特許文献１に開示された技術は、一般の映像群の中からユーザが任意に指定したオブジェクトを抽出し、その抽出結果をフレーム単位で返す画像抽出技術である。返されたフレームは、指定された種類のオブジェクト以外のノイズが背景として映りこんでいる可能性が高く、深層学習の教師データとして活用するには適さない可能性がある。 The technique disclosed in Patent Document 1 is an image extraction technique that extracts an object arbitrarily designated by a user from a general video group and returns the extraction result in units of frames. The returned frames are likely to contain background noise other than the specified type of object, and may not be suitable for use as training data for deep learning.

本開示のひとつの目的は、深層学習に好適な対象物の画像を抽出することを可能にする技術を提供することである。 One object of the present disclosure is to provide a technique that enables extraction of an image of an object suitable for deep learning.

ひとつの態様に係る対象物画像抽出装置は、映像に含まれる画像のフレームから対象物の画像を抽出する対象物画像抽出装置であって、前記映像の一部のフレームをキーフレームとし、前記キーフレームにおける対象物が表示された部分を含む矩形で指定された対象物領域の画像を取得するキーフレーム対象物指定部と、前記映像において前記キーフレームで指定された対象物を追跡し、前記映像における前記キーフレームでないフレームである中間フレームの前記対象物が表示された部分を含む矩形の対象物領域の画像を抽出する中間フレーム画像抽出部と、を有する。 A target object image extracting device according to one aspect is a target object image extracting device for extracting an image of a target object from frames of an image included in a video, wherein some frames of the video are used as key frames, and the key a key frame object specifying unit for acquiring an image of an object area specified by a rectangle including a portion where the object is displayed in the frame; and an intermediate frame image extracting unit for extracting an image of a rectangular target object area including a portion where the target object is displayed in the intermediate frame, which is a frame that is not the key frame in the above.

本開示によれば、深層学習に好適な対象物の画像を抽出できる。 According to the present disclosure, an image of an object suitable for deep learning can be extracted.

本実施形態に係る教師画像抽出装置の構成例を示す図である。It is a figure which shows the structural example of the teacher image extraction apparatus which concerns on this embodiment. 教師画像抽出装置の動作例を示す図である。It is a figure which shows the operation example of a teacher image extraction apparatus. キーフレームと中間フレームとの関係を説明するための図である。FIG. 4 is a diagram for explaining the relationship between key frames and intermediate frames; キーフレーム及び中間フレームから人物画像を抽出する例を示す図である。FIG. 10 is a diagram showing an example of extracting a person image from key frames and intermediate frames; キーフレーム及び中間フレームから複数の人物画像を抽出する第１の例を示す図である。FIG. 10 is a diagram showing a first example of extracting a plurality of person images from key frames and intermediate frames; 入力映像から抽出した人物画像と抽出画像セットとの関係を説明するための図である。FIG. 4 is a diagram for explaining the relationship between a person image extracted from an input video and an extracted image set; キーフレーム及び中間フレームから複数の人物画像を抽出する第２の例を示す図である。FIG. 10 is a diagram showing a second example of extracting a plurality of person images from key frames and intermediate frames; 中間フレーム画像抽出部及び人物領域特定部の詳細を示すブロック図である。3 is a block diagram showing details of an intermediate frame image extraction unit and a human area identification unit; FIG. 歩行者処理部に含まれる領域ベクトルグルーピング部の動作例を説明するための図である。FIG. 10 is a diagram for explaining an operation example of a region vector grouping unit included in the pedestrian processing unit; 歩行者に対する人物領域の設定において見切れが生じる例を説明するための図であ。FIG. 11 is a diagram for explaining an example in which the setting of a person area for a pedestrian is partially cut off; 左足の先端位置を基準に人物領域を補正する例を示す図である。FIG. 10 is a diagram showing an example of correcting a person area based on the tip position of the left leg; 右足の先端位置を基準に人物領域を補正する例を示す図である。FIG. 10 is a diagram showing an example of correcting a person area based on the tip position of the right foot; 両足の先端位置を基準に人物領域を補正する例を示す図である。FIG. 10 is a diagram showing an example of correcting a person area based on tip positions of both feet; 静止者処理部に含まれる微動ベクトルグルーピング部の動作例を示す図である。It is a figure which shows the operation example of the slight motion vector grouping part contained in a stationary person process part. 採用画像判定部の動作例を説明するための示す図である。FIG. 10 is a diagram for explaining an operation example of a adopted image determination unit; 採用画像判定部が参照する学習エラー率曲線の例を示す図である。FIG. 10 is a diagram showing an example of a learning error rate curve referred to by the adopted image determination unit; 教師画像抽出装置を含む教師画像抽出システムの第１例を示す図である。1 is a diagram showing a first example of a teacher image extraction system including a teacher image extraction device; FIG. 教師画像抽出装置を含む教師画像抽出システムの第２例を示す図である。FIG. 10 is a diagram showing a second example of a teacher image extraction system including a teacher image extraction device;

以下、図面を参照して実施形態を説明する。 Embodiments will be described below with reference to the drawings.

図１は、本実施形態に係る教師画像抽出装置１０の構成例を示す図である。なお、教師画像抽出装置１０は、対象物の画像を抽出する対象物画像抽出装置の一例である。 FIG. 1 is a diagram showing a configuration example of a teacher image extraction device 10 according to this embodiment. Note that the teacher image extracting device 10 is an example of a target object image extracting device that extracts an image of a target object.

教師画像抽出装置１０は、キーフレーム人物指定部１０１、中間フレーム画像抽出部１０２、人物領域特定部１０３、採用画像判定部１０４、及び、教師画像保存部１０５を備える。 The teacher image extracting device 10 includes a key frame person specifying unit 101 , an intermediate frame image extracting unit 102 , a person region specifying unit 103 , a adopted image determining unit 104 and a teacher image storing unit 105 .

キーフレーム人物指定部１０１は、入力された映像（動画）１００を構成する複数のフレームのうちのキーフレームにおける、抽出対象の人物の指定を受け付ける。キーフレームは、複数のフレームのうち、所定間隔毎に位置するフレームである。例えば、ユーザは、キーフレームに対して、抽出対象の人物が含まれるように、手動で矩形の領域を指定する。 The keyframe person designation unit 101 receives designation of a person to be extracted in a keyframe among a plurality of frames forming an input video (moving image) 100 . A key frame is a frame positioned at predetermined intervals among a plurality of frames. For example, the user manually designates a rectangular area so that the person to be extracted is included in the keyframe.

キーフレームの間隔は、任意に設定されてよい。キーフレームの間隔は、５秒または１０秒など、一定の間隔に設定されてよい。例えば、３０ｆｐｓの映像に対してキーフレームの間隔を５秒に設定した場合、５秒間のフレーム数は１５０枚（＝３０フレーム×５秒）である。そのうち、先頭の１枚をキーフレームと呼び、当該キーフレームに後続する１４９枚を中間フレームと呼ぶ。以下、キーフレームとそれに後続する中間フレームのセットを、キーフレームセットと呼んでもよい。 The keyframe interval may be set arbitrarily. The interval between keyframes may be set to a fixed interval, such as 5 seconds or 10 seconds. For example, if the keyframe interval is set to 5 seconds for a 30 fps video, the number of frames in 5 seconds is 150 (=30 frames×5 seconds). Among them, the first frame is called a key frame, and the 149 frames following the key frame are called intermediate frames. Hereinafter, a set of keyframes and subsequent intermediate frames may be referred to as a keyframe set.

キーフレームの間隔は、一定の間隔でなくてもよく、例えば、異なるキーフレームの間隔を組み合わせてもよい。また、映像全体における先頭のフレームのみをキーフレームとし、後続する残りのフレームを中間フレームとしてもよい。 The keyframe intervals may not be constant intervals, and for example, different keyframe intervals may be combined. Alternatively, only the first frame in the entire video may be set as a key frame, and the following remaining frames may be set as intermediate frames.

中間フレーム画像抽出部１０２は、キーフレームに後続する各中間フレームにおいて、当該キーフレームに対して指定された人物と同一人物を追跡する。そして、中間フレーム画像抽出部１０２は、各中間フレームから、当該同一人物を含む領域を特定し、その特定した領域の画像を抽出する。以下、人物を含む領域を「人物領域」と呼び、人物領域を抽出した（切り出した）画像を「人物画像」という。 The intermediate frame image extraction unit 102 tracks the same person as the person specified for the key frame in each intermediate frame following the key frame. Then, the intermediate frame image extracting unit 102 specifies an area including the same person from each intermediate frame, and extracts an image of the specified area. Hereinafter, an area including a person will be referred to as a "person area", and an image obtained by extracting (cutting out) the person area will be referred to as a "person image".

人物領域特定部１０３は、中間フレーム人部画像抽出部１０２と連携し、中間フレームにおける人物領域を特定する。例えば、人物領域特定部１０３は、抽出対象の人物の身体全体が含まれるように、人物領域を特定する。別言すると、人物領域特定部１０３は、抽出対象の人物の身体の一部がはみ出ないように、人物領域を特定する。 The person region identification unit 103 identifies the person region in the intermediate frame in cooperation with the intermediate frame human part image extraction unit 102 . For example, the human region identification unit 103 identifies the human region so as to include the entire body of the person to be extracted. In other words, the person area identifying unit 103 identifies the person area so that a part of the body of the person to be extracted does not protrude.

採用画像判定部１０４は、中間フレーム画像抽出部１０２によって抽出された人物画像を、人物識別モデルの深層学習用の教師画像として採用するか否かを判定する。例えば、採用画像判定部１０４は、抽出された人物画像を、仮に人物識別モデルの深層学習の教師画像として用いた場合に当該人物識別モデルの精度向上が見込めるか否かについて、学習エラー率（テストエラー率）に基づいて判定する。そして、採用画像判定部１０４は、精度向上が見込めると判定した人物画像を、人物識別モデルの学習用の教師画像として採用する。 The adoption image determination unit 104 determines whether or not to adopt the person image extracted by the intermediate frame image extraction unit 102 as a teacher image for deep learning of the person identification model. For example, the adopted image determination unit 104 determines whether or not an improvement in the accuracy of the person identification model can be expected if the extracted person image is used as a teacher image for deep learning of the person identification model. error rate). Then, the adopted image determination unit 104 adopts the person image determined to be expected to improve accuracy as a teacher image for learning the person identification model.

教師画像保存部１０５は、採用画像判定部１０４において採用された人物画像を、人物識別モデルの学習ための教師画像１０６として保存する。教師画像保存部１０５は、中間フレームから抽出された人物画像に限らず、キーフレームに対して指定された領域から抽出された人物画像も教師画像として保存してよい。 A teacher image storage unit 105 stores the person image adopted by the employed image determination unit 104 as a teacher image 106 for learning a person identification model. The teacher image storage unit 105 may store, as a teacher image, not only the person image extracted from the intermediate frame, but also the person image extracted from the area specified for the key frame.

図２は、教師画像抽出装置１０の動作例を示す図である。 FIG. 2 is a diagram showing an operation example of the teacher image extracting device 10. As shown in FIG.

ユーザは、教師画像抽出装置１０に対して、人物画像の抽出に用いる映像（動画）１００を入力する。入力される映像１００は、ＨＤＤ（ＨａｒｄＤｉｓｋＤｒｉｖｅ）などの記録媒体に格納されたファイルであってよい。又は、入力される映像１００は、カメラで撮影中の映像、或いは、ネットワークを経由してストリーミング受信した映像であってもよい。又は、入力される映像１００は、１つの動画を構成する全てのフレームを展開した、複数の連続した画像ファイルの集合であってもよい。 A user inputs a video (moving image) 100 used for extracting a person image to the teacher image extraction device 10 . The input image 100 may be a file stored in a recording medium such as an HDD (Hard Disk Drive). Alternatively, the input image 100 may be an image being captured by a camera or an image received by streaming via a network. Alternatively, the input video 100 may be a collection of a plurality of continuous image files in which all frames forming one moving image are expanded.

ユーザは、キーフレーム人物指定部１０１を通じて、入力された映像１００におけるキーフレームに対して抽出対象の人物を指定する。 A user designates a person to be extracted for a keyframe in the input video 100 through the keyframe person designation unit 101 .

中間フレーム画像抽出部１０２は、人物領域特定部１０３と連携して、キーフレーム人物指定部１０１によって指定された人物と同一人物の画像（人物画像）を抽出し、抽出画像セット２０２として出力する。 The intermediate frame image extraction unit 102 extracts images (person images) of the same person as the person specified by the key frame person specification unit 101 in cooperation with the person area specification unit 103 , and outputs them as an extraction image set 202 .

採用画像判定部１０４は、学習エラー率に基づいて、抽出画像セット２０２を、人物識別モデルの学習用の教師画像として採用するか否かを判定する。採用画像判定部１０４は、教師画像として採用すると判定した抽出画像セットを、教師画像セット２０３として出力する。 Based on the learning error rate, the adopted image determination unit 104 determines whether or not to adopt the extracted image set 202 as teacher images for learning the person identification model. The adopted image determination unit 104 outputs the extracted image set determined to be adopted as the teacher image as the teacher image set 203 .

教師画像セット２０３は、人物識別モデルの学習用の教師画像として用いられる。なお、人物識別モデルの学習は、人物識別モデルを新たに生成するための学習と、生成済みの人物識別モデルの精度を向上させるための再学習と、の何れであってもよい。 The teacher image set 203 is used as teacher images for learning a person identification model. The learning of the person identification model may be either learning for generating a new person identification model or re-learning for improving the accuracy of the generated person identification model.

図３は、キーフレームと中間フレームとの関係を説明するための図である。なお、図３は、キーフレーム間隔３１２が６フレームの場合の例である。 FIG. 3 is a diagram for explaining the relationship between key frames and intermediate frames. Note that FIG. 3 is an example in which the keyframe interval 312 is 6 frames.

ユーザは、マウス等を操作して、キーフレーム３００内の人物３０８を囲む人物領域３１０を指定する。 The user operates a mouse or the like to designate a person area 310 surrounding the person 308 in the keyframe 300 .

このように、ユーザが、キーフレーム３００に対して人物３０８を囲む人物領域３１０を指定することにより、キーフレームに後続する中間フレームにおける、同一人物の追跡精度及び画像抽出精度が向上する。すなわち、教師画像抽出装置１０は、ユーザからのキーフレームに対する人物領域の指定を受け付けるキーフレーム人物指定部１０１と、各中間フレームから自動的に人物画像を抽出する中間フレーム画像抽出部１０２との連携により、入力された映像１００から、高品質な人物の教師画像を大量に取得できる。 In this way, the user designates the person region 310 surrounding the person 308 for the keyframe 300, thereby improving the tracking accuracy and image extraction accuracy of the same person in intermediate frames following the keyframe. That is, the teacher image extracting apparatus 10 includes a key frame person specifying unit 101 that accepts specification of a person region for a key frame from a user, and an intermediate frame image extracting unit 102 that automatically extracts a person image from each intermediate frame. Therefore, a large number of high-quality human teacher images can be obtained from the input video 100 .

なお、キーフレーム間隔が広い場合には、キーフレームに存在しない人物が途中の中間フレームから新たに出現する場合がある。このように、途中の中間フレームから新たに出現する人物は、中間フレーム画像抽出部１０２に含まれる動き推定人物検出部６０１（図８参照）によって検出されてよい。 Note that when the keyframe interval is wide, a person who does not exist in the keyframe may newly appear from an intermediate frame in the middle. In this way, a person newly appearing in an intermediate frame may be detected by the motion estimation person detection unit 601 (see FIG. 8) included in the intermediate frame image extraction unit 102 .

また、キーフレームに対する人物領域の指定は、上述した手動の場合に限られない。例えば、キーフレーム内の人物を、動き推定人物検出部６０１と同様の処理によって自動的に検出してもよい。なお、動き推定人物抽出部６０１の詳細については後述する。 In addition, designation of a person region for a key frame is not limited to the above-described manual operation. For example, a person in a key frame may be automatically detected by processing similar to that of the motion estimation person detection unit 601 . Details of the motion estimation person extraction unit 601 will be described later.

図４は、キーフレーム及び中間フレームから人物画像を抽出する例を示す図である。 FIG. 4 is a diagram showing an example of extracting a person image from key frames and intermediate frames.

ユーザは、キーフレーム３１３に対して、人物３１５を囲む人物領域３１６を指定する。この場合、中間フレーム画像抽出部１０２は、キーフレームに対して指定された人物領域３１６を基点に、後続する中間フレーム３１４から、人物３１５と同一人物３１７を自動的に追跡し、同一人物３１７を囲む人物領域３１８を特定する。そして、中間フレーム画像抽出部１０２は、特定した人物領域３１８から人物画像を抽出する。 The user designates a person area 316 surrounding a person 315 with respect to the key frame 313 . In this case, the intermediate frame image extraction unit 102 automatically traces the person 315 and the same person 317 from the succeeding intermediate frame 314, using the person region 316 specified for the key frame as a base point, and extracts the same person 317. Identify the person region 318 to surround. Then, the intermediate frame image extraction unit 102 extracts a human image from the specified human region 318 .

図５は、キーフレーム及び中間フレームから複数の人物画像を抽出する第１の例を示す図である。 FIG. 5 is a diagram showing a first example of extracting a plurality of person images from key frames and intermediate frames.

ユーザは、キーフレーム３１９に対して、各人物３２１～３２３を指定する。中間フレーム画像抽出部１０２は、キーフレームに対して指定された各人物３２１～３２３を基点に、後続の中間フレーム３２０の各同一人物を自動的に追跡し、各同一人物を囲む人物領域３２４～３２６を特定する。そして、中間フレーム画像抽出部１０２は、特定した各人物領域３２４～３２６から人物画像を抽出する。 The user designates each person 321 to 323 for the key frame 319 . The intermediate frame image extracting unit 102 automatically tracks each same person in the succeeding intermediate frame 320 using each of the persons 321 to 323 specified for the key frame as a base point, and extracts a person area 324 to 324 surrounding each same person. 326. Then, the intermediate frame image extraction unit 102 extracts a person image from each of the specified person areas 324 to 326. FIG.

図６は、入力映像から抽出した人物画像と抽出画像セットとの関係を説明するための図である。 FIG. 6 is a diagram for explaining the relationship between a person image extracted from an input video and an extracted image set.

図６に示すように、中間フレーム画像抽出部１０２は、中間フレーム４０１～４０５、４０７から複数の人物画像４０９～４１３、４１５を自動的に抽出し、記録媒体に、教師画像セット２０２として保存する。 As shown in FIG. 6, the intermediate frame image extraction unit 102 automatically extracts a plurality of person images 409 to 413 and 415 from the intermediate frames 401 to 405 and 407, and saves them as a teacher image set 202 on a recording medium. .

また、中間フレーム画像抽出部１０２は、キーフレーム４００、４０６に対して指定された人物画像４０８、４１４も、中間フレーム４０１～４０５、４０７から抽出した人物画像４０９～４１３、４１５と共に保存する。 The intermediate frame image extraction unit 102 also stores the human images 408 and 414 specified for the key frames 400 and 406 together with the human images 409 to 413 and 415 extracted from the intermediate frames 401 to 405 and 407, respectively.

図７は、キーフレーム及び中間フレームから複数の人物画像を抽出する第２の例を示す図である。 FIG. 7 is a diagram showing a second example of extracting a plurality of person images from key frames and intermediate frames.

図５を参照して説明した第１の例は、キーフレーム及び中間フレームから、フレーム毎に、複数の人物を抽出する例であった。これに対して、図７を参照して説明する第２の例は、キーフレームに対して指定された人物毎に、後続する中間フレームから同一人物を追跡及び抽出する例である。これにより、同一人物をより高い精度で抽出でき得る。 The first example described with reference to FIG. 5 was an example of extracting a plurality of persons from key frames and intermediate frames for each frame. In contrast, a second example, described with reference to FIG. 7, is an example of tracking and extracting the same person from subsequent intermediate frames for each person specified for a keyframe. As a result, the same person can be extracted with higher accuracy.

図７に示すように、キーフレーム５００内の３人の人物５０３、５０６、５０９の各々を基点に、後続の中間フレーム５０１、５０２から、同一人物の人物画像を追跡及び抽出する。 As shown in FIG. 7, starting from each of the three persons 503, 506, 509 in the keyframe 500, from subsequent intermediate frames 501, 502, the person images of the same person are tracked and extracted.

例えば、キーフレーム５００の人物５０３を基準に同一人物の追跡を行い、１番目の中間フレーム５０１から同一人物５０４の人物画像を抽出し、２番目の中間フレーム５０２から同一人物５０５の人物画像を抽出する。 For example, the same person is tracked based on the person 503 in the key frame 500, the person image of the same person 504 is extracted from the first intermediate frame 501, and the person image of the same person 505 is extracted from the second intermediate frame 502. do.

次に、キーフレーム５００の人物画像５０６を基準に同一人物の追跡を行い、１番目の中間フレーム５０１から同一人物５０７の人物画像を抽出し、２番目の中間フレーム５０２から同一人物５０８の人物画像を抽出する。 Next, the same person is tracked based on the person image 506 of the key frame 500, the person image of the same person 507 is extracted from the first intermediate frame 501, and the person image of the same person 508 is extracted from the second intermediate frame 502. to extract

次に、キーフレーム５００の人物画像５０９を基準に同一人物の追跡を行い、１番目の中間フレーム５０１から同一人物５１０の人物画像を抽出し、２番目の中間フレーム５０２から同一人物５１１の人物画像を抽出する。 Next, the same person is tracked based on the person image 509 of the key frame 500, the person image of the same person 510 is extracted from the first intermediate frame 501, and the person image of the same person 511 is extracted from the second intermediate frame 502. to extract

図８は、中間フレーム画像抽出部１０２及び人物領域特定部１０３の詳細を示すブロック図である。 FIG. 8 is a block diagram showing the details of the intermediate frame image extraction unit 102 and the human area identification unit 103. As shown in FIG.

中間フレーム画像抽出部１０２は、動き推定人物検出部６０１、ベクトル安定化フィルタ６０２、領域補正部６０９、フレーム間領域差判定部６１０、及び、人物画像切り出し部６１１を含む。 The intermediate frame image extraction unit 102 includes a motion estimation person detection unit 601 , a vector stabilization filter 602 , an area correction unit 609 , an inter-frame area difference determination unit 610 and a person image extraction unit 611 .

人物領域特定部１０３は、領域ベクトルグルーピング部６０３、全身マップ生成部６０４、前後フレーム検証部６０５、微動ベクトルグルーピング部６０６、微動エッジ抽出部６０７、及び、時系列エッジ強度検証部６０８を含む。 Human region identification unit 103 includes region vector grouping unit 603 , whole body map generation unit 604 , front/back frame verification unit 605 , slight movement vector grouping unit 606 , slight movement edge extraction unit 607 , and time series edge strength verification unit 608 .

動き推定人物検出部６０１は、入力映像１００に対して、オプティカルフロー（ＯｐｔｉｃａｌＦｌｏｗ）に基づく動きベクトル演算を行い、人物の動きベクトルを検出する。動きベクトルは、フレームを所定単位で区切った各ブロックのフレーム間でのシフトの移動方向と移動量を示すベクトルである。ブロックの大きさ（ブロック間の距離）は、システムの運用に合わせて適切に設定される。また、オプティカルフローは、２つの画像間でエッジなどの複数の特徴点がどう動いたのかを計算して、対象物体の動きを推定したり、対象物体を認識したりする画像処理技術の１つである。 The motion estimation person detection unit 601 performs motion vector calculation on the input image 100 based on optical flow to detect the motion vector of the person. A motion vector is a vector that indicates the direction and amount of shift between frames of each block that divides a frame into predetermined units. The size of blocks (distance between blocks) is appropriately set according to system operation. Optical flow is an image processing technology that calculates how multiple feature points such as edges move between two images, estimates the movement of the target object, and recognizes the target object. is.

ベクトル安定化フィルタ６０２は、動き推定人物検出部６０１によって検出された人物の動きベクトルについてフレーム間のバラツキを抑制するためにカルマンフィルタ（Ｋａｌｍａｎｆｉｌｔｅｒ）を用い、動きベクトルを安定化させる。カルマンフィルタは、誤差を含む複数個の観測データを用いて、未来の状態を予測する状態推定手法の１つである。カルマンフィルタは、予測誤差を一定範囲に収束させる性質を有するため、ベクトル安定化フィルタ６０２は、この性質を利用して、動きベクトルの出力を安定化させることができる。 A vector stabilization filter 602 stabilizes the motion vector of the person detected by the motion estimation person detection unit 601 by using a Kalman filter to suppress variation between frames. The Kalman filter is one of state estimation techniques for predicting future states using a plurality of observation data containing errors. Since the Kalman filter has the property of converging the prediction error within a certain range, the vector stabilization filter 602 can utilize this property to stabilize the motion vector output.

ベクトル安定化フィルタ６０２は、安定化後のベクトルを、予測誤差が比較的大きい（例えば所定の閾値以上）場合、歩行者処理部６１３へ出力し、予測誤差が比較的小さい（例えば所定の閾値未満）場合、静止者処理部６１４へ出力する。 The vector stabilization filter 602 outputs the stabilized vector to the pedestrian processing unit 613 when the prediction error is relatively large (for example, a predetermined threshold or more), and the prediction error is relatively small (for example, less than a predetermined threshold). ), it outputs to the stationary person processing unit 614 .

歩行者処理部６１３は、領域ベクトルグルーピング部６０３、及び、全身マップ生成部６０４を有する。 The pedestrian processing unit 613 has an area vector grouping unit 603 and a whole body map generation unit 604 .

領域ベクトルグルーピング部６０３は、ベクトル安定化フィルタ６０２によって安定化された複数の動きベクトルのうち、近接する同等の傾向の動きベクトルを束ね（グルーピングし）、ベクトルグループを生成する。 A region vector grouping unit 603 bundles (groups) adjacent motion vectors having the same tendency among the plurality of motion vectors stabilized by the vector stabilization filter 602 to generate a vector group.

全身マップ生成部６０４は、領域ベクトルグルーピング部６０３によって生成されたベクトルグループから、人物の全身を示すマップ（以下「全身マップ」という）を生成する。 A whole body map generation unit 604 generates a map showing the whole body of a person (hereinafter referred to as a “whole body map”) from the vector group generated by the region vector grouping unit 603 .

静止者処理部６１４は、微動ベクトルグルーピング部６０６、微動エッジ抽出部６０７、及び、時系列エッジ強度検証部６０８を有する。 The stationary person processing section 614 has a slight movement vector grouping section 606 , a slight movement edge extraction section 607 and a time series edge strength verification section 608 .

微動ベクトルグルーピング部６０６は、ベクトル安定化フィルタ６０２によって安定化された複数の動きベクトルのうち、近接する同じ傾向の動き量の小さい微動ベクトルを束ねた領域をマーカとする。そして、微動ベクトルグルーピング部６０６は、隣接する複数フレーム間で領域が重なるマーカ郡を束ね、ベクトルグループを生成する。 A fine motion vector grouping unit 606 uses, as a marker, an area obtained by bundling fine motion vectors having the same tendency and having a small amount of motion among the plurality of motion vectors stabilized by the vector stabilization filter 602 . Then, the fine movement vector grouping unit 606 bundles the marker groups whose areas overlap between a plurality of adjacent frames to generate a vector group.

微動エッジ抽出部６０７は、時間方向の連続するフレームの各々からエッジ画像を抽出し、その抽出した複数のエッジ画像を平均化し、１つの平均エッジ画像を得る。 A fine movement edge extraction unit 607 extracts an edge image from each of successive frames in the time direction, averages the extracted edge images, and obtains one average edge image.

時系列エッジ強度検証部６０８は、微動エッジ抽出部６０７によって得られた平均エッジ画像について、ベクトルグループ枠内に、所定基準以上のエッジ成分（強度）が存在するか否かを判定する。これにより、静止人物の有無が判定される。 A time-series edge strength verification unit 608 determines whether or not the average edge image obtained by the fine movement edge extraction unit 607 has an edge component (strength) equal to or greater than a predetermined reference within the vector group frame. Thereby, the presence or absence of a stationary person is determined.

前後フレーム検証部６０５は、上記の歩行者処理部６１３から出力される全身マップ、又は、上記の静止者処理部６１４から出力されるエッジ画像について、所定範囲の前後フレームによる平準化を行い、注目フレームにおける人物領域を決定する。 The front/rear frame verification unit 605 smoothes the whole-body map output from the pedestrian processing unit 613 or the edge image output from the stationary person processing unit 614 using the front/rear frames in a predetermined range. Determine the person region in the frame.

領域補正部６０９は、歩行者処理６１３が行われた場合、人物の足先端位置を検出し、全身マップの領域を補正する。全身のうちの足先端の動きベクトル量は、相対的に大きく観測されるため、この補正により、一部が欠けた人物画像が抽出されることを抑制できる。 When the pedestrian processing 613 is performed, the area correction unit 609 detects the tip positions of the person's feet and corrects the area of the whole body map. Since the amount of motion vector at the tip of the foot in the whole body is observed to be relatively large, this correction can suppress the extraction of a partially missing person image.

フレーム間領域差判定部６１０は、フレーム毎に検出した人物の重心から、フレーム間の平均重心移動量を決定する。そして、フレーム間領域差判定部６１０は、この平均重心移動量と、注目フレームの人物重心移動量との差分を算出し、その差分が所定の閾値未満の場合、注目フレームの人物領域を選択する。一方、フレーム間領域差判定部６１０は、その差分が所定の閾値以上の場合、注目フレームの人物領域を選択しなくてよい。 The inter-frame region difference determination unit 610 determines an average center-of-gravity movement amount between frames from the center of gravity of the person detected for each frame. Then, the inter-frame region difference determining unit 610 calculates the difference between this average center-of-gravity movement amount and the human center-of-gravity movement amount in the frame of interest, and if the difference is less than a predetermined threshold, selects the human region of the frame of interest. . On the other hand, if the difference is equal to or greater than the predetermined threshold value, the inter-frame area difference determining section 610 does not have to select the person area of the frame of interest.

人物画像切り出し部６１１は、フレーム間領域差判定部６１０において選択された人物領域から人物画像を切り出し、教師画像として出力６１２する。 A human image clipping unit 611 clips a human image from the human region selected by the inter-frame region difference determination unit 610 and outputs 612 it as a teacher image.

図９は、歩行者処理部６１３に含まれる領域ベクトルグルーピング部６０３の動作例を説明するための図である。 FIG. 9 is a diagram for explaining an operation example of the area vector grouping unit 603 included in the pedestrian processing unit 613. As shown in FIG.

領域ベクトルグルーピング部６０３は、移動人物（例えば歩行者）を特定するために、近接する同等の移動方向及び移動量の動きベクトルを束ねて、移動人物の人物領域を予測する。なお、近接する動きベクトルは、２つの動きベクトルのブロックの位置が隣り合っていてよい。また、移動方向が同等の動きベクトルは、２つの動きベクトルのなす角が所定角度以下であってよい。ここで、所定角度は、システムの運用に合わせて適切に設定される。例えば、所定角度は、動きベクトルのなす角が当該所定角度以下であれば、人物の各部位の動きとして実質的に同一と見なせる角度に設定されてよい。また、移動量が同等の動きベクトルは、２つの動きベクトルの大きさの差が所定値以下であってよい。ここで、所定値は、システムの運用に合わせて適切に設定される。例えば、所定値は、動きベクトルの大きさの差がその所定値以下であれば人物の各部位の動きとして実質的に同一と見なせる値に設定されてよい。 In order to identify a moving person (for example, a pedestrian), the area vector grouping unit 603 bundles motion vectors of similar movement direction and movement amount close to each other and predicts the person area of the moving person. It should be noted that adjacent motion vectors may be adjacent blocks of two motion vectors. Also, two motion vectors having the same movement direction may form an angle less than or equal to a predetermined angle. Here, the predetermined angle is appropriately set according to system operation. For example, the predetermined angle may be set to an angle that can be regarded as substantially the same as the motion of each part of the person if the angle formed by the motion vectors is equal to or less than the predetermined angle. Also, motion vectors having the same amount of movement may have a difference in magnitude of two motion vectors that is less than or equal to a predetermined value. Here, the predetermined value is appropriately set according to system operation. For example, the predetermined value may be set to a value that can be regarded as substantially the same as the motion of each part of a person if the difference in motion vector magnitude is equal to or less than the predetermined value.

例えば図９において（ａ）に示すように、人物７０１が方向７００に移動している場合、領域ベクトルグルーピング部６０３は、次の処理を行う。すなわち、領域ベクトルグルーピング部６０３は、オプティカルフローによって、各特徴点の方向７００とほぼ同じ向き及び移動量の動きベクトル群７０２を観測する。 For example, as shown in FIG. 9A, when a person 701 is moving in a direction 700, the region vector grouping unit 603 performs the following processing. In other words, the area vector grouping unit 603 observes a motion vector group 702 having substantially the same direction and amount of movement as the direction 700 of each feature point by optical flow.

このとき、領域ベクトルグルーピング部６０３は、特徴点毎に、所定範囲内で近接する類似の動きベクトルを束ねる（７０４）。そして、領域ベクトルグルーピング部６０３は、図９において（ｂ）に示すように、束ねた動きベクトルを包含する仮想円７０５を設定する。 At this time, the region vector grouping unit 603 bundles similar motion vectors that are adjacent within a predetermined range for each feature point (704). Then, the area vector grouping unit 603 sets a virtual circle 705 containing the grouped motion vectors, as shown in FIG. 9(b).

領域ベクトルグルーピング部６０３は、人物７０１に含まれる全ての特徴点について、図９において（ｂ）の処理を実行する。そして、領域ベクトルグルーピング部６０３は、図９において（ｃ）に示すように設定した仮想円を重ねて、全身を表現した全身マップ７０６を得る。 The area vector grouping unit 603 executes the process (b) in FIG. 9 for all feature points included in the person 701 . Region vector grouping section 603 then superimposes virtual circles set as shown in FIG. 9(c) to obtain whole body map 706 representing the whole body.

領域ベクトルグルーピング部６０３は、図９において（ｄ）に示すように、全身マップ７０６を囲む矩形の人物領域７０７を設定する。そして、領域ベクトルグルーピング部６０３は、人物領域７０７から人物画像７０１を抽出し、教師画像として出力する。 The region vector grouping unit 603 sets a rectangular human region 707 surrounding the whole body map 706, as shown in FIG. 9(d). Then, the area vector grouping unit 603 extracts the person image 701 from the person area 707 and outputs it as a teacher image.

図１０～図１３は、中間フレーム画像抽出部１０２に含まれる領域補正部６０９の動作例を説明するための図である。 10 to 13 are diagrams for explaining an operation example of the area correction unit 609 included in the intermediate frame image extraction unit 102. FIG.

歩行者の場合、全身のうち、足の先端の動きベクトル量が相対的に大きく観測される。そこで、領域補正部６０９は、足の先端位置を検出し、全身マップの人物領域を補正する。 In the case of a pedestrian, a relatively large amount of motion vector is observed at the tip of the foot in the whole body. Therefore, the area correction unit 609 detects the tip positions of the feet and corrects the person area of the whole body map.

図１０に例示するように、領域ベクトルグルーピング部６０３は、有効な特徴点を観測できない場合がある。この場合、全身マップ生成用の仮想円を設定できない箇所８０２が生じ、全身マップ生成部８０４は、一部が欠落した全身マップを生成する。すなわち、人物８００の一部の箇所８０２が見切れた、矩形の人物領域８０１が設定される。 As illustrated in FIG. 10, the region vector grouping unit 603 may not be able to observe effective feature points. In this case, there is a portion 802 where the virtual circle for generating the whole body map cannot be set, and the whole body map generation unit 804 generates a partially missing whole body map. That is, a rectangular person area 801 is set in which a portion 802 of the person 800 is cut off.

ここで、歩行中の上半身は、足の先端よりも内側に存在する確率が高い。そこで、領域補正部６０９は、足の先端位置８０４を検出し、その検出した足の先端位置８０４を人物８００の端と想定して、矩形の人物領域８０１を補正する。これにより、矩形の人物領域８０１の内側に、人物８００の全身が含まれる確率が高くなる。 Here, there is a high probability that the upper half of the body during walking is located inside the tips of the feet. Therefore, the area correction unit 609 detects the tip position 804 of the foot, assumes that the detected tip position 804 of the foot is the edge of the person 800 , and corrects the rectangular person area 801 . This increases the probability that the whole body of the person 800 is included inside the rectangular person region 801 .

図１１は、領域補正部６０９が、検出した左足の先端位置８０４を基準にして、人物８０３の人物領域８０５の左側部分を拡張する補正を行った例を示す。 FIG. 11 shows an example in which the region correction unit 609 performs correction to expand the left portion of the person region 805 of the person 803 based on the detected tip position 804 of the left foot.

図１２は、領域補正部６０９が、検出した右足の先端位置８０７を基準にして、人物８０６の人物領域８０８の右側部分を拡張する補正を行った例を示す。 FIG. 12 shows an example in which the region correction unit 609 performs correction to expand the right portion of the person region 808 of the person 806 with reference to the detected tip position 807 of the right foot.

図１３は、領域補正部６０９が、検出した左足の先端位置８１０および右足の先端位置８１１を基準にして、人物８０９の人物領域８１２の左側部分および右側部分の両方を拡張する補正を行った例を示す。 FIG. 13 shows an example in which the region correction unit 609 performs correction to expand both the left and right portions of the person region 812 of the person 809 based on the detected left leg tip position 810 and right leg tip position 811. indicates

例えば、領域補正部６０９は、補正前の人身マップの垂直方向の中心軸を基準として、足の先端位置が中心軸よりも左寄りであれば、人物領域の中心軸の左側部分を拡張する補正を行う。領域補正部６０９は、補正前の人身マップの垂直方向の中心軸を基準として、足の先端位置が中心軸よりも右寄りであれば、人物領域の中心軸の右側部分を拡張する補正を行う。 For example, the area correction unit 609 performs correction to extend the left side of the central axis of the person area if the tip position of the foot is to the left of the central axis of the human body map in the vertical direction before correction. conduct. The area correction unit 609 performs correction to extend the right side of the central axis of the human area if the tip position of the foot is to the right of the central axis of the human body map before correction.

前かがみになりながら歩く歩行者などでは、上半身の一定領域が足の先端位置よりも外側にはみ出す場合がある。この場合、領域補正部６０９は、歩行者が直立して歩行していないと判断し、上述した足の先端位置に基づく人物領域の補正を実行しなくてもよい。 A certain area of the upper body of a pedestrian walking while slouching may protrude outside the tip of the foot. In this case, the region correction unit 609 may determine that the pedestrian is not walking upright, and may not correct the person region based on the above-described foot tip positions.

図１４は、静止者処理部６１４に含まれる微動ベクトルグルーピング部６０６の動作例を示す図である。 FIG. 14 is a diagram showing an operation example of the slight motion vector grouping unit 606 included in the stationary person processing unit 614. As shown in FIG.

人物は、一般的に、睡眠中を除き、完全な静止状態になることはほとんどなく、僅かに動いている。経っている人物では、特にこの傾向が強い。微動ベクトルグルーピング部６０６は、この僅かな動きを時系列に観測して統合することにより、静止状態の人物を検出する。 Humans are generally seldom completely still, except during sleep, and move slightly. This tendency is particularly strong in people who have passed through. A slight motion vector grouping unit 606 detects a person in a stationary state by observing and integrating these slight motions in time series.

図１４において（ａ）は、静止中の人物９００を示す。微動ベクトルグルーピング部６０６は、図９に示した、歩行者向けの処理と同様の処理を、静止中の人物に対しても実行する。所定の条件が満たされる場合、静止中の人物であっても、人物の全体を包含する人物領域を得られる場合がある。しかし、注目フレーム単独の場合、たいてい、全身の一部が動いた状態のみが観測される。そこで、微動ベクトルグルーピング部６０６は、図１４において（ｂ）に示すように、所定のフレーム区間（Ｔｎ～Ｔｎ＋４）において、連続してベクトルグループを観測する。例えば、微動ベクトルグルーピング部６０６は、同一人物に対して、５個のベクトルグループ９０６～９１０を観測する。 In FIG. 14, (a) shows a person 900 at rest. The fine motion vector grouping unit 606 performs the same processing as the processing for pedestrians shown in FIG. 9 also for a stationary person. If a predetermined condition is met, even a stationary person may be able to obtain a person region that includes the entire person. However, in the case of the frame of interest alone, in most cases only a state in which a part of the whole body moves is observed. Therefore, the fine motion vector grouping section 606 continuously observes vector groups in a predetermined frame section (Tn to Tn+4), as shown in FIG. 14(b). For example, the micromotion vector grouping unit 606 observes five vector groups 906-910 for the same person.

この場合、微動ベクトルグルーピング部６０６は、図１４において（ｃ）に示すように、観測された５個のベクトルグループを包含する矩形の仮領域を設定する。そして、微動ベクトルグルーピング部６０６は、図１４において（ｄ）に示すように、この設定した仮領域を、人物９００の候補領域９１２とする。 In this case, the fine motion vector grouping unit 606 sets a rectangular provisional region that includes the five observed vector groups, as shown in FIG. 14(c). Then, the slight movement vector grouping unit 606 sets the set temporary area as the candidate area 912 of the person 900, as shown in FIG. 14(d).

次に、微動エッジ抽出部６０７は、図１４において（ｅ）に示すように、所定フレーム区間Ｔｍ～Ｔｍ＋２において連続してエッジ抽出処理を行い、所定フレーム毎にエッジ画像９１６～９１８を得る。 Next, as shown in (e) in FIG. 14, the fine movement edge extraction unit 607 continuously performs edge extraction processing in a predetermined frame interval Tm to Tm+2 to obtain edge images 916 to 918 for each predetermined frame.

次に、微動エッジ抽出部６０７は、所定フレーム区間Ｔｍ～Ｔｍ＋２のエッジ画像９１６～９１８から、図１４において（ｆ）に示すように、１個の平均エッジ画像９１９を得る。 Next, the fine movement edge extraction unit 607 obtains one average edge image 919 from the edge images 916 to 918 in the predetermined frame section Tm to Tm+2, as shown in FIG. 14(f).

次に、時系列エッジ強度検証部６０８は、図１４において（ｇ）に示すように、平均エッジ画像９１９の候補領域９１２内に、所定基準以上のエッジ成分強度が存在するか否かを判定する。例えば、時系列エッジ強度検証部６０８は、所定の輝度値の画素が所定の面積以上存在するか否かを判断する。そして、時系列エッジ強度検証部６０８は、所定基準以上のエッジ成分が存在する場合、図１４（ｇ）に示す候補領域９１２を、図１４において（ｈ）に示すように、人物領域９２０と確定する。時系列エッジ強度検証部６０８は、エッジ成分が所定の基準未満の場合、図１４において（ｇ）に示す候補領域９１２を破棄する。 Next, as shown in (g) in FIG. 14, the time-series edge strength verification unit 608 determines whether or not there is an edge component strength equal to or greater than a predetermined standard in the candidate region 912 of the average edge image 919. . For example, the time-series edge strength verification unit 608 determines whether pixels with a predetermined brightness value exist in a predetermined area or more. Then, when there is an edge component equal to or greater than a predetermined criterion, time-series edge strength verification section 608 determines candidate region 912 shown in FIG. 14(g) to be person region 920 as shown in FIG. do. Time-series edge strength verification section 608 discards candidate region 912 indicated by (g) in FIG. 14 when the edge component is less than a predetermined criterion.

図１５は、採用画像判定部１０４の動作例を説明するための図である。 15A and 15B are diagrams for explaining an operation example of the adopted image determination unit 104. FIG.

採用画像判定部１０４は、学習エラー率に基づいて、人物識別モデルの精度向上が見込めるか否かを判定する。そして、採用画像判定部１０４は、人物識別モデルの精度向上が見込めると判定した人物領域の人物画像を、教師画像として選択する。 Based on the learning error rate, the adopted image determination unit 104 determines whether an improvement in the accuracy of the person identification model can be expected. Then, the adopted image determination unit 104 selects, as a teacher image, the person image in the person region determined to be expected to improve the accuracy of the person identification model.

採用画像判定部１０４は、画像読出部１００１、人物識別試験モデル学習部１００２、採用判定部１００３、及び、画像保存部１００７を含む。 Adoption image determination unit 104 includes image reading unit 1001 , person identification test model learning unit 1002 , adoption determination unit 1003 , and image storage unit 1007 .

画像読出部１００１は、抽出画像セット２０２から、所定枚数の人物画像を読み出す。画像読出部１００１は、抽出画像セット２０２から、任意の数の人物画像を読み出してよい。例えば、画像読出部１００１は、抽出画像セット２０２に含まれる１００００枚の人物画像のうち、２０００枚の人物画像を読み出してもよい。 The image reading unit 1001 reads out a predetermined number of human images from the extraction image set 202 . The image reading unit 1001 may read any number of human images from the extracted image set 202 . For example, the image reading unit 1001 may read 2000 person images out of 10000 person images included in the extraction image set 202 .

人物識別試験モデル学習部１００２は、人物識別試験モデルの深層学習を実行する。人物識別試験モデル学習部１００２は、画像読出部１００１によって読み出された２０００枚の人物画像のうち、或る１０００枚を人物識別モデルのｅｐｏｃｈ毎のフィルタ係数更新学習に用い、残りの１０００枚をｅｐｏｃｈ毎のテストエラー率評価のために用いてもよい。なお、ｅｐｏｃｈは、識別モデルのフィルタ係数の最小演算単位の集合であり、学習用の入力画像を全て参照し終える単位である。ｅｐｏｃｈは、学習訓練回数とも呼ばれる。 The person identification test model learning unit 1002 performs deep learning of the person identification test model. Of the 2000 human images read by the image reading unit 1001, the person identification test model learning unit 1002 uses certain 1000 images for the filter coefficient update learning for each epoch of the person identification model, and uses the remaining 1000 images. It may be used for test error rate evaluation per epoch. Note that an epoch is a set of minimum computational units of filter coefficients of a discrimination model, and is a unit for referring to all input images for learning. An epoch is also called a learning training number.

採用判定部１００３は、所定のｅｐｏｃｈ数の学習が進行した時点で、テストエラー率が最も低いｅｐｏｃｈ時点における、学習フィルタ係数と基準モデルエラー率１００４とを比較する。基準モデルエラー率１００４は、既存の人物識別モデルが有する学習エラー率である。 When learning has progressed for a predetermined number of epochs, adoption determining section 1003 compares the learned filter coefficients and reference model error rate 1004 at the epoch with the lowest test error rate. A reference model error rate 1004 is a learning error rate of an existing person identification model.

採用判定部１００３が、試験モデルエラー率が基準モデルエラー率１００４よりも低いと判定した場合（採用判定１００５：ＹＥＳ）、画像保存部１００７は、画像読出部１００１が読み出した人物画像を、教師画像として教師画像セット２０３へ格納する。教師画像セット２０３は、正式な人物識別モデルの学習に用いられる。 When the adoption determination unit 1003 determines that the test model error rate is lower than the reference model error rate 1004 (adoption determination 1005: YES), the image storage unit 1007 stores the person image read by the image reading unit 1001 as a teacher image. are stored in the teacher image set 203 as. The teacher image set 203 is used for training a formal person identification model.

一方、採用判定部１００３は、試験モデルエラー率が基準モデルエラー率１００４よりも高いと判定した場合（採用判定１００５：ＮＯ）、画像読出部１００１が読み出した人物画像を破棄する。 On the other hand, when the adoption determination unit 1003 determines that the test model error rate is higher than the reference model error rate 1004 (adoption determination 1005: NO), the person image read by the image reading unit 1001 is discarded.

図１６は、採用画像判定部１０４が参照する学習エラー率曲線の例を示す図である。 FIG. 16 is a diagram showing an example of a learning error rate curve referred to by the adopted image determination unit 104. As shown in FIG.

図１６において、縦軸はテストエラー率（単位は％）、横軸は学習訓練回数（単位はｅｐｏｃｈ）を示す。 In FIG. 16, the vertical axis indicates the test error rate (unit: %), and the horizontal axis indicates the number of times of learning and training (unit: epoch).

テストエラー率は、１ｅｐｏｃｈ毎に出力された学習モデルに対して、テストエラー率を評価するための未知の画像を入力した場合に、その未知の画像の識別に失敗した割合（エラーの割合）のことである。未知の画像は、学習に用いた画像とは異なる画像である。テストエラー率は、０％から１００％の範囲で表現され、一般的に、０％に近いほど、識別モデルの認識性能が高いと解釈される。 The test error rate is the ratio (error rate) of failures in identifying an unknown image when an unknown image for evaluating the test error rate is input to the learning model output for each epoch. That is. An unknown image is an image different from the image used for learning. The test error rate is expressed in a range from 0% to 100%, and generally, the closer to 0%, the higher the recognition performance of the discriminative model.

図１６において、ＥＲ１は、基準モデルテストエラー率１００４のベスト値を示す。ＥＲ２は、人物識別試験モデル学習部１００２によって学習された人物識別試験モデルにおけるテストエラー率のベスト値を示す。 In FIG. 16, ER1 indicates the best value of the reference model test error rate 1004. FIG. ER2 indicates the best test error rate in the person identification test model learned by the person identification test model learning unit 1002 .

図１６の例では、ＥＲ２はＥＲ１よりも小さい。これは、人物識別試験モデルの識別性能が向上していることを示す。よって、人物識別試験モデルの学習に用いた人物画像は、教師画像セットに格納される。 In the example of FIG. 16, ER2 is smaller than ER1. This indicates that the identification performance of the person identification test model is improved. Therefore, the person images used for learning the person identification test model are stored in the teacher image set.

一方、ＥＲ２がＥＲ１よりも大きい場合は、人物識別試験モデルの識別性能が低下していることを示す。よって、人物識別試験モデルの学習に用いた人物画像は、教師画像セットに格納されずに破棄される。 On the other hand, if ER2 is greater than ER1, it indicates that the identification performance of the person identification test model is degraded. Therefore, the person images used for learning the person identification test model are discarded without being stored in the teacher image set.

図１７は、教師画像抽出装置１０を含む教師画像抽出システムの第１例を示す。第１例は、教師画像抽出システムが、ローカルにおいて構成される例を示す。 FIG. 17 shows a first example of a teacher image extraction system including the teacher image extraction device 10. As shown in FIG. A first example shows an example in which a teacher image extraction system is configured locally.

例えば、図１７に示すように、教師画像抽出システムは、カメラ１２００、映像格納装置１２０１、教師画像抽出装置１０、モニタ１２０３、及び、教師画像格納装置１２０４を有する。 For example, as shown in FIG. 17, the teacher image extraction system has a camera 1200, a video storage device 1201, a teacher image extraction device 10, a monitor 1203, and a teacher image storage device 1204.

カメラ１２００は、人物を含む映像を撮影する。 Camera 1200 captures a video including a person.

映像格納装置１２０１は、カメラ１２００が撮影した映像（動画）を格納する。なお、カメラ１２００が撮影した映像は、映像格納装置１２０１に格納されずに、直接、教師画像抽出装置１０へ入力されてもよい。 The image storage device 1201 stores images (moving images) captured by the camera 1200 . Note that the video captured by the camera 1200 may be input directly to the teacher image extraction device 10 without being stored in the video storage device 1201 .

教師画像抽出装置１０は、映像格納装置１２０１から入力された映像から、上述したように、教師画像を抽出する。なお、教師画像抽出装置１０に入力される映像は、任意に選択されてよい。上述した教師画像抽出装置１０は、メモリとＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）を有し、メモリに格納されたソフトウェアプログラムをＣＰＵが実行することにより、各部の処理を実現するものであってよい。この場合、教師画像抽出装置１０は、当該ソフトウェアプログラムを実行するパーソナルコンピュータ（ＰＣ）であってもよい。 The teacher image extracting device 10 extracts the teacher image from the video input from the video storage device 1201 as described above. The video input to the teacher image extraction device 10 may be arbitrarily selected. The teacher image extracting apparatus 10 described above may have a memory and a CPU (Central Processing Unit), and the CPU may execute a software program stored in the memory to realize processing of each section. In this case, the teacher image extraction device 10 may be a personal computer (PC) that executes the software program.

教師画像抽出装置１０は、抽出した教師画像を、モニタ１２０３に表示してよい。また、教師画像抽出装置１０は、抽出した教師画像を、教師画像格納装置１２０４に格納してよい。 The teacher image extraction device 10 may display the extracted teacher image on the monitor 1203 . Also, the teacher image extraction device 10 may store the extracted teacher image in the teacher image storage device 1204 .

図１８は、教師画像抽出装置１０を含む教師画像抽出システムの第２例を示す。 FIG. 18 shows a second example of a teacher image extraction system including the teacher image extraction device 10. As shown in FIG.

第２例は、教師画像抽出システムが、ネットワークのクラウドとして提供される例を示す。教師画像抽出システムは、制御ＰＣ１２０８、教師画像抽出装置１０、映像格納装置１２０９、及び、教師画像格納装置１２１２を含む。 A second example shows an example in which the teacher image extraction system is provided as a network cloud. The teacher image extraction system includes control PC 1208 , teacher image extraction device 10 , video storage device 1209 and teacher image storage device 1212 .

クラウドは、インターネット１２０７を介して、コンピューティング、データベース、ストレージ、及び／又は、アプリケーションなどの様々なＩＴリソースを、オンデマンドで提供する。 The cloud provides various IT resources such as computing, databases, storage, and/or applications on demand over the Internet 1207 .

例えば、図１８に示すように、カメラ１２０６、ホストＰＣ１２１３及びモニタ１２１４をローカルに設け、教師画像抽出システムを、ネットワーク１２０７を介して、クラウドとして提供する。 For example, as shown in FIG. 18, a camera 1206, a host PC 1213 and a monitor 1214 are provided locally, and a teacher image extraction system is provided as a cloud via a network 1207. FIG.

カメラ１２０６は、人物を含む映像を撮影する。 A camera 1206 captures an image including a person.

ホストＰＣ１２１３は、カメラ１２０６が撮影した映像を、ネットワーク１２０７及び制御ＰＣ１２０８を介して、映像格納装置１２０９に格納する。 The host PC 1213 stores the video captured by the camera 1206 in the video storage device 1209 via the network 1207 and control PC 1208 .

教師画像抽出装置１０は、映像格納装置１２０１から入力された映像から、上述したように、教師画像を抽出する The teacher image extracting device 10 extracts the teacher image from the video input from the video storage device 1201 as described above.

図１８に示すように、教師画像抽出装置１０は、複数設けられてよい。この場合、複数の教師画像抽出装置１０は、並列処理によって教師画像を抽出してよい。また、上述した教師画像抽出装置１０の機能及び処理は、各装置１０のメモリに格納されたコンピュータプログラム１２１１がＣＰＵによって実行されることにより、実現されてもよい。この場合、教師画像装置１０は、当該コンピュータプログラムを実行するサブＰＣであってよい。 As shown in FIG. 18, a plurality of teacher image extraction devices 10 may be provided. In this case, the plurality of teacher image extraction devices 10 may extract teacher images by parallel processing. Further, the functions and processing of the teacher image extracting device 10 described above may be realized by executing the computer program 1211 stored in the memory of each device 10 by the CPU. In this case, the teacher image device 10 may be a sub-PC that executes the computer program.

教師画像抽出装置１０によって抽出された教師画像は、ネットワーク１２０７及びホストＰＣ１２１３を介して、ローカルのモニタ１２１４に表示されてよい。また、教師画像抽出装置１０は、抽出した教師画像を、教師画像格納装置１２０４に格納してよい。 A teacher image extracted by the teacher image extraction device 10 may be displayed on a local monitor 1214 via the network 1207 and host PC 1213 . Also, the teacher image extraction device 10 may store the extracted teacher image in the teacher image storage device 1204 .

なお、上述では、抽出対象が人物の場合の例を説明したが、抽出対象は人物に限られない。例えば、抽出対象は、建物、車両、家電製品、海、山、空、草花、樹木といった人物以外であってもよい。 In the above description, an example in which the extraction target is a person has been described, but the extraction target is not limited to a person. For example, an extraction target may be a building, a vehicle, a home appliance, the sea, a mountain, the sky, a flower, or a tree other than a person.

抽出対象が人物以外の場合、上述した、キーフレーム人物指定部、並びに、領域補正部を含む中間フレーム画像抽出部における、人物の存在を判断するための制約条件は、抽出対象に合わせて、適切に変更されてよい。 If the object to be extracted is not a person, the constraint conditions for judging the presence of a person in the intermediate frame image extraction unit including the key frame person designation unit and the area correction unit described above are set appropriately according to the extraction object. may be changed to

上述した内容は、次のように表現できる。 The above contents can be expressed as follows.

映像に含まれる画像のフレームから、対象物の一例である人物の画像を抽出する教師画像抽出装置１０は、キーフレーム人物指定部１０１及び中間フレーム画像抽出部１０２を有する。キーフレーム人物指定部１０１は、映像の一部のフレームをキーフレームとし、キーフレームにおける対象物が表示された部分を含む矩形で指定された人物領域の画像を取得する。中間フレーム画像抽出部１０２は、映像においてキーフレームで指定された人物を追跡し、映像におけるキーフレームでないフレームである中間フレームの人物が表示された部分を含む矩形の人物領域の画像を抽出する。 A teacher image extracting device 10 for extracting an image of a person, which is an example of a target object, from an image frame included in a video includes a key frame person specifying unit 101 and an intermediate frame image extracting unit 102 . The keyframe person specifying unit 101 acquires an image of a person area specified by a rectangle including a portion where a target object is displayed in the keyframe, using a frame of a part of the video as a keyframe. An intermediate frame image extraction unit 102 tracks a person specified by a key frame in the video, and extracts a rectangular human area image including a portion where the person is displayed in the intermediate frame, which is a non-key frame in the video.

この構成によれば、映像に含まれるフレームから人物が表示された矩形の部分画像を抽出するので、背景のノイズを低減した教師画像を抽出できる。 According to this configuration, since a rectangular partial image in which a person is displayed is extracted from the frame included in the video, it is possible to extract a teacher image with reduced background noise.

教師画像抽出装置１０は、人物が移動している場合と人物が静止している場合とで異なる動きベクトルに対する処理により、中間フレームにおける人物が表示された部分である人物領域を特定する人物領域特定部１０３を更に有してよい。 The teacher image extracting apparatus 10 performs human region identification that identifies a human region, which is a portion in which a human is displayed in an intermediate frame, by processing different motion vectors depending on whether the human is moving or stationary. A unit 103 may also be included.

人物が移動している場合と静止している場合とでは、人物の領域の動きベクトルは異なる性質を示す。そのため、この構成によれば、それぞれの場合に好適な動きベクトルに対する処理を用いて人物領域を特定できるので、人物が移動している場合にも静止している場合にも人物領域を良好に特定できる。 The motion vector of the person's area exhibits different characteristics depending on whether the person is moving or stationary. Therefore, according to this configuration, the human region can be identified by using suitable motion vector processing in each case. can.

人物領域特定部１０３は、人物が移動している場合、位置が近接し移動方向および移動量が同等である動きベクトルをグルーピングして得られる領域を合成することにより、人物領域を特定してよい。 When the person is moving, the person area identification unit 103 may identify the person area by synthesizing areas obtained by grouping motion vectors that are close to each other and have the same movement direction and movement amount. .

人物が移動している場合には人物の部位は同等の動きとなる。そのため、この構成によれば、近接する同等の動きベクトルをグルーピングし、それを合成することで、人物領域を良好に特定できる。 When the person is moving, the parts of the person move in the same manner. Therefore, according to this configuration, by grouping close equivalent motion vectors and synthesizing them, it is possible to identify the human region well.

人物領域特定部１０３は、人物が静止している場合、対象としている中間フレームを含んで連続する複数のフレームのそれぞれにおける近接する同等の移動方向および移動量を示す動きベクトルをグルーピングして得られる部分領域を複数のフレームにわたり合成した領域に基づいて、人物領域を特定してよい。 When the person is stationary, the person region identifying unit 103 obtains motion vectors by grouping motion vectors indicating similar moving directions and moving amounts in each of a plurality of consecutive frames including the target intermediate frame. A person area may be specified based on an area obtained by synthesizing partial areas over a plurality of frames.

人物が静止していても部位毎に動きは見られるのが一般的である。そのため、この構成によれば、動きベクトルをグルーピングした部分領域を複数のフレームにわたり合成することで人物領域を良好に特定できる。 Even if a person is stationary, it is common to see movement in each part. Therefore, according to this configuration, it is possible to satisfactorily identify a person area by synthesizing partial areas obtained by grouping motion vectors over a plurality of frames.

人物領域特定部１０３は、人物が静止している場合、複数のフレームにわたり部分領域を合成した領域と、中間フレームを含む１つ以上のフレームの画像から抽出されるエッジ画像とに基づいて、人物領域を特定してよい。 When the person is stationary, the person region identification unit 103 identifies the person based on an area obtained by synthesizing partial regions over a plurality of frames and an edge image extracted from one or more frame images including an intermediate frame. A region may be specified.

この構成によれば、動きの見られた部分領域を合成した領域に加え、エッジ抽出の結果も用いることにより、更に良好に静止している人物の領域を特定できる。 According to this configuration, by using the result of edge extraction in addition to the region obtained by synthesizing the partial regions in which movement is observed, it is possible to specify the region of the still person more satisfactorily.

中間フレーム画像抽出部１０２は、中間フレームの前または後の１つ以上のフレームの動きベクトルに基づいて、中間フレームの人物領域を補正してよい。 The intermediate frame image extracting unit 102 may correct the person area of the intermediate frame based on motion vectors of one or more frames before or after the intermediate frame.

この構成によれば、対象とする中間フレームの前後のフレームを利用した処理で人物領域を補正するので、人物の見切れを低減できる。 According to this configuration, since the person area is corrected by the processing using the frames before and after the target intermediate frame, it is possible to reduce the cut-out of the person.

中間フレーム画像抽出部１０２は、中間フレームの前または後の１つ以上のフレームにおける動きベクトルの移動量が相対的に大きい領域を人物が歩行している足先端位置と推定し、足先端位置が含まれるように人物領域を補正してよい。 The intermediate frame image extracting unit 102 estimates a region in which the amount of movement of the motion vector in one or more frames before or after the intermediate frame is relatively large as the position of the tip of the person's walking foot. The person region may be corrected to be included.

この構成によれば、対象とする中間フレームの前後のフレームの動きベクトルを利用した処理で足先端と推定される領域を含むように人物領域を補正するので、人物の足先端の見切れを低減できる。 According to this configuration, the person area is corrected so as to include the area estimated to be the tip of the foot by processing using the motion vectors of the frames before and after the target intermediate frame, so that the tip of the person's foot can be reduced. .

教師画像抽出装置１０は、キーフレームの人物画像と中間フレームの人物画像とを含む人物画像群を深層学習に用いて試験モデルを構築し、試験モデルによる人物識別の精度を評価し、評価結果に基づいて前記人物画像群を採用するか否か判定する採用画像判定部１０４を更に有してよい。 The teacher image extraction device 10 constructs a test model using a human image group including a human image in a key frame and a human image in an intermediate frame for deep learning, evaluates the accuracy of human identification by the test model, and uses it as an evaluation result. It may further include a adopted image determination unit 104 that determines whether or not to adopt the person image group based on the above.

この構成によれば、抽出した人物画像群により試験モデルを構築して人物の識別の精度を評価し、人物画像群の採否を判定するので、深層学習で良好な精度を得られる人物画像を採用できる。 According to this configuration, a test model is constructed from the extracted human image group, the accuracy of human identification is evaluated, and the acceptance or rejection of the human image group is determined. can.

中間フレーム画像抽出部１０２は、キーフレームに複数の人物が指定された場合、複数の人物をそれぞれ追跡し、それぞれの人物についての人物領域の人物画像を抽出してよい。 When a plurality of persons are specified in a key frame, the intermediate frame image extracting unit 102 may track each of the plurality of persons and extract a person image of the person area for each person.

この構成によれば、複数の人物をそれぞれ追跡し、それぞれに人物領域の人物画像を抽出するので、多数の人物画像の抽出が可能となる。 According to this configuration, each of a plurality of persons is tracked and a person image in the person area is extracted for each person, so it is possible to extract a large number of person images.

１０…教師画像抽出装置、１００…入力映像入力、１０１…キーフレーム人物指定部、１０２…中間フレーム画像抽出部、１０３…人物領域特定部、１０４…採用画像判定部、１０５…教師画像保存部

Reference Signs List 10 Teacher image extracting device 100 Input video input 101 Key frame person specifying unit 102 Intermediate frame image extracting unit 103 Human area identifying unit 104 Adopted image determining unit 105 Teacher image saving unit

Claims

An object image extracting device for extracting an image of a person, which is an object, from an image frame included in a video,
a key frame person designating unit that obtains an image of a person area designated by a rectangle including a portion in which a person is displayed in the key frame, using a frame of a part of the video as a key frame;
Intermediate frame image extraction for tracking the person specified by the keyframe in the video, and extracting an image of a rectangular human area including a portion where the person is displayed in an intermediate frame, which is a frame that is not the keyframe in the video. Department and
a human region identifying unit that identifies the human region in the intermediate frame by processing different motion vectors depending on whether the person is moving or stationary;
has
The intermediate frame image extracting unit detects a motion vector indicating a shift direction and a shift amount between frames of each block obtained by dividing a frame into predetermined units, stabilizes the motion vector,
The person region identification unit determines that the person is moving when the prediction error of the motion vector after stabilization is equal to or greater than a predetermined threshold, and determines that the person is moving, and the person is moving in the same direction and the same amount of movement. A human region is identified by synthesizing regions obtained by grouping vectors, and if the prediction error is less than the threshold, the human is determined to be stationary, and the target intermediate frame is included. Identifying the person region based on a region obtained by synthesizing partial regions obtained by grouping motion vectors indicating similar moving directions and amounts of movement in each of a plurality of consecutive frames over the plurality of frames;
Object image extraction device.

When the person is stationary, the person area specifying unit divides the area obtained by synthesizing the partial areas over the plurality of frames into an edge image extracted from one or more frame images including the intermediate frame. identifying the person region based on
2. The object image extraction device according to claim 1 .

The intermediate frame image extracting unit, when the person region identifying unit determines that the person is moving, determines that the movement amount of the motion vector of one or more frames before or after the intermediate frame is relative. estimating an area with a large area as the tip position of the foot where the person is walking, and correcting the person area so that the tip position of the foot is included;
2. The object image extraction device according to claim 1.

A test model is constructed using a human image group including the image of the human region of the key frame and the image of the human region of the intermediate frame for deep learning, the accuracy of human identification by the test model is evaluated, and the evaluation result is used. further comprising a adopted image determination unit that determines whether or not to adopt the person image group based on
2. The object image extraction device according to claim 1.

When a plurality of persons are specified in the key frame, the intermediate frame image extracting unit tracks each of the plurality of persons and extracts an image of the person region for each person .
2. The object image extraction device according to claim 1.

A target object image extraction method for extracting an image of a person, which is a target object, from an image frame included in a video,
the computer
obtaining an image of a human region specified by a rectangle including a portion in which a person is displayed in the key frame, using a frame of a portion of the video as a key frame;
A method of tracking a person specified by the keyframe in the video, and extracting an image of a rectangular person region including a portion in which the person is displayed in an intermediate frame, which is a frame that is not the keyframe in the video,
Detecting a motion vector indicating the direction and amount of shift between frames of each block obtained by dividing a frame into predetermined units, and stabilizing the motion vector;
If the prediction error of the motion vector after stabilization is equal to or greater than a predetermined threshold, it is determined that the person is moving, and the motion vectors are obtained by grouping motion vectors that are close to each other and have the same movement direction and movement amount. Identify the person area by synthesizing the areas,
When the prediction error is less than the threshold value, the person is determined to be stationary, and movement indicating similar and close moving directions and moving amounts in each of a plurality of consecutive frames including the target intermediate frame. identifying the person region based on a region obtained by synthesizing the partial regions obtained by grouping the vectors over the plurality of frames;
Object image extraction method.

A software program for causing a computer to extract an image of a person who is an object from an image frame included in a video,
obtaining an image of a human region specified by a rectangle including a portion in which a person is displayed in the key frame, using a frame of a portion of the video as a key frame;
Tracking the person specified by the keyframe in the video, and extracting an image of a rectangular person region including a portion where the person is displayed in an intermediate frame, which is a frame that is not the keyframe in the video,
Detecting a motion vector indicating the direction and amount of shift between frames of each block obtained by dividing a frame into predetermined units, and stabilizing the motion vector;
If the prediction error of the motion vector after stabilization is equal to or greater than a predetermined threshold, it is determined that the person is moving, and the motion vectors are obtained by grouping motion vectors that are close to each other and have the same movement direction and movement amount. Identify the person area by synthesizing the areas,
When the prediction error is less than the threshold value, the person is determined to be stationary, and movement indicating similar and close moving directions and moving amounts in each of a plurality of consecutive frames including the target intermediate frame. identifying the person region based on a region obtained by synthesizing the partial regions obtained by grouping the vectors over the plurality of frames;
A software program that makes a computer do things.