JP2018185724A

JP2018185724A - Device, program and method for tracking object using pixel change processing image

Info

Publication number: JP2018185724A
Application number: JP2017088267A
Authority: JP
Inventors: 有希永井; Yuki Nagai; 小林　達也; Tatsuya Kobayashi; 達也小林
Original assignee: KDDI Corp
Current assignee: KDDI Corp
Priority date: 2017-04-27
Filing date: 2017-04-27
Publication date: 2018-11-22
Anticipated expiration: 2037-04-27
Also published as: JP6789876B2

Abstract

PROBLEM TO BE SOLVED: To provide a device capable of further reliably tracking a plurality of objects.SOLUTION: An object tracking device which uses a time series image group that can include a plurality of tracking objects and determines, based on an output from a discriminator where an image or an image area is input, a position of the tracking object in each image to track a tracking object, includes, as a feature of a device configuration: mask processing means that executes pixel change processing for changing to a pixel pattern where a feature of another tracking object is eliminated or reduced with respect to a processing object area, which is a processing object area in an image or an image area included in the image group, determined based on a position determined as a past point in time or as a correct position for another tracking object other than one tracking object; and image output means that outputs the image or the image region subjected to the pixel change processing to the discriminator for learning and/or discrimination of one tracking object. Also, the discriminator may include a plurality of convolutional neural networks for inputting images or image areas at a plurality of time points.SELECTED DRAWING: Figure 2

Description

本発明は、追跡対象を含み得る時系列画像群を解析して当該対象を追跡する対象追跡技術に関する。 The present invention relates to an object tracking technique for analyzing a time-series image group that can include a tracking target and tracking the target.

現在、監視やマーケティング等の目的をもって、カメラで撮影され生成された時系列の画像データを解析し、移動する追跡対象の実空間での位置を逐次決定していく追跡技術が実用化している。追跡対象としては、人物や乗り物等、撮影可能であれば様々な物体が設定可能である。また、複数の物体を逐次並行して追跡する技術も開発され、盛んに改良が進められている。この技術では、例えば、店舗内で滞留・移動する多数の店員や客等の動線を、より正確に把握することが意図されている。 Currently, for the purpose of monitoring, marketing and the like, a tracking technique is being put into practical use, which analyzes time-series image data captured and generated by a camera and sequentially determines the position of a moving tracking target in real space. As a tracking target, various objects such as a person and a vehicle can be set as long as the object can be photographed. In addition, a technique for sequentially tracking a plurality of objects in parallel has been developed and is being actively improved. With this technology, for example, it is intended to more accurately grasp the flow lines of a large number of shop assistants and customers who stay and move in the store.

このような時系列画像群を用いた対象追跡技術は、一般に、（ａ）刻々の画像によって追跡対象である物体の見かけをオンライン学習しつつ当該物体の位置に係る情報を出力する識別器によるものと、（ｂ）見かけのオンライン学習を行わない識別器によるものとに大別される。 The target tracking technique using such a time series image group is generally based on (a) a discriminator that outputs information related to the position of the object while online learning of the appearance of the object to be tracked by using an image every moment. And (b) a classifier that does not perform apparent online learning.

例えば非特許文献１には、このうち（ａ）のオンライン学習タイプの追跡技術が開示されている。この技術では、新しい時刻のフレームが取得されるとその都度逐次的に、検出と、追跡対象の物体に付与されたＩＤ（識別子）毎の追跡処理とを行う。具体的には、各ＩＤに係る物体の見かけをオンライン学習し、ＩＤ毎に見かけの識別モデルを構成しておく。次いで、このようにオンライン学習された各ＩＤについての見かけの識別モデルを用い、新しい検出結果と、ここまで追跡してきたＩＤに係る物体の軌跡とを結びつけている。 For example, Non-Patent Document 1 discloses (a) online learning type tracking technology. In this technique, whenever a frame at a new time is acquired, detection and tracking processing for each ID (identifier) assigned to the tracking target object are sequentially performed. Specifically, the appearance of an object related to each ID is learned online, and an apparent identification model is configured for each ID. Then, using the apparent identification model for each ID learned online in this way, the new detection result and the locus of the object related to the ID tracked so far are linked.

一方、非特許文献２には、（ｂ）のオンライン学習を行わないタイプである、単一の物体を追跡する技術が開示されている。この技術では、深層学習（Deep Learning）を用い、前時刻の物体画像領域と、現時刻の画像における前時刻の当該領域に相当する領域の周辺から候補として切り取った候補画像領域との画像領域ペアから、現時刻で物体が存在する画像領域を推定することにより追跡を行っている。 On the other hand, Non-Patent Document 2 discloses a technique for tracking a single object, which is a type that does not perform online learning in (b). This technique uses deep learning, and an image region pair of an object image region at the previous time and a candidate image region cut out as a candidate from the periphery of the region corresponding to the region at the previous time in the current time image. Thus, tracking is performed by estimating an image area where an object exists at the current time.

S.-H. Bae and K.-J. Yoon., "Robust online multi-object tracking based on tracklet confidence and online discriminative appearance learning", Published in Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference, ２０１４年，１２１８〜１２２５頁S.-H. Bae and K.-J. Yoon., "Robust online multi-object tracking based on tracklet confidence and online discriminative appearance learning", Published in Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference, 2014 , Pages 1218-1225 D. Held, S. Thrun, and S. Savarese, "Learning to track at 100 fps with deep regression networks", Cornell University Library，Subjects: Computer Vision and Pattern Recognition (cs.CV), Cite as: arXiv: 1604.01802, ２０１６年D. Held, S. Thrun, and S. Savarese, "Learning to track at 100 fps with deep regression networks", Cornell University Library, Subjects: Computer Vision and Pattern Recognition (cs.CV), Cite as: arXiv: 1604.01802, 2016

しかしながら、上述したような従来技術では、特に、複数の物体を逐次並行して追跡する際、依然解決できない問題が生じてしまう。 However, in the conventional technology as described above, particularly when tracking a plurality of objects in parallel, a problem still cannot be solved.

例えば、非特許文献１に記載された技術は、物体の見かけをオンライン学習するので、複数の物体が画像データ中に存在する場合にはメモリを大量に消費し、しかもその消費量を制御することが困難であるとの問題を抱えている。例えば、展示会場や公道等のように、多数の人物が存在し、且つ映像に含まれる人物が刻々変化するような状況でこの技術を適用すると、メモリ消費量が膨大となり、最終的に追跡処理ができなくなる可能性も生じる。具体的には、長時間にわたり追跡のための解析を実行する中で追跡対象となる人物のＩＤが増大するにつれて学習すべき人物モデルも増大し続け、結果的にメモリ資源が不足する事態に陥ってしまうのである。 For example, since the technique described in Non-Patent Document 1 learns the appearance of an object online, when a plurality of objects exist in image data, a large amount of memory is consumed and the consumption is controlled. Have a problem with being difficult. For example, if this technology is applied in a situation where there are many people such as exhibition halls and public roads, and the people included in the video change every moment, the memory consumption becomes enormous, and finally the tracking process There is also a possibility that it will not be possible. Specifically, the person model to be learned continues to increase as the ID of the person to be tracked increases while performing analysis for tracking over a long period of time, resulting in a situation where memory resources are insufficient. It will end up.

これに対し、非特許文献２に記載された技術ではたしかに、物体の見かけをオンライン学習する必要がないので、上述したようなメモリ消費の問題が発生しない。しかしながら、非特許文献２は、あくまで追跡対象として単一の物体を設定しており、複数物体の追跡を想定していない。ここで、この非特許文献２に記載された技術をそのまま、複数物体の追跡に適用したとすると、メモリ消費の問題は発生しない一方、追跡処理対象の物体の近くに見かけの類似した物体が存在する場合に、ドリフトと呼ばれる現象の生じる可能性が高まる。 On the other hand, in the technique described in Non-Patent Document 2, it is not necessary to learn the appearance of an object online, so the problem of memory consumption as described above does not occur. However, Non-Patent Document 2 merely sets a single object as a tracking target, and does not assume tracking of a plurality of objects. Here, if the technique described in Non-Patent Document 2 is applied to tracking a plurality of objects as they are, the problem of memory consumption does not occur, but an apparently similar object exists near the object to be tracked. In this case, the possibility of a phenomenon called drift increases.

ここで、ドリフトとは、近くに存在するこの見かけの類似した物体に係る画像領域を正解領域とみなし、誤った物体を追跡し始めてしまう現象である。例えば、映像中で互いに重畳する複数の人物を追跡対象とする場合に、頻発する可能性がある現象である。ところが、非特許文献２は、そもそも単一物体を追跡対象としていることもあって、このドリフトに対処する方法を何ら提案していない。 Here, the drift is a phenomenon in which an image area related to an apparently similar object existing in the vicinity is regarded as a correct answer area and an incorrect object starts to be tracked. For example, this is a phenomenon that may occur frequently when a plurality of persons superimposed on each other in a video are targeted for tracking. However, Non-Patent Document 2 does not propose any method for dealing with this drift because a single object is a tracking target in the first place.

ちなみに、このドリフトは、複数物体を追跡する場合、オンライン学習を利用した物体追跡においても完全に解消されている訳ではなく、依然対策の望まれているのが現状である。 Incidentally, when drifting a plurality of objects, this drift is not completely eliminated even in the object tracking using online learning, and a countermeasure is still desired.

そこで、本発明は、複数の物体をより確実に追跡可能な装置、プログラム及び方法を提供することを目的とする。 SUMMARY An advantage of some aspects of the invention is that it provides an apparatus, a program, and a method that can more reliably track a plurality of objects.

本発明によれば、複数の追跡対象を含み得る時系列の画像群を用い、当該画像又は該画像内の画像領域を入力した識別器からの出力に基づき、各画像における当該追跡対象の位置を決定して当該追跡対象を追跡する装置であって、
当該画像群に含まれる画像又は該画像内の画像領域における処理対象領域であって、１つの追跡対象以外の他の追跡対象について過去の時点で又は正解として決定された位置に基づき決定された処理対象領域に対し、当該他の追跡対象の特徴を消滅させた又は減じた画素パターンへの変更を行う画素変更処理を実施するマスク処理手段と、
当該画素変更処理を施された当該画像又は当該画像領域を、当該１つの追跡対象の学習及び／又は識別のために識別器へ出力する画像出力手段と
を有する対象追跡装置が提供される。 According to the present invention, a time-series image group that can include a plurality of tracking targets is used, and the position of the tracking target in each image is determined based on an output from the classifier that inputs the image or an image region in the image. A device for determining and tracking the tracking target,
A processing target region in an image included in the image group or an image region in the image and determined based on a position determined at a past time or as a correct answer for other tracking targets other than one tracking target Mask processing means for performing a pixel change process for changing to a pixel pattern in which the characteristics of the other tracking target are eliminated or reduced with respect to the target region;
There is provided an object tracking device having an image output means for outputting the image or the image region subjected to the pixel change processing to a discriminator for learning and / or identification of the one tracking target.

この本発明の対象追跡装置における画素変更処理の一実施形態として、マスク処理手段は、少なくとも１つの当該他の追跡対象に係る処理対象領域に対し、当該画像若しくは当該画像領域における当該追跡対象以外の画像領域に基づいて決定された画素パターン、又は所定の画素パターンへの変更を行うことも好ましい。 As one embodiment of the pixel changing process in the target tracking device of the present invention, the mask processing unit is configured to apply at least one processing target area related to the other tracking target to the image or other than the tracking target in the image area. It is also preferable to change to a pixel pattern determined based on the image area or a predetermined pixel pattern.

また、マスク処理手段は、画素変更処理の他の実施形態として、少なくとも１つの当該他の追跡対象に係る処理対象領域に対し、該他の追跡対象の見かけをぼかす処理を実施することも好ましい。 Further, as another embodiment of the pixel change process, the mask processing means preferably performs a process of blurring the appearance of the other tracking target with respect to the processing target area related to at least one other tracking target.

さらに、マスク処理手段は、画素変更処理の更なる他の実施形態として、当該他の追跡対象に係る当該処理対象領域が、当該１つの追跡対象について決定された位置を含む追跡対象領域と重畳した重畳領域を有する場合、当該重畳領域に対して当該画素変更処理を実施しないことも好ましい。 Further, the mask processing unit may superimpose the processing target area related to the other tracking target on the tracking target area including the position determined for the one tracking target as still another embodiment of the pixel changing process. In the case of having an overlapping area, it is also preferable not to perform the pixel changing process on the overlapping area.

また、本発明による対象追跡装置は、当該他の追跡対象に係る当該処理対象領域が、当該１つの追跡対象について決定された位置を含む追跡対象領域と重畳した重畳領域を有する場合、当該１つの追跡対象に係る当該追跡対象領域を、当該重畳領域を除いた領域に変更する追跡対象領域決定部を更に有することも好ましい。 In addition, the target tracking device according to the present invention may be configured such that when the processing target area related to the other tracking target has a superimposed area superimposed on the tracking target area including the position determined for the one tracking target, It is also preferable to further include a tracking target area determination unit that changes the tracking target area related to the tracking target to an area excluding the superimposition area.

さらに、本発明による対象追跡装置の一実施形態として、画像出力手段は、少なくとも１つの時点における当該画素変更処理を施された当該画像又は当該画像領域を、学習済みの識別器へ出力し、本対象追跡装置は、学習済みの識別器からの出力に基づいて、当該１つの時点での当該１つの追跡対象の位置を決定する対象位置決定手段を更に有することも好ましい。 Furthermore, as one embodiment of the object tracking device according to the present invention, the image output means outputs the image or the image area subjected to the pixel change processing at at least one time point to the learned classifier, The target tracking device preferably further includes target position determining means for determining the position of the one tracking target at the one time point based on the output from the learned classifier.

また、この実施形態において、画像出力手段は、１つの時点における当該画素変更処理を施された当該画像又は当該画像領域と、この１つの時点よりも前となる少なくとも１つの時点における当該画素変更処理を施された当該画像又は当該画像領域とを、学習済みの識別器へ出力することも好ましい。 Further, in this embodiment, the image output means includes the image or the image area subjected to the pixel change process at one time point, and the pixel change process at at least one time point before the one time point. It is also preferable to output the image or the image region that has been subjected to learning to a learned classifier.

さらに、本発明に係る識別器は、各時点の当該画像又は当該画像領域を入力してこの画像又は画像領域の特徴に係る特徴情報を出力する複数の第１ニューラルネットワークと、これら複数の第１ニューラルネットワークから出力された複数の当該特徴情報を入力して当該１つの追跡対象の位置に係る情報を出力する第２ニューラルネットワークとを含むことも好ましい。 Furthermore, the discriminator according to the present invention includes a plurality of first neural networks that input the image or the image region at each time point and output the feature information related to the feature of the image or the image region, and the plurality of the first neural networks. It is also preferable to include a second neural network that inputs a plurality of the feature information output from the neural network and outputs information related to the position of the one tracking target.

また、画像出力手段は、上記の第２ニューラルネットワークに対し、当該特徴情報に対応する当該画像又は当該画像領域の時点に係る情報を、この特徴情報に紐づけて出力することも好ましい。 The image output means preferably outputs the image corresponding to the feature information or information related to the time point of the image region in association with the feature information to the second neural network.

さらに、画像出力手段は、本発明に係る識別器を学習させる際、少なくとも当該画素変更処理を施された当該画像又は当該画像領域と、該画像又は該画像領域における正解としての追跡対象領域の位置に係る情報との組を、識別器へ出力することも好ましい。 Further, when the image output means learns the discriminator according to the present invention, at least the image or the image area subjected to the pixel change process, and the position of the tracking target area as a correct answer in the image or the image area It is also preferable to output a set of information related to the above to the discriminator.

また、本発明の対象追跡装置における画素変更処理の更なる他の実施形態として、マスク処理手段は、本発明に係る識別器を学習させる際、決定された当該処理対象領域から位置のずれたズレ処理対象領域に対し画素変更処理を実施することも好ましい。 As still another embodiment of the pixel changing process in the object tracking device of the present invention, the mask processing means is configured to detect the discriminator according to the present invention, and the misaligned position from the determined processing target area. It is also preferable to perform pixel change processing on the processing target area.

さらに、識別器についての他の実施形態として、画像出力手段は、少なくとも当該画素変更処理を施された１つの時点での当該画像又は当該画像領域を、オンラインで学習を行う識別器へ出力し、本対象追跡装置は、オンラインで学習を行っている識別器からの出力に基づいて、当該１つの時点での当該１つの追跡対象の位置を決定する対象位置決定手段を更に有することも好ましい。 Furthermore, as another embodiment of the discriminator, the image output means outputs at least the image or the image area at one time point subjected to the pixel change processing to the discriminator that performs online learning, The target tracking device preferably further includes target position determining means for determining the position of the one tracking target at the one time point based on an output from the discriminator that is learning online.

本発明によれば、また、複数の追跡対象を含み得る時系列の画像群を用い、当該画像又は該画像内の画像領域を入力した識別器からの出力に基づき、各画像における当該追跡対象の位置を決定して当該追跡対象を追跡する装置に搭載されたコンピュータを機能させるプログラムであって、
当該画像群に含まれる画像又は該画像内の画像領域における処理対象領域であって、１つの追跡対象以外の他の追跡対象について過去の時点で又は正解として決定された位置に基づき決定された処理対象領域に対し、当該他の追跡対象の特徴を消滅させた又は減じた画素パターンへの変更を行う画素変更処理を実施するマスク処理手段と、
当該画素変更処理を施された当該画像又は当該画像領域を、当該１つの追跡対象の学習及び／又は識別のために識別器へ出力する画像出力手段と
としてコンピュータを機能させる対象追跡プログラムが提供される。 Further, according to the present invention, a time-series image group that can include a plurality of tracking targets is used, and the tracking target of each image is determined based on an output from the classifier that inputs the image or an image area in the image. A program for operating a computer mounted on a device for determining a position and tracking the tracking target,
A processing target region in an image included in the image group or an image region in the image and determined based on a position determined at a past time or as a correct answer for other tracking targets other than one tracking target Mask processing means for performing a pixel change process for changing to a pixel pattern in which the characteristics of the other tracking target are eliminated or reduced with respect to the target region;
Provided is an object tracking program for causing a computer to function as an image output means for outputting the image or the image area subjected to the pixel change processing to a discriminator for learning and / or identification of the one tracking object. The

本発明によれば、さらに、複数の追跡対象を含み得る時系列の画像群を用い、当該画像又は該画像内の画像領域を入力した識別器からの出力に基づき、各画像における当該追跡対象の位置を決定して当該追跡対象を追跡する装置に搭載されたコンピュータにおける対象追跡方法であって、
当該画像群に含まれる画像又は該画像内の画像領域における処理対象領域であって、１つの追跡対象以外の他の追跡対象について過去の時点で又は正解として決定された位置に基づき決定された処理対象領域に対し、当該他の追跡対象の特徴を消滅させた又は減じた画素パターンへの変更を行う画素変更処理を実施するステップと、
当該画素変更処理を施された当該画像又は当該画像領域を、当該１つの追跡対象の学習及び／又は識別のために識別器へ出力するステップと
を有する対象追跡方法が提供される。 According to the present invention, a time-series image group that can include a plurality of tracking targets is used, and the tracking target of each image in the image is output based on the output from the identifier that inputs the image or the image region in the image. An object tracking method in a computer mounted on an apparatus for determining a position and tracking the tracking object,
A processing target region in an image included in the image group or an image region in the image and determined based on a position determined at a past time or as a correct answer for other tracking targets other than one tracking target Performing a pixel change process for changing to a pixel pattern in which the characteristics of the other tracking target are eliminated or reduced with respect to the target region;
And outputting the image or the image region subjected to the pixel change processing to a discriminator for learning and / or identification of the one tracking target.

本発明の対象追跡装置、プログラム及び方法によれば、複数の物体をより確実に追跡することができる。 According to the object tracking device, program, and method of the present invention, a plurality of objects can be tracked more reliably.

本発明による対象追跡装置を含む対象追跡システムの一実施形態を示す模式図である。1 is a schematic diagram illustrating an embodiment of an object tracking system including an object tracking apparatus according to the present invention. 本発明による対象追跡装置の一実施形態における機能構成を示す機能ブロック図である。It is a functional block diagram which shows the function structure in one Embodiment of the object tracking apparatus by this invention. 本発明に係る画素変更処理（マスク処理）の一実施形態を示す模式図である。It is a mimetic diagram showing one embodiment of pixel change processing (mask processing) concerning the present invention. 本発明に係る画素変更処理（マスク処理）の他の実施形態を示す模式図である。It is a schematic diagram which shows other embodiment of the pixel change process (mask process) which concerns on this invention. 本発明に係る画素変更処理（マスク処理）の更なる他の実施形態を示す模式図である。It is a schematic diagram which shows further another embodiment of the pixel change process (mask process) which concerns on this invention. 本発明に係る識別器の一実施形態を示す模式図である。It is a schematic diagram which shows one Embodiment of the discriminator which concerns on this invention. 本発明に係る識別器の他の実施形態を示す模式図である。It is a schematic diagram which shows other embodiment of the discriminator which concerns on this invention.

以下、本発明の実施形態について、図面を用いて詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

［物体追跡システム］
図１は、本発明による対象追跡装置を含む対象追跡システムの一実施形態を示す模式図である。 [Object tracking system]
FIG. 1 is a schematic diagram showing an embodiment of an object tracking system including an object tracking apparatus according to the present invention.

図１に示した、本実施形態の対象追跡システムは、
（ａ）追跡対象の物体を撮影可能であり、撮影した画像の情報を、通信ネットワークを介して時系列で送信可能な１つ又は複数のカメラ２と、
（ｂ）カメラ２から通信ネットワークを介して取得される時系列の画像群を用い、当該物体の位置情報を予測して当該物体を追跡可能な対象追跡装置１と
を備えている。 The object tracking system of this embodiment shown in FIG.
(A) one or a plurality of cameras 2 capable of capturing an object to be tracked and capable of transmitting information of the captured image in time series via a communication network;
(B) A target tracking device 1 that uses a time-series image group acquired from the camera 2 via a communication network and predicts position information of the object to track the object.

ここで、追跡対象となる物体には、人物、動物、乗り物や、その他移動し得る物理対象等、撮影可能であれば様々なものが該当する。また、撮影される場所も、特に限定されるものではない。例えば、観客、通勤者、買い物客、労働者、歩行者や、ランナー等が追跡対象として映り得る屋外であってもよく、さらには会社、学校、家庭や、店舗の内部といった屋内とすることもできる。 Here, various objects can be used as objects to be tracked, such as people, animals, vehicles, and other physical objects that can move. Also, the location where the image is taken is not particularly limited. For example, it may be outdoors where spectators, commuters, shoppers, workers, pedestrians, runners, etc. can be reflected, and it may be indoors such as inside a company, school, home, or store. it can.

ちなみに、本実施形態では、追跡対象となる物体（人物）が複数又は多数であって、同行したり互いにすれ違ったり、画像内において重畳したりする、従来技術の苦手としてきた環境が想定されている。 Incidentally, in the present embodiment, there are assumed an environment that has been a weak point of the prior art in which there are a plurality or many objects (persons) to be tracked, and they accompany, pass each other, or are superimposed in an image. .

また、画像情報の伝送路である通信ネットワークは、例えばＷｉ−Ｆｉ（登録商標）等の無線ＬＡＮ(Local Area Network)とすることができる。または、ＬＴＥ(Long Term Evolution)、ＷｉＭＡＸ（Worldwide Interoperability for Microwave Access）又は３Ｇ（3rd Generation）等の無線系アクセスネットワークを介し、インターネットを経由してカメラ２と対象追跡装置１とを通信接続させるものであってもよい。 In addition, a communication network that is a transmission path for image information can be a wireless local area network (LAN) such as Wi-Fi (registered trademark), for example. Alternatively, the camera 2 and the target tracking device 1 are connected via the Internet via a wireless access network such as LTE (Long Term Evolution), WiMAX (Worldwide Interoperability for Microwave Access) or 3G (3rd Generation). It may be.

さらに、光ファイバ網若しくはＡＤＳＬ（Asymmetric Digital Subscriber Line）等の固定系アクセスネットワークを介しインターネットを経由して、又はプライベートネットワークを介してカメラ２と対象追跡装置１とが通信接続されてもよい。また、変更態様として、カメラ２と対象追跡装置１とは直接有線で接続されてもよい。さらに、複数のカメラ２から出力される画像情報を取りまとめて対象追跡装置１に送信可能な（図示していない）カメラ制御装置が設けられていてもよい。 Further, the camera 2 and the target tracking device 1 may be connected for communication via the Internet via a fixed access network such as an optical fiber network or ADSL (Asymmetric Digital Subscriber Line), or via a private network. As a change mode, the camera 2 and the target tracking device 1 may be directly connected by wire. Furthermore, a camera control device (not shown) that can collect image information output from the plurality of cameras 2 and transmit the image information to the target tracking device 1 may be provided.

同じく図１に示すように、対象追跡装置１は、取得した時系列画像群の画像、又はこの画像内の画像領域（画像パッチ）、を入力した識別器１１からの出力に基づき、各画像における追跡対象の位置を決定して追跡対象を追跡する装置であって、
（Ａ）時系列画像群に含まれる画像又はこの画像内の画像領域における「処理対象領域」に対し、１つの追跡対象以外の他の追跡対象の特徴を消滅させた又は減じた画素パターンへの変更を行う「画素変更処理」を実施するマスク処理部１１１と、
（Ｂ）この「画素変更処理」を施された画像又は画像領域を、この１つの追跡対象の学習及び／又は識別のために識別器１１へ出力する画像出力部１１３と
を有することを特徴としている。 Similarly, as shown in FIG. 1, the target tracking device 1 is configured to output an image of the acquired time-series image group or an image area (image patch) in the image based on an output from the discriminator 11. A device for determining a position of a tracking target and tracking the tracking target,
(A) To an image included in a time-series image group or a “processing target region” in an image region in this image, to a pixel pattern in which features of other tracking targets other than one tracking target have been eliminated or reduced A mask processing unit 111 for performing a “pixel change process” for changing,
(B) An image output unit 113 that outputs an image or an image area subjected to the “pixel change process” to the classifier 11 for learning and / or identification of the one tracking target. Yes.

ここで、上記（Ａ）の「処理対象領域」は、この１つの追跡対象以外の他の追跡対象について過去の時点で又は正解として決定された位置に基づいて決定される。さらに、同じく上記（Ａ）の「画素変更処理」の具体例として、
（Ａ１）「処理対象領域」を所定の画素パターンを有する領域に変更してもよい。例えば、黒等の単色で塗りつぶすことができる。また、
（Ａ２）「処理対象領域」を、追跡対象以外の画像領域に基づいて決定された画素パターンを有する領域に変更してもよい。例えば、追跡対象の背景となる背景色を決定し、この背景色で塗りつぶすことができる。さらに、
（Ａ３）「処理対象領域」に対し、他の追跡対象の見かけをぼかす処理を行ってもよい。例えば、その周囲の色から算出した色をもってぼかすことができる。 Here, the “processing target area” in (A) is determined based on a position determined at a past time or as a correct answer for other tracking targets other than this one tracking target. Furthermore, as a specific example of the “pixel change process” in (A) above,
(A1) The “processing target area” may be changed to an area having a predetermined pixel pattern. For example, it can be painted with a single color such as black. Also,
(A2) The “processing target area” may be changed to an area having a pixel pattern determined based on an image area other than the tracking target. For example, it is possible to determine a background color to be a background to be tracked and fill it with this background color. further,
(A3) A process for blurring the appearance of other tracking targets may be performed on the “processing target area”. For example, it is possible to blur with a color calculated from the surrounding colors.

なお当然に、「画素変更処理」は上記（Ａ１）〜（Ａ３）の形態に限定されるものではなく、他の追跡対象の特徴を消滅させた又は減じた画素パターンへの変更を行うものであれば、種々の処理が採用可能である。 Naturally, the “pixel changing process” is not limited to the above-described forms (A1) to (A3), and changes to a pixel pattern in which other features to be tracked are eliminated or reduced. If there are, various processes can be employed.

このように、対象追跡装置１は、追跡対象として複数の物体を追跡する場合に、追跡処理対象である１つの物体以外の他の物体に対し「画素変更処理」を行うことによって、識別器１１に入力する画像又は画像領域（画像パッチ）における他の物体の特徴を消滅又は減少させている。その結果、このような処理の施された画像又は画像領域を用い、識別器１１を学習させたり、及び／又は識別器１１に識別処理をさせたりすることによって、この１つの物体を他の物体と混同することなく、より確実に特定し続けることが可能となる。 As described above, when tracking a plurality of objects as tracking targets, the target tracking device 1 performs the “pixel changing process” on other objects other than the one object that is the tracking processing target, thereby identifying the discriminator 11. The characteristics of other objects in the image or image area (image patch) input to the image are deleted or reduced. As a result, by using the image or the image region on which such processing has been performed, the classifier 11 is learned and / or the classifier 11 performs the classification process, thereby making this one object another object. It becomes possible to continue to specify more reliably without being confused with.

言い換えれば、以上に説明したような本発明の「画像オブジェクトマスク手法」を用いることによって、追跡処理対象の物体の近くに見かけの類似した物体が存在する場合でも、従来問題となってきたドリフトの発生を抑制することが可能となるのである。ここで、ドリフトとは、近くに存在するこの類似した物体に係る画像領域を、正解の領域だとみなし、誤った物体を追跡し始めてしまう現象である。 In other words, by using the “image object mask method” of the present invention as described above, even if there is an apparently similar object near the object to be tracked, the drift that has been a problem in the past has been solved. It is possible to suppress the occurrence. Here, the drift is a phenomenon in which an image area related to a similar object existing in the vicinity is regarded as a correct area and an incorrect object starts to be tracked.

ちなみに、本発明の「画像オブジェクトマスク手法」は、メモリ消費量（計算コスト）の増大の問題を解決すべくオンライン学習を行わない追跡処理に対しても、一方、従来主流であったオンライン学習を行う追跡処理に対しても適用することができる。このうちオンライン学習を行わない複数物体の追跡処理に適用された場合には、メモリ消費量（計算コスト）増大の抑制と、ドリフト発生の抑制とを両立することが可能となる。一方、オンライン学習を行う追跡処理に適用された場合には、ドリフトの解消をより進めることができる。 By the way, the “image object mask method” of the present invention is suitable for tracking processing that does not perform online learning in order to solve the problem of increase in memory consumption (calculation cost). It can also be applied to the tracking processing to be performed. Among these, when applied to a tracking process for a plurality of objects that do not perform online learning, it is possible to achieve both suppression of increase in memory consumption (calculation cost) and suppression of occurrence of drift. On the other hand, when applied to a tracking process that performs online learning, drift can be further eliminated.

いずれにしても、本発明の「画像オブジェクトマスク手法」は、誤追跡を抑制し、複数物体の追跡をより確実にする技術であるが、特に、上記の前者のケースである非オンライン学習下でのマルチトラッキングを実現するのに非常に重要な技術となるのである。 In any case, the “image object mask method” of the present invention is a technique that suppresses mistracking and more reliably tracks a plurality of objects. In particular, under the non-online learning that is the former case described above. This is a very important technology for realizing multi-tracking.

なお、装置１で取り扱われる時系列画像群は、本実施形態のようなカメラ撮影によって生成された画像データに限定されるものではない。追跡対象の実際の位置や見かけに関係するデータであれば、様々なものが該当する。例えば、デプスカメラによって生成される（対象の各画素の）デプス値情報を、画像データとして利用することも可能である。 Note that the time-series image group handled by the apparatus 1 is not limited to image data generated by camera shooting as in the present embodiment. Various data are applicable as long as the data is related to the actual position and appearance of the tracking target. For example, depth value information (for each target pixel) generated by a depth camera can be used as image data.

また、対象追跡装置１は、本実施形態において、刻々と取得される画像内に張られた画像座標系u-vでの位置座標(u, v)を、実空間に張られた世界座標系Gx-Gy-Gzでの位置座標(gx, gy, gz)へ変換する座標変換操作を用いて、追跡対象物体の映った画像情報から、実空間での位置に係る情報を算出している。例えば、追跡対象物体の画像内における前時刻t−1での位置(u, v)が、現時刻tでの位置(u', v')へ変化した場合、この物体は、実空間（観測対象空間）において前時刻t−1での位置(gx, gy, gz)から現時刻tでの位置(gx', gy', gz')へ移動したことが推定され、実空間での位置の前時刻t−1からの変化分を取得することができる。 Further, in the present embodiment, the object tracking device 1 uses the world coordinate system Gx−, which is extended in the real space, to the position coordinates (u, v) in the image coordinate system uv that is stretched in the image acquired every moment. Information relating to the position in the real space is calculated from the image information of the tracking target object using a coordinate conversion operation for converting the position coordinates (gx, gy, gz) in Gy-Gz. For example, if the position (u, v) at the previous time t−1 in the image of the tracking target object changes to the position (u ′, v ′) at the current time t, this object will be in real space (observation (Target space) is estimated to have moved from the position (gx, gy, gz) at the previous time t-1 to the position (gx ', gy', gz ') at the current time t. The change from the previous time t−1 can be acquired.

ここで、使用する時刻は、単位時間を１としてこの単位時間経過毎に設定される時刻であり、時刻tの1つ前となる時刻はt−1となる。また、上記のような画像座標系から世界座標系への座標変換は、予めキャリブレーションにより各カメラ２についての設置位置及び撮影向きに係る外部パラメータを設定しておくことによって決定することが可能である。なお、複数のカメラ２のそれぞれから画像が取得される場合でも、これらの画像を統合して１つの画像空間を構築し、この画像空間に画像座標系を適用することができる。 Here, the time to be used is a time that is set every time the unit time elapses with the unit time being 1, and the time that is one time before the time t is t−1. Also, the coordinate conversion from the image coordinate system to the world coordinate system as described above can be determined by setting external parameters related to the installation position and shooting direction of each camera 2 in advance by calibration. is there. Even when images are acquired from each of the plurality of cameras 2, these images can be integrated to construct one image space, and an image coordinate system can be applied to this image space.

このように、対象追跡装置１は、本実施形態において、刻々と取得される画像情報（画像座標系u-vでの位置情報）に基づき、追跡対象物体の実空間での位置情報（世界座標系Gx-Gy-Gzでの位置情報）を推定することができるのである。 As described above, in the present embodiment, the target tracking device 1 is based on the image information (position information in the image coordinate system uv) acquired every moment, and the position information in the real space (world coordinate system Gx) of the tracking target object. (Location information in -Gy-Gz) can be estimated.

［装置構成、対象追跡方法］
図２は、本発明による対象追跡装置の一実施形態における機能構成を示す機能ブロック図である。 [Device configuration, target tracking method]
FIG. 2 is a functional block diagram showing a functional configuration in an embodiment of the object tracking apparatus according to the present invention.

図２によれば、対象追跡装置１は、１つ又は複数のカメラ２と通信接続可能な通信インタフェース１０１と、画像蓄積部１０２と、マスク処理画像蓄積部１０３と、識別モデル蓄積部１０４と、対象情報記憶部１０５と、プロセッサ・メモリとを有する。ここで、プロセッサ・メモリは、対象追跡装置１のコンピュータを機能させるプログラムを実行することによって、対象追跡機能を実現させる。 According to FIG. 2, the target tracking device 1 includes a communication interface 101 that can be connected to one or a plurality of cameras 2, an image storage unit 102, a mask processing image storage unit 103, an identification model storage unit 104, It has a target information storage unit 105 and a processor memory. Here, the processor memory realizes the object tracking function by executing a program that causes the computer of the object tracking apparatus 1 to function.

さらに、プロセッサ・メモリは、機能構成部として、マスク処理部１１１と、追跡領域決定部１１２と、識別器１１と、学習部１１３ａ及び識別部１１３ｂを含む画像出力部１１３と、対象位置決定部１１４と、追跡対象管理部１１５と、通信制御部１２１とを有する。なお、図２における対象追跡装置１の機能構成部間を矢印で接続して示した処理の流れは、本発明による対象追跡方法の一実施形態としても理解される。 Furthermore, the processor memory includes a mask processing unit 111, a tracking area determination unit 112, a discriminator 11, an image output unit 113 including a learning unit 113a and a discrimination unit 113b, and a target position determination unit 114 as functional components. And a tracking target management unit 115 and a communication control unit 121. Note that the processing flow shown by connecting the functional components of the target tracking device 1 in FIG. 2 with arrows is understood as an embodiment of the target tracking method according to the present invention.

同じく図２において、カメラ２は、例えば、ＣＣＤイメージセンサ、ＣＭＯＳイメージセンサ等の固体撮像素子を備えた可視光、近赤外線又は赤外線対応の撮影デバイスである。なお、上述したように、カメラ２としてデプスカメラを用いることも可能である。また、カメラ２又は（図示していない）カメラ制御装置は、カメラ２で撮影された物体の画像を含む撮影画像データを生成し、当該データを時系列に又はバッチで対象追跡装置１に送信する機能を有する。また、カメラ２は、可動であって設置位置、撮影向きや高さを変更することができ、この変更のための制御信号を受信し処理する機能を有していることも好ましい。 Similarly, in FIG. 2, the camera 2 is a visible light, near-infrared, or infrared imaging device including a solid-state imaging device such as a CCD image sensor or a CMOS image sensor. As described above, a depth camera can be used as the camera 2. Further, the camera 2 or a camera control device (not shown) generates captured image data including an image of an object captured by the camera 2 and transmits the data to the target tracking device 1 in time series or batch. It has a function. It is also preferable that the camera 2 is movable and can change the installation position, shooting direction, and height, and has a function of receiving and processing a control signal for this change.

通信インタフェース１０１は、時系列の画像群である撮影画像データを、カメラ２又はカメラ制御装置から通信ネットワークを介して受信する。通信インタフェース１０１を使用した送受信及び通信データ処理の制御は、通信制御部１２１によって行われ、取得された撮影画像データ（画像ファイル）は、画像蓄積部１０２に蓄積される。ここで、この撮影画像データは、カメラ２又はカメラ制御装置から時系列順に呼び出されて取得されたものであってもよく、リアルタイムに一定時間間隔でキャプチャされた画像を順次取得したものであってもよい。 The communication interface 101 receives captured image data that is a time-series image group from the camera 2 or the camera control device via a communication network. Control of transmission / reception and communication data processing using the communication interface 101 is performed by the communication control unit 121, and acquired captured image data (image file) is stored in the image storage unit 102. Here, the captured image data may be obtained by being called in chronological order from the camera 2 or the camera control device, and sequentially obtained images captured at regular time intervals in real time. Also good.

マスク処理部１１１は、画像蓄積部１０２から読み出された時系列画像群に含まれる画像又はこの画像内の画像領域における処理対象領域に対し、追跡処理対象である１つの物体以外の他の物体の特徴を消滅させた又は減じた画素パターンへの変更を行うマスク処理（画素変更処理）を実施する。 The mask processing unit 111 is an object included in the time-series image group read from the image storage unit 102 or a processing target region in the image region in the image other than one object that is a tracking processing target. A mask process (pixel change process) is performed for changing to a pixel pattern in which the above feature is eliminated or reduced.

ここで、マスク処理を施す処理対象領域は、これらの他の物体について（ａ）過去の時点で、又は（ｂ）正解として、決定された位置に基づき決定される。例えば、時刻tにおける処理対象領域を、時刻t−1における当該他の物体の正解画像領域としてもよい。ちなみに、マスク処理部１１１は、過去の時刻における画像（又は画像領域）で決定された追跡対象領域の情報を、後述する対象位置決定部１１４から取得することができる。 Here, the processing target area to be subjected to the mask process is determined based on the determined position (a) at a past time point or (b) as a correct answer for these other objects. For example, the processing target area at time t may be the correct image area of the other object at time t−1. Incidentally, the mask processing unit 111 can acquire information on the tracking target region determined by the image (or image region) at the past time from the target position determination unit 114 described later.

いずれにしても、本実施形態のマスク処理は、識別器１１への入力となる画像や画像領域（画像パッチ）に他物体の見かけが全く又は鮮明に写り込まないようにするための処理となっている。なお、マスク処理部１１１で処理を施されたマスク処理画像又は画像領域は、マスク処理画像蓄積部１０３に保存しておき、適宜読み出して使用されることも好ましい。 In any case, the mask process of the present embodiment is a process for preventing the appearance of other objects from appearing at all or clearly in the image or image area (image patch) to be input to the discriminator 11. ing. Note that it is also preferable that the mask processed image or the image region processed by the mask processing unit 111 is stored in the mask processed image storage unit 103 and is read and used as appropriate.

ここで、マスク処理部１１１で実施されるマスク処理としては、すでに簡潔に説明した以下の３つ、すなわち
（Ａ１）処理対象領域を所定の画素パターンを有する領域に変更する処理、
（Ａ２）処理対象領域を追跡対象の背景となる画像領域に基づいて決定された画素パターンを有する領域に変更する処理、及び
（Ａ３）処理対象領域に対し、他の追跡対象の見かけをぼかす処理
を挙げることができる。 Here, as the mask processing performed in the mask processing unit 111, the following three that have already been briefly described, that is, (A1) processing for changing the processing target region to a region having a predetermined pixel pattern,
(A2) Processing to change the processing target region to a region having a pixel pattern determined based on the image region that is the background of the tracking target, and (A3) Processing to blur the appearance of other tracking target to the processing target region Can be mentioned.

このうち、上記（Ａ１）のマスク処理の具体例として、処理対象領域を黒等の単色で塗りつぶしてもよい。ちなみに、このようなマスク処理を施されたマスク処理画像又は画像領域を用いて（教師データとして）、識別器１１を学習させる場合、（黒等の単色パターンといった）所定画素パターンの部分は追跡対象物体の領域ではない、との負の学習を積極的に行わせていると捉えることもできる。また、換言すれば、所定画素パターンの部分を除いた画像領域内に追跡対象物体の領域が存在する、との学習を行わせているとも言えるのである。 Among these, as a specific example of the mask processing (A1), the processing target region may be painted with a single color such as black. By the way, when the discriminator 11 is to be learned (as teacher data) using a mask-processed image or image area subjected to such mask processing, a portion of a predetermined pixel pattern (such as a monochromatic pattern such as black) is to be tracked It can also be understood that negative learning that the object is not an area is actively performed. In other words, it can be said that learning is performed that the area of the tracking target object exists in the image area excluding the portion of the predetermined pixel pattern.

また、上記（Ａ２）のマスク処理の具体例として、公知の背景モデリング手法を用いて、画像中の背景モデルを学習し、処理対象領域を学習された背景色をもって塗りつぶしてもよい。ここで、背景モデルとして、例えば各画素（ピクセル）の色を複数時刻での平均色としたものを採用してもよく、又は各画素における背景色の分布を混合ガウス分布によってモデル化したものを用いることも可能である。 As a specific example of the mask process (A2), a background model in an image may be learned using a known background modeling technique, and the processing target area may be filled with the learned background color. Here, as the background model, for example, the color of each pixel (pixel) may be an average color at a plurality of times, or the background color distribution in each pixel is modeled by a mixed Gaussian distribution. It is also possible to use it.

この混合ガウス分布を用いた背景モデリング手法は、例えば非特許文献：P. KadewTraKuPong and R. Bowden, "An improved adaptive background mixture model for real-time tracking with shadow detection", Video-Based Surveillance Systems, Computer Vision and Distributed Processing, ２００２年，１３４〜１４４頁、及び非特許文献：T. Bouwmans, F. El Baf, B. Vachon, "Background Modeling using Mixture of Gaussians for Foreground Detection - A Survey", Recent Patents on Computer Science. 1, ２００８年，２１９〜２３７頁に記載されている The background modeling method using this Gaussian distribution is, for example, non-patent literature: P. KadewTraKuPong and R. Bowden, "An improved adaptive background mixture model for real-time tracking with shadow detection", Video-Based Surveillance Systems, Computer Vision and Distributed Processing, 2002, 134-144, and non-patent literature: T. Bouwmans, F. El Baf, B. Vachon, "Background Modeling using Mixture of Gaussians for Foreground Detection-A Survey", Recent Patents on Computer Science 1, 2008, pages 219-237

さらに、上記（Ａ３）のマスク処理においては、他の追跡対象の周囲の色から算出した色でぼかしてもよく、画像の平滑化処理と同様の手法でマスク処理対象領域をぼかすことができる。具体的には、所定のカーネルを設定して、マスク処理対象領域に対してカーネルとの畳み込み演算を行うことによって平滑化処理を行ってもよい。 Furthermore, in the mask process of (A3) above, it may be blurred with a color calculated from the surrounding colors of other tracking objects, and the mask process target area can be blurred by the same method as the image smoothing process. Specifically, a smoothing process may be performed by setting a predetermined kernel and performing a convolution operation with the kernel on the mask processing target area.

図３は、本発明に係る画素変更処理（マスク処理）の一実施形態を示す模式図である。 FIG. 3 is a schematic diagram showing an embodiment of a pixel change process (mask process) according to the present invention.

図３（Ａ）に示すように、取得された時系列画像群から、追跡処理における現時点となる現時刻tと、１つ前の時刻t'（＝t−1）との２枚の画像を用いて追跡を行う際のマスク処理について説明する。なお、このような２枚の画像を識別器１１に入力して追跡を行う具体的方法については、後に、図６を用いて詳細に説明する。また同じく後に詳述するが、時刻t'よりも過去の時刻t''（t''＜t'）の画像を含む３枚、若しくは４枚以上の画像を用いてマスク処理を伴う追跡を実施することも可能である。 As shown in FIG. 3A, two images of the current time t as the current time in the tracking process and the previous time t ′ (= t−1) are obtained from the acquired time-series image group. The mask processing when performing tracking using the method will be described. A specific method for inputting and tracking such two images to the discriminator 11 will be described later in detail with reference to FIG. As will be described later in detail, tracking with mask processing is performed using three or four images including images at time t ″ (t ″ <t ′) past time t ′. It is also possible to do.

本実施形態では、時刻tと時刻t'との２枚の画像から、時刻tにおける１つの追跡対象である物体iに係る矩形の画像領域である追跡対象領域(c_u ^t, c_v ^t, w^t, h^t)を推定することによって追跡処理を実施する。ここで、c_u ^t及びc_v ^tはそれぞれ、画像座標系における追跡対象領域の中心（物体iの中心）のu座標及びv座標であり、w^t及びh^tはそれぞれ、追跡対象領域の幅及び高さである。 In the present embodiment, from two images at time t and time t ′, a tracking target area (c _u ^t , c _v ^t , The tracking process is performed by estimating w ^t , h ^t ). Here, c _u ^t and c _v ^t are the u coordinate and v coordinate of the center of the tracking target area (center of the object i) in the image coordinate system, respectively, and w ^t and ^ht are the width of the tracking target area, respectively. And height.

最初に、図３（Ｂ）に示すように、時刻t及び時刻t'（＝t−1）の各画像において、時刻t'における物体i以外の物体の追跡対象領域(c_u ^t', c_v ^t', w^t', h^t')に基づいて、マスク処理を行う画像領域であるマスク処理対象領域を決定する。例えば、マスク処理対象領域を、物体i以外の物体の追跡対象領域を含む領域(c_u ^t', c_v ^t', A・w^t', B・h^t')とすることができる。 First, as shown in FIG. 3B, in each image at time t and time t ′ (= t−1), the tracking target region (c _u ^t ′, c) of an object other than the object i at time t ′. _{Based on v} ^t ′, w ^t ′, h ^t ′), a mask process target area that is an image area to be masked is determined. For example, the mask processing target region can be a region including a tracking target region of an object other than the object i (c _u ^t ', c _v ^t ', A · w ^t ', B · h ^t ').

ここで、A＝B＝1とすれば、マスク処理対象領域は、物体i以外の物体の追跡対象領域そのものとなる。このマスク処理対象領域をどの程度の大きさ（面積）にするかは、追跡精度に大きく影響し得る設計事項となる。例えば、物体iの追跡対象領域と重畳する確率が所定以下となる範囲で、より大きい面積に設定することも好ましい。 Here, if A = B = 1, the mask processing target area is the tracking target area itself of an object other than the object i. The size (area) of the mask processing target region is a design matter that can greatly affect the tracking accuracy. For example, it is also preferable to set a larger area in a range where the probability of overlapping with the tracking target area of the object i is equal to or less than a predetermined value.

ちなみに、上述したように３枚以上の画像を用いる実施形態でも上記と同様に、１つの時刻（例えば、現時刻tを除く最も新しい時刻、中間となる時刻、若しくは最も古い時刻）における物体i以外の物体の追跡対象領域に基づいて、マスク処理対象領域を決定することができる。いずれにしても、マスク処理部１１１（図２）は、以上のように決定されたマスク処理対象領域に対し、上述したようなマスク処理（画素変更処理）を実施するのである。 Incidentally, as described above, in the embodiment using three or more images as described above, other than the object i at one time (for example, the latest time except the current time t, the intermediate time, or the oldest time). Based on the tracking target area of the object, the mask processing target area can be determined. In any case, the mask processing unit 111 (FIG. 2) performs the mask processing (pixel change processing) as described above on the mask processing target area determined as described above.

次いで、同じく図３（Ｂ）に示すように、マスク処理を施された時刻t及び時刻t'（＝t−1）の各画像に対し、基準となる画像領域を設定して、この基準画像領域から切り取り対象領域を決定する。この切り取り対象領域の画像部分が、後に各画像から切り取られて識別器１１へ入力される画像領域（画像パッチ）となる。 Next, as shown in FIG. 3B, a reference image area is set for each image at time t and time t ′ (= t−1) on which the mask processing is performed, and this reference image is set. The cut target area is determined from the area. The image portion of the cut target area becomes an image area (image patch) that is cut out from each image and input to the discriminator 11 later.

ここで、基準画像領域を時刻t'（＝t−1）での追跡対象領域(c_x ^t', c_y ^t', w^t', h^t')とした場合、切り取り対象領域は、例えば(c_x ^t', c_y ^t', C・w^t', D・h^t')としてもよい。ここで、C＝D＝2とすれば、切り取り対象領域は、追跡対象領域を含むその４倍の面積を有する領域となる。 Here, when the reference image area is a tracking target area (c _x ^t ′, c _y ^t ′, w ^t ′, h ^t ′) at time t ′ (= t−1), the clipping target area is, for example, (c _x ^t ', c _y ^t ', C · w ^t ', D · h ^t '). Here, if C = D = 2, the cut target region is a region having an area four times that including the tracking target region.

ちなみに、上述したように３枚以上の画像を用いる実施形態でも同様に、１つの時刻（例えば、現時刻tを除く最も新しい時刻、中間となる時刻、若しくは最も古い時刻）における物体i以外の物体の追跡対象領域に基づいて、切り取り対象領域を決定することができる。いずれにしても、次いで、決定された切り取り対象領域から画像部分（画像パッチ）を切り取って、この画像パッチを識別器１１に入力させ、この識別器１１の出力から時刻tにおける物体iの追跡対象領域を決定したり、識別器１１を学習させたりするのである。 Incidentally, in the embodiment using three or more images as described above, similarly to the object other than the object i at one time (for example, the newest time, the intermediate time, or the oldest time excluding the current time t). The cut target area can be determined based on the tracking target area. In any case, next, an image portion (image patch) is cut out from the determined cut target area, the image patch is input to the discriminator 11, and the tracking target of the object i at time t is output from the output of the discriminator 11. The area is determined or the classifier 11 is learned.

なお当然に、マスク処理対象領域及び切り取り対象領域の基準となる基準画像領域は、現時刻tより前の時刻における追跡対象領域に限定されるものではない。すなわち、基準画像領域を、以上述べたように各時刻の画像において同一の（同一座標範囲の）画像領域としてもよいが、時刻毎に異なる画像領域となるように設定することもできる。例えば、各時刻の画像において、前時刻（例えば1つ前の時刻）の追跡対象領域を基準画像領域としてもよく、物体位置の軌跡情報から予測される当該時刻での画像領域を基準画像領域とすることも可能である。 Naturally, the reference image area serving as a reference for the mask processing target area and the cut target area is not limited to the tracking target area at a time before the current time t. That is, the reference image area may be the same (in the same coordinate range) image area in the images at each time as described above, but may be set to be different image areas at each time. For example, in the image at each time, the tracking target area at the previous time (for example, the previous time) may be set as the reference image area, and the image area at the time predicted from the trajectory information of the object position is set as the reference image area. It is also possible to do.

図４は、本発明に係る画素変更処理（マスク処理）の他の実施形態を示す模式図である。 FIG. 4 is a schematic diagram showing another embodiment of pixel change processing (mask processing) according to the present invention.

本実施形態では、図４（Ａ）に示すように、１つの時刻のマスク処理画像において、１つの追跡対象である物体iの追跡対象領域と、他の物体のマスク処理対象領域とが重畳している場合に、識別精度を維持又は向上させるための処置を行う。実際、このように重畳している場合、物体iに係る画像情報の一部がマスク処理に係る画素情報に置き換わることになるので、識別器１１は、入力したこのようなマスク処理画像（又は画像パッチ）から、識別のための又は学習のための正しい情報の一部を受け取れなかったり、逆に正しくない情報を受け入れたりしてしまう。 In the present embodiment, as shown in FIG. 4A, in the mask processing image at one time, the tracking target region of the object i that is one tracking target and the mask processing target region of another object are superimposed. If so, take measures to maintain or improve the identification accuracy. In fact, in the case of such superimposition, a part of the image information related to the object i is replaced with the pixel information related to the mask processing, so the discriminator 11 can input such a mask processed image (or image). From the patch), a part of correct information for identification or learning cannot be received, or conversely, incorrect information is accepted.

そこで、本実施形態では、図４（Ｂ）に示すように、１つ以上存在するマスク処理対象領域のうちの１つでも、物体iの追跡対象領域と重畳した重畳領域を有する場合、この（１つ以上の）重畳領域に対してはマスク処理（画素変更処理）を実施しないでおく。これにより、追跡対象領域内の正しい情報を識別器１１に全部受け取らせることができる。 Therefore, in the present embodiment, as shown in FIG. 4B, when at least one of the one or more mask processing target areas has a superposed area superimposed on the tracking target area of the object i, this ( Mask processing (pixel change processing) is not performed on one or more overlapping regions. Thereby, it is possible to cause the discriminator 11 to receive all correct information in the tracking target area.

また、変更態様として、図４（Ｃ）に示すように、１つ以上存在するマスク処理対象領域のうちの１つでも、物体iの追跡対象領域と重畳した重畳領域を有する場合、この物体iの追跡対象領域を、この（１つ以上の）重畳領域を除いた領域に変更することも好ましい。ちなみに、このような領域変更処理は、追跡領域決定部１１２（図２）で実施される。 Further, as a change mode, as shown in FIG. 4C, when at least one of the one or more mask processing target areas has a superimposition area superimposed on the tracking target area of the object i, the object i It is also preferable to change the tracking target area to an area excluding this (one or more) overlapping areas. Incidentally, such a region change process is performed by the tracking region determination unit 112 (FIG. 2).

ここで、識別器１１に入力される画像パッチを決める切り取り対象領域は、この変更された追跡対象領域に基づいて設定されることになる。また、このような重畳領域を排除した画像パッチを識別器１１に入力させた場合、この識別器１１の出力に基づいて推定された領域から、物体iの追跡対象領域全体を、先の変更を戻す形で決定してもよい。 Here, the cut target area for determining the image patch input to the discriminator 11 is set based on the changed tracking target area. Further, when an image patch from which such a superposed region is excluded is input to the discriminator 11, the entire tracking target region of the object i is changed from the region estimated based on the output of the discriminator 11. You may decide in the form of returning.

図５は、本発明に係る画素変更処理（マスク処理）の更なる他の実施形態を示す模式図である。 FIG. 5 is a schematic diagram showing still another embodiment of the pixel change process (mask process) according to the present invention.

本実施形態では、特に３枚以上の画像を用いる場合等、画像の時刻が互いに所定以上離隔している場合に、マスク処理対象領域が物体i以外の物体の実際の画像位置から大きくずれないようにするための処置を行う。 In the present embodiment, particularly when three or more images are used, the mask processing target region is not greatly deviated from the actual image position of an object other than the object i when the image times are separated from each other by a predetermined amount or more. To take action.

例えば、図５（Ａ）に示すように、時系列画像群から、現時刻tの画像と、現時刻tから過去に遡って所定以上離隔した時刻t''（t''＜t）の画像とを含む２枚以上の画像群を取り出し、マスク処理を行う場合を説明する。この場合、時刻t''の画像においては、例えば、物体i以外の他の物体の追跡対象領域をマスク処理対象領域としてマスク処理を実施してもよいが、一方、時刻tの画像においては、これら他の物体のおおよその位置範囲にマスク処理を実施することも好ましい。 For example, as shown in FIG. 5A, from the time-series image group, an image at the current time t and an image at the time t ″ (t ″ <t) separated from the current time t by a predetermined distance or more. A case will be described in which two or more image groups including the above are extracted and mask processing is performed. In this case, in the image at time t '', for example, masking processing may be performed using the tracking target region of an object other than the object i as a mask processing target region, whereas in the image at time t, It is also preferable to perform mask processing on the approximate position range of these other objects.

ここで、時刻tにおける他の物体のおおよその位置範囲としては、例えば、図５（Ｂ）に示すように、各他の物体における時刻ｔ''周辺での平均速度ベクトルv(t'')を算出して、時刻t''から時刻tまでの推定移動ベクトルrを、次式
（１） r＝v(t'')・(t''−t)
によって決定し、時刻t''における各他の物体の追跡対象領域を各他の物体の推定移動ベクトルr分だけ並行移動させた領域を、時刻tの画像における各他の物体のマスク処理対象領域としてもよい。 Here, as an approximate position range of other objects at time t, for example, as shown in FIG. 5B, an average velocity vector v (t '') around time t '' in each other object. And the estimated movement vector r from time t ″ to time t is expressed by the following equation (1) r = v (t ″) · (t ″ −t)
The area to which the tracking target area of each other object at time t '' is translated in parallel by the estimated movement vector r of each other object is determined as the mask processing target area of each other object in the image at time t. It is good.

また、変更態様として、各他の物体の位置についての時刻t''から時刻t-1までの軌跡情報に基づいて、時刻tでの位置を予測し、マスク処理対象領域を決定することも可能である。 As a change mode, it is also possible to predict the position at time t based on the trajectory information from time t '' to time t-1 for the position of each other object, and determine the mask processing target area. It is.

さらに、変更態様として、マスク処理部１１１（図２）は、識別器１１を学習させる際、以上説明したように決定されたマスク処理対象領域から位置のずれたズレ処理対象領域に対し、マスク処理を実施してもよい。ここで、位置のずれは、ランダムな向きにおける所定範囲内でのランダムなずれ量だけのずれであってもよく、予め設定された所定向きの所定ずれ量だけのずれとすることもできる。識別器１１は、このようなズレ処理対象領域に係るマスク処理画像（画像パッチ）を予め学習しておくことによって、実際の検出（識別）の際、マスク処理された領域が実際のマスクすべき物体の位置からある程度ずれている可能性の少なくないマスク処理画像（画像パッチ）に対し、より適切な検出（識別）を実施することが可能となるのである。 Further, as a change mode, when the mask processing unit 111 (FIG. 2) learns the discriminator 11, the mask processing is performed on the shift processing target region whose position is shifted from the mask processing target region determined as described above. May be implemented. Here, the position shift may be a shift by a random shift amount within a predetermined range in a random direction, or may be a shift by a predetermined shift amount in a predetermined direction set in advance. The discriminator 11 learns a mask processing image (image patch) related to such a shift processing target region in advance, so that the masked region should be actually masked during actual detection (identification). More appropriate detection (identification) can be performed on a mask-processed image (image patch) that is not likely to be displaced from the position of the object to some extent.

図２に戻って、画像出力部１１３は、以上に説明したようなマスク処理（画素変更処理）を施された１つ若しくは複数の時刻の画像又は画像領域を、識別器１１へ出力する。ここで、画像出力部１１３の学習部１１３ａは、マスク処理画像又はマスク処理画像領域（画像パッチ）を、１つの追跡対象である物体iの学習のために識別器１１へ出力する。 Returning to FIG. 2, the image output unit 113 outputs, to the discriminator 11, one or more images or image areas that have been subjected to the mask processing (pixel change processing) as described above. Here, the learning unit 113a of the image output unit 113 outputs a mask processed image or a mask processed image region (image patch) to the classifier 11 for learning one object i that is a tracking target.

一方、画像出力部１１３の識別部１１３ｂは、マスク処理画像又はマスク処理画像パッチを、物体iの識別のために識別器へ出力する。ここで、複数物体追跡（マルチトラッキング）を行うに当たっては、追跡対象である物体を１つ１つ順に物体iとして取り扱うマスク処理を行い、このマスク処理画像（画像パッチ）を用いて識別器１１に対し１つ１つの物体の識別（物体の画像領域位置の決定）を順次行わせることになる。 On the other hand, the identification unit 113b of the image output unit 113 outputs the mask processed image or the mask processed image patch to the classifier for identifying the object i. Here, in performing multiple object tracking (multi-tracking), mask processing is performed in which the objects to be tracked are handled as objects i one by one in order, and the mask processing image (image patch) is used for the classifier 11. On the other hand, identification of each object (determination of the image area position of the object) is sequentially performed.

ちなみに、マスク処理画像（マスク処理画像パッチ）の入力は、識別器１１による識別の際だけに行われてもよい。この場合、識別器１１の学習は、マスク処理を施していない画像（画像パッチ）の入力によって行われる。また、識別器１１の学習の際だけにマスク処理画像（マスク処理画像パッチ）の入力が行われてもよいが、学習及び識別の両方でマスク処理画像（マスク処理画像パッチ）の入力が行われることも好ましい。すなわち、マスク処理画像（マスク処理画像パッチ）で学習した識別器１１を用いて、マスク処理済みの画像（画像パッチ）における識別を実施することも好ましいのである。 Incidentally, the input of the mask processed image (mask processed image patch) may be performed only at the time of identification by the classifier 11. In this case, learning of the discriminator 11 is performed by inputting an image (image patch) that has not been subjected to mask processing. Further, the mask processed image (mask processed image patch) may be input only during learning of the classifier 11, but the mask processed image (mask processed image patch) is input in both learning and identification. It is also preferable. That is, it is also preferable to perform discrimination on an image (image patch) that has been subjected to mask processing using the discriminator 11 that has been learned from the mask processing image (mask processing image patch).

次いで、識別器１１（図２）における、非オンライン学習タイプ及びオンライン学習タイプの２つの実施形態をそれぞれ、図６及び図７を用いて説明する。 Next, two embodiments of the discriminator 11 (FIG. 2) of the non-online learning type and the online learning type will be described with reference to FIGS. 6 and 7, respectively.

図６は、本発明に係る識別器の一実施形態を示す模式図である。 FIG. 6 is a schematic diagram showing an embodiment of a discriminator according to the present invention.

図６に示すように、識別器１１は、本実施形態において、
（ａ）各時点の画像又は画像領域（画像パッチ）を入力してこれらの特徴に係る特徴情報を出力する複数の第１ニューラルネットワークとしての畳み込み層部（Convolutional Layers）と、
（ｂ）これらの畳み込み層部から出力された複数の特徴情報を入力して１つの追跡対象の位置に係る情報を出力する第２ニューラルネットワークとしての全結合層部（Fully-Connected Layers）と
を含む。 As shown in FIG. 6, the discriminator 11 is, in this embodiment,
(A) Convolutional Layers (Convolutional Layers) as a plurality of first neural networks that input an image or image region (image patch) at each time point and output feature information related to these features;
(B) Fully-connected layers as a second neural network that inputs a plurality of feature information outputted from these convolutional layer parts and outputs information related to the position of one tracking target. Including.

ここで、上記（ａ）の畳み込み層部は、動物の視覚野の単純細胞の働きを模した機能を有し、画像に対しカーネル（重み付け行列フィルタ）をスライドさせて特徴マップを生成する畳み込み処理を実行する。この畳み込み処理によって、画像の解像度を段階的に落としながら、エッジや勾配等の基本的特徴を抽出し、局所的な相関パターンの情報を得ることができる。 Here, the convolution layer part of (a) has a function imitating the function of a simple cell in the visual cortex of an animal, and generates a feature map by sliding a kernel (weighting matrix filter) on an image. Execute. With this convolution process, it is possible to extract basic features such as edges and gradients while gradually reducing the resolution of the image, and obtain information on local correlation patterns.

例えば、畳み込み層部として、５層の畳み込み層を用いたAlexNetを用いることが可能である。このAlexNetでは、各畳み込み層はプーリング層と対になっており、畳み込み処理とプーリング処理とが繰り返される。ここで、プーリング処理とは、動物の視覚野の複雑細胞の働きを模した処理であり、畳み込み層から出力される特徴マップ（一定領域内の畳み込みフィルタの反応）を最大値や平均値等でまとめ、調整パラメータを減らしつつ、局所的な平行移動不変性を確保する処理である。AlexNetについては、例えば、Krizhevsky, A., Sutskever, I., and Hinton, G. E.，"Imagenet classification with deep convolutional neural networks"，Advances in Neural Information Processing Systems 25，２０１２年，１１０６〜１１１４頁に記載されている。 For example, AlexNet using five convolution layers can be used as the convolution layer portion. In this AlexNet, each convolution layer is paired with a pooling layer, and the convolution process and the pooling process are repeated. Here, the pooling process is a process that mimics the function of complex cells in the visual cortex of animals. The feature map output from the convolution layer (convolution filter response in a certain area) is expressed as a maximum value or an average value. In summary, it is a process of ensuring local translational invariance while reducing adjustment parameters. AlexNet is described in, for example, Krizhevsky, A., Sutskever, I., and Hinton, GE, “Imagenet classification with deep convolutional neural networks”, Advances in Neural Information Processing Systems 25, 2012, pages 1106 to 1114. Yes.

さらに、上記の畳み込み層部及び全結合層部として、例えば、非特許文献２に記載されたニューラルネットワークを採用してもよい。ちなみに、本実施形態において、畳み込み層部（第１ニューラルネットワーク）の数は２つでもよいが、より高い識別精度（追跡精度）を実現すべく、図６に示すように３つ以上とすることも好ましい。 Furthermore, for example, a neural network described in Non-Patent Document 2 may be employed as the convolution layer portion and the total coupling layer portion. Incidentally, in the present embodiment, the number of convolution layer parts (first neural network) may be two, but in order to achieve higher identification accuracy (tracking accuracy), it is set to three or more as shown in FIG. Is also preferable.

識別器１１は、各時点での追跡対象検出（識別）処理を行う前に、予めオフラインで学習を行っている。具体的に、
（ａ）時間的に前後する２つ又は３つ以上の画像（又は画像パッチ）の組と、
（ｂ）各画像（又は各画像パッチ）における正解追跡対象領域の位置及び範囲を定義する４次元ベクトル(CU_C, CV_C, CW, CH)と
を含むデータセットを、画像出力部１１３の学習部１１３ａから大量に入力して学習を行う。 The discriminator 11 performs offline learning in advance before performing the tracking target detection (identification) process at each time point. Specifically,
(A) a set of two or three or more images (or image patches) that fluctuate in time;
(B) The image output unit 113 learns a data set including a four-dimensional vector (CU _C , CV _C , CW, CH) that defines the position and range of the correct tracking target area in each image (or each image patch). Learning is performed by inputting a large amount from the unit 113a.

ここで、CU_C及びCV_Cはそれぞれ、正解画像領域のu軸方向の物体中心及びv軸方向の物体中心である。また、CW及びCHはそれぞれ、正解画像領域矩形の幅及び高さである。なお、入力が画像パッチの場合、CU_C値及びCV_C値の基づく座標系は、当該画像パッチの局所座標系となる。また、入力される画像（画像パッチ）は、上述したようなマスク処理を施されたマスク処理画像（画像パッチ）であることも好ましい。 Here, each CU _C and CV _C, which is the object the center of the u-axis direction of the object center and the v-axis direction of the correct image area. CW and CH are the width and height of the correct image region rectangle, respectively. When the input is an image patch, the coordinate system based on the CU _C value and the CV _C value is the local coordinate system of the image patch. The input image (image patch) is also preferably a mask processed image (image patch) subjected to the mask processing as described above.

なお、このように学習を行った識別器１１で生成された識別モデルは、識別モデル蓄積部１０４（図２）に保存されることも好ましい。例えば、保存された識別モデルが外部の識別器に移植されて利用されてもよい。 Note that the discrimination model generated by the discriminator 11 that has learned in this way is also preferably stored in the discrimination model storage unit 104 (FIG. 2). For example, a stored identification model may be used by being transplanted to an external classifier.

次いで、追跡対象である複数の物体の検出(識別)においては、識別器１１の複数の畳み込み層部は、画像出力部１１３の識別部１１３ｂ（図２）から、時刻tのマスク処理画像（画像パッチ）を含むマスク処理画像（画像パッチ）群を入力する。ここで、各畳み込み層部は、学習の際に設定された時間間隔を有する複数時刻の各々を、学習の際と同様の時間の順番（時刻の割り当て）で入力することも好ましい。 Next, in the detection (identification) of a plurality of objects to be tracked, the plurality of convolution layer portions of the discriminator 11 are masked images (images) at time t from the identification unit 113b (FIG. 2) of the image output unit 113. A mask processing image (image patch) group including a patch is input. Here, it is also preferable that each convolution layer unit inputs each of a plurality of times having a time interval set at the time of learning in the same time order (time allocation) as at the time of learning.

図６では、図中最上の畳み込み層部に、現時刻tのマスク処理画像（画像パッチ）が入力され、図中上から２つ目の畳み込み層部に、時刻t'（t'＜ｔ）のマスク処理画像（画像パッチ）が入力され、図中上から３つ目の畳み込み層部に、時刻t''（t''＜ｔ'）のマスク処理画像（画像パッチ）が入力されている。ここで、本実施形態では、学習の際、図中最上の畳み込み層部、上から２つめの畳み込み層部、及び上から３つめの畳み込み層部はそれぞれ、ある時刻T、時刻(T−t＋t')、及び時刻(T−t＋t'')の画像（画像パッチ）若しくはマスク処理画像（画像パッチ）を入力して学習している。 In FIG. 6, the mask processing image (image patch) at the current time t is input to the uppermost convolution layer portion in the drawing, and the time t ′ (t ′ <t) is input to the second convolution layer portion from the top in the drawing. The mask processed image (image patch) at time t ″ (t ″ <t ′) is input to the third convolutional layer portion from the top in the figure. . Here, in the present embodiment, at the time of learning, the uppermost convolutional layer portion in the figure, the second convolutional layer portion from the top, and the third convolutional layer portion from the top respectively have a certain time T and time (T−t + t ') And an image (image patch) or a mask-processed image (image patch) at time (T−t + t ″) are input and learned.

また、追跡対象である複数の物体の検出(識別)において、識別器１１（複数の畳み込み層）は、複数の物体の各々を識別対象としたマスク処理を施した画像（画像パッチ）を
順次入力（して識別結果を出力）することにより、マルチトラッキングを実現する。 Further, in the detection (identification) of a plurality of objects to be tracked, the discriminator 11 (a plurality of convolution layers) sequentially inputs an image (image patch) subjected to mask processing for each of the plurality of objects as an identification target. Multi-tracking is realized by (and outputting the identification result).

次いで、以上説明したようにマスク処理画像（画像パッチ）を入力した畳み込み層から出力された特徴量を入力した全結合層は、最終的に、４次元ベクトル(U_C, V_C, W, H)を出力する。ここで、U_C及びV_Cはそれぞれ、入力されたマスク処理画像（画像パッチ）における識別対象の物体についての追跡対象領域のu軸方向の物体中心及びv軸方向の物体中心である。また、W及びHはそれぞれ、この追跡対象領域矩形の幅及び高さである。なお、入力が画像パッチの場合、U_C値及びV_C値の基づく座標系は、当該画像パッチの局所座標系となる。 Next, as described above, all the connection layers to which the feature amount output from the convolution layer to which the mask processing image (image patch) is input are finally converted into four-dimensional vectors (U _C , V _C , W, H ) Is output. Here, U _C and V _C are the object center in the u-axis direction and the object center in the v-axis direction of the tracking target region for the object to be identified in the input mask processing image (image patch), respectively. W and H are the width and height of the tracking target area rectangle, respectively. Incidentally, if the input is an image patch, the coordinate system based on the U _C value and V _C value is a local coordinate system of the image patch.

さらに、同じく図６に示すように変更態様として、画像出力部１１３（図２）は、識別器１１の学習及び識別の際、全結合層部（第２ニューラルネットワーク）に対し、畳み込み層部から出力される特徴情報に対応する画像（画像パッチ）における時点に係る情報（図６では画像間の時間間隔Δt）を、この特徴情報に紐づけて出力してもよい。これにより、識別器１１は入力される複数の画像（画像パッチ）の時刻に係る情報をも学習することができ、それに基づいてより精度の高い検出（識別）を実施することも可能となる。また、識別器１１に入力される複数の画像（画像パッチ）における、時刻についてのバラエティを確保することもできるのである。 Further, as shown in FIG. 6, as a modification, the image output unit 113 (FIG. 2) has a convolutional layer unit that performs the learning and identification of the discriminator 11 from the convolutional layer unit (second neural network). Information related to the time point in the image (image patch) corresponding to the output feature information (the time interval Δt between images in FIG. 6) may be output in association with this feature information. Thereby, the discriminator 11 can learn also the information regarding the time of the several image (image patch) input, and based on it, it becomes possible to implement a more accurate detection (identification). In addition, a variety of time can be secured in a plurality of images (image patches) input to the discriminator 11.

以上いずれにしても、識別器１１は、画像出力部１１３から、その段階で追跡対象物体以外となる物体に対しマスク処理の施された画像（画像パッチ）を入力するので、追跡対象物体の識別精度が向上し、ドリフトの発生を抑制することが可能となるのである。 In any case, the discriminator 11 inputs from the image output unit 113 an image (image patch) that has been subjected to mask processing for an object other than the tracking target object at that stage. The accuracy is improved and the occurrence of drift can be suppressed.

図７は、本発明に係る識別器の他の実施形態を示す模式図である。 FIG. 7 is a schematic diagram showing another embodiment of the discriminator according to the present invention.

図７によれば、本実施形態の識別器１１'は、畳み込み層を含む畳み込みニューラルネットワーク（ＣＮＮ）の出力側に、機械学習を実施可能なサポートベクタマシン（ＳＶＭ）を接続した構成を有する。 According to FIG. 7, the discriminator 11 ′ of this embodiment has a configuration in which a support vector machine (SVM) capable of performing machine learning is connected to the output side of a convolutional neural network (CNN) including a convolution layer.

この識別器１１'は、画像出力部１１３から刻々に入力される（現時刻tの）マスク処理画像領域（画像パッチ）をオンラインで学習しつつ、このマスク処理画像パッチに対し、この画像パッチに映っているものが追跡対象物体か否かの２値判定を行う。具体的には、特徴空間において識別境界面を生成・更新しつつ、この識別境界面からの符号付き距離dを信頼度として算定し、信頼度が所定閾値以上であるか否かの判定を行うのである。なお、このような識別器を用いた物体追跡については、例えば、S. Hare, A. Saffari and P. H. S. Torr，"Struck: Structured Output Tracking with Kernels"，Publications of International Conference on Computer Vision (ICCV), ２０１１年，２６３〜２７０頁に記載されている。 The discriminator 11 ′ learns the mask processed image area (image patch) input from the image output unit 113 every moment (at the current time t) online, and applies this image patch to the mask processed image patch. A binary determination is made as to whether or not the object shown is a tracking target object. Specifically, while generating and updating the identification boundary surface in the feature space, the signed distance d from the identification boundary surface is calculated as the reliability, and it is determined whether the reliability is equal to or greater than a predetermined threshold. It is. Note that object tracking using such a classifier is, for example, S. Hare, A. Saffari and PHS Torr, “Struck: Structured Output Tracking with Kernels”, Publications of International Conference on Computer Vision (ICCV), 2011. Year, pages 263-270.

なお、画像出力部１１３が刻々に出力する（現時刻tの）マスク処理画像パッチは、時刻tでの正解の追跡対象領域となり得る複数の画像パッチ候補とすることができる。この場合、識別器１１'は、これらの候補のうちから正解の追跡対象領域を判別するのである。 Note that the mask-processed image patch (at the current time t) that is output every moment by the image output unit 113 can be a plurality of image patch candidates that can be the correct tracking target area at the time t. In this case, the discriminator 11 ′ discriminates the correct tracking target area from these candidates.

以上説明したように、識別器１１'は、画像出力部１１３から、その段階で追跡対象物体以外となる物体に対しマスク処理の施された画像（画像パッチ）を入力するので、追跡対象物体の識別精度が向上し、ドリフトの発生を抑制することが可能となる。 As described above, the discriminator 11 ′ inputs an image (image patch) on which mask processing is performed on an object other than the tracking target object at that stage from the image output unit 113. The identification accuracy is improved and the occurrence of drift can be suppressed.

図２に戻って、対象位置決定部１１４は、学習済みの識別器１１からの出力に基づいて、現時刻tでの追跡対象物体の位置を決定する。具体的には、追跡対象である複数の物体の各々について、識別器１１から出力された４次元ベクトル(U_C, V_C, W, H)に基づいて、最終的に時刻tの画像における追跡対象領域(c_u ^t, c_v ^t, w^t, h^t)を算出する。また変更態様として、対象位置決定部１１４は、オンラインで学習している識別器１１'（図７）からの出力に基づいて、現時刻tでの追跡対象物体の位置を決定してもよい。 Returning to FIG. 2, the target position determination unit 114 determines the position of the tracking target object at the current time t based on the output from the learned discriminator 11. Specifically, for each of a plurality of objects to be tracked, tracking in the image at time t is finally performed based on the four-dimensional vector (U _C , V _C , W, H) output from the classifier 11. The target area (c _u ^t , c _v ^t , w ^t , h ^t ) is calculated. As a change mode, the target position determination unit 114 may determine the position of the tracking target object at the current time t based on the output from the discriminator 11 ′ (FIG. 7) learned online.

追跡対象管理部１１５は、対象位置決定部１１４で決定された、追跡対象である複数の物体の各々における追跡対象領域(c_u ^t, c_v ^t, w^t, h^t)の情報から、これら複数の物体の各々について、時刻毎に（実空間に張られた）世界座標系Gx-Gy-Gzにおける位置を対応付けた追跡履歴情報（動線情報）を生成し、管理する。１つの応用例として、これにより、店舗内で滞留・移動する多数の店員や客等の動線をより正確に把握することも可能となる。 The tracking target management unit 115 uses the information on the tracking target regions (c _u ^t , c _v ^t , w ^t , h ^t ) in each of the plurality of objects that are the tracking targets determined by the target position determination unit 114, For each of a plurality of objects, tracking history information (flow line information) in which positions in the world coordinate system Gx-Gy-Gz (corresponding to real space) are associated with each other is generated and managed. As one application example, this makes it possible to more accurately grasp the flow lines of a large number of store clerks and customers who stay and move in the store.

また、生成された追跡履歴情報（動線情報）は、生成・更新される毎に、又は適宜、対象情報記憶部１０５に記憶されることも好ましい。また、通信制御部１２１及び通信インタフェース１０１を介し、外部の情報処理装置３に送信されてもよい。 The generated tracking history information (flow line information) is also preferably stored in the target information storage unit 105 every time it is generated / updated or as appropriate. Further, the information may be transmitted to the external information processing apparatus 3 via the communication control unit 121 and the communication interface 101.

以上詳細に説明したように、本発明は、追跡対象として複数の物体を追跡する場合に、追跡処理対象である１つの物体以外の他の物体に対し「画素変更処理」を行うことによって、識別器に入力する画像又は画像領域における他の物体の特徴を消滅又は減少させている。その結果、このような処理の施された画像又は画像領域を用い、識別器を学習させたり、及び／又は識別器に識別処理をさせたりすることによって、この１つの物体を他の物体と混同することなく、より確実に特定し続けることが可能となる。 As described above in detail, when tracking a plurality of objects as tracking targets, the present invention performs identification by performing “pixel change processing” on other objects other than one object that is a tracking processing target. The characteristics of other objects in the image or image area that are input to the vessel are eliminated or reduced. As a result, this one object is confused with another object by using the image or image region that has been subjected to such processing to learn the classifier and / or to have the classifier perform the classification process. It becomes possible to continue specifying more reliably without doing.

言い換えれば、以上に説明した本発明の「画像オブジェクトマスク手法」を用いることによって、追跡処理対象の物体の近くに見かけの類似した物体が存在する場合でも、従来問題となってきたドリフトの発生を抑制することが可能となるのである。 In other words, by using the “image object mask method” of the present invention described above, even when an apparently similar object exists near the object to be tracked, the drift that has been a problem in the past is generated. It becomes possible to suppress.

ちなみに、この「画像オブジェクトマスク手法」は、特に、オンライン学習を行わないマルチトラッキング処理に適用された場合に、メモリ消費量（計算コスト）増大の抑制と、ドリフト発生の抑制との両立を可能にする非常に重要な技術となる。 By the way, this "image object mask method" enables both the suppression of memory consumption (calculation cost) increase and the suppression of drift occurrence, especially when applied to multi-tracking processing without online learning. It will be a very important technology.

また、本発明の構成及び方法は、例えば、多数の人物が移動・滞留したり出入りする場を監視する監視システム、及び商店街や商業・サービス施設内での人物の入店、休憩、観戦・イベント参加や、移動の状況を調査するためのマーケティング調査システム等、様々な系に適用可能である。 In addition, the configuration and method of the present invention include, for example, a monitoring system for monitoring a place where a large number of persons move, stay, and enter and exit, and a person entering, resting, watching / It can be applied to various systems such as event surveys and marketing survey systems for investigating the status of travel.

以上に述べた本発明の種々の実施形態において、本発明の技術思想及び見地の範囲の種々の変更、修正及び省略は、当業者によれば容易に行うことができる。前述の説明はあくまで例であって、何ら制約しようとするものではない。本発明は、特許請求の範囲及びその均等物として限定するものにのみ制約される。 In the various embodiments of the present invention described above, various changes, modifications, and omissions in the technical idea and scope of the present invention can be easily made by those skilled in the art. The above description is merely an example, and is not intended to be restrictive. The invention is limited only as defined in the following claims and the equivalents thereto.

１対象追跡装置
１０１通信インタフェース
１０２画像蓄積部
１０３マスク処理画像蓄積部
１０４識別モデル蓄積部
１０５対象情報記憶部
１１、１１' 識別器
１１１マスク処理部
１１２追跡領域決定部
１１３画像出力部
１１３ａ学習部
１１３ｂ識別部
１１４対象位置決定部
１１５追跡対象管理部
１２１通信制御部
２カメラ
３情報処理装置

DESCRIPTION OF SYMBOLS 1 Target tracking apparatus 101 Communication interface 102 Image storage part 103 Mask process image storage part 104 Identification model storage part 105 Target information storage part 11, 11 'Classifier 111 Mask process part 112 Tracking area | region determination part 113 Image output part 113a Learning part 113b Identification unit 114 Target position determination unit 115 Tracking target management unit 121 Communication control unit 2 Camera 3 Information processing device

Claims

Using a time-series image group that can include a plurality of tracking targets, the tracking target is determined by determining the position of the tracking target in each image based on the output from the discriminator that has input the image or the image area in the image. A device for tracking
A processing target region in an image included in the image group or an image region in the image and determined based on a position determined at a past time or as a correct answer for other tracking targets other than one tracking target Mask processing means for performing a pixel change process for changing to a pixel pattern in which the characteristics of the other tracking target are eliminated or reduced with respect to the target region;
An object tracking device comprising: an image output unit that outputs the image or the image area subjected to the pixel change process to the classifier for learning and / or identification of the one tracking target. .

The mask processing means is a pixel pattern determined based on an image region other than the tracking target in the image or the image region, or a predetermined pixel pattern for at least one processing target region related to the other tracking target. The object tracking apparatus according to claim 1, wherein the object tracking apparatus performs a change to the function.

The said mask processing means implements the process which blurs the appearance of this other tracking object with respect to the process target area | region which concerns on at least one said other tracking object as the said pixel change process. 2. The object tracking device according to 2.

When the processing target area related to the other tracking target has a superimposing area superimposed on the tracking target area including the position determined for the one tracking target, the mask processing means The object tracking device according to claim 1, wherein the pixel changing process is not performed.

When the processing target area related to the other tracking target has a superimposed area superimposed on the tracking target area including the position determined for the one tracking target, the tracking target area related to the one tracking target is 5. The object tracking device according to claim 1, further comprising a tracking target area determining unit that changes to an area excluding the superimposition area. 6.

The image output means outputs the image or the image region subjected to the pixel change processing at at least one time point to the learned classifier,
2. The target tracking device further includes target position determining means for determining a position of the one tracking target at the one time point based on an output from the learned classifier. 6. The object tracking device according to any one of 1 to 5.

The image output means includes the image or the image area subjected to the pixel change process at one time point, and the image subjected to the pixel change process at at least one time point before the one time point. The object tracking device according to claim 6, wherein the image region is output to the learned classifier.

The discriminator receives a plurality of first neural networks that input the image or the image region at each time point and outputs feature information relating to the feature of the image or the image region, and outputs from the plurality of first neural networks The object tracking device according to claim 6, further comprising: a second neural network that inputs the plurality of pieces of the feature information and outputs information related to the position of the one tracking target.

9. The image output means outputs the information corresponding to the feature information or the time point of the image region to the second neural network in association with the feature information. Object tracking device as described in.

The image output means, when learning the discriminator, at least the image or the image region that has been subjected to the pixel change process, and information on the position of the tracking target region as a correct answer in the image or the image region The object tracking device according to claim 1, wherein a set of the following is output to the discriminator.

The said mask processing means implements the said pixel change process with respect to the shift | offset | difference process target area | region shifted in position from the determined said process target area, when learning the said discriminator. The object tracking device according to any one of the above.

The image output means outputs at least the image or the image region at one time point subjected to the pixel change processing to the discriminator that performs online learning,
The target tracking device further includes target position determining means for determining the position of the one tracking target at the one time point based on an output from the discriminator that is learning online. The object tracking device according to any one of claims 1 to 5.

Using a time-series image group that can include a plurality of tracking targets, the tracking target is determined by determining the position of the tracking target in each image based on the output from the discriminator that has input the image or the image area in the image. A program for operating a computer mounted on a device for tracking
A processing target region in an image included in the image group or an image region in the image and determined based on a position determined at a past time or as a correct answer for other tracking targets other than one tracking target Mask processing means for performing a pixel change process for changing to a pixel pattern in which the characteristics of the other tracking target are eliminated or reduced with respect to the target region;
A computer is caused to function as an image output means for outputting the image or the image region subjected to the pixel change processing to the discriminator for learning and / or identification of the one tracking target. Target tracking program.

Using a time-series image group that can include a plurality of tracking targets, the tracking target is determined by determining the position of the tracking target in each image based on the output from the discriminator that has input the image or the image area in the image. An object tracking method in a computer mounted on a device for tracking
A processing target region in an image included in the image group or an image region in the image and determined based on a position determined at a past time or as a correct answer for other tracking targets other than one tracking target Performing a pixel change process for changing to a pixel pattern in which the characteristics of the other tracking target are eliminated or reduced with respect to the target region;
Outputting the image or the image region subjected to the pixel changing process to the discriminator for learning and / or identification of the one tracking target.