JP2020149641A

JP2020149641A - Object tracking device and object tracking method

Info

Publication number: JP2020149641A
Application number: JP2019049168A
Authority: JP
Inventors: 宏奥田; Hiroshi Okuda; 信二高橋; Shinji Takahashi
Original assignee: Omron Corp; Omron Tateisi Electronics Co
Current assignee: Omron Corp
Priority date: 2019-03-15
Filing date: 2019-03-15
Publication date: 2020-09-17
Anticipated expiration: 2039-03-15
Also published as: JP7334432B2

Abstract

To provide an object tracking device that reduces drift to a background to perform accurate object tracking.SOLUTION: In a monitoring system 2, a human tracking device 1 has an image input unit that acquires the position of an object in a first frame image, and a tracking unit that determines the position of the object from a second frame image. The tracking means includes: a feature quantity extraction unit that extracts feature quantities from a target area of the second frame image; a likelihood map creation unit that determines, based on the feature quantities, a likelihood map representing the likelihood of the presence of the object for the target area of the second frame image; and a position specification unit that, when the likelihood map has one peak, specifies the position of the peak as the position of the object, and when the likelihood map has a plurality of peaks, specifies, as the position of the object, the position of a peak selected in consideration of image similarity representing the similarity between an image area in the vicinity of the position of the object in the first frame image and image areas in the vicinity of the peaks in the second frame image.SELECTED DRAWING: Figure 2

Description

本発明は、動画像中の物体を追跡する技術に関する。 The present invention relates to a technique for tracking an object in a moving image.

動画像（時系列画像）のあるフレームにおいて検出された物体を追跡する物体追跡は、コンピュータビジョン分野において重要な技術である。 Object tracking, which tracks an object detected in a frame with a moving image (time series image), is an important technique in the field of computer vision.

一般的なトラッキング手法である背景差分による手法は、追跡対象の動きが止まってしまった場合にはロストしてしまう。例えば追跡対象が人物である場合、この人物が椅子に座るとロストしてしまうため、オフィス内の監視に向かない。さらに、テンプレートマッチングでは、物体が変形しテンプレートとの差異が所定の閾値以上になると、ロストしてしまう。人物の場合、人物の動作によってテンプレートと比べて大きな変形が発生するため追跡に失敗する。 The background subtraction method, which is a general tracking method, is lost when the movement of the tracking target stops. For example, if the tracking target is a person, it will be lost if this person sits in a chair, so it is not suitable for monitoring in the office. Further, in template matching, when an object is deformed and the difference from the template exceeds a predetermined threshold value, the object is lost. In the case of a person, tracking fails because the movement of the person causes a large deformation compared to the template.

これに対して、非特許文献１は、輝度勾配（ＨＯＧ特徴量）に基づく尤度と色特徴（色ヒストグラム）に基づく尤度とを合成した合成尤度に基づいて追跡対象の位置を判断する。このように形状と色に関わる特徴量を相補的に用いて追跡を行うことで、ロバストな追跡が可能である旨が報告されている。 On the other hand, in Non-Patent Document 1, the position of the tracking target is determined based on the composite likelihood obtained by combining the likelihood based on the luminance gradient (HOG feature amount) and the likelihood based on the color feature (color histogram). .. It has been reported that robust tracking is possible by performing tracking using features related to shape and color in a complementary manner.

また、特許文献１は、シーン変化を検出し、変化したシーンに対して最適な追跡性能を有する特徴量を選択して追跡を行うことを開示する。 Further, Patent Document 1 discloses that a scene change is detected, and a feature amount having an optimum tracking performance for the changed scene is selected and tracked.

ところで、ビルディングオートメーション（ＢＡ）やファクトリーオートメーション（ＦＡ）の分野において、画像センサにより人の「数」・「位置」・「動線」などを自動で計測し、照明や空調などの機器を最適制御するアプリケーションが必要とされている。このような用途では、できるだけ広い範囲の画像情報を取得するために、魚眼レンズ（フィッシュアイレンズ）を搭載した超広角のカメラ（魚眼カメラ、全方位カメラ、全天球カメラなどと呼ばれるが、いずれも意味は同じである。本明細書では「魚眼カメラ」の語を用いる）を利用することが多い。さらに、上記の用途では、できるだけ広い範囲の画像情報を取得するために、天井などの高所に取り付けたカメラをカメラの視点がトップ・ビューになるようにして配置する。この配置のカメラでは、人物を撮影する視点は、人物が画像の周辺にいるときには正面像になり、画像の中央にいるときには上面図となる。 By the way, in the fields of building automation (BA) and factory automation (FA), image sensors automatically measure the "number", "position", "flow line", etc. of people, and optimally control equipment such as lighting and air conditioning. Application is needed. In such applications, in order to acquire image information in the widest possible range, it is called an ultra-wide-angle camera (fisheye camera, omnidirectional camera, omnidirectional camera, etc.) equipped with a fisheye lens (fisheye lens). The meaning is the same. In this specification, the term "fisheye camera" is used). Further, in the above application, in order to acquire image information in as wide a range as possible, a camera mounted on a high place such as a ceiling is arranged so that the viewpoint of the camera is the top view. In the camera of this arrangement, the viewpoint for photographing the person is the front view when the person is around the image and the top view when the person is in the center of the image.

魚眼カメラで撮影された画像は、撮影面内の位置により撮影対象の見た目が歪みのため変形する。さらに、カメラの視点をトップ・ビューにすると、追跡対象の位置により見た目が変化する。また、組み込み機器など、処理能力の限られた環境ではフレームレートが低いことが考えられ、フレーム間での物体の移動量や特徴量の変化が大きいという特殊性がある。したがって、従来技術の追跡手法では、精度良く追跡できない場合がある。 The image taken by the fisheye camera is deformed due to the distortion of the appearance of the object to be photographed depending on the position in the photographing surface. Furthermore, when the viewpoint of the camera is set to the top view, the appearance changes depending on the position of the tracking target. Further, in an environment with limited processing capacity such as an embedded device, the frame rate is considered to be low, and there is a peculiarity that the amount of movement of an object and the amount of features change greatly between frames. Therefore, the conventional tracking method may not be able to track accurately.

特開２０１５−７９５０２号公報JP-A-2015-79502

Bertinetto, Luca, et al. "Staple: Complementary learners for real-time tracking." Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.Bertinetto, Luca, et al. "Staple: Complementary learners for real-time tracking." Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.

本発明は上記実情に鑑みなされたものであって、従来よりも精度の良い物体追跡技術を提供することを目的とする。 The present invention has been made in view of the above circumstances, and an object of the present invention is to provide an object tracking technique with higher accuracy than before.

上記目的を達成するために本発明は、以下の構成を採用する。 In order to achieve the above object, the present invention adopts the following configuration.

本発明の第一側面は、第１フレーム画像における対象物の位置を取得する取得手段と、前記第１フレーム画像の後のフレーム画像である第２フレーム画像から、前記対象物の位置を求める追跡手段と、を備える、物体追跡装置であって、前記追跡手段は、前記第２フレーム画像の対象領域から特徴量を抽出する特徴量抽出手段と、前記第２フレーム画像の前記対象領域について、前記対象物が存在する確からしさを表す尤度のマップを前記特徴量に基づいて求める尤度算出手段と、前記尤度のマップにおいてピークが１つの場合には、当該ピークの位置を前記対象物の位置として特定し、前記尤度のマップにおいてピークが複数ある場合には、前記第１フレーム画像の前記対象物の位置の近傍の画像領域と前記第２フレーム画像の各ピークの近傍の画像領域との類似度を表す画像類似度を考慮して選択されるピークの位置を前記対象物の位置として特定する、位置決定手段と、を備える、ことを特徴とする物体追跡装置を提供する。 The first aspect of the present invention is a tracking for obtaining the position of an object from an acquisition means for acquiring the position of the object in the first frame image and a second frame image which is a frame image after the first frame image. An object tracking device including means, wherein the tracking means relates to a feature amount extracting means for extracting a feature amount from a target area of the second frame image and the target area of the second frame image. A likelihood calculation means for obtaining a likelihood map representing the certainty that an object exists based on the feature amount, and when there is one peak in the likelihood map, the position of the peak is determined by the object. When it is specified as a position and there are a plurality of peaks in the likelihood map, an image area near the position of the object in the first frame image and an image area near each peak in the second frame image. Provided is an object tracking device including a position determining means for specifying the position of a peak selected in consideration of the image similarity representing the similarity as the position of the object.

追跡の対象とする物体、すなわち「対象物」は、任意の物体であってよく、人体、顔、動物、車両などがその一例である。「対象領域」は第２フレーム画像における対象物の探索を行う領域であり、典型的には、第１フレーム画像における対象物の位置に基づいて決定される部分領域である。「画像類似度」は画像同士の類似度を表す指標であり、例えば、領域内の平均色や平均明度の差によって評価される。第１フレーム画像中の対象物の位置の近傍の画像領域と第２フレーム画像中のピークの位置の近傍の画像領域とは、同じ特徴量であることが好ましく、また、対象物（前景）の一部の領域であることが好ましく、特に、対象物（前景）の中心の一部の領域であることが好ましい。 The object to be tracked, that is, the "object" may be any object, and examples thereof include a human body, a face, an animal, and a vehicle. The "target area" is an area for searching for an object in the second frame image, and is typically a partial area determined based on the position of the object in the first frame image. The "image similarity" is an index showing the similarity between images, and is evaluated by, for example, the difference in average color and average brightness in a region. It is preferable that the image area near the position of the object in the first frame image and the image area near the position of the peak in the second frame image have the same feature amount, and the object (foreground) It is preferably a part of the area, and particularly preferably a part of the center of the object (foreground).

尤度算出部によって求められる尤度のマップは、対象物が存在する位置で最大値を取ることが期待されるが、対象物とは異なる物体の位置で最大値を取ることがある。したがって、単に尤度のマップにおける最大値の位置を追跡対象物の位置として決定すると、乗り移り（ドリフト）と呼ばれる追跡エラーが発生する。そこで、本発明では、尤度のマップにおいて複数のピーク（局所的ピーク）が存在する場合に、対象物の位置近傍の画像類似度を考慮してピークを選択し、選択されたピーク位置を対象物の位置として特定する。このように画像類似度を考慮してピークを選択することにより追跡精度が向上する。 The likelihood map obtained by the likelihood calculation unit is expected to take the maximum value at the position where the object exists, but may take the maximum value at the position of the object different from the object. Therefore, simply determining the position of the maximum value on the likelihood map as the position of the tracking object causes a tracking error called transfer (drift). Therefore, in the present invention, when a plurality of peaks (local peaks) exist in the likelihood map, the peaks are selected in consideration of the image similarity near the position of the object, and the selected peak positions are targeted. Identify as the location of an object. By selecting the peak in consideration of the image similarity in this way, the tracking accuracy is improved.

本発明の位置決定手段は、例えば、前記尤度のマップにおいてピークが複数ある場合には、尤度の値が閾値以上のピークのうち、前記画像類似度が最大であるピークの位置を前記対象物の位置として特定してもよい。この際、上記の閾値を画像類似度に応じてピークごとに決定してもよい。 For example, when there are a plurality of peaks in the likelihood map, the positioning means of the present invention targets the position of the peak having the maximum image similarity among the peaks having the likelihood value equal to or higher than the threshold value. It may be specified as the position of an object. At this time, the above threshold value may be determined for each peak according to the image similarity.

本発明の尤度算出手段による尤度のマップの求め方は特に限定されないが、例えば、形状に関する特徴量である第１特徴量と、色または輝度に関する特徴量である第２特徴量に着目して尤度のマップを求めてよい。形状に関する特徴量の例として、ＨＯＧ特徴量、ＬＢＰ特徴量、ＳＨＩＦＴ特徴量、ＳＵＲＦ特徴量の少なくともいずれかが挙げられる。色に関する特徴として、色ヒストグラム、輝度ヒストグラム、Color Names特徴量の少なく
とも何れかが挙げられる。本発明の尤度算出手段は、第１特徴量に基づく第１尤度と第２特徴量に基づく第２尤度とを求め、これらを合成した合成尤度のマップを生成してもよい
。 The method of obtaining the likelihood map by the likelihood calculation means of the present invention is not particularly limited, but for example, pay attention to the first feature amount which is a feature amount related to shape and the second feature amount which is a feature amount related to color or brightness. You may find a map of likelihood. Examples of the feature amount related to the shape include at least one of the HOG feature amount, the LBP feature amount, the SHIFT feature amount, and the SURF feature amount. Color features include at least one of color histograms, luminance histograms, and Color Names features. The likelihood calculation means of the present invention may obtain a first likelihood based on the first feature amount and a second likelihood based on the second feature amount, and generate a composite likelihood map by synthesizing these.

本発明においてピーク選択の際に考慮される画像類似度は、例えば、平均色、平均輝度、代表色の少なくともいずれかを含む画像情報の差、差の絶対値、差の二乗の少なくともいずれかに基づいて決定することができる。さらに、画像類似度は、ＨＯＧなどの形状に関する特徴量または色ヒストグラムなどの色に関する特徴量の少なくともいずれかの、ヒストグラムインタセクション、バタチャリヤ係数、ＥａｒｔｈＭｏｖｅｒ’ｓＤｉｓｔａｎｃｅの少なくともいずれかに基づいて決定することができる。加えて、テンプレートマッチングにより画像類似度を決定することができる。また、類似度ではなく、差の二乗和、差の絶対値和の少なくともいずれかに基づいて相違度を測定する方法を採用することができる。画像類似度は、２つの画像がどの程度類似しているかを把握可能な尺度であり、ヒストグラムインタセクションなどのように類似しているほど値が大きい指標でもよいし、差の絶対値などのように類似しているほど値が小さい指標でもよい。 In the present invention, the image similarity considered at the time of peak selection is, for example, at least one of the difference in image information including at least one of the average color, the average brightness, and the representative color, the absolute value of the difference, and the square of the difference. It can be decided based on. Further, the image similarity is determined based on at least one of the feature amount related to the shape such as HOG or the feature amount related to the color such as the color histogram, at least one of the histogram intersection, the butterfly coefficient, and the Earth Mover's Distance. be able to. In addition, image similarity can be determined by template matching. Further, a method of measuring the degree of difference based on at least one of the sum of squares of the differences and the sum of the absolute values of the differences can be adopted instead of the degree of similarity. Image similarity is a measure that allows you to grasp how similar two images are, and may be an index with a larger value as they are similar, such as a histogram intersection, or an absolute value of difference. The index may have a smaller value as it is similar to.

また、本発明において処理対象とされる画像は、魚眼カメラにより得られた魚眼画像であってよい。「魚眼カメラ」は、魚眼レンズを搭載したカメラであり、通常のカメラに比べて超広角での撮影が可能なカメラである。全方位カメラ、全天球カメラおよび魚眼カメラはいずれも超広角カメラの一種であり、いずれも意味は同じである。魚眼カメラは、検出対象エリアの上方から検出対象エリアを見下ろすように設置されていればよい。典型的には魚眼カメラの光軸が鉛直下向きとなるように設置されるが、魚眼カメラの光軸が鉛直方向に対して傾いていても構わない。魚眼画像はひずみが大きいため、特に低フレームレートの画像ではフレーム間での物体の特徴変化が大きく、背景へのドリフトが多発する。さらに、カメラの光軸を鉛直下向きとなるように設置すると、画像における対象物の位置により対象物を撮影する視点が変化するため、特に低フレームレートの画像では、物体が大きく変形し追跡の失敗が多発する。しかし、本発明によればそのような魚眼画像においても、カメラの光軸を鉛直下向きとなるように設置しても精度の良い追跡が可能である。もっとも、本発明が処理対象とする画像は、魚眼画像に限られず、通常の画像（歪みの少ない画像や高フレームレートの画像）であっても構わない。 Further, the image to be processed in the present invention may be a fisheye image obtained by a fisheye camera. A "fisheye camera" is a camera equipped with a fisheye lens, which is capable of shooting at an ultra-wide angle compared to a normal camera. Omnidirectional cameras, spherical cameras, and fisheye cameras are all types of ultra-wide-angle cameras, and they all have the same meaning. The fisheye camera may be installed so as to look down on the detection target area from above the detection target area. Typically, the optical axis of the fisheye camera is installed so as to face vertically downward, but the optical axis of the fisheye camera may be tilted with respect to the vertical direction. Since the fisheye image has a large distortion, the characteristic change of the object between frames is large especially in the image with a low frame rate, and the drift to the background occurs frequently. Furthermore, if the optical axis of the camera is installed so that it faces vertically downward, the viewpoint at which the object is photographed changes depending on the position of the object in the image. Therefore, especially in a low frame rate image, the object is greatly deformed and tracking fails. Occurs frequently. However, according to the present invention, even in such a fisheye image, accurate tracking is possible even if the optical axis of the camera is installed so as to face vertically downward. However, the image to be processed by the present invention is not limited to the fisheye image, and may be a normal image (an image with less distortion or an image with a high frame rate).

本発明の第二側面は、第１フレーム画像における対象物の位置を取得する取得ステップと、前記第１フレーム画像の後のフレーム画像である第２フレーム画像から、前記対象物の位置を求める追跡ステップと、を含む、物体追跡方法であって、前記追跡ステップは、前記第２フレーム画像の対象領域から特徴量を抽出する特徴量抽出ステップと、前記第２フレーム画像の前記対象領域について、前記対象物が存在する確からしさを表す尤度のマップを前記特徴量に基づいて求める尤度算出ステップと、前記尤度のマップにおいてピークが１つの場合には、当該ピークの位置を前記対象物の位置として特定し、前記尤度のマップにおいてピークが複数ある場合には、前記第１フレーム画像の前記対象物の位置の近傍の画像領域と前記第２フレーム画像の各ピークの近傍の画像領域との類似度を表す画像類似度を考慮して選択されるピークの位置を前記対象物の位置として特定する、位置決定ステップと、を含む、ことを特徴とする物体追跡方法を提供する。 The second aspect of the present invention is a tracking for obtaining the position of the object from the acquisition step of acquiring the position of the object in the first frame image and the second frame image which is a frame image after the first frame image. An object tracking method including a step, wherein the tracking step relates to a feature amount extraction step for extracting a feature amount from a target area of the second frame image and the target area of the second frame image. A likelihood calculation step for obtaining a likelihood map representing the certainty that an object exists based on the feature amount, and when there is one peak in the likelihood map, the position of the peak is determined by the object. When it is specified as a position and there are a plurality of peaks in the likelihood map, an image area near the position of the object in the first frame image and an image area near each peak in the second frame image. Provided is an object tracking method comprising a positioning step of identifying the position of a peak selected in consideration of the image similarity representing the similarity of the object as the position of the object.

本発明は、上記手段の少なくとも一部を有する物体追跡装置として捉えてもよいし、画像処理装置や監視システムとして捉えてもよい。また、本発明は、上記処理の少なくとも一部を含む物体追跡方法、画像処理方法、監視方法として捉えてもよい。また、本発明は、かかる方法を実現するためのプログラムやそのプログラムを非一時的に記録した記録媒体として捉えることもできる。なお、上記手段および処理の各々は可能な限り互いに組み合わせて本発明を構成することができる。 The present invention may be regarded as an object tracking device having at least a part of the above means, or may be regarded as an image processing device or a monitoring system. Further, the present invention may be regarded as an object tracking method, an image processing method, and a monitoring method including at least a part of the above processing. Further, the present invention can also be regarded as a program for realizing such a method and a recording medium in which the program is recorded non-temporarily. The present invention can be constructed by combining each of the above means and treatments with each other as much as possible.

本発明によれば、従来よりも精度の良い物体追跡が行える。 According to the present invention, object tracking can be performed with higher accuracy than before.

図１は、本発明に係る人追跡装置の適用例を示す図である。FIG. 1 is a diagram showing an application example of the person tracking device according to the present invention. 図２は、人追跡装置を備える監視システムの構成を示す図である。FIG. 2 is a diagram showing a configuration of a monitoring system including a person tracking device. 図３は、人追跡装置が実施する全体処理のフローチャートである。FIG. 3 is a flowchart of the entire process performed by the person tracking device. 図４は、学習処理のフローチャートである。FIG. 4 is a flowchart of the learning process. 図５は、追跡処理のフローチャートである。FIG. 5 is a flowchart of the tracking process. 図６は、追跡処理におけるピーク選択処理のフローチャートである。FIG. 6 is a flowchart of the peak selection process in the tracking process. 図７は、追跡処理における合成尤度のマップ生成を説明する図である。FIG. 7 is a diagram illustrating map generation of synthetic likelihood in the tracking process. 図８は、合成尤度のマップの例を示す図である。FIG. 8 is a diagram showing an example of a map of composite likelihood. 図９は、合成尤度のマップに複数のピークがある場合のピーク選択処理を説明する図である。FIG. 9 is a diagram illustrating a peak selection process when there are a plurality of peaks in the composite likelihood map.

＜適用例＞
図１を参照して、本発明に係る物体追跡装置の適用例を説明する。人追跡装置１は、追跡対象エリア１１の上方（例えば天井１２など）に設置された魚眼カメラ１０により得られた魚眼画像を解析して、追跡対象エリア１１内に存在する人１３を検出・追跡する装置である。この人追跡装置１は、例えば、オフィスや工場などにおいて、追跡対象エリア１１を通行する人１３の検出、認識、追跡などを行う。図１の例では、魚眼画像から検出された４つの人体それぞれの領域がバウンディングボックスで示されている。人追跡装置１の検出結果は、外部装置に出力され、例えば、人数のカウント、照明や空調など各種機器の制御、不審者の監視および動線分析などに利用される。 <Application example>
An application example of the object tracking device according to the present invention will be described with reference to FIG. The person tracking device 1 analyzes the fisheye image obtained by the fisheye camera 10 installed above the tracking target area 11 (for example, the ceiling 12), and detects the person 13 existing in the tracking target area 11.・ It is a tracking device. The person tracking device 1 detects, recognizes, and tracks a person 13 passing through the tracking target area 11, for example, in an office or a factory. In the example of FIG. 1, the region of each of the four human bodies detected from the fisheye image is shown by a bounding box. The detection result of the person tracking device 1 is output to an external device, and is used for, for example, counting the number of people, controlling various devices such as lighting and air conditioning, monitoring a suspicious person, and analyzing a flow line.

物体追跡は、前フレーム画像において特定された対象物の位置近傍の現フレームのターゲット領域（対象領域）を対象として、対象物と同様の特徴を有する領域の位置を特定することにより行われる。ここで、ターゲット領域内に対象物らしさを表す尤度のピークが複数現れる場合がある。人追跡装置１は、このような場合に、単に尤度が最も高いピークを対象物の位置として特定するのではなく、前フレーム画像の対象物の中心位置近傍での平均色と、現フレームのピーク位置近傍での平均色との差が最小となるピークを、対象物の位置として決定する。このように平均色を考慮してピークすなわち対象物位置を特定することで、背景へのドリフトを抑制でき、精度の良い追跡が可能となる。また、平均色の算出は演算負荷が比較的軽い処理であるため、高速な追跡が実現できる。 Object tracking is performed by specifying the position of an area having the same characteristics as the object, targeting the target area (target area) of the current frame near the position of the object specified in the previous frame image. Here, a plurality of likelihood peaks representing the object-likeness may appear in the target region. In such a case, the human tracking device 1 does not simply specify the peak with the highest likelihood as the position of the object, but the average color near the center position of the object in the previous frame image and the current frame. The peak that minimizes the difference from the average color in the vicinity of the peak position is determined as the position of the object. By specifying the peak, that is, the position of the object in consideration of the average color in this way, drift to the background can be suppressed, and accurate tracking becomes possible. Further, since the calculation of the average color is a process with a relatively light calculation load, high-speed tracking can be realized.

＜監視システム＞
図２を参照して、本発明の実施形態を説明する。図２は、本発明の実施形態に係る人追跡装置を適用した監視システムの構成を示すブロック図である。監視システム２は、魚眼カメラ１０と人追跡装置１とを備えている。 <Monitoring system>
An embodiment of the present invention will be described with reference to FIG. FIG. 2 is a block diagram showing a configuration of a monitoring system to which the person tracking device according to the embodiment of the present invention is applied. The surveillance system 2 includes a fisheye camera 10 and a person tracking device 1.

魚眼カメラ１０は、魚眼レンズを含む光学系と撮像素子（ＣＣＤやＣＭＯＳなどのイメージセンサ）を有する撮像装置である。魚眼カメラ１０は、例えば図１に示すように、追跡対象エリア１１の天井１２などに、光軸を鉛直下向きにした状態で設置され、追跡対象エリア１１の全方位（３６０度）の画像を撮影するとよい。魚眼カメラ１０は人追跡装置１に対し有線（ＵＳＢケーブル、ＬＡＮケーブルなど）または無線（ＷｉＦｉなど）で接続され、魚眼カメラ１０で撮影された画像データは人追跡装置１に取り込まれる。画像データはモノクロ画像、カラー画像のいずれでもよく、また画像データの解像度やフレームレートやフォーマットは任意である。本実施形態では、１０ｆｐｓ（１秒あたり１０枚）で取り込まれるカラー（ＲＧＢ）画像を用いることを想定している。 The fisheye camera 10 is an image pickup device having an optical system including a fisheye lens and an image sensor (an image sensor such as a CCD or CMOS). As shown in FIG. 1, the fisheye camera 10 is installed on the ceiling 12 of the tracking target area 11 with the optical axis facing vertically downward, and images of the tracking target area 11 in all directions (360 degrees) are captured. You should take a picture. The fish-eye camera 10 is connected to the person tracking device 1 by wire (USB cable, LAN cable, etc.) or wirelessly (WiFi, etc.), and the image data captured by the fish-eye camera 10 is taken into the person tracking device 1. The image data may be either a monochrome image or a color image, and the resolution, frame rate, and format of the image data are arbitrary. In this embodiment, it is assumed that a color (RGB) image captured at 10 fps (10 images per second) is used.

本実施形態の人追跡装置１は、画像入力部２０、人体検出部２１、学習部２２、記憶部
２３、追跡部２４、出力部２８を有している。 The person tracking device 1 of the present embodiment includes an image input unit 20, a human body detection unit 21, a learning unit 22, a storage unit 23, a tracking unit 24, and an output unit 28.

画像入力部２０は、魚眼カメラ１０から画像データを取り込む機能を有する。取り込まれた画像データは人体検出部２１および追跡部２４に引き渡される。この画像データは記憶部２３に格納されてもよい。 The image input unit 20 has a function of capturing image data from the fisheye camera 10. The captured image data is delivered to the human body detection unit 21 and the tracking unit 24. This image data may be stored in the storage unit 23.

人体検出部２１は、人体を検出するアルゴリズムを用いて、魚眼画像から人体を検出する機能を有する。人体検出部２１によって検出された人体が、追跡部２４による追跡処理の対象となる。なお、人体検出部２１は、画像内に新たに現れた人物のみを検出してもよく、追跡対象の人物が存在している位置の近くは検出処理の対象から除外してもよい。さらに、一定の時間間隔またはフレーム間隔により、画像全体に人体検出部２１による人物の検出を行い、その後、追跡部２４による追跡処理をするＴｒａｃｋｉｎｇ−ｂｙ−ｄｅｔｅｃｔｉｏｎ方式にしてもよい。 The human body detection unit 21 has a function of detecting a human body from a fisheye image by using an algorithm for detecting the human body. The human body detected by the human body detection unit 21 is the target of the tracking process by the tracking unit 24. The human body detection unit 21 may detect only a person newly appearing in the image, or may exclude the vicinity of the position where the person to be tracked exists from the target of the detection process. Further, a Tracking-by-detection method may be adopted in which the human body detection unit 21 detects a person in the entire image at a fixed time interval or frame interval, and then the tracking unit 24 performs tracking processing.

学習部２２は、人体検出部２１が検出した、あるいは追跡部２４が特定した人体の画像から、追跡対象の人体の特徴を学習して学習結果を記憶部２３に記憶する。ここでは、学習部２２は、形状特徴に基づく評価を行うための相関フィルタと、色特徴に基づく評価を行うための色ヒストグラムと、中心位置での平均色とを求める。学習部２２は、毎フレーム学習を行い、現フレームから得られる学習結果を所定の係数で過去の学習結果に反映させて更新する。 The learning unit 22 learns the characteristics of the human body to be tracked from the image of the human body detected by the human body detecting unit 21 or specified by the tracking unit 24, and stores the learning result in the storage unit 23. Here, the learning unit 22 obtains a correlation filter for performing evaluation based on shape features, a color histogram for performing evaluation based on color features, and an average color at the center position. The learning unit 22 performs learning every frame, and updates the learning result obtained from the current frame by reflecting it in the past learning result with a predetermined coefficient.

記憶部２３は、学習部２２によって学習された学習結果を記憶する。記憶部２３は、また、利用する特徴量、各特徴量のパラメータ、学習係数、合成の際の重み係数、ピーク選択における閾値の初期値など、学習処理および追跡処理のハイパーパラメータも記憶する。 The storage unit 23 stores the learning result learned by the learning unit 22. The storage unit 23 also stores hyperparameters of learning processing and tracking processing, such as feature amounts to be used, parameters of each feature amount, learning coefficients, weighting coefficients at the time of synthesis, and initial values of threshold values in peak selection.

追跡部２４は、追跡対象の人物の現フレーム画像中での位置を特定する。追跡部２４は、最初は人体検出部２１による検出位置を含む領域をターゲット領域として、そのターゲット領域内から検出された人物と同様の特徴を有する物体位置を特定する。それ以降は、前フレーム画像について追跡部２４が特定した位置の付近をターゲット領域として、現フレーム画像中から追跡対象の人物の位置を特定する。 The tracking unit 24 identifies the position of the person to be tracked in the current frame image. Initially, the tracking unit 24 uses a region including a detection position by the human body detection unit 21 as a target region, and identifies an object position having the same characteristics as a person detected in the target region. After that, the position of the person to be tracked is specified from the current frame image with the vicinity of the position specified by the tracking unit 24 for the previous frame image as the target area.

特徴量抽出部２５は、ターゲット領域から物体の形状に関する特徴量と色に関する特徴量を抽出する。特徴量抽出部２５は、形状に関する特徴としてＨＯＧ特徴量を抽出し、色に関する特徴量として色ヒストグラムを抽出する。 The feature amount extraction unit 25 extracts the feature amount related to the shape of the object and the feature amount related to the color from the target region. The feature amount extraction unit 25 extracts the HOG feature amount as a feature related to the shape, and extracts the color histogram as the feature amount related to the color.

尤度のマップ生成部２６は、抽出された特徴量と、記憶部２３に記憶されている相関フィルタおよび色ヒストグラムを用いて、ターゲット領域の各位置について追跡対象物が存在する確からしさを表す尤度のマップを生成する。尤度のマップ生成部２６は、形状特徴と相関フィルタに基づく尤度と、色特徴と色ヒストグラムに基づく尤度とを合成した合成尤度のマップを生成する。なお、尤度のマップは応答マップとも称される。 The likelihood map generator 26 uses the extracted features and the correlation filter and color histogram stored in the storage unit 23 to indicate the likelihood that the tracked object exists at each position in the target area. Generate a map of degrees. The likelihood map generation unit 26 generates a composite likelihood map in which the likelihood based on the shape feature and the correlation filter and the likelihood based on the color feature and the color histogram are combined. The likelihood map is also called a response map.

位置特定部２７は、合成尤度のマップに基づいて、現フレーム画像における追跡対象物の位置を特定する。具体的には、位置特定部２７は、合成尤度のマップにおけるピークが一つの場合にはその位置を追跡対象物の位置として特定する。一方、位置特定部２７は、ピークが複数ある場合には、前フレーム画像における対象物の中心位置近傍の平均色と、現フレーム画像におけるピーク位置中心近傍の平均色との差が最小のピークの位置を、追跡対象物の位置として特定する。平均色の差が最小であるというのは、言い換えると、平均色に基づく画像類似度が最大ということである。 The position specifying unit 27 identifies the position of the tracking object in the current frame image based on the map of the composite likelihood. Specifically, the position specifying unit 27 specifies the position as the position of the tracking object when there is one peak in the map of the composite likelihood. On the other hand, in the position specifying unit 27, when there are a plurality of peaks, the difference between the average color near the center position of the object in the previous frame image and the average color near the center of the peak position in the current frame image is the smallest peak. The position is specified as the position of the tracked object. The minimum difference in average color means, in other words, the maximum image similarity based on the average color.

出力部２８は、魚眼画像や検出結果・追跡結果などの情報を外部装置に出力する機能を有する。例えば、出力部２８は、外部装置としてのディスプレイに情報を表示してもよいし、外部装置としてのコンピュータに情報を転送してもよいし、外部装置としての照明装置や空調やＦＡ装置に対し情報や制御信号を送信してもよい。 The output unit 28 has a function of outputting information such as a fisheye image, a detection result, and a tracking result to an external device. For example, the output unit 28 may display information on a display as an external device, transfer information to a computer as an external device, or for a lighting device, an air conditioner, or an FA device as an external device. Information and control signals may be transmitted.

人追跡装置１は、例えば、ＣＰＵ（プロセッサ）、メモリ、ストレージなどを備えるコンピュータにより構成することができる。その場合、図２に示す構成は、ストレージに格納されたプログラムをメモリにロードし、ＣＰＵが当該プログラムを実行することによって実現されるものである。かかるコンピュータは、パーソナルコンピュータ、サーバコンピュータ、タブレット端末、スマートフォンのような汎用的なコンピュータでもよいし、オンボードコンピュータのように組み込み型のコンピュータでもよい。あるいは、図２に示す構成の全部または一部を、ＡＳＩＣやＦＰＧＡなどで構成してもよい。あるいは、図２に示す構成の全部または一部を、クラウドコンピューティングや分散コンピューティングにより実現してもよい。 The person tracking device 1 can be configured by, for example, a computer including a CPU (processor), a memory, a storage, and the like. In that case, the configuration shown in FIG. 2 is realized by loading the program stored in the storage into the memory and executing the program by the CPU. Such a computer may be a general-purpose computer such as a personal computer, a server computer, a tablet terminal, or a smartphone, or an embedded computer such as an onboard computer. Alternatively, all or part of the configuration shown in FIG. 2 may be configured by ASIC, FPGA, or the like. Alternatively, all or part of the configuration shown in FIG. 2 may be realized by cloud computing or distributed computing.

＜全体処理＞
図３は、監視システム２による人追跡処理の全体フローチャートである。図３に沿って人追跡処理の全体的な流れを説明する。 <Overall processing>
FIG. 3 is an overall flowchart of the person tracking process by the monitoring system 2. The overall flow of the person tracking process will be described with reference to FIG.

まず、ステップＳ１０１において、ユーザが人追跡装置１に対して学習および追跡のハイパーパラメータの設定を行う。ハイパーパラメータの例として、利用する特徴量、各特徴量のパラメータ、学習係数、合成の際の重み係数、ピーク選択における閾値の初期値などが挙げられる。入力されたハイパーパラメータは記憶部２３に記憶される。 First, in step S101, the user sets the learning and tracking hyperparameters for the person tracking device 1. Examples of hyperparameters include features to be used, parameters of each feature, learning coefficient, weighting coefficient at the time of synthesis, initial value of threshold value in peak selection, and the like. The input hyperparameters are stored in the storage unit 23.

次に、ステップＳ１０２において、人追跡装置１は、ターゲット領域を取得する。ターゲット領域は、追跡対象の人物が存在する領域とその周辺をあわせた領域であり、追跡対象の人物が存在する可能性が高い領域である。ターゲット領域は、追跡部２４によって処理対象とされる領域ともいえる。本実施形態では、追跡対象人物の初期位置は人体検出部２１によって検出される。ただし、追跡対象人物の初期位置は、例えば、ユーザによって入力されるなどしてもよい。 Next, in step S102, the person tracking device 1 acquires the target area. The target area is an area that includes the area in which the person to be tracked exists and its surroundings, and is an area in which the person to be tracked is likely to exist. It can be said that the target area is an area to be processed by the tracking unit 24. In the present embodiment, the initial position of the person to be tracked is detected by the human body detection unit 21. However, the initial position of the tracked person may be input by the user, for example.

以下、ステップＳ１０４からＳ１０７の処理が繰り返し実施される。ステップＳ１０３の終了判定において終了条件を満たしたら処理を終了する。終了条件は、例えば、追跡対象人物の喪失（フレームアウト）や動画の終了とすることができる。 Hereinafter, the processes of steps S104 to S107 are repeatedly performed. When the end condition is satisfied in the end determination in step S103, the process ends. The end condition can be, for example, the loss of the tracked person (frame out) or the end of the moving image.

ステップＳ１０４において、画像入力部２０が魚眼カメラ１０から１フレームの魚眼画像を入力する。この際、魚眼画像の歪みを補正した平面展開画像を作成して以降の処理を行ってもよいが、本実施形態の監視システム２では、魚眼画像をそのまま（歪んだまま）検出や追跡の処理に用いる。 In step S104, the image input unit 20 inputs one frame of fisheye image from the fisheye camera 10. At this time, a plane-developed image in which the distortion of the fisheye image is corrected may be created and the subsequent processing may be performed, but in the monitoring system 2 of the present embodiment, the fisheye image is detected and tracked as it is (while being distorted). Used for processing.

ステップＳ１０５では、現在のフレームが最初の画像であるか否かが判定される。ここで、最初の画像とは、追跡対象人物の初期位置が与えられたフレーム画像のことであり、典型的には人体検出部２１によって追跡対象人物が検出されたフレーム画像のことである。 In step S105, it is determined whether or not the current frame is the first image. Here, the first image is a frame image in which the initial position of the tracking target person is given, and is typically a frame image in which the tracking target person is detected by the human body detection unit 21.

現在のフレームが最初の画像よりも後のフレームの画像である場合には、ステップＳ１０６に進み、追跡部２４が追跡処理を実行する。追跡処理の詳細は後述する。 If the current frame is an image of a frame after the first image, the process proceeds to step S106, and the tracking unit 24 executes the tracking process. The details of the tracking process will be described later.

ステップＳ１０７では、現在のフレーム画像において対象人物が存在する領域に基づいて、学習部２２が学習処理を実行する。学習処理の詳細は後述する。 In step S107, the learning unit 22 executes the learning process based on the area where the target person exists in the current frame image. The details of the learning process will be described later.

このように、追跡処理Ｓ１０６による追跡対象人物の位置特定が毎フレーム行われて、追跡が実現される。また、本実施形態の追跡手法は、追跡対象人物の特徴を毎フレーム学習する逐次学習型の追跡アルゴリズムを採用している。 In this way, the position of the person to be tracked by the tracking process S106 is specified every frame, and the tracking is realized. Further, the tracking method of the present embodiment employs a sequential learning type tracking algorithm that learns the characteristics of the person to be tracked every frame.

＜学習処理＞
図４は、ステップＳ１０７の学習処理の詳細を示すフローチャートである。また、図７は学習処理および学習結果を用いた追跡処理を説明する図である。以下、図４および図７を参照して学習処理について説明する。 <Learning process>
FIG. 4 is a flowchart showing the details of the learning process in step S107. Further, FIG. 7 is a diagram illustrating a learning process and a tracking process using the learning result. Hereinafter, the learning process will be described with reference to FIGS. 4 and 7.

学習部２２は、まず、現フレーム画像からターゲット領域７４を切り出す（Ｓ２０１）。図７に示すように、ターゲット領域７４は、人物の前景領域７２および背景領域７３を含む領域である。前景領域７２は追跡対象人物が存在する領域であり、背景領域は追跡対象人物が存在しない領域である。背景領域７３の大きさは、前景領域７２の大きさに応じて決定されている。例えば、前景領域７２のサイズがターゲット領域７４の全体サイズの所定の比率（例えば１／３）となるように、背景領域７３のサイズが決定されている。なお、ターゲット領域は中心が追跡対象人物の位置となるように追跡処理の最後に更新されている（図５のステップＳ３０８）ので、ターゲット領域７４の中心は追跡対象人物の中心位置と等しい。 First, the learning unit 22 cuts out the target area 74 from the current frame image (S201). As shown in FIG. 7, the target area 74 is an area including a foreground area 72 and a background area 73 of a person. The foreground area 72 is an area where the tracked person exists, and the background area is an area where the tracked person does not exist. The size of the background area 73 is determined according to the size of the foreground area 72. For example, the size of the background region 73 is determined so that the size of the foreground region 72 is a predetermined ratio (for example, 1/3) of the total size of the target region 74. Since the target area is updated at the end of the tracking process so that the center is the position of the tracking target person (step S308 in FIG. 5), the center of the target area 74 is equal to the center position of the tracking target person.

学習部２２は、ターゲット領域７４の中心位置７１近傍の平均色を抽出して、記憶部２３に記憶する（Ｓ２０２）。ここで、中心位置７１近傍とは、中心位置７１を含む前景領域７２よりも小さい領域であり、典型的には中心位置７１を中心とする矩形領域である。この近傍領域のサイズは、固定サイズ（例えば３×３）としてもよいし、前景領域７２のサイズに応じたサイズ（例えば半分のサイズ）としてもよい。 The learning unit 22 extracts the average color near the center position 71 of the target area 74 and stores it in the storage unit 23 (S202). Here, the vicinity of the center position 71 is a region smaller than the foreground region 72 including the center position 71, and is typically a rectangular region centered on the center position 71. The size of this neighborhood region may be a fixed size (for example, 3 × 3) or a size corresponding to the size of the foreground region 72 (for example, half the size).

学習部２２はまた、ターゲット領域７４内のＨＯＧ特徴量を取得する（Ｓ２０３）。ＨＯＧ特徴量は、局所領域の輝度勾配方向をヒストグラム化した特徴量であり、物体の形状・輪郭を表す特徴量と捉えられる。ここでは、ＨＯＧ特徴量を採用しているが、物体の形状・輪郭を表す他の特徴量、例えば、ＬＢＰ特徴量、ＳＨＩＦＴ特徴量、ＳＵＲＦ特徴量を採用してもよい。 The learning unit 22 also acquires the HOG feature amount in the target region 74 (S203). The HOG feature amount is a feature amount obtained by histogramizing the brightness gradient direction of the local region, and can be regarded as a feature amount representing the shape and contour of the object. Here, the HOG feature amount is adopted, but other feature amounts representing the shape / contour of the object, for example, the LBP feature amount, the SHIFT feature amount, and the SURF feature amount may be adopted.

学習部２２は、応答がターゲット中心にピークを持つような相関フィルタ７６を求める（Ｓ２０４）。具体的には、ＨＯＧ特徴量を抽出した後に、その特徴量自身の相関に対して、中心のみにピークを持つ理想の応答に最も近づくようなフィルタを求めることで、相関フィルタ７６が得られる。相関フィルタの計算をフーリエ空間で行う場合には、特徴量に窓関数を乗じてもよい。ＨＯＧ特徴量は次フレームの追跡処理で相関フィルタをかける際に使用するため、記憶部２３に記憶する。 The learning unit 22 obtains a correlation filter 76 such that the response has a peak at the center of the target (S204). Specifically, after extracting the HOG feature amount, the correlation filter 76 can be obtained by finding a filter that is closest to the ideal response having a peak only in the center with respect to the correlation of the feature amount itself. When the correlation filter is calculated in Fourier space, the feature quantity may be multiplied by the window function. The HOG feature amount is stored in the storage unit 23 for use when applying a correlation filter in the tracking process of the next frame.

学習部２２はまた、ターゲット領域７４内の色ヒストグラム７７を取得する（Ｓ２０５）。具体的には、前景領域７２と背景領域７３のそれぞれの色ヒストグラムを取得する。色ヒストグラムは色を表す特徴量であり、色を表すその他の特徴量としてColor Names (CN)特徴量を採用できる。また、色の特徴量ではなく、輝度の特徴を表す特徴量として輝度ヒストグラムを採用してもよい。 The learning unit 22 also acquires the color histogram 77 in the target region 74 (S205). Specifically, the color histograms of the foreground region 72 and the background region 73 are acquired. The color histogram is a feature amount representing a color, and the Color Names (CN) feature amount can be adopted as another feature amount representing a color. Further, the luminance histogram may be adopted as the feature quantity representing the luminance feature instead of the color feature quantity.

今回の学習が最初の学習であれば（Ｓ２０６−ＹＥＳ）、ステップＳ２０３，Ｓ２０５で生成した相関フィルタおよび色ヒストグラムをそのまま記憶部２３に記憶する。一方、今回の学習が２回目以降の学習であれば（Ｓ２０６−ＮＯ）、処理はステップＳ２０７に進む。 If this learning is the first learning (S206-YES), the correlation filter and the color histogram generated in steps S203 and S205 are stored in the storage unit 23 as they are. On the other hand, if the current learning is the second and subsequent learning (S206-NO), the process proceeds to step S207.

学習部２２は、ステップＳ２０７において、前回求めた相関フィルタ（記憶部２３に記憶されている相関フィルタ）と今回ステップＳ２０４で求めた相関フィルタを合成することで新たな相関フィルタを求め、記憶部２３に記憶する。また、学習部２２は、ステップＳ２０８において、前回求めた色ヒストグラム（記憶部２３に記憶されている色ヒストグラム）と、今回ステップＳ２０５で求めた色ヒストグラムを合成することで新たな色ヒストグラムを求め、記憶部２３に記憶する。合成の際の重み（学習係数）は適宜決定すればよい。 In step S207, the learning unit 22 obtains a new correlation filter by synthesizing the correlation filter obtained last time (correlation filter stored in the storage unit 23) and the correlation filter obtained in step S204 this time, and the storage unit 23 obtains a new correlation filter. Remember in. Further, in step S208, the learning unit 22 obtains a new color histogram by synthesizing the color histogram obtained last time (the color histogram stored in the storage unit 23) and the color histogram obtained in step S205 this time. It is stored in the storage unit 23. The weight (learning coefficient) at the time of synthesis may be appropriately determined.

＜追跡処理＞
図５は、ステップＳ１０６の追跡処理の詳細を示すフローチャートである。また、図７は学習処理および学習結果を用いた追跡処理を説明する図である。以下、図５および図７を参照して追跡処理について説明する。 <Tracking process>
FIG. 5 is a flowchart showing the details of the tracking process in step S106. Further, FIG. 7 is a diagram illustrating a learning process and a tracking process using the learning result. Hereinafter, the tracking process will be described with reference to FIGS. 5 and 7.

追跡部２４は、現フレーム画像からターゲット領域７５を切り出す（Ｓ３０１）。なお、ターゲット領域は中心が追跡対象人物の位置となるように前回の追跡処理の最後に更新されている（図５のステップＳ３０８）ので、ターゲット領域７４の中心は追跡対象人物の中心位置と等しい。図７において、追跡部２４の処理対象がＴ＋１フレーム目の画像である場合、Ｔフレーム目において特定された追跡対象人物の位置を中心とするターゲット領域７４に対応するターゲット領域７５が切り出される。 The tracking unit 24 cuts out the target area 75 from the current frame image (S301). Since the target area is updated at the end of the previous tracking process so that the center is the position of the tracking target person (step S308 in FIG. 5), the center of the target area 74 is equal to the center position of the tracking target person. .. In FIG. 7, when the processing target of the tracking unit 24 is the image in the T + 1th frame, the target area 75 corresponding to the target area 74 centered on the position of the tracking target person specified in the T frame is cut out.

特徴量抽出部２５は、ターゲット領域７５内の各セルからＨＯＧ特徴量を抽出する（Ｓ３０２）。尤度のマップ生成部２６は、ターゲット領域７５内のＨＯＧ特徴量と記憶部２３に記憶されているＨＯＧ特徴量の相関に対して相関フィルタ７６をかけて尤度のマップ７８（応答マップ）を求める（Ｓ３０３）。図８Ａのグラフ８１および図８Ｂのグラフ８４がＨＯＧ特徴量に基づく尤度のマップ７８の例である。なお、尤度のマップ８１は、ターゲット領域７５内のそれぞれの位置についての追跡対象人物である確からしさ（尤度）を表すマップである。 The feature amount extraction unit 25 extracts the HOG feature amount from each cell in the target area 75 (S302). The likelihood map generation unit 26 applies a correlation filter 76 to the correlation between the HOG feature amount in the target area 75 and the HOG feature amount stored in the storage unit 23 to obtain a likelihood map 78 (response map). Find (S303). Graph 81 of FIG. 8A and graph 84 of FIG. 8B are examples of a map 78 of likelihood based on HOG features. The likelihood map 81 is a map showing the certainty (likelihood) of the person to be tracked for each position in the target area 75.

尤度のマップ生成部２６は、ターゲット領域７５内の各画素の色と記憶部２３に記憶されている色ヒストグラム７７とから、ターゲット領域７５内の各セルが追跡対象人物（前景）である確からしさ（尤度）を表す尤度のマップ７９（応答マップ）を生成する。より具体的には、尤度のマップ生成部２６は、記憶部２３に記憶されている色ヒストグラム７７と、着目画素の色に基づいて、着目画素の前景尤度を求める。そして、各セル内に含まれる画素の前景尤度の平均を取ることで、当該セルが追跡対象の人物である尤度が求められる。図８Ａのグラフ８２および図８Ｂのグラフ８５が色ヒストグラムに基づく尤度のマップ７８の例である。 From the color of each pixel in the target area 75 and the color histogram 77 stored in the storage unit 23, the likelihood map generation unit 26 ensures that each cell in the target area 75 is a tracking target person (foreground). A likelihood map 79 (response map) representing the likelihood (likelihood) is generated. More specifically, the likelihood map generation unit 26 obtains the foreground likelihood of the pixel of interest based on the color histogram 77 stored in the storage unit 23 and the color of the pixel of interest. Then, by taking the average of the foreground likelihoods of the pixels included in each cell, the likelihood that the cell is the person to be tracked can be obtained. Graph 82 of FIG. 8A and graph 85 of FIG. 8B are examples of the likelihood map 78 based on the color histogram.

尤度のマップ生成部２６は、上記のようにして求めた相関フィルタ７６に基づく尤度のマップ７８と色ヒストグラム７７に基づく尤度のマップ７９を合成して合成尤度のマップ８０を生成する（合成の方法は特に限定されず、２つの尤度を単純に平均してもよいし、重みを付けて平均してもよい。図８Ａのグラフ８３および図８Ｂのグラフ８６が合成尤度のマップ８０（合成応答マップ）の例である。 The likelihood map generation unit 26 synthesizes the likelihood map 78 based on the correlation filter 76 obtained as described above and the likelihood map 79 based on the color histogram 77 to generate the composite likelihood map 80. (The method of synthesis is not particularly limited, and the two likelihoods may be simply averaged or weighted and averaged. Graph 83 in FIG. 8A and graph 86 in FIG. 8B show the composite likelihood. This is an example of map 80 (composite response map).

位置特定部２７は、合成尤度のマップから１つのピークを選択して、当該ピーク位置を現フレーム画像における追跡対象人物の中心位置であると決定する（Ｓ３０７）。ここで、図８Ａに示すように、合成尤度のマップが１つのピークしか有しない場合には、当該ピークの位置が追跡対象人物の位置であるといえる。しかしながら、図８Ｂに示すように、合成応答マップが複数のピークを有する場合には、値（合成尤度）の最も高いピークを単純に選択すると、ドリフトが生じ追跡を誤る可能性がある。そこで、位置特定部２７は、図６のフローチャートに示す処理によってピークを選択することで、精度の高い追跡を実
現する。 The position specifying unit 27 selects one peak from the map of the composite likelihood and determines that the peak position is the center position of the tracked person in the current frame image (S307). Here, as shown in FIG. 8A, when the map of the composite likelihood has only one peak, it can be said that the position of the peak is the position of the person to be tracked. However, as shown in FIG. 8B, when the composite response map has a plurality of peaks, simply selecting the peak with the highest value (composite likelihood) may cause drift and mistracking. Therefore, the position specifying unit 27 realizes highly accurate tracking by selecting the peak by the process shown in the flowchart of FIG.

図６のフローチャートによって行われる処理の概要を、図９を参照して簡単に説明する。図９において、Ｔフレーム目が前フレームであり、Ｔ＋１フレーム目が現フレームである。画像９１は前フレーム画像におけるターゲット領域を表し、その中心９２は対象人物が存在する位置の中心である。画像９４は現フレーム画像におけるターゲット領域を表し、その中で複数のピーク９５が抽出されている。 An outline of the processing performed by the flowchart of FIG. 6 will be briefly described with reference to FIG. In FIG. 9, the T frame is the front frame, and the T + 1 frame is the current frame. The image 91 represents the target area in the previous frame image, and the center 92 thereof is the center of the position where the target person exists. The image 94 represents a target region in the current frame image, in which a plurality of peaks 95 are extracted.

位置特定部２７は、現フレーム画像において抽出された複数のピーク９５のうち、前フレーム画像の追跡対象人物の中心位置９２近傍の領域９３での平均色と、現フレーム画像のピーク９５近傍の領域９６での平均色との差が、最も小さいピークを選択する。 Of the plurality of peaks 95 extracted in the current frame image, the position specifying unit 27 has the average color in the region 93 near the center position 92 of the tracked person in the previous frame image and the region near the peak 95 in the current frame image. Select the peak with the smallest difference from the average color at 96.

なお、以下では説明の簡略化のために、「前フレーム画像の追跡対象人物の中心位置近傍の領域での平均色」のことを「前フレーム画像の中心平均色」と称し、「現フレーム画像のピーク位置近傍の領域における平均色」のことを「現フレーム画像のピーク位置平均色」と称する。 In the following, for the sake of simplification of the explanation, the "average color in the area near the center position of the person to be tracked in the previous frame image" is referred to as the "center average color of the previous frame image", and the "current frame image". The "average color in the region near the peak position of the current frame image" is referred to as the "peak position average color of the current frame image".

以下、図６を参照してより詳細に説明する。位置特定部２７は、合成尤度のマップから、局所的ピークを抽出する（Ｓ４０１）。局所的ピークは、合成尤度のマップにおいて極大値を取る位置といえる。局所的ピークは、例えば、対象画素の値が近傍画素の値以上であるか否かを判断することにより抽出すればよい。ここで検出された局所的ピークのそれぞれに対して、ステップＳ４０２以降の処理が行われる。 Hereinafter, a more detailed description will be given with reference to FIG. The positioning unit 27 extracts a local peak from the composite likelihood map (S401). The local peak can be said to be the position where the maximum value is taken in the map of the composite likelihood. The local peak may be extracted, for example, by determining whether or not the value of the target pixel is equal to or greater than the value of the neighboring pixel. The processing after step S402 is performed for each of the local peaks detected here.

位置特定部２７は、現フレーム画像９４のピーク９５近傍の領域９６における平均色を抽出して、一時的に記憶する（Ｓ４０２）。平均色の求め方は特に限定されない。 The position specifying unit 27 extracts the average color in the region 96 near the peak 95 of the current frame image 94 and temporarily stores it (S402). The method of obtaining the average color is not particularly limited.

位置特定部２７は、現在処理しているピークが最初のピークであるか否かを判断し（Ｓ４０３）、最初のピークであればステップＳ４０８に進み、このピークを選択する。なお、ここでの選択は暫定的な選択であり、ループ処理を抜けた後に選択されているピークが最終的な選択結果となる。位置特定部２７は、選択したピークの位置を記憶部２３に格納する。また、位置特定部２７は、現フレーム画像のピーク位置平均色と前フレーム画像の中心平均色との差を記憶部２３に格納する。 The position specifying unit 27 determines whether or not the peak currently being processed is the first peak (S403), and if it is the first peak, proceeds to step S408 and selects this peak. The selection here is a tentative selection, and the peak selected after exiting the loop processing is the final selection result. The position specifying unit 27 stores the position of the selected peak in the storage unit 23. Further, the position specifying unit 27 stores the difference between the peak position average color of the current frame image and the center average color of the previous frame image in the storage unit 23.

現在処理しているピークが最初のピークではない場合は、処理はステップＳ４０４に進む。位置特定部２７は、ピーク値が閾値Ａ以上であるか否かを判断する（Ｓ４０４）。この閾値Ａは、予め設定により与えられる固定値であってもよいし、各フレームの追跡処理が行われるたびに更新される値であってもよい。 If the peak currently being processed is not the first peak, processing proceeds to step S404. The position specifying unit 27 determines whether or not the peak value is equal to or higher than the threshold value A (S404). This threshold value A may be a fixed value given by presetting, or may be a value updated every time the tracking process of each frame is performed.

ピーク値が閾値Ａ以上であれば（Ｓ４０４−ＹＥＳ）、位置特定部２７は、前フレーム画像の中心平均色と現フレーム画像のピーク位置平均色との差に応じて、閾値を補正する（ステップＳ４０５）。補正された閾値を閾値Ｂと称する。具体的には、平均色の差が大きいほど閾値を小さく、平均色の差が小さいほど閾値を大きく補正するとよい。閾値補正を行うのは、フレーム間での照明変化への頑健性を高めるためである。 If the peak value is equal to or higher than the threshold value A (S404-YES), the position specifying unit 27 corrects the threshold value according to the difference between the center average color of the previous frame image and the peak position average color of the current frame image (step). S405). The corrected threshold value is referred to as threshold value B. Specifically, the larger the difference in average color, the smaller the threshold value, and the smaller the difference in average color, the larger the threshold value. The threshold correction is performed in order to improve the robustness to lighting changes between frames.

位置特定部２７は、処理対象のピークにおける値が、補正閾値Ｂ以上であるか否かを判断する（Ｓ４０６）。ピーク値が補正閾値Ｂ以上であれば、位置特定部２７は、さらに、前フレーム画像の中心平均色と現フレーム画像のピーク位置平均色との差が、ピーク値が補正閾値Ｂ以上のピークの中で最小であるか判断する（Ｓ４０９）。この判断は、現在のピークにおける平均色の差が、暫定的に選択されているピークにおける平均色の差よりも小さいかという判断で置き換えてもよい。平均色の差が最小であれば（Ｓ４０９−ＹＥＳ
）、位置特定部２７は、処理対象のピークを選択し、そのピーク位置および平均色差を記憶部２３に格納する。 The position specifying unit 27 determines whether or not the value at the peak to be processed is equal to or higher than the correction threshold value B (S406). If the peak value is equal to or higher than the correction threshold value B, the position specifying unit 27 further determines that the difference between the center average color of the previous frame image and the peak position average color of the current frame image is the peak value of the correction threshold value B or higher. It is determined whether it is the smallest among them (S409). This determination may be replaced by the determination that the difference in average color at the current peak is less than the difference in average color at the tentatively selected peak. If the difference in average color is the smallest (S409-YES)
), The position specifying unit 27 selects the peak to be processed, and stores the peak position and the average color difference in the storage unit 23.

一方、ステップＳ４０４の判断において処理対象のピークにおける値が閾値Ａ未満である場合、または、ステップＳ４０６の判断において処理対象のピークにおける値が補正閾値Ｂ未満である場合は、処理はステップＳ４０７に進む。位置特定部２７は、ステップＳ４０７において、当該ピークの値がこれまでの最大であるか判断し、最大であれば、ステップＳ４０８において、このピークを選択する。 On the other hand, if the value at the peak to be processed is less than the threshold value A in the determination in step S404, or if the value at the peak to be processed is less than the correction threshold B in the determination in step S406, the process proceeds to step S407. .. The position specifying unit 27 determines in step S407 whether the value of the peak is the maximum so far, and if it is the maximum, selects this peak in step S408.

以上の、ステップＳ４０２からＳ４１０の処理を、ステップＳ４０１で抽出された全てのピークに対して実施することで、ピークの値が閾値以上であり、かつ、前フレーム画像の中心平均色と現フレーム画像のピーク位置平均色との差が最小のピークが選択される。 By performing the above processes of steps S402 to S410 for all the peaks extracted in step S401, the peak value is equal to or more than the threshold value, and the center average color of the previous frame image and the current frame image Peak position The peak with the smallest difference from the average color is selected.

なお、上記のフローチャートにおいて、ステップＳ４０４においてピーク値が閾値Ａ以上であるか否かの判断を行っているが、この処理は省略して、補正閾値Ｂに基づく判断（Ｓ４０６）のみを行うようにしてもよい。また、ピーク値が閾値以下の場合（Ｓ４０４−ＮＯ、Ｓ４０６−ＮＯ）に、ピーク値がこれまでの最大であれば選択するようにしているが（Ｓ４０７−Ｓ４０８）、ピーク値に関わらず選択しないようにしてもよい。ピーク値が閾値以下のピークについては平均色差が大きく、その後に別のピークが選択（Ｓ４１０）されると想定されるためである。また、平均色差に基づく閾値の補正処理（Ｓ４０４）により照明変化に対する頑健性が向上するが、この処理を省略して固定の閾値を用いても構わない。 In the above flowchart, whether or not the peak value is equal to or higher than the threshold value A is determined in step S404, but this process is omitted and only the determination (S406) based on the correction threshold value B is performed. You may. Further, when the peak value is below the threshold value (S404-NO, S406-NO), if the peak value is the maximum so far, it is selected (S407-S408), but it is not selected regardless of the peak value. You may do so. This is because it is assumed that the average color difference is large for the peak whose peak value is less than the threshold value, and then another peak is selected (S410). Further, although the robustness against lighting change is improved by the threshold correction process (S404) based on the average color difference, this process may be omitted and a fixed threshold value may be used.

図５のフローチャートの説明に戻る。上記のようにしてステップＳ３０７のピーク選択処理が完了すると、位置特定部２７は、ターゲット領域の中心を選択されたピークの位置に更新し（Ｓ３０８）、ターゲット領域のサイズを更新する（Ｓ３０９）。このように、追跡処理が完了した後に、ターゲット領域の中心は追跡対象人物の中心位置に更新され、また、ターゲット領域のサイズも追跡結果に応じて更新される。ターゲット領域の更新サイズは、ＤＳＳＴ（Discriminative Scale Space Tracking）のように画像のピラミッド
を用いる方法で推定してもよいし、前フレームにおけるターゲット領域のサイズ、レンズ歪みの特性、カメラの視点、カメラの配置およびターゲット領域の画像における位置の少なくともいずれかに基づいて決定されてもよい。追跡処理完了後のターゲット領域の中心が追跡対象人物の中心位置であり、ターゲット領域中の前景領域が追跡対象人物の存在領域（バウンディングボックス）である。 Returning to the description of the flowchart of FIG. When the peak selection process in step S307 is completed as described above, the position specifying unit 27 updates the center of the target region to the position of the selected peak (S308), and updates the size of the target region (S309). In this way, after the tracking process is completed, the center of the target area is updated to the center position of the tracked person, and the size of the target area is also updated according to the tracking result. The update size of the target area may be estimated by a method using an image pyramid such as DSST (Discriminative Scale Space Tracking), or the size of the target area in the previous frame, the characteristics of lens distortion, the viewpoint of the camera, and the camera. It may be determined based on at least one of the placement and the position of the target area in the image. The center of the target area after the completion of the tracking process is the center position of the tracking target person, and the foreground area in the target area is the existing area (bounding box) of the tracking target person.

＜本実施形態の有利な効果＞
本実施形態では、魚眼画像を平面展開せずに用いる人追跡装置において、背景へのドリフトを抑制し、精度の高い人追跡が実現できる。ドリフトは、逐次学習を行う際に追跡対象以外の特徴を誤って学習することに起因して発生する追跡の失敗である。画像中に追跡対象人物と類似する物体（背景）が存在する場合、複雑背景下の場合および遮蔽が存在する場合などに生じる。一般に、追跡対象人物に類似する物体（背景）があるとき、複雑背景下および遮蔽が存在するときには、尤度のマップにおいて複数のピークが現れる。そして、このような場合に、追跡対象以外の物体に対応するピークを誤って選択するとドリフトが生じる。本実施形態では、合成尤度のマップに複数のピークが現れる場合に、単に尤度が最大のピークを選択するのではなく、中心位置の平均色を考慮してピーク選択を行っている。これにより、追跡対象以外の物体に対応するピークを誤って選択すること、すなわちドリフトの発生を低減できる。ドリフトの発生を低減できると、追跡結果のエラーが少なくなり、精度の高い追跡が実現できる。 <Advantageous effect of this embodiment>
In the present embodiment, in a person tracking device that uses a fisheye image without developing a plane, it is possible to suppress drift to the background and realize highly accurate person tracking. Drift is a tracking failure that occurs due to erroneous learning of features other than the tracked object during sequential learning. It occurs when there is an object (background) similar to the person to be tracked in the image, under a complicated background, or when there is an occlusion. In general, when there is an object (background) similar to the tracked person, multiple peaks appear in the likelihood map in the presence of complex backgrounds and occlusions. Then, in such a case, if a peak corresponding to an object other than the tracking target is erroneously selected, drift occurs. In the present embodiment, when a plurality of peaks appear in the composite likelihood map, the peaks are selected in consideration of the average color at the center position, instead of simply selecting the peak with the maximum likelihood. This makes it possible to erroneously select a peak corresponding to an object other than the tracked object, that is, to reduce the occurrence of drift. If the occurrence of drift can be reduced, errors in the tracking result are reduced, and highly accurate tracking can be realized.

また、平均色の算出は演算負荷やメモリ使用量が比較的少ない処理であるため、本実施
形態の手法は、計算資源が少ない組込機器でも実現できる。 Further, since the calculation of the average color is a process in which the calculation load and the amount of memory used are relatively small, the method of the present embodiment can be realized even with an embedded device having few computational resources.

＜その他＞
上記実施形態は、本発明の構成例を例示的に説明するものに過ぎない。本発明は上記の具体的な形態には限定されることはなく、その技術的思想の範囲内で種々の変形が可能である。 <Others>
The above-described embodiment is merely an example of a configuration example of the present invention. The present invention is not limited to the above-mentioned specific form, and various modifications can be made within the scope of its technical idea.

例えば、上記実施形態では、合成尤度のマップにおいて複数のピークが現れたときに、平均色を考慮して選択するピークを決定している。しかしながら、前フレーム画像の追跡対象物の中心位置近傍の領域と、現フレーム画像のピーク位置近傍の領域との間の、画像類似度が最も高いピークを選択すればよい。平均色以外の画像類似度を評価する手法として、例えば、平均輝度、代表色のようにスカラーであらわされる画像情報の少なくともいずれかを特徴量として差、差の絶対値、差の二乗の少なくともいずれかを類似度の尺度とする方法を採用することができる。さらに、ＨＯＧなどの形状に関する特徴ベクトル、色ヒストグラムなどの色に関する特徴ベクトルの少なくともいずれかを抽出し、ヒストグラムインタセクション、バタチャリヤ係数、ＥａｒｔｈＭｏｖｅｒ’ｓＤｉｓｔａｎｃｅの少なくともいずれかに基づいて類似度を測定する方法を採用することができる。加えて、テンプレートマッチングにより類似度を測定する方法を採用することができる。また、類似度ではなく、差の二乗和、差の絶対値和の少なくともいずれかに基づいて相違度を測定する方法を採用することができる。 For example, in the above embodiment, when a plurality of peaks appear in the composite likelihood map, the peak to be selected is determined in consideration of the average color. However, the peak having the highest image similarity between the region near the center position of the tracking object of the previous frame image and the region near the peak position of the current frame image may be selected. As a method for evaluating image similarity other than the average color, for example, at least one of the difference, the absolute value of the difference, and the square of the difference, using at least one of the image information represented by the scalar such as the average brightness and the representative color as a feature amount. A method can be adopted in which the degree of similarity is used as a measure of similarity. Furthermore, at least one of the feature vector related to the shape such as HOG and the feature vector related to the color such as the color histogram is extracted, and the similarity is measured based on at least one of the histogram intersection, the Batacharya coefficient, and the Earth Mover's Distance. The method can be adopted. In addition, a method of measuring similarity by template matching can be adopted. Further, a method of measuring the degree of difference based on at least one of the sum of squares of the differences and the sum of the absolute values of the differences can be adopted instead of the degree of similarity.

また、上記の実施形態は非特許文献１に記載の手法（Stapleと呼ばれる）をベースにした追跡処理を行っているが、現フレーム画像において追跡対象物が存在する確からしさを表す尤度のマップを算出するアルゴリズムは上記実施形態の手法に限定されない。例えば、形状特徴のみに基づく尤度のマップ算出や、色特徴のみに基づく尤度のマップ算出などを行ってもよい。尤度のマップの算出も、相関フィルタをかけることよって行う以外に、ＣＮＮ（Convolutional Neural Network）、ＲＮＮ（Recurrent Neural Network）、ＬＳＴＭ（Long Short-Term Memory）のような深層学習モデルを利用して行ってもよい。本発明は、尤度のマップにおいて複数のピークが現れたときに、中心平均色などの画像類似度を考慮していずれかのピークを選択するものであり、尤度のマップ算出アルゴリズムに関係なく適用が可能である。 Further, in the above embodiment, the tracking process is performed based on the method (called Staple) described in Non-Patent Document 1, but the likelihood map showing the certainty that the tracking object exists in the current frame image. The algorithm for calculating is not limited to the method of the above embodiment. For example, the likelihood map calculation based only on the shape feature, the likelihood map calculation based only on the color feature, and the like may be performed. In addition to calculating the likelihood map by applying a correlation filter, deep learning models such as CNN (Convolutional Neural Network), RNN (Recurrent Neural Network), and RSTM (Long Short-Term Memory) are used. You may go. The present invention selects one of the peaks in consideration of the image similarity such as the center average color when a plurality of peaks appear in the likelihood map, regardless of the likelihood map calculation algorithm. Applicable.

また、上記の実施形態では魚眼画像を平面展開せずに処理しているが、魚眼画像を平面展開した画像を処理対象としてもよいし、通常のカメラにより撮影された画像を処理対象としてもよい。 Further, in the above embodiment, the fisheye image is processed without being developed in a plane, but an image in which the fisheye image is developed in a plane may be processed, or an image taken by a normal camera is used as a processing target. May be good.

＜付記＞
（１）第１フレーム画像における対象物の位置を取得する取得手段（２１）と、
前記第１フレーム画像の後のフレーム画像である第２フレーム画像から、前記対象物の位置を求める追跡手段（２４）と、
を備える、物体追跡装置（１）であって、
前記追跡手段は、
前記第２フレーム画像の対象領域から特徴量を抽出する特徴量抽出手段（２５）と、
前記第２フレーム画像の前記対象領域について、前記対象物が存在する確からしさを表す尤度のマップを前記特徴量に基づいて求める尤度算出手段（２６）と、
前記尤度のマップにおいてピークが１つの場合には、当該ピークの位置を前記対象物の位置として特定し、前記尤度のマップにおいてピークが複数ある場合には、前記第１フレームの前記対象物の位置の近傍の画像領域と前記第２フレームの各ピークの近傍の画像領域との類似度を表す画像類似度を考慮して選択されるピークの位置を前記対象物の位置として特定する、位置決定手段（２７）と、
を備える、ことを特徴とする物体追跡装置（１）。 <Additional notes>
(1) Acquisition means (21) for acquiring the position of the object in the first frame image, and
A tracking means (24) for obtaining the position of the object from the second frame image, which is a frame image after the first frame image, and
An object tracking device (1) comprising the
The tracking means
The feature amount extraction means (25) for extracting the feature amount from the target area of the second frame image, and
With the likelihood calculation means (26), which obtains a likelihood map representing the certainty that the object exists in the target region of the second frame image based on the feature amount.
When there is one peak in the likelihood map, the position of the peak is specified as the position of the object, and when there are a plurality of peaks in the likelihood map, the object in the first frame. The position of the peak selected in consideration of the image similarity representing the similarity between the image region near the position of and the image region near each peak of the second frame is specified as the position of the object. Determining means (27) and
An object tracking device (1), characterized in that.

（２）第１フレーム画像における対象物の位置を取得する取得ステップ（Ｓ１０２）と、
前記第１フレーム画像の後のフレーム画像である第２フレーム画像から、前記対象物の位置を求める追跡ステップと（Ｓ１０６）、
を含む、物体追跡方法であって、
前記追跡ステップは、
前記第２フレーム画像の対象領域から特徴量を抽出する特徴量抽出ステップ（Ｓ３０２，Ｓ３０４）と、
前記第２フレーム画像の前記対象領域について、前記対象物が存在する確からしさを表す尤度のマップを前記特徴量に基づいて求める尤度算出ステップ（Ｓ３０３，Ｓ３０５，Ｓ３０６）と、
前記尤度のマップにおいてピークが１つの場合には、当該ピークの位置を前記対象物の位置として特定し、前記尤度のマップにおいてピークが複数ある場合には、前記第１フレームの前記対象物の位置の近傍の画像領域と前記第２フレームの各ピークの近傍の画像領域との類似度を表す画像類似度を考慮して選択されるピークの位置を前記対象物の位置として特定する、位置決定ステップ（Ｓ３０７）と、
を含む、ことを特徴とする物体追跡方法。 (2) In the acquisition step (S102) of acquiring the position of the object in the first frame image,
A tracking step of finding the position of the object from the second frame image, which is a frame image after the first frame image, and (S106).
Is an object tracking method that includes
The tracking step
The feature amount extraction step (S302, S304) for extracting the feature amount from the target area of the second frame image, and
With respect to the target region of the second frame image, a likelihood calculation step (S303, S305, S306) for obtaining a likelihood map representing the certainty that the object exists based on the feature amount.
When there is one peak in the likelihood map, the position of the peak is specified as the position of the object, and when there are a plurality of peaks in the likelihood map, the object in the first frame. The position of the peak selected in consideration of the image similarity representing the similarity between the image region near the position of and the image region near each peak of the second frame is specified as the position of the object. The decision step (S307) and
A method for tracking an object, including.

１：人追跡装置
２：監視システム
１０：魚眼カメラ
１１：追跡対象エリア
１２：天井
１３：人 1: Person tracking device 2: Surveillance system 10: Fisheye camera 11: Tracking area 12: Ceiling 13: Person

Claims

An acquisition means for acquiring the position of the object in the first frame image, and
A tracking means for obtaining the position of the object from the second frame image, which is a frame image after the first frame image, and
An object tracking device equipped with
The tracking means
A feature amount extraction means for extracting a feature amount from the target area of the second frame image, and
A likelihood calculating means for obtaining a likelihood map representing the certainty that the object exists in the target region of the second frame image based on the feature amount.
When there is one peak in the likelihood map, the position of the peak is specified as the position of the object, and when there are a plurality of peaks in the likelihood map, the target in the first frame image. The position of the peak selected in consideration of the image similarity representing the similarity between the image area near the position of the object and the image area near each peak of the second frame image is specified as the position of the object. , Positioning means,
An object tracking device comprising.

When there are a plurality of peaks in the likelihood map, the position determining means uses the position of the peak having the maximum image similarity among the peaks having a likelihood value equal to or higher than the threshold value as the position of the object. Identify,
The object tracking device according to claim 1, wherein the object tracking device is characterized in that.

The threshold is determined for each peak according to the image similarity.
The object tracking device according to claim 2, wherein the object tracking device is characterized by this.

The feature amount extracting means extracts a first feature amount, which is a feature amount related to shape, and a second feature amount, which is a feature amount related to color or brightness.
The likelihood calculating means obtains a composite likelihood map obtained by synthesizing a first likelihood based on the first feature amount and a second likelihood based on the second feature amount as the likelihood map.
The object tracking device according to any one of claims 1 to 3, wherein the object tracking device is characterized.

The first feature amount is at least one of a HOG feature amount, an LBP feature amount, a SHIFT feature amount, and a SURF feature amount.
The object tracking device according to claim 4, wherein the second feature amount is at least one of a luminance histogram, a color histogram, and a Color Names feature amount.

The image similarity is
Determined or determined based on at least one of the difference in image information, including at least one of the average color, average brightness, and representative color, the absolute value of the difference, and the square of the difference in the image region.
At least one of the histogram intersection, the butterfly coefficient, and the Earth Mover's Distance, which is at least one of the first feature amount which is a feature amount related to the shape or the second feature amount which is a feature amount related to color or brightness in the image region. Determined based on, or
Determined by template matching in the image area
The object tracking device according to any one of claims 1 to 5, wherein the object tracking device is characterized.

The first frame image and the second frame image are fisheye images obtained by a fisheye camera.
The object tracking device according to any one of claims 1 to 6, wherein the object tracking device is characterized.

The acquisition step of acquiring the position of the object in the first frame image, and
A tracking step of finding the position of the object from the second frame image, which is a frame image after the first frame image, and
Is an object tracking method that includes
The tracking step
A feature amount extraction step for extracting a feature amount from the target area of the second frame image, and
With respect to the target area of the second frame image, a likelihood calculation step of obtaining a likelihood map representing the certainty that the object exists based on the feature amount, and
When there is one peak in the likelihood map, the position of the peak is specified as the position of the object, and when there are a plurality of peaks in the likelihood map, the target in the first frame image. The position of the peak selected in consideration of the image similarity representing the similarity between the image area near the position of the object and the image area near each peak of the second frame image is specified as the position of the object. , Positioning steps and
A method for tracking an object, including.

A program for causing a computer to perform each step of the method according to claim 8.