JP2021149687A - Device, method and program for object recognition - Google Patents

Device, method and program for object recognition Download PDF

Info

Publication number
JP2021149687A
JP2021149687A JP2020050235A JP2020050235A JP2021149687A JP 2021149687 A JP2021149687 A JP 2021149687A JP 2020050235 A JP2020050235 A JP 2020050235A JP 2020050235 A JP2020050235 A JP 2020050235A JP 2021149687 A JP2021149687 A JP 2021149687A
Authority
JP
Japan
Prior art keywords
congestion
photographing means
degree
weighting
image
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
JP2020050235A
Other languages
Japanese (ja)
Inventor
豪二 水戸
Goji Mito
豪二 水戸
匠 宗片
Takumi Munekata
匠 宗片
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Secom Co Ltd
Original Assignee
Secom Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Secom Co Ltd filed Critical Secom Co Ltd
Priority to JP2020050235A priority Critical patent/JP2021149687A/en
Publication of JP2021149687A publication Critical patent/JP2021149687A/en
Pending legal-status Critical Current

Links

Images

Abstract

To provide an object recognizing technique capable of preventing efficiently, reduction of accuracy of object recognition due to congestion.SOLUTION: An object recognizing device 1 recognizes an object, on the basis of photographed images photographed by plural imaging means 10a, 10b, 10c having a common visual field. Congestion degree estimating means 130 estimates a congestion degree of an object photographed in the photographed image, for every imaging means 10a, 10b, 10c. Individual recognizing means 131 recognizes a whole part or a part of the object in the photographed image by analyzing photographed images for every imaging means 10a, 10b, 10c, for generating an individual recognizing result. Weighting determining means 132 determines weighting of each imaging means 10a, 10b, 10c, according to the congestion degree of the position where the individual recognizing means 131 recognizes the object in the photographed images photographed by the imaging means 10a, 10b, 10c. Integration recognizing means 133 integrates the individual recognizing results of the imaging means 10a, 10b, 10c on the basis of the weighting.SELECTED DRAWING: Figure 1

Description

画像に基づいて物体を認識する技術に関し、特に、共通視野を有する複数の撮影手段で撮影した画像に基づいて物体を認識する技術に関する。 The present invention relates to a technique for recognizing an object based on an image, and more particularly to a technique for recognizing an object based on an image taken by a plurality of photographing means having a common field of view.

警備などの目的で、カメラによって撮影した画像から人等の物体を検出し、追跡し、または姿勢を認識する等、物体を認識することが行われている。その際、複数のカメラに共通視野を持たせて、複数方向から撮影することによって認識の精度を高めることができる。 For the purpose of security and the like, an object such as a person is detected and tracked from an image taken by a camera, or an object is recognized by recognizing a posture. At that time, the recognition accuracy can be improved by giving a common field of view to the plurality of cameras and shooting from a plurality of directions.

例えば、特許文献1には、共通視野を有する複数のカメラで人等の移動物体を追跡する移動物体追跡装置であって、テンプレートマッチングを各カメラの撮影画像上で行って、カメラごとに得られた移動物体の位置をその尤度で重みづけて共通の座標系で統合する移動物体追跡装置が記載されている。こうすることで、一部のカメラの画像上で移動物体同士の重なりが生じ、そのカメラでの尤度が低下しても他のカメラの情報で補えるため、高精度な追跡を続けられる。 For example, Patent Document 1 is a moving object tracking device that tracks moving objects such as people with a plurality of cameras having a common field of view, and template matching is performed on captured images of each camera to obtain each camera. A moving object tracking device that weights the positions of moving objects by their likelihood and integrates them in a common coordinate system is described. By doing so, moving objects overlap each other on the image of some cameras, and even if the likelihood of that camera decreases, it can be supplemented with the information of other cameras, so that highly accurate tracking can be continued.

特開2010−049296号公報Japanese Unexamined Patent Publication No. 2010-049296

しかしながら、従来技術では、注目する物体の周囲の混雑によって生じる認識精度の低下を効果的に防止できない問題があった。すなわち、例えば特許文献1に記載の移動物体追跡装置では、同種の物体が重なる以上は誤マッチングによって偶発的に高い尤度が生じることがあり、且つ、事後的に得られる尤度からはどのカメラの撮影画像において誤マッチングが行われているかを区別できないため誤マッチングにより得た位置を統合から排除することが困難であった。そして、混雑度が高くなるほど誤マッチングは生じやすくなる。 However, in the prior art, there is a problem that a decrease in recognition accuracy caused by congestion around an object of interest cannot be effectively prevented. That is, for example, in the moving object tracking device described in Patent Document 1, as long as objects of the same type overlap, a high likelihood may be accidentally generated due to erroneous matching, and which camera is based on the likelihood obtained after the fact. It was difficult to exclude the position obtained by the erroneous matching from the integration because it was not possible to distinguish whether or not the erroneous matching was performed in the captured image of. Then, the higher the degree of congestion, the more likely it is that erroneous matching will occur.

本発明は上記問題を鑑みてなされたものであって、混雑によって生じる物体認識の精度低下を効果的に防止することのできる物体認識装置、物体認識方法および物体認識プログラムを提供することを目的とする。 The present invention has been made in view of the above problems, and an object of the present invention is to provide an object recognition device, an object recognition method, and an object recognition program capable of effectively preventing a decrease in the accuracy of object recognition caused by congestion. do.

(1)本発明に係る物体認識装置は、共通視野を有する複数の撮影手段で撮影した撮影画像に基づいて物体を認識する物体認識装置であって、前記撮影手段ごとに、前記撮影画像に撮影された前記物体の混雑度を推定する混雑度推定手段と、前記撮影手段ごとの前記撮影画像を解析して前記撮影画像上における前記物体の全部または一部を認識して個別認識結果を生成する個別認識手段と、前記各撮影手段が撮影した前記撮影画像上における前記個別認識手段が前記物体を認識した位置の前記混雑度に応じて前記各撮影手段の重み付けを決定する重付決定手段と、前記重み付けに基づいて、前記撮影手段ごとの前記個別認識結果を統合する統合認識手段と、を備える。 (1) The object recognition device according to the present invention is an object recognition device that recognizes an object based on captured images taken by a plurality of photographing means having a common field of view, and the photographed image is photographed for each of the photographing means. The congestion degree estimating means for estimating the degree of congestion of the object and the photographed image for each photographing means are analyzed to recognize all or a part of the object on the photographed image and generate an individual recognition result. Individual recognition means, weighting determination means for determining the weighting of each imaging means according to the degree of congestion at the position where the individual recognition means recognizes the object on the captured image captured by each imaging means. An integrated recognition means that integrates the individual recognition results for each of the photographing means based on the weighting is provided.

(2)上記(1)に記載する本発明に係る物体認識装置において、前記混雑度推定手段は、前記撮影画像を入力されると当該撮影画像内の任意の位置の前記混雑度を出力するよう予め学習した推定器に前記撮影画像を入力して前記撮影画像内の任意の位置の前記混雑度を推定し、前記重付決定手段は、前記撮影画像の領域毎に、前記混雑度に応じて前記撮影手段の重み付けを決定する。 (2) In the object recognition device according to the present invention described in (1) above, when the captured image is input, the congestion degree estimating means outputs the congestion degree at an arbitrary position in the captured image. The captured image is input to a pre-learned estimator to estimate the degree of congestion at an arbitrary position in the captured image, and the weighting determining means is used for each region of the captured image according to the degree of congestion. The weighting of the photographing means is determined.

(3)上記(1)または(2)に記載する本発明に係る物体認識装置において、前記個別認識手段は、前記撮影手段ごとに、前記撮影画像を解析して現時刻の前記撮影画像上における前記物体の位置情報を求め、前記統合認識手段は、前記重み付けに基づいて前記撮影手段ごとの前記位置情報を統合して現時刻における前記物体の位置を決定する。 (3) In the object recognition device according to the present invention according to the above (1) or (2), the individual recognition means analyzes the photographed image for each image pickup means and displays the photographed image on the photographed image at the current time. The position information of the object is obtained, and the integrated recognition means integrates the position information for each imaging means based on the weighting to determine the position of the object at the current time.

本発明によれば、混雑によって生じる物体認識の精度低下を効果的に防止できる物体認識装置、物体認識方法および物体認識プログラムを提供することが可能となる。 According to the present invention, it is possible to provide an object recognition device, an object recognition method, and an object recognition program that can effectively prevent a decrease in the accuracy of object recognition caused by congestion.

三次元位置推定装置の概略の構成を示すブロック図である。It is a block diagram which shows the schematic structure of the 3D position estimation apparatus. 人物および群衆と各撮影手段の撮影画像の関係を示す図である。It is a figure which shows the relationship between the person and the crowd, and the photographed image of each photographing means. 図2の人物200の拡大図である。It is an enlarged view of the person 200 of FIG. 実施形態1における三次元位置推定装置の全体的な処理を示す概略フロー図である。It is a schematic flow chart which shows the whole processing of the 3D position estimation apparatus in Embodiment 1. FIG. 三次元位置推定処理を示すサブフロー図である。It is a subflow diagram which shows the 3D position estimation process. 三次元追跡装置の概略の構成を示すブロック図である。It is a block diagram which shows the schematic structure of the 3D tracking device. 追跡人物および群衆と各撮影手段の撮影画像の関係を示す図である。It is a figure which shows the relationship between the tracking person and the crowd, and the photographed image of each photographing means. 追跡人物の仮説と尤度と重み付けの説明図である。It is explanatory drawing of the hypothesis of the tracking person, the likelihood and the weighting. 三次元追跡装置の全体的な処理を示すフロー図である。It is a flow chart which shows the overall processing of a 3D tracking apparatus. 物体認識装置の他例を説明する説明図である。It is explanatory drawing explaining another example of the object recognition apparatus.

[実施形態1]
以下、本発明の実施の形態(以下実施形態1という)に係る物体認識装置の一例である三次元位置推定装置について説明する。三次元位置推定装置は、共通視野を有する複数の撮影手段で撮影した撮影画像に基づいて共通視野内の人物の三次元位置を推定する。
[Embodiment 1]
Hereinafter, a three-dimensional position estimation device, which is an example of the object recognition device according to the embodiment of the present invention (hereinafter referred to as the first embodiment), will be described. The three-dimensional position estimation device estimates the three-dimensional position of a person in a common field of view based on captured images taken by a plurality of photographing means having a common field of view.

図1は三次元位置推定装置1の概略の構成を示すブロック図である。三次元位置推定装置1は撮影手段10a、10b、10c、通信部11、記憶部12、画像処理部13および表示部14からなる。 FIG. 1 is a block diagram showing a schematic configuration of the three-dimensional position estimation device 1. The three-dimensional position estimation device 1 includes photographing means 10a, 10b, 10c, a communication unit 11, a storage unit 12, an image processing unit 13, and a display unit 14.

撮影手段10a、10b、10cは、対象データの集まりである画像を取得するカメラであり、本実施形態においては監視カメラである。撮影手段10a,10b,10cは共通視野を有し、同期している。撮影手段10a、10b、10cは通信部11を介して画像処理部13と接続され、監視空間を所定の時間間隔で撮影して画像を生成し、生成した画像を順次、画像処理部13に入力する。例えば、撮影手段10a、10b、10cは、監視空間である屋内の壁に当該監視空間を俯瞰する所定の固定視野を有して設置され、監視空間を時間間隔1/5秒で撮影してカラー画像またはモノクロ画像を生成する。なお、本実施形態1では3台の撮影手段の例を示しているが、少なくとも撮影手段は2台あればよい。混雑度の低い画像が撮影される可能性を上げるために、撮影手段の台数は多いほどよく、共通視野の重心から各撮影手段の設置位置への方位差が大きいほどよい。 The photographing means 10a, 10b, and 10c are cameras that acquire an image that is a collection of target data, and are surveillance cameras in the present embodiment. The photographing means 10a, 10b, and 10c have a common field of view and are synchronized. The photographing means 10a, 10b, and 10c are connected to the image processing unit 13 via the communication unit 11, captures the monitoring space at predetermined time intervals to generate an image, and sequentially inputs the generated images to the image processing unit 13. do. For example, the photographing means 10a, 10b, and 10c are installed on an indoor wall which is a monitoring space with a predetermined fixed field of view overlooking the monitoring space, and the monitoring space is photographed at a time interval of 1/5 second in color. Generate an image or a monochrome image. Although the first embodiment shows an example of three photographing means, at least two photographing means may be sufficient. In order to increase the possibility that an image with a low degree of congestion is taken, the larger the number of shooting means, the better, and the larger the directional difference from the center of gravity of the common field of view to the installation position of each shooting means.

撮影手段10a,10b,10cは、予めキャリブレーションされ、共通する三次元の座標系(いわゆる世界座標系)が定義されている。以下、この座標系をXYZ座標系と称する。また、撮影手段10a,10b,10cそれぞれの撮影画像に固有の二次元の座標系(いわゆるカメラ座標系)をxy座標系と称する。 The photographing means 10a, 10b, and 10c are calibrated in advance, and a common three-dimensional coordinate system (so-called world coordinate system) is defined. Hereinafter, this coordinate system will be referred to as an XYZ coordinate system. Further, a two-dimensional coordinate system (so-called camera coordinate system) peculiar to each photographed image of the photographing means 10a, 10b, and 10c is referred to as an xy coordinate system.

通信部11は通信回路であり、その一端が画像処理部13に接続され、他端が撮影手段10a、10b、10cおよび表示部14と接続される。通信部11は撮影手段10a〜10cから画像を取得して画像処理部13に入力する。また、通信部11は画像処理部13から物体の認識結果を表示部14へ出力する。 The communication unit 11 is a communication circuit, one end of which is connected to the image processing unit 13, and the other end of which is connected to the photographing means 10a, 10b, 10c and the display unit 14. The communication unit 11 acquires an image from the photographing means 10a to 10c and inputs it to the image processing unit 13. Further, the communication unit 11 outputs the recognition result of the object from the image processing unit 13 to the display unit 14.

なお、撮影手段10a〜10c、通信部11、記憶部12、画像処理部13および表示部14の間は各部の設置場所に応じた形態で適宜接続される。例えば、撮影手段10a〜10cと通信部11および画像処理部13とが遠隔に設置される場合、撮影手段10a〜10cと通信部11との間をインターネット回線にて接続することができる。また、通信部11と画像処理部13との間はバスで接続する構成とすることができる。その他、接続手段として、LAN(Local Area Network)、各種ケーブルなどを用いることができる。 The photographing means 10a to 10c, the communication unit 11, the storage unit 12, the image processing unit 13, and the display unit 14 are appropriately connected in a form according to the installation location of each unit. For example, when the photographing means 10a to 10c and the communication unit 11 and the image processing unit 13 are remotely installed, the photographing means 10a to 10c and the communication unit 11 can be connected by an internet line. Further, the communication unit 11 and the image processing unit 13 may be connected by a bus. In addition, a LAN (Local Area Network), various cables, or the like can be used as the connection means.

記憶部12は、ROM(Read Only Memory)、RAM(Random Access Memory)等のメモリ装置であり、各種プログラムや各種データを記憶する。例えば、記憶部12は学習用のデータや、学習済みモデルである推定器の情報を記憶し、画像処理部13との間でこれらの情報を入出力する。すなわち、推定器の学習に用いる情報や当該処理の過程で生じた情報などが記憶部12と画像処理部13との間で入出力される。 The storage unit 12 is a memory device such as a ROM (Read Only Memory) and a RAM (Random Access Memory), and stores various programs and various data. For example, the storage unit 12 stores learning data and information of an estimator that is a learned model, and inputs and outputs such information to and from the image processing unit 13. That is, information used for learning the estimator, information generated in the process of the processing, and the like are input / output between the storage unit 12 and the image processing unit 13.

画像処理部13は、CPU(Central Processing Unit)、DSP(Digital Signal Processor)、MCU(Micro Control Unit)、GPU(Graphics Processing Unit)等の演算装置で構成される。画像処理部13は記憶部12からプログラムを読み出して実行することにより各種の処理手段・制御手段として動作し、必要に応じて、各種データを記憶部12から読み出し、生成したデータを記憶部12に記憶させる。例えば、画像処理部13は推定器を学習し生成すると共に、生成した推定器を通信部11経由で記憶部12に記憶させる。 The image processing unit 13 is composed of arithmetic units such as a CPU (Central Processing Unit), a DSP (Digital Signal Processor), an MCU (Micro Control Unit), and a GPU (Graphics Processing Unit). The image processing unit 13 operates as various processing means / control means by reading a program from the storage unit 12 and executing the program, reads various data from the storage unit 12 as necessary, and stores the generated data in the storage unit 12. Remember. For example, the image processing unit 13 learns and generates an estimator, and stores the generated estimator in the storage unit 12 via the communication unit 11.

表示部14は、液晶ディスプレイまたは有機EL(Electro-Luminescence)ディスプレイ等であり、通信部11を経由して画像処理部13から入力される移動物体の認識結果を表示する。 The display unit 14 is a liquid crystal display, an organic EL (Electro-Luminescence) display, or the like, and displays a recognition result of a moving object input from the image processing unit 13 via the communication unit 11.

画像処理部13が、混雑度推定手段130、二次元位置推定手段(個別認識手段)131、重付決定手段132、三次元位置推定手段(統合認識手段)133、推定結果出力手段134として機能する。 The image processing unit 13 functions as a congestion degree estimation means 130, a two-dimensional position estimation means (individual recognition means) 131, a weighting determination means 132, a three-dimensional position estimation means (integrated recognition means) 133, and an estimation result output means 134. ..

混雑度推定手段130は、撮影手段10a,10b,10cごとに、撮影画像に撮影された物体の混雑度を推定する。本実施形態においては、混雑度推定手段130は、撮影画像を入力されると当該撮影画像内の任意の位置の混雑度を出力するよう予め学習した推定器に撮影画像を入力して撮影画像内の任意の位置の混雑度を推定する。具体的には、混雑度推定手段130は、画像を入力されると各画素の混雑度を推定した混雑度マップを出力するよう予め学習した推定器に、撮影画像を入力して当該撮影画像の混雑度マップを出力させ、得られた混雑度マップを記憶部12に記憶させる。 The congestion degree estimating means 130 estimates the congestion degree of the object photographed in the photographed image for each of the photographing means 10a, 10b, and 10c. In the present embodiment, the congestion degree estimation means 130 inputs the photographed image into the estimator learned in advance so as to output the congestion degree at an arbitrary position in the photographed image when the photographed image is input, and within the photographed image. Estimate the degree of congestion at any position in. Specifically, the congestion degree estimation means 130 inputs a captured image into an estimator trained in advance to output a congestion degree map that estimates the congestion degree of each pixel when an image is input, and the captured image is of the captured image. The congestion degree map is output, and the obtained congestion degree map is stored in the storage unit 12.

推定器は具体的にはディープラーニングの技術を用いて実現できる。すなわち推定器は画像を入力されると当該画像の混雑度マップを出力するCNN(畳み込みニューラルネット―ワーク;convolutional neural network)でモデル化することができる。学習のために、例えば、群衆が撮影された大量の学習用画像と、学習用画像それぞれにおける各人の頭部の重心位置を平均値とし当該頭部のサイズに応じた分散を有する確率密度関数を設定して頭部ごとの当該関数の値を画素ごとに加算した混雑度マップとが用意される。そして、モデルに学習用画像それぞれを入力したときの出力を当該画像に対応する混雑度マップに近づける学習が事前に行われる。こうして得られた学習済みモデルを混雑度推定手段130のプログラムの一部をなす推定器として記憶部12に記憶させておく。例えば、“Single image crowd counting via multi-column convolutional neural network”, Zhang, Y. ,Zhou他, CVPR 2016に記載されているMCNN(multi-column convolutional neural network)は推定器の一例であり、当該論文に記載されている群衆密度マップ(crowd density map)は混雑度マップの一例である。なお、本実施形態において混雑度推定手段130は、認識精度低下を許容できる混雑度の上限値T0を予め定めておき、推定器から出力された混雑度を上限値T0で除した上で除算結果が1.0以上となった場合に1.0とする規格化を行うものとする。つまり、本実施形態において混雑度の値域は[0,1]である。 The estimator can be specifically realized by using deep learning technology. That is, the estimator can be modeled by a CNN (convolutional neural network) that outputs a congestion map of the image when the image is input. For learning, for example, a large number of learning images taken by a crowd and a probability density function having a variance according to the size of the head, with the position of the center of gravity of each person's head in each of the learning images as the average value. Is set and the value of the function for each head is added for each pixel to prepare a congestion degree map. Then, learning is performed in advance to bring the output when each of the training images is input to the model closer to the congestion degree map corresponding to the image. The trained model thus obtained is stored in the storage unit 12 as an estimator forming a part of the program of the congestion degree estimation means 130. For example, MCNN (multi-column convolutional neural network) described in “Single image crowd counting via multi-column convolutional neural network”, Zhang, Y., Zhou et al., CVPR 2016 is an example of an estimator. The crowd density map described in is an example of a congestion map. In the present embodiment, the congestion degree estimation means 130 sets an upper limit value T0 of the congestion degree that can tolerate a decrease in recognition accuracy in advance, divides the congestion degree output from the estimator by the upper limit value T0, and then divides the result. When is 1.0 or more, standardization to 1.0 shall be performed. That is, in the present embodiment, the range of the degree of congestion is [0,1].

混雑度推定手段130は、各混雑度マップにおいて混雑度が予め定めた閾値T1以上の領域を高混雑度領域として抽出する。混雑度推定手段130は、撮影手段10a〜10cのそれぞれを識別する撮影手段IDと、撮影手段10a〜10cの撮影画像における高混雑度領域とを対応付けた混雑度情報を重付決定手段132に出力する。 The congestion degree estimation means 130 extracts a region in which the congestion degree is a predetermined threshold value T1 or more in each congestion degree map as a high congestion degree region. The congestion degree estimating means 130 sends the congestion degree information in which the photographing means ID for identifying each of the photographing means 10a to 10c and the high congestion degree region in the photographed image of the photographing means 10a to 10c are associated with each other in the weighting determination means 132. Output.

個別認識手段である二次元位置推定手段131は、撮影手段ごとの撮影画像を解析して撮影画像上における物体の全部または一部を認識して個別認識結果を生成する。具体的には、予め画像からの人の像の領域(人物領域)の検出を学習した検出器に撮影手段10a〜10cのそれぞれが撮影した撮影画像を入力して当該検出器に各撮影画像上における人物領域を出力させ(検出させ)、撮影手段10a〜10cの撮影手段IDと検出された人物領域と当該人物領域の重心位置とを対応付けた個別認識結果を生成し、生成した個別認識結果を重付決定手段132および三次元位置推定手段133に出力する。 The two-dimensional position estimation means 131, which is an individual recognition means, analyzes the captured image of each photographing means, recognizes all or a part of the object on the captured image, and generates an individual recognition result. Specifically, the captured images taken by each of the photographing means 10a to 10c are input to the detector that has learned the detection of the human image region (personal region) from the image in advance, and the captured images are input to the detector. The individual recognition result is generated by outputting (detecting) the person area in the above image, generating an individual recognition result in which the photographing means ID of the photographing means 10a to 10c is associated with the detected person area and the position of the center of gravity of the person area, and the generated individual recognition result. Is output to the weighting determination means 132 and the three-dimensional position estimation means 133.

上記検出器は、例えば、CNNを、大量の学習用画像と当該学習用画像内における人の像を囲んだ人物領域を示す正解データとからなる学習用データを用いて深層学習(ディープラーニング)させた学習済みモデルである。このようなCNNの一例が“Faster R-CNN: Towards real-time object detection with region proposal networks”, Shaoqing Ren他, NIPS, 2015に記載されている。 The detector causes, for example, deep learning of a CNN by using learning data consisting of a large amount of learning images and correct answer data indicating a person area surrounding a person's image in the learning image. It is a trained model. An example of such a CNN is described in “Faster R-CNN: Towards real-time object detection with region proposal networks”, Shaoqing Ren et al., NIPS, 2015.

重付決定手段132は、各撮影手段10a,10b,10cが撮影した撮影画像上における個別認識手段が物体を認識した位置の混雑度に応じて各撮影手段の重み付けを決定する。 The weighting determining means 132 determines the weighting of each photographing means according to the degree of congestion at the position where the individual recognition means recognizes the object on the photographed image photographed by the photographing means 10a, 10b, 10c.

具体的には、重付決定手段132は、二次元位置推定手段131から入力される個別認識結果を参照し、各撮影手段に関する個別認識結果に含まれる人物領域それぞれの上部1/3(以下、頭部領域とも称する)を「個別認識手段が物体を認識した位置」と設定する。そして、重付決定手段132は、混雑度推定手段130から入力される混雑度情報を参照し、撮影手段ごとに、頭部領域において当該撮影手段の高混雑度領域が占めない割合を重みとして算出して重みを含ませた個別認識結果を三次元位置推定手段133に出力する。 Specifically, the weighting determination means 132 refers to the individual recognition result input from the two-dimensional position estimation means 131, and is the upper 1/3 of each person area included in the individual recognition result for each photographing means (hereinafter, hereafter. (Also referred to as the head region) is set as "the position where the individual recognition means recognizes the object". Then, the weighting determining means 132 refers to the congestion degree information input from the congestion degree estimating means 130, and calculates as a weight the ratio of the head region not occupied by the high congestion degree region of the photographing means for each photographing means. Then, the individual recognition result including the weight is output to the three-dimensional position estimation means 133.

例えば、重みは、頭部領域と高混雑度領域との非重複率による以下の式で定まる。
・重み=1.0−頭部領域と高混雑度領域との重複面積/頭部領域の面積
また、重みは頭部領域における閑散度による以下の式で定めてもよい。この場合、混雑度推定手段130は撮影手段IDと混雑度マップを対応付けた混雑度情報を出力する。
・重み=1.0−頭部領域内の混雑度の総和/頭部領域の面積
つまり、頭部領域内の混雑度が高い個別認識結果ほど重みは小さくなる。これは背後の群集の影響で個別認識結果の信頼度が低いことを意味する。他方、頭部領域内の混雑度が低い個別認識結果ほど重みは高くなる。これは群集の影響が少なく個別認識結果の信頼度が高いことを意味する。このような重みの違いは、認識対象の物体と背後の群集の撮影画像上での位置関係が撮影手段との位置関係によって異なることで生じる。そのため、個別認識手段が物体を認識した位置における混雑度に応じて各撮影手段の重み付けを決定することで、群集の影響により変わる当該位置についての個別認識結果の信頼度を評価できる。
For example, the weight is determined by the following equation based on the non-overlapping rate between the head region and the high congestion region.
Weight = 1.0-Area of overlap between head region and high congestion area / area of head region The weight may be determined by the following formula according to the degree of quietness in the head region. In this case, the congestion degree estimation means 130 outputs the congestion degree information in which the photographing means ID and the congestion degree map are associated with each other.
-Weight = 1.0-Sum of congestion in the head region / Area of the head region In other words, the higher the degree of congestion in the head region, the smaller the weight. This means that the reliability of the individual recognition result is low due to the influence of the crowd behind. On the other hand, the lower the degree of congestion in the head region, the higher the weight. This means that the influence of the crowd is small and the reliability of the individual recognition result is high. Such a difference in weight occurs because the positional relationship between the object to be recognized and the crowd behind it on the captured image differs depending on the positional relationship with the photographing means. Therefore, by determining the weighting of each photographing means according to the degree of congestion at the position where the individual recognition means recognizes the object, it is possible to evaluate the reliability of the individual recognition result for the position which changes due to the influence of the crowd.

記憶部12は、xy座標系の撮影画像上で求めた人物領域の重心位置をXYZ座標系に逆投影するために撮影手段10a〜10cのカメラパラメータ120を記憶している。カメラパラメータ120は、実際の監視空間における撮影手段10a〜10cの設置位置および撮像方向といった外部パラメータ、撮影手段10a〜10cの焦点距離、画角、レンズ歪みその他のレンズ特性や、撮像素子の画素数といった内部パラメータを含む情報である。 The storage unit 12 stores the camera parameters 120 of the photographing means 10a to 10c in order to back-project the position of the center of gravity of the person area obtained on the photographed image of the xy coordinate system onto the XYZ coordinate system. The camera parameter 120 includes external parameters such as the installation position and imaging direction of the photographing means 10a to 10c in the actual monitoring space, the focal length, the angle of view, the lens distortion and other lens characteristics of the photographing means 10a to 10c, and the number of pixels of the imaging element. It is information including internal parameters such as.

統合認識手段である三次元位置推定手段133は、重み付けに基づいて、撮影手段ごとの個別認識結果を統合する。本実施形態1においては、重み付けに基づいて撮影手段ごとの位置情報を統合して物体の位置を決定し、決定した位置を推定結果出力手段134に出力する。撮影手段ごとの位置情報は撮影手段10a,10b,10cに関する個別認識結果に含まれる重心位置であり、決定される物体の位置は物体の三次元位置である。 The three-dimensional position estimation means 133, which is an integrated recognition means, integrates the individual recognition results for each photographing means based on the weighting. In the first embodiment, the position information of each photographing means is integrated based on the weighting to determine the position of the object, and the determined position is output to the estimation result output means 134. The position information for each photographing means is the position of the center of gravity included in the individual recognition results for the photographing means 10a, 10b, 10c, and the position of the determined object is the three-dimensional position of the object.

具体的には、三次元位置推定手段133は、まず、二次元位置推定手段131から入力された撮影手段10a,10b,10cの個別認識結果、および記憶部12に記憶されている撮影手段10a,10b,10cのカメラパラメータ120を参照し、撮影手段ごとに、当該撮影手段の個別認識結果に含まれる各物体の重心位置のそれぞれを当該撮影手段のカメラパラメータ120を用いてXYZ座標系に逆投影して、各物体の重心位置を通る視線ベクトルを導出する。 Specifically, the three-dimensional position estimating means 133 first receives individual recognition results of the photographing means 10a, 10b, 10c input from the two-dimensional position estimating means 131, and the photographing means 10a stored in the storage unit 12. With reference to the camera parameters 120 of 10b and 10c, for each photographing means, the position of the center of gravity of each object included in the individual recognition result of the photographing means is back-projected to the XYZ coordinate system using the camera parameter 120 of the photographing means. Then, a line-of-sight vector passing through the position of the center of gravity of each object is derived.

次に、重付決定手段132から入力された撮影手段10a,10b,10cの重みを参照し、物体ごとに、各撮影手段からの視線ベクトルとの距離の重み付け和が最小となる三次元位置を当該物体の三次元位置として算出する。 Next, with reference to the weights of the photographing means 10a, 10b, and 10c input from the weighting determining means 132, the three-dimensional position where the weighted sum of the distances from the line-of-sight vector from each photographing means is minimized is determined for each object. Calculated as the three-dimensional position of the object.

各物体の三次元位置Pは、当該物体に対する撮影手段C(撮影手段IDがCである撮影手段を撮影手段Cと表記)の重みをWCとし、当該物体の重心位置を通る撮影手段Cからの視線ベクトルVCと三次元位置Pとの距離をD(VC,P)とすると、ΣWC×D(VC,P)が最小となる三次元位置Pを最小二乗法により解くことで求まる。ただしΣはCについての総和とする。 The three-dimensional position P of each object is from the photographing means C passing through the position of the center of gravity of the object, where the weight of the photographing means C (the photographing means whose imaging means ID is C is referred to as the photographing means C) with respect to the object is W C. When the line of sight distance between the vector V C and the three-dimensional position P and D (V C, P), ΣW C × D (V C, P) the three-dimensional position P which is the minimum by solving the minimum square method I want it. However, Σ is the sum of C.

なお、同一物体による撮影手段10aからの視線ベクトルと撮影手段10bからの視線ベクトルと撮影手段10cからの視線ベクトルの組み合わせを事前に特定するのは困難である。そこで、例えば、三次元位置推定手段133は、総当たりの組み合わせについて三次元位置の算出を試行し、最小化された距離の重み付け和が予め定めた閾値TD以上であった組み合わせを削除して、最小化された距離の重み付け和が閾値TD未満であった組み合わせのみを同一物体によるものとする。 It is difficult to specify in advance the combination of the line-of-sight vector from the photographing means 10a, the line-of-sight vector from the photographing means 10b, and the line-of-sight vector from the photographing means 10c by the same object. Therefore, for example, the three-dimensional position estimation means 133 attempts to calculate the three-dimensional position for the round-robin combination, deletes the combination in which the weighted sum of the minimized distances is equal to or greater than the predetermined threshold value TD. Only combinations in which the weighted sum of the minimized distances is less than the threshold value TD are considered to be the same object.

つまり、重みが大きな撮影手段からの重心位置ほど重視し、重みが小さな撮影手段からの重心位置ほど軽視して統合することにより三次元位置を決定する。このようにすることで、群集の存在により撮影手段ごとの個別認識結果に生じる誤差の影響を低減した高精度な統合が可能となる。よって、物体を高精度に認識することができる。 That is, the position of the center of gravity from the photographing means having a large weight is emphasized, and the position of the center of gravity from the photographing means having a small weight is disregarded and integrated to determine the three-dimensional position. By doing so, it is possible to perform highly accurate integration by reducing the influence of errors that occur in the individual recognition results for each photographing means due to the presence of the crowd. Therefore, the object can be recognized with high accuracy.

推定結果出力手段134は、推定結果を生成し、画像処理部5の外部に出力する。撮影画像と、XYZ座標系の仮想空間上に人物の三次元位置を表す×印を描画して二次元投影した投影図とを合成した画像を生成し、通信部11に出力する。通信部11により伝送されて表示部14に表示される。 The estimation result output means 134 generates an estimation result and outputs it to the outside of the image processing unit 5. An image obtained by synthesizing the captured image and the projection drawing obtained by drawing a x mark representing the three-dimensional position of the person in the virtual space of the XYZ coordinate system and projecting the two dimensions is generated and output to the communication unit 11. It is transmitted by the communication unit 11 and displayed on the display unit 14.

次に、本実施形態1における三次元位置推定装置1の処理例を説明する。図2に示すように、撮影手段10a,10b,10cそれぞれにおいて、共通視野に存在する人物200及び群衆210を撮影画像221,222,223として撮影する。 Next, a processing example of the three-dimensional position estimation device 1 in the first embodiment will be described. As shown in FIG. 2, in each of the photographing means 10a, 10b, and 10c, the person 200 and the crowd 210 existing in the common field of view are photographed as photographed images 221, 222, 223.

二次元位置推定手段131は少なくとも人物200についての個別認識結果を生成する。すなわち、撮影手段10aについては撮影画像221上で人物200を囲う人物領域231とその重心位置241を生成する。撮影手段10bについては撮影画像222上で人物200を囲う人物領域232とその重心位置242を生成する。人物領域232は群衆の像の影響を受けて本来の人物領域よりも大きく検出され、重心位置242も本来の重心位置からずれている。撮影手段10cについては撮影画像223上で人物200を囲う人物領域233とその重心位置243を生成する。混雑度推定手段130は撮影画像221,222,223について高混雑度領域251,252,253を抽出する。 The two-dimensional position estimation means 131 generates an individual recognition result for at least the person 200. That is, for the photographing means 10a, the person area 231 surrounding the person 200 and the position of the center of gravity thereof 241 are generated on the photographed image 221. The photographing means 10b generates a person area 232 surrounding the person 200 and a position of the center of gravity 242 thereof on the photographed image 222. The person area 232 is detected to be larger than the original person area due to the influence of the image of the crowd, and the center of gravity position 242 is also deviated from the original center of gravity position. For the photographing means 10c, the person area 233 surrounding the person 200 and the position of the center of gravity 243 thereof are generated on the photographed image 223. The congestion degree estimation means 130 extracts high congestion degree regions 251,252, 253 from the captured images 221, 222, 223.

重付決定手段132は人物領域の上部1/3(頭部領域)と高混雑度領域との非重複率に応じた重みを算出する。撮影手段10a,10cについては、人物領域231,233の上部1/3と高混雑度領域251,253との重複は無く、重みは1.0となる。撮影手段10bについては、人物領域232の上部1/3と高混雑度領域252との重複があり、重みは0.2となる。 The weighting determining means 132 calculates the weight according to the non-overlapping rate between the upper 1/3 (head area) of the person area and the high congestion area. For the photographing means 10a and 10c, there is no overlap between the upper 1/3 of the person areas 231,233 and the high congestion areas 251,253, and the weight is 1.0. Regarding the photographing means 10b, the upper 1/3 of the person area 232 and the high congestion area 252 overlap, and the weight is 0.2.

三次元位置推定手段133は撮影手段10a,10b,10cのカメラパラメータ120を用いて、重心位置241,242,243のそれぞれを通る視線ベクトルV1,V2,V3を導出する。撮影手段10bについては、人物領域232および重心位置242が本来のものからずれているため、視線ベクトルV2は視線ベクトルV1,V3に対してずれが生じている。 The three-dimensional position estimation means 133 uses the camera parameters 120 of the photographing means 10a, 10b, and 10c to derive the line-of-sight vectors V1, V2, and V3 passing through the center-of-gravity positions 241, 242, and 243, respectively. As for the photographing means 10b, since the person area 232 and the center of gravity position 242 are deviated from the original ones, the line-of-sight vector V2 is deviated from the line-of-sight vectors V1 and V3.

図3は図2の人物200周辺を拡大したものである。三次元位置360は、仮に、重み付けをせずに視線ベクトルV1,V2,V3との距離が最小となるように決定した場合の人物200の三次元位置である。三次元位置360は、実際の人物200の重心位置からずれた位置となる。 FIG. 3 is an enlarged view of the area around the person 200 in FIG. The three-dimensional position 360 is the three-dimensional position of the person 200 when the distance between the line-of-sight vectors V1, V2, and V3 is determined to be the minimum without weighting. The three-dimensional position 360 is a position deviated from the position of the center of gravity of the actual person 200.

三次元位置361は、視線ベクトルV1,V2,V3との距離の重み付け和が最小となるように決定した位置である。三次元位置361は、実際の人物200のほぼ重心位置を示している。視線ベクトルV1から三次元位置361までの距離D1と視線ベクトルV3から三次元位置361までの距離D3が、視線ベクトルV2から三次元位置361までの距離D2よりも短くなっている。これは、距離D1,D3が大きく重み付けて評価され、距離D2が小さく重み付けて評価されたことを示している。このように、撮影手段10a,10b,10cに対する重み付けによって、三次元位置361に対する視線ベクトルV2の寄与を小さくし、視線ベクトルV1,V3の寄与を大きくしたことで三次元位置361の算出が高精度化される。 The three-dimensional position 361 is a position determined so that the weighted sum of the distances from the line-of-sight vectors V1, V2, and V3 is minimized. The three-dimensional position 361 indicates the position of the center of gravity of the actual person 200. The distance D1 from the line-of-sight vector V1 to the three-dimensional position 361 and the distance D3 from the line-of-sight vector V3 to the three-dimensional position 361 are shorter than the distance D2 from the line-of-sight vector V2 to the three-dimensional position 361. This indicates that the distances D1 and D3 were evaluated with a large weight, and the distance D2 was evaluated with a small weight. In this way, by weighting the photographing means 10a, 10b, and 10c, the contribution of the line-of-sight vector V2 to the three-dimensional position 361 is reduced, and the contribution of the line-of-sight vectors V1 and V3 is increased, so that the calculation of the three-dimensional position 361 is highly accurate. Be made.

〔三次元位置推定装置1の動作〕
図4は本実施形態1における三次元位置推定装置1の全体的な処理を示すフローチャートである。図4のステップS100〜S150は、撮影手段10a,10b,10cから撮影画像が入力される度に繰り返される。
[Operation of 3D position estimation device 1]
FIG. 4 is a flowchart showing the overall processing of the three-dimensional position estimation device 1 in the first embodiment. Steps S100 to S150 of FIG. 4 are repeated every time a photographed image is input from the photographing means 10a, 10b, 10c.

撮影手段10a,10b,10cからの撮影画像が画像処理部13に入力される(S100)。画像処理部13は混雑度推定手段130として動作し、撮影手段10a,10b,10cからの撮影画像それぞれを推定器に入力して撮影手段ごとの混雑度マップを生成し、各混雑度マップから閾値T1以上の高混雑度領域を抽出する(S110)。 The captured images from the photographing means 10a, 10b, and 10c are input to the image processing unit 13 (S100). The image processing unit 13 operates as the congestion degree estimation means 130, inputs each of the captured images from the photographing means 10a, 10b, and 10c into the estimator, generates a congestion degree map for each photographing means, and sets a threshold value from each congestion degree map. A highly congested region of T1 or higher is extracted (S110).

画像処理部13は二次元位置推定手段131として動作し、撮影手段10a,10b,10cからの撮影画像それぞれを検出器に入力して人物領域を検出し、撮影手段IDと人物領域と人物領域の重心位置を対応付けた個別認識結果を生成する(S120)。画像処理部13は重付決定手段132として動作し、高混雑度領域と個別認識結果を入力し、人物領域の上部1/3の頭部領域と高混雑度領域の非重複率に応じた重みを決定する(S130)。 The image processing unit 13 operates as the two-dimensional position estimation means 131, inputs the captured images from the photographing means 10a, 10b, and 10c into the detector to detect the person area, and determines the photographing means ID, the person area, and the person area. An individual recognition result associated with the position of the center of gravity is generated (S120). The image processing unit 13 operates as the weighting determination means 132, inputs the high-congestion degree region and the individual recognition result, and weights according to the non-overlapping ratio of the head region and the high-congestion degree region of the upper 1/3 of the person region. Is determined (S130).

画像処理部13は三次元位置推定手段133として動作し、個別認識結果と重みを入力し、三次元位置を推定する(S140)。図5は、三次元位置推定手段133の処理を示すサブフローチャートである。 The image processing unit 13 operates as the three-dimensional position estimation means 133, inputs the individual recognition result and the weight, and estimates the three-dimensional position (S140). FIG. 5 is a sub-flow chart showing the processing of the three-dimensional position estimation means 133.

三次元位置推定手段133は、記憶部12からカメラパラメータ120を読み出し、個別認識結果に含まれている撮影手段ごとの各人物の重心位置を逆投影して、当該重心位置を通る当該撮影手段からの視線ベクトルを算出する(S141)。三次元位置推定手段133は、撮影手段10a,10b,10cそれぞれにつき一つずつの視線ベクトルを選択する条件下で、総当たりで視線ベクトルの組み合わせを生成し、生成した組み合わせを順次処理対象の組み合わせに設定する(S142)。 The three-dimensional position estimation means 133 reads out the camera parameter 120 from the storage unit 12, back-projects the position of the center of gravity of each person for each shooting means included in the individual recognition result, and passes through the center of gravity position from the shooting means. The line-of-sight vector of is calculated (S141). The three-dimensional position estimation means 133 generates a combination of line-of-sight vectors by brute force under the condition that one line-of-sight vector is selected for each of the photographing means 10a, 10b, and 10c, and the generated combinations are sequentially processed. Is set to (S142).

三次元位置推定手段133は、処理対象の組み合わせについて、当該組み合わせを構成する各視線ベクトルからの距離の重み付け和が最小となる三次元位置を導出する(S143)。三次元位置推定手段133は、最小となったときの距離の重み付け和が予め定めた閾値TD未満であるか否かを判定する(S144)。距離の重み付け和が閾値TD未満であればS145へ移行し、距離の重み付け和が閾値TD以上であればS145をスキップしてS146へ移行する。距離の重み付け和が閾値TD未満であれば同一物体についての視線ベクトルの組み合わせであったとして三次元位置を記憶部12に一時記憶させる(S145)。 The three-dimensional position estimation means 133 derives the three-dimensional position at which the weighted sum of the distances from each line-of-sight vector constituting the combination is minimized for the combination to be processed (S143). The three-dimensional position estimation means 133 determines whether or not the weighted sum of the distances when the distance is minimized is less than the predetermined threshold value TD (S144). If the weighted sum of the distances is less than the threshold value TD, the process proceeds to S145, and if the weighted sum of the distances is greater than or equal to the threshold value TD, S145 is skipped and the process proceeds to S146. If the weighted sum of the distances is less than the threshold value TD, the storage unit 12 temporarily stores the three-dimensional position as a combination of line-of-sight vectors for the same object (S145).

三次元位置推定手段133は、ステップS142で生成した全ての組み合わせを処理したか否かを確認する(S146)。全ての組み合わせを処理し終えた場合はS147へ移行し、未処理の組み合わせがあればS142に戻り、次の組み合わせに対する処理を行う。 The three-dimensional position estimation means 133 confirms whether or not all the combinations generated in step S142 have been processed (S146). When all the combinations have been processed, the process proceeds to S147, and if there are unprocessed combinations, the process returns to S142 to perform processing for the next combination.

ステップS145で一時記憶させた三次元位置について、距離の近い三次元位置同士を、同一人物に関するものであるとして、一つにまとめる(S147)。つまり、一人の人物について複数の三次元位置が算出される場合があるためこれらの重複を排除する。これにより二次元位置推定手段131の処理において一人の人物について複数の人物領域が検出されて生じる誤検出を防ぐ。さらには、ステップS142で三次元位置推定手段133が生成した組み合わせの中の、異なる物体の視線ベクトルの組み合わせについての距離の重み付け和が偶々閾値TD以下となって残ることで生じる誤検出を防ぐ。例えば、三次元位置推定手段133は、群平均法、ウォード(Ward)法などの手法を用いて、三次元位置をクラスタリングして各クラスタの代表値を一人の人物の三次元位置とする。三次元位置推定手段133は、一時記憶した三次元位置を消去して図4のステップS150へ移行する。 Regarding the three-dimensional positions temporarily stored in step S145, the three-dimensional positions that are close to each other are grouped together as being related to the same person (S147). That is, since a plurality of three-dimensional positions may be calculated for one person, these duplications are eliminated. This prevents erroneous detection caused by the detection of a plurality of person areas for one person in the processing of the two-dimensional position estimation means 131. Further, it prevents erroneous detection caused by the weighted sum of the distances for the combination of the line-of-sight vectors of different objects in the combination generated by the three-dimensional position estimation means 133 in step S142 accidentally remaining below the threshold value TD. For example, the three-dimensional position estimation means 133 clusters the three-dimensional positions by using a method such as the group averaging method or the Ward method, and sets the representative value of each cluster as the three-dimensional position of one person. The three-dimensional position estimation means 133 erases the temporarily stored three-dimensional position and proceeds to step S150 of FIG.

画像処理部13は推定結果出力手段134として動作し、ステップS147の統合を経た三次元位置を入力して当該位置を示す表示用画像を生成し、表示用画像を通信部11経由で表示部14に表示させる(S150)。 The image processing unit 13 operates as the estimation result output means 134, inputs the three-dimensional position after the integration in step S147 to generate a display image indicating the position, and displays the display image via the communication unit 11 in the display unit 14. Is displayed (S150).

[実施形態1の変形例]
(1−1)実施形態1では、個別認識手段である二次元位置推定手段131が、検出器が出力する人物領域をそのまま用いて個別認識結果を生成したが、重複度の高い人物領域同士を一つにまとめる処理を行ってから重心位置を算出して個別認識結果を生成してもよい。その場合のまとめ方には、検出時の尤度が最も高い人物領域を選択する、検出時の尤度で重み付けて平均するなどの方法がある。
[Modification of Embodiment 1]
(1-1) In the first embodiment, the two-dimensional position estimation means 131, which is an individual recognition means, generates an individual recognition result by using the person area output by the detector as it is, but the person areas having a high degree of overlap are generated. The individual recognition result may be generated by calculating the position of the center of gravity after performing the process of combining them into one. In that case, there are methods such as selecting the person area having the highest likelihood at the time of detection, weighting with the likelihood at the time of detection, and averaging.

(1−2)実施形態1では、3台の撮影手段10a,10b,10cで撮影する例を述べたが撮影手段を4台以上とすることもできる。撮影手段を4台以上とする場合、統合認識手段である三次元位置推定手段133が生成する視線ベクトルの組み合わせを撮影手段の台数よりも少ない個数の視線ベクトルの組み合わせとすることもできる。例えば、4台の撮影手段それぞれについての視線ベクトルの中から3台の撮影手段の視線ベクトルを選ぶ組み合わせを総当たりで生成する。 (1-2) In the first embodiment, an example of shooting with three shooting means 10a, 10b, 10c has been described, but the number of shooting means may be four or more. When the number of photographing means is four or more, the combination of the line-of-sight vectors generated by the three-dimensional position estimating means 133, which is the integrated recognition means, can be a combination of a number of line-of-sight vectors smaller than the number of the number of photographing means. For example, a combination of selecting the line-of-sight vectors of the three shooting means from the line-of-sight vectors of each of the four shooting means is generated by brute force.

(1−3)実施形態1では個別認識手段である二次元位置推定手段131が、各時刻の撮影画像(いわば静止画)から人物領域を検出する例を示したが、前後する時刻の撮影画像(いわば動画)を利用し各人物の追跡処理を行って人物領域を検出してもよい。その場合、同一物体の視線ベクトルの組み合わせが一度特定された人物は、それ以降は総当たりの組み合わせの試行を省略できる。 (1-3) In the first embodiment, an example is shown in which the two-dimensional position estimation means 131, which is an individual recognition means, detects a person area from a captured image (so to speak, a still image) at each time. A person area may be detected by performing tracking processing of each person using (so to speak, a moving image). In that case, a person whose combination of line-of-sight vectors of the same object is once identified can omit the trial of the brute force combination thereafter.

[実施形態2]
本実施形態2では、物体認識装置の一例である三次元追跡装置について説明する。本実施形態2における三次元追跡装置は、共通視野を有する複数の撮影手段で撮影した撮影画像に基づいて共通視野内の人物を追跡する。
[Embodiment 2]
In the second embodiment, a three-dimensional tracking device, which is an example of the object recognition device, will be described. The three-dimensional tracking device according to the second embodiment tracks a person in a common field of view based on captured images taken by a plurality of photographing means having a common field of view.

また、本実施形態2では、パーティクルフィルタに準じた手法で追跡を行う。各時刻において、追跡中の物体ごとに、当該物体の位置の候補を複数設定して各候補に対応した仮説を設定し、仮説を統合することによって物体の位置を決定する。本明細書では、各時刻において追跡中の物体ごとに1つ決定する位置を物体位置と称し、各時刻において追跡中の物体のそれぞれに対して複数設定する候補を候補位置と称する。すなわち、物体位置の候補が候補位置となる。 Further, in the second embodiment, tracking is performed by a method similar to the particle filter. At each time, for each object being tracked, a plurality of candidates for the position of the object are set, hypotheses corresponding to each candidate are set, and the position of the object is determined by integrating the hypotheses. In the present specification, a position determined once for each object being tracked at each time is referred to as an object position, and a plurality of candidates set for each of the objects being tracked at each time are referred to as a candidate position. That is, the candidate of the object position becomes the candidate position.

実施形態1においては重付決定手段132が重み付けの決定に際して参照する「個別認識手段が物体を認識した位置」を個別認識手段である二次元位置推定手段131が物体を検出した人物領域の上部1/3とし、重み付けの対象を重心位置とした。実施形態2においては重付決定手段532が重み付けの決定に際して参照する「個別認識手段が物体を認識した位置」を個別認識手段である候補位置設定・評価手段531が物体の尤度を算出した位置、すなわち候補位置によって定まる頭部投影領域とし、重み付けの対象を尤度とする。以下、候補位置設定・評価手段531が算出する尤度を個別尤度、個別尤度を統合して得られる尤度を統合尤度と称する。 In the first embodiment, the upper part 1 of the person region in which the two-dimensional position estimation means 131, which is the individual recognition means, detects the "position where the individual recognition means recognizes the object" referred by the weighting determination means 132 when determining the weighting. It was set to 3/3, and the weighting target was the position of the center of gravity. In the second embodiment, the position where the candidate position setting / evaluation means 531 which is the individual recognition means calculates the likelihood of the object by referring to the "position where the individual recognition means recognizes the object" which the weighting determination means 532 refers to when determining the weighting. That is, the head projection area determined by the candidate position is used, and the weighting target is the likelihood. Hereinafter, the likelihood calculated by the candidate position setting / evaluation means 531 is referred to as an individual likelihood, and the likelihood obtained by integrating the individual likelihoods is referred to as an integrated likelihood.

図6は、本実施形態2における三次元追跡装置5の構成を示すブロック図である。撮影手段50a,50b,50c、通信部51、表示部54は、実施形態1の撮影手段10a,10b,10c,通信部11,表示部14と同様である。画像処理部53は、混雑度推定手段530、候補位置設定・評価手段(個別認識手段)531、重付決定手段532、物体位置決定手段(統合認識手段)533、追跡結果出力手段534として機能する。また、記憶部52には、カメラパラメータ520の他に、物体情報521が記憶される。 FIG. 6 is a block diagram showing the configuration of the three-dimensional tracking device 5 according to the second embodiment. The photographing means 50a, 50b, 50c, the communication unit 51, and the display unit 54 are the same as the photographing means 10a, 10b, 10c, the communication unit 11, and the display unit 14 of the first embodiment. The image processing unit 53 functions as congestion degree estimation means 530, candidate position setting / evaluation means (individual recognition means) 531, weighting determination means 532, object position determination means (integrated recognition means) 533, and tracking result output means 534. .. In addition to the camera parameter 520, the storage unit 52 stores object information 521.

本実施形態2の混雑度推定手段530は実施形態1の混雑度推定手段130と同様であるが、出力先は重付決定手段532および追跡結果出力手段534となる。カメラパラメータ520は、実施形態1のカメラパラメータ120と同様であるが、実施形態2は、XYZ座標系の候補位置等をxy座標系に投影するために用いられる。 The congestion degree estimation means 530 of the second embodiment is the same as the congestion degree estimation means 130 of the first embodiment, but the output destinations are the weighting determination means 532 and the tracking result output means 534. The camera parameter 520 is the same as the camera parameter 120 of the first embodiment, but the second embodiment is used to project a candidate position or the like of the XYZ coordinate system onto the xy coordinate system.

物体情報521は、移動物体の三次元形状モデルと追跡中の移動物体の情報を記憶する。具体的には、移動物体の三次元形状モデルは立位の人の頭部・胴部・脚部の立体形状を模した3つの回転楕円体を連結してなるモデルである。或いは、立位の人の全身の立体形状を1つの回転楕円体で模したモデルでもよい。 The object information 521 stores information on a three-dimensional shape model of a moving object and the moving object being tracked. Specifically, the three-dimensional shape model of a moving object is a model formed by connecting three spheroids that imitate the three-dimensional shapes of the head, body, and legs of a standing person. Alternatively, a model that imitates the three-dimensional shape of the whole body of a standing person with one spheroid may be used.

追跡中の移動物体の情報は、追跡中の人物それぞれを識別する物体IDと対応づけて、各撮影手段の撮影手段IDと対応付けられた当該人物のテンプレートと、当該人物のXYZ座標系での物体位置と、当該人物の仮説と、が記憶される。各仮説は、仮説IDと、XYZ座標系での候補位置と、が記憶される。また、各仮説は、各撮影手段の撮影手段IDと対応づけて、候補位置に配置した三次元形状モデルの当該撮影手段のxy座標系への全身投影領域および頭部投影領域と、当該撮影手段の撮影画像を用いて算出した候補位置の個別尤度と、当該撮影手段の候補位置の重みが記憶される。 The information of the moving object being tracked is associated with the object ID that identifies each person being tracked, the template of the person associated with the shooting means ID of each shooting means, and the XYZ coordinate system of the person. The position of the object and the hypothesis of the person are stored. For each hypothesis, the hypothesis ID and the candidate position in the XYZ coordinate system are stored. In addition, each hypothesis corresponds to the imaging means ID of each imaging means, and the whole body projection area and the head projection area of the three-dimensional shape model arranged at the candidate positions on the xy coordinate system of the imaging means, and the imaging means. The individual likelihood of the candidate position calculated using the captured image of the above and the weight of the candidate position of the photographing means are stored.

個別認識手段である候補位置設定・評価手段531は、撮影手段ごとの撮影画像を解析して撮影画像上における物体の全部または一部を認識して個別認識結果を生成する。本実施形態2では、追跡中の物体それぞれについて、過去の位置情報(物体位置または候補位置)から現時刻の候補位置を予測し、各撮影手段が撮影した撮影画像上で各候補位置と物体形状とによって定まる領域(全身投影領域および頭部投影領域)を算出し、候補位置、両投影領域および全身投影領域に当該物体の画像特徴が現れている度合いである個別尤度を含んだ仮説を個別認識結果として生成して記憶部52の物体情報521に記憶させる。 The candidate position setting / evaluation means 531 which is an individual recognition means analyzes the photographed image of each photographing means, recognizes all or a part of the object on the photographed image, and generates an individual recognition result. In the second embodiment, for each of the objects being tracked, the candidate position at the current time is predicted from the past position information (object position or candidate position), and each candidate position and object shape are predicted on the captured image captured by each imaging means. The regions (whole body projection region and head projection region) determined by It is generated as a recognition result and stored in the object information 521 of the storage unit 52.

具体的に、候補位置設定・評価手段531は、まず、記憶部52が記憶している物体情報521を参照し、追跡中の人物ごとに、過去の物体位置に現在の物体位置(物体位置の推定値)を外挿し、現在の物体位置の近傍にランダムに複数の候補位置を設定する。また、過去の候補位置に現在の候補位置を外挿してもよい。過去の物体位置や過去の候補位置が2時刻分以上無い人物については1時刻前の物体位置の近傍に候補位置を設定する。物体位置およびこの段階での候補位置はXYZ座標系の座標値となる。 Specifically, the candidate position setting / evaluation means 531 first refers to the object information 521 stored in the storage unit 52, and for each person being tracked, the current object position (object position) is set to the past object position for each person being tracked. Estimated value) is extrapolated, and multiple candidate positions are randomly set near the current object position. Further, the current candidate position may be extrapolated to the past candidate position. For a person whose past object position or past candidate position is less than 2 hours, the candidate position is set in the vicinity of the object position 1 hour ago. The object position and the candidate position at this stage are the coordinate values of the XYZ coordinate system.

次に、候補位置設定・評価手段531は、記憶部52が記憶している物体情報521の三次元形状モデルとカメラパラメータ520を参照し、候補位置それぞれについて、当該候補位置に配置した三次元形状モデルを撮影手段10a,10b,10cのxy座標系に投影する。また、候補位置それぞれについて、当該候補位置に配置した頭部の三次元形状モデルを撮影手段10a,10b,10cのxy座標系に投影する。続いて、候補位置設定・評価手段531は、追跡中の各人物の候補位置それぞれについて、候補位置と各撮影手段への全身投影領域および頭部投影領域を含んだ仮説を生成して物体情報521に追加する。そして、候補位置設定・評価手段531は、追跡中の各人物の候補位置それぞれについて、撮影手段10a,10b,10cの撮影画像における全身投影領域の画像特徴を抽出して当該人物のテンプレートの画像特徴との類似度に基づいて個別尤度La,Lb,Lcを算出し、算出した個別尤度La,Lb,Lcを対応する仮説に追記して物体情報521を更新する。なお、全身投影領域の上部1/3を近似的に頭部投影領域としてもよい。また、全身の立体形状を1つの回転楕円体とする場合も全身投影領域の上部1/3を頭部投影領域とすればよい。 Next, the candidate position setting / evaluation means 531 refers to the three-dimensional shape model of the object information 521 stored in the storage unit 52 and the camera parameter 520, and for each of the candidate positions, the three-dimensional shape arranged at the candidate position. The model is projected onto the xy coordinate system of the photographing means 10a, 10b, 10c. Further, for each of the candidate positions, the three-dimensional shape model of the head arranged at the candidate position is projected onto the xy coordinate system of the photographing means 10a, 10b, 10c. Subsequently, the candidate position setting / evaluation means 531 generates a hypothesis including the candidate position, the whole body projection area and the head projection area to each imaging means for each candidate position of each person being tracked, and the object information 521. Add to. Then, the candidate position setting / evaluation means 531 extracts the image feature of the whole body projection region in the captured image of the photographing means 10a, 10b, 10c for each candidate position of each person being tracked, and the image feature of the template of the person. The individual likelihoods La, Lb, and Lc are calculated based on the similarity with the above, and the calculated individual likelihoods La, Lb, and Lc are added to the corresponding hypothesis to update the object information 521. The upper 1/3 of the whole body projection area may be approximately used as the head projection area. Further, when the three-dimensional shape of the whole body is one spheroid, the upper third of the whole body projection area may be the head projection area.

重付決定手段532は、各撮影手段が撮影した撮影画像上における個別認識手段が物体を認識した位置の混雑度に応じて各撮影手段の重みWを決定する。本実施形態2では、候補位置ごとに、各撮影手段が撮影した撮影画像上で当該候補位置と物体形状とによって定まる頭部投影領域についての混雑度に応じて、撮影手段10a,10b,10cごとの個別尤度La,Lb,Lcに対する重みWa,Wb,Wcを決定する。 The weighting determining means 532 determines the weight W of each photographing means according to the degree of congestion at the position where the individual recognition means recognizes the object on the photographed image photographed by each photographing means. In the second embodiment, for each candidate position, every shooting means 10a, 10b, 10c according to the degree of congestion of the head projection area determined by the candidate position and the object shape on the shot image taken by each shooting means. The weights Wa, Wb, and Wc for the individual likelihoods La, Lb, and Lc of

具体的には、記憶部52が記憶している物体情報521および混雑度推定手段530から入力された混雑度情報を参照し、候補位置ごとに、撮影手段10a,10b,10cそれぞれについての頭部投影領域に対する高混雑度領域の非重複度を重みWa,Wb,Wcとして算出し、算出した重みWを対応する仮説に追記して物体情報521を更新する。ここで、非重複度の代わりに閑散度を重みWとしてもよい。 Specifically, the object information 521 stored in the storage unit 52 and the congestion degree information input from the congestion degree estimation means 530 are referred to, and the heads of the photographing means 10a, 10b, and 10c are referred to for each candidate position. The non-overlapping degree of the high congestion degree region with respect to the projection region is calculated as the weights Wa, Wb, and Wc, and the calculated weight W is added to the corresponding hypothesis to update the object information 521. Here, the degree of quietness may be the weight W instead of the degree of non-overlap.

つまり、頭部投影領域内の混雑度が高い個別認識結果ほど重みWは小さくなる。これは背後の群集の影響で個別認識結果の信頼度が低くなることを意味する。他方、頭部投影領域内の混雑度が低い個別認識結果ほど重みWは高くなる。これは群集の影響が少なく個別認識結果の信頼度が高くなることを意味する。このような重みWの違いは、認識対象の物体と背後の群集の撮影画像上での位置関係が撮影手段との位置関係によって異なることで生じる。そのため、認識対象の物体の領域における混雑度に応じて各撮影手段の重みWを決定することで、撮影手段と群集の位置関係により変わる個別認識結果の信頼度を評価できる。 That is, the weight W becomes smaller as the individual recognition result has a higher degree of congestion in the head projection area. This means that the reliability of the individual recognition result is low due to the influence of the crowd behind. On the other hand, the weight W becomes higher as the individual recognition result has a lower degree of congestion in the head projection area. This means that the influence of the crowd is small and the reliability of the individual recognition result is high. Such a difference in weight W occurs because the positional relationship between the object to be recognized and the crowd behind it on the captured image differs depending on the positional relationship with the photographing means. Therefore, by determining the weight W of each photographing means according to the degree of congestion in the area of the object to be recognized, the reliability of the individual recognition result that changes depending on the positional relationship between the photographing means and the crowd can be evaluated.

統合認識手段である物体位置決定手段533は、重み付けに基づいて撮影手段ごとの個別認識結果を統合する。換言すると物体位置決定手段533は、各移動物体における複数の候補位置に基づいて、現時刻における移動物体の物体位置を求める。 The object position determining means 533, which is an integrated recognition means, integrates the individual recognition results for each photographing means based on the weighting. In other words, the object position determining means 533 obtains the object position of the moving object at the current time based on a plurality of candidate positions in each moving object.

本実施形態において、物体位置決定手段533は、XYZ座標系において、移動物体ごとに、当該移動物体の各候補位置の撮影手段ごとの個別尤度を重みWに基づいて統合し、さらに統合尤度を重みUとして候補位置を重み付け平均することによって当該移動物体の物体位置を算出する。算出したXYZ座標系の物体位置を移動物体と対応づけて記憶部52の物体情報521に記憶させる。 In the present embodiment, the object position determining means 533 integrates the individual likelihood of each candidate position of the moving object for each photographing means based on the weight W in the XYZ coordinate system, and further integrates the integrated likelihood. Is used as the weight U, and the candidate positions are weighted and averaged to calculate the object position of the moving object. The calculated object position in the XYZ coordinate system is associated with the moving object and stored in the object information 521 of the storage unit 52.

物体位置決定手段533は、追跡中の物体について、物体位置、仮説やテンプレートの更新処理を行うと共に、新規物体の存在を判定し、当該新規物体について物体情報を登録する処理、及び消失物体についての処理を行う。以下、追跡中の物体についての処理、新規物体についての処理、及び消失物体についての処理を順次、説明する。 The object position determining means 533 updates the object position, hypothesis and template of the object being tracked, determines the existence of the new object, registers the object information about the new object, and the disappeared object. Perform processing. Hereinafter, the processing for the object being tracked, the processing for the new object, and the processing for the disappearing object will be sequentially described.

〔追跡中の移動物体〕
物体位置決定手段533により物体位置が判定された物体について、当該判定された物体位置を追加記憶させるとともに、現時刻の物体位置それぞれに形状モデルを配置して各撮影画像に投影して全身投影領域の画像特徴を抽出し、当該物体の撮影手段ごとのテンプレートを現時刻の画像特徴により更新する。更新は、抽出された画像特徴を、記憶されている画像特徴と置き換えてもよいし、抽出された画像特徴と記憶されている画像特徴とを重み付け平均してもよい。
[Moving object being tracked]
For an object whose object position has been determined by the object position determining means 533, the determined object position is additionally stored, and a shape model is placed at each object position at the current time and projected onto each captured image to produce a whole body projection area. The image features of the object are extracted, and the template for each photographing means of the object is updated with the image features at the current time. The update may replace the extracted image features with the stored image features, or the extracted image features and the stored image features may be weighted and averaged.

〔新規物体〕
物体位置決定手段533は、監視空間に追跡対象の物体(人)が存在しないときに撮影された背景画像と各撮影画像との差分処理を行って背景差分領域を検出するとともに、現時刻の物体位置それぞれに形状モデルを配置して各撮影画像に投影しいずれの全身投影領域とも重ならない背景差分領域を抽出する。そして、物体位置決定手段533は、非重複の背景差分領域が追跡対象の物体として有効な面積TSを有していれば、非重複の背景差分領域に新規物体が存在すると判定する。新規物体が存在すると判定された場合、非重複の背景差分領域に対して実施形態1と同様の方法で三次元位置の推定を行ってXYZ座標系での物体位置を導出する。また、物体IDと対応付けて当該物体のテンプレート、当該物体の物体位置が記憶部52の物体情報521に記憶される。また、物体位置決定手段533は、追跡対象の物体が存在しないときの撮影画像を背景画像として記憶部4に記憶させ、背景差分領域が検出されなかった領域の撮影画像で背景画像を更新する。
[New object]
The object position determining means 533 detects the background subtraction region by performing subtraction processing between the background image captured when the object (person) to be tracked does not exist in the monitoring space and each captured image, and also detects the object at the current time. A shape model is placed at each position and projected onto each captured image to extract a background subtraction region that does not overlap with any of the whole body projection regions. Then, the object position determining means 533 determines that a new object exists in the non-overlapping background subtraction region if the non-overlapping background subtraction region has an effective area TS as the object to be tracked. When it is determined that a new object exists, the three-dimensional position is estimated for the non-overlapping background subtraction region by the same method as in the first embodiment, and the object position in the XYZ coordinate system is derived. Further, the template of the object and the object position of the object are stored in the object information 521 of the storage unit 52 in association with the object ID. Further, the object position determining means 533 stores the captured image when the object to be tracked does not exist as a background image in the storage unit 4, and updates the background image with the captured image in the region where the background subtraction region is not detected.

〔消失物体〕
物体位置決定手段533は、物体が遮蔽物により隠蔽された場合や撮影画像外に移動した場合等、全ての個別尤度Lが閾値TL以下となった物体を物体位置なしの消失物体と判定し、当該物体の物体情報を削除する。
[Disappearing object]
The object position determining means 533 determines that an object whose individual likelihood L is equal to or less than the threshold value TL, such as when the object is concealed by a shield or moved out of the captured image, is regarded as a disappearing object without an object position. , Delete the object information of the object.

追跡結果出力手段534は、例えば、追跡中の物体ごとの時系列の物体位置をXYZ座標系でプロットした移動軌跡画像を生成し、撮影手段10a,10b,10cのxy座標系に投影する。また、予め混雑度に対応する色を定めておき、混雑度マップの各画素と対応する画素に当該画素の混雑度に対応する色の画素値を設定した混雑度画像を生成する。各撮影手段10a,10b,10cの移動軌跡画像と各撮影手段10a,10b,10cの混雑度画像とを透過合成した画像を表示部54に出力する。さらに現時刻の撮影画像を重畳してもよい。 The tracking result output means 534 generates, for example, a movement locus image in which the time-series object positions of each object being tracked are plotted in the XYZ coordinate system and projected onto the xy coordinate system of the photographing means 10a, 10b, 10c. Further, a color corresponding to the degree of congestion is determined in advance, and a congestion degree image is generated in which the pixel values of the colors corresponding to the degree of congestion of the pixels are set in the pixels corresponding to each pixel of the degree of congestion map. An image obtained by transparently synthesizing the moving locus image of each of the photographing means 10a, 10b, 10c and the congestion degree image of each of the photographing means 10a, 10b, 10c is output to the display unit 54. Further, the captured image at the current time may be superimposed.

次に、図7、図8に基づいて本実施形態2における三次元追跡装置5の処理例を説明する。図7は、追跡人物および群衆と各撮影手段の撮影画像の関係を示す図である。図7に示すように、撮影手段10a,10b,10cそれぞれにおいて、共通視野に存在する追跡中の人物600及び群衆610を撮影画像621,622,623として撮影する。 Next, a processing example of the three-dimensional tracking device 5 in the second embodiment will be described with reference to FIGS. 7 and 8. FIG. 7 is a diagram showing the relationship between the tracking person and the crowd and the captured image of each photographing means. As shown in FIG. 7, each of the photographing means 10a, 10b, and 10c photographs the tracking person 600 and the crowd 610 existing in the common field of view as captured images 621, 622, 623.

追跡対象の人物600の三次元空間上の位置を決定するために、三次元空間上における人物600の頭部周辺に複数の候補位置630を設定する。混雑度推定手段530は、撮影画像621,622,623について高混雑度領域651,652,653を抽出する。撮影手段10a,10cの撮影画像621,623上では追跡対象の人物641,643は高混雑度領域651,653に重複していないが、撮影手段10bの撮影画像622上では、追跡対象の人物642は高混雑度領域652に重複している。そのため、撮影手段10a,10cに関する候補位置の重みWは大きくなるが、撮影手段10bに関する候補位置の重みWは小さくなる。 In order to determine the position of the person 600 to be tracked in the three-dimensional space, a plurality of candidate positions 630 are set around the head of the person 600 in the three-dimensional space. The congestion degree estimation means 530 extracts high congestion degree regions 651,652,653 from the captured images 621, 622, 623. The tracked person 641, 643 does not overlap the high congestion area 651, 653 on the captured images 621 and 623 of the photographing means 10a and 10c, but the tracked person 642 on the captured image 622 of the photographing means 10b. Overlaps the high congestion region 652. Therefore, the weight W of the candidate positions with respect to the photographing means 10a and 10c becomes large, but the weight W of the candidate positions with respect to the photographing means 10b becomes small.

図8(a)は追跡中の人物について設定された候補位置の一つに対して撮影手段10bの重みWを決定する様子を示す図である。図8(a)に示すように、三次元空間上の追跡中の人物600と群衆610を撮影手段10bで撮影する。人物600に対して候補位置700が設定されたとすると、撮影手段10bの撮影画像622において対応する位置710を頭部中心とする頭部投影領域720が得られる。また、群衆610の位置が高混雑度領域652として設定される。撮影手段毎、仮説毎に頭部投影領域720と高混雑度領域652との非重複率に応じて重みWが決定される。撮影手段10bに関する候補位置710についての頭部投影領域720は高混雑度領域652と重複している(非重複率が低い)ため、重みWが小さくなる。 FIG. 8A is a diagram showing how the weight W of the photographing means 10b is determined for one of the candidate positions set for the person being tracked. As shown in FIG. 8A, the tracking person 600 and the crowd 610 in the three-dimensional space are photographed by the photographing means 10b. Assuming that the candidate position 700 is set for the person 600, a head projection region 720 centered on the head at the corresponding position 710 in the captured image 622 of the photographing means 10b can be obtained. Also, the position of the crowd 610 is set as the high congestion area 652. The weight W is determined according to the non-overlapping rate between the head projection region 720 and the high congestion region 652 for each imaging means and each hypothesis. Since the head projection region 720 for the candidate position 710 with respect to the photographing means 10b overlaps with the high congestion degree region 652 (the non-overlapping rate is low), the weight W becomes small.

図8(b)は撮影手段10a,10b,10cに関する重み付け前の個別尤度を示す図である。人物600に複数の候補位置が設定されている。撮影手段10a,10b,10cの撮影画像ごとにこれら複数の候補位置全てが尤度評価される。四角形730、三角形731、五角形732は同じ候補位置を表している。記号の位置が候補位置を示す。四角形730の大きさは撮影手段10aの撮影画像を用いて求めた個別尤度の大きさ、三角形731の大きさは撮影手段10bの撮影画像を用いて求めた個別尤度の大きさ、五角形732の大きさは撮影手段10cの撮影画像を用いて求めた個別尤度の大きさを示している。撮影手段10a,10cに関する候補位置730,732は、高混雑度領域652の影響を受けていないため、正しく尤度評価ができている。撮影手段10bに関する候補位置731の右上側は高混雑度領域652の影響を受け正しく尤度評価できずに、個別尤度が高くなっている。 FIG. 8B is a diagram showing individual likelihoods of the photographing means 10a, 10b, and 10c before weighting. A plurality of candidate positions are set for the person 600. All of these plurality of candidate positions are evaluated for likelihood for each of the captured images of the photographing means 10a, 10b, and 10c. The quadrangle 730, the triangle 731, and the pentagon 732 represent the same candidate positions. The position of the symbol indicates the candidate position. The size of the quadrangle 730 is the size of the individual likelihood obtained using the captured image of the photographing means 10a, and the size of the triangle 731 is the size of the individual likelihood obtained using the captured image of the photographing means 10b, the pentagon 732. Indicates the magnitude of the individual likelihood obtained by using the captured image of the photographing means 10c. Since the candidate positions 730 and 732 with respect to the photographing means 10a and 10c are not affected by the high congestion degree region 652, the likelihood evaluation can be performed correctly. The upper right side of the candidate position 731 with respect to the photographing means 10b is affected by the high-congestion degree region 652, and the likelihood cannot be evaluated correctly, and the individual likelihood is high.

図8(c)は図8(b)の個別尤度に、混雑度に基づいた重みWをかけた重み付け個別尤度を示している。撮影手段10a,10bに関する候補位置740,742は混雑度が低く重みWが大きいため、候補位置740,742の点が大きくなっている。撮影手段10bに関する候補位置741は混雑度が高く重みWが小さいため、候補位置741の点が小さくなっている。そのため、群衆610(高混雑度領域652)により正しく個別尤度が算出できなかった撮影手段10bに関する仮説の影響力が小さくなる。よって、候補位置と重みWと個別尤度に基づいて加重平均で物体位置を求めた際、撮影手段10bに関する仮説の影響を小さくすることができ、物体位置を高精度に設定できる。 FIG. 8 (c) shows the weighted individual likelihood obtained by multiplying the individual likelihood of FIG. 8 (b) by the weight W based on the degree of congestion. Since the candidate positions 740 and 742 with respect to the photographing means 10a and 10b have a low degree of congestion and a large weight W, the points of the candidate positions 740 and 742 are large. Since the candidate position 741 with respect to the photographing means 10b has a high degree of congestion and a small weight W, the point of the candidate position 741 is small. Therefore, the influence of the hypothesis regarding the photographing means 10b for which the individual likelihood could not be calculated correctly due to the crowd 610 (high congestion degree region 652) becomes small. Therefore, when the object position is obtained by the weighted average based on the candidate position, the weight W, and the individual likelihood, the influence of the hypothesis regarding the photographing means 10b can be reduced, and the object position can be set with high accuracy.

[三次元追跡装置5の動作例]
以下、三次元追跡装置5の動作を説明する。図9は三次元追跡装置5の動作の全体フロー図である。三次元追跡装置5の動作が開始されると、撮影手段10a,10b,10cは画像処理部53に順次撮影画像を出力する。画像処理部53は撮影画像が入力されるたびに(ステップS500)、ステップS501〜S510の一連の処理を繰り返す。
[Operation example of 3D tracking device 5]
The operation of the three-dimensional tracking device 5 will be described below. FIG. 9 is an overall flow diagram of the operation of the three-dimensional tracking device 5. When the operation of the three-dimensional tracking device 5 is started, the photographing means 10a, 10b, and 10c sequentially output the captured images to the image processing unit 53. The image processing unit 53 repeats a series of processes of steps S501 to S510 each time a captured image is input (step S500).

画像処理部53は撮影手段10a,10b,10cで取得した撮影画像に対し混雑度推定手段530により混雑度マップを出力する。また、混雑度が予め定めた閾値T1以上の領域を高混雑度領域として抽出する(ステップS501)。 The image processing unit 53 outputs a congestion degree map to the photographed images acquired by the photographing means 10a, 10b, and 10c by the congestion degree estimating means 530. Further, a region having a congestion degree equal to or higher than a predetermined threshold value T1 is extracted as a high congestion degree region (step S501).

画像処理部53は記憶部52の物体情報521に記録された人物ごとに、入力された撮影画像上にて追跡処理を行い現在の物体位置の推定を行う(ステップS502〜S508)。画像処理部53は記憶部52の物体情報521に記録された追跡対象の人物を順次、追跡処理の対象として選択し、全ての追跡対象の人物について追跡処理が完了した場合は、画像処理部53は処理をステップS509に進め、一方、未処理の追跡対象の人物が存在する場合は追跡処理を継続する(ステップS508)。 The image processing unit 53 performs tracking processing on the input captured image for each person recorded in the object information 521 of the storage unit 52 to estimate the current object position (steps S502 to S508). The image processing unit 53 sequentially selects the person to be tracked recorded in the object information 521 of the storage unit 52 as the target of the tracking process, and when the tracking process is completed for all the persons to be tracked, the image processing unit 53 Proceeds the process to step S509, while continuing the tracking process if there is an unprocessed person to be tracked (step S508).

以下、ステップS502〜S508の追跡処理をさらに詳しく説明する。画像処理部53は候補位置設定・評価手段531として機能し、各追跡人物についてXYZ座標系で仮説の設定を行い、各仮説が示す候補位置に配置した三次元形状モデルを撮影手段10a,10b,10cのxy座標系に投影する(ステップS502)。すなわち、候補位置設定・評価手段531は過去の追跡情報から現在の候補位置を予測し、仮説に候補位置を設定する。 Hereinafter, the tracking process of steps S502 to S508 will be described in more detail. The image processing unit 53 functions as a candidate position setting / evaluation means 531, sets a hypothesis for each tracking person in the XYZ coordinate system, and sets the three-dimensional shape model arranged at the candidate position indicated by each hypothesis in the photographing means 10a, 10b, Projection onto the xy coordinate system of 10c (step S502). That is, the candidate position setting / evaluation means 531 predicts the current candidate position from the past tracking information and sets the candidate position in the hypothesis.

画像処理部53は重付決定手段532として機能し、記憶部52が記憶している物体情報521および混雑度推定手段530から入力された混雑度情報を参照し、候補位置それぞれについて、撮影手段10a,10b,10cの頭部投影領域に対する高混雑度領域の非重複度を重みWa,Wb,Wcとして算出し、算出した重みWa,Wb,Wcを対応する仮説に追記して物体情報521を更新する(ステップS503)。 The image processing unit 53 functions as the multiplicity determination means 532, refers to the object information 521 stored in the storage unit 52 and the congestion degree information input from the congestion degree estimation means 530, and refers to the congestion degree information input from the congestion degree estimation means 530, and for each of the candidate positions, the photographing means 10a , 10b, 10c The non-overlapping degree of the high congestion area with respect to the head projection area is calculated as weights Wa, Wb, Wc, and the calculated weights Wa, Wb, Wc are added to the corresponding hypothesis to update the object information 521. (Step S503).

画像処理部53は候補位置設定・評価手段531として機能し、ステップS502で設定された各仮説に対して撮影手段10a,10b,10cの撮影画像における全身投影領域の画像特徴と当該人物のテンプレートの画像特徴の類似度に基づいて個別尤度La,Lb,Lcの算出を行う(ステップS504)。ちなみにテンプレートも撮影手段ごとのものである。 The image processing unit 53 functions as the candidate position setting / evaluation means 531, and for each hypothesis set in step S502, the image features of the whole body projection region in the captured images of the photographing means 10a, 10b, 10c and the template of the person concerned. The individual likelihoods La, Lb, and Lc are calculated based on the similarity of the image features (step S504). By the way, the template is also for each shooting means.

その後、画像処理部53は物体位置決定手段533として機能し、ステップS504にて算出された仮説の個別尤度に基づき、追跡の継続が可能かどうかを判定し(ステップS505)、不可と判定した場合は追跡終了処理を行う(ステップS506)。これにより、追跡不可と判定された人物についての追跡が終了され、物体位置決定手段533は記憶部52の物体情報521から当該人物に関する情報を削除する。ここで、全ての個別尤度が閾値TL未満であった人物を追跡継続不可と判定する。これにより撮影画像に写らなくなった人物の情報が削除される。 After that, the image processing unit 53 functions as the object position determination means 533, determines whether the tracking can be continued based on the individual likelihood of the hypothesis calculated in step S504 (step S505), and determines that the tracking is not possible. If so, the tracking end process is performed (step S506). As a result, the tracking of the person determined to be untraceable is completed, and the object position determining means 533 deletes the information about the person from the object information 521 of the storage unit 52. Here, it is determined that the tracking cannot be continued for the person whose individual likelihood is less than the threshold value TL. As a result, the information of the person who is no longer shown in the captured image is deleted.

ステップS505にて追跡の継続が可能と判断された場合は、物体位置決定手段533は、ステップS502で設定された仮説群の候補位置及びステップS503で算出された重みWおよびステップS504で算出された個別尤度に基づいて統合尤度を算出し、統合尤度と候補位置に基づいて追跡人物の物体位置を推定する(ステップS507)。 When it was determined in step S505 that the tracking could be continued, the object position determining means 533 calculated the candidate position of the hypothesis group set in step S502, the weight W calculated in step S503, and the weight W calculated in step S504. The integrated likelihood is calculated based on the individual likelihood, and the object position of the tracking person is estimated based on the integrated likelihood and the candidate position (step S507).

上述の追跡処理S502〜S507が記憶部25の物体情報521に登録された全ての人物に対して行われると、既に述べたように画像処理部53は処理をステップS509に進め、物体位置決定手段533により、撮影画像にてまだ追跡設定されていない人物の検出を行い、検出された場合は新規の追跡人物として追加する(ステップS509)。なお、新規の追跡人物として追加された場合は、実施形態1の方法により物体位置を求める。 When the above-mentioned tracking processes S502 to S507 are performed on all the persons registered in the object information 521 of the storage unit 25, the image processing unit 53 advances the process to step S509 as described above, and the object position determination means. According to 533, a person who has not yet been tracked is detected in the captured image, and if detected, the person is added as a new tracked person (step S509). When added as a new tracking person, the position of the object is obtained by the method of the first embodiment.

ステップS500で入力された撮影画像に対し上述した処理S501〜S509により人物の追跡が完了すると、画像処理部53は追跡結果を表示部54へ出力する(ステップS510)。例えば、画像処理部53は追跡結果として全人物の物体位置を表示部54の表示装置等に表示させる。 When the tracking of the person is completed by the above-mentioned processes S501 to S509 for the captured image input in step S500, the image processing unit 53 outputs the tracking result to the display unit 54 (step S510). For example, the image processing unit 53 causes the display device or the like of the display unit 54 to display the object positions of all persons as a tracking result.

[実施形態2の変形例]
(2−1)上記実施形態2においては、重付決定手段532が三次元形状モデルを用いて重みWを算出したが三次元形状モデルを用いずに重みWを算出することもできる。例えば、混雑度が低いほど高い重みWを算出する関係式を予め定めておき、候補位置を投影した投影点の混雑度を混雑度マップから取得して、取得した混雑度に上記関係式を適用して重みWを算出する。
[Modified Example of Embodiment 2]
(2-1) In the second embodiment, the weighting determining means 532 calculates the weight W using the three-dimensional shape model, but the weight W can also be calculated without using the three-dimensional shape model. For example, a relational expression for calculating a higher weight W as the degree of congestion is lower is determined in advance, the degree of congestion of the projection point on which the candidate position is projected is acquired from the degree of congestion map, and the above relational expression is applied to the acquired degree of congestion. And the weight W is calculated.

或いは、候補位置を投影した投影点を中心とする近傍領域(例えば5×5画素)の混雑度を混雑度マップから取得して、取得した混雑度の代表値に上記関係式を適用して重みWを算出する。代表値は例えば最大値、平均値または最頻値とする。この変形例で「個別認識手段が物体を認識した位置」は「候補位置を投影した投影点」または「候補位置を投影した投影点を中心とする近傍領域」とする。 Alternatively, the degree of congestion in the vicinity region (for example, 5 × 5 pixels) centered on the projection point on which the candidate position is projected is acquired from the degree of congestion map, and the above relational expression is applied to the acquired representative value of the degree of congestion to weight the weight. Calculate W. The representative value is, for example, the maximum value, the average value, or the mode value. In this modification, the "position where the individual recognition means recognizes the object" is defined as the "projection point where the candidate position is projected" or the "neighborhood region centered on the projection point where the candidate position is projected".

(2−2)上記実施形態2においては、重付決定手段532が撮影手段10a,10b,10cと候補位置の組み合わせに対して重みWを決定する例を示したが、近似的に撮影手段10a,10b,10cと物体の組み合わせに対して重みWを決定してもよい。すなわち、複数の候補位置のまとまりに対して重みWを決定することになる。 (2-2) In the second embodiment, an example in which the weighting determining means 532 determines the weight W with respect to the combination of the photographing means 10a, 10b, 10c and the candidate position is shown, but the photographing means 10a is approximately used. , 10b, 10c and the combination of objects may determine the weight W. That is, the weight W is determined for a group of a plurality of candidate positions.

(2−2−1)例えば、物体ごとに、XYZ座標系にて当該物体の複数の候補位置それぞれに頭部の三次元形状モデルを配置し、配置した複数の三次元形状モデルをまとめて撮影手段10a,10b,10cのxy座標系に投影する。この複数の三次元形状モデルの投影領域を「個別認識手段が物体を認識した位置」とみなす。そして、撮影手段10a,10b,10cについての各物体に関する投影領域内の混雑度に基づいて撮影手段10a,10b,10cと物体の組み合わせに対する重みWを算出する。 (2-2-1) For example, for each object, a three-dimensional shape model of the head is arranged at each of a plurality of candidate positions of the object in the XYZ coordinate system, and the arranged plurality of three-dimensional shape models are collectively photographed. Projection onto the xy coordinate system of means 10a, 10b, 10c. The projection area of the plurality of three-dimensional shape models is regarded as "the position where the individual recognition means recognizes the object". Then, the weight W for the combination of the photographing means 10a, 10b, 10c and the object is calculated based on the degree of congestion in the projection region for each object for the photographing means 10a, 10b, 10c.

(2−2−2)また、例えば、物体ごとに、XYZ座標系にて当該物体の複数の候補位置を包含するできるだけ小さな球または楕円体を導出し、導出した球または楕円体を撮影手段10a,10b,10cのxy座標系に投影する。上記例と同様、この小さな球または楕円体についての投影領域を「個別認識手段が物体を認識した位置」とみなす。そして、撮影手段10a,10b,10cについての各物体に関する投影領域内の混雑度に基づいて撮影手段10a,10b,10cと物体の組み合わせに対する重みWを算出する。 (2-2-2) Further, for example, for each object, a sphere or ellipsoid as small as possible that includes a plurality of candidate positions of the object is derived in the XYZ coordinate system, and the derived sphere or ellipsoid is photographed by the photographing means 10a. , 10b, 10c xy coordinate system. Similar to the above example, the projection area for this small sphere or ellipsoid is regarded as the "position where the individual recognition means recognizes the object". Then, the weight W for the combination of the photographing means 10a, 10b, 10c and the object is calculated based on the degree of congestion in the projection region for each object for the photographing means 10a, 10b, 10c.

(2−2−3)また、例えば、物体ごとに、XYZ座標系にて当該物体の過去の物体位置に外挿して現在の物体位置を予測し、予測した位置に頭部の三次元形状モデルを配置して撮影手段10a,10b,10cのxy座標系に投影する。投影領域は上記2例の投影領域を代表する領域と位置付けることができ、この各撮影手段についての投影領域を「個別認識手段が物体を認識した位置」とみなす。そして、撮影手段10a,10b,10cについての各物体に関する投影領域内の混雑度に基づいて撮影手段10a,10b,10cと物体の組み合わせに対する重みWを算出する。 (2-2-3) For example, for each object, the current object position is predicted by extrapolating to the past object position of the object in the XYZ coordinate system, and the three-dimensional shape model of the head is at the predicted position. Is arranged and projected onto the xy coordinate system of the photographing means 10a, 10b, 10c. The projection area can be positioned as a region representing the projection areas of the above two examples, and the projection area for each of the photographing means is regarded as the “position where the individual recognition means recognizes the object”. Then, the weight W for the combination of the photographing means 10a, 10b, 10c and the object is calculated based on the degree of congestion in the projection region for each object for the photographing means 10a, 10b, 10c.

なお、変形例(2−1)と同様、変形例(2−2−1)、(2−2−3)において、三次元形状モデルの投影領域の代わりに候補位置そのものを投影した投影点またはその近傍領域における混雑度に基づいて重みWを算出してもよい。なお、これらの場合、同一物体の仮説には同一の重みWが設定されることになる。 Similar to the modified example (2-1), in the modified examples (2-2-1) and (2-2-3), the projection point or the projection point on which the candidate position itself is projected instead of the projection area of the three-dimensional shape model or The weight W may be calculated based on the degree of congestion in the neighboring region. In these cases, the same weight W is set for the hypothesis of the same object.

(2−3)上記実施形態2およびその変形例においては、重付決定手段532は混雑度のみを使って重みWを決定していたが、これに加え、撮影手段から追跡対象までの距離、他の人物や障害物による隠蔽の度合など様々な要素から撮影手段が追跡に好適であるかを判断し重みWを決定することができる。 (2-3) In the second embodiment and its modifications, the weighting determining means 532 determines the weight W using only the degree of congestion, but in addition to this, the distance from the photographing means to the tracking target, The weight W can be determined by determining whether the photographing means is suitable for tracking from various factors such as the degree of concealment by another person or an obstacle.

(2−4)上記実施形態2およびその各変形例においては、候補位置設定・評価手段531は1つの仮説の個別尤度の算出(すなわち個別認識)を全ての撮影手段に対して行ったが、仮説ごとに撮影手段を1つ定めて個別尤度の算出を行ってもよい。この場合、尤度の統合はなく、物体位置決定手段533が重みWと個別尤度の積で候補位置を重み付け平均する構成とすることができる。つまり、その構成において重みWによる重み付けの対象は候補位置となる。或いは仮説の数によって重み付けを行う構成とすることができる。例えば、候補位置設定・評価手段531は、変形例(2−2−3)のように物体位置を予測し、予測した位置において、撮影手段と物体の組み合わせに対する重みWを算出し、撮影手段と物体の組み合わせに対する候補位置を当該組み合わせの重みWに応じた数だけ設定する。1物体当たりの候補位置をN個、注目する物体の撮影手段Cに関する重みをWCとすると、当該物体の撮影手段Cに関する候補位置はN×WC/ΣWCとなる。その構成においても重みWによる重み付けの対象は候補位置である。 (2-4) In the second embodiment and each modification thereof, the candidate position setting / evaluation means 531 calculates the individual likelihood of one hypothesis (that is, individual recognition) for all the photographing means. , One imaging means may be defined for each hypothesis and the individual likelihood may be calculated. In this case, the likelihoods are not integrated, and the object position determining means 533 can be configured to weight and average the candidate positions by the product of the weight W and the individual likelihoods. That is, in that configuration, the target of weighting by the weight W is a candidate position. Alternatively, it can be configured to be weighted according to the number of hypotheses. For example, the candidate position setting / evaluation means 531 predicts the object position as in the modified example (2-2-3), calculates the weight W for the combination of the photographing means and the object at the predicted position, and uses the photographing means. The number of candidate positions for the combination of objects is set according to the weight W of the combination. Assuming that there are N candidate positions per object and the weight related to the photographing means C of the object of interest is W C , the candidate positions related to the photographing means C of the object are N × W C / ΣW C. Even in that configuration, the target of weighting by the weight W is a candidate position.

(2−5)上記実施形態2およびその各変形例においては、物体位置決定手段533が背景差分処理に基づき新規物体を検出する例を示したが、その代わりに、追跡対象とする物体の画像を不特定多数機械学習した(例えば不特定多数の人の画像を深層学習した)学習済みモデルを用いて新規物体を検出してもよい。その場合、物体位置決定手段533は、撮影画像を学習済みモデルに入力して物体の領域を検出し、いずれの形状モデルとも重複しない領域が閾値TS以上の大きさである物体の領域に新規物体が存在すると判定する。 (2-5) In the second embodiment and each modification thereof, an example in which the object position determining means 533 detects a new object based on background subtraction processing is shown, but instead, an image of the object to be tracked is shown. A new object may be detected using a trained model in which an unspecified number of people are machine-learned (for example, images of an unspecified number of people are deep-learned). In that case, the object position determining means 533 inputs the captured image into the trained model to detect the area of the object, and the area that does not overlap with any of the shape models is a new object in the area of the object having a size equal to or larger than the threshold TS. Is determined to exist.

[実施形態1,2に共通の変形例]
(3−1)上記実施形態1,2およびその各変形例においては、重付決定手段による混雑度に基づく重みWの算出は、単純に物体の位置での混雑度に基づいて行っていたが、物体への視線方向に沿った領域での混雑度を加味して重みWを算出してもよい。
[Modification example common to Embodiments 1 and 2]
(3-1) In the first and second embodiments and their respective modifications, the calculation of the weight W based on the degree of congestion by the weighting determining means is simply performed based on the degree of congestion at the position of the object. , The weight W may be calculated in consideration of the degree of congestion in the region along the line-of-sight direction to the object.

図10(a)に示す例では、人物800について、撮影手段10aの撮影画像821上の領域831での混雑度と撮影手段10bの撮影画像822上の領域832での混雑度は同程度である。しかし、撮影手段10aから見ると人物800は群衆810の手前であり隠蔽されていないのに対し、撮影手段10bから見ると人物800は群衆810の奥であり一部が隠蔽されている。そのため撮影手段10aについての個別認識結果の方が撮影手段10bについての個別認識結果よりも信頼性が高い。 In the example shown in FIG. 10A, the degree of congestion in the area 831 on the photographed image 821 of the photographing means 10a and the degree of congestion in the area 832 on the photographed image 822 of the photographing means 10b are about the same for the person 800. .. However, when viewed from the photographing means 10a, the person 800 is in front of the crowd 810 and is not concealed, whereas when viewed from the photographing means 10b, the person 800 is behind the crowd 810 and part of it is concealed. Therefore, the individual recognition result for the photographing means 10a is more reliable than the individual recognition result for the photographing means 10b.

そこで、実施形態2の重付決定手段532は、候補位置に頭部の三次元形状モデルを配置した頭部投影領域850に加えて、候補位置と撮影手段の位置とを結ぶ直線上で候補位置よりも撮影手段に近い位置に配置した頭部投影領域851と、同直線上で候補位置よりも撮影手段から遠い位置に配置した頭部投影領域852とをさらに算出して、各頭部投影領域での混雑度を加味する。図10(b)に示す例では、撮影手段10aに近い側の頭部投影領域851と遠い側の頭部投影領域852での指標(非重複度、閑散度または混雑度)を算出する。 Therefore, the weighting determining means 532 of the second embodiment has a candidate position on a straight line connecting the candidate position and the position of the photographing means in addition to the head projection area 850 in which the three-dimensional shape model of the head is arranged at the candidate position. The head projection area 851 arranged at a position closer to the photographing means and the head projection area 852 arranged at a position farther from the photographing means than the candidate position on the same straight line are further calculated, and each head projection area is calculated. Consider the degree of congestion in. In the example shown in FIG. 10B, the indexes (non-overlapping degree, off-peak degree or congestion degree) in the head projection area 851 on the side closer to the photographing means 10a and the head projection area 852 on the far side are calculated.

実施形態1の重付決定手段132の場合これを近似的に行う。例えば、撮影手段が俯瞰設置された広角カメラであれば、人物領域を画面下にずらして候補位置よりも撮影手段に近い位置での人物領域とし、人物領域を画面上にずらして候補位置よりも撮影手段から遠い位置での人物領域とする。また、例えば、撮影手段が俯瞰設置された魚眼カメラであれば、人物領域を画面中央からの放射線上で中央に近づく方向にずらして候補位置よりも撮影手段に近い位置での人物領域とし、人物領域を同放射線上で中央から離れる方向にずらして候補位置よりも撮影手段から遠い位置での人物領域とする。 In the case of the weighting determination means 132 of the first embodiment, this is performed approximately. For example, in the case of a wide-angle camera in which the shooting means is installed from a bird's-eye view, the person area is shifted to the bottom of the screen to be a person area closer to the shooting means than the candidate position, and the person area is shifted to the screen to be closer to the candidate position than the candidate position. It is a person area far from the shooting means. Further, for example, in the case of a fisheye camera in which the photographing means is installed from a bird's-eye view, the person area is shifted in the direction closer to the center on the radiation from the center of the screen to be a person area closer to the photographing means than the candidate position. The person area is shifted in the direction away from the center on the same radiation to obtain the person area at a position farther from the photographing means than the candidate position.

ずらし量は、撮影手段の取り付け位置や角度等に応じて調整し、例えば元の領域と半分程度重なる量とすればよい。そして、重付決定手段132,532は、候補位置での指標と、撮影手段に近い位置での指標と、撮影手段から遠い位置での指標の平均値を求め、平均値に応じた重みWを決定する。この際、撮影手段に近い位置での指標を撮影手段から遠い位置での指標よりも大きく重み付けた重み付け平均値とするのが好適である。 The amount of shift may be adjusted according to the mounting position and angle of the photographing means, and may be set to an amount that overlaps with, for example, about half of the original area. Then, the weighting determining means 132, 532 obtains the average value of the index at the candidate position, the index at the position close to the photographing means, and the index at the position far from the photographing means, and calculates the weight W according to the average value. decide. At this time, it is preferable that the index at a position close to the photographing means is a weighted average value that is weighted larger than the index at a position far from the photographing means.

(3−2)混雑度推定手段130,530が連続値を出力する推定器を用いた例を示したが、離散的な混雑度を出力する推定器を用いることもできる。 (3-2) Although the example in which the congestion degree estimation means 130 and 530 use an estimator that outputs continuous values is shown, an estimator that outputs a discrete degree of congestion can also be used.

例えば、推定器を多クラスSVM(Support Vector Machine)でモデル化し、混雑度の度合いに応じて「背景(無人)」、「低混雑度」、「中混雑度」、「高混雑度」の4クラスに分類してラベル付けされた学習用画像を用いて当該モデルを学習させておく。そして、混雑度推定手段130,530は、撮影画像の各画素を中心とする窓を設定して窓内の画像の特徴量を推定器に入力し、各画素のクラスを識別する。混雑度推定手段130,530は、上述した非重複度を用いる場合は「高混雑度」ラベルの画素の集まりを高混雑度領域とし、上述した閑散度を用いる場合は各ラベルをその混雑度合いに応じて予め定めた数値に置換して離散値の混雑度マップとする。 For example, the estimator is modeled with a multi-class SVM (Support Vector Machine), and 4 of "background (unmanned)", "low congestion", "medium congestion", and "high congestion" depending on the degree of congestion. The model is trained using the training images classified into classes and labeled. Then, the congestion degree estimation means 130 and 530 set a window centered on each pixel of the captured image, input the feature amount of the image in the window into the estimator, and identify the class of each pixel. When the above-mentioned non-overlapping degree is used, the congestion degree estimation means 130 and 530 use the group of pixels of the “high degree of congestion” label as the high degree of congestion area, and when the above-mentioned low degree of use is used, each label is set as the degree of congestion. The congestion degree map of discrete values is obtained by replacing with a predetermined numerical value accordingly.

また、多クラスSVM以外にも、決定木型のランダムフォレスト法、多クラスのアダブースト(AdaBoost)法または多クラスロジスティック回帰法などにて学習した種々の多クラス識別器によっても推定器を実現できる。或いは識別型のCNNによっても推定器を実現できる(CNNの場合、窓走査は不要)。また、クラス分類された学習用画像を用いる場合でも特徴量から混雑度を回帰する回帰型のモデルとすることによって連続値の混雑度を出力する推定器を実現することもできる。その場合、リッジ回帰法、サポートベクターリグレッション法、回帰木型のランダムフォレスト法またはガウス過程回帰(Gaussian Process Regression)などによって、特徴量から混雑度を求めるための回帰関数のパラメータを学習させる。或いは回帰型のCNNを用いた推定器とすることもできる(CNN場合、窓走査は不要)。 In addition to the multi-class SVM, the estimator can also be realized by various multi-class discriminators learned by the decision tree type random forest method, the multi-class AdaBoost method, the multi-class logistic regression method, or the like. Alternatively, an estimator can be realized by an identification type CNN (in the case of CNN, window scanning is unnecessary). Further, even when a class-classified learning image is used, it is possible to realize an estimator that outputs a continuous value of the congestion degree by using a regression type model that returns the congestion degree from the feature amount. In that case, the parameters of the regression function for obtaining the degree of congestion from the features are learned by the ridge regression method, the support vector regression method, the random forest method of the regression tree type, or the Gaussian process regression. Alternatively, it can be an estimator using a regression type CNN (in the case of CNN, window scanning is unnecessary).

(3−3)本発明は、車両、動物等、混雑状態をなし得る人以外の物体にも適用できる。 (3-3) The present invention can be applied to objects other than humans, such as vehicles and animals, which can be in a congested state.

1…三次元位置推定装置(物体認識装置)、10a,10b,10c,50a,50b,50c…撮影手段、11,51…通信部、12,52…記憶部、13,53…画像処理部、14,54…表示部、120,520…カメラパラメータ、130、530…混雑度推定手段、131…二次元位置推定手段(個別認識手段)、132、532…重付決定手段、133…三次元位置推定手段(統合認識手段)、134…推定結果出力手段、5…三次元追跡装置、521…物体情報、531…候補位置設定・評価手段(個別認識手段)、533…物体位置決定手段(統合認識手段)、534…追跡結果出力手段 1 ... Three-dimensional position estimation device (object recognition device), 10a, 10b, 10c, 50a, 50b, 50c ... Imaging means, 11,51 ... Communication unit, 12,52 ... Storage unit, 13,53 ... Image processing unit, 14,54 ... Display unit, 120,520 ... Camera parameters, 130, 530 ... Congestion degree estimation means, 131 ... Two-dimensional position estimation means (individual recognition means), 132, 532 ... Overload determination means, 133 ... Three-dimensional position Estimating means (integrated recognition means), 134 ... Estimating result output means, 5 ... 3D tracking device, 521 ... Object information, 331 ... Candidate position setting / evaluation means (individual recognition means), 533 ... Object position determination means (integrated recognition) Means) 534 ... Tracking result output means

Claims (5)

共通視野を有する複数の撮影手段で撮影した撮影画像に基づいて物体を認識する物体認識装置であって、
前記撮影手段ごとに、前記撮影画像に撮影された前記物体の混雑度を推定する混雑度推定手段と、
前記撮影手段ごとの前記撮影画像を解析して前記撮影画像上における前記物体の全部または一部を認識して個別認識結果を生成する個別認識手段と、
前記各撮影手段が撮影した前記撮影画像上における前記個別認識手段が前記物体を認識した位置の前記混雑度に応じて前記各撮影手段の重み付けを決定する重付決定手段と、
前記重み付けに基づいて、前記撮影手段ごとの前記個別認識結果を統合する統合認識手段と、
を備えたことを特徴とする物体認識装置。
An object recognition device that recognizes an object based on captured images taken by a plurality of photographing means having a common field of view.
For each of the photographing means, a congestion degree estimating means for estimating the congestion degree of the object photographed in the photographed image, and
An individual recognition means that analyzes the photographed image for each photographing means and recognizes all or a part of the object on the photographed image to generate an individual recognition result.
A weighting determination means that determines the weighting of each imaging means according to the degree of congestion at a position where the individual recognition means recognizes the object on the captured image captured by the imaging means.
Based on the weighting, the integrated recognition means that integrates the individual recognition results for each of the photographing means and the integrated recognition means.
An object recognition device characterized by being equipped with.
前記混雑度推定手段は、前記撮影画像を入力されると当該撮影画像内の任意の位置の前記混雑度を出力するよう予め学習した推定器に前記撮影画像を入力して前記撮影画像内の任意の位置の前記混雑度を推定し、
前記重付決定手段は、前記撮影画像の領域毎に、前記混雑度に応じて前記撮影手段の重み付けを決定することを特徴とする請求項1記載の物体認識装置。
The congestion degree estimating means inputs the photographed image into an estimator learned in advance to output the congestion degree at an arbitrary position in the photographed image when the photographed image is input, and arbitrarily in the photographed image. Estimate the degree of congestion at the position of
The object recognition device according to claim 1, wherein the weighting determining means determines the weighting of the photographing means according to the degree of congestion for each region of the photographed image.
前記個別認識手段は、前記撮影手段ごとに、前記撮影画像を解析して現時刻の前記撮影画像上における前記物体の位置情報を求め、
前記統合認識手段は、前記重み付けに基づいて前記撮影手段ごとの前記位置情報を統合して現時刻における前記物体の位置を決定することを特徴とする請求項1または2記載の物体認識装置。
The individual recognition means analyzes the photographed image for each of the photographing means and obtains the position information of the object on the photographed image at the current time.
The object recognition device according to claim 1 or 2, wherein the integrated recognition means integrates the position information for each imaging means based on the weighting to determine the position of the object at the current time.
共通視野を有する複数の撮影手段で撮影した撮影画像に基づいて物体を認識する物体認識装置による物体認識方法であって、
混雑度推定手段が、前記撮影手段ごとに、前記撮影画像に撮影された前記物体の混雑度を推定し、
個別認識手段が、前記撮影手段ごとの前記撮影画像を解析して前記撮影画像上における前記物体の全部または一部を認識して個別認識結果を生成し、
重付決定手段が、前記各撮影手段が撮影した前記撮影画像上における前記個別認識手段が前記物体を認識した位置の前記混雑度に応じて前記各撮影手段の重み付けを決定し、
統合認識手段が、前記重み付けに基づいて、前記撮影手段ごとの前記個別認識結果を統合する
ことを特徴とする物体認識方法。
It is an object recognition method by an object recognition device that recognizes an object based on captured images taken by a plurality of photographing means having a common field of view.
The congestion degree estimating means estimates the congestion degree of the object photographed in the photographed image for each of the photographing means.
The individual recognition means analyzes the photographed image for each photographing means, recognizes all or a part of the object on the photographed image, and generates an individual recognition result.
The weighting determining means determines the weighting of the respective photographing means according to the degree of congestion at the position where the individual recognition means recognizes the object on the photographed image photographed by the respective photographing means.
An object recognition method, characterized in that the integrated recognition means integrates the individual recognition results for each of the photographing means based on the weighting.
共通視野を有する複数の撮影手段で撮影した撮影画像に基づいて物体を認識する物体認識装置において実行される物体認識プログラムであって、
混雑度推定手段が、前記撮影手段ごとに、前記撮影画像に撮影された前記物体の混雑度を推定する処理と、
個別認識手段が、前記撮影手段ごとの前記撮影画像を解析して前記撮影画像上における前記物体の全部または一部を認識して個別認識結果を生成する処理と、
重付決定手段が、前記各撮影手段が撮影した前記撮影画像上における前記個別認識手段が前記物体を認識した位置の前記混雑度に応じて前記各撮影手段の重み付けを決定する処理と、
統合認識手段が、前記重み付けに基づいて、前記撮影手段ごとの前記個別認識結果を統合する処理と、
を実行させることを特徴とする物体認識プログラム。
An object recognition program executed in an object recognition device that recognizes an object based on captured images taken by a plurality of photographing means having a common field of view.
The congestion degree estimating means estimates the congestion degree of the object photographed in the photographed image for each photographing means.
A process in which the individual recognition means analyzes the captured image for each imaging means, recognizes all or a part of the object on the captured image, and generates an individual recognition result.
The weighting determination means determines the weighting of each imaging means according to the degree of congestion at the position where the individual recognition means recognizes the object on the captured image captured by the imaging means.
A process in which the integrated recognition means integrates the individual recognition results for each of the photographing means based on the weighting.
An object recognition program characterized by executing.
JP2020050235A 2020-03-19 2020-03-19 Device, method and program for object recognition Pending JP2021149687A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP2020050235A JP2021149687A (en) 2020-03-19 2020-03-19 Device, method and program for object recognition

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
JP2020050235A JP2021149687A (en) 2020-03-19 2020-03-19 Device, method and program for object recognition

Publications (1)

Publication Number Publication Date
JP2021149687A true JP2021149687A (en) 2021-09-27

Family

ID=77849042

Family Applications (1)

Application Number Title Priority Date Filing Date
JP2020050235A Pending JP2021149687A (en) 2020-03-19 2020-03-19 Device, method and program for object recognition

Country Status (1)

Country Link
JP (1) JP2021149687A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023218761A1 (en) * 2022-05-09 2023-11-16 日立Astemo株式会社 Abnormality diagnosis device
WO2024024048A1 (en) * 2022-07-28 2024-02-01 日本電信電話株式会社 Object detection device, object detection method, and object detection program

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023218761A1 (en) * 2022-05-09 2023-11-16 日立Astemo株式会社 Abnormality diagnosis device
WO2024024048A1 (en) * 2022-07-28 2024-02-01 日本電信電話株式会社 Object detection device, object detection method, and object detection program

Similar Documents

Publication Publication Date Title
JP6018674B2 (en) System and method for subject re-identification
JP5102410B2 (en) Moving body detection apparatus and moving body detection method
JP5285575B2 (en) Human behavior determination device and program thereof
JP6032921B2 (en) Object detection apparatus and method, and program
JP5459674B2 (en) Moving object tracking system and moving object tracking method
US20090041297A1 (en) Human detection and tracking for security applications
JP7272024B2 (en) Object tracking device, monitoring system and object tracking method
JP2017146957A (en) Augmenting layer-based object detection with deep convolutional neural networks
JP6654789B2 (en) Apparatus, program, and method for tracking object considering multiple candidates at change points
JP7334432B2 (en) Object tracking device, monitoring system and object tracking method
CN109255360B (en) Target classification method, device and system
US20220366570A1 (en) Object tracking device and object tracking method
KR101681104B1 (en) A multiple object tracking method with partial occlusion handling using salient feature points
JP2021149687A (en) Device, method and program for object recognition
US11544926B2 (en) Image processing apparatus, method of processing image, and storage medium
JP6893812B2 (en) Object detector
WO2012153868A1 (en) Information processing device, information processing method and information processing program
JP2023008030A (en) Image processing system, image processing method, and image processing program
JP2019066909A (en) Object distribution estimation apparatus
JP6851246B2 (en) Object detector
KR101311728B1 (en) System and the method thereof for sensing the face of intruder
JP2022019339A (en) Information processing apparatus, information processing method, and program
JP4942197B2 (en) Template creation apparatus, facial expression recognition apparatus and method, program, and recording medium
JP7422572B2 (en) Moving object tracking device, moving object tracking method, and moving object tracking program
WO2023149295A1 (en) Information processing device, information processing method, and program

Legal Events

Date Code Title Description
A621 Written request for application examination

Free format text: JAPANESE INTERMEDIATE CODE: A621

Effective date: 20230216

A977 Report on retrieval

Free format text: JAPANESE INTERMEDIATE CODE: A971007

Effective date: 20231221

A131 Notification of reasons for refusal

Free format text: JAPANESE INTERMEDIATE CODE: A131

Effective date: 20240116

A521 Request for written amendment filed

Free format text: JAPANESE INTERMEDIATE CODE: A523

Effective date: 20240314