JP2021149687A

JP2021149687A - Device, method and program for object recognition

Info

Publication number: JP2021149687A
Application number: JP2020050235A
Authority: JP
Inventors: 豪二水戸; Goji Mito; 匠宗片; Takumi Munekata
Original assignee: Secom Co Ltd
Current assignee: Secom Co Ltd
Priority date: 2020-03-19
Filing date: 2020-03-19
Publication date: 2021-09-27

Abstract

To provide an object recognizing technique capable of preventing efficiently, reduction of accuracy of object recognition due to congestion.SOLUTION: An object recognizing device 1 recognizes an object, on the basis of photographed images photographed by plural imaging means 10a, 10b, 10c having a common visual field. Congestion degree estimating means 130 estimates a congestion degree of an object photographed in the photographed image, for every imaging means 10a, 10b, 10c. Individual recognizing means 131 recognizes a whole part or a part of the object in the photographed image by analyzing photographed images for every imaging means 10a, 10b, 10c, for generating an individual recognizing result. Weighting determining means 132 determines weighting of each imaging means 10a, 10b, 10c, according to the congestion degree of the position where the individual recognizing means 131 recognizes the object in the photographed images photographed by the imaging means 10a, 10b, 10c. Integration recognizing means 133 integrates the individual recognizing results of the imaging means 10a, 10b, 10c on the basis of the weighting.SELECTED DRAWING: Figure 1

Description

画像に基づいて物体を認識する技術に関し、特に、共通視野を有する複数の撮影手段で撮影した画像に基づいて物体を認識する技術に関する。 The present invention relates to a technique for recognizing an object based on an image, and more particularly to a technique for recognizing an object based on an image taken by a plurality of photographing means having a common field of view.

警備などの目的で、カメラによって撮影した画像から人等の物体を検出し、追跡し、または姿勢を認識する等、物体を認識することが行われている。その際、複数のカメラに共通視野を持たせて、複数方向から撮影することによって認識の精度を高めることができる。 For the purpose of security and the like, an object such as a person is detected and tracked from an image taken by a camera, or an object is recognized by recognizing a posture. At that time, the recognition accuracy can be improved by giving a common field of view to the plurality of cameras and shooting from a plurality of directions.

例えば、特許文献１には、共通視野を有する複数のカメラで人等の移動物体を追跡する移動物体追跡装置であって、テンプレートマッチングを各カメラの撮影画像上で行って、カメラごとに得られた移動物体の位置をその尤度で重みづけて共通の座標系で統合する移動物体追跡装置が記載されている。こうすることで、一部のカメラの画像上で移動物体同士の重なりが生じ、そのカメラでの尤度が低下しても他のカメラの情報で補えるため、高精度な追跡を続けられる。 For example, Patent Document 1 is a moving object tracking device that tracks moving objects such as people with a plurality of cameras having a common field of view, and template matching is performed on captured images of each camera to obtain each camera. A moving object tracking device that weights the positions of moving objects by their likelihood and integrates them in a common coordinate system is described. By doing so, moving objects overlap each other on the image of some cameras, and even if the likelihood of that camera decreases, it can be supplemented with the information of other cameras, so that highly accurate tracking can be continued.

特開２０１０−０４９２９６号公報Japanese Unexamined Patent Publication No. 2010-049296

しかしながら、従来技術では、注目する物体の周囲の混雑によって生じる認識精度の低下を効果的に防止できない問題があった。すなわち、例えば特許文献１に記載の移動物体追跡装置では、同種の物体が重なる以上は誤マッチングによって偶発的に高い尤度が生じることがあり、且つ、事後的に得られる尤度からはどのカメラの撮影画像において誤マッチングが行われているかを区別できないため誤マッチングにより得た位置を統合から排除することが困難であった。そして、混雑度が高くなるほど誤マッチングは生じやすくなる。 However, in the prior art, there is a problem that a decrease in recognition accuracy caused by congestion around an object of interest cannot be effectively prevented. That is, for example, in the moving object tracking device described in Patent Document 1, as long as objects of the same type overlap, a high likelihood may be accidentally generated due to erroneous matching, and which camera is based on the likelihood obtained after the fact. It was difficult to exclude the position obtained by the erroneous matching from the integration because it was not possible to distinguish whether or not the erroneous matching was performed in the captured image of. Then, the higher the degree of congestion, the more likely it is that erroneous matching will occur.

本発明は上記問題を鑑みてなされたものであって、混雑によって生じる物体認識の精度低下を効果的に防止することのできる物体認識装置、物体認識方法および物体認識プログラムを提供することを目的とする。 The present invention has been made in view of the above problems, and an object of the present invention is to provide an object recognition device, an object recognition method, and an object recognition program capable of effectively preventing a decrease in the accuracy of object recognition caused by congestion. do.

（１）本発明に係る物体認識装置は、共通視野を有する複数の撮影手段で撮影した撮影画像に基づいて物体を認識する物体認識装置であって、前記撮影手段ごとに、前記撮影画像に撮影された前記物体の混雑度を推定する混雑度推定手段と、前記撮影手段ごとの前記撮影画像を解析して前記撮影画像上における前記物体の全部または一部を認識して個別認識結果を生成する個別認識手段と、前記各撮影手段が撮影した前記撮影画像上における前記個別認識手段が前記物体を認識した位置の前記混雑度に応じて前記各撮影手段の重み付けを決定する重付決定手段と、前記重み付けに基づいて、前記撮影手段ごとの前記個別認識結果を統合する統合認識手段と、を備える。 (1) The object recognition device according to the present invention is an object recognition device that recognizes an object based on captured images taken by a plurality of photographing means having a common field of view, and the photographed image is photographed for each of the photographing means. The congestion degree estimating means for estimating the degree of congestion of the object and the photographed image for each photographing means are analyzed to recognize all or a part of the object on the photographed image and generate an individual recognition result. Individual recognition means, weighting determination means for determining the weighting of each imaging means according to the degree of congestion at the position where the individual recognition means recognizes the object on the captured image captured by each imaging means. An integrated recognition means that integrates the individual recognition results for each of the photographing means based on the weighting is provided.

（２）上記（１）に記載する本発明に係る物体認識装置において、前記混雑度推定手段は、前記撮影画像を入力されると当該撮影画像内の任意の位置の前記混雑度を出力するよう予め学習した推定器に前記撮影画像を入力して前記撮影画像内の任意の位置の前記混雑度を推定し、前記重付決定手段は、前記撮影画像の領域毎に、前記混雑度に応じて前記撮影手段の重み付けを決定する。 (2) In the object recognition device according to the present invention described in (1) above, when the captured image is input, the congestion degree estimating means outputs the congestion degree at an arbitrary position in the captured image. The captured image is input to a pre-learned estimator to estimate the degree of congestion at an arbitrary position in the captured image, and the weighting determining means is used for each region of the captured image according to the degree of congestion. The weighting of the photographing means is determined.

（３）上記（１）または（２）に記載する本発明に係る物体認識装置において、前記個別認識手段は、前記撮影手段ごとに、前記撮影画像を解析して現時刻の前記撮影画像上における前記物体の位置情報を求め、前記統合認識手段は、前記重み付けに基づいて前記撮影手段ごとの前記位置情報を統合して現時刻における前記物体の位置を決定する。 (3) In the object recognition device according to the present invention according to the above (1) or (2), the individual recognition means analyzes the photographed image for each image pickup means and displays the photographed image on the photographed image at the current time. The position information of the object is obtained, and the integrated recognition means integrates the position information for each imaging means based on the weighting to determine the position of the object at the current time.

本発明によれば、混雑によって生じる物体認識の精度低下を効果的に防止できる物体認識装置、物体認識方法および物体認識プログラムを提供することが可能となる。 According to the present invention, it is possible to provide an object recognition device, an object recognition method, and an object recognition program that can effectively prevent a decrease in the accuracy of object recognition caused by congestion.

三次元位置推定装置の概略の構成を示すブロック図である。It is a block diagram which shows the schematic structure of the 3D position estimation apparatus. 人物および群衆と各撮影手段の撮影画像の関係を示す図である。It is a figure which shows the relationship between the person and the crowd, and the photographed image of each photographing means. 図２の人物２００の拡大図である。It is an enlarged view of the person 200 of FIG. 実施形態１における三次元位置推定装置の全体的な処理を示す概略フロー図である。It is a schematic flow chart which shows the whole processing of the 3D position estimation apparatus in Embodiment 1. FIG. 三次元位置推定処理を示すサブフロー図である。It is a subflow diagram which shows the 3D position estimation process. 三次元追跡装置の概略の構成を示すブロック図である。It is a block diagram which shows the schematic structure of the 3D tracking device. 追跡人物および群衆と各撮影手段の撮影画像の関係を示す図である。It is a figure which shows the relationship between the tracking person and the crowd, and the photographed image of each photographing means. 追跡人物の仮説と尤度と重み付けの説明図である。It is explanatory drawing of the hypothesis of the tracking person, the likelihood and the weighting. 三次元追跡装置の全体的な処理を示すフロー図である。It is a flow chart which shows the overall processing of a 3D tracking apparatus. 物体認識装置の他例を説明する説明図である。It is explanatory drawing explaining another example of the object recognition apparatus.

［実施形態１］
以下、本発明の実施の形態（以下実施形態１という）に係る物体認識装置の一例である三次元位置推定装置について説明する。三次元位置推定装置は、共通視野を有する複数の撮影手段で撮影した撮影画像に基づいて共通視野内の人物の三次元位置を推定する。 [Embodiment 1]
Hereinafter, a three-dimensional position estimation device, which is an example of the object recognition device according to the embodiment of the present invention (hereinafter referred to as the first embodiment), will be described. The three-dimensional position estimation device estimates the three-dimensional position of a person in a common field of view based on captured images taken by a plurality of photographing means having a common field of view.

図１は三次元位置推定装置１の概略の構成を示すブロック図である。三次元位置推定装置１は撮影手段１０ａ、１０ｂ、１０ｃ、通信部１１、記憶部１２、画像処理部１３および表示部１４からなる。 FIG. 1 is a block diagram showing a schematic configuration of the three-dimensional position estimation device 1. The three-dimensional position estimation device 1 includes photographing means 10a, 10b, 10c, a communication unit 11, a storage unit 12, an image processing unit 13, and a display unit 14.

撮影手段１０ａ、１０ｂ、１０ｃは、対象データの集まりである画像を取得するカメラであり、本実施形態においては監視カメラである。撮影手段１０ａ，１０ｂ，１０ｃは共通視野を有し、同期している。撮影手段１０ａ、１０ｂ、１０ｃは通信部１１を介して画像処理部１３と接続され、監視空間を所定の時間間隔で撮影して画像を生成し、生成した画像を順次、画像処理部１３に入力する。例えば、撮影手段１０ａ、１０ｂ、１０ｃは、監視空間である屋内の壁に当該監視空間を俯瞰する所定の固定視野を有して設置され、監視空間を時間間隔１／５秒で撮影してカラー画像またはモノクロ画像を生成する。なお、本実施形態１では３台の撮影手段の例を示しているが、少なくとも撮影手段は２台あればよい。混雑度の低い画像が撮影される可能性を上げるために、撮影手段の台数は多いほどよく、共通視野の重心から各撮影手段の設置位置への方位差が大きいほどよい。 The photographing means 10a, 10b, and 10c are cameras that acquire an image that is a collection of target data, and are surveillance cameras in the present embodiment. The photographing means 10a, 10b, and 10c have a common field of view and are synchronized. The photographing means 10a, 10b, and 10c are connected to the image processing unit 13 via the communication unit 11, captures the monitoring space at predetermined time intervals to generate an image, and sequentially inputs the generated images to the image processing unit 13. do. For example, the photographing means 10a, 10b, and 10c are installed on an indoor wall which is a monitoring space with a predetermined fixed field of view overlooking the monitoring space, and the monitoring space is photographed at a time interval of 1/5 second in color. Generate an image or a monochrome image. Although the first embodiment shows an example of three photographing means, at least two photographing means may be sufficient. In order to increase the possibility that an image with a low degree of congestion is taken, the larger the number of shooting means, the better, and the larger the directional difference from the center of gravity of the common field of view to the installation position of each shooting means.

撮影手段１０ａ，１０ｂ，１０ｃは、予めキャリブレーションされ、共通する三次元の座標系（いわゆる世界座標系）が定義されている。以下、この座標系をＸＹＺ座標系と称する。また、撮影手段１０ａ，１０ｂ，１０ｃそれぞれの撮影画像に固有の二次元の座標系（いわゆるカメラ座標系）をｘｙ座標系と称する。 The photographing means 10a, 10b, and 10c are calibrated in advance, and a common three-dimensional coordinate system (so-called world coordinate system) is defined. Hereinafter, this coordinate system will be referred to as an XYZ coordinate system. Further, a two-dimensional coordinate system (so-called camera coordinate system) peculiar to each photographed image of the photographing means 10a, 10b, and 10c is referred to as an xy coordinate system.

通信部１１は通信回路であり、その一端が画像処理部１３に接続され、他端が撮影手段１０ａ、１０ｂ、１０ｃおよび表示部１４と接続される。通信部１１は撮影手段１０ａ〜１０ｃから画像を取得して画像処理部１３に入力する。また、通信部１１は画像処理部１３から物体の認識結果を表示部１４へ出力する。 The communication unit 11 is a communication circuit, one end of which is connected to the image processing unit 13, and the other end of which is connected to the photographing means 10a, 10b, 10c and the display unit 14. The communication unit 11 acquires an image from the photographing means 10a to 10c and inputs it to the image processing unit 13. Further, the communication unit 11 outputs the recognition result of the object from the image processing unit 13 to the display unit 14.

なお、撮影手段１０ａ〜１０ｃ、通信部１１、記憶部１２、画像処理部１３および表示部１４の間は各部の設置場所に応じた形態で適宜接続される。例えば、撮影手段１０ａ〜１０ｃと通信部１１および画像処理部１３とが遠隔に設置される場合、撮影手段１０ａ〜１０ｃと通信部１１との間をインターネット回線にて接続することができる。また、通信部１１と画像処理部１３との間はバスで接続する構成とすることができる。その他、接続手段として、ＬＡＮ（Local Area Network）、各種ケーブルなどを用いることができる。 The photographing means 10a to 10c, the communication unit 11, the storage unit 12, the image processing unit 13, and the display unit 14 are appropriately connected in a form according to the installation location of each unit. For example, when the photographing means 10a to 10c and the communication unit 11 and the image processing unit 13 are remotely installed, the photographing means 10a to 10c and the communication unit 11 can be connected by an internet line. Further, the communication unit 11 and the image processing unit 13 may be connected by a bus. In addition, a LAN (Local Area Network), various cables, or the like can be used as the connection means.

記憶部１２は、ＲＯＭ（Read Only Memory）、ＲＡＭ（Random Access Memory）等のメモリ装置であり、各種プログラムや各種データを記憶する。例えば、記憶部１２は学習用のデータや、学習済みモデルである推定器の情報を記憶し、画像処理部１３との間でこれらの情報を入出力する。すなわち、推定器の学習に用いる情報や当該処理の過程で生じた情報などが記憶部１２と画像処理部１３との間で入出力される。 The storage unit 12 is a memory device such as a ROM (Read Only Memory) and a RAM (Random Access Memory), and stores various programs and various data. For example, the storage unit 12 stores learning data and information of an estimator that is a learned model, and inputs and outputs such information to and from the image processing unit 13. That is, information used for learning the estimator, information generated in the process of the processing, and the like are input / output between the storage unit 12 and the image processing unit 13.

画像処理部１３は、ＣＰＵ（Central Processing Unit）、ＤＳＰ（Digital Signal Processor）、ＭＣＵ（Micro Control Unit）、ＧＰＵ（Graphics Processing Unit）等の演算装置で構成される。画像処理部１３は記憶部１２からプログラムを読み出して実行することにより各種の処理手段・制御手段として動作し、必要に応じて、各種データを記憶部１２から読み出し、生成したデータを記憶部１２に記憶させる。例えば、画像処理部１３は推定器を学習し生成すると共に、生成した推定器を通信部１１経由で記憶部１２に記憶させる。 The image processing unit 13 is composed of arithmetic units such as a CPU (Central Processing Unit), a DSP (Digital Signal Processor), an MCU (Micro Control Unit), and a GPU (Graphics Processing Unit). The image processing unit 13 operates as various processing means / control means by reading a program from the storage unit 12 and executing the program, reads various data from the storage unit 12 as necessary, and stores the generated data in the storage unit 12. Remember. For example, the image processing unit 13 learns and generates an estimator, and stores the generated estimator in the storage unit 12 via the communication unit 11.

表示部１４は、液晶ディスプレイまたは有機ＥＬ（Electro-Luminescence）ディスプレイ等であり、通信部１１を経由して画像処理部１３から入力される移動物体の認識結果を表示する。 The display unit 14 is a liquid crystal display, an organic EL (Electro-Luminescence) display, or the like, and displays a recognition result of a moving object input from the image processing unit 13 via the communication unit 11.

画像処理部１３が、混雑度推定手段１３０、二次元位置推定手段（個別認識手段）１３１、重付決定手段１３２、三次元位置推定手段（統合認識手段）１３３、推定結果出力手段１３４として機能する。 The image processing unit 13 functions as a congestion degree estimation means 130, a two-dimensional position estimation means (individual recognition means) 131, a weighting determination means 132, a three-dimensional position estimation means (integrated recognition means) 133, and an estimation result output means 134. ..

混雑度推定手段１３０は、撮影手段１０ａ，１０ｂ，１０ｃごとに、撮影画像に撮影された物体の混雑度を推定する。本実施形態においては、混雑度推定手段１３０は、撮影画像を入力されると当該撮影画像内の任意の位置の混雑度を出力するよう予め学習した推定器に撮影画像を入力して撮影画像内の任意の位置の混雑度を推定する。具体的には、混雑度推定手段１３０は、画像を入力されると各画素の混雑度を推定した混雑度マップを出力するよう予め学習した推定器に、撮影画像を入力して当該撮影画像の混雑度マップを出力させ、得られた混雑度マップを記憶部１２に記憶させる。 The congestion degree estimating means 130 estimates the congestion degree of the object photographed in the photographed image for each of the photographing means 10a, 10b, and 10c. In the present embodiment, the congestion degree estimation means 130 inputs the photographed image into the estimator learned in advance so as to output the congestion degree at an arbitrary position in the photographed image when the photographed image is input, and within the photographed image. Estimate the degree of congestion at any position in. Specifically, the congestion degree estimation means 130 inputs a captured image into an estimator trained in advance to output a congestion degree map that estimates the congestion degree of each pixel when an image is input, and the captured image is of the captured image. The congestion degree map is output, and the obtained congestion degree map is stored in the storage unit 12.

推定器は具体的にはディープラーニングの技術を用いて実現できる。すなわち推定器は画像を入力されると当該画像の混雑度マップを出力するＣＮＮ（畳み込みニューラルネット―ワーク；convolutional neural network）でモデル化することができる。学習のために、例えば、群衆が撮影された大量の学習用画像と、学習用画像それぞれにおける各人の頭部の重心位置を平均値とし当該頭部のサイズに応じた分散を有する確率密度関数を設定して頭部ごとの当該関数の値を画素ごとに加算した混雑度マップとが用意される。そして、モデルに学習用画像それぞれを入力したときの出力を当該画像に対応する混雑度マップに近づける学習が事前に行われる。こうして得られた学習済みモデルを混雑度推定手段１３０のプログラムの一部をなす推定器として記憶部１２に記憶させておく。例えば、“Single image crowd counting via multi-column convolutional neural network”, Zhang, Y. ,Zhou他, CVPR 2016に記載されているＭＣＮＮ（multi-column convolutional neural network）は推定器の一例であり、当該論文に記載されている群衆密度マップ（crowd density map）は混雑度マップの一例である。なお、本実施形態において混雑度推定手段１３０は、認識精度低下を許容できる混雑度の上限値Ｔ０を予め定めておき、推定器から出力された混雑度を上限値Ｔ０で除した上で除算結果が１．０以上となった場合に１．０とする規格化を行うものとする。つまり、本実施形態において混雑度の値域は［０，１］である。 The estimator can be specifically realized by using deep learning technology. That is, the estimator can be modeled by a CNN (convolutional neural network) that outputs a congestion map of the image when the image is input. For learning, for example, a large number of learning images taken by a crowd and a probability density function having a variance according to the size of the head, with the position of the center of gravity of each person's head in each of the learning images as the average value. Is set and the value of the function for each head is added for each pixel to prepare a congestion degree map. Then, learning is performed in advance to bring the output when each of the training images is input to the model closer to the congestion degree map corresponding to the image. The trained model thus obtained is stored in the storage unit 12 as an estimator forming a part of the program of the congestion degree estimation means 130. For example, MCNN (multi-column convolutional neural network) described in “Single image crowd counting via multi-column convolutional neural network”, Zhang, Y., Zhou et al., CVPR 2016 is an example of an estimator. The crowd density map described in is an example of a congestion map. In the present embodiment, the congestion degree estimation means 130 sets an upper limit value T0 of the congestion degree that can tolerate a decrease in recognition accuracy in advance, divides the congestion degree output from the estimator by the upper limit value T0, and then divides the result. When is 1.0 or more, standardization to 1.0 shall be performed. That is, in the present embodiment, the range of the degree of congestion is [0,1].

混雑度推定手段１３０は、各混雑度マップにおいて混雑度が予め定めた閾値Ｔ１以上の領域を高混雑度領域として抽出する。混雑度推定手段１３０は、撮影手段１０ａ〜１０ｃのそれぞれを識別する撮影手段ＩＤと、撮影手段１０ａ〜１０ｃの撮影画像における高混雑度領域とを対応付けた混雑度情報を重付決定手段１３２に出力する。 The congestion degree estimation means 130 extracts a region in which the congestion degree is a predetermined threshold value T1 or more in each congestion degree map as a high congestion degree region. The congestion degree estimating means 130 sends the congestion degree information in which the photographing means ID for identifying each of the photographing means 10a to 10c and the high congestion degree region in the photographed image of the photographing means 10a to 10c are associated with each other in the weighting determination means 132. Output.

個別認識手段である二次元位置推定手段１３１は、撮影手段ごとの撮影画像を解析して撮影画像上における物体の全部または一部を認識して個別認識結果を生成する。具体的には、予め画像からの人の像の領域（人物領域）の検出を学習した検出器に撮影手段１０ａ〜１０ｃのそれぞれが撮影した撮影画像を入力して当該検出器に各撮影画像上における人物領域を出力させ（検出させ）、撮影手段１０ａ〜１０ｃの撮影手段ＩＤと検出された人物領域と当該人物領域の重心位置とを対応付けた個別認識結果を生成し、生成した個別認識結果を重付決定手段１３２および三次元位置推定手段１３３に出力する。 The two-dimensional position estimation means 131, which is an individual recognition means, analyzes the captured image of each photographing means, recognizes all or a part of the object on the captured image, and generates an individual recognition result. Specifically, the captured images taken by each of the photographing means 10a to 10c are input to the detector that has learned the detection of the human image region (personal region) from the image in advance, and the captured images are input to the detector. The individual recognition result is generated by outputting (detecting) the person area in the above image, generating an individual recognition result in which the photographing means ID of the photographing means 10a to 10c is associated with the detected person area and the position of the center of gravity of the person area, and the generated individual recognition result. Is output to the weighting determination means 132 and the three-dimensional position estimation means 133.

上記検出器は、例えば、ＣＮＮを、大量の学習用画像と当該学習用画像内における人の像を囲んだ人物領域を示す正解データとからなる学習用データを用いて深層学習（ディープラーニング）させた学習済みモデルである。このようなＣＮＮの一例が“Faster R-CNN: Towards real-time object detection with region proposal networks”, Shaoqing Ren他, NIPS, 2015に記載されている。 The detector causes, for example, deep learning of a CNN by using learning data consisting of a large amount of learning images and correct answer data indicating a person area surrounding a person's image in the learning image. It is a trained model. An example of such a CNN is described in “Faster R-CNN: Towards real-time object detection with region proposal networks”, Shaoqing Ren et al., NIPS, 2015.

重付決定手段１３２は、各撮影手段１０ａ，１０ｂ，１０ｃが撮影した撮影画像上における個別認識手段が物体を認識した位置の混雑度に応じて各撮影手段の重み付けを決定する。 The weighting determining means 132 determines the weighting of each photographing means according to the degree of congestion at the position where the individual recognition means recognizes the object on the photographed image photographed by the photographing means 10a, 10b, 10c.

具体的には、重付決定手段１３２は、二次元位置推定手段１３１から入力される個別認識結果を参照し、各撮影手段に関する個別認識結果に含まれる人物領域それぞれの上部１／３（以下、頭部領域とも称する）を「個別認識手段が物体を認識した位置」と設定する。そして、重付決定手段１３２は、混雑度推定手段１３０から入力される混雑度情報を参照し、撮影手段ごとに、頭部領域において当該撮影手段の高混雑度領域が占めない割合を重みとして算出して重みを含ませた個別認識結果を三次元位置推定手段１３３に出力する。 Specifically, the weighting determination means 132 refers to the individual recognition result input from the two-dimensional position estimation means 131, and is the upper 1/3 of each person area included in the individual recognition result for each photographing means (hereinafter, hereafter. (Also referred to as the head region) is set as "the position where the individual recognition means recognizes the object". Then, the weighting determining means 132 refers to the congestion degree information input from the congestion degree estimating means 130, and calculates as a weight the ratio of the head region not occupied by the high congestion degree region of the photographing means for each photographing means. Then, the individual recognition result including the weight is output to the three-dimensional position estimation means 133.

例えば、重みは、頭部領域と高混雑度領域との非重複率による以下の式で定まる。
・重み＝１．０−頭部領域と高混雑度領域との重複面積／頭部領域の面積
また、重みは頭部領域における閑散度による以下の式で定めてもよい。この場合、混雑度推定手段１３０は撮影手段ＩＤと混雑度マップを対応付けた混雑度情報を出力する。
・重み＝１．０−頭部領域内の混雑度の総和／頭部領域の面積
つまり、頭部領域内の混雑度が高い個別認識結果ほど重みは小さくなる。これは背後の群集の影響で個別認識結果の信頼度が低いことを意味する。他方、頭部領域内の混雑度が低い個別認識結果ほど重みは高くなる。これは群集の影響が少なく個別認識結果の信頼度が高いことを意味する。このような重みの違いは、認識対象の物体と背後の群集の撮影画像上での位置関係が撮影手段との位置関係によって異なることで生じる。そのため、個別認識手段が物体を認識した位置における混雑度に応じて各撮影手段の重み付けを決定することで、群集の影響により変わる当該位置についての個別認識結果の信頼度を評価できる。 For example, the weight is determined by the following equation based on the non-overlapping rate between the head region and the high congestion region.
Weight = 1.0-Area of overlap between head region and high congestion area / area of head region The weight may be determined by the following formula according to the degree of quietness in the head region. In this case, the congestion degree estimation means 130 outputs the congestion degree information in which the photographing means ID and the congestion degree map are associated with each other.
-Weight = 1.0-Sum of congestion in the head region / Area of the head region In other words, the higher the degree of congestion in the head region, the smaller the weight. This means that the reliability of the individual recognition result is low due to the influence of the crowd behind. On the other hand, the lower the degree of congestion in the head region, the higher the weight. This means that the influence of the crowd is small and the reliability of the individual recognition result is high. Such a difference in weight occurs because the positional relationship between the object to be recognized and the crowd behind it on the captured image differs depending on the positional relationship with the photographing means. Therefore, by determining the weighting of each photographing means according to the degree of congestion at the position where the individual recognition means recognizes the object, it is possible to evaluate the reliability of the individual recognition result for the position which changes due to the influence of the crowd.

記憶部１２は、ｘｙ座標系の撮影画像上で求めた人物領域の重心位置をＸＹＺ座標系に逆投影するために撮影手段１０ａ〜１０ｃのカメラパラメータ１２０を記憶している。カメラパラメータ１２０は、実際の監視空間における撮影手段１０ａ〜１０ｃの設置位置および撮像方向といった外部パラメータ、撮影手段１０ａ〜１０ｃの焦点距離、画角、レンズ歪みその他のレンズ特性や、撮像素子の画素数といった内部パラメータを含む情報である。 The storage unit 12 stores the camera parameters 120 of the photographing means 10a to 10c in order to back-project the position of the center of gravity of the person area obtained on the photographed image of the xy coordinate system onto the XYZ coordinate system. The camera parameter 120 includes external parameters such as the installation position and imaging direction of the photographing means 10a to 10c in the actual monitoring space, the focal length, the angle of view, the lens distortion and other lens characteristics of the photographing means 10a to 10c, and the number of pixels of the imaging element. It is information including internal parameters such as.

統合認識手段である三次元位置推定手段１３３は、重み付けに基づいて、撮影手段ごとの個別認識結果を統合する。本実施形態１においては、重み付けに基づいて撮影手段ごとの位置情報を統合して物体の位置を決定し、決定した位置を推定結果出力手段１３４に出力する。撮影手段ごとの位置情報は撮影手段１０ａ，１０ｂ，１０ｃに関する個別認識結果に含まれる重心位置であり、決定される物体の位置は物体の三次元位置である。 The three-dimensional position estimation means 133, which is an integrated recognition means, integrates the individual recognition results for each photographing means based on the weighting. In the first embodiment, the position information of each photographing means is integrated based on the weighting to determine the position of the object, and the determined position is output to the estimation result output means 134. The position information for each photographing means is the position of the center of gravity included in the individual recognition results for the photographing means 10a, 10b, 10c, and the position of the determined object is the three-dimensional position of the object.

具体的には、三次元位置推定手段１３３は、まず、二次元位置推定手段１３１から入力された撮影手段１０ａ，１０ｂ，１０ｃの個別認識結果、および記憶部１２に記憶されている撮影手段１０ａ，１０ｂ，１０ｃのカメラパラメータ１２０を参照し、撮影手段ごとに、当該撮影手段の個別認識結果に含まれる各物体の重心位置のそれぞれを当該撮影手段のカメラパラメータ１２０を用いてＸＹＺ座標系に逆投影して、各物体の重心位置を通る視線ベクトルを導出する。 Specifically, the three-dimensional position estimating means 133 first receives individual recognition results of the photographing means 10a, 10b, 10c input from the two-dimensional position estimating means 131, and the photographing means 10a stored in the storage unit 12. With reference to the camera parameters 120 of 10b and 10c, for each photographing means, the position of the center of gravity of each object included in the individual recognition result of the photographing means is back-projected to the XYZ coordinate system using the camera parameter 120 of the photographing means. Then, a line-of-sight vector passing through the position of the center of gravity of each object is derived.

次に、重付決定手段１３２から入力された撮影手段１０ａ，１０ｂ，１０ｃの重みを参照し、物体ごとに、各撮影手段からの視線ベクトルとの距離の重み付け和が最小となる三次元位置を当該物体の三次元位置として算出する。 Next, with reference to the weights of the photographing means 10a, 10b, and 10c input from the weighting determining means 132, the three-dimensional position where the weighted sum of the distances from the line-of-sight vector from each photographing means is minimized is determined for each object. Calculated as the three-dimensional position of the object.

各物体の三次元位置Ｐは、当該物体に対する撮影手段Ｃ（撮影手段ＩＤがＣである撮影手段を撮影手段Ｃと表記）の重みをＷ_Cとし、当該物体の重心位置を通る撮影手段Ｃからの視線ベクトルＶ_Cと三次元位置Ｐとの距離をＤ（Ｖ_C，Ｐ）とすると、ΣＷ_C×Ｄ（Ｖ_C，Ｐ）が最小となる三次元位置Ｐを最小二乗法により解くことで求まる。ただしΣはＣについての総和とする。 The three-dimensional position P of each object is from the photographing means C passing through the position of the center of gravity of the object, where the weight of the photographing means C (the photographing means whose imaging means ID is C is referred to as the photographing means C) _{with respect to the object is W C.} When the line of sight distance between the vector V _C and the three-dimensional position P and _{D (V C, P),} ΣW C × D (V C, P) the three-dimensional position P which is the minimum by solving the minimum square method I want it. However, Σ is the sum of C.

なお、同一物体による撮影手段１０ａからの視線ベクトルと撮影手段１０ｂからの視線ベクトルと撮影手段１０ｃからの視線ベクトルの組み合わせを事前に特定するのは困難である。そこで、例えば、三次元位置推定手段１３３は、総当たりの組み合わせについて三次元位置の算出を試行し、最小化された距離の重み付け和が予め定めた閾値ＴＤ以上であった組み合わせを削除して、最小化された距離の重み付け和が閾値ＴＤ未満であった組み合わせのみを同一物体によるものとする。 It is difficult to specify in advance the combination of the line-of-sight vector from the photographing means 10a, the line-of-sight vector from the photographing means 10b, and the line-of-sight vector from the photographing means 10c by the same object. Therefore, for example, the three-dimensional position estimation means 133 attempts to calculate the three-dimensional position for the round-robin combination, deletes the combination in which the weighted sum of the minimized distances is equal to or greater than the predetermined threshold value TD. Only combinations in which the weighted sum of the minimized distances is less than the threshold value TD are considered to be the same object.

つまり、重みが大きな撮影手段からの重心位置ほど重視し、重みが小さな撮影手段からの重心位置ほど軽視して統合することにより三次元位置を決定する。このようにすることで、群集の存在により撮影手段ごとの個別認識結果に生じる誤差の影響を低減した高精度な統合が可能となる。よって、物体を高精度に認識することができる。 That is, the position of the center of gravity from the photographing means having a large weight is emphasized, and the position of the center of gravity from the photographing means having a small weight is disregarded and integrated to determine the three-dimensional position. By doing so, it is possible to perform highly accurate integration by reducing the influence of errors that occur in the individual recognition results for each photographing means due to the presence of the crowd. Therefore, the object can be recognized with high accuracy.

推定結果出力手段１３４は、推定結果を生成し、画像処理部５の外部に出力する。撮影画像と、ＸＹＺ座標系の仮想空間上に人物の三次元位置を表す×印を描画して二次元投影した投影図とを合成した画像を生成し、通信部１１に出力する。通信部１１により伝送されて表示部１４に表示される。 The estimation result output means 134 generates an estimation result and outputs it to the outside of the image processing unit 5. An image obtained by synthesizing the captured image and the projection drawing obtained by drawing a x mark representing the three-dimensional position of the person in the virtual space of the XYZ coordinate system and projecting the two dimensions is generated and output to the communication unit 11. It is transmitted by the communication unit 11 and displayed on the display unit 14.

次に、本実施形態１における三次元位置推定装置１の処理例を説明する。図２に示すように、撮影手段１０ａ，１０ｂ，１０ｃそれぞれにおいて、共通視野に存在する人物２００及び群衆２１０を撮影画像２２１，２２２，２２３として撮影する。 Next, a processing example of the three-dimensional position estimation device 1 in the first embodiment will be described. As shown in FIG. 2, in each of the photographing means 10a, 10b, and 10c, the person 200 and the crowd 210 existing in the common field of view are photographed as photographed images 221, 222, 223.

二次元位置推定手段１３１は少なくとも人物２００についての個別認識結果を生成する。すなわち、撮影手段１０ａについては撮影画像２２１上で人物２００を囲う人物領域２３１とその重心位置２４１を生成する。撮影手段１０ｂについては撮影画像２２２上で人物２００を囲う人物領域２３２とその重心位置２４２を生成する。人物領域２３２は群衆の像の影響を受けて本来の人物領域よりも大きく検出され、重心位置２４２も本来の重心位置からずれている。撮影手段１０ｃについては撮影画像２２３上で人物２００を囲う人物領域２３３とその重心位置２４３を生成する。混雑度推定手段１３０は撮影画像２２１，２２２，２２３について高混雑度領域２５１，２５２，２５３を抽出する。 The two-dimensional position estimation means 131 generates an individual recognition result for at least the person 200. That is, for the photographing means 10a, the person area 231 surrounding the person 200 and the position of the center of gravity thereof 241 are generated on the photographed image 221. The photographing means 10b generates a person area 232 surrounding the person 200 and a position of the center of gravity 242 thereof on the photographed image 222. The person area 232 is detected to be larger than the original person area due to the influence of the image of the crowd, and the center of gravity position 242 is also deviated from the original center of gravity position. For the photographing means 10c, the person area 233 surrounding the person 200 and the position of the center of gravity 243 thereof are generated on the photographed image 223. The congestion degree estimation means 130 extracts high congestion degree regions 251,252, 253 from the captured images 221, 222, 223.

重付決定手段１３２は人物領域の上部１／３（頭部領域）と高混雑度領域との非重複率に応じた重みを算出する。撮影手段１０ａ，１０ｃについては、人物領域２３１，２３３の上部１／３と高混雑度領域２５１，２５３との重複は無く、重みは１．０となる。撮影手段１０ｂについては、人物領域２３２の上部１／３と高混雑度領域２５２との重複があり、重みは０．２となる。 The weighting determining means 132 calculates the weight according to the non-overlapping rate between the upper 1/3 (head area) of the person area and the high congestion area. For the photographing means 10a and 10c, there is no overlap between the upper 1/3 of the person areas 231,233 and the high congestion areas 251,253, and the weight is 1.0. Regarding the photographing means 10b, the upper 1/3 of the person area 232 and the high congestion area 252 overlap, and the weight is 0.2.

三次元位置推定手段１３３は撮影手段１０ａ，１０ｂ，１０ｃのカメラパラメータ１２０を用いて、重心位置２４１，２４２，２４３のそれぞれを通る視線ベクトルＶ１，Ｖ２，Ｖ３を導出する。撮影手段１０ｂについては、人物領域２３２および重心位置２４２が本来のものからずれているため、視線ベクトルＶ２は視線ベクトルＶ１，Ｖ３に対してずれが生じている。 The three-dimensional position estimation means 133 uses the camera parameters 120 of the photographing means 10a, 10b, and 10c to derive the line-of-sight vectors V1, V2, and V3 passing through the center-of-gravity positions 241, 242, and 243, respectively. As for the photographing means 10b, since the person area 232 and the center of gravity position 242 are deviated from the original ones, the line-of-sight vector V2 is deviated from the line-of-sight vectors V1 and V3.

図３は図２の人物２００周辺を拡大したものである。三次元位置３６０は、仮に、重み付けをせずに視線ベクトルＶ１，Ｖ２，Ｖ３との距離が最小となるように決定した場合の人物２００の三次元位置である。三次元位置３６０は、実際の人物２００の重心位置からずれた位置となる。 FIG. 3 is an enlarged view of the area around the person 200 in FIG. The three-dimensional position 360 is the three-dimensional position of the person 200 when the distance between the line-of-sight vectors V1, V2, and V3 is determined to be the minimum without weighting. The three-dimensional position 360 is a position deviated from the position of the center of gravity of the actual person 200.

三次元位置３６１は、視線ベクトルＶ１，Ｖ２，Ｖ３との距離の重み付け和が最小となるように決定した位置である。三次元位置３６１は、実際の人物２００のほぼ重心位置を示している。視線ベクトルＶ１から三次元位置３６１までの距離Ｄ１と視線ベクトルＶ３から三次元位置３６１までの距離Ｄ３が、視線ベクトルＶ２から三次元位置３６１までの距離Ｄ２よりも短くなっている。これは、距離Ｄ１，Ｄ３が大きく重み付けて評価され、距離Ｄ２が小さく重み付けて評価されたことを示している。このように、撮影手段１０ａ，１０ｂ，１０ｃに対する重み付けによって、三次元位置３６１に対する視線ベクトルＶ２の寄与を小さくし、視線ベクトルＶ１，Ｖ３の寄与を大きくしたことで三次元位置３６１の算出が高精度化される。 The three-dimensional position 361 is a position determined so that the weighted sum of the distances from the line-of-sight vectors V1, V2, and V3 is minimized. The three-dimensional position 361 indicates the position of the center of gravity of the actual person 200. The distance D1 from the line-of-sight vector V1 to the three-dimensional position 361 and the distance D3 from the line-of-sight vector V3 to the three-dimensional position 361 are shorter than the distance D2 from the line-of-sight vector V2 to the three-dimensional position 361. This indicates that the distances D1 and D3 were evaluated with a large weight, and the distance D2 was evaluated with a small weight. In this way, by weighting the photographing means 10a, 10b, and 10c, the contribution of the line-of-sight vector V2 to the three-dimensional position 361 is reduced, and the contribution of the line-of-sight vectors V1 and V3 is increased, so that the calculation of the three-dimensional position 361 is highly accurate. Be made.

〔三次元位置推定装置１の動作〕
図４は本実施形態１における三次元位置推定装置１の全体的な処理を示すフローチャートである。図４のステップＳ１００〜Ｓ１５０は、撮影手段１０ａ，１０ｂ，１０ｃから撮影画像が入力される度に繰り返される。 [Operation of 3D position estimation device 1]
FIG. 4 is a flowchart showing the overall processing of the three-dimensional position estimation device 1 in the first embodiment. Steps S100 to S150 of FIG. 4 are repeated every time a photographed image is input from the photographing means 10a, 10b, 10c.

撮影手段１０ａ，１０ｂ，１０ｃからの撮影画像が画像処理部１３に入力される（Ｓ１００）。画像処理部１３は混雑度推定手段１３０として動作し、撮影手段１０ａ，１０ｂ，１０ｃからの撮影画像それぞれを推定器に入力して撮影手段ごとの混雑度マップを生成し、各混雑度マップから閾値Ｔ１以上の高混雑度領域を抽出する（Ｓ１１０）。 The captured images from the photographing means 10a, 10b, and 10c are input to the image processing unit 13 (S100). The image processing unit 13 operates as the congestion degree estimation means 130, inputs each of the captured images from the photographing means 10a, 10b, and 10c into the estimator, generates a congestion degree map for each photographing means, and sets a threshold value from each congestion degree map. A highly congested region of T1 or higher is extracted (S110).

画像処理部１３は二次元位置推定手段１３１として動作し、撮影手段１０ａ，１０ｂ，１０ｃからの撮影画像それぞれを検出器に入力して人物領域を検出し、撮影手段ＩＤと人物領域と人物領域の重心位置を対応付けた個別認識結果を生成する（Ｓ１２０）。画像処理部１３は重付決定手段１３２として動作し、高混雑度領域と個別認識結果を入力し、人物領域の上部１／３の頭部領域と高混雑度領域の非重複率に応じた重みを決定する（Ｓ１３０）。 The image processing unit 13 operates as the two-dimensional position estimation means 131, inputs the captured images from the photographing means 10a, 10b, and 10c into the detector to detect the person area, and determines the photographing means ID, the person area, and the person area. An individual recognition result associated with the position of the center of gravity is generated (S120). The image processing unit 13 operates as the weighting determination means 132, inputs the high-congestion degree region and the individual recognition result, and weights according to the non-overlapping ratio of the head region and the high-congestion degree region of the upper 1/3 of the person region. Is determined (S130).

画像処理部１３は三次元位置推定手段１３３として動作し、個別認識結果と重みを入力し、三次元位置を推定する（Ｓ１４０）。図５は、三次元位置推定手段１３３の処理を示すサブフローチャートである。 The image processing unit 13 operates as the three-dimensional position estimation means 133, inputs the individual recognition result and the weight, and estimates the three-dimensional position (S140). FIG. 5 is a sub-flow chart showing the processing of the three-dimensional position estimation means 133.

三次元位置推定手段１３３は、記憶部１２からカメラパラメータ１２０を読み出し、個別認識結果に含まれている撮影手段ごとの各人物の重心位置を逆投影して、当該重心位置を通る当該撮影手段からの視線ベクトルを算出する（Ｓ１４１）。三次元位置推定手段１３３は、撮影手段１０ａ，１０ｂ，１０ｃそれぞれにつき一つずつの視線ベクトルを選択する条件下で、総当たりで視線ベクトルの組み合わせを生成し、生成した組み合わせを順次処理対象の組み合わせに設定する（Ｓ１４２）。 The three-dimensional position estimation means 133 reads out the camera parameter 120 from the storage unit 12, back-projects the position of the center of gravity of each person for each shooting means included in the individual recognition result, and passes through the center of gravity position from the shooting means. The line-of-sight vector of is calculated (S141). The three-dimensional position estimation means 133 generates a combination of line-of-sight vectors by brute force under the condition that one line-of-sight vector is selected for each of the photographing means 10a, 10b, and 10c, and the generated combinations are sequentially processed. Is set to (S142).

三次元位置推定手段１３３は、処理対象の組み合わせについて、当該組み合わせを構成する各視線ベクトルからの距離の重み付け和が最小となる三次元位置を導出する（Ｓ１４３）。三次元位置推定手段１３３は、最小となったときの距離の重み付け和が予め定めた閾値ＴＤ未満であるか否かを判定する（Ｓ１４４）。距離の重み付け和が閾値ＴＤ未満であればＳ１４５へ移行し、距離の重み付け和が閾値ＴＤ以上であればＳ１４５をスキップしてＳ１４６へ移行する。距離の重み付け和が閾値ＴＤ未満であれば同一物体についての視線ベクトルの組み合わせであったとして三次元位置を記憶部１２に一時記憶させる（Ｓ１４５）。 The three-dimensional position estimation means 133 derives the three-dimensional position at which the weighted sum of the distances from each line-of-sight vector constituting the combination is minimized for the combination to be processed (S143). The three-dimensional position estimation means 133 determines whether or not the weighted sum of the distances when the distance is minimized is less than the predetermined threshold value TD (S144). If the weighted sum of the distances is less than the threshold value TD, the process proceeds to S145, and if the weighted sum of the distances is greater than or equal to the threshold value TD, S145 is skipped and the process proceeds to S146. If the weighted sum of the distances is less than the threshold value TD, the storage unit 12 temporarily stores the three-dimensional position as a combination of line-of-sight vectors for the same object (S145).

三次元位置推定手段１３３は、ステップＳ１４２で生成した全ての組み合わせを処理したか否かを確認する（Ｓ１４６）。全ての組み合わせを処理し終えた場合はＳ１４７へ移行し、未処理の組み合わせがあればＳ１４２に戻り、次の組み合わせに対する処理を行う。 The three-dimensional position estimation means 133 confirms whether or not all the combinations generated in step S142 have been processed (S146). When all the combinations have been processed, the process proceeds to S147, and if there are unprocessed combinations, the process returns to S142 to perform processing for the next combination.

ステップＳ１４５で一時記憶させた三次元位置について、距離の近い三次元位置同士を、同一人物に関するものであるとして、一つにまとめる（Ｓ１４７）。つまり、一人の人物について複数の三次元位置が算出される場合があるためこれらの重複を排除する。これにより二次元位置推定手段１３１の処理において一人の人物について複数の人物領域が検出されて生じる誤検出を防ぐ。さらには、ステップＳ１４２で三次元位置推定手段１３３が生成した組み合わせの中の、異なる物体の視線ベクトルの組み合わせについての距離の重み付け和が偶々閾値ＴＤ以下となって残ることで生じる誤検出を防ぐ。例えば、三次元位置推定手段１３３は、群平均法、ウォード（Ward）法などの手法を用いて、三次元位置をクラスタリングして各クラスタの代表値を一人の人物の三次元位置とする。三次元位置推定手段１３３は、一時記憶した三次元位置を消去して図４のステップＳ１５０へ移行する。 Regarding the three-dimensional positions temporarily stored in step S145, the three-dimensional positions that are close to each other are grouped together as being related to the same person (S147). That is, since a plurality of three-dimensional positions may be calculated for one person, these duplications are eliminated. This prevents erroneous detection caused by the detection of a plurality of person areas for one person in the processing of the two-dimensional position estimation means 131. Further, it prevents erroneous detection caused by the weighted sum of the distances for the combination of the line-of-sight vectors of different objects in the combination generated by the three-dimensional position estimation means 133 in step S142 accidentally remaining below the threshold value TD. For example, the three-dimensional position estimation means 133 clusters the three-dimensional positions by using a method such as the group averaging method or the Ward method, and sets the representative value of each cluster as the three-dimensional position of one person. The three-dimensional position estimation means 133 erases the temporarily stored three-dimensional position and proceeds to step S150 of FIG.

画像処理部１３は推定結果出力手段１３４として動作し、ステップＳ１４７の統合を経た三次元位置を入力して当該位置を示す表示用画像を生成し、表示用画像を通信部１１経由で表示部１４に表示させる（Ｓ１５０）。 The image processing unit 13 operates as the estimation result output means 134, inputs the three-dimensional position after the integration in step S147 to generate a display image indicating the position, and displays the display image via the communication unit 11 in the display unit 14. Is displayed (S150).

［実施形態１の変形例］
（１−１）実施形態１では、個別認識手段である二次元位置推定手段１３１が、検出器が出力する人物領域をそのまま用いて個別認識結果を生成したが、重複度の高い人物領域同士を一つにまとめる処理を行ってから重心位置を算出して個別認識結果を生成してもよい。その場合のまとめ方には、検出時の尤度が最も高い人物領域を選択する、検出時の尤度で重み付けて平均するなどの方法がある。 [Modification of Embodiment 1]
(1-1) In the first embodiment, the two-dimensional position estimation means 131, which is an individual recognition means, generates an individual recognition result by using the person area output by the detector as it is, but the person areas having a high degree of overlap are generated. The individual recognition result may be generated by calculating the position of the center of gravity after performing the process of combining them into one. In that case, there are methods such as selecting the person area having the highest likelihood at the time of detection, weighting with the likelihood at the time of detection, and averaging.

（１−２）実施形態１では、３台の撮影手段１０ａ，１０ｂ，１０ｃで撮影する例を述べたが撮影手段を４台以上とすることもできる。撮影手段を４台以上とする場合、統合認識手段である三次元位置推定手段１３３が生成する視線ベクトルの組み合わせを撮影手段の台数よりも少ない個数の視線ベクトルの組み合わせとすることもできる。例えば、４台の撮影手段それぞれについての視線ベクトルの中から３台の撮影手段の視線ベクトルを選ぶ組み合わせを総当たりで生成する。 (1-2) In the first embodiment, an example of shooting with three shooting means 10a, 10b, 10c has been described, but the number of shooting means may be four or more. When the number of photographing means is four or more, the combination of the line-of-sight vectors generated by the three-dimensional position estimating means 133, which is the integrated recognition means, can be a combination of a number of line-of-sight vectors smaller than the number of the number of photographing means. For example, a combination of selecting the line-of-sight vectors of the three shooting means from the line-of-sight vectors of each of the four shooting means is generated by brute force.

（１−３）実施形態１では個別認識手段である二次元位置推定手段１３１が、各時刻の撮影画像（いわば静止画）から人物領域を検出する例を示したが、前後する時刻の撮影画像（いわば動画）を利用し各人物の追跡処理を行って人物領域を検出してもよい。その場合、同一物体の視線ベクトルの組み合わせが一度特定された人物は、それ以降は総当たりの組み合わせの試行を省略できる。 (1-3) In the first embodiment, an example is shown in which the two-dimensional position estimation means 131, which is an individual recognition means, detects a person area from a captured image (so to speak, a still image) at each time. A person area may be detected by performing tracking processing of each person using (so to speak, a moving image). In that case, a person whose combination of line-of-sight vectors of the same object is once identified can omit the trial of the brute force combination thereafter.

［実施形態２］
本実施形態２では、物体認識装置の一例である三次元追跡装置について説明する。本実施形態２における三次元追跡装置は、共通視野を有する複数の撮影手段で撮影した撮影画像に基づいて共通視野内の人物を追跡する。 [Embodiment 2]
In the second embodiment, a three-dimensional tracking device, which is an example of the object recognition device, will be described. The three-dimensional tracking device according to the second embodiment tracks a person in a common field of view based on captured images taken by a plurality of photographing means having a common field of view.

また、本実施形態２では、パーティクルフィルタに準じた手法で追跡を行う。各時刻において、追跡中の物体ごとに、当該物体の位置の候補を複数設定して各候補に対応した仮説を設定し、仮説を統合することによって物体の位置を決定する。本明細書では、各時刻において追跡中の物体ごとに１つ決定する位置を物体位置と称し、各時刻において追跡中の物体のそれぞれに対して複数設定する候補を候補位置と称する。すなわち、物体位置の候補が候補位置となる。 Further, in the second embodiment, tracking is performed by a method similar to the particle filter. At each time, for each object being tracked, a plurality of candidates for the position of the object are set, hypotheses corresponding to each candidate are set, and the position of the object is determined by integrating the hypotheses. In the present specification, a position determined once for each object being tracked at each time is referred to as an object position, and a plurality of candidates set for each of the objects being tracked at each time are referred to as a candidate position. That is, the candidate of the object position becomes the candidate position.

実施形態１においては重付決定手段１３２が重み付けの決定に際して参照する「個別認識手段が物体を認識した位置」を個別認識手段である二次元位置推定手段１３１が物体を検出した人物領域の上部１／３とし、重み付けの対象を重心位置とした。実施形態２においては重付決定手段５３２が重み付けの決定に際して参照する「個別認識手段が物体を認識した位置」を個別認識手段である候補位置設定・評価手段５３１が物体の尤度を算出した位置、すなわち候補位置によって定まる頭部投影領域とし、重み付けの対象を尤度とする。以下、候補位置設定・評価手段５３１が算出する尤度を個別尤度、個別尤度を統合して得られる尤度を統合尤度と称する。 In the first embodiment, the upper part 1 of the person region in which the two-dimensional position estimation means 131, which is the individual recognition means, detects the "position where the individual recognition means recognizes the object" referred by the weighting determination means 132 when determining the weighting. It was set to 3/3, and the weighting target was the position of the center of gravity. In the second embodiment, the position where the candidate position setting / evaluation means 531 which is the individual recognition means calculates the likelihood of the object by referring to the "position where the individual recognition means recognizes the object" which the weighting determination means 532 refers to when determining the weighting. That is, the head projection area determined by the candidate position is used, and the weighting target is the likelihood. Hereinafter, the likelihood calculated by the candidate position setting / evaluation means 531 is referred to as an individual likelihood, and the likelihood obtained by integrating the individual likelihoods is referred to as an integrated likelihood.

図６は、本実施形態２における三次元追跡装置５の構成を示すブロック図である。撮影手段５０ａ，５０ｂ，５０ｃ、通信部５１、表示部５４は、実施形態１の撮影手段１０ａ，１０ｂ，１０ｃ，通信部１１，表示部１４と同様である。画像処理部５３は、混雑度推定手段５３０、候補位置設定・評価手段（個別認識手段）５３１、重付決定手段５３２、物体位置決定手段（統合認識手段）５３３、追跡結果出力手段５３４として機能する。また、記憶部５２には、カメラパラメータ５２０の他に、物体情報５２１が記憶される。 FIG. 6 is a block diagram showing the configuration of the three-dimensional tracking device 5 according to the second embodiment. The photographing means 50a, 50b, 50c, the communication unit 51, and the display unit 54 are the same as the photographing means 10a, 10b, 10c, the communication unit 11, and the display unit 14 of the first embodiment. The image processing unit 53 functions as congestion degree estimation means 530, candidate position setting / evaluation means (individual recognition means) 531, weighting determination means 532, object position determination means (integrated recognition means) 533, and tracking result output means 534. .. In addition to the camera parameter 520, the storage unit 52 stores object information 521.

本実施形態２の混雑度推定手段５３０は実施形態１の混雑度推定手段１３０と同様であるが、出力先は重付決定手段５３２および追跡結果出力手段５３４となる。カメラパラメータ５２０は、実施形態１のカメラパラメータ１２０と同様であるが、実施形態２は、ＸＹＺ座標系の候補位置等をｘｙ座標系に投影するために用いられる。 The congestion degree estimation means 530 of the second embodiment is the same as the congestion degree estimation means 130 of the first embodiment, but the output destinations are the weighting determination means 532 and the tracking result output means 534. The camera parameter 520 is the same as the camera parameter 120 of the first embodiment, but the second embodiment is used to project a candidate position or the like of the XYZ coordinate system onto the xy coordinate system.

物体情報５２１は、移動物体の三次元形状モデルと追跡中の移動物体の情報を記憶する。具体的には、移動物体の三次元形状モデルは立位の人の頭部・胴部・脚部の立体形状を模した３つの回転楕円体を連結してなるモデルである。或いは、立位の人の全身の立体形状を１つの回転楕円体で模したモデルでもよい。 The object information 521 stores information on a three-dimensional shape model of a moving object and the moving object being tracked. Specifically, the three-dimensional shape model of a moving object is a model formed by connecting three spheroids that imitate the three-dimensional shapes of the head, body, and legs of a standing person. Alternatively, a model that imitates the three-dimensional shape of the whole body of a standing person with one spheroid may be used.

追跡中の移動物体の情報は、追跡中の人物それぞれを識別する物体ＩＤと対応づけて、各撮影手段の撮影手段ＩＤと対応付けられた当該人物のテンプレートと、当該人物のＸＹＺ座標系での物体位置と、当該人物の仮説と、が記憶される。各仮説は、仮説ＩＤと、ＸＹＺ座標系での候補位置と、が記憶される。また、各仮説は、各撮影手段の撮影手段ＩＤと対応づけて、候補位置に配置した三次元形状モデルの当該撮影手段のｘｙ座標系への全身投影領域および頭部投影領域と、当該撮影手段の撮影画像を用いて算出した候補位置の個別尤度と、当該撮影手段の候補位置の重みが記憶される。 The information of the moving object being tracked is associated with the object ID that identifies each person being tracked, the template of the person associated with the shooting means ID of each shooting means, and the XYZ coordinate system of the person. The position of the object and the hypothesis of the person are stored. For each hypothesis, the hypothesis ID and the candidate position in the XYZ coordinate system are stored. In addition, each hypothesis corresponds to the imaging means ID of each imaging means, and the whole body projection area and the head projection area of the three-dimensional shape model arranged at the candidate positions on the xy coordinate system of the imaging means, and the imaging means. The individual likelihood of the candidate position calculated using the captured image of the above and the weight of the candidate position of the photographing means are stored.

個別認識手段である候補位置設定・評価手段５３１は、撮影手段ごとの撮影画像を解析して撮影画像上における物体の全部または一部を認識して個別認識結果を生成する。本実施形態２では、追跡中の物体それぞれについて、過去の位置情報（物体位置または候補位置）から現時刻の候補位置を予測し、各撮影手段が撮影した撮影画像上で各候補位置と物体形状とによって定まる領域（全身投影領域および頭部投影領域）を算出し、候補位置、両投影領域および全身投影領域に当該物体の画像特徴が現れている度合いである個別尤度を含んだ仮説を個別認識結果として生成して記憶部５２の物体情報５２１に記憶させる。 The candidate position setting / evaluation means 531 which is an individual recognition means analyzes the photographed image of each photographing means, recognizes all or a part of the object on the photographed image, and generates an individual recognition result. In the second embodiment, for each of the objects being tracked, the candidate position at the current time is predicted from the past position information (object position or candidate position), and each candidate position and object shape are predicted on the captured image captured by each imaging means. The regions (whole body projection region and head projection region) determined by It is generated as a recognition result and stored in the object information 521 of the storage unit 52.

具体的に、候補位置設定・評価手段５３１は、まず、記憶部５２が記憶している物体情報５２１を参照し、追跡中の人物ごとに、過去の物体位置に現在の物体位置（物体位置の推定値）を外挿し、現在の物体位置の近傍にランダムに複数の候補位置を設定する。また、過去の候補位置に現在の候補位置を外挿してもよい。過去の物体位置や過去の候補位置が２時刻分以上無い人物については１時刻前の物体位置の近傍に候補位置を設定する。物体位置およびこの段階での候補位置はＸＹＺ座標系の座標値となる。 Specifically, the candidate position setting / evaluation means 531 first refers to the object information 521 stored in the storage unit 52, and for each person being tracked, the current object position (object position) is set to the past object position for each person being tracked. Estimated value) is extrapolated, and multiple candidate positions are randomly set near the current object position. Further, the current candidate position may be extrapolated to the past candidate position. For a person whose past object position or past candidate position is less than 2 hours, the candidate position is set in the vicinity of the object position 1 hour ago. The object position and the candidate position at this stage are the coordinate values of the XYZ coordinate system.

次に、候補位置設定・評価手段５３１は、記憶部５２が記憶している物体情報５２１の三次元形状モデルとカメラパラメータ５２０を参照し、候補位置それぞれについて、当該候補位置に配置した三次元形状モデルを撮影手段１０ａ，１０ｂ，１０ｃのｘｙ座標系に投影する。また、候補位置それぞれについて、当該候補位置に配置した頭部の三次元形状モデルを撮影手段１０ａ，１０ｂ，１０ｃのｘｙ座標系に投影する。続いて、候補位置設定・評価手段５３１は、追跡中の各人物の候補位置それぞれについて、候補位置と各撮影手段への全身投影領域および頭部投影領域を含んだ仮説を生成して物体情報５２１に追加する。そして、候補位置設定・評価手段５３１は、追跡中の各人物の候補位置それぞれについて、撮影手段１０ａ，１０ｂ，１０ｃの撮影画像における全身投影領域の画像特徴を抽出して当該人物のテンプレートの画像特徴との類似度に基づいて個別尤度Ｌａ，Ｌｂ，Ｌｃを算出し、算出した個別尤度Ｌａ，Ｌｂ，Ｌｃを対応する仮説に追記して物体情報５２１を更新する。なお、全身投影領域の上部１／３を近似的に頭部投影領域としてもよい。また、全身の立体形状を１つの回転楕円体とする場合も全身投影領域の上部１／３を頭部投影領域とすればよい。 Next, the candidate position setting / evaluation means 531 refers to the three-dimensional shape model of the object information 521 stored in the storage unit 52 and the camera parameter 520, and for each of the candidate positions, the three-dimensional shape arranged at the candidate position. The model is projected onto the xy coordinate system of the photographing means 10a, 10b, 10c. Further, for each of the candidate positions, the three-dimensional shape model of the head arranged at the candidate position is projected onto the xy coordinate system of the photographing means 10a, 10b, 10c. Subsequently, the candidate position setting / evaluation means 531 generates a hypothesis including the candidate position, the whole body projection area and the head projection area to each imaging means for each candidate position of each person being tracked, and the object information 521. Add to. Then, the candidate position setting / evaluation means 531 extracts the image feature of the whole body projection region in the captured image of the photographing means 10a, 10b, 10c for each candidate position of each person being tracked, and the image feature of the template of the person. The individual likelihoods La, Lb, and Lc are calculated based on the similarity with the above, and the calculated individual likelihoods La, Lb, and Lc are added to the corresponding hypothesis to update the object information 521. The upper 1/3 of the whole body projection area may be approximately used as the head projection area. Further, when the three-dimensional shape of the whole body is one spheroid, the upper third of the whole body projection area may be the head projection area.

重付決定手段５３２は、各撮影手段が撮影した撮影画像上における個別認識手段が物体を認識した位置の混雑度に応じて各撮影手段の重みＷを決定する。本実施形態２では、候補位置ごとに、各撮影手段が撮影した撮影画像上で当該候補位置と物体形状とによって定まる頭部投影領域についての混雑度に応じて、撮影手段１０ａ，１０ｂ，１０ｃごとの個別尤度Ｌａ，Ｌｂ，Ｌｃに対する重みＷａ，Ｗｂ，Ｗｃを決定する。 The weighting determining means 532 determines the weight W of each photographing means according to the degree of congestion at the position where the individual recognition means recognizes the object on the photographed image photographed by each photographing means. In the second embodiment, for each candidate position, every shooting means 10a, 10b, 10c according to the degree of congestion of the head projection area determined by the candidate position and the object shape on the shot image taken by each shooting means. The weights Wa, Wb, and Wc for the individual likelihoods La, Lb, and Lc of

具体的には、記憶部５２が記憶している物体情報５２１および混雑度推定手段５３０から入力された混雑度情報を参照し、候補位置ごとに、撮影手段１０ａ，１０ｂ，１０ｃそれぞれについての頭部投影領域に対する高混雑度領域の非重複度を重みＷａ，Ｗｂ，Ｗｃとして算出し、算出した重みＷを対応する仮説に追記して物体情報５２１を更新する。ここで、非重複度の代わりに閑散度を重みＷとしてもよい。 Specifically, the object information 521 stored in the storage unit 52 and the congestion degree information input from the congestion degree estimation means 530 are referred to, and the heads of the photographing means 10a, 10b, and 10c are referred to for each candidate position. The non-overlapping degree of the high congestion degree region with respect to the projection region is calculated as the weights Wa, Wb, and Wc, and the calculated weight W is added to the corresponding hypothesis to update the object information 521. Here, the degree of quietness may be the weight W instead of the degree of non-overlap.

つまり、頭部投影領域内の混雑度が高い個別認識結果ほど重みＷは小さくなる。これは背後の群集の影響で個別認識結果の信頼度が低くなることを意味する。他方、頭部投影領域内の混雑度が低い個別認識結果ほど重みＷは高くなる。これは群集の影響が少なく個別認識結果の信頼度が高くなることを意味する。このような重みＷの違いは、認識対象の物体と背後の群集の撮影画像上での位置関係が撮影手段との位置関係によって異なることで生じる。そのため、認識対象の物体の領域における混雑度に応じて各撮影手段の重みＷを決定することで、撮影手段と群集の位置関係により変わる個別認識結果の信頼度を評価できる。 That is, the weight W becomes smaller as the individual recognition result has a higher degree of congestion in the head projection area. This means that the reliability of the individual recognition result is low due to the influence of the crowd behind. On the other hand, the weight W becomes higher as the individual recognition result has a lower degree of congestion in the head projection area. This means that the influence of the crowd is small and the reliability of the individual recognition result is high. Such a difference in weight W occurs because the positional relationship between the object to be recognized and the crowd behind it on the captured image differs depending on the positional relationship with the photographing means. Therefore, by determining the weight W of each photographing means according to the degree of congestion in the area of the object to be recognized, the reliability of the individual recognition result that changes depending on the positional relationship between the photographing means and the crowd can be evaluated.

統合認識手段である物体位置決定手段５３３は、重み付けに基づいて撮影手段ごとの個別認識結果を統合する。換言すると物体位置決定手段５３３は、各移動物体における複数の候補位置に基づいて、現時刻における移動物体の物体位置を求める。 The object position determining means 533, which is an integrated recognition means, integrates the individual recognition results for each photographing means based on the weighting. In other words, the object position determining means 533 obtains the object position of the moving object at the current time based on a plurality of candidate positions in each moving object.

本実施形態において、物体位置決定手段５３３は、ＸＹＺ座標系において、移動物体ごとに、当該移動物体の各候補位置の撮影手段ごとの個別尤度を重みＷに基づいて統合し、さらに統合尤度を重みＵとして候補位置を重み付け平均することによって当該移動物体の物体位置を算出する。算出したＸＹＺ座標系の物体位置を移動物体と対応づけて記憶部５２の物体情報５２１に記憶させる。 In the present embodiment, the object position determining means 533 integrates the individual likelihood of each candidate position of the moving object for each photographing means based on the weight W in the XYZ coordinate system, and further integrates the integrated likelihood. Is used as the weight U, and the candidate positions are weighted and averaged to calculate the object position of the moving object. The calculated object position in the XYZ coordinate system is associated with the moving object and stored in the object information 521 of the storage unit 52.

物体位置決定手段５３３は、追跡中の物体について、物体位置、仮説やテンプレートの更新処理を行うと共に、新規物体の存在を判定し、当該新規物体について物体情報を登録する処理、及び消失物体についての処理を行う。以下、追跡中の物体についての処理、新規物体についての処理、及び消失物体についての処理を順次、説明する。 The object position determining means 533 updates the object position, hypothesis and template of the object being tracked, determines the existence of the new object, registers the object information about the new object, and the disappeared object. Perform processing. Hereinafter, the processing for the object being tracked, the processing for the new object, and the processing for the disappearing object will be sequentially described.

〔追跡中の移動物体〕
物体位置決定手段５３３により物体位置が判定された物体について、当該判定された物体位置を追加記憶させるとともに、現時刻の物体位置それぞれに形状モデルを配置して各撮影画像に投影して全身投影領域の画像特徴を抽出し、当該物体の撮影手段ごとのテンプレートを現時刻の画像特徴により更新する。更新は、抽出された画像特徴を、記憶されている画像特徴と置き換えてもよいし、抽出された画像特徴と記憶されている画像特徴とを重み付け平均してもよい。 [Moving object being tracked]
For an object whose object position has been determined by the object position determining means 533, the determined object position is additionally stored, and a shape model is placed at each object position at the current time and projected onto each captured image to produce a whole body projection area. The image features of the object are extracted, and the template for each photographing means of the object is updated with the image features at the current time. The update may replace the extracted image features with the stored image features, or the extracted image features and the stored image features may be weighted and averaged.

〔新規物体〕
物体位置決定手段５３３は、監視空間に追跡対象の物体（人）が存在しないときに撮影された背景画像と各撮影画像との差分処理を行って背景差分領域を検出するとともに、現時刻の物体位置それぞれに形状モデルを配置して各撮影画像に投影しいずれの全身投影領域とも重ならない背景差分領域を抽出する。そして、物体位置決定手段５３３は、非重複の背景差分領域が追跡対象の物体として有効な面積ＴＳを有していれば、非重複の背景差分領域に新規物体が存在すると判定する。新規物体が存在すると判定された場合、非重複の背景差分領域に対して実施形態１と同様の方法で三次元位置の推定を行ってＸＹＺ座標系での物体位置を導出する。また、物体ＩＤと対応付けて当該物体のテンプレート、当該物体の物体位置が記憶部５２の物体情報５２１に記憶される。また、物体位置決定手段５３３は、追跡対象の物体が存在しないときの撮影画像を背景画像として記憶部４に記憶させ、背景差分領域が検出されなかった領域の撮影画像で背景画像を更新する。 [New object]
The object position determining means 533 detects the background subtraction region by performing subtraction processing between the background image captured when the object (person) to be tracked does not exist in the monitoring space and each captured image, and also detects the object at the current time. A shape model is placed at each position and projected onto each captured image to extract a background subtraction region that does not overlap with any of the whole body projection regions. Then, the object position determining means 533 determines that a new object exists in the non-overlapping background subtraction region if the non-overlapping background subtraction region has an effective area TS as the object to be tracked. When it is determined that a new object exists, the three-dimensional position is estimated for the non-overlapping background subtraction region by the same method as in the first embodiment, and the object position in the XYZ coordinate system is derived. Further, the template of the object and the object position of the object are stored in the object information 521 of the storage unit 52 in association with the object ID. Further, the object position determining means 533 stores the captured image when the object to be tracked does not exist as a background image in the storage unit 4, and updates the background image with the captured image in the region where the background subtraction region is not detected.

〔消失物体〕
物体位置決定手段５３３は、物体が遮蔽物により隠蔽された場合や撮影画像外に移動した場合等、全ての個別尤度Ｌが閾値ＴＬ以下となった物体を物体位置なしの消失物体と判定し、当該物体の物体情報を削除する。 [Disappearing object]
The object position determining means 533 determines that an object whose individual likelihood L is equal to or less than the threshold value TL, such as when the object is concealed by a shield or moved out of the captured image, is regarded as a disappearing object without an object position. , Delete the object information of the object.

追跡結果出力手段５３４は、例えば、追跡中の物体ごとの時系列の物体位置をＸＹＺ座標系でプロットした移動軌跡画像を生成し、撮影手段１０ａ，１０ｂ，１０ｃのｘｙ座標系に投影する。また、予め混雑度に対応する色を定めておき、混雑度マップの各画素と対応する画素に当該画素の混雑度に対応する色の画素値を設定した混雑度画像を生成する。各撮影手段１０ａ，１０ｂ，１０ｃの移動軌跡画像と各撮影手段１０ａ，１０ｂ，１０ｃの混雑度画像とを透過合成した画像を表示部５４に出力する。さらに現時刻の撮影画像を重畳してもよい。 The tracking result output means 534 generates, for example, a movement locus image in which the time-series object positions of each object being tracked are plotted in the XYZ coordinate system and projected onto the xy coordinate system of the photographing means 10a, 10b, 10c. Further, a color corresponding to the degree of congestion is determined in advance, and a congestion degree image is generated in which the pixel values of the colors corresponding to the degree of congestion of the pixels are set in the pixels corresponding to each pixel of the degree of congestion map. An image obtained by transparently synthesizing the moving locus image of each of the photographing means 10a, 10b, 10c and the congestion degree image of each of the photographing means 10a, 10b, 10c is output to the display unit 54. Further, the captured image at the current time may be superimposed.

次に、図７、図８に基づいて本実施形態２における三次元追跡装置５の処理例を説明する。図７は、追跡人物および群衆と各撮影手段の撮影画像の関係を示す図である。図７に示すように、撮影手段１０ａ，１０ｂ，１０ｃそれぞれにおいて、共通視野に存在する追跡中の人物６００及び群衆６１０を撮影画像６２１，６２２，６２３として撮影する。 Next, a processing example of the three-dimensional tracking device 5 in the second embodiment will be described with reference to FIGS. 7 and 8. FIG. 7 is a diagram showing the relationship between the tracking person and the crowd and the captured image of each photographing means. As shown in FIG. 7, each of the photographing means 10a, 10b, and 10c photographs the tracking person 600 and the crowd 610 existing in the common field of view as captured images 621, 622, 623.

追跡対象の人物６００の三次元空間上の位置を決定するために、三次元空間上における人物６００の頭部周辺に複数の候補位置６３０を設定する。混雑度推定手段５３０は、撮影画像６２１，６２２，６２３について高混雑度領域６５１，６５２，６５３を抽出する。撮影手段１０ａ，１０ｃの撮影画像６２１，６２３上では追跡対象の人物６４１，６４３は高混雑度領域６５１，６５３に重複していないが、撮影手段１０ｂの撮影画像６２２上では、追跡対象の人物６４２は高混雑度領域６５２に重複している。そのため、撮影手段１０ａ，１０ｃに関する候補位置の重みＷは大きくなるが、撮影手段１０ｂに関する候補位置の重みＷは小さくなる。 In order to determine the position of the person 600 to be tracked in the three-dimensional space, a plurality of candidate positions 630 are set around the head of the person 600 in the three-dimensional space. The congestion degree estimation means 530 extracts high congestion degree regions 651,652,653 from the captured images 621, 622, 623. The tracked person 641, 643 does not overlap the high congestion area 651, 653 on the captured images 621 and 623 of the photographing means 10a and 10c, but the tracked person 642 on the captured image 622 of the photographing means 10b. Overlaps the high congestion region 652. Therefore, the weight W of the candidate positions with respect to the photographing means 10a and 10c becomes large, but the weight W of the candidate positions with respect to the photographing means 10b becomes small.

図８（ａ）は追跡中の人物について設定された候補位置の一つに対して撮影手段１０ｂの重みＷを決定する様子を示す図である。図８（ａ）に示すように、三次元空間上の追跡中の人物６００と群衆６１０を撮影手段１０ｂで撮影する。人物６００に対して候補位置７００が設定されたとすると、撮影手段１０ｂの撮影画像６２２において対応する位置７１０を頭部中心とする頭部投影領域７２０が得られる。また、群衆６１０の位置が高混雑度領域６５２として設定される。撮影手段毎、仮説毎に頭部投影領域７２０と高混雑度領域６５２との非重複率に応じて重みＷが決定される。撮影手段１０ｂに関する候補位置７１０についての頭部投影領域７２０は高混雑度領域６５２と重複している（非重複率が低い）ため、重みＷが小さくなる。 FIG. 8A is a diagram showing how the weight W of the photographing means 10b is determined for one of the candidate positions set for the person being tracked. As shown in FIG. 8A, the tracking person 600 and the crowd 610 in the three-dimensional space are photographed by the photographing means 10b. Assuming that the candidate position 700 is set for the person 600, a head projection region 720 centered on the head at the corresponding position 710 in the captured image 622 of the photographing means 10b can be obtained. Also, the position of the crowd 610 is set as the high congestion area 652. The weight W is determined according to the non-overlapping rate between the head projection region 720 and the high congestion region 652 for each imaging means and each hypothesis. Since the head projection region 720 for the candidate position 710 with respect to the photographing means 10b overlaps with the high congestion degree region 652 (the non-overlapping rate is low), the weight W becomes small.

図８（ｂ）は撮影手段１０ａ，１０ｂ，１０ｃに関する重み付け前の個別尤度を示す図である。人物６００に複数の候補位置が設定されている。撮影手段１０ａ，１０ｂ，１０ｃの撮影画像ごとにこれら複数の候補位置全てが尤度評価される。四角形７３０、三角形７３１、五角形７３２は同じ候補位置を表している。記号の位置が候補位置を示す。四角形７３０の大きさは撮影手段１０ａの撮影画像を用いて求めた個別尤度の大きさ、三角形７３１の大きさは撮影手段１０ｂの撮影画像を用いて求めた個別尤度の大きさ、五角形７３２の大きさは撮影手段１０ｃの撮影画像を用いて求めた個別尤度の大きさを示している。撮影手段１０ａ，１０ｃに関する候補位置７３０，７３２は、高混雑度領域６５２の影響を受けていないため、正しく尤度評価ができている。撮影手段１０ｂに関する候補位置７３１の右上側は高混雑度領域６５２の影響を受け正しく尤度評価できずに、個別尤度が高くなっている。 FIG. 8B is a diagram showing individual likelihoods of the photographing means 10a, 10b, and 10c before weighting. A plurality of candidate positions are set for the person 600. All of these plurality of candidate positions are evaluated for likelihood for each of the captured images of the photographing means 10a, 10b, and 10c. The quadrangle 730, the triangle 731, and the pentagon 732 represent the same candidate positions. The position of the symbol indicates the candidate position. The size of the quadrangle 730 is the size of the individual likelihood obtained using the captured image of the photographing means 10a, and the size of the triangle 731 is the size of the individual likelihood obtained using the captured image of the photographing means 10b, the pentagon 732. Indicates the magnitude of the individual likelihood obtained by using the captured image of the photographing means 10c. Since the candidate positions 730 and 732 with respect to the photographing means 10a and 10c are not affected by the high congestion degree region 652, the likelihood evaluation can be performed correctly. The upper right side of the candidate position 731 with respect to the photographing means 10b is affected by the high-congestion degree region 652, and the likelihood cannot be evaluated correctly, and the individual likelihood is high.

図８（ｃ）は図８（ｂ）の個別尤度に、混雑度に基づいた重みＷをかけた重み付け個別尤度を示している。撮影手段１０ａ，１０ｂに関する候補位置７４０，７４２は混雑度が低く重みＷが大きいため、候補位置７４０，７４２の点が大きくなっている。撮影手段１０ｂに関する候補位置７４１は混雑度が高く重みＷが小さいため、候補位置７４１の点が小さくなっている。そのため、群衆６１０（高混雑度領域６５２）により正しく個別尤度が算出できなかった撮影手段１０ｂに関する仮説の影響力が小さくなる。よって、候補位置と重みＷと個別尤度に基づいて加重平均で物体位置を求めた際、撮影手段１０ｂに関する仮説の影響を小さくすることができ、物体位置を高精度に設定できる。 FIG. 8 (c) shows the weighted individual likelihood obtained by multiplying the individual likelihood of FIG. 8 (b) by the weight W based on the degree of congestion. Since the candidate positions 740 and 742 with respect to the photographing means 10a and 10b have a low degree of congestion and a large weight W, the points of the candidate positions 740 and 742 are large. Since the candidate position 741 with respect to the photographing means 10b has a high degree of congestion and a small weight W, the point of the candidate position 741 is small. Therefore, the influence of the hypothesis regarding the photographing means 10b for which the individual likelihood could not be calculated correctly due to the crowd 610 (high congestion degree region 652) becomes small. Therefore, when the object position is obtained by the weighted average based on the candidate position, the weight W, and the individual likelihood, the influence of the hypothesis regarding the photographing means 10b can be reduced, and the object position can be set with high accuracy.

［三次元追跡装置５の動作例］
以下、三次元追跡装置５の動作を説明する。図９は三次元追跡装置５の動作の全体フロー図である。三次元追跡装置５の動作が開始されると、撮影手段１０ａ，１０ｂ，１０ｃは画像処理部５３に順次撮影画像を出力する。画像処理部５３は撮影画像が入力されるたびに（ステップＳ５００）、ステップＳ５０１〜Ｓ５１０の一連の処理を繰り返す。 [Operation example of 3D tracking device 5]
The operation of the three-dimensional tracking device 5 will be described below. FIG. 9 is an overall flow diagram of the operation of the three-dimensional tracking device 5. When the operation of the three-dimensional tracking device 5 is started, the photographing means 10a, 10b, and 10c sequentially output the captured images to the image processing unit 53. The image processing unit 53 repeats a series of processes of steps S501 to S510 each time a captured image is input (step S500).

画像処理部５３は撮影手段１０ａ，１０ｂ，１０ｃで取得した撮影画像に対し混雑度推定手段５３０により混雑度マップを出力する。また、混雑度が予め定めた閾値Ｔ１以上の領域を高混雑度領域として抽出する（ステップＳ５０１）。 The image processing unit 53 outputs a congestion degree map to the photographed images acquired by the photographing means 10a, 10b, and 10c by the congestion degree estimating means 530. Further, a region having a congestion degree equal to or higher than a predetermined threshold value T1 is extracted as a high congestion degree region (step S501).

画像処理部５３は記憶部５２の物体情報５２１に記録された人物ごとに、入力された撮影画像上にて追跡処理を行い現在の物体位置の推定を行う（ステップＳ５０２〜Ｓ５０８）。画像処理部５３は記憶部５２の物体情報５２１に記録された追跡対象の人物を順次、追跡処理の対象として選択し、全ての追跡対象の人物について追跡処理が完了した場合は、画像処理部５３は処理をステップＳ５０９に進め、一方、未処理の追跡対象の人物が存在する場合は追跡処理を継続する（ステップＳ５０８）。 The image processing unit 53 performs tracking processing on the input captured image for each person recorded in the object information 521 of the storage unit 52 to estimate the current object position (steps S502 to S508). The image processing unit 53 sequentially selects the person to be tracked recorded in the object information 521 of the storage unit 52 as the target of the tracking process, and when the tracking process is completed for all the persons to be tracked, the image processing unit 53 Proceeds the process to step S509, while continuing the tracking process if there is an unprocessed person to be tracked (step S508).

以下、ステップＳ５０２〜Ｓ５０８の追跡処理をさらに詳しく説明する。画像処理部５３は候補位置設定・評価手段５３１として機能し、各追跡人物についてＸＹＺ座標系で仮説の設定を行い、各仮説が示す候補位置に配置した三次元形状モデルを撮影手段１０ａ，１０ｂ，１０ｃのｘｙ座標系に投影する（ステップＳ５０２）。すなわち、候補位置設定・評価手段５３１は過去の追跡情報から現在の候補位置を予測し、仮説に候補位置を設定する。 Hereinafter, the tracking process of steps S502 to S508 will be described in more detail. The image processing unit 53 functions as a candidate position setting / evaluation means 531, sets a hypothesis for each tracking person in the XYZ coordinate system, and sets the three-dimensional shape model arranged at the candidate position indicated by each hypothesis in the photographing means 10a, 10b, Projection onto the xy coordinate system of 10c (step S502). That is, the candidate position setting / evaluation means 531 predicts the current candidate position from the past tracking information and sets the candidate position in the hypothesis.

画像処理部５３は重付決定手段５３２として機能し、記憶部５２が記憶している物体情報５２１および混雑度推定手段５３０から入力された混雑度情報を参照し、候補位置それぞれについて、撮影手段１０ａ，１０ｂ，１０ｃの頭部投影領域に対する高混雑度領域の非重複度を重みＷａ，Ｗｂ，Ｗｃとして算出し、算出した重みＷａ，Ｗｂ，Ｗｃを対応する仮説に追記して物体情報５２１を更新する（ステップＳ５０３）。 The image processing unit 53 functions as the multiplicity determination means 532, refers to the object information 521 stored in the storage unit 52 and the congestion degree information input from the congestion degree estimation means 530, and refers to the congestion degree information input from the congestion degree estimation means 530, and for each of the candidate positions, the photographing means 10a , 10b, 10c The non-overlapping degree of the high congestion area with respect to the head projection area is calculated as weights Wa, Wb, Wc, and the calculated weights Wa, Wb, Wc are added to the corresponding hypothesis to update the object information 521. (Step S503).

画像処理部５３は候補位置設定・評価手段５３１として機能し、ステップＳ５０２で設定された各仮説に対して撮影手段１０ａ，１０ｂ，１０ｃの撮影画像における全身投影領域の画像特徴と当該人物のテンプレートの画像特徴の類似度に基づいて個別尤度Ｌａ，Ｌｂ，Ｌｃの算出を行う（ステップＳ５０４）。ちなみにテンプレートも撮影手段ごとのものである。 The image processing unit 53 functions as the candidate position setting / evaluation means 531, and for each hypothesis set in step S502, the image features of the whole body projection region in the captured images of the photographing means 10a, 10b, 10c and the template of the person concerned. The individual likelihoods La, Lb, and Lc are calculated based on the similarity of the image features (step S504). By the way, the template is also for each shooting means.

その後、画像処理部５３は物体位置決定手段５３３として機能し、ステップＳ５０４にて算出された仮説の個別尤度に基づき、追跡の継続が可能かどうかを判定し（ステップＳ５０５）、不可と判定した場合は追跡終了処理を行う（ステップＳ５０６）。これにより、追跡不可と判定された人物についての追跡が終了され、物体位置決定手段５３３は記憶部５２の物体情報５２１から当該人物に関する情報を削除する。ここで、全ての個別尤度が閾値ＴＬ未満であった人物を追跡継続不可と判定する。これにより撮影画像に写らなくなった人物の情報が削除される。 After that, the image processing unit 53 functions as the object position determination means 533, determines whether the tracking can be continued based on the individual likelihood of the hypothesis calculated in step S504 (step S505), and determines that the tracking is not possible. If so, the tracking end process is performed (step S506). As a result, the tracking of the person determined to be untraceable is completed, and the object position determining means 533 deletes the information about the person from the object information 521 of the storage unit 52. Here, it is determined that the tracking cannot be continued for the person whose individual likelihood is less than the threshold value TL. As a result, the information of the person who is no longer shown in the captured image is deleted.

ステップＳ５０５にて追跡の継続が可能と判断された場合は、物体位置決定手段５３３は、ステップＳ５０２で設定された仮説群の候補位置及びステップＳ５０３で算出された重みＷおよびステップＳ５０４で算出された個別尤度に基づいて統合尤度を算出し、統合尤度と候補位置に基づいて追跡人物の物体位置を推定する（ステップＳ５０７）。 When it was determined in step S505 that the tracking could be continued, the object position determining means 533 calculated the candidate position of the hypothesis group set in step S502, the weight W calculated in step S503, and the weight W calculated in step S504. The integrated likelihood is calculated based on the individual likelihood, and the object position of the tracking person is estimated based on the integrated likelihood and the candidate position (step S507).

上述の追跡処理Ｓ５０２〜Ｓ５０７が記憶部２５の物体情報５２１に登録された全ての人物に対して行われると、既に述べたように画像処理部５３は処理をステップＳ５０９に進め、物体位置決定手段５３３により、撮影画像にてまだ追跡設定されていない人物の検出を行い、検出された場合は新規の追跡人物として追加する（ステップＳ５０９）。なお、新規の追跡人物として追加された場合は、実施形態１の方法により物体位置を求める。 When the above-mentioned tracking processes S502 to S507 are performed on all the persons registered in the object information 521 of the storage unit 25, the image processing unit 53 advances the process to step S509 as described above, and the object position determination means. According to 533, a person who has not yet been tracked is detected in the captured image, and if detected, the person is added as a new tracked person (step S509). When added as a new tracking person, the position of the object is obtained by the method of the first embodiment.

ステップＳ５００で入力された撮影画像に対し上述した処理Ｓ５０１〜Ｓ５０９により人物の追跡が完了すると、画像処理部５３は追跡結果を表示部５４へ出力する（ステップＳ５１０）。例えば、画像処理部５３は追跡結果として全人物の物体位置を表示部５４の表示装置等に表示させる。 When the tracking of the person is completed by the above-mentioned processes S501 to S509 for the captured image input in step S500, the image processing unit 53 outputs the tracking result to the display unit 54 (step S510). For example, the image processing unit 53 causes the display device or the like of the display unit 54 to display the object positions of all persons as a tracking result.

［実施形態２の変形例］
（２−１）上記実施形態２においては、重付決定手段５３２が三次元形状モデルを用いて重みＷを算出したが三次元形状モデルを用いずに重みＷを算出することもできる。例えば、混雑度が低いほど高い重みＷを算出する関係式を予め定めておき、候補位置を投影した投影点の混雑度を混雑度マップから取得して、取得した混雑度に上記関係式を適用して重みＷを算出する。 [Modified Example of Embodiment 2]
(2-1) In the second embodiment, the weighting determining means 532 calculates the weight W using the three-dimensional shape model, but the weight W can also be calculated without using the three-dimensional shape model. For example, a relational expression for calculating a higher weight W as the degree of congestion is lower is determined in advance, the degree of congestion of the projection point on which the candidate position is projected is acquired from the degree of congestion map, and the above relational expression is applied to the acquired degree of congestion. And the weight W is calculated.

或いは、候補位置を投影した投影点を中心とする近傍領域（例えば５×５画素）の混雑度を混雑度マップから取得して、取得した混雑度の代表値に上記関係式を適用して重みＷを算出する。代表値は例えば最大値、平均値または最頻値とする。この変形例で「個別認識手段が物体を認識した位置」は「候補位置を投影した投影点」または「候補位置を投影した投影点を中心とする近傍領域」とする。 Alternatively, the degree of congestion in the vicinity region (for example, 5 × 5 pixels) centered on the projection point on which the candidate position is projected is acquired from the degree of congestion map, and the above relational expression is applied to the acquired representative value of the degree of congestion to weight the weight. Calculate W. The representative value is, for example, the maximum value, the average value, or the mode value. In this modification, the "position where the individual recognition means recognizes the object" is defined as the "projection point where the candidate position is projected" or the "neighborhood region centered on the projection point where the candidate position is projected".

（２−２）上記実施形態２においては、重付決定手段５３２が撮影手段１０ａ，１０ｂ，１０ｃと候補位置の組み合わせに対して重みＷを決定する例を示したが、近似的に撮影手段１０ａ，１０ｂ，１０ｃと物体の組み合わせに対して重みＷを決定してもよい。すなわち、複数の候補位置のまとまりに対して重みＷを決定することになる。 (2-2) In the second embodiment, an example in which the weighting determining means 532 determines the weight W with respect to the combination of the photographing means 10a, 10b, 10c and the candidate position is shown, but the photographing means 10a is approximately used. , 10b, 10c and the combination of objects may determine the weight W. That is, the weight W is determined for a group of a plurality of candidate positions.

（２−２−１）例えば、物体ごとに、ＸＹＺ座標系にて当該物体の複数の候補位置それぞれに頭部の三次元形状モデルを配置し、配置した複数の三次元形状モデルをまとめて撮影手段１０ａ，１０ｂ，１０ｃのｘｙ座標系に投影する。この複数の三次元形状モデルの投影領域を「個別認識手段が物体を認識した位置」とみなす。そして、撮影手段１０ａ，１０ｂ，１０ｃについての各物体に関する投影領域内の混雑度に基づいて撮影手段１０ａ，１０ｂ，１０ｃと物体の組み合わせに対する重みＷを算出する。 (2-2-1) For example, for each object, a three-dimensional shape model of the head is arranged at each of a plurality of candidate positions of the object in the XYZ coordinate system, and the arranged plurality of three-dimensional shape models are collectively photographed. Projection onto the xy coordinate system of means 10a, 10b, 10c. The projection area of the plurality of three-dimensional shape models is regarded as "the position where the individual recognition means recognizes the object". Then, the weight W for the combination of the photographing means 10a, 10b, 10c and the object is calculated based on the degree of congestion in the projection region for each object for the photographing means 10a, 10b, 10c.

（２−２−２）また、例えば、物体ごとに、ＸＹＺ座標系にて当該物体の複数の候補位置を包含するできるだけ小さな球または楕円体を導出し、導出した球または楕円体を撮影手段１０ａ，１０ｂ，１０ｃのｘｙ座標系に投影する。上記例と同様、この小さな球または楕円体についての投影領域を「個別認識手段が物体を認識した位置」とみなす。そして、撮影手段１０ａ，１０ｂ，１０ｃについての各物体に関する投影領域内の混雑度に基づいて撮影手段１０ａ，１０ｂ，１０ｃと物体の組み合わせに対する重みＷを算出する。 (2-2-2) Further, for example, for each object, a sphere or ellipsoid as small as possible that includes a plurality of candidate positions of the object is derived in the XYZ coordinate system, and the derived sphere or ellipsoid is photographed by the photographing means 10a. , 10b, 10c xy coordinate system. Similar to the above example, the projection area for this small sphere or ellipsoid is regarded as the "position where the individual recognition means recognizes the object". Then, the weight W for the combination of the photographing means 10a, 10b, 10c and the object is calculated based on the degree of congestion in the projection region for each object for the photographing means 10a, 10b, 10c.

（２−２−３）また、例えば、物体ごとに、ＸＹＺ座標系にて当該物体の過去の物体位置に外挿して現在の物体位置を予測し、予測した位置に頭部の三次元形状モデルを配置して撮影手段１０ａ，１０ｂ，１０ｃのｘｙ座標系に投影する。投影領域は上記２例の投影領域を代表する領域と位置付けることができ、この各撮影手段についての投影領域を「個別認識手段が物体を認識した位置」とみなす。そして、撮影手段１０ａ，１０ｂ，１０ｃについての各物体に関する投影領域内の混雑度に基づいて撮影手段１０ａ，１０ｂ，１０ｃと物体の組み合わせに対する重みＷを算出する。 (2-2-3) For example, for each object, the current object position is predicted by extrapolating to the past object position of the object in the XYZ coordinate system, and the three-dimensional shape model of the head is at the predicted position. Is arranged and projected onto the xy coordinate system of the photographing means 10a, 10b, 10c. The projection area can be positioned as a region representing the projection areas of the above two examples, and the projection area for each of the photographing means is regarded as the “position where the individual recognition means recognizes the object”. Then, the weight W for the combination of the photographing means 10a, 10b, 10c and the object is calculated based on the degree of congestion in the projection region for each object for the photographing means 10a, 10b, 10c.

なお、変形例（２−１）と同様、変形例（２−２−１）、（２−２−３）において、三次元形状モデルの投影領域の代わりに候補位置そのものを投影した投影点またはその近傍領域における混雑度に基づいて重みＷを算出してもよい。なお、これらの場合、同一物体の仮説には同一の重みＷが設定されることになる。 Similar to the modified example (2-1), in the modified examples (2-2-1) and (2-2-3), the projection point or the projection point on which the candidate position itself is projected instead of the projection area of the three-dimensional shape model or The weight W may be calculated based on the degree of congestion in the neighboring region. In these cases, the same weight W is set for the hypothesis of the same object.

（２−３）上記実施形態２およびその変形例においては、重付決定手段５３２は混雑度のみを使って重みＷを決定していたが、これに加え、撮影手段から追跡対象までの距離、他の人物や障害物による隠蔽の度合など様々な要素から撮影手段が追跡に好適であるかを判断し重みＷを決定することができる。 (2-3) In the second embodiment and its modifications, the weighting determining means 532 determines the weight W using only the degree of congestion, but in addition to this, the distance from the photographing means to the tracking target, The weight W can be determined by determining whether the photographing means is suitable for tracking from various factors such as the degree of concealment by another person or an obstacle.

（２−４）上記実施形態２およびその各変形例においては、候補位置設定・評価手段５３１は１つの仮説の個別尤度の算出（すなわち個別認識）を全ての撮影手段に対して行ったが、仮説ごとに撮影手段を１つ定めて個別尤度の算出を行ってもよい。この場合、尤度の統合はなく、物体位置決定手段５３３が重みＷと個別尤度の積で候補位置を重み付け平均する構成とすることができる。つまり、その構成において重みＷによる重み付けの対象は候補位置となる。或いは仮説の数によって重み付けを行う構成とすることができる。例えば、候補位置設定・評価手段５３１は、変形例（２−２−３）のように物体位置を予測し、予測した位置において、撮影手段と物体の組み合わせに対する重みＷを算出し、撮影手段と物体の組み合わせに対する候補位置を当該組み合わせの重みＷに応じた数だけ設定する。１物体当たりの候補位置をＮ個、注目する物体の撮影手段Ｃに関する重みをＷ_Cとすると、当該物体の撮影手段Ｃに関する候補位置はＮ×Ｗ_C／ΣＷ_Cとなる。その構成においても重みＷによる重み付けの対象は候補位置である。 (2-4) In the second embodiment and each modification thereof, the candidate position setting / evaluation means 531 calculates the individual likelihood of one hypothesis (that is, individual recognition) for all the photographing means. , One imaging means may be defined for each hypothesis and the individual likelihood may be calculated. In this case, the likelihoods are not integrated, and the object position determining means 533 can be configured to weight and average the candidate positions by the product of the weight W and the individual likelihoods. That is, in that configuration, the target of weighting by the weight W is a candidate position. Alternatively, it can be configured to be weighted according to the number of hypotheses. For example, the candidate position setting / evaluation means 531 predicts the object position as in the modified example (2-2-3), calculates the weight W for the combination of the photographing means and the object at the predicted position, and uses the photographing means. The number of candidate positions for the combination of objects is set according to the weight W of the combination. Assuming that there are N candidate positions per object and the weight related to the photographing means C of the object of interest is W _C , the candidate positions related to the photographing means C of the object are N × W _C / ΣW _C. Even in that configuration, the target of weighting by the weight W is a candidate position.

（２−５）上記実施形態２およびその各変形例においては、物体位置決定手段５３３が背景差分処理に基づき新規物体を検出する例を示したが、その代わりに、追跡対象とする物体の画像を不特定多数機械学習した（例えば不特定多数の人の画像を深層学習した）学習済みモデルを用いて新規物体を検出してもよい。その場合、物体位置決定手段５３３は、撮影画像を学習済みモデルに入力して物体の領域を検出し、いずれの形状モデルとも重複しない領域が閾値ＴＳ以上の大きさである物体の領域に新規物体が存在すると判定する。 (2-5) In the second embodiment and each modification thereof, an example in which the object position determining means 533 detects a new object based on background subtraction processing is shown, but instead, an image of the object to be tracked is shown. A new object may be detected using a trained model in which an unspecified number of people are machine-learned (for example, images of an unspecified number of people are deep-learned). In that case, the object position determining means 533 inputs the captured image into the trained model to detect the area of the object, and the area that does not overlap with any of the shape models is a new object in the area of the object having a size equal to or larger than the threshold TS. Is determined to exist.

［実施形態１，２に共通の変形例］
（３−１）上記実施形態１，２およびその各変形例においては、重付決定手段による混雑度に基づく重みＷの算出は、単純に物体の位置での混雑度に基づいて行っていたが、物体への視線方向に沿った領域での混雑度を加味して重みＷを算出してもよい。 [Modification example common to Embodiments 1 and 2]
(3-1) In the first and second embodiments and their respective modifications, the calculation of the weight W based on the degree of congestion by the weighting determining means is simply performed based on the degree of congestion at the position of the object. , The weight W may be calculated in consideration of the degree of congestion in the region along the line-of-sight direction to the object.

図１０（ａ）に示す例では、人物８００について、撮影手段１０ａの撮影画像８２１上の領域８３１での混雑度と撮影手段１０ｂの撮影画像８２２上の領域８３２での混雑度は同程度である。しかし、撮影手段１０ａから見ると人物８００は群衆８１０の手前であり隠蔽されていないのに対し、撮影手段１０ｂから見ると人物８００は群衆８１０の奥であり一部が隠蔽されている。そのため撮影手段１０ａについての個別認識結果の方が撮影手段１０ｂについての個別認識結果よりも信頼性が高い。 In the example shown in FIG. 10A, the degree of congestion in the area 831 on the photographed image 821 of the photographing means 10a and the degree of congestion in the area 832 on the photographed image 822 of the photographing means 10b are about the same for the person 800. .. However, when viewed from the photographing means 10a, the person 800 is in front of the crowd 810 and is not concealed, whereas when viewed from the photographing means 10b, the person 800 is behind the crowd 810 and part of it is concealed. Therefore, the individual recognition result for the photographing means 10a is more reliable than the individual recognition result for the photographing means 10b.

そこで、実施形態２の重付決定手段５３２は、候補位置に頭部の三次元形状モデルを配置した頭部投影領域８５０に加えて、候補位置と撮影手段の位置とを結ぶ直線上で候補位置よりも撮影手段に近い位置に配置した頭部投影領域８５１と、同直線上で候補位置よりも撮影手段から遠い位置に配置した頭部投影領域８５２とをさらに算出して、各頭部投影領域での混雑度を加味する。図１０（ｂ）に示す例では、撮影手段１０ａに近い側の頭部投影領域８５１と遠い側の頭部投影領域８５２での指標（非重複度、閑散度または混雑度）を算出する。 Therefore, the weighting determining means 532 of the second embodiment has a candidate position on a straight line connecting the candidate position and the position of the photographing means in addition to the head projection area 850 in which the three-dimensional shape model of the head is arranged at the candidate position. The head projection area 851 arranged at a position closer to the photographing means and the head projection area 852 arranged at a position farther from the photographing means than the candidate position on the same straight line are further calculated, and each head projection area is calculated. Consider the degree of congestion in. In the example shown in FIG. 10B, the indexes (non-overlapping degree, off-peak degree or congestion degree) in the head projection area 851 on the side closer to the photographing means 10a and the head projection area 852 on the far side are calculated.

実施形態１の重付決定手段１３２の場合これを近似的に行う。例えば、撮影手段が俯瞰設置された広角カメラであれば、人物領域を画面下にずらして候補位置よりも撮影手段に近い位置での人物領域とし、人物領域を画面上にずらして候補位置よりも撮影手段から遠い位置での人物領域とする。また、例えば、撮影手段が俯瞰設置された魚眼カメラであれば、人物領域を画面中央からの放射線上で中央に近づく方向にずらして候補位置よりも撮影手段に近い位置での人物領域とし、人物領域を同放射線上で中央から離れる方向にずらして候補位置よりも撮影手段から遠い位置での人物領域とする。 In the case of the weighting determination means 132 of the first embodiment, this is performed approximately. For example, in the case of a wide-angle camera in which the shooting means is installed from a bird's-eye view, the person area is shifted to the bottom of the screen to be a person area closer to the shooting means than the candidate position, and the person area is shifted to the screen to be closer to the candidate position than the candidate position. It is a person area far from the shooting means. Further, for example, in the case of a fisheye camera in which the photographing means is installed from a bird's-eye view, the person area is shifted in the direction closer to the center on the radiation from the center of the screen to be a person area closer to the photographing means than the candidate position. The person area is shifted in the direction away from the center on the same radiation to obtain the person area at a position farther from the photographing means than the candidate position.

ずらし量は、撮影手段の取り付け位置や角度等に応じて調整し、例えば元の領域と半分程度重なる量とすればよい。そして、重付決定手段１３２，５３２は、候補位置での指標と、撮影手段に近い位置での指標と、撮影手段から遠い位置での指標の平均値を求め、平均値に応じた重みＷを決定する。この際、撮影手段に近い位置での指標を撮影手段から遠い位置での指標よりも大きく重み付けた重み付け平均値とするのが好適である。 The amount of shift may be adjusted according to the mounting position and angle of the photographing means, and may be set to an amount that overlaps with, for example, about half of the original area. Then, the weighting determining means 132, 532 obtains the average value of the index at the candidate position, the index at the position close to the photographing means, and the index at the position far from the photographing means, and calculates the weight W according to the average value. decide. At this time, it is preferable that the index at a position close to the photographing means is a weighted average value that is weighted larger than the index at a position far from the photographing means.

（３−２）混雑度推定手段１３０，５３０が連続値を出力する推定器を用いた例を示したが、離散的な混雑度を出力する推定器を用いることもできる。 (3-2) Although the example in which the congestion degree estimation means 130 and 530 use an estimator that outputs continuous values is shown, an estimator that outputs a discrete degree of congestion can also be used.

例えば、推定器を多クラスＳＶＭ（Support Vector Machine）でモデル化し、混雑度の度合いに応じて「背景（無人）」、「低混雑度」、「中混雑度」、「高混雑度」の４クラスに分類してラベル付けされた学習用画像を用いて当該モデルを学習させておく。そして、混雑度推定手段１３０，５３０は、撮影画像の各画素を中心とする窓を設定して窓内の画像の特徴量を推定器に入力し、各画素のクラスを識別する。混雑度推定手段１３０，５３０は、上述した非重複度を用いる場合は「高混雑度」ラベルの画素の集まりを高混雑度領域とし、上述した閑散度を用いる場合は各ラベルをその混雑度合いに応じて予め定めた数値に置換して離散値の混雑度マップとする。 For example, the estimator is modeled with a multi-class SVM (Support Vector Machine), and 4 of "background (unmanned)", "low congestion", "medium congestion", and "high congestion" depending on the degree of congestion. The model is trained using the training images classified into classes and labeled. Then, the congestion degree estimation means 130 and 530 set a window centered on each pixel of the captured image, input the feature amount of the image in the window into the estimator, and identify the class of each pixel. When the above-mentioned non-overlapping degree is used, the congestion degree estimation means 130 and 530 use the group of pixels of the “high degree of congestion” label as the high degree of congestion area, and when the above-mentioned low degree of use is used, each label is set as the degree of congestion. The congestion degree map of discrete values is obtained by replacing with a predetermined numerical value accordingly.

また、多クラスＳＶＭ以外にも、決定木型のランダムフォレスト法、多クラスのアダブースト（AdaBoost）法または多クラスロジスティック回帰法などにて学習した種々の多クラス識別器によっても推定器を実現できる。或いは識別型のＣＮＮによっても推定器を実現できる（ＣＮＮの場合、窓走査は不要）。また、クラス分類された学習用画像を用いる場合でも特徴量から混雑度を回帰する回帰型のモデルとすることによって連続値の混雑度を出力する推定器を実現することもできる。その場合、リッジ回帰法、サポートベクターリグレッション法、回帰木型のランダムフォレスト法またはガウス過程回帰（Gaussian Process Regression）などによって、特徴量から混雑度を求めるための回帰関数のパラメータを学習させる。或いは回帰型のＣＮＮを用いた推定器とすることもできる（ＣＮＮ場合、窓走査は不要）。 In addition to the multi-class SVM, the estimator can also be realized by various multi-class discriminators learned by the decision tree type random forest method, the multi-class AdaBoost method, the multi-class logistic regression method, or the like. Alternatively, an estimator can be realized by an identification type CNN (in the case of CNN, window scanning is unnecessary). Further, even when a class-classified learning image is used, it is possible to realize an estimator that outputs a continuous value of the congestion degree by using a regression type model that returns the congestion degree from the feature amount. In that case, the parameters of the regression function for obtaining the degree of congestion from the features are learned by the ridge regression method, the support vector regression method, the random forest method of the regression tree type, or the Gaussian process regression. Alternatively, it can be an estimator using a regression type CNN (in the case of CNN, window scanning is unnecessary).

（３−３）本発明は、車両、動物等、混雑状態をなし得る人以外の物体にも適用できる。 (3-3) The present invention can be applied to objects other than humans, such as vehicles and animals, which can be in a congested state.

１…三次元位置推定装置（物体認識装置）、１０ａ，１０ｂ，１０ｃ，５０ａ，５０ｂ，５０ｃ…撮影手段、１１，５１…通信部、１２，５２…記憶部、１３，５３…画像処理部、１４，５４…表示部、１２０，５２０…カメラパラメータ、１３０、５３０…混雑度推定手段、１３１…二次元位置推定手段（個別認識手段）、１３２、５３２…重付決定手段、１３３…三次元位置推定手段（統合認識手段）、１３４…推定結果出力手段、５…三次元追跡装置、５２１…物体情報、５３１…候補位置設定・評価手段（個別認識手段）、５３３…物体位置決定手段（統合認識手段）、５３４…追跡結果出力手段 1 ... Three-dimensional position estimation device (object recognition device), 10a, 10b, 10c, 50a, 50b, 50c ... Imaging means, 11,51 ... Communication unit, 12,52 ... Storage unit, 13,53 ... Image processing unit, 14,54 ... Display unit, 120,520 ... Camera parameters, 130, 530 ... Congestion degree estimation means, 131 ... Two-dimensional position estimation means (individual recognition means), 132, 532 ... Overload determination means, 133 ... Three-dimensional position Estimating means (integrated recognition means), 134 ... Estimating result output means, 5 ... 3D tracking device, 521 ... Object information, 331 ... Candidate position setting / evaluation means (individual recognition means), 533 ... Object position determination means (integrated recognition) Means) 534 ... Tracking result output means

Claims

An object recognition device that recognizes an object based on captured images taken by a plurality of photographing means having a common field of view.
For each of the photographing means, a congestion degree estimating means for estimating the congestion degree of the object photographed in the photographed image, and
An individual recognition means that analyzes the photographed image for each photographing means and recognizes all or a part of the object on the photographed image to generate an individual recognition result.
A weighting determination means that determines the weighting of each imaging means according to the degree of congestion at a position where the individual recognition means recognizes the object on the captured image captured by the imaging means.
Based on the weighting, the integrated recognition means that integrates the individual recognition results for each of the photographing means and the integrated recognition means.
An object recognition device characterized by being equipped with.

The congestion degree estimating means inputs the photographed image into an estimator learned in advance to output the congestion degree at an arbitrary position in the photographed image when the photographed image is input, and arbitrarily in the photographed image. Estimate the degree of congestion at the position of
The object recognition device according to claim 1, wherein the weighting determining means determines the weighting of the photographing means according to the degree of congestion for each region of the photographed image.

The individual recognition means analyzes the photographed image for each of the photographing means and obtains the position information of the object on the photographed image at the current time.
The object recognition device according to claim 1 or 2, wherein the integrated recognition means integrates the position information for each imaging means based on the weighting to determine the position of the object at the current time.

It is an object recognition method by an object recognition device that recognizes an object based on captured images taken by a plurality of photographing means having a common field of view.
The congestion degree estimating means estimates the congestion degree of the object photographed in the photographed image for each of the photographing means.
The individual recognition means analyzes the photographed image for each photographing means, recognizes all or a part of the object on the photographed image, and generates an individual recognition result.
The weighting determining means determines the weighting of the respective photographing means according to the degree of congestion at the position where the individual recognition means recognizes the object on the photographed image photographed by the respective photographing means.
An object recognition method, characterized in that the integrated recognition means integrates the individual recognition results for each of the photographing means based on the weighting.

An object recognition program executed in an object recognition device that recognizes an object based on captured images taken by a plurality of photographing means having a common field of view.
The congestion degree estimating means estimates the congestion degree of the object photographed in the photographed image for each photographing means.
A process in which the individual recognition means analyzes the captured image for each imaging means, recognizes all or a part of the object on the captured image, and generates an individual recognition result.
The weighting determination means determines the weighting of each imaging means according to the degree of congestion at the position where the individual recognition means recognizes the object on the captured image captured by the imaging means.
A process in which the integrated recognition means integrates the individual recognition results for each of the photographing means based on the weighting.
An object recognition program characterized by executing.