JP7349290B2

JP7349290B2 - Object recognition device, object recognition method, and object recognition program

Info

Publication number: JP7349290B2
Application number: JP2019149092A
Authority: JP
Inventors: 友彦中村
Original assignee: Secom Co Ltd
Current assignee: Secom Co Ltd
Priority date: 2019-08-15
Filing date: 2019-08-15
Publication date: 2023-09-22
Anticipated expiration: 2039-08-15
Also published as: JP2021033374A

Description

本発明は、画像等の計測データから対象物の位置（部位や存在領域など）といった対象物に関する所定情報を検出して対象物を認識する技術に関する。 The present invention relates to a technology for recognizing a target object by detecting predetermined information regarding the target object, such as the position of the target object (part, area, etc.) from measurement data such as an image.

撮影画像中に現れている人の複数の部位を機械学習に基づいて検出する研究が盛んに行われている。 There is a lot of research being done on detecting multiple parts of a person that appear in captured images based on machine learning.

例えば、非特許文献１に記載の技術においては、人が写った多数の学習用画像を入力値とし当該学習用画像における人の部位の種別および位置を記したアノテーションを出力値の目標値とするモデルを深層学習させる。そして、学習済みモデルに撮影画像を入力することによって撮影画像に写った人の部位の種別および位置を出力させる。このアノテーションは学習用画像に現れている部位について作成される。ちなみに、アノテーションに記された各部位の情報や学習済みモデルが出力する各部位の情報はキーポイントなどと呼ばれている。 For example, in the technology described in Non-Patent Document 1, a large number of learning images depicting people are used as input values, and annotations describing the types and positions of human body parts in the learning images are used as target values for output values. Deep learning the model. Then, by inputting the photographed image to the trained model, the type and position of the body part of the person in the photographed image is output. This annotation is created for the part appearing in the learning image. By the way, the information about each part written in the annotation and the information about each part output by the trained model are called key points.

人についての各種認識に必要な部位が検出できれば、当該人について、姿勢の認識の他にも、存在領域の認識、プロポーションに基づく大人か子供か（属性）の認識等が可能となる。 If the parts necessary for various types of recognition about a person can be detected, it becomes possible to recognize not only the posture of the person, but also the region of existence, and whether the person is an adult or a child (attribute) based on proportions.

“Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields.”, Z. Cao, T. Simon, S. Wei and Y. Sheikh (2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1302-1310)“Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields.”, Z. Cao, T. Simon, S. Wei and Y. Sheikh (2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1302-1310)

しかしながら、従来技術では、撮影画像に現れていない部位を推定する精度が低いため、対象物同士による隠蔽が生じていると対象物の姿勢、存在領域、属性等の認識が困難となる問題があった。 However, with the conventional technology, the accuracy of estimating parts that do not appear in the photographed image is low, so if objects are hidden from each other, it becomes difficult to recognize the posture, area of existence, attributes, etc. of the objects. Ta.

例えば、２人の人物が撮影画像上で重なると、一部が隠れた人物が検出されなくなってしまう場合がある。そのため当該検出結果を基に人の存在領域の認識を行った場合、一部が隠れた人物については、存在領域無しとの認識になってしまう。また、一部が隠れた人物については姿勢や属性の認識も困難である。 For example, if two people overlap on a photographed image, a partially hidden person may not be detected. Therefore, when recognizing a person's presence area based on the detection result, a partially hidden person will be recognized as not having a presence area. Furthermore, it is difficult to recognize the posture and attributes of a partially hidden person.

すなわち従来技術では、学習用画像と当該画像に現れていない部位との関係を明示的に学習させていなかったため、撮影画像に現れていない部位の検出は困難であった。そのため、従来技術では、隠蔽が生じている対象物について、姿勢、存在領域、属性などの認識が困難となる場合があった。 That is, in the conventional technology, since the relationship between the learning image and the part that does not appear in the image is not explicitly learned, it is difficult to detect the part that does not appear in the photographed image. Therefore, in the conventional technology, it may be difficult to recognize the posture, area of existence, attributes, etc. of a hidden object.

また、上記問題は、二次元計測データ（画像）のみならず三次元計測データにおいても生じ、同様に二次元計測データの時系列、三次元計測データの時系列においても生じる。 Further, the above problem occurs not only in two-dimensional measurement data (images) but also in three-dimensional measurement data, and similarly occurs in the time series of two-dimensional measurement data and the time series of three-dimensional measurement data.

本発明は上記問題を鑑みてなされたものであり、他の対象物によって一部が隠蔽されて計測された対象物についても高確度で認識できる対象物認識装置、対象物認識方法および対象物認識プログラムを提供することを目的とする。 The present invention has been made in view of the above-mentioned problems, and provides an object recognition device, an object recognition method, and an object recognition device that can recognize with high accuracy even a measured object that is partially hidden by another object. The purpose is to provide programs.

（１）本発明に係る対象物認識装置は、複数の対象物が計測された計測データから前記対象物それぞれについて当該対象物に関する所定情報を認識する装置であって、前記対象物のサンプルが計測されたサンプルデータにおいて当該サンプルの領域の一部を含むマスク領域をマスク処理して得られたデータと、当該マスク領域の情報とを入力とし、当該サンプルに関する前記所定情報を出力の目標値とする学習によって予め生成された検出器を記憶している検出器記憶手段と、前記計測データに対して、当該計測データにおいて前記対象物が検出された既検出領域をマスク処理して得られるデータと前記既検出領域の情報とからなるマスク処理データを生成するマスク手段と、前記マスク処理データを前記検出器に入力して前記計測データにおける前記対象物に関する前記所定情報を検出する検出手段と、を備える。 (1) The object recognition device according to the present invention is a device that recognizes predetermined information regarding each object from measurement data obtained by measuring a plurality of objects, and in which a sample of the object is measured. data obtained by masking a mask region that includes a part of the region of the sample in the sample data that has been obtained, and information on the mask region are input, and the predetermined information regarding the sample is used as the output target value. a detector storage means storing a detector generated in advance through learning; data obtained by masking a detected area in which the object has been detected in the measurement data; and a detection means for inputting the mask processing data into the detector to detect the predetermined information regarding the object in the measurement data. .

（２）上記（１）に記載の対象物認識装置において、前記検出器記憶手段は、前記サンプルデータを前記マスク処理がされる前の状態にて前記入力とした場合についても前記学習を行った前記検出器を記憶し、前記検出手段は、前記既検出領域が無い場合に、前記マスク処理データに代えて前記計測データを前記検出器に入力して当該計測データにおける前記対象物に関する前記所定情報を検出する構成とすることができる。 (2) In the object recognition device according to (1) above, the detector storage means performs the learning also when the sample data is inputted in a state before the masking process is performed. storing the detector, and the detecting means inputs the measurement data to the detector instead of the mask processing data when there is no detected area, and inputs the predetermined information regarding the object in the measurement data. It can be configured to detect.

（３）上記（１），（２）に記載の対象物認識装置において、前記所定情報は前記対象物を構成する部位の位置に関する情報である構成とすることができる。 (3) In the object recognition device described in (1) and (2) above, the predetermined information may be information regarding a position of a part constituting the object.

（４）上記（１）～（３）に記載の対象物認識装置において、前記サンプルデータに基づく前記学習により前記検出器を生成する学習手段、をさらに備えた構成とすることができる。 (4) The object recognition device described in (1) to (3) above may further include a learning means for generating the detector through the learning based on the sample data.

（５）本発明に係る対象物認識方法は、複数の対象物が計測された計測データから前記対象物それぞれについて当該対象物に関する所定情報を認識する方法であって、前記対象物のサンプルが計測されたサンプルデータにおいて当該サンプルの領域の一部を含むマスク領域をマスク処理して得られたデータと、当該マスク領域の情報とを入力とし、当該サンプルに関する前記所定情報を出力の目標値とする学習によって予め生成された検出器を用意するステップと、前記計測データに対して、当該計測データにおいて前記対象物が検出された既検出領域をマスク処理して得られるデータと前記既検出領域の情報とからなるマスク処理データを生成するステップと、前記マスク処理データを前記検出器に入力して前記計測データにおける前記対象物に関する前記所定情報を検出するステップと、を備える。 (5) The object recognition method according to the present invention is a method of recognizing predetermined information regarding each object from measurement data obtained by measuring a plurality of objects, wherein a sample of the object is measured. data obtained by masking a mask region that includes a part of the region of the sample in the sample data that has been obtained, and information on the mask region are input, and the predetermined information regarding the sample is used as the output target value. a step of preparing a detector generated in advance through learning, and data obtained by masking a detected area in which the object has been detected in the measurement data and information on the detected area. and a step of inputting the mask processing data into the detector to detect the predetermined information regarding the object in the measurement data.

（６）本発明に係る対象物認識プログラムは、複数の対象物が計測された計測データから前記対象物それぞれについて当該対象物に関する所定情報を認識する処理をコンピュータに行わせるプログラムであって、当該コンピュータを、前記対象物のサンプルが計測されたサンプルデータにおいて当該サンプルの領域の一部を含むマスク領域をマスク処理して得られたデータと、当該マスク領域の情報とを入力とし、当該サンプルに関する前記所定情報を出力の目標値とする学習によって予め生成された検出器を記憶している検出器記憶手段、前記計測データに対して、当該計測データにおいて前記対象物が検出された既検出領域をマスク処理して得られるデータと前記既検出領域の情報とからなるマスク処理データを生成するマスク手段、及び、前記マスク処理データを前記検出器に入力して前記計測データにおける前記対象物に関する前記所定情報を検出する検出手段、として機能させる。 (6) The object recognition program according to the present invention is a program that causes a computer to perform a process of recognizing predetermined information regarding each object from measurement data obtained by measuring a plurality of objects, the program comprising: A computer is input with data obtained by masking a mask area including a part of the area of the sample in the sample data in which the sample of the target object is measured, and information on the mask area, and the information about the sample is input. a detector storage means storing a detector generated in advance by learning using the predetermined information as an output target value; a mask unit that generates mask processing data consisting of data obtained by mask processing and information on the detected area; and a mask unit that inputs the mask processing data to the detector to obtain the predetermined information regarding the object in the measurement data. It functions as a detection means for detecting information.

本発明の対象物認識装置、対象物認識方法および対象物認識プログラムによれば、他の対象物によって一部が隠蔽されて計測された対象物についても高確度で認識することが可能となる。 According to the object recognition device, object recognition method, and object recognition program of the present invention, it is possible to recognize with high accuracy even a measured object whose part is hidden by another object.

本発明の実施形態に係る対象物認識装置の概略の構成を示すブロック図である。1 is a block diagram showing a schematic configuration of a target object recognition device according to an embodiment of the present invention. 本発明の実施形態に係る対象物認識装置の学習段階に関する概略の機能ブロック図である。FIG. 2 is a schematic functional block diagram regarding a learning stage of the object recognition device according to the embodiment of the present invention. 本発明の実施形態に係る対象物認識装置の学習段階に関する学習用データの例を説明する模式図である。FIG. 3 is a schematic diagram illustrating an example of learning data regarding a learning stage of the object recognition device according to the embodiment of the present invention. 本発明の実施形態に係る対象物認識装置の認識段階に関する概略の機能ブロック図である。FIG. 2 is a schematic functional block diagram regarding the recognition stage of the object recognition device according to the embodiment of the present invention. 本発明の実施形態に係る対象物認識装置の学習段階での動作に関する概略のフロー図である。FIG. 2 is a schematic flow diagram regarding the operation of the object recognition device according to the embodiment of the present invention in a learning stage. 本発明の実施形態に係る対象物認識装置の認識段階での動作に関する概略のフロー図である。FIG. 2 is a schematic flow diagram regarding the operation of the object recognition device in the recognition stage according to the embodiment of the present invention. 本発明の実施形態に係る対象物認識装置の認識段階での処理例を説明するための模式図である。FIG. 3 is a schematic diagram for explaining a processing example at a recognition stage of the object recognition device according to the embodiment of the present invention.

以下、本発明の実施の形態（以下実施形態という）である対象物認識装置１について、図面に基づいて説明する。本発明に係る対象物認識装置は、計測データから所定の対象物に関する所定情報を求めるものであり、本実施形態にて一例として示す対象物認識装置１は、監視空間を撮影した撮影画像から監視空間に現れた人物を構成する複数の部位の位置および当該人物の存在領域（対象物領域）を検出する。すなわち、本実施形態において、計測データは二次元画像であり、対象物は人であり、対象物に関する所定情報は、当該対象物の位置に関する情報であって、具体的には人の部位の位置および人の存在領域である。ちなみに対象物認識装置１は各人物について検出された複数の部位を囲む領域を当該人物の対象物領域として検出する。なお、計測データにおいては部位の一部が他の対象物に隠されている場合がある。本発明では、隠された部位を推定することを含めて部位の検出と称し、隠された部位の推定結果を用いた対象物領域の推定を含めて対象物領域の検出と称する。 DESCRIPTION OF THE PREFERRED EMBODIMENTS An object recognition device 1 that is an embodiment of the present invention (hereinafter referred to as an embodiment) will be described below based on the drawings. The object recognition device according to the present invention obtains predetermined information regarding a predetermined object from measurement data. The positions of a plurality of parts constituting a person appearing in space and the area where the person exists (object area) are detected. That is, in this embodiment, the measurement data is a two-dimensional image, the target object is a person, and the predetermined information regarding the target object is information regarding the position of the target object, specifically, the position of the body part of the person. and the realm of human existence. Incidentally, the object recognition device 1 detects an area surrounding a plurality of body parts detected for each person as the object area of the person. Note that in the measurement data, a part of the body part may be hidden by another object. In the present invention, estimation of a hidden part is referred to as part detection, and estimation of an object area using the estimation result of a hidden part is referred to as detection of an object area.

上記対象物認識に用いる複数の部位を要検出部位、要検出部位の代表点をキーポイントと称する。キーポイントの情報は、少なくとも対応する部位の種別と位置の組み合わせで表され、この組み合わせを含むデータを部位データと称する。そして、各キーポイントを検出することによって、対応する要検出部位の位置が検出される。なお、要検出部位とする部位の種別は、対象物や認識の目的に応じて予め定められる。 The plurality of parts used for the target object recognition are referred to as required detection parts, and representative points of the required detection parts are referred to as key points. The key point information is represented by a combination of at least the type and position of the corresponding part, and data including this combination is referred to as part data. By detecting each key point, the position of the corresponding detection target region is detected. Note that the type of the part to be detected is determined in advance depending on the object and the purpose of recognition.

対象物認識装置１は、学習用画像と学習用画像に対するアノテーション（付与データ）とを用いて、部位データを検出する検出器を学習し記憶する。ここで、付与データは、学習用の計測データに現れている対象物に対して付与される部位データである。そして、対象物認識装置１は、記憶している検出器を用いて撮影画像における部位データの検出を行う。特に、対象物認識装置１は、学習用画像の一部にマスク処理を施して学習する。これにより、隠れた部位を含めて検出可能な検出器を学習する。そして、対象物認識装置１は、撮影画像において他の対象物を隠蔽している可能性がある対象物の領域にマスク処理を施して検出器を適用する。これにより、隠れた部位を含めた検出を行う。 The object recognition device 1 uses learning images and annotations (added data) to the learning images to learn and store a detector for detecting body part data. Here, the given data is part data given to the object appearing in the measurement data for learning. Then, the object recognition device 1 detects body part data in the photographed image using the stored detector. In particular, the object recognition device 1 performs learning by performing mask processing on a part of the learning image. In this way, a detector capable of detecting hidden parts is learned. Then, the object recognition device 1 applies mask processing to a region of the object that may be hiding another object in the photographed image, and applies the detector to the region of the object. This allows detection including hidden parts.

［対象物認識装置１の構成］
図１は対象物認識装置１の概略の構成を示すブロック図である。対象物認識装置１は撮影部２、通信部３、記憶部４、画像処理部５および表示部６からなる。 [Configuration of object recognition device 1]
FIG. 1 is a block diagram showing a schematic configuration of an object recognition device 1. As shown in FIG. The object recognition device 1 includes a photographing section 2, a communication section 3, a storage section 4, an image processing section 5, and a display section 6.

撮影部２は、計測データを取得する計測手段であり、本実施形態においては監視カメラである。撮影部２は通信部３を介して画像処理部５と接続され、監視空間を所定の時間間隔で撮影して撮影画像を生成し、撮影画像を順次、画像処理部５に入力する。例えば、撮影部２は、監視空間であるイベント会場の一角に設置されたポールに当該監視空間を俯瞰する所定の固定視野を有して設置され、監視空間をフレーム周期１秒で撮影してカラー画像を生成する。なお、撮影部２はカラー画像の代わりにモノクロ画像を生成してもよい。また、画像処理部５は例えば、画像解析センターなどに設置される。 The photographing unit 2 is a measurement unit that acquires measurement data, and is a surveillance camera in this embodiment. The photographing section 2 is connected to the image processing section 5 via the communication section 3, photographs the monitoring space at predetermined time intervals to generate photographed images, and sequentially inputs the photographed images to the image processing section 5. For example, the photographing unit 2 is installed on a pole installed in a corner of the event venue, which is the surveillance space, with a predetermined fixed field of view overlooking the surveillance space, and photographs the surveillance space at a frame period of 1 second, coloring it. Generate an image. Note that the photographing unit 2 may generate a monochrome image instead of a color image. Further, the image processing unit 5 is installed at, for example, an image analysis center.

通信部３は通信回路であり、その一端が画像処理部５に接続され、他端が撮影部２および表示部６と接続される。通信部３は撮影部２から撮影画像を取得して画像処理部５に入力し、画像処理部５から対象物の認識結果を入力され表示部６へ出力する。 The communication section 3 is a communication circuit, one end of which is connected to the image processing section 5, and the other end connected to the photographing section 2 and the display section 6. The communication section 3 acquires a photographed image from the photographing section 2 and inputs it to the image processing section 5 , and receives the recognition result of the object from the image processing section 5 and outputs it to the display section 6 .

なお、撮影部２、通信部３、記憶部４、画像処理部５および表示部６の間は各部の設置場所に応じた形態で適宜接続される。例えば、撮影部２と通信部３および画像処理部５とが遠隔に設置される場合、撮影部２と通信部３との間をインターネット回線にて接続することができる。また、通信部３と画像処理部５との間はバスで接続する構成とすることができる。その他、接続手段として、ＬＡＮ（Local Area Network）、各種ケーブルなどを用いることができる。 Note that the photographing section 2, the communication section 3, the storage section 4, the image processing section 5, and the display section 6 are connected as appropriate depending on the installation location of each section. For example, when the photographing section 2, the communication section 3, and the image processing section 5 are installed remotely, the photographing section 2 and the communication section 3 can be connected via an Internet line. Further, the communication section 3 and the image processing section 5 may be connected via a bus. In addition, a LAN (Local Area Network), various cables, etc. can be used as the connection means.

記憶部４は、ＲＯＭ（Read Only Memory）、ＲＡＭ（Random Access Memory）等のメモリ装置であり、各種プログラムや各種データを記憶する。例えば、記憶部４は学習用画像、学習用画像に対する付与データ、学習済みモデルである検出器の情報を記憶する。記憶部４は画像処理部５と接続されて、画像処理部５との間でこれらの情報を入出力する。すなわち、対象物の認識に必要な情報や、認識処理の過程で生じた情報が記憶部４と画像処理部５との間で入出力される。 The storage unit 4 is a memory device such as a ROM (Read Only Memory) or a RAM (Random Access Memory), and stores various programs and various data. For example, the storage unit 4 stores learning images, data added to the learning images, and information on a detector that is a trained model. The storage unit 4 is connected to the image processing unit 5 and inputs and outputs this information to and from the image processing unit 5. That is, information necessary for object recognition and information generated in the process of recognition processing are input and output between the storage section 4 and the image processing section 5.

画像処理部５は、計測データを処理する計測データ処理部であり、ＣＰＵ（Central Processing Unit）、ＤＳＰ（Digital Signal Processor）、ＭＣＵ（Micro Control Unit）、ＧＰＵ(Graphics Processing Unit)等の演算装置で構成される。画像処理部５は記憶部４からプログラムを読み出して実行することにより各種の処理手段・制御手段として動作し、必要に応じて、各種データを記憶部４から読み出し、生成したデータを記憶部４に記憶させる。例えば、画像処理部５は検出器を学習し生成する。また、画像処理部５は、生成した検出器を通信部３経由で記憶部４に記憶させる。また、画像処理部５は検出器を用いて、撮影画像における対象物を認識する処理を行う。 The image processing unit 5 is a measurement data processing unit that processes measurement data, and is a calculation device such as a CPU (Central Processing Unit), a DSP (Digital Signal Processor), an MCU (Micro Control Unit), or a GPU (Graphics Processing Unit). configured. The image processing unit 5 operates as various processing means and control means by reading and executing programs from the storage unit 4, and reads various data from the storage unit 4 and stores generated data in the storage unit 4 as necessary. Make me remember. For example, the image processing unit 5 learns and generates a detector. Further, the image processing unit 5 stores the generated detector in the storage unit 4 via the communication unit 3. The image processing unit 5 also uses a detector to perform a process of recognizing an object in a photographed image.

表示部６は、液晶ディスプレイまたは有機ＥＬ（Electro-Luminescence）ディスプレイ等であり、通信部３から入力された認識結果を表示する。監視員は表示された認識結果に応じて対処の要否等を判断し、必要に応じて対処員を急行させる等の対処を行う。 The display unit 6 is a liquid crystal display, an organic EL (Electro-Luminescence) display, or the like, and displays the recognition results input from the communication unit 3. The monitor determines whether or not a response is necessary based on the displayed recognition result, and takes measures such as dispatching a response person as necessary.

なお、本実施形態では、記憶部４および画像処理部５を画像解析センター側に設けることとするが、これらを撮影部２側に設けてもよい。 In this embodiment, the storage section 4 and the image processing section 5 are provided on the image analysis center side, but they may be provided on the imaging section 2 side.

以下、対象物認識装置１の構成について、先ず、検出器を学習する学習段階に関する構成について説明し、次いで、検出器を用いて対象物を認識する認識段階に関する構成について説明する。 Hereinafter, regarding the configuration of the object recognition device 1, first, the configuration related to the learning stage of learning the detector will be explained, and then the configuration related to the recognition stage of recognizing the target object using the detector will be explained.

［学習段階に関する対象物認識装置１の構成］
図２は学習段階に関する対象物認識装置１の概略の機能ブロック図であり、記憶部４が学習用データ記憶手段４０、学習モデル記憶手段４１として機能し、画像処理部５が学習用マスク手段５０および学習手段５１として機能する。 [Configuration of object recognition device 1 regarding learning stage]
FIG. 2 is a schematic functional block diagram of the object recognition device 1 regarding the learning stage, in which the storage section 4 functions as a learning data storage means 40 and a learning model storage means 41, and the image processing section 5 functions as a learning mask means 50. and functions as a learning means 51.

学習用データ記憶手段４０は多数の学習用の画像を記憶する学習用画像記憶手段であると共に、当該学習用画像に対する付与データを記憶する付与データ記憶手段である。学習用画像と付与データには学習に先立って予め記憶されているものと学習時に生成されるものとが含まれる。学習用データ記憶手段４０は、学習用画像と当該画像に撮影されている各人の付与データとを紐づけて保持する。以下、学習用画像に撮影されている人物をサンプルと称する。別人物は別サンプルであり、同一人物であっても画像が異なれば別サンプルである。また、学習用画像はサンプルが計測された計測データ（サンプルデータ）である。具体的には、各サンプルには互いを識別するためのサンプルＩＤが付与され、学習用画像には画像ＩＤが付与され、学習用データ記憶手段４０にはこれらＩＤの対応関係が記憶される。付与データは、各サンプルのキーポイントそれぞれについての情報を含む。つまり、付与データにより、各サンプルの複数の要検出部位についてその種別ごとにその代表点の位置がわかる。また、学習用データ記憶手段４０は学習用マスク手段５０により生成される学習用データを記憶する。 The learning data storage means 40 is a learning image storage means that stores a large number of learning images, and also serves as a given data storage means that stores given data for the learning images. The learning images and assigned data include those that are stored in advance prior to learning and those that are generated during learning. The learning data storage means 40 stores the learning image and the assigned data of each person photographed in the image in association with each other. Hereinafter, the person photographed in the learning image will be referred to as a sample. Different people are different samples, and even if they are the same person, different images are different samples. Further, the learning image is measurement data (sample data) in which samples are measured. Specifically, each sample is given a sample ID for mutual identification, the learning image is given an image ID, and the learning data storage means 40 stores the correspondence of these IDs. The assigned data includes information about each key point of each sample. That is, based on the provided data, the position of the representative point of each type of a plurality of detection-required parts of each sample can be determined. Further, the learning data storage means 40 stores the learning data generated by the learning mask means 50.

学習用マスク手段５０は、学習用データ記憶手段４０から学習用画像を取得し、学習用画像の一部分をマスクすることでマスク済み学習用画像を作成する。その後、マスクした領域を表す情報（マスク領域情報）とマスク済み学習用画像を学習用データ記憶手段４０へ出力する。マスク領域情報は例えば、マスク済み学習用画像と同一の幅および高さの二値画像であり、マスクした領域に１、それ以外に０が格納されている。 The learning mask means 50 acquires the learning image from the learning data storage means 40 and creates a masked learning image by masking a part of the learning image. Thereafter, information representing the masked area (mask area information) and the masked learning image are output to the learning data storage means 40. The mask area information is, for example, a binary image having the same width and height as the masked learning image, and stores 1 in the masked area and 0 in other areas.

学習用マスク手段５０はマスク領域の位置と大きさをランダムに設定する。具体的には、学習用マスク手段５０は、０から画像の幅の画素数までの範囲内の整数値から重複なく乱数を２つとりｘ１，ｘ２（ｘ１＜ｘ２）とし、０から画像の高さまでの画素数の範囲内の整数値からも重複なく２つ乱数をとりｙ１，ｙ２（ｙ１＜ｙ２）とする。そして、ｘ１≦ｘ＜ｘ２かつｙ１≦ｙ＜ｙ２となる画像上の領域をマスクする範囲とし、事前に定められた値（マスク値）で埋める。ここで、マスク値は全学習用画像の全画素値の平均色とした。 The learning mask means 50 randomly sets the position and size of the mask area. Specifically, the learning mask means 50 takes two random numbers without duplication from integer values within the range from 0 to the number of pixels of the width of the image, sets them as x1 and x2 (x1<x2), and sets them as x1 and x2 (x1<x2), Two random numbers are taken without duplication from the integer values within the range of the number of pixels so far, and are set as y1 and y2 (y1<y2). Then, a region on the image satisfying x1≦x<x2 and y1≦y<y2 is set as a range to be masked, and is filled with a predetermined value (mask value). Here, the mask value was the average color of all pixel values of all learning images.

なお、乱数によっては対象物の像と重ならないマスク領域が設定される場合もあるため、学習用マスク手段５０は、各学習用画像に対して十分な数の生成処理を繰り返し行って、対象物の像の一部とマスク領域とが重なったマスク済み学習用画像が生成されることを保証する。あるいは、学習用マスク手段５０は、付与データを参照して、１以上のキーポイントの位置がマスク領域内であることを確認した場合のみマスク済み学習用画像を生成する、という処理を各学習用画像に対して一定数繰り返すようにしても良い。 Note that depending on the random numbers, a mask area that does not overlap with the image of the target object may be set, so the learning masking means 50 repeatedly performs a sufficient number of generation processes on each learning image to This ensures that a masked learning image is generated in which a part of the image of the image overlaps with the mask area. Alternatively, the learning mask means 50 refers to the assigned data and performs a process for each learning image to generate a masked learning image only when it is confirmed that the position of one or more key points is within the mask area. It may be repeated a certain number of times for the image.

図３は学習用データの例を説明する模式図であり、サンプルが写る学習用画像１００と当該サンプルの付与データ１０１とから生成される学習用データが示されている。 FIG. 3 is a schematic diagram illustrating an example of learning data, and shows learning data generated from a learning image 100 in which a sample appears and data 101 attached to the sample.

学習用画像１００は学習用データ記憶手段４０に予め記憶されている学習用画像の一例である。予め用意される学習用画像は、カメラで実際に撮影された実画像でなくてもよく、例えば、コンピュータグラフィックス（ＣＧ）などで作られた画像であってもよい。付与データ１０１は学習用画像１００に対応付けて学習用データ記憶手段４０に予め記憶されている付与データである。予め用意される付与データは、人手によって作成されてもよいし、機械が抽出したものを人が確認し必要に応じて修正することによって作成されてもよいし、それらが混在していてもよい。なお、ここで示す付与データは要検出部位を１７個とし、人のキーポイントのトポロジーを図化した例である。キーポイントの位置を表す１７個の白丸と、キーポイント間の連結関係を表す１６本の線分にて図化されている。 The learning image 100 is an example of a learning image stored in the learning data storage means 40 in advance. The learning images prepared in advance do not have to be real images actually taken with a camera, and may be images created using computer graphics (CG), for example. The assigned data 101 is assigned data that is stored in advance in the learning data storage means 40 in association with the learning image 100. The assigned data prepared in advance may be created manually, or may be created by a person checking what is extracted by a machine and correcting it as necessary, or may be a mixture of these. . Note that the provided data shown here is an example in which the number of detection parts is 17 and the topology of a person's key points is diagrammed. It is illustrated with 17 white circles representing the positions of key points and 16 line segments representing the connection relationships between key points.

学習用マスク手段５０は学習用データ記憶手段４０に予め用意された学習用画像１００に基づき、マスク済み学習用画像とマスク領域情報とを互いに対応付けて複数通り生成する。図３では、学習用マスク手段５０が学習用画像１００から生成する多様なマスク済み学習用画像のうち３つの例としてマスク済み学習用画像１１０，１２０，１３０を示している。これらマスク済み学習用画像にて斜線部がマスク領域であり、人の下半身のみがマスクされたマスク済み学習用画像１１０、人の中央部の大半がマスクされたマスク済み学習用画像１２０、人の左半身の大半がマスクされたマスク済み学習用画像１３０が示されている。 The learning mask means 50 generates a plurality of masked learning images and mask area information by associating them with each other based on the learning image 100 prepared in advance in the learning data storage means 40. In FIG. 3, masked learning images 110, 120, and 130 are shown as three examples of various masked learning images generated from the learning image 100 by the learning masking means 50. In these masked learning images, the shaded area is the masked area, including a masked learning image 110 in which only the lower body of the person is masked, a masked learning image 120 in which most of the central part of the person is masked, and a masked learning image 120 in which most of the middle part of the person is masked. A masked learning image 130 in which most of the left side of the body is masked is shown.

マスク領域情報１１１，１２１，１３１は、学習用マスク手段５０がマスク済み学習用画像１１０，１２０，１３０のそれぞれに対応して生成するマスク領域情報である。また、付与データ１１２，１２２，１３２は、マスク済み学習用画像１１０，１２０，１３０のそれぞれに対応付けられる付与データである。ちなみに、これらは付与データ１１０の複製である。 The mask area information 111, 121, 131 is mask area information generated by the learning mask means 50 corresponding to the masked learning images 110, 120, 130, respectively. Furthermore, the attached data 112, 122, and 132 are attached data that are associated with the masked learning images 110, 120, and 130, respectively. Incidentally, these are copies of the assigned data 110.

なお、上記例において学習用マスク手段５０はランダムにマスク領域を生成したが、マスク領域の大きさおよび位置を規則的且つ網羅的に設定してもよい。また、上記例において学習用マスク手段５０は長方形のマスク領域を生成したが、マスク領域の形状を楕円に設定してもよいし、想定される遮蔽物の形状に設定してもよく、適宜の形状とすることができる。 In the above example, the learning masking means 50 randomly generates mask areas, but the size and position of the mask areas may be set regularly and comprehensively. Further, in the above example, the learning masking means 50 generates a rectangular mask area, but the shape of the mask area may be set to an ellipse, or may be set to the shape of an assumed shielding object, or may be set to an appropriate shape. It can be any shape.

学習手段５１は、少なくとも学習用マスク手段５０から得られたマスク済み学習用画像、マスク領域情報および付与データを用いて検出器を学習する。すなわち、学習手段５１は、マスク済み学習用画像とマスク領域情報とを入力とし付与データを出力の目標値とする学習によって検出器を生成する。好適には学習手段５１はさらに元の学習用画像（マスク無し学習用画像）をも検出器の学習に用いる。その際、学習手段５１は、マスク無し学習用画像に対応して全要素が０のマスク領域情報を生成して学習に用いる。つまり、学習モデル記憶手段４１に記憶される検出器の学習モデルが、マスク済み学習用画像とマスク領域情報とのセットと、マスク無し学習用画像とマスク領域情報とのセットを入力として各セットに含まれるサンプル毎のキーポイントを出力するように学習される。なお、ここでの学習とは、検出器のパラメータを与えられたデータから求めることである。 The learning means 51 learns the detector using at least the masked learning image obtained from the learning mask means 50, the mask area information, and the assigned data. That is, the learning means 51 generates a detector through learning using the masked learning image and the mask area information as input and the assigned data as the output target value. Preferably, the learning means 51 also uses the original learning image (learning image without mask) for the learning of the detector. At this time, the learning means 51 generates mask area information in which all elements are 0 corresponding to the unmasked learning image and uses it for learning. In other words, the learning model of the detector stored in the learning model storage means 41 inputs a set of a masked learning image and mask area information, and a set of an unmasked learning image and mask area information, and divides each set into two sets. It is trained to output key points for each included sample. Note that learning here means finding the parameters of the detector from the given data.

本実施形態では検出器を、畳み込み層、線形変換処理、活性化関数などから構成される畳み込みニューラルネットワーク（Convolutional Neural Networks;ＣＮＮｓ）を用いてモデル化する。活性化関数としてはＲｅＬＵ関数を用いる。学習手段５１は、ＣＮＮｓを構成する各要素のパラメータについて、その出力と対応する付与データとの乖離度を定量化する誤差関数を最小化することで学習を行う。誤差関数は事前に定めておく。最小化には確率的最急降下法などを用いる。 In this embodiment, the detector is modeled using convolutional neural networks (CNNs), which are composed of convolutional layers, linear transformation processing, activation functions, and the like. A ReLU function is used as the activation function. The learning means 51 performs learning for the parameters of each element constituting the CNNs by minimizing an error function that quantifies the degree of deviation between the output and the corresponding given data. The error function is determined in advance. Stochastic steepest descent method is used for minimization.

ここで、検出器ではマスク領域情報を手掛かりとして、マスク済み画像についてはマスクされていない部分のみを用いて、出力を計算する。このようにマスキングされた部分を用いずにマスキングされる前の目標値を検出するよう学習を行うことで、検出器は、マスクされた領域内にキーポイントがあった場合でも、その周囲のマスクされていない部分の情報を用いて、マスクされた領域内のキーポイントを検出できるようになる。 Here, the detector uses the mask area information as a clue and calculates the output using only the unmasked portion of the masked image. In this way, by learning to detect the target value before masking without using the masked area, the detector can detect the surrounding mask even if there is a key point in the masked area. It becomes possible to detect key points within the masked area using information from the unmarked areas.

検出器として、マスク領域情報を手掛かりとしマスク済み学習用画像のマスクされていない部分を用いて、出力を計算するためのモジュールであればどんなものを用いてもよいが、本実施形態では下記文献で提案されたパーシャルコンボリューション（Partial Convolution）を畳み込み層として用いることでこのモジュールを構成する。 As a detector, any module may be used as long as it calculates the output using the unmasked portion of the masked training image using the mask area information as a clue, but in this embodiment, the following reference is used. This module is constructed by using Partial Convolution, which was proposed in , as a convolution layer.

Guilin Liu, Fitsum A. Reda, Kevin J. Shih, Ting-Chun Wang, Andrew Tao and Bryan Catanzaro, “Image Inpainting for Irregular Holes Using Partial Convolutions,” The European Conference on Computer Vision (ECCV), pp. 85--100, 2018. Guilin Liu, Fitsum A. Reda, Kevin J. Shih, Ting-Chun Wang, Andrew Tao and Bryan Catanzaro, “Image Inpainting for Irregular Holes Using Partial Convolutions,” The European Conference on Computer Vision (ECCV), pp. 85-- 100, 2018.

ここで、ＣＮＮｓに入力する際に、マスク済み学習用画像は各画素値に対して全学習用画像の全画素値から求めた平均色を引くことで正規化される。つまり、マスクされた領域の画素値が０となるように正規化される。 Here, when inputting to CNNs, the masked learning images are normalized by subtracting the average color obtained from all pixel values of all learning images from each pixel value. In other words, the pixel values in the masked area are normalized to zero.

学習モデル記憶手段４１は、検出器についての学習モデルを記憶する。具体的には、学習手段５１によって得られた検出器のパラメータを記憶する。また、検出器として用いるＣＮＮｓの構造が格納される。学習手段５１による学習処理に伴い、学習モデル記憶手段４１に記憶される学習モデルは更新される。そして、学習が完了すると、学習モデル記憶手段４１は検出器の学習済みモデルを記憶し、検出器記憶手段４３として機能する。 The learning model storage means 41 stores a learning model for the detector. Specifically, the detector parameters obtained by the learning means 51 are stored. Also, the structure of CNNs used as a detector is stored. Along with the learning process by the learning means 51, the learning model stored in the learning model storage means 41 is updated. When the learning is completed, the learning model storage means 41 stores the trained model of the detector and functions as the detector storage means 43.

［認識段階に関する対象物認識装置１の構成］
図４は認識段階に関する対象物認識装置１の概略の機能ブロック図であり、記憶部４が対象物領域記憶手段４２および検出器記憶手段４３として機能し、画像処理部５が既検出領域マスク手段５３、キーポイント検出手段５４および対象物領域算出手段５５として機能し、通信部３が画像処理部５と協働し、撮影画像取得手段３０および認識結果出力手段３１として機能する。 [Configuration of object recognition device 1 regarding recognition stage]
FIG. 4 is a schematic functional block diagram of the object recognition device 1 regarding the recognition stage, in which the storage section 4 functions as an object area storage means 42 and a detector storage means 43, and the image processing section 5 functions as an already detected area masking means. 53, functions as a key point detection means 54 and an object area calculation means 55, and the communication section 3 cooperates with the image processing section 5, and functions as a photographed image acquisition means 30 and a recognition result output means 31.

撮影画像取得手段３０は撮影部２から撮影画像を順次取得して画像処理部５に出力する。 The photographed image acquisition means 30 sequentially acquires photographed images from the photographing section 2 and outputs them to the image processing section 5.

検出器記憶手段４３は上述したように、学習段階で生成された検出器を記憶している。 As described above, the detector storage means 43 stores the detectors generated in the learning stage.

対象物領域記憶手段４２は、対象物領域算出手段５５により算出された対象物領域の情報を記憶する。当該情報には当該対象物領域の元として検出された部位データ（検出部位データ）を含む。 The object area storage means 42 stores information on the object area calculated by the object area calculation means 55. The information includes part data (detected part data) detected as the source of the target object area.

既検出領域マスク手段５３は、撮影画像に対して、当該撮影画像にて対象物が検出された既検出領域をマスク処理して得られるマスク済み撮影画像と既検出領域の情報（既検出領域情報）とからなるマスク処理データを生成する。具体的には、既検出領域マスク手段５３は、対象物領域記憶手段４２から対象物領域を読み出して既検出領域を定め、これをマスク領域としてマスク済み撮影画像を生成する。また、既検出領域マスク手段５３はマスクした領域を示す既検出領域情報を生成する。既検出領域情報は、撮影画像と同一の幅および高さの二値画像であり、例えば、既検出領域内に１、それ以外に０が格納されている。マスク済み撮影画像のマスクした領域の画素値は、学習段階の学習用マスク手段５０で用いたマスク値で置き換える。 The detected area masking means 53 masks the detected area in which the object has been detected in the captured image, and generates a masked captured image and information on the detected area (detected area information). ) to generate mask processing data. Specifically, the detected area masking means 53 reads out the object area from the object area storage means 42, defines the detected area, and generates a masked captured image using this as a mask area. Further, the detected area masking means 53 generates detected area information indicating the masked area. The detected area information is a binary image having the same width and height as the photographed image, and for example, 1 is stored in the detected area and 0 is stored elsewhere. The pixel values of the masked area of the masked captured image are replaced with the mask values used by the learning masking means 50 in the learning stage.

なお、対象物領域記憶手段４２に対象物領域が１つも格納されていない場合は、既検出領域マスク手段５３は、既検出領域情報の全ての要素を０とし、マスク済み撮影画像に代えてマスク無し撮影画像である元の撮影画像を出力する。 Note that if the object area storage means 42 does not store any object area, the detected area masking means 53 sets all elements of the detected area information to 0 and uses the masked image instead of the masked captured image. The original photographed image, which is a blank photographed image, is output.

なお、既検出領域は対象物領域そのものであるとは限らない。例えば、本実施形態では、後述する外接矩形を対象物領域とするが、当該外接矩形の隅部分は背景であることが多いということに着目し、既検出領域として当該外接矩形に内接する楕円領域を用いてもよい。本実施形態では、既検出領域として、矩形の対象物領域に内接する楕円の短径および長径を定数倍して得られる楕円領域を用いる。また、後述するように、１つの撮影画像にて複数の対象物領域が検出された場合には、既検出領域はそれらについての合成を行った結果の領域となる。 Note that the detected area is not necessarily the target area itself. For example, in this embodiment, a circumscribed rectangle, which will be described later, is the object area, but focusing on the fact that the corners of the circumscribed rectangle are often the background, an elliptical area inscribed in the circumscribed rectangle is used as the detected area. may also be used. In this embodiment, an elliptical area obtained by multiplying the short axis and long axis of an ellipse inscribed in a rectangular object area by a constant is used as the detected area. Furthermore, as will be described later, when a plurality of target object areas are detected in one captured image, the detected area becomes an area resulting from combining these areas.

キーポイント検出手段５４は検出器記憶手段４３に格納されている検出器を読み出し、既検出領域マスク手段５３から得た撮影画像、またはマスク済み撮影画像と既検出領域情報とを当該検出器に入力する。検出器は当該画像に写っている対象物ごとの部位データ（検出部位データ）を出力し、キーポイント検出手段５４は検出部位データを対象物領域算出手段５５へ出力する。 The key point detection means 54 reads the detector stored in the detector storage means 43, and inputs the photographed image obtained from the detected area masking means 53, or the masked photographed image and the detected area information to the detector. do. The detector outputs part data (detected part data) for each object shown in the image, and the key point detection means 54 outputs the detected part data to the object area calculation means 55.

対象物領域算出手段５５はキーポイント検出手段５４から検出部位データを入力され、検出部位データが示す要検出部位を基準とした所定範囲を計測データにおける対象物領域として算出し、算出した対象物領域の情報を対象物領域記憶手段４２に格納する。例えば、対象物領域算出手段５５は、各人物のキーポイント群に外接する外接矩形を、当該人物に関する対象物領域として算出することができる。 The target object area calculating means 55 receives the detected part data from the key point detecting means 54, calculates a predetermined range based on the required detection part indicated by the detected part data as the target object area in the measurement data, and calculates the calculated target object area. information is stored in the object area storage means 42. For example, the object area calculation means 55 can calculate a circumscribed rectangle that circumscribes the key point group of each person as the object area regarding the person.

また、外接矩形を予め定めた比率で拡大して対象物領域としてもよい。つまり対象物領域の設定に際し、キーポイントが真の領域のやや内側に検出されることや検出誤差を考慮して上下左右にマージンを設ける。また、各キーポイントの定義や各キーポイントの検出誤差の見積もりに応じて上下左右の各方向に対する比率を異なるものとしてもよい。 Alternatively, the circumscribed rectangle may be expanded at a predetermined ratio to form the object area. That is, when setting the target object area, margins are provided on the top, bottom, left, and right sides, taking into consideration the fact that key points are detected slightly inside the true area and detection errors. Furthermore, the ratios for the up, down, left, and right directions may be different depending on the definition of each key point and the estimation of the detection error of each key point.

または、人の部位データから対象物領域への変換を学習した変換器に各入力画像から検出された人物ごとの検出部位データを入力することにより、各検出部位データを対象物領域に変換してもよい。この場合、サンプルごとの対象物領域の正解データを付与データに含ませておき、変換器は、サンプルの付与データのうちの部位データを入力とし当該サンプルの付与データのうちの対象物領域を出力の目標値とする学習によって生成された学習済みモデルとすることができる。記憶部４は不図示の変換器記憶手段としても機能し、変換器を予め記憶する。そして、対象物領域算出手段５５は変換器記憶手段から変換器を読み出して利用する。つまり、対象物領域算出手段５５は、キーポイント検出手段５４が検出した検出部位データを変換器に入力して、計測データにおける対象物領域を検出する。 Alternatively, each detected body part data can be converted into a target area by inputting the detected body part data for each person detected from each input image into a converter that has learned to convert human body part data to a target object area. Good too. In this case, the correct data for the target object area for each sample is included in the assigned data, and the converter inputs the part data of the sample assigned data and outputs the target object area of the assigned sample data. It can be a trained model generated by learning with a target value of . The storage unit 4 also functions as a converter storage means (not shown) and stores converters in advance. Then, the object area calculation means 55 reads out the converter from the converter storage means and uses it. That is, the object area calculation means 55 inputs the detection part data detected by the key point detection means 54 to the converter, and detects the object area in the measurement data.

以上のようにして検出される検出部位データにおいては、計測データ上で隠れているキーポイントが補完されているため、隠れによって極端に小さな対象物領域が検出されてしまう不具合や、隠れによって１つの対象物に係る対象物領域が複数に分かれて検出されてしまう不具合を格段に低減できる。 In the detection part data detected in the above manner, key points hidden in the measurement data are complemented, so there are problems such as detection of extremely small object areas due to occlusion, and problems where one object area is detected due to occlusion. It is possible to significantly reduce the problem that the target object area related to the target object is divided into a plurality of parts and detected.

ここで、隠蔽されている対象物の部位が、当該対象物を隠蔽している対象物の領域をマスクする前に検出される場合がある。このようにして検出された部位は対象物の一部分のみであることが多く、対象物領域が本来よりも小さく誤認識される。さらに、場合によっては、例えば１人の人物の上半身の一部と下半身の一部が別々に検出されるというように、本来１つの対象物である領域が複数の対象物の領域と誤認識される。 Here, the part of the object that is hidden may be detected before masking the region of the object that is hiding the object. The region detected in this way is often only a part of the object, and the object region is erroneously recognized to be smaller than it should be. Furthermore, in some cases, an area that is originally a single object may be mistakenly recognized as an area of multiple objects, such as when a part of the upper body and a part of the lower body of a person are detected separately. Ru.

そこで、対象物領域算出手段５５は、キーポイント検出手段５４の出力（検出器の出力）のうち部位数が所定数（予め定めた下限数）未満である対象物のものを除き、部位数が下限数以上である対象物の対象物領域のみを算出する。このようにすることで誤認識の防止効果をさらに高めることができる。 Therefore, the object area calculation means 55 excludes the output of the key point detection means 54 (output of the detector) for which the number of parts is less than a predetermined number (a predetermined lower limit number), and calculates the number of parts. Only the target object areas of the target objects whose number is equal to or greater than the lower limit number are calculated. By doing so, the effect of preventing erroneous recognition can be further enhanced.

具体的には、例えば、対象物領域算出手段５５は、１７個の要検出部位のうちの半数以上（９個以上）が検出された人物に対してのみ対象物領域を算出して対象物領域記憶手段４２に追記させる。なお、下限数を半数とするのは一例であり、認識の目的に応じて他の値を適宜設定すればよい。 Specifically, for example, the object area calculation means 55 calculates the object area only for a person for whom more than half (9 or more) of the 17 detection-required parts are detected. The information is additionally recorded in the storage means 42. Note that setting the lower limit number to half is just an example, and other values may be set as appropriate depending on the purpose of recognition.

また、部位数以外を対象物領域の算出条件としてもよい。例えば、対象物領域算出手段５５は、特定の要検出部位（例えば顔）が検出された対象物に対してのみ対象物領域を算出する。また、例えば、予め部位の種類ごとに重要度を定めておき、対象物領域算出手段５５は、検出された要検出部位の重要度の和が予め定めた閾値以上である対象物に対してのみ対象物領域を算出する。 Further, conditions other than the number of parts may be used as a condition for calculating the target object area. For example, the object area calculation means 55 calculates the object area only for objects for which a specific detection-required part (for example, a face) has been detected. Further, for example, the degree of importance is determined in advance for each type of part, and the object area calculation means 55 only applies to objects for which the sum of the degrees of importance of the detected parts to be detected is equal to or greater than a predetermined threshold. Calculate the object area.

認識結果出力手段３１は、対象物領域算出手段５５が出力した対象物領域の情報を表示部６に出力する。例えば、認識結果出力手段３１は、撮影画像に対象物領域を表す矩形を描画した画像を生成して表示部６に出力する。なお、対象物領域が検出されなかった場合、認識結果は対象物無しであるとして撮影画像をそのまま出力してもよい。 The recognition result output means 31 outputs the information on the object area outputted by the object area calculation means 55 to the display section 6. For example, the recognition result output means 31 generates an image in which a rectangle representing the object area is drawn on the photographed image and outputs it to the display unit 6. Note that if the target object area is not detected, the recognition result may be that there is no target object, and the captured image may be output as is.

［対象物認識装置１の動作］
次に、対象物認識装置１の動作を、学習段階と認識段階とに分けて説明する。 [Operation of object recognition device 1]
Next, the operation of the object recognition device 1 will be explained separately in a learning stage and a recognition stage.

［学習段階での対象物認識装置１の動作］
図５は学習段階での対象物認識装置１の動作に関する概略のフロー図である。 [Operation of object recognition device 1 at learning stage]
FIG. 5 is a schematic flow diagram regarding the operation of the object recognition device 1 in the learning stage.

対象物認識装置１は撮影画像に現れる対象物を認識する動作に先立って、検出器を学習する動作を行う。 The object recognition device 1 performs an operation of learning the detector prior to an operation of recognizing an object appearing in a photographed image.

当該学習の動作が開始されると、画像処理部５は検出器の学習を行うために、学習モデル記憶手段４１から検出器の学習モデルを読み出す。この時点での学習モデルのパラメータは初期値である（ステップＳ１０）。 When the learning operation is started, the image processing unit 5 reads the learning model of the detector from the learning model storage means 41 in order to perform the learning of the detector. The parameters of the learning model at this point are initial values (step S10).

続いて画像処理部５は学習用マスク手段５０として機能し、学習用データ記憶手段４０から、学習用画像および当該画像内のサンプル群に対する付与データを読み込む（ステップＳ１１）。学習用マスク手段５０は、当該画像に対しマスクする領域をランダムに設定する（ステップＳ１２）。さらに、学習用マスク手段５０は、設定した領域をマスクすることでマスク済み画像を作成する（ステップＳ１３）。学習用画像から生成されたマスク済み画像およびマスク領域情報は、当該学習用画像についての付与データと対応付けて学習用データ記憶手段４０に保存される。 Subsequently, the image processing section 5 functions as the learning mask means 50, and reads the learning image and the data assigned to the sample group within the image from the learning data storage means 40 (step S11). The learning masking means 50 randomly sets regions to be masked in the image (step S12). Furthermore, the learning masking means 50 creates a masked image by masking the set area (step S13). The masked image and mask area information generated from the learning image are stored in the learning data storage means 40 in association with the assigned data for the learning image.

次に画像処理部５は学習手段５１として機能し、学習用マスク手段５０で生成されたマスク済み学習用画像とマスク領域情報を入力とし、キーポイントを検出する（ステップＳ１４）。そして、検出されたキーポイントと、当該マスク済み学習用画像に対応付けられているマスクする前の学習用画像の付与データとの乖離度を、誤差関数を用いて計算し（ステップＳ１５）、誤差を用いて学習モデルのパラメータを更新する（ステップＳ１６）。 Next, the image processing section 5 functions as a learning means 51, receives the masked learning image generated by the learning masking means 50 and the mask area information, and detects key points (step S14). Then, the degree of deviation between the detected key points and the assigned data of the learning image before masking that is associated with the masked learning image is calculated using an error function (step S15), and the The parameters of the learning model are updated using (step S16).

さらに、画像処理部５は学習手段５１として機能し、反復終了条件が満たされているかを判定する（ステップＳ１７）。満たされた場合（ステップＳ１７にて「ＹＥＳ」の場合）は、学習済みモデルを検出器として学習モデル記憶手段４１に格納する（ステップＳ１８）。一方、満たされない場合（ステップＳ１７にて「ＮＯ」の場合）は、反復終了条件が満たされるまでステップＳ１１からステップＳ１７の動作を反復する。反復終了条件は、例えば、誤差関数の値やその変化量が事前に定めた閾値よりも小さくなったことや、事前に定めた反復回数に達したことなどを用いることができる。 Furthermore, the image processing unit 5 functions as a learning means 51, and determines whether the repetition end condition is satisfied (step S17). If the condition is satisfied ("YES" in step S17), the learned model is stored as a detector in the learned model storage means 41 (step S18). On the other hand, if the condition is not satisfied ("NO" in step S17), the operations from step S11 to step S17 are repeated until the repetition end condition is satisfied. The repetition end condition may be, for example, that the value of the error function or its amount of change has become smaller than a predetermined threshold, or that a predetermined number of repetitions has been reached.

［認識段階での対象物認識装置１の動作］
図６は認識段階での対象物認識装置１の動作に関する概略のフロー図である。 [Operation of object recognition device 1 at recognition stage]
FIG. 6 is a schematic flow diagram regarding the operation of the object recognition device 1 in the recognition stage.

対象物認識装置１は上述の学習段階にて生成した検出器を用いて、撮影画像に現れる対象物を認識する動作を行う。 The object recognition device 1 performs an operation of recognizing an object appearing in a photographed image using the detector generated in the above-described learning stage.

対象物認識装置１が当該動作を開始すると、撮影部２は所定時間おきに監視空間を撮影して撮影画像を順次、画像処理部５が設置されている画像解析センター宛に送信する。画像処理部５は通信部３と協働して、撮影部２から撮影画像を受信するたびに図６のフロー図に示す動作を繰り返す。 When the object recognition device 1 starts the operation, the photographing section 2 photographs the monitoring space at predetermined intervals and sequentially transmits the photographed images to an image analysis center where the image processing section 5 is installed. The image processing section 5 cooperates with the communication section 3 to repeat the operation shown in the flowchart of FIG. 6 every time it receives a photographed image from the photographing section 2.

通信部３は撮影画像取得手段３０として機能し、撮影画像を受信すると当該撮影画像を画像処理部５に出力する（ステップＳ２０）。 The communication section 3 functions as a photographed image acquisition means 30, and upon receiving a photographed image, outputs the photographed image to the image processing section 5 (step S20).

画像処理部５は既検出領域マスク手段５３として機能し、全ての要素が０である既検出領域情報を生成して撮影画像と共にキーポイント検出手段５４に入力する（ステップＳ２１）。 The image processing unit 5 functions as a detected area masking unit 53, generates detected area information in which all elements are 0, and inputs it to the key point detection unit 54 together with the captured image (step S21).

続いて、画像処理部５はキーポイント検出手段５４として機能し、キーポイント検出手段５４は、検出器記憶手段４３に記憶されている検出器を読み出し、入力された撮影画像および既検出領域情報を検出器に入力してキーポイントを人ごとに検出し検出部位データとして出力する（ステップＳ２２）。 Subsequently, the image processing section 5 functions as a key point detection means 54, and the key point detection means 54 reads out the detector stored in the detector storage means 43, and stores the input captured image and detected area information. The key points are input to the detector, detected for each person, and output as detected part data (step S22).

次に、画像処理部５は対象物領域算出手段５５として機能し、キーポイント検出手段５４により検出された検出部位データを入力として、各人のキーポイントの対象物領域を算出する対象物認識処理を行う（ステップＳ２３）。なお、検出部位データが空である場合は、対象物領域算出手段５５は、対象物領域も空として出力する。 Next, the image processing unit 5 functions as an object area calculation means 55, and performs object recognition processing to calculate the object area of each person's key points using the detected part data detected by the key point detection means 54 as input. (Step S23). Note that when the detected part data is empty, the object area calculation means 55 outputs the object area as also empty.

対象物領域算出手段５５は、キーポイント検出手段５４による検出状況が反復終了条件を満たすか否かを判定する（ステップＳ２４）。対象物領域算出手段５５は、反復回数が予め定めた上限を越えた、またはキーポイント検出手段５４が検出した検出部位データに要検出部位数の半分以上の部位が検出された対象物のものが存在しない場合に、反復終了条件を満たしたと判定し、反復回数が上限以下であり且つ半分以上の要検出部位が検出された対象物が存在する場合は反復終了条件を満たさないと判定する。 The object area calculation means 55 determines whether the detection situation by the key point detection means 54 satisfies the repetition end condition (step S24). The object area calculation means 55 determines that the number of repetitions exceeds a predetermined upper limit or that the detection part data detected by the key point detection means 54 includes an object in which half or more of the number of required detection parts is detected. If it does not exist, it is determined that the repetition end condition is satisfied, and if there is an object for which the number of repetitions is less than or equal to the upper limit and half or more of the required detection parts are detected, it is determined that the repetition end condition is not satisfied.

反復終了条件を満たさない場合（ステップＳ２４にて「ＮＯ」の場合）、画像処理部５は以下に述べるステップＳ２５からＳ２７の処理の後、処理をステップＳ２２に戻す。 If the repetition end condition is not satisfied ("NO" in step S24), the image processing unit 5 returns the process to step S22 after processing steps S25 to S27 described below.

ステップＳ２５からＳ２７の処理にて、まず、画像処理部５は対象物領域算出手段５５として機能し、要検出部位数の半分以上の部位が検出された対象物の対象物領域の情報を対象物領域記憶手段４２に格納する（ステップＳ２５）。続いて画像処理部５は既検出領域マスク手段５３として機能し、対象物領域算出手段５５を介して、撮影画像についてこれまでに検出され対象物領域記憶手段４２に格納されている対象物領域を全て読み出し、当該既検出の対象物領域を合成して既検出領域情報を生成する（ステップＳ２６）。例えば、複数の対象物領域それぞれに対して設定される楕円の既検出領域を合成し、複数の楕円の和領域をマスク領域とする。その後、既検出領域マスク手段５３は、撮影画像に対して既検出領域情報が示す領域にマスク処理を施してマスク済み画像を生成し（ステップＳ２７）、既検出領域情報と共にキーポイント検出手段５４に入力する。つまり、ステップＳ２２にて既検出領域マスク手段５３からキーポイント検出手段５４に入力されるのは、初回のステップＳ２２では、マスク無し撮影画像、およびマスク領域がないことを示す既検出領域情報であったが、２回目以降のステップＳ２２ではマスク済み撮影画像、およびそのマスクされた領域を示す既検出領域情報である。 In the processing from step S25 to S27, first, the image processing unit 5 functions as the object area calculation means 55, and calculates the information of the object area of the object in which half or more of the number of required detection parts are detected. It is stored in the area storage means 42 (step S25). Subsequently, the image processing section 5 functions as an already detected area masking means 53, and uses the object area calculating means 55 to determine the object area detected so far in the photographed image and stored in the object area storage means 42. All are read out and the detected object areas are combined to generate detected area information (step S26). For example, elliptical detected regions set for each of a plurality of object regions are combined, and the sum region of the plurality of ellipses is used as a mask region. Thereafter, the detected area masking means 53 performs mask processing on the area indicated by the detected area information on the photographed image to generate a masked image (step S27), and sends the masked image together with the detected area information to the key point detection means 54. input. That is, what is input from the detected area masking unit 53 to the key point detection unit 54 in step S22 is the unmasked captured image and the detected area information indicating that there is no masked area in the first step S22. However, in step S22 from the second time onward, the masked captured image and already detected area information indicating the masked area are displayed.

このマスク済み撮影画像を用いたステップＳ２２，Ｓ２３の処理はステップＳ２４の反復終了条件を満たすまで繰り返される。 The processing in steps S22 and S23 using this masked photographic image is repeated until the repetition termination condition of step S24 is satisfied.

反復終了条件が満たされた場合（ステップＳ２４にて「ＹＥＳ」の場合）、対象物領域算出手段５５は、対象物領域記憶手段４２に格納された対象物領域を全て読み出す。読み出された対象物領域は、認識処理結果として通信部３を介して表示部６に出力される（ステップＳ２８）。具体的には、画像処理部５と通信部３とが協働して認識結果出力手段３１として機能し、対象物領域算出手段５５から入力された対象物領域などの情報から認識画像を作成し、これを表示部６に出力する。なお、対象物領域記憶手段４２に格納されていた検出部位データと対象物領域は、次に取得される撮影画像の処理前に消去する。 If the repetition end condition is satisfied (“YES” in step S24), the object area calculation means 55 reads out all the object areas stored in the object area storage means 42. The read object area is output to the display unit 6 via the communication unit 3 as a recognition processing result (step S28). Specifically, the image processing section 5 and the communication section 3 cooperate to function as the recognition result output means 31, and create a recognized image from information such as the object area inputted from the object area calculation means 55. , and outputs this to the display section 6. Note that the detected body part data and object area stored in the object area storage means 42 are deleted before processing the next captured image.

図７は、対象物認識装置１の認識段階での処理例を説明するための模式図であり、図７を用いて、既検出領域マスク手段５３、キーポイント検出手段５４、対象物領域算出手段５５の動作を説明する。 FIG. 7 is a schematic diagram for explaining an example of processing at the recognition stage of the object recognition device 1. Using FIG. The operation of 55 will be explained.

図７に例として示す撮影画像２００には、２人の人物２０１，２０２が写っている。この撮影画像２００に対するキーポイント検出、対象物領域算出の１回目の処理では、既検出領域マスク手段５３からキーポイント検出手段５４にマスク無しの撮影画像２００が入力され、処理結果２１０には、人物２０２に対応するキーポイント２１１が検出され、対象物領域２１２としてキーポイント２１１の外接矩形が算出される。 A photographed image 200 shown as an example in FIG. 7 includes two people 201 and 202. In the first process of key point detection and target object area calculation for this photographed image 200, the photographed image 200 without a mask is input from the detected area masking means 53 to the key point detection means 54, and the processing result 210 includes a person A key point 211 corresponding to 202 is detected, and a circumscribed rectangle of the key point 211 is calculated as the object region 212.

つまり、検出器は画像から部位に関する情報を抽出し検出を行うため、撮影画像２００のように人物同士にて隠蔽が生じている状況では、手前にいる人物２０２と奥にいる人物２０１とが重なっている領域に関して、主に手前にいる人物２０２の特徴を捉える。その結果、そのような画像に対して、検出器は奥にいる人物２０１のキーポイントを検出することができない。 In other words, since the detector extracts information about body parts from the image and performs detection, in a situation where people are concealed from each other, as in the photographed image 200, the person 202 in the foreground and the person 201 in the back overlap. The characteristics of the person 202 in the foreground are mainly captured with respect to the area in which the image is displayed. As a result, the detector cannot detect the key points of the person 201 in the background for such an image.

この点に関し、対象物認識装置１は、一度で全ての人物を検出するのではなく、複数回キーポイント検出を行うことで人物を検出することを目指す。図７の例の撮影画像２００に対するキーポイント検出、対象物領域算出の２回目の処理では、既検出領域マスク手段５３からキーポイント検出手段５４にマスク済み撮影画像２２０が入力される。当該画像２２０では、対象物領域２１２に基づいて生成された既検出領域２２１がマスク処理される。この画像２２０に対する処理結果２３０には、人物２０１に対応するキーポイント２３１が検出され、対象物領域２３２としてキーポイント２３１の外接矩形が算出される。 In this regard, the object recognition device 1 aims to detect people by performing key point detection multiple times, rather than detecting all people at once. In the second process of key point detection and target object area calculation for the photographed image 200 in the example of FIG. In the image 220, a detected area 221 generated based on the target object area 212 is subjected to mask processing. In the processing result 230 for this image 220, a key point 231 corresponding to the person 201 is detected, and a circumscribed rectangle of the key point 231 is calculated as an object region 232.

つまり、一度検出された人物領域についてはそれを保管する記憶手段さえあれば、次回以降に同一撮影画像に対して検出を行う際には、検出する必要がなく、よって、これまでに検出された人物領域をマスクしたマスク済み撮影画像に対してキーポイント検出を行えばよい。図７の例では、マスク無し撮影画像２００から手前の人物２０２が検出されてしまえば、次回の検出ではマスク済み撮影画像２２０のように、手前の人物２０２が検出された処理結果２１０に基づいた既検出領域２２１を適切にマスクした画像を用いればよい。マスク済み画像２２０とその既検出領域情報を用いることで、マスクされていない領域から推測、補完できる範囲でマスクされた領域内にある部位を検出しようと検出器が動作するため、処理結果２３０のように奥の人物２０１が検出できると期待される。 In other words, as long as there is a storage device to store the human area that has been detected once, there is no need to perform detection next time on the same captured image. Key point detection may be performed on a masked captured image in which a human region is masked. In the example of FIG. 7, once the person 202 in the foreground is detected from the unmasked photographed image 200, the next detection will be based on the processing result 210 in which the person 202 in the foreground was detected, such as the masked photographed image 220. An image in which the detected area 221 is appropriately masked may be used. By using the masked image 220 and its detected area information, the detector operates to detect parts within the masked area to the extent that it can be inferred and complemented from the unmasked area. It is expected that the person 201 in the background can be detected.

すなわち、撮影画像をキーポイント検出手段５４に入力し得られた検出部位データから対象物領域算出手段５５により対象物領域を算出し、それを既検出領域マスク手段５３に入力することで既に検出された人物領域をマスクしたマスク済み撮影画像を作成する。そのマスク済み撮影画像に対して、再度、キーポイント検出手段５４を適用することで、未検出のサンプルの部位データを検出する。新たに検出部位データが得られなくなるまで、マスク済み撮影画像を作成しキーポイント検出を行うことを繰り返すことで、複数人による隠蔽に対しても対応できる。 That is, a captured image is input to the key point detection means 54, the object area calculation means 55 calculates the object area from the obtained detected part data, and the object area is inputted to the already detected area masking means 53, thereby detecting the area that has already been detected. Create a masked photographed image with the human area masked. By applying the key point detection means 54 again to the masked photographed image, the part data of the undetected sample is detected. By repeating creating masked captured images and detecting key points until no new detection part data can be obtained, it is possible to cope with concealment by multiple people.

［変形例］
（１）上記実施形態では、人の全身を対象物とする例を示したが、対象物は、人の上半身などの人体の一部としてもよいし、車両や椅子などの人以外の物体としてもよい。 [Modified example]
(1) In the above embodiment, an example is shown in which the whole body of a person is the object, but the object may be a part of the human body such as the upper half of the person, or an object other than the person such as a vehicle or a chair. Good too.

（２）上記実施形態では、対象物が計測される計測データが二次元画像であり、計測データを取得する計測手段は撮影部２とし二次元画像を撮影するカメラである例を示したが、計測データ、計測手段はこの例に限られない。例えば、計測データは三次元空間を計測したものであってもよい。三次元計測データの例として、距離画像センサを計測手段に用いて得られる距離画像や、多視点カメラで撮影した画像から構築した三次元データや、ＬiＤＡＲ（Light Detection and Ranging）で計測した点群データを挙げることができる。また、計測データは、二次元画像の時系列（二次元計測データの時系列）、三次元計測データの時系列とすることもできる。なお、点群データの場合のマスク処理はデータを欠落させる処理となる。 (2) In the above embodiment, the measurement data obtained by measuring the object is a two-dimensional image, and the measurement means for acquiring the measurement data is the camera 2 that takes the two-dimensional image. The measurement data and measurement means are not limited to this example. For example, the measurement data may be data obtained by measuring a three-dimensional space. Examples of 3D measurement data include distance images obtained using a distance image sensor as a measurement means, 3D data constructed from images taken with a multi-view camera, and point clouds measured with LiDAR (Light Detection and Ranging). I can list data. Further, the measurement data can also be a time series of two-dimensional images (time series of two-dimensional measurement data) or a time series of three-dimensional measurement data. Note that mask processing in the case of point cloud data is a process that causes data to be omitted.

（３）上記実施形態では、マスク済み学習用画像およびマスク済み撮影画像を生成する際にマスク値として学習画像の画素値の平均値を用いる例を示した。別の実施形態においては、マスクされた領域の画素値を予め定めた単一値とする、ランダムな値（ランダムノイズ）とするなど、検出器の構成に適したマスク値とすることができる。 (3) In the above embodiment, an example is shown in which the average value of the pixel values of the learning image is used as the mask value when generating the masked learning image and the masked captured image. In another embodiment, the pixel value of the masked area can be a predetermined single value, a random value (random noise), or a mask value suitable for the configuration of the detector.

（４）上記実施形態では、サンプル毎の部位データを目標値とした検出器を用いたが、認識対象が異なるタスクにも適用することができる。 (4) In the above embodiment, a detector is used in which the target value is body part data for each sample, but the present invention can also be applied to tasks with different recognition targets.

例えば、画像からサンプル毎の外接矩形を検出する物体検出器を検出器として利用することで、隠蔽に頑健な物体検出を行うことができる。この場合、部位データの代わりに外接矩形が検出器学習の目標値となり、キーポイント検出手段５４の出力がサンプル毎の外接矩形の検出値となるため、対象物領域算出手段５５はキーポイント検出手段５４の出力をそのまま対象物領域として用いる。 For example, by using an object detector that detects a circumscribed rectangle for each sample from an image as a detector, it is possible to perform object detection that is robust against concealment. In this case, the circumscribed rectangle becomes the target value for detector learning instead of the body part data, and the output of the key point detection means 54 becomes the detected value of the circumscribed rectangle for each sample. The output of 54 is used as it is as the target object region.

また、検出器は、対象物の位置に関する情報に加えて、年齢、性別等の属性や、笑顔等の状態を検出する検出器であってもよい。 Further, the detector may be a detector that detects attributes such as age and gender, or a state such as a smiling face, in addition to information regarding the position of the target object.

（５）上記実施形態では、キーポイント検出手段５４で用いられる検出器は反復的に検出される間同一のものを使用するが、初回と２回目以降の反復で別の検出器を用いてもよい。その場合、初回は既検出領域情報として全て０詰めされたものが使用されるため、画像のみを入力とする検出器を用いてもよいからである。初回で用いる検出器は画像のみを入力して学習しておく。 (5) In the above embodiment, the same detector is used in the key point detection means 54 during repeated detection, but different detectors may be used for the first and second and subsequent iterations. good. In that case, since the detected area information filled with all zeros is used for the first time, a detector that receives only images as input may be used. The detector used for the first time is trained by inputting only images.

また、２回目以降の反復の中でも、複数の検出器を予め学習し用意しておけば、使用する検出器を回数に応じて切り替えてもよい。例えば、反復回数が多くなるほど、学習時にマスク領域を多くして学習した検出器を用いることで、マスク領域が多い場合でもより適切な検出部位データを得られる。 Also, during the second and subsequent iterations, if a plurality of detectors are learned and prepared in advance, the detectors to be used may be switched depending on the number of times. For example, as the number of repetitions increases, by using a trained detector with a larger number of mask areas during learning, more appropriate detection part data can be obtained even when there are many mask areas.

（６）上記実施形態では、既検出領域マスク手段５３で対象物領域からマスク領域を作成する際に、対象物領域に内接する楕円領域の定数倍を用いたが、対象物領域や検出部位データからサンプルごとの人物領域（インスタンスマスク）を推定する推定器を予め学習し、それを用いてマスク領域を生成してもよい。その場合は、必要に応じて検出部位データも、対象物領域算出手段５５を介して対象物領域記憶手段４２から既検出領域マスク手段５３に入力する。特に、検出部位データからインスタンスマスクを推定することで、対象物領域よりも詳細な部位の位置情報が得られるため、インスタンスマスクをより正確に推定できる。また、長方形や楕円形などの決められた形状だけでなく、任意の形状をとるように人物領域を推定することもできるため、隠蔽が複雑な場合にも適切なマスク領域を算出しやすくなる。また、このときの推定器を、画像も入力する推定器としてもよい。 (6) In the above embodiment, when creating a mask area from the target object area by the already detected area masking means 53, a constant multiple of the elliptical area inscribed in the target object area is used. An estimator for estimating a person region (instance mask) for each sample may be learned in advance, and the mask region may be generated using the estimator. In that case, the detected region data is also input from the object region storage means 42 to the detected region masking means 53 via the object region calculation means 55, if necessary. In particular, by estimating the instance mask from the detected part data, positional information of the part that is more detailed than the target area can be obtained, so the instance mask can be estimated more accurately. Furthermore, since a human region can be estimated to take any shape, not just a predetermined shape such as a rectangle or an ellipse, it becomes easier to calculate an appropriate mask region even when concealment is complicated. Further, the estimator at this time may be an estimator that also inputs images.

また、インスタンスマスクは、既検出領域マスク手段５３ではなくキーポイント検出手段５４で、部位データの検出と同時に検出してもよい。その場合、検出器は画像を入力として部位データとその部位データに対応するインスタンスマスクを検出する。すなわち、部位データとそれに紐づいたインスタンスマスクがサンプルごとに得られる。学習手段５１では、付与データのほかにインスタンスマスクも目標値として検出器を学習する、学習用データ記憶手段４０には、学習用画像、付与データに加え、インスタンスマスクも格納しておく。 Further, the instance mask may be detected by the key point detection means 54 instead of the detected area masking means 53 at the same time as the detection of body part data. In that case, the detector receives the image as input and detects part data and an instance mask corresponding to the part data. In other words, part data and an instance mask associated therewith are obtained for each sample. The learning means 51 uses the instance mask as a target value in addition to the imparted data to learn the detector. The learning data storage means 40 stores the instance mask in addition to the learning image and the imparted data.

（７）上記実施形態では、反復終了条件として、反復回数の上限と要検出部位数の下限を用いたが、要検出部位数の条件を反復終了条件から除いてもよい。その場合、無駄な反復は生じるが、計算量が平準化するため制御が容易となる。 (7) In the above embodiment, the upper limit of the number of repetitions and the lower limit of the number of detection sites are used as the repetition termination conditions, but the condition of the number of detection sites may be excluded from the repetition termination conditions. In that case, although unnecessary repetition occurs, the amount of calculations is leveled out, making control easier.

（８）上記実施形態では、検出器での活性化関数としてＲｅＬＵ関数を用いたが、活性化関数としてｔａｎｈ関数、シグモイド（Sigmoid）関数などを用いてもよい。また、ＲｅｓＮｅｔ（residual network：残差ネットワーク）で用いられるようなショートカット構造を有する構成としてもよい。 (8) In the above embodiment, the ReLU function is used as the activation function in the detector, but a tanh function, a sigmoid function, or the like may be used as the activation function. Further, a configuration having a shortcut structure such as that used in ResNet (residual network) may be used.

１対象物認識装置、２撮影部、３通信部、４記憶部、５画像処理部、６表示部、７操作入力部、３０撮影画像取得手段、３１認識結果出力手段、４０学習用データ記憶手段、４１学習モデル記憶手段、４２対象物領域記憶手段４２、４３検出器記憶手段、５０学習用マスク手段、５１学習手段、５３既検出領域マスク手段５３、５４キーポイント検出手段、５５対象物領域算出手段５５。 1 object recognition device, 2 photographing section, 3 communication section, 4 storage section, 5 image processing section, 6 display section, 7 operation input section, 30 photographed image acquisition means, 31 recognition result output means, 40 learning data storage means , 41 learning model storage means, 42 object area storage means 42, 43 detector storage means, 50 learning mask means, 51 learning means, 53 detected area masking means 53, 54 key point detection means, 55 object area calculation Means 55.

Claims

An object recognition device that recognizes predetermined information that is information regarding the position of each of the objects from measurement data obtained by measuring a plurality of objects,
Masked sample data obtained by masking a mask area that includes a part of the area of the sample in the sample data in which the sample of the target object was measured, and information indicating the position of the mask area on the sample data. a detector storage means storing a detector generated in advance by learning with the predetermined information of the sample in the region including the mask region as an output target value;
The first target object is detected in the measurement data in which a first target object and a second target object partially blocked by the first target object are measured . a masking unit that generates mask processing data consisting of masked measurement data obtained by masking a detected area, which is a region, and information indicating a position of the detected area on the measurement data ;
a detection means for inputting the mask processing data into the detector to detect the predetermined information of the second object in a region including the already detected region in the measurement data ;
An object recognition device comprising:

The detector storage means stores the detector that has performed the learning even when the sample data is inputted in a state before the mask processing is performed,
The detection means inputs the measurement data to the detector instead of the mask processing data and detects the predetermined information of the target in the measurement data when there is no detected area;
The object recognition device according to claim 1, characterized in that:

3. The object recognition device according to claim 1, wherein the predetermined information is information regarding a position of a part constituting the object.

The object recognition device according to any one of claims 1 to 3, further comprising a learning unit that generates the detector through the learning based on the sample data.

An object recognition method that recognizes predetermined information that is information regarding the position of each of the objects from measurement data obtained by measuring a plurality of objects, the method comprising:
Masked sample data obtained by masking a mask area that includes a part of the area of the sample in the sample data in which the sample of the target object was measured, and information indicating the position of the mask area on the sample data. a step of preparing a detector generated in advance by learning with the predetermined information of the sample in the region including the mask region as an output target value;
The first target object is detected in the measurement data in which a first target object and a second target object partially blocked by the first target object are measured . a step of generating mask processing data consisting of masked measurement data obtained by masking a previously detected region , which is a region, and information indicating a position of the detected region on the measurement data ;
inputting the mask processing data into the detector to detect the predetermined information of the second object in a region including the already detected region in the measurement data;
An object recognition method characterized by comprising:

A program that causes a computer to perform a process of recognizing predetermined information , which is information regarding the position of each of the objects, from measurement data obtained by measuring a plurality of objects, the program comprising:
the computer,
Masked sample data obtained by masking a mask area that includes a part of the area of the sample in the sample data in which the sample of the target object was measured, and information indicating the position of the mask area on the sample data. Detector storage means storing a detector generated in advance by learning with the predetermined information of the sample in the region including the mask region as an output target value;
The first target object is detected in the measurement data in which a first target object and a second target object partially blocked by the first target object are measured . a masking unit that generates mask processing data consisting of masked measurement data obtained by masking a previously detected region, which is a region, and information indicating a position of the detected region on the measurement data ;
Detection means for inputting the mask processing data into the detector to detect the predetermined information of the second object in a region including the already detected region in the measurement data ;
An object recognition program characterized by functioning as an object recognition program.