JP2023172115A

JP2023172115A - Object detection model training device, object detecting device, and object detection model training method

Info

Publication number: JP2023172115A
Application number: JP2022083698A
Authority: JP
Inventors: 洋登永吉; Hirotaka Nagayoshi; 拓実會下; Takumi EGE
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2022-05-23
Filing date: 2022-05-23
Publication date: 2023-12-06
Also published as: WO2023228558A1

Abstract

To provide an object detection model training device capable of improving training accuracy of an object detection model.SOLUTION: In an object detection model training device, a training data generating unit 100 generates training data by using computer graphics to synthesize images of a detection target object, and generates a first teaching signal related to the detection target object, and a second teaching signal corresponding to a state of the detection target object during imaging. An object detection model comprises a first inferring unit 161 for outputting a first inferred result relating to the detection target object, and a second inferring unit 162 for outputting a second inferred result relating to the state of the detection target object during imaging. The first inferring unit 161 obtains the first inferred result using an input image and the second inferred result. An object detection training unit 150 trains the second inferring unit 162 by providing the training data to the second inferring unit 162, and trains the first inferring unit 161 by providing the training data and the second inferred result to the first inferring unit 161.SELECTED DRAWING: Figure 1

Description

本発明は、物体検知モデル学習装置、物体検知装置及び物体検知モデル学習方法に関する。 The present invention relates to an object detection model learning device, an object detection device, and an object detection model learning method.

従来、画像認識による物体検出の学習に際し、学習データの生成にコンピュータグラフィックス（ＣＧ）を利用することが行われている。例えば、特許文献１には、「物体検出処理、姿勢検出処理等を実行するときに用いられる学習済みモデルを取得するために、学習処理時に必要となる学習用データを短時間に多量に取得できる学習用データ生成システムを提供する。」、「学習用データ生成システムは、３次元空間を撮像して得た背景画像を取得する。また、物体の形状およびテクスチャーの少なくとも１つを含むコンピュータグラフィックス処理用のデータであるＣＧ物体生成用データを取得する。取得したＣＧ物体生成用データに基づいてＣＧ物体画像を生成する。３次元空間内での所定の位置にＣＧ物体が配置されるように、ＣＧ物体画像を背景画像に合成することで得られるレンダリング画像を学習用画像として取得する。」という記載がある。 Conventionally, computer graphics (CG) has been used to generate learning data when learning object detection using image recognition. For example, Patent Document 1 states, ``In order to obtain a trained model used when performing object detection processing, posture detection processing, etc., a large amount of learning data required during learning processing can be obtained in a short time. Provides a learning data generation system.", "The learning data generation system acquires a background image obtained by imaging a three-dimensional space.The learning data generation system also acquires a background image obtained by imaging a three-dimensional space. Acquire CG object generation data, which is processing data. Generate a CG object image based on the acquired CG object generation data.So that the CG object is placed at a predetermined position in three-dimensional space. , a rendered image obtained by combining a CG object image with a background image is acquired as a learning image.''

特開２０２０－１１９１２７号公報Japanese Patent Application Publication No. 2020-119127

従来の技術によれば、検知対象の物体を実際に撮像して得た実画像を学習用データとして用いる場合と同等の学習精度を実現できる。しかしながら、従来の技術は、実画像による学習よりも高い学習精度を実現するものではない。 According to the conventional technology, it is possible to achieve learning accuracy equivalent to the case where an actual image obtained by actually capturing an object to be detected is used as learning data. However, conventional techniques do not achieve higher learning accuracy than learning using real images.

本発明は、物体検知モデルの学習精度を向上することを目的とする。 An object of the present invention is to improve the learning accuracy of an object detection model.

上記目的を達成するために、代表的な本発明の物体検知モデル学習装置の一つは、入力画像から物体を検知する物体検知モデルの学習を行う物体検知学習部と、前記学習に用いる学習用データを生成する学習用データ生成部と、を備え、前記学習用データ生成部は、検知対象の物体の画像をコンピュータグラフィックスで合成することで前記学習用データを生成し、前記学習用データについて、前記検知対象の物体に関する第一の教師信号を生成し、前記学習用データについて、前記検知対象の物体の撮像時の状態に対応する第二の教師信号を生成し、前記物体検知モデルは、前記入力画像について、前記検知対象の物体に関する第一の推論結果を出力する第一の推論部と、前記入力画像について、前記検知対象の物体の撮像時の状態に関する第二の推論結果を出力する第二の推論部と、を備え、前記第一の推論部は、前記入力画像と前記第二の推論結果とを用いて前記第一の推論結果を求めるものであり、前記物体検知学習部は、前記学習用データを前記第二の推論部に与えて得られた第二の推論結果と前記第二の教師信号との差を第二の誤差として求める第二の誤差計算部と、前記学習用データ及び前記第二の推論結果を前記第一の推論部に与えて得られた第一の推論結果と前記第一の教師信号との差を第一の誤差として求める第一の誤差計算部と、前記第二の誤差に基づいて前記第二の推論部のパラメータを更新し、前記第一の誤差に基づいて前記第一の推論部のパラメータを更新する推論パラメータ更新部と、を備えることを特徴とする。
また、代表的な本発明の物体検知装置の一つは、入力画像について、検知対象の物体に関する第一の推論結果を出力する第一の推論部と、前記入力画像について、前記検知対象の物体の撮像時の状態に関する第二の推論結果を出力する第二の推論部と、を備え、前記第一の推論部は、前記入力画像と前記第二の推論結果とを用いて前記第一の推論結果を求めることを特徴とする。
入力画像から物体を検知する物体検知モデルの学習を行う物体検知学習方法であって、検知対象の物体の画像をコンピュータグラフィックスで合成することで前記学習用データを生成するステップと、前記学習用データについて、前記検知対象の物体に関する第一の教師信号を生成するステップと、前記学習用データについて、前記検知対象の物体の撮像時の状態に対応する第二の教師信号を生成するステップと、前記物体検知モデルが有する第二の推論部に対し、前記学習用データを与え、前記検知対象の物体の撮像時の状態に関する第二の推論結果を得るステップと、前記物体検知モデルが有する第一の推論部に対し、前記学習用データと前記第二の推論結果とを与え、前記検知対象の物体に関する第一の推論結果を得るステップと、前記第一の推論結果と前記第一の教師信号との差を第一の誤差として求めるステップと、前記第二の推論結果と前記第二の教師信号との差を第二の誤差として求めるステップと、前記第二の誤差に基づいて前記第二の推論部のパラメータを更新し、前記第一の誤差に基づいて前記第一の推論部のパラメータを更新するステップと、を含むことを特徴とする。 In order to achieve the above object, one of the typical object detection model learning devices of the present invention includes an object detection learning section that learns an object detection model that detects an object from an input image, and a learning device used for the learning. a learning data generating section that generates data, the learning data generating section generates the learning data by synthesizing an image of an object to be detected using computer graphics, and , generates a first teacher signal regarding the object to be detected, and generates a second teacher signal corresponding to the state of the object to be detected at the time of imaging with respect to the learning data, and the object detection model: A first inference unit that outputs a first inference result regarding the object to be detected with respect to the input image, and a second inference result regarding the state of the object to be detected at the time of imaging with respect to the input image. a second inference unit, the first inference unit uses the input image and the second inference result to obtain the first inference result, and the object detection learning unit , a second error calculation unit that calculates, as a second error, a difference between a second inference result obtained by applying the learning data to the second inference unit and the second teacher signal; a first error calculation unit that calculates, as a first error, a difference between a first inference result obtained by applying the first inference data and the second inference result to the first inference unit and the first teacher signal; and an inference parameter updating unit that updates the parameters of the second inference unit based on the second error and updates the parameters of the first inference unit based on the first error. It is characterized by
Further, one of the representative object detection devices of the present invention includes a first inference unit that outputs a first inference result regarding an object to be detected with respect to an input image; a second inference unit that outputs a second inference result regarding the state at the time of imaging, the first inference unit using the input image and the second inference result to determine the first inference result. It is characterized by obtaining inference results.
An object detection learning method for learning an object detection model that detects an object from an input image, the method comprising: generating the training data by synthesizing images of objects to be detected using computer graphics; for the data, generating a first teacher signal related to the object to be detected; for the learning data, generating a second teacher signal corresponding to the state of the object to be detected at the time of imaging; providing the learning data to a second inference unit included in the object detection model to obtain a second inference result regarding the state of the object to be detected at the time of imaging; providing the learning data and the second inference result to the inference unit of the invention to obtain a first inference result regarding the object to be detected; and the first inference result and the first teacher signal. a step of obtaining the difference between the second inference result and the second teacher signal as a second error; and a step of obtaining the difference between the second inference result and the second teacher signal as a second error, and The method is characterized by comprising the steps of: updating parameters of the inference section of the inference section; and updating parameters of the first inference section based on the first error.

本発明によれば、物体検知モデルの学習精度を向上できる。 According to the present invention, the learning accuracy of an object detection model can be improved.

図１は、物体検知モデル学習装置の構成図である。FIG. 1 is a configuration diagram of an object detection model learning device. 図２は、物体検知装置の構成図である。FIG. 2 is a configuration diagram of the object detection device. 図３は、ＣＧ合成パラメータ制約条件についての説明図である。FIG. 3 is an explanatory diagram of CG synthesis parameter constraints. 図４は、第一の教師信号選択条件の具体例である。FIG. 4 is a specific example of the first teacher signal selection condition. 図５は、第二の教師信号選択条件の具体例である。FIG. 5 is a specific example of the second teacher signal selection condition. 図６は、学習の処理手順を示すフローチャートである。FIG. 6 is a flowchart showing the learning processing procedure. 図７は、物体検知の処理手順を示すフローチャートである。FIG. 7 is a flowchart showing the processing procedure for object detection. 図８は、教師信号の具体例である。FIG. 8 shows a specific example of the teacher signal. 図９は、ＣＧ合成した画像の具体例である。FIG. 9 is a specific example of a CG composite image. 図１０は、推論結果の説明図である。FIG. 10 is an explanatory diagram of the inference results. 図１１は、変形例にかかる物体検知モデル学習装置の構成図である。FIG. 11 is a configuration diagram of an object detection model learning device according to a modified example. 図１２は、変形例にかかる物体検知装置の構成図である。FIG. 12 is a configuration diagram of an object detection device according to a modification.

以下、本発明の実施形態について、図面を参照して説明する。なお、以下に説明する実施形態は特許請求の範囲に係る発明を限定するものではなく、また実施形態の中で説明されている諸要素及びその組み合わせの全てが発明の解決手段に必須であるとは限らない。また、発明の構成に必須だが周知である構成については、図示及び説明を省略する場合がある。 Embodiments of the present invention will be described below with reference to the drawings. The embodiments described below do not limit the claimed invention, and all of the elements and combinations thereof described in the embodiments are essential to the solution of the invention. is not limited. Additionally, illustrations and explanations of well-known configurations that are essential to the configuration of the invention may be omitted.

以下の説明において、「ｘｘｘテーブル」といった表現により、入力に対して出力が得られる情報を説明することがあるが、この情報は、どのような構造のデータでもよい。従って、「ｘｘｘテーブル」を「ｘｘｘ情報」と言うことができる。 In the following description, information such as an "xxx table" may be used to describe information from which an output is obtained in response to an input, but this information may be data having any structure. Therefore, the "xxx table" can be called "xxx information."

また、以下の説明において、各テーブルの構成は一例であり、１つのテーブルは、２以上のテーブルに分割されてもよいし、２以上のテーブルの全部又は一部が１つのテーブルであってもよい。 In addition, in the following explanation, the configuration of each table is an example, and one table may be divided into two or more tables, or all or part of two or more tables may be one table. good.

また、以下の説明において、「プログラム」を主語として処理を説明する場合がある。プログラムは、プロセッサ部によって実行されることで、定められた処理を、適宜に記憶部及び／又はインターフェース部などを用いながら行うため、処理の主語が、プロセッサ部（或いは、そのプロセッサ部を有するコントローラのようなデバイス）とされてもよい。 In addition, in the following description, processing may be explained using "program" as the subject. The program is executed by the processor unit to carry out predetermined processing using the storage unit and/or interface unit as appropriate, so the subject of the processing is the processor unit (or the controller that has the processor unit). devices such as ).

プログラムは、計算機のような装置にインストールされてもよいし、例えば、プログラム配布サーバ又は計算機が読み取り可能な（例えば非一時的な）記録媒体にあってもよい。また、以下の説明において、２以上のプログラムが１つのプログラムとして実現されてもよいし、１つのプログラムが２以上のプログラムとして実現されてもよい。 The program may be installed on a device such as a computer, or may be located on, for example, a program distribution server or a computer-readable (eg, non-transitory) recording medium. Furthermore, in the following description, two or more programs may be realized as one program, or one program may be realized as two or more programs.

また、「プロセッサ部」は、１又は複数のプロセッサである。プロセッサは、典型的には、ＣＰＵ（Central Processing Unit）のようなマイクロプロセッサであるが、ＧＰＵ（Graphics Processing Unit）のような他種のプロセッサでもよい。また、プロセッサは、シングルコアでもよいしマルチコアでもよい。また、プロセッサは、処理の一部又は全部を行うハードウェア回路（例えばＦＰＧＡ（Field-Programmable Gate Array）又はＡＳＩＣ（Application Specific Integrated Circuit））といった広義のプロセッサでもよい。 Further, the "processor section" is one or more processors. The processor is typically a microprocessor such as a CPU (Central Processing Unit), but may be another type of processor such as a GPU (Graphics Processing Unit). Furthermore, the processor may be single-core or multi-core. Further, the processor may be a processor in a broad sense such as a hardware circuit (eg, FPGA (Field-Programmable Gate Array) or ASIC (Application Specific Integrated Circuit)) that performs part or all of the processing.

また、以下の説明において、同種の要素を区別しないで説明する場合には、参照符号（又は、参照符号のうちの共通符号）を使用し、同種の要素を区別して説明する場合は、要素の識別番号（又は参照符号）を使用することがある。また各図に示す各要素の数は一例であって、図示に限られるものではない。 In addition, in the following explanation, when the same type of elements are explained without distinguishing them, reference numerals (or common numerals among the reference numerals) are used, and when the same kind of elements are explained separately, the element An identification number (or reference number) may be used. Moreover, the number of each element shown in each figure is an example, and is not limited to the number shown in the figure.

図１は、物体検知モデル学習装置の構成図である。図１に示した物体検知モデル学習装置は、物体検知学習部１５０と、学習用データ生成部１００と、を備える。物体検知学習部１５０は、入力画像から物体を検知する物体検知モデルの学習を行う。学習用データ生成部１００は、物体検知学習部１５０が学習に用いる学習用データを生成する。 FIG. 1 is a configuration diagram of an object detection model learning device. The object detection model learning device shown in FIG. 1 includes an object detection learning section 150 and a learning data generation section 100. The object detection learning unit 150 performs learning of an object detection model that detects objects from input images. The learning data generation unit 100 generates learning data that the object detection learning unit 150 uses for learning.

学習用データ生成部１００は、検知対象の物体の画像をコンピュータグラフィックスで合成することで学習用データを生成する。また、学習用データ生成部１００は、学習用データについて、検知対象の物体に関する第一の教師信号を生成し、検知対象の物体の撮像時の状態に対応する第二の教師信号を生成する。 The learning data generation unit 100 generates learning data by synthesizing images of objects to be detected using computer graphics. In addition, the learning data generation unit 100 generates, with respect to the learning data, a first teacher signal related to the object to be detected, and a second teacher signal corresponding to the state of the object to be detected at the time of imaging.

物体検知学習部１５０は、物体検知部１６０、第一の誤差計算部１７０、第二の誤差計算部１７１及び推論パラメータ更新部１８０を有する。
物体検知部１６０は、物体検知モデルに対応する処理部であり、第一の推論部１６１と第二の推論部１６２を有する。 The object detection learning section 150 includes an object detection section 160, a first error calculation section 170, a second error calculation section 171, and an inference parameter update section 180.
The object detection unit 160 is a processing unit corresponding to an object detection model, and includes a first inference unit 161 and a second inference unit 162.

第二の推論部１６２は、入力画像について、検知対象の物体の撮像時の状態に関する第二の推論結果を出力する。第一の推論部１６１は、入力画像について、検知対象の物体に関する第一の推論結果を出力する。このとき、第一の推論部１６１は、入力画像と第二の推論結果とを用いて第一の推論結果を求める。例えば、第二の推論部１６２が、カメラの位置や照明の方向などを推論結果として出力したならば、第一の推論部１６１は、カメラの位置や照明の方向を考慮して、検知対象の物体の有無などを推論する。 The second inference unit 162 outputs a second inference result regarding the state of the object to be detected at the time of imaging with respect to the input image. The first inference unit 161 outputs a first inference result regarding the object to be detected with respect to the input image. At this time, the first inference unit 161 uses the input image and the second inference result to obtain a first inference result. For example, if the second inference unit 162 outputs the position of the camera, the direction of illumination, etc. as an inference result, the first inference unit 161 considers the position of the camera and the direction of illumination, and outputs the detection target. Infer the presence or absence of objects.

第一の推論部１６１及び第二の推論部１６２には、物体検知の既知の技術を利用できる。例えば、SSD（Single Shot MultiBox Detector）や、YOLO（You only look once: Unified, real-time object detection）などである。
これらの技術は、ニューラルネットワークを利用した技術であり、画像を入力すると検知対象の種類と検知対象の画像上での範囲を数値情報として出力する。出力できる数値情報はこれらの種類に限らないため、第一の推論部１６１のみならず第二の推論部１６２に利用することが可能である。 Known object detection techniques can be used for the first inference section 161 and the second inference section 162. Examples include SSD (Single Shot MultiBox Detector) and YOLO (You only look once: Unified, real-time object detection).
These technologies utilize neural networks, and when an image is input, they output the type of detection target and the range of the detection target on the image as numerical information. Since the numerical information that can be output is not limited to these types, it can be used not only in the first inference section 161 but also in the second inference section 162.

第二の誤差計算部１７１は、学習用データを入力画像として第二の推論部１６２に与えて得られた第二の推論結果と、第二の教師信号との差を第二の誤差として求める。
第一の誤差計算部１７０は、学習用データ及び第二の推論結果を第一の推論部１６１に与えて得られた第一の推論結果と、第一の教師信号との差を第一の誤差として求める。 The second error calculation unit 171 calculates, as a second error, the difference between the second inference result obtained by giving the learning data as an input image to the second inference unit 162 and the second teacher signal. .
The first error calculation unit 170 calculates the difference between the first inference result obtained by giving the learning data and the second inference result to the first inference unit 161 and the first teacher signal. Obtained as an error.

推論パラメータ更新部１８０は、第二の誤差に基づいて第二の推論部１６２のパラメータを更新し、第一の誤差に基づいて第一の推論部１６１のパラメータを更新することで、物体検知部１６０の学習を行う。 The inference parameter update unit 180 updates the parameters of the second inference unit 162 based on the second error, and updates the parameters of the first inference unit 161 based on the first error, thereby updating the parameters of the object detection unit. 160 studies.

学習用データ生成部１００は、条件入力部１１０、ＣＧ合成パラメータ制約条件保存部１２０、ＣＧ合成パラメータ生成部１２１、学習用ＣＧ生成部１２２、教師信号選択条件保存部１３０、教師信号選択部１３１、第一の教師信号生成部１３２及び第二の教師信号生成部１３３を有する。 The learning data generation section 100 includes a condition input section 110, a CG synthesis parameter constraint storage section 120, a CG synthesis parameter generation section 121, a learning CG generation section 122, a teacher signal selection condition storage section 130, a teacher signal selection section 131, It has a first teacher signal generation section 132 and a second teacher signal generation section 133.

条件入力部１１０は、学習に関する各種条件の入力を受け付けるインターフェースである。学習に関する各種条件には、コンピュータグラフィックスのパラメータに関する制約条件であるＣＧ合成パラメータ制約条件と、教師信号を選択する条件である教師信号選択条件とが含まれる。また、教師信号選択条件には、第一の教師信号の選択条件と、第二の教師信号の選択条件とが含まれる。 The condition input unit 110 is an interface that accepts input of various conditions related to learning. The various conditions related to learning include CG synthesis parameter constraints, which are constraints regarding computer graphics parameters, and teacher signal selection conditions, which are conditions for selecting a teacher signal. Further, the teacher signal selection conditions include a first teacher signal selection condition and a second teacher signal selection condition.

ＣＧ合成パラメータ制約条件保存部１２０は、入力されたＣＧ合成パラメータ制約条件を保存する。教師信号選択条件保存部１３０は、入力された教師信号選択条件を保存する。ＣＧ合成パラメータ制約条件保存部１２０及び教師信号選択条件保存部１３０は、任意の記憶媒体によって実現すればよい。 The CG synthesis parameter constraint storage unit 120 stores the input CG synthesis parameter constraints. The teacher signal selection condition storage unit 130 stores the input teacher signal selection condition. The CG synthesis parameter constraint condition storage section 120 and the teacher signal selection condition storage section 130 may be realized by any storage medium.

ＣＧ合成パラメータ生成部１２１は、ＣＧ合成パラメータ制約条件の範囲内で複数のＣＧ合成パラメータを生成し、学習用ＣＧ生成部１２２に出力する。
学習用ＣＧ生成部１２２は、ＣＧ合成パラメータ生成部１２１が生成した複数のＣＧ合成パラメータについて、それぞれ学習用データを生成する。 The CG synthesis parameter generation section 121 generates a plurality of CG synthesis parameters within the range of the CG synthesis parameter constraint conditions, and outputs them to the learning CG generation section 122.
The learning CG generation unit 122 generates learning data for each of the plurality of CG synthesis parameters generated by the CG synthesis parameter generation unit 121.

例えば、装置の検知を行う場合、ＣＧ合成パラメータ生成部１２１は、装置の種類、向き、照明方向などを指定したＣＧ合成パラメータに基づいて、装置の外観のＣＧを生成し、背景画像に重畳することで学習用データを生成する。背景画像は、実画像であってもよいし、ＣＧであってもよい。 For example, when detecting a device, the CG synthesis parameter generation unit 121 generates a CG of the appearance of the device based on CG synthesis parameters that specify the type of device, orientation, lighting direction, etc., and superimposes it on the background image. This will generate training data. The background image may be a real image or a CG image.

教師信号選択部１３１、ＣＧ合成パラメータと教師信号選択条件に基づいて、教師信号に含める項目を選択する処理を行う。第一の教師信号生成部１３２は、教師信号選択部１３１が選択した教師信号の項目について、第一の教師信号を生成する。第二の教師信号生成部１３３は、教師信号選択部１３１が選択した教師信号の項目について、第二の教師信号を生成する。 The teacher signal selection unit 131 performs processing to select items to be included in the teacher signal based on the CG synthesis parameters and teacher signal selection conditions. The first teacher signal generation unit 132 generates a first teacher signal for the teacher signal item selected by the teacher signal selection unit 131. The second teacher signal generation unit 133 generates a second teacher signal for the teacher signal item selected by the teacher signal selection unit 131.

例えば、ＣＧ合成パラメータが装置の種類、向き、照明方向などを指定する値を持ち、第一の教師信号として検知対象の種類が指定されていれば、装置の種類の値が第一の教師信号となる。同様に、第二の教師信号として照明方向が指定されていれば、照明方向の値が第二の教師信号となる。 For example, if the CG synthesis parameter has a value that specifies the device type, orientation, illumination direction, etc., and the type of detection target is specified as the first teacher signal, the value of the device type is the first teacher signal. becomes. Similarly, if the illumination direction is designated as the second teacher signal, the value of the illumination direction becomes the second teacher signal.

詳細については後述するが、第一の教師信号は、検知対象の物体の種別と、学習用データにおける検知対象の物体の位置とを含むことが好適である。また、第二の教師信号は、検知対象の物体とカメラとの位置関係、検知対象の物体に対する照明の状態、検知対象の物体の変形の状態のうち、すくなくともいずれかを含むことが好適である。
換言すれば、第一の教師信号は、何がどこにあるかのように、物体検知の出力として用いられる情報である。これに対し、第二の教師信号は、物体検知の精度向上に寄与する補助的な情報である。 Although details will be described later, it is preferable that the first teacher signal includes the type of the object to be detected and the position of the object to be detected in the learning data. Further, it is preferable that the second teacher signal includes at least one of the following: the positional relationship between the object to be detected and the camera, the state of illumination of the object to be detected, and the state of deformation of the object to be detected. .
In other words, the first teacher signal is information used as an output of object detection, such as what is where. On the other hand, the second teacher signal is auxiliary information that contributes to improving the accuracy of object detection.

なお、第二の教師信号は、検知対象の物体の種別と、学習用データにおける検知対象の物体の位置とをさらに含んでもよい。すなわち、第二の教師信号に第一の教師信号と同じ情報が含まれていてもよい。この場合、第二の推論部は、第一の推論部が出力すべき情報を推論結果として出力しており、第一の推論部は、第二の推論結果を踏まえて改めて推論を行うことになる。 Note that the second teacher signal may further include the type of the object to be detected and the position of the object to be detected in the learning data. That is, the second teacher signal may include the same information as the first teacher signal. In this case, the second inference unit outputs the information that the first inference unit should output as an inference result, and the first inference unit decides to perform inference again based on the second inference result. Become.

図２は、物体検知装置の構成図である。物体検知装置は、図１に示した物体検知部１６０に、画像入力部２００と推論結果出力部２１０を接続した構成である。この構成では、画像入力部２００は、第一の推論部１６１及び第二の推論部１６２に同一の入力画像を与える。第二の推論部１６２は、入力画像から撮像時の状態を推論し、第一の推論部１６１に出力する。 FIG. 2 is a configuration diagram of the object detection device. The object detection device has a configuration in which an image input section 200 and an inference result output section 210 are connected to the object detection section 160 shown in FIG. In this configuration, the image input section 200 provides the same input image to the first inference section 161 and the second inference section 162. The second inference unit 162 infers the state at the time of imaging from the input image and outputs it to the first inference unit 161.

第一の推論部１６１は、第二の推論部１６２により推論された撮像時の状態を考慮しつつ、入力画像に対する画像認識を実行し、検知対象の物体の有無や位置に関する推論を行って、推論結果出力部２１０に出力する。
なお、図２では、第二の推論部１６２も推論結果出力部２１０への出力を行っている。第二の推論部１６２からの外部出力は必須ではないが、撮像時の状態を出力として求められるのであれば、出力は可能である。 The first inference unit 161 performs image recognition on the input image, taking into account the state at the time of imaging inferred by the second inference unit 162, and infers the presence or absence and position of the object to be detected. The result is output to the inference result output unit 210.
Note that in FIG. 2, the second inference section 162 also outputs to the inference result output section 210. External output from the second inference unit 162 is not essential, but output is possible if the state at the time of imaging is required as an output.

図３は、ＣＧ合成パラメータ制約条件についての説明図である。図３に示した例では、ＣＧ合成パラメータ制約条件３００は、項目名に入力値の最小値と最大値を対応付けることで、項目が取りうる値の範囲を示している。なお、項目名によっては細目名が設定されており、この場合には細目名ごとに入力値の最小値と最大値を対応付ける。 FIG. 3 is an explanatory diagram of CG synthesis parameter constraints. In the example shown in FIG. 3, the CG synthesis parameter constraint condition 300 indicates the range of values that an item can take by associating the minimum and maximum input values with the item name. Note that depending on the item name, a sub-item name is set, and in this case, the minimum value and maximum value of the input value are associated with each sub-item name.

項目名は、検知対象の種類、検知対象の個数、検知対象の位置、検知対象の向き、検知対象の変形パラメータ、カメラの位置、カメラの方向、照明数、照明の位置、照明の方向、照明範囲、照度、照明の色など含む。 The item names are: type of detection target, number of detection targets, position of detection target, orientation of detection target, deformation parameter of detection target, camera position, camera direction, number of lights, light position, light direction, light Includes range, illuminance, lighting color, etc.

検知対象の種類は、例えば「装置Ａ」等である。
検知対象の個数は、１つの学習用データに含める検知対象の数である。図３では、１～３個の範囲を指定している。 The type of detection target is, for example, "device A".
The number of detection targets is the number of detection targets included in one learning data. In FIG. 3, a range of 1 to 3 is specified.

検知対象の位置は、検知対象の物体を配置する空間座標の範囲を示す。ＸＹＺ座標系を用いて検知対象の位置を示す場合には、Ｘ、Ｙ、Ｚを細目名として有する。図３では、Ｘ座標の範囲を５０～２００、Ｙ座標の範囲を３０～１００、Ｚ座標の範囲を０～０としている。 The position of the detection target indicates the range of spatial coordinates in which the object to be detected is placed. When indicating the position of a detection target using the XYZ coordinate system, X, Y, and Z are used as sub-item names. In FIG. 3, the X coordinate range is 50 to 200, the Y coordinate range is 30 to 100, and the Z coordinate range is 0 to 0.

検知対象の方向は、検知対象の向きの範囲を示す。例えば、水平角度φと垂直角度θの２つの細目名を用い、φの範囲を１０度～３０度、θの範囲を０度～０度のように指定する。
検知対象の変形パラメータは、検知対象の可動部を細目名として有する。変形パラメータは、検知対象の種類によって異なる。例えば装置Ａが回動するアームを変形部として備えていれば、細目名はアーム角度、１０度～５０度を範囲とすればよい。変形パラメータは、伸縮部材、スライド部材などでもよい。また、複数の変形箇所があるならば、細目名の数を増やし、それぞれに範囲を設定する。 The direction of the detection target indicates the range of orientations of the detection target. For example, using two subdivisions, horizontal angle φ and vertical angle θ, the range of φ is specified as 10 degrees to 30 degrees, and the range of θ is specified as 0 degrees to 0 degrees.
The deformation parameter of the detection target has the movable part of the detection target as a sub-item name. The deformation parameters differ depending on the type of detection target. For example, if the device A is equipped with a rotating arm as a deformable part, the specific name may be the arm angle, which is in the range of 10 degrees to 50 degrees. The deformation parameter may be a telescopic member, a sliding member, or the like. If there are multiple deformed parts, increase the number of sub-items and set a range for each.

カメラの位置は、仮想的なカメラの空間座標、すなわちＣＧ合成した物体を見る視点の範囲を示す。ＸＹＺ座標系を用いてカメラの位置を示す場合には、Ｘ、Ｙ、Ｚを細目名として有する。図３では、Ｘ座標の範囲を０．０ｍ～５．０ｍ、Ｙ座標の範囲を５．０ｍ～１０．０ｍ、Ｚ座標の範囲を３．０ｍ～３．５ｍとしている。 The camera position indicates the spatial coordinates of the virtual camera, that is, the range of the viewpoint from which the CG composite object is viewed. When indicating the position of the camera using the XYZ coordinate system, X, Y, and Z are used as sub-item names. In FIG. 3, the X coordinate range is 0.0 m to 5.0 m, the Y coordinate range is 5.0 m to 10.0 m, and the Z coordinate range is 3.0 m to 3.5 m.

カメラの方向は、カメラの向きの範囲を示す。例えば、水平角度φと垂直角度θの２つの細目名を用い、φの範囲を０度～３６０度、θの範囲を１１０度～１３５度のように指定する。 The camera direction indicates the range of camera orientations. For example, using two subdivisions, horizontal angle φ and vertical angle θ, the range of φ is specified as 0 degrees to 360 degrees, and the range of θ is specified as 110 degrees to 135 degrees.

照明数は、ＣＧ合成における照明の数の範囲を示す。
照明の位置、照明の方向、照射範囲、照度、照明の色の項目は、照明数の分だけ指定する。照明の位置は、ＣＧ合成した物体に当てる照明の光源の位置に対応し、検知対象の位置やカメラの位置と同様に、ＸＹＺ座標で指定する。照明の方向は、検知対象の位置やカメラの向きと同様に水平角度φと垂直角度θで示す。
照明範囲は、照明が照らす範囲の角度であり、例えば３０度～４５度である。照度は、照明の明るさを示し、例えば１０００ｌｍである。照明の色は、例えばＲＧＢで示せばよい。 The number of illuminations indicates the range of the number of illuminations in CG composition.
The items of lighting position, lighting direction, irradiation range, illuminance, and lighting color are specified for each number of lights. The position of the illumination corresponds to the position of the light source of the illumination applied to the CG-composed object, and is specified using XYZ coordinates, similar to the position of the detection target and the position of the camera. The direction of illumination is indicated by a horizontal angle φ and a vertical angle θ, similar to the position of the detection target and the direction of the camera.
The illumination range is the angle of the range illuminated by the illumination, and is, for example, 30 degrees to 45 degrees. Illuminance indicates the brightness of illumination, and is, for example, 1000 lm. The color of the illumination may be expressed, for example, in RGB.

図４は、第一の教師信号選択条件の具体例である。図４に示した第一の教師信号選択条件４００は、第一の推論の出力として学習可能な項目について選択有無を指定する。第一の推論の出力として学習可能な項目としては、検知対象の種類、画像上の検知対象の範囲がある。これらの項目について選択有無の値が「有」であれば、当該項目の教師信号を生成する。 FIG. 4 is a specific example of the first teacher signal selection condition. The first teacher signal selection condition 400 shown in FIG. 4 specifies whether or not to select an item that can be learned as the output of the first inference. Items that can be learned as the output of the first inference include the type of detection target and the range of the detection target on the image. If the selection presence/absence value for these items is "yes", a teacher signal for the item is generated.

具体的には、検知対象の種類が「有」であれば、ＣＧ合成パラメータを参照し、検知対象の種類の値をそのまま教師信号として使用すればよい。
画像上の検知対象の範囲の種類の値が「有」であれば、ＣＧ合成パラメータを参照し、検知対象の物体の空間座標、カメラの空間座標、カメラの向きに基づいて、カメラの画角の平面に検知対象の像を投影することで教師信号を生成する。 Specifically, if the type of detection target is "present", the CG synthesis parameter may be referred to and the value of the type of detection target may be used as it is as a teacher signal.
If the value of the type of detection target range on the image is "Yes", the camera's angle of view is determined based on the spatial coordinates of the object to be detected, the spatial coordinates of the camera, and the orientation of the camera by referring to the CG composition parameters. A teacher signal is generated by projecting the image of the detection target onto the plane of the image.

図５は、第二の教師信号選択条件の具体例である。図５に示した第二の教師信号選択条件５００は、第二の推論の出力として学習可能な項目について選択有無を指定する。第二の推論の出力として学習可能な項目を例示する。 FIG. 5 is a specific example of the second teacher signal selection condition. The second teacher signal selection condition 500 shown in FIG. 5 specifies whether or not to select an item that can be learned as the output of the second inference. Examples of items that can be learned as the output of the second inference are shown below.

「検知対象の種類」第一の教師信号と同様であるので、説明を省略する。
「画像上の検知対象の範囲」第一の教師信号と同様であるので、説明を省略する。
「検知対象の変形パラメータ」ＣＧ合成パラメータを参照し、対応する項目の値をそのまま教師信号として使用すればよい。
「カメラに対しての検知対象の相対位置と相対方向」ＣＧ合成パラメータを参照し、検知対象の物体の空間座標、カメラの空間座標、カメラの向きに基づいて、算出することができる。
「カメラに対しての、照明と相対位置、照射の相対方向」ＣＧ合成パラメータを参照し、カメラの空間座標、カメラの向き、照明の空間座標、照明の向きに基づいて、算出することができる。
「検知対象に対しての、照明と相対位置、照射の相対方向」ＣＧ合成パラメータを参照し、検知対象の物体の空間座標、照明の空間座標、照明の向きに基づいて、算出することができる。
「照射範囲」ＣＧ合成パラメータを参照し、カメラと複数の照明に関する情報を統合して求めることができる。
「照度」ＣＧ合成パラメータを参照し、カメラと複数の照明に関する情報を統合して求めることができる。
「照明色」ＣＧ合成パラメータを参照し、カメラと複数の照明に関する情報を統合して求めることができる。 “Type of Detection Target” This is the same as the first teacher signal, so the explanation will be omitted.
"Range of detection target on image" This is the same as the first teacher signal, so the explanation will be omitted.
"Deformation Parameter of Detection Target" It is sufficient to refer to the CG synthesis parameter and use the value of the corresponding item as it is as the teacher signal.
"Relative position and relative direction of detection target with respect to camera" This can be calculated based on the spatial coordinates of the object to be detected, the spatial coordinates of the camera, and the orientation of the camera with reference to the CG synthesis parameters.
"Lighting, relative position, and relative direction of illumination with respect to the camera" Can be calculated based on the spatial coordinates of the camera, camera direction, spatial coordinates of illumination, and direction of illumination with reference to CG composition parameters. .
"Illumination, relative position, and relative direction of illumination to the detection target" Can be calculated based on the spatial coordinates of the object to be detected, the spatial coordinates of the illumination, and the direction of the illumination by referring to the CG synthesis parameters. .
"Irradiation range" can be determined by integrating information regarding the camera and multiple lights by referring to the CG synthesis parameters.
"Illuminance" It can be determined by referring to the CG synthesis parameters and integrating information regarding the camera and multiple lights.
"Lighting color" It can be determined by referring to the CG composition parameters and integrating information regarding the camera and multiple lights.

図６は、学習の処理手順を示すフローチャートである。まず、条件入力部１１０が、ＣＧ合成パラメータ制約条件の入力を受け付け、ＣＧ合成パラメータ制約条件保存部１２０に保存する（Ｓ６００）。また、条件入力部１１０は、第一の教師信号選択条件の入力を受け付け、教師信号選択条件保存部１３０に保存する（Ｓ６０１）。同様に、条件入力部１１０は、第二の教師信号選択条件の入力を受け付け、教師信号選択条件保存部１３０に保存する（Ｓ６０２）。 FIG. 6 is a flowchart showing the learning processing procedure. First, the condition input unit 110 receives input of CG synthesis parameter constraint conditions, and stores them in the CG synthesis parameter constraint storage unit 120 (S600). Furthermore, the condition input unit 110 receives the input of the first teacher signal selection condition, and stores it in the teacher signal selection condition storage unit 130 (S601). Similarly, the condition input unit 110 receives the input of the second teacher signal selection condition, and stores it in the teacher signal selection condition storage unit 130 (S602).

ＣＧ合成パラメータ生成部１２１は、ＣＧ生成パラメータの制約条件に従い、ランダムにそれぞれのＣＧ合成パラメータを生成する（Ｓ６０３）。
学習用ＣＧ生成部１２２は、生成したＣＧ合成パラメータを用いて、学習用データである学習用ＣＧを生成する（Ｓ６０４）。 The CG synthesis parameter generation unit 121 randomly generates each CG synthesis parameter according to the constraints of the CG generation parameters (S603).
The learning CG generation unit 122 generates a learning CG, which is learning data, using the generated CG synthesis parameters (S604).

第一の教師信号生成部１３２は、ＣＧ合成パラメータと学習用ＣＧを用いて、第一の教師信号選択条件に従って、第一の教師信号を生成する（Ｓ６０５）。この教師信号は、物体検知に直接かかわる教師信号であり、検知対象の種類や、画像上の検知対象の範囲などである。
第二の教師信号生成部１３３は、ＣＧ合成パラメータと学習用ＣＧを用いて、第二の教師信号選択条件に従って、第二の教師信号を生成する（Ｓ６０６）。この教師信号は、物体検知に直接かかわらない教師信号を含む。
例えば、検体対象に関する教師信号であれば、検知対象の変形パラメータや、カメラに対しての検知対象の相対位置や相対方向などである。また例えば照明に関する教師信号であれば、カメラに対しての各照明の相対位置や、照射の相対方向や、検知対象に対しての各照明の相対位置や、照射の相対方向などである。また、照射範囲、照度、照明色などである。
またさらに、第二の教師信号が検知対象に関わる教師信号を含むときには、第一の教師信号も含むことが望ましい。例えば、第二の教師信号に、検知対象の変形パラメータが入っているとき、検知対象の変形は検知対象の種類毎に異なるため、検知対象の種類も第二の教師信号として必要である。また、画像上いずれの場所でその変形パラメータの影響が生じるかが教師信号として必要となるので、検知対象の画像上の範囲も必要である。以降、第一の教師信号と重複する分を除いた第二の教師信号を、狭義の第二の教師信号と呼ぶ。 The first teacher signal generation unit 132 generates a first teacher signal according to the first teacher signal selection condition using the CG synthesis parameters and the learning CG (S605). This teacher signal is a teacher signal directly related to object detection, and includes information such as the type of object to be detected and the range of the object to be detected on the image.
The second teacher signal generation unit 133 generates a second teacher signal according to the second teacher signal selection condition using the CG synthesis parameters and the learning CG (S606). This teacher signal includes a teacher signal that is not directly related to object detection.
For example, if it is a teacher signal related to a specimen object, the information includes deformation parameters of the detection object, relative position and relative direction of the detection object with respect to the camera, and the like. For example, in the case of a teacher signal related to illumination, the information includes the relative position of each illumination with respect to the camera, the relative direction of irradiation, the relative position of each illumination with respect to the detection target, the relative direction of irradiation, etc. It also includes the irradiation range, illuminance, illumination color, etc.
Furthermore, when the second teacher signal includes a teacher signal related to the detection target, it is desirable that the second teacher signal also includes the first teacher signal. For example, when the second teacher signal contains a deformation parameter of the detection target, the type of the detection target is also required as the second teacher signal because the deformation of the detection target differs depending on the type of detection target. Furthermore, since it is necessary as a teacher signal to know where on the image the influence of the deformation parameter occurs, the range on the image of the detection target is also necessary. Hereinafter, the second teacher signal excluding the portion that overlaps with the first teacher signal will be referred to as a second teacher signal in a narrow sense.

続いて、第二の推論部１６２は、その時点で保持する第一の推論パラメータと学習用ＣＧを用いて第二の推論結果を算出する（Ｓ６０７）。ここで、第二の推論結果は、第二の教師信号に対応した数値情報である。第二の推論結果は、入力画像と対応しているので、二次元マップとして出力することも可能である。この出力については後述する。 Subsequently, the second inference unit 162 calculates a second inference result using the first inference parameters held at that time and the learning CG (S607). Here, the second inference result is numerical information corresponding to the second teacher signal. Since the second inference result corresponds to the input image, it can also be output as a two-dimensional map. This output will be described later.

第一の推論部１６１は、第二の推論結果と、学習用ＣＧを用いて第一の推論結果を算出する（Ｓ６０８）。あるいは、第二の推論結果のうち、狭義の第二の教師信号と対応する部分のみと、学習用ＣＧを用いて第一の推論結果を出力してもよい。第一の推論結果は、第一の教師信号に対応した数値情報である。 The first inference unit 161 calculates a first inference result using the second inference result and the learning CG (S608). Alternatively, the first inference result may be output using only the portion of the second inference result that corresponds to the narrowly defined second teacher signal and the learning CG. The first inference result is numerical information corresponding to the first teacher signal.

続いて、第一の誤差計算部１７０は、第一の推論結果と、第一の教師信号を用いて第一の誤差を算出する（Ｓ６０９）。第二の誤差計算部１７１は、第二の推論結果と、第二の教師信号を用いて、第二の誤差を算出する（Ｓ６１０）。 Next, the first error calculation unit 170 calculates a first error using the first inference result and the first teacher signal (S609). The second error calculation unit 171 calculates the second error using the second inference result and the second teacher signal (S610).

推論パラメータ更新部１８０は、例えば誤差十分小さくなったか、あるいは所定の回数後述のパラメータの更新を実施したかなどの終了条件が満たされたかどうかを判断する（Ｓ６１１）。満たされていれば終了し、そうでなければ、第一の誤差と第二の誤差がそれぞれ小さくなるように、例えば誤差逆伝播法などの方法を用いて、第一の推論部１６１と第二の推論部１６２のパラメータを更新する（Ｓ６１２）。その後は、Ｓ６０３へともどる。 The inference parameter update unit 180 determines whether termination conditions are satisfied, such as whether the error has become sufficiently small or whether parameter updates, which will be described later, have been performed a predetermined number of times (S611). If it is satisfied, the process ends, and if not, the first inference unit 161 and the second The parameters of the inference unit 162 are updated (S612). After that, the process returns to S603.

図７は、物体検知の処理手順を示すフローチャートである。まず、画像入力部２００が推論対象の画像を入力画像として受け付ける（Ｓ７００）。
第二の推論部１６２は、推論対象の画像を用いて第二の推論結果を算出する（Ｓ７０１）。
第一の推論部１６１は、第二の推論結果と、推論対象の画像を用いて第一の推論結果を算出する（Ｓ７０２）。
推論結果出力部２１０は、第一の推論結果を出力する。人が閲覧する場合であれば画面として出力すればよいし、推論結果をシステムで使う場合は、ネットワークを介してそのシステムへ出力すればよい。
このとき、第二の推論結果を出力してもよい。そうすると、検知対象に関しての詳細な情報を、上記の人やシステムに提供することができる。 FIG. 7 is a flowchart showing the processing procedure for object detection. First, the image input unit 200 receives an image to be inferred as an input image (S700).
The second inference unit 162 calculates a second inference result using the image to be inferred (S701).
The first inference unit 161 calculates a first inference result using the second inference result and the inference target image (S702).
The inference result output unit 210 outputs the first inference result. If the result is to be viewed by a person, it can be output as a screen, or if the inference result is to be used in a system, it can be output to that system via a network.
At this time, the second inference result may be output. Then, detailed information about the detection target can be provided to the above-mentioned people and systems.

図８は、教師信号の具体例である。同図では、第一の教師信号として、検知対象の種類と画像上の検知対象の範囲が生成されている。また、第二の教師信号として、検知対象の種類、画像上の検知対象の範囲、検知対象の変形パラメータ、カメラに対しての検知対象の相対位置、カメラに対しての検知対象の相対方向、カメラに対しての照明１の相対位置、カメラに対しての照明１の相対方向、検知対象に対しての照明１の相対位置、検知対象に対しての照明１の照射の相対方向、照明１の照度、照明１の照明色などが生成されている。 FIG. 8 shows a specific example of the teacher signal. In the figure, the type of detection target and the range of the detection target on the image are generated as the first teacher signal. In addition, as a second teacher signal, the type of detection target, the range of the detection target on the image, the deformation parameter of the detection target, the relative position of the detection target with respect to the camera, the relative direction of the detection target with respect to the camera, Relative position of illumination 1 with respect to the camera, relative direction of illumination 1 with respect to the camera, relative position of illumination 1 with respect to the detection target, relative direction of irradiation of illumination 1 with respect to the detection target, illumination 1 The illuminance of , the illumination color of illumination 1, etc. are generated.

図９は、ＣＧ合成した画像の具体例である。図９に示した画像９００には、ＣＧ合成画像９１０及びＣＧ合成画像９１１が含まれている。ＣＧ合成画像９１０とＣＧ合成画像９１１は、検知対象物についてＣＧ合成で生成した画像である。 FIG. 9 is a specific example of a CG composite image. The image 900 shown in FIG. 9 includes a CG composite image 910 and a CG composite image 911. The CG composite image 910 and the CG composite image 911 are images generated by CG composition of the detection target.

ＣＧ合成画像９１０とＣＧ合成画像９１１は、検知対象物の種類としては同じものであるが、照明の当たり方によって見え方が異なっている。ＣＧ合成画像９１０は暗く、ＣＧ合成画像９１１は明るい。 The CG composite image 910 and the CG composite image 911 have the same type of detection target object, but their appearance differs depending on how the object is illuminated. The CG composite image 910 is dark, and the CG composite image 911 is bright.

このように見え方が大きく異なる物体を、同一種類の物体であると物体検知部１６０が学習するのは難易度が高い。しかし、見え方の異なる様子を入力として与えられると難易度が下がる。そこで、第二の推論部１６２が見え方の違いを推論し、その結果を第一の推論部１６１に与えることで物体検知の精度が向上する。
同じことが、カメラに対する物体の向きや物体の変形パラメータに対しても言える。 It is difficult for the object detection unit 160 to learn that objects that look so different in appearance are of the same type. However, the difficulty level decreases when different ways of seeing are given as input. Therefore, the second inference unit 162 infers the difference in appearance and provides the result to the first inference unit 161, thereby improving the accuracy of object detection.
The same is true for the object's orientation with respect to the camera and the object's deformation parameters.

図１０は、推論結果の説明図である。図１０の二次元マップ１０００において、マップ上の各位置の要素は、第二の推論結果とそれぞれと対応した複数の固定長のベクトルからなる。カメラに対しての照明の相対位置やカメラに対しての照明の相対方向など画面全体に関する推論結果は、ベクトルのある次元に格納され、それは二次元マップ全体に及ぶ。また、十分に学習が進んだ状態であれば、ＣＧ合成画像９１０に対応する推論結果が、ベクトルのまた別の次元に格納され、それは二次元マップの領域１０１０に限定される。同様に、ＣＧ合成画像９１１に対応する推論結果は、二次元マップの領域１０１１に限定される。 FIG. 10 is an explanatory diagram of the inference results. In the two-dimensional map 1000 of FIG. 10, the element at each position on the map consists of a plurality of fixed-length vectors corresponding to the second inference results. The inference results regarding the entire screen, such as the relative position of the light with respect to the camera and the relative direction of the light with respect to the camera, are stored in a certain dimension of the vector, which spans the entire two-dimensional map. Furthermore, if learning has progressed sufficiently, the inference result corresponding to the CG composite image 910 is stored in another dimension of the vector, and is limited to the area 1010 of the two-dimensional map. Similarly, the inference result corresponding to the CG composite image 911 is limited to the area 1011 of the two-dimensional map.

次に、物体検知部が共有の推論部を有する変形例について説明する。
図１１は、変形例にかかる物体検知モデル学習装置の構成図である。図１１では、物体検知部１６０が共有の推論部１１００をさらに備えている点が、図１と異なる。学習用ＣＧ生成部１２２の出力が共有の推論部１１００に学習用データとして入力される。 Next, a modification example in which the object detection section has a shared inference section will be described.
FIG. 11 is a configuration diagram of an object detection model learning device according to a modified example. 11 differs from FIG. 1 in that the object detection section 160 further includes a shared inference section 1100. The output of the learning CG generation section 122 is input to the shared inference section 1100 as learning data.

共有の推論部１１００は、入力された学習用データに対し、所定の推論処理を行った後、第一の推論部１６１及び第二の推論部１６２に出力する。
すなわち、この構成では、第一の推論部１６１と第二の推論部１６２に共通して有効な前処理を共有の推論部１１００で実行することで、処理を効率化することができる。
その他の構成及び動作については、図１と同様であるので説明を省略する。 The shared inference unit 1100 performs predetermined inference processing on the input learning data, and then outputs the result to the first inference unit 161 and the second inference unit 162.
That is, in this configuration, the shared inference unit 1100 executes preprocessing that is common to the first inference unit 161 and the second inference unit 162, thereby making the processing more efficient.
The other configurations and operations are the same as those in FIG. 1, so explanations will be omitted.

図１２は、変形例にかかる物体検知装置の構成図である。図１２では、物体検知部１６０が共有の推論部１１００をさらに備えている点が、図１と異なる。
共有の推論部１１００は、画像入力部２００から入力画像を受け付け、入力画像に対して所定の推論処理を行った後、第一の推論部１６１及び第二の推論部１６２に出力する。
すなわち、この構成では、第一の推論部１６１と第二の推論部１６２に共通して有効な前処理を共有の推論部１１００で実行することで、処理を効率化することができる。
その他の構成及び動作については、図２と同様であるので説明を省略する。 FIG. 12 is a configuration diagram of an object detection device according to a modification. 12 differs from FIG. 1 in that the object detection section 160 further includes a shared inference section 1100.
The shared inference unit 1100 receives an input image from the image input unit 200, performs predetermined inference processing on the input image, and then outputs the result to the first inference unit 161 and the second inference unit 162.
That is, in this configuration, the shared inference unit 1100 executes preprocessing that is common to the first inference unit 161 and the second inference unit 162, thereby making the processing more efficient.
The other configurations and operations are the same as those in FIG. 2, so their explanation will be omitted.

上述してきたように、実施例に開示した物体検知モデル学習装置は、入力画像から物体を検知する物体検知モデルの学習を行う物体検知学習部１５０と、前記学習に用いる学習用データを生成する学習用データ生成部１００と、を備える。
前記学習用データ生成部１００は、検知対象の物体の画像をコンピュータグラフィックスで合成することで前記学習用データを生成し、前記学習用データについて、前記検知対象の物体に関する第一の教師信号を生成し、前記学習用データについて、前記検知対象の物体の撮像時の状態に対応する第二の教師信号を生成する。
前記物体検知モデルは、前記入力画像について、前記検知対象の物体に関する第一の推論結果を出力する第一の推論部１６１と、前記入力画像について、前記検知対象の物体の撮像時の状態に関する第二の推論結果を出力する第二の推論部１６２と、を備え、前記第一の推論部１６１は、前記入力画像と前記第二の推論結果とを用いて前記第一の推論結果を求める。
前記物体検知学習部１５０は、前記学習用データを前記第二の推論部１６２に与えて得られた第二の推論結果と前記第二の教師信号との差を第二の誤差として求める第二の誤差計算部１７１と、前記学習用データ及び前記第二の推論結果を前記第一の推論部１６１に与えて得られた第一の推論結果と前記第一の教師信号との差を第一の誤差として求める第一の誤差計算部１７０と、前記第二の誤差に基づいて前記第二の推論部のパラメータを更新し、前記第一の誤差に基づいて前記第一の推論部のパラメータを更新する推論パラメータ更新部１８０と、を備えることを特徴とする。
かかる構成及び動作によれば、検知対象の物体の撮像時の状態を用いるため、物体検知モデルの学習精度を向上することができ、実画像による学習よりも高い学習精度が期待できる。 As described above, the object detection model learning device disclosed in the embodiment includes an object detection learning unit 150 that performs learning of an object detection model that detects an object from an input image, and a learning unit that generates learning data used for the learning. data generation unit 100.
The learning data generation unit 100 generates the learning data by synthesizing images of the object to be detected using computer graphics, and generates a first teaching signal regarding the object to be detected with respect to the learning data. A second teacher signal corresponding to the state of the object to be detected at the time of imaging is generated for the learning data.
The object detection model includes a first inference unit 161 that outputs a first inference result regarding the object to be detected with respect to the input image, and a first inference result regarding the state of the object to be detected at the time of imaging regarding the input image. a second inference unit 162 that outputs a second inference result, and the first inference unit 161 obtains the first inference result using the input image and the second inference result.
The object detection learning section 150 provides a second inference result obtained by giving the learning data to the second inference section 162 and obtains a difference between the second teacher signal and the second error. The error calculation unit 171 calculates the difference between the first inference result obtained by giving the learning data and the second inference result to the first inference unit 161 and the first teacher signal. a first error calculation unit 170 that calculates the error as an error; updates the parameters of the second inference unit based on the second error; and updates the parameters of the first inference unit based on the first error. The invention is characterized by comprising an inference parameter updating unit 180 that updates the inference parameters.
According to this configuration and operation, since the state of the object to be detected at the time of imaging is used, the learning accuracy of the object detection model can be improved, and higher learning accuracy can be expected than learning using real images.

また、前記学習用データ生成部１００は、前記コンピュータグラフィックスのパラメータに関する制約条件と、前記第一の教師信号の選択条件と、前記第二の教師信号の選択条件とを受け付け、前記制約条件の範囲内で複数のコンピュータグラフィックス合成パラメータを生成し、前記コンピュータグラフィックス合成パラメータに基づいて前記学習用データを生成し、前記コンピュータグラフィックス合成パラメータと前記第一の教師信号の選択条件とを用いて前記第一の教師信号を生成し、前記コンピュータグラフィックス合成パラメータと前記第二の教師信号の選択条件とを用いて前記第二の教師信号を生成する。
かかる構成及び動作によれば、ＣＧ合成のパラメータを利用して、撮像時の状態に関する学習を行い、物体検知モデルの学習精度を向上することができる。 Further, the learning data generation unit 100 receives constraint conditions regarding the computer graphics parameters, selection conditions for the first teacher signal, and selection conditions for the second teacher signal, and Generate a plurality of computer graphics synthesis parameters within a range, generate the learning data based on the computer graphics synthesis parameters, and use the computer graphics synthesis parameters and the selection condition of the first teacher signal. The first teacher signal is generated using the computer graphics synthesis parameters and the selection conditions for the second teacher signal.
According to such a configuration and operation, it is possible to perform learning regarding the state at the time of imaging using parameters of CG synthesis, and improve the learning accuracy of the object detection model.

なお、前記第一の教師信号は、前記検知対象の物体の種別と、前記学習用データにおける前記検知対象の物体の位置とを含むことが好適である。
また、前記第二の教師信号は、前記検知対象の物体とカメラとの位置関係、前記検知対象の物体に対する照明の状態、前記検知対象の物体の変形の状態のうち、すくなくともいずれかを含むことが好適である。
これらのパラメータを教師として学習を行うことで、撮像時の状態を効率的に学習し、物体検知の学習精度向上をより効果的に行うことができる。 Note that it is preferable that the first teacher signal includes the type of the object to be detected and the position of the object to be detected in the learning data.
Further, the second teacher signal may include at least one of a positional relationship between the object to be detected and the camera, a state of illumination for the object to be detected, and a state of deformation of the object to be detected. is suitable.
By performing learning using these parameters as a teacher, the state at the time of imaging can be learned efficiently, and the learning accuracy of object detection can be more effectively improved.

なお、前記第二の教師信号は、前記検知対象の物体の種別と、前記学習用データにおける前記検知対象の物体の位置とをさらに含んでもよい。
第二の推論対象となるパラメータは、検知対象の物体の種別や位置に影響を受ける場合があるためである。 Note that the second teacher signal may further include the type of the object to be detected and the position of the object to be detected in the learning data.
This is because the second parameter to be inferred may be influenced by the type and position of the object to be detected.

また、前記物体検知モデルは、共有の推論部１１００をさらに備え、前記入力画像に対して前記共有の推論部による推論処理を行った後、共有の推論部の推論結果に対して前記第一の推論部１６１及び前記第二の推論部１６２による処理を行うこととしてもよい。
このように、第一の推論部１６１と第二の推論部１６２に共通して有効な前処理を共有の推論部１１００で実行することで、処理を効率化することができる。 The object detection model further includes a shared inference unit 1100, and after the input image is subjected to inference processing by the shared inference unit, the inference result of the shared inference unit is subjected to the first inference process. Processing may be performed by the inference unit 161 and the second inference unit 162.
In this way, by executing preprocessing that is commonly effective for the first inference unit 161 and the second inference unit 162 in the shared inference unit 1100, processing efficiency can be improved.

なお、本発明は上述の実施例に限定されるものではなく、様々な変形例が含まれる。例えば、上記した実施例は本発明を分かりやすく説明するために詳細に説明したものであり、必ずしも説明した全ての構成を備えるものに限定されるものではない。また、かかる構成の削除に限らず、構成の置き換えや追加も可能である。 Note that the present invention is not limited to the above-described embodiments, and includes various modifications. For example, the embodiments described above are described in detail to explain the present invention in an easy-to-understand manner, and the present invention is not necessarily limited to having all the configurations described. Furthermore, it is possible not only to delete such a configuration but also to replace or add a configuration.

また、上記の各構成、機能、処理部、処理手段等は、それらの一部又は全部を、例えば集積回路で設計する等によりハードウェアで実現してもよい。また、本発明は、実施例の機能を実現するソフトウェアのプログラムコードによっても実現できる。この場合、プログラムコードを記録した記録媒体をコンピュータに提供し、そのコンピュータが備えるプロセッサが記録媒体に格納されたプログラムコードを読み出す。この場合、記録媒体から読み出されたプログラムコード自体が前述した実施例の機能を実現することになり、そのプログラムコード自体、及びそれを記憶した記録媒体は本発明を構成することになる。このようなプログラムコードを供給するための記録媒体としては、例えば、フレキシブルディスク、ＣＤ－ＲＯＭ、ＤＶＤ－ＲＯＭ、ハードディスク、ＳＳＤ（Solid State Drive）、光ディスク、光磁気ディスク、ＣＤ－Ｒ、磁気テープ、不揮発性のメモリカード、ＲＯＭなどが用いられる。 Further, each of the above-mentioned configurations, functions, processing units, processing means, etc. may be partially or entirely realized in hardware by designing, for example, an integrated circuit. Further, the present invention can also be realized by software program codes that realize the functions of the embodiments. In this case, a recording medium on which a program code is recorded is provided to a computer, and a processor included in the computer reads the program code stored on the recording medium. In this case, the program code itself read from the recording medium realizes the functions of the embodiments described above, and the program code itself and the recording medium storing it constitute the present invention. Examples of recording media for supplying such program codes include flexible disks, CD-ROMs, DVD-ROMs, hard disks, SSDs (Solid State Drives), optical disks, magneto-optical disks, CD-Rs, magnetic tapes, A non-volatile memory card, ROM, etc. are used.

また、本実施例に記載の機能を実現するプログラムコードは、例えば、アセンブラ、Ｃ／Ｃ＋＋、ｐｅｒｌ、Ｓｈｅｌｌ、ＰＨＰ、Ｊａｖａ（登録商標）等の広範囲のプログラム又はスクリプト言語で実装できる。 Furthermore, the program code that implements the functions described in this embodiment can be implemented using a wide range of program or script languages, such as assembler, C/C++, Perl, Shell, PHP, and Java (registered trademark).

上述の実施例において、制御線や情報線は、説明上必要と考えられるものを示しており、製品上必ずしも全ての制御線や情報線を示しているとは限らない。全ての構成が相互に接続されていてもよい。 In the above-described embodiments, the control lines and information lines are those considered necessary for explanation, and not all control lines and information lines are necessarily shown in the product. All configurations may be interconnected.

１００：学習用データ生成部、１１０：条件入力部、１２０：ＣＧ合成パラメータ制約条件保存部、１２１：ＣＧ合成パラメータ生成部、１２２：学習用ＣＧ生成部、１３０：教師信号選択条件保存部、１３１：教師信号選択部、１３２：第一の教師信号生成部、１３３：第二の教師信号生成部、１５０：物体検知学習部、１６０：物体検知部、１６１：第一の推論部、１６２：第二の推論部、１７０：第一の誤差計算部、１７１：第二の誤差計算部、１８０：推論パラメータ更新部、２００：画像入力部、２１０：推論結果出力部、１１００：共有の推論部
100: Learning data generation unit, 110: Condition input unit, 120: CG synthesis parameter constraint storage unit, 121: CG synthesis parameter generation unit, 122: Learning CG generation unit, 130: Teacher signal selection condition storage unit, 131 : Teacher signal selection unit, 132: First teacher signal generation unit, 133: Second teacher signal generation unit, 150: Object detection learning unit, 160: Object detection unit, 161: First inference unit, 162: First Second inference unit, 170: First error calculation unit, 171: Second error calculation unit, 180: Inference parameter update unit, 200: Image input unit, 210: Inference result output unit, 1100: Shared inference unit

Claims

an object detection learning unit that learns an object detection model that detects objects from input images;
a learning data generation unit that generates learning data used for the learning,
The learning data generation unit includes:
Generate the learning data by synthesizing images of the object to be detected using computer graphics,
generating a first teacher signal regarding the object to be detected for the learning data;
For the learning data, generate a second teacher signal corresponding to the state of the object to be detected at the time of imaging;
The object detection model is
a first inference unit that outputs a first inference result regarding the object to be detected with respect to the input image;
a second inference unit that outputs, regarding the input image, a second inference result regarding the state of the object to be detected at the time of imaging;
The first inference unit determines the first inference result using the input image and the second inference result,
The object detection learning section includes:
a second error calculation unit that calculates, as a second error, a difference between a second inference result obtained by applying the learning data to the second inference unit and the second teacher signal;
a first error that is determined as a first error between a first inference result obtained by giving the learning data and the second inference result to the first inference unit and the first teacher signal; calculation section and
An inference parameter updating unit that updates parameters of the second inference unit based on the second error and updates parameters of the first inference unit based on the first error. An object detection model learning device.

The object detection model learning device according to claim 1,
The learning data generation unit includes:
accepting constraints regarding the computer graphics parameters, selection conditions for the first teacher signal, and selection conditions for the second teacher signal;
generating a plurality of computer graphics synthesis parameters within the range of the constraint conditions, and generating the learning data based on the computer graphics synthesis parameters;
generating the first teacher signal using the computer graphics synthesis parameter and the selection condition for the first teacher signal;
An object detection model learning device characterized in that the second teacher signal is generated using the computer graphics synthesis parameter and the selection condition for the second teacher signal.

The object detection model learning device according to claim 1,
The object detection model learning device is characterized in that the first teacher signal includes a type of the object to be detected and a position of the object to be detected in the learning data.

The object detection model learning device according to claim 1,
The second teacher signal includes at least one of a positional relationship between the object to be detected and a camera, a state of illumination for the object to be detected, and a state of deformation of the object to be detected. An object detection model learning device.

The object detection model learning device according to claim 4,
The object detection model learning device is characterized in that the second teacher signal further includes a type of the object to be detected and a position of the object to be detected in the learning data.

The object detection model learning device according to claim 1,
The object detection model further includes a shared inference unit, and after the input image is subjected to inference processing by the shared inference unit, the first inference unit and the inference process are performed on the inference result of the shared inference unit. An object detection model learning device characterized in that processing is performed by the second inference section.

a first inference unit that outputs a first inference result regarding the object to be detected with respect to the input image;
a second inference unit that outputs, regarding the input image, a second inference result regarding the state of the object to be detected at the time of imaging;
The object detection device is characterized in that the first inference unit obtains the first inference result using the input image and the second inference result.

An object detection learning method for learning an object detection model that detects an object from an input image, the method comprising:
generating training data by synthesizing images of objects to be detected using computer graphics;
generating a first teacher signal regarding the object to be detected with respect to the learning data;
generating, for the learning data, a second teacher signal corresponding to the state of the object to be detected at the time of imaging;
providing the learning data to a second inference unit included in the object detection model to obtain a second inference result regarding the state of the object to be detected at the time of imaging;
providing the learning data and the second inference result to a first inference unit included in the object detection model to obtain a first inference result regarding the object to be detected;
obtaining a difference between the first inference result and the first teacher signal as a first error;
obtaining a difference between the second inference result and the second teacher signal as a second error;
updating parameters of the second inference unit based on the second error, and updating parameters of the first inference unit based on the first error;
An object detection model learning method characterized by comprising: