JP2020038572A

JP2020038572A - Image learning program, image learning method, image recognition program, image recognition method, creation program for learning data set, creation method for learning data set, learning data set, and image recognition device

Info

Publication number: JP2020038572A
Application number: JP2018166350A
Authority: JP
Inventors: 俊菅原; Takashi Sugawara
Original assignee: Kyocera Corp
Current assignee: Kyocera Corp
Priority date: 2018-09-05
Filing date: 2018-09-05
Publication date: 2020-03-12

Abstract

To propose an image learning program and the like that can improve efficiency in learning with teacher images.SOLUTION: An image learning program causes an image recognition device to execute learning by using a learning data set including learning images and teacher images. The image recognition device comprises a first image recognition unit and a second image recognition unit. The first image recognition unit performs image segmentation of the learning images. The second image recognition unit performs more elaborate image segmentation than that of the first image recognition unit. The image learning program causes to execute a first step of inputting the learning image in the first image recognition unit to acquire a first output image, a second step of inputting the learning image and first output image to the second image recognition unit to acquire a second output image, a third step of acquiring a second error of the second output image with respect to the teacher image, and a fourth step of correcting an error in the second image recognition unit on the basis of the second error.SELECTED DRAWING: Figure 9

Description

本発明は、画像学習プログラム、画像学習方法、画像認識プログラム、画像認識方法、学習データセットの生成プログラム、学習データセットの生成方法、学習データセット、及び画像認識装置に関する。 The present invention relates to an image learning program, an image learning method, an image recognition program, an image recognition method, a learning data set generation program, a learning data set generation method, a learning data set, and an image recognition device.

画像認識技術として、Fully Convolutional Network（ＦＣＮ：全層畳み込みネットワーク）を用いたSemantic Segmentation（セマンティック・セグメンテーション）が知られている（例えば、非特許文献１参照）。セマンティック・セグメンテーションは、デジタル画像のピクセル単位でのクラス分類（クラス推論）を行っている。つまり、セマンティック・セグメンテーションは、デジタル画像の各ピクセルに対してクラス推論を行い、推論結果として、各ピクセル対してクラスをラベリングすることで、デジタル画像の領域分割を行う。 As an image recognition technology, Semantic Segmentation using a Fully Convolutional Network (FCN: full-layer convolution network) is known (for example, see Non-Patent Document 1). In semantic segmentation, a digital image is classified into pixels (class inference). That is, in the semantic segmentation, a class is inferred for each pixel of the digital image, and as a result of the inference, a class is labeled for each pixel to divide a region of the digital image.

Zhao, Hengshuang, et al. "Pyramid scene parsing network." IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). 2017Zhao, Hengshuang, et al. "Pyramid scene parsing network." IEEE Conf. On Computer Vision and Pattern Recognition (CVPR). 2017

ここで、セマンティック・セグメンテーションでは、学習データセットを用いて深層学習を行うことにより、画像認識の精度を高めている。学習データセットは、学習対象の画像となる学習画像と、学習画像に対する回答となる領域分割された教師画像とを含む。セマンティック・セグメンテーションに用いられる教師画像は、アノテーション作業により生成されるが、アノテーション作業の作業負荷が高いことから、アノテーションコストが高いものとなっている。また、様々なシーンに対して、精度よく画像認識を行うためには、セマンティック・セグメンテーションに用いられる教師画像を大量に用意する必要があり、アノテーションコストがさらに増大する。 Here, in the semantic segmentation, the accuracy of image recognition is increased by performing deep learning using a learning data set. The learning data set includes a learning image serving as a learning target image, and a region-divided teacher image serving as an answer to the learning image. The teacher image used for the semantic segmentation is generated by the annotation work. However, since the workload of the annotation work is high, the annotation cost is high. Further, in order to accurately perform image recognition on various scenes, it is necessary to prepare a large number of teacher images used for semantic segmentation, and the annotation cost further increases.

本発明は、教師画像による学習効率を向上させることができる画像学習プログラム、画像学習方法、画像認識プログラム、画像認識方法、学習データセットの生成プログラム、学習データセットの生成方法、学習データセット、及び画像認識装置を提供することを目的とする。 The present invention provides an image learning program, an image learning method, an image recognition program, an image recognition method, a learning data set generating program, a learning data set generating method, a learning data set, and a learning method that can improve the learning efficiency of a teacher image. It is an object to provide an image recognition device.

態様の１つに係る画像学習プログラムは、画像セグメンテーションを行う画像認識装置によって実行される画像学習プログラムであって、前記画像認識装置の学習に用いられる学習データセットは、前記画像認識装置の学習対象の画像となる学習画像と、前記学習画像に対応する教師画像と、を含み、前記画像認識装置は、前記学習画像の画像セグメンテーションを行う第１の画像認識部と、前記第１の画像認識部よりも緻密な領域分割となるように前記学習画像の画像セグメンテーションを行う第２の画像認識部と、を備え、前記学習画像を前記第１の画像認識部に入力し、前記第１の画像認識部により前記学習画像の画像セグメンテーションを行って、第１の出力画像を取得する第１のステップと、前記学習画像と前記第１の出力画像とを前記第２の画像認識部に入力し、前記第１の出力画像を用いて前記第２の画像認識部により前記学習画像の画像セグメンテーションを行って、第２の出力画像を取得する第２のステップと、前記教師画像に対する前記第２の出力画像の第２の誤差を取得する第３のステップと、前記第２の誤差に基づいて、前記第２の画像認識部による画像セグメンテーションの処理を修正する第４のステップと、を実行させる。 An image learning program according to one aspect is an image learning program executed by an image recognition device that performs image segmentation, wherein a learning data set used for learning of the image recognition device is a learning target set of the image recognition device. The image recognition device includes a first image recognition unit that performs image segmentation of the learning image, and a first image recognition unit that performs image segmentation of the learning image. A second image recognizing unit that performs image segmentation of the learning image so as to be more finely divided into regions, and the learning image is input to the first image recognizing unit, and the first image recognition is performed. A first step of performing image segmentation of the learning image by a unit to obtain a first output image; A second step of inputting to the second image recognizing unit and performing image segmentation of the learning image by the second image recognizing unit using the first output image to obtain a second output image And a third step of obtaining a second error of the second output image with respect to the teacher image; and correcting the image segmentation process by the second image recognition unit based on the second error. And performing the fourth step.

態様の１つに係る画像学習方法は、画像セグメンテーションを行う画像認識装置が実行する画像学習方法であって、前記画像認識装置の学習に用いられる学習データセットは、前記画像認識装置の学習対象の画像となる学習画像と、前記学習画像に対応する教師画像と、を含み、前記画像認識装置は、前記学習画像の画像セグメンテーションを行う第１の画像認識部と、前記第１の画像認識部よりも緻密な領域分割となるように前記学習画像の画像セグメンテーションを行う第２の画像認識部と、を備え、前記学習画像を前記第１の画像認識部に入力し、前記第１の画像認識部により前記学習画像の画像セグメンテーションを行って、第１の出力画像を取得する第１のステップと、前記学習画像と前記第１の出力画像とを前記第２の画像認識部に入力し、前記第１の出力画像を用いて前記第２の画像認識部により前記学習画像の画像セグメンテーションを行って、第２の出力画像を取得する第２のステップと、前記教師画像に対する前記第２の出力画像の第２の誤差を取得する第３のステップと、前記第２の誤差に基づいて、前記第２の画像認識部による画像セグメンテーションの処理を修正する第４のステップと、を含む。 An image learning method according to one aspect is an image learning method performed by an image recognition device that performs image segmentation, wherein a learning data set used for learning of the image recognition device includes a learning data set of a learning target of the image recognition device. A learning image to be an image, and a teacher image corresponding to the learning image, wherein the image recognition device includes a first image recognition unit that performs image segmentation of the learning image, and a first image recognition unit. A second image recognizing unit that performs image segmentation of the learning image so that the region is finely divided, and inputs the learning image to the first image recognizing unit. A first step of performing image segmentation of the learning image to obtain a first output image, and performing the second image recognition on the learning image and the first output image. A second step of performing image segmentation of the learning image by the second image recognition unit using the first output image to obtain a second output image; and A third step of obtaining a second error of the second output image; and a fourth step of correcting an image segmentation process by the second image recognition unit based on the second error. Including.

態様の１つに係る画像認識プログラムは、入力された入力画像の画像セグメンテーションを行う画像認識装置によって実行される画像認識プログラムであって、前記画像認識装置は、前記入力画像の画像セグメンテーションを行う第１の画像認識部と、前記第１の画像認識部よりも緻密な領域分割となるように前記入力画像の画像セグメンテーションを行う第２の画像認識部と、を備え、前記入力画像を前記第１の画像認識部に入力し、前記第１の画像認識部により前記入力画像の画像セグメンテーションを行って、第１の出力画像を取得する第８のステップと、前記入力画像と前記第１の出力画像とを前記第２の画像認識部に入力し、前記第１の出力画像を用いて前記第２の画像認識部により前記入力画像の画像セグメンテーションを行って、第２の出力画像を取得する第９のステップと、を実行させる。 An image recognition program according to one aspect is an image recognition program executed by an image recognition device that performs image segmentation of an input image that has been input, wherein the image recognition device performs image segmentation of the input image. A first image recognition unit, and a second image recognition unit that performs image segmentation of the input image so as to be more finely divided into regions than the first image recognition unit. An eighth step of inputting the input image to the image recognition unit, and performing an image segmentation of the input image by the first image recognition unit to obtain a first output image; and the input image and the first output image. Is input to the second image recognizing unit, and the second image recognizing unit performs image segmentation of the input image using the first output image. To execute a ninth step of obtaining a second output image.

態様の１つに係る画像認識方法は、入力された入力画像の画像セグメンテーションを行う画像認識装置が実行する画像認識方法であって、前記画像認識装置は、前記入力画像の画像セグメンテーションを行う第１の画像認識部と、前記第１の画像認識部よりも緻密な領域分割となるように前記入力画像の画像セグメンテーションを行う第２の画像認識部と、を備え、前記入力画像を前記第１の画像認識部に入力し、前記第１の画像認識部により前記入力画像の画像セグメンテーションを行って、第１の出力画像を取得する第８のステップと、前記入力画像と前記第１の出力画像とを前記第２の画像認識部に入力し、前記第２の画像認識部により前記第１の出力画像を用いて前記入力画像の画像セグメンテーションを行って、第２の出力画像を取得する第９のステップと、を含む。 An image recognition method according to one aspect is an image recognition method performed by an image recognition device that performs image segmentation of an input image that has been input, wherein the image recognition device performs first image segmentation that performs image segmentation of the input image. An image recognizing unit, and a second image recognizing unit that performs image segmentation of the input image so that the area is more finely divided than the first image recognizing unit. An eighth step of inputting the input image to the image recognizing unit, performing an image segmentation of the input image by the first image recognizing unit, and obtaining a first output image; Is input to the second image recognizing unit, and the second image recognizing unit performs image segmentation of the input image using the first output image, thereby obtaining a second output image. Including a ninth step of Tokusuru, the.

態様の１つに係る学習データセットの生成プログラムは、入力された入力画像の画像セグメンテーションを行う画像認識装置に実行され、前記画像認識装置で用いられる学習データセットを生成する学習データセットの生成プログラムであって、前記画像認識装置は、前記入力画像の画像セグメンテーションを行う第１の画像認識部と、前記第１の画像認識部よりも緻密な領域分割となるように前記入力画像の画像セグメンテーションを行う第２の画像認識部と、を備え、前記学習データセットは、前記第１の画像認識部が学習するための第１の学習データセットと、前記第２の画像認識部が学習するための第２の学習データセットとを有し、前記第１の学習データセットは、第１の学習画像と、前記第１の学習画像に対応する第１の教師画像と、を含み、前記第２の学習データセットは、第２の学習画像と、前記第２の学習画像に対応する第２の教師画像と、を含み、前記第２の教師画像は、前記第１の教師画像に比して緻密に領域分割された画像となっており、前記第１の学習画像と前記第１の教師画像とを前記第２の画像認識部に入力し、前記第２の画像認識部により前記第１の教師画像を用いて前記第１の学習画像の画像セグメンテーションを行って、第２の出力画像を取得する第１１のステップと、前記第１の学習画像を前記第２の学習画像として取得すると共に、前記第２の出力画像を前記第２の教師画像として取得し、前記第２の学習画像と前記第２の教師画像とを含む第２の学習データセットを生成する第１２のステップと、を実行させる。 A program for generating a learning data set according to one aspect is executed by an image recognition device that performs image segmentation of an input image that has been input, and generates a learning data set that is used in the image recognition device. The image recognition device may further include: a first image recognition unit that performs image segmentation of the input image; and an image segmentation of the input image that is more minutely divided than the first image recognition unit. A second image recognizing unit for performing, wherein the learning data set includes a first learning data set for learning by the first image recognizing unit, and a learning data set for learning by the second image recognizing unit. A second learning data set, wherein the first learning data set includes a first learning image and a first teacher image corresponding to the first learning image. Wherein the second learning data set includes a second learning image and a second teacher image corresponding to the second learning image, and the second teacher image includes The first learning image and the first teacher image are input to the second image recognition unit, and the second learning image and the first teacher image are input to the second image recognition unit. An eleventh step of performing image segmentation of the first learning image using the first teacher image by an image recognition unit to obtain a second output image, and converting the first learning image to the second learning image. And the second output image is obtained as the second teacher image, and a second learning data set including the second learning image and the second teacher image is generated. The twelfth step is executed.

態様の１つに係る学習データセットの生成方法は、入力された入力画像の画像セグメンテーションを行う画像認識装置が実行して、前記画像認識装置で用いられる学習データセットを生成する学習データセットの生成方法であって、前記画像認識装置は、前記入力画像の画像セグメンテーションを行う第１の画像認識部と、前記第１の画像認識部よりも緻密な領域分割となるように前記入力画像の画像セグメンテーションを行う第２の画像認識部と、を備え、前記学習データセットは、前記第１の画像認識部が学習するための第１の学習データセットと、前記第２の画像認識部が学習するための第２の学習データセットとを有し、前記第１の学習データセットは、第１の学習画像と、前記第１の学習画像に対応する第１の教師画像と、を含み、前記第２の学習データセットは、第２の学習画像と、前記第２の学習画像に対応する第２の教師画像と、を含み、前記第２の教師画像は、前記第１の教師画像に比して緻密に領域分割された画像となっており、前記第１の学習画像と前記第１の教師画像とを前記第２の画像認識部に入力し、前記第２の画像認識部により前記第１の教師画像を用いて前記第１の学習画像の画像セグメンテーションを行って、第２の出力画像を取得する第１１のステップと、前記第１の学習画像を前記第２の学習画像として取得すると共に、前記第２の出力画像を前記第２の教師画像として取得し、前記第２の学習画像と前記第２の教師画像とを含む第２の学習データセットを生成する第１２のステップと、を含む。 A method of generating a learning data set according to one aspect is performed by an image recognition device that performs image segmentation of an input image that has been input, and generates a learning data set that generates a learning data set used in the image recognition device. A method, comprising: a first image recognition unit for performing image segmentation of the input image; and an image segmentation of the input image so as to perform a finer region division than the first image recognition unit. And a second image recognizing unit for performing the learning. The learning data set includes a first learning data set for the first image recognizing unit to learn, and a learning data set for the second image recognizing unit to learn. A second learning data set, wherein the first learning data set includes a first learning image, and a first teacher image corresponding to the first learning image. The second learning data set includes a second learning image and a second teacher image corresponding to the second learning image, and the second teacher image is included in the first teacher image. The first learning image and the first teacher image are input to the second image recognizing unit, and the first learning image and the first teacher image are input to the second image recognizing unit. An eleventh step of performing image segmentation of the first learning image using a first teacher image to obtain a second output image, and obtaining the first learning image as the second learning image And a twelfth step of obtaining the second output image as the second teacher image and generating a second learning data set including the second learning image and the second teacher image. ,including.

態様の１つに係る画像認識装置は、入力画像の画像セグメンテーションを行う第１の画像認識部と、前記第１の画像認識部よりも緻密な領域分割となるように前記入力画像の画像セグメンテーションを行う第２の画像認識部と、を備え、前記第１の画像認識部は、前記入力画像が入力されると、前記入力画像の画像セグメンテーションを行って、第１の出力画像を生成し、生成した前記第１の出力画像を前記第２の画像認識部へ向けて出力し、前記第２の画像認識部は、前記入力画像と前記第１の出力画像とが入力されると、前記第１の出力画像を用いて前記第２の画像認識部により前記入力画像の画像セグメンテーションを行って、第２の出力画像を出力する。 An image recognition device according to one aspect includes a first image recognition unit that performs image segmentation of an input image, and performs an image segmentation of the input image so as to perform a finer region division than the first image recognition unit. A second image recognizing unit for performing, when the input image is input, the first image recognizing unit performs image segmentation of the input image to generate a first output image, and generates the first output image. The first output image is output to the second image recognition unit, and the second image recognition unit receives the first image when the input image and the first output image are input. The image segmentation of the input image is performed by the second image recognition unit using the output image of (i), and a second output image is output.

図１は、実施形態に係る画像認識装置の概要を示す図である。FIG. 1 is a diagram illustrating an outline of an image recognition device according to the embodiment. 図２は、実施形態に係る画像認識装置の画像認識部の概要を示す図である。FIG. 2 is a diagram illustrating an outline of an image recognition unit of the image recognition device according to the embodiment. 図３は、画像認識装置に入力される入力画像の一例を示す図である。FIG. 3 is a diagram illustrating an example of an input image input to the image recognition device. 図４は、画像認識装置から出力される出力画像の一例を示す図である。FIG. 4 is a diagram illustrating an example of an output image output from the image recognition device. 図５は、第１の学習データセットの一例を示す図である。FIG. 5 is a diagram illustrating an example of the first learning data set. 図６は、第２の学習データセットの一例を示す図である。FIG. 6 is a diagram illustrating an example of the second learning data set. 図７は、画像認識装置の画像学習に関する処理の一例を示す図である。FIG. 7 is a diagram illustrating an example of a process related to image learning of the image recognition device. 図８は、画像認識装置の画像学習に関する処理の一例を示す図である。FIG. 8 is a diagram illustrating an example of a process related to image learning of the image recognition device. 図９は、画像認識装置の画像学習に関する処理の一例を示す図である。FIG. 9 is a diagram illustrating an example of a process related to image learning of the image recognition device. 図１０は、画像認識装置の画像認識に関する処理の一例を示す図である。FIG. 10 is a diagram illustrating an example of a process related to image recognition of the image recognition device. 図１１は、画像認識装置の画像認識に関する処理の一例を示す図である。FIG. 11 is a diagram illustrating an example of a process regarding image recognition of the image recognition device. 図１２は、画像認識装置の画像認識に関する処理の一例を示す図である。FIG. 12 is a diagram illustrating an example of a process regarding image recognition of the image recognition device. 図１３は、画像認識装置による学習データセットの生成に関する処理の一例を示す図である。FIG. 13 is a diagram illustrating an example of a process regarding generation of a learning data set by the image recognition device.

本出願に係る実施形態を、図面を参照しつつ詳細に説明する。以下の説明において、同様の構成要素について同一の符号を付すことがある。さらに、重複する説明は省略することがある。また、本出願に係る実施形態を説明する上で密接に関連しない事項は、説明及び図示を省略することがある。 An embodiment according to the present application will be described in detail with reference to the drawings. In the following description, similar components may be denoted by the same reference numerals. Further, duplicate description may be omitted. In addition, description and illustration of matters that are not closely related in describing the embodiment according to the present application may be omitted.

（実施形態）
図１は、実施形態に係る画像認識装置の概要を示す図である。画像認識装置１は、入力される入力画像Ｉに含まれるオブジェクトを認識し、認識した結果を出力画像Ｏとして出力するものである。画像認識装置１は、カメラ等の撮像装置において撮像された撮影画像が入力画像Ｉとして入力される。画像認識装置１は、入力画像Ｉに対して画像セグメンテーションを行う。画像セグメンテーションとは、デジタル画像の分割された画像領域に対してクラスをラベリングすることであり、クラス推論（クラス分類）ともいう。つまり、画像セグメンテーションとは、デジタル画像の分割された所定の画像領域が、何れのクラスであるかを判別して、画像領域が示すクラスを識別するための識別子（ラベル）を付すことである。画像認識装置１は、入力画像Ｉを画像セグメンテーション（クラス推論）した画像を、出力画像Ｏとして出力する。 (Embodiment)
FIG. 1 is a diagram illustrating an outline of an image recognition device according to the embodiment. The image recognition device 1 recognizes an object included in an input image I to be input, and outputs a recognition result as an output image O. The image recognition device 1 receives a captured image captured by an imaging device such as a camera as an input image I. The image recognition device 1 performs image segmentation on an input image I. Image segmentation refers to labeling a class with respect to a divided image region of a digital image, and is also referred to as class inference (class classification). That is, the image segmentation is to determine which class the predetermined image area obtained by dividing the digital image belongs to, and to attach an identifier (label) for identifying the class indicated by the image area. The image recognition device 1 outputs an image obtained by performing image segmentation (class inference) on the input image I as an output image O.

画像認識装置１は、例えば、車の車載認識カメラに設けられている。車載認識カメラは、車の走行状況を所定のフレームレートでリアルタイムに撮像し、撮像した撮影画像を画像認識装置１に入力する。画像認識装置１は、所定のフレームレートで入力される撮影画像を入力画像Ｉとして取得する。画像認識装置１は、入力画像Ｉに含まれるオブジェクトをクラス分類して、クラス分類された画像を出力画像Ｏとして、所定のフレームレートで出力する。なお、画像認識装置１は、車載認識カメラへの搭載に限定されず、他の装置に設けてもよい。 The image recognition device 1 is provided, for example, in an in-vehicle recognition camera of a car. The in-vehicle recognition camera captures the running state of the vehicle at a predetermined frame rate in real time, and inputs the captured image to the image recognition device 1. The image recognition device 1 acquires a captured image input at a predetermined frame rate as an input image I. The image recognition device 1 classifies the objects included in the input image I into classes, and outputs the classified images as output images O at a predetermined frame rate. Note that the image recognition device 1 is not limited to being mounted on a vehicle-mounted recognition camera, and may be provided in another device.

先ず、図３を参照して、入力画像Ｉについて説明する。図３は、画像認識装置１に入力される入力画像Ｉの一例を示す図である。入力画像Ｉは、複数の画素（ピクセル）からなるデジタル画像である。入力画像Ｉは、例えば、カメラ等の撮像装置に設けられる撮像素子によって生成される、撮像素子の画素数に応じた解像度の画像となっている。つまり、入力画像Ｉは、画像の画素数を高くするアップサンプリング処理、または、画像の画素数を低くするダウンサンプリング処理が行われていない、高解像度となるオリジナルの原画像となっている。 First, the input image I will be described with reference to FIG. FIG. 3 is a diagram illustrating an example of an input image I input to the image recognition device 1. The input image I is a digital image including a plurality of pixels. The input image I is, for example, an image generated by an imaging device provided in an imaging device such as a camera and having a resolution according to the number of pixels of the imaging device. That is, the input image I is an original high resolution original image that has not been subjected to upsampling processing for increasing the number of pixels of the image or downsampling processing for decreasing the number of pixels of the image.

次に、図４を参照して、出力画像Ｏについて説明する。図４は、画像認識装置１から出力される出力画像Ｏの一例を示す図である。出力画像Ｏは、クラスごとに領域分割されている。クラスは、例えば、入力画像Ｉに含まれるオブジェクトを含み、人、車、道、建物等である。出力画像Ｏは、ピクセル単位でオブジェクトごとのクラス分類がなされ、ピクセル単位ごとに分類されたクラスがラベリングされることで、クラスごとに領域分割されている。図４では、例えば、人のクラスに分類された画像領域Ｏａと、車のクラスに分類された画像領域Ｏｂと、道路のクラスに分類された画像領域Ｏｃとを図示している。なお、図４の出力画像Ｏは一例であり、このクラス分類に、特に限定されない。また、出力画像Ｏは、入力画像Ｉと同じ解像度となっている。 Next, the output image O will be described with reference to FIG. FIG. 4 is a diagram illustrating an example of an output image O output from the image recognition device 1. The output image O is divided into regions for each class. The class includes, for example, an object included in the input image I, and is a person, a car, a road, a building, and the like. The output image O is classified into classes by each object on a pixel-by-pixel basis, and the classes classified on a pixel-by-pixel basis are labeled to classify the image into regions. FIG. 4 illustrates, for example, an image region Oa classified into a person class, an image region Ob classified into a car class, and an image region Oc classified into a road class. Note that the output image O in FIG. 4 is an example, and there is no particular limitation on this class classification. The output image O has the same resolution as the input image I.

再び図１を参照して、画像認識装置１について説明する。画像認識装置１は、制御部５と、記憶部６と、画像認識部７とを備えている。 Referring to FIG. 1 again, the image recognition device 1 will be described. The image recognition device 1 includes a control unit 5, a storage unit 6, and an image recognition unit 7.

記憶部６は、プログラム及びデータを記憶する。また、記憶部６は、制御部５の処理結果を一時的に記憶する作業領域としても利用してもよい。記憶部６は、半導体記憶デバイス、及び磁気記憶デバイス等の任意の記憶デバイスを含んでよい。また、記憶部６は、複数の種類の記憶デバイスを含んでよい。また、記憶部６は、メモリカード等の可搬の記憶媒体と、記憶媒体の読み取り装置との組み合わせを含んでよい。 The storage unit 6 stores programs and data. Further, the storage unit 6 may be used as a work area for temporarily storing a processing result of the control unit 5. The storage unit 6 may include an arbitrary storage device such as a semiconductor storage device and a magnetic storage device. Further, the storage unit 6 may include a plurality of types of storage devices. The storage unit 6 may include a combination of a portable storage medium such as a memory card and a storage medium reading device.

記憶部６は、プログラムとして、画像学習プログラムＰ１と、画像認識プログラムＰ２と、学習データセットの生成プログラムＰ３とを含む。画像学習プログラムＰ１は、画像認識部７に学習を行わせるためのプログラムである。画像認識プログラムＰ２は、画像認識部７に画像認識を行わせるためのプログラムである。学習データセットの生成プログラムＰ３は、画像認識部７の学習に用いられる学習データセットを生成するためのプログラムである。また、記憶部６は、データとして、各種画像と、学習データセットとを含む。各種画像は、画像認識装置１に入力される入力画像Ｉ、画像認識装置１から出力される出力画像Ｏ等である。学習データセットは、画像認識部７の学習に用いられるデータである。 The storage unit 6 includes, as programs, an image learning program P1, an image recognition program P2, and a learning data set generation program P3. The image learning program P1 is a program for causing the image recognition unit 7 to perform learning. The image recognition program P2 is a program for causing the image recognition unit 7 to perform image recognition. The learning data set generation program P3 is a program for generating a learning data set used for learning of the image recognition unit 7. Further, the storage unit 6 includes various images and a learning data set as data. The various images are an input image I input to the image recognition device 1, an output image O output from the image recognition device 1, and the like. The learning data set is data used for learning of the image recognition unit 7.

制御部５は、画像認識装置１の動作を統括的に制御して各種の機能を実現する。制御部５は、例えば、ＣＰＵ（Central Processing Unit）等の集積回路を含んでいる。具体的に、制御部５は、記憶部６に記憶されているプログラムに含まれる命令を実行して、画像認識部７等を制御することによって各種機能を実現する。 The control unit 5 controls the operation of the image recognition device 1 comprehensively to realize various functions. The control unit 5 includes, for example, an integrated circuit such as a CPU (Central Processing Unit). Specifically, the control unit 5 executes commands included in the program stored in the storage unit 6 and controls the image recognition unit 7 and the like to realize various functions.

制御部５は、例えば、画像学習プログラムＰ１を実行することにより、学習データセットを用いて、画像認識部７の学習を実行させる。また、制御部５は、例えば、画像認識プログラムＰ２を実行することにより、画像認識部７による入力画像Ｉの画像認識を実行させる。さらに、制御部５は、生成プログラムＰ３を実行することにより、画像認識部７による学習データセットの生成を実行させる。 The control unit 5 executes the learning of the image recognition unit 7 using the learning data set, for example, by executing the image learning program P1. Further, the control unit 5 causes the image recognition unit 7 to execute the image recognition of the input image I by executing the image recognition program P2, for example. Further, the control unit 5 causes the image recognition unit 7 to generate a learning data set by executing the generation program P3.

次に、図２を参照して、画像認識部７について説明する。図２は、実施形態に係る画像認識装置の画像認識部の概要を示す図である。画像認識部７は、ＧＰＵ（Graphics Processing Unit）等の集積回路を含んでいる。画像認識部７は、第１の画像認識部１１と、第２の画像認識部１２とを備えている。画像認識部７は、入力画像Ｉが入力されると、入力画像Ｉを第１の画像認識部１１及び第２の画像認識部１２にそれぞれ入力する。 Next, the image recognition unit 7 will be described with reference to FIG. FIG. 2 is a diagram illustrating an outline of an image recognition unit of the image recognition device according to the embodiment. The image recognition unit 7 includes an integrated circuit such as a GPU (Graphics Processing Unit). The image recognition section 7 includes a first image recognition section 11 and a second image recognition section 12. When the input image I is input, the image recognition unit 7 inputs the input image I to the first image recognition unit 11 and the second image recognition unit 12, respectively.

第１の画像認識部１１は、入力画像Ｉに含まれるオブジェクトの位置を求めるタスクを実行する。第１の画像認識部１１は、例えば、バウンディング・ボックス（Bounding Box）を用いた画像セグメンテーションを行っている。バウンディング・ボックスは、入力画像Ｉに含まれるオブジェクトを囲む矩形状の画像領域である。第１の画像認識部１１は、入力画像Ｉが入力されると、入力画像Ｉから、オブジェクトを囲むバウンディング・ボックスを抽出して、バウンディング・ボックスごとにクラス分類された画像を、第１の出力画像Ｏ１として出力する。つまり、第１の画像認識部１１は、入力画像Ｉに対して、後述する第２の画像認識部１２に比してラフに（粗めに）領域分割を行い、オブジェクトの位置に関する情報が含まれる第１の出力画像Ｏ１を出力する。 The first image recognition unit 11 executes a task for finding the position of an object included in the input image I. The first image recognition unit 11 performs, for example, image segmentation using a bounding box. The bounding box is a rectangular image area surrounding an object included in the input image I. When the input image I is input, the first image recognition unit 11 extracts a bounding box surrounding the object from the input image I, and outputs an image classified into classes for each bounding box to a first output. Output as image O1. That is, the first image recognizing unit 11 performs region segmentation on the input image I roughly (roughly) as compared with the second image recognizing unit 12 described later, and includes information on the position of the object. A first output image O1 is output.

第１の画像認識部１１は、ＣＮＮ（Convolution Neural Network）またはＦＣＮ（Fully Convolutional Network）等の畳み込み層を含むニューラル・ネットワーク（以下、単にネットワークともいう）を用いた画像セグメンテーションを行っている。第１の画像認識部１１は、エンコーダ２２と、デコーダ２３とを有している。 The first image recognition unit 11 performs image segmentation using a neural network including a convolutional layer such as a CNN (Convolution Neural Network) or an FCN (Fully Convolutional Network) (hereinafter, also simply referred to as a network). The first image recognition unit 11 has an encoder 22 and a decoder 23.

エンコーダ２２は、入力画像Ｉに対してエンコード処理を実行する。エンコード処理は、入力画像Ｉの特徴量を抽出した特徴マップ（Feature Map）を生成しつつ、特徴マップの解像度を低くするダウンサンプリング（プーリングともいう）を実行する処理である。具体的に、エンコード処理では、畳み込み層とプーリング層とにおいて入力画像Ｉに処理が行われる。畳み込み層では、入力画像Ｉの特徴量を抽出するためのカーネル（フィルタ）を、入力画像Ｉにおいて所定のストライドで移動させる。そして、畳み込み層では、畳み込み層の重みに基づいて、入力画像Ｉの特徴量を抽出するための畳み込み計算が行われ、この畳み込み計算により特徴量が抽出された特徴マップを生成する。生成される特徴マップは、カーネルのチャネル数に応じた数だけ生成される。プーリング層では、特徴量が抽出された特徴マップを縮小して、低解像度となる特徴マップを生成する。エンコード処理では、畳み込み層における処理とプーリング層における処理とを複数回繰り返し実行することで、ダウンサンプリングされた特徴量を有する特徴マップを生成する。 The encoder 22 performs an encoding process on the input image I. The encoding process is a process of executing a downsampling (also referred to as pooling) for lowering the resolution of the feature map while generating a feature map (Feature Map) in which the feature amount of the input image I is extracted. Specifically, in the encoding processing, processing is performed on the input image I in the convolution layer and the pooling layer. In the convolutional layer, a kernel (filter) for extracting a feature amount of the input image I is moved at a predetermined stride in the input image I. Then, in the convolution layer, a convolution calculation for extracting a feature amount of the input image I is performed based on the weight of the convolution layer, and a feature map from which the feature amount is extracted by the convolution calculation is generated. The generated feature maps are generated in a number corresponding to the number of channels of the kernel. The pooling layer reduces the feature map from which the feature amount has been extracted, and generates a feature map having a low resolution. In the encoding process, a process in the convolutional layer and a process in the pooling layer are repeatedly performed a plurality of times to generate a feature map having a down-sampled feature amount.

デコーダ２３は、エンコード処理後の特徴マップに対してデコード処理を実行する。デコード処理は、特徴マップの解像度を高くするアップサンプリング（アンプーリングともいう）を実行する処理である。具体的に、デコード処理は、逆畳み込み層とアンプーリング層とにおいて特徴マップに処理が行われる。アンプーリング層では、特徴量を含む低解像度の特徴マップを拡大して、高解像度となる特徴マップを生成する。逆畳み込み層では、特徴マップに含まれる特徴量を、復元させるための逆畳み込み計算が、逆畳み込み層の重みに基づいて実行され、この計算により特徴量を復元させた特徴マップを生成する。そして、デコード処理では、アンプーリング層における処理と逆畳み込み層における処理とを複数回繰り返し実行することで、アップサンプリングされ、領域分割された画像である第１の出力画像Ｏ１を生成する。第１の出力画像Ｏ１は、画像認識部７に入力される入力画像Ｉと同じ解像度になるまで、アップサンプリングされる。 The decoder 23 performs a decoding process on the encoded feature map. The decoding process is a process of executing upsampling (also referred to as ampling) for increasing the resolution of the feature map. Specifically, the decoding process is performed on the feature map in the deconvolution layer and the amplifying layer. In the amplifying layer, a low-resolution feature map including a feature amount is enlarged to generate a high-resolution feature map. In the deconvolution layer, a deconvolution calculation for restoring the feature amount included in the feature map is executed based on the weight of the deconvolution layer, and a feature map in which the feature amount is restored by this calculation is generated. In the decoding process, a first output image O1 that is an up-sampled and region-divided image is generated by repeatedly performing the processing in the amplifying layer and the processing in the deconvolution layer a plurality of times. The first output image O1 is up-sampled until it has the same resolution as the input image I input to the image recognition unit 7.

以上のように、第１の画像認識部１１は、入力画像Ｉに対して、エンコード処理及びデコード処理を実行し、ピクセル単位でクラス推論（クラス分類）を行うことで、入力画像Ｉの画像セグメンテーションを行う。そして、第１の画像認識部１１は、入力画像Ｉをクラスごとに領域分割した画像を、第１の出力画像Ｏ１として出力する。 As described above, the first image recognition unit 11 performs the encoding process and the decoding process on the input image I, and performs the class inference (class classification) on a pixel-by-pixel basis, whereby the image segmentation of the input image I is performed. I do. Then, the first image recognition unit 11 outputs, as a first output image O1, an image obtained by dividing the input image I into regions for each class.

なお、第１の画像認識部１１は、バウンディング・ボックスを用いた画像セグメンテーションに適用して説明したが、特に限定されない。第１の画像認識部１１は、入力画像Ｉに含まれるオブジェクトの位置を求めるタスクを実行可能であれば、例えば、異なるネットワークを用いた画像セグメンテーションを実行するものであってもよい。また、第１の画像認識部１１は、エンコード処理とデコード処理とを実行したが、入力画像Ｉに含まれるオブジェクトの位置を求めるタスクを実行可能であれば、エンコード処理に含まれるプーリング層と、デコード処理に含まれるアンプーリング層とを省いてもよい。 Note that the first image recognition unit 11 has been described as applied to image segmentation using a bounding box, but is not particularly limited. The first image recognition unit 11 may execute, for example, image segmentation using a different network as long as it can execute a task for obtaining the position of an object included in the input image I. In addition, the first image recognition unit 11 has executed the encoding process and the decoding process. However, if the task for obtaining the position of the object included in the input image I can be executed, the pooling layer included in the encoding process includes The ampling layer included in the decoding process may be omitted.

また、詳細は後述するが、第１の画像認識部１１は、学習時において、多量の第１の教師画像Ｔ１を用いて学習している。このため、第１の画像認識部１１は、入力画像Ｉに対する様々な変動要因を学習することができることから、ロバスト性を担保した画像認識を実行する。例えば、車載認識カメラにおけるロバスト性とは、逆光及び暗所などの照度変動と、気候変動と、雨滴、泥及び傷などのレンズ変動と、雪及び路面反射などの走行空間変動とを含む各種変動の変動要因に対する耐性である。 Further, as will be described in detail later, the first image recognition unit 11 learns using a large amount of the first teacher image T1 at the time of learning. For this reason, since the first image recognition unit 11 can learn various fluctuation factors with respect to the input image I, the first image recognition unit 11 executes image recognition while ensuring robustness. For example, the robustness of an in-vehicle recognition camera means various fluctuations including illuminance fluctuations such as backlight and dark places, climate fluctuations, lens fluctuations such as raindrops, mud and scratches, and running space fluctuations such as snow and road surface reflection. Resistance to fluctuation factors.

第１の画像認識部１１は、その出力側が、第２の画像認識部１２の入力側に接続されている。このため、第１の画像認識部１１は、第１の出力画像Ｏ１を、第２の画像認識部１２に入力する。また、第１の画像認識部１１は、第１の出力画像Ｏ１を、中間画像として外部に出力している。 The output side of the first image recognition unit 11 is connected to the input side of the second image recognition unit 12. For this reason, the first image recognition unit 11 inputs the first output image O1 to the second image recognition unit 12. In addition, the first image recognition unit 11 outputs the first output image O1 to the outside as an intermediate image.

第２の画像認識部１２は、第１の画像認識部１１に比して入力画像Ｉの領域分割を緻密に行うタスクを実行する。第２の画像認識部１２は、例えば、セマンティック・セグメンテーションを用いた画像セグメンテーションを行っている。セマンティック・セグメンテーションは、入力画像Ｉの各ピクセルに対してクラス推論を行い、推論結果として、各ピクセルに対してクラスをラベリングすることで、入力画像Ｉの領域分割を行う。第２の画像認識部１２には、入力画像Ｉと第１の出力画像Ｏ１とが入力される。第２の画像認識部１２は、入力画像Ｉと第１の出力画像Ｏ１とが入力されると、第１の出力画像Ｏ１を用いて、入力画像Ｉのピクセルごとにクラス分類された画像を、第２の出力画像Ｏ２として出力する。つまり、第２の画像認識部１２は、第１の出力画像Ｏ１をヒントとして、入力画像Ｉに対して、第１の画像認識部１１に比して緻密に領域分割を行って、第２の出力画像Ｏ２を出力する。 The second image recognizing unit 12 executes a task of performing the area division of the input image I more precisely than the first image recognizing unit 11. The second image recognition unit 12 performs image segmentation using, for example, semantic segmentation. In the semantic segmentation, class inference is performed for each pixel of the input image I, and as a result of the inference, a class is labeled for each pixel to perform region division of the input image I. The input image I and the first output image O1 are input to the second image recognition unit 12. When the input image I and the first output image O1 are input, the second image recognizing unit 12 uses the first output image O1 to convert an image that has been classified into classes for each pixel of the input image I, Output as the second output image O2. That is, the second image recognizing unit 12 uses the first output image O1 as a hint to divide the input image I more densely than the first image recognizing unit 11, and to perform the second An output image O2 is output.

第２の画像認識部１２は、ＣＮＮ（Convolution Neural Network）またはＦＣＮ（Fully Convolutional Network）等の畳み込み層を含むニューラル・ネットワーク（以下、単にネットワークともいう）を用いた画像セグメンテーションを行っている。また、第２の画像認識部１２は、入力画像Ｉの特徴量を抽出する特徴量抽出処理を実行する。さらに、第２の画像認識部１２は、第１の出力画像Ｏ１と特徴量抽出処理が行われる画像とを統合するフュージョン処理を実行して、ピクセル単位のクラス推論を行っている。 The second image recognition unit 12 performs image segmentation using a neural network including a convolutional layer such as a CNN (Convolution Neural Network) or an FCN (Fully Convolutional Network) (hereinafter, simply referred to as a network). In addition, the second image recognition unit 12 performs a feature amount extraction process of extracting a feature amount of the input image I. Further, the second image recognition unit 12 performs a fusion process for integrating the first output image O1 and the image on which the feature amount extraction process is performed, and performs class inference on a pixel-by-pixel basis.

特徴量抽出処理は、複数の畳み込み層において入力画像Ｉの特徴量を抽出する処理であり、エンコーダ２２における畳み込み層の処理とほぼ同様である。また、特徴量抽出処理では、プーリング層を省いた処理となっている。畳み込み層では、入力画像Ｉの特徴量を抽出するための畳み込み計算が、畳み込み層の重みに基づいて実行され、この計算により特徴量が抽出された特徴マップを生成する。特徴量抽出処理では、入力画像Ｉに対して畳み込み計算が複数回実行されることで、特徴マップを生成する。 The feature amount extraction process is a process of extracting the feature amount of the input image I in a plurality of convolution layers, and is substantially the same as the processing of the convolution layer in the encoder 22. Further, the feature amount extraction processing is processing in which the pooling layer is omitted. In the convolutional layer, a convolution calculation for extracting a feature amount of the input image I is executed based on the weight of the convolutional layer, and a feature map from which the feature amount is extracted by this calculation is generated. In the feature extraction process, a convolution calculation is performed on the input image I a plurality of times to generate a feature map.

フュージョン処理は、第１の出力画像Ｏ１をヒントとして、特徴量抽出処理が行われる特徴マップをマージして、クラス推論を行うことにより、クラスごとに領域分割された画像を生成し、入力画像Ｉと同じ解像度の第２の出力画像Ｏ２を生成する。 In the fusion process, the first output image O1 is used as a hint to merge the feature maps on which the feature amount extraction process is performed, and to perform class inference, thereby generating an image divided into regions for each class. To generate a second output image O2 having the same resolution as that of.

以上のように、第２の画像認識部１２は、入力画像Ｉに対して、特徴量抽出処理及びフュージョン処理を実行し、第１の出力画像Ｏ１をヒントとして、ピクセル単位でクラス推論（クラス分類）を行うことで、入力画像Ｉの画像セグメンテーションを行う。また、第２の画像認識部１２は、画像セグメンテーションされた入力画像Ｉを、第２の出力画像Ｏ２として出力する。 As described above, the second image recognition unit 12 performs the feature amount extraction processing and the fusion processing on the input image I, and performs class inference (class classification) on a pixel-by-pixel basis using the first output image O1 as a hint. ) To perform the image segmentation of the input image I. In addition, the second image recognition unit 12 outputs the input image I that has been subjected to the image segmentation as a second output image O2.

なお、第２の画像認識部１２は、セマンティック・セグメンテーションを用いた画像セグメンテーションに適用して説明したが、特に限定されない。第２の画像認識部１２は、第１の画像認識部１１に比して入力画像Ｉの領域分割を緻密に行うタスクを実行可能であれば、例えば、異なるネットワークを用いた画像セグメンテーションを実行するものであってもよい。 The second image recognition unit 12 has been described as applied to image segmentation using semantic segmentation, but is not particularly limited. If the second image recognizing unit 12 can execute a task of performing the area division of the input image I more precisely than the first image recognizing unit 11, for example, it executes image segmentation using a different network. It may be something.

また、詳細は後述するが、第２の画像認識部１２は、学習時において、第１の出力画像Ｏ１と、緻密に領域分割された高解像度の第２の教師画像とを用いて学習している。このため、第２の画像認識部１２は、入力画像Ｉに対する緻密な領域分割を行うことができることから、画像認識性（認識精度）を担保した画像認識を実行する。 As will be described later in detail, the second image recognition unit 12 learns at the time of learning using the first output image O1 and the high-resolution second teacher image that is finely divided into regions. I have. For this reason, since the second image recognition unit 12 can perform fine area division on the input image I, the second image recognition unit 12 performs image recognition while ensuring image recognizability (recognition accuracy).

第２の画像認識部１２は、第２の出力画像Ｏ２を外部に出力する。また、第２の画像認識部１２は、第２の出力画像Ｏ２の生成時に用いた第１の出力画像Ｏ１を、第２の出力画像Ｏ２に関連付けて出力可能となっている。 The second image recognition unit 12 outputs the second output image O2 to the outside. In addition, the second image recognition unit 12 can output the first output image O1 used when generating the second output image O2 in association with the second output image O2.

以上から、第１の画像認識部１１は、第２の画像認識部１２と比べて緻密な領域分割を行う必要がないため、難易度が低く計算負荷の低いタスクとなっている。また、第２の画像認識部１２は、第１の出力画像Ｏ１を用いて入力画像Ｉの領域分割を行うため、入力画像Ｉのみからの画像セグメンテーションを行うセマンティック・セグメンテーションに比して、難易度が低く計算負荷の低いタスクとなっている。 As described above, since the first image recognition unit 11 does not need to perform a fine area division as compared with the second image recognition unit 12, it is a task with low difficulty and low calculation load. Further, since the second image recognition unit 12 performs the area division of the input image I using the first output image O1, the difficulty level is higher than the semantic segmentation in which the image segmentation is performed only from the input image I. Task with low computational load.

次に、画像認識装置１の学習について説明する。画像認識装置１の学習には、学習データセットが用いられる。学習データセットは、学習対象となる画像である学習画像と、学習画像に対応する教師画像と、を含む。学習画像は、入力画像と同様に、デジタル画像である。教師画像は、学習画像に対応する画像セグメンテーションされた回答となる画像、つまり、領域分割された画像となっている。教師画像は、アノテーション作業により生成される画像となっている。 Next, learning of the image recognition device 1 will be described. The learning of the image recognition device 1 uses a learning data set. The learning data set includes a learning image, which is an image to be learned, and a teacher image corresponding to the learning image. The learning image is a digital image, like the input image. The teacher image is an image that is an answer that has been subjected to image segmentation corresponding to the learning image, that is, an image obtained by region segmentation. The teacher image is an image generated by the annotation work.

図５は、第１の学習データセットの一例を示す図である。図６は、第２の学習データセットの一例を示す図である。学習データセットは、第１の画像認識部１１の学習に用いられる第１の学習データセットＤ１と、第２の画像認識部１２の学習に用いられる第２の学習データセットＤ２とを含む。 FIG. 5 is a diagram illustrating an example of the first learning data set. FIG. 6 is a diagram illustrating an example of the second learning data set. The learning data set includes a first learning data set D1 used for learning of the first image recognizing unit 11 and a second learning data set D2 used for learning of the second image recognizing unit 12.

図５に示すように、第１の学習データセットＤ１は、第１の学習画像Ｇ１と、第１の教師画像Ｔ１とを含む。第１の学習画像Ｇ１は、第１の画像認識部１１の学習対象となる画像であり、入力画像と同様に、デジタル画像である。第１の教師画像Ｔ１は、バウンディング・ボックスを用いてクラスごとに領域分割された画像となっている。図５に示す第１の教師画像Ｔ１では、例えば、人のクラスに分類された矩形状の画像領域Ｔ１ａと、車のクラスに分類された矩形状の画像領域Ｔ１ｂと含んでいる。 As shown in FIG. 5, the first learning data set D1 includes a first learning image G1 and a first teacher image T1. The first learning image G1 is an image to be learned by the first image recognition unit 11, and is a digital image, like the input image. The first teacher image T1 is an image that is divided into regions for each class using a bounding box. The first teacher image T1 illustrated in FIG. 5 includes, for example, a rectangular image region T1a classified into a person class and a rectangular image region T1b classified into a car class.

第２の学習データセットＤ２は、第２の学習画像Ｇ２と、第２の教師画像Ｔ２とを含む。第２の学習画像Ｇ２は、第２の画像認識部１２の学習対象となる画像であり、入力画像及び第１の学習画像Ｇ１と同様に、デジタル画像である。なお、図５及び図６では、説明を簡単にするために、第１の学習画像Ｇ１と第２の学習画像Ｇ２とを同じ画像としているが、異なる画像であってもよい。第２の教師画像Ｔ２は、ピクセル単位でクラスごとに領域分割された画像となっている。図６に示す第２の教師画像Ｔ２では、例えば、人のクラスに分類された画像領域Ｔ２ａと、車のクラスに分類された画像領域Ｔ２ｂと、道路のクラスに分類された画像領域Ｔ２ｃとを含んでいる。 The second learning data set D2 includes a second learning image G2 and a second teacher image T2. The second learning image G2 is an image to be learned by the second image recognition unit 12, and is a digital image like the input image and the first learning image G1. In FIGS. 5 and 6, the first learning image G1 and the second learning image G2 are the same image for the sake of simplicity, but may be different images. The second teacher image T2 is an image that is divided into regions by pixel in units of classes. In the second teacher image T2 shown in FIG. 6, for example, an image region T2a classified into a person class, an image region T2b classified into a car class, and an image region T2c classified into a road class are included. Contains.

第１の教師画像Ｔ１は、第２の教師画像Ｔ２に比して粗い領域分割となる画像セグメンテーションが行われた画像となっている。換言すれば、第２の教師画像Ｔ２は、第１の教師画像Ｔ１に比して緻密な領域分割となる画像セグメンテーションが行われた画像となっている。第１の教師画像Ｔ１及び第２の教師画像Ｔ２は、アノテーション作業により生成される画像となっている。具体的に、第１の教師画像Ｔ１は、第１の学習画像Ｇ１に含まれるオブジェクトをバウンディング・ボックスにより囲んでクラス分類を行うアノテーション作業により生成される。第２の教師画像Ｔ２は、第１の学習画像Ｇ１のピクセル単位でクラス分類を行うアノテーション作業により生成される。このため、第１の教師画像Ｔ１は、第２の教師画像Ｔ２に比して作業負荷が低く、アノテーションコストが低いものとなっている。換言すれば、第２の教師画像Ｔ２は、第１の教師画像Ｔ１に比して作業負荷が高く、アノテーションコストが高いものとなっている。 The first teacher image T1 is an image that has been subjected to image segmentation that results in coarser area division than the second teacher image T2. In other words, the second teacher image T2 is an image that has been subjected to image segmentation that results in a finer area division than the first teacher image T1. The first teacher image T1 and the second teacher image T2 are images generated by the annotation work. Specifically, the first teacher image T1 is generated by an annotation operation of classifying the object included in the first learning image G1 by surrounding the object with a bounding box. The second teacher image T2 is generated by an annotation work of performing class classification on a pixel-by-pixel basis in the first learning image G1. Therefore, the first teacher image T1 has a lower work load and a lower annotation cost than the second teacher image T2. In other words, the second teacher image T2 has a higher work load and a higher annotation cost than the first teacher image T1.

また、画像認識装置１の学習に際し、用意される第１の教師画像Ｔ１は、第２の教師画像Ｔ２に比して多量となっている。換言すれば、用意される第２の教師画像Ｔ２は、第１の教師画像Ｔ１に比して少量となっている。つまり、アノテーションコストの低い第１の教師画像Ｔ１を含む第１の学習データセットＤ１を多量に用意して、第１の画像認識部１１の学習を行う。また、アノテーションコストの高い第２の教師画像Ｔ２を含む第２の学習データセットＤ２を少量だけ用意して、第２の画像認識部１２の学習を行う。 Further, the first teacher image T1 prepared for learning by the image recognition device 1 is larger than the second teacher image T2. In other words, the amount of the prepared second teacher image T2 is smaller than that of the first teacher image T1. That is, a large amount of the first learning data set D1 including the first teacher image T1 with low annotation cost is prepared, and the learning of the first image recognition unit 11 is performed. Further, the second image recognition unit 12 learns by preparing a small amount of the second learning data set D2 including the second teacher image T2 having a high annotation cost.

次に、図７から図９を参照して、第１の学習データセットＤ１及び第２の学習データセットＤ２を用いた画像認識装置１の学習に関する処理について説明する。図７から図９は、画像認識装置の画像学習に関する処理の一例を示す図である。画像認識装置１の学習では、第１の画像認識部１１の学習を行ってから、第２の画像認識部１２の学習を行っている。 Next, with reference to FIG. 7 to FIG. 9, a process related to learning of the image recognition device 1 using the first learning data set D1 and the second learning data set D2 will be described. 7 to 9 are diagrams illustrating an example of processing related to image learning of the image recognition device. In the learning of the image recognition device 1, the learning of the second image recognition unit 12 is performed after the learning of the first image recognition unit 11 is performed.

図７を参照して、第１の学習データセットＤ１を用いて、第１の画像認識部１１の学習を行う処理について説明する。第１の画像認識部１１の学習を行う処理では、第１の学習画像Ｇ１を第１の画像認識部１１に入力し、第１の画像認識部１１により第１の学習画像Ｇ１の画像セグメンテーションを行って、第１の出力画像Ｏ１を取得するステップ（第５のステップ）を実行する。 With reference to FIG. 7, a process of performing learning of the first image recognition unit 11 using the first learning data set D1 will be described. In the process of performing learning of the first image recognition unit 11, the first learning image G1 is input to the first image recognition unit 11, and the first image recognition unit 11 performs image segmentation of the first learning image G1. Then, the step of acquiring the first output image O1 (fifth step) is performed.

具体的に、第１の学習データセットＤ１の第１の学習画像Ｇ１が、画像認識装置１の第１の画像認識部１１に入力される（ステップＳ１）。第１の学習画像Ｇ１が入力されると、第１の画像認識部１１は、第１の学習画像Ｇ１を入力画像として、第１の学習画像Ｇ１に対してエンコード処理を実行する（ステップＳ２）。第１の画像認識部１１は、エンコード処理を実行することで、ダウンサンプリングされた特徴量を含む低解像度の特徴マップを生成する。第１の画像認識部１１は、ダウンサンプリングされた低解像度の特徴量を含む特徴マップに対してデコード処理を実行する（ステップＳ３）。第１の画像認識部１１は、デコード処理を実行することで、特徴量を含む特徴マップを復元しながらアップサンプリングして、第１の学習画像Ｇ１と同じ解像度とする。そして、第１の画像認識部１１は、画像をピクセル単位でクラスごとに領域分割するクラス推論を実行する（ステップＳ４）。第１の画像認識部１１は、クラス推論の結果として、第１の出力画像Ｏ１を取得する（ステップＳ５）。 Specifically, the first learning image G1 of the first learning data set D1 is input to the first image recognition unit 11 of the image recognition device 1 (Step S1). When the first learning image G1 is input, the first image recognition unit 11 executes an encoding process on the first learning image G1 using the first learning image G1 as an input image (Step S2). . The first image recognition unit 11 generates a low-resolution feature map including the down-sampled feature amounts by executing the encoding process. The first image recognition unit 11 executes a decoding process on the feature map including the down-sampled low-resolution feature amount (step S3). The first image recognizing unit 11 performs a decoding process to perform upsampling while restoring a feature map including a feature amount, thereby obtaining the same resolution as that of the first learning image G1. Then, the first image recognition unit 11 executes a class inference for dividing the image into regions on a pixel-by-pixel basis for each class (step S4). The first image recognition unit 11 obtains a first output image O1 as a result of the class inference (Step S5).

次に、第１の画像認識部１１の学習を行う処理では、第１の教師画像Ｔ１に対する第１の出力画像Ｏ１の第１の誤差を取得するステップ（ステップＳ６：第６のステップ）を実行する。 Next, in the learning process of the first image recognition unit 11, a step of acquiring a first error of the first output image O1 with respect to the first teacher image T1 (Step S6: sixth step) is executed. I do.

具体的に、ステップＳ６において、第１の画像認識部１１は、第１の出力画像Ｏ１を取得すると、第１の学習データセットＤ１の第１の教師画像Ｔ１を取得する。第１の画像認識部１１は、取得した第１の教師画像Ｔ１と第１の出力画像Ｏ１とから、第１の教師画像Ｔ１と第１の出力画像Ｏ１との誤差量を第１の誤差として算出する。誤差量は、Cross Entropy関数を用いて誤差計算を行うことにより算出される。 Specifically, in step S6, when the first image recognition unit 11 obtains the first output image O1, it obtains the first teacher image T1 of the first learning data set D1. The first image recognition unit 11 sets an error amount between the first teacher image T1 and the first output image O1 as a first error based on the obtained first teacher image T1 and the first output image O1. calculate. The error amount is calculated by performing an error calculation using the Cross Entropy function.

そして、第１の画像認識部１１の学習を行う処理では、第１の誤差に基づいて、第１の画像認識部１１による画像セグメンテーションを修正するステップ（第７のステップ）を実行する。 Then, in the process of learning the first image recognition unit 11, a step (seventh step) of correcting the image segmentation by the first image recognition unit 11 based on the first error is executed.

具体的に、第１の画像認識部１１は、第１の誤差を取得すると、誤差量に基づいて誤差逆伝播法によりネットワークにおける誤差が修正されるように、ネットワークの畳み込み層及び逆畳み込み層の重みを学習させ、ネットワークを更新する（ステップＳ７）。第１の画像認識部１１は、ステップＳ７の実行により、第１の学習データセットＤ１を用いた学習を終了する。そして、第１の画像認識部１１は、ステップＳ１からステップＳ７を、第１の学習データセットＤ１のセット数に応じて繰り返し実行する。 Specifically, when the first image recognition unit 11 obtains the first error, the first image recognition unit 11 corrects the error in the network by the error backpropagation method based on the error amount, so that the convolutional layer and the deconvolutional layer of the network can be corrected. The weight is learned and the network is updated (step S7). The first image recognition unit 11 ends the learning using the first learning data set D1 by executing step S7. Then, the first image recognition unit 11 repeatedly executes steps S1 to S7 according to the number of the first learning data sets D1.

次に、図８及び図９を参照して、第２の学習データセットＤ２を用いて、第２の画像認識部１２の学習を行う処理について説明する。第２の画像認識部１２の学習を行う処理では、第１の画像認識部１１は学習済みとなっており、第１の画像認識部１１から出力される第１の出力画像Ｏ１が用いられる。第２の画像認識部１２の学習を行う処理では、第２の学習画像Ｇ２を第１の画像認識部１１に入力し、第１の画像認識部１１により第２の学習画像Ｇ２の画像セグメンテーションを行って、第１の出力画像Ｏ１を取得するステップ（第１のステップ）を実行する。 Next, with reference to FIGS. 8 and 9, a process of learning the second image recognition unit 12 using the second learning data set D2 will be described. In the process of learning the second image recognition unit 12, the first image recognition unit 11 has already learned, and the first output image O1 output from the first image recognition unit 11 is used. In the process of learning the second image recognizing unit 12, the second learning image G2 is input to the first image recognizing unit 11, and the first image recognizing unit 11 performs image segmentation of the second learning image G2. Then, the step of obtaining the first output image O1 (first step) is performed.

具体的に、図８に示すように、第２の学習データセットＤ２の第２の学習画像Ｇ２が、画像認識装置１の第１の画像認識部１１に入力される（ステップＳ１１）。第２の学習画像Ｇ２が入力されると、第１の画像認識部１１は、第２の学習画像Ｇ２を入力画像として、第２の学習画像Ｇ２に対してエンコード処理を実行する（ステップＳ１２）。第１の画像認識部１１は、エンコード処理を実行することで、ダウンサンプリングされた特徴量を含む低解像度の特徴マップを生成する。第１の画像認識部１１は、ダウンサンプリングされた特徴量を含む特徴マップに対してデコード処理を実行する（ステップＳ１３）。第１の画像認識部１１は、デコード処理を実行することで、特徴量を含む低解像度の特徴マップを復元しながらアップサンプリングして、第２の学習画像Ｇ２と同じ解像度とする。そして、第１の画像認識部１１は、画像をピクセル単位でクラスごとに領域分割するクラス推論を実行する（ステップＳ１４）。第１の画像認識部１１は、クラス推論の結果として、第１の出力画像Ｏ１を取得する（ステップＳ１５）。 Specifically, as shown in FIG. 8, the second learning image G2 of the second learning data set D2 is input to the first image recognition unit 11 of the image recognition device 1 (Step S11). When the second learning image G2 is input, the first image recognition unit 11 performs an encoding process on the second learning image G2 using the second learning image G2 as an input image (Step S12). . The first image recognition unit 11 generates a low-resolution feature map including the down-sampled feature amounts by executing the encoding process. The first image recognition unit 11 performs a decoding process on the feature map including the down-sampled feature amount (Step S13). The first image recognition unit 11 performs up-sampling while restoring a low-resolution feature map including a feature amount by executing a decoding process to obtain the same resolution as that of the second learning image G2. Then, the first image recognition unit 11 executes a class inference for dividing the image into regions on a pixel-by-pixel basis for each class (step S14). The first image recognition unit 11 obtains a first output image O1 as a result of the class inference (Step S15).

次に、第２の画像認識部１２の学習を行う処理では、第２の学習画像Ｇ２と第１の出力画像Ｏ１とを第２の画像認識部１２に入力し、第１の出力画像Ｏ１を用いて第２の画像認識部１２により第２の学習画像Ｇ２の画像セグメンテーションを行って、第２の出力画像Ｏ２を取得するステップ（第２のステップ）を実行する。 Next, in the process of performing learning of the second image recognition unit 12, the second learning image G2 and the first output image O1 are input to the second image recognition unit 12, and the first output image O1 is processed. The second image recognizing unit 12 performs image segmentation of the second learning image G2 by using the second image recognition unit 12 to obtain a second output image O2 (second step).

具体的に、図９に示すように、第２の学習データセットＤ２の第２の学習画像Ｇ２が、画像認識装置１の第２の画像認識部１２に入力される（ステップＳ２１）。第２の学習画像Ｇ２が入力されると、第２の画像認識部１２は、第２の学習画像Ｇ２を入力画像として、第２の学習画像Ｇ２に対して特徴量抽出処理を実行する（ステップＳ２２）。第２の画像認識部１２は、特徴量抽出処理を実行することで、特徴量を含む特徴マップを生成する。また、第２の画像認識部１２は、特徴量を含む特徴マップに対してフュージョン処理を実行する（ステップＳ２３）。第２の画像認識部１２は、フュージョン処理を実行することで、第１の出力画像Ｏ１をヒントとして、特徴量抽出処理が行われる特徴マップを復元する。そして、第２の画像認識部１２は、特徴マップから、ピクセル単位でクラスごとに領域分割するクラス推論を実行する（ステップＳ２４）。第２の画像認識部１２は、クラス推論の結果として、第２の出力画像Ｏ２を取得する（ステップＳ２５）。 Specifically, as shown in FIG. 9, the second learning image G2 of the second learning data set D2 is input to the second image recognition unit 12 of the image recognition device 1 (Step S21). When the second learning image G2 is input, the second image recognition unit 12 executes a feature amount extraction process on the second learning image G2 using the second learning image G2 as an input image (Step S10). S22). The second image recognition unit 12 generates a feature map including a feature amount by executing a feature amount extraction process. In addition, the second image recognition unit 12 performs a fusion process on the feature map including the feature amount (Step S23). The second image recognizing unit 12 executes the fusion process to restore the feature map on which the feature amount extraction process is performed using the first output image O1 as a hint. Then, the second image recognizing unit 12 executes class inference that divides the area for each class on a pixel basis from the feature map (step S24). The second image recognition unit 12 acquires a second output image O2 as a result of the class inference (Step S25).

次に、第２の画像認識部１２の学習を行う処理では、第２の教師画像Ｔ２に対する第２の出力画像Ｏ２の第２の誤差を取得するステップ（ステップＳ２６：第３のステップ）を実行する。 Next, in the learning process of the second image recognition unit 12, a step of acquiring a second error of the second output image O2 with respect to the second teacher image T2 (Step S26: a third step) is executed. I do.

具体的に、ステップＳ２６において、第２の画像認識部１２は、第２の出力画像Ｏ２を取得すると、第２の学習データセットＤ２の第２の教師画像Ｔ２を取得する。第２の画像認識部１２は、取得した第２の教師画像Ｔ２と第２の出力画像Ｏ２とから、第２の教師画像Ｔ２と第２の出力画像Ｏ２との誤差量を第２の誤差として算出する。誤差量は、Cross Entropy関数を用いて誤差計算を行うことにより算出される。 Specifically, in step S26, when acquiring the second output image O2, the second image recognition unit 12 acquires the second teacher image T2 of the second learning data set D2. The second image recognizing unit 12 sets an error amount between the second teacher image T2 and the second output image O2 as a second error based on the acquired second teacher image T2 and the second output image O2. calculate. The error amount is calculated by performing an error calculation using the Cross Entropy function.

そして、第２の画像認識部１２の学習を行う処理では、第２の誤差に基づいて、第２の画像認識部１２による画像セグメンテーションを修正するステップ（第４のステップ）を実行する。 Then, in the process of learning by the second image recognition unit 12, a step (fourth step) of correcting the image segmentation by the second image recognition unit 12 based on the second error is executed.

具体的に、第２の画像認識部１２は、第２の誤差を取得すると、誤差量に基づいて誤差逆伝播法によりネットワークにおける誤差が修正されるように、ネットワークの畳み込み層の重みを学習させ、ネットワークを更新する（ステップＳ２７）。ここで、ステップＳ２７において、第２の誤差に基づく学習では、第２の画像認識部１２の学習を行う一方で、第１の画像認識部１１の学習を遮断している。すなわち、第２の誤差は、第２の画像認識部１２へ誤差逆伝播させる一方で、第１の画像認識部１１へ誤差逆伝播させない。このため、ステップＳ２７では、第２の画像認識部１２におけるネットワークが誤差修正される一方で、第１の画像認識部１１におけるネットワークが誤差修正されない。第２の画像認識部１２は、ステップＳ２７の実行により、第２の学習データセットＤ２を用いた学習を終了する。そして、第２の画像認識部１２は、ステップＳ２１からステップＳ２７を、第２の学習データセットＤ２のセット数に応じて繰り返し実行する。 Specifically, when the second image recognition unit 12 acquires the second error, the second image recognition unit 12 learns the weight of the convolutional layer of the network so that the error in the network is corrected by the error backpropagation method based on the error amount. The network is updated (step S27). Here, in step S27, in the learning based on the second error, the learning of the first image recognition unit 11 is interrupted while the learning of the second image recognition unit 12 is performed. That is, while the second error is backpropagated to the second image recognition unit 12, it is not backpropagated to the first image recognition unit 11. For this reason, in step S27, while the network in the second image recognition unit 12 is corrected for errors, the network in the first image recognition unit 11 is not corrected for errors. The second image recognition unit 12 ends the learning using the second learning data set D2 by executing step S27. Then, the second image recognition unit 12 repeatedly executes steps S21 to S27 according to the number of the second learning data sets D2.

このように、画像認識装置１の学習では、アノテーションコストの低い多量の第１の学習データセットＤ１を用いて、第１の画像認識部１１を学習させている。また、画像認識装置１の学習では、アノテーションコストの高い少量の第２の学習データセットＤ２を用いて、第２の画像認識部１２を学習させている。 As described above, in the learning of the image recognition device 1, the first image recognition unit 11 is trained by using a large amount of the first learning data set D1 with low annotation cost. In the learning of the image recognition device 1, the second image recognition unit 12 is trained by using a small amount of the second learning data set D2 having a high annotation cost.

次に、図１０及び図１１を参照して、学習済みの画像認識装置１による画像認識について説明する。図１０及び図１１は、画像認識装置の画像認識に関する処理の一例を示す図である。画像認識装置１の画像認識に関する処理では、入力画像Ｉを第１の画像認識部１１に入力し、第１の画像認識部１１により入力画像Ｉの画像セグメンテーションを行って、第１の出力画像Ｏ１を取得するステップ（第８のステップ）を実行する。 Next, with reference to FIGS. 10 and 11, image recognition by the learned image recognition device 1 will be described. FIG. 10 and FIG. 11 are diagrams illustrating an example of processing related to image recognition of the image recognition device. In a process related to image recognition of the image recognition device 1, the input image I is input to the first image recognition unit 11, and the first image recognition unit 11 performs image segmentation of the input image I, and outputs the first output image O1. Is executed (eighth step).

具体的に、図９に示すように、入力画像Ｉが画像認識装置１に入力される（ステップＳ３１）。入力画像Ｉが入力されると、第１の画像認識部１１は、入力画像Ｉに対してエンコード処理を実行する（ステップＳ３２）。第１の画像認識部１１は、エンコード処理を実行することで、ダウンサンプリングされた特徴量を含む低解像度の特徴マップを生成する。第１の画像認識部１１は、ダウンサンプリングされた特徴量を含む特徴マップに対してデコード処理を実行する（ステップＳ３３）。第１の画像認識部１１は、デコード処理を実行することで、特徴量を含む低解像度の特徴マップを復元しながらアップサンプリングして、入力画像Ｉと同じ解像度とする。そして、第１の画像認識部１１は、画像をピクセル単位でクラスごとに領域分割するクラス推論を実行する（ステップＳ３４）。第１の画像認識部１１は、クラス推論の結果として、第１の出力画像Ｏ１を取得する（ステップＳ３５）。 Specifically, as shown in FIG. 9, the input image I is input to the image recognition device 1 (Step S31). When the input image I is input, the first image recognition unit 11 performs an encoding process on the input image I (Step S32). The first image recognition unit 11 generates a low-resolution feature map including the down-sampled feature amounts by executing the encoding process. The first image recognition unit 11 performs a decoding process on the feature map including the down-sampled feature amount (Step S33). The first image recognition unit 11 performs upsampling while restoring a low-resolution feature map including a feature amount by executing a decoding process, and sets the same resolution as the input image I. Then, the first image recognition unit 11 executes a class inference for dividing the image into regions on a pixel-by-pixel basis for each class (step S34). The first image recognition unit 11 obtains a first output image O1 as a result of the class inference (Step S35).

次に、画像認識装置１の画像認識に関する処理では、入力画像Ｉと第１の出力画像Ｏ１とを第２の画像認識部１２に入力し、第１の出力画像Ｏ１を用いて第２の画像認識部１２により入力画像Ｉの画像セグメンテーションを行って、第２の出力画像Ｏ２を取得するステップ（第９のステップ）を実行する。 Next, in the process related to image recognition of the image recognition device 1, the input image I and the first output image O1 are input to the second image recognition unit 12, and the second image is input using the first output image O1. A step (a ninth step) of performing the image segmentation of the input image I by the recognition unit 12 to obtain the second output image O2 is executed.

具体的に、図１１に示すように、入力画像Ｉが、画像認識装置１の第２の画像認識部１２に入力される（ステップＳ４１）。入力画像Ｉが入力されると、第２の画像認識部１２は、入力画像Ｉに対して特徴量抽出処理を実行する（ステップＳ４２）。第２の画像認識部１２は、特徴量抽出処理を実行することで、入力画像Ｉから特徴量を含む特徴マップを生成する。また、第２の画像認識部１２は、特徴量を含む特徴マップに対してフュージョン処理を実行する（ステップＳ４３）。第２の画像認識部１２は、フュージョン処理を実行することで、第１の出力画像Ｏ１をヒントとして、特徴量抽出処理が行われる特徴マップを復元する。そして、第２の画像認識部１２は、特徴マップから、ピクセル単位でクラスごとに領域分割するクラス推論を実行する（ステップＳ４４）。第２の画像認識部１２は、クラス推論の結果として、第２の出力画像Ｏ２を取得する（ステップＳ４５）。 Specifically, as shown in FIG. 11, the input image I is input to the second image recognition unit 12 of the image recognition device 1 (Step S41). When the input image I is input, the second image recognition unit 12 performs a feature amount extraction process on the input image I (Step S42). The second image recognizing unit 12 generates a feature map including the feature amount from the input image I by executing the feature amount extracting process. In addition, the second image recognition unit 12 performs a fusion process on the feature map including the feature amount (Step S43). The second image recognizing unit 12 executes the fusion process to restore the feature map on which the feature amount extraction process is performed using the first output image O1 as a hint. Then, the second image recognition unit 12 executes a class inference that divides the area for each class on a pixel-by-pixel basis from the feature map (step S44). The second image recognition unit 12 acquires the second output image O2 as a result of the class inference (Step S45).

このように、画像認識装置１の画像認識では、第１の画像認識部１１においてロバスト性を担保した画像認識を行う。また、第１の画像認識部１１では、第２の画像認識部１２と比べて緻密な領域分割を行う必要がないため、計算負荷の低いタスクで画像認識を行う。そして、画像認識装置１の画像認識では、第２の画像認識部１２において緻密な領域分割を行うため、画像認識性を担保した画像認識を行う。また、第２の画像認識部１２では、第１の出力画像Ｏ１を用いて入力画像Ｉの領域分割を行うため、計算負荷の低いタスクで画像認識を行う。 As described above, in the image recognition of the image recognition device 1, the first image recognition unit 11 performs the image recognition while ensuring the robustness. In addition, the first image recognition unit 11 does not need to perform more precise area division than the second image recognition unit 12, and thus performs image recognition using a task with a low calculation load. Then, in the image recognition of the image recognition device 1, the second image recognition unit 12 performs the fine area division, so that the image recognition is performed while ensuring the image recognizability. In addition, the second image recognition unit 12 performs the area division of the input image I by using the first output image O1, and thus performs the image recognition using a task with a low calculation load.

また、画像認識装置１の画像認識に関する処理として、図１２に示す処理を行っている。図１２は、画像認識装置の画像認識に関する処理の一例を示す図である。図１２に示す処理では、画像認識により取得した第１の出力画像Ｏ１と、第１の出力画像Ｏ１に対応する第２の出力画像Ｏ２とを関連付けて取得するステップ（第１０のステップ）を実行する。 In addition, the processing illustrated in FIG. 12 is performed as processing related to image recognition of the image recognition device 1. FIG. 12 is a diagram illustrating an example of a process regarding image recognition of the image recognition device. In the process shown in FIG. 12, a step (tenth step) of acquiring the first output image O1 acquired by the image recognition and the second output image O2 corresponding to the first output image O1 in association with each other is executed. I do.

具体的に、図１２に示すように、第１の画像認識部１１が第１の出力画像Ｏ１を中間画像として取得する（ステップＳ５１）。また、第２の画像認識部１２が第１の出力画像Ｏ１に対応する第２の出力画像Ｏ２を取得する（ステップＳ５２）。画像認識装置１は、第１の出力画像Ｏ１と第２の出力画像Ｏ２を関連付けて取得する（ステップＳ５３）。 More specifically, as shown in FIG. 12, the first image recognition unit 11 acquires the first output image O1 as an intermediate image (Step S51). Further, the second image recognition unit 12 acquires a second output image O2 corresponding to the first output image O1 (Step S52). The image recognition device 1 acquires the first output image O1 and the second output image O2 in association with each other (Step S53).

そして、取得された第１の出力画像Ｏ１及び第２の出力画像Ｏ２は、画像認識装置１による画像認識の評価または解析を行う場合において使用される。例えば、画像認識装置１による画像認識に誤認識等の不具合があった場合、第１の出力画像Ｏ１及び第２の出力画像Ｏ２を比較することで、第１の画像認識部１１における異常があったのか、第２の画像認識部１２における異常があったのかを推定することが可能となる。すなわち、第２の出力画像Ｏ２に誤認識がある場合、第１の出力画像Ｏ１に誤認識がなければ、第２の画像認識部１２に異常があると推定できる。一方で、第２の出力画像Ｏ２に誤認識がある場合、第１の出力画像Ｏ１に誤認識があれば、第１の画像認識部１１に異常があると推定できる。 Then, the obtained first output image O1 and second output image O2 are used when evaluating or analyzing image recognition by the image recognition device 1. For example, when there is a defect such as erroneous recognition in image recognition by the image recognition device 1, by comparing the first output image O1 and the second output image O2, an abnormality in the first image recognition unit 11 is detected. It is possible to estimate whether there is an abnormality in the second image recognition unit 12. That is, when there is an erroneous recognition in the second output image O2, if there is no erroneous recognition in the first output image O1, it can be estimated that there is an abnormality in the second image recognition unit 12. On the other hand, if the second output image O2 has an erroneous recognition, and if the first output image O1 has an erroneous recognition, it can be estimated that the first image recognition unit 11 has an abnormality.

次に、図１３を参照して、学習済みの画像認識装置１による学習データセットの生成について説明する。図１３は、画像認識装置による学習データセットの生成に関する処理の一例を示す図である。画像認識装置１の学習データセットの生成に関する処理では、既に用意されている第１の学習データセットＤ１を用いて、第２の学習データセットＤ２を生成している。学習データセットの生成に関する処理では、第１の学習画像Ｇ１と第１の教師画像Ｔ１とを第２の画像認識部１２に入力し、第２の画像認識部１２により第１の教師画像Ｔ１を用いて第１の学習画像Ｇ１の画像セグメンテーションを行って、第２の出力画像Ｏ２を取得するステップ（第１１のステップ）を実行する。 Next, generation of a learning data set by the learned image recognition device 1 will be described with reference to FIG. FIG. 13 is a diagram illustrating an example of a process regarding generation of a learning data set by the image recognition device. In the process regarding the generation of the learning data set of the image recognition device 1, the second learning data set D2 is generated using the first learning data set D1 that has already been prepared. In the processing related to the generation of the learning data set, the first learning image G1 and the first teacher image T1 are input to the second image recognition unit 12, and the first teacher image T1 is generated by the second image recognition unit 12. A step (eleventh step) of performing image segmentation of the first learning image G1 to obtain a second output image O2 is performed.

具体的に、図１３に示すように、第１の学習画像Ｇ１が、画像認識装置１の第２の画像認識部１２に入力される（ステップＳ６１）。第１の学習画像Ｇ１が入力されると、第２の画像認識部１２は、第１の学習画像Ｇ１に対して特徴量抽出処理を実行する（ステップＳ６２）。第２の画像認識部１２は、特徴量抽出処理を実行することで、第１の学習画像Ｇ１から特徴量を含む特徴マップを生成する。また、第２の画像認識部１２は、特徴量を含む特徴マップに対してフュージョン処理を実行する（ステップＳ６３）。第２の画像認識部１２は、フュージョン処理を実行することで、第１の教師画像Ｔ１をヒントとして、特徴量抽出処理が行われる特徴マップを復元する。そして、第２の画像認識部１２は、特徴マップから、ピクセル単位でクラスごとに領域分割するクラス推論を実行する（ステップＳ６４）。第２の画像認識部１２は、クラス推論の結果として、第２の出力画像Ｏ２を取得する（ステップＳ６５）。 Specifically, as shown in FIG. 13, the first learning image G1 is input to the second image recognition unit 12 of the image recognition device 1 (Step S61). When the first learning image G1 is input, the second image recognition unit 12 executes a feature amount extraction process on the first learning image G1 (Step S62). The second image recognition unit 12 generates a feature map including a feature amount from the first learning image G1 by executing a feature amount extraction process. In addition, the second image recognition unit 12 performs a fusion process on the feature map including the feature amount (Step S63). The second image recognition unit 12 executes the fusion process to restore the feature map on which the feature amount extraction process is performed using the first teacher image T1 as a hint. Then, the second image recognizing unit 12 executes class inference for dividing the area for each class on a pixel-by-pixel basis from the feature map (step S64). The second image recognition unit 12 acquires a second output image O2 as a result of the class inference (Step S65).

次に、学習データセットの生成に関する処理では、画像認識装置１は、第１の学習画像Ｇ１を第２の学習画像Ｇ２として取得する。また、画像認識装置１は、第２の出力画像Ｏ２を第２の教師画像Ｔ２として取得する。そして、画像認識装置１は、第２の学習画像Ｇ２と第２の教師画像Ｔ２とを含む第２の学習データセットＤ２を生成する（ステップＳ６６：第１２のステップ）。画像認識装置１は、ステップＳ６６の実行後、第２の学習データセットＤ２の生成を終了する。そして、画像認識装置１は、ステップＳ６１からステップＳ６６を、異なる第１の学習データセットＤ１を用いながら、複数回繰り返し実行することで、第２の学習データセットＤ２を複数生成する。なお、生成した複数の第２の学習データセットＤ２の中から、使用可能な第２の学習データセットＤ２を選別するステップを追加してもよい。 Next, in the process relating to the generation of the learning data set, the image recognition device 1 acquires the first learning image G1 as the second learning image G2. In addition, the image recognition device 1 acquires the second output image O2 as the second teacher image T2. Then, the image recognition device 1 generates a second learning data set D2 including the second learning image G2 and the second teacher image T2 (Step S66: twelfth step). After performing Step S66, the image recognition device 1 ends the generation of the second learning data set D2. Then, the image recognition device 1 generates a plurality of second learning data sets D2 by repeatedly executing steps S61 to S66 a plurality of times while using different first learning data sets D1. Note that a step of selecting a usable second learning data set D2 from the plurality of generated second learning data sets D2 may be added.

以上のように、実施形態に係る画像認識装置１の学習では、第１の画像認識部１１の学習と、第２の画像認識部１２の学習とに分けることができる。そして、第１の画像認識部１１の学習では、アノテーションコストの低い多量の第１の学習データセットＤ１を用いて学習を行うことができる。また、第２の画像認識部１２の学習では、アノテーションコストの高い少量の第２の学習データセットＤ２を用いて学習を行うことができる。このため、画像認識装置１の学習では、アノテーションコストの高い第２の教師画像Ｔ２が少量で済むため、アノテーションコストの削減を図ることができる。これにより、画像認識装置１の学習では、教師画像Ｔ１，Ｔ２による学習効率を向上させることができる。 As described above, learning of the image recognition device 1 according to the embodiment can be divided into learning of the first image recognition unit 11 and learning of the second image recognition unit 12. In the learning of the first image recognition unit 11, learning can be performed using a large amount of the first learning data set D1 with low annotation cost. In the learning of the second image recognition unit 12, learning can be performed using a small amount of the second learning data set D2 with high annotation cost. For this reason, the learning of the image recognition device 1 requires only a small amount of the second teacher image T2 having a high annotation cost, so that the annotation cost can be reduced. Thereby, in the learning of the image recognition device 1, the learning efficiency by the teacher images T1 and T2 can be improved.

また、第１の画像認識部１１の学習では、多量の第１の学習データセットＤ１を用いて学習を行うことができるため、ロバスト性の高い画像認識を学習することができる。また、第２の画像認識部１２の学習では、緻密な画像セグメンテーションを行うことができるため、認識精度の高い画像認識を学習することができる。よって、実施形態に係る画像認識装置１の学習では、ロバスト性が高く、認識精度の高い画像認識を学習することができる。 In the learning of the first image recognition unit 11, since learning can be performed using a large amount of the first learning data set D1, image recognition with high robustness can be learned. In the learning of the second image recognizing unit 12, since fine image segmentation can be performed, image recognition with high recognition accuracy can be learned. Therefore, in the learning of the image recognition device 1 according to the embodiment, it is possible to learn image recognition with high robustness and high recognition accuracy.

また、第１の画像認識部１１の学習では、第１の学習データセットＤ１を用いて学習を行うことができるため、第１の画像認識部１１に適した精度のよい学習を行うことができる。同様に、第２の画像認識部１２の学習では、第２の学習データセットＤ２を用いて学習を行うことができるため、第２の画像認識部１２に適した精度のよい学習を行うことができる。 In the learning of the first image recognition unit 11, since learning can be performed using the first learning data set D1, highly accurate learning suitable for the first image recognition unit 11 can be performed. . Similarly, in the learning of the second image recognition unit 12, since learning can be performed using the second learning data set D2, accurate learning suitable for the second image recognition unit 12 can be performed. it can.

また、第２の画像認識部１２の学習では、取得した第２の誤差を、第１の画像認識部１１に誤差伝播させていないことから、第２の画像認識部１２の学習によって第１の画像認識部１１に与える影響を排することができる。 In the learning of the second image recognition unit 12, since the acquired second error is not propagated to the first image recognition unit 11, the first error is learned by the learning of the second image recognition unit 12. The influence on the image recognition unit 11 can be eliminated.

また、実施形態に係る画像認識装置１の画像認識では、第１の画像認識部１１の画像認識と、第２の画像認識部１２の画像認識とに分けることができる。そして、第１の画像認識部１１の画像認識では、ロバスト性の高い画像認識を行うことができる。また、第２の画像認識部１２の画像認識では、認識精度の高い画像認識を行うことができる。よって、実施形態に係る画像認識装置１の画像認識では、ロバスト性が高く、認識精度の高い画像認識を行うことができる。 Further, the image recognition of the image recognition device 1 according to the embodiment can be divided into image recognition by the first image recognition unit 11 and image recognition by the second image recognition unit 12. In the image recognition of the first image recognition unit 11, image recognition with high robustness can be performed. Further, in the image recognition of the second image recognition unit 12, image recognition with high recognition accuracy can be performed. Therefore, in the image recognition of the image recognition device 1 according to the embodiment, image recognition with high robustness and high recognition accuracy can be performed.

また、第１の画像認識部１１の画像認識では、入力画像Ｉに含まれるオブジェクトの位置を求めるタスクを実行することで、計算負荷の低いタスクとして実行することができる。また、第２の画像認識部１２の画像認識では、第１の出力画像Ｏ１をヒントとして用いて、入力画像Ｉの緻密な画像セグメンテーションを実行することで、計算負荷の低いタスクとして実行することができる。このため、実施形態に係る画像認識装置１では、計算負荷が低いことから、画像認識を高速に行うことができる。 Further, in the image recognition of the first image recognition unit 11, by executing the task for obtaining the position of the object included in the input image I, the task can be executed as a task with a low calculation load. Also, in the image recognition of the second image recognition unit 12, by executing the fine image segmentation of the input image I using the first output image O1 as a hint, it can be executed as a task with a low calculation load. it can. For this reason, the image recognition device 1 according to the embodiment can perform image recognition at high speed because the calculation load is low.

また、画像認識装置１の画像認識では、第１の出力画像Ｏ１と、第２の出力画像Ｏ２とを関連付けて取得することができる。このため、画像認識装置１の画像認識の評価または解析等において、第１の画像認識部１１及び第２の画像認識部１２の異常を推定することが可能となる。 In the image recognition performed by the image recognition device 1, the first output image O1 and the second output image O2 can be acquired in association with each other. Therefore, in the evaluation or analysis of the image recognition of the image recognition device 1, it is possible to estimate the abnormality of the first image recognition unit 11 and the second image recognition unit 12.

また、画像認識装置１では、第１の学習データセットＤ１を用いて、第２の学習データセットＤ２を生成することができる。このため、画像認識装置１では、第２の学習データセットＤ２を自動で生成することができるため、第２の学習データセットＤ２を生成するためのアノテーションコストを削減することができる。 In the image recognition device 1, the second learning data set D2 can be generated using the first learning data set D1. For this reason, in the image recognition device 1, since the second learning data set D2 can be automatically generated, the annotation cost for generating the second learning data set D2 can be reduced.

なお、実施形態の第２の画像認識部１２において、カーネルのチャネル数と、畳み込み層に入力される特徴マップのチャネル数とを、第１の画像認識部１１に比して小さくしてもよい。カーネルのチャネル数と畳み込み層に入力される特徴マップのチャネル数との積は、画像認識の表現力である。第２の画像認識部１２の表現力を、第１の画像認識部１１に比して低くすることで、第２の画像認識部１２の計算負荷を軽減することができる。このため、実施形態に係る画像認識装置１では、計算負荷をより低くし、画像認識をより高速に行うことができる。 Note that, in the second image recognition unit 12 of the embodiment, the number of channels of the kernel and the number of channels of the feature map input to the convolutional layer may be smaller than those of the first image recognition unit 11. . The product of the number of channels of the kernel and the number of channels of the feature map input to the convolutional layer is the expressiveness of image recognition. By making the expressive power of the second image recognition unit 12 lower than that of the first image recognition unit 11, the calculation load of the second image recognition unit 12 can be reduced. For this reason, in the image recognition device 1 according to the embodiment, the calculation load can be further reduced, and the image recognition can be performed faster.

１画像認識装置
５制御部
６記憶部
７画像認識部
１１第１の画像認識部
１２第２の画像認識部
２２エンコーダ
２３デコーダ
Ｉ入力画像
Ｏ出力画像
Ｐ１画像学習プログラム
Ｐ２画像認識プログラム
Ｐ３学習データセットの生成プログラム
Ｄ１第１の学習データセット
Ｄ２第２の学習データセット Reference Signs List 1 image recognition device 5 control unit 6 storage unit 7 image recognition unit 11 first image recognition unit 12 second image recognition unit 22 encoder 23 decoder I input image O output image P1 image learning program P2 image recognition program P3 learning data set Generation program D1 first learning data set D2 second learning data set

Claims

An image learning program executed by an image recognition device that performs image segmentation,
A learning data set used for learning of the image recognition device includes:
A learning image to be an image to be learned by the image recognition device,
And a teacher image corresponding to the learning image,
The image recognition device,
A first image recognition unit that performs image segmentation of the learning image;
A second image recognition unit that performs image segmentation of the learning image so as to be more densely divided than the first image recognition unit,
A first step of inputting the learning image to the first image recognizing unit, performing image segmentation of the learning image by the first image recognizing unit, and acquiring a first output image;
Inputting the learning image and the first output image to the second image recognition unit, performing image segmentation of the learning image by the second image recognition unit using the first output image, A second step of obtaining a second output image;
A third step of obtaining a second error of the second output image with respect to the teacher image;
A fourth step of correcting an image segmentation process by the second image recognition unit based on the second error;
An image learning program that lets you execute

The learning data set includes a first learning data set for learning by the first image recognition unit, and a second learning data set for learning by the second image recognition unit,
The first learning data set includes a first learning image and a first teacher image corresponding to the first learning image,
The second learning data set includes a second learning image and a second teacher image corresponding to the second learning image,
The second teacher image has a smaller number of images than the first teacher image, and is an image that is more precisely divided into regions than the first teacher image,
Before performing the first step, the first learning image is input to the first image recognition unit, and the first image recognition unit performs image segmentation of the first learning image, A fifth step of acquiring the first output image;
A sixth step of obtaining a first error of the first output image with respect to the first teacher image;
Correcting the image segmentation by the first image recognition unit based on the first error.
In the first step, the second learning image is input to the first image recognition unit to obtain the first output image,
In the second step, the second learning image and the first output image are input to the second image recognition unit to obtain the second output image,
The computer-readable storage medium according to claim 1, wherein in the third step, the second error of the second output image with respect to the second teacher image is obtained.

The image learning program according to claim 1, wherein, in the fourth step, correction of image segmentation processing by the first image recognition unit based on the second error is blocked.

An image learning method performed by an image recognition device that performs image segmentation,
A learning data set used for learning of the image recognition device includes:
A learning image to be an image to be learned by the image recognition device,
And a teacher image corresponding to the learning image,
The image recognition device,
A first image recognition unit that performs image segmentation of the learning image;
A second image recognition unit that performs image segmentation of the learning image so as to be more densely divided than the first image recognition unit,
A first step of inputting the learning image to the first image recognizing unit, performing image segmentation of the learning image by the first image recognizing unit, and acquiring a first output image;
Inputting the learning image and the first output image to the second image recognition unit, performing image segmentation of the learning image by the second image recognition unit using the first output image, A second step of obtaining a second output image;
A third step of obtaining a second error of the second output image with respect to the teacher image;
A fourth step of correcting an image segmentation process by the second image recognition unit based on the second error;
An image learning method including:

An image recognition program executed by an image recognition device that performs image segmentation of an input image that has been input,
The image recognition device,
A first image recognition unit that performs image segmentation of the input image;
A second image recognizing unit that performs image segmentation of the input image so as to be more densely divided than the first image recognizing unit,
An eighth step of inputting the input image to the first image recognition unit, performing image segmentation of the input image by the first image recognition unit, and acquiring a first output image;
Inputting the input image and the first output image to the second image recognition unit, performing image segmentation of the input image by the second image recognition unit using the first output image, A ninth step of obtaining a second output image;
Image recognition program that executes

The image recognition program according to claim 5, further comprising: executing a tenth step of associating the acquired first output image with the second output image corresponding to the first output image.

An image recognition method performed by an image recognition device that performs image segmentation of an input image that has been input,
The image recognition device,
A first image recognition unit that performs image segmentation of the input image;
A second image recognizing unit that performs image segmentation of the input image so as to be more densely divided than the first image recognizing unit,
An eighth step of inputting the input image to the first image recognition unit, performing image segmentation of the input image by the first image recognition unit, and acquiring a first output image;
Inputting the input image and the first output image to the second image recognition unit, performing image segmentation of the input image using the first output image by the second image recognition unit, A ninth step of obtaining a second output image;
An image recognition method including:

A learning data set generation program that is executed by an image recognition device that performs image segmentation of an input image that has been input, and generates a learning data set used in the image recognition device.
The image recognition device,
A first image recognition unit that performs image segmentation of the input image;
A second image recognizing unit that performs image segmentation of the input image so as to be more densely divided than the first image recognizing unit,
The learning data set includes a first learning data set for learning by the first image recognition unit, and a second learning data set for learning by the second image recognition unit,
The first learning data set includes a first learning image and a first teacher image corresponding to the first learning image,
The second learning data set includes a second learning image and a second teacher image corresponding to the second learning image,
The second teacher image is an image that is more precisely divided into regions than the first teacher image,
The first learning image and the first teacher image are input to the second image recognition unit, and the second image recognition unit uses the first teacher image to generate the first learning image. An eleventh step of performing image segmentation to obtain a second output image;
Acquiring the first learning image as the second learning image, acquiring the second output image as the second teacher image, and combining the second learning image with the second teacher image. A twelfth step of generating a second training data set comprising:
A program for generating a training data set to execute.

A method for generating a learning data set, which is executed by an image recognition device that performs image segmentation of an input image that has been input and generates a learning data set used in the image recognition device,
The image recognition device,
A first image recognition unit that performs image segmentation of the input image;
A second image recognizing unit that performs image segmentation of the input image so as to be more densely divided than the first image recognizing unit,
The learning data set includes a first learning data set for learning by the first image recognition unit, and a second learning data set for learning by the second image recognition unit,
The first learning data set includes a first learning image and a first teacher image corresponding to the first learning image,
The second learning data set includes a second learning image and a second teacher image corresponding to the second learning image,
The second teacher image is an image that is more precisely divided into regions than the first teacher image,
The first learning image and the first teacher image are input to the second image recognition unit, and the second image recognition unit uses the first teacher image to generate the first learning image. An eleventh step of performing image segmentation to obtain a second output image;
Acquiring the first learning image as the second learning image, acquiring the second output image as the second teacher image, and combining the second learning image with the second teacher image. A twelfth step of generating a second training data set comprising:
A method for generating a training data set containing

A learning data set generation program according to claim 8, or a learning data set generated by the learning data set generation method according to claim 9.

A first image recognition unit that performs image segmentation of the input image;
A second image recognizing unit that performs image segmentation of the input image so as to be more densely divided than the first image recognizing unit,
When the input image is input, the first image recognition unit performs image segmentation of the input image to generate a first output image, and generates the generated first output image as the second output image. Output to the image recognition unit,
The second image recognition unit, when the input image and the first output image are input, performs image segmentation of the input image by the second image recognition unit using the first output image. An image recognition device that outputs a second output image.

The image recognition device according to claim 11, wherein the first image recognition unit performs image segmentation using a bounding box.

The image recognition device according to claim 11, wherein the second image recognition unit performs image segmentation by semantic segmentation.