JP2020038574A

JP2020038574A - Image learning program, image learning method, image recognition program, image recognition method, and image recognition device

Info

Publication number: JP2020038574A
Application number: JP2018166366A
Authority: JP
Inventors: 俊菅原; Takashi Sugawara
Original assignee: Kyocera Corp
Current assignee: Kyocera Corp
Priority date: 2018-09-05
Filing date: 2018-09-05
Publication date: 2020-03-12

Abstract

To propose an image learning program and the like that can improve efficiency in recognition of image segmentation.SOLUTION: An image learning program causes an image recognition device to execute learning by using a learning data set including learning images and teacher images. The image recognition device comprises a first image recognition unit and a second image recognition unit. The first image recognition unit reduces the resolution of the learning images and performs image segmentation on the learning images with low resolution. The second image recognition unit performs image segmentation on the learning image with high resolution compared with the learning image with low resolution. The image learning program causes to execute a first step of inputting the learning image in the first image recognition unit to acquire a first output image, a second step of inputting the learning image and first output image to the second image recognition unit to acquire a second output image, a third step of acquiring a second error of the second output image with respect to the teacher image, and a fourth step of correcting an error in the second image recognition unit on the basis of the second error.SELECTED DRAWING: Figure 8

Description

本発明は、画像学習プログラム、画像学習方法、画像認識プログラム、画像認識方法、及び画像認識装置に関する。 The present invention relates to an image learning program, an image learning method, an image recognition program, an image recognition method, and an image recognition device.

画像認識技術として、Fully Convolutional Network（ＦＣＮ：全層畳み込みネットワーク）を用いたSemantic Segmentation（セマンティック・セグメンテーション）が知られている（例えば、非特許文献１参照）。セマンティック・セグメンテーションは、デジタル画像のピクセル単位でのクラス分類（クラス推論）を行っている。つまり、セマンティック・セグメンテーションは、デジタル画像の各ピクセルに対してクラス推論を行い、推論結果として、各ピクセル対してクラスをラベリングすることで、デジタル画像の領域分割を行う。 As an image recognition technology, Semantic Segmentation using a Fully Convolutional Network (FCN: full-layer convolution network) is known (for example, see Non-Patent Document 1). In semantic segmentation, a digital image is classified into pixels (class inference). That is, in the semantic segmentation, a class is inferred for each pixel of the digital image, and as a result of the inference, a class is labeled for each pixel to divide a region of the digital image.

Long, Jonathan, Evan Shelhamer, and Trevor Darrell. "Fully convolutional networks for semantic segmentation." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015.Long, Jonathan, Evan Shelhamer, and Trevor Darrell. "Fully convolutional networks for semantic segmentation." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015.

ここで、セマンティック・セグメンテーションに用いられるＦＣＮは、一般的に、エンコーダ（Encoder）を含んでいる。エンコーダは、入力された入力画像の解像度を低くするダウンサンプリングを行っている。ダウンサンプリングを行うと入力画像に含まれる局所的な情報（例えば、小さな物体等）が欠落する。このため、入力画像に含まれる局所的な情報に対して、画像セグメンテーションを行うことが困難となる。 Here, the FCN used for semantic segmentation generally includes an encoder. The encoder performs downsampling to lower the resolution of the input image that has been input. When downsampling is performed, local information (for example, a small object) included in the input image is lost. For this reason, it is difficult to perform image segmentation on local information included in the input image.

本発明は、画像セグメンテーションの認識精度を向上させることができる画像学習プログラム、画像学習方法、画像認識プログラム、画像認識方法、及び画像認識装置を提供することを目的とする。 An object of the present invention is to provide an image learning program, an image learning method, an image recognition program, an image recognition method, and an image recognition device that can improve the recognition accuracy of image segmentation.

態様の１つに係る画像学習プログラムは、画像セグメンテーションを行う画像認識装置によって実行される画像学習プログラムであって、前記画像認識装置の学習に用いられる学習データセットは、前記画像認識装置の学習対象の画像となる学習画像と、前記学習画像に対応する教師画像と、を含み、前記画像認識装置は、前記学習画像の解像度を低くするダウンサンプリングを実行して、低解像度の前記学習画像を生成し、生成した低解像度の前記学習画像の画像セグメンテーションを行う第１の画像認識部と、低解像度の前記学習画像に比して高解像度の前記学習画像の画像セグメンテーションを行う第２の画像認識部と、を備え、前記学習画像を前記第１の画像認識部に入力して、低解像度の前記学習画像を生成し、生成した低解像度の前記学習画像の画像セグメンテーションを行って、第１の出力画像を取得する第１のステップと、高解像度の前記学習画像と前記第１の出力画像とを前記第２の画像認識部に入力し、前記第１の出力画像を用いて前記第２の画像認識部により高解像度の前記学習画像の画像セグメンテーションを行って、第２の出力画像を取得する第２のステップと、前記教師画像に対する前記第２の出力画像の第２の誤差を取得する第３のステップと、前記第２の誤差に基づいて、前記第２の画像認識部による画像セグメンテーションの処理を修正する第４のステップと、を実行させる。 An image learning program according to one aspect is an image learning program executed by an image recognition device that performs image segmentation, wherein a learning data set used for learning of the image recognition device is a learning target set of the image recognition device. The image recognition device generates a low-resolution learning image by performing down-sampling to lower the resolution of the learning image, the learning image including a learning image to be an image of the learning image and a teacher image corresponding to the learning image. A first image recognition unit that performs image segmentation of the generated low-resolution learning image; and a second image recognition unit that performs image segmentation of the learning image having a higher resolution than the learning image having a lower resolution. And inputting the learning image to the first image recognizing unit to generate the low-resolution learning image, and generating the low-resolution learning image. Performing a first step of performing image segmentation of the learning image to obtain a first output image, and inputting the high-resolution learning image and the first output image to the second image recognition unit; A second step of performing a high-resolution image segmentation of the learning image by the second image recognition unit using the first output image to obtain a second output image; and Performing a third step of obtaining a second error of the second output image and a fourth step of correcting an image segmentation process by the second image recognition unit based on the second error. Let it.

態様の１つに係る画像学習方法は、画像セグメンテーションを行う画像認識装置が実行する画像学習方法であって、前記画像認識装置の学習に用いられる学習データセットは、前記画像認識装置の学習対象の画像となる学習画像と、前記学習画像に対応する教師画像と、を含み、前記画像認識装置は、前記学習画像の解像度を低くするダウンサンプリングを実行して、低解像度の前記学習画像を生成し、生成した低解像度の前記学習画像の画像セグメンテーションを行う第１の画像認識部と、低解像度の前記学習画像に比して高解像度の前記学習画像の画像セグメンテーションを行う第２の画像認識部と、を備え、前記学習画像を前記第１の画像認識部に入力して、低解像度の前記学習画像を生成し、生成した低解像度の前記学習画像の画像セグメンテーションを行って、第１の出力画像を取得する第１のステップと、高解像度の前記学習画像と前記第１の出力画像とを前記第２の画像認識部に入力し、前記第１の出力画像を用いて前記第２の画像認識部により高解像度の前記学習画像の画像セグメンテーションを行って、第２の出力画像を取得する第２のステップと、前記教師画像に対する前記第２の出力画像の第２の誤差を取得する第３のステップと、前記第２の誤差に基づいて、前記第２の画像認識部による画像セグメンテーションの処理を修正する第４のステップと、を含む。 An image learning method according to one aspect is an image learning method performed by an image recognition device that performs image segmentation, wherein a learning data set used for learning of the image recognition device includes a learning data set of a learning target of the image recognition device. The image recognition device includes a learning image to be an image and a teacher image corresponding to the learning image, and performs downsampling to reduce the resolution of the learning image to generate the learning image with a low resolution. A first image recognition unit that performs image segmentation of the generated low-resolution learning image, and a second image recognition unit that performs image segmentation of the high-resolution learning image as compared to the low-resolution learning image. The learning image is input to the first image recognition unit to generate the low-resolution learning image, and the generated low-resolution image of the learning image is provided. Performing a first step of performing segmentation to obtain a first output image; and inputting the high-resolution learning image and the first output image to the second image recognition unit; A second step of performing image segmentation of the high-resolution learning image by the second image recognition unit using the output image to obtain a second output image, and the second output image for the teacher image And a fourth step of correcting the image segmentation process performed by the second image recognition unit based on the second error.

態様の１つに係る画像認識プログラムは、入力された入力画像の画像セグメンテーションを行う画像認識装置によって実行される画像認識プログラムであって、前記画像認識装置は、前記入力画像の解像度を低くするダウンサンプリングを実行して、低解像度の前記入力画像を生成し、生成した低解像度の前記入力画像の画像セグメンテーションを行う第１の画像認識部と、低解像度の前記入力画像に比して高解像度の前記入力画像の画像セグメンテーションを行う第２の画像認識部と、を備え、前記入力画像を前記第１の画像認識部に入力して、低解像度の前記入力画像を生成し、生成した低解像度の前記入力画像の画像セグメンテーションを行って、第１の出力画像を取得する第８のステップと、高解像度の前記入力画像と前記第１の出力画像とを前記第２の画像認識部に入力し、前記第１の出力画像を用いて前記第２の画像認識部により高解像度の前記入力画像の画像セグメンテーションを行って、第２の出力画像を取得する第９のステップと、を実行させる。 An image recognition program according to one aspect is an image recognition program executed by an image recognition device that performs image segmentation of an input image that has been input, wherein the image recognition device is configured to reduce the resolution of the input image. A first image recognition unit that performs sampling to generate the low-resolution input image and performs image segmentation of the generated low-resolution input image; A second image recognition unit that performs image segmentation of the input image, wherein the input image is input to the first image recognition unit to generate the low-resolution input image, and the generated low-resolution An eighth step of performing an image segmentation of the input image to obtain a first output image; And the image is input to the second image recognizing unit, and the second output image is subjected to high-resolution image segmentation by the second image recognizing unit using the first output image. And a ninth step of acquiring.

態様の１つに係る画像認識方法は、入力された入力画像の画像セグメンテーションを行う画像認識装置が実行する画像認識方法であって、前記画像認識装置は、前記入力画像の解像度を低くするダウンサンプリングを実行して、低解像度の前記入力画像を生成し、生成した低解像度の前記入力画像の画像セグメンテーションを行う第１の画像認識部と、低解像度の前記入力画像に比して高解像度の前記入力画像の画像セグメンテーションを行う第２の画像認識部と、を備え、前記入力画像を前記第１の画像認識部に入力して、低解像度の前記入力画像を生成し、生成した低解像度の前記入力画像の画像セグメンテーションを行って、第１の出力画像を取得する第８のステップと、高解像度の前記入力画像と前記第１の出力画像とを前記第２の画像認識部に入力し、前記第１の出力画像を用いて前記第２の画像認識部により高解像度の前記入力画像の画像セグメンテーションを行って、第２の出力画像を取得する第９のステップと、を含む。 An image recognition method according to one aspect is an image recognition method performed by an image recognition device that performs image segmentation of an input image that has been input, wherein the image recognition device performs downsampling that lowers the resolution of the input image. A first image recognition unit that generates the low-resolution input image and performs image segmentation of the generated low-resolution input image; and a high-resolution input image compared to the low-resolution input image. A second image recognition unit that performs image segmentation of the input image, wherein the input image is input to the first image recognition unit to generate the low-resolution input image, and the generated low-resolution An eighth step of performing image segmentation of the input image to obtain a first output image, and separating the input image and the first output image with high resolution into the second image. A ninth step of inputting to an image recognition unit, performing a high-resolution image segmentation of the input image by the second image recognition unit using the first output image, and acquiring a second output image; ,including.

態様の１つに係る画像認識装置は、入力された入力画像の解像度を低くするダウンサンプリングを実行して、低解像度の前記入力画像を生成し、生成した低解像度の前記入力画像の画像セグメンテーションを行う第１の画像認識部と、低解像度の前記入力画像に比して高解像度の前記入力画像の画像セグメンテーションを行う第２の画像認識部と、を備え、前記第１の画像認識部は、前記入力画像が入力されると、低解像度の前記入力画像を生成し、生成した低解像度の前記入力画像の画像セグメンテーションを行って第１の出力画像を生成し、生成した前記第１の出力画像を前記第２の画像認識部へ向けて出力し、前記第２の画像認識部は、高解像度の前記入力画像と前記第１の出力画像とが入力されると、前記第１の出力画像を用いて前記第２の画像認識部により高解像度の前記入力画像の画像セグメンテーションを行って、第２の出力画像を出力する。 An image recognition device according to one aspect performs downsampling to reduce the resolution of an input image that has been input, generates the low-resolution input image, and performs image segmentation of the generated low-resolution input image. A first image recognition unit for performing, and a second image recognition unit for performing image segmentation of the input image having a higher resolution than the input image having a lower resolution, wherein the first image recognition unit includes: When the input image is input, the low-resolution input image is generated, the generated low-resolution input image is subjected to image segmentation to generate a first output image, and the generated first output image is generated. To the second image recognizing unit, and the second image recognizing unit converts the first output image when the high-resolution input image and the first output image are input. Before using By the second image recognition unit performs image segmentation of the input image with high resolution, and outputs a second output image.

図１は、実施形態に係る画像認識装置の概要を示す図である。FIG. 1 is a diagram illustrating an outline of an image recognition device according to the embodiment. 図２は、実施形態に係る画像認識装置の画像認識部の概要を示す図である。FIG. 2 is a diagram illustrating an outline of an image recognition unit of the image recognition device according to the embodiment. 図３は、画像認識装置に入力される入力画像の一例を示す図である。FIG. 3 is a diagram illustrating an example of an input image input to the image recognition device. 図４は、画像認識装置から出力される出力画像の一例を示す図である。FIG. 4 is a diagram illustrating an example of an output image output from the image recognition device. 図５は、学習データセットの一例を示す図である。FIG. 5 is a diagram illustrating an example of a learning data set. 図６は、画像認識装置の画像学習に関する処理の一例を示す図である。FIG. 6 is a diagram illustrating an example of a process related to image learning of the image recognition device. 図７は、画像認識装置の画像学習に関する処理の一例を示す図である。FIG. 7 is a diagram illustrating an example of a process related to image learning of the image recognition device. 図８は、画像認識装置の画像学習に関する処理の一例を示す図である。FIG. 8 is a diagram illustrating an example of a process related to image learning of the image recognition device. 図９は、画像認識装置の画像認識に関する処理の一例を示す図である。FIG. 9 is a diagram illustrating an example of a process regarding image recognition of the image recognition device. 図１０は、画像認識装置の画像認識に関する処理の一例を示す図である。FIG. 10 is a diagram illustrating an example of a process related to image recognition of the image recognition device. 図１１は、画像認識装置の画像認識に関する処理の一例を示す図である。FIG. 11 is a diagram illustrating an example of a process regarding image recognition of the image recognition device.

以下、本発明につき図面を参照しつつ詳細に説明する。なお、下記の発明を実施するための形態（以下実施形態という）により本発明が限定されるものではない。また、下記の実施形態における構成要素には、当業者が容易に想定できるもの、実質的に同一のもの、いわゆる均等の範囲のものが含まれる。 Hereinafter, the present invention will be described in detail with reference to the drawings. The present invention is not limited by the following embodiments for carrying out the invention (hereinafter, referred to as embodiments). In addition, constituent elements in the following embodiments include those that can be easily assumed by those skilled in the art, those that are substantially the same, and those that are in an equivalent range.

（実施形態）
図１は、実施形態に係る画像認識装置の概要を示す図である。画像認識装置１は、入力される入力画像Ｉに含まれるオブジェクトを認識し、認識した結果を出力画像Ｏとして出力するものである。画像認識装置１は、カメラ等の撮像装置において撮像された撮影画像が入力画像Ｉとして入力される。画像認識装置１は、入力画像Ｉに対して画像セグメンテーションを行う。画像セグメンテーションとは、デジタル画像の分割された画像領域に対してクラスをラベリングすることであり、クラス推論（クラス分類）ともいう。つまり、画像セグメンテーションとは、デジタル画像の分割された所定の画像領域が、何れのクラスであるかを判別して、画像領域が示すクラスを識別するための識別子（ラベル）を付すことである。画像認識装置１は、入力画像Ｉを画像セグメンテーション（クラス推論）した画像を、出力画像Ｏとして出力する。 (Embodiment)
FIG. 1 is a diagram illustrating an outline of an image recognition device according to the embodiment. The image recognition device 1 recognizes an object included in an input image I to be input, and outputs a recognition result as an output image O. The image recognition device 1 receives a captured image captured by an imaging device such as a camera as an input image I. The image recognition device 1 performs image segmentation on an input image I. Image segmentation refers to labeling a class with respect to a divided image region of a digital image, and is also referred to as class inference (class classification). That is, the image segmentation is to determine which class the predetermined image area obtained by dividing the digital image belongs to, and to attach an identifier (label) for identifying the class indicated by the image area. The image recognition device 1 outputs an image obtained by performing image segmentation (class inference) on the input image I as an output image O.

画像認識装置１は、例えば、車の車載認識カメラに設けられている。車載認識カメラは、車の走行状況を所定のフレームレートでリアルタイムに撮像し、撮像した撮影画像を画像認識装置１に入力する。画像認識装置１は、所定のフレームレートで入力される撮影画像を入力画像Ｉとして取得する。画像認識装置１は、入力画像Ｉに含まれるオブジェクトをクラス分類して、クラス分類された画像を出力画像Ｏとして、所定のフレームレートで出力する。なお、画像認識装置１は、車載認識カメラへの搭載に限定されず、他の装置に設けてもよい。 The image recognition device 1 is provided, for example, in an in-vehicle recognition camera of a car. The in-vehicle recognition camera captures the running state of the vehicle at a predetermined frame rate in real time, and inputs the captured image to the image recognition device 1. The image recognition device 1 acquires a captured image input at a predetermined frame rate as an input image I. The image recognition device 1 classifies the objects included in the input image I into classes, and outputs the classified images as output images O at a predetermined frame rate. Note that the image recognition device 1 is not limited to being mounted on a vehicle-mounted recognition camera, and may be provided in another device.

先ず、図３を参照して、入力画像Ｉについて説明する。図３は、画像認識装置１に入力される入力画像Ｉの一例を示す図である。入力画像Ｉは、複数の画素（ピクセル）からなるデジタル画像である。入力画像Ｉは、例えば、カメラ等の撮像装置に設けられる撮像素子によって生成される、撮像素子の画素数に応じた解像度の画像となっている。つまり、入力画像Ｉは、画像の画素数を高くするアップサンプリング処理、または、画像の画素数を低くするダウンサンプリング処理が行われていない、高解像度となるオリジナルの原画像となっている。 First, the input image I will be described with reference to FIG. FIG. 3 is a diagram illustrating an example of an input image I input to the image recognition device 1. The input image I is a digital image including a plurality of pixels. The input image I is, for example, an image generated by an imaging device provided in an imaging device such as a camera and having a resolution according to the number of pixels of the imaging device. That is, the input image I is an original high resolution original image that has not been subjected to upsampling processing for increasing the number of pixels of the image or downsampling processing for decreasing the number of pixels of the image.

次に、図４を参照して、出力画像Ｏについて説明する。図４は、画像認識装置１から出力される出力画像Ｏの一例を示す図である。出力画像Ｏは、クラスごとに領域分割されている。クラスは、例えば、入力画像Ｉに含まれるオブジェクトを含み、人、車、道、建物等である。出力画像Ｏは、ピクセル単位でオブジェクトごとのクラス分類がなされ、ピクセル単位ごとに分類されたクラスがラベリングされることで、クラスごとに領域分割されている。図４では、例えば、人のクラスに分類された画像領域Ｏａと、車のクラスに分類された画像領域Ｏｂと、道路のクラスに分類された画像領域Ｏｃとを図示している。なお、図４の出力画像Ｏは一例であり、このクラス分類に、特に限定されない。また、出力画像Ｏは、入力画像Ｉと同じ解像度となっている。 Next, the output image O will be described with reference to FIG. FIG. 4 is a diagram illustrating an example of an output image O output from the image recognition device 1. The output image O is divided into regions for each class. The class includes, for example, an object included in the input image I, and is a person, a car, a road, a building, and the like. The output image O is classified into classes by each object on a pixel-by-pixel basis, and the classes classified on a pixel-by-pixel basis are labeled to classify the image into regions. FIG. 4 illustrates, for example, an image region Oa classified into a person class, an image region Ob classified into a car class, and an image region Oc classified into a road class. Note that the output image O in FIG. 4 is an example, and there is no particular limitation on this class classification. The output image O has the same resolution as the input image I.

再び図１を参照して、画像認識装置１について説明する。画像認識装置１は、制御部５と、記憶部６と、画像認識部７とを備えている。 Referring to FIG. 1 again, the image recognition device 1 will be described. The image recognition device 1 includes a control unit 5, a storage unit 6, and an image recognition unit 7.

記憶部６は、プログラム及びデータを記憶する。また、記憶部６は、制御部５の処理結果を一時的に記憶する作業領域としても利用してもよい。記憶部６は、半導体記憶デバイス、及び磁気記憶デバイス等の任意の記憶デバイスを含んでよい。また、記憶部６は、複数の種類の記憶デバイスを含んでよい。また、記憶部６は、メモリカード等の可搬の記憶媒体と、記憶媒体の読み取り装置との組み合わせを含んでよい。 The storage unit 6 stores programs and data. Further, the storage unit 6 may be used as a work area for temporarily storing a processing result of the control unit 5. The storage unit 6 may include an arbitrary storage device such as a semiconductor storage device and a magnetic storage device. Further, the storage unit 6 may include a plurality of types of storage devices. The storage unit 6 may include a combination of a portable storage medium such as a memory card and a storage medium reading device.

記憶部６は、プログラムとして、画像学習プログラムＰ１と、画像認識プログラムＰ２とを含む。画像学習プログラムＰ１は、画像認識部７に学習を行わせるためのプログラムである。画像認識プログラムＰ２は、画像認識部７に画像認識を行わせるためのプログラムである。また、記憶部６は、データとして、各種画像と、学習データセットとを含む。各種画像は、画像認識装置１に入力される入力画像Ｉ、画像認識装置１から出力される出力画像Ｏ等である。学習データセットは、画像認識部７の学習に用いられるデータである。 The storage unit 6 includes an image learning program P1 and an image recognition program P2 as programs. The image learning program P1 is a program for causing the image recognition unit 7 to perform learning. The image recognition program P2 is a program for causing the image recognition unit 7 to perform image recognition. Further, the storage unit 6 includes various images and a learning data set as data. The various images are an input image I input to the image recognition device 1, an output image O output from the image recognition device 1, and the like. The learning data set is data used for learning of the image recognition unit 7.

制御部５は、画像認識装置１の動作を統括的に制御して各種の機能を実現する。制御部５は、例えば、ＣＰＵ（Central Processing Unit）等の集積回路を含んでいる。具体的に、制御部５は、記憶部６に記憶されているプログラムに含まれる命令を実行して、画像認識部７等を制御することによって各種機能を実現する。 The control unit 5 controls the operation of the image recognition device 1 comprehensively to realize various functions. The control unit 5 includes, for example, an integrated circuit such as a CPU (Central Processing Unit). Specifically, the control unit 5 executes commands included in the program stored in the storage unit 6 and controls the image recognition unit 7 and the like to realize various functions.

制御部５は、例えば、画像学習プログラムＰ１を実行することにより、学習データセットを用いて、画像認識部７の学習を実行させる。また、制御部５は、例えば、画像認識プログラムＰ２を実行することにより、画像認識部７による入力画像Ｉの画像認識を実行させる。 The control unit 5 executes the learning of the image recognition unit 7 using the learning data set, for example, by executing the image learning program P1. Further, the control unit 5 causes the image recognition unit 7 to execute the image recognition of the input image I by executing the image recognition program P2, for example.

次に、図２を参照して、画像認識部７について説明する。図２は、実施形態に係る画像認識装置の画像認識部の概要を示す図である。画像認識部７は、ＧＰＵ（Graphics Processing Unit）等の集積回路を含んでいる。画像認識部７は、第１の画像認識部１１と、第２の画像認識部１２とを備えている。画像認識部７は、入力画像Ｉが入力されると、入力画像Ｉを第１の画像認識部１１及び第２の画像認識部１２にそれぞれ入力する。 Next, the image recognition unit 7 will be described with reference to FIG. FIG. 2 is a diagram illustrating an outline of an image recognition unit of the image recognition device according to the embodiment. The image recognition unit 7 includes an integrated circuit such as a GPU (Graphics Processing Unit). The image recognition section 7 includes a first image recognition section 11 and a second image recognition section 12. When the input image I is input, the image recognition unit 7 inputs the input image I to the first image recognition unit 11 and the second image recognition unit 12, respectively.

第１の画像認識部１１は、入力画像Ｉの解像度を低くして、入力画像Ｉを広範に捉えて領域分割を行うタスクを実行する。第１の画像認識部１１は、例えば、セマンティック・セグメンテーションを用いた画像セグメンテーションを行っている。セマンティック・セグメンテーションは、入力画像Ｉの各ピクセルに対してクラス推論を行い、推論結果として、各ピクセルに対してクラスをラベリングすることで、入力画像Ｉの領域分割を行う。第１の画像認識部１１は、入力画像Ｉが入力されると、入力画像Ｉのピクセルごとにクラス分類し、クラス分類された画像を、第１の出力画像Ｏ１として出力する。つまり、第１の画像認識部１１は、後述する第２の画像認識部１２に比して、入力画像Ｉを大局的に取り扱って領域分割を行い、第１の出力画像Ｏ１を出力する。 The first image recognition unit 11 executes a task of reducing the resolution of the input image I, capturing the input image I in a wide range, and dividing the area. The first image recognition unit 11 performs, for example, image segmentation using semantic segmentation. In the semantic segmentation, class inference is performed for each pixel of the input image I, and as a result of the inference, a class is labeled for each pixel to perform region division of the input image I. When the input image I is input, the first image recognition unit 11 classifies the pixels of the input image I for each pixel, and outputs the classified image as a first output image O1. That is, the first image recognition unit 11 handles the input image I globally to perform area division and outputs a first output image O1 as compared with a second image recognition unit 12 described later.

第１の画像認識部１１は、ＣＮＮ（Convolution Neural Network）またはＦＣＮ（Fully Convolutional Network）等の畳み込み層を含むニューラル・ネットワーク（以下、単にネットワークともいう）を用いた画像セグメンテーションを行っている。第１の画像認識部１１は、ダウンサンプリング層２１と、エンコーダ２２と、デコーダ２３とを有している。 The first image recognition unit 11 performs image segmentation using a neural network including a convolutional layer such as a CNN (Convolution Neural Network) or an FCN (Fully Convolutional Network) (hereinafter, also simply referred to as a network). The first image recognition unit 11 has a down-sampling layer 21, an encoder 22, and a decoder 23.

ダウンサンプリング層２１は、入力画像Ｉの解像度を低くして、低解像度の入力画像Ｉを生成するダウンサンプリングを実行している。ダウンサンプリング層２１は、低解像度の入力画像Ｉをエンコーダ２２に出力する。なお、ダウンサンプリング層２１は、例えば、入力画像Ｉの高さＨ及び幅Ｗの画素数を半分として、入力画像Ｉの解像度を１／４としている。 The downsampling layer 21 performs downsampling for reducing the resolution of the input image I and generating a low-resolution input image I. The downsampling layer 21 outputs the low-resolution input image I to the encoder 22. In the downsampling layer 21, for example, the number of pixels of the height H and the width W of the input image I is halved, and the resolution of the input image I is １／.

エンコーダ２２は、低解像度の入力画像Ｉに対してエンコード処理を実行する。エンコード処理は、低解像度の入力画像Ｉの特徴量を抽出した特徴マップ（Feature Map）を生成しつつ、特徴マップの解像度を低くするダウンサンプリング（プーリングともいう）を実行する処理である。具体的に、エンコード処理では、畳み込み層とプーリング層とにおいて低解像度の入力画像Ｉに処理が行われる。畳み込み層では、入力画像Ｉの特徴量を抽出するためのカーネル（フィルタ）を、入力画像Ｉにおいて所定のストライドで移動させる。そして、畳み込み層では、畳み込み層の重みに基づいて、入力画像Ｉの特徴量を抽出するための畳み込み計算が行われ、この畳み込み計算により特徴量が抽出された特徴マップを生成する。生成される特徴マップは、カーネルのチャネル数に応じた数だけ生成される。プーリング層では、特徴量が抽出された特徴マップを縮小して、低解像度となる特徴マップを生成する。エンコード処理では、畳み込み層における処理とプーリング層における処理とを複数回繰り返し実行することで、ダウンサンプリングされた特徴量を有する特徴マップを生成する。 The encoder 22 performs an encoding process on the low-resolution input image I. The encoding process is a process of generating a feature map (Feature Map) by extracting the feature amount of the low-resolution input image I and executing downsampling (also called pooling) for lowering the resolution of the feature map. Specifically, in the encoding processing, processing is performed on the low-resolution input image I in the convolution layer and the pooling layer. In the convolutional layer, a kernel (filter) for extracting a feature amount of the input image I is moved at a predetermined stride in the input image I. Then, in the convolution layer, a convolution calculation for extracting a feature amount of the input image I is performed based on the weight of the convolution layer, and a feature map from which the feature amount is extracted by the convolution calculation is generated. The generated feature maps are generated in a number corresponding to the number of channels of the kernel. The pooling layer reduces the feature map from which the feature amount has been extracted, and generates a feature map having a low resolution. In the encoding process, a process in the convolutional layer and a process in the pooling layer are repeatedly performed a plurality of times to generate a feature map having a down-sampled feature amount.

デコーダ２３は、エンコード処理後の特徴マップに対してデコード処理を実行する。デコード処理は、特徴マップの解像度を高くするアップサンプリング（アンプーリングともいう）を実行する処理である。具体的に、デコード処理は、逆畳み込み層とアンプーリング層とにおいて特徴マップに処理が行われる。アンプーリング層では、特徴量を含む低解像度の特徴マップを拡大して、高解像度となる特徴マップを生成する。逆畳み込み層では、特徴マップに含まれる特徴量を、復元させるための逆畳み込み計算が、逆畳み込み層の重みに基づいて実行され、この計算により特徴量を復元させた特徴マップを生成する。そして、デコード処理では、アンプーリング層における処理と逆畳み込み層における処理とを複数回繰り返し実行することで、アップサンプリングされ、領域分割された画像である第１の出力画像Ｏ１を生成する。第１の出力画像Ｏ１は、画像認識部７に入力される入力画像Ｉと同じ解像度になるまで、アップサンプリングされる。 The decoder 23 performs a decoding process on the encoded feature map. The decoding process is a process of executing upsampling (also referred to as ampling) for increasing the resolution of the feature map. Specifically, the decoding process is performed on the feature map in the deconvolution layer and the amplifying layer. In the amplifying layer, a low-resolution feature map including a feature amount is enlarged to generate a high-resolution feature map. In the deconvolution layer, a deconvolution calculation for restoring the feature amount included in the feature map is executed based on the weight of the deconvolution layer, and a feature map in which the feature amount is restored by this calculation is generated. In the decoding process, a first output image O1 that is an up-sampled and region-divided image is generated by repeatedly performing the processing in the amplifying layer and the processing in the deconvolution layer a plurality of times. The first output image O1 is up-sampled until it has the same resolution as the input image I input to the image recognition unit 7.

以上のように、第１の画像認識部１１は、入力画像Ｉに対してダウンサンプリングを行って、低解像度の入力画像Ｉとする。この後、第１の画像認識部１１は、低解像度の入力画像Ｉに対して、エンコード処理及びデコード処理を実行し、ピクセル単位でクラス推論（クラス分類）を行うことで、低解像度の入力画像Ｉの画像セグメンテーションを行う。そして、第１の画像認識部１１は、低解像度の入力画像Ｉをクラスごとに領域分割した画像を、第１の出力画像Ｏ１として出力する。 As described above, the first image recognition unit 11 performs down-sampling on the input image I to obtain a low-resolution input image I. Thereafter, the first image recognizing unit 11 performs an encoding process and a decoding process on the low-resolution input image I and performs class inference (class classification) on a pixel-by-pixel basis. Perform image segmentation of I. Then, the first image recognition unit 11 outputs, as the first output image O1, an image obtained by dividing the low-resolution input image I into regions for each class.

なお、第１の画像認識部１１は、セマンティック・セグメンテーションを用いた画像セグメンテーションに適用して説明したが、特に限定されない。第１の画像認識部１１は、入力画像Ｉの解像度を低くして、入力画像Ｉを広範に捉えて領域分割を行うタスクを実行可能であれば、例えば、異なるネットワークを用いた画像セグメンテーションを実行するものであってもよい。 The first image recognition unit 11 has been described as applied to image segmentation using semantic segmentation, but is not particularly limited. The first image recognition unit 11 executes image segmentation using a different network, for example, if the resolution of the input image I is reduced and the task of capturing the input image I in a wide range and performing region division can be performed. May be used.

ここで、第１の画像認識部１１の画像認識に係る計算量について説明する。畳み込み層の計算量は、（１）式に示す関係式により表される。ここで、Ｋは、畳み込み層で用いられるカーネルサイズである。Ｈ×Ｗは、入力画像Ｉの画像サイズである。Ｃは、カーネルのチャネル数である。Ｃ’は、畳み込み層に入力される特徴マップのチャネル数である。ＳＴは、入力画像Ｉにおいて移動するカーネルのストライドである。 Here, a calculation amount related to image recognition of the first image recognition unit 11 will be described. The calculation amount of the convolutional layer is represented by the relational expression shown in Expression (1). Here, K is the kernel size used in the convolutional layer. H × W is the image size of the input image I. C is the number of channels in the kernel. C ′ is the number of channels of the feature map input to the convolutional layer. ST is the stride of the moving kernel in the input image I.

畳み込み層の計算量＝（Ｋ^２×Ｈ×Ｗ×Ｃ×Ｃ’）／ＳＴ・・・（１） Calculation amount of convolutional layer = (K ² × H × W × C × C ′) / ST (1)

第１の画像認識部１１では、ダウンサンプリング層２１において、入力画像Ｉの解像度を低くしている。このため、上記の（１）式に示す入力画像Ｉの画像サイズ（Ｈ×Ｗ）が小さいものとなることから、畳み込み層の計算量は小さいものとなる。よって、第１の画像認識部１１では、入力画像Ｉの解像度を低くすることで、第１の画像認識部１１による入力画像Ｉの画像認識のタスクは、計算負荷の低いタスクとなっている。 In the first image recognition unit 11, the resolution of the input image I is reduced in the downsampling layer 21. Therefore, since the image size (H × W) of the input image I shown in the above equation (1) is small, the calculation amount of the convolutional layer is small. Therefore, in the first image recognition unit 11, the task of image recognition of the input image I by the first image recognition unit 11 is a task with a low calculation load by lowering the resolution of the input image I.

第１の画像認識部１１は、その出力側が、第２の画像認識部１２の入力側に接続されている。このため、第１の画像認識部１１は、第１の出力画像Ｏ１を、第２の画像認識部１２に入力する。また、第１の画像認識部１１は、第１の出力画像Ｏ１を、中間画像として外部に出力している。 The output side of the first image recognition unit 11 is connected to the input side of the second image recognition unit 12. For this reason, the first image recognition unit 11 inputs the first output image O1 to the second image recognition unit 12. In addition, the first image recognition unit 11 outputs the first output image O1 to the outside as an intermediate image.

第２の画像認識部１２は、第１の画像認識部１１に比して入力画像Ｉを局所的に捉えて領域分割を行うタスクを実行する。第２の画像認識部１２は、第１の画像認識部１１と同様に、例えば、セマンティック・セグメンテーションを用いた画像セグメンテーションを行っている。第２の画像認識部１２には、入力画像Ｉと第１の出力画像Ｏ１とが入力される。第２の画像認識部１２は、入力画像Ｉの解像度を低下させずに、画像セグメンテーションを行う。第２の画像認識部１２は、入力画像Ｉと第１の出力画像Ｏ１とが入力されると、第１の出力画像Ｏ１を用いて、入力画像Ｉのピクセルごとにクラス分類を行い、クラス分類された画像を、第２の出力画像Ｏ２として出力する。つまり、第２の画像認識部１２は、第１の出力画像Ｏ１をヒントとして、入力画像Ｉに対して領域分割を行って、第２の出力画像Ｏ２を出力する。 The second image recognizing unit 12 performs a task of localizing the input image I and performing region division as compared with the first image recognizing unit 11. The second image recognition unit 12, like the first image recognition unit 11, performs image segmentation using, for example, semantic segmentation. The input image I and the first output image O1 are input to the second image recognition unit 12. The second image recognition unit 12 performs image segmentation without reducing the resolution of the input image I. When the input image I and the first output image O1 are input, the second image recognition unit 12 performs class classification for each pixel of the input image I using the first output image O1, and performs the class classification. The output image is output as a second output image O2. That is, the second image recognition unit 12 performs region division on the input image I using the first output image O1 as a hint, and outputs a second output image O2.

第２の画像認識部１２は、入力画像Ｉの特徴量を抽出する特徴量抽出処理を実行する。さらに、第２の画像認識部１２は、第１の出力画像Ｏ１と特徴量抽出処理が行われる画像とを統合するフュージョン処理を実行して、ピクセル単位のクラス推論を行っている。 The second image recognition unit 12 performs a feature amount extraction process for extracting a feature amount of the input image I. Further, the second image recognition unit 12 performs a fusion process for integrating the first output image O1 and the image on which the feature amount extraction process is performed, and performs class inference on a pixel-by-pixel basis.

特徴量抽出処理は、複数の畳み込み層において入力画像Ｉの特徴量を抽出する処理であり、エンコーダ２２における畳み込み層の処理とほぼ同様である。また、特徴量抽出処理では、プーリング層を省いた処理となっている。畳み込み層では、入力画像Ｉの特徴量を抽出するための畳み込み計算が、畳み込み層の重みに基づいて実行され、この計算により特徴量が抽出された特徴マップを生成する。特徴量抽出処理では、入力画像Ｉに対して畳み込み計算が複数回実行されることで、特徴マップを生成する。 The feature amount extraction process is a process of extracting the feature amount of the input image I in a plurality of convolution layers, and is substantially the same as the processing of the convolution layer in the encoder 22. Further, the feature amount extraction processing is processing in which the pooling layer is omitted. In the convolutional layer, a convolution calculation for extracting a feature amount of the input image I is executed based on the weight of the convolutional layer, and a feature map from which the feature amount is extracted by this calculation is generated. In the feature extraction process, a convolution calculation is performed on the input image I a plurality of times to generate a feature map.

フュージョン処理は、第１の出力画像Ｏ１をヒントとして、特徴量抽出処理が行われる特徴マップをマージして、クラス推論を行うことにより、クラスごとに領域分割された画像を生成し、入力画像Ｉと同じ解像度の第２の出力画像Ｏ２を生成する。 In the fusion process, the first output image O1 is used as a hint to merge the feature maps on which the feature amount extraction process is performed, and to perform class inference, thereby generating an image divided into regions for each class. To generate a second output image O2 having the same resolution as that of.

以上のように、第２の画像認識部１２は、入力画像Ｉに対して、特徴量抽出処理及びフュージョン処理を実行し、第１の出力画像Ｏ１をヒントとして、ピクセル単位でクラス推論（クラス分類）を行うことで、入力画像Ｉの画像セグメンテーションを行う。また、第２の画像認識部１２は、画像セグメンテーションされた入力画像Ｉを、第２の出力画像Ｏ２として出力する。 As described above, the second image recognition unit 12 performs the feature amount extraction processing and the fusion processing on the input image I, and performs class inference (class classification) on a pixel-by-pixel basis using the first output image O1 as a hint. ) To perform the image segmentation of the input image I. In addition, the second image recognition unit 12 outputs the input image I that has been subjected to the image segmentation as a second output image O2.

ここで、第２の画像認識部１２では、カーネルのチャネル数Ｃと、畳み込み層に入力される特徴マップのチャネル数Ｃ’とを、第１の画像認識部１１に比して小さくしている。カーネルのチャネル数Ｃと畳み込み層に入力される特徴マップのチャネル数Ｃ’との積は、画像認識の表現力である。つまり、第２の画像認識部１２の表現力は、第１の画像認識部１１に比して小さいものとなっている。これは、第２の画像認識部１２では、入力画像Ｉの画像セグメンテーションに際して、第１の出力画像Ｏ１をヒントとしていることから、表現力が小さい場合であっても、画像認識の精度を担保できるからである。 Here, in the second image recognition unit 12, the number of channels C of the kernel and the number of channels C ′ of the feature map input to the convolutional layer are smaller than those of the first image recognition unit 11. . The product of the channel number C of the kernel and the channel number C 'of the feature map input to the convolutional layer is the expressiveness of image recognition. That is, the expressive power of the second image recognition unit 12 is smaller than that of the first image recognition unit 11. This is because the second image recognition unit 12 uses the first output image O1 as a hint at the time of image segmentation of the input image I, so that the accuracy of image recognition can be ensured even when the expressive power is small. Because.

なお、第２の画像認識部１２は、セマンティック・セグメンテーションを用いた画像セグメンテーションに適用して説明したが、特に限定されない。第２の画像認識部１２は、第１の画像認識部１１に比して入力画像Ｉを局所的に捉えて領域分割を行うタスクを実行可能であれば、例えば、異なるネットワークを用いた画像セグメンテーションを実行するものであってもよい。 The second image recognition unit 12 has been described as applied to image segmentation using semantic segmentation, but is not particularly limited. As long as the second image recognition unit 12 can execute a task of locally capturing the input image I and performing region division as compared with the first image recognition unit 11, for example, image segmentation using a different network May be executed.

ここで、第２の画像認識部１２の画像認識に係る計算量について説明する。第２の画像認識部１２では、第１の画像認識部１１に比して表現力が小さいものとなっている。このため、上記の（１）式に示す表現力（Ｃ×Ｃ’）が小さいものとなることから、畳み込み層の計算量は小さいものとなる。よって、第２の画像認識部１２では、画像認識の表現力を小さくすることで、第２の画像認識部１２による入力画像Ｉの画像認識のタスクは、計算負荷の低いタスクとなっている。 Here, the calculation amount related to image recognition of the second image recognition unit 12 will be described. The second image recognition unit 12 has a lower expressive power than the first image recognition unit 11. For this reason, since the expressive power (C × C ′) shown in the above equation (1) is small, the calculation amount of the convolutional layer is small. Therefore, in the second image recognition unit 12, the task of image recognition of the input image I by the second image recognition unit 12 is a task with a low calculation load by reducing the expressiveness of the image recognition.

第２の画像認識部１２は、第２の出力画像Ｏ２を外部に出力する。また、第２の画像認識部１２は、第２の出力画像Ｏ２の生成時に用いた第１の出力画像Ｏ１を、第２の出力画像Ｏ２に関連付けて出力可能となっている。 The second image recognition unit 12 outputs the second output image O2 to the outside. In addition, the second image recognition unit 12 can output the first output image O1 used when generating the second output image O2 in association with the second output image O2.

以上から、第１の画像認識部１１は、第２の画像認識部１２と比べて入力画像Ｉを大局的に捉えて領域分割を行うべく、入力画像Ｉをダウンサンプリングして、低解像度の入力画像Ｉとしていることから、計算負荷の低いタスクとなっている。また、第２の画像認識部１２は、第１の画像認識部１１と比べて入力画像Ｉを局所的に捉えて領域分割を行う際に、第１の出力画像Ｏ１を用いて入力画像Ｉの領域分割を行う。このため、第２の画像認識部１２は、第１の画像認識部１１と比べて表現力の低い画像認識でよいことから、計算負荷の低いタスクとなっている。 From the above, the first image recognition unit 11 downsamples the input image I and performs low-resolution input Since the image I is used, the task has a low calculation load. Further, the second image recognition unit 12 uses the first output image O1 to capture the input image I locally and performs region division as compared with the first image recognition unit 11. Perform region division. For this reason, the second image recognition unit 12 is a task with a low calculation load, since image recognition with lower expressive power is sufficient compared with the first image recognition unit 11.

次に、画像認識装置１の学習について説明する。画像認識装置１の学習には、学習データセットが用いられる。図５は、学習データセットの一例を示す図である。学習データセットは、学習対象となる画像である学習画像と、学習画像に対応する教師画像と、を含む。学習画像は、入力画像Ｉと同様に、デジタル画像である。教師画像は、学習画像に対応する画像セグメンテーションされた回答となる画像、つまり、領域分割された画像となっている。教師画像は、アノテーション作業により生成される画像となっている。 Next, learning of the image recognition device 1 will be described. The learning of the image recognition device 1 uses a learning data set. FIG. 5 is a diagram illustrating an example of a learning data set. The learning data set includes a learning image, which is an image to be learned, and a teacher image corresponding to the learning image. The learning image is a digital image, like the input image I. The teacher image is an image that is an answer that has been subjected to image segmentation corresponding to the learning image, that is, an image obtained by region segmentation. The teacher image is an image generated by the annotation work.

学習データセットは、第１の画像認識部１１の学習に用いられる第１の学習データセットＤ１と、第２の画像認識部１２の学習に用いられる第２の学習データセットＤ２とを含む。 The learning data set includes a first learning data set D1 used for learning of the first image recognizing unit 11 and a second learning data set D2 used for learning of the second image recognizing unit 12.

図５に示すように、第１の学習データセットＤ１は、第１の学習画像Ｇ１と、第１の教師画像Ｔ１とを含む。第１の学習画像Ｇ１は、第１の画像認識部１１の学習対象となる画像であり、入力画像Ｉと同様に、デジタル画像である。第１の教師画像Ｔ１は、ピクセル単位でクラスごとに領域分割された画像となっている。図５に示す第１の教師画像Ｔ１では、例えば、人のクラスに分類された画像領域Ｔ１ａと、車のクラスに分類された画像領域Ｔ１ｂと、道路のクラスに分類された画像領域Ｔ２ｃとを含んでいる。 As shown in FIG. 5, the first learning data set D1 includes a first learning image G1 and a first teacher image T1. The first learning image G1 is an image to be learned by the first image recognition unit 11, and is a digital image like the input image I. The first teacher image T <b> 1 is an image obtained by dividing an area for each class in pixel units. In the first teacher image T1 shown in FIG. 5, for example, an image region T1a classified into a person class, an image region T1b classified into a car class, and an image region T2c classified into a road class are included. Contains.

第２の学習データセットＤ２は、第２の学習画像Ｇ２と、第２の教師画像Ｔ２とを含む。第２の学習画像Ｇ２は、第２の画像認識部１２の学習対象となる画像であり、入力画像及び第１の学習画像Ｇ１と同様に、デジタル画像である。なお、図５では、説明を簡単にするために、第１の学習画像Ｇ１と第２の学習画像Ｇ２とを同じ画像としているが、異なる画像であってもよい。第２の教師画像Ｔ２は、第１の教師画像Ｔ１と同様に、ピクセル単位でクラスごとに領域分割された画像となっている。図５に示す第２の教師画像Ｔ２では、第１の教師画像Ｔ１と同様に、人のクラスに分類された画像領域Ｔ２ａと、車のクラスに分類された画像領域Ｔ２ｂと、道路のクラスに分類された画像領域Ｔ２ｃとを含んでいる。 The second learning data set D2 includes a second learning image G2 and a second teacher image T2. The second learning image G2 is an image to be learned by the second image recognition unit 12, and is a digital image like the input image and the first learning image G1. In FIG. 5, the first learning image G1 and the second learning image G2 are the same image for the sake of simplicity, but may be different images. Like the first teacher image T1, the second teacher image T2 is an image obtained by dividing the area of each pixel into classes. In the second teacher image T2 shown in FIG. 5, similarly to the first teacher image T1, an image region T2a classified into a person class, an image region T2b classified into a car class, and a road class And a classified image area T2c.

ここで、第１の学習データセットＤ１の第１の教師画像Ｔ１は、第２の学習データセットＤ２の第２の教師画像Ｔ２に比して、低解像度の教師画像となっている。つまり、第１の教師画像Ｔ１及び第２の教師画像Ｔ２は、ピクセル単位ごとに領域分割された画像であるものの、第１の教師画像Ｔ１は、画像サイズが小さく、第２の教師画像Ｔ２は、画像サイズが大きなものとなっている。 Here, the first teacher image T1 of the first learning data set D1 is a lower resolution teacher image than the second teacher image T2 of the second learning data set D2. In other words, although the first teacher image T1 and the second teacher image T2 are images divided into regions for each pixel, the first teacher image T1 has a small image size, and the second teacher image T2 , The image size is large.

なお、実施形態では、第１の教師画像Ｔ１を第２の教師画像Ｔ２に比して、低解像度の教師画像としたが、特に限定されない。第１の教師画像Ｔ１と第２の教師画像Ｔ２とを同じ解像度としてもよい。すなわち、第１の学習データセットＤ１と第２の学習データセットＤ２とを同じ学習データセットとしてもよい。換言すれば、単一の学習データセットを用いて、第１の画像認識部１１及び第２の画像認識部１２の学習を行ってもよい。 In the embodiment, the first teacher image T1 is a teacher image having a lower resolution than the second teacher image T2. However, the present invention is not particularly limited. The first teacher image T1 and the second teacher image T2 may have the same resolution. That is, the first learning data set D1 and the second learning data set D2 may be the same learning data set. In other words, the learning of the first image recognition unit 11 and the second image recognition unit 12 may be performed using a single learning data set.

次に、図６から図８を参照して、第１の学習データセットＤ１及び第２の学習データセットＤ２を用いた画像認識装置１の学習に関する処理について説明する。図６から図８は、画像認識装置の画像学習に関する処理の一例を示す図である。画像認識装置１の学習では、第１の画像認識部１１の学習を行ってから、第２の画像認識部１２の学習を行っている。 Next, with reference to FIG. 6 to FIG. 8, a process related to learning of the image recognition device 1 using the first learning data set D1 and the second learning data set D2 will be described. 6 to 8 are diagrams illustrating an example of processing related to image learning of the image recognition device. In the learning of the image recognition device 1, the learning of the second image recognition unit 12 is performed after the learning of the first image recognition unit 11 is performed.

図６を参照して、第１の学習データセットＤ１を用いて、第１の画像認識部１１の学習を行う処理について説明する。第１の画像認識部１１の学習を行う処理では、第１の学習画像Ｇ１を第１の画像認識部１１に入力し、第１の画像認識部１１により第１の学習画像Ｇ１の画像セグメンテーションを行って、第１の出力画像Ｏ１を取得するステップ（第５のステップ）を実行する。 With reference to FIG. 6, a process of performing learning of the first image recognition unit 11 using the first learning data set D1 will be described. In the process of performing learning of the first image recognition unit 11, the first learning image G1 is input to the first image recognition unit 11, and the first image recognition unit 11 performs image segmentation of the first learning image G1. Then, the step of acquiring the first output image O1 (fifth step) is performed.

具体的に、第１の学習データセットＤ１の第１の学習画像Ｇ１が、画像認識装置１の第１の画像認識部１１に入力される（ステップＳ１）。第１の学習画像Ｇ１が入力されると、第１の画像認識部１１は、第１の学習画像Ｇ１を入力画像として、第１の学習画像Ｇ１をダウンサンプリングする（ステップＳ２）。第１の画像認識部１１は、低解像度となった第１の学習画像Ｇ１に対してエンコード処理を実行する（ステップＳ３）。第１の画像認識部１１は、エンコード処理を実行することで、ダウンサンプリングされた特徴量を含む低解像度の特徴マップを生成する。第１の画像認識部１１は、ダウンサンプリングされた特徴量を含む特徴マップに対してデコード処理を実行する（ステップＳ４）。第１の画像認識部１１は、デコード処理を実行することで、特徴量を含む低解像度の特徴マップを復元しながらアップサンプリングして、第１の学習画像Ｇ１と同じ解像度とする。そして、第１の画像認識部１１は、画像をピクセル単位でクラスごとに領域分割するクラス推論を実行する（ステップＳ５）。第１の画像認識部１１は、クラス推論の結果として、第１の出力画像Ｏ１を取得する（ステップＳ６）。 Specifically, the first learning image G1 of the first learning data set D1 is input to the first image recognition unit 11 of the image recognition device 1 (Step S1). When the first learning image G1 is input, the first image recognition unit 11 uses the first learning image G1 as an input image and downsamples the first learning image G1 (Step S2). The first image recognition unit 11 performs an encoding process on the first learning image G1 having a low resolution (Step S3). The first image recognition unit 11 generates a low-resolution feature map including the down-sampled feature amounts by executing the encoding process. The first image recognition unit 11 performs a decoding process on the feature map including the down-sampled feature amount (Step S4). The first image recognition unit 11 performs up-sampling while restoring a low-resolution feature map including a feature amount by executing a decoding process, to obtain the same resolution as that of the first learning image G1. Then, the first image recognition unit 11 executes a class inference for dividing the image into regions on a pixel-by-pixel basis for each class (step S5). The first image recognition unit 11 acquires the first output image O1 as a result of the class inference (Step S6).

次に、第１の画像認識部１１の学習を行う処理では、第１の教師画像Ｔ１に対する第１の出力画像Ｏ１の第１の誤差を取得するステップ（ステップＳ７：第６のステップ）を実行する。 Next, in the learning process of the first image recognition unit 11, a step of acquiring a first error of the first output image O1 with respect to the first teacher image T1 (Step S7: sixth step) is executed. I do.

具体的に、ステップＳ７において、第１の画像認識部１１は、第１の出力画像Ｏ１を取得すると、第１の学習データセットＤ１の第１の教師画像Ｔ１を取得する。第１の画像認識部１１は、取得した第１の教師画像Ｔ１と第１の出力画像Ｏ１とから、第１の教師画像Ｔ１と第１の出力画像Ｏ１との誤差量を第１の誤差として算出する。誤差量は、Cross Entropy関数を用いて誤差計算を行うことにより算出される。 Specifically, in step S7, when the first image recognition unit 11 acquires the first output image O1, it acquires the first teacher image T1 of the first learning data set D1. The first image recognition unit 11 sets an error amount between the first teacher image T1 and the first output image O1 as a first error based on the obtained first teacher image T1 and the first output image O1. calculate. The error amount is calculated by performing an error calculation using the Cross Entropy function.

そして、第１の画像認識部１１の学習を行う処理では、第１の誤差に基づいて、第１の画像認識部１１による画像セグメンテーションを修正するステップ（第７のステップ）を実行する。 Then, in the process of learning the first image recognition unit 11, a step (seventh step) of correcting the image segmentation by the first image recognition unit 11 based on the first error is executed.

具体的に、第１の画像認識部１１は、第１の誤差を取得すると、誤差量に基づいて誤差逆伝播法によりネットワークにおける誤差が修正されるように、ネットワークの畳み込み層及び逆畳み込み層の重みを学習させ、ネットワークを更新する（ステップＳ８）。第１の画像認識部１１は、ステップＳ８の実行により、第１の学習データセットＤ１を用いた学習を終了する。そして、第１の画像認識部１１は、ステップＳ１からステップＳ８を、第１の学習データセットＤ１のセット数に応じて繰り返し実行する。 Specifically, when the first image recognition unit 11 obtains the first error, the first image recognition unit 11 corrects the error in the network by the error backpropagation method based on the error amount, so that the convolutional layer and the deconvolutional layer of the network can be corrected. The weight is learned and the network is updated (step S8). The first image recognition unit 11 ends the learning using the first learning data set D1 by executing step S8. Then, the first image recognition unit 11 repeatedly executes steps S1 to S8 according to the number of the first learning data sets D1.

次に、図７及び図８を参照して、第２の学習データセットＤ２を用いて、第２の画像認識部１２の学習を行う処理について説明する。第２の画像認識部１２の学習を行う処理では、第１の画像認識部１１は学習済みとなっており、第１の画像認識部１１から出力される第１の出力画像Ｏ１が用いられる。第２の画像認識部１２の学習を行う処理では、第２の学習画像Ｇ２を第１の画像認識部１１に入力し、第１の画像認識部１１により第２の学習画像Ｇ２の画像セグメンテーションを行って、第１の出力画像Ｏ１を取得するステップ（第１のステップ）を実行する。 Next, with reference to FIGS. 7 and 8, a process of learning the second image recognition unit 12 using the second learning data set D2 will be described. In the process of learning the second image recognition unit 12, the first image recognition unit 11 has already learned, and the first output image O1 output from the first image recognition unit 11 is used. In the process of learning the second image recognizing unit 12, the second learning image G2 is input to the first image recognizing unit 11, and the first image recognizing unit 11 performs image segmentation of the second learning image G2. Then, the step of obtaining the first output image O1 (first step) is performed.

具体的に、図７に示すように、第２の学習データセットＤ２の第２の学習画像Ｇ２が、画像認識装置１の第１の画像認識部１１に入力される（ステップＳ１１）。第２の学習画像Ｇ２が入力されると、第１の画像認識部１１は、第２の学習画像Ｇ２を入力画像として、第２の学習画像Ｇ２をダウンサンプリングする（ステップＳ１２）。第１の画像認識部１１は、低解像度となった第２の学習画像Ｇ２に対してエンコード処理を実行する（ステップＳ１３）。第１の画像認識部１１は、エンコード処理を実行することで、ダウンサンプリングされた特徴量を含む低解像度の特徴マップを生成する。第１の画像認識部１１は、ダウンサンプリングされた特徴量を含む特徴マップに対してデコード処理を実行する（ステップＳ１４）。第１の画像認識部１１は、デコード処理を実行することで、特徴量を含む低解像度の特徴マップを復元しながらアップサンプリングして、第２の学習画像Ｇ２と同じ解像度とする。そして、第１の画像認識部１１は、画像をピクセル単位でクラスごとに領域分割するクラス推論を実行する（ステップＳ１５）。第１の画像認識部１１は、クラス推論の結果として、第１の出力画像Ｏ１を取得する（ステップＳ１６）。 Specifically, as shown in FIG. 7, the second learning image G2 of the second learning data set D2 is input to the first image recognition unit 11 of the image recognition device 1 (Step S11). When the second learning image G2 is input, the first image recognition unit 11 uses the second learning image G2 as an input image and downsamples the second learning image G2 (Step S12). The first image recognition unit 11 performs an encoding process on the second learning image G2 having the reduced resolution (Step S13). The first image recognition unit 11 generates a low-resolution feature map including the down-sampled feature amounts by executing the encoding process. The first image recognition unit 11 performs a decoding process on the feature map including the down-sampled feature amount (Step S14). The first image recognition unit 11 performs up-sampling while restoring a low-resolution feature map including a feature amount by executing a decoding process to obtain the same resolution as that of the second learning image G2. Then, the first image recognition unit 11 executes a class inference for dividing the image into regions on a pixel-by-pixel basis for each class (step S15). The first image recognition unit 11 acquires the first output image O1 as a result of the class inference (Step S16).

次に、第２の画像認識部１２の学習を行う処理では、第２の学習画像Ｇ２と第１の出力画像Ｏ１とを第２の画像認識部１２に入力し、第１の出力画像Ｏ１を用いて第２の画像認識部１２により第２の学習画像Ｇ２の画像セグメンテーションを行って、第２の出力画像Ｏ２を取得するステップ（第２のステップ）を実行する。 Next, in the process of performing learning of the second image recognition unit 12, the second learning image G2 and the first output image O1 are input to the second image recognition unit 12, and the first output image O1 is processed. The second image recognizing unit 12 performs image segmentation of the second learning image G2 by using the second image recognition unit 12 to obtain a second output image O2 (second step).

具体的に、図８に示すように、第２の学習データセットＤ２の第２の学習画像Ｇ２が、画像認識装置１の第２の画像認識部１２に入力される（ステップＳ２１）。第２の学習画像Ｇ２が入力されると、第２の画像認識部１２は、第２の学習画像Ｇ２を入力画像として、第２の学習画像Ｇ２に対して特徴量抽出処理を実行する（ステップＳ２２）。第２の画像認識部１２は、特徴量抽出処理を実行することで、特徴量を含む特徴マップを生成する。また、第２の画像認識部１２は、特徴量を含む特徴マップに対してフュージョン処理を実行する（ステップＳ２３）。第２の画像認識部１２は、フュージョン処理を実行することで、第１の出力画像Ｏ１をヒントとして、特徴量抽出処理が行われる特徴マップを復元する。そして、第２の画像認識部１２は、特徴マップから、ピクセル単位でクラスごとに領域分割するクラス推論を実行する（ステップＳ２４）。第２の画像認識部１２は、クラス推論の結果として、第２の出力画像Ｏ２を取得する（ステップＳ２５）。 Specifically, as shown in FIG. 8, the second learning image G2 of the second learning data set D2 is input to the second image recognition unit 12 of the image recognition device 1 (Step S21). When the second learning image G2 is input, the second image recognition unit 12 executes a feature amount extraction process on the second learning image G2 using the second learning image G2 as an input image (Step S10). S22). The second image recognition unit 12 generates a feature map including a feature amount by executing a feature amount extraction process. In addition, the second image recognition unit 12 performs a fusion process on the feature map including the feature amount (Step S23). The second image recognizing unit 12 executes the fusion process to restore the feature map on which the feature amount extraction process is performed using the first output image O1 as a hint. Then, the second image recognizing unit 12 executes class inference that divides the area for each class on a pixel basis from the feature map (step S24). The second image recognition unit 12 acquires a second output image O2 as a result of the class inference (Step S25).

次に、第２の画像認識部１２の学習を行う処理では、第２の教師画像Ｔ２に対する第２の出力画像Ｏ２の第２の誤差を取得するステップ（ステップＳ２６：第３のステップ）を実行する。 Next, in the learning process of the second image recognition unit 12, a step of acquiring a second error of the second output image O2 with respect to the second teacher image T2 (Step S26: a third step) is executed. I do.

具体的に、ステップＳ２６において、第２の画像認識部１２は、第２の出力画像Ｏ２を取得すると、第２の学習データセットＤ２の第２の教師画像Ｔ２を取得する。第２の画像認識部１２は、取得した第２の教師画像Ｔ２と第２の出力画像Ｏ２とから、第２の教師画像Ｔ２と第２の出力画像Ｏ２との誤差量を第２の誤差として算出する。誤差量は、Cross Entropy関数を用いて誤差計算を行うことにより算出される。 Specifically, in step S26, when acquiring the second output image O2, the second image recognition unit 12 acquires the second teacher image T2 of the second learning data set D2. The second image recognizing unit 12 sets an error amount between the second teacher image T2 and the second output image O2 as a second error based on the acquired second teacher image T2 and the second output image O2. calculate. The error amount is calculated by performing an error calculation using the Cross Entropy function.

そして、第２の画像認識部１２の学習を行う処理では、第２の誤差に基づいて、第２の画像認識部１２による画像セグメンテーションを修正するステップ（第４のステップ）を実行する。 Then, in the process of learning by the second image recognition unit 12, a step (fourth step) of correcting the image segmentation by the second image recognition unit 12 based on the second error is executed.

具体的に、第２の画像認識部１２は、第２の誤差を取得すると、誤差量に基づいて誤差逆伝播法によりネットワークにおける誤差が修正されるように、ネットワークの畳み込み層の重みを学習させ、ネットワークを更新する（ステップＳ２７）。ここで、ステップＳ２７において、第２の誤差に基づく学習では、第２の画像認識部１２の学習を行う一方で、第１の画像認識部１１の学習を遮断している。すなわち、第２の誤差は、第２の画像認識部１２へ誤差逆伝播させる一方で、第１の画像認識部１１へ誤差逆伝播させない。このため、ステップＳ２７では、第２の画像認識部１２におけるネットワークが誤差修正される一方で、第１の画像認識部１１におけるネットワークが誤差修正されない。第２の画像認識部１２は、ステップＳ２７の実行により、第２の学習データセットＤ２を用いた学習を終了する。そして、第２の画像認識部１２は、ステップＳ２１からステップＳ２７を、第２の学習データセットＤ２のセット数に応じて繰り返し実行する。 Specifically, when the second image recognition unit 12 acquires the second error, the second image recognition unit 12 learns the weight of the convolutional layer of the network so that the error in the network is corrected by the error backpropagation method based on the error amount. The network is updated (step S27). Here, in step S27, in the learning based on the second error, the learning of the first image recognition unit 11 is interrupted while the learning of the second image recognition unit 12 is performed. That is, while the second error is backpropagated to the second image recognition unit 12, it is not backpropagated to the first image recognition unit 11. For this reason, in step S27, while the network in the second image recognition unit 12 is corrected for errors, the network in the first image recognition unit 11 is not corrected for errors. The second image recognition unit 12 ends the learning using the second learning data set D2 by executing step S27. Then, the second image recognition unit 12 repeatedly executes steps S21 to S27 according to the number of the second learning data sets D2.

このように、画像認識装置１の学習では、第１の学習データセットＤ１を用いて、第１の学習画像Ｇ１を大局的に捉えるように、第１の画像認識部１１を学習させている。また、画像認識装置１の学習では、第２の学習データセットＤ２を用いて、第２の学習画像Ｇ２を局所的に捉えるように、第２の画像認識部１２を学習させている。 As described above, in the learning of the image recognition device 1, the first image recognition unit 11 is trained using the first learning data set D1 so as to globally capture the first learning image G1. In the learning of the image recognition device 1, the second image recognition unit 12 is trained so as to locally capture the second learning image G2 by using the second learning data set D2.

次に、図９及び図１０を参照して、学習済みの画像認識装置１による画像認識について説明する。図９及び図１０は、画像認識装置の画像認識に関する処理の一例を示す図である。画像認識装置１の画像認識に関する処理では、入力画像Ｉを第１の画像認識部１１に入力し、第１の画像認識部１１により入力画像Ｉの画像セグメンテーションを行って、第１の出力画像Ｏ１を取得するステップ（第８のステップ）を実行する。 Next, image recognition by the learned image recognition device 1 will be described with reference to FIGS. 9 and 10 are diagrams illustrating an example of processing related to image recognition of the image recognition device. In a process related to image recognition of the image recognition device 1, the input image I is input to the first image recognition unit 11, and the first image recognition unit 11 performs image segmentation of the input image I, and outputs the first output image O1. Is executed (eighth step).

具体的に、図９に示すように、入力画像Ｉが画像認識装置１に入力される（ステップＳ３１）。入力画像Ｉが入力されると、第１の画像認識部１１は、入力画像Ｉをダウンサンプリングする（ステップＳ３２）。第１の画像認識部１１は、低解像度となった入力画像Ｉに対してエンコード処理を実行する（ステップＳ３３）。第１の画像認識部１１は、エンコード処理を実行することで、ダウンサンプリングされた特徴量を含む低解像度の特徴マップを生成する。第１の画像認識部１１は、ダウンサンプリングされた特徴量を含む特徴マップに対してデコード処理を実行する（ステップＳ３４）。第１の画像認識部１１は、デコード処理を実行することで、特徴量を含む低解像度の特徴マップを復元しながらアップサンプリングして、入力画像Ｉと同じ解像度とする。そして、第１の画像認識部１１は、画像をピクセル単位でクラスごとに領域分割するクラス推論を実行する（ステップＳ３５）。第１の画像認識部１１は、クラス推論の結果として、第１の出力画像Ｏ１を取得する（ステップＳ３６）。 Specifically, as shown in FIG. 9, the input image I is input to the image recognition device 1 (Step S31). When the input image I is input, the first image recognition unit 11 downsamples the input image I (Step S32). The first image recognition unit 11 performs an encoding process on the input image I having a low resolution (step S33). The first image recognition unit 11 generates a low-resolution feature map including the down-sampled feature amounts by executing the encoding process. The first image recognition unit 11 performs a decoding process on the feature map including the down-sampled feature amount (Step S34). The first image recognition unit 11 performs upsampling while restoring a low-resolution feature map including a feature amount by executing a decoding process, and sets the same resolution as the input image I. Then, the first image recognition unit 11 executes a class inference for dividing the image into regions on a pixel-by-class basis (step S35). The first image recognition unit 11 obtains a first output image O1 as a result of the class inference (Step S36).

次に、画像認識装置１の画像認識に関する処理では、入力画像Ｉと第１の出力画像Ｏ１とを第２の画像認識部１２に入力し、第１の出力画像Ｏ１を用いて第２の画像認識部１２により入力画像Ｉの画像セグメンテーションを行って、第２の出力画像Ｏ２を取得するステップ（第９のステップ）を実行する。 Next, in the process related to image recognition of the image recognition device 1, the input image I and the first output image O1 are input to the second image recognition unit 12, and the second image is input using the first output image O1. A step (a ninth step) of performing the image segmentation of the input image I by the recognition unit 12 to obtain the second output image O2 is executed.

具体的に、図１０に示すように、入力画像Ｉが、画像認識装置１の第２の画像認識部１２に入力される（ステップＳ４１）。入力画像Ｉが入力されると、第２の画像認識部１２は、入力画像Ｉに対して特徴量抽出処理を実行する（ステップＳ４２）。第２の画像認識部１２は、特徴量抽出処理を実行することで、入力画像Ｉから特徴量を含む特徴マップを生成する。また、第２の画像認識部１２は、特徴量を含む特徴マップに対してフュージョン処理を実行する（ステップＳ４３）。第２の画像認識部１２は、フュージョン処理を実行することで、第１の出力画像Ｏ１をヒントとして、特徴量抽出処理が行われる特徴マップを復元する。そして、第２の画像認識部１２は、特徴マップから、ピクセル単位でクラスごとに領域分割するクラス推論を実行する（ステップＳ４４）。第２の画像認識部１２は、クラス推論の結果として、第２の出力画像Ｏ２を取得する（ステップＳ４５）。 Specifically, as shown in FIG. 10, the input image I is input to the second image recognition unit 12 of the image recognition device 1 (Step S41). When the input image I is input, the second image recognition unit 12 performs a feature amount extraction process on the input image I (Step S42). The second image recognizing unit 12 generates a feature map including the feature amount from the input image I by executing the feature amount extracting process. In addition, the second image recognition unit 12 performs a fusion process on the feature map including the feature amount (Step S43). The second image recognizing unit 12 executes the fusion process to restore the feature map on which the feature amount extraction process is performed using the first output image O1 as a hint. Then, the second image recognition unit 12 executes a class inference that divides the area for each class on a pixel-by-pixel basis from the feature map (step S44). The second image recognition unit 12 acquires the second output image O2 as a result of the class inference (Step S45).

このように、画像認識装置１の画像認識では、第１の画像認識部１１において、第２の画像認識部１２と比べて入力画像Ｉをダウンサンプリングできるため、計算負荷の低いタスクで画像認識を行う。そして、画像認識装置１の画像認識では、第２の画像認識部１２において、第１の出力画像Ｏ１を用いて入力画像Ｉの領域分割を行うため、計算負荷の低いタスクで画像認識を行う。 As described above, in the image recognition of the image recognition device 1, the input image I can be downsampled in the first image recognition unit 11 as compared with the second image recognition unit 12. Do. Then, in the image recognition of the image recognition apparatus 1, the second image recognition unit 12 performs the area division of the input image I using the first output image O1, so that the image recognition is performed by a task with a low calculation load.

また、画像認識装置１の画像認識に関する処理として、画像認識装置１は、第２の出力画像Ｏ２を取得すると、第２の出力画像Ｏ２を所定のフレームレートで出力するステップを実行する。ここで、画像認識装置１は、計算負荷の軽減のために、取得した第１の出力画像Ｏ１と、取得した第２の出力画像Ｏ２とを混在させて所定のフレームレートで出力してもよい。つまり、画像認識装置１は、第１の出力画像Ｏ１を出力する場合、第２の画像認識部１２による画像認識を実行せずに、画像認識を行ってもよい。 In addition, as a process related to image recognition of the image recognition device 1, when the image recognition device 1 acquires the second output image O2, the image recognition device 1 executes a step of outputting the second output image O2 at a predetermined frame rate. Here, the image recognition device 1 may output the acquired first output image O1 and the acquired second output image O2 at a predetermined frame rate in a mixed manner in order to reduce the calculation load. . That is, when outputting the first output image O <b> 1, the image recognition device 1 may perform image recognition without performing image recognition by the second image recognition unit 12.

また、画像認識装置１の画像認識に関する処理として、図１１に示す処理を行っている。図１１は、画像認識装置の画像認識に関する処理の一例を示す図である。図１１に示す処理では、画像認識により取得した第１の出力画像Ｏ１と、第１の出力画像Ｏ１に対応する第２の出力画像Ｏ２とを関連付けて取得するステップ（第１０のステップ）を実行する。 In addition, the processing illustrated in FIG. 11 is performed as processing related to image recognition of the image recognition device 1. FIG. 11 is a diagram illustrating an example of a process regarding image recognition of the image recognition device. In the process shown in FIG. 11, a step (tenth step) of acquiring the first output image O1 acquired by image recognition and the second output image O2 corresponding to the first output image O1 in association with each other is executed. I do.

具体的に、第１の画像認識部１１が第１の出力画像Ｏ１を中間画像として取得する（ステップＳ５１）。また、第２の画像認識部１２が第１の出力画像Ｏ１に対応する第２の出力画Ｏ２を取得する（ステップＳ５２）。画像認識装置１は、第１の出力画像Ｏ１と第２の出力画像Ｏ２を関連付けて取得する（ステップＳ５３）。 Specifically, the first image recognizing unit 11 acquires the first output image O1 as an intermediate image (Step S51). Further, the second image recognition unit 12 acquires a second output image O2 corresponding to the first output image O1 (Step S52). The image recognition device 1 acquires the first output image O1 and the second output image O2 in association with each other (Step S53).

そして、取得された第１の出力画像Ｏ１及び第２の出力画像Ｏ２は、画像認識装置１による画像認識の評価または解析を行う場合において使用される。例えば、画像認識装置１による画像認識に誤認識等の不具合があった場合、第１の出力画像Ｏ１及び第２の出力画像Ｏ２を比較することで、第１の画像認識部１１における異常があったのか、第２の画像認識部１２における異常があったのかを推定することが可能となる。すなわち、第２の出力画像Ｏ２に誤認識がある場合、第１の出力画像Ｏ１に誤認識がなければ、第２の画像認識部１２に異常があると推定できる。一方で、第２の出力画像Ｏ２に誤認識がある場合、第１の出力画像Ｏ１に誤認識があれば、第１の画像認識部１１に異常があると推定できる。 Then, the obtained first output image O1 and second output image O2 are used when evaluating or analyzing image recognition by the image recognition device 1. For example, when there is a defect such as erroneous recognition in image recognition by the image recognition device 1, by comparing the first output image O1 and the second output image O2, an abnormality in the first image recognition unit 11 is detected. It is possible to estimate whether there is an abnormality in the second image recognition unit 12. That is, when there is an erroneous recognition in the second output image O2, if there is no erroneous recognition in the first output image O1, it can be estimated that there is an abnormality in the second image recognition unit 12. On the other hand, if the second output image O2 has an erroneous recognition, and if the first output image O1 has an erroneous recognition, it can be estimated that the first image recognition unit 11 has an abnormality.

以上のように、実施形態に係る画像認識装置１の学習では、第１の画像認識部１１の学習と、第２の画像認識部１２の学習とに分けることができる。そして、第１の画像認識部１１の学習では、学習画像Ｇ１，Ｇ２を大局的に捉えて領域分割を行うタスクを実行することができる。また、第２の画像認識部１２の学習では、学習画像Ｇ２を局所的に捉えて領域分割を行うタスクを実行することができる。このため、第１の画像認識部１１において、局所的な情報が欠落する場合であっても、第２の画像認識部１２で局所的な情報を捉えることができる。局所的な情報としては、例えば、小さな物体、または遠方の物体である。よって、実施形態に係る画像認識装置１の学習では、局所的な情報の欠落を抑制した画像認識の学習を行うことができる。 As described above, learning of the image recognition device 1 according to the embodiment can be divided into learning of the first image recognition unit 11 and learning of the second image recognition unit 12. Then, in the learning of the first image recognition unit 11, a task of capturing the learning images G1 and G2 globally and performing region division can be executed. Further, in the learning of the second image recognition unit 12, a task of locally capturing the learning image G2 and performing region division can be performed. Therefore, even when local information is missing in the first image recognition unit 11, local information can be captured in the second image recognition unit 12. The local information is, for example, a small object or a distant object. Therefore, in the learning of the image recognition device 1 according to the embodiment, it is possible to perform the learning of the image recognition in which the local information loss is suppressed.

また、第１の画像認識部１１の学習では、学習画像Ｇ１，Ｇ２を低解像度とすることで、計算負荷の低いタスクとして実行できる。また、第２の画像認識部１２の学習では、画像認識の表現力を低くすることで、計算負荷の低いタスクとして実行できる。このため、実施形態に係る画像認識装置１では、計算負荷が低く、効率のよい学習を行うことができる。 Further, the learning of the first image recognition unit 11 can be executed as a task with a low calculation load by setting the learning images G1 and G2 at low resolution. Further, the learning of the second image recognition unit 12 can be executed as a task with a low calculation load by reducing the expressiveness of the image recognition. Therefore, in the image recognition device 1 according to the embodiment, the calculation load is low, and efficient learning can be performed.

また、第１の画像認識部１１の学習では、第１の学習データセットＤ１を用いて学習を行うことができるため、第１の画像認識部１１に適した精度のよい学習を行うことができる。同様に、第２の画像認識部１２の学習では、第２の学習データセットＤ２を用いて学習を行うことができるため、第２の画像認識部１２に適した精度のよい学習を行うことができる。 In the learning of the first image recognition unit 11, since learning can be performed using the first learning data set D1, highly accurate learning suitable for the first image recognition unit 11 can be performed. . Similarly, in the learning of the second image recognition unit 12, since learning can be performed using the second learning data set D2, accurate learning suitable for the second image recognition unit 12 can be performed. it can.

また、第２の画像認識部１２において、第１の出力画像Ｏ１を、第２の学習画像Ｇ２または入力画像Ｉと同じ解像度とすることができる。このため、第２の画像認識部１２における第１の出力画像Ｏ１の処理を、第２の学習画像Ｇ２と同様に取り扱うことができる。 Further, in the second image recognition unit 12, the first output image O1 can have the same resolution as the second learning image G2 or the input image I. For this reason, the processing of the first output image O1 in the second image recognition unit 12 can be handled in the same way as the second learning image G2.

また、第２の画像認識部１２の学習では、取得した第２の誤差を、第１の画像認識部１１に誤差伝播させていないことから、第２の画像認識部１２の学習によって第１の画像認識部１１に与える影響を排することができる。 In the learning of the second image recognition unit 12, since the acquired second error is not propagated to the first image recognition unit 11, the first error is learned by the learning of the second image recognition unit 12. The influence on the image recognition unit 11 can be eliminated.

また、実施形態に係る画像認識装置１の画像認識では、第１の画像認識部１１の画像認識と、第２の画像認識部１２の画像認識とに分けることができる。そして、第１の画像認識部１１の画像認識では、学習画像Ｇ１，Ｇ２を大局的に捉えて領域分割を行うタスクを実行することができる。また、第２の画像認識部１２の画像認識では、学習画像Ｇ２を局所的に捉えて領域分割を行うタスクを実行することができる。このため、第１の画像認識部１１において、局所的な情報が欠落する場合であっても、第２の画像認識部１２で局所的な情報を捉えることができる。よって、実施形態に係る画像認識装置１の画像認識では、局所的な情報の欠落を抑制した画像認識を行うことができる。 Further, the image recognition of the image recognition device 1 according to the embodiment can be divided into image recognition by the first image recognition unit 11 and image recognition by the second image recognition unit 12. Then, in the image recognition of the first image recognition unit 11, a task of globally capturing the learning images G1 and G2 and performing region division can be executed. In the image recognition performed by the second image recognition unit 12, a task of locally capturing the learning image G2 and performing region division can be performed. For this reason, even if local information is missing in the first image recognition unit 11, local information can be captured by the second image recognition unit 12. Therefore, in the image recognition of the image recognition device 1 according to the embodiment, it is possible to perform image recognition in which local information loss is suppressed.

また、第１の画像認識部１１の画像認識では、入力画像Ｉを低解像度とすることで、計算負荷の低いタスクとして実行できる。また、第２の画像認識部１２の学習では、画像認識の表現力を低くすることで、計算負荷の低いタスクとして実行できる。このため、実施形態に係る画像認識装置１では、計算負荷が低いことから、画像認識を高速に行うことができる。 In the image recognition of the first image recognition unit 11, by setting the input image I to a low resolution, it can be executed as a task with a low calculation load. Further, the learning of the second image recognition unit 12 can be executed as a task with a low calculation load by reducing the expressiveness of the image recognition. For this reason, the image recognition device 1 according to the embodiment can perform image recognition at high speed because the calculation load is low.

また、画像認識装置１の画像認識では、第１の出力画像Ｏ１と、第２の出力画像Ｏ２とを関連付けて取得することができる。このため、画像認識装置１の画像認識の評価または解析等において、第１の画像認識部１１及び第２の画像認識部１２の異常を推定することが可能となる。 In the image recognition performed by the image recognition device 1, the first output image O1 and the second output image O2 can be acquired in association with each other. Therefore, in the evaluation or analysis of the image recognition of the image recognition device 1, it is possible to estimate the abnormality of the first image recognition unit 11 and the second image recognition unit 12.

また、画像認識装置１の画像認識では、第１の出力画像Ｏ１と第２の出力画像Ｏ２とを混在させて所定のフレームレートで出力することができる。このため、第１の出力画像Ｏ１を出力する場合、第２の画像認識部１２による画像認識を実行せずに、第１の画像認識部１１による画像認識を実行することができるため、画像認識における計算負荷をさらに軽減させることができる。 In the image recognition of the image recognition device 1, the first output image O1 and the second output image O2 can be mixed and output at a predetermined frame rate. Therefore, when the first output image O1 is output, the image recognition by the first image recognition unit 11 can be performed without performing the image recognition by the second image recognition unit 12, so that the image recognition is performed. Can further reduce the calculation load.

なお、実施形態において、画像認識部７は、第１の画像認識部１１と第２の画像認識部１２とを含むものとしたが、特に限定されない。画像認識部７は、少なくとも、第１の画像認識部１１と第２の画像認識部１２とを含めばよく、解像度の異なる画像を認識する３つ以上の画像認識部を含むものであってもよい。 In the embodiment, the image recognition unit 7 includes the first image recognition unit 11 and the second image recognition unit 12, but is not particularly limited. The image recognition unit 7 may include at least the first image recognition unit 11 and the second image recognition unit 12, and may include three or more image recognition units that recognize images having different resolutions. Good.

１画像認識装置
５制御部
６記憶部
７画像認識部
１１第１の画像認識部
１２第２の画像認識部
２１ダウンサンプリング層
２２エンコーダ
２３デコーダ
Ｉ入力画像
Ｏ出力画像
Ｐ１画像学習プログラム
Ｐ２画像認識プログラム
Ｄ１第１の学習データセット
Ｄ２第２の学習データセット REFERENCE SIGNS LIST 1 image recognition device 5 control unit 6 storage unit 7 image recognition unit 11 first image recognition unit 12 second image recognition unit 21 downsampling layer 22 encoder 23 decoder I input image O output image P1 image learning program P2 image recognition program D1 First learning data set D2 Second learning data set

Claims

An image learning program executed by an image recognition device that performs image segmentation,
A learning data set used for learning of the image recognition device includes:
A learning image to be an image to be learned by the image recognition device,
And a teacher image corresponding to the learning image,
The image recognition device,
A first image recognition unit that performs downsampling to lower the resolution of the learning image, generates the learning image with a low resolution, and performs image segmentation of the generated learning image with a low resolution;
A second image recognition unit that performs image segmentation of the high-resolution learning image as compared to the low-resolution learning image,
Inputting the learning image to the first image recognition unit, generating the low-resolution learning image, performing image segmentation of the generated low-resolution learning image, and obtaining a first output image. One step,
The high-resolution learning image and the first output image are input to the second image recognizing unit, and the high-resolution learning image of the high-resolution learning image is input by the second image recognizing unit using the first output image. A second step of performing image segmentation to obtain a second output image;
A third step of obtaining a second error of the second output image with respect to the teacher image;
A fourth step of correcting an image segmentation process by the second image recognition unit based on the second error;
An image learning program that lets you execute

The learning data set includes a first learning data set for learning by the first image recognition unit, and a second learning data set for learning by the second image recognition unit,
The first learning data set includes a first learning image and a first teacher image corresponding to the first learning image,
The second learning data set includes a second learning image and a second teacher image corresponding to the second learning image,
The second teacher image has a higher resolution than the first teacher image,
Before performing the first step, the first learning image is input to the first image recognition unit, and the first image recognition unit performs image segmentation of the first learning image, A fifth step of acquiring the first output image;
A sixth step of obtaining a first error of the first output image with respect to the first teacher image;
Correcting the image segmentation by the first image recognition unit based on the first error.
In the first step, the second learning image is input to the first image recognition unit to obtain the first output image,
In the second step, the second learning image and the first output image are input to the second image recognition unit to obtain the second output image,
The computer-readable storage medium according to claim 1, wherein in the third step, the second error of the second output image with respect to the second teacher image is obtained.

The said 2nd step WHEREIN: The said 1st output image is made into the same resolution as the high-resolution learning image, and the said 1st output image is input into the said 2nd image recognition part. Image learning program.

4. The image learning program according to claim 1, wherein in the fourth step, correction of image segmentation processing by the first image recognition unit based on the second error is blocked. 5.

An image learning method performed by an image recognition device that performs image segmentation,
A learning data set used for learning of the image recognition device includes:
A learning image to be an image to be learned by the image recognition device,
And a teacher image corresponding to the learning image,
The image recognition device,
A first image recognition unit that performs downsampling to lower the resolution of the learning image, generates the learning image with a low resolution, and performs image segmentation of the generated learning image with a low resolution;
A second image recognition unit that performs image segmentation of the high-resolution learning image as compared to the low-resolution learning image,
Inputting the learning image to the first image recognition unit, generating the low-resolution learning image, performing image segmentation of the generated low-resolution learning image, and obtaining a first output image. One step,
The high-resolution learning image and the first output image are input to the second image recognizing unit, and the high-resolution learning image of the high-resolution learning image is input by the second image recognizing unit using the first output image. A second step of performing image segmentation to obtain a second output image;
A third step of obtaining a second error of the second output image with respect to the teacher image;
A fourth step of correcting an image segmentation process by the second image recognition unit based on the second error;
An image learning method including:

An image recognition program executed by an image recognition device that performs image segmentation of an input image that has been input,
The image recognition device,
A first image recognition unit that performs downsampling to reduce the resolution of the input image, generates the low-resolution input image, and performs image segmentation of the generated low-resolution input image;
A second image recognition unit that performs image segmentation of the high-resolution input image as compared to the low-resolution input image,
Inputting the input image to the first image recognizing unit, generating the low-resolution input image, performing image segmentation of the generated low-resolution input image, and obtaining a first output image. 8 steps,
The high-resolution input image and the first output image are input to the second image recognition unit, and the high-resolution input image is input to the second image recognition unit using the first output image. A ninth step of performing image segmentation to obtain a second output image;
Image recognition program that executes

The image recognition program according to claim 6, further comprising: executing a tenth step of associating the acquired first output image with the second output image corresponding to the first output image.

8. The ninth step, wherein the first output image has the same resolution as the high-resolution input image, and the first output image is input to the second image recognition unit. 9. Image recognition program.

9. The method according to claim 6, further comprising: executing an eleventh step of mixing the acquired first output image and the acquired second output image and outputting the mixed image at a predetermined frame rate. 10. Image recognition program.

An image recognition method performed by an image recognition device that performs image segmentation of an input image that has been input,
The image recognition device,
A first image recognition unit that performs downsampling to reduce the resolution of the input image, generates the low-resolution input image, and performs image segmentation of the generated low-resolution input image;
A second image recognition unit that performs image segmentation of the high-resolution input image as compared to the low-resolution input image,
Inputting the input image to the first image recognition unit, generating the low-resolution input image, performing image segmentation of the generated low-resolution input image, and obtaining a first output image. 8 steps,
The high-resolution input image and the first output image are input to the second image recognition unit, and the high-resolution input image is input to the second image recognition unit using the first output image. A ninth step of performing image segmentation to obtain a second output image;
An image recognition method including:

A first image recognition unit that performs downsampling to lower the resolution of the input image that has been input, generates the low-resolution input image, and performs image segmentation of the generated low-resolution input image;
A second image recognition unit that performs image segmentation of the high-resolution input image as compared to the low-resolution input image,
When the input image is input, the first image recognition unit generates the low-resolution input image, performs image segmentation of the generated low-resolution input image, and generates a first output image. And outputting the generated first output image to the second image recognition unit.
The second image recognition unit, when the high-resolution input image and the first output image are input, the second image recognition unit using the first output image, the high-resolution An image recognition device that performs image segmentation of an input image and outputs a second output image.

The image recognition device according to claim 11, wherein the first image recognition unit and the second image recognition unit perform image segmentation by semantic segmentation.