JP2019028650A

JP2019028650A - Image identification device, learning device, image identification method, learning method and program

Info

Publication number: JP2019028650A
Application number: JP2017146337A
Authority: JP
Inventors: 雅人青葉; Masahito Aoba
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2017-07-28
Filing date: 2017-07-28
Publication date: 2019-02-21

Abstract

To provide an image identification device, a learning device, an image identification method, a learning method and a program capable of performing image recognition with high accuracy without depending upon intuition.SOLUTION: An image processing device includes acquisition means 1100 for acquiring an input image by a sensor value of an imaging device; and identification means 1300 for identifying the acquired input image by the sensor value by using a discriminator having a conversion unit. At least the conversion unit in the discriminator performs learning on the basis of a learning image by the sensor value of the imaging device and correct answer data attached to the learning image.SELECTED DRAWING: Figure 1

Description

本発明は、画像を所定のクラスに分類する、画像を複数のクラスの領域に分割するなどの画像識別技術に関する。 The present invention relates to an image identification technique such as classifying an image into a predetermined class or dividing an image into regions of a plurality of classes.

画像を所定のクラスに分類する研究は、これまで広く行われてきており、近年では画像を非常に多くのクラスに分類するタスクも研究されてきている。非特許文献１には、深層学習による画像分類の技術が開示されている。 Research for classifying images into predetermined classes has been widely performed, and in recent years, tasks for classifying images into very many classes have been studied. Non-Patent Document 1 discloses an image classification technique based on deep learning.

また、画像を複数の領域に分割する領域分割の研究も多く行われており、画像から人物の領域、自動車の領域、道路の領域、建物の領域、空の領域などの、意味的な領域を切り出す課題が盛んに研究されている。このような課題は、意味的領域分割（ＳｅｍａｎｔｉｃＳｅｇｍｅｎｔａｔｉｏｎ）と呼ばれ、被写体の種類に対応した画像補正や、シーン解釈などに応用できると考えられている。 In addition, many researches on area division that divides an image into multiple areas have been conducted, and semantic areas such as human areas, automobile areas, road areas, building areas, and empty areas are also extracted from images. The problem of cutting out has been actively researched. Such a problem is called semantic segmentation and is considered to be applicable to image correction corresponding to the type of subject, scene interpretation, and the like.

意味的領域分割を行うにあたり、画像の各位置に関するクラスラベルの識別を小領域（ｓｕｐｅｒｐｉｘｅｌ）単位で行う手法が多く提案されている。小領域は主に類似した特徴を持つ小さな領域として画像から切り出されるもので、非特許文献２をはじめとして、さまざまな手法が提案されている。このようにして得られたそれぞれの小領域は、その内部の特徴量、あるいはその周辺のコンテクスト特徴量も一緒に用いてクラスラベルを識別することが行われる。通常はさまざまな学習画像を用いてこのような局所ベースの領域識別器を学習させることで領域識別を行う。 In performing semantic region division, many methods have been proposed in which class labels for each position of an image are identified in units of small regions (superpixels). The small area is mainly cut out from the image as a small area having similar characteristics, and various methods such as Non-Patent Document 2 have been proposed. Each small region obtained in this way is used to identify a class label using the internal feature amount or the surrounding context feature amount together. Usually, region identification is performed by learning such a local-based region classifier using various learning images.

近年では、深層学習を利用した領域分割の研究も行われてきている。非特許文献３では、ＣＮＮ（ＣｏｎｖｏｌｕｔｉｏｎａｌＮｅｕｒａｌＮｅｔｗｏｒｋ）の中間層出力を特徴量として利用し、複数の中間層特徴による画素ごとのクラス判定結果を統合する。非特許文献３では、このようにして、画像の意味的領域分割を行っている。非特許文献３の手法では、前述のような小領域分割結果を利用することなく、画素ごとに直接クラス判定を行っている。 In recent years, research on region segmentation using deep learning has also been conducted. In Non-Patent Document 3, an intermediate layer output of CNN (Convolutional Neural Network) is used as a feature amount, and class determination results for each pixel based on a plurality of intermediate layer features are integrated. In Non-Patent Document 3, semantic region division of an image is performed in this way. In the method of Non-Patent Document 3, class determination is directly performed for each pixel without using the small region division result as described above.

このような画像分類や意味的領域分割などの画像識別を行うにあたり、識別器の入力データとして与えられるものは通常、カメラ内部処理もしくは撮影後にユーザの手により現像された画像である。本来、現像画像はユーザが目で見て楽しむものであるため、画像の現像方法は見た目の美しさを基準にして決定される。しかしながら、このような通常の現像方法が画像識別のタスクにおいて適しているとは限らない。例えば、白い壁を美しく見せるために露出をややオーバー気味に飛ばした画像では、曇天のテクスチャレスな空と壁を区別することは困難になる。これに対して、暗めに撮影して壁のテクスチャが見えるような画像であるほうが、壁と空を分類するのに適していると考えられる。 In performing such image classification such as image classification and semantic region division, what is given as input data of the discriminator is usually an image developed by the user's hand after camera internal processing or photographing. Originally, the developed image is something that the user can see and enjoy, so the image development method is determined based on the beauty of appearance. However, such a normal development method is not always suitable for an image identification task. For example, in an image that is slightly overexposed to make a white wall look beautiful, it is difficult to distinguish the cloudless sky and the wall. On the other hand, it is considered that an image in which the wall texture can be seen by photographing darkly is suitable for classifying the wall and the sky.

特許文献１では、撮像装置から得られたＲＡＷ画像の平均輝度値によって複数のガンマ補正関数の中から補正関数を選択することで、露出のアンダー／オーバーを抑えた画像を、表示用画像とは別に生成し、物体検出処理に利用することを提案している。 In Patent Document 1, an image for which under / overexposure is suppressed by selecting a correction function from among a plurality of gamma correction functions according to an average luminance value of a RAW image obtained from an imaging device is referred to as a display image. It has been proposed to generate it separately and use it for object detection processing.

特開２０１４−１１７６７号公報JP, 2014-11767, A

”ＩｍａｇｅＮｅｔＣｌａｓｓｉｆｉｃａｔｉｏｎｗｉｔｈＤｅｅｐＣｏｎｖｏｌｕｔｉｏｎａｌＮｅｕｒａｌＮｅｔｗｏｒｋｓ”，Ａ．Ｋｒｉｚｈｅｖｓｋｙ，Ｉ．Ｓｕｔｓｋｅｖｅｒ，Ｇ．Ｅ．Ｈｉｎｔｏｎ，Ｐｒｏｃ．ＮｅｕｒａｌＩｎｆｏｒｍａｔｉｏｎＰｒｏｃｅｓｓｉｎｇＳｙｓｔｅｍｓ，２０１２．“ImageNet Classification with Deep Convolutional Neural Networks”, A.M. Krizhevsky, I.K. Suskever, G.M. E. Hinton, Proc. Neural Information Processing Systems, 2012. ”ＳＬＩＣＳｕｐｅｒｐｉｘｅｌｓ”，Ｒ．Ａｃｈａｎｔａ，Ａ．Ｓｈａｊｉ，Ｋ．Ｓｍｉｔｈ，Ａ．Ｌｕｃｃｈｉ，ＥＰＦＬＴｅｃｈｎｉｃａｌＲｅｐｏｒｔ，２０１０．“SLIC Superpixels”, R.A. Achanta, A .; Shaji, K .; Smith, A.M. Lucchi, EPFL Technical Report, 2010. ”ＦｕｌｌｙＣｏｎｖｏｌｕｔｉｏｎａｌＮｅｔｗｏｒｋｓｆｏｒＳｅｍａｎｔｉｃＳｅｇｍｅｎｔａｔｉｏｎ”，Ｌｏｎｇ，Ｓｈｅｌｈａｍｅｒ，ａｎｄＤａｒｒｅｌｌ，ＣＶＰＲ２０１５．“Fully Convolutional Networks for Semantic Segmentation”, Long, Shelhamer, and Darrell, CVPR2015. ＨｙｐｅｒｃｏｌｍｎｓｆｏｒＯｂｊｅｃｔＳｅｇｍｅｎｔａｔｉｏｎａｎｄＦｉｎｅ−ＧｒａｉｎｅｄＬｏｃａｌｉｚａｔｉｏｎ，Ｂ．Ｈａｒｉｈａｒａｎ，Ｐ．Ａｒｂｅｌａｅｚ，Ｒ．ＧｉｒｓｈｉｃｋａｎｄＪ．Ｍａｌｉｋ，ＣＶＰＲ２０１５．Hypercolumns for Object Segmentation and Fine-Grained Localization, B.H. Hariharan, P.A. Arbelaez, R.A. Girstick and J.M. Malik, CVPR2015.

しかしながら、特許文献１で、用意される補正関数は人が直観的にパラメータを設定したものであって、補正値の良し悪しはその直観に頼るものであり、高精度な画像識別を行えない場合があった。そこで、本発明は、高精度な画像識別を行えるようにすることを目的とする。 However, in Patent Document 1, when the correction function prepared is a parameter that is intuitively set by a person, and whether the correction value is good or bad depends on the intuition, and high-accuracy image identification cannot be performed. was there. Therefore, an object of the present invention is to enable high-accuracy image identification.

本発明は、撮像装置のセンサ値による入力画像を取得する取得手段と、変換部を有する識別器を用いて、前記取得されたセンサ値による入力画像を識別する識別手段と、を有し、前記識別器のうち少なくとも前記変換部は、撮像装置のセンサ値による学習画像と当該学習画像に付与された正解データとに基づいて学習されていることを特徴とする。 The present invention includes an acquisition unit that acquires an input image based on a sensor value of an imaging device, and an identification unit that identifies the input image based on the acquired sensor value using a discriminator having a conversion unit, At least the conversion unit of the discriminator is learned based on a learning image based on sensor values of the imaging device and correct data provided to the learning image.

本発明によれば、高精度な画像識別を行うことができるようになる。 According to the present invention, highly accurate image identification can be performed.

各実施形態における画像処理装置の機能構成を示す概略ブロック図。1 is a schematic block diagram showing a functional configuration of an image processing apparatus in each embodiment. 第１の実施形態における学習処理、識別処理を示すフローチャート。The flowchart which shows the learning process in 1st Embodiment, and an identification process. 第１の実施形態における意味的領域分割におけるクラスラベルを説明する図。The figure explaining the class label in the semantic area | region division | segmentation in 1st Embodiment. 第１の実施形態における識別器の構造を説明する図。The figure explaining the structure of the discriminator in 1st Embodiment. 第１の実施形態においてガンマ補正関数とその近似関数を説明する図。The figure explaining a gamma correction function and its approximation function in 1st Embodiment. 第１の実施形態における変換部の構成を説明する図。The figure explaining the structure of the conversion part in 1st Embodiment. 各実施形態における識別器の構成を説明する図。The figure explaining the structure of the discriminator in each embodiment. 第１の実施形態における調整部の構成を説明する図。The figure explaining the structure of the adjustment part in 1st Embodiment. 各実施形態における識別器の構成を説明する図。The figure explaining the structure of the discriminator in each embodiment.

［第１の実施形態］
以下、本発明の第１の実施形態の詳細について図面を参照しつつ説明する。図１は、各実施形態に係る画像処理装置の機能構成を示す概略ブロック図であり、図１（ａ）が本実施形態に係る概略ブロック図である。画像処理装置は、学習時における学習装置、および識別時における画像識別装置として機能するものである。 [First Embodiment]
The details of the first embodiment of the present invention will be described below with reference to the drawings. FIG. 1 is a schematic block diagram illustrating a functional configuration of an image processing apparatus according to each embodiment, and FIG. 1A is a schematic block diagram according to the present embodiment. The image processing device functions as a learning device at the time of learning and an image identification device at the time of identification.

このような画像処理装置は、ＣＰＵ、ＲＯＭ、ＲＡＭ、ＨＤＤ等のハードウェア構成を備え、ＣＰＵがＲＯＭやＨＤ等に格納されたプログラムを実行することにより、例えば、後述する各機能構成やフローチャートの処理が実現される。ＲＡＭは、ＣＰＵがプログラムを展開して実行するワークエリアとして機能する記憶領域を有する。ＲＯＭは、ＣＰＵが実行するプログラム等を格納する記憶領域を有する。ＨＤＤは、ＣＰＵが処理を実行する際に要する各種のプログラム、閾値に関するデータ等を含む各種のデータを格納する記憶領域を有する。 Such an image processing apparatus has a hardware configuration such as a CPU, a ROM, a RAM, and an HDD. When the CPU executes a program stored in a ROM, an HD, or the like, for example, each of the functional configurations and flowcharts described below are performed. Processing is realized. The RAM has a storage area that functions as a work area where the CPU develops and executes the program. The ROM has a storage area for storing programs executed by the CPU. The HDD has a storage area for storing various types of data including various programs necessary for the CPU to execute processing, data on threshold values, and the like.

まず、本実施形態における学習時の装置構成の概要について説明する。ここで学習とは、後述する識別時の処理を行うために利用される識別器を、事前に用意された学習画像から生成することである。処理内容の詳細については後述する。 First, an outline of a device configuration during learning in the present embodiment will be described. Here, learning is to generate a discriminator used for performing processing at the time of discrimination described later from a learning image prepared in advance. Details of the processing contents will be described later.

第１の学習データ記憶部５１００には、あらかじめ第１の学習データが用意されている。第１の学習データは、複数の現像後の学習画像と、各学習画像に対して付与されたクラスラベルから構成される。第１の学習データ取得部２１００では、第１の学習データ記憶部５１００から、学習画像、クラスラベルを読み込む。第１の学習部２２００では、学習画像を識別器に入力することによって得られる識別結果とクラスラベルとの誤差から、第１の識別器を学習する。学習して得られた第１の識別器は、第１の識別器記憶部５２００に記憶される。 In the first learning data storage unit 5100, first learning data is prepared in advance. The first learning data includes a plurality of developed learning images and class labels assigned to the learning images. The first learning data acquisition unit 2100 reads a learning image and a class label from the first learning data storage unit 5100. The first learning unit 2200 learns the first discriminator from the error between the discrimination result obtained by inputting the learning image to the discriminator and the class label. The first discriminator obtained by learning is stored in the first discriminator storage unit 5200.

第２の学習データ記憶部５３００には、あらかじめ第２の学習データが用意されている。第２の学習データは、デジタルカメラ等の撮像装置で得られた、現像される前のセンサ値による学習画像と、各学習画像に対して付与されたクラスラベルで構成される。第２の学習データ取得部２３００では、第２の学習データ記憶部５３００から、学習画像、クラスラベルを読み込む。変換部追加部２４００では、第１の識別器記憶部５２００から、第１の識別器を読み込み、その入力側に変換部を追加することで、第２の識別器を生成する。第２の学習部２５００では、第２の学習データにおける学習データを、第２の識別器に入力して得られた識別結果とクラスラベルとの誤差から、第２の識別器を学習する。学習して得られた第２の識別器は、第２の識別器記憶部５４００に記憶される。 Second learning data is prepared in advance in the second learning data storage unit 5300. The second learning data includes a learning image obtained by an image sensor such as a digital camera and a sensor value before development and a class label assigned to each learning image. The second learning data acquisition unit 2300 reads the learning image and the class label from the second learning data storage unit 5300. In the conversion part addition part 2400, a 1st discriminator is read from the 1st discriminator memory | storage part 5200, and a 2nd discriminator is produced | generated by adding a conversion part to the input side. The second learning unit 2500 learns the second discriminator from the error between the discrimination result obtained by inputting the learning data in the second learning data to the second discriminator and the class label. The second discriminator obtained by learning is stored in the second discriminator storage unit 5400.

次に、識別時の装置構成の概要に関して説明する。ここで識別とは、未知の入力画像に対して画像識別を行うことである。処理内容の詳細は後述する。 Next, an outline of the device configuration at the time of identification will be described. Here, the term “identification” refers to image identification for an unknown input image. Details of the processing contents will be described later.

入力データ取得部１１００では、撮像装置で得られた、現像される前のセンサ値による入力画像と、その入力画像に対応する撮影情報が読み込まれる。識別器設定部１２００では、あらかじめ学習によって得られている第２の識別器を、第２の識別器記憶部５４００から読み込んで設定する。識別部１３００では、設定された第２の識別器の変換部に取得された入力画像を入力し、識別結果を得る。得られた識別結果は識別結果出力部１４００に送られ、ユーザもしくは別機器に結果が提示される。 In the input data acquisition unit 1100, the input image obtained by the sensor value before development and obtained by the imaging device and the shooting information corresponding to the input image are read. In the discriminator setting unit 1200, the second discriminator obtained by learning in advance is read from the second discriminator storage unit 5400 and set. In the identification unit 1300, the acquired input image is input to the set conversion unit of the second classifier, and the identification result is obtained. The obtained identification result is sent to the identification result output unit 1400, and the result is presented to the user or another device.

本実施形態では、学習時の機能構成も、識別時の機能構成も同じ装置（画像処理装置）で実現されるものとして説明したが、それぞれ別の装置によって実現するようにしてもよい。また、第１の学習データ取得部２１００、第１の学習部２２００、第２の学習データ取得部２３００、変換部追加部２４００、および第２の学習部２５００は、すべて同じ装置上で実現されるものとして説明したが、それぞれ独立したモジュールとしてもよい。また、装置上で実装されるプログラムとして実現してもよい。第１の学習データ記憶部５１００、第１の識別器記憶部５２００、第２の学習データ記憶部５３００、および第２の識別器記憶部５４００は、装置の内部もしくは外部のストレージとして実現される。 In the present embodiment, the functional configuration at the time of learning and the functional configuration at the time of identification are described as being realized by the same device (image processing device), but may be realized by different devices. The first learning data acquisition unit 2100, the first learning unit 2200, the second learning data acquisition unit 2300, the conversion unit addition unit 2400, and the second learning unit 2500 are all realized on the same device. Although described as a thing, it is good also as an independent module. Moreover, you may implement | achieve as a program mounted on an apparatus. The first learning data storage unit 5100, the first discriminator storage unit 5200, the second learning data storage unit 5300, and the second discriminator storage unit 5400 are realized as storage inside or outside the apparatus.

入力データ取得部１１００、識別器設定部１２００、識別部１３００、および識別結果出力部１４００は、すべて同じ装置上で実現されるものでもよいし、それぞれ独立したモジュールとしてもよい。また、装置上で実装されるプログラムとして実現してもよいし、カメラ等の撮影装置内部において回路もしくはプログラムとして実装してもよい。第２の識別器記憶部５４００は、学習時と識別時で別々の装置で実現される場合には、それぞれで異なるストレージであってもよい。その場合には、学習時に得られた識別器を、識別用の装置におけるストレージにコピーもしくは移動して用いればよい。 The input data acquisition unit 1100, the classifier setting unit 1200, the identification unit 1300, and the identification result output unit 1400 may all be realized on the same device, or may be independent modules. Further, the program may be implemented as a program installed on the apparatus, or may be implemented as a circuit or a program inside an imaging apparatus such as a camera. When the second discriminator storage unit 5400 is realized by different devices at the time of learning and at the time of identification, different storages may be used. In that case, the classifier obtained at the time of learning may be used by copying or moving to the storage in the identification device.

次に、本実施形態に係る処理の詳細について説明する。ここでは、画像識別として、画像の意味的な領域分割を例にして説明を進める。まず、本実施形態の学習時の処理について説明する。図２（ａ）は、本実施形態における学習時の処理を示すフローチャートである。 Next, details of processing according to the present embodiment will be described. Here, as image identification, explanation will be given by taking an example of semantic area division of an image. First, the processing at the time of learning of this embodiment will be described. FIG. 2A is a flowchart showing processing at the time of learning in the present embodiment.

第１の学習データ取得ステップＳ２１００では、第１の学習データ取得部２１００が、第１の学習データ記憶部５１００から、学習画像とクラスラベルを、学習データとして読み込む。第１の学習データ記憶部５１００には、あらかじめ複数の現像済み学習画像とクラスラベルが用意されている。学習画像は、具体的にはデジタルカメラ等で撮影され、カメラ内部もしくはカメラ外部の現像プログラムによって現像された画像データである。通常はＪＰＥＧやＰＮＧ、ＢＭＰなどの形式で与えられるが、本実施形態は学習画像のフォーマットに限定されるものではない。用意されている第１の学習画像の枚数をＮ_１枚とし、ｎ番目の第１の学習画像をＩ_ｎ（ｎ＝１，…，Ｎ_１）と書くこととする。意味的領域分割におけるクラスラベルとは、学習画像の各領域に対して識別クラスがラベルとして割り振られているものである。 In the first learning data acquisition step S2100, the first learning data acquisition unit 2100 reads the learning image and the class label from the first learning data storage unit 5100 as learning data. In the first learning data storage unit 5100, a plurality of developed learning images and class labels are prepared in advance. The learning image is specifically image data taken by a digital camera or the like and developed by a developing program inside or outside the camera. Usually, it is given in a format such as JPEG, PNG, or BMP, but this embodiment is not limited to the format of the learning image. It is assumed that the number of prepared first learning images is N ₁ and the n-th first learning image is written as I _n (n = 1,..., N ₁ ). The class label in the semantic area division is an identification class assigned as a label to each area of the learning image.

図３に意味的領域分割におけるクラスラベルの例を示す。図３（ａ）の５００は学習画像の例で、図３（ｂ）の５４０は学習画像５００に対応するクラスラベルの例である。このように、画像に対応する正解クラスラベルが与えられた正解データを、画像識別では一般的にＧＴ（ＧｒｏｕｎｄＴｒｕｔｈ）と呼ぶ。各領域に対して、空５４１、頭髪５４２、顔５４３、服５４４、花５４５、葉茎５４６といったクラスラベルが与えられている。ここでは意味的なクラスラベルを例に上げたが、光沢面やマット面、高周波領域といった領域の属性によるクラスラベルが与えられていてもよい。また、空と木の枝のような、複数種類の物体が混在して写っているクラスを定義してもよい。領域クラスは全部でＭ種類あるとする。学習画像Ｉ_ｎの全画素に対応するクラスラベル集合をＧＴ_ｎとする。 FIG. 3 shows an example of class labels in semantic area division. 3A is an example of a learning image, and 540 in FIG. 3B is an example of a class label corresponding to the learning image 500. In this way, correct data to which a correct class label corresponding to an image is given is generally called GT (Ground Truth) in image identification. Class labels such as sky 541, hair 542, face 543, clothes 544, flower 545, and leaf stem 546 are given to each region. Here, a semantic class label is taken as an example, but a class label may be given according to attributes of a region such as a glossy surface, a matte surface, and a high frequency region. Also, a class in which a plurality of types of objects such as sky and tree branches are mixed may be defined. Assume that there are M kinds of area classes in total. The class label set corresponding to all pixels of the learning image I _n and GT _n.

第１の学習ステップＳ２２００では、第１の学習部２２００が、第１の学習データによって、第１の識別器を学習する。ここでは、第１の識別器としてＣＮＮ（ＣｏｎｖｏｌｕｔｉｏｎａｌＮｅｕｒａｌＮｅｔｗｏｒｋ）を利用することとして説明する。ＣＮＮは、畳み込み層とプーリング層が何層も繰り返されることによって入力信号の局所的な特徴が次第にまとめられ、変形や位置ずれに対してロバストな特徴に変換されることで、識別タスクを行うニューラルネットワークである。 In the first learning step S2200, the first learning unit 2200 learns the first discriminator based on the first learning data. Here, a description will be given assuming that a CNN (Convolutional Neural Network) is used as the first discriminator. CNN is a neural network that performs identification tasks by gradually consolidating local features of an input signal by repeating a convolution layer and a pooling layer, and converting them into features that are robust against deformation and displacement. It is a network.

図４は、本実施形態で用いられるＣＮＮの構造を説明する図である。ＣＮＮは、図４（ａ）の６１０と６２０によって構成されており、それぞれ畳み込み層、完全結合層と呼ばれている。これらは、それぞれの役割として特徴抽出とパターン分類を行っている部分であり、以降では一般性を失わないように、それぞれ特徴抽出部６１０、分類部６２０と記述することとする。６１１は入力層であり、入力画像６３０の各位置における畳み込み演算結果を信号として受け取る。６１２、６１３は中間層であり、複数の中間層を介して最終層６１５へと信号が送られる。各層では畳み込み層とプーリング層が逐次配置され、畳み込み演算と、プーリングによる信号の選択が繰り返される。特徴抽出部６１０の最終層６１５における出力信号は、分類部６２０へと送られる。分類部６２０では、各層の素子が前後の層と全結合しており、重み係数による積和演算によって信号が出力層６４０へと送られる。出力層６４０ではクラス数Ｍと同数の出力素子があり、それぞれの素子の信号強度を比較して、最も大きな信号を出力する素子に対応するクラスが、その画素の出力クラスラベルとなる。 FIG. 4 is a diagram for explaining the structure of the CNN used in this embodiment. The CNN is composed of 610 and 620 in FIG. 4A, and is called a convolution layer and a complete coupling layer, respectively. These are portions that perform feature extraction and pattern classification as their respective roles, and are hereinafter referred to as a feature extraction unit 610 and a classification unit 620 so as not to lose generality. Reference numeral 611 denotes an input layer, which receives a convolution calculation result at each position of the input image 630 as a signal. Reference numerals 612 and 613 denote intermediate layers, and signals are sent to the final layer 615 through a plurality of intermediate layers. In each layer, a convolution layer and a pooling layer are sequentially arranged, and a convolution operation and signal selection by pooling are repeated. The output signal in the final layer 615 of the feature extraction unit 610 is sent to the classification unit 620. In the classification unit 620, the elements in each layer are fully coupled to the previous and subsequent layers, and a signal is sent to the output layer 640 by a product-sum operation using a weight coefficient. The output layer 640 has the same number of output elements as the number of classes M, and the signal intensity of each element is compared, and the class corresponding to the element that outputs the largest signal becomes the output class label of the pixel.

学習画像Ｉ_ｎをＣＮＮに入力した際に出力層で得られる出力信号の値を教師信号と比較することで、学習が行われる。ここで、学習画像Ｉ_ｎの画素（ｉ，ｊ）における、クラスｍに対応する出力素子の教師信号ｔ（ｎ，ｉ，ｊ，ｍ）は、ＧＴ_ｎの画素（ｉ，ｊ）のクラスラベルがＣ（Ｃ＝１，…，Ｍ）である場合に、下記の数式１のように定義される。 The value of the output signal obtained at the output layer when entering the learning image I _n the CNN is compared with the teacher signal, the learning is performed. Here, the pixels of the learning image _{I n} (i, j) in the teacher signal t output element corresponding to the class m (n, i, j, m) is the class label of the pixel of GT _n (i, j) Is C (C = 1,..., M), the following equation 1 is defined.

入力画像Ｉ_ｎを識別器に入力した結果得られた出力層における位置（ｉ，ｊ）のクラスｍに対応する出力素子（ｉ，ｊ，ｍ）の出力信号をｙ（ｎ，ｉ，ｊ，ｍ）とすると、出力素子（ｉ，ｊ，ｍ）における誤差は下記の数式２のように計算される。 Position in the input image I _n obtained as a result of the input to the discriminator output layer (i, j) output element corresponding to the class m the (i, j, m) the output signal of y (n, i, j, m), the error in the output element (i, j, m) is calculated as shown in Equation 2 below.

Ｅ_{（ｎ，ｉ，ｊ，ｍ）}＝（ｙ（ｎ，ｉ，ｊ，ｍ）−ｔ（ｎ，ｉ，ｊ，ｍ））^２（数式２）
誤差逆伝搬法によって出力層から入力層へと誤差が順次逆伝搬されることで、確率的勾配降下法などによってＣＮＮにおける各層の重み係数が更新される。 E _{(n, i, j, m)} = (y (n, i, j, m) −t (n, i, j, m)) ² (Formula 2)
By sequentially backpropagating the error from the output layer to the input layer by the error backpropagation method, the weight coefficient of each layer in the CNN is updated by the stochastic gradient descent method or the like.

学習におけるＣＮＮの重み係数は、ランダムな初期値からスタートさせてもよいが、何かしらのタスクに対して学習済みの重み係数を初期値として与えてもよい。例えば、領域分割タスクのクラスラベルは画素ごとに与える必要があるが、画像分類タスクのクラスラベルは画像１枚に対して１つのラベルを付与すればよい。そのため、事前に人がＧＴとしてクラスラベルを入力する手間は数十倍から数百倍の差がある。そのため、大量の画像分類タスクの学習画像が一般に公開されており、簡単に入手することができる。例えば、ＩＬＳＶＲＣ（ＩｍａｇｅＮｅｔＬａｒｇｅ−ｓｃａｌｅＶｉｓｕａｌＲｅｃｏｇｎｉｔｉｏｎＣｈａｌｌｅｎｇｅ）では１２０万枚の画像分類用学習用画像が公開されている。よって、ＣＮＮの重み係数の初期値を、このような画像分類タスクで一度初期学習しておき、その学習結果による重み係数を初期値として、領域分割タスクの学習を始めてもよい。このようにして、第１の学習ステップにて学習されたＣＮＮの重み係数を、識別器のパラメータとして第１の識別器記憶部に記憶する。 The weight coefficient of CNN in learning may be started from a random initial value, but a learned weight coefficient may be given as an initial value for some task. For example, the class label of the area division task needs to be given for each pixel, but the class label of the image classification task may be given one label for one image. Therefore, there is a difference of several tens to several hundreds of times that a person inputs a class label as a GT in advance. For this reason, a large number of learning images for image classification tasks are publicly available and can be easily obtained. For example, in ILSVRC (ImageNet Large-scale Visual Recognition Challenge), 1.2 million learning images for image classification are disclosed. Therefore, the initial value of the weighting coefficient of CNN may be initially learned once by such an image classification task, and the learning of the area division task may be started using the weighting coefficient based on the learning result as the initial value. In this way, the weight coefficient of CNN learned in the first learning step is stored in the first discriminator storage unit as a discriminator parameter.

次に、第２の学習データ取得ステップＳ２３００では、第２の学習データ取得部２３００が、第２の学習データ記憶部５２００から、現像されていない学習画像、撮影情報、およびクラスラベルを、学習データとして読み込む。 Next, in the second learning data acquisition step S2300, the second learning data acquisition unit 2300 acquires learning images, shooting information, and class labels that have not been developed from the second learning data storage unit 5200 as learning data. Read as.

第２の学習データ記憶部５２００には、あらかじめ複数の現像されていない学習画像とクラスラベルが用意されている。また、学習画像に関する撮影情報も得られている。学習画像は、具体的にはデジタルカメラ等で撮影され、現像処理を行う前の状態における、ＣＣＤやＣＭＯＳのような画像素子におけるセンサ値の羅列であり、一般的にＲＡＷ画像と呼ばれる。ＲＡＷ画像における輝度値は、撮影情報を使って以下のように各画素における各色チャンネルの輝度の絶対量を求めることができる。 In the second learning data storage unit 5200, a plurality of undeveloped learning images and class labels are prepared in advance. Also, shooting information related to the learning image is obtained. The learning image is specifically an enumeration of sensor values in an image element such as a CCD or CMOS before being taken with a digital camera or the like and developed, and is generally called a RAW image. As for the luminance value in the RAW image, the absolute amount of the luminance of each color channel in each pixel can be obtained using the shooting information as follows.

撮影情報として、この画像全体における輝度Ｂｖ値、センサの適正レベル値ｏｐｔが得られているとする。ある画像における画素位置（ｉ，ｊ）に対応するＲＡＷ画像上のベイヤ配列におけるＲチャネルのセンサ測定値がＲ_ＲＡＷであったとき、画素（ｉ，ｊ）のＲチャネルにおける輝度の絶対量であるＲ_Ｂｖの値は、下記の数式３で求めることができる。 It is assumed that the brightness Bv value and the appropriate level value opt of the sensor are obtained as shooting information. When the sensor measurement value of the R channel in the Bayer array on the RAW image corresponding to the pixel position (i, j) in a certain image is R _RAW , the absolute amount of luminance in the R channel of the pixel (i, j) The value of R _Bv can be obtained by the following Equation 3.

ＧチャネルとＢチャネルとの輝度絶対量であるＧ_ＢｖとＢ_Ｂｖも、同様にして数式４、数式５より求められる。 Similarly, G _Bv and B _Bv which are absolute luminance amounts of the G channel and the B channel are also obtained from Equations 4 and 5.

このような変換式を用いることで、ＲＡＷ画像による第２の学習画像に関する輝度絶対量のマップを得ることができる。 By using such a conversion formula, it is possible to obtain a map of absolute luminance amounts related to the second learning image by the RAW image.

用意されている第２の学習画像の枚数をＮ_２枚とし、ｎ番目の第２の学習画像から変換して得られた輝度絶対量のマップによる学習画像をＪ_ｎ（ｎ＝１，…，Ｎ_２）と書くこととする。クラスラベルは、第１の学習画像におけるクラスラベルと同じ定義のものとする。学習画像Ｊ_ｎの全画素に対応するクラスラベル集合をＧＴ_ｎとする。ＲＡＷ画像を伴った画像は、それを伴わない画像の収集に比べて困難であるため、第２の学習画像の数Ｎ_２は、一般的にはＮ_１より少ないことが多い。実際、多くのアマチュア写真家はＲＡＷ画像を公開しないため、ウェブなどで収集できる画像のほとんどはＲＡＷ画像を伴わないものである。また、第２の学習データで用いた学習画像は、現像することによって第１の学習データに利用することも可能である。 The number of prepared second learning images is N _2, and learning images based on the absolute luminance map obtained by conversion from the n-th second learning image are J _n (n = 1,..., N ₂ ). The class label has the same definition as the class label in the first learning image. The class label set corresponding to all pixels of the learning image J _n and GT _n. Since an image with a RAW image is more difficult than collecting images without it, the number N ₂ of _second learning images is generally less than N _{1 in} many cases. In fact, many amateur photographers do not publish RAW images, so most of the images that can be collected on the web or the like do not involve RAW images. Further, the learning image used in the second learning data can be used for the first learning data by developing.

変換部追加ステップＳ２４００では、変換部追加部２４００が、第１の識別器記憶部５２００から第１の識別器を読み込み、読み込まれた第１の識別器の入力層側に、変換部を追加する。 In conversion unit addition step S2400, conversion unit addition unit 2400 reads the first discriminator from first discriminator storage unit 5200, and adds the conversion unit to the input layer side of the read first discriminator. .

まず、第１の識別器記憶部５２００から、第１の学習ステップＳ２２００にて学習されたＣＮＮの重み係数を読み込む。読み込まれた重み係数をＣＮＮに設定する。設定されたＣＮＮに対して、図４（ｂ）のようにして変換部６５１を追加する。変換部６５１の入力側には、第２の学習データ取得ステップＳ２３００で用意した、輝度絶対量による学習画像６５１が入力される。変換部６５０を通過して変換された画像６５２は、現像後の画像と同様にしてＣＮＮの入力層６１１に入力される。 First, the CNN weighting factor learned in the first learning step S2200 is read from the first discriminator storage unit 5200. The read weight coefficient is set to CNN. A conversion unit 651 is added to the set CNN as shown in FIG. On the input side of the conversion unit 651, a learning image 651 based on an absolute luminance amount prepared in the second learning data acquisition step S2300 is input. The image 652 converted through the conversion unit 650 is input to the input layer 611 of the CNN in the same manner as the developed image.

変換部は、ＣＮＮの新しい層として追加される。通常、ＲＡＷ画像から現像画像への変換は、ガンマ補正とホワイトバランスによる修正が行われる。ガンマ補正関数は下記の数式６のように定義される。 The converter is added as a new layer of CNN. Normally, conversion from a RAW image to a developed image is performed by gamma correction and correction by white balance. The gamma correction function is defined as Equation 6 below.

ｙ＝ｘ^γ （数式６）
ここで、ｘは任意の画素におけるＲＡＷ画像の値、すなわち撮像装置におけるセンサ値であり、ｙはその画素の現像後の輝度値である。制御パラメータγの値は、カメラやメーカ、撮影モードなどによって異なる。図５は、本実施形態におけるガンマ補正関数とその近似関数を説明する図であり、図５（ａ）にガンマ関数の例を示している。図５（ａ）の７０１はγ＝１、７０２はγ＝０．５、７０３はγ＝２のときのガンマ補正関数のカーブである。ホワイトバランスは、これら補正された輝度値のチャネルごとの重み付けにあたる。ここで、入力信号に対するガンマ補正関数を下記の数式７のように近似することを考える。 y = x ^γ (Formula 6)
Here, x is a value of a RAW image at an arbitrary pixel, that is, a sensor value in the imaging device, and y is a luminance value after development of the pixel. The value of the control parameter γ varies depending on the camera, manufacturer, shooting mode, and the like. FIG. 5 is a diagram for explaining a gamma correction function and its approximate function in this embodiment, and FIG. 5A shows an example of the gamma function. In FIG. 5A, 701 is a curve of a gamma correction function when γ = 1, 702 is γ = 0.5, and 703 is γ = 2. The white balance is a weight for each channel of the corrected luminance values. Here, it is considered that the gamma correction function for the input signal is approximated as shown in Equation 7 below.

ｙ＝ｗ_２ｔａｎｈ（ｗ_１ｘ＋ｂ_１・ｚ_１）＋ｂ_２・ｚ_２（数式７）
ここで、ｚ_１およびｚ_２は撮影環境によって変動する変数で、ｗ_１、ｗ_２、ｂ_１およびｂ_２は重み係数である。図５（ｂ）に示すように、この関数はγ補正関数の近似とすることができる。図５（ｂ）の７１１，７１２、７１３は、それぞれ下記の数式８、数式９、数式１０のような関数である。 y = w ₂ tanh (w ₁ x + b ₁ · z ₁ ) + b ₂ · z ₂ (Formula 7)
Here, z ₁ and z ₂ are variables that vary depending on the shooting environment, and w ₁ , w ₂ , b _1, and b ₂ are weighting factors. As shown in FIG. 5B, this function can be an approximation of the γ correction function. Reference numerals 711, 712, and 713 in FIG. 5B are functions such as the following Expression 8, Expression 9, and Expression 10, respectively.

ｙ＝１．１ｔａｎｈ（ｘ−０．５）＋０．５（数式８）
ｙ＝５ｔａｎｈ（ｘ＋１）−３．８（数式９）
ｙ＝５ｔａｎｈ（ｘ−２）＋４．８（数式１０）
図６は本実施形態における変換部の構成を示す図であり、数式７における形式は、図６（ａ）のような素子の組み合わせで表現することができる。輝度絶対量による学習画像６５１の任意の画素と、変換後の画像６５２における対応画素は、素子６５３および素子６５４によって結合される。学習画像６５１における輝度絶対値は、数式７ではｘに相当し、重みｗ_１で重みづけされて、入出力関数として非線形関数のｔａｎｈを持つ素子６５３に入力される。素子６５３の出力信号は重みｗ_２で重みづけされ、単調増加の線形関数を入出力関数として持つ素子６５４に入力される。 y = 1.1 tanh (x−0.5) +0.5 (Formula 8)
y = 5 tanh (x + 1) -3.8 (Formula 9)
y = 5 tanh (x-2) +4.8 (Formula 10)
FIG. 6 is a diagram showing the configuration of the conversion unit in the present embodiment, and the format in Expression 7 can be expressed by a combination of elements as shown in FIG. An arbitrary pixel of the learning image 651 based on the absolute luminance amount and a corresponding pixel in the converted image 652 are combined by an element 653 and an element 654. The absolute luminance value in the learning image 651 corresponds to x in Equation 7, is weighted by the weight w ₁ , and is input to the element 653 having a nonlinear function tanh as an input / output function. The output signal of the element 653 is weighted by the weight w ₂ and input to the element 654 having a linear function that increases monotonously as an input / output function.

学習画像６５１からは、シーン特徴抽出器６５５を通して、画像のシーンを記述する特徴量６５６が抽出される。シーン記述特徴量６５６は、ＨＯＧやＦｉｓｈｅｒＶｅｃｔｏｒ、色ヒストグラムなどを想定することができるが、本実施形態はその特徴量の種類によって限定されるものではない。また、シーン記述特徴は上記のように画像特徴だけに限らない。例えば、撮像画像における付帯情報として、地軸に対する撮像装置の向き情報としてのジャイロセンサ値や、時計による時刻情報から特徴量を抽出してもよい。その例を図６（ｂ）に示す。例えば、ジャイロセンサの値から地面方向を３軸の値で得ることができるため、これは正規化などすれば３次元のシーン記述特徴ベクトルとして利用することができる。また、時計による時刻情報は、１時間を１５ｄｅｇとして対応付けてｓｉｎ、ｃｏｓによる循環関数にすれば、１日を１周期とした特徴量として利用できる。カレンダーに関しても同様に１ヵ月を１５ｄｅｇとして１年を１周期とした特徴量として利用することができる。シーン記述特徴量６５６は、数式７ではｚ_１およびｚ_２に相当する。そして、重みベクトルｂ_１による積和演算により重みづけされて素子６５３へと入力され、重みベクトルｂ_２による積和演算により重みづけされて、素子６５４へと入力される。 From the learning image 651, the feature quantity 656 describing the scene of the image is extracted through the scene feature extractor 655. The scene description feature quantity 656 can be assumed to be HOG, FisherVector, a color histogram, or the like, but this embodiment is not limited by the type of the feature quantity. Further, the scene description feature is not limited to the image feature as described above. For example, as supplementary information in the captured image, a feature amount may be extracted from a gyro sensor value as orientation information of the imaging device with respect to the ground axis or time information by a clock. An example is shown in FIG. For example, since the ground direction can be obtained as a three-axis value from the value of the gyro sensor, it can be used as a three-dimensional scene description feature vector if it is normalized. In addition, time information by a clock can be used as a feature value with one day as one cycle if it is associated with 15 deg as a cyclic function by sin and cos. Similarly, the calendar can be used as a feature value with 15 months as one month and one cycle as one year. The scene description feature quantity 656 corresponds to z ₁ and z ₂ in Equation 7. Then, it is weighted by the product-sum operation using the weight vector b ₁ and input to the element 653, is weighted by the product-sum operation using the weight vector b ₂ , and is input to the element 654.

ここでは、ｚ_１＝ｚ_２として説明したが、シーン記述特徴は、例えばｚ_１をＨＯＧ、ｚ_２をＦｉｓｈｅｒＶｅｃｔｏｒといったように、別々のものとして分けてもよい。シーン記述特徴は、ｔａｈｈカーブのバイアスを調整するための特徴であって、重み係数ベクトルｂ_１およびｂ_２で重み付けすることは、一種のシーン識別を行うことに相当する。例えば、晴れた日の屋外のシーンと、白い壁に囲まれた屋内のシーンでは、画像中に写っている物体の相違と、輝度絶対量の違いにより、異なるシーン記述特徴が得られるため、画像変換のバイアスとして異なる補正量をかけることになる。また、シーン記述特徴６５６を素子６５３および６５４に送る際に、重み係数ｂ_１およびｂ_２による線形和ではなく、多層構造のニューラルネットワークを加えてもよい。図６（ｃ）はその例を示しており、６５７および６５８はそれぞれ、入力層にシーン記述特徴６５６を入力し、１つの出力信号ｆ（ｚ）およびｇ（ｚ）を出力する多層ニューラルネットワークである。この場合、数式７は以下の数式１１のようになる。また、以降の説明では、各ニューラルネットワークｆおよびｇの結合係数をｂ_１およびｂ_２と置き換えて読めばよい。 Here, z ₁ = z _{2 has} been described, but the scene description feature may be divided as a separate item, for example, z ₁ is HOG and z ₂ is FisherVector. The scene description feature is a feature for adjusting the bias of the tahh curve, and weighting with the weight coefficient vectors b ₁ and b ₂ corresponds to performing a kind of scene identification. For example, in an outdoor scene on a sunny day and an indoor scene surrounded by a white wall, different scene description characteristics are obtained due to the difference in the objects reflected in the image and the difference in absolute luminance. Different correction amounts are applied as conversion biases. Further, when sending the scene description feature 656 to the elements 653 and 654, a neural network having a multilayer structure may be added instead of the linear sum by the weighting factors b ₁ and b ₂ . FIG. 6 (c) shows an example, and 657 and 658 are multi-layer neural networks that input a scene description feature 656 into the input layer and output one output signal f (z) and g (z), respectively. is there. In this case, Expression 7 is expressed as Expression 11 below. In the following description, the coupling coefficient of each neural network f and g may be replaced with b ₁ and b ₂ for reading.

ｙ＝ｗ_２ｔａｎｈ（ｗ_１ｘ＋ｆ（ｚ_１））＋ｂ_２・ｇ（ｚ_２）（数式１１）
素子６５４の出力信号は、そのままＣＮＮの入力層へ渡す画像６５２の対応画素の値として扱われる。このような結合をＲ_Ｂｖ、Ｇ_Ｂｖ、Ｂ_Ｂｖの各チャネルに対して持たせたとき、ｗ_２およびｂ_２の値は各チャネルのバランスを表現しており、これはホワイトバランスの値を近似するものである。このようにして、現像前の学習画像６５１は、輝度絶対量から現像処理と近似された変換により、画像変換されることになる。 y = w ₂ tanh (w ₁ x + f (z ₁ )) + b ₂ · g (z ₂ ) (Formula 11)
The output signal of the element 654 is treated as the value of the corresponding pixel of the image 652 passed to the input layer of the CNN as it is. When such coupling is provided for each of the R _Bv , G _Bv , and B _Bv channels, the values of w ₂ and b ₂ represent the balance of each channel, which approximates the value of white balance. To do. In this manner, the learning image 651 before development is image-converted by conversion approximated to the development processing from the absolute luminance amount.

第２の学習ステップＳ２５００では、第２の学習部２５００が、変換部追加ステップＳ２４００で追加された変換部とともに、識別器を学習する。変換部追加ステップＳ２４００で設定された画像変換を定義する重み係数ｗ_１、ｗ_２、ｂ_１、ｂ_２は、第２の学習データ取得部によって取得された学習画像とクラスラベルによって学習される。学習画像Ｊ_ｎが図４（ｂ）の変換部６５０に入力され、特徴抽出６１０と分類部６２０を介して出力信号が得られたら、その値をクラスラベルと比較することにより、ＣＮＮ全体と変換部の重み係数を学習する。 In the second learning step S2500, the second learning unit 2500 learns the discriminator together with the conversion unit added in the conversion unit addition step S2400. The weighting factors w ₁ , w ₂ , b ₁ , b ₂ that define the image conversion set in the conversion unit adding step S2400 are learned from the learning image and the class label acquired by the second learning data acquisition unit. When the learning image J _n is input to the conversion unit 650 in FIG. 4B and an output signal is obtained via the feature extraction 610 and the classification unit 620, the value is compared with the class label to convert the entire CNN Learn the weighting factor of the part.

特徴抽出部６１０と分類部６２０の結合係数は、第１の学習ステップＳ２２００で得られた値を初期値とする。変換部６５０における重み係数は、ランダムな初期値から学習させてもよい。あるいは、変換部６５０だけＣＮＮとは独立に学習させ、その状態を初期値としてＣＮＮと一緒に学習させてもよい。変換部６５０だけを初期学習させるためには、変換部６５０を３層ニューラルネットワークとみなして、現像前の輝度絶対量マップによる学習画像６５１を入力とし、素子６５３の出力信号をネットワークの出力信号とみなす。教師信号として、適正露出による現像後画像の輝度値を与えることにより、誤差逆伝搬により回帰学習を行えばよい。変換部６５０、特徴抽出部６１０、分類部６２０の重み係数の初期値が決定されたら、全てを通して学習を行う。このようにして、変換部６５０と特徴抽出部６１０、分類部６２０をすべて通して学習画像で学習させることにより、変換部６５０のパラメータも、学習画像に対して識別誤差を軽減させる方向に修正することができる。これは、画像の現像方法を、見た目の良さではなく、識別し易いように修正していることに相当する。変換部６５０と畳み込み層６１０、完全結合層６２０が学習されたら、得られた重み係数を第２の識別器記憶部５４００に記憶させる。 The coupling coefficient between the feature extraction unit 610 and the classification unit 620 uses the value obtained in the first learning step S2200 as an initial value. The weighting factor in the conversion unit 650 may be learned from a random initial value. Alternatively, only the conversion unit 650 may be learned independently from the CNN, and the state may be learned as an initial value together with the CNN. In order to initially learn only the conversion unit 650, the conversion unit 650 is regarded as a three-layer neural network, the learning image 651 based on the luminance absolute amount map before development is input, and the output signal of the element 653 is used as the output signal of the network. I reckon. Regression learning may be performed by back propagation by giving a luminance value of a developed image with appropriate exposure as a teacher signal. When the initial values of the weighting factors of the conversion unit 650, the feature extraction unit 610, and the classification unit 620 are determined, learning is performed through all of them. In this way, by learning through the learning image through all of the conversion unit 650, the feature extraction unit 610, and the classification unit 620, the parameters of the conversion unit 650 are also corrected to reduce the identification error with respect to the learning image. be able to. This is equivalent to correcting the image development method so that it is easy to identify, not good appearance. When the conversion unit 650, the convolution layer 610, and the complete coupling layer 620 are learned, the obtained weight coefficient is stored in the second discriminator storage unit 5400.

このようにして学習された識別器を用いて実際の入力画像を識別する工程を、以下に詳細説明する。図２（ｂ）は、本実施形態に係る識別時の処理を示すフローチャートである。 The step of identifying an actual input image using the classifier learned in this way will be described in detail below. FIG. 2B is a flowchart showing processing at the time of identification according to the present embodiment.

まず、入力データ取得ステップＳ１１００では、入力データ取得部１１００が、撮像装置から得られた現像前の画像データが取得される。入力データの方式は、第２の学習データ取得ステップＳ２３００における現像前画像と同様であるとする。すなわち、撮像装置で得られたセンサ値によるＲＡＷ画像から、撮像情報を利用して、数式３、数式４、数式５を使って輝度絶対量のマップに変換したものである。 First, in input data acquisition step S1100, the input data acquisition unit 1100 acquires image data before development obtained from the imaging apparatus. Assume that the method of input data is the same as the pre-development image in the second learning data acquisition step S2300. That is, a RAW image based on sensor values obtained by the imaging apparatus is converted into a map of absolute luminance using Formula 3, Formula 4, and Formula 5 using imaging information.

識別器設定ステップＳ１２００では、識別器設定部１２００が、第２の識別器記憶部５４００から学習済みの識別器を読み込む。なお、ここでは入力データ取得ステップＳ１１００の後に識別器設定ステップＳ１２００を行うようにしているが、この２つのステップの手順は逆でもよい。識別器を常にメモリに確保して入力画像を次々に処理する場合には、識別器設定ステップＳ１２００の後で入力データ取得ステップＳ１１００以降の処理を繰り返し行うとしてもよい。識別器設定ステップＳ１２００で設定される識別器は、図４（ｂ）で表わされる変換部とＣＮＮで構成される識別器である。 In the classifier setting step S1200, the classifier setting unit 1200 reads a learned classifier from the second classifier storage unit 5400. Here, the discriminator setting step S1200 is performed after the input data acquisition step S1100, but the procedure of these two steps may be reversed. When the discriminator is always secured in the memory and the input image is processed one after another, the processing after the input data acquisition step S1100 may be repeatedly performed after the discriminator setting step S1200. The discriminator set in the discriminator setting step S1200 is a discriminator composed of the conversion unit and the CNN represented in FIG.

識別ステップＳ１３００では、識別部１３００が、識別器設定ステップＳ１２００で設定された識別器を用いて、入力データ取得ステップＳ１１００で取得された入力画像の識別処理を行う。輝度絶対量のマップとして取得された入力画像は、図４（ｂ）における識別器の変換部６５０に入力され、変換された画像はＣＮＮの特徴抽出部６１０における入力層６１１へと入力される。畳み込み層６１０では入力画像の信号が各層に順伝搬され、変換された信号は全結合層６２０を介して、各識別クラスに割り当てられた出力素子６２１の出力信号になる。信号が最も大きい出力素子に対応するクラスラベルが、その画素のクラス識別結果となる。 In the identification step S1300, the identification unit 1300 performs identification processing of the input image acquired in the input data acquisition step S1100, using the identifier set in the identifier setting step S1200. The input image acquired as the absolute luminance map is input to the converter 650 of the discriminator in FIG. 4B, and the converted image is input to the input layer 611 in the feature extraction unit 610 of the CNN. In the convolution layer 610, the signal of the input image is forwardly propagated to each layer, and the converted signal becomes an output signal of the output element 621 assigned to each identification class via the all coupling layer 620. The class label corresponding to the output element with the largest signal becomes the class identification result of the pixel.

識別結果出力ステップＳ１４００では、識別結果出力部１４００が、識別ステップＳ１３００で得られた識別結果を出力する。識別結果出力ステップＳ１４００で行われる処理は、識別結果を利用するアプリケーションに依存するものであって、本実施形態を限定するものではない。例えば、領域ごとに与える画像処理を、領域クラスによって変更するような画像補正アプリケーションであれば、各画素のクラスラベルを画像補正プログラムに出力すればよい。その際、各クラスの曖昧さによって処理の重み付けなどが必要であるなら、各クラスラベルに対応する出力素子６２１の出力信号値をクラス尤度としてそのまま出力してもよい。特定のクラスに関する識別結果だけが必要であるなら、他のクラスに関する結果を捨てて、必要なクラスの識別結果だけを出力すればよい。 In the identification result output step S1400, the identification result output unit 1400 outputs the identification result obtained in the identification step S1300. The processing performed in the identification result output step S1400 depends on the application that uses the identification result, and does not limit the present embodiment. For example, in the case of an image correction application in which the image processing given for each area is changed depending on the area class, the class label of each pixel may be output to the image correction program. At this time, if processing weighting is necessary due to the ambiguity of each class, the output signal value of the output element 621 corresponding to each class label may be output as it is as the class likelihood. If only the identification results for a specific class are needed, the results for other classes may be discarded and only the identification results for the necessary classes may be output.

以上の説明では、画像識別処理として、画像の領域分割を例に説明したが、画像分類タスクに対しても、本実施形態は適用可能である。図９は各実施形態における識別器の構造を説明する図であり、画像分類タスクの場合は、図９（ａ）のようにＣＮＮの特徴抽出部６１０の最終層６１５の全画素における出力信号を、分類部６２０に入力すればよい。 In the above description, image segmentation has been described as an example of image identification processing, but the present embodiment can also be applied to an image classification task. FIG. 9 is a diagram for explaining the structure of the discriminator in each embodiment. In the case of an image classification task, as shown in FIG. 9A, output signals in all pixels of the final layer 615 of the CNN feature extraction unit 610 are obtained. , Input to the classification unit 620.

以上のように、本実施形態によれば、識別器への入力画像を変換する変換部を学習することにより、識別に適した画像が得られ、高精度な画像識別を実現することが可能になる。また、識別器の画像入力側に変換部を加えて現像前の学習画像を用いて追加学習することにより、現像処理も含めた識別器の学習を行うことができる。これにより、見た目重視の現像処理を介した画像で無理に識別処理を行うことなく、より高精度な識別が行えることが期待できる。また、変換部以外の部分を大量の現像後画像で事前学習し、それを初期値に変換部を含めた識別器の学習を行うプロセスにより、比較的少ない現像前画像による学習画像で、画像変換を含めた識別器を学習することができる。これは、現像後画像に比べて現像前画像による大量の学習画像を揃えることが困難な場合に有効である。 As described above, according to the present embodiment, by learning the conversion unit that converts the input image to the classifier, an image suitable for identification can be obtained, and highly accurate image identification can be realized. Become. Further, by adding a conversion unit to the image input side of the classifier and performing additional learning using a learning image before development, it is possible to learn the classifier including development processing. As a result, it can be expected that more accurate identification can be performed without forcibly performing identification processing on an image that has undergone appearance-oriented development processing. In addition, a part of the image other than the conversion unit is pre-learned with a large amount of the developed image, and the image is converted with a learning image with relatively few pre-development images by a process of learning the discriminator including the conversion unit as an initial value. Can be learned. This is effective when it is difficult to align a large number of learning images based on the pre-development image compared to the post-development image.

［第２の実施形態］
次に、本発明の第２の実施形態について説明する。第１の実施形態では、識別器にＣＮＮを用いた場合の例を示したが、識別器にＣＮＮを用いた場合、畳み込み処理の繰り返しによって、順伝搬信号からはエッジ情報などが強く残ることになり、輝度値などの絶対値情報は徐々に情報が薄れていく傾向がある。色や明るさが有効な特徴であるような場合、そのような情報が失われることは識別精度低下の原因になる。例えば、パステルカラーの無地な家の壁や、太陽のランプのように光る物体などは、各色チャネルの輝度があればより識別精度の向上が期待できる。 [Second Embodiment]
Next, a second embodiment of the present invention will be described. In the first embodiment, an example in which CNN is used as a discriminator is shown. However, when CNN is used as a discriminator, edge information and the like remain strongly from a forward propagation signal due to repetition of convolution processing. Thus, the absolute value information such as the luminance value tends to fade gradually. When color and brightness are effective features, the loss of such information causes a reduction in identification accuracy. For example, a solid wall of a pastel color or a shining object such as a sun lamp can be expected to improve the identification accuracy if the luminance of each color channel is sufficient.

また、撮像装置によっては、像面位相差ＡＦなどによって距離情報が得られている場合もあり、距離情報との併用によって、被写体との距離によって発生するスケーリングに対するロバスト性が向上することも期待できる。本実施形態では、入力画像の各画素に対して与えられる絶対値情報を識別器に取り込む形態について説明する。 Further, depending on the imaging device, distance information may be obtained by image plane phase difference AF or the like, and combined use with distance information can be expected to improve robustness against scaling caused by the distance to the subject. . In the present embodiment, a mode will be described in which absolute value information given to each pixel of an input image is taken into a discriminator.

なお、第１の実施形態で既に説明をした構成については同一の符号を付し、その説明を省略する。本実施形態における装置構成は、第１の実施形態と同じく図１（ａ）で表わされるため、詳細な説明は省略する。 In addition, the same code | symbol is attached | subjected about the structure already demonstrated by 1st Embodiment, and the description is abbreviate | omitted. Since the apparatus configuration in this embodiment is represented in FIG. 1A as in the first embodiment, detailed description thereof is omitted.

まず、本実施形態の学習処理について説明する。そのフローチャートは、図２（ａ）で示される第１の実施形態の学習処理のフローチャートと同じであるが、一部のステップにおける処理の内容が第１の実施形態とは異なる。 First, the learning process of this embodiment will be described. The flowchart is the same as the flowchart of the learning process of the first embodiment shown in FIG. 2A, but the contents of the process in some steps are different from those of the first embodiment.

第１の学習データ取得ステップＳ２１００，第１の学習ステップＳ２２００、および第２の学習データ取得ステップＳ２３００に関しては、第１の実施形態と同様であるため、説明は省略する。 Since the first learning data acquisition step S2100, the first learning step S2200, and the second learning data acquisition step S2300 are the same as those in the first embodiment, description thereof is omitted.

変換部追加ステップＳ２４００では、まず、変換部追加部２４００が、第１の識別器記憶部５２００から、第１の学習ステップＳ２２００にて学習されたＣＮＮを読み込む。そして、変換部追加部２４００は、このＣＮＮに対して、図４（ｃ）のように変換部６５０と、伝達部６６０を追加する。変換部６５０に関しては第１の実施形態と同様であるため、説明は省略する。伝達部６６０は、現像前の画像が持つ、各画素における絶対値情報を伝達するための多層ネットワークである。各層６６１、６６２、６６３、…、６６５は、ＣＮＮの各畳み込み層６１１、６１２，６１３、…、６１５に対応する層である。各層のチャネル数は、画像の各画素における情報の数に相当する。例えば、絶対値情報として第１の実施形態で示した輝度絶対量（Ｒ_Ｂｖ、Ｇ_Ｂｖ、Ｂ_Ｂｖ）を用いる場合には、各層はＲ_Ｂｖ、Ｇ_Ｂｖ、Ｂ_Ｂｖに対応する３つのチャネルを持つことになる。層の各チャネルにおける特徴面のサイズは、ＣＮＮの対応層の特徴面のサイズと等しいものとする。伝達部６６０の各層は単純にスケーリングの関係にあり、線形補間やバイキュービック補間、最近傍法などによる手法でリサイズされる。演算時間を重視するのであれば、間引きによるテーブル参照でリサイズを行ってもよい。これら伝達部における層間の結合部分には学習によって修正される結合係数は割り振られない。また、ＣＮＮの入力層６１１に対応する伝達部６６０の層６６１のサイズは入力画像サイズそのものであるため、入力画像の画素情報がそのまま直接設定されることになる。伝達部６６０の最も入力側の層６１１における、出力部の画素位置に対応する位置６６９の値は、分類部６２０の入力層にそのまま入力される。 In the conversion unit addition step S2400, first, the conversion unit addition unit 2400 reads the CNN learned in the first learning step S2200 from the first discriminator storage unit 5200. And the conversion part addition part 2400 adds the conversion part 650 and the transmission part 660 like FIG.4 (c) with respect to this CNN. Since the conversion unit 650 is the same as that of the first embodiment, the description thereof is omitted. The transmission unit 660 is a multilayer network for transmitting absolute value information in each pixel included in an image before development. Each of the layers 661, 662, 663, ..., 665 is a layer corresponding to each convolutional layer 611, 612, 613, ..., 615 of the CNN. The number of channels in each layer corresponds to the number of information in each pixel of the image. For example, when the absolute luminance information (R _Bv , G _Bv , B _Bv ) shown in the first embodiment is used as absolute value information, each layer has three channels corresponding to R _Bv , G _Bv , B _Bv. Will have. The size of the feature surface in each channel of the layer shall be equal to the size of the feature surface of the corresponding layer of the CNN. Each layer of the transmission unit 660 is simply in a scaling relationship, and is resized by a technique such as linear interpolation, bicubic interpolation, or nearest neighbor method. If the calculation time is important, resizing may be performed by referring to a table by thinning. A coupling coefficient that is corrected by learning is not allocated to a coupling portion between layers in these transmission units. Also, since the size of the layer 661 of the transmission unit 660 corresponding to the input layer 611 of the CNN is the input image size itself, the pixel information of the input image is directly set as it is. The value of the position 669 corresponding to the pixel position of the output unit in the layer 611 on the most input side of the transmission unit 660 is input as it is to the input layer of the classification unit 620.

伝達部６６０の各層における値は、対応するＣＮＮの畳み込み層に対して、バイアス係数とともにバイアス値として入力される。第ｌ番目の畳み込み層におけるチャネルｍの位置（ｉ，ｊ）の素子に対する入力信号ｕ_ｌｍ（ｉ，ｊ）は、以下の数式１２のように表わされる。 The value in each layer of the transmission unit 660 is input as a bias value together with the bias coefficient to the corresponding CNN convolution layer. An input signal u _lm (i, j) for the element at the position (i, j) of the channel m in the l-th convolutional layer is expressed as Equation 12 below.

ここで右辺第１項はＣＮＮにおける結合を表わしており、ＫはＣＮＮの第ｌ−１層におけるチャネル数、Ｈは第ｌ−１層と第ｌ層の間における畳み込みフィルタの幅である。ｈ_{ｌｐｑｋｍ}は、第ｌ層の第ｍチャネルと第ｌ−１層の第ｋチャネルを結合する畳み込みフィルタの、フィルタ中心座標における位置（ｐ，ｑ）の値である。また、ｚ_ｌ−１（ｉ，ｊ，ｋ）は、第ｌ−１層における位置（ｉ，ｊ）の出力信号、ｂ_ｌｍは、第ｌ層の第ｍチャネルにおけるバイアス係数である。右辺第２項は伝達部６６０からの結合を表わしており、Ｒは画素情報のチャネル数、Ｊ_ｌ（ｉ，ｊ，ｒ）は画素情報伝達部の第ｌ層の第ｒチャネルにおける値、ｂ_ｌｒは同バイアス係数である。これらの中で、学習によって修正されるパラメータは、ｈ_{ｌｐｑｋｍ}、ｂ_ｌｍ、ｂ_ｌｒである。 Here, the first term on the right side represents coupling in the CNN, K is the number of channels in the 1-1 layer of the CNN, and H is the width of the convolution filter between the 1-1 layer and the 1st layer. h _lpqkm is the value of the position (p, q) in the filter center coordinates of the convolution filter that combines the m-th channel of the l-th layer and the k-th channel of the ( _1-1) th layer. Further, z _l−1 (i, j, k) is an output signal at the position (i, j) in the l− _1th layer, and b _lm is a bias coefficient in the mth channel of the lth layer. The second term on the right side represents the coupling from the transmission unit 660, R is the number of channels of pixel information, J _l (i, j, r) is the value in the r-th channel of the l-th layer of the pixel information transmission unit, b _lr is the bias coefficient. Among these, parameters corrected by learning are h _lpqkm , b _lm , and b _lr .

第２の学習ステップＳ２５００では、第２の学習部２５００が、第２の学習データ取得ステップＳ２３００で取得した学習画像を用いて、ＣＮＮの内部結合係数を学習する。また、第２の学習部２５００は、ＣＮＮの内部結合係数とともに、学習画像を用いて、変換部追加ステップＳ２４００で追加された変換部６５０および伝達部６６０とＣＮＮを結合する係数を学習する。上述した結合係数が学習されるパラメータとして追加されたことを除けば、学習に関する基本的なアルゴリズムは第１の実施形態と同様であるため、その説明は省く。学習によって修正されたパラメータは、第２の識別器記憶部５４００に記憶される。 In the second learning step S2500, the second learning unit 2500 learns the CNN internal coupling coefficient using the learning image acquired in the second learning data acquisition step S2300. The second learning unit 2500 learns the coefficient that combines the CNN with the conversion unit 650 and the transmission unit 660 added in the conversion unit addition step S2400, using the learning image together with the CNN internal coupling coefficient. Except that the above-described coupling coefficient is added as a learned parameter, the basic algorithm related to learning is the same as that in the first embodiment, and a description thereof will be omitted. The parameters corrected by learning are stored in the second discriminator storage unit 5400.

次に、本実施形態の識別処理について説明する。そのフローチャートは、図２（ｂ）で示される第１の実施形態の識別処理のフローチャートと同じであるが、一部のステップにおける処理の内容が第１の実施形態とは異なる。 Next, the identification process of this embodiment will be described. The flowchart is the same as the flowchart of the identification process of the first embodiment shown in FIG. 2B, but the contents of the process in some steps are different from those of the first embodiment.

入力データ取得ステップＳ１１００、識別器設定ステップＳ１２００の処理は、第１の実施形態と同様であるため、説明は省略する。 Since the processing of the input data acquisition step S1100 and the discriminator setting step S1200 is the same as that of the first embodiment, description thereof is omitted.

識別ステップＳ１３００では、識別部１３００が、現像前の輝度絶対値による入力画像を図４（ｃ）に示すネットワークに入力することにより、識別結果を得る。識別時における順伝搬方向の信号の伝達に関しては、学習時と同じであるため、詳細な説明は省略する。 In the identification step S1300, the identification unit 1300 obtains an identification result by inputting the input image based on the absolute luminance value before development into the network shown in FIG. 4C. Since the transmission of the signal in the forward propagation direction at the time of identification is the same as that at the time of learning, detailed description is omitted.

識別結果出力ステップＳ１４００における処理は、第１の実施形態と同様であるため、説明は省略する。 Since the processing in the identification result output step S1400 is the same as that in the first embodiment, description thereof is omitted.

絶対値情報としては、上記輝度絶対量以外にも、撮像系の像面位相差ＡＦなどによって得られる距離情報を与えてもよい。距離情報は、対象物体の絶対的なサイズや立体形状に関する情報を与えるため、スケーリングによる類似物や、実物と写真や絵画などとの区別がつきやすくなる。例えば、看板に描かれた人物や巨大人物像と、実物の人間を区別する場合などで有効である。 As absolute value information, distance information obtained by image plane phase difference AF or the like of the imaging system may be given in addition to the absolute luminance value. Since the distance information gives information on the absolute size and three-dimensional shape of the target object, it is easy to distinguish between similar objects by scaling, real objects, and photographs and paintings. For example, this is effective in distinguishing between a person or a giant figure drawn on a signboard and a real person.

距離情報用の伝達部６７０を追加した場合の構成を図４（ｄ）に示す。６５３は、各画素の距離情報を持つ距離マップである。画像の画素密度に対して測距点が疎である場合には、線形補間やバイキュービック補間、あるは最近傍法などによって、各画素の距離を補うことで、各画素に対する距離マップを算出すればよい。この場合、チャネル数は、Ｒ_Ｂｖ、Ｇ_Ｂｖ、Ｂ_Ｂｖ、距離情報の４チャネルになる。図４（ｄ）の構成による学習処理や識別時の処理は、図４（ｃ）の輝度絶対量による例と同様であるため、説明は省略する。さらに、輝度情報と距離情報を併用する場合には、図７（ｂ）のように伝達部を２つ並列に用意すればよい。 FIG. 4D shows a configuration when a distance information transmission unit 670 is added. Reference numeral 653 denotes a distance map having distance information of each pixel. When the distance measurement points are sparse with respect to the pixel density of the image, the distance map for each pixel can be calculated by supplementing the distance of each pixel by linear interpolation, bicubic interpolation, or nearest neighbor method. That's fine. In this case, the number of channels is 4 channels of R _Bv , G _Bv , B _Bv , and distance information. Since the learning process and the identification process in the configuration of FIG. 4D are the same as the example of the absolute luminance amount in FIG. 4C, the description thereof is omitted. Furthermore, in the case where the luminance information and the distance information are used in combination, two transmission units may be prepared in parallel as shown in FIG.

なお、距離情報を利用する場合には、画像情報とは異なる勾配特徴を追加することも可能である。距離の勾配を特徴に組み込むことにより、写真と立体物の区別が容易にできるようになる。その場合は、図９（ｂ）のように、特徴抽出部をもう１つ並列に並べる構成となる。この場合、学習時の変換部追加ステップＳ２４００では、変換部６５０と伝達部６６０だけでなく、距離情報用の特徴抽出部７１０を追加することになる。距離情報用特徴抽出部７１０では、距離マップが入力層７１１に入力され、最終層７１５における出力信号は、分類部６２０への入力信号として与えられる。第２の学習ステップＳ２５００では、学習画像に対する誤差信号が逆伝搬され、分類部６２０、画像用および距離情報用の特徴抽出部６１０と７１０、変換部６５０、画像用および距離情報用の伝達部６６０と６７０の重み係数が、学習によって更新される。 In addition, when using distance information, it is also possible to add a gradient feature different from image information. Incorporating a gradient of distance into the feature makes it easy to distinguish a photograph from a three-dimensional object. In this case, as shown in FIG. 9B, another feature extraction unit is arranged in parallel. In this case, not only the conversion unit 650 and the transmission unit 660 but also the feature extraction unit 710 for distance information is added in the conversion unit addition step S2400 during learning. In the distance information feature extraction unit 710, the distance map is input to the input layer 711, and the output signal in the final layer 715 is given as an input signal to the classification unit 620. In the second learning step S2500, the error signal with respect to the learning image is back-propagated, the classification unit 620, the image and distance information feature extraction units 610 and 710, the conversion unit 650, the image and distance information transmission unit 660. And the weighting factor of 670 are updated by learning.

以上のように、本実施形態では、輝度値情報または距離情報の少なくとも一方をニューラルネットワークの中間層に入力して識別精度の向上を図ることができる。特に、ＣＮＮの出力層に向けて薄れがちな絶対値による情報が、特徴抽出の中間層に入れ込まれることによって、色や明るさが重要な情報である対象物体の識別に対して識別精度の向上が期待できる。また、像面位相差などによって得ることのできる距離情報も、同様な方法にて識別器に利用することができ、さらに識別精度の向上が期待できる。 As described above, in this embodiment, it is possible to improve the identification accuracy by inputting at least one of the luminance value information and the distance information to the intermediate layer of the neural network. In particular, information with absolute values that tend to fade toward the output layer of the CNN is inserted into the intermediate layer of feature extraction, so that the accuracy of identification can be improved for the identification of target objects whose color and brightness are important information. Improvement can be expected. Further, the distance information that can be obtained by the image plane phase difference or the like can also be used for the discriminator by a similar method, and further improvement in discrimination accuracy can be expected.

［第３の実施形態］
次に、本発明の第３の実施形態について説明する。第１の実施形態では、入力画像のシーン特徴によって、入力画像全体に対して同じ画像変換を行う方法について説明をした。これは通常の現像方法を変換部で近似しつつ識別精度を向上させるための現像方法を学習によって得ることを意味する。ここで、さらに識別精度を向上させるために、領域によって異なる現像を行ってもよい。本実施形態では、領域ごとに変換部の変換パラメータを修正する形態について説明する。なお、第１、第２の実施形態で既に説明をした構成については同一の符号を付し、その説明は省略する。 [Third Embodiment]
Next, a third embodiment of the present invention will be described. In the first embodiment, the method of performing the same image conversion on the entire input image according to the scene characteristics of the input image has been described. This means that a developing method for improving identification accuracy while approximating a normal developing method at the conversion unit is obtained by learning. Here, in order to further improve the identification accuracy, different development may be performed depending on the region. This embodiment demonstrates the form which corrects the conversion parameter of a conversion part for every area | region. In addition, the same code | symbol is attached | subjected about the structure already demonstrated in 1st, 2nd embodiment, and the description is abbreviate | omitted.

図１（ｂ）は、本実施形態に係る画像処理装置の機能構成を示す概略ブロック図である。まず、学習時の学習装置として機能する際の装置構成について説明する。第１の学習データ取得部２１００、第１の学習部２２００、第２の学習データ取得部２３００、変換部追加部２４００、および第２の学習部２５００は、第１の実施と同様であるため、説明を省略する。また、第１の学習データ記憶部５１００、第１の識別器記憶部５２００、第２の学習データ記憶部５３００、および第２の識別器記憶部５４００についても、第１の実施形態と同様であるため、説明は省略する。 FIG. 1B is a schematic block diagram illustrating a functional configuration of the image processing apparatus according to the present embodiment. First, a device configuration when functioning as a learning device during learning will be described. Since the first learning data acquisition unit 2100, the first learning unit 2200, the second learning data acquisition unit 2300, the conversion unit addition unit 2400, and the second learning unit 2500 are the same as those in the first implementation, Description is omitted. The first learning data storage unit 5100, the first discriminator storage unit 5200, the second learning data storage unit 5300, and the second discriminator storage unit 5400 are also the same as in the first embodiment. Therefore, explanation is omitted.

調整部追加部２６００は、一次識別結果により変換部を調整する調整部を、識別器に追加する。第３の学習部２７００は、第２の学習部２５００によって学習された変換部と識別器を使って第２の学習データに対して識別処理を行い、そこで発生する誤差をもとに、調整部を学習する。学習された調整部のパラメータは、調整部記憶部５５００に記憶される。 The adjustment unit addition unit 2600 adds an adjustment unit that adjusts the conversion unit based on the primary identification result to the classifier. The third learning unit 2700 performs identification processing on the second learning data using the conversion unit and the discriminator learned by the second learning unit 2500, and based on the error generated there, the adjustment unit To learn. The learned parameters of the adjustment unit are stored in the adjustment unit storage unit 5500.

続いて、識別時の画像識別装置として機能する際の装置構成について説明する。入力データ取得部１１００、識別器設定部１２００、および識別部１３００は、第１の実施形態と同様であるため、説明を省略する。調整部設定部１５００は、調整部記憶部５５００から調整部のパラメータを読み込み、変換部を調整する調整部を設定する。再識別部１６００は、識別部１３００における識別結果と調整部を使って、識別器による識別処理を再度行う。得られた再識別結果は識別結果出力部１４００に送られ、ユーザもしくは別機器に結果が提示される。 Next, a device configuration when functioning as an image identification device at the time of identification will be described. The input data acquisition unit 1100, the discriminator setting unit 1200, and the discriminating unit 1300 are the same as those in the first embodiment, and thus description thereof is omitted. The adjustment unit setting unit 1500 reads parameters of the adjustment unit from the adjustment unit storage unit 5500 and sets an adjustment unit that adjusts the conversion unit. The re-identification unit 1600 uses the identification result in the identification unit 1300 and the adjustment unit to perform identification processing by the classifier again. The obtained re-identification result is sent to the identification result output unit 1400, and the result is presented to the user or another device.

次に、図２（ｃ）を用いて、本実施形態に係る学習処理について説明する。なお、第１の学習データ取得ステップＳ２１００から第２の学習ステップＳ２５００までは、第１の実施形態と同様の処理のため、説明を省略する。 Next, the learning process according to the present embodiment will be described with reference to FIG. Since the first learning data acquisition step S2100 to the second learning step S2500 are the same as those in the first embodiment, description thereof is omitted.

調整部追加ステップＳ２６００では、調整部追加部２６００が、変換部による現像処理を、一次識別結果をもとに調整する調整部を追加する。図７は、各実施形態に係る識別器の構成を説明する図であり、本実施形態に係る構成例を図７（ａ）に示す。 In the adjustment unit addition step S2600, the adjustment unit addition unit 2600 adds an adjustment unit that adjusts the development processing by the conversion unit based on the primary identification result. FIG. 7 is a diagram illustrating the configuration of the classifier according to each embodiment, and a configuration example according to this embodiment is illustrated in FIG.

まず、本ステップでは、第２の学習ステップＳ２５００までに学習された識別器に対して、学習画像を入力することで、クラス識別結果が素子６２１にて得られる。得られた識別結果から、第２の学習画像の中で、ＧＴにおけるクラスラベルとの誤差が小さいもの、例えば誤差０．２以下の領域を学習データから除外する。残った学習データに関して、特徴抽出部６１０の最終層６１５における全チャネルに関する出力信号ベクトル６９１を抽出する。なお、ここではベクトル６９１を最終層６１５における出力信号と説明したが、他の層の値を用いてもよい。例えば、入力層６１１の値を使ってもいいし、すべての層６１１から６１５の値を連結して用いてもよい。これらの値は調整部６９０に入力され、そこから出力される信号は変換部６５０へと与えられ、画像変換処理が調整される。 First, in this step, a class identification result is obtained by the element 621 by inputting a learning image to the discriminator learned up to the second learning step S2500. From the obtained identification result, in the second learning image, a region having a small error from the class label in GT, for example, a region having an error of 0.2 or less is excluded from the learning data. With respect to the remaining learning data, output signal vectors 691 relating to all channels in the final layer 615 of the feature extraction unit 610 are extracted. Although the vector 691 has been described as an output signal in the final layer 615 here, values in other layers may be used. For example, the values of the input layer 611 may be used, or the values of all the layers 611 to 615 may be connected and used. These values are input to the adjustment unit 690, and a signal output from the adjustment unit 690 is provided to the conversion unit 650 to adjust the image conversion process.

ここで、図８を用いて、本実施形態に係る調整部６９０の構成を説明する。特徴抽出部６１０における最終層６１５の出力信号と、出力層６４０における各クラスの出力信号は、連結されて特徴量６９６とされる。特徴量６９６は、重みベクトルｂ_３による重みづけがなされた上で、素子６９３に入力される。特徴量６９６は、また、重みベクトルｂ_４による重みづけがなされ、素子６５４に入力される。このような構造によって得られた調整値は、第１の実施形態と同様にしてガンマ補正関数の近似である。素子６５４への入力信号は、変換部による変換関数と調整部による変換関数の和となっていることから、２つのガンマ補正関数の組み合わせによる現像処理を近似していることになる。変換部と調整部の双方によって補正された素子６５４の出力信号ｙは、下記の数式１３のように記述される。 Here, the configuration of the adjustment unit 690 according to the present embodiment will be described with reference to FIG. The output signal of the final layer 615 in the feature extraction unit 610 and the output signal of each class in the output layer 640 are combined into a feature quantity 696. The feature quantity 696 is input to the element 693 after being weighted by the weight vector b ₃ . The feature quantity 696 is also weighted by the weight vector b ₄ and input to the element 654. The adjustment value obtained by such a structure is an approximation of the gamma correction function as in the first embodiment. Since the input signal to the element 654 is the sum of the conversion function by the conversion unit and the conversion function by the adjustment unit, it approximates development processing by a combination of two gamma correction functions. The output signal y of the element 654 corrected by both the conversion unit and the adjustment unit is described as Equation 13 below.

ｙ＝ｗ_２ｔａｎｈ（ｗ_１ｘ＋ｆ（ｚ_１））＋ｂ_２・ｇ（ｚ_２）＋ｗ_３ｔａｎｈ（ｂ_３・ｚ_３）＋ｂ_４・ｚ_３（数式１３）
ここで、ｚ_３は、６９６で示される畳み込み層における最終層の出力信号と、出力層における各クラスの出力信号を結合した特徴ベクトルであり、ｗ_３、ｂ_３、およびｂ_４は重み係数である。右辺第２項は、調整部からの入力であり、ここでは調整項と呼ぶことにする。 y = w ₂ tanh (w ₁ x + f (z ₁ )) + b ₂ · g (z ₂ ) + w ₃ tanh (b ₃ · z ₃ ) + b ₄ · z ₃ (Formula 13)
Here, z ₃ is a feature vector obtained by combining the output signal of the final layer in the convolution layer indicated by 696 and the output signal of each class in the output layer, and w ₃ , b ₃ , and b ₄ are weight coefficients. is there. The second term on the right side is an input from the adjustment unit, and is referred to as an adjustment term here.

次に、第３の学習ステップＳ２７００では、第３の学習部２７００が、これら調整部６９０の重み係数を学習する。第３の学習部２７００は、調整部６９０に関する重み係数ｗ_３、ｂ_３、ｂ_４以外のすべての重み係数の学習係数を０にして固定し、ｗ_３、ｂ_３、ｂ_４についてのみ誤差逆伝搬して修正する。学習して得られた上記パラメータは、調整部記憶部５５００に記憶される。 Next, in the third learning step S <b> 2700, the third learning unit 2700 learns the weighting factors of these adjustment units 690. The third learning unit 2700 fixes the learning coefficients of all the weighting factors other than the weighting factors w ₃ , b ₃ , and b ₄ related to the adjusting unit 690 to 0, and reverses the error only for w ₃ , b ₃ , and b _4. Propagate and correct. The parameters obtained by learning are stored in the adjustment unit storage unit 5500.

ここで行われる学習は、特徴抽出部６１０における出力信号と、分類部におけるクラス識別信号を、入力特徴量として学習しているため、間違いパターンを学習していると解釈することができる。つまり、クラス識別結果とＣＮＮの内部状態がどのようなときに間違いが発生し、そのときにどのような画像変換を行えば誤差が減少するかが学習されている。例えば、一次識別の時点で白い領域が飛び過ぎてしまい、識別結果が正しくなかった場合、本実施形態では、類似した間違いパターンのときに、その誤差を減らすために、輝度の明るい部分ではコントラストが強くなるように調整部の変換が修正される。 Since the learning performed here learns the output signal from the feature extraction unit 610 and the class identification signal from the classification unit as input feature amounts, it can be interpreted as learning an error pattern. That is, it is learned when an error occurs between the class identification result and the internal state of the CNN, and what image conversion is performed at that time to reduce the error. For example, if a white region is skipped at the time of primary identification and the identification result is not correct, in this embodiment, in the case of a similar error pattern, in order to reduce the error, the contrast is high in a bright portion. The conversion of the adjustment unit is corrected to be strong.

次に、図２（ｄ）を用いて、本実施形態に係る識別処理について説明する。まず、第１の実施形態と同様の処理にて入力データ取得ステップＳ１１００から識別ステップＳ１３００を行い、一次識別結果を得る。 Next, identification processing according to the present embodiment will be described with reference to FIG. First, the input data acquisition step S1100 to the identification step S1300 are performed by the same processing as in the first embodiment, and the primary identification result is obtained.

次に、調整部設定ステップＳ１５００では、調整部設定部１５００が、調整部記憶部５５００から調整部の重み係数を読み込み、調整部６９０が設定される。 Next, in the adjustment unit setting step S1500, the adjustment unit setting unit 1500 reads the weight coefficient of the adjustment unit from the adjustment unit storage unit 5500, and the adjustment unit 690 is set.

再識別ステップＳ１６００では、再識別部１６００が、一次識別結果のクラス識別信号と、特徴抽出部６１０における出力信号を連結した特徴ベクトル６９６を調整部６９０に入力することで、調整項の追加された画像変換関数が領域ごとに得られる。入力画像は、調整項による修正を加えた画像変換を介して、領域ごとに調整された画像変換が行われ、その変換結果が特徴抽出６１０に入力される。これにより、分類部６２０を介して、出力層６４０にて再識別結果が得られる。なお、ここでは図示しないが、再識別ステップＳ１６００の結果をさらに用いて、調整項を使って繰り返し再識別を行ってもよい。その場合、適当な繰り返し数で打ち切るか、もしくは調整項の信号の変化が小さくなった時点で計算を打ち切るなどすればよい。 In the re-identification step S1600, the re-identification unit 1600 inputs the feature vector 696 obtained by connecting the class identification signal of the primary identification result and the output signal from the feature extraction unit 610 to the adjustment unit 690, and the adjustment term is added. An image conversion function is obtained for each region. The input image is subjected to image conversion adjusted for each region through image conversion with correction by an adjustment term, and the conversion result is input to the feature extraction 610. As a result, a re-identification result is obtained in the output layer 640 via the classification unit 620. Although not shown here, the result of the re-identification step S1600 may be further used to repeatedly re-identify using the adjustment term. In that case, the calculation may be terminated at an appropriate number of repetitions or when the change in the signal of the adjustment term becomes small.

識別結果出力ステップＳ１４００は、第１の実施形態と同様の処理であるため、説明は省略する。 The identification result output step S1400 is the same process as that of the first embodiment, and thus the description thereof is omitted.

本実施形態では、このようにして、まず得られた一次識別結果をもとに、畳み込み層における出力信号とクラス識別信号を組み合わせた識別結果を反映した特徴量を用いて画像変換処理を調整させる。その調整の方法は、学習データによって誤差を縮小する方向に学習されているため、一次識別結果よりも精度よく識別されることが期待できる。 In the present embodiment, first, based on the primary identification result obtained in this way, the image conversion process is adjusted using the feature value reflecting the identification result obtained by combining the output signal and the class identification signal in the convolution layer. . Since the adjustment method is learned in a direction to reduce the error based on the learning data, it can be expected that the adjustment is performed with higher accuracy than the primary identification result.

［その他の実施形態］
上記の各実施形態では、ＣＮＮによる識別は、特徴抽出部６１０の最終層６１５と、分類部６２０接続した形式で説明を行った。しかし、被写体の細かいテクスチャが有効な特徴量である場合、最終層からの信号だけでは識別に不十分な場合もある。例えば、モルタルによる白壁と、曇天によるテクスチャのない空などを区別する場合などは、細かいテクスチャは重要な情報である。そのような場合には、特徴抽出部６１０のすべての層から信号を取り出すことで、それを分類部６２０に渡す方法もある。これはハイパーカラム構造と呼ばれ、非特許文献４などにも挙げられている公知の手法である。 [Other Embodiments]
In each of the above embodiments, the identification by the CNN has been described in the form of connecting the final layer 615 of the feature extraction unit 610 and the classification unit 620. However, when the fine texture of the subject is an effective feature amount, the signal from the final layer alone may not be sufficient for identification. For example, when distinguishing a white wall made of mortar and a sky with no texture caused by cloudy weather, fine texture is important information. In such a case, there is a method in which signals are extracted from all layers of the feature extraction unit 610 and passed to the classification unit 620. This is a known technique which is called a hyper column structure and is also mentioned in Non-Patent Document 4.

ハイパーカラム構造を、これまで説明した構成に対して同様に採用しても、上記各実施形態と同様な処理を行うことができる。図７（ｃ）に、ハイパーカラム構造のＣＮＮを示す。６８１、６８２、６８３、…、６８５は、出力層６４０の画素６２１の位置における、特徴抽出部６１０における各層６１１、６１２、６１３、…、６１５における出力信号である。これらの信号値は特徴ベクトルとして扱われ、分類部６２０へと入力される。 Even if the hyper column structure is similarly adopted for the configuration described so far, the same processing as in each of the above embodiments can be performed. FIG. 7C shows a CNN having a hyper column structure. , 685 are output signals in the respective layers 611, 612, 613,..., 615 in the feature extraction unit 610 at the position of the pixel 621 in the output layer 640. These signal values are treated as feature vectors and input to the classification unit 620.

図４（ｂ）（ｃ）および図５（ｄ）（ｂ）に関しても、同様な構造を入れ込むことができる。構造が上記のようなハイパーカラム構造になっても、学習処理や識別処理に関しては同様のアルゴリズムで可能であるため、詳細な説明は省略する。 A similar structure can be inserted in FIGS. 4B and 4C and FIGS. 5D and 5B. Even if the structure is a hyper column structure as described above, the learning process and the identification process can be performed with the same algorithm, and thus detailed description thereof is omitted.

本発明は、上述の実施形態の１以上の機能を実現するプログラムを、ネットワーク又は記憶媒体を介してシステム又は装置に供給し、そのシステム又は装置のコンピュータにおける１以上のプロセッサがプログラムを読出し実行する処理でも実現可能である。また、１以上の機能を実現する回路（例えば、ＡＳＩＣ）によっても実現可能である。 The present invention supplies a program that realizes one or more functions of the above-described embodiments to a system or apparatus via a network or a storage medium, and one or more processors in a computer of the system or apparatus read and execute the program. It can also be realized by processing. It can also be realized by a circuit (for example, ASIC) that realizes one or more functions.

１１００入力データ取得部
１２００識別器設定部
１３００識別部
１４００識別結果出力部 DESCRIPTION OF SYMBOLS 1100 Input data acquisition part 1200 Classifier setting part 1300 Identification part 1400 Identification result output part

Claims

Acquisition means for acquiring an input image based on sensor values of the imaging device;
Using a discriminator having a conversion unit, and discriminating means for discriminating an input image based on the acquired sensor value,
At least the conversion unit of the classifier is learned based on a learning image based on a sensor value of the imaging device and correct data given to the learning image.

The classifier includes a classifier obtained by adding the conversion unit to a classifier learned based on first learning data including a developed learning image and correct data given to the learning image. The image identification device according to claim 1, wherein the image identification device is generated by learning based on second learning data including a learning image based on sensor values of the imaging device and correct data assigned to the learning image. .

The image discriminating apparatus according to claim 1, wherein the discriminator is a neural network.

4. The convolution neural network including a feature extraction unit that extracts image features by a multi-layer convolution operation, and a classification unit that performs pattern classification from the extracted image features. The image identification device described in 1.

The said discriminator inputs at least one of the luminance value or distance information in each pixel of the input image based on the sensor value as a weighted signal to each layer of the feature extraction unit. Image identification device.

The converter is
A first output of an output signal obtained by converting an input signal composed of a weighted linear sum of weighting factors of an absolute luminance amount in each pixel of the input image and a scene feature extracted from the entire input image by a non-linear function. Layers,
A second layer that outputs an output signal obtained by converting an input signal composed of a weighted linear sum of the scene feature and the output signal of the first layer by a weighting function with a linear function;
The image identification device according to claim 3, comprising:

The image identification apparatus according to claim 6, wherein the scene feature is an image feature acquired from the entire input image.

The image identification apparatus according to claim 6, wherein the scene feature is an image feature acquired from the entire input image and imaging information when the input image is captured.

Based on the identification result by the identification means, further comprising an adjustment means for adjusting the conversion unit,
The image identification apparatus according to claim 1, wherein the identification unit re-identifies the input image using a classifier having the adjusted conversion unit.

The image identification device according to claim 1, wherein an input image based on a sensor value of the imaging device is a RAW image.

The identification unit executes an identification task for classifying the input image into a plurality of classes or an identification task for dividing the input image into regions for a plurality of classes. The image identification device according to item 1.

A learning device for learning a discriminator used for identifying an input image based on a sensor value of an imaging device,
First learning means for learning the first classifier based on first learning data including a learning image in which an image based on a sensor value is developed and correct data assigned to the learning image;
Adding means for adding the conversion unit to the learned first classifier to generate a second classifier;
A second discriminator used in the discriminating unit is generated by learning the second discriminator based on second learning data including a learning image based on a sensor value and correct data assigned to the learning image. 2 learning means,
A learning apparatus comprising:

Acquiring an input image based on sensor values of the imaging device;
Using a discriminator having a conversion unit to identify an input image based on the acquired sensor value, and
At least the conversion unit of the classifier is learned based on a learning image based on a sensor value of the imaging device and correct data given to the learning image.

A learning method for learning a discriminator used for identifying an input image based on sensor values of an imaging device,
Learning a first classifier based on first learning data including a learning image in which an image based on sensor values is developed and correct data assigned to the learning image;
Adding the converter to the learned first classifier to generate a second classifier;
A step of generating a discriminator used by the discriminating means by learning the second discriminator based on second learning data including a learning image based on sensor values and correct data assigned to the learning image. When,
A learning method characterized by comprising:

A program for causing a computer to function as the image identification device according to any one of claims 1 to 11.

A program for causing a computer to function as the learning device according to claim 12.