JP2022143211A

JP2022143211A - Image processing device and learning method

Info

Publication number: JP2022143211A
Application number: JP2021043611A
Authority: JP
Inventors: 暢小倉; Toru Kokura
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2021-03-17
Filing date: 2021-03-17
Publication date: 2022-10-03

Abstract

To make it possible to output a more suitable demosaic image.SOLUTION: An image processing device that performs image processing using a neural network (NN) includes: acquisition means for acquiring teacher image data; generation means for generating student image data obtained by mosaicking the teacher image data; weight acquisition means for acquiring a loss weight map indicating a weight of a loss value at each pixel position of the teacher image data; and learning means for performing learning of demosaic processing based on a difference between result image data obtained by inputting the student image data to the NN and the teacher image data, and the loss weight map. In the loss weight map, the weight of the loss value at a pixel position of a specific phase in the teacher image data is different from the weight of the loss value at a pixel position other than the specific phase.SELECTED DRAWING: Figure 4

Description

本発明は、機械学習を用いた画像処理技術に関するものである。 The present invention relates to image processing technology using machine learning.

デジタルカメラなどのデジタル撮像装置には、例えばＲＧＢ配列からなるカラーフィルタが装着された撮像素子が用いられている。これにより、撮像素子の各画素に特定の波長光が入射する構成となっている。このようなカラーフィルタのＲＧＢ配列としてベイヤー（Ｂａｙｅｒ）配列が多く利用されている。 2. Description of the Related Art A digital imaging apparatus such as a digital camera uses an imaging device equipped with a color filter having an RGB arrangement, for example. As a result, light of a specific wavelength is incident on each pixel of the imaging device. The Bayer array is often used as the RGB array of such color filters.

ベイヤー配列のカラーフィルタを有する撮像素子により得られた撮像画像は、撮像素子の各画素にＲＧＢいずれかの色に対応する画素値のみが設定されたいわゆるモザイク画像となる。カメラの現像処理部は、このモザイク画像の各画素に対して残り二色の成分を補間するデモザイク処理などの様々な信号処理を施してカラー画像を生成する。デモザイク処理の従来手法として、まばらのＲＧＢ各色のデータに対して線形フィルタを適用して周囲の同一色の画素値の線形補間を実行して各画素に対応する画素値（ＲＧＢ成分）を算出する手法がある。 A captured image obtained by an imaging device having a Bayer array color filter is a so-called mosaic image in which only pixel values corresponding to any one of RGB colors are set for each pixel of the imaging device. The development processing unit of the camera performs various signal processing such as demosaic processing for interpolating the remaining two color components on each pixel of this mosaic image to generate a color image. As a conventional method of demosaic processing, a linear filter is applied to sparse RGB color data to perform linear interpolation of surrounding pixel values of the same color to calculate pixel values (RGB components) corresponding to each pixel. There is a method.

この手法は補間精度が低いため、これまでに数多くの非線形補間手法が提案されてきたが、何れの手法においても、特定の画像領域においては、偽色やアーティファクトが発生するという課題があった。 Since this method has low interpolation accuracy, many non-linear interpolation methods have been proposed so far, but all the methods have the problem of generating false colors and artifacts in specific image regions.

一方、近年、深層学習技術を応用したデータ駆動型の補間手法が提案されている。非特許文献１は、ＣＮＮベースのデモザイクネットワークを学習させる手法を開示している。学習が完了すると、学習結果を用いて、モザイク画像をＣＮＮに入力しＲＧＢ画像に変換する推論（入力データに対する回帰のタスク）を行う。 On the other hand, in recent years, a data-driven interpolation method using deep learning technology has been proposed. Non-Patent Document 1 discloses a technique for training a CNN-based demosaicing network. When the learning is completed, the learning result is used to perform inference (regression task on the input data) to input the mosaic image to the CNN and convert it to an RGB image.

M Gharbi, G Chaurasia, S Paris, F Durand, "Deep Joint Demosaicking and Denoising", Siggraph Asia 2016, ACM Transactions on Graphics (TOG), 2016年11月M Gharbi, G Chaurasia, S Paris, F Durand, "Deep Joint Demosaicking and Denoising", Siggraph Asia 2016, ACM Transactions on Graphics (TOG), November 2016

ところで、上述のモザイク画像における画素は実観測により取得されたものである。そのため、デモザイク結果のＲＧＢ画像においても、実観測で取得された画素値（観測値）が、モザイク画像と同じ値で含まれていることが望ましい。換言すると、ＣＮＮにモザイク画像を入力し、出力としてデモザイク結果画像を得たとき、観測値はＣＮＮの処理によって変化することなく出力されることが望ましい。しかし、ＣＮＮの出力値は入力値と異なる値になり得る。すなわち、モザイク画像とデモザイク結果画像とで、観測値が変化しうるという課題がある。 By the way, the pixels in the above-described mosaic image are obtained by actual observation. Therefore, it is desirable that the pixel values (observed values) obtained by actual observation are included in the RGB image as a result of the demosaicing with the same values as those in the mosaic image. In other words, when a mosaic image is input to a CNN and a demosaic result image is obtained as an output, it is desirable that the observed values be output without being changed by the processing of the CNN. However, the output value of CNN can be different from the input value. That is, there is a problem that the observed value may change between the mosaic image and the demosaiced image.

本発明は、このような問題に鑑みてなされたものであり、より好適なデモザイク画像を出力可能とする技術を提供することを目的としている。 The present invention has been made in view of such problems, and an object of the present invention is to provide a technique capable of outputting a more suitable demosaiced image.

上述の問題点を解決するため、本発明に係る画像処理装置は以下の構成を備える。すなわち、ニューラルネットワーク（ＮＮ）を利用した画像処理を行う画像処理装置において、
教師画像データを取得する取得手段と、
前記教師画像データをモザイク化して得られる生徒画像データを生成する生成手段と、
前記教師画像データの各画素位置における損失値の重みを示す損失重みマップを取得する重み取得手段と、
前記生徒画像データを前記ＮＮに入力して得られる結果画像データと前記教師画像データとの間の差分と、前記損失重みマップと、に基づいてデモザイク処理の学習を行う学習手段と、
を備え、
前記損失重みマップにおいて、前記教師画像データにおける特定の位相の画素位置における損失値の重みは、該特定の位相以外の画素位置における損失値の重みと異なる。 In order to solve the above problems, an image processing apparatus according to the present invention has the following configuration. That is, in an image processing device that performs image processing using a neural network (NN),
Acquisition means for acquiring teacher image data;
generating means for generating student image data obtained by mosaicking the teacher image data;
weight acquisition means for acquiring a loss weight map indicating the weight of the loss value at each pixel position of the teacher image data;
learning means for learning demosaic processing based on the difference between the result image data obtained by inputting the student image data to the NN and the teacher image data and the loss weight map;
with
In the loss weight map, weights of loss values at pixel positions of a specific phase in the teacher image data are different from weights of loss values at pixel positions other than the specific phase.

本発明によれば、より好適なデモザイク画像を出力可能とする技術を提供することができる。 ADVANTAGE OF THE INVENTION According to this invention, the technique which can output a more suitable demosaicing image can be provided.

画像処理装置のハードウェア構成を示すブロック図である。2 is a block diagram showing the hardware configuration of the image processing device; FIG. ＣＮＮが推論結果画像を生成する過程を説明する図である。FIG. 4 is a diagram for explaining the process of CNN generating an inference result image; ＣＮＮが損失値を算出する過程を説明する図である。It is a figure explaining the process in which CNN calculates a loss value. 第１実施形態における画像処理装置の機能構成を示すブロック図である。2 is a block diagram showing the functional configuration of the image processing apparatus according to the first embodiment; FIG. 第１実施形態における画像変換処理の流れを示すフローチャートである。4 is a flowchart showing the flow of image conversion processing in the first embodiment; 第２実施形態における画像処理装置の機能構成を示すブロック図である。FIG. 11 is a block diagram showing the functional configuration of an image processing apparatus according to a second embodiment; FIG. 第２実施形態における画像変換処理の流れを示すフローチャート図である。FIG. 11 is a flowchart showing the flow of image conversion processing in the second embodiment; 教師画像生成部における処理の流れを説明する図である。FIG. 5 is a diagram for explaining the flow of processing in a teacher image generation unit; 損失重みマップを説明する図である。It is a figure explaining a loss weight map. モザイク画像を生成する流れを説明する図である。It is a figure explaining the flow which produces|generates a mosaic image. デモザイクネットワークにける処理の流れを説明する図である（従来技術）。FIG. 2 is a diagram illustrating the flow of processing in a demosaicing network (prior art); 第２実施形態における学習処理の流れを説明する図である。It is a figure explaining the flow of the learning process in 2nd Embodiment.

以下、添付図面を参照して実施形態を詳しく説明する。なお、以下の実施形態は特許請求の範囲に係る発明を限定するものではない。実施形態には複数の特徴が記載されているが、これらの複数の特徴の全てが発明に必須のものとは限らず、また、複数の特徴は任意に組み合わせられてもよい。さらに、添付図面においては、同一若しくは同様の構成に同一の参照番号を付し、重複した説明は省略する。 Hereinafter, embodiments will be described in detail with reference to the accompanying drawings. In addition, the following embodiments do not limit the invention according to the scope of claims. Although multiple features are described in the embodiments, not all of these multiple features are essential to the invention, and multiple features may be combined arbitrarily. Furthermore, in the accompanying drawings, the same or similar configurations are denoted by the same reference numerals, and redundant description is omitted.

（第１実施形態）
本発明に係る画像処理装置の第１実施形態として、畳み込みニューラルネットワーク（Convolutional Neural Network：ＣＮＮ）を利用する画像処理装置における学習の処理を例に挙げて以下に説明する。また、学習結果（ＣＮＮのネットワークパラメタ）に基づく推論（デモザイク処理）についても説明する。特に第１実施形態では、損失重みを導入した損失関数計算により、デモザイク処理の前後における観測値の変化を抑制する形態について説明する。 (First embodiment)
As a first embodiment of an image processing apparatus according to the present invention, a learning process in an image processing apparatus using a convolutional neural network (CNN) will be described below as an example. Inference (demosaicing) based on learning results (network parameters of CNN) will also be described. In particular, in the first embodiment, a form in which changes in observed values before and after demosaicing are suppressed by loss function calculation with loss weights introduced will be described.

＜装置構成＞
図１は、画像処理装置のハードウェア構成を示すブロック図である。情報処理装置１００は、ＣＰＵ１０１、ＲＡＭ１０２、ＲＯＭ１０３、二次記憶装置１０４、入力インターフェース１０５、出力インターフェース１０６、撮像装置１１１、ＧＰＵ１１２を含む。情報処理装置１００の各構成要素はシステムバス１０７によって相互に接続されている。また、情報処理装置１００は、入力インターフェース１０５を介して外部記憶装置１０８および操作部１１０に接続されている。また、情報処理装置１００は、出力インターフェース１０６を介して外部記憶装置１０８および表示装置１０９に接続されている。 <Device configuration>
FIG. 1 is a block diagram showing the hardware configuration of an image processing apparatus. The information processing apparatus 100 includes a CPU 101, a RAM 102, a ROM 103, a secondary storage device 104, an input interface 105, an output interface 106, an imaging device 111, and a GPU 112. Each component of the information processing apparatus 100 is interconnected by a system bus 107 . The information processing apparatus 100 is also connected to an external storage device 108 and an operation unit 110 via an input interface 105 . The information processing apparatus 100 is also connected to an external storage device 108 and a display device 109 via an output interface 106 .

ＣＰＵ１０１は、ＲＡＭ１０２をワークメモリとして、ＲＯＭ１０３に格納されたプログラムを実行し、システムバス１０７を介して情報処理装置１００の各構成要素を統括的に制御する。これにより、後述する様々な処理が実行される。ＧＰＵ１１２は、ＣＰＵ１０１から受け取ったデータに対して演算を行い、その結果をＣＰＵ１０１に出力する。 The CPU 101 executes programs stored in the ROM 103 using the RAM 102 as a work memory, and comprehensively controls each component of the information processing apparatus 100 via the system bus 107 . As a result, various processes to be described later are executed. The GPU 112 performs operations on data received from the CPU 101 and outputs the results to the CPU 101 .

二次記憶装置１０４は、情報処理装置１００で取り扱われる種々のデータを記憶する記憶装置であり、第１実施形態ではＨＤＤが用いられる。ＣＰＵ１０１は、システムバス１０７を介して二次記憶装置１０４へのデータの書き込みおよび二次記憶装置１０４に記憶されたデータの読出しを行う。なお、二次記憶装置１０４にはＨＤＤの他に、光ディスクドライブやフラッシュメモリなど、様々な記憶デバイスを用いることが可能である。 The secondary storage device 104 is a storage device that stores various data handled by the information processing apparatus 100, and an HDD is used in the first embodiment. The CPU 101 writes data to the secondary storage device 104 and reads data stored in the secondary storage device 104 via the system bus 107 . In addition to the HDD, various storage devices such as an optical disk drive and a flash memory can be used as the secondary storage device 104 .

入力インターフェース１０５は、例えばＵＳＢやＩＥＥＥ１３９４等のシリアルバスインターフェースである。情報処理装置１００は、入力インターフェース１０５を介して、外部装置からデータや命令等を入力する。第１実施形態では、情報処理装置１００は、入力インターフェース１０５を介して、外部記憶装置１０８（例えば、ハードディスク、メモリカード、ＣＦカード、ＳＤカード、ＵＳＢメモリなどの記憶媒体）からデータを取得する。また第１実施形態では、情報処理装置１００は、操作部１１０に入力されたユーザの指示を、入力インターフェース１０５を介して取得する。操作部１１０は、マウスやキーボードなどの入力装置であり、ユーザの指示を入力する。 The input interface 105 is, for example, a serial bus interface such as USB or IEEE1394. The information processing apparatus 100 receives data, instructions, and the like from an external device via the input interface 105 . In the first embodiment, the information processing apparatus 100 acquires data from an external storage device 108 (for example, a storage medium such as a hard disk, memory card, CF card, SD card, USB memory, etc.) via the input interface 105 . Further, in the first embodiment, the information processing apparatus 100 acquires a user's instruction input to the operation unit 110 via the input interface 105 . An operation unit 110 is an input device such as a mouse and a keyboard, and inputs user's instructions.

出力インターフェース１０６は、入力インターフェース１０５と同様にＵＳＢやＩＥＥＥ１３９４等のシリアルバスインターフェースである。なお、出力インターフェース１０６は、例えばＤＶＩやＨＤＭＩ（登録商標）等の映像出力端子であってもよい。情報処理装置１００は、出力インターフェース１０６を介して、外部装置にデータ等を出力する。第１実施形態では、情報処理装置１００は、出力インターフェース１０６を介して表示装置１０９（液晶ディスプレイなどの各種画像表示デバイス）に、ＣＰＵ１０１によって処理されたデータ（例えば、画像データ）を出力する。なお、情報処理装置１００の構成要素は上記以外にも存在するが、本発明の主眼ではないため、説明を省略する。 The output interface 106, like the input interface 105, is a serial bus interface such as USB or IEEE1394. Note that the output interface 106 may be a video output terminal such as DVI or HDMI (registered trademark). The information processing apparatus 100 outputs data and the like to an external device via the output interface 106 . In the first embodiment, the information processing apparatus 100 outputs data (eg, image data) processed by the CPU 101 to the display device 109 (various image display devices such as a liquid crystal display) via the output interface 106 . Although there are components of the information processing apparatus 100 other than those described above, they are not the focus of the present invention, so description thereof will be omitted.

撮像装置１１１は、情報処理装置１００で処理を行う入力画像を撮像する。第１実施形態では、画像処理装置１００に於いて、ＣＰＵ１０１からの指令に基づき、画像処理アプリケーションにベイヤーデータ（モザイク画像）を入力してデモザイクした画像データ（デモザイク画像）を出力する態様について説明するものとする。 The imaging device 111 captures an input image to be processed by the information processing device 100 . In the first embodiment, in the image processing apparatus 100, based on a command from the CPU 101, Bayer data (mosaic image) is input to an image processing application and demosaiced image data (demosaiced image) is output. shall be

＜ＣＮＮ＞
まず、深層学習技術を応用した画像処理技術全般で用いられているＣＮＮについて説明する。そして、ＣＮＮを用いたデモザイク処理および、デモザイク処理の前後において観測値が変化する原理について説明する。 <CNN>
First, CNN, which is used in general image processing technology applying deep learning technology, will be described. Then, demosaic processing using CNN and the principle of changes in observed values before and after demosaic processing will be described.

ＣＮＮとは、学習（ｔｒａｉｎｉｎｇまたはｌｅａｒｎｉｎｇ）により生成したフィルタを画像に対して畳み込み（ｃｏｎｖｏｌｕｔｉｏｎ）を行った後、非線形演算することを繰り返す、学習型の画像処理技術である。フィルタは、局所受容野（Local Receptive Field：ＬＲＦ）とも呼ばれる。画像に対してフィルタを畳み込んだ後、非線形演算して得られる画像は、特徴マップ（ｆｅａｔｕｒｅｍａｐ）と呼ばれる。また、学習は入力画像と出力画像のペアからなる学習データ（ｔｒａｉｎｉｎｇｉｍａｇｅｓまたはｄａｔａｓｅｔｓ）を用いて行われる。簡単には、入力画像から対応する出力画像へ高精度に変換可能なフィルタの値を、学習データから生成することが「学習」である。学習の詳細については後述する。 CNN is a learning-type image processing technology that repeats non-linear operations after convolution of an image with a filter generated by training or learning. A filter is also called a local receptive field (LRF). After convolving the filter with respect to the image, the image obtained by non-linear operation is called a feature map. Also, learning is performed using training data (training images or data sets) consisting of pairs of input images and output images. Simply put, “learning” is to generate, from training data, filter values that can be converted from an input image to a corresponding output image with high accuracy. The details of learning will be described later.

画像がＲＧＢカラーチャネルを有する場合や、特徴マップが複数枚の画像から構成されている場合、畳み込みに用いるフィルタも、それに応じて複数のチャネルを有する。すなわち、畳み込みフィルタは、縦横サイズと枚数の他に、チャネル数を加えた、４次元配列で表現される。画像（または特徴マップ）にフィルタを畳み込んだ後、非線形演算する処理は、層（ｌａｙｅｒ）という単位で表現される。例えば、ｎ層目の特徴マップやｎ層目のフィルタなどと呼ばれる。また、例えばフィルタの畳み込みと非線形演算を３回繰り返すようなＣＮＮは、３層のネットワーク構造を有するという。この処理は、以下の数式（１）のように定式化することができる。 If the image has RGB color channels, or if the feature map consists of multiple images, the filter used for convolution will accordingly have multiple channels. That is, the convolution filter is represented by a four-dimensional array that includes the number of channels in addition to the vertical and horizontal size and the number of sheets. After convolving the image (or feature map) with the filter, the process of non-linear calculation is expressed in units called layers. For example, it is called an n-th layer feature map or an n-th layer filter. Also, for example, a CNN that repeats filter convolution and nonlinear operation three times is said to have a three-layer network structure. This process can be formulated as in the following formula (1).

数式（１）において、Ｗ_nはｎ層目のフィルタ、ｂ_nはｎ層目のバイアスである。また、Ｇは非線形演算子、Ｘ_nはｎ層目の特徴マップ、＊は畳み込み演算子である。なお、右肩の（ｌ）はｌ番目のフィルタまたは特徴マップであることを表している。フィルタおよびバイアスは、後述する学習により生成され、まとめてネットワークパラメータとも呼ばれる。非線形演算としては、例えばシグモイド関数（sigmoid function）やＲｅＬＵ（Rectified Linear Unit）が用いられる。ＲｅＬＵは、以下の数式（２）で与えられる。 In Equation (1), W _n is the n-th layer filter and b _n is the n-th layer bias. Also, G is a non-linear operator, X _n is an n-th layer feature map, and * is a convolution operator. Note that (l) in the right shoulder represents the l-th filter or feature map. Filters and biases are generated by learning, which will be described later, and are collectively called network parameters. A sigmoid function or ReLU (Rectified Linear Unit), for example, is used as the nonlinear operation. ReLU is given by Equation (2) below.

すなわち、ＲｅＬＵは、入力したベクトルＸの要素のうち負のものはゼロ、正のものはそのままとする非線形な処理である。 That is, ReLU is a non-linear process in which negative elements of the input vector X elements are zeroed and positive elements are left unchanged.

次に、ＣＮＮの学習について説明する。ＣＮＮの学習は、入力学習画像（生徒画像）と対応する出力学習画像（教師画像）の組からなる学習データに対して、一般に以下の数式（３）で表される目的関数を最小化することで行われる。 Next, learning of CNN will be described. CNN learning is to minimize an objective function generally represented by the following formula (3) for learning data consisting of a set of input learning images (student images) and corresponding output learning images (teacher images). is done in

ここで、Ｌは正解とその推定との誤差を測る損失関数（loss function）である。また、Ｙ_iはｉ番目の出力学習画像、Ｘ_iはｉ番目の入力学習画像である。また、ＦはＣＮＮの各層で行う演算（数式１）を、まとめて表した関数である。また、θはネットワークパラメータ（フィルタおよびバイアス）である。さらに、||Ｚ||₂はＬ２ノルムであり、簡単にはベクトルＺの要素の２乗和の平方根である。 where L is a loss function that measures the error between the correct answer and its estimate. Also, Y _i is the i-th output learning image, and X _i is the i-th input learning image. Also, F is a function collectively representing the calculations (formula 1) performed in each layer of the CNN. Also, θ is the network parameters (filter and bias). Furthermore, ||Z _||

また、ｎは学習に用いる学習データの全枚数であるが、一般に学習データの全枚数は多いため、確率的勾配降下法（Stochastic Gradient Descent：ＳＧＤ）では、学習画像の一部をランダムに選び学習に用いている。これにより、多くの学習データを用いた学習における、計算負荷が低減できる。 In addition, n is the total number of learning data used for learning, but since the total number of learning data is generally large, in Stochastic Gradient Descent (SGD), a part of the learning image is randomly selected for learning used for This reduces the computational load in learning using a large amount of learning data.

また、目的関数の最小化（＝最適化）法として、モーメンタム（ｍｏｍｅｎｔｕｍ）法やＡｄａＧｒａｄ法、ＡｄａＤｅｌｔａ法、Ａｄａｍ法など、様々な方法が知られている。Ａｄａｍ法は、以下の数式（４）で与えられる。 Also, various methods such as the momentum method, the AdaGrad method, the AdaDelta method, and the Adam method are known as methods for minimizing (=optimizing) the objective function. Adam's method is given by Equation (4) below.

数式（４）において、θ_i ^tは反復ｔ回目におけるｉ番目のネットワークパラメータ、ｇはθ_i ^tに関する損失関数Ｌの勾配である。また、ｍ、ｖはモーメントベクトル、αは基本学習率（base learning rate）、β₁、β₂はハイパーパラメータ、εは小さい定数である。なお、学習における最適化法の選択指針は存在しないため、基本的に何を用いても良いが、方法ごとの収束性には違いがあるため、学習時間の違いが生じることが知られている。 (4), θ _i ^t is the ith network parameter at the t th iteration and g is the slope of the loss function L with respect to θ _i ^t . Also, m and v are moment vectors, α is a base learning rate, β ₁ and β ₂ are hyperparameters, and ε is a small constant. In addition, since there is no selection guideline for optimization methods in learning, basically any method can be used. .

ＣＮＮを用いたネットワークとしては、画像認識分野のＲｅｓＮｅｔや超解像分野における応用であるＲＥＤ－Ｎｅｔが有名である。いずれもＣＮＮを多層にして、フィルタの畳み込みを何度も行うことで、処理の高精度化を図っている。例えば、ＲｅｓＮｅｔは畳み込み層をショートカットする経路を設けたネットワーク構造を特徴とし、これにより１５２層もの多層ネットワークを実現し、人間の認識率に迫る高精度な認識を実現している。なお、多層ＣＮＮにより処理が高精度化する理由は、簡単には非線形演算を何度も繰り返すことで、入出力間の非線形な関係を表現できるためである。 As networks using CNN, ResNet in the field of image recognition and RED-Net, which is applied in the field of super-resolution, are well known. In both cases, CNN is multi-layered and filter convolution is performed many times to improve the accuracy of processing. For example, ResNet is characterized by a network structure in which paths are provided to shortcut convolutional layers, thereby realizing a multi-layered network of 152 layers and realizing highly accurate recognition approaching the recognition rate of humans. The reason why the multi-layer CNN improves the accuracy of the processing is simply that the nonlinear relationship between the input and output can be expressed by repeating the nonlinear calculation many times.

次に、先行技術（非特許文献１）においてデモザイク処理の前後において観測値が変化する原理について説明する。その後、この観測値の変化を抑制する方策について述べる。図２はＣＮＮが推論結果画像を生成する過程を説明する図であり、図３（ａ）はＣＮＮが損失値を算出する過程を説明する図である。 Next, the principle that observed values change before and after demosaic processing in the prior art (Non-Patent Document 1) will be described. After that, we will discuss measures to suppress this change in observed values. FIG. 2 is a diagram for explaining the process by which the CNN generates an inference result image, and FIG. 3A is a diagram for explaining the process by which the CNN calculates a loss value.

推論時、ＣＮＮはモザイク画像２０１を入力として受け取り、ＲＧＢ形式のデモザイク結果画像２０２を出力する。ここで、モザイク画像はベイヤー配列のカラーフィルタに従うものとしているが、Ｘ－ｔｒａｎｓなど別の配列のモザイク画像であってもよい。 During inference, the CNN receives a mosaic image 201 as input and outputs a demosaic result image 202 in RGB format. Here, the mosaic image conforms to the Bayer array color filter, but may be a mosaic image of another array such as X-trans.

モザイク画像には、実観測により取得された画素値Ｒ₀₀，Ｇ₀₁，Ｇ₁₀，Ｂ₁₁，．．．が並んでいる。これに対応するデモザイク結果画像中の観測値Ｒ'₀₀，Ｇ'₀₁，Ｇ'₁₀，Ｂ'₁₁，．．．を白背景で示している。それ以外の灰色背景で示している画素値は、観測値に基づき補間により算出された値（非観測値）である。ここで、Ｒは赤、Ｇは緑、Ｂは青の画素（成分）を表す。デモザイク画像を構成するＲ平面２０３、Ｇ平面２０４、Ｂ平面２０５のそれぞれにおいて、観測値は特定の位相に配列している。 Pixel values R ₀₀ , G ₀₁ , G ₁₀ , B ₁₁ , . . . are lined up. Observations R' ₀₀ , G' ₀₁ , G' ₁₀ , B' ₁₁ , . . . is shown with a white background. Other pixel values shown with a gray background are values (unobserved values) calculated by interpolation based on observed values. Here, R represents a red pixel (component), G represents a green pixel (component), and B represents a blue pixel (component). In each of the R-plane 203, G-plane 204, and B-plane 205 that constitute the demosaiced image, the observed values are arranged in a specific phase.

このような推論を行うために必要な学習時の処理を、図３（ａ）に示している。学習時にも、モザイク画像に対して推論を行い、デモザイク結果２０２を得る。これと、デモザイクの正解データである教師画像３０２とを比較して推論の誤差を損失関数に従って算出し、その誤差（損失値と呼ぶ）を小さくするように学習を進める。 FIG. 3(a) shows processing at the time of learning necessary for performing such inference. Inference is also performed on the mosaic image during learning, and a demosaic result 202 is obtained. This is compared with the teacher image 302, which is demosaic correct data, to calculate an inference error according to a loss function, and learning proceeds so as to reduce the error (called a loss value).

ｉ番目の教師画像Ｙ_iと、ｉ番目のデモザイク結果画像Ｙ'_i（＝Ｆ（Ｘ_i；θ））との間の損失値ｌ_iは、一般に下記の式で表されるＬ２ノルムにより計算される。 The loss value l _i between the i-th teacher image Y _i and the i-th demosaic result image Y′ _i (=F(X _i ; θ)) is generally calculated by the L2 norm expressed by the following formula: be done.

ここで、Ｙ_i（ｘ，ｙ，ｃ）は、教師画像Ｙ_i上の位置（ｘ，ｙ）における第ｃチャネル目の値を表す。この式は、数式（３）の総和記号Σ内の項を、別の形式で書き直したものである。 Here, Y _i (x, y, c) represents the c-th channel value at the position (x, y) on the teacher image Y _i . This expression rewrites the terms in the summation symbol Σ of Equation (3) in a different form.

このように、先行技術の損失値計算においては、画像中の位相により観測値や非観測値といった性質の違いがあるにも関わらず、数式（４）のように各位相を等価に扱っている。その結果、ＣＮＮが性能を発揮できず、観測値の変化が発生していたと考えられる。なお、数式（４）に示すＬ２ノルム以外に、Ｌ１ノルム、ＰＳＮＲ、ＳＳＩＭなど他の損失関数も用いられるが、各位相が等価に扱われているという点は同様である。 In this way, in the loss value calculation of the prior art, each phase is treated equivalently as in Equation (4), despite the difference in properties such as observed values and non-observed values depending on the phase in the image. . As a result, it is thought that the CNN could not demonstrate its performance and the observed value changed. In addition to the L2 norm shown in Equation (4), other loss functions such as the L1 norm, PSNR, and SSIM are also used, but each phase is treated equally.

そこで、第１実施形態では、位相ごとの重みｗ（ｘ，ｙ，ｃ）を導入し、損失値を以下の数式（６）のように計算する。 Therefore, in the first embodiment, a weight w(x, y, c) for each phase is introduced, and the loss value is calculated as in Equation (6) below.

この重みを、損失重みと呼ぶ。第１実施形態では、非観測値に比べて観測値が重視されるように損失重みを設定することにより、観測値の変化を抑制する。 This weight is called a loss weight. In the first embodiment, changes in observed values are suppressed by setting loss weights so that observed values are emphasized more than unobserved values.

＜装置の動作＞
図４は、第１実施形態における画像処理装置の機能構成を示すブロック図である。より具体的には、学習処理を行う場合の画像処理装置の機能構成を示している。なお、図４（ａ）に示した構成は適宜変形／変更が可能である。例えば、１つの機能部を機能別に複数の機能部に分割しても良いし、２つ以上の機能部を１つの機能部に統合しても良い。また、図４（ａ）の構成は、２以上の装置によって構成しても良い。その場合、各装置は回路や有線若しくは無線のネットワークを介して接続され、互いにデータ通信を行って協調動作を行うことで、以下に画像処理装置が行うものとして後述する各処理を実現する。 <Device operation>
FIG. 4 is a block diagram showing the functional configuration of the image processing apparatus according to the first embodiment. More specifically, it shows the functional configuration of the image processing apparatus when performing the learning process. Note that the configuration shown in FIG. 4A can be appropriately modified/changed. For example, one functional unit may be divided into a plurality of functional units by function, or two or more functional units may be integrated into one functional unit. Also, the configuration of FIG. 4(a) may be configured by two or more devices. In that case, each device is connected via a circuit or a wired or wireless network, performs data communication with each other, and performs cooperative operations, thereby realizing each process described below as what is performed by the image processing device.

なお、図４（ａ）に示す各機能部は、特定用途向け集積回路（ＡＳＩＣ）などのハードウェアとして実装しても良いし、ＣＰＵ１０１がコンピュータプログラムを実行することでソフトウェアとして実装してもよい。また、ハードウェアとソフトウェアの組み合わせにより実装してもよい。 Each functional unit shown in FIG. 4A may be implemented as hardware such as an application specific integrated circuit (ASIC), or may be implemented as software by the CPU 101 executing a computer program. . Alternatively, it may be implemented by a combination of hardware and software.

図５は、第１実施形態における画像変換処理の流れを示すフローチャートである。ステップＳ５０１では、教師画像取得部４０１は、ＲＧＢ形式の教師画像（教師画像データ）を取得する。取得された教師画像は、学習データ生成部４０２に出力される。教師画像は、例えば、非特許文献１に記載の方法に従って生成する。 FIG. 5 is a flow chart showing the flow of image conversion processing in the first embodiment. In step S501, the teacher image acquisition unit 401 acquires a teacher image (teacher image data) in RGB format. The acquired teacher image is output to the learning data generation unit 402 . A teacher image is generated according to the method described in Non-Patent Document 1, for example.

図８は、教師画像生成部における処理の流れを説明する図である。具体的には、撮像装置１１１でモザイク画像８０１（カラーモザイク画像データ）を取得し、これに対して簡易デモザイクを適用してＲＧＢ画像８０２を生成する。そして、最後に画像縮小を適用して教師画像８０３を生成する。 FIG. 8 is a diagram for explaining the flow of processing in the teacher image generation unit. Specifically, the imaging device 111 acquires a mosaic image 801 (color mosaic image data) and applies simple demosaicing to generate an RGB image 802 . Finally, image reduction is applied to generate a teacher image 803 .

簡易デモザイクにはバイリニア（ｂｉｌｉｎｅａｒ）補間を用いるが、他のデモザイク手法を用いても構わない。また、ここではカラーフィルタのカラー配列としてベイヤー配列を想定しているが、Ｘ－Ｔｒａｎｓなどの他の配列を用いてもよい。 Bilinear interpolation is used for simple demosaicing, but other demosaicing techniques may be used. Also, here, the Bayer array is assumed as the color array of the color filters, but other arrays such as X-Trans may be used.

教師画像は、サイズが一定の小画像（パッチ）の形式になっている。教師画像のサイズが一定でない場合や、ＣＮＮに入力できるより大きいサイズである場合には、予めパッチ単位に画像分割を行う。 The training images are in the form of small images (patches) of constant size. When the size of the teacher image is not constant, or when the size is larger than the size that can be input to the CNN, the image is divided into patches in advance.

なお、非特許文献１以外の方法でＲＧＢ形式の教師画像を取得しても構わない。例えば、事前に撮像して記憶しておいた物を読み出してもよく、また撮像素子の位置をずらしながら撮像することでＲＧＢ形式の教師画像を得ても構わない。 Note that the teacher image in RGB format may be acquired by a method other than Non-Patent Document 1. For example, an object captured and stored in advance may be read out, or a teacher image in RGB format may be obtained by capturing images while shifting the position of the image sensor.

ステップＳ５０２では、学習データ生成部４０２は、受け取った教師画像からカラーフィルタ配列パターンに従ってサブサンプリングを行い、モザイク画像（生徒画像データ）を生成する。生徒画像は、教師画像をカラーフィルタ配列パターンに従ってサブサンプリングすることによって生成する。 In step S502, the learning data generation unit 402 performs sub-sampling from the received teacher image according to the color filter array pattern to generate a mosaic image (student image data). A student image is generated by sub-sampling the teacher image according to the color filter array pattern.

図１０は、モザイク画像を生成する流れを説明する図である。具体的には、教師画像のＲ成分１００１・Ｇ成分１００２・Ｂ成分１００３から、カラーフィルタ配列１００５に基づいてサブサンプリングし、生徒画像１００４を得る。生成された生徒画像と教師画像とを画像ペアにし、デモザイク学習部４０５に出力する。 FIG. 10 is a diagram explaining the flow of generating a mosaic image. Specifically, a student image 1004 is obtained by sub-sampling the R component 1001, G component 1002, and B component 1003 of the teacher image based on the color filter array 1005 . The generated student image and teacher image are paired and output to the demosaicing learning unit 405 .

ステップＳ５０３では、損失重み算出部４０３は、カラーフィルタ配列に基づいた損失重みマップｗ（ｘ，ｙ，ｃ）を生成する。まず、損失重み算出部はデータベースから、教師画像を生成した撮像素子のカラーフィルタ配列の情報を取得し、観測値の存在する画素位置（特定の位相）を算出する。次に、特定の位相の画素位置である観測値に対する重みｗ_aと、特定の位相以外の画素位置である非観測値に対する重みｗ_bに基づき、損失重みマップを作成する。画素位置（ｘ，ｙ）における第ｃチャネルに観測値が入っていればｗ（ｘ，ｙ，ｃ）＝ｗ_aとし、非観測値が入っていればｗ（ｘ，ｙ，ｃ）＝ｗ_bとする。例えばｗ_a＝５、ｗ_b＝１である場合、損失重みマップは図３（ｂ）のようになる。 In step S503, the loss weight calculator 403 generates a loss weight map w(x, y, c) based on the color filter array. First, the loss weight calculation unit acquires information on the color filter array of the imaging device that generated the teacher image from the database, and calculates the pixel position (specific phase) where the observed value exists. Next, a loss weight map is created based on the weight w _{a for the observed value that is the pixel position of the specific phase and the weight w b} _for the non-observed value that is the pixel position other than the specific phase. If the c-th channel at pixel position (x, y) contains an observed value, then w(x, y, c) = w _a , and if an unobserved value contains w(x, y, c) = w _b . For example, when w _a =5 and w _b =1, the loss weight map is as shown in FIG. 3(b).

重みｗ_a、ｗ_bは所与の値であるが、ユーザが入力により与えたり、学習の進みに応じて値を更新したりしてもよい。また、ｗ_b＝０として、非観測値が損失値計算に寄与しないように設定しても良い。損失値計算時に非観測値を除去／マスクしても良く、これはｗ_b＝０と設定する事と等価である。また、観測値の損失重みは一律でｗ_aとせずとも良く、画素ごとに異なっても良い。損失重みマップはデモザイク学習部４０５に出力される。 Although the weights w _a and w _b are given values, they may be given by the user's input, or the values may be updated according to the progress of learning. Alternatively, w _b =0 may be set so that non-observed values do not contribute to loss value calculation. Unobserved values may be removed/masked when calculating the loss value, which is equivalent to setting w _b =0. Also, the loss weight of the observed value does not have to be _uniform and may be different for each pixel. The loss weight map is output to demosaicing learning section 405 .

ステップＳ５０４では、ネットワークパラメータ取得部４０４は、デモザイク学習に用いるＣＮＮのネットワークパラメータを取得する。ネットワークパラメータとは、ＣＮＮを構成する各フィルタの係数のことである。ネットワークパラメータは、Ｈｅの正規分布に従う乱数として設定する。Ｈｅの正規分布とは、平均が０で分散が下記の数式（７）に示されるσ_hとなるような正規分布である。 In step S504, the network parameter acquisition unit 404 acquires CNN network parameters used for demosaic learning. A network parameter is a coefficient of each filter that constitutes the CNN. The network parameters are set as random numbers following the He normal distribution. The normal distribution of He is a normal distribution that has a mean of 0 and a variance of σ _h shown in Equation (7) below.

ここで、ｍ_NはＣＮＮ中のそのフィルタのニューロン数である。なお、これ以外の方法でネットワークパラメータを決定しても構わない。取得されたネットワークパラメータは、デモザイク学習部４０５に出力される。 where m _N is the number of neurons for that filter in the CNN. Note that network parameters may be determined by other methods. The acquired network parameters are output to the demosaicing learning unit 405 .

ステップＳ５０５では、デモザイク学習部４０５は、受け取った画像ペアを用いてＣＮＮを学習させる。学習には、非特許文献１に開示されている手法を用いるとよい。 In step S505, the demosaicing learning unit 405 uses the received image pair to learn CNN. For learning, the method disclosed in Non-Patent Document 1 may be used.

図１１は、デモザイクネットワークにける処理の流れを説明する図である。ＣＮＮの構造は、数式（１）の演算を行う複数のフィルタ１１０２を積層した形である。図４（ｂ）は、学習処理におけるデモザイク学習部の詳細機能構成を示している。 FIG. 11 is a diagram illustrating the flow of processing in the demosaicing network. The structure of the CNN is a stack of a plurality of filters 1102 that perform the computation of Equation (1). FIG. 4B shows the detailed functional configuration of the demosaicing learning unit in the learning process.

まず、ネットワークパラメータ保持部４１０は、ネットワークパラメータ取得部４０４からネットワークパラメータθを受け取る。このネットワークパラメータでＣＮＮの各層の重みを初期化する。 First, network parameter storage section 410 receives network parameter θ from network parameter acquisition section 404 . The network parameters are used to initialize the weights of each layer of the CNN.

続いて、デモザイク推論部４１１は、このＣＮＮに対して生徒画像１００４を入力して、デモザイクの推論を行う。入力時、生徒画像を３チャネルの欠損画像１１０１に変換する。欠損画像のＲチャネルには、生徒画像のＲ成分の画素のみが入っており、他の画素の画素値は欠損値（０）に設定されている。Ｇチャネル、Ｂチャネルについても同様に、それぞれＧとＢの画素値のみが記録されており、残りの画素値は０である。なお、バイリニア補間等の手法により、欠損値を補間しても構わない。 Subsequently, the demosaicing inference unit 411 inputs the student image 1004 to this CNN and performs demosaicing inference. On input, the student image is converted to a 3-channel missing image 1101 . The R channel of the missing image contains only the R component pixels of the student image, and the pixel values of the other pixels are set to the missing value (0). Similarly, only the pixel values of G and B are recorded for the G channel and the B channel, respectively, and the remaining pixel values are zero. Missing values may be interpolated by a method such as bilinear interpolation.

次に、この欠損画像に対してフィルタ１１０２を順次適用し、特徴マップを算出する。続いて、連結層１１０３は、算出された特徴マップと欠損画像１１０１とをチャネル方向に連結する。特徴マップと欠損画像のチャネル数がそれぞれｎ₁とｎ₂であった場合、連結結果のチャネル数は（ｎ₁＋ｎ₂）となる。続けて、この連結結果に対してフィルタを適用し、最終フィルタでは３チャネルの出力を行うことにより、デモザイク結果画像１１０４（結果画像データ）を得る。 Next, a filter 1102 is sequentially applied to this missing image to calculate a feature map. Subsequently, a connection layer 1103 connects the calculated feature map and the missing image 1101 in the channel direction. If the number of channels of the feature map and the missing image is n ₁ and n ₂ respectively, the number of channels of the concatenation result is (n ₁ +n ₂ ). Subsequently, a filter is applied to this connection result, and the final filter outputs three channels to obtain a demosaic result image 1104 (result image data).

続いて、損失値算出部４１２は、得られたデモザイク結果画像と教師画像との誤差（差分）を計算し、画像全体についてその重み付き平均を取ることにより、損失値を算出する。具体的には、損失重み算出部４０３から取得した損失重みマップに基づき、数式（５）に従って、ｉ番目のデモザイク結果画像の損失値ｌ_iを算出する。そして、最終的な損失値Ｌ（θ）を、次の数式（８）に従って算出する。 Subsequently, the loss value calculator 412 calculates the error (difference) between the obtained demosaic result image and the teacher image, and calculates the loss value by taking the weighted average of the entire image. Specifically, based on the loss weight map acquired from the loss weight calculator 403, the loss value l _i of the i-th demosaic result image is calculated according to Equation (5). Then, the final loss value L(θ) is calculated according to the following formula (8).

この式は、数式（３）と等価である。なお、数式（８）による損失値以外に、非特許文献１などの手法により他の１つ以上の損失値を算出し、それらの線形結合を最終的な損失値と定めても良い。得られた損失値は、ネットワークパラメータ更新部４１３に出力される。 This formula is equivalent to formula (3). In addition to the loss value obtained by Equation (8), one or more other loss values may be calculated by a method such as Non-Patent Document 1, and a linear combination thereof may be determined as the final loss value. The obtained loss value is output to network parameter updating section 413 .

続いて、ネットワークパラメータ更新部４１３は、算出された損失値から、誤差逆伝播法（Back propagation）などによってネットワークパラメータの更新を行う。これは、損失値の微分を計算することにより、損失値がより小さくなるようにネットワークパラメータを変更するという手法である。得られた更新後のネットワークパラメータは、ネットワークパラメータ保持部４１０に保存されるとともに、検査部４０６に出力される。 Subsequently, the network parameter updating unit 413 updates the network parameters from the calculated loss value by error back propagation or the like. This is a technique of changing the network parameters so that the loss value becomes smaller by calculating the derivative of the loss value. The obtained updated network parameters are stored in network parameter holding section 410 and output to inspection section 406 .

ステップＳ５０６において、検査部４０６は、学習が完了したか否かの判定を行う。判定を行うために、学習には用いていない、風景写真や人物写真などの画像データ（チャート画像）群において、テスト画像データであるテストチャートを用意する。テストチャートとは、高輝度または高彩度といった観測値の変化が発生しやすい領域を含むモザイク画像である。このテストチャートを、学習結果のＣＮＮを用いてデモザイクしテスト結果画像データを生成し、次の数式（８）に従って観測値の変化度εを評価する。 In step S506, the inspection unit 406 determines whether or not learning has been completed. In order to perform determination, a test chart, which is test image data, is prepared in a group of image data (chart images) such as landscape photographs and portrait photographs, which are not used for learning. A test chart is a mosaic image that includes areas where changes in observed values are likely to occur, such as high brightness or high saturation. This test chart is demosaiced using the learning result CNN to generate test result image data, and the degree of change ε of the observed value is evaluated according to the following equation (8).

ここで、Ｘ_iはｉ番目のチャート画像であり、Ｆ－１は、図１０のようにＲＧＢ画像をモザイク化する関数である。||Ｚ||_pはｐノルムであり、ｐ＝１であるが、ｐは他の値でも構わない。変化度εが所与の閾値未満であった場合には、学習が完了したと判定する。なお、テストチャートを選定する際に、複数の候補画像から数式（９）に従って変化度εを算出し、εの大きくなるような候補画像をテストチャートと定めても良い。 Here, X _i is the i-th chart image, and F-1 is a function for mosaicking the RGB image as shown in FIG. ||Z|| _p is the p-norm and p=1, but p can be other values. If the degree of change ε is less than a given threshold, it is determined that learning has been completed. When selecting a test chart, the degree of change ε may be calculated from a plurality of candidate images according to Equation (9), and a candidate image having a large ε may be determined as the test chart.

なお、学習完了の判定基準はこれだけには限らない。例えば、更新時のネットワークパラメータの変化量が規定値より小さいか否かという判定基準や、推論結果と教師画像との残差が規定値より小さいか否かという判定基準を用いても良い。また学習（ネットワークパラメータの更新）の反復回数が規定値に達すれば学習完了としても良い。学習が完了した場合は、更新されたネットワークパラメータを、学習結果記憶部４０７に出力する。 Note that the learning completion criteria are not limited to this. For example, it is also possible to use a criterion of whether or not the amount of change in the network parameter at the time of updating is smaller than a specified value, or a criterion of whether or not the residual between the inference result and the teacher image is smaller than a specified value. Also, learning may be completed when the number of iterations of learning (update of network parameters) reaches a specified value. When learning is completed, the updated network parameters are output to the learning result storage unit 407 .

学習が完了していない場合は、ステップＳ５０５に戻り、取得している学習データを用いて再度学習処理を行う。この学習には全ての学習データを用いなくても良く、ステップＳ５０５に戻る度に、学習データ群から部分集合である画像群をランダムに抽出し、これを用いて学習を行っても構わない。また、教師画像取得部４０１から、新たな教師画像群を取得しても構わない。 If the learning has not been completed, the process returns to step S505, and learning processing is performed again using the acquired learning data. It is not necessary to use all the learning data for this learning, and each time the process returns to step S505, an image group, which is a subset, may be randomly extracted from the learning data group and used for learning. Also, a new teacher image group may be acquired from the teacher image acquisition unit 401 .

ステップＳ５０７では、学習結果記憶部４０７は、受け取ったネットワークパラメータを記憶する。 In step S507, the learning result storage unit 407 stores the received network parameters.

学習処理は以上であるが、学習結果を用いてデモザイクの推論（デモザイク処理）を行う場合には、次以降のステップに進む。その場合、学習結果記憶部４０７は、ネットワークパラメータをデモザイク推論部４０９に出力する。図４（ｃ）は、推論処理（デモザイク処理）を行う場合の画像処理装置の機能構成を示している。 The learning process is as described above, but when inference of demosaicing (demosaicing process) is performed using the learning result, the process proceeds to the following steps. In that case, the learning result storage unit 407 outputs the network parameters to the demosaicing inference unit 409 . FIG. 4C shows the functional configuration of the image processing device when performing inference processing (demosaicing processing).

ステップＳ５０８では、入力画像取得部４０８は、撮像装置１１１で撮像を行い、デモザイクを行う対象のモザイク画像（入力画像データ）を取得する。なお、この入力画像は、事前に撮像して記憶しておいた物を読み出してもよい。取得された入力画像は、デモザイク推論部４０９に出力される。 In step S508, the input image acquisition unit 408 captures an image with the imaging device 111 and acquires a mosaic image (input image data) to be demosaiced. It should be noted that the input image may be read from an image that has been captured and stored in advance. The acquired input image is output to the demosaicing inference unit 409 .

ステップＳ５０９では、デモザイク推論部４０９は、デモザイク学習部４０５での学習で用いたのと同じＣＮＮを構築する。このネットワークパラメータを、学習結果記憶部４０７から受け取ったネットワークパラメータで初期化する。このＣＮＮに対して、受け取った入力画像を入力し、デモザイク学習部４０５で行ったのと同じ方法で出力画像生成を行い推論結果（出力画像データ）を得る。 In step S<b>509 , the demosaicing inference unit 409 constructs the same CNN as used in the learning in the demosaicing learning unit 405 . This network parameter is initialized with the network parameter received from the learning result storage unit 407 . The received input image is input to this CNN, and an output image is generated by the same method as that performed by the demosaicing learning unit 405 to obtain an inference result (output image data).

以上説明したとおり第１実施形態によれば、損失重みを導入した損失関数計算における誤差（損失値）を小さくするように学習を行う。これにより、デモザイク処理の前後における観測値の変化を抑制するようなＣＮＮの学習が可能となる。すなわち、学習により得られたＣＮＮを用いた推論（デモザイク処理）を行うことにより好適なデモザイク画像を出力することが可能となる。 As described above, according to the first embodiment, learning is performed so as to reduce the error (loss value) in the loss function calculation to which the loss weight is introduced. This enables CNN learning that suppresses changes in observed values before and after demosaic processing. That is, it is possible to output a suitable demosaic image by performing inference (demosaicing) using the CNN obtained by learning.

（第２実施形態）
第２実施形態では、ＣＮＮの学習の他の形態について説明する。第１実施形態では、観測値の損失重みｗ_aを、非観測値の損失重みｗ_bに比べて大きい値に設定して学習を行った。すなわち、増幅率ｒをｒ＝ｗ_a／ｗ_bと定義した際に、ｒ＞１である学習を行った。 (Second embodiment)
In the second embodiment, another form of CNN learning will be described. In the first embodiment, learning is performed by setting the loss weight w _{a of the observed value to a value greater than the loss weight w b} _of the unobserved value. That is, when the amplification factor r was defined as r ₌ wa/ _wb , learning was performed with r>1.

ただし、ｒの大きい学習を行うと、特に高周波部において、モアレなどの画質弊害が発生するＣＮＮとなり得る。このような画質弊害を抑制するためには、ｒ＝１である学習を行うことが望ましい。そこで、第２実施形態では、ｒ＝１である学習とｒ＞１である学習の両方を行う例を示す。なお、第１実施形態と共通する部分は説明を省略し、以下では差異点を中心に説明する。 However, learning with a large r may result in a CNN that causes image quality defects such as moire, especially in high-frequency areas. In order to suppress such image quality deterioration, it is desirable to perform learning with r=1. Therefore, in the second embodiment, an example of performing both learning with r=1 and learning with r>1 is shown. The description of the parts common to the first embodiment will be omitted, and the differences will be mainly described below.

＜装置の動作＞
図１２は、第２実施形態における学習処理の流れを説明する図である。図に示されるように、まずｒ＝１と設定した事前学習を行うことで高周波部の画質弊害を抑制する。その後にｒ＞１と設定した本学習を行うことで観測値変化を抑制する。 <Device operation>
FIG. 12 is a diagram illustrating the flow of learning processing in the second embodiment. As shown in the figure, first, pre-learning with r=1 is performed to suppress image quality deterioration in high-frequency areas. After that, main learning with r>1 is performed to suppress changes in observed values.

図６（ａ）は、第２実施形態における画像処理装置の機能構成を示すブロック図である。図７は、第２実施形態における画像変換処理の流れを示すフローチャート図である。 FIG. 6A is a block diagram showing the functional configuration of the image processing apparatus according to the second embodiment. FIG. 7 is a flow chart showing the flow of image conversion processing in the second embodiment.

ステップＳ７０１では、損失重み算出部４０３は、ｒ＝１という条件下で損失重みマップを生成する。すなわち、損失重みマップの全ての値が１（または所与の定数）となる。なお、ｒは１と略一致していればよく、例えばｒ＝１．１といった値に設定しても構わない。損失重みマップはデモザイク学習部４０５に出力される。 In step S701, the loss weight calculator 403 generates a loss weight map under the condition r=1. That is, all values in the loss weight map will be 1 (or a given constant). Note that r may be substantially equal to 1, and may be set to a value such as r=1.1, for example. The loss weight map is output to demosaicing learning section 405 .

ステップＳ７０２では、デモザイク学習部４０５は、受け取った損失重みマップに基づいてＣＮＮを訓練する。訓練の方法は、ステップＳ５０５と同様である。この訓練が、ＣＮＮの事前学習である。 In step S702, the demosaicing learning unit 405 trains the CNN based on the received loss weight map. The training method is the same as in step S505. This training is CNN pre-learning.

ステップＳ７０３では、学習結果記憶部４０７は、事前学習で得られたネットワークパラメータを記憶する。 In step S703, the learning result storage unit 407 stores the network parameters obtained by pre-learning.

ステップＳ７０４では、損失重み算出部４０３は、ｒ＞１という条件下で損失重みマップを生成する。例えば、ｒ＝５として、図３（ｂ）の損失重みマップを得る。損失重みマップはデモザイク学習部４０５に出力される。 In step S704, loss weight calculator 403 generates a loss weight map under the condition of r>1. For example, with r=5, the loss weight map of FIG. 3(b) is obtained. The loss weight map is output to demosaicing learning section 405 .

ステップＳ７０５では、ネットワークパラメータ取得部４０４は、事前学習のネットワークパラメータを学習結果記憶部４０７から読み出し、デモザイク学習部４０５に出力する。 In step S<b>705 , the network parameter acquisition unit 404 reads the pre-learned network parameters from the learning result storage unit 407 and outputs them to the demosaicing learning unit 405 .

ステップＳ７０６では、デモザイク学習部４０５は、受け取った事前学習のネットワークパラメータを初期値としてＣＮＮの重みを初期化した後、受け取ったｒ＞１の損失重みマップに基づいてＣＮＮの本学習を行う。学習結果として得られたネットワークパラメータは、終了判定後、学習結果記憶部４０７に出力される。 In step S706, the demosaicing learning unit 405 initializes the weights of the CNN using the received pre-learned network parameters as initial values, and then performs the main learning of the CNN based on the received loss weight map of r>1. The network parameters obtained as learning results are output to the learning result storage unit 407 after the end determination.

なお、上述の説明では、増幅率をまずｒ＝１と設定して学習し、その後でｒ＞１と設定して学習するよう説明した。しかしながら、両設定を用いる順番はこれには限定されず、例えば逆の順で学習を行ったり、交互に学習を行ったりしても良い。また、学習の段階は２段階には限定されず、ｒの値を増やしながら複数段階の学習を行っても良い。 In the above description, it was explained that learning is performed by first setting the amplification factor to r=1, and then learning is performed by setting r>1. However, the order in which both settings are used is not limited to this, and for example, learning may be performed in the reverse order, or learning may be performed alternately. Also, the number of stages of learning is not limited to two stages, and multiple stages of learning may be performed while increasing the value of r.

さらに、学習の進みに応じてｒの値を増加させても良い。例えば、エポック数（学習の繰り返し回数）の増加や損失値の減少に連動させて、ｒの値を増加させても良い。このとき、ｒの値を更新するたびに損失重み算出部４０３は損失重みマップを再生成し、損失値算出部４１２は損失重みマップを再読み込みする。 Furthermore, the value of r may be increased according to the progress of learning. For example, the value of r may be increased in conjunction with an increase in the number of epochs (the number of repetitions of learning) or a decrease in the loss value. At this time, every time the value of r is updated, the loss weight calculator 403 regenerates the loss weight map, and the loss value calculator 412 reloads the loss weight map.

以上説明したとおり第２実施形態によれば、増幅率ｒに関してｒ＝１とｒ＞１のそれぞれについて学習を行う。これにより、第１実施形態に比較し高周波部等の画質弊害を低減したＣＮＮの学習が可能となる。 As described above, according to the second embodiment, learning is performed for each of r=1 and r>1 with respect to the amplification factor r. As a result, it is possible to perform CNN learning with reduced image quality impairments such as high-frequency portions compared to the first embodiment.

（変形例Ａ）
第１および第２実施形態では、図３（ｂ）のように、チャネルごとに観測値の損失重みを同一に設定した。しかし、ベイヤー配列ではＧ画素の個数がＲ画素やＢ画素の個数の２倍であるため、損失重みマップにおける損失値に対するＧチャネルの寄与度が、ＲやＢチャネルに比べて約２倍大きくなる。そこで、本変形例では、この寄与度が同等となるように補正を行う例を、図６（ａ）を参照して説明する。 (Modification A)
In the first and second embodiments, as shown in FIG. 3B, the same loss weight is set for the observed value for each channel. However, in the Bayer array, the number of G pixels is twice the number of R and B pixels, so the contribution of the G channel to the loss value in the loss weight map is about twice as large as that of the R and B channels. . Therefore, in this modified example, an example in which correction is performed so that the contributions are equal will be described with reference to FIG. 6(a).

図６（ａ）は、学習処理を行う場合の画像処理装置の機能構成を示し、図４（ａ）と同様である。ただし、損失重み算出部４０３の内部構成が上述の実施形態と異なる。 FIG. 6(a) shows the functional configuration of the image processing apparatus when learning processing is performed, and is the same as FIG. 4(a). However, the internal configuration of loss weight calculation section 403 is different from the above-described embodiment.

図６（ｂ）は、学習処理における損失重み算出部の詳細機能構成を示している。また、図９（ａ）は、損失重みマップを説明する図である。 FIG. 6(b) shows the detailed functional configuration of the loss weight calculator in the learning process. FIG. 9(a) is a diagram for explaining a loss weight map.

存在画素数算出部６０１は、カラーフィルタ配列のカラーチャネル毎に、画素存在数の比を計算する。ベイヤー配列においては、画素存在数はＲ：Ｇ：Ｂ＝１：２：１である。ある色の存在数で、最も多い色の存在数を割った比率を、色増幅率ｒ_R、ｒ_G、ｒ_Bとする。この計算法は、数式では以下の数式（１０）のように表せる。 The existing pixel number calculation unit 601 calculates the ratio of existing pixel numbers for each color channel of the color filter array. In the Bayer array, the number of existing pixels is R:G:B=1:2:1. The ratio obtained by dividing the number of existing colors by the number of existing colors is defined as color amplification factors r _R , r _G and r _B . This calculation method can be expressed as the following formula (10).

ここで、ｎ_cはそのチャネルｃの画素存在数である。これを用いて、観測値の損失重みをｒ_Cｗ_a（ｃ∈｛Ｒ，Ｇ，Ｂ｝）として算出する。損失重みマップ算出部６０３は、この算出結果に基づいて損失重みマップを生成する。生成結果の例９０２においては、ｒ_R＝ｒ_B＝２であるため、ＲとＢの損失値がＧの２倍となっている。 Here, _nc is the number of existing pixels in that channel c. This is used to compute the observation loss weight as r _c w _a (cε{R,G,B}). Loss weight map calculator 603 generates a loss weight map based on this calculation result. In the example 902 of the generated result, since r _R =r _B =2, the loss values of R and B are twice that of G.

なお、上述の実施形態および変形例を適用しうるカラーフィルタ配列はベイヤー配列には限定されない。例えばＸ－ｔｒａｎｓであれば画素存在数は２：５：２となり、これに基づいて損失重みを算出しても構わない。また、非観測値の損失重みも色増幅率に基づいて補正を行ってもよい。 Note that the color filter array to which the above-described embodiments and modifications can be applied is not limited to the Bayer array. For example, for X-trans, the number of existing pixels is 2:5:2, and the loss weight may be calculated based on this. Also, the loss weight of the non-observed value may be corrected based on the color amplification factor.

また、色増幅率ｒ_Cの算出法は数式（１０）には限定されない。例えば、各チャネル中の損失重みの平均値や分散値などが一律となるように色増幅率を設定してもよい。また、色増幅率に画素ごとの乱数を加算したりしてもよい。更に、数式（１０）では色増幅率ｒ_Cが画素存在数ｎ_cの－１乗に比例しているがｋ乗（ｋは任意の実数）に比例するとしてもよい。 Also, the method of calculating the color amplification factor r _C is not limited to Equation (10). For example, the color amplification factor may be set so that the average value and variance value of the loss weights in each channel are uniform. Alternatively, a random number for each pixel may be added to the color amplification factor. Furthermore, in equation (10), the color amplification factor r _C is proportional to the -1 power of the number of existing pixels n _c , but it may be proportional to the k power (k is an arbitrary real number).

このようにして生成された損失重みマップを利用することにより、色チャネル毎の寄与度の差が補償されたより好適なデモザイク画像を出力することが可能となる。 By using the loss weight map generated in this way, it becomes possible to output a more suitable demosaiced image in which the difference in the degree of contribution of each color channel is compensated.

（変形例Ｂ）
第１および第２実施形態では、同位相の画素の損失重みは、位置に関わらず一様と設定した。しかし、パディングが行われるＣＮＮにおいては、デモザイク結果画像の画像端に近い画素はパディングの影響を受ける。この影響により推論結果が不正な値となりうるため、これらの画素の損失値への寄与は小さくすることが望ましい。そこで、本変形例では、パディングの影響に応じて損失重みを減衰補正する例を、ブロック図６（ｂ）を参照して説明する。 (Modification B)
In the first and second embodiments, the loss weights of in-phase pixels are set to be uniform regardless of their positions. However, in CNN with padding, the pixels near the edges of the demosaic result image are affected by the padding. Since this effect can lead to incorrect inference results, it is desirable to make the contribution of these pixels to the loss value small. Therefore, in this modified example, an example of attenuation correction of the loss weight according to the influence of padding will be described with reference to the block diagram 6(b).

アーキテクチャ情報取得部６０２は、ＣＮＮのアーキテクチャの構成の情報を取得する。この情報に基づき、パディングの影響を受ける画素数が、画像端から数えて何画素であるかという値（パディング幅）を算出する。 The architecture information acquisition unit 602 acquires information on the configuration of the CNN architecture. Based on this information, a value (padding width) indicating how many pixels counted from the image edge is affected by padding is calculated.

例えば、３×３の畳み込みフィルタ２層から構成されるＣＮＮの場合、フィルタ１層ごとに端１画素がパディングの影響を受けるため、全体でパディング幅は２画素である。そこで、損失重みマップの端から距離２以内の画素に対して、所与の減衰係数ｒ_dを乗算し、観測値の損失重みをｒ_dｒ_Cｗ_aとして算出する。 For example, in the case of a CNN composed of two layers of 3×3 convolution filters, one pixel at the end of each filter layer is affected by padding, so the total padding width is two pixels. Therefore, pixels within a distance of 2 from the end of the loss weight map are multiplied by _a given attenuation coefficient _rd , and the loss weight of the observed value is _calculated as _rdrCwa .

損失重みマップ算出部６０３は、この算出結果に基づいて損失重みマップを生成する。図９（ｂ）は、８×８の画像に対して損失重みマップを計算した場合の、Ｒチャネルのマップの例を示している。画像端に近いほどパディングの影響が増大することを考慮し、損失重みマップにおいては画像端から距離１の画素に対してはｒ_d＝０．２、距離２の画素に対してはｒ_d＝０．６の減衰係数を与えている。なお、非観測画素に対しても同様の減衰補正を与えている。 Loss weight map calculator 603 generates a loss weight map based on this calculation result. FIG. 9(b) shows an example of an R channel map when a loss weight map is calculated for an 8×8 image. Considering that the effect of padding increases closer to the image edge, in the loss weight map, r _d =0.2 for pixels at a distance of 1 from the image edge and r _d =0.2 for pixels at a distance of 2 from the image edge. It gives a damping factor of 0.6. Similar attenuation correction is applied to non-observed pixels.

このようにして生成された損失重みマップを利用することにより、パディングの影響が補償されたより好適なデモザイク画像を出力することが可能となる。 By using the loss weight map generated in this way, it is possible to output a more suitable demosaiced image in which the influence of padding is compensated.

（その他の実施例）
本発明は、上述の実施形態の１以上の機能を実現するプログラムを、ネットワーク又は記憶媒体を介してシステム又は装置に供給し、そのシステム又は装置のコンピュータにおける１つ以上のプロセッサーがプログラムを読出し実行する処理でも実現可能である。また、１以上の機能を実現する回路（例えば、ＡＳＩＣ）によっても実現可能である。 (Other examples)
The present invention supplies a program that implements one or more functions of the above-described embodiments to a system or device via a network or a storage medium, and one or more processors in the computer of the system or device reads and executes the program. It can also be realized by processing to It can also be implemented by a circuit (for example, ASIC) that implements one or more functions.

発明は上記実施形態に制限されるものではなく、発明の精神及び範囲から離脱することなく、様々な変更及び変形が可能である。従って、発明の範囲を公にするために請求項を添付する。 The invention is not limited to the embodiments described above, and various modifications and variations are possible without departing from the spirit and scope of the invention. Accordingly, the claims are appended to make public the scope of the invention.

４００画像処理装置；４０１教師画像取得部；４０２学習データ生成部；４０３損失重み算出部；４０４ネットワークパラメータ取得部；４０５デモザイク学習部；４０６検査部；４０７学習結果記憶部 400 image processing device; 401 teacher image acquisition unit; 402 learning data generation unit; 403 loss weight calculation unit; 404 network parameter acquisition unit;

Claims

An image processing device that performs image processing using a neural network (NN),
Acquisition means for acquiring teacher image data;
generating means for generating student image data obtained by mosaicking the teacher image data;
weight acquisition means for acquiring a loss weight map indicating the weight of the loss value at each pixel position of the teacher image data;
learning means for learning demosaic processing based on the difference between the result image data obtained by inputting the student image data to the NN and the teacher image data and the loss weight map;
with
The image processing apparatus according to claim 1, wherein in the loss weight map, a weight of a loss value at a pixel position of a specific phase in the teacher image data is different from a weight of a loss value at a pixel position other than the specific phase.

2. The image processing apparatus according to claim 1, wherein the specific phase is a phase in which pixel values obtained by actual observation exist in the teacher image data.

The teacher image data is color mosaic image data obtained by an imaging device having a color filter,
3. The image processing apparatus according to claim 2, wherein said specific phase is determined by a color arrangement in said color filter.

4. The image processing according to claim 2, wherein in the loss weight map, the weight of the loss value at the pixel position of the specific phase is greater than the weight of the loss value at the pixel position other than the specific phase. Device.

5. The image processing apparatus according to claim 4, wherein in said loss weight map, the weight of the loss value at pixel positions other than said specific phase is zero.

The learning means is
a parameter acquiring means for acquiring network parameters of the NN;
inference means for inputting the student image data to the NN in which the network parameters are set, applying inference of demosaic processing, and acquiring the result image data;
loss value calculation means for calculating a loss value based on the difference between the result image data and the teacher image data and the loss weight map;
updating means for updating the network parameters based on the loss value;
6. The image processing apparatus according to any one of claims 1 to 5, comprising:

7. The image processing apparatus according to claim 6, wherein said updating means updates said network parameters by error backpropagation using the loss value calculated by said loss value calculating means.

input image acquisition means for acquiring input image data;
output image generating means for inputting the input image data to the NN set with network parameters obtained by learning by the learning means and applying inference of demosaic processing to generate output image data;
8. The image processing apparatus according to any one of claims 1 to 7, further comprising:

Based on test result image data obtained by inputting given test image data to the NN set with network parameters obtained by learning by the learning means and applying inference of demosaic processing, the network parameters are set. 9. The image processing apparatus according to claim 1, further comprising inspection means for inspecting.

The learning means is
a first learning means for performing learning by setting an amplification factor, which is a ratio of the weight of the loss value at the pixel position of the specific phase and the weight of the loss value at the pixel position other than the specific phase, to 1;
a second learning means for performing learning by setting the amplification factor to be greater than 1 using the network parameters obtained by the learning by the first learning means as initial values;
10. The image processing apparatus according to any one of claims 1 to 9, comprising:

The teacher image data is color mosaic image data obtained by an imaging device having a color filter,
11. The method according to any one of claims 1 to 10, wherein in said loss weight map, the weight of the loss value at each pixel position of said teacher image data is determined based on the ratio of the number of pixels for each color channel in said color filter. The image processing apparatus according to any one of items 1 to 3.

3. In the loss weight map, the weight of a loss value at a pixel location affected by padding in the neural net is less than the weight of a loss value at a pixel location not affected by padding in the neural network. 12. The image processing device according to any one of 11.

A learning method in an image processing device that performs image processing using a neural network (NN),
an acquisition step of acquiring teacher image data;
a generating step of generating student image data obtained by mosaicking the teacher image data;
a weight acquisition step of acquiring a loss weight map indicating the weight of the loss value at each pixel position of the teacher image data;
a learning step of learning demosaic processing based on the difference between the result image data obtained by inputting the student image data to the NN and the teacher image data and the loss weight map;
including
A learning method, wherein, in the loss weight map, a weight of a loss value at a pixel position of a specific phase in the teacher image data is different from a weight of a loss value at a pixel position other than the specific phase.

A program for causing a computer to function as the image processing apparatus according to any one of claims 1 to 12.