JP2018165948A

JP2018165948A - Image recognition device, image recognition method, computer program, and product monitoring system

Info

Publication number: JP2018165948A
Application number: JP2017063675A
Authority: JP
Inventors: 金輝陳; Jinhui Chen; 貴志上東; Takashi Kamihigashi; 宗彦伊藤; Munehiko Ito; 泰郎高槻; Yasuo Takatsuki
Original assignee: Kobe University NUC
Current assignee: Kobe University NUC
Priority date: 2017-03-28
Filing date: 2017-03-28
Publication date: 2018-10-25
Anticipated expiration: 2037-03-28
Also published as: JP6964857B2

Abstract

PROBLEM TO BE SOLVED: To provide an image recognition device, an image recognition method, a computer program, and a product monitoring system improving accuracy of image recognition by a hierarchical neural network.SOLUTION: An image recognition device 10 comprises an arithmetic processing unit 1 having a data generation unit 3 performing predetermined data processing on an original image to generate input data, and an image processing unit 2 having a hierarchical neural network recognizing a type of object included in the generated input data. The image processing unit performs process of learning a parameter of the network based on a recognition result of the network if the original image is a sample image, and performs process of outputting the recognition result of the network if the original image is a target image to be recognized. The data processing performed by the data generation unit is process of imparting at least one invariance of rotation and inversion to the original image.SELECTED DRAWING: Figure 1

Description

本発明は、画像認識装置、画像認識方法、コンピュータプログラム、及び製品監視システムに関する。具体的には、階層型畳み込みニューラルネットワークを用いた画像認識の精度を向上する画像処理技術に関する。 The present invention relates to an image recognition apparatus, an image recognition method, a computer program, and a product monitoring system. Specifically, the present invention relates to an image processing technique for improving the accuracy of image recognition using a hierarchical convolution neural network.

近年、深層学習（Deep Learning）による画像認識の性能が飛躍的に向上している。深層学習は、多層の階層型ニューラルネットワークを用いた機械学習の総称である。多層の階層型ニューラルネットワークとしては、例えば、畳み込みニューラルネットワーク（以下、「ＣＮＮ」ともいう。）が用いられる。 In recent years, the performance of image recognition by deep learning has improved dramatically. Deep learning is a general term for machine learning using a multilayer hierarchical neural network. For example, a convolutional neural network (hereinafter also referred to as “CNN”) is used as the multilayer hierarchical neural network.

ＣＮＮは、局所領域の畳み込み層とプーリング層とが繰り返される多層の積層構造を有しており、かかる積層構造により画像認識の性能が向上するとされている。
非特許文献１及び２に示すように、畳み込みニューラルネットワークを用いた深層学習により、オブジェクトのクラスを認識することも既に行われている。 The CNN has a multilayer structure in which a convolution layer and a pooling layer in a local region are repeated, and the image recognition performance is improved by such a structure.
As shown in Non-Patent Documents 1 and 2, an object class has already been recognized by deep learning using a convolutional neural network.

"ImageNet Classification with Deep Convolutional Neural Networks" A. krizhevsky et al. in: Proc. Adv. Neural Inf. Proc. Syst. (NIPS), 2012, PP.1097-1105"ImageNet Classification with Deep Convolutional Neural Networks" A. krizhevsky et al. In: Proc. Adv. Neural Inf. Proc. Syst. (NIPS), 2012, PP.1097-1105 "Very Deep Convolutional Networks for Large-Scale Image Recognition" K.Symonyan et al. arXiv:1409.1556v6 [cs.CV] 10 Apr.2015"Very Deep Convolutional Networks for Large-Scale Image Recognition" K.Symonyan et al. ArXiv: 1409.1556v6 [cs.CV] 10 Apr.2015

畳み込みニューラルネットワークを用いた画像認識では、原画像に前処理を施すことなく、原画像の画素値（ＲＧＢ値）をそのままネットワークに入力するか、画素値に主成分分析（Principle Component Analysis）が行われる。
このように、従来では、原画像の画素値（生データ）をそのまま使用するか、原画像から単一の特徴因子を抽出する前処理を行うだけであるから、認識精度を向上するには、多数のサンプル画像及び同じクラスの多形態のサンプル画像を収集する必要がある。 In image recognition using a convolutional neural network, the original image pixel values (RGB values) are input to the network as they are without preprocessing, or principal component analysis (Principle Component Analysis) is performed on the pixel values. Is called.
As described above, conventionally, since the pixel value (raw data) of the original image is used as it is or only a preprocessing for extracting a single feature factor from the original image is performed, in order to improve the recognition accuracy, Multiple sample images and the same class of polymorphic sample images need to be collected.

特に、回転又は反転したオブジェクトを含むサンプル画像は稀少であるから、通常の向きのサンプル画像を用いてＣＮＮの学習を重ねても、回転又は反転したオブジェクトの認識精度を余り向上できないという問題もある。
本発明は、かかる従来の問題点に鑑み、階層型ニューラルネットワークによる画像認識の精度を向上することを目的とする。 In particular, since sample images including rotated or inverted objects are rare, there is a problem that the recognition accuracy of rotated or inverted objects cannot be improved much even if CNN learning is performed using sample images of normal orientation. .
The present invention has been made in view of such conventional problems, and an object thereof is to improve the accuracy of image recognition by a hierarchical neural network.

（１）本発明の画像認識装置は、原画像に所定のデータ処理を施して入力データを生成するデータ生成部と、生成された前記入力データに含まれるオブジェクトの種類を認識する階層型ニューラルネットワークを有する画像処理部と、を備える画像認識装置であって、前記画像処理部は、前記原画像がサンプル画像である場合は、前記ネットワークの認識結果に基づいて当該ネットワークのパラメータを学習する処理を行い、前記原画像が認識の対象画像である場合は、前記ネットワークの認識結果を出力する処理を行い、前記データ生成部が行う前記データ処理は、前記原画像に対して回転及び反転のうちの少なくとも１つの不変性を付与する処理である。 (1) An image recognition apparatus according to the present invention includes a data generation unit that performs predetermined data processing on an original image to generate input data, and a hierarchical neural network that recognizes the type of object included in the generated input data An image recognition device comprising: an image processing unit comprising: an image processing unit that, when the original image is a sample image, performs processing for learning parameters of the network based on a recognition result of the network If the original image is a recognition target image, a process of outputting a recognition result of the network is performed, and the data processing performed by the data generation unit includes rotation and inversion of the original image. It is a process for imparting at least one invariance.

本発明の画像認識装置によれば、データ生成部が行うデータ処理は、原画像に対して回転及び反転のうちの少なくとも１つの不変性を付与する処理よりなる。
このため、同数のサンプル画像により学習した場合には、上記のデータ処理を施さずに原画像をそのまま入力データとする用いる場合に比べて、階層型ニューラルネットワークによる画像認識の精度を向上することができる（図１０参照）。また、回転又は反転したオブジェクトでも正確に認識できようになる。 According to the image recognition apparatus of the present invention, the data processing performed by the data generation unit includes processing for imparting at least one invariance of rotation and inversion to the original image.
For this reason, when learning with the same number of sample images, the accuracy of image recognition by the hierarchical neural network can be improved as compared with the case where the original image is used as input data without performing the above data processing. Yes (see FIG. 10). Further, even a rotated or inverted object can be accurately recognized.

（２）本発明の画像認識装置において、具体的には、前記データ生成部が行う前記データ処理には、下記に定義する第１処理及び第２処理が含まれる。
第１処理：原画像に対して回転及び反転のうちの少なくとも１つの不変性を有する画像フィルタを生成する処理
第２処理：第１処理で生成した画像フィルタを原画像に畳み込む処理 (2) In the image recognition apparatus of the present invention, specifically, the data processing performed by the data generation unit includes a first process and a second process defined below.
1st process: The process which produces | generates the image filter which has at least 1 invariance of rotation and inversion with respect to an original image 2nd process: The process which convolves the image filter produced | generated by the 1st process with an original image

（３）より具体的には、前記画像フィルタは、前記原画像の所定点を原点とする極座標で定義される任意の画素点の色ベクトルを、当該画素点を起点として所定角度で開く任意の方向に分割した、複数の色ベクトルに含まれる要素よりなる。
その理由は、極座標表示の画素点の色ベクトルを上記のように分割した複数の色ベクトルは、原点回りに任意の角度で回転しても等価なままであり、原画像に対して回転及び反転のうちの少なくとも１つの不変性を有するからである。 (3) More specifically, the image filter is configured to open a color vector of an arbitrary pixel point defined by polar coordinates with the predetermined point of the original image as an origin at a predetermined angle starting from the pixel point. It consists of elements contained in multiple color vectors divided in the direction.
The reason is that the color vector obtained by dividing the color vector of the pixel point of the polar coordinate display as described above remains equivalent even when rotated at an arbitrary angle around the origin, and is rotated and inverted with respect to the original image. This is because it has at least one of the invariance.

（４）更に具体的には、前記画像フィルタは、前記原画像の所定点を原点とする極座標で定義される任意の画素点の色ベクトルを、当該画素点を起点として半径方向及び接線方向に分割した、２つの色ベクトルに含まれる要素よりなることが好ましい。
その理由は、色ベクトルを半径方向と接線方向の２方向に分解すると、計算パラメータの数量を最小限にすることができ、データ生成部の処理負荷を低減できるからである。 (4) More specifically, the image filter uses a color vector of an arbitrary pixel point defined by polar coordinates with a predetermined point of the original image as an origin in a radial direction and a tangential direction starting from the pixel point. It is preferable that the divided elements are included in two color vectors.
The reason is that if the color vector is decomposed in two directions, the radial direction and the tangential direction, the number of calculation parameters can be minimized, and the processing load on the data generation unit can be reduced.

（５）本発明の画像認識装置において、具体的には、前記階層型ニューラルネットワークは、畳み込みニューラルネットワークよりなる。
その理由は、畳み込みニューラルネットワークは、階層型ニューラルネットワークの中でも画像認識に高い性能を実現できるからである。 (5) In the image recognition apparatus of the present invention, specifically, the hierarchical neural network is a convolutional neural network.
The reason is that the convolutional neural network can realize high performance for image recognition even in the hierarchical neural network.

（６）本発明の画像認識装置において、種類が認識される前記オブジェクトは、手書き文字、人間、動物、植物、及び製品のうちの少なくとも１つの物体であればよい。
その理由は、本発明の特徴である、原画像に対して回転及び反転のうちの少なくとも１つの不変性を付与するデータ処理は、原画像に含まれるオブジェクトの属性に関係なく、種々のオブジェクトに適用可能であると考えられるからである。従って、本発明の画像認識装置の適用範囲は、特定のオブジェクトの認識に限定されるものではない。 (6) In the image recognition apparatus of the present invention, the object whose type is recognized may be at least one object of handwritten characters, humans, animals, plants, and products.
The reason for this is that data processing that imparts at least one invariance of rotation and inversion to the original image, which is a feature of the present invention, can be applied to various objects regardless of the attributes of the objects included in the original image. This is because it is considered applicable. Therefore, the application range of the image recognition apparatus of the present invention is not limited to the recognition of a specific object.

（７）本発明のコンピュータプログラムは、上述の（１）〜（６）のいずれかに記載の画像認識装置として、コンピュータを機能させるためのコンピュータプログラムに関する。
従って、本発明のコンピュータプログラムは、上述の（１）〜（６）のいずれかに記載の画像認識装置と同様の作用効果を奏する。 (7) The computer program of this invention is related with the computer program for functioning a computer as an image recognition apparatus in any one of said (1)-(6).
Therefore, the computer program of the present invention has the same operational effects as those of the image recognition device described in any one of (1) to (6) above.

（８）本発明の画像認識方法は、上述の（１）〜（６）のいずれかに記載の画像認識装置が実行する画像認識方法に関する。
従って、本発明の画像認識方法は、上述の（１）〜（６）のいずれかに画像認識装置と同様の作用効果を奏する。 (8) The image recognition method of this invention is related with the image recognition method which the image recognition apparatus in any one of the above-mentioned (1)-(6) performs.
Therefore, the image recognition method of the present invention has the same effects as the image recognition apparatus in any of the above (1) to (6).

（９）本発明の製品監視システムは、複数の製品を撮影する撮影装置と、撮影された前記複数の製品のうちのいずれかを外部に取り出すロボット装置と、取り出すべき前記製品を前記ロボット装置に指示する制御装置と、を備える製品監視システムであって、前記制御装置は、上述の（１）〜（６）のいずれかに記載の画像認識装置を含み、前記画像認識装置は、不良品と認識した前記製品の取り出しを前記ロボット装置に指示する。 (9) The product monitoring system of the present invention includes a photographing device that photographs a plurality of products, a robot device that takes out one of the plurality of photographed products to the outside, and the product that should be removed to the robot device. A product monitoring system comprising: a control device for instructing, wherein the control device includes the image recognition device according to any one of (1) to (6) described above, and the image recognition device is a defective product. The robot apparatus is instructed to take out the recognized product.

本発明の製品監視システムによれば、画像認識装置が、不良品と認識した製品の取り出しをロボット装置に指示するので、適当数のサンプル画像により画像認識装置を学習させることにより、不良品の取り出しを自動的かつ正確に行うことができる。 According to the product monitoring system of the present invention, since the image recognition device instructs the robot device to pick up a product recognized as a defective product, the defective product is picked up by learning the image recognition device from an appropriate number of sample images. Can be done automatically and accurately.

本発明は、上記のような特徴的な構成を備えるシステム及び装置として実現できるだけでなく、かかる特徴的な構成をコンピュータに実行させるためのコンピュータプログラムとして実現することができる。
また、上記の本発明は、システム及び装置の一部又は全部を実現する、１又は複数の半導体集積回路として実現することができる。 The present invention can be realized not only as a system and apparatus having the above-described characteristic configuration, but also as a computer program for causing a computer to execute such characteristic configuration.
Further, the present invention described above can be realized as one or a plurality of semiconductor integrated circuits that realize part or all of the system and apparatus.

本発明によれば、階層型ニューラルネットワークによる画像認識の精度を向上することができる。 According to the present invention, it is possible to improve the accuracy of image recognition by a hierarchical neural network.

本発明の実施形態に係る画像識別装置のブロック図である。1 is a block diagram of an image identification device according to an embodiment of the present invention. ＣＮＮ処理部に含まれるＣＮＮの概略構成図である。It is a schematic block diagram of CNN contained in a CNN process part. 畳み込み層の処理内容の概念図である。It is a conceptual diagram of the processing content of a convolution layer. 受容野の構造の概念図である。It is a conceptual diagram of the structure of a receptive field. データ生成部による第１処理の説明図である。It is explanatory drawing of the 1st process by a data generation part. データ生成部による第２処理の説明図である。It is explanatory drawing of the 2nd process by a data generation part. 任意の画素点を所定角度だけ回転させた回転点等の説明図である。It is explanatory drawing of the rotation point etc. which rotated arbitrary pixel points only the predetermined angle. ＣＮＮ処理部に構築される深層ＣＮＮの構造図である。It is a structural diagram of the deep CNN constructed in the CNN processing unit. シミュレーション実験に用いた手書き文字の一例を示す図である。It is a figure which shows an example of the handwritten character used for simulation experiment. 文字クラスごとの認識精度の試験結果を表すグラフである。It is a graph showing the test result of the recognition accuracy for every character class. 本発明の実施形態に係る製品監視システムの全体構成図である。1 is an overall configuration diagram of a product monitoring system according to an embodiment of the present invention.

以下、図面を参照して、本発明の実施形態の詳細を説明する。なお、以下に記載する実施形態の少なくとも一部を任意に組み合わせてもよい。 Hereinafter, details of embodiments of the present invention will be described with reference to the drawings. In addition, you may combine arbitrarily at least one part of embodiment described below.

〔画像処理装置の全体構成〕
図１は、本発明の実施形態に係る画像認識装置１０のブロック図である。
図１に示すように、本実施形態の画像認識装置１０は、例えば、図示しないＰＣ（Personal Computer）に搭載された演算処理部１と画像処理部２と備える。 [Overall configuration of image processing apparatus]
FIG. 1 is a block diagram of an image recognition apparatus 10 according to an embodiment of the present invention.
As shown in FIG. 1, the image recognition apparatus 10 of this embodiment includes, for example, an arithmetic processing unit 1 and an image processing unit 2 mounted on a PC (Personal Computer) (not shown).

演算処理部１は、ＣＰＵ（Central Processing Unit）を含む。演算処理部１のＣＰＵの数は１つでも複数でもよく、ＦＰＧＡ（Field-Programmable Gate Array）やＡＳＩＣ（Application Specific Integrated Circuit）などの集積回路を含んでもよい。
演算処理部１は、ＲＡＭ（Random Access Memory）を含む。ＲＡＭは、ＳＲＡＭ（Static ＲＡＭ）又はＤＲＡＭ（Dynamic ＲＡＭ）などのメモリ素子で構成され、ＣＰＵなどが実行するコンピュータプログラム及びその実行に必要なデータを一時的に記憶する。 The arithmetic processing unit 1 includes a CPU (Central Processing Unit). The number of CPUs in the arithmetic processing unit 1 may be one or plural, and may include an integrated circuit such as an FPGA (Field-Programmable Gate Array) or an ASIC (Application Specific Integrated Circuit).
The arithmetic processing unit 1 includes a RAM (Random Access Memory). The RAM is composed of a memory element such as SRAM (Static RAM) or DRAM (Dynamic RAM), and temporarily stores a computer program executed by the CPU and data necessary for the execution.

画像処理部２は、ＧＰＵ（Graphics Processing Unit）を含む。画像処理部２のＧＰＵの数は１つでも複数でもよく、ＦＰＧＡやＡＳＩＣなどの集積回路を含んでもよい。
画像処理部２は、ＲＡＭを含む。ＲＡＭは、ＳＲＡＭ又はＤＲＡＭなどのメモリ素子で構成され、ＧＰＵなどが実行するコンピュータプログラム及びその実行に必要なデータを一時的に記憶する。 The image processing unit 2 includes a GPU (Graphics Processing Unit). The number of GPUs in the image processing unit 2 may be one or plural, and may include an integrated circuit such as an FPGA or an ASIC.
The image processing unit 2 includes a RAM. The RAM is configured by a memory element such as SRAM or DRAM, and temporarily stores a computer program executed by the GPU and data necessary for the execution.

演算処理部１は、メモリに記録された演算処理のコンピュータプログラムを、ＣＰＵが実行することにより実現される機能部として、ＣＮＮ処理部４への入力画像を生成するデータ生成部３を備える。
データ生成部３は、ラベル付きのサンプル画像７又は識別対象の撮影画像（以下、「対象画像」ともいう。）８に対して、下記の第１及び第２処理を施すことにより、ＣＮＮ処理部４に対する入力画像（以下、「入力データ」ともいう。）を生成する。 The arithmetic processing unit 1 includes a data generation unit 3 that generates an input image to the CNN processing unit 4 as a functional unit realized by the CPU executing a computer program for arithmetic processing recorded in a memory.
The data generation unit 3 performs the following first and second processing on the labeled sample image 7 or the identification target captured image (hereinafter also referred to as “target image”) 8, thereby providing a CNN processing unit. 4 is generated (hereinafter also referred to as “input data”).

第１処理：サンプル画像７又は対象画像８から、当該画像７，８に対して回転及び反転の不変性を有する画像フィルタ９を生成する処理（図５参照）
第２処理：サンプル画像７又は対象画像８に対して、第１処理で生成した画像フィルタ９を畳み込む処理（図６参照） First processing: processing for generating an image filter 9 having invariance of rotation and inversion with respect to the images 7 and 8 from the sample image 7 or the target image 8 (see FIG. 5).
Second process: a process of convolving the image filter 9 generated in the first process with the sample image 7 or the target image 8 (see FIG. 6)

以下において、データ生成部３に入力される「サンプル画像」及び「対象画像」の総称を、「原画像」ともいう。原画像は、処理される（訓練若しくは認識）オブジェクト領域画像のみを指す。
データ生成部３は、原画像７，８に対して第１及び第２処理を行って得られた入力データを、後段の画像処理部２におけるＣＮＮ処理部４に入力する。 Hereinafter, a generic term of “sample image” and “target image” input to the data generation unit 3 is also referred to as “original image”. The original image refers only to the object area image to be processed (training or recognition).
The data generation unit 3 inputs input data obtained by performing the first and second processing on the original images 7 and 8 to the CNN processing unit 4 in the subsequent image processing unit 2.

画像処理部２は、メモリに記録された画像処理のコンピュータプログラムを、ＧＰＵが実行することにより実現される機能部として、ＣＮＮ処理部４、学習部５、及び認識部６を備える。
ＣＮＮ処理部４は、入力データに含まれるオブジェクトの種類の認識（例えば、入力画像に含まれる文字の種類の認識など）を実行し、その認識結果（具体的には、分類クラスごとの確率など）を学習部５又は認識部６に入力する。 The image processing unit 2 includes a CNN processing unit 4, a learning unit 5, and a recognition unit 6 as functional units realized by the GPU executing an image processing computer program recorded in a memory.
The CNN processing unit 4 performs recognition of the type of object included in the input data (for example, recognition of the type of character included in the input image), and the recognition result (specifically, the probability for each classification class) ) Is input to the learning unit 5 or the recognition unit 6.

具体的には、ラベル付きのサンプル画像７を用いてＣＮＮを訓練する場合には、ＣＮＮ処理部４は、サンプル画像７の分類クラスを特定し、特定した分類クラスを学習部５に入力する。
他方、学習済みのＣＮＮ処理部４に対象画像８の分類クラスを特定させる場合、すなわち、画像処理部２が識別器として動作する場合には、ＣＮＮ処理部４は、特定した分類クラスを認識部６に入力する。 Specifically, when CNN is trained using the labeled sample image 7, the CNN processing unit 4 identifies the classification class of the sample image 7 and inputs the identified classification class to the learning unit 5.
On the other hand, when the learned CNN processing unit 4 specifies the classification class of the target image 8, that is, when the image processing unit 2 operates as a discriminator, the CNN processing unit 4 recognizes the specified classification class. 6

学習部５は、入力された分類クラスに基づいて、ＣＮＮ処理部４が保持するパラメータ（重みやバイアス）を更新し、更新後のパラメータをＣＮＮ処理部４に記憶させる。
認識部６は、入力された分類クラスに基づいて、認識結果を出力する。具体的には、ＣＮＮ処理部４から入力された最も高い確率の分類クラスを、対象画像８の分類クラスとして出力する。認識部６が出力する認識結果は、ＰＣのディスプレイなどに表示されることにより、ＰＣのオペレータに通知される。 The learning unit 5 updates parameters (weights and biases) held by the CNN processing unit 4 based on the input classification class, and stores the updated parameters in the CNN processing unit 4.
The recognition unit 6 outputs a recognition result based on the input classification class. Specifically, the classification class having the highest probability input from the CNN processing unit 4 is output as the classification class of the target image 8. The recognition result output from the recognition unit 6 is displayed on a PC display or the like, thereby notifying the PC operator.

〔ＣＮＮ処理部の処理内容〕
（ＣＮＮの構成例）
図２は、ＣＮＮ処理部４に含まれるＣＮＮの概略構成図である。
図２に示すように、ＣＮＮ処理部４に構築されるＣＮＮは、畳み込み層（「ダウンサンプリング層」ともいう。）Ｃ１，Ｃ２、プーリング層Ｐ１，Ｐ２及び全結合層Ｆの３つの演算処理層と、ＣＮＮの出力層である最終層Ｅとを備える。 [Processing content of CNN processing unit]
(Configuration example of CNN)
FIG. 2 is a schematic configuration diagram of the CNN included in the CNN processing unit 4.
As shown in FIG. 2, the CNN constructed in the CNN processing unit 4 includes three arithmetic processing layers including convolution layers (also referred to as “down-sampling layers”) C 1 and C 2, pooling layers P 1 and P 2, and a fully coupled layer F. And a final layer E which is an output layer of CNN.

畳み込み層Ｃ１，Ｃ２の後にはプーリング層Ｐ１，Ｐ２が配置され、最後のプーリング層Ｐ２の後に全結合層Ｆが配置される。ＣＮＮの最終層Ｅには、予め設定された分類クラス数と同数（図２では１０個）の最終ノードが含まれる。
図２では、畳み込み層Ｃ１，Ｃ２とこれに対応するプーリング層Ｐ１，Ｐ２が２つの場合を例示している。もっとも、畳み込み層とプーリング層は、３つ以上であってもよい。また、全結合層Ｆは少なくとも１つ配置される。 Pooling layers P1 and P2 are arranged after the convolution layers C1 and C2, and all coupling layers F are arranged after the last pooling layer P2. The final layer E of the CNN includes the same number (10 in FIG. 2) of final nodes as the number of classification classes set in advance.
FIG. 2 illustrates the case where the convolution layers C1 and C2 and the corresponding pooling layers P1 and P2 are two. However, the convolution layer and the pooling layer may be three or more. Further, at least one total coupling layer F is disposed.

ある層Ｃ１，Ｐ１，Ｃ２，Ｐ２におけるｊ番目のノードは、直前の層のｍ個のノードからそれぞれ入力ｘ_ｉ（ｉ＝１，２，……ｍ）を受け取り、これらの重み付き和にバイアスを加算した中間変数ｕ_ｊを計算する。すなわち、中間変数ｕ_ｊは次式で計算される。なお、次式において、ｗ_ｉｊは重みであり、ｂ_ｊはバイアスである。
The j-th node in a certain layer C1, P1, C2, P2 receives inputs x _i (i = 1, 2,... M) from m nodes in the immediately preceding layer, respectively, and biases these weighted sums. An intermediate variable u _j obtained by adding is calculated. That is, the intermediate variable u _j is calculated by the following equation. In the following equation, w _ij is a weight and b _j is a bias.

非線形関数である活性化関数ａ（・）に中間変数ｕ_ｊを適用した応答ｙ_ｊ、すなわち、ｙ_ｊ＝ａ（ｕ_ｊ）がこの層のノードの出力となり、この出力は次の層に入力される。
活性化関数ａには、「シグモイド関数」、或いは、ａ（ｘ_ｊ）＝ｍａｘ（ｘ_ｊ，０）などが使用される。特に、後者の活性化関数は、「ＲｅＬＵ（Rectified Linear Unit）」と呼ばれる。ＲｅＬＵは、収束性の良さや学習速度の向上などに貢献することから、近年よく使用される。 The response y _j obtained by applying the intermediate variable u _j to the activation function a (•) that is a nonlinear function, that is, y _j = a (u _j ) becomes the output of the node of this layer, and this output is input to the next layer. Is done.
As the activation function a, a “sigmoid function” or a (x _j ) = max (x _j , 0) is used. In particular, the latter activation function is called “ReLU (Rectified Linear Unit)”. ReLU is often used in recent years because it contributes to good convergence and improved learning speed.

ＣＮＮの出力層付近には、隣接層間のノードをすべて結合した全結合層Ｆが１層以上配置される。ＣＮＮの出力を与える最終層Ｅは、通常のニューラルネットワークと同様に設計される。
入力画像のクラス分類を目的とする場合は、分類クラス数と同数のノードが最終層Ｅに配置され、最終層Ｅの活性化関数ａには「ソフトマックス関数」が用いられる。 In the vicinity of the output layer of the CNN, one or more total coupling layers F in which all nodes between adjacent layers are coupled are arranged. The final layer E giving the output of CNN is designed in the same way as a normal neural network.
For the purpose of classifying the input image, the same number of nodes as the number of classification classes are arranged in the final layer E, and the “softmax function” is used as the activation function a of the final layer E.

具体的には、ｎ個のノードへの入力ｕ_ｊ（ｊ＝１，２，……ｎ）をもとに、次式が算出される。認識時には、ｐ_ｊが最大値をとるノードのインデックスｊ＝ａｒｇｍａｘ_ｊｐ_ｊが推定クラスとして選択される。
Specifically, the following equation is calculated based on inputs u _j (j = 1, 2,..., N) to n nodes. At the time of recognition, the index j = argmax _j p _{j of the} node where p _j has the maximum value is selected as the estimation class.

（畳み込み層の処理内容）
図３は、畳み込み層Ｃ１，Ｃ２の処理内容の概念図である。
図３に示すように、畳み込み層Ｃ１，Ｃ２の入力は、縦長のサイズがＳ×Ｓ画素のＮ枚（Ｎチャンネル）の形式となっている。
以下、この形式の画像をＳ×Ｓ×Ｎと記載する。また、Ｓ×Ｓ×Ｎの入力をｘ_ｉｊｋ（ただし、(i,j,k）∈[0,S-1],[0,S-1],[1,N]）と記載する。 (Processing content of convolution layer)
FIG. 3 is a conceptual diagram of processing contents of the convolution layers C1 and C2.
As shown in FIG. 3, the inputs of the convolution layers C1 and C2 are in the form of N sheets (N channels) having a vertically long size of S × S pixels.
Hereinafter, this type of image is referred to as S × S × N. An input of S × S × N is described as x _ijk (where (i, j, k) ∈ [0, S-1], [0, S-1], [1, N]).

ＣＮＮにおいて、最初の入力層（畳み込み層Ｃ１）のチャンネル数は、入力画像がグレースケールならばＮ＝１となり、カラーならばＮ＝３（ＲＧＢの３チャンネル）となる。
畳み込み層Ｃ１，Ｃ２では、入力ｘ_ｉｊｋにフィルタ（「カーネル」ともいう。）を畳み込む計算が実行される。 In CNN, the number of channels in the first input layer (convolution layer C1) is N = 1 if the input image is grayscale, and N = 3 (three RGB channels) if the input image is color.
In the convolution layers C1 and C2, a calculation for convolving a filter (also referred to as “kernel”) to the input x _ijk is executed.

この計算は、一般的な画像処理におけるフィルタの畳み込み、例えば、小サイズの画像を入力画像に２次元的に畳み込んで画像をぼかす処理（ガウシアンフィルタ）や、エッジを強調する処理（鮮鋭化フィルタ）と基本的に同様の処理である。
具体的には、各チャンネルｋ（ｋ＝１〜Ｎ）の入力ｘ_ｉｊｋのサイズＳ×Ｓの画素に、Ｌ×Ｌのサイズの２次元フィルタを畳み込み、その結果を全チャンネルｋ＝１〜Ｎにわたって加算する。この計算結果は、１チャンネルの画像ｕ_ｉｊの形式となる。 This calculation is performed by convolution of a filter in general image processing, for example, processing of blurring an image by two-dimensionally convolution of a small size image (Gaussian filter), processing of enhancing an edge (sharpening filter) ) Is basically the same process.
Specifically, a two-dimensional filter of L × L size is convoluted with a pixel of size S × S of input x _ijk of each channel k (k = 1 to N), and the result is obtained for all channels k = 1 to N. Add over. This calculation result is in the form of a one-channel image u _ij .

フィルタをｗ_ｉｊｋ（ただし、(i,j,k）∈[1,L-1],[1,L-1],[1,N]）と定義すると、ｕ_ｉｊは次式で算出される。
If the filter is defined as w _ijk (where (i, j, k) ∈ [1, L-1], [1, L-1], [1, N]), u _ij is calculated by the following equation. .

ただし、Ｐ_ｉｊは、画像中の画素（ｉ，ｊ）を頂点とするサイズＬ×Ｌ画素の正方領域である。すなわち、Ｐ_ｉｊは、次式で定義される。
Here, P _ij is a square area of size L × L pixels with the pixel (i, j) in the image as a vertex. That is, P _ij is defined by the following equation.

ｂ_ｋは、バイアスである。本実施形態では、バイアスは、チャンネルごとに全出力ノード間で共通とする。すなわち、ｂ_ｉｊｋ＝ｂ_ｋとする。
フィルタは、全画素ではなく複数画素の間隔で適用されることもある。すなわち、所定の画素数ｓについて、Ｐ_ｉｊを次式のように定義し、ｗ_{ｐ−ｉ，ｑ−ｊ，ｋ}をｗ_{ｐ−ｓｉ，ｑ−ｓｊ，ｋ}と置き換えてｕ_ｉｊを計算してもよい。この画素間隔ｓを「スライド」という。
b _k is a bias. In this embodiment, the bias is common among all output nodes for each channel. That is, b _ijk = b _k .
The filter may be applied at intervals of a plurality of pixels instead of all pixels. That is, for a predetermined number of pixels s, P _ij is defined as follows, and w _ij is calculated by replacing w _{p-i, q-j, k} with w _{p-si, q-sj, k.} Also good. This pixel interval s is called “slide”.

上記のように計算されたｕ_ｉｊは、その後、活性化関数ａ（・）を経て、畳み込み層Ｃ１，Ｃ２の出力ｙ_ｉｊとなる。すなわち、ｙ_ｉｊ＝ａ（ｕ_ｉｊ）となる。
これにより、１つのフィルタｗ_ｉｊｋにつき、入力ｘ_ｉｊｋと縦横サイズが同じであるＳ×Ｓの１チャンネル分の出力ｙ_ｉｊが得られる。 The u _ij calculated as described above becomes an output y _ij of the convolution layers C1 and C2 through the activation function a (•). That is, y _ij = a (u _ij ).
As a result, for one filter w _ijk , an output y _ij for one S × S channel having the same vertical and horizontal sizes as the input x _ijk is obtained.

同様のフィルタをＮ’個用意して、それぞれ独立して上述の計算を実行すれば、Ｎ’チャンネル分のＳ×Ｓの出力、すなわち、Ｓ×Ｓ×Ｎ’サイズの出力ｙ_ｉｊｋ（ただし、(i,j,k）∈[1,S-1],[1,S-1],[1,N']）が得られる。
このＮ’チャンネル分の出力ｙ_ｉｊｋは、次の層への入力ｘ_ｉｊｋとなる。図３は、Ｎ’個あるフィルタのうちの１つに関する計算内容を示している。 If N ′ similar filters are prepared and the above calculation is performed independently, an output of S × S for N ′ channels, that is, an output y _{ijk of} S × S × N ′ size (where, (i, j, k) ∈ [1, S-1], [1, S-1], [1, N ′]) is obtained.
The output y _{ijk for the} N ′ channel becomes the input x _ijk to the next layer. FIG. 3 shows the calculation content for one of the N ′ filters.

以上の計算は、例えば図４に示すように、特殊な形で層間ノードが結ばれた単層ネットワークとして表現できる。図４は、受容野の構造の概念図である。左側の図では受容野が矩形で表現され、右側の図では受容野がノードで表現されている。
具体的には、上位層の各ノードは下位層の各ノードの一部と結合している（これを「局所受容野」という。）。また、結合の重みは各ノード間で共通となっている（これを「重み共有」という。）。 The above calculation can be expressed as a single layer network in which interlayer nodes are connected in a special manner as shown in FIG. FIG. 4 is a conceptual diagram of the structure of the receptive field. In the figure on the left, the receptive field is represented by a rectangle, and in the figure on the right, the receptive field is represented by a node.
Specifically, each node in the upper layer is coupled to a part of each node in the lower layer (this is referred to as “local receptive field”). Also, the connection weight is common among the nodes (this is referred to as “weight sharing”).

（プーリング層の処理内容）
図２に示す通り、プーリング層Ｐ１，Ｐ２は、畳み込み層Ｃ１，Ｃ２と対で存在する。従って、畳み込み層Ｃ１，Ｃ２の出力はプーリング層Ｐ１，Ｐ２への入力となり、プーリング層Ｐ１，Ｐ２の入力はＳ×Ｓ×Ｎの形式となる。
プーリング層Ｐ１，Ｐ２の目的は、画像のどの位置でフィルタの応答が強かったかという情報を一部捨てて、特徴の微少な変化に対する応答の不変性を実現することである。 (Processing contents of the pooling layer)
As shown in FIG. 2, the pooling layers P1 and P2 exist in pairs with the convolution layers C1 and C2. Accordingly, the outputs of the convolution layers C1 and C2 are inputs to the pooling layers P1 and P2, and the inputs of the pooling layers P1 and P2 are in the form of S × S × N.
The purpose of the pooling layers P1 and P2 is to discard part of the information about the position of the filter where the response of the filter was strong, and to realize the invariance of the response to a slight change in the feature.

プーリング層Ｐ１，Ｐ２のノード（ｉ，ｊ）は、畳み込み層Ｃ１，Ｃ２と同様に、入力側の層に局所受容野Ｐ_ｉ，ｊを有する。プーリング層Ｐ１，Ｐ２のノード（ｉ，ｊ）は、局所受容野Ｐ_ｉ，ｊの内部のノード（ｐ，ｑ）∈Ｐ_ｉ，ｊの出力ｙ_ｐ，ｑを１つに集約する。
プーリング層Ｐ１，Ｐ２の局所受容野Ｐ_ｉ，ｊのサイズは、畳み込み層Ｃ１，Ｃ２のそれ（フィルタサイズ）と無関係に設定される。 The nodes (i, j) of the pooling layers P1, P2 have local receptive fields P _{i, j} in the input-side layers, like the convolution layers C1, C2. The nodes (i, j) of the pooling layers P1, P2 consolidate the outputs y _{p, q} of the nodes (p, q) εP _{i, j} inside the local receptive fields P _{i, j} into one.
The size of the local receptive fields P _{i, j} of the pooling layers P1, P2 is set independently of that of the convolution layers C1, C2 (filter size).

入力が複数チャンネルの場合、チャンネルごとに上記の処理が行われる。すなわち、畳み込み層Ｃ１，Ｃ２とプーリング層Ｐ１，Ｐ２の出力チャンネル数は一致する。
プーリングは、画像の縦横（ｉ，ｊ）の方向に間引いて行われる。すなわち、２以上のストライドｓが設定される。例えば、ｓ＝２とすると、出力の縦横サイズは入力の縦横サイズの半分となり、プーリング層の出力ノード数は、入力ノード数の１／ｓ^２倍となる。 When the input is a plurality of channels, the above processing is performed for each channel. That is, the number of output channels of the convolution layers C1 and C2 and the pooling layers P1 and P2 are the same.
Pooling is performed by thinning out the image in the vertical and horizontal (i, j) directions. That is, two or more strides s are set. For example, when s = 2, the vertical and horizontal size of the output is half of the vertical and horizontal size of the input, and the number of output nodes of the pooling layer is 1 / s ² times the number of input nodes.

受容野Ｐ_ｉ，ｊの内部のノードからの入力を１つに纏めて集約する方法には、「平均プーリング」及び「最大プーリング」などがある。
平均プーリングは、次式の通り、Ｐ_ｉ，ｊに属するノードからの入力ｘ_ｐｑｋの平均値を出力する方法である。
There are “average pooling”, “maximum pooling”, and the like as a method of collecting the inputs from the nodes inside the receptive field P _{i, j} into one.
Average pooling is a method of outputting an average value of inputs x _pqk from nodes belonging to P _{i, j as shown} in the following equation.

最大プーリングは、次式の通り、Ｐ_ｉ，ｊに属するノードからの入力ｘ_ｐｑｋの最大値を出力する方法である。ＣＮＮの初期の研究では平均プーリングが主流であったが、現在では最大プーリングが一般的に採用される。
The maximum pooling is a method of outputting the maximum value of the input x _pqk from the nodes belonging to P _{i, j} as follows: Although average pooling was the mainstream in early CNN research, now maximum pooling is generally employed.

なお、畳み込み層Ｃ１，Ｃ２と異なり、プーリング層Ｐ１，Ｐ２では、学習によって変化する重みは存在せず、活性化関数も適用されない。
本実施形態のＣＮＮにおいて、平均プーリング及び最大プーリングのいずれを採用してもよいが、図７に示すＣＮＮの実装例では最大プーリングを採用している。 Note that, unlike the convolution layers C1 and C2, in the pooling layers P1 and P2, there is no weight that changes due to learning, and no activation function is applied.
Either the average pooling or the maximum pooling may be employed in the CNN of the present embodiment, but the maximum pooling is employed in the CNN implementation example shown in FIG.

〔学習部の処理内容〕
ＣＮＮの学習（training）では、「教師あり学習」が基本である。本実施形態においても、学習部５は教師あり学習を実行する。
具体的には、学習部５は、学習データとなる多数のラベル付きのサンプル画像を含む集合を対象として、各サンプル画像の分類誤差を最小化することにより実行される。以下、この処理について説明する。 [Processing content of the learning unit]
In CNN training, “supervised learning” is fundamental. Also in this embodiment, the learning unit 5 performs supervised learning.
Specifically, the learning unit 5 is executed by minimizing the classification error of each sample image for a set including a large number of labeled sample images as learning data. Hereinafter, this process will be described.

ＣＮＮ処理部４の最終層Ｅの各ノードは、ソフトマックス関数による正規化（前述の〔数２〕）により、対応するクラスに対する確率ｐ_ｊ（ｊ＝１，２，……ｎ）を出力する。この確率ｐ_ｊは、学習部５に入力される。
学習部５は、入力された確率ｐ_ｊから算出される分類誤差を最小化するように、ＣＮＮ処理部４に設定された重みなどのパラメータを更新する。 Each node in the final layer E of the CNN processing unit 4 outputs the probability p _j (j = 1, 2,... N) for the corresponding class by normalization with the softmax function (the above-mentioned [Expression 2]). . This probability p _j is input to the learning unit 5.
The learning unit 5 updates parameters such as weights set in the CNN processing unit 4 so as to minimize the classification error calculated from the input probability p _j .

具体的には、学習部５は、入力サンプルに対する理想的な出力ｄ１，ｄ２，……ｄｎ（ラベル）と、出力ｐ１．ｐ２．……ｐｎの乖離を、次式の交差エントロピーＣによって算出する。この交差エントロピーＣが分類誤差である。
Specifically, the learning unit 5 outputs ideal outputs d1, d2,... Dn (labels) for the input samples and outputs p1. p2. ...... The pn deviation is calculated by the following cross entropy C. This cross entropy C is a classification error.

目標出力ｄ１，ｄ２，……ｄｎは、正解クラスｊのみでｄ_ｊ＝１となり、それ以外のすべてのｋ（≠ｊ）ではｄ_ｋ＝０となるように設定される。
学習部５は、上記の交差エントロピーＣが小さくなるように、各畳み込み層Ｃ１，Ｃ２のフィルタの係数ｗ_ｉｊｋと各ノードのバイアスｂ_ｋ、及び、ＣＮＮの出力層側に配置された全結合層Ｆの重みとバイアスを調整する。 The target outputs d1, d2,... Dn are set so that d _j = 1 only in the correct class j, and d _k = 0 in all other k (≠ j).
The learning unit 5 uses the filter coefficients w _ijk of the convolution layers C1 and C2, the bias b _{k of} each node, and all coupling layers arranged on the output layer side of the CNN so that the cross entropy C is reduced. Adjust the weight and bias of F.

分類誤差Ｃの最小化には、確率的勾配降下法が用いられる。学習部５は、重みやバイアスに関する誤差勾配（∂Ｃ／∂ｗ_ｉｊ）を、誤差逆伝播法（ＢＰ法）により計算する。ＢＰ法による計算方法は、通常のニューラルネットワークの場合と同様である。
もっとも、ＣＮＮ処理部４が最大プーリングを採用する場合の逆伝播では、学習サンプルに対する順伝播の際に、プーリング領域のどのノードの値を選んだかを記憶し、逆伝播時にそのノードのみと結合（重み１で結合）させる。 A stochastic gradient descent method is used to minimize the classification error C. The learning unit 5 calculates an error gradient (∂C / ∂w _ij ) related to weights and biases by an error back propagation method (BP method). The calculation method by the BP method is the same as that of a normal neural network.
Of course, in the back propagation when the CNN processing unit 4 adopts the maximum pooling, it stores which node value of the pooling region is selected in the forward propagation for the learning sample, and is combined with only that node at the time of back propagation ( Combined with weight 1).

学習部５による分類誤差Ｃの評価とこれに基づくパラメータ（重みなど）の更新は、全学習サンプルについて実行してもよい。しかし、収束性及び計算速度の観点から、数個から数百個程度のサンプルの集合（ミニバッチ）ごとに実行することが好ましい。この場合の重みｗ_ｉｊの更新量Δｗ_ｉｊは、次式で決定される。
The evaluation of the classification error C by the learning unit 5 and the update of parameters (weights and the like) based on the evaluation may be performed for all learning samples. However, from the viewpoint of convergence and calculation speed, it is preferable to execute for each set (mini-batch) of several to several hundred samples. In this case, the update amount Δw _ij of the weight w _ij is determined by the following equation.

上式において、Δｗ_ｉｊ ^（ｔ）は今回の重み更新量であり、Δｗ_ｉｊ ^{（ｔ−１）}は前回の重み更新量である。上式の第１項は、勾配降下法により誤差を削減するためのｗ_ｉｊの修正量を表す項であり、εは学習率である。
上式の第２項は、モメンタム（momentum）である。モメンタムは、前回更新量のα（〜０．９）倍を加算することでミニパッチの選択による重みの偏りを抑える。第３項は、重み減衰（weight decay）である。重み減衰は、重みが過大にならないようにするパラメータである。なお、バイアスｂ_ｋの更新についても同様である。 In the above equation, Δw _ij ^(t) is the current weight update amount, and Δw _ij ^(t−1) is the previous weight update amount. The first term of the above equation is a term representing the correction amount of w _ij for reducing the error by the gradient descent method, and ε is the learning rate.
The second term in the above formula is the momentum. The momentum suppresses the weight bias due to the selection of the mini-patch by adding α (˜0.9) times the previous update amount. The third term is weight decay. The weight attenuation is a parameter that prevents the weight from becoming excessive. The same applies to the update of the bias b _k .

〔画像生成部の処理内容〕
図５は、データ生成部３による第１処理の説明図である。
前述の通り、「第１処理」は、サンプル画像７又は対象画像８から、当該原画像７，８に対して回転及び反転の不変性を有する画像フィルタ９を生成する処理である。
図５において、学習段階での原画像は、サンプル画像７であり、画像処理装置２を識別器とする場合の原画像は、対象画像８である。 [Processing content of image generator]
FIG. 5 is an explanatory diagram of the first process performed by the data generation unit 3.
As described above, the “first process” is a process for generating the image filter 9 having invariance of rotation and inversion with respect to the original images 7 and 8 from the sample image 7 or the target image 8.
In FIG. 5, the original image at the learning stage is the sample image 7, and the original image when the image processing apparatus 2 is used as a discriminator is the target image 8.

原画像７，８には、直交座標の各画素点ｐ（ｘ，ｙ）におけるＲＧＢ値（０〜２５５）が含まれる。ここでは、図５（ａ）に示すように、画素点ｐでのＲＧＢ値を要素とするデータ列（Ｒ，Ｇ，Ｂ）を色ベクトル「ｇ」という。
データ生成部３は、まず、原画像７，８の中心点ｃを抽出し、抽出した中心点ｃを座標の原点とする。次に、データ生成部３は、図５（ｂ）に示すように、直交座標（ｘ，ｙ）の画素点ｐを、中心点ｃを原点とする極座標に変換する（極化処理）。 The original images 7 and 8 include RGB values (0 to 255) at the respective pixel points p (x, y) in the orthogonal coordinates. Here, as shown in FIG. 5A, the data string (R, G, B) having the RGB value at the pixel point p as an element is referred to as a color vector “g”.
The data generation unit 3 first extracts the center point c of the original images 7 and 8 and sets the extracted center point c as the origin of coordinates. Next, as shown in FIG. 5B, the data generation unit 3 converts the pixel point p of the orthogonal coordinates (x, y) into polar coordinates having the center point c as the origin (polarization process).

なお、極座標の原点は、必ずしも原画像７，８の中心点ｃでなくてもよく、中心点ｃから多少ずれた位置にある所定のポイントであってもよい。 Note that the origin of the polar coordinates does not necessarily have to be the center point c of the original images 7 and 8, but may be a predetermined point that is slightly shifted from the center point c.

次に、データ生成部３は、中心点ｃを原点とする極座標に含まれる、任意の画素点ｐの色ベクトルｇ＝（Ｒ，Ｇ，Ｂ）を、画素点ｐにおける半径方向の色ベクトルｇｒと接線方向の色ベクトルｇｔに分解する。この色ベクトルｇの分解は、次式により実行される。
Next, the data generation unit 3 uses the color vector g = (R, G, B) of an arbitrary pixel point p included in polar coordinates with the center point c as the origin as the radial color vector gr at the pixel point p. And tangential color vector gt. The decomposition of the color vector g is executed by the following equation.

ここで、「ｇ^Ｔ」は、色ベクトルｇ＝（Ｒ，Ｇ，Ｂ）の転置ベクトルである。「ｒ」は、次式により定義される画素点ｐにおける半径方向の単位ベクトルである。「ｔ」は、次式により定義される画素点ｐにおける接線方向の単位ベクトルである。 Here, “g ^T ” is a transposed vector of the color vector g = (R, G, B). “R” is a unit vector in the radial direction at the pixel point p defined by the following equation. “T” is a unit vector in the tangential direction at the pixel point p defined by the following equation.

上式において、「Ｒ_θ」は、単位ベクトルｒを角度θだけ回転させる回転ベクトルである。本実施形態では、単位ベクトルｔの方向は接線方向（単位ベクトルｒからの角度が９０度）であるから、回転行列Ｒ_θの角度θの値は、θ＝π／２となる。
原画像７，８の極座標に含まれるすべての画素点ｐに上記の計算を行うことにより、各画素点ｐについて、合計６種類の要素（Ｒｒ、Ｒｔ、Ｇｒ、Ｇｔ、Ｂｒ、Ｂｔ）を含むシングルチャンネルの画像フィルタ９が生成される。 In the above equation, “R _θ ” is a rotation vector that rotates the unit vector r by an angle θ. In the present embodiment, since the direction of the unit vector t is a tangential direction (the angle from the unit vector r is 90 degrees), the value of the angle θ of the rotation matrix R _θ is θ = π / 2.
By performing the above calculation for all pixel points p included in the polar coordinates of the original images 7 and 8, each pixel point p includes a total of six types of elements (Rr, Rt, Gr, Gt, Br, Bt). A single channel image filter 9 is generated.

図６は、データ生成部３による第２処理の説明図である。
前述の通り、「第２処理」は、原画像７，８に対して、第１処理で生成した画像フィルタ９を畳み込む処理である。
図６において、学習段階での原画像は、サンプル画像７であり、画像処理装置２を識別器とする場合の原画像は、対象画像８である。 FIG. 6 is an explanatory diagram of the second process by the data generation unit 3.
As described above, the “second process” is a process for convolving the image filters 9 generated in the first process with the original images 7 and 8.
In FIG. 6, the original image at the learning stage is the sample image 7, and the original image when the image processing apparatus 2 is used as a discriminator is the target image 8.

〔画像フィルタの回転及び反転の不変性〕
図７は、画素点ｐを所定角度θだけ回転させた変換点ｐ_θ、反転させた反転点ｐ'と回転反転（若しくは反転回転）させた複数変更点ｐ'_θの説明図である。
図７に示すように、任意の画素点ｐに対して、同じ半径で左回りに所定角度θだけ進んだ点である回転点を「ｐ_θ」とする。
図７に示すように、任意の画素点ｐに対して、反転させた反転点を「ｐ'」とする。
図７に示すように、任意の画素点ｐの反転点ｐ'に対して、同じ半径で左回りに所定角度θだけ進んだ点である回転点を「ｐ'_θ」とする。但し、複数の変更があった場合に、変換される手順は反転回転でも回転反転でもよい。
また、回転点ｐ_θ、反転点ｐ'、回転反転（若しくは反転回転）点ｐ'_θにおける色ベクトルをそれぞれ「ｇ_θ」、「ｇ'」、「ｇ'_θ」とし、回転点ｐ'_θにおける半径方向及び接線方向の単位ベクトルをそれぞれ「ｒ_θ」、「ｒ'」、「ｒ'_θ」及び「ｔ_θ」、「ｔ'」、「ｔ'_θ」とする。 [Invariance of image filter rotation and reversal]
FIG. 7 is an explanatory diagram of a conversion point p _{θ obtained} by rotating the pixel point p by a predetermined angle θ, an inverted inversion point p ′, and a plurality of change points p ′ _{θ obtained} by rotating and reversing (or reversing and rotating).
As shown in FIG. 7, a rotation point that is a point advanced by a predetermined angle θ counterclockwise with the same radius with respect to an arbitrary pixel point p is defined as “p _θ ”.
As shown in FIG. 7, an inversion point obtained by inverting an arbitrary pixel point p is “p ′”.
As shown in FIG. 7, a rotation point that is a point advanced by a predetermined angle θ counterclockwise with the same radius with respect to an inversion point p ′ of an arbitrary pixel point p is defined as “p ′ _θ ”. However, when there are a plurality of changes, the conversion procedure may be reversed rotation or rotation reversed.
Further, the color vectors at the rotation point p _θ , the inversion point p ′, and the rotation inversion (or inversion rotation) point p ′ _θ are respectively “g _θ ”, “g ′”, and “g ′ _θ ”, and the rotation point p ′ _θ. The unit vectors in the radial direction and the tangential direction in FIG. 4 are respectively “r _θ ”, “r ′”, “r ′ _θ ”, “t _θ ”, “t ′”, and “t ′ _θ ”.

この場合、次式に示すように、画素点ｐ、回転点ｐ_θ、反転点ｐ'、複数変換点ｐ'_θにおける半径方向及び接線方向の色ベクトルは、（ｇ^Ｔｒ，ｇ^Ｔｔ）、（(Ｒ_θｇ)^Ｔｒ，(Ｒ_θｇ)^Ｔｔ）、（(Mｇ)^Ｔｒ，(Mｇ)^Ｔｔ）、（(MＲ_θｇ)^Ｔｒ，(MＲ_θｇ)^Ｔｔ）と一致する。
ただし、次式において、Ｍは反転行列である。Ｍは、対角成分が１又は−１の対角行列であるため、Ｍ^ＴＭ＝Ｉとなる。 In this case, as shown in the following equation, the color vectors in the radial direction and the tangential direction at the pixel point p, the rotation point p _θ , the inversion point p ′, and the plurality of conversion points p ′ _θ are (g ^T r, g ^T t) _{^{, ((R θ g) T}} r, (R θ g) T t), ((Mg) T r, (Mg) T t), ((MR θ g) T r, (MR θ g) T t) Matches.
In the following equation, M is an inversion matrix. Since M is a diagonal matrix whose diagonal component is 1 or −1, M ^T M = I.

（回転の場合）
(For rotation)

（反転の場合）
(In the case of inversion)

（回転反転（若しくは反転回転）の場合）
(In case of rotation reversal (or reversal rotation))

上記の等式は、中心点ｃ回りの角度θの値に関係なく成立する。すなわち、画素点ｐでの半径方向の色ベクトルｇｒと接線方向の色ベクトルｇｔは、中心点ｃ回りにどのような角度θで回転しても不変である。同様、反転及び複数変更の場合も不変性を有する。なお、反転には、左右の反転（ｙ軸対称）と上下の反転（ｘ軸対称）の双方が含まれる。
従って、離散情報に構成された原画像７，８の色ベクトルｇに対して、各方向の色ベクトルｇｒ，ｇｔを要素とする画像フィルタ９を畳み込む処理を実行すれば、処理後の入力データは、中心点ｃ回りの回転及び反転に対して不変性を有する入力データとなる。 The above equation holds regardless of the value of the angle θ around the center point c. That is, the color vector gr in the radial direction and the color vector gt in the tangential direction at the pixel point p are unchanged regardless of the angle θ around the center point c. Similarly, inversion and multiple changes have invariance. Note that the inversion includes both left-right inversion (y-axis symmetry) and up-down inversion (x-axis symmetry).
Therefore, if the process of convolving the image filter 9 having the color vectors gr and gt in each direction as elements is performed on the color vectors g of the original images 7 and 8 configured as discrete information, the input data after processing is The input data has invariance with respect to rotation and inversion around the center point c.

〔推奨されるＣＮＮの構造例〕
図８は、ＣＮＮ処理部４に構築される深層ＣＮＮの構造図である。
図８に示すように、本願発明者らが推奨する、画像認識のためのＣＮＮのアーキテクチャは、入力ボリュームを出力ボリュームに変換する畳み込み層Ｃ１〜Ｃ４と、全結合層Ａ１〜Ａ３の積層体により構成されている。 [Recommended CNN structure example]
FIG. 8 is a structural diagram of the deep CNN constructed in the CNN processing unit 4.
As shown in FIG. 8, the CNN architecture for image recognition recommended by the inventors of the present application is based on a stack of convolutional layers C1 to C4 for converting an input volume into an output volume, and all coupling layers A1 to A3. It is configured.

ＣＮＮの各層Ｃ１〜Ｃ４，Ａ１〜Ａ３は、幅、高さ及び奥行きの３次元的に配列されたニューロンを有する。
最初の入力層Ｃ１の幅、高さ及び奥行きのサイズは５６×５６×３が好ましい。畳み込み層Ｃ２〜Ｃ４及び全結合層Ａ１の内部のニューロンは、１つ前の層の受容野と呼ばれる小領域のノードのみに接続されている。 Each layer C1-C4, A1-A3 of the CNN has neurons that are arranged three-dimensionally in width, height and depth.
The size of the width, height, and depth of the first input layer C1 is preferably 56 × 56 × 3. Neurons inside the convolution layers C2 to C4 and the total connection layer A1 are connected only to a small area node called a receptive field of the previous layer.

出力ボリュームの空間的な大きさは、次式で計算することができる。
Ｗ２＝１＋（Ｗ１−Ｋ＋２Ｐ）／Ｓ
上式において、Ｗ１は、入力ボリュームのサイズである。Ｋは、畳み込み層のニューロンの核（ノード）のフィールドサイズである。Ｓはストライド、すなわち、カーネルマップにおける隣接するニューロンの受容野の中心間距離を意味する。Ｐは、ボーダー上で使用されるゼロパディングの量を意味する。 The spatial size of the output volume can be calculated by the following equation.
W2 = 1 + (W1-K + 2P) / S
In the above equation, W1 is the size of the input volume. K is the field size of the nucleus (node) of the neuron in the convolution layer. S means the stride, that is, the distance between the centers of the receptive fields of adjacent neurons in the kernel map. P means the amount of zero padding used on the border.

図８のＣＮＮでは、第１畳み込み層Ｃ１において、Ｗ１＝５６、Ｋ＝５、Ｓ＝２、Ｐ＝２である。従って、第２畳み込み層Ｃ２の出力ボリュームの空間的な大きさは、Ｗ２＝１＋（５６−５＋２×２）／２＝２８．５→２８となる。
図８のネットワークでは、重みを持つ７つの層を含む。最初の４つは畳み込み層Ｃ１〜Ｃ４であり、残りの３つは完全に接続された全結合層Ａ１〜Ａ３である。全結合層Ａ１〜Ａ３には、ドロップアウトが含まれる。 In the CNN of FIG. 8, W1 = 56, K = 5, S = 2, and P = 2 in the first convolution layer C1. Therefore, the spatial size of the output volume of the second convolution layer C2 is W2 = 1 + (56-5 + 2 × 2) /2=28.5→28.
The network of FIG. 8 includes seven layers with weights. The first four are convolutional layers C1-C4, and the remaining three are all connected layers A1-A3 that are fully connected. All coupling layers A1 to A3 include dropouts.

最後の全結合層Ａ３の出力は、この層Ａ３と完全に接続された最終層である、７クラスラベルの分布を生成する7-way SOFTMAXに供給される。
畳み込み層Ｃ２〜Ｃ４と全結合層Ａ１のニューロンは前の層の受容野に接続され、全結合層Ａ２〜Ａ３のニューロンは、前の層の全てのニューロンに接続されている。 The output of the last fully coupled layer A3 is fed to 7-way SOFTMAX, which generates a distribution of 7 class labels, which is the final layer fully connected to this layer A3.
The neurons of the convolutional layers C2 to C4 and the fully connected layer A1 are connected to the receptive field of the previous layer, and the neurons of the fully connected layers A2 to A3 are connected to all the neurons of the previous layer.

畳み込み層Ｃ１，Ｃ２の後にはバッチ正規化層が続く。各バッチ正規化層の後には、それぞれ前述の最大プーリングを実行するプーリング層が続く。
畳み込み層Ｃ１〜Ｃ４と全結合層Ａ１〜Ａ３のための非線形マッピング関数は、整流リニアユニット（ＲｅＬＵ）よりなる。 The convolution layers C1, C2 are followed by a batch normalization layer. Each batch normalization layer is followed by a pooling layer that performs the aforementioned maximum pooling.
The non-linear mapping function for the convolution layers C1 to C4 and all the coupling layers A1 to A3 is composed of a rectifying linear unit (ReLU).

第１畳み込み層Ｃ１は、サイズが５×５×３の６４個のカーネルにより、２画素のストライドで５６×５６×３の入力画像（ＡＧＥ画像）をフィルタリングする。
ストライド（歩幅）は、カーネルマップ内で隣接するニューロンの受容野の中心間の距離である。ストライドは、すべての畳み込み層において１ピクセルに設定されている。 The first convolution layer C1 filters an input image (AGE image) of 56 × 56 × 3 with a stride of 2 pixels by 64 kernels having a size of 5 × 5 × 3.
The stride is the distance between the centers of the receptive fields of adjacent neurons in the kernel map. The stride is set to 1 pixel in all convolution layers.

第２畳み込み層Ｃ２の入力は、バッチ正規化及び最大プールされた第１畳み込み層Ｃ１の出力である。第２畳込み層Ｃ２は、サイズが３×３×６４である１２８のカーネルで入力をフィルタリングする。
第３畳み込み層Ｃ３は、サイズが３×３×６４である１２８のカーネルを有し、これらは第２層Ｃ２（バッチ正規化とＭＡＸプーリング）の出力に接続されている。 The input of the second convolution layer C2 is the output of the first convolution layer C1 which is batch normalized and maximum pooled. The second convolution layer C2 filters the input with 128 kernels that are 3 × 3 × 64 in size.
The third convolutional layer C3 has 128 kernels of size 3x3x64, which are connected to the output of the second layer C2 (batch normalization and MAX pooling).

第４畳み込み層Ｃ４は、サイズが３×３×１２８である１２８のカーネルを備えている。完全に接続された全結合層Ａ１〜Ａ３は、それぞれ１０２４のニューロンを備えている。 The fourth convolution layer C4 includes 128 kernels having a size of 3 × 3 × 128. All the fully connected layers A1 to A3 are each provided with 1024 neurons.

〔推奨される学習例〕
本願発明者らは、図８の構造の深層ＣＮＮを実際に訓練（学習）させた。訓練に際しては、NVIDIA GTX745 4GBのＧＰＵを実装するＰＣに対して、オープンソースの数値解析ソフトウェアである「ＭＡＴＬＡＢ」を用いて行った。
ＣＮＮの学習ステップにおいては、重み減衰、モメンタム、バッチサイズ、学習率や学習サイクルを含むパラメータなどの重要な設定がある。以下、この点について説明する。 [Recommended learning examples]
The inventors of the present application actually trained (learned) the deep CNN having the structure of FIG. The training was performed using “MATLAB”, an open source numerical analysis software, on a PC that implements a 4GB GPU of NVIDIA GTX745.
In the learning step of CNN, there are important settings such as weight attenuation, momentum, batch size, learning rate and parameters including learning cycle. Hereinafter, this point will be described.

本願発明者らによる訓練では、モメンタムが０．９であり、重み減衰が０．０００５である非同期の確率的勾配降下法を採用した。次式は、今回採用した重みｗの更新ルールである。
In the training by the present inventors, an asynchronous stochastic gradient descent method with a momentum of 0.9 and a weight decay of 0.0005 was adopted. The following equation is the weight w update rule adopted this time.

上式において、ｉは反復回数であり、ｍはモメンタム変数である。εは学習率を意味する。右辺の第３項は、ｗｉにおいて誤差Ｌを削減するための重みｗの修正量のｉ番目のバッチＤｉに関する平均値である。
バッチサイズの増加は、より信頼性の高い勾配推定値をもたらし、学習時間を短縮できるが、それでは最大の安定した学習率εの増加が得られない。そこで、ＣＮＮのモデルに適したバッチサイズを選択する必要がある。 In the above equation, i is the number of iterations and m is a momentum variable. ε means the learning rate. The third term on the right side is an average value for the i-th batch Di of the correction amount of the weight w for reducing the error L in wi.
Increasing the batch size results in a more reliable gradient estimate and can reduce the learning time, but it does not provide the largest stable increase in learning rate ε. Therefore, it is necessary to select a batch size suitable for the CNN model.

ここでは、畳み込み層Ｃ１〜Ｃ４について、それぞれ、６４、１２８、２５６及び５１２のバッチサイズを採用した訓練（学習）の結果を比較した。その結果、図８のＣＮＮでは、２５６のバッチサイズが最適であることが判明した。
また、すべての層に同等の学習率を使用し、訓練を通して手動で調整した。学習率は０．１に初期化し、エラーレートが現時点の学習率で改善を停止したときに、学習率を１０で分割した。また、訓練に際しては、約２０サイクルでネットワークを訓練した。 Here, the results of training (learning) employing batch sizes of 64, 128, 256, and 512 were compared for the convolution layers C1 to C4, respectively. As a result, it was found that the batch size of 256 is optimal for the CNN of FIG.
In addition, the same learning rate was used for all strata and adjusted manually throughout the training. The learning rate was initialized to 0.1, and when the error rate stopped improving at the current learning rate, the learning rate was divided by 10. In the training, the network was trained in about 20 cycles.

〔実験例：手書き文字を識別する場合の効果〕
本願発明者らは、「神戸大学経済経営研究所附属企業資料総合センター」に所蔵された、「鐘紡資料データベース」の「支配人回章１」に含まれる手書き文字の画像を用いて、本願発明の有意性を試す比較実験を行った。 [Experimental example: Effect of identifying handwritten characters]
The inventors of the present application use the handwritten character image included in the “Chairman's memorandum 1” of the “Kanebo Material Database” in the “Corporate Research Center of the University of Kobe” A comparative experiment was conducted to test the significance.

識別するオブジェクト（手書き文字）の種類は、支配人回章１に含まれる「支」、「配」、「人」、「工」、「場」、「長」、「会（會）」、「社」、「明」及び「治」とした。学習に用いる各文字のサンプル数は、各々４００枚（５６×５６ピクセル）とした。図９は、比較実験に用いた手書き文字（原画像）の一例を示す図である。 The types of objects (handwritten characters) to be identified are “support”, “allocation”, “person”, “engineer”, “place”, “long”, “meeting (會)”, “ "Company", "Ming" and "Oji". The number of samples of each character used for learning was 400 (56 × 56 pixels). FIG. 9 is a diagram illustrating an example of a handwritten character (original image) used in the comparison experiment.

図１０は、文字クラスごとの認識精度の試験結果を表すグラフである。図１０において、○のグラフ（ours）は、本願発明の認識精度を示す。□のグラフ（alexnet）は、非特許文献１の場合（ただし、入力データはＲＧＢ値）の認識精度を示す。
▽のグラフ（vgg-vd-16）は、非特許文献２の場合（入力データはＲＧＢ値であり、レイヤ数は１６）の認識精度を示す。＊のグラフ（vgg-vd-19）は、非特許文献２の場合（入力データはＲＧＢ値であり、レイヤ数は１９）の認識精度を示す。 FIG. 10 is a graph showing a recognition accuracy test result for each character class. In FIG. 10, a circle (ours) indicates the recognition accuracy of the present invention. The square (alexnet) indicates the recognition accuracy in the case of Non-Patent Document 1 (however, the input data is RGB values).
The graph (vgg-vd-16) of ▽ indicates the recognition accuracy in the case of Non-Patent Document 2 (input data is RGB values and the number of layers is 16). The graph of * (vgg-vd-19) shows the recognition accuracy in the case of Non-Patent Document 2 (input data is RGB values and the number of layers is 19).

図１０に示すように、入力画像として従来通りのＲＧＢ値を用いる非特許文献１及び２の場合には、１０種類のすべての文字クラスについて、９０％を超える認識精度を達成できない。
これに対して、手書き文字の原画像に回転及び反転の不変性を与える本願発明の場合には、１０種類のすべての文字クラスについて、９０％以上の認識精度を獲得した。 As shown in FIG. 10, in the case of Non-Patent Documents 1 and 2 using conventional RGB values as an input image, recognition accuracy exceeding 90% cannot be achieved for all ten character classes.
On the other hand, in the case of the present invention in which the original image of handwritten characters is invariant to rotation and reversal, recognition accuracy of 90% or more was obtained for all ten character classes.

図１０のグラフから明らかな通り、深層ＣＮＮを用いた画像認識（図１０の例では文字認識）において、原画像に回転及び反転の不変性を与える処理を行ってＣＮＮの入力データとすれば、従来の生データ（ＲＧＢ画像）を入力データとする場合に比べて、認識精度が有意に改善される。 As is apparent from the graph of FIG. 10, in the image recognition using the deep CNN (character recognition in the example of FIG. 10), if processing for giving rotation and inversion invariance to the original image is performed as the input data of the CNN, Compared with the case where conventional raw data (RGB image) is used as input data, the recognition accuracy is significantly improved.

〔画像認識装置の効果〕
以上の通り、本実施形態の画像認識装置１０によれば、原画像７，８に対して回転及び反転の不変性を有する画像フィルタ９を畳み込むことにより、ＣＮＮ処理部４への入力データを生成する。このため、サンプル画像７をそれほど多く収集しなくても、他の深層学習技術に比べて高い認識精度を発揮できる。
例えば、図１０に示す通り、サンプル画像７の数が「４００」の場合には、従来の深層学習技術よりも高い認識精度が得られる。 [Effect of image recognition device]
As described above, according to the image recognition apparatus 10 of the present embodiment, the input data to the CNN processing unit 4 is generated by convolving the original image 7 and 8 with the image filter 9 having invariance of rotation and inversion. To do. For this reason, even if it does not collect so many sample images 7, it can exhibit high recognition accuracy compared with other deep learning techniques.
For example, as shown in FIG. 10, when the number of sample images 7 is “400”, recognition accuracy higher than that of the conventional deep learning technique can be obtained.

本実施形態では、６種類の要素（Ｒｒ、Ｒｔ、Ｇｒ、Ｇｔ、Ｂｒ、Ｂｔ）を含むシングルチャンネルの画像フィルタ９をサンプル画像７に畳み込むので、１つのサンプル画像７に含まれるデータ量を少なくとも１６倍に拡張することになる。
このため、従来の深層学習では１００枚のサンプル画像７を必要とする場合には、ほぼ６枚（１００／１６＝６．２５）のサンプル画像７を収集すれば、従来の深層学習と概ね同程度の識別精度が得られる。 In the present embodiment, the single-channel image filter 9 including six types of elements (Rr, Rt, Gr, Gt, Br, Bt) is convoluted with the sample image 7, so that the amount of data included in one sample image 7 is at least The expansion will be 16 times.
For this reason, when 100 sample images 7 are required in the conventional deep learning, if approximately 6 (100/16 = 6.25) sample images 7 are collected, the same as in the conventional deep learning. A certain degree of identification accuracy is obtained.

本実施形態の画像認識装置１０によれば、画像フィルタ９の畳み込みにより、回転又は反転したオブジェクトでも正確に認識できる。
従って、例えば、古文書などの撮影画像に含まれる文字を正確に認識して、テキスト又はワープロ文書データに変換する装置として利用できる。また、帳票などの文書に記載の文字を読み込んで、テキストやワープロ文書データに変換する装置や、タッチパネルに手書き入力された文字をリアルタイムに認識する装置として利用することもできる。 According to the image recognition apparatus 10 of the present embodiment, an object that has been rotated or reversed by the convolution of the image filter 9 can be accurately recognized.
Therefore, for example, it can be used as an apparatus for accurately recognizing characters included in a photographed image such as an old document and converting it into text or word processor document data. It can also be used as a device that reads characters described in a document such as a form and converts them into text or word processor document data, or a device that recognizes characters handwritten on the touch panel in real time.

その他、本実施形態の画像認識装置１０は、人間の顔の表情認識、顔画像からの年齢認識、及び、動物、植物、製品などのあらゆる物体の種類の認識などに利用可能である。
このように、本実施形態の画像認識装置１０において、ＣＮＮ処理部４が種類を認識可能なオブジェクトは、手書き文字、人間、動物、植物、及び製品のうちの少なくとも１つの物体であればよく、あらゆるオブジェクトの種類の認識に利用することができる。 In addition, the image recognition apparatus 10 according to the present embodiment can be used for recognition of human facial expressions, age recognition from face images, and recognition of all types of objects such as animals, plants, and products.
Thus, in the image recognition device 10 of the present embodiment, the object that can be recognized by the CNN processing unit 4 may be at least one object of handwritten characters, humans, animals, plants, and products. It can be used to recognize any object type.

〔画像認識装置のその他の応用例〕
図１１は、本実施形態の製品監視システム２０の全体構成図である。
製品監視システム２０は、撮影画像に含まれる製品２５の種類を認識する画像認識装置１０（図１参照）を、不良品の選別及び取り出しに利用するシステムである。
図１１に示すように、製品監視システム２０は、撮影装置２１、駆動制御装置２２、及び可動アーム式のロボット装置２３を備える。 [Other application examples of image recognition device]
FIG. 11 is an overall configuration diagram of the product monitoring system 20 according to the present embodiment.
The product monitoring system 20 is a system that uses the image recognition apparatus 10 (see FIG. 1) for recognizing the type of the product 25 included in the photographed image for sorting and taking out defective products.
As shown in FIG. 11, the product monitoring system 20 includes an imaging device 21, a drive control device 22, and a movable arm type robot device 23.

撮影装置２１は、例えば、ＣＣＤ（電荷結合素子）を利用してデジタル画像を生成するデジタルカメラよりなる。撮影装置２１は、ベルトコンベア２４の上方に吊り下げられており、下流側（図１１の右側）に進行するベルトコンベア２４上の複数の製品２５を上から撮影する。
撮影装置２１は、複数の製品２５が含まれるデジタル画像よりなる撮影画像を、駆動制御装置２２に送信する。撮影画像は、静止画及び動画像のいずれでもよい。 The imaging device 21 is composed of, for example, a digital camera that generates a digital image using a CCD (charge coupled device). The imaging device 21 is suspended above the belt conveyor 24, and images a plurality of products 25 on the belt conveyor 24 traveling downstream (right side in FIG. 11) from above.
The imaging device 21 transmits a captured image formed of a digital image including a plurality of products 25 to the drive control device 22. The captured image may be either a still image or a moving image.

駆動制御装置２２は、ロボット装置２３の動作を制御するコンピュータ装置よりなる。駆動制御装置２２は、第１通信部２６、第２通信部２７、制御部２８、及び記憶部２９を備える。
第１通信部２６は、所定のＩ／Ｏインタフェース規格により、撮影装置２１と通信する通信装置よりなる。第１通信部２６と撮影装置２１との通信は、有線通信及び無線通信のいずれであってもよい。 The drive control device 22 includes a computer device that controls the operation of the robot device 23. The drive control device 22 includes a first communication unit 26, a second communication unit 27, a control unit 28, and a storage unit 29.
The first communication unit 26 includes a communication device that communicates with the imaging device 21 according to a predetermined I / O interface standard. Communication between the first communication unit 26 and the photographing apparatus 21 may be either wired communication or wireless communication.

第２通信部２７は、所定のＩ／Ｏインタフェース規格により、ロボット装置２３と通信する通信装置よりなる。第２通信部２７とロボット装置２３との通信は、有線通信及び無線通信のいずれであってもよい。 The second communication unit 27 includes a communication device that communicates with the robot device 23 according to a predetermined I / O interface standard. Communication between the second communication unit 27 and the robot apparatus 23 may be either wired communication or wireless communication.

制御部２８は、１又は複数のＣＰＵを有する情報処理装置であり、上述の本実施形態の画像識別装置（図１参照）を含む制御装置よりなる。
記憶部２９は、１又は複数のＲＡＭ及びＲＯＭなどのメモリを含む記憶装置よりなる。記憶部２９は、制御部２８に実行させる各種のコンピュータプログラムや、撮影装置２１などから受信した画像データなどの、一時的又は非一時的な記録媒体として機能する。 The control unit 28 is an information processing apparatus having one or a plurality of CPUs, and includes a control apparatus including the above-described image identification apparatus (see FIG. 1) of the present embodiment.
The memory | storage part 29 consists of memory | storage devices containing memories, such as 1 or several RAM and ROM. The storage unit 29 functions as a temporary or non-temporary recording medium for various computer programs to be executed by the control unit 28 and image data received from the imaging device 21 or the like.

このように、駆動制御装置２２は、コンピュータを備えて構成される。従って、駆動制御装置２２の各機能は、当該コンピュータの記憶装置に記憶されたコンピュータプログラムが当該コンピュータのＣＰＵ及びＧＰＵによって実行されることで発揮される。
かかるコンピュータプログラムは、ＣＤ−ＲＯＭやＵＳＢメモリなどの一時的又は非一時的な記録媒体に記憶させることができる。 As described above, the drive control device 22 includes a computer. Therefore, each function of the drive control device 22 is exhibited when the computer program stored in the storage device of the computer is executed by the CPU and GPU of the computer.
Such a computer program can be stored in a temporary or non-temporary recording medium such as a CD-ROM or a USB memory.

制御部２８は、記憶部２９に格納されたコンピュータプログラムを読み出して実行することにより、第１及び第２通信部２６，２７に対する通信制御や、ロボット装置２３の動作制御を実現できる。
例えば、制御部２８は、第１通信部２６が受信した撮影画像から製品２５の画像部分（以下、「製品画像」という。）を抽出し、抽出した製品画像の分類クラス（ここでは、「正常」又は「不良」とする。）を識別する。 The control unit 28 can realize communication control for the first and second communication units 26 and 27 and operation control of the robot device 23 by reading and executing the computer program stored in the storage unit 29.
For example, the control unit 28 extracts the image portion of the product 25 (hereinafter referred to as “product image”) from the captured image received by the first communication unit 26, and classifies the extracted product image (here, “normal” Or “bad”).

制御部２８は、分類クラスが不良である製品画像を検出すると、撮影時刻とベルトコンベア２４の進行速度などから、製品２５がロボット装置２３の下を通過する時刻及び位置を算出し、算出した通過時刻及び通過位置を第２通信部２７に送信させる。
ロボット装置２３は、多関節のロボットアーム３０と、ロボットアーム３０を駆動するアクチュエータ３１とを備える。ロボットアーム３０は、コンベア２４上の製品２５を把持することができるハンド部（図示せず）を有する。 When the control unit 28 detects a product image with a bad classification class, the control unit 28 calculates the time and position at which the product 25 passes under the robot device 23 from the photographing time and the traveling speed of the belt conveyor 24, and the calculated passing time. The time and the passing position are transmitted to the second communication unit 27.
The robot apparatus 23 includes an articulated robot arm 30 and an actuator 31 that drives the robot arm 30. The robot arm 30 has a hand portion (not shown) that can grip the product 25 on the conveyor 24.

制御部２８が算出した不良品の通過時刻及び通過位置は、アクチュエータ３１に送信される。アクチュエータ３１は、不良品の通過時刻にハンド部が通過位置に移動し、製品２５を掴んで外部に取り出すように、ロボットアーム３０の各関節を駆動する。
記憶部２９は、製品２５の良否判定を実行可能な所定構造のＣＮＮ（例えば図８）や、当該ＣＮＮに対する学習済みの重み及びバイアスなどを記憶している。制御部２８は、学習済みのＣＮＮにより、撮影画像から抽出した製品画像の良否を判定する。 The passage time and passage position of the defective product calculated by the control unit 28 are transmitted to the actuator 31. The actuator 31 drives each joint of the robot arm 30 so that the hand part moves to the passing position at the passing time of the defective product, and the product 25 is grasped and taken out to the outside.
The storage unit 29 stores a CNN (for example, FIG. 8) having a predetermined structure capable of executing the quality determination of the product 25, a learned weight and bias for the CNN, and the like. The control unit 28 determines the quality of the product image extracted from the captured image based on the learned CNN.

上述の製品監視システム２０を実現するのに必要となる、工場管理者が行うべき作業工程を列挙すると、次の通りである。
工程１）ロボットアーム３０の下流側の作業員３２が、コンベア２４を流れる製品２５の中から不良品を判別し、その不良品をデジタルカメラ（図示せず）で撮影する。
工程２）撮影した画像データを、不良品のサンプル画像７として駆動制御装置２２の記憶部２９に入力する。 It is as follows when the work process which should be performed by a factory manager required for implement | achieving the above-mentioned product monitoring system 20 is enumerated.
Step 1) An operator 32 on the downstream side of the robot arm 30 determines a defective product from the products 25 flowing on the conveyor 24, and photographs the defective product with a digital camera (not shown).
Step 2) The captured image data is input to the storage unit 29 of the drive control device 22 as a defective sample image 7.

工程３）上記の工程１及び２を、所望の識別精度（例えば、９９％以上）が得られるまで繰り返す。
なお、不良品の代表的な形状が予め判明している場合には、不良品のサンプル画像は、コンベア２４を流れる製品２５以外の製品を撮影した画像データであってもよい。 Step 3) The above steps 1 and 2 are repeated until a desired identification accuracy (for example, 99% or more) is obtained.
If the representative shape of the defective product is known in advance, the sample image of the defective product may be image data obtained by photographing a product other than the product 25 flowing on the conveyor 24.

工程１及び２を繰り返す学習段階において、運用当初は、不良品のサンプル画像７が少ないことから、駆動制御装置２２による不良品の認識精度は低い。
しかし、不良品のサンプル画像７が増加するに従い、駆動制御装置２２の認識精度が向上し、不良品を１００％に近い状態で排除できるようになる。このため、駆動制御装置２２が所定数のサンプル画像７によって学習を積むことにより、作業員３２による目視の監視が不要となる。 In the learning stage where Steps 1 and 2 are repeated, since there are few defective product sample images 7 at the beginning of operation, the recognition accuracy of the defective product by the drive control device 22 is low.
However, as the number of defective sample images 7 increases, the recognition accuracy of the drive control device 22 improves, and defective products can be eliminated in a state close to 100%. For this reason, when the drive control apparatus 22 learns with the predetermined number of sample images 7, visual monitoring by the worker 32 becomes unnecessary.

特に、本実施形態では、回転又は反転したオブジェクトでも認識精度が高い識別器を使用するので、例えば図１１に示すように、コンベア２４に種々の向きで載せられる製品２５の場合でも、早期にほぼ１００％に近い形で、不良品の識別を行うことができる。
なお、コンベア２４に種々の向きで載せられる製品２５の例としては、お菓子や練り製品などの食品類や成形機により生産される成形品等が考えられる。また、現在、コンベア２４上で製品２５を搬送し、作業員３２が目視し、不良品をコンベア２４上から除去しているような場合には、本実施形態の製品監視システム２０を採用することで、目視検査を実施している当該作業員３２の人数を、大幅に削減できる効果が期待できる。 In particular, in the present embodiment, a discriminator with high recognition accuracy is used even for a rotated or inverted object, so that, for example, as shown in FIG. Defective products can be identified in a form close to 100%.
In addition, as an example of the product 25 put on the conveyor 24 in various directions, food products such as sweets and kneaded products, molded products produced by a molding machine, and the like are conceivable. In addition, when the product 25 is currently conveyed on the conveyor 24 and visually checked by the worker 32 and defective products are removed from the conveyor 24, the product monitoring system 20 of the present embodiment is employed. Therefore, it can be expected that the number of the workers 32 performing the visual inspection can be greatly reduced.

〔第１の変形例〕
上述の実施形態において、色ベクトルｇの分解方向は、「半径方向」（図５のｒ方向）及び「接線方向」（図５のｔ方向）に限定されない。すなわち、色ベクトルｇの分解方向は、画素点ｐを基点として所定角度で開く任意の２方向であればよい。
もっとも、半径方向と接線方向以外の２方向で分解すると、各方向の色ベクトルの演算式が複雑になり、データ生成部３の処理負荷が大きくなる。従って、上述の実施形態の通り、色ベクトルｇの分解方向は半径方向及び接線方向とすることが好ましい。 [First Modification]
In the embodiment described above, the separation direction of the color vector g is not limited to the “radial direction” (the r direction in FIG. 5) and the “tangential direction” (the t direction in FIG. 5). In other words, the separation direction of the color vector g may be any two directions that open at a predetermined angle with the pixel point p as a base point.
However, if the decomposition is performed in two directions other than the radial direction and the tangential direction, the calculation expression of the color vector in each direction becomes complicated, and the processing load on the data generation unit 3 increases. Therefore, as described above, it is preferable that the color vector g is decomposed in the radial direction and the tangential direction.

〔第２の変形例〕
上述の実施形態において、中心点ｃを原点とする３次元極座標を定義し、色ベクトルｇを、半径方向と２つの接線方向（３次元の場合は合計で３方向）に分解した要素を有する３次元の画像フィルタにより、原画像７，８を畳み込み処理してもよい。
このようにすれば、中心点ｃを原点とする２次元の回転又は反転だけでなく、中心点ｃを原点とした奥行き方向に傾斜する対象画像８に対しても、画像の認識精度を向上することができる。 [Second Modification]
In the above-described embodiment, a three-dimensional polar coordinate having the center point c as the origin is defined, and the color vector g has elements that are decomposed into a radial direction and two tangential directions (three directions in the case of three dimensions). The original images 7 and 8 may be convolved with a three-dimensional image filter.
In this way, not only two-dimensional rotation or inversion with the center point c as the origin, but also the image recognition accuracy is improved for the target image 8 inclined in the depth direction with the center point c as the origin. be able to.

〔第３の変形例〕
上述の実施形態では、原画像７，８に対して回転及び反転の不変性を有する画像フィルタ９を採用したが、画像フィルタ９は、原画像７，８に対して回転のみの不変性を有する画像フィルタ、或いは、原画像７，８に対して反転のみの不変性を有する画像フィルタであってもよい。すなわち、本願発明の画像フィルタ９は、原画像７，８に対して回転及び反転のうちの少なくとも１つの不変性を有する画像フィルタであればよい。 [Third Modification]
In the above-described embodiment, the image filter 9 having rotation and inversion invariance with respect to the original images 7 and 8 is employed. However, the image filter 9 has only rotation invariance with respect to the original images 7 and 8. The image filter may be an image filter or an image filter having invariance of only inversion with respect to the original images 7 and 8. That is, the image filter 9 of the present invention may be an image filter having at least one invariance of rotation and inversion with respect to the original images 7 and 8.

〔その他の変形例〕
今回開示した実施形態（変形例を含む。）はすべての点で例示であって制限的なものではない。本発明の権利範囲は、上述の実施形態に限定されるものではなく、特許請求の範囲に記載された構成と均等の範囲内でのすべての変更が含まれる。
例えば、上述の実施形態では、ニューラルネットワークが畳み込みニューラルネットワーク（ＣＮＮ）よりなるが、畳み込み層を有しない他の構造の階層型ニューラルネットワークであってもよい。 [Other variations]
The embodiments (including modifications) disclosed herein are illustrative and non-restrictive in every respect. The scope of rights of the present invention is not limited to the above-described embodiments, but includes all modifications within the scope equivalent to the configurations described in the claims.
For example, in the above-described embodiment, the neural network is a convolutional neural network (CNN), but may be a hierarchical neural network having another structure that does not have a convolutional layer.

１演算処理部
２画像処理部
３データ生成部
４ＣＮＮ処理部
５学習部
６認識部
７サンプル画像（原画像）
８対称画像（原画像）
９画像フィルタ
１０画像認識装置
２０製品監視システム
２１撮影装置
２２駆動制御装置（制御装置）
２３ロボット装置
２４ベルトコンベア
２５製品
２６第１通信部
２７第２通信部
２８制御部
２９記憶部
３０ロボットアーム
３１アクチュエータ
３２作業員 DESCRIPTION OF SYMBOLS 1 Computation processing part 2 Image processing part 3 Data generation part 4 CNN processing part 5 Learning part 6 Recognition part 7 Sample image (original image)
8 Symmetric image (original image)
9 Image Filter 10 Image Recognition Device 20 Product Monitoring System 21 Imaging Device 22 Drive Control Device (Control Device)
23 Robot Device 24 Belt Conveyor 25 Product 26 First Communication Unit 27 Second Communication Unit 28 Control Unit 29 Storage Unit 30 Robot Arm 31 Actuator 32 Worker

Claims

A data generation unit that performs predetermined data processing on the original image to generate input data;
An image processing unit comprising a hierarchical neural network that recognizes the type of object included in the generated input data,
When the original image is a sample image, the image processing unit performs processing for learning parameters of the network based on the recognition result of the network, and when the original image is a recognition target image, Perform processing to output network recognition results,
The data processing performed by the data generation unit is an image recognition apparatus that is a process of imparting at least one invariance of rotation and inversion to the original image.

The image recognition apparatus according to claim 1, wherein the data processing performed by the data generation unit includes a first process and a second process defined below.
1st process: The process which produces | generates the image filter which has at least 1 invariance of rotation and inversion with respect to an original image 2nd process: The process which convolves the image filter produced | generated by the 1st process with an original image

The image filter includes a plurality of color vectors obtained by dividing a color vector of an arbitrary pixel point defined by polar coordinates having a predetermined point of the original image as an origin into an arbitrary direction opening from the pixel point as a starting point at a predetermined angle. The image recognition apparatus according to claim 2, wherein the image recognition apparatus comprises elements included in

The image filter includes two color vectors obtained by dividing a color vector of an arbitrary pixel point defined by polar coordinates with a predetermined point of the original image as an origin into a radial direction and a tangential direction starting from the pixel point. The image recognition apparatus according to claim 2, comprising the above elements.

The image recognition apparatus according to claim 1, wherein the hierarchical neural network is a convolutional neural network.

The image recognition apparatus according to any one of claims 1 to 5, wherein the object whose type is recognized includes at least one object of handwritten characters, humans, animals, plants, and products.

A data generation unit that performs predetermined data processing on the original image to generate input data;
A computer program for causing a computer to function as an image recognition device comprising an image processing unit having a hierarchical neural network that recognizes an object included in the generated input data,
The data generation unit performing, as the data processing, a process of imparting at least one invariance of rotation and inversion to the original image;
The image processing unit, when the original image is a sample image, performing a process of learning parameters of the network based on the recognition result of the network;
And a step of outputting a recognition result of the network when the original image is a recognition target image.

A data generation unit that performs predetermined data processing on the original image to generate input data;
An image recognition method executed by an image recognition apparatus comprising: an image processing unit having a hierarchical neural network that recognizes an object included in the generated input data,
The data generation unit performing, as the data processing, a process of imparting at least one invariance of rotation and inversion to the original image;
The image processing unit, when the original image is a sample image, performing a process of learning parameters of the network based on the recognition result of the network;
And a step of outputting a recognition result of the network when the original image is a recognition target image.

A photographing device for photographing a plurality of products;
A robot apparatus for taking out one of the plurality of photographed products to the outside;
A product monitoring system comprising: a control device that instructs the robot device to take out the product to be extracted;
The control device includes the image recognition device according to any one of claims 1 to 6,
The product recognition system, wherein the image recognition device instructs the robot device to take out the product recognized as a defective product.